Skip to content

Commit 1dc4491

Browse files
committed
Document CloudWatch integration
1 parent beb4f75 commit 1dc4491

4 files changed

Lines changed: 62 additions & 4 deletions

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,10 @@ lambda-alertmanager?
33

44
- Provides simple & reliable alerting for your infrastructure.
55
- Uses so little resources that it is practically free to run.
6-
- [Monitors your web properties for being up](docs/usecase_http-monitoring.md), receive alerts from Prometheus,
7-
Amazon CloudWatch alarms, alarms via SNS topic or any custom HTTP integration (as JSON).
6+
- [Monitors your web properties for being up](docs/usecase_http-monitoring.md),
7+
[receive alerts from Prometheus](docs/usecase_prometheus-alerting.md),
8+
[Amazon CloudWatch alarms](docs/usecase_cloudwatch-alerting.md), alarms via SNS topic or
9+
[any custom HTTP integration (as JSON)](docs/setup_custom_integration.md).
810
- Runs **entirely** on AWS' reliable infrastructure (after setup nothing for you to manage or fix). The compute part is Lambda,
911
but we also use DynamoDB + streams (for state), IAM (for sandboxing AlertManager), API Gateway (for inbound https integrations),
1012
CloudWatch Events (for scheduling) and SNS (inbound alarm receiving, outbound alert delivery).
@@ -59,8 +61,9 @@ Follow these steps precisely, and you've got yourself a working installation:
5961
4. [Set up AlertManager](docs/setup_alertmanager.md)
6062
5. [Set up API Gateway](docs/setup_apigateway.md) (also includes: testing that this works)
6163
6. (recommended) [Set up AlertManager-canary](docs/setup_alertmanager-canary.md)
62-
7. (optional) Set up Prometheus integration
64+
7. (optional) [Set up Prometheus integration](docs/usecase_prometheus-alerting.md)
6365
8. (optional) [Set up custom integration](docs/setup_custom_integration.md)
66+
9. (optional) [Set up CloudWatch integration](docs/usecase_cloudwatch-alerting.md)
6467

6568

6669
Diagram
@@ -109,7 +112,7 @@ Q: Why use this, [uptimerobot.com](https://uptimerobot.com/) is free?
109112

110113
A: uptimerobot.com is awesome, but:
111114

112-
- It only supports 5 minute rates while lambda-alertmanager supports 1 minute rates.
115+
- The free option only supports 5 minute rates while lambda-alertmanager supports 1 minute rates.
113116
- It does mainly HTTP/HTTPS checks, while lambda-alertmanager integrates with Prometheus, Amazon CloudWatch & others as well.
114117
- It supports free SMS messages (no delivery guarantees), but they have non-free "pro SMS" (better delivery).
115118
lambda-alertmanager SMSes are all "pro SMS" and free to a certain limit.
17.7 KB
Loading
18.4 KB
Loading
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
Use case: CloudWatch alerting
2+
=============================
3+
4+
NOTE: this guide applies to most AWS services - not just SQS. But we'll use SQS as an example.
5+
6+
Let's say that you have queue workers (whether in AWS or outside of AWS) that use AWS's SQS
7+
(Simple Queue Service). Great way to detect problems is to detect if they queue is backing up.
8+
9+
10+
What does a healhy queue look like?
11+
-----------------------------------
12+
13+
A healthy queue would not have many queued work items for a prolonged amount of time.
14+
Healthy queue looks like this:
15+
16+
![](usecase_cloudwatch-alerting-healthy-queue.png)
17+
18+
Observations:
19+
20+
- Items are sent to the queue pretty constantly.
21+
- Visible messages (= messages that are not yet consumed by a worker) should be close to zero at all times.
22+
23+
24+
What does an unhealhy queue look like?
25+
--------------------------------------
26+
27+
Unhealthy queue gets messages sent to it faster than they are processed. Looks like this:
28+
29+
![](usecase_cloudwatch-alerting-unhealthy-queue.png)
30+
31+
Observations:
32+
33+
- Items are sent to the queue pretty constantly.
34+
- Visible messages (= messages that are not yet consumed by a worker) ARE NOT close to zero.
35+
36+
37+
Creating a CloudWatch alarm to detect unhealthy queue
38+
-----------------------------------------------------
39+
40+
Go to `CloudWatch > Alarms > Create Alarm > SQS Queue Metrics`:
41+
42+
- QueueName = your queue
43+
- Metrics = `ApproximateNumberOfMessagesVisible`
44+
- `[ Next ]`
45+
- Name = `Queue XYZ health`
46+
- Whenever ApproximateNumberOfMessagesVisible `is >= 5 for 1 consecutive periods`
47+
- Period = `5 minutes`
48+
- Action = `state = alarm => send notification to AlertManager-ingest`
49+
- `[ Create Alarm ]`
50+
51+
Now when alarming condition is detected, CloudWatch uses AlertManager to dispatch the alert to you. :)
52+
53+
NOTE: `ApproximateAgeOfOldestMessage` is probably best metric to detect unhealthy queue that
54+
works even in high-bandwidth queues.
55+
`ApproximateNumberOfMessagesVisible` was mainly used as the easiest explanation.

0 commit comments

Comments
 (0)