Skip to content

Commit beb4f75

Browse files
committed
Prometheus integration
1 parent 6ca29b2 commit beb4f75

5 files changed

Lines changed: 151 additions & 5 deletions

alertmanager/index.js

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,6 @@ function failAndLog(context, failResult) {
6060
}
6161

6262
var apis = {
63-
// TODO: 'POST /prom-alertmanager-webhook': function () { }
64-
6563
'GET /alerts': function (event, context) {
6664
dynamodb.scan({
6765
TableName: 'alertmanager_alerts'
@@ -113,19 +111,19 @@ var apis = {
113111
dynamodb.scan({
114112
TableName: 'alertmanager_alerts',
115113
Limit: 1000 // whichever comes first, 1 MB or 1 000 records
116-
}, function (err, data){
114+
}, function (err, firingAlertsResult){
117115
if (err) {
118116
context.fail(err);
119117
return;
120118
}
121119

122-
if (data.Items.length >= MAX_FIRING_ALERTS) {
120+
if (firingAlertsResult.Items.length >= MAX_FIRING_ALERTS) {
123121
// should not context.fail(), as otherwise the submitter could re-try again (that would be undesirable)
124122
httpSucceedAndLog(context, "Max alerts already firing. Discarding the submitted alert.");
125123
return;
126124
}
127125

128-
var items = data.Items.map(unwrapDynamoDBTypedObject);
126+
var items = firingAlertsResult.Items.map(unwrapDynamoDBTypedObject);
129127

130128
var largestNumber = 0;
131129

@@ -178,6 +176,47 @@ var apis = {
178176
trySaveOnce(1);
179177
},
180178

179+
// Prometheus integration
180+
'POST /prometheus-alertmanager/api/v1/alerts': function (event, context) {
181+
var eventBody = JSON.parse(event.body);
182+
/* eventBody=
183+
[
184+
{
185+
"labels": {
186+
"alertname": "dummy_service_down",
187+
"instance": "10.0.0.17:80",
188+
"job": "prometheus-dummy-service"
189+
},
190+
"annotations": {
191+
192+
},
193+
"startsAt": "2017-01-17T08:42:07.804Z",
194+
"endsAt": "2017-01-17T08:42:52.806Z",
195+
"generatorURL": "http://f67e003689ac:9090/graph?g0.expr=fictional_healthmeter%7Bjob%3D%22prometheus-dummy-service%22%7D+%3C+50\\u0026g0.tab=0"
196+
}
197+
]
198+
*/
199+
200+
// FIXME: this only takes care of the first alert
201+
var subject = eventBody.length === 1 ?
202+
eventBody[0].labels.alertname :
203+
'Alert count not 1, was: ' + eventBody.length; // Fallback for actually letting us now
204+
205+
// convert to simulated incoming HTTP message
206+
var simulatedHttpEvent = {
207+
httpMethod: 'POST',
208+
path: '/alerts/ingest',
209+
body: JSON.stringify({
210+
subject: subject,
211+
details: "Job: " + eventBody[0].labels.job + "\nInstance: " + eventBody[0].labels.instance,
212+
timestamp: eventBody[0].startsAt
213+
})
214+
};
215+
216+
// run the main dispatcher again
217+
exports.handler(simulatedHttpEvent, context);
218+
},
219+
181220
'SNS: ingest': function (event, context) {
182221
/* {
183222
Type: 'Notification',
4.61 KB
Loading
17.3 KB
Loading
24.4 KB
Loading
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
Use case: Prometheus alerting
2+
=============================
3+
4+
- Prometheus is the main component when talking about Prometheus. It monitors your services by
5+
scraping them for metrics. It allows to define alerting rules for these metrics: "if this metric
6+
looks like something is wrong -> raise an alert".
7+
8+
- However, Prometheus only **raises** alerts. It does not filter or transport them. They wisely made
9+
this modular and separated those concerns into another Prometheus project:
10+
[prom-alertmanager](https://prometheus.io/docs/alerting/alertmanager/).
11+
12+
- Our lambda-alertmanager is a simple replacement to prom-alertmanager that runs entirely on AWS.
13+
14+
15+
Configure Prometheus to send alarms to lambda-alertmanager
16+
----------------------------------------------------------
17+
18+
Edit `prometheus.conf`:
19+
20+
```
21+
global:
22+
... snipped
23+
24+
# most verbose way of specifying 'https://REDACTED.execute-api.us-east-1.amazonaws.com/prod/prometheus-alertmanager'
25+
# Prometheus will do a HTTP POST to /prod/prometheus-alertmanager/api/v1/alerts
26+
alerting:
27+
alertmanagers:
28+
- scheme: 'https'
29+
path_prefix: '/prod/prometheus-alertmanager'
30+
static_configs:
31+
- targets:
32+
- 'REDACTED.execute-api.us-east-1.amazonaws.com'
33+
34+
scrape_configs:
35+
... snipped
36+
```
37+
38+
39+
Have a Prometheus-enabled service you want to monitor/graph
40+
-----------------------------------------------------------
41+
42+
In our example we have a service `http://prometheus-dummy-service`.
43+
Its Prometheus-scrapable metrics live at `http://prometheus-dummy-service/metrics`.
44+
45+
The response looks like this:
46+
47+
```
48+
# this is fictional value
49+
fictional_healthmeter 100
50+
51+
```
52+
53+
Prometheus-metrics can have much [richer data structure](https://prometheus.io/docs/concepts/data_model/)
54+
than this, but this is the simplest example.
55+
56+
Prometheus [autodiscovers](https://prometheus.io/docs/operating/configuration/) our services,
57+
and will scrape those metrics automatically.
58+
59+
Now we can graph that metric inside Prometheus:
60+
61+
![](usecase_prometheus-alerting-graph-normal.png)
62+
63+
The metric is reporting constant `100`. Which in our fictional case means everything is OK.
64+
65+
66+
Configure an alert to Prometheus
67+
--------------------------------
68+
69+
We'll decide that the metric `fictional_healthmeter` signals error if it dips below `50`.
70+
Add to Prometheus' alerting rules:
71+
72+
```
73+
ALERT dummy_service_down
74+
IF fictional_healthmeter{job="prometheus-dummy-service"} < 50
75+
```
76+
77+
Now, when that happens (`fictional_healthmeter` dips to `20`):
78+
79+
![](usecase_prometheus-alerting-graph-unhealthy.png)
80+
81+
Prometheus will submit this alarm to lambda-AlertManager - you'll get a notification via your configured transports:
82+
83+
![](usecase_prometheus-alerting-email.png)
84+
85+
86+
Why did we replace Prometheus' AlertManager?
87+
--------------------------------------------
88+
89+
- Prometheus' AlertManager would have to run on your own infrastructure - more stuff for you to operate and worry about.
90+
91+
- Reliability. If AlertManager goes down, you are not going to be alerted. AlertManager is in a sense
92+
your most critical part of your infrastructure, as you have to trust it to work when shit hits the fan.
93+
You don't want your customers to call you because you yourself don't know that your servers are down.
94+
I.e. if monitoring goes down, who monitors the monitoring? I have great confidence in letting all this
95+
run on AWS' well-managed environment.
96+
97+
98+
But what if Prometheus goes down?
99+
---------------------------------
100+
101+
Okay we learned that lambda-alertmanager is in charge of being reliable in delivering alerts. But since
102+
Prometheus is the one that **raises these alerts**, what if Prometheus itself goes down, so there's nobody
103+
to alert us that monitoring is down?
104+
105+
For this case I advise you to make AlertManager-Canary monitor your Prometheus. Just configure a http check
106+
in Canary to alert you if Prometheus goes down. That way if AWS stays up, you'll always be notified even
107+
if your entire cluster dies at the exact same moment.

0 commit comments

Comments
 (0)