|
| 1 | +Use case: Prometheus alerting |
| 2 | +============================= |
| 3 | + |
| 4 | +- Prometheus is the main component when talking about Prometheus. It monitors your services by |
| 5 | + scraping them for metrics. It allows to define alerting rules for these metrics: "if this metric |
| 6 | + looks like something is wrong -> raise an alert". |
| 7 | + |
| 8 | +- However, Prometheus only **raises** alerts. It does not filter or transport them. They wisely made |
| 9 | + this modular and separated those concerns into another Prometheus project: |
| 10 | + [prom-alertmanager](https://prometheus.io/docs/alerting/alertmanager/). |
| 11 | + |
| 12 | +- Our lambda-alertmanager is a simple replacement to prom-alertmanager that runs entirely on AWS. |
| 13 | + |
| 14 | + |
| 15 | +Configure Prometheus to send alarms to lambda-alertmanager |
| 16 | +---------------------------------------------------------- |
| 17 | + |
| 18 | +Edit `prometheus.conf`: |
| 19 | + |
| 20 | +``` |
| 21 | +global: |
| 22 | + ... snipped |
| 23 | +
|
| 24 | +# most verbose way of specifying 'https://REDACTED.execute-api.us-east-1.amazonaws.com/prod/prometheus-alertmanager' |
| 25 | +# Prometheus will do a HTTP POST to /prod/prometheus-alertmanager/api/v1/alerts |
| 26 | +alerting: |
| 27 | + alertmanagers: |
| 28 | + - scheme: 'https' |
| 29 | + path_prefix: '/prod/prometheus-alertmanager' |
| 30 | + static_configs: |
| 31 | + - targets: |
| 32 | + - 'REDACTED.execute-api.us-east-1.amazonaws.com' |
| 33 | +
|
| 34 | +scrape_configs: |
| 35 | + ... snipped |
| 36 | +``` |
| 37 | + |
| 38 | + |
| 39 | +Have a Prometheus-enabled service you want to monitor/graph |
| 40 | +----------------------------------------------------------- |
| 41 | + |
| 42 | +In our example we have a service `http://prometheus-dummy-service`. |
| 43 | +Its Prometheus-scrapable metrics live at `http://prometheus-dummy-service/metrics`. |
| 44 | + |
| 45 | +The response looks like this: |
| 46 | + |
| 47 | +``` |
| 48 | +# this is fictional value |
| 49 | +fictional_healthmeter 100 |
| 50 | +
|
| 51 | +``` |
| 52 | + |
| 53 | +Prometheus-metrics can have much [richer data structure](https://prometheus.io/docs/concepts/data_model/) |
| 54 | +than this, but this is the simplest example. |
| 55 | + |
| 56 | +Prometheus [autodiscovers](https://prometheus.io/docs/operating/configuration/) our services, |
| 57 | +and will scrape those metrics automatically. |
| 58 | + |
| 59 | +Now we can graph that metric inside Prometheus: |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | +The metric is reporting constant `100`. Which in our fictional case means everything is OK. |
| 64 | + |
| 65 | + |
| 66 | +Configure an alert to Prometheus |
| 67 | +-------------------------------- |
| 68 | + |
| 69 | +We'll decide that the metric `fictional_healthmeter` signals error if it dips below `50`. |
| 70 | +Add to Prometheus' alerting rules: |
| 71 | + |
| 72 | +``` |
| 73 | +ALERT dummy_service_down |
| 74 | + IF fictional_healthmeter{job="prometheus-dummy-service"} < 50 |
| 75 | +``` |
| 76 | + |
| 77 | +Now, when that happens (`fictional_healthmeter` dips to `20`): |
| 78 | + |
| 79 | + |
| 80 | + |
| 81 | +Prometheus will submit this alarm to lambda-AlertManager - you'll get a notification via your configured transports: |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +Why did we replace Prometheus' AlertManager? |
| 87 | +-------------------------------------------- |
| 88 | + |
| 89 | +- Prometheus' AlertManager would have to run on your own infrastructure - more stuff for you to operate and worry about. |
| 90 | + |
| 91 | +- Reliability. If AlertManager goes down, you are not going to be alerted. AlertManager is in a sense |
| 92 | + your most critical part of your infrastructure, as you have to trust it to work when shit hits the fan. |
| 93 | + You don't want your customers to call you because you yourself don't know that your servers are down. |
| 94 | + I.e. if monitoring goes down, who monitors the monitoring? I have great confidence in letting all this |
| 95 | + run on AWS' well-managed environment. |
| 96 | + |
| 97 | + |
| 98 | +But what if Prometheus goes down? |
| 99 | +--------------------------------- |
| 100 | + |
| 101 | +Okay we learned that lambda-alertmanager is in charge of being reliable in delivering alerts. But since |
| 102 | +Prometheus is the one that **raises these alerts**, what if Prometheus itself goes down, so there's nobody |
| 103 | +to alert us that monitoring is down? |
| 104 | + |
| 105 | +For this case I advise you to make AlertManager-Canary monitor your Prometheus. Just configure a http check |
| 106 | +in Canary to alert you if Prometheus goes down. That way if AWS stays up, you'll always be notified even |
| 107 | +if your entire cluster dies at the exact same moment. |
0 commit comments