A Guide to Unit Testing Prometheus Alerts

Although Prometheus alerts are widely used alerting system, unit testing these alerts is uncommon. Learn about best practices of testing these alerts.
SRE || DevOps Engineer. I'm always fascinated by blazing-fast technology changes.

A Guide to Unit Testing Prometheus Alerts

Although Prometheus alerts are widely used alerting system, unit testing these alerts is uncommon. Learn about best practices of testing these alerts.
Prometheus unit testing

Alerting systems are an indispensable component of any robust monitoring setup. They function as the first line of defense, promptly notifying you of any system anomalies that require immediate attention and Prometheus alerting is the most widely used alerting system in the Kubernetes ecosystem. Making sure that your alerts are working as expected is critical to the health of your monitoring system.

In this article, we will go through the basics of unit testing of Prometheus alerts and understand a few caveats of unit testing as well.

Prerequisites

Unit Testing of Prometheus Alerts

Let’s create the below alert rule in a file called grafana-alert.yaml which will fire when the grafana service discovery is missing for 15 minutes:

groups:
- name: Grafana
  rules:
  - alert: GrafanaDown
    annotations:
      summary: 'Grafana is missing from service discovery'
      description: 'Grafana in {{ $labels.cluster }} is down'
    expr: up{job="grafana",cluster="lab"} == 0
    for: 15m
    labels:
      severity: warning
      team: devops

Now, we will write a unit test for the above alert in a file called grafana-alert-test.yaml:

rule_files:
  - grafana-alert.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'up{job="grafana",cluster="lab"}'
        values: '0x15 1 1'
    alert_rule_test:
    - alertname: GrafanaDown
      eval_time: 14m
      exp_alerts: []
    - alertname: GrafanaDown
      eval_time: 15m
      exp_alerts:
      - exp_labels:
          severity: warning
          team: devops
          job: grafana
          cluster: lab
        exp_annotations:
          description: 'Grafana in lab is down'
          summary: 'Grafana is missing from service discovery'

Understanding the unit test config

interval: 1m : The interval at which samples are evaluated. Default is 1m.

input_series : The time series that are used as input for the test. In our case, we are using the up metric with the job label set to grafana and cluster label set to lab.

values : The input series values. In our case, we are setting the value to 0x15 1 1. This means that the input series value will be 0 for the first 15 minutes (interval is 1m) and then 1 for the next 2 minutes.

  • alert_rule_test : This is the actual test. It has the following fields:
  • alertname : The name of the alert being tested.

eval_time: The time at which the alert is evaluated. In our case, we are evaluating the alert at 14m and 15m.

exp_alerts : Since we have set the for field in the alert to 15m, At 14m, the alert should not fire and hence exp_alerts is set to [].

exp_labels and exp_annotations : The expected labels and annotations for the alert. At 15m, the alert should fire and hence exp_alerts is set to the expected labels and annotations.

Specifying the Input Series Values

Using Shorthand

In our example, we specified the value as 0x15 1 1, this is a shorthand for 0+0x15 which further expands to 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1.

The syntax is : initial_value + (increment_value x increment_count)

For instance, to specify values starting at 10 and going up to 50 with an increment of 5, use 10+5x8.

Stale and Missing Samples

To input both stale and missing values, use _ for missing and stale for stale values.

Examples:

| Shorthand     | Expanded         |
|---------------|------------------|
| 1+2x3         | 1 3 5 7          |
| 0+10x100      | 0 10 20 30 ..100 |
| 1x10          | 1 1 1 (10 times) |
| 3+0x3 2+0x3   | 3 3 3 3 2 2 2 2  |
| 1 _x3         | 1 _ _ _          |
| 2 stale 5+2x3 | 2 stale 5 7 9 11 |

Running the Test

Execute the test using promtool :

promtool test rules grafana-alert-test.yaml

Unit Testing:  grafana-alert.yaml
SUCCESS

Debugging Failed Tests

If a test fails, the output may look like:

FAILED:
    alertname: GrafanaDown, time: 15m, 
        exp:[
            0:
              Labels:{alertname="GrafanaDown", cluster="lab", severity="warning", team="devops"}
              Annotations:{description="Grafana in lab is down", summary="Grafana is missing from service discovery"}
            ], 
        got:[
            0:
              Labels:{alertname="GrafanaDown", cluster="lab", job="grafana", severity="warning", team="devops"}
              Annotations:{description="Grafana in lab is down", summary="Grafana is missing from service discovery"}
            ]

The failure may occur due to unexpected labels or expected labels missing. In our case, the failure is due to the job label being present in the alert but not expected i.e not present in the exp_labels section.

Types of Alerts to Test

  • Critical Alerts (P1): Alerts that directly impact business operations or suggest potential service outages should be end-to-end tested. For instance, an alert with severity critical triggered by low disk space on critical servers, high error rate on web services or critical certificate rotations, database connection limits should undergo unit testing to ensure proper detection. A simple rule can be, every alert sent to pagerduty should have unit tests.
  • Complex Logic Alerts: Alerts with complex logic or dependencies on multiple metrics are prime candidates. For example, imagine you are running an e-commerce app and having an alert that triggers when a surge in abandoned carts matches with payment failure rate. Testing this alert well is important because you might be joining multiple metrics from different services.

Alerts Not Ideal for Unit Testing

  • Low Impact Alerts: Alerts that have minimal impact on business or serve informational purposes but are valuable enough to keep may not justify the overhead of unit testing.
  • Dynamic Environment Alerts: Alerts in dynamically changing environments, where infrastructure and metrics fluctuate frequently, pose challenges for unit testing. Although this is not suggested due to metrics cardinality issues, ensuring the reliability of such alerts through unit tests becomes cumbersome and may not yield much benefits.

Notable Points

  • While unit testing is valuable, debugging failed tests can be challenging, requiring manual intervention, especially at scale.
  • Updates to alerts, even minor label changes, necessitate corresponding updates to unit tests.
  • If you have hundreds of alerts, it will be difficult to write unit tests for all of them and maintain them.
  • SaaS providers like Grafana Cloud, Coralogix, New Relic, etc. provide alerting as a service. They have their own alerting engine, which is compatible and based on Prometheus but not Prometheus. So, if you are using any of these services and want to have unit tests for your alerts, you need to maintain these alerts in Git and write a script to do the unit testing via CI or via other means.
  • Unit testing is not widely adopted in the Prometheus community. So sometimes, implementing unit tests for your alerts when you are struck can be challenging.
  • Alert testing can be more realistic. For example, your test might succeed, but your alert might not fire because one of the labels, like cluster, is not present in Prometheus. Hence, running tests on actual metrics can be a more viable option.

Aviator.co | Blog

Subscribe