This document outlines the specification to follow to be Prometheus Alert-Generator compliant that includes
Input:
- The format of the alerting rules.
Output:
- Format of the alert.
- Payload format of the alerts sent to the Alertmanager. (Alertmanager is described in “Sending Alerts to Alertmanager” section)
- GET APIs to support, with their respective format.
Between Input and Output:
- How to maintain different states and lifecycles of an alert.
- When to send an alert out to the Alertmanager.
This document follows the RFC 2119 language.
Software MUST pass the test suite at {} to be called “Prometheus alert-generator compliant”.
The setup is made up of 3 different components
- Sample receiver: to which samples are sent via Prometheus remote write protocol.
- Sample querier: which allows querying samples via PromQL using Prometheus style query APIs. Used to query
ALERTS
series generated by the alert-generator. - Alert-generator: that does everything mentioned below in the doc - accepts the alerting rules, executes them, maintains the alert states, sends alerts to Alertmanager, and supports GET /api/v1/alerts and GET /api/v1/rules.
Only the alert-generator needs to follow the below specification while sample receiver and sample querier facilitate ingestion and query of time series data. They are optional to be part of the same software; all 3 components can be a single software or different softwares.
An alert in JSON MUST follow the following format:
{
"labels": {
"alertname": "<alertname>",
"label1": "value1",
"label2": "value2",
"..."
},
"annotations": {
"label1": "value1",
"label2": "value2",
"..."
},
"startsAt": "<RFC3339Millis time>",
"endsAt": "<RFC3339Millis time>",
"generatorURL": "<string>"
}
labels
: MUST be present. The labels uniquely identify an alert.annotations
: MUST be present IF the alert has annotations. Annotations provide additional details about the alert which can change over time for the same alert.startsAt
: SHOULD be present. It is the time when the alert was triggered.endsAt
: SHOULD be present. It is the time when the alert MUST be consideredinactive
. Note that future alert updates MAY change this value.generatorURL
: SHOULD be present. It is a URL that takes the user to the query page for the source expression of the alert.
The alert-generator MUST accept the Prometheus style alerting rules configuration as described in the v2.33 docs for Alerting Rules with the following structure. It MUST be in either YAML format or an equivalent JSON format. Alert-generator MAY accept them as files on disk or via an API.
groups:
[ - <rule_group> ]
<rule_group>
# The name of the group. MUST be unique within a file.
name: <string>
# How often rules in the group are evaluated.
[ interval: <duration> | default = 1m ]
rules:
[ - <rule> ... ]
<rule>
# The name of the alert. MUST be a valid label value.
alert: <string>
# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string>
# Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]
# Labels to add or overwrite for each alert.
labels:
[ <labelname>: <template_string> ]
# Annotations to add to each alert.
annotations:
[ <labelname>: <template_string> ]
Example config
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
- alert: VeryHighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 1
-
Results of a rule evaluation MUST be available for any subsequent rules in the group during the same evaluation cycle that depend on these results. The order of rules MUST be the same as provided in the config.
-
Labels and annotations in alerting rules MUST support all template variables and functions as described in the v2.33 template reference for the values with the following exceptions:
graphLink
,tableLink
: MAY be supported if the sample querier supports having a UI link for graph and table respectively.tmpl
,pathPrefix
,safeHtml
: MAY be supported if the software supports console template files.strvalue
: MAY be supported if there is a use case..ExternalLabels
/$externalLabels
,.ExternalURL
/$externalURL
: MAY be supported if the software supports configuring external labels and external URL.
The config MUST NOT be rejected if it contains the optional template variables and/or functions listed above. They MUST result in an empty string if not supported.
The PromQL expression expr
of the alerting rule MUST be executed against the Sample Querier as an instant query for the current time (called the “evaluation time” or the “group evaluation time”). This MUST be done at regular intervals and the interval MUST be the interval
from the parent <rule_group>
of this alerting rule, which MUST default to 1 minute if not specified in the config.
Steps to follow in order to process the query result.
Each element in the result vector of the instant query MUST produce a distinct alert, and labels of the element MUST become the labels of the alert.
For example if the result vector was
my_metric_total{job=”foo”, status=”500”} => 10
my_metric_total{job=”foo”, status=”400”} => 18
Then the corresponding alerts produced at this step MUST be
Alert1 = { “labels”: {“__name__”:”my_metric_total”,”job”=”foo”,”status”=”500”}, ... }
Alert2 = { “labels”: {“__name__”:”my_metric_total”,”job”=”foo”,”status”=”400”}, ... }
The labels and annotation templates from the alerting rule MUST be run for each of these alerts individually with label-value data for the template coming from the corresponding element from the result vector. The output of the template execution MUST be added to the alert as labels and annotations respectively. Labels from template execution MUST override the existing labels in the alert.
The alert name from the alerting rule (HighRequestLatency
from the example above) MUST be added to the labels of the alert with the label name as alertname
. It MUST override any existing alertname
label.
The labels of the alert at the end of step 3 MUST uniquely identify an alert.
Alert MUST be in pending
, firing
or inactive
state. The “pending
State Conditions”, “firing
State Conditions”, “inactive
State Conditions” and “Time Series to Create” MUST be checked after this step 3.
The execution of an alerting rule MUST error out immediately and MUST NOT send any alerts as described in “Sending Alerts to Alertmanager” section or add samples to samples receiver as described in “Time Series to Create” section if there is more than one alert with the same labels at the end of step 3 . This error MUST be reflected in the output of GET /api/v1/rules
API as described in “APIs to Support” section below.
Alerts MUST start in pending
state if the for
duration is non-zero. This evaluation time when the alert is first created is referred to as ActiveAt
for that alert.
The alert MUST stay in pending
state during an evaluation if the difference between evaluation time and ActiveAt
is less than for
duration (as specified in the alerting rule).
If the annotation values change at any evaluation, the latest annotations MUST be updated to the alert immediately.
If the difference between the current evaluation time and ActiveAt
is greater than or equal to the for
duration (as specified by the alerting rule), the alert MUST go into firing
state immediately. This evaluation time when it first went into firing
state is referred to as FiredAt
.
For a zero for
duration, the alert MUST directly go into firing
state the first time the alert was created and skip the initial pending
state. This evaluation time when the alert is first created is referred to as ActiveAt
.
For a non-zero for
duration that is less than the group evaluation interval, the alert MUST go into firing
state during the next evaluation after it went into pending
state and not in between evaluations if the alert does not become inactive
in the next evaluation.
If the annotation values change at any evaluation, the latest annotations MUST be updated to the alert immediately.
If an existing pending
or firing
state alert was not produced by the current evaluation of the rule, that alert MUST immediately go into inactive
state. This evaluation time where the alert got resolved is referred to as ResolvedAt
.
Any alerts in future evaluations with the same labels as an inactive
alert MUST be considered as a new alert and MUST follow the pending
and firing
state conditions as stated above. The ActiveAt
and ResolvedAt
MUST be set again according to the above conditions for pending
and firing
states.
At the end of a single alerting rule evaluation, for each active alert (i.e. pending
and firing
state alerts), the alert-generator MUST produce the following time series with a sample value of 1 and a timestamp matching the evaluation time and send it over to the sample receiver.
The sample MUST be immediately available via the sample querier for the evaluation of subsequent rules in the parent rule group during the same evaluation cycle.
Series labels (sorted) MUST have these labels only.
{
"__name__": "ALERTS",
"alertstate": "pending" or "firing",
<all labels from the alert including "alertname">
}
The alertstate
MUST be ”pending”
for a pending
state alert and MUST be ”firing”
for a firing
state alert.
The __name__
and alertstate
labels MUST override any existing labels in the alert with the values above.
For example if the alert labels of a firing
alert at the end of step 3 of processing instant query result were { “__name__”:“my_metric_name”, “alertstate”:“very_critical”, “alertname”:”HighRequestLatency”, “severity”:”page”}
, then the labels for the time series would be { “__name__”:“ALERTS”, “alertstate”:“firing”, “alertname”:”HighRequestLatency”, “severity”:”page”}
Series MUST NOT be created for an alert that is in inactive
state.
Alertmanager is any software that accepts alerts to process further in the format described below in “Sending Alerts to Alertmanager” section, for example, Prometheus Alertmanager.
Alert-generators MUST send only firing
and inactive
state alerts to an alertmanager. The alerts MUST be sent only after the respective rule evaluations and not in between two evaluations.
The “Conditions for Sending firing
Alerts” and “Conditions for Sending inactive
Alerts” MUST be checked after the “pending
State Conditions”, “firing
State Conditions” and “inactive
State Conditions” steps.
The ResendDelay
used for resending the alert SHOULD be configurable and MUST default to 1 minute.
firing
alerts MUST be sent to Alertmanager in the following scenarios:
- When it first went into
firing
state. - The difference between current evaluation time and the last time the
firing
alert was sent to the alertmanager is more thanResendDelay
.
This implies that the firing
alert MUST be sent continuously with a fixed interval until it becomes inactive.
ResendDelay
acts as a minimum interval while the actual interval MUST be the first >0 multiple of the group interval that is more than or equal to ResendDelay
.
inactive
alerts MUST be sent to Alertmanager in the following scenarios.
- When it first went into
inactive
state. - The difference between current evaluation time and the last time the
inactive
alert was sent to the Alertmanager is more thanResendDelay
AND the difference between current evaluation time andResolvedAt
is less than 15 minutes AND there is no new active alert (pending
orfiring
state alert) with the same labels.
This implies that the inactive
alert MUST be sent continuously with a fixed interval until 15 minutes after the ResolvedAt
of the alert, or until a new alert is created with the same labels.
ResendDelay
acts as a minimum interval while the actual interval MUST be the first >0 multiple of the group interval that is more than or equal to ResendDelay
.
The alerts can be sent out in any format as required by the software while it MUST be translatable to the following JSON format
[
<alert 1>,
<alert 2>,
...
]
Where the structure of each <alert i>
MUST be the same as described in the “Alert Format” section above.
The parameters of each alert MUST be set as follows:
labels
: MUST be the same as labels of the alert produced in “Executing an Alerting Rule” after step 3.annotations
: MUST be the same as annotations of the alert produced in “Executing an Alerting Rule” after step 3.startsAt
: MUST be set to theFiredAt
time of the alert (in both cases when alert isfiring
orinactive
) as described in “Executing an Alerting Rule”.endsAt
:- If the alert is
inactive
, then it MUST beResolvedAt
from “Executing an Alerting Rule”. - If the alert is
firing
, the value MUST beStartsAt + (4 * ResendDelay)
orStartsAt + (4 * Group Interval)
, whichever is higher.StartsAt
as seen in “Executing an Alerting Rule”,ResendDelay
is as seen in “Sending Alerts to Alertmanager”, and Group Interval is theinterval
of the parent<rule_group>
of corresponding alerting rule.
- If the alert is
generatorURL
: It SHOULD be set to a URL that takes the user to the query page for the source expression of the alert.
This API returns all the rules along with its health and associated alerts. The alert-generator MUST support the GET /api/v1/rules
API. The API MUST return a JSON containing the following fields and it MAY add additional custom fields anywhere in the JSON.
{
"status" : "success",
"data": {
"groups": [ <group>, ]
}
}
<group>
{
"name": "<string>",
"interval": <float>,
"lastEvaluation": "<RFC3339Millis time>",
"rules": [ <rule>, ]
}
name
is the group name as present in the config.interval
is the group evaluation interval in float seconds as present in the file.lastEvaluation
is the timestamp of the last time the group was evaluated.
An example for a custom field here that is used by Prometheus is ”file”: “<string>”
, which tells where on disk is the rule file that contains this group.
<rule>
{
"type": "alerting",
"name": "<string>",
"query": "<string>",
"duration": <float>,
"labels": {
"label1": "value1",
"label2": "value2",
"..."
},
"annotations": {
"label1": "value1",
"label2": "value2",
"..."
},
"lastEvaluation": "<RFC3339Millis time>",
"evaluationTime": <float>,
"health": "<string>",
"state": "<string>",
"alerts": [ <alert>, ],
[ "lastError": "<string>" ]
}
name
,query
,labels
,annotations
are exactly the same as present in the alerting rule config.duration
is the same asfor
period in float seconds.lastEvaluation
is the timestamp of the last time the rule was evaluated.evaluationTime
is the time taken to completely evaluate the rule in float seconds.health
is the health of rule evaluation. It MUST be one of"ok"
,"err"
,"unknown"
.state
must be one of these under following scenarios"pending"
: at least 1 alert in the rule inpending
state and no other alert infiring
state."firing"
: at least 1 alert in the rule infiring
state."inactive"
: no alert in the rule infiring
orpending
state.
alerts
is the list of all the alerts in this rule that are currentlypending
orfiring
.lastError
MUST be omitted or empty""
whenhealth
is"ok"
.lastError
MUST be non empty for otherhealth
states containing the error faced while executing the rule.
<alert>
{
"activeAt": "<RFC3339Millis time>",
"state": "firing",
"value": "<string>",
"labels": {
"label1": "value1",
"label2": "value2",
"..."
},
"annotations": {
"label1": "value1",
"label2": "value2",
"..."
}
}
activeAt
is the time the alert was created as described in the above specification asActiveAt
.state
MUST be one of"pending"
,"firing"
or"inactive"
.value
is the stringified float value of the instant query sample that created this alert.labels
are the labels of the alert.annotations
are the annotations of the alert.
This API returns the union of all alerts across all the rules as seen in GET /api/v1/rules
. The alert-generator MUST support the GET /api/v1/alerts
API. The API MUST return a JSON containing the following fields and it MAY add additional custom fields anywhere in the JSON.
{
"status": "success",
"data": {
"alerts": [ <alert>, ]
}
}
<alert>
{
"activeAt": "<RFC3339Millis time>",
"state": "firing",
"value": "<string>",
"labels": {
"label1": "value1",
"label2": "value2",
"..."
},
"annotations": {
"label1": "value1",
"label2": "value2",
"..."
}
}
activeAt
is the time the alert was created as described in the above specification asActiveAt
.state
MUST be one of"pending"
,"firing"
or"inactive"
.value
is the stringified float value of the sample that created this alert.labels
are the labels of the alert.annotations
are the annotations of the alert.