A collection of static analysis and testing tools for PromQL-compatible monitoring systems. All tools are written in Go and designed for use in CI/CD workflows with --check functionality by default.
This collection of static analysis tools are intended to help:
- maintain PromQL rules, by keeping them readable, consistent, unit tested and most importantly, approachable by non PromQL experts
- identify common bugs regarding labels, time controls, etc
- maintain high quality, actionable alerts, including with the ability to preview how an alert would render
- identifying alerts in need of refinement and even deletion once they are no longer useful
Each of these tools is inspired from the mistakes I've made when writing PromQL rules, issues seen when peer reviewing PromQL PRs or providing PromQL consultations, during my tenure as a member of Google SRE
Statically analyzes and formats PromQL expressions for proper multiline formatting.
Features:
- Checks PromQL expressions for multiline formatting standards
- Automatically formats long or complex expressions for better readability
- Integrates with CI to enforce formatting standards
Usage:
# Check formatting (default mode, exits 1 if issues found)
promql-fmt --check ./alerts/
# Automatically fix formatting issues
promql-fmt --fix ./alerts/
promql-fmt --fmt ./alerts/ # alias for --fix
# Verbose output
promql-fmt --verbose --check ./prometheus/Example:
Before:
expr: sum(rate(http_requests_total{job="api",status=~"5.."}[5m])) by (instance) / sum(rate(http_requests_total{job="api"}[5m])) by (instance)After:
expr: |
sum (
rate(http_requests_total{job="api",status=~"5.."}[5m])
)
/ on (instance)
sum by (instance) (
rate(http_requests_total{job="api"}[5m])
)Note: The formatter automatically:
- Removes redundant aggregation clauses from the left operand when both operands share the same
byclause - Adds explicit
on()clauses for vector matching based on the aggregation labels - Follows PromQL best practices where only the final operand needs the aggregation
Enforces required labels in PromQL expressions to prevent collisions in multi-tenant observability platforms.
Features:
- Validates that all PromQL expressions include required labels
- Default: checks for
joblabel to prevent tenant collisions - Configurable for any set of required labels
- Detailed violation reporting with line numbers
Usage:
# Check for default 'job' label
label-check --check ./alerts/
# Check for multiple required labels
label-check --labels=job,namespace ./alerts/
# Check specific file
label-check --labels=job,cluster alerts.ymlExample Output:
./alerts/api-alerts.yml:
Expression: rate(http_requests_total[5m])
Missing required labels: job
Line: 12
Found 1 expressions with missing required labels
Required labels: job
Analyzes historical alert firing patterns and recommends optimal for durations to reduce spurious, unactionable alerts.
Features:
- Queries Prometheus for historical alert firing data
- Compares actual firing durations with configured
forvalues - Recommends better hysteresis values based on statistical analysis
- Identifies spurious short-lived alerts
- Suggests optimal values to reduce alert fatigue
Usage:
# Analyze all alerts from last 7 days
alert-hysteresis --prometheus-url=http://localhost:9090
# Analyze specific alert over 24 hours
alert-hysteresis --prometheus-url=http://prometheus:9090 \
--alert=HighErrorRate \
--timeframe=24h
# Compare with configured values in rules file
alert-hysteresis --prometheus-url=http://prometheus:9090 \
--rules=./alerts.yml \
--timeframe=7d
# Adjust sensitivity threshold (default: 20% mismatch)
alert-hysteresis --prometheus-url=http://prometheus:9090 \
--threshold=0.3 \
--rules=./alerts.ymlExample Output:
Fetching alert history from http://prometheus:9090 (timeframe: 168h0m0s)...
Analyzing 156 alert firing events...
Alert: HighErrorRate
Firing events: 45
Average duration: 3m24s
Median duration: 2m15s
Min/Max duration: 45s / 25m30s
Configured 'for': 30s
⚠ RECOMMENDATION: Change 'for' duration to 2m
Reason: 33.3% of alerts (15/45) fire for less than 2m, suggesting spurious alerts
Spurious alerts (< recommended): 15 (33.3%)
Alert: HighMemoryUsage
Firing events: 12
Average duration: 45m12s
Median duration: 42m0s
Min/Max duration: 15m / 2h15m
Configured 'for': 30m
Recommended 'for': 30m
✓ Current configuration is acceptable
Found 1 alerts that need hysteresis adjustment
Automatically generates unit test cases for PromQL expressions to ensure recording rules and metrics calculations work as expected.
Features:
- Generates test cases from existing PromQL recording rules
- Creates baseline test fixtures based on current metric values
- Supports custom test scenarios and edge cases
- Integration with Prometheus unit testing framework
Usage:
# Generate tests for recording rules
autogen-promql-tests --rules=./recording-rules.yml --output=./tests/
# Generate tests with custom scenarios
autogen-promql-tests --rules=./rules.yml \
--scenarios=edge-cases \
--output=./tests/
# Validate generated tests
autogen-promql-tests --rules=./rules.yml --validateExample Output:
# Generated test file: recording_rules_test.yml
rule_files:
- recording-rules.yml
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{job="api",status="200"}'
values: '0+10x10'
- series: 'http_requests_total{job="api",status="500"}'
values: '0+1x10'
promql_expr_test:
- expr: job:http_requests:rate5m
eval_time: 5m
exp_samples:
- labels: '{job="api"}'
value: 0.183Tests Alertmanager configurations end-to-end, including routing, grouping, inhibition rules, and receiver integrations.
Features:
- Validates Alertmanager routing configurations
- Tests alert grouping and inhibition rules
- Simulates alert flows through receivers
- Validates notification templates and formatting
- Supports testing without sending actual notifications
Usage:
# Test Alertmanager configuration
e2e-alertmanager-test --config=./alertmanager.yml
# Test specific routing tree
e2e-alertmanager-test --config=./alertmanager.yml \
--alert='{"labels":{"severity":"critical","team":"platform"}}'
# Validate inhibition rules
e2e-alertmanager-test --config=./alertmanager.yml \
--test-inhibition
# Dry-run mode (no actual notifications)
e2e-alertmanager-test --config=./alertmanager.yml \
--dry-runExample Output:
Testing Alertmanager configuration: ./alertmanager.yml
Routing Test:
Alert: {severity="critical", team="platform"}
✓ Matched route: platform-critical
✓ Receiver: pagerduty-platform
✓ Group by: [alertname, cluster]
✓ Group wait: 10s
✓ Group interval: 5m
✓ Repeat interval: 4h
Inhibition Test:
Alert: {severity="warning", alertname="HighMemory"}
Inhibited by: {severity="critical", alertname="NodeDown"}
✓ Inhibition rule matched: inhibit-warning-if-critical
Template Test:
Receiver: slack-platform
✓ Template rendered successfully
✓ No template errors
Preview:
[CRITICAL] High Error Rate in production
Severity: critical
Team: platform
Runbook: https://runbooks.example.com/high-error-rate
All tests passed ✓
Identifies alerts that haven't fired in a specified time period, helping teams clean up obsolete or overly sensitive alerting rules.
Features:
- Queries Prometheus for alert firing history
- Identifies alerts that haven't fired in N days
- Suggests candidates for deletion or review
- Differentiates between intentionally quiet alerts and stale rules
- Exports analysis results for review
Usage:
# Find alerts that haven't fired in 90 days
stale-alerts-analyzer --prometheus-url=http://localhost:9090 \
--days=90
# Analyze specific rules file
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
--rules=./alerts.yml \
--days=60
# Export results to JSON
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
--days=90 \
--output=json > stale-alerts.json
# Exclude known quiet alerts
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
--days=90 \
--exclude="DeadMansSwitch,Watchdog"Example Output:
Analyzing alert staleness over last 90 days...
Fetching alert history from http://prometheus:9090...
Stale Alerts (No firings in 90+ days):
Alert: LowDiskSpaceWarning
Last fired: 127 days ago (2024-09-15)
Configured in: ./alerts/infrastructure.yml:45
⚠ RECOMMENDATION: Review for deletion
Reason: No firings in 127 days suggests either:
- Alert threshold is too conservative
- Infrastructure improvements made alert obsolete
- Alert rule needs adjustment
Alert: HighAPILatency
Last fired: 95 days ago (2024-10-17)
Configured in: ./alerts/application.yml:12
⚠ RECOMMENDATION: Review alert threshold
Reason: Long period without firing may indicate threshold too high
Recently Active Alerts:
Alert: HighErrorRate
Last fired: 2 days ago
Firing frequency: 12 times in last 90 days
✓ Alert is active and useful
Alert: HighMemoryUsage
Last fired: 15 days ago
Firing frequency: 8 times in last 90 days
✓ Alert is active and useful
Summary:
Total alerts analyzed: 24
Stale alerts (90+ days): 2 (8.3%)
Active alerts: 22 (91.7%)
Recommendations:
- Review 2 stale alerts for potential deletion
- Consider lowering thresholds or improving sensitivity
brew install conallob/tap/o11y-analysis-toolsEach tool is available as a container image:
# Pull specific tools
docker pull ghcr.io/conallob/promql-fmt:latest
docker pull ghcr.io/conallob/label-check:latest
docker pull ghcr.io/conallob/alert-hysteresis:latest
docker pull ghcr.io/conallob/autogen-promql-tests:latest
docker pull ghcr.io/conallob/e2e-alertmanager-test:latest
docker pull ghcr.io/conallob/stale-alerts-analyzer:latest
# Run in container
docker run -v $(pwd):/data ghcr.io/conallob/promql-fmt:latest --check /data
docker run -v $(pwd):/data ghcr.io/conallob/label-check:latest /data
docker run ghcr.io/conallob/stale-alerts-analyzer:latest --prometheus-url=http://prometheus:9090 --days=90Debian/Ubuntu:
# Download .deb from releases page
wget https://github.com/conallob/o11y-analysis-tools/releases/download/vX.Y.Z/o11y-analysis-tools_X.Y.Z_linux_amd64.deb
sudo dpkg -i o11y-analysis-tools_X.Y.Z_linux_amd64.debRHEL/Fedora/CentOS:
# Download .rpm from releases page
wget https://github.com/conallob/o11y-analysis-tools/releases/download/vX.Y.Z/o11y-analysis-tools_X.Y.Z_linux_amd64.rpm
sudo rpm -i o11y-analysis-tools_X.Y.Z_linux_amd64.rpmDownload the latest release for your platform from the releases page.
Binaries are available for:
- Linux (amd64, arm64)
- macOS (amd64, arm64)
- Windows (amd64, arm64)
# Clone the repository
git clone https://github.com/conallob/o11y-analysis-tools.git
cd o11y-analysis-tools
# Build all tools
make build
# Or build individually
go build -o bin/promql-fmt ./cmd/promql-fmt
go build -o bin/label-check ./cmd/label-check
go build -o bin/alert-hysteresis ./cmd/alert-hysteresis
go build -o bin/autogen-promql-tests ./cmd/autogen-promql-tests
go build -o bin/e2e-alertmanager-test ./cmd/e2e-alertmanager-test
go build -o bin/stale-alerts-analyzer ./cmd/stale-alerts-analyzer
# Install to $GOPATH/bin
make installAll tools are designed to work in CI/CD pipelines with --check mode as the default behavior.
name: PromQL Validation
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Go
uses: actions/setup-go@v4
with:
go-version: '1.21'
- name: Install tools
run: |
go install github.com/conallob/o11y-analysis-tools/cmd/promql-fmt@latest
go install github.com/conallob/o11y-analysis-tools/cmd/label-check@latest
- name: Check PromQL formatting
run: promql-fmt --check ./prometheus/
- name: Check required labels
run: label-check --labels=job,namespace ./prometheus/promql-validation:
image: golang:1.21
script:
- go install github.com/conallob/o11y-analysis-tools/cmd/promql-fmt@latest
- go install github.com/conallob/o11y-analysis-tools/cmd/label-check@latest
- promql-fmt --check ./alerts/
- label-check --labels=job ./alerts/
only:
- merge_requests#!/bin/bash
# .git/hooks/pre-commit
promql-fmt --check $(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(yml|yaml)$')
if [ $? -ne 0 ]; then
echo "PromQL formatting issues found. Run 'promql-fmt --fix' to fix."
exit 1
fi
label-check --check $(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(yml|yaml)$')
if [ $? -ne 0 ]; then
echo "Missing required labels. Please add 'job' label to all PromQL expressions."
exit 1
fiNo configuration file needed. All options are provided via CLI flags.
Create a .label-check.yml in your repository root:
required_labels:
- job
- namespace
- clusterThen run without flags:
label-check ./alerts/Create a .alert-hysteresis.yml:
prometheus_url: http://prometheus:9090
timeframe: 7d
threshold: 0.2
rules_file: ./prometheus/alerts.yml- CONTRIBUTING.md - Contributing guidelines, development setup, and testing
- RELEASING.md - Release process and versioning
- CLAUDE.md - AI assistant guidance for working with this codebase
We welcome contributions! Please see CONTRIBUTING.md for:
- Development setup and workflow
- Code style guidelines
- Testing requirements
- Pull request process
For information about creating releases, see RELEASING.md.
BSD 3-Clause - See LICENSE file for details
- Add support for Cortex and Thanos
- Web UI for alert hysteresis analysis
- Export analysis results to JSON/CSV
- Integration with Grafana for visualization
- Support for Mimir-specific PromQL extensions
- Alert simulation mode to test hysteresis changes
- Automatic PR creation for recommended changes
Q: Does promql-fmt support all PromQL syntax? A: Currently supports most common PromQL patterns. Complex nested queries may need manual formatting.
Q: Can alert-hysteresis work with Thanos or Cortex? A: Yes, as long as they expose a Prometheus-compatible API endpoint.
Q: What if my alerts don't have a 'job' label?
A: Use --labels to specify your required labels, or configure via .label-check.yml.
Q: How does alert-hysteresis calculate recommendations? A: It uses statistical analysis (median, percentiles) of historical firing durations to recommend values that filter spurious short-lived alerts while preserving actionable ones.