o11y-analysis-tools

A collection of static analysis and testing tools for PromQL-compatible monitoring systems. All tools are written in Go and designed for use in CI/CD workflows with --check functionality by default.

Background

This collection of static analysis tools are intended to help:

maintain PromQL rules, by keeping them readable, consistent, unit tested and most importantly, approachable by non PromQL experts
identify common bugs regarding labels, time controls, etc
maintain high quality, actionable alerts, including with the ability to preview how an alert would render
identifying alerts in need of refinement and even deletion once they are no longer useful

Each of these tools is inspired from the mistakes I've made when writing PromQL rules, issues seen when peer reviewing PromQL PRs or providing PromQL consultations, during my tenure as a member of Google SRE

Tools

1. promql-fmt - PromQL Expression Formatter

Statically analyzes and formats PromQL expressions for proper multiline formatting.

Features:

Checks PromQL expressions for multiline formatting standards
Automatically formats long or complex expressions for better readability
Integrates with CI to enforce formatting standards

Usage:

# Check formatting (default mode, exits 1 if issues found)
promql-fmt --check ./alerts/

# Automatically fix formatting issues
promql-fmt --fix ./alerts/
promql-fmt --fmt ./alerts/  # alias for --fix

# Verbose output
promql-fmt --verbose --check ./prometheus/

Example:

Before:

expr: sum(rate(http_requests_total{job="api",status=~"5.."}[5m])) by (instance) / sum(rate(http_requests_total{job="api"}[5m])) by (instance)

After:

expr: |
  sum (
    rate(http_requests_total{job="api",status=~"5.."}[5m])
  )
    / on (instance)
  sum by (instance) (
    rate(http_requests_total{job="api"}[5m])
  )

Note: The formatter automatically:

Removes redundant aggregation clauses from the left operand when both operands share the same by clause
Adds explicit on() clauses for vector matching based on the aggregation labels
Follows PromQL best practices where only the final operand needs the aggregation

2. label-check - Label Standards Enforcement

Enforces required labels in PromQL expressions to prevent collisions in multi-tenant observability platforms.

Features:

Validates that all PromQL expressions include required labels
Default: checks for job label to prevent tenant collisions
Configurable for any set of required labels
Detailed violation reporting with line numbers

Usage:

# Check for default 'job' label
label-check --check ./alerts/

# Check for multiple required labels
label-check --labels=job,namespace ./alerts/

# Check specific file
label-check --labels=job,cluster alerts.yml

Example Output:

./alerts/api-alerts.yml:
  Expression: rate(http_requests_total[5m])
    Missing required labels: job
    Line: 12

Found 1 expressions with missing required labels
Required labels: job

3. alert-hysteresis - Alert Hysteresis Analyzer

Analyzes historical alert firing patterns and recommends optimal for durations to reduce spurious, unactionable alerts.

Features:

Queries Prometheus for historical alert firing data
Compares actual firing durations with configured for values
Recommends better hysteresis values based on statistical analysis
Identifies spurious short-lived alerts
Suggests optimal values to reduce alert fatigue

Usage:

# Analyze all alerts from last 7 days
alert-hysteresis --prometheus-url=http://localhost:9090

# Analyze specific alert over 24 hours
alert-hysteresis --prometheus-url=http://prometheus:9090 \
  --alert=HighErrorRate \
  --timeframe=24h

# Compare with configured values in rules file
alert-hysteresis --prometheus-url=http://prometheus:9090 \
  --rules=./alerts.yml \
  --timeframe=7d

# Adjust sensitivity threshold (default: 20% mismatch)
alert-hysteresis --prometheus-url=http://prometheus:9090 \
  --threshold=0.3 \
  --rules=./alerts.yml

Example Output:

Fetching alert history from http://prometheus:9090 (timeframe: 168h0m0s)...
Analyzing 156 alert firing events...

Alert: HighErrorRate
  Firing events: 45
  Average duration: 3m24s
  Median duration: 2m15s
  Min/Max duration: 45s / 25m30s
  Configured 'for': 30s
  ⚠ RECOMMENDATION: Change 'for' duration to 2m
     Reason: 33.3% of alerts (15/45) fire for less than 2m, suggesting spurious alerts
  Spurious alerts (< recommended): 15 (33.3%)

Alert: HighMemoryUsage
  Firing events: 12
  Average duration: 45m12s
  Median duration: 42m0s
  Min/Max duration: 15m / 2h15m
  Configured 'for': 30m
  Recommended 'for': 30m
  ✓ Current configuration is acceptable

Found 1 alerts that need hysteresis adjustment

4. autogen-promql-tests - PromQL Test Case Generator

Automatically generates unit test cases for PromQL expressions to ensure recording rules and metrics calculations work as expected.

Features:

Generates test cases from existing PromQL recording rules
Creates baseline test fixtures based on current metric values
Supports custom test scenarios and edge cases
Integration with Prometheus unit testing framework

Usage:

# Generate tests for recording rules
autogen-promql-tests --rules=./recording-rules.yml --output=./tests/

# Generate tests with custom scenarios
autogen-promql-tests --rules=./rules.yml \
  --scenarios=edge-cases \
  --output=./tests/

# Validate generated tests
autogen-promql-tests --rules=./rules.yml --validate

Example Output:

# Generated test file: recording_rules_test.yml
rule_files:
  - recording-rules.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api",status="200"}'
        values: '0+10x10'
      - series: 'http_requests_total{job="api",status="500"}'
        values: '0+1x10'
    promql_expr_test:
      - expr: job:http_requests:rate5m
        eval_time: 5m
        exp_samples:
          - labels: '{job="api"}'
            value: 0.183

5. e2e-alertmanager-test - End-to-End Alertmanager Testing

Tests Alertmanager configurations end-to-end, including routing, grouping, inhibition rules, and receiver integrations.

Features:

Validates Alertmanager routing configurations
Tests alert grouping and inhibition rules
Simulates alert flows through receivers
Validates notification templates and formatting
Supports testing without sending actual notifications

Usage:

# Test Alertmanager configuration
e2e-alertmanager-test --config=./alertmanager.yml

# Test specific routing tree
e2e-alertmanager-test --config=./alertmanager.yml \
  --alert='{"labels":{"severity":"critical","team":"platform"}}'

# Validate inhibition rules
e2e-alertmanager-test --config=./alertmanager.yml \
  --test-inhibition

# Dry-run mode (no actual notifications)
e2e-alertmanager-test --config=./alertmanager.yml \
  --dry-run

Example Output:

Testing Alertmanager configuration: ./alertmanager.yml

Routing Test:
  Alert: {severity="critical", team="platform"}
  ✓ Matched route: platform-critical
  ✓ Receiver: pagerduty-platform
  ✓ Group by: [alertname, cluster]
  ✓ Group wait: 10s
  ✓ Group interval: 5m
  ✓ Repeat interval: 4h

Inhibition Test:
  Alert: {severity="warning", alertname="HighMemory"}
  Inhibited by: {severity="critical", alertname="NodeDown"}
  ✓ Inhibition rule matched: inhibit-warning-if-critical

Template Test:
  Receiver: slack-platform
  ✓ Template rendered successfully
  ✓ No template errors
  Preview:
    [CRITICAL] High Error Rate in production
    Severity: critical
    Team: platform
    Runbook: https://runbooks.example.com/high-error-rate

All tests passed ✓

6. stale-alerts-analyzer - Alert Staleness Analyzer

Identifies alerts that haven't fired in a specified time period, helping teams clean up obsolete or overly sensitive alerting rules.

Features:

Queries Prometheus for alert firing history
Identifies alerts that haven't fired in N days
Suggests candidates for deletion or review
Differentiates between intentionally quiet alerts and stale rules
Exports analysis results for review

Usage:

# Find alerts that haven't fired in 90 days
stale-alerts-analyzer --prometheus-url=http://localhost:9090 \
  --days=90

# Analyze specific rules file
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
  --rules=./alerts.yml \
  --days=60

# Export results to JSON
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
  --days=90 \
  --output=json > stale-alerts.json

# Exclude known quiet alerts
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
  --days=90 \
  --exclude="DeadMansSwitch,Watchdog"

Example Output:

Analyzing alert staleness over last 90 days...
Fetching alert history from http://prometheus:9090...

Stale Alerts (No firings in 90+ days):

  Alert: LowDiskSpaceWarning
    Last fired: 127 days ago (2024-09-15)
    Configured in: ./alerts/infrastructure.yml:45
    ⚠ RECOMMENDATION: Review for deletion
       Reason: No firings in 127 days suggests either:
         - Alert threshold is too conservative
         - Infrastructure improvements made alert obsolete
         - Alert rule needs adjustment

  Alert: HighAPILatency
    Last fired: 95 days ago (2024-10-17)
    Configured in: ./alerts/application.yml:12
    ⚠ RECOMMENDATION: Review alert threshold
       Reason: Long period without firing may indicate threshold too high

Recently Active Alerts:

  Alert: HighErrorRate
    Last fired: 2 days ago
    Firing frequency: 12 times in last 90 days
    ✓ Alert is active and useful

  Alert: HighMemoryUsage
    Last fired: 15 days ago
    Firing frequency: 8 times in last 90 days
    ✓ Alert is active and useful

Summary:
  Total alerts analyzed: 24
  Stale alerts (90+ days): 2 (8.3%)
  Active alerts: 22 (91.7%)

Recommendations:
  - Review 2 stale alerts for potential deletion
  - Consider lowering thresholds or improving sensitivity

Installation

Homebrew (macOS/Linux)

brew install conallob/tap/o11y-analysis-tools

Container Images

Each tool is available as a container image:

# Pull specific tools
docker pull ghcr.io/conallob/promql-fmt:latest
docker pull ghcr.io/conallob/label-check:latest
docker pull ghcr.io/conallob/alert-hysteresis:latest
docker pull ghcr.io/conallob/autogen-promql-tests:latest
docker pull ghcr.io/conallob/e2e-alertmanager-test:latest
docker pull ghcr.io/conallob/stale-alerts-analyzer:latest

# Run in container
docker run -v $(pwd):/data ghcr.io/conallob/promql-fmt:latest --check /data
docker run -v $(pwd):/data ghcr.io/conallob/label-check:latest /data
docker run ghcr.io/conallob/stale-alerts-analyzer:latest --prometheus-url=http://prometheus:9090 --days=90

Package Managers

Debian/Ubuntu:

# Download .deb from releases page
wget https://github.com/conallob/o11y-analysis-tools/releases/download/vX.Y.Z/o11y-analysis-tools_X.Y.Z_linux_amd64.deb
sudo dpkg -i o11y-analysis-tools_X.Y.Z_linux_amd64.deb

RHEL/Fedora/CentOS:

# Download .rpm from releases page
wget https://github.com/conallob/o11y-analysis-tools/releases/download/vX.Y.Z/o11y-analysis-tools_X.Y.Z_linux_amd64.rpm
sudo rpm -i o11y-analysis-tools_X.Y.Z_linux_amd64.rpm

Pre-built Binaries

Download the latest release for your platform from the releases page.

Binaries are available for:

Linux (amd64, arm64)
macOS (amd64, arm64)
Windows (amd64, arm64)

Build from source

# Clone the repository
git clone https://github.com/conallob/o11y-analysis-tools.git
cd o11y-analysis-tools

# Build all tools
make build

# Or build individually
go build -o bin/promql-fmt ./cmd/promql-fmt
go build -o bin/label-check ./cmd/label-check
go build -o bin/alert-hysteresis ./cmd/alert-hysteresis
go build -o bin/autogen-promql-tests ./cmd/autogen-promql-tests
go build -o bin/e2e-alertmanager-test ./cmd/e2e-alertmanager-test
go build -o bin/stale-alerts-analyzer ./cmd/stale-alerts-analyzer

# Install to $GOPATH/bin
make install

CI/CD Integration

All tools are designed to work in CI/CD pipelines with --check mode as the default behavior.

GitHub Actions Example

name: PromQL Validation

on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Go
        uses: actions/setup-go@v4
        with:
          go-version: '1.21'

      - name: Install tools
        run: |
          go install github.com/conallob/o11y-analysis-tools/cmd/promql-fmt@latest
          go install github.com/conallob/o11y-analysis-tools/cmd/label-check@latest

      - name: Check PromQL formatting
        run: promql-fmt --check ./prometheus/

      - name: Check required labels
        run: label-check --labels=job,namespace ./prometheus/

GitLab CI Example

promql-validation:
  image: golang:1.21
  script:
    - go install github.com/conallob/o11y-analysis-tools/cmd/promql-fmt@latest
    - go install github.com/conallob/o11y-analysis-tools/cmd/label-check@latest
    - promql-fmt --check ./alerts/
    - label-check --labels=job ./alerts/
  only:
    - merge_requests

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

promql-fmt --check $(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(yml|yaml)$')
if [ $? -ne 0 ]; then
    echo "PromQL formatting issues found. Run 'promql-fmt --fix' to fix."
    exit 1
fi

label-check --check $(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(yml|yaml)$')
if [ $? -ne 0 ]; then
    echo "Missing required labels. Please add 'job' label to all PromQL expressions."
    exit 1
fi

Configuration

promql-fmt

No configuration file needed. All options are provided via CLI flags.

label-check

Create a .label-check.yml in your repository root:

required_labels:
  - job
  - namespace
  - cluster

Then run without flags:

label-check ./alerts/

alert-hysteresis

Create a .alert-hysteresis.yml:

prometheus_url: http://prometheus:9090
timeframe: 7d
threshold: 0.2
rules_file: ./prometheus/alerts.yml

Documentation

CONTRIBUTING.md - Contributing guidelines, development setup, and testing
RELEASING.md - Release process and versioning
CLAUDE.md - AI assistant guidance for working with this codebase

Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

Development setup and workflow
Code style guidelines
Testing requirements
Pull request process

Releasing

For information about creating releases, see RELEASING.md.

License

BSD 3-Clause - See LICENSE file for details

Roadmap

Add support for Cortex and Thanos
Web UI for alert hysteresis analysis
Export analysis results to JSON/CSV
Integration with Grafana for visualization
Support for Mimir-specific PromQL extensions
Alert simulation mode to test hysteresis changes
Automatic PR creation for recommended changes

FAQ

Q: Does promql-fmt support all PromQL syntax? A: Currently supports most common PromQL patterns. Complex nested queries may need manual formatting.

Q: Can alert-hysteresis work with Thanos or Cortex? A: Yes, as long as they expose a Prometheus-compatible API endpoint.

Q: What if my alerts don't have a 'job' label? A: Use --labels to specify your required labels, or configure via .label-check.yml.

Q: How does alert-hysteresis calculate recommendations? A: It uses statistical analysis (median, percentiles) of historical firing durations to recommend values that filter spurious short-lived alerts while preserving actionable ones.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github		.github
cmd		cmd
examples		examples
internal		internal
pkg/formatting		pkg/formatting
scripts		scripts
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.alert-hysteresis		Dockerfile.alert-hysteresis
Dockerfile.autogen-promql-tests		Dockerfile.autogen-promql-tests
Dockerfile.e2e-alertmanager-test		Dockerfile.e2e-alertmanager-test
Dockerfile.label-check		Dockerfile.label-check
Dockerfile.promql-fmt		Dockerfile.promql-fmt
Dockerfile.stale-alerts-analyzer		Dockerfile.stale-alerts-analyzer
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASING.md		RELEASING.md
go.mod		go.mod
go.sum		go.sum

Uh oh!

License

conallob/o11y-analysis-tools

Folders and files

Latest commit

History

Repository files navigation

o11y-analysis-tools

Background

Tools

1. promql-fmt - PromQL Expression Formatter

2. label-check - Label Standards Enforcement

3. alert-hysteresis - Alert Hysteresis Analyzer

4. autogen-promql-tests - PromQL Test Case Generator

5. e2e-alertmanager-test - End-to-End Alertmanager Testing

6. stale-alerts-analyzer - Alert Staleness Analyzer

Installation

Homebrew (macOS/Linux)

Container Images

Package Managers

Pre-built Binaries

Build from source

CI/CD Integration

GitHub Actions Example

GitLab CI Example

Pre-commit Hook

Configuration

promql-fmt

label-check

alert-hysteresis

Documentation

Contributing

Releasing

License

Roadmap

FAQ

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Packages