Skip to content

Various static analysis and testing tools for managing PromQL compatible monitoring stacks

License

Notifications You must be signed in to change notification settings

conallob/o11y-analysis-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

o11y-analysis-tools

Test Go Report Card codecov Go Version License Buy Me A Coffee

A collection of static analysis and testing tools for PromQL-compatible monitoring systems. All tools are written in Go and designed for use in CI/CD workflows with --check functionality by default.

Background

This collection of static analysis tools are intended to help:

  • maintain PromQL rules, by keeping them readable, consistent, unit tested and most importantly, approachable by non PromQL experts
  • identify common bugs regarding labels, time controls, etc
  • maintain high quality, actionable alerts, including with the ability to preview how an alert would render
  • identifying alerts in need of refinement and even deletion once they are no longer useful

Each of these tools is inspired from the mistakes I've made when writing PromQL rules, issues seen when peer reviewing PromQL PRs or providing PromQL consultations, during my tenure as a member of Google SRE

Tools

1. promql-fmt - PromQL Expression Formatter

Statically analyzes and formats PromQL expressions for proper multiline formatting.

Features:

  • Checks PromQL expressions for multiline formatting standards
  • Automatically formats long or complex expressions for better readability
  • Integrates with CI to enforce formatting standards

Usage:

# Check formatting (default mode, exits 1 if issues found)
promql-fmt --check ./alerts/

# Automatically fix formatting issues
promql-fmt --fix ./alerts/
promql-fmt --fmt ./alerts/  # alias for --fix

# Verbose output
promql-fmt --verbose --check ./prometheus/

Example:

Before:

expr: sum(rate(http_requests_total{job="api",status=~"5.."}[5m])) by (instance) / sum(rate(http_requests_total{job="api"}[5m])) by (instance)

After:

expr: |
  sum (
    rate(http_requests_total{job="api",status=~"5.."}[5m])
  )
    / on (instance)
  sum by (instance) (
    rate(http_requests_total{job="api"}[5m])
  )

Note: The formatter automatically:

  • Removes redundant aggregation clauses from the left operand when both operands share the same by clause
  • Adds explicit on() clauses for vector matching based on the aggregation labels
  • Follows PromQL best practices where only the final operand needs the aggregation

2. label-check - Label Standards Enforcement

Enforces required labels in PromQL expressions to prevent collisions in multi-tenant observability platforms.

Features:

  • Validates that all PromQL expressions include required labels
  • Default: checks for job label to prevent tenant collisions
  • Configurable for any set of required labels
  • Detailed violation reporting with line numbers

Usage:

# Check for default 'job' label
label-check --check ./alerts/

# Check for multiple required labels
label-check --labels=job,namespace ./alerts/

# Check specific file
label-check --labels=job,cluster alerts.yml

Example Output:

./alerts/api-alerts.yml:
  Expression: rate(http_requests_total[5m])
    Missing required labels: job
    Line: 12

Found 1 expressions with missing required labels
Required labels: job

3. alert-hysteresis - Alert Hysteresis Analyzer

Analyzes historical alert firing patterns and recommends optimal for durations to reduce spurious, unactionable alerts.

Features:

  • Queries Prometheus for historical alert firing data
  • Compares actual firing durations with configured for values
  • Recommends better hysteresis values based on statistical analysis
  • Identifies spurious short-lived alerts
  • Suggests optimal values to reduce alert fatigue

Usage:

# Analyze all alerts from last 7 days
alert-hysteresis --prometheus-url=http://localhost:9090

# Analyze specific alert over 24 hours
alert-hysteresis --prometheus-url=http://prometheus:9090 \
  --alert=HighErrorRate \
  --timeframe=24h

# Compare with configured values in rules file
alert-hysteresis --prometheus-url=http://prometheus:9090 \
  --rules=./alerts.yml \
  --timeframe=7d

# Adjust sensitivity threshold (default: 20% mismatch)
alert-hysteresis --prometheus-url=http://prometheus:9090 \
  --threshold=0.3 \
  --rules=./alerts.yml

Example Output:

Fetching alert history from http://prometheus:9090 (timeframe: 168h0m0s)...
Analyzing 156 alert firing events...

Alert: HighErrorRate
  Firing events: 45
  Average duration: 3m24s
  Median duration: 2m15s
  Min/Max duration: 45s / 25m30s
  Configured 'for': 30s
  ⚠ RECOMMENDATION: Change 'for' duration to 2m
     Reason: 33.3% of alerts (15/45) fire for less than 2m, suggesting spurious alerts
  Spurious alerts (< recommended): 15 (33.3%)

Alert: HighMemoryUsage
  Firing events: 12
  Average duration: 45m12s
  Median duration: 42m0s
  Min/Max duration: 15m / 2h15m
  Configured 'for': 30m
  Recommended 'for': 30m
  ✓ Current configuration is acceptable

Found 1 alerts that need hysteresis adjustment

4. autogen-promql-tests - PromQL Test Case Generator

Automatically generates unit test cases for PromQL expressions to ensure recording rules and metrics calculations work as expected.

Features:

  • Generates test cases from existing PromQL recording rules
  • Creates baseline test fixtures based on current metric values
  • Supports custom test scenarios and edge cases
  • Integration with Prometheus unit testing framework

Usage:

# Generate tests for recording rules
autogen-promql-tests --rules=./recording-rules.yml --output=./tests/

# Generate tests with custom scenarios
autogen-promql-tests --rules=./rules.yml \
  --scenarios=edge-cases \
  --output=./tests/

# Validate generated tests
autogen-promql-tests --rules=./rules.yml --validate

Example Output:

# Generated test file: recording_rules_test.yml
rule_files:
  - recording-rules.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api",status="200"}'
        values: '0+10x10'
      - series: 'http_requests_total{job="api",status="500"}'
        values: '0+1x10'
    promql_expr_test:
      - expr: job:http_requests:rate5m
        eval_time: 5m
        exp_samples:
          - labels: '{job="api"}'
            value: 0.183

5. e2e-alertmanager-test - End-to-End Alertmanager Testing

Tests Alertmanager configurations end-to-end, including routing, grouping, inhibition rules, and receiver integrations.

Features:

  • Validates Alertmanager routing configurations
  • Tests alert grouping and inhibition rules
  • Simulates alert flows through receivers
  • Validates notification templates and formatting
  • Supports testing without sending actual notifications

Usage:

# Test Alertmanager configuration
e2e-alertmanager-test --config=./alertmanager.yml

# Test specific routing tree
e2e-alertmanager-test --config=./alertmanager.yml \
  --alert='{"labels":{"severity":"critical","team":"platform"}}'

# Validate inhibition rules
e2e-alertmanager-test --config=./alertmanager.yml \
  --test-inhibition

# Dry-run mode (no actual notifications)
e2e-alertmanager-test --config=./alertmanager.yml \
  --dry-run

Example Output:

Testing Alertmanager configuration: ./alertmanager.yml

Routing Test:
  Alert: {severity="critical", team="platform"}
  ✓ Matched route: platform-critical
  ✓ Receiver: pagerduty-platform
  ✓ Group by: [alertname, cluster]
  ✓ Group wait: 10s
  ✓ Group interval: 5m
  ✓ Repeat interval: 4h

Inhibition Test:
  Alert: {severity="warning", alertname="HighMemory"}
  Inhibited by: {severity="critical", alertname="NodeDown"}
  ✓ Inhibition rule matched: inhibit-warning-if-critical

Template Test:
  Receiver: slack-platform
  ✓ Template rendered successfully
  ✓ No template errors
  Preview:
    [CRITICAL] High Error Rate in production
    Severity: critical
    Team: platform
    Runbook: https://runbooks.example.com/high-error-rate

All tests passed ✓

6. stale-alerts-analyzer - Alert Staleness Analyzer

Identifies alerts that haven't fired in a specified time period, helping teams clean up obsolete or overly sensitive alerting rules.

Features:

  • Queries Prometheus for alert firing history
  • Identifies alerts that haven't fired in N days
  • Suggests candidates for deletion or review
  • Differentiates between intentionally quiet alerts and stale rules
  • Exports analysis results for review

Usage:

# Find alerts that haven't fired in 90 days
stale-alerts-analyzer --prometheus-url=http://localhost:9090 \
  --days=90

# Analyze specific rules file
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
  --rules=./alerts.yml \
  --days=60

# Export results to JSON
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
  --days=90 \
  --output=json > stale-alerts.json

# Exclude known quiet alerts
stale-alerts-analyzer --prometheus-url=http://prometheus:9090 \
  --days=90 \
  --exclude="DeadMansSwitch,Watchdog"

Example Output:

Analyzing alert staleness over last 90 days...
Fetching alert history from http://prometheus:9090...

Stale Alerts (No firings in 90+ days):

  Alert: LowDiskSpaceWarning
    Last fired: 127 days ago (2024-09-15)
    Configured in: ./alerts/infrastructure.yml:45
    ⚠ RECOMMENDATION: Review for deletion
       Reason: No firings in 127 days suggests either:
         - Alert threshold is too conservative
         - Infrastructure improvements made alert obsolete
         - Alert rule needs adjustment

  Alert: HighAPILatency
    Last fired: 95 days ago (2024-10-17)
    Configured in: ./alerts/application.yml:12
    ⚠ RECOMMENDATION: Review alert threshold
       Reason: Long period without firing may indicate threshold too high

Recently Active Alerts:

  Alert: HighErrorRate
    Last fired: 2 days ago
    Firing frequency: 12 times in last 90 days
    ✓ Alert is active and useful

  Alert: HighMemoryUsage
    Last fired: 15 days ago
    Firing frequency: 8 times in last 90 days
    ✓ Alert is active and useful

Summary:
  Total alerts analyzed: 24
  Stale alerts (90+ days): 2 (8.3%)
  Active alerts: 22 (91.7%)

Recommendations:
  - Review 2 stale alerts for potential deletion
  - Consider lowering thresholds or improving sensitivity

Installation

Homebrew (macOS/Linux)

brew install conallob/tap/o11y-analysis-tools

Container Images

Each tool is available as a container image:

# Pull specific tools
docker pull ghcr.io/conallob/promql-fmt:latest
docker pull ghcr.io/conallob/label-check:latest
docker pull ghcr.io/conallob/alert-hysteresis:latest
docker pull ghcr.io/conallob/autogen-promql-tests:latest
docker pull ghcr.io/conallob/e2e-alertmanager-test:latest
docker pull ghcr.io/conallob/stale-alerts-analyzer:latest

# Run in container
docker run -v $(pwd):/data ghcr.io/conallob/promql-fmt:latest --check /data
docker run -v $(pwd):/data ghcr.io/conallob/label-check:latest /data
docker run ghcr.io/conallob/stale-alerts-analyzer:latest --prometheus-url=http://prometheus:9090 --days=90

Package Managers

Debian/Ubuntu:

# Download .deb from releases page
wget https://github.com/conallob/o11y-analysis-tools/releases/download/vX.Y.Z/o11y-analysis-tools_X.Y.Z_linux_amd64.deb
sudo dpkg -i o11y-analysis-tools_X.Y.Z_linux_amd64.deb

RHEL/Fedora/CentOS:

# Download .rpm from releases page
wget https://github.com/conallob/o11y-analysis-tools/releases/download/vX.Y.Z/o11y-analysis-tools_X.Y.Z_linux_amd64.rpm
sudo rpm -i o11y-analysis-tools_X.Y.Z_linux_amd64.rpm

Pre-built Binaries

Download the latest release for your platform from the releases page.

Binaries are available for:

  • Linux (amd64, arm64)
  • macOS (amd64, arm64)
  • Windows (amd64, arm64)

Build from source

# Clone the repository
git clone https://github.com/conallob/o11y-analysis-tools.git
cd o11y-analysis-tools

# Build all tools
make build

# Or build individually
go build -o bin/promql-fmt ./cmd/promql-fmt
go build -o bin/label-check ./cmd/label-check
go build -o bin/alert-hysteresis ./cmd/alert-hysteresis
go build -o bin/autogen-promql-tests ./cmd/autogen-promql-tests
go build -o bin/e2e-alertmanager-test ./cmd/e2e-alertmanager-test
go build -o bin/stale-alerts-analyzer ./cmd/stale-alerts-analyzer

# Install to $GOPATH/bin
make install

CI/CD Integration

All tools are designed to work in CI/CD pipelines with --check mode as the default behavior.

GitHub Actions Example

name: PromQL Validation

on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Go
        uses: actions/setup-go@v4
        with:
          go-version: '1.21'

      - name: Install tools
        run: |
          go install github.com/conallob/o11y-analysis-tools/cmd/promql-fmt@latest
          go install github.com/conallob/o11y-analysis-tools/cmd/label-check@latest

      - name: Check PromQL formatting
        run: promql-fmt --check ./prometheus/

      - name: Check required labels
        run: label-check --labels=job,namespace ./prometheus/

GitLab CI Example

promql-validation:
  image: golang:1.21
  script:
    - go install github.com/conallob/o11y-analysis-tools/cmd/promql-fmt@latest
    - go install github.com/conallob/o11y-analysis-tools/cmd/label-check@latest
    - promql-fmt --check ./alerts/
    - label-check --labels=job ./alerts/
  only:
    - merge_requests

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

promql-fmt --check $(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(yml|yaml)$')
if [ $? -ne 0 ]; then
    echo "PromQL formatting issues found. Run 'promql-fmt --fix' to fix."
    exit 1
fi

label-check --check $(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(yml|yaml)$')
if [ $? -ne 0 ]; then
    echo "Missing required labels. Please add 'job' label to all PromQL expressions."
    exit 1
fi

Configuration

promql-fmt

No configuration file needed. All options are provided via CLI flags.

label-check

Create a .label-check.yml in your repository root:

required_labels:
  - job
  - namespace
  - cluster

Then run without flags:

label-check ./alerts/

alert-hysteresis

Create a .alert-hysteresis.yml:

prometheus_url: http://prometheus:9090
timeframe: 7d
threshold: 0.2
rules_file: ./prometheus/alerts.yml

Documentation

  • CONTRIBUTING.md - Contributing guidelines, development setup, and testing
  • RELEASING.md - Release process and versioning
  • CLAUDE.md - AI assistant guidance for working with this codebase

Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Development setup and workflow
  • Code style guidelines
  • Testing requirements
  • Pull request process

Releasing

For information about creating releases, see RELEASING.md.

License

BSD 3-Clause - See LICENSE file for details

Roadmap

  • Add support for Cortex and Thanos
  • Web UI for alert hysteresis analysis
  • Export analysis results to JSON/CSV
  • Integration with Grafana for visualization
  • Support for Mimir-specific PromQL extensions
  • Alert simulation mode to test hysteresis changes
  • Automatic PR creation for recommended changes

FAQ

Q: Does promql-fmt support all PromQL syntax? A: Currently supports most common PromQL patterns. Complex nested queries may need manual formatting.

Q: Can alert-hysteresis work with Thanos or Cortex? A: Yes, as long as they expose a Prometheus-compatible API endpoint.

Q: What if my alerts don't have a 'job' label? A: Use --labels to specify your required labels, or configure via .label-check.yml.

Q: How does alert-hysteresis calculate recommendations? A: It uses statistical analysis (median, percentiles) of historical firing durations to recommend values that filter spurious short-lived alerts while preserving actionable ones.

About

Various static analysis and testing tools for managing PromQL compatible monitoring stacks

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors 2

  •  
  •