Skip to content

Conversation

@danbryan
Copy link

Summary

  • Fix RPC latency metric (rpc/duration/all) to use ResettingTimer instead of Timer

Problem

The rpc/duration/all metric was using metrics.NewRegisteredTimer() which creates a Timer that accumulates samples over the entire process lifetime. When exported to Prometheus as a Summary, the quantile values grow unbounded over time.

In production, this caused Grafana dashboards to display RPC latency as years instead of milliseconds:

Percentile Expected Actual (after days of uptime)
p50 ~50-100ms 101 days
p95 ~150-200ms 1.58 years
p99 ~200-300ms 1.73 years

Screenshot from production Grafana:

EVM RPC Latency showing years

The raw Prometheus values confirmed the issue - p50 quantile values ranged from 5.5M to 12.2M seconds across nodes.

Root Cause

The Timer type in go-metrics uses an ExpDecaySample internally, but when exported to Prometheus via the metrics/prometheus collector, the addTimer() function calls m.Percentiles() which calculates percentiles over the entire sample reservoir since process start - not a sliding window.

From metrics/prometheus/collector.go:

func (c *collector) addTimer(name string, m metrics.Timer) {
    pv := []float64{0.5, 0.75, 0.95, 0.99, 0.999, 0.9999}
    ps := m.Percentiles(pv)  // Calculates over ALL samples since start
    // ...
}

Solution

Switch from NewRegisteredTimer to NewRegisteredResettingTimer. The ResettingTimer resets its sample buffer on each Prometheus scrape (via Snapshot()), ensuring quantile calculations are based on recent data only.

- rpcServingTimer = metrics.NewRegisteredTimer("rpc/duration/all", nil)
+ rpcServingTimer = metrics.NewRegisteredResettingTimer("rpc/duration/all", nil)

The ResettingTimer is already used correctly elsewhere - for example, the per-method metrics in the same file use ResettingSample:

sampler := func() metrics.Sample {
    return metrics.ResettingSample(
        metrics.NewExpDecaySample(1028, 0.015),
    )
}

Test Plan

  • Verify the change compiles: go build ./...
  • Deploy to a test node and verify rpc_duration_all metric shows reasonable values (milliseconds, not years)
  • Confirm Grafana dashboard displays latency correctly after node restart

Vvaradinov and others added 11 commits January 10, 2023 17:49
Co-authored-by: Federico Kunze Küllmer <31522760+fedekunze@users.noreply.github.com>
Co-authored-by: Vladislav Varadinov <vladislav.varadinov@gmail.com>
Co-authored-by: Vladislav Varadinov <vlad@evmos.org>
The rpc/duration/all metric was using metrics.NewRegisteredTimer which
creates a Timer that accumulates samples over the entire process lifetime.
When exported to Prometheus as a Summary, the quantile values grow
unbounded over time, eventually reaching millions of seconds.

In production, this caused Grafana dashboards to display RPC latency
as "years" instead of milliseconds:
- p50: 101 days
- p95: 1.58 years
- p99: 1.73 years

This change switches to NewRegisteredResettingTimer which resets its
sample buffer on each Prometheus scrape. This ensures quantile
calculations are based on recent data only (since last scrape interval)
rather than the entire process lifetime.

The per-method metrics (rpc/duration/<method>/<status>) already use
ResettingSample correctly via GetOrRegisterHistogramLazy, so only the
aggregate rpc/duration/all timer was affected.
@vladjdk
Copy link
Member

vladjdk commented Dec 17, 2025

This should point to release 1.16 instead of master

@danbryan danbryan changed the base branch from master to release/1.16 December 17, 2025 00:21
@danbryan danbryan closed this Dec 17, 2025
@danbryan danbryan deleted the fix/rpc-duration-resetting-timer branch December 17, 2025 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants