fix(rpc): use ResettingTimer for RPC duration metrics #11

danbryan · 2025-12-16T23:12:26Z

Summary

Fix RPC latency metric (rpc/duration/all) to use ResettingTimer instead of Timer

Problem

The rpc/duration/all metric was using metrics.NewRegisteredTimer() which creates a Timer that accumulates samples over the entire process lifetime. When exported to Prometheus as a Summary, the quantile values grow unbounded over time.

In production, this caused Grafana dashboards to display RPC latency as years instead of milliseconds:

Percentile	Expected	Actual (after days of uptime)
p50	~50-100ms	101 days
p95	~150-200ms	1.58 years
p99	~200-300ms	1.73 years

Screenshot from production Grafana:

The raw Prometheus values confirmed the issue - p50 quantile values ranged from 5.5M to 12.2M seconds across nodes.

Root Cause

The Timer type in go-metrics uses an ExpDecaySample internally, but when exported to Prometheus via the metrics/prometheus collector, the addTimer() function calls m.Percentiles() which calculates percentiles over the entire sample reservoir since process start - not a sliding window.

From metrics/prometheus/collector.go:

func (c *collector) addTimer(name string, m metrics.Timer) {
    pv := []float64{0.5, 0.75, 0.95, 0.99, 0.999, 0.9999}
    ps := m.Percentiles(pv)  // Calculates over ALL samples since start
    // ...
}

Solution

Switch from NewRegisteredTimer to NewRegisteredResettingTimer. The ResettingTimer resets its sample buffer on each Prometheus scrape (via Snapshot()), ensuring quantile calculations are based on recent data only.

- rpcServingTimer = metrics.NewRegisteredTimer("rpc/duration/all", nil)
+ rpcServingTimer = metrics.NewRegisteredResettingTimer("rpc/duration/all", nil)

The ResettingTimer is already used correctly elsewhere - for example, the per-method metrics in the same file use ResettingSample:

sampler := func() metrics.Sample {
    return metrics.ResettingSample(
        metrics.NewExpDecaySample(1028, 0.015),
    )
}

Test Plan

Verify the change compiles: go build ./...
Deploy to a test node and verify rpc_duration_all metric shows reasonable values (milliseconds, not years)
Confirm Grafana dashboard displays latency correctly after node restart

Co-authored-by: Federico Kunze Küllmer <31522760+fedekunze@users.noreply.github.com>

Co-authored-by: Vladislav Varadinov <vladislav.varadinov@gmail.com>

Co-authored-by: Vladislav Varadinov <vlad@evmos.org>

The rpc/duration/all metric was using metrics.NewRegisteredTimer which creates a Timer that accumulates samples over the entire process lifetime. When exported to Prometheus as a Summary, the quantile values grow unbounded over time, eventually reaching millions of seconds. In production, this caused Grafana dashboards to display RPC latency as "years" instead of milliseconds: - p50: 101 days - p95: 1.58 years - p99: 1.73 years This change switches to NewRegisteredResettingTimer which resets its sample buffer on each Prometheus scrape. This ensures quantile calculations are based on recent data only (since last scrape interval) rather than the entire process lifetime. The per-method metrics (rpc/duration/<method>/<status>) already use ResettingSample correctly via GetOrRegisterHistogramLazy, so only the aggregate rpc/duration/all timer was affected.

vladjdk · 2025-12-17T00:03:50Z

This should point to release 1.16 instead of master

Vvaradinov and others added 11 commits January 10, 2023 17:49

chore: add relevant workflows (cosmos#1)

065d6ce

Co-authored-by: Federico Kunze Küllmer <31522760+fedekunze@users.noreply.github.com>

imp(vm): Interpreter interface (cosmos#2)

7cd2d5a

imp(vm): define default JumpTable (cosmos#3)

5ca9939

Merge branch 'ethereum:master' into master

93e19b4

fix: add missing markdown lint and protolint files (cosmos#4)

01f8889

feat(vm): EVM active precompiles (cosmos#7)

e61d503

Co-authored-by: Vladislav Varadinov <vladislav.varadinov@gmail.com>

imp(vm): add Address function to precompiles (cosmos#8)

106d470

refactor: pick out Stack changes (cosmos#6)

01221e9

feat(vm): stateful precompiles (cosmos#10)

b5cd235

imp: add UnpackIntoInterface to utils (ethereum#20)

a4867b2

Co-authored-by: Vladislav Varadinov <vlad@evmos.org>

danbryan changed the base branch from master to release/1.16 December 17, 2025 00:21

danbryan closed this Dec 17, 2025

danbryan deleted the fix/rpc-duration-resetting-timer branch December 17, 2025 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rpc): use ResettingTimer for RPC duration metrics #11

fix(rpc): use ResettingTimer for RPC duration metrics #11

Uh oh!

danbryan commented Dec 16, 2025

Uh oh!

vladjdk commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(rpc): use ResettingTimer for RPC duration metrics #11

fix(rpc): use ResettingTimer for RPC duration metrics #11

Uh oh!

Conversation

danbryan commented Dec 16, 2025

Summary

Problem

Root Cause

Solution

Test Plan

Uh oh!

vladjdk commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants