fix(rpc): use ResettingTimer for RPC duration metrics #11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
rpc/duration/all) to useResettingTimerinstead ofTimerProblem
The
rpc/duration/allmetric was usingmetrics.NewRegisteredTimer()which creates a Timer that accumulates samples over the entire process lifetime. When exported to Prometheus as a Summary, the quantile values grow unbounded over time.In production, this caused Grafana dashboards to display RPC latency as years instead of milliseconds:
Screenshot from production Grafana:
The raw Prometheus values confirmed the issue - p50 quantile values ranged from 5.5M to 12.2M seconds across nodes.
Root Cause
The
Timertype in go-metrics uses anExpDecaySampleinternally, but when exported to Prometheus via themetrics/prometheuscollector, theaddTimer()function callsm.Percentiles()which calculates percentiles over the entire sample reservoir since process start - not a sliding window.From
metrics/prometheus/collector.go:Solution
Switch from
NewRegisteredTimertoNewRegisteredResettingTimer. TheResettingTimerresets its sample buffer on each Prometheus scrape (viaSnapshot()), ensuring quantile calculations are based on recent data only.The
ResettingTimeris already used correctly elsewhere - for example, the per-method metrics in the same file useResettingSample:Test Plan
go build ./...rpc_duration_allmetric shows reasonable values (milliseconds, not years)