Skip to content

Conversation

@m-nagarajan
Copy link
Contributor

@m-nagarajan m-nagarajan commented Jan 6, 2026

Problem Statement

Otel integration in server for HeartBeatStat Metrics

Solution

Current tehuti integration: using heart beat metrics as an example. There are 2 layers
Layer 1: metrics in HeartbeatStat class of different types (Rate(), Min(), Max(), etc) are created used local metrics repository, thus can't be exported. These are created for each version available
Layer 2: metrics in HeartbeatStatReporter class, all of which are AsyncGauge() which reads from the metrics in layer 1 for the current/future versions. These gets exported.

This brings in a complexity where 2 sets of metrics needs to be maintained and the 2nd layer should be always a Gauge leading to the metric name having _min/_min/_avg suffixes to note what its measuring, thus limiting what could be exported. Moving to OpenTelemetry, I want to remove this complexity and make it just a single layer and use a dimension to denote whether it belongs to current/future/backup versions and thus could export histograms directly. The new metrics are defined and directly recorded at HeartbeatVersionedStats class where the previous layer 1 metrics are recorded.

New Otel metric: venice.server.ingestion.replication.heartbeat.delay (Histogram)
New Dimensions:
venice.replica.type: leader/follower
venice.replica.state: ready_to_serve/catching_up
venice.version.role: current/future/backup

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copy link
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, thank you for putting up this PR. Left some comments for clarifications..

@lluwm
Copy link
Contributor

lluwm commented Jan 16, 2026

Thanks @m-nagarajan for addressing my comments and it LGTM!

Copy link
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm! thank you for the change.
only one small nit about sharing the enum, I feel like if it is a common one, maybe it should not in the stat package.

@m-nagarajan m-nagarajan merged commit a26fe09 into linkedin:main Jan 17, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants