[Perf][Metrics] Use flurry's concurrent hashmap for 5x throughput #2305

utpilla · 2024-11-18T01:58:39Z

Towards #1740

Changes

Current implementation of Metrics SDK uses a RwLock<Hashmap> for aggregating measurements. RwLock brings in significant amount of contention even for concurrent reads
This PR showcases how we could use flurry crate's concurrent Hashmap instead of a RwLock<Hashmap>.

The performance gains are huge!

Stress Tests results:

Machine details:

OS: Ubuntu 22.04.4 LTS (5.15.153.1-microsoft-standard-WSL2)
Hardware: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz, 16vCPUs,
RAM: 64.0 GB

Counter throughput went up from 9M to 45M iterations per sec
Histogram throughput went up from 7.5 M to 35M iterations per sec

Benchmarks

The benchmark results are comparable. There is no significant difference in the benchmark results.

Note for reviewers

This PR is not meant for merging as-is. It's meant to show that we can utilize a more efficient concurrent data structure to our advantage. If we indeed decide to use flurry's Hashmap, we have to address the following:

There is a potential race condition during collect for Delta temporality. This needs to be fixed to avoid losing measurements.
Provide meaningful Ord implementations for KeyValue. Ord implementation for the Hashmap's key type is a requirement from flurry. For this PR, I have added a very basic implementation just to unblock myself from testing the Hashmap.
We need to offer this under a feature flag. We don't want to add dependency on an external crate by default.

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

codecov · 2024-11-18T02:02:00Z

Codecov Report

Attention: Patch coverage is 53.84615% with 24 lines in your changes missing coverage. Please review.

Project coverage is 79.5%. Comparing base (3ac2d9f) to head (dcc1aab).

Files with missing lines	Patch %	Lines
opentelemetry/src/common.rs	0.0%	23 Missing ⚠️
opentelemetry-sdk/src/metrics/internal/mod.rs	96.5%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #2305     +/-   ##
=======================================
- Coverage   79.6%   79.5%   -0.1%     
=======================================
  Files        123     123             
  Lines      21263   21291     +28     
=======================================
+ Hits       16938   16947      +9     
- Misses      4325    4344     +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lalitb · 2024-11-18T07:57:19Z

Thank you for sharing the results! The performance improvements are impressive, as anticipated, given that Flurry is designed after Java's ConcurrentHashMap and provides more fine-grained concurrency compared to DashMap. This can be added under the feature flag, but we also need to see some concern raised on its stability on high-load - jonhoo/flurry#127:

flurry has been a very interesting concurrent hash table experiment and has driven a lot of innovation in that space. However, at this point, flurry suffers from performance as well as memory usage issues under load.

Use flurry's concurrent hashmap

dcc1aab

utpilla requested a review from a team as a code owner November 18, 2024 01:58

utpilla marked this pull request as draft November 18, 2024 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][Metrics] Use flurry's concurrent hashmap for 5x throughput #2305

[Perf][Metrics] Use flurry's concurrent hashmap for 5x throughput #2305

utpilla commented Nov 18, 2024

codecov bot commented Nov 18, 2024

lalitb commented Nov 18, 2024 •

edited

Loading

[Perf][Metrics] Use flurry's concurrent hashmap for 5x throughput #2305

Are you sure you want to change the base?

[Perf][Metrics] Use flurry's concurrent hashmap for 5x throughput #2305

Conversation

utpilla commented Nov 18, 2024

Changes

Stress Tests results:

Machine details:

Benchmarks

Note for reviewers

Merge requirement checklist

codecov bot commented Nov 18, 2024

Codecov Report

lalitb commented Nov 18, 2024 • edited Loading

lalitb commented Nov 18, 2024 •

edited

Loading