Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deltatocumulativeprocessor] Introduce an upper bound for exp histogram buckets #36874

Merged
merged 11 commits into from
Jan 16, 2025

Conversation

euroelessar
Copy link
Contributor

Description

This PR introduces a limit on the maximum number of exponential histogram buckets within the deltatocumulativeprocessor. Previously, when merging delta metrics into cumulative metrics, the resulting exponential histograms could grow very large, potentially causing excessive memory usage and processing overhead. By capping the number of buckets at 160 and dynamically downscaling histograms when necessary, this change ensures that the processor remains efficient and stable even when handling large, merged exponential histograms.

Link to tracking issue

Fixes #33277

Testing

Added unit tests for edge cases.

Documentation

Updated changelog.

@euroelessar
Copy link
Contributor Author

@sh0rez Please have a look, it's an alternative take on #34157, it avoids any extra memory allocations and could be easier to follow.
All changes are in ExpHistogram.Add (and Merge called from it), as it's the only place which can grow the histogram.

Comment on lines 109 to 110
// Downscale if an expected number of buckets after the merge is too large.
if deltaScale := max(getDeltaScale(dp.Positive(), in.Positive()), getDeltaScale(dp.Negative(), in.Negative())); deltaScale > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, this downscales using the existing algorithm first, possibly exceeding the cap.
we check for that only after the fact, already having allocated more memory than we want, right?
this sounds like it's still vulnerable to DoS attacks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downscaling operation should always produce no more buckets than the original, right?
In particular, expo.Downscale never performs allocations (except panicking path), so it can not be a DoS attack vector on memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very true, thanks for the hint ;)

@euroelessar euroelessar requested a review from sh0rez December 20, 2024 17:12
Copy link
Contributor

github-actions bot commented Jan 4, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 4, 2025
@euroelessar
Copy link
Contributor Author

@sh0rez hi, would you be able to have an another look at the comment above?

@github-actions github-actions bot removed the Stale label Jan 5, 2025
Copy link
Member

@sh0rez sh0rez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the slow review!

// getDeltaScale computes how many times the histograms need to be downscaled to ensure
// the bucket range after their merge fits within maxBuckets.
// This logic assumes that trailing and leading zeros are going to be removed.
func getDeltaScale(arel, brel pmetric.ExponentialHistogramDataPointBuckets) expo.Scale {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, but I think it better fits the public api of the expo package, maybe like this?

package expo

// Limit returns a target Scale that when be downscaled to,
// the total bucket count after [Merge] never exceeds max
func Limit(max int, scale Scale, arel, brel Buckets) Scale { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 109 to 110
// Downscale if an expected number of buckets after the merge is too large.
if deltaScale := max(getDeltaScale(dp.Positive(), in.Positive()), getDeltaScale(dp.Negative(), in.Negative())); deltaScale > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very true, thanks for the hint ;)

Comment on lines 92 to 94
dp: expdp{Scale: 0, PosNeg: generateBins(10, 80, 1), Count: 160},
in: expdp{Scale: 0, PosNeg: generateBins(80+10, 60, 2), Count: 120},
want: expdp{Scale: 0, PosNeg: generateBins(10, 80, 1, 60, 2), Count: 280},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need generateBins and downscaled?
From looking at this test case, it's hard to reason about it. I cannot clearly see what either the input or the output looks like and as such not tell whether it's correct.
Is there a way to stick to written-out bins{} literals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made maxBuckets configurable in code (var instead of const), updated the test to set it to max 8 buckets and moved to rawbs method
should be easier to read now, please let me know if it needs any further changes!

@sh0rez
Copy link
Member

sh0rez commented Jan 15, 2025

@open-telemetry/collector-contrib-approvers this is ready to merge

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and the chosen size aligns with the SDK default 👍

@mx-psi mx-psi merged commit 37c8044 into open-telemetry:main Jan 16, 2025
163 checks passed
@github-actions github-actions bot added this to the next release milestone Jan 16, 2025
Comment on lines +83 to +86
to := max(
expo.Limit(maxBuckets, from, dp.Positive(), in.Positive()),
expo.Limit(maxBuckets, from, dp.Negative(), in.Negative()),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@euroelessar @mx-psi Shouldn't this be min instead of max? What I am seeing is that if I have no negative data points, then no scaling happens and the number of positive buckets still explodes.

I think the tests missed this because they all use PosNeg, i.e., use the same positive and negative data points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiit-clarifai can you file an issue for this? It definitely looks suspicious

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, #37416.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, #37432 fixes it. Thanks for letting me know about it! I've also added tests to catch it in the future.

chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this pull request Jan 26, 2025
…am buckets (open-telemetry#36874)

#### Description
This PR introduces a limit on the maximum number of exponential
histogram buckets within the deltatocumulativeprocessor. Previously,
when merging delta metrics into cumulative metrics, the resulting
exponential histograms could grow very large, potentially causing
excessive memory usage and processing overhead. By capping the number of
buckets at 160 and dynamically downscaling histograms when necessary,
this change ensures that the processor remains efficient and stable even
when handling large, merged exponential histograms.

#### Link to tracking issue
Fixes open-telemetry#33277

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added unit tests for edge cases.

#### Documentation
Updated changelog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deltatocumulative: Number of buckets in exponential histograms should be capped
5 participants