Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enforce max series for metrics queries #4525

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ie-pham
Copy link
Contributor

@ie-pham ie-pham commented Jan 7, 2025

What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced only on the query front-end side.

new config: max_response_series <default 1000>

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ie-pham
Copy link
Contributor Author

ie-pham commented Jan 21, 2025

the way this is implemented, tempo will truncate the final results at the frontend level. We can implement it in a way that will return as soon as 1000 series is reached regardless of how many data points are in each series to exit early. Not sure which we prefer.

PathSearchTagValuesV2 = "/api/v2/search/tag/{" + MuxVarTagName + "}/values"
PathSearchTagsV2 = "/api/v2/search/tags"
PathTracesV2 = "/api/v2/traces/{traceID}"
PathMetricsQueryInstantV2 = "/api/v2/metrics/query"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add these /v2/metrics endpoints to the API Reference page in docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding a v2 metrics api?

@@ -58,11 +58,7 @@ message TraceByIDRequest {
message TraceByIDResponse {
Trace trace = 1;
TraceByIDMetrics metrics = 2;
enum Status {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if taking the enum out is backwards compatible or not, I think it should be okay. but can you verify that tracebyidv2 endpoint is working as expected with grafana.

I think e2e tests would catch this, but I am not sure if we are testing it at the proto level

@@ -1,5 +1,7 @@
## main / unreleased

* [CHANGE] Enforce max series in response for metrics queries [#4525](https://github.com/grafana/tempo/pull/4525) (@ie-pham)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the changelog entry makes me think that this is a behaviour change in the current endpoints, while we are adding new v2 endpoints. can we update the entry to make it clear.

@@ -696,6 +696,9 @@ query_frontend:
# Maximun number of exemplars per range query. Limited to 100.
[max_exemplars: <int> | default = 100 ]

# Maximum number of time series returned for a metrics query.
[max_response_series: <int> | default 1000]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[max_response_series: <int> | default 1000]
[max_response_series: <int> | default = 1000]

to match other default values in the doc.

@@ -14,7 +15,7 @@ import (
var _ GRPCCombiner[*tempopb.QueryRangeResponse] = (*genericCombiner[*tempopb.QueryRangeResponse])(nil)

// NewQueryRange returns a query range combiner.
func NewQueryRange(req *tempopb.QueryRangeRequest, trackDiffs bool) (Combiner, error) {
func NewQueryRange(req *tempopb.QueryRangeRequest, trackDiffs bool, setMaxSeries bool, maxSeries int) (Combiner, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can pass maxSeries as 0 to disable it, and skip the setMaxSeries variable.

no strong preference here tho, okay with this as well.

@@ -43,6 +44,11 @@ func NewQueryRange(req *tempopb.QueryRangeRequest, trackDiffs bool) (Combiner, e
if resp == nil {
resp = &tempopb.QueryRangeResponse{}
}
if setMaxSeries && len(resp.Series) > maxSeries {
resp.Series = resp.Series[:maxSeries]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we are still collecting all the series and the dropping the extra data before we return them, and not exiting early here? right?

it would be great if exited early when we hit this limit.

It would be very useful in the cases where is q metrics query is returning high cardinality results, for example: {} | rate by (span:id)

Just the work of pulling the series from the blocks will be resource intensive, and can OOM all generators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i raised the question here #4525 (comment)
the problem is the results come in indeterministically - so if we exit as soon as we hit the max series, we could have a response where each series just has one data point which isn't very uesful. but yes it would save us for out performing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I am okay with current impl, and we can leve a note or todo that this can improved by making the results deterministic and exiting early.

sidenote: I think Joe's ordered results work might help here.

PathSearchTagValuesV2 = "/api/v2/search/tag/{" + MuxVarTagName + "}/values"
PathSearchTagsV2 = "/api/v2/search/tags"
PathTracesV2 = "/api/v2/traces/{traceID}"
PathMetricsQueryInstantV2 = "/api/v2/metrics/query"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding a v2 metrics api?

PARTIAL = 1;
}
Status status = 3;
PartialStatus status = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work reusing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to do a V2 - it's just extra precautions for when we roll out and Grafana is not yet using the new protos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants