-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-operation and client error metric #6443
base: dev
Are you sure you want to change the base?
Conversation
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 2 changed, 0 removed
Build ID: d6eafb54039dc8c1ca88d695 URL: https://www.apollographql.com/docs/deploy-preview/d6eafb54039dc8c1ca88d695 |
CI performance tests
|
println!("temp_service: {temp_service}"); | ||
|
||
u64_counter!( | ||
"temp.nick.apollo.router.operations.error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 Cool to see a PoC in under 50 lines.
05abe17
to
1582ded
Compare
1582ded
to
a46fee7
Compare
fn count_errors(&self, errors: &[graphql::Error], context: &Context) { | ||
let codes = errors | ||
.iter() | ||
.map(|e| e.extensions.get("code").and_then(|c| c.as_str())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC for errors without a code, it would count against the "empty" code e.g. code=""
. I think we could consider defining an explicit UNKNOWN
code but it seems fine to support this with an empty code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is an empty string. I can easily add a fallback code but I think an empty string is just as good.
if matches!(self.oltp_error_metrics_mode, OtlpErrorMetricsMode::Enabled) { | ||
let code_str = code.unwrap_or_default().to_string(); | ||
u64_counter!( | ||
"apollo.router.operations.error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that the Router team has asked for this to be a private metric - would that mean we would need to call it apollo_private.router.operations.error
? Ultimately we don't want it to be, so I'm fine with it not being private at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should pass the "private" filter regex and not pass the "public" filter regex, so it should be private. I don't really know how to test that thought because it is still coming through in prometheus.
pub(crate) fn private<T: Into<MeterProvider>>(delegate: T) -> Self {
FilterMeterProvider::builder()
.delegate(delegate)
.allow(
Regex::new(
r"apollo\.(graphos\.cloud|router\.(operations?|lifecycle|config|schema|query|query_planning|telemetry|instance))(\..*|$)|apollo_router_uplink_fetch_count_total|apollo_router_uplink_fetch_duration_seconds",
)
.expect("regex should have been valid"),
)
.build()
}
pub(crate) fn public<T: Into<MeterProvider>>(delegate: T) -> Self {
FilterMeterProvider::builder()
.delegate(delegate)
.deny(
Regex::new(r"apollo\.router\.(config|entities|instance|operations\.(connectors|fetch|request_size|response_size)|schema\.connectors)(\..*|$)")
.expect("regex should have been valid"),
)
.build()
}
This is looking pretty good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good from my side.
error[E0432]: unresolved import `opentelemetry_api`
--> apollo-router/src/services/router/tests.rs:12:5
|
12 | use opentelemetry_api::KeyValue;
| ^^^^^^^^^^^^^^^^^ use of undeclared crate or module `opentelemetry_api` The dependency on |
This adds a new metric that tracks errors per operation, client, and error code. The metric in incremented whenever an error occurs and includes the following attributes:
This new metric is marked as experimental and disabled by default, since using it in a high-cardinality graph will likely mean that we'll hit OTel cardinality issues.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩