-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metricprovider): add support for multiple formulas in dd metricprovider #3892
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3892 +/- ##
==========================================
- Coverage 83.87% 83.85% -0.03%
==========================================
Files 163 163
Lines 18564 18609 +45
==========================================
+ Hits 15571 15604 +33
- Misses 2120 2128 +8
- Partials 873 877 +4 ☔ View full report in Codecov by Sentry. |
Published E2E Test Results 4 files 4 suites 3h 11m 17s ⏱️ For more details on these failures, see this check. Results for commit 0833186. ♻️ This comment has been updated with latest results. |
Published Unit Test Results2 296 tests 2 296 ✅ 2m 59s ⏱️ Results for commit 0833186. ♻️ This comment has been updated with latest results. |
Quality Gate passedIssues Measures |
593a099
to
33e3f77
Compare
33e3f77
to
5ddd301
Compare
@zachaller I see this is labelled as v1.9, can it possibly be added to the v1.8 milestone? |
Sorry yea, my main reason for not putting this in 1.8 was just time (I need to get 1.8 out it's way overdue) and not having access and knowledge about Datadog as a metrics provider. Hence, the recent addition of plugins for metrics it's just a lot for us to maintain. I tend to lean on @meeech for extra input on Datadog as a provider. Getting some docs on this feature would also be required as well. I did already cut 1.8.0-rc1 and don't really want to add a feat into an already cut release but let me stew on it a bit and lets get this into master until then. |
5ddd301
to
fc758ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey thanks for this. Cool stuff.
So a little nit, and a bigger question around how default and isNil have been changed. They close over for a reason, and I think it may have to do with at what point they are actually evaluted. But maybe it was done that way for no reason?
Also, I agree with @zachaller the docs definitely need updating around this.
tbh not even clear for the use case around this, or what issue its solving (want less api requests to DD? Just dont like the idea of this being 2 separate metric calls)? but I'm probably missing something.
So for sure some docs update at a minimum with example usage in an analysis template.
@@ -100,8 +100,8 @@ func EvalCondition(resultValue any, condition string) (bool, error) { | |||
"asFloat": asFloat, | |||
"isNaN": math.IsNaN, | |||
"isInf": isInf, | |||
"isNil": isNilFunc(resultValue), | |||
"default": defaultFunc(resultValue), | |||
"isNil": isNil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats up with this change? (i looked, no commits I can refer to for rationale)? So I'm worried we might break something (because I can't comment on test confidence across all the metric providers - @zachaller let me know what your confidence level is there) - because for sure default
is used a lot. the original looks like some curry func / closure(?) to me whereas now it isn't. I feel like a change like this can actually pass simple tests and still mess up in real usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some code archaeology and cant find the rationale for doing it that way initially. Just usually when you close over a value there is a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I was baffled by the way it was done because it made no actual sense to me. I had to change it however because this extra level of indirection made it impossible to actually pass a dynamic result vector to the defaultFunc
or isNil
during expression compilation.
So with the original function, for a result vector of length 2 (assume non nil for now) and a condition like default(result[0]) > 10 && default(result[1]) < 4
for example, defaultFunc
takes the result vector as an argument, and returns the function that just returns this result vector (i.e., it discards the first argument _ any
) either way. Compiling an expression like the above would result in default(result[0])
returning the result vector, so the comparison ends up being vector > integer
.
While the new implementation dynamically calls defaultFunc
during compilation, and it actually considers the first argument, which in our case is the element result[0]
/result[1]
passed to it.
Originally, it didn't cause a problem, since it is one value not a vector, and so doing it either way would return the same value.
func isNilFunc(resultValue any) func(any) bool { | ||
return func(_ any) bool { | ||
return isNil(resultValue) | ||
func defaultFunc(resultValue any, defaultValue any) any { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
@meeech Sure, I'll update the docs with clear examples, but for now a clearer description of an example use case for this is having a condition on two different (datadog) metrics simultaneously. I think the example in the enhancement proposal above should be sufficiently clear, but reframing it here Currently, you can have multiple queries (every query is one dd metric), combine those queries into a formula, then writing a condition over that formula. This new field makes it possible to have multiple queries (thus multiple dd metrics), and assert a combined condition over each individually, by making every one of the queries a formula, i.e. - metric: dummy
failureCondition: result[0] > 10 && result[1] < 5
provider:
datadog:
queries:
- a: "sum:something{}"
- b: "avg:another{}"
formulas:
- "a"
- "b"
(And obviously, the formulas themselves can be more complex) I think the example in the proposal is really a common one, which is, fail if the rate of success ( This is, like, the simplest example of the possible more complex conditions with multiple formulas. You can basically do any datadog queries you'd normally write in dd monitors here. |
…rovider Signed-off-by: Youssef Rabie <[email protected]>
fc758ff
to
0833186
Compare
Quality Gate passedIssues Measures |
@y-rabie sorry about that - i clicked the enchancement proposal link on the checklist - i missed the link in the body :/ I will have a look at it |
So you make some interesting points: (was gonna add this to the proposal, but maybe better to just keep the conversation here at this point).
If its not significant enough, then why are we trusting this number? This seems like a case where setting up an inconclusive result makes sense - for a human to come in and judge? Like, if I have a higher than allowed error rate, but not a ton of traffic, is this something I want to see blindly proceed? or maybe an analysis run step makes sense in that scenario that waits till we cross the collection threshhold? I guess I dont see the scenario where my error rate is higher than desired, but since I have low traffic, I ignore it. If anything it tells me my soak times arent long enough. But the more interesting point you mention is this:
I assume you would still write those monitors? (which leads to having to keep them in step with your analysis templates)/ Rather than adding this complexity around multiple formulas, would it be more interesting and useful to add support for this end point |
@meeech I agree, valid points regarding the example. I mean, maybe analysis steps are not always ideal when you need a continuously running analysis during the whole rollout. But I digress, I realize critiquing the example helps critiquing if there is ever a use case for this, but I just don't think it's far-fetched that I need to have this sort of flexible combined condition over two (or more) different metrics? My brain can't come up with another example now. But generally the most basic thing is that I should be able to have this sort of ANDing, instead of ORing, of failures, and this cannot be done at the level of the Analysis metrics (nor should it I think). I'll give it more thought and I'll add any other example I can come up with 😄 Really interesting thought about the monitors part. I'm afraid that semantically, it might be overloading the Analysis concept in Argo? It's metrics and measurements. Argo integrates with datadog as a metric provider, having something like monitors for datadog singularly breaks this concept. If added, should be there something to prevent alertmanger alerts from being integrated with Argo down the line? |
Just throwing this out there, I am not exactly sure what Datadog monitors are but based of what you two have said it sounds to me that they could/should be implemented as an analysis run plugin vs adding directly to the current dd provider. I would say the same thing about Alertmanager and the built in prometheus provider. |
Agree. In this scenario, I was envisioning them being used in tandem. Specifically around ensuring you have hit your required traffic count, while the error count analysis is running in the background.
This is a trap we all fall into. :D I think we can always rationalize why we might need something (like I don't debate the utility of ANDing and ORing). I guess I'm pushing back here only because this adds complexity to the dd metric provider (which then has to be maintained moving forward). And would add complexity to the analysis templates themselves. I don't think it overloads the analysis concept. For me the analysis concept is "take this outside signal, and use it to evaluate the health of my release". its totally reasonable to say I don't want anything releasing when Im in an emergency state. The only reason I suggest to put it in the DD metric provider is because its a good fit since it already has the access it needs. Otherwise, it could be implemented with the new analysis provider plugin. |
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.This regards the Enhancement Proposal I've opened. It was not discussed but I went ahead and did the changes.
I've written unit tests for it, and built and tested the behavior with multiple formulas, and with a single formula.
These changes are backward-compatible.
Outstanding changes that needs to addressed would be in the UI, to visualize the values of all elements of the
result
vector each in a separate chart, instead of just the first value.