-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PROF-10073: Read and propagate helm config for profiling #27185
Conversation
Regression DetectorRegression Detector ResultsRun ID: 7584aa4d-b07d-41ef-80f6-853130a1ba4f Metrics dashboard Target profiles Baseline: c7e9128 Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | links |
---|---|---|---|---|---|
➖ | tcp_syslog_to_blackhole | ingress throughput | +1.73 | [-10.99, +14.45] | Logs |
➖ | file_tree | memory utilization | +0.72 | [+0.63, +0.80] | Logs |
➖ | otel_to_otel_logs | ingress throughput | +0.54 | [-0.27, +1.36] | Logs |
➖ | pycheck_1000_100byte_tags | % cpu utilization | +0.41 | [-4.50, +5.31] | Logs |
➖ | idle | memory utilization | +0.07 | [+0.02, +0.12] | Logs |
➖ | basic_py_check | % cpu utilization | +0.06 | [-2.55, +2.67] | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.00, +0.00] | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.51 | [-1.42, +0.39] | Logs |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
8899694
to
b261685
Compare
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=38725070 --os-family=ubuntu Note: This applies to commit 8855e1b |
b261685
to
ac20b1b
Compare
ac20b1b
to
5222933
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of small suggestions, but approving.
releasenotes-dca/notes/admission-controller-profiling-fd7678fa28f7b90e.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for files owned by ASC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for #container-platform owned files.
Some suggestions:
- It would be nice to have some E2E tests to validate that injecting this env var doesn't cause conflicts or issues in the nominal case of a user app. We already have e2e tests for auto instrumentation and language detection here, so you can augment them to test your change too.
- It would be good to remove the label
team/container-platform
from this PR because the QA should not be done by our team, but rather by the team that is directly impacted by this feature. Our team scope is only ensuring the admission controller injects what is expected, but QA here should ensure that feature works fine.
Thanks for the suggestion @adel121! There's currently an e2e onboarding system tests draft PR that exercises this functionality (with the drawback that it can only run tests for a released agent.) I'm glad to see that there's an AWS/Pulumi framework locally in the agent as well. I learned since from @robertomonteromiguel that folks that created agent's e2e have helped him create the onboarding system tests infrastructure too. I'll figure out how to validate the feature in this e2e framework and follow up with a test. |
/merge |
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
PROF-10258 ### What does this PR do? Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true` at run time. The profiler has roughly the default configuration, with a few non-default profile types which should be low-enough overhead and sufficiently useful to enable for recent Go releases. There are not currently any other configuration knobs. If enabled this way, the profiles will be tagged with `orchestrion:true`. This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable profiling. This can be provided via the [Datadog Admission Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller) (see [this PR](DataDog/datadog-agent#27185)). Some languages/runtimes use heuristics to decide whether to enable profiling when `auto` is provided, to avoid profiling short-lived or non-instrumented applications. We assume the application is meant to be instrumented by virtue of the user building with Orchestrion. And the Go profiler won't send any data until at least one minute has passed. So, we treat `auto` the same as `true`. TODO - document this once it's released TODO - this has only been manually tested. That's probably okay for now; there's not much to this code. But we can investigate ways to automatically test it. ### Motivation If somebody uses Orchestrion to add APM instrumentation, and they also want profiling, Orchestrion should be able to add it so the user doesn't have to separately modify their code to get profiling. ### Reviewer's Checklist <!-- * Authors can use this list as a reference to ensure that there are no problems during the review but the signing off is to be done by the reviewer(s). --> - [ ] Changed code has unit tests for its functionality. --------- Signed-off-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]> Co-authored-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]>
PROF-10258 ### What does this PR do? Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true` at run time. The profiler has roughly the default configuration, with a few non-default profile types which should be low-enough overhead and sufficiently useful to enable for recent Go releases. There are not currently any other configuration knobs. If enabled this way, the profiles will be tagged with `orchestrion:true`. This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable profiling. This can be provided via the [Datadog Admission Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller) (see [this PR](DataDog/datadog-agent#27185)). Some languages/runtimes use heuristics to decide whether to enable profiling when `auto` is provided, to avoid profiling short-lived or non-instrumented applications. We assume the application is meant to be instrumented by virtue of the user building with Orchestrion. And the Go profiler won't send any data until at least one minute has passed. So, we treat `auto` the same as `true`. TODO - document this once it's released TODO - this has only been manually tested. That's probably okay for now; there's not much to this code. But we can investigate ways to automatically test it. ### Motivation If somebody uses Orchestrion to add APM instrumentation, and they also want profiling, Orchestrion should be able to add it so the user doesn't have to separately modify their code to get profiling. ### Reviewer's Checklist <!-- * Authors can use this list as a reference to ensure that there are no problems during the review but the signing off is to be done by the reviewer(s). --> - [ ] Changed code has unit tests for its functionality. --------- Signed-off-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]> Co-authored-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]>
PROF-10258 ### What does this PR do? Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true` at run time. The profiler has roughly the default configuration, with a few non-default profile types which should be low-enough overhead and sufficiently useful to enable for recent Go releases. There are not currently any other configuration knobs. If enabled this way, the profiles will be tagged with `orchestrion:true`. This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable profiling. This can be provided via the [Datadog Admission Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller) (see [this PR](DataDog/datadog-agent#27185)). Some languages/runtimes use heuristics to decide whether to enable profiling when `auto` is provided, to avoid profiling short-lived or non-instrumented applications. We assume the application is meant to be instrumented by virtue of the user building with Orchestrion. And the Go profiler won't send any data until at least one minute has passed. So, we treat `auto` the same as `true`. TODO - document this once it's released TODO - this has only been manually tested. That's probably okay for now; there's not much to this code. But we can investigate ways to automatically test it. ### Motivation If somebody uses Orchestrion to add APM instrumentation, and they also want profiling, Orchestrion should be able to add it so the user doesn't have to separately modify their code to get profiling. ### Reviewer's Checklist <!-- * Authors can use this list as a reference to ensure that there are no problems during the review but the signing off is to be done by the reviewer(s). --> - [ ] Changed code has unit tests for its functionality. --------- Signed-off-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]> Co-authored-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]>
What does this PR do?
Allows Continuous Profiler to be enabled (or disabled) by Helm Charts config on the controller pod. The cluster-agent, running on the controller pod, will read an environment variable and use this to mutate the configuration of other pods to set the environment variables that will activate profiling within the tracer/profiler client libraries.
This PR follows the same approach as #23618 did for activation of ASM products.
Motivation
to make it easier for k8s clients to activate Continous Profiling. Simplified installation is a common request.
Additional Notes
The PR is designed to establish the fundamentals that will make these other PRs work:
datadog.profiling
helm-charts#1443feature.profiling
datadog-operator#1271It is necessary to first have the functionality here in the agent before we can make the Helm and Operator changes available.
Continuous Profiler will have the env var
DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_APPSEC_ENABLED
set by changes in those Datadog Operator and Helm Charts PRs. It will result inDD_PROFILING_ENABLED
being propagated to all pods (or those conforming to the filters).Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
invoke test --targets=./pkg/clusteragent