PROF-10073: Read and propagate helm config for profiling #27185

szegedi · 2024-07-01T10:13:14Z

What does this PR do?

Allows Continuous Profiler to be enabled (or disabled) by Helm Charts config on the controller pod. The cluster-agent, running on the controller pod, will read an environment variable and use this to mutate the configuration of other pods to set the environment variables that will activate profiling within the tracer/profiler client libraries.

This PR follows the same approach as #23618 did for activation of ASM products.

Motivation

to make it easier for k8s clients to activate Continous Profiling. Simplified installation is a common request.

Additional Notes

The PR is designed to establish the fundamentals that will make these other PRs work:

Add support for datadog.profiling helm-charts#1443
PROF-10073: Add suport for feature.profiling datadog-operator#1271
It is necessary to first have the functionality here in the agent before we can make the Helm and Operator changes available.

Continuous Profiler will have the env var DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_APPSEC_ENABLED set by changes in those Datadog Operator and Helm Charts PRs. It will result in DD_PROFILING_ENABLED being propagated to all pods (or those conforming to the filters).

Possible Drawbacks / Trade-offs

More complexity in our config handling, to make it easier for customers.

Describe how to test/QA your changes

Unit tests have been added and ensured to pass with invoke test --targets=./pkg/clusteragent

bits-bot · 2024-07-01T10:13:18Z

All committers have signed the CLA.

pr-commenter · 2024-07-01T10:34:37Z

Regression Detector

Regression Detector Results

Run ID: 7584aa4d-b07d-41ef-80f6-853130a1ba4f Metrics dashboard Target profiles

Baseline: c7e9128
Comparison: 8855e1b

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	links
➖	tcp_syslog_to_blackhole	ingress throughput	+1.73	[-10.99, +14.45]	Logs
➖	file_tree	memory utilization	+0.72	[+0.63, +0.80]	Logs
➖	otel_to_otel_logs	ingress throughput	+0.54	[-0.27, +1.36]	Logs
➖	pycheck_1000_100byte_tags	% cpu utilization	+0.41	[-4.50, +5.31]	Logs
➖	idle	memory utilization	+0.07	[+0.02, +0.12]	Logs
➖	basic_py_check	% cpu utilization	+0.06	[-2.55, +2.67]	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.00, +0.00]	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-0.51	[-1.42, +0.39]	Logs

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

pr-commenter · 2024-07-01T13:34:59Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=38725070 --os-family=ubuntu

Note: This applies to commit 8855e1b

maycmlee

Just a couple of small suggestions, but approving.

releasenotes-dca/notes/admission-controller-profiling-fd7678fa28f7b90e.yaml

eliottness

LGTM

ogaca-dd

LGTM for files owned by ASC

adel121

Approved for #container-platform owned files.

Some suggestions:

It would be nice to have some E2E tests to validate that injecting this env var doesn't cause conflicts or issues in the nominal case of a user app. We already have e2e tests for auto instrumentation and language detection here, so you can augment them to test your change too.
It would be good to remove the label team/container-platform from this PR because the QA should not be done by our team, but rather by the team that is directly impacted by this feature. Our team scope is only ensuring the admission controller injects what is expected, but QA here should ensure that feature works fine.

szegedi · 2024-07-15T15:24:32Z

Thanks for the suggestion @adel121! There's currently an e2e onboarding system tests draft PR that exercises this functionality (with the drawback that it can only run tests for a released agent.) I'm glad to see that there's an AWS/Pulumi framework locally in the agent as well. I learned since from @robertomonteromiguel that folks that created agent's e2e have helped him create the onboarding system tests infrastructure too. I'll figure out how to validate the feature in this e2e framework and follow up with a test.

szegedi · 2024-07-15T15:25:24Z

/merge

dd-devflow · 2024-07-15T15:25:31Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 32m.

Use /merge -c to cancel this operation!

PROF-10258 ### What does this PR do? Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true` at run time. The profiler has roughly the default configuration, with a few non-default profile types which should be low-enough overhead and sufficiently useful to enable for recent Go releases. There are not currently any other configuration knobs. If enabled this way, the profiles will be tagged with `orchestrion:true`. This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable profiling. This can be provided via the [Datadog Admission Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller) (see [this PR](DataDog/datadog-agent#27185)). Some languages/runtimes use heuristics to decide whether to enable profiling when `auto` is provided, to avoid profiling short-lived or non-instrumented applications. We assume the application is meant to be instrumented by virtue of the user building with Orchestrion. And the Go profiler won't send any data until at least one minute has passed. So, we treat `auto` the same as `true`. TODO - document this once it's released TODO - this has only been manually tested. That's probably okay for now; there's not much to this code. But we can investigate ways to automatically test it. ### Motivation If somebody uses Orchestrion to add APM instrumentation, and they also want profiling, Orchestrion should be able to add it so the user doesn't have to separately modify their code to get profiling. ### Reviewer's Checklist  - [ ] Changed code has unit tests for its functionality. --------- Signed-off-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]> Co-authored-by: github-actions on behalf of RomainMuller <[email protected]> Co-authored-by: Romain Marcadier <[email protected]>

github-actions bot added team/container-platform The Container Platform Team team/apm-onboarding labels Jul 1, 2024

szegedi force-pushed the szegedi/profiling-ssi-kubernetes branch 2 times, most recently from 8899694 to b261685 Compare July 1, 2024 13:03

szegedi force-pushed the szegedi/profiling-ssi-kubernetes branch from b261685 to ac20b1b Compare July 8, 2024 16:02

szegedi added component/cluster-agent component/config need-change/helm-chart Add this label if your change require also a change in the datadog helm chart category/feature and removed team/container-platform The Container Platform Team labels Jul 8, 2024

Read and propagate helm config for profiling

5222933

szegedi force-pushed the szegedi/profiling-ssi-kubernetes branch from ac20b1b to 5222933 Compare July 9, 2024 02:50

szegedi added the team/container-platform The Container Platform Team label Jul 9, 2024

szegedi marked this pull request as ready for review July 9, 2024 10:38

szegedi requested review from a team as code owners July 9, 2024 10:38

szegedi requested review from robertpi and eliottness July 9, 2024 10:38

maycmlee approved these changes Jul 9, 2024

View reviewed changes

releasenotes-dca/notes/admission-controller-profiling-fd7678fa28f7b90e.yaml Outdated Show resolved Hide resolved

Documentation suggestion

8855e1b

eliottness approved these changes Jul 10, 2024

View reviewed changes

ogaca-dd approved these changes Jul 10, 2024

View reviewed changes

adel121 approved these changes Jul 12, 2024

View reviewed changes

dd-mergequeue bot merged commit 7b152a6 into main Jul 15, 2024
224 checks passed

dd-mergequeue bot deleted the szegedi/profiling-ssi-kubernetes branch July 15, 2024 16:04

github-actions bot added this to the 7.57.0 milestone Jul 15, 2024

stanistan pushed a commit that referenced this pull request Jul 15, 2024

PROF-10073: Read and propagate helm config for profiling (#27185)

466dae2

szegedi mentioned this pull request Jul 16, 2024

Add support for datadog.profiling DataDog/helm-charts#1443

Draft

5 tasks

nsrip-dd mentioned this pull request Jul 30, 2024

feat: add continuous profiler instrumentation DataDog/orchestrion#178

Merged

1 task

szegedi mentioned this pull request Jul 31, 2024

PROF-10073: Add support for datadog.profiling DataDog/helm-charts#1471

Merged

5 tasks

clamoriniere removed the team/container-platform The Container Platform Team label Aug 9, 2024

szegedi mentioned this pull request Aug 16, 2024

PROF-10073: Add suport for feature.profiling DataDog/datadog-operator#1271

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PROF-10073: Read and propagate helm config for profiling #27185

PROF-10073: Read and propagate helm config for profiling #27185

szegedi commented Jul 1, 2024 •

edited

Loading

bits-bot commented Jul 1, 2024 •

edited

Loading

pr-commenter bot commented Jul 1, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

pr-commenter bot commented Jul 1, 2024 •

edited

Loading

maycmlee left a comment

eliottness left a comment

ogaca-dd left a comment

adel121 left a comment

szegedi commented Jul 15, 2024

szegedi commented Jul 15, 2024

dd-devflow bot commented Jul 15, 2024

PROF-10073: Read and propagate helm config for profiling #27185

PROF-10073: Read and propagate helm config for profiling #27185

Conversation

szegedi commented Jul 1, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

bits-bot commented Jul 1, 2024 • edited Loading

pr-commenter bot commented Jul 1, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

pr-commenter bot commented Jul 1, 2024 • edited Loading

Test changes on VM

maycmlee left a comment

Choose a reason for hiding this comment

eliottness left a comment

Choose a reason for hiding this comment

ogaca-dd left a comment

Choose a reason for hiding this comment

adel121 left a comment

Choose a reason for hiding this comment

szegedi commented Jul 15, 2024

szegedi commented Jul 15, 2024

dd-devflow bot commented Jul 15, 2024

szegedi commented Jul 1, 2024 •

edited

Loading

bits-bot commented Jul 1, 2024 •

edited

Loading

pr-commenter bot commented Jul 1, 2024 •

edited

Loading

pr-commenter bot commented Jul 1, 2024 •

edited

Loading