Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROF-10073: Read and propagate helm config for profiling #27185

Merged
merged 2 commits into from
Jul 15, 2024

Conversation

szegedi
Copy link
Contributor

@szegedi szegedi commented Jul 1, 2024

What does this PR do?

Allows Continuous Profiler to be enabled (or disabled) by Helm Charts config on the controller pod. The cluster-agent, running on the controller pod, will read an environment variable and use this to mutate the configuration of other pods to set the environment variables that will activate profiling within the tracer/profiler client libraries.

This PR follows the same approach as #23618 did for activation of ASM products.

Motivation

to make it easier for k8s clients to activate Continous Profiling. Simplified installation is a common request.

Additional Notes

The PR is designed to establish the fundamentals that will make these other PRs work:

Continuous Profiler will have the env var DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_APPSEC_ENABLED set by changes in those Datadog Operator and Helm Charts PRs. It will result in DD_PROFILING_ENABLED being propagated to all pods (or those conforming to the filters).

Possible Drawbacks / Trade-offs

  • More complexity in our config handling, to make it easier for customers.

Describe how to test/QA your changes

  • Unit tests have been added and ensured to pass with invoke test --targets=./pkg/clusteragent

@bits-bot
Copy link
Collaborator

bits-bot commented Jul 1, 2024

CLA assistant check
All committers have signed the CLA.

@pr-commenter
Copy link

pr-commenter bot commented Jul 1, 2024

Regression Detector

Regression Detector Results

Run ID: 7584aa4d-b07d-41ef-80f6-853130a1ba4f Metrics dashboard Target profiles

Baseline: c7e9128
Comparison: 8855e1b

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI links
tcp_syslog_to_blackhole ingress throughput +1.73 [-10.99, +14.45] Logs
file_tree memory utilization +0.72 [+0.63, +0.80] Logs
otel_to_otel_logs ingress throughput +0.54 [-0.27, +1.36] Logs
pycheck_1000_100byte_tags % cpu utilization +0.41 [-4.50, +5.31] Logs
idle memory utilization +0.07 [+0.02, +0.12] Logs
basic_py_check % cpu utilization +0.06 [-2.55, +2.67] Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.01, +0.01] Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.00, +0.00] Logs
uds_dogstatsd_to_api_cpu % cpu utilization -0.51 [-1.42, +0.39] Logs

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@szegedi szegedi force-pushed the szegedi/profiling-ssi-kubernetes branch 2 times, most recently from 8899694 to b261685 Compare July 1, 2024 13:03
@pr-commenter
Copy link

pr-commenter bot commented Jul 1, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=38725070 --os-family=ubuntu

Note: This applies to commit 8855e1b

@szegedi szegedi force-pushed the szegedi/profiling-ssi-kubernetes branch from b261685 to ac20b1b Compare July 8, 2024 16:02
@szegedi szegedi added component/cluster-agent component/config need-change/helm-chart Add this label if your change require also a change in the datadog helm chart category/feature and removed team/container-platform The Container Platform Team labels Jul 8, 2024
@szegedi szegedi force-pushed the szegedi/profiling-ssi-kubernetes branch from ac20b1b to 5222933 Compare July 9, 2024 02:50
@szegedi szegedi added the team/container-platform The Container Platform Team label Jul 9, 2024
@szegedi szegedi marked this pull request as ready for review July 9, 2024 10:38
@szegedi szegedi requested review from a team as code owners July 9, 2024 10:38
@szegedi szegedi requested review from robertpi and eliottness July 9, 2024 10:38
Copy link
Contributor

@maycmlee maycmlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of small suggestions, but approving.

Copy link
Contributor

@eliottness eliottness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ogaca-dd ogaca-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for files owned by ASC

Copy link
Contributor

@adel121 adel121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for #container-platform owned files.

Some suggestions:

  • It would be nice to have some E2E tests to validate that injecting this env var doesn't cause conflicts or issues in the nominal case of a user app. We already have e2e tests for auto instrumentation and language detection here, so you can augment them to test your change too.
  • It would be good to remove the label team/container-platform from this PR because the QA should not be done by our team, but rather by the team that is directly impacted by this feature. Our team scope is only ensuring the admission controller injects what is expected, but QA here should ensure that feature works fine.

@szegedi
Copy link
Contributor Author

szegedi commented Jul 15, 2024

Thanks for the suggestion @adel121! There's currently an e2e onboarding system tests draft PR that exercises this functionality (with the drawback that it can only run tests for a released agent.) I'm glad to see that there's an AWS/Pulumi framework locally in the agent as well. I learned since from @robertomonteromiguel that folks that created agent's e2e have helped him create the onboarding system tests infrastructure too. I'll figure out how to validate the feature in this e2e framework and follow up with a test.

@szegedi
Copy link
Contributor Author

szegedi commented Jul 15, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Jul 15, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 32m.

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit 7b152a6 into main Jul 15, 2024
224 checks passed
@dd-mergequeue dd-mergequeue bot deleted the szegedi/profiling-ssi-kubernetes branch July 15, 2024 16:04
@github-actions github-actions bot added this to the 7.57.0 milestone Jul 15, 2024
github-merge-queue bot pushed a commit to DataDog/orchestrion that referenced this pull request Aug 6, 2024
PROF-10258

### What does this PR do?

Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true`
at run time.
The profiler has roughly the default configuration, with a few
non-default profile types which
should be low-enough overhead and sufficiently useful to enable for
recent Go releases.
There are not currently any other configuration knobs.

If enabled this way, the profiles will be tagged with
`orchestrion:true`.

This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable
profiling.
This can be provided via the [Datadog Admission
Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller)
(see [this PR](DataDog/datadog-agent#27185)).
Some languages/runtimes use heuristics to decide whether to enable
profiling when
`auto` is provided, to avoid profiling short-lived or non-instrumented
applications.
We assume the application is meant to be instrumented by virtue of the
user building
with Orchestrion. And the Go profiler won't send any data until at least
one minute has
passed. So, we treat `auto` the same as `true`.

TODO - document this once it's released

TODO - this has only been manually tested. That's probably okay for now;
there's not
much to this code. But we can investigate ways to automatically test it.

### Motivation

If somebody uses Orchestrion to add APM instrumentation, and they also
want
profiling, Orchestrion should be able to add it so the user doesn't have
to
separately modify their code to get profiling.

### Reviewer's Checklist
<!--
* Authors can use this list as a reference to ensure that there are no
problems
during the review but the signing off is to be done by the reviewer(s).
-->

- [ ] Changed code has unit tests for its functionality.

---------

Signed-off-by: github-actions on behalf of RomainMuller <[email protected]>
Co-authored-by: Romain Marcadier <[email protected]>
Co-authored-by: github-actions on behalf of RomainMuller <[email protected]>
Co-authored-by: Romain Marcadier <[email protected]>
github-merge-queue bot pushed a commit to DataDog/orchestrion that referenced this pull request Aug 7, 2024
PROF-10258

### What does this PR do?

Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true`
at run time.
The profiler has roughly the default configuration, with a few
non-default profile types which
should be low-enough overhead and sufficiently useful to enable for
recent Go releases.
There are not currently any other configuration knobs.

If enabled this way, the profiles will be tagged with
`orchestrion:true`.

This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable
profiling.
This can be provided via the [Datadog Admission
Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller)
(see [this PR](DataDog/datadog-agent#27185)).
Some languages/runtimes use heuristics to decide whether to enable
profiling when
`auto` is provided, to avoid profiling short-lived or non-instrumented
applications.
We assume the application is meant to be instrumented by virtue of the
user building
with Orchestrion. And the Go profiler won't send any data until at least
one minute has
passed. So, we treat `auto` the same as `true`.

TODO - document this once it's released

TODO - this has only been manually tested. That's probably okay for now;
there's not
much to this code. But we can investigate ways to automatically test it.

### Motivation

If somebody uses Orchestrion to add APM instrumentation, and they also
want
profiling, Orchestrion should be able to add it so the user doesn't have
to
separately modify their code to get profiling.

### Reviewer's Checklist
<!--
* Authors can use this list as a reference to ensure that there are no
problems
during the review but the signing off is to be done by the reviewer(s).
-->

- [ ] Changed code has unit tests for its functionality.

---------

Signed-off-by: github-actions on behalf of RomainMuller <[email protected]>
Co-authored-by: Romain Marcadier <[email protected]>
Co-authored-by: github-actions on behalf of RomainMuller <[email protected]>
Co-authored-by: Romain Marcadier <[email protected]>
github-merge-queue bot pushed a commit to DataDog/orchestrion that referenced this pull request Aug 7, 2024
PROF-10258

### What does this PR do?

Add profiling, enabled manually by setting `DD_PROFILING_ENABLED=true`
at run time.
The profiler has roughly the default configuration, with a few
non-default profile types which
should be low-enough overhead and sufficiently useful to enable for
recent Go releases.
There are not currently any other configuration knobs.

If enabled this way, the profiles will be tagged with
`orchestrion:true`.

This PR also accepts the value `DD_PROFILING_ENABLED=auto` to enable
profiling.
This can be provided via the [Datadog Admission
Controller](https://docs.datadoghq.com/containers/cluster_agent/admission_controller)
(see [this PR](DataDog/datadog-agent#27185)).
Some languages/runtimes use heuristics to decide whether to enable
profiling when
`auto` is provided, to avoid profiling short-lived or non-instrumented
applications.
We assume the application is meant to be instrumented by virtue of the
user building
with Orchestrion. And the Go profiler won't send any data until at least
one minute has
passed. So, we treat `auto` the same as `true`.

TODO - document this once it's released

TODO - this has only been manually tested. That's probably okay for now;
there's not
much to this code. But we can investigate ways to automatically test it.

### Motivation

If somebody uses Orchestrion to add APM instrumentation, and they also
want
profiling, Orchestrion should be able to add it so the user doesn't have
to
separately modify their code to get profiling.

### Reviewer's Checklist
<!--
* Authors can use this list as a reference to ensure that there are no
problems
during the review but the signing off is to be done by the reviewer(s).
-->

- [ ] Changed code has unit tests for its functionality.

---------

Signed-off-by: github-actions on behalf of RomainMuller <[email protected]>
Co-authored-by: Romain Marcadier <[email protected]>
Co-authored-by: github-actions on behalf of RomainMuller <[email protected]>
Co-authored-by: Romain Marcadier <[email protected]>
@clamoriniere clamoriniere removed the team/container-platform The Container Platform Team label Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/feature component/cluster-agent component/config need-change/helm-chart Add this label if your change require also a change in the datadog helm chart team/apm-onboarding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants