Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GPU monitoring #1601

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

gjulianm
Copy link
Collaborator

@gjulianm gjulianm commented Jan 3, 2025

What does this PR do?

This PR adds support for the GPU monitoring feature, doing all the required changes to improve experience of customers.

Motivation

Simplify deployment of GPU monitoring.

Additional Notes

This is an initial implementation of the feature. It does not support deployment of mixed clusters (those where not all nodes have GPUs).

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: v7.60.x for the GPU monitoring feature.
  • Cluster Agent: N/A

Describe your test plan

  1. Deploy the operator in a cluster
  2. Deploy the agent resource with feature.gpu.enabled: yes.
  3. Check that the deployed agent pod has runtimeClassName: nvidia with kubectl get pod datadog-agent-XXX -o json | jq ".spec.runtimeClassName".
  4. Ensure that DD_GPU_MONITORING_ENABLED is set to true in both the agent and system-probe containers.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@gjulianm gjulianm self-assigned this Jan 3, 2025
@gjulianm gjulianm added the enhancement New feature or request label Jan 3, 2025
@gjulianm gjulianm added this to the v1.12.0 milestone Jan 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 84.00000% with 20 lines in your changes missing coverage. Please review.

Project coverage is 48.96%. Comparing base (db00883) to head (da4ab24).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...nal/controller/datadogagent/feature/gpu/feature.go 90.09% 9 Missing and 1 partial ⚠️
internal/controller/testutils/agent.go 0.00% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1601      +/-   ##
==========================================
+ Coverage   48.94%   48.96%   +0.01%     
==========================================
  Files         227      236       +9     
  Lines       20636    21315     +679     
==========================================
+ Hits        10101    10437     +336     
- Misses      10010    10337     +327     
- Partials      525      541      +16     
Flag Coverage Δ
unittests 48.96% <84.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
api/datadoghq/v2alpha1/datadogagent_types.go 100.00% <ø> (ø)
internal/controller/datadogagent/controller.go 51.85% <ø> (ø)
...ller/datadogagent/defaults/datadogagent_default.go 91.24% <100.00%> (+0.14%) ⬆️
pkg/testutils/builder.go 91.62% <100.00%> (+0.10%) ⬆️
...nal/controller/datadogagent/feature/gpu/feature.go 90.09% <90.09%> (ø)
internal/controller/testutils/agent.go 0.00% <0.00%> (ø)

... and 11 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db00883...da4ab24. Read the comment docs.

@gjulianm gjulianm force-pushed the guillermo.julian/gpu-monitoring branch from 6777208 to ce25955 Compare January 7, 2025 13:40
@gjulianm gjulianm force-pushed the guillermo.julian/gpu-monitoring branch 4 times, most recently from ce25955 to 60173ad Compare January 8, 2025 09:55
@gjulianm gjulianm force-pushed the guillermo.julian/gpu-monitoring branch from 60173ad to dd0dd9c Compare January 8, 2025 10:03
@gjulianm gjulianm marked this pull request as ready for review January 8, 2025 11:06
@gjulianm gjulianm requested review from a team as code owners January 8, 2025 11:06
@tbavelier tbavelier modified the milestones: v1.12.0, v1.13.0 Jan 8, 2025
Copy link
Contributor

@buraizu buraizu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a minor edit requested

docs/configuration.v2alpha1.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants