Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddtrace/tracer: runtime metrics v2 (exclude from release notes) #2772

Merged

Conversation

felixge
Copy link
Member

@felixge felixge commented Jul 5, 2024

What does this PR do?

⚠️ IMPORTANT ⚠️: This is not intended to be used by customers yet. DO NOT ENABLE this yet. We are still evaluating this feature internal and may decide to remove it again.

Implement DD_RUNTIME_METRICS_V2_ENABLED env variable which allows using the new runtime/metrics for runtime metrics. This gives us access to new metrics that are not available in runtime.ReadMemStats, e.g. scheduler latency.

Motivation

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

Unsure? Have a question? Request a review!

@felixge felixge changed the title ddtrace/tracer: runtime metrics v2 ddtrace/tracer: runtime metrics v2 (exclude from release notes) Jul 17, 2024
Copy link
Contributor

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Stuck for more than 1 month label Aug 23, 2024
@felixge felixge removed the stale Stuck for more than 1 month label Aug 28, 2024
@pr-commenter
Copy link

pr-commenter bot commented Aug 28, 2024

Benchmarks

Benchmark execution time: 2024-11-06 17:40:25

Comparing candidate commit b1a595b in PR branch felix.geisendoerfer/PROF-8665-experimental-runtime-v2-metrics with baseline commit c9fc691 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 58 metrics, 0 unstable metrics.

scenario:BenchmarkTracerAddSpans-24

  • 🟩 execution_time [-167.076ns; -87.124ns] or [-4.203%; -2.192%]

Copy link
Contributor

@anatolebeuzon anatolebeuzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! (just need to fix the failing CI tests)

Copy link
Contributor

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Stuck for more than 1 month label Sep 18, 2024
@darccio
Copy link
Member

darccio commented Oct 1, 2024

@felixge When it's planned to open the PR for review/merge?

@darccio darccio added do-not-merge/WIP and removed stale Stuck for more than 1 month labels Oct 1, 2024
@felixge
Copy link
Member Author

felixge commented Oct 13, 2024

I'm planning to merge this merged, probably in 1.5 weeks from now (got time blocked to work on this project every 2 weeks).

@DataDog DataDog deleted a comment from github-actions bot Nov 6, 2024
@DataDog DataDog deleted a comment from github-actions bot Nov 6, 2024
this toil shouldn't be needed, will try to refactor this later
@felixge felixge marked this pull request as ready for review November 6, 2024 15:12
@felixge felixge requested review from a team as code owners November 6, 2024 15:12
Copy link
Contributor

@mtoffl01 mtoffl01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this just because I'm interested! I don't have context on this change, so I left a few questions in the comments.

Also: Curious why we are calling this "runtime metrics v2" if the feature collects perf metrics but about a different process than v1 -- runtime metrics "v1" collects runtime metrics about the process the tracer is running in, whereas "v2" collects runtime metrics about dd-trace-go (and the agent?), as I understand it. Based on the name, I would've expected runtime metrics v2 to do the same thing as v1 but maybe with improved accuracy/more data.

// Enabled runtime metrics v2 by default
if v := os.Getenv("DD_RUNTIME_METRICS_V2_ENABLED"); v == "" {
os.Setenv("DD_RUNTIME_METRICS_V2_ENABLED", "true")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this rather than change line 387 in ddtrace/tracer/option.go to c.runtimeMetricsV2 = internal.BoolVal("DD_RUNTIME_METRICS_V2_ENABLED", true)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is not meant to be enabled for customers yet. We want to add it to dd-trace-go behind an env var so that we can experiment with it in the backend go repos at Datadog. If that goes well, we'll document and advertise this feature and probably turn it on by default.

@@ -329,6 +331,14 @@ func newTracer(opts ...StartOption) *tracer {
t.reportRuntimeMetrics(defaultMetricsReportInterval)
}()
}
if c.runtimeMetricsV2 {
l := slog.New(slogHandler{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use slog instead of tracer logger? Do these logs get routed somewhere else that customers don't see / only we see somewhere?

Copy link
Member Author

@felixge felixge Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use slog instead of tracer logger?

We want to share the new runtime metrics implementation between dd-trace-go and as well as the datadog agent in the future. The latter doesn't use dd-trace-go to instrument itself, that's why we have put the code in https://github.com/DataDog/go-runtime-metrics-internal. For that repo we decided to use slog as the logging interface as it's the new standard Go logger. We should consider adopting it in dd-trace-go itself in the future as well, but for now we decided to integrate with our internal logger via an adapter. Let me know if that makes sense.

Do these logs get routed somewhere else that customers don't see / only we see somewhere?

No, the slow logs just get forwarded to dd-trace-go's internal logger and end up in the customer stdout as usual.

@felixge
Copy link
Member Author

felixge commented Nov 6, 2024

Reviewed this just because I'm interested! I don't have context on this change, so I left a few questions in the comments.

Sorry, I was going to write a PR description, forgot that I hadn't done it when pressing "ready for review" 🙈. We have a little squad to work on this (Anatole, Nayef and I), so I was assuming only they would end up reviewing.

Also: Curious why we are calling this "runtime metrics v2" if the feature collects perf metrics but about a different process than v1 -- runtime metrics "v1" collects runtime metrics about the process the tracer is running in, whereas "v2" collects runtime metrics about dd-trace-go (and the agent?), as I understand it. Based on the name, I would've expected runtime metrics v2 to do the same thing as v1 but maybe with improved accuracy/more data.

runtime metrics v2 serves the same purpose as v1. The only difference is that it's using the "new" runtime/metrics package from Go rather than the old runtime.ReadMemStats interface. This gives us access to some new metrics that were previously unavailable.

@@ -381,6 +384,7 @@ func newConfig(opts ...StartOption) *config {
}
c.logStartup = internal.BoolEnv("DD_TRACE_STARTUP_LOGS", true)
c.runtimeMetrics = internal.BoolVal(getDDorOtelConfig("metrics"), false)
c.runtimeMetricsV2 = internal.BoolEnv("DD_RUNTIME_METRICS_V2_ENABLED", false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT from @Gandem: We should consider adding a test for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolution: Will probably add one in a follow-up PR.

@felixge felixge enabled auto-merge (squash) November 7, 2024 06:24
@felixge felixge merged commit bebced4 into main Nov 7, 2024
171 checks passed
@felixge felixge deleted the felix.geisendoerfer/PROF-8665-experimental-runtime-v2-metrics branch November 7, 2024 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants