-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[NTP] generate ntp.offset metric based from intake's HTTP response
#45664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
14b3565 to
a55b5cc
Compare
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
22 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
a55b5cc to
af12394
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: c8b028a Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -1.78 | [-4.86, +1.30] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.97 | [+1.88, +2.07] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.63 | [+0.55, +0.70] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.63 | [+0.55, +0.70] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.31 | [+0.12, +0.50] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.20 | [-0.03, +0.43] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.16 | [+0.05, +0.26] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.12 | [+0.07, +0.17] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.01 | [-0.13, +0.15] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.12, +0.14] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.04, +0.05] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.10, +0.09] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.43, +0.41] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.01 | [-0.06, +0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.04 | [-0.27, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.08 | [-0.61, +0.44] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.10 | [-0.49, +0.28] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.16 | [-0.37, +0.05] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.19 | [-0.35, -0.03] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.20 | [-0.24, -0.16] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.42 | [-0.46, -0.38] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | -0.59 | [-0.75, -0.43] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.63 | [-2.12, +0.86] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_cpu | % cpu utilization | -1.78 | [-4.86, +1.30] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
Introduces ntp.offset metric with source:intake tag to monitor clock drift using Datadog intake server timestamps from HTTP responses. This provides clock monitoring even when NTP is blocked by firewalls. Changes: - Capture Date header from intake HTTP responses in forwarder - Store intake offset in expvar for global access - Submit ntp.offset metric with source:intake tag (independent of NTP check) - Display both NTP and Intake offsets in agent status Clocks section - Update tests to handle new metric submission The metric uses intake server time for accurate drift detection and is submitted even when NTP queries fail.
af12394 to
9a606e7
Compare
ntp.offset metric based from intake's HTTP response
fabbing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @vickenty suggestions should be applied, besides that LGTM!
What does this PR do?
Introduces another
ntp.offsetmetric withsource:intaketag. This metric is derived from theDateHTTP response header from intake communications and provides an alternative to NTP-based clock monitoring.Original
ntp.offsetmetric calculated from an actual NTP server will now be taggedsource:ntpKey changes:
Dateheader from successful HTTP responses in the forwarderexpvarfor global accessntp.offsetmetric withsource:intaketag from the NTP check (independent of NTP query success)agent statusClocks section, if available. e.g.Motivation
It's common for NTP checks to fail due to firewall restrictions while remaining enabled in agent configurations. When the system clock drifts significantly, the Datadog intake begins dropping metrics, leading to missing data despite the agent running normally. This often goes unnoticed until data gaps are discovered.
This PR addresses the issue by:
Describe how you validated your changes
source:intaketagged metricAdditional Notes
ntp.offsetmetric now has two variants distinguished by thesourcetag:source:ntpandsource:intakeserverTime.Sub(agentTime).Seconds()from the HTTPDateheaderagent_time + offset), ensuring the metric appears at the correct time from Datadog's perspective even when there is clock driftintakeOffsetexpvar is not set (NaN), the metric is not submittedntp.offsetqueries without tag filters will include both sourcesintakeOffsetis only created once intransaction.go;ntp.goreads it usingexpvar.Get()