Skip to content

Conversation

@sangeetashivaji
Copy link
Contributor

@sangeetashivaji sangeetashivaji commented Jan 15, 2026

What does this PR do?

We want to support ClickHouse in DBM and this PR includes agent changes to help support

  • Query Metrics

  • Query Activity

  • Query Completion

Note: the default collection interval for Query Metrics is 10s, Query Activity it is 1s, Completed Query Samples it is 10s.

Majority of the logic sits in the three new files added statement_activity.py, statements.py & completed_query_samples.py

Motivation

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@codecov
Copy link

codecov bot commented Jan 16, 2026

Codecov Report

❌ Patch coverage is 82.99051% with 215 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.04%. Comparing base (34cfce2) to head (6bda942).
⚠️ Report is 2 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (34cfce2) and HEAD (6bda942). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (34cfce2) HEAD (6bda942)
2 1
Additional details and impacted files
Flag Coverage Δ
active_directory ?
activemq_xml ?
aerospike ?
airflow ?
amazon_msk ?
ambari ?
apache ?
appgate_sdp ?
arangodb ?
argo_rollouts ?
argo_workflows ?
argocd ?
aspdotnet ?
avi_vantage ?
aws_neuron ?
azure_iot_edge ?
boundary ?
btrfs ?
cacti ?
calico ?
cassandra_nodetool ?
celery ?
ceph ?
cert_manager ?
cilium ?
cisco_aci ?
citrix_hypervisor ?
clickhouse ?
cloud_foundry_api ?
cloudera ?
cockroachdb ?
consul ?
coredns ?
couch ?
couchbase ?
crio ?
datadog_checks_base ?
datadog_checks_dev ?
datadog_checks_downloader ?
datadog_cluster_agent ?
dcgm ?
ddev ?
directory ?
disk ?
dns_check ?
dotnetclr ?
druid ?
duckdb ?
ecs_fargate ?
eks_fargate ?
elastic ?
envoy ?
esxi ?
etcd ?
exchange_server ?
external_dns ?
falco ?
fluentd ?
fluxcd ?
fly_io ?
foundationdb ?
gearmand ?
gitlab ?
gitlab_runner ?
glusterfs ?
go_expvar ?
gunicorn ?
haproxy ?
harbor ?
hazelcast ?
hdfs_datanode ?
hdfs_namenode ?
http_check ?
ibm_ace ?
ibm_db2 ?
ibm_i ?
ibm_mq ?
ibm_was ?
iis ?
impala ?
infiniband ?
istio ?
kafka_consumer ?
karpenter ?
keda ?
kong ?
krakend ?
kube_apiserver_metrics ?
kube_controller_manager ?
kube_dns ?
kube_metrics_server ?
kube_proxy ?
kube_scheduler ?
kubeflow ?
kubelet ?
kubernetes_cluster_autoscaler ?
kubernetes_state ?
kubevirt_api ?
kubevirt_controller ?
kubevirt_handler ?
kuma ?
kyototycoon ?
kyverno ?
lighttpd ?
linkerd ?
linux_proc_extras ?
litellm ?
lustre ?
mac_audit_logs ?
mapr ?
mapreduce ?
marathon ?
marklogic ?
mcache ?
mesos_master ?
milvus ?
mongo ?
mysql ?
nagios ?
network ?
nfsstat ?
nginx ?
nginx_ingress_controller ?
nvidia_nim ?
nvidia_triton ?
octopus_deploy ?
openldap ?
openmetrics ?
openstack ?
openstack_controller ?
pdh_check ?
pgbouncer ?
php_fpm ?
postfix ?
postgres ?
powerdns_recursor ?
process ?
prometheus ?
proxmox ?
proxysql ?
pulsar ?
quarkus ?
rabbitmq ?
ray ?
redisdb ?
rethinkdb ?
riak ?
riakcs ?
sap_hana ?
scylla ?
silk ?
silverstripe_cms ?
singlestore ?
slurm ?
snmp ?
snowflake ?
sonarqube ?
sonatype_nexus ?
spark ?
sqlserver ?
squid ?
ssh_check ?
statsd ?
strimzi ?
supabase ?
supervisord ?
system_core ?
system_swap ?
tcp_check ?
teamcity ?
tekton ?
teleport ?
temporal ?
teradata ?
tibco_ems ?
tls ?
torchserve ?
traefik_mesh ?
traffic_server ?
twemproxy ?
twistlock ?
varnish ?
vault ?
velero ?
vertica ?
vllm ?
voltdb ?
vsphere ?
weaviate ?
win32_event_log ?
windows_performance_counters ?
windows_service ?
wmi_check ?
yarn ?
zk ?

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sangeetashivaji sangeetashivaji changed the title [DRAFT] Sangeeta.shivajirao/clickhouse fixes jan13 [DBMON-6018] ClickHouse support for DBM Jan 20, 2026
@sangeetashivaji sangeetashivaji marked this pull request as ready for review January 20, 2026 17:25
@sangeetashivaji sangeetashivaji requested review from a team as code owners January 20, 2026 17:25
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e525d34a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +171 to +176
if not rows:
# No new queries, but still advance checkpoint
if self._current_checkpoint_microseconds:
self._save_checkpoint(self._current_checkpoint_microseconds)
self._last_checkpoint_microseconds = self._current_checkpoint_microseconds
self._log.debug("Advanced checkpoint (no new completed queries)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid advancing completion checkpoint on query errors

This block advances the checkpoint whenever rows is empty, but _collect_completed_queries() catches query/processing errors and returns [] (see the exception path later in the same file), so a transient ClickHouse error or permission issue will look identical to “no new queries.” In that failure scenario the checkpoint still moves forward, permanently skipping the failed window and losing completion samples. Consider distinguishing “no data” from “error” (e.g., re-raise or return a sentinel) before advancing the checkpoint.

Useful? React with 👍 / 👎.

Comment on lines +277 to +282
rows = self._collect_metrics_rows()
if not rows:
# Even if no rows, save the checkpoint to advance the window
# This prevents re-querying the same empty window repeatedly
if self._pending_checkpoint_microseconds:
self._save_checkpoint(self._pending_checkpoint_microseconds)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don’t save metrics checkpoint when query_log load fails

This early-return saves _pending_checkpoint_microseconds when no rows are returned, but _load_query_log_statements() swallows exceptions and returns an empty list on failure. If the query_log fetch fails (e.g., transient connection issue), this path still persists the checkpoint, causing the next run to skip that entire window and drop metrics. Treat error vs empty-result separately (e.g., let the exception bubble or set a failure flag) before saving the checkpoint.

Useful? React with 👍 / 👎.

brett0000FF
brett0000FF previously approved these changes Jan 20, 2026
@temporal-github-worker-1 temporal-github-worker-1 bot dismissed brett0000FF’s stale review January 20, 2026 19:19

Review from brett0000FF is dismissed. Related teams and files:

  • documentation
    • clickhouse/assets/configuration/spec.yaml
@sangeetashivaji sangeetashivaji force-pushed the sangeeta.shivajirao/clickhouse-fixes-jan13 branch 2 times, most recently from 59dc9e0 to 2872c02 Compare January 21, 2026 22:54
@sangeetashivaji sangeetashivaji requested a review from a team as a code owner January 22, 2026 17:55
@sangeetashivaji sangeetashivaji force-pushed the sangeeta.shivajirao/clickhouse-fixes-jan13 branch from fc7785d to 2df0d33 Compare January 22, 2026 17:59
@sangeetashivaji sangeetashivaji force-pushed the sangeeta.shivajirao/clickhouse-fixes-jan13 branch from 75e49ec to ba84005 Compare January 22, 2026 19:30
description: |
Set to `true` when connecting through a single endpoint that load-balances across multiple nodes.
When enabled, the agent uses `clusterAllReplicas('default', system.<table>)` to query
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to give out details around why we are using this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably put extended details in the docs rather than in the spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! will update the spec here

value:
type: boolean
example: false
- name: database_instance_collection_interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this as a config at all? Is there any use case for changing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline we'd want to remove this config

# (C) Datadog, Inc. 2019-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use from datadog_checks.base.utils.format import json to smartly load a more efficient json libary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this

from datadog_checks.base.stubs import datadog_agent


class ClickhouseCheck(AgentCheck):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should probably extend DatabaseCheck: from datadog_checks.base.checks.db import DatabaseCheck. That gives access to shared DBM functions and properties.

# Build typed configuration
config, validation_result = build_config(self)
self._config = config
self._validation_result = validation_result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we emit a DBM agent health event with the config?

self._agent_hostname = None

# _database_instance_emitted: limit the collection and transmission of the database instance metadata
self._database_instance_emitted = TTLCache(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a TTLCache for this? Can we just use a database_instance_last_emitted var or such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTLCache is an overkill - so will be using a variable for this


# Only save checkpoint after ALL payloads are successfully submitted
# This ensures we don't lose data if submission fails partway through
if self._pending_checkpoint_microseconds:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we double submit some activity?

# Do NOT save checkpoint on error - this ensures we retry the same window
return []

def _get_clickhouse_version(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the main check file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's right! it should be in the main check

is_initial_query
FROM {query_log_table}
WHERE
event_time_microseconds > fromUnixTimestamp64Micro({last_checkpoint_microseconds})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same set of questions here as in statements

event_time_microseconds > fromUnixTimestamp64Micro({last_checkpoint_microseconds})
AND event_time_microseconds <= fromUnixTimestamp64Micro({current_checkpoint_microseconds})
AND event_date >= toDate(fromUnixTimestamp64Micro({last_checkpoint_microseconds}))
AND type = 'QueryFinish'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems highly duplicative with statements. Could the querying/batching/etc be abstracted out to create two minimal jobs that mostly do the same thing? Or should they actually be one job and just collect both on the same interval?

SERVICE_CHECK_CONNECT = 'can_connect'

def __init__(self, name, init_config, instances):
super(ClickhouseCheck, self).__init__(name, init_config, instances)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add in DBM health integration you'll also get things like uncaught errors and missed collection intervals for free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants