-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[DBMON-6018] ClickHouse support for DBM #22341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[DBMON-6018] ClickHouse support for DBM #22341
Conversation
c5aeec3 to
4b37573
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e525d34a3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if not rows: | ||
| # No new queries, but still advance checkpoint | ||
| if self._current_checkpoint_microseconds: | ||
| self._save_checkpoint(self._current_checkpoint_microseconds) | ||
| self._last_checkpoint_microseconds = self._current_checkpoint_microseconds | ||
| self._log.debug("Advanced checkpoint (no new completed queries)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid advancing completion checkpoint on query errors
This block advances the checkpoint whenever rows is empty, but _collect_completed_queries() catches query/processing errors and returns [] (see the exception path later in the same file), so a transient ClickHouse error or permission issue will look identical to “no new queries.” In that failure scenario the checkpoint still moves forward, permanently skipping the failed window and losing completion samples. Consider distinguishing “no data” from “error” (e.g., re-raise or return a sentinel) before advancing the checkpoint.
Useful? React with 👍 / 👎.
| rows = self._collect_metrics_rows() | ||
| if not rows: | ||
| # Even if no rows, save the checkpoint to advance the window | ||
| # This prevents re-querying the same empty window repeatedly | ||
| if self._pending_checkpoint_microseconds: | ||
| self._save_checkpoint(self._pending_checkpoint_microseconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t save metrics checkpoint when query_log load fails
This early-return saves _pending_checkpoint_microseconds when no rows are returned, but _load_query_log_statements() swallows exceptions and returns an empty list on failure. If the query_log fetch fails (e.g., transient connection issue), this path still persists the checkpoint, causing the next run to skip that entire window and drop metrics. Treat error vs empty-result separately (e.g., let the exception bubble or set a failure flag) before saving the checkpoint.
Useful? React with 👍 / 👎.
Review from brett0000FF is dismissed. Related teams and files:
- documentation
- clickhouse/assets/configuration/spec.yaml
59dc9e0 to
2872c02
Compare
fc7785d to
2df0d33
Compare
75e49ec to
ba84005
Compare
| description: | | ||
| Set to `true` when connecting through a single endpoint that load-balances across multiple nodes. | ||
| When enabled, the agent uses `clusterAllReplicas('default', system.<table>)` to query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to give out details around why we are using this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably put extended details in the docs rather than in the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! will update the spec here
| value: | ||
| type: boolean | ||
| example: false | ||
| - name: database_instance_collection_interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this as a config at all? Is there any use case for changing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline we'd want to remove this config
| # (C) Datadog, Inc. 2019-present | ||
| # All rights reserved | ||
| # Licensed under a 3-clause BSD style license (see LICENSE) | ||
| import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use from datadog_checks.base.utils.format import json to smartly load a more efficient json libary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix this
| from datadog_checks.base.stubs import datadog_agent | ||
|
|
||
|
|
||
| class ClickhouseCheck(AgentCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should probably extend DatabaseCheck: from datadog_checks.base.checks.db import DatabaseCheck. That gives access to shared DBM functions and properties.
| # Build typed configuration | ||
| config, validation_result = build_config(self) | ||
| self._config = config | ||
| self._validation_result = validation_result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we emit a DBM agent health event with the config?
| self._agent_hostname = None | ||
|
|
||
| # _database_instance_emitted: limit the collection and transmission of the database instance metadata | ||
| self._database_instance_emitted = TTLCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a TTLCache for this? Can we just use a database_instance_last_emitted var or such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TTLCache is an overkill - so will be using a variable for this
|
|
||
| # Only save checkpoint after ALL payloads are successfully submitted | ||
| # This ensures we don't lose data if submission fails partway through | ||
| if self._pending_checkpoint_microseconds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we double submit some activity?
| # Do NOT save checkpoint on error - this ensures we retry the same window | ||
| return [] | ||
|
|
||
| def _get_clickhouse_version(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the main check file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's right! it should be in the main check
| is_initial_query | ||
| FROM {query_log_table} | ||
| WHERE | ||
| event_time_microseconds > fromUnixTimestamp64Micro({last_checkpoint_microseconds}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same set of questions here as in statements
| event_time_microseconds > fromUnixTimestamp64Micro({last_checkpoint_microseconds}) | ||
| AND event_time_microseconds <= fromUnixTimestamp64Micro({current_checkpoint_microseconds}) | ||
| AND event_date >= toDate(fromUnixTimestamp64Micro({last_checkpoint_microseconds})) | ||
| AND type = 'QueryFinish' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems highly duplicative with statements. Could the querying/batching/etc be abstracted out to create two minimal jobs that mostly do the same thing? Or should they actually be one job and just collect both on the same interval?
| SERVICE_CHECK_CONNECT = 'can_connect' | ||
|
|
||
| def __init__(self, name, init_config, instances): | ||
| super(ClickhouseCheck, self).__init__(name, init_config, instances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add in DBM health integration you'll also get things like uncaught errors and missed collection intervals for free.
What does this PR do?
We want to support ClickHouse in DBM and this PR includes agent changes to help support
Query Metrics
Query Activity
Query Completion
Note: the default collection interval for Query Metrics is 10s, Query Activity it is 1s, Completed Query Samples it is 10s.
Majority of the logic sits in the three new files added
statement_activity.py,statements.py&completed_query_samples.pyMotivation
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged