Conversation
gianm
left a comment
There was a problem hiding this comment.
This is a useful concept, although I think it should be done in such a way that the Coordinator handles communications with the metadata store. Other things that work this way include lookup definitions, users and roles for basic authentication and authorization, centralized schema, etc.
The rationale for why we've done it this way, in these other cases, is three-fold:
- Operationally, it's better if a smaller number of servers is communicating with the metadata store, to avoid overloading it and to make it easier to configure security for the metadata store.
- Syncing dynamic configurations from a central leader (Coordinator) allows updates to propagate more quickly.
- Someday we may want to make an option to use a builtin metadata store. This will be simplest to implement if metadata store operations go through central leaders. Those leaders would naturally become the place where
The way it can work is:
- Move the user-facing POST and GET APIs to the Coordinator.
- Brokers pull configs from the Coordinator on startup.
- When the Coordinator receives an update to the config, it should push it out to an internal POST API on the Brokers.
- Brokers also pull configs from the Coordinator periodically, in case they missed a push.
There are possible race conditions surrounding the push and pull: it's possible a new config is getting pushed out around the same time that the Broker is doing a scheduled pull. To solve these you can include a timestamp in the config object, and have the Brokers reject configs with older timestamps than the current one.
Please also consider what guarantees should exist around availability of the Broker config. Is it "required", i.e., if the Broker can't fetch it on startup then startup will fail? Or is it "best effort", i.e., if the Broker can't fetch it on startup, we proceed with a null config or default config for some time, until the next scheduled pull succeeds?
jtuglu1
left a comment
There was a problem hiding this comment.
Did a first pass – overall looking good!
| this.validDebugDimensions = validateDebugDimensions(debugDimensions); | ||
| this.turboLoadingNodes = Configs.valueOrDefault(turboLoadingNodes, Set.of()); | ||
| this.cloneServers = Configs.valueOrDefault(cloneServers, Map.of()); | ||
| this.queryBlocklist = queryBlocklist != null ? ImmutableList.copyOf(queryBlocklist) : Defaults.QUERY_BLOCKLIST; |
There was a problem hiding this comment.
Is there a reason we are creating an immutable copy of this? Maybe let's use Configs.valueOrDefault to follow convention here with the rest of the containers being serde'd.
|
|
||
| private static class Defaults | ||
| { | ||
| static final List<QueryBlocklistRule> QUERY_BLOCKLIST = ImmutableList.of(); |
| throw DruidException.forPersona(DruidException.Persona.USER) | ||
| .ofCategory(DruidException.Category.FORBIDDEN) | ||
| .build( | ||
| "Query blocked by rule[%s]", |
There was a problem hiding this comment.
Does this log the query ID? It'd be nice to co-locate the query ID and the rule name in a log so users/operators don't need to grep that separately.
| * @param query the query to check | ||
| * @return true if the query matches this rule, false otherwise | ||
| */ | ||
| public boolean matches(Query<?> query) |
There was a problem hiding this comment.
It might be worth caching this result for all 3 types. JVM might be smart and branch predict this every time but since this is called per-query on constant fields, it might be worth explicitly doing this to save ≥ 2 comparisons per query.
|
Another thing I forgot to mention is docs + UI, namely: Docs:
UI: |
Fixes #18964
Description
Added query blocklist for dynamically blocking queries in the situation there is a rogue app/user spamming the cluster without relying on static configs/restarts.
QueryBlocklistRule
Enforced early in QueryLifecycle (after init) and throws DruidException when a query matches a rule i.e. if ALL specified criteria match (AND logic). Null or empty criteria act as wildcards (match everything):
dataSources: Matches if ANY datasource in the query intersects with the rule's datasourcesqueryTypes: Matches if the query type is in the rule's query typescontextMatches: Matches if ALL key-value pairs in the rule match the query context (exact string match)Query blocklist is managed via the existing coordinator config APIs:
Considerations
Creating a separate broker-level dynamic config was considered which can also be used for other future features such as datasource aliasing, routing rules or feature flags. Following contributor feedback, this implementation reuses the existing CoordinatorDynamicConfig and coordinator-broker config sync infrastructure instead of creating a separate broker config to have fewer servers communicating with the metadata store and have a future built-in metadata store operations go through central leaders
The query blocklist follows a "best effort" approach:
Rationale: The query blocklist is an operational safety feature, not a security control. Blocking all queries when config is unavailable would be more disruptive than the problem it's trying to solve. Operators can still use coordinator/historical unavailability to enforce hard blocks if needed.
Release note
Added query blocklist feature for dynamically blocking queries without restarts. Operators can block queries by datasource, query type, or query context using the new /druid/broker/v1/config API. Rules use AND logic (all criteria must match) and are stored in the metadata database.
Key changed/added classes in this PR
This PR has: