-
Notifications
You must be signed in to change notification settings - Fork 71
Proposal for adding KIP-714 support to metrics-reporter #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mickael Maison <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. Some quick comments.
|
||
This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`. | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree with this section. Monitoring applications is essential. And one of the keys to making it useful in situations when you need the monitoring the most are independent communication channels. I.e. do not monitor Kafka dependents through Kafka (the same Kafka cluster they depend on / the same SaaS Kafka service etc.). The KIP-714 is a great business strategy for some vendors, but a bad architecture pattern. And the cloud native landscape has various established solutions for metric collection across distributed application landscapes, which provides this.
I'm not saying that we cannot have some support for KIP-714 - it is a Kafka feature and we provide support for Kafka 🤷. But we should not present it as some superior solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not tried comparing the different monitoring solutions, each has different tradeoffs and strengths. Nowhere I say this is superior to another alternative. I just described some of the benefits support for KIP-714 could bring to users and I believe all the statements I made are valid.
I'm happy including weaknesses of the KIP-714 model if you think that would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read the motivation as a way to highlight the Kafka client metrics support as the best one.
Maybe just this sentence ...
When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues.
... could be changed in a way that it mentions the Kafka solution. Not sure if Jakub is referring to this sentence like "having the clients to send metrics to broker is the easiest and best solution ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the motivation is valid. The main goal of KIP-714 is to make troubleshooting Kafka clients easier. Specifically, the zero-config approach is something that established solutions can't provide. Maybe we can explicitly mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trust me: Troubleshooting Kafka client through information that is supposed to be sent by ... wait for it ... exactly the same Kafka client you are trying to troubleshoot ... does not work well and is not a good practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-read the motivation again and my perspective is still the same after my first comment ... it doesn't sound like Mickael si saying this is a great and awesome feature and the best one compared to others.
I agree that monitoring is usually done by using several solutions all together and this one is part of the toolbox imho. I also agree with Mickael use cases.
I'm not against having this feature in Strimzi. I personally would leave it as something that users can contribute if they want it.
@scholzj Mickael is not a user per-se but a contributor. Why should we accept this only from a user? Because it would be clear that coming from users it would be useful? It's even possible that adding this feature, we'll have a lot of users using it.
From your previous replies it's pretty clear you don't like that feature. If you're against this feature and don't want it in Strimzi, just say so, it will save both of us time.
@mimaison remember the Strimzi maintainers is a group of people ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scholzj Mickael is not a user per-se but a contributor. Why should we accept this only from a user? Because it would be clear that coming from users it would be useful? It's even possible that adding this feature, we'll have a lot of users using it.
@ppatierno I did not say we should accept it only from a user. I said that if it would be up to me, I would leave it for a user contribution as I do not consider it well invested time and since user contributions are a useful measure of a real user interest (and I'm actually a bit surprised no user - as far as I know - asked about this feature in the past). But as I said, it is not my time being invested here, so it is not my problem, just my opinion. I also acknowledged that my (maintainer of a CNCF project) role is obviously different from Mickael's role and our motivations might differ.
But I stand behind my opinion that I'm not going to give my +1 vote to a feature which has a motivation that does not follow the best interest of users but the best interest of one or more Kafka vendors. That said, I'm not likely to -1 this just because of the motivation. So as you said, there are other maintainers as well. 🤷
remember the Strimzi maintainers is a group of people ;-)
That is of course true. But I think it is sad that you and me are the only maintainers who actually commented on this during the time this is opened. I think that makes it pretty hard for people to read the room and understand that the Strimzi maintainers is a group of people
when it is only two people who raised some comments. And from the two people, one (me) has raised opinions that seem to be perceived as strongly negative and against the proposal ... and the other one (you) has so far not approved it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was just to make clear to Mickael that contributors should not give up just because one maintainer doesn't agree on a proposal. It always happen to me as well to be against something but it's not the reason why something new shouldn't be added.
Regarding being just me and you the only maintainers to comment, let me tag all the rest and I will also raise this one as topic of discussion in the next community meeting. You are right, it's opened since more than one month but no other comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy for Strimzi to provide support for this feature. I can see it being useful for organisations where a single team manages the Kafka cluster while other teams, that are inexperienced with Kafka internals, write and manage the client applications.
I don't have any problem with the motivation section. To me it mostly reads as explaining why users might want to do this, rather than it being a strong recommendation that this is the best way. The only sentence that could be softened is When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues.
perhaps to remove and greatly simplifies diagnosing issues
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with the motivation section either. Maybe just softening that greatly....
as Kate stated above...but that's it :)
telemetryLabels: | ||
- "client_id" | ||
- "principal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, how do these labels apply to the Kafka server metrics? I would expect this configuration to be a bit more sophisticated:
- A separate section for the client-side metrics
- Maybe require enabling the client-side metrics first (without that, it seems like it is pretty easy to DoS the brokers through the client metrics)
- Have a separate configuration for the client metrics and for the server metrics?
- I guess it needs to distinguish the broker configuration from the client-based operand configuration as well? (i.e. there are no client metrics in Connect / MM2 / Bridge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These labels would only apply to client metrics and have no impact on server metrics. Likewise the allowList
only filters server metrics, client metrics are selected via metric subscriptions.
I understand your concern that this could be confusing. We could start by adjust the name of the new field, maybe clientLabels
or clientMetricsLabels
would be more explicit?
We've now split the reporter into server and client side modules, but it's not in a metrics-reporter release yet, and that will need to be introduced into the operator too. The client label configuration only applies to the server side module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with the clientMetricsLabels
field name in this case.
Then administrators must set metrics subscriptions to define the metrics clients will send. There are no default subscriptions. Subscriptions can be set, updated and deleted at runtime via the `kafka-configs.sh` or `kafka-client-metrics.sh` tools, or via the `Admin` API. For example: | ||
```sh | ||
./bin/kafka-client-metrics.sh --bootstrap-server localhost:9092 \ | ||
--alter --name topic-metrics \ | ||
--metrics org.apache.kafka.producer.topic. \ | ||
--interval 30000 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like it should have some declarative way of managing and not rely on the Kafka commands ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting with @ppatierno last week he also suggested introducing a mechanism to manage metric subscriptions.
I've not included this in the proposal yet as I'm not familiar enough with the operator to propose a design that makes sense. So help from maintainers would be appreciated.
Would a simple ConfigMap
be enough to put the subscriptions? Then we would need a way to attach it to the cluster definition. I also don't understand well the mechanisms that would process and apply the subscriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So AFAIU you can have one or more subscriptions and for each of them you have a couple of lists (metrics prefixes and filters) in addition to an interval value.
Thinking out loud we should start from the Kafka
custom resource to have a field dedicated to the client metrics (because we already have a metricsConfig
field to support JMX and Prometheus server related metrics). If such a field is, for example, clientsMetrics
(or whatever), it could be a list of subscriptions with the fields for corresponding configuration, or it could point to a ConfigMap
. I guess the choice could depend on how big this configuration could be but also taking into account that the ConfigMap
is not strongly typed (as it could be a direct API in the Kafka CR) so we should think more of how to add such configuration (i.e. a JSON?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison can the subscriptions be set at deploy time, or only set at runtime? I see here you listed the dedicated shell script, but this can be configured via the configs script as well can't it? Is it effectively a configuration option that can be updated dynamically (i.e. without rolling the brokers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subscriptions can only be set, updated and deleted at runtime. Yes you can also use the kafka-configs.sh
tool or the incrementalAlterConfigs()
Admin APIs to configure subscriptions.
Is it effectively a configuration option that can be updated dynamically (i.e. without rolling the brokers)?
Exactly
Then administrators must set metrics subscriptions to define the metrics clients will send. There are no default subscriptions. Subscriptions can be set, updated and deleted at runtime via the `kafka-configs.sh` or `kafka-client-metrics.sh` tools, or via the `Admin` API. For example: | ||
```sh | ||
./bin/kafka-client-metrics.sh --bootstrap-server localhost:9092 \ | ||
--alter --name topic-metrics \ | ||
--metrics org.apache.kafka.producer.topic. \ | ||
--interval 30000 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So AFAIU you can have one or more subscriptions and for each of them you have a couple of lists (metrics prefixes and filters) in addition to an interval value.
Thinking out loud we should start from the Kafka
custom resource to have a field dedicated to the client metrics (because we already have a metricsConfig
field to support JMX and Prometheus server related metrics). If such a field is, for example, clientsMetrics
(or whatever), it could be a list of subscriptions with the fields for corresponding configuration, or it could point to a ConfigMap
. I guess the choice could depend on how big this configuration could be but also taking into account that the ConfigMap
is not strongly typed (as it could be a direct API in the Kafka CR) so we should think more of how to add such configuration (i.e. a JSON?).
|
||
This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`. | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read the motivation as a way to highlight the Kafka client metrics support as the best one.
Maybe just this sentence ...
When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues.
... could be changed in a way that it mentions the Kafka solution. Not sure if Jakub is referring to this sentence like "having the clients to send metrics to broker is the easiest and best solution ..."?
|
||
When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues. The mechanism to set subscriptions also allows users to precisely adjust metrics at runtime to collect the most relevant metrics. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we try to clearly split which part of the proposal is for the metrics-reporter component and which one is for the operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part specific for the operator is in the Updates to strimzi-kafka-operator
section. Anything else is for metrics-reporter.
|
||
This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`. | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the motivation is valid. The main goal of KIP-714 is to make troubleshooting Kafka clients easier. Specifically, the zero-config approach is something that established solutions can't provide. Maybe we can explicitly mention that.
The reporter can convert the other fields as labels and add them to the metric series. In many cases it does not make sense to add all of them, for example if the cluster is behind a proxy the client address will always be the same. Also using all of them can create high cardinality labels series which can be problematic in Prometheus. | ||
|
||
I propose introducing a configuration to select the metadata fields to use as labels: | ||
- `prometheus.metrics.reporter.telemetry.labels`: List of label names in client metrics. The valid names are `client_id`, `listener_name`, `security_protocol`, `principal`, `client_address`. This defaults to `client_id`. This configuration is reconfigurable at runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to default to an optional configuration? Maybe we could default to client_id and client_address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about client_id
, it could be empty is users don't set it. There's the client_instance_id
label that is always injected so I thought that would be fine. I'm not really keen on defaulting to client_address
as in virtualized environment this may change often and increase the cardinality of the metrics.
In the end you need to know which are the best labels to use for your own environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to use client_instance_id
, but I would also put client_id
when non empty. Wdyt?
|
||
When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available. | ||
|
||
So we need a mechanism to remove metrics from the registry when a client stops emitting metrics, otherwise if a client shuts down, the metrics reporter will keep reporting its metrics. As we don't have the interval, the simplest approach is to delete client metrics whenever they are scraped. While this may create gaps in metric series, this ensures that metrics collected are always valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have two Prometheus instances scraping from the same SMR endpoint? Depending on configuration, one of them could be starved and never see client metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common to have 2 Prometheus servers scraping the same endpoints? I have to admit that's not a scenario I've considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, why not? You could easily have dedicated Prometheus instances, for example one dedicated to server metrics and another one to client metrics, or maybe one instance collecting a critical set of metrics from all Kafka clusters, and then a smaller instance for each cluster with a bigger set of metrics.
What about deleting removed metrics on telemetry pushes? I don't know if there is any technical limitation, but you already have the old set in the Prometheus registry, so it should be possible to remove metrics by comparing them with the incoming set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about deleting removed metrics on telemetry pushes? I don't know if there is any technical limitation, but you already have the old set in the Prometheus registry, so it should be possible to remove metrics by comparing them with the incoming set.
The issue is when the client stops pushing new metrics, you then keep the last value you received and start exposing stale metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's a problem.
However when metrics are received, the interval or any details about the subscription that caused this emission is not available.
Isn't this a design flaw that should be addressed in Kafka with a small KIP? Or can we use the admin client to get this information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this proposal makes sense to me. The only part that feels unfinished is how to manage the subscriptions. I do think that it ought to be configurable declaratively, but I don't feel I understand how the subscriptions work well enough to make a concrete suggestion of the best way to handle it, so I've added some questions that will hopefully help me understand better.
|
||
### Client metrics life cycle | ||
|
||
When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Strimzi provided a way to manage the subscription, and therefore could inspect the interval, would that enable a better process for managing the lifecycle of the client metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. You'd have to find the matching rule. This would also not work outside of the Strimzi operator.
I'm currently considering whether we can update the ClientTelemetry
API and provide the push interval.
- `kafka_server_client_metrics_throttle_count{client_instance_id="<ID>"}`/`kafka_server_client_metrics_throttle_rate{client_instance_id="<ID>"}`: `PushTelemetry` requests can be throttled if they violate quotas defined by the cluster administrator. The count and rate of throttled `PushTelemetry` requests. | ||
- `kafka_server_client_metrics_unknown_subscription_request_count`/`kafka_server_client_metrics_unknown_subscription_request_rate`: The count and rate of `PushTelemetry` requests received from clients that have not retrieved their subscriptions. | ||
|
||
Administrators should monitor these metrics when they set subscriptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we document this as a recommendation, or did you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this recommendation could go in the docs. I think it's important to understand the extra load subscriptions can add to your clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to understand the extra load subscriptions can add to your clusters.
Do we have some quantification here? I mean roughly in %, how much overhead could it add? (I know it depends on factors like number of clients, metrics, and intervals, but in general...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't have numbers as I have only run this in a local one node cluster. As you said, the impact will largely depend on the number of clients and the subscriptions enabled, so it probably extends from insignificant to major slowdown. That's why I mention these metrics as these should help evaluate the impact of handling client metrics.
|
||
This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`. | ||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy for Strimzi to provide support for this feature. I can see it being useful for organisations where a single team manages the Kafka cluster while other teams, that are inexperienced with Kafka internals, write and manage the client applications.
I don't have any problem with the motivation section. To me it mostly reads as explaining why users might want to do this, rather than it being a strong recommendation that this is the best way. The only sentence that could be softened is When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues.
perhaps to remove and greatly simplifies diagnosing issues
.
|
||
When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available. | ||
|
||
So we need a mechanism to remove metrics from the registry when a client stops emitting metrics, otherwise if a client shuts down, the metrics reporter will keep reporting its metrics. As we don't have the interval, the simplest approach is to delete client metrics whenever they are scraped. While this may create gaps in metric series, this ensures that metrics collected are always valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean if Prometheus is scraping the endpoint, but then a user directly queries it then it would prevent the metrics from being collect by Prometheus? If so we would need to make sure this is very clear in the docs. I would assume mostly this wouldn't happen, but when troubleshooting sometimes people do unexpected things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean if Prometheus is scraping the endpoint, but then a user directly queries it then it would prevent the metrics from being collect by Prometheus?
With the current proposal, when the endpoint is scraped by anything, user or Prometheus, all existing client metrics in the Prometheus registry are returned, and then the client metrics are deleted from the registry. Clients send their metrics periodically based on the configured push interval. Until clients push their metrics again, if the endpoint is scraped again, there won't be any client metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if for debugging purposes, when the user wants to scrape without deleting metrics, we could have a query parameter like ?clean=true|false
(defaulting to true
when it's not present, which is what Prometheus does) in order to allow the user to scrape without deletion. But maybe I am overthinking.
Then administrators must set metrics subscriptions to define the metrics clients will send. There are no default subscriptions. Subscriptions can be set, updated and deleted at runtime via the `kafka-configs.sh` or `kafka-client-metrics.sh` tools, or via the `Admin` API. For example: | ||
```sh | ||
./bin/kafka-client-metrics.sh --bootstrap-server localhost:9092 \ | ||
--alter --name topic-metrics \ | ||
--metrics org.apache.kafka.producer.topic. \ | ||
--interval 30000 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimaison can the subscriptions be set at deploy time, or only set at runtime? I see here you listed the dedicated shell script, but this can be configured via the configs script as well can't it? Is it effectively a configuration option that can be updated dynamically (i.e. without rolling the brokers)?
This proposes adding support to KIP-714 to
metrics-reporter
. This allows clients to send their metrics to brokers simplifying the monitoring and troubleshooting of client-side issues.