Proposal for adding KIP-714 support to metrics-reporter #159

mimaison · 2025-06-19T18:46:29Z

This proposes adding support to KIP-714 to metrics-reporter. This allows clients to send their metrics to brokers simplifying the monitoring and troubleshooting of client-side issues.

Signed-off-by: Mickael Maison <[email protected]>

scholzj

Thanks for the proposal. Some quick comments.

scholzj · 2025-06-19T19:17:01Z

105-metrics-reporter-kip-714.md

+
+This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`.
+
+## Motivation


I strongly disagree with this section. Monitoring applications is essential. And one of the keys to making it useful in situations when you need the monitoring the most are independent communication channels. I.e. do not monitor Kafka dependents through Kafka (the same Kafka cluster they depend on / the same SaaS Kafka service etc.). The KIP-714 is a great business strategy for some vendors, but a bad architecture pattern. And the cloud native landscape has various established solutions for metric collection across distributed application landscapes, which provides this.

I'm not saying that we cannot have some support for KIP-714 - it is a Kafka feature and we provide support for Kafka 🤷. But we should not present it as some superior solution.

I've not tried comparing the different monitoring solutions, each has different tradeoffs and strengths. Nowhere I say this is superior to another alternative. I just described some of the benefits support for KIP-714 could bring to users and I believe all the statements I made are valid.

I'm happy including weaknesses of the KIP-714 model if you think that would be useful.

I didn't read the motivation as a way to highlight the Kafka client metrics support as the best one.
Maybe just this sentence ...

When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues.

... could be changed in a way that it mentions the Kafka solution. Not sure if Jakub is referring to this sentence like "having the clients to send metrics to broker is the easiest and best solution ..."?

I think the motivation is valid. The main goal of KIP-714 is to make troubleshooting Kafka clients easier. Specifically, the zero-config approach is something that established solutions can't provide. Maybe we can explicitly mention that.

Trust me: Troubleshooting Kafka client through information that is supposed to be sent by ... wait for it ... exactly the same Kafka client you are trying to troubleshoot ... does not work well and is not a good practice.

I re-read the motivation again and my perspective is still the same after my first comment ... it doesn't sound like Mickael si saying this is a great and awesome feature and the best one compared to others.
I agree that monitoring is usually done by using several solutions all together and this one is part of the toolbox imho. I also agree with Mickael use cases.

I'm not against having this feature in Strimzi. I personally would leave it as something that users can contribute if they want it.

@scholzj Mickael is not a user per-se but a contributor. Why should we accept this only from a user? Because it would be clear that coming from users it would be useful? It's even possible that adding this feature, we'll have a lot of users using it.

From your previous replies it's pretty clear you don't like that feature. If you're against this feature and don't want it in Strimzi, just say so, it will save both of us time.

@mimaison remember the Strimzi maintainers is a group of people ;-)

@scholzj Mickael is not a user per-se but a contributor. Why should we accept this only from a user? Because it would be clear that coming from users it would be useful? It's even possible that adding this feature, we'll have a lot of users using it.

@ppatierno I did not say we should accept it only from a user. I said that if it would be up to me, I would leave it for a user contribution as I do not consider it well invested time and since user contributions are a useful measure of a real user interest (and I'm actually a bit surprised no user - as far as I know - asked about this feature in the past). But as I said, it is not my time being invested here, so it is not my problem, just my opinion. I also acknowledged that my (maintainer of a CNCF project) role is obviously different from Mickael's role and our motivations might differ.

But I stand behind my opinion that I'm not going to give my +1 vote to a feature which has a motivation that does not follow the best interest of users but the best interest of one or more Kafka vendors. That said, I'm not likely to -1 this just because of the motivation. So as you said, there are other maintainers as well. 🤷

remember the Strimzi maintainers is a group of people ;-)

That is of course true. But I think it is sad that you and me are the only maintainers who actually commented on this during the time this is opened. I think that makes it pretty hard for people to read the room and understand that the Strimzi maintainers is a group of people when it is only two people who raised some comments. And from the two people, one (me) has raised opinions that seem to be perceived as strongly negative and against the proposal ... and the other one (you) has so far not approved it either.

My intention was just to make clear to Mickael that contributors should not give up just because one maintainer doesn't agree on a proposal. It always happen to me as well to be against something but it's not the reason why something new shouldn't be added.
Regarding being just me and you the only maintainers to comment, let me tag all the rest and I will also raise this one as topic of discussion in the next community meeting. You are right, it's opened since more than one month but no other comments here.

I am happy for Strimzi to provide support for this feature. I can see it being useful for organisations where a single team manages the Kafka cluster while other teams, that are inexperienced with Kafka internals, write and manage the client applications.

I don't have any problem with the motivation section. To me it mostly reads as explaining why users might want to do this, rather than it being a strong recommendation that this is the best way. The only sentence that could be softened is When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues. perhaps to remove and greatly simplifies diagnosing issues.

I don't have a problem with the motivation section either. Maybe just softening that greatly.... as Kate stated above...but that's it :)

scholzj · 2025-06-19T19:26:09Z

105-metrics-reporter-kip-714.md

+    telemetryLabels:
+      - "client_id"
+      - "principal"


So, how do these labels apply to the Kafka server metrics? I would expect this configuration to be a bit more sophisticated:

A separate section for the client-side metrics

Maybe require enabling the client-side metrics first (without that, it seems like it is pretty easy to DoS the brokers through the client metrics)

Have a separate configuration for the client metrics and for the server metrics?

I guess it needs to distinguish the broker configuration from the client-based operand configuration as well? (i.e. there are no client metrics in Connect / MM2 / Bridge)

These labels would only apply to client metrics and have no impact on server metrics. Likewise the allowList only filters server metrics, client metrics are selected via metric subscriptions.

I understand your concern that this could be confusing. We could start by adjust the name of the new field, maybe clientLabels or clientMetricsLabels would be more explicit?

We've now split the reporter into server and client side modules, but it's not in a metrics-reporter release yet, and that will need to be introduced into the operator too. The client label configuration only applies to the server side module.

I would go with the clientMetricsLabels field name in this case.

scholzj · 2025-06-19T19:27:02Z

105-metrics-reporter-kip-714.md

+Then administrators must set metrics subscriptions to define the metrics clients will send. There are no default subscriptions. Subscriptions can be set, updated and deleted at runtime via the `kafka-configs.sh` or `kafka-client-metrics.sh` tools, or via the `Admin` API. For example:
+```sh
+./bin/kafka-client-metrics.sh --bootstrap-server localhost:9092 \
+  --alter --name topic-metrics \
+  --metrics org.apache.kafka.producer.topic. \
+  --interval 30000
+```


This sounds like it should have some declarative way of managing and not rely on the Kafka commands ...

Chatting with @ppatierno last week he also suggested introducing a mechanism to manage metric subscriptions.

I've not included this in the proposal yet as I'm not familiar enough with the operator to propose a design that makes sense. So help from maintainers would be appreciated.

Would a simple ConfigMap be enough to put the subscriptions? Then we would need a way to attach it to the cluster definition. I also don't understand well the mechanisms that would process and apply the subscriptions.

So AFAIU you can have one or more subscriptions and for each of them you have a couple of lists (metrics prefixes and filters) in addition to an interval value.
Thinking out loud we should start from the Kafka custom resource to have a field dedicated to the client metrics (because we already have a metricsConfig field to support JMX and Prometheus server related metrics). If such a field is, for example, clientsMetrics (or whatever), it could be a list of subscriptions with the fields for corresponding configuration, or it could point to a ConfigMap. I guess the choice could depend on how big this configuration could be but also taking into account that the ConfigMap is not strongly typed (as it could be a direct API in the Kafka CR) so we should think more of how to add such configuration (i.e. a JSON?).

@mimaison can the subscriptions be set at deploy time, or only set at runtime? I see here you listed the dedicated shell script, but this can be configured via the configs script as well can't it? Is it effectively a configuration option that can be updated dynamically (i.e. without rolling the brokers)?

Subscriptions can only be set, updated and deleted at runtime. Yes you can also use the kafka-configs.sh tool or the incrementalAlterConfigs() Admin APIs to configure subscriptions.

Is it effectively a configuration option that can be updated dynamically (i.e. without rolling the brokers)?

Exactly

ppatierno · 2025-06-23T09:41:35Z

105-metrics-reporter-kip-714.md

+Then administrators must set metrics subscriptions to define the metrics clients will send. There are no default subscriptions. Subscriptions can be set, updated and deleted at runtime via the `kafka-configs.sh` or `kafka-client-metrics.sh` tools, or via the `Admin` API. For example:
+```sh
+./bin/kafka-client-metrics.sh --bootstrap-server localhost:9092 \
+  --alter --name topic-metrics \
+  --metrics org.apache.kafka.producer.topic. \
+  --interval 30000
+```


So AFAIU you can have one or more subscriptions and for each of them you have a couple of lists (metrics prefixes and filters) in addition to an interval value.
Thinking out loud we should start from the Kafka custom resource to have a field dedicated to the client metrics (because we already have a metricsConfig field to support JMX and Prometheus server related metrics). If such a field is, for example, clientsMetrics (or whatever), it could be a list of subscriptions with the fields for corresponding configuration, or it could point to a ConfigMap. I guess the choice could depend on how big this configuration could be but also taking into account that the ConfigMap is not strongly typed (as it could be a direct API in the Kafka CR) so we should think more of how to add such configuration (i.e. a JSON?).

ppatierno · 2025-06-23T09:46:26Z

105-metrics-reporter-kip-714.md

+
+This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`.
+
+## Motivation


I didn't read the motivation as a way to highlight the Kafka client metrics support as the best one.
Maybe just this sentence ...

When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues.

... could be changed in a way that it mentions the Kafka solution. Not sure if Jakub is referring to this sentence like "having the clients to send metrics to broker is the easiest and best solution ..."?

ppatierno · 2025-06-23T09:55:16Z

105-metrics-reporter-kip-714.md

+
+When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues. The mechanism to set subscriptions also allows users to precisely adjust metrics at runtime to collect the most relevant metrics. 
+
+## Proposal


could we try to clearly split which part of the proposal is for the metrics-reporter component and which one is for the operator?

The part specific for the operator is in the Updates to strimzi-kafka-operator section. Anything else is for metrics-reporter.

fvaleri · 2025-07-21T15:58:35Z

105-metrics-reporter-kip-714.md

+
+This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`.
+
+## Motivation


I think the motivation is valid. The main goal of KIP-714 is to make troubleshooting Kafka clients easier. Specifically, the zero-config approach is something that established solutions can't provide. Maybe we can explicitly mention that.

fvaleri · 2025-07-21T16:04:54Z

105-metrics-reporter-kip-714.md

+The reporter can convert the other fields as labels and add them to the metric series. In many cases it does not make sense to add all of them, for example if the cluster is behind a proxy the client address will always be the same. Also using all of them can create high cardinality labels series which can be problematic in Prometheus.
+
+I propose introducing a configuration to select the metadata fields to use as labels: 
+- `prometheus.metrics.reporter.telemetry.labels`: List of label names in client metrics. The valid names are `client_id`, `listener_name`, `security_protocol`, `principal`, `client_address`. This defaults to `client_id`. This configuration is reconfigurable at runtime.


Is it ok to default to an optional configuration? Maybe we could default to client_id and client_address.

I see what you mean about client_id, it could be empty is users don't set it. There's the client_instance_id label that is always injected so I thought that would be fine. I'm not really keen on defaulting to client_address as in virtualized environment this may change often and increase the cardinality of the metrics.

In the end you need to know which are the best labels to use for your own environment.

I agree to use client_instance_id, but I would also put client_id when non empty. Wdyt?

fvaleri · 2025-07-21T16:18:59Z

105-metrics-reporter-kip-714.md

+
+When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available. 
+
+So we need a mechanism to remove metrics from the registry when a client stops emitting metrics, otherwise if a client shuts down, the metrics reporter will keep reporting its metrics. As we don't have the interval, the simplest approach is to delete client metrics whenever they are scraped. While this may create gaps in metric series, this ensures that metrics collected are always valid. 


What if we have two Prometheus instances scraping from the same SMR endpoint? Depending on configuration, one of them could be starved and never see client metrics.

Is it common to have 2 Prometheus servers scraping the same endpoints? I have to admit that's not a scenario I've considered.

I mean, why not? You could easily have dedicated Prometheus instances, for example one dedicated to server metrics and another one to client metrics, or maybe one instance collecting a critical set of metrics from all Kafka clusters, and then a smaller instance for each cluster with a bigger set of metrics.

What about deleting removed metrics on telemetry pushes? I don't know if there is any technical limitation, but you already have the old set in the Prometheus registry, so it should be possible to remove metrics by comparing them with the incoming set.

What about deleting removed metrics on telemetry pushes? I don't know if there is any technical limitation, but you already have the old set in the Prometheus registry, so it should be possible to remove metrics by comparing them with the incoming set.

The issue is when the client stops pushing new metrics, you then keep the last value you received and start exposing stale metrics.

Right, that's a problem.

However when metrics are received, the interval or any details about the subscription that caused this emission is not available.

Isn't this a design flaw that should be addressed in Kafka with a small KIP? Or can we use the admin client to get this information?

I opened https://cwiki.apache.org/confluence/display/KAFKA/KIP-1217%3A+Include+push+interval+in+ClientTelemetryReceiver+context

katheris

Overall this proposal makes sense to me. The only part that feels unfinished is how to manage the subscriptions. I do think that it ought to be configurable declaratively, but I don't feel I understand how the subscriptions work well enough to make a concrete suggestion of the best way to handle it, so I've added some questions that will hopefully help me understand better.

katheris · 2025-08-04T09:43:14Z

105-metrics-reporter-kip-714.md

+
+### Client metrics life cycle
+
+When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available. 


If Strimzi provided a way to manage the subscription, and therefore could inspect the interval, would that enable a better process for managing the lifecycle of the client metrics?

Not really. You'd have to find the matching rule. This would also not work outside of the Strimzi operator.
I'm currently considering whether we can update the ClientTelemetry API and provide the push interval.

katheris · 2025-08-04T09:44:20Z

105-metrics-reporter-kip-714.md

+- `kafka_server_client_metrics_throttle_count{client_instance_id="<ID>"}`/`kafka_server_client_metrics_throttle_rate{client_instance_id="<ID>"}`: `PushTelemetry` requests can be throttled if they violate quotas defined by the cluster administrator. The count and rate of throttled `PushTelemetry` requests.
+- `kafka_server_client_metrics_unknown_subscription_request_count`/`kafka_server_client_metrics_unknown_subscription_request_rate`: The count and rate of `PushTelemetry` requests received from clients that have not retrieved their subscriptions.
+
+Administrators should monitor these metrics when they set subscriptions.


Will we document this as a recommendation, or did you have something else in mind?

Yes this recommendation could go in the docs. I think it's important to understand the extra load subscriptions can add to your clusters.

I think it's important to understand the extra load subscriptions can add to your clusters.

Do we have some quantification here? I mean roughly in %, how much overhead could it add? (I know it depends on factors like number of clients, metrics, and intervals, but in general...)

No I don't have numbers as I have only run this in a local one node cluster. As you said, the impact will largely depend on the number of clients and the subscriptions enabled, so it probably extends from insignificant to major slowdown. That's why I mention these metrics as these should help evaluate the impact of handling client metrics.

katheris · 2025-08-04T09:52:57Z

105-metrics-reporter-kip-714.md

+
+This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`.
+
+## Motivation


I am happy for Strimzi to provide support for this feature. I can see it being useful for organisations where a single team manages the Kafka cluster while other teams, that are inexperienced with Kafka internals, write and manage the client applications.

I don't have any problem with the motivation section. To me it mostly reads as explaining why users might want to do this, rather than it being a strong recommendation that this is the best way. The only sentence that could be softened is When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues. perhaps to remove and greatly simplifies diagnosing issues.

katheris · 2025-08-04T09:55:33Z

105-metrics-reporter-kip-714.md

+
+When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available. 
+
+So we need a mechanism to remove metrics from the registry when a client stops emitting metrics, otherwise if a client shuts down, the metrics reporter will keep reporting its metrics. As we don't have the interval, the simplest approach is to delete client metrics whenever they are scraped. While this may create gaps in metric series, this ensures that metrics collected are always valid. 


Does this mean if Prometheus is scraping the endpoint, but then a user directly queries it then it would prevent the metrics from being collect by Prometheus? If so we would need to make sure this is very clear in the docs. I would assume mostly this wouldn't happen, but when troubleshooting sometimes people do unexpected things.

Does this mean if Prometheus is scraping the endpoint, but then a user directly queries it then it would prevent the metrics from being collect by Prometheus?

With the current proposal, when the endpoint is scraped by anything, user or Prometheus, all existing client metrics in the Prometheus registry are returned, and then the client metrics are deleted from the registry. Clients send their metrics periodically based on the configured push interval. Until clients push their metrics again, if the endpoint is scraped again, there won't be any client metrics.

I was wondering if for debugging purposes, when the user wants to scrape without deleting metrics, we could have a query parameter like ?clean=true|false (defaulting to true when it's not present, which is what Prometheus does) in order to allow the user to scrape without deletion. But maybe I am overthinking.

katheris · 2025-08-04T10:08:58Z

105-metrics-reporter-kip-714.md

+Then administrators must set metrics subscriptions to define the metrics clients will send. There are no default subscriptions. Subscriptions can be set, updated and deleted at runtime via the `kafka-configs.sh` or `kafka-client-metrics.sh` tools, or via the `Admin` API. For example:
+```sh
+./bin/kafka-client-metrics.sh --bootstrap-server localhost:9092 \
+  --alter --name topic-metrics \
+  --metrics org.apache.kafka.producer.topic. \
+  --interval 30000
+```


@mimaison can the subscriptions be set at deploy time, or only set at runtime? I see here you listed the dedicated shell script, but this can be configured via the configs script as well can't it? Is it effectively a configuration option that can be updated dynamically (i.e. without rolling the brokers)?

Proposal for adding KIP-714 support to metrics-reporter

9b8db69

Signed-off-by: Mickael Maison <[email protected]>

scholzj reviewed Jun 19, 2025

View reviewed changes

mimaison mentioned this pull request Jun 20, 2025

Add support for KIP-714 strimzi/metrics-reporter#72

Open

ppatierno reviewed Jun 23, 2025

View reviewed changes

fvaleri reviewed Jul 21, 2025

View reviewed changes

ppatierno requested review from tombentley, Frawless, katheris, sknot-rh, samuel-hawker, PaulRMellor, im-konge and see-quick August 1, 2025 13:02

katheris reviewed Aug 4, 2025

View reviewed changes


		This proposes adding support for KIP-714 to the server-side metric reporter, `ServerKafkaMetricsReporter`.

		## Motivation


		When clients are able to send their metrics to broker this eases their collection, and greatly simplifies diagnosing issues. The mechanism to set subscriptions also allows users to precisely adjust metrics at runtime to collect the most relevant metrics.

		## Proposal


		When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available.

		So we need a mechanism to remove metrics from the registry when a client stops emitting metrics, otherwise if a client shuts down, the metrics reporter will keep reporting its metrics. As we don't have the interval, the simplest approach is to delete client metrics whenever they are scraped. While this may create gaps in metric series, this ensures that metrics collected are always valid.


		### Client metrics life cycle

		When the reporter receives metrics from a client, it converts them and adds them into Prometheus registry, ready to be collected when the `GET /metrics` endpoint is scraped. The subscription specifies an interval for the client to send metrics. However when metrics are received, the interval or any details about the subscription that caused this emission is not available.

Proposal for adding KIP-714 support to metrics-reporter #159

Are you sure you want to change the base?

Proposal for adding KIP-714 support to metrics-reporter #159

Uh oh!

Conversation

mimaison commented Jun 19, 2025

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fvaleri Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fvaleri Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fvaleri Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fvaleri Jul 21, 2025 •

edited

Loading

fvaleri Jul 21, 2025 •

edited

Loading

fvaleri Aug 6, 2025 •

edited

Loading