Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add type and unit metadata labels #39

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

dashpole
Copy link

@dashpole dashpole commented Oct 24, 2024

Some screenshots from the Prometheus UI after this PoC: prometheus/prometheus@main...dashpole:prometheus:type_and_unit_labels

The PoC updates all PromQL tests to demonstrate that adding type and unit labels doesn't break existing queries.

Querying for __unit__

Screenshot 2024-10-23 at 3 31 51 PM

@dashpole dashpole force-pushed the type_and_unit_labels branch from 8aab401 to 25305b0 Compare October 24, 2024 15:21
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! I think we need to discuss a bit more whether to handle __* labels in the PromQL engine or the UI. Maybe there's a chance we're doing a breaking change...?

proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Show resolved Hide resolved
* Users might be surprised by, or dislike the additional suffixes and delimiters in the metric name results
* Mitigation: Opt-in for query engines?

### "Hide" __type__ and __unit__ labels in PromQL, instead of UI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how safe this is. If PromQL does not hide the return, new labels will suddenly appear in users' queries.

I'm also not sure about the potential impact other than dashboards/alerts having new labels, could that be a problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely open to this alternative. I like having the option of showing __type__ and __unit__ in the UI, but if it has to wait for Prom 4.0 because it is breaking, that seems like a significant downside.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it could be also merge of both, where PromQL only returns those labels based on something e.g. on conflict or when type or unit is selectors.

The downside is that we now have to really deeply design groupings and joins.. should they use those labels or not. With the current proposal the answer is trivial and straightforward. This helps with creating alerts and recording rules etc

I think I would be keen to try the current approach e.g. behind feature flag and see who will use it and how to tell more. cc @juliusv @roidelapluie WDYT from UX point of view for non-UI things?

proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Show resolved Hide resolved
Signed-off-by: David Ashpole <[email protected]>
@dashpole dashpole force-pushed the type_and_unit_labels branch from 98a86bd to e56b2be Compare November 1, 2024 15:35
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for proposing this. I love it in general, and as said elsewhere, I had a very similar idea a while ago for completely different reasons (mostly to avoid mixed series with native histograms and float-based sample types). Finding a similar solution for unrelated problems seems like a signal that this could be actually useful. However, I have also run into some conundrums, which was the reason why I put the whole idea on the backburner. I'll try to sketch out my train of thought here:

The __name__ label is the precedent here for "special labels", and as such it gives us a hint of what the issues might be. It's special power is not just about how to display it. It is treated specially for label matching and aggregation, and it is removed by most operations, following the argument that a rate of process_cpu_seconds_total is not process_cpu_seconds_total anymore. We have to ask ourselves the same questions for __unit__ and __type__.

In the brave new world, process_cpu_seconds_total would probably look like process_cpu{__type__="counter", __unit__="seconds"}. If we rate that, the outcome is not a counter anymore (but a gauge), and it is actually unit-less (seconds per second, i.e. it is the CPU usage ratio. In wishful thinking, the outcome should be process_cpu_usage{__type__="gauge", __unit__="ratio"}. Of course, we cannot easily come up with a general procedure to make up new names (which is the reason why current PromQL simply removes the name), and even changing the unit is tough. Changing the type might actually be feasible (we explored that a bit with native histograms, which "know" if they are a counter histogram or a gauge histogram). So maybe we can come up with "type translation rules" for all PromQL operations, and then maybe drop the unit in doubt (when calculating a rate or multiplying) and keep it when it makes sense (a sum aggregation or adding). But you will notice how deep we are in the woods here already.

The next thing is aggregation and label matching. As said, __name__ is also special here, but we probably need to be special in different ways for __unit__ and __type__. For example if we simply do a + b, we probably want to include unit and type in the label match (only add gauges to gauges, only add seconds to seconds etc.). However, a * b is already very different. b might be a scaling factor, i.e. a unit-less gauge. a could be counters or histograms. So in this case, we would like to exclude the unit and type from the label match.

It gets very confusing quickly. I wish we could just add the __unit__ and __type__ as an experiment without further special treatment and see how people cope with it. But right now my worry is it will hit too many road bumps…


* [Required] OpenTelemetry users can query for the original names of their metrics.
* [Required] Users can filter by the metric type or unit in PromQL queries.
* [Required] PromQL returns a warning when querying across metrics with different units or types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And PromQL could finally warn reliably about "inappropriate" operations (apply rate to a gauge, delta to a counter, …). (Recently, Prometheus has added an info-level annotation for cases where rate is applied to a metric that doesn't end on _total etc., but that's error prone (gauges could have a name ending on _total, and not all counters end on _total).)

By making unit and type a label (and thus part of the series' identity) it becomes implicitly harder to accidentally query across metrics with different unit or type anyway (there aren't any "mixed series" anymore). Maybe that is something we could work into the goals as preventing the mismatch from happening in the first place is a much more attractive selling point than just stating that we can "return a warning".

Copy link
Member

@ArthurSens ArthurSens Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that's a happy side effect of the proposal, but not a requirement :)

Copy link
Author

@dashpole dashpole Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning about inappropriate operations sgtm. Added to goals.

In the context of preventing collisions from happening in the first place, i've been thinking about what best practices should be. If users write queries and always include __unit__ and __type__ filters in queries, it is impossible for mismatch to happen in the first place. It also makes queries more readable, as you can easily tell the type and unit of the metric when reading a query. But it does seem quite verbose, and I'm pessimistic that users would actually do this if we tell them they should.


When a query for a metric returns multiple metrics with a different `__type__` or `__unit__` label, but the same `__name__`, users see a warning in the UI.

Users don't see the `__type__` or `__unit__` labels in the Prometheus UI next to other labels, but the unit displayed next to the value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe the type next to the metric name and labels, but as a colored badge or something (cf. how it's done in the PGW, which utilizes the full exposition data model so it even has the help string, which is of course out of scope here).

image

proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved
* Doing this would require some changes to client libraries, which is significantly more work, and is harder to experiment with.
* There can be conflicts between `# TYPE` and `# UNIT` metadata for a metric, and `__type__` and `__unit__` labels. This adds complexity to understanding the exposition format, and requires establishing rules for dealing with conflicts.

### More metadata labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like the idea to model OTel's notion of "identifying metadata" as labels in Prometheus (to be distinguished from "non-identifying metadata", which is closer to Prometheus's original metadata idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately OTel resource attributes are all considered identifying today (correct me if I'm wrong David).

There is work in progress to replace resource attributes with a new thing called Entity. Entity will be able to distinguish, using otel lingo, Identifying from Descriptive attributes, but this will take some years to get popular in the industry

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel resource attributes are all considered identifying today, but we don't treat them all as identifying when translating from OTel -> Prometheus. Instead, we just treat service.name + service.instance.id + service.namespace as identifying. That is correct if the OTLP originates from an OTel-instrumented application, but not if it wasn't (e.g. comes from an otel collector receiver).

Anyways, what I had in mind was more things that users don't think of as "labels", like schema url, or maybe scope name + version. I think resource attributes make sense as "real" labels, rather than metadata.

@beorn7
Copy link
Member

beorn7 commented Nov 12, 2024

It gets very confusing quickly. I wish we could just add the __unit__ and __type__ as an experiment without further special treatment and see how people cope with it. But right now my worry is it will hit too many road bumps…

Still in brainstorming mode, here is an idea how maybe a step zero could look like: Add the __unit__ and __type__ upon ingestion, but mostly ignore them in PromQL operations, maybe with the following exceptions:

  • __unit__ and __type__ can be used in label selectors. This allows the filtering (listed as a goal in this doc) and could also be used to increase readability of a PromQL expression.
  • __unit__ and __type__ can still be added via label_replace, e.g. to improve display of a result or to give the result of recording rules the unit and type information.

However, aggregations and label matches would ignore __unit__ and __type__ and would therefore work as before, and any operation would remove the __unit__ and __type__ label (with the exception of label_replace), both meant to circumvent any of the issues laid out above.

From there, we could cautiously move forward with more ideas. A step one might then be to handle __type__ in a more "native" fashion, i.e. attaching the "correct" __type__ label depending on the operation, still allowing the user to override via label_replace (like hard-casting a type in C), but maybe enforce a histogram type on native histograms (and prevent such a type on floats), so that we would avoid mixed series. Also, taking __type__ into account during aggregations might be possible.

@bwplotka
Copy link
Member

unit and type can still be added via label_replace, e.g. to improve display of a result or to give the result of recording rules the unit and type information.

Hm, I think it would be also nice to display them when collision actually occurs 🤔

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looking good, some comments.

proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved

When a query for a metric returns multiple metrics with a different `__type__` or `__unit__` label, but the same `__name__`, users see a warning in the UI.

Users don't see the `__type__` or `__unit__` labels in the Prometheus UI next to other labels, but the unit displayed next to the value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this is UI, but what we envision in PromQL response? Should it give those labels back in JSON? If yes, in what order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to TSDB API too. I assume TSDB API provide those labels as normal labels (worth to clarify?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the alternative explains it, should we mention that both API and PromQL contains those labels as-is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the TSDB API. Can you link to it?

proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Show resolved Hide resolved

### Prometheus Server Ingestion

When receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats, the type and unit of the metric are added as the `__type__` and `__unit__` labels, if not already present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exposition formats, I guess it means this would allow:

TYPE metric counter
UNIT metric seconds
HELP ...
metric{__type__="gauge"}

Ideally I do opposite to what's written, so TYPE and UNIT wins here, or this could be blocked even by spec. Essentially I wonder about efficiency of parsers here.

For ingestion protocols, at least it feels metadata should override it (not what is written now) so it's easier for everyone.

Any further concerns around perf of parsing here for this @bboreham?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about what would be on the /federate endpoint if the Prometheus server scraped conflicting metrics from applications. Our choices are:

  1. [current] Only #UNIT metadata, and no __unit__ labels.
  2. Only __unit__ labels, and no #UNIT metadata.
  3. [proposal] Mixed __unit__ labels, and a #UNIT with one of the units the metric has.
  • 1 will result in collisions, but maybe that is acceptable for the /federate endpoint. It isn't a regression from what we have today.
  • 2 works, but would be breaking for existing users that are consuming #UNIT.
  • 3 is confusing because of the mixing, but I wanted to leave that open as an option.

I'm OK with disallowing mixing __unit__ and #UNIT, since we could relax that in the future if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline, and:

  1. would epic to follow 1 and allow multiple the same named metric for different types and units in OM, let's try proposing that, it would help massively with exporters and federate
  2. type and unit from metadata is primary (in case of mix)
  3. Should we allow type and unit at all in exposition, SHOULD NOT maybe?

Copy link
Member

@ArthurSens ArthurSens Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fond of adding __unit__ and __type__ as labels in the exposition format. I think they should stay as a single line at the top of the metric family like it is today.

The Unit/Type will be the same for all time series in a metric family. If they become labels in the exposition format, it will just be repeated information for all time series. It increases the payload, and we waste CPU time parsing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

Looking at /metrics is for humans and follows what you see is what you get in terms of showing what you can query - implying that if you can query by __type__ or the API/UI returns the __type__ then you should put them on each line.

And the implication of that is that there needs to be at least one more efficient exposition format that is not so verbose for machines to read.

So I'm not sure this would work without either putting the new labels on every line or modifying OpenMetrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the above argument doesn't stand if we follow @beorn7 's advice and not show __type__, __unit__ on the query API (/query , /query_range) , although I would love to have it on the metadata queries endpoint like (/series) for Grafana to be able to know what's a float series and what's a native histogram.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should show __type__ and __unit__ in the query APIs. I'm just saying, in the zeroth step, we should remove those labels in PromQL operations (like the name is removed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with showing __type__ and __unit__ in query APIs, just like we show __name__.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

That's a really good catch! This is something we'll probably need to tackle with OM 2.0. Is this what you had in mind with this issue @bwplotka prometheus/OpenMetrics#280?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "Users see no difference to exposition formats", I meant that this proposal does not change how metrics are currently exposed in exposition formats. I'll leave that for follow-up discussions in prometheus/OpenMetrics#280 and elsewhere. I updated the language in 0a545c8.

I am open to ways of querying for type and unit that align better with the existing #TYPE and #UNIT comments, if we want to preserve the "what you see is what you get" experience of Prometheus. We do support querying for __name__, so there is at least some precedence.

* Users might be surprised by, or dislike the additional suffixes and delimiters in the metric name results
* Mitigation: Opt-in for query engines?

### "Hide" __type__ and __unit__ labels in PromQL, instead of UI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it could be also merge of both, where PromQL only returns those labels based on something e.g. on conflict or when type or unit is selectors.

The downside is that we now have to really deeply design groupings and joins.. should they use those labels or not. With the current proposal the answer is trivial and straightforward. This helps with creating alerts and recording rules etc

I think I would be keen to try the current approach e.g. behind feature flag and see who will use it and how to tell more. cc @juliusv @roidelapluie WDYT from UX point of view for non-UI things?

@dashpole dashpole force-pushed the type_and_unit_labels branch from 15b0c44 to a2e5ed9 Compare November 26, 2024 02:52
@dashpole dashpole force-pushed the type_and_unit_labels branch from d4ecde7 to bce6d9c Compare November 26, 2024 03:46
@dashpole
Copy link
Author

@beorn7 I've added "Aggregations and label matches ignore __unit__ and __type__ and any operation would remove the __unit__ and __type__ label (with the exception of label_replace)." based on your suggestion above. I've also added "Handle type and unit in PromQL operations" as a potential future extension.

@beorn7
Copy link
Member

beorn7 commented Nov 26, 2024

Quick note about federation: To my knowledge, federation still completely ignores metadata, so the exposition format it generates has all metric types as "untyped" and no units and help strings anyway, simply because Prometheus doesn't know about metadata post-ingestion. So being able to funnel unit and type through the Prometheus TSDB via labels would be a direct improvement for federation.

@ArthurSens
Copy link
Member

Quick note about federation: To my knowledge, federation still completely ignores metadata, so the exposition format it generates has all metric types as "untyped" and no units and help strings anyway, simply because Prometheus doesn't know about metadata post-ingestion. So being able to funnel unit and type through the Prometheus TSDB via labels would be a direct improvement for federation.

I think it would be better to enable metadata post-scrape/federation than adding type and unit as labels in the exposition format. See #39 (comment)

@beorn7
Copy link
Member

beorn7 commented Nov 27, 2024

Quick note about federation: To my knowledge, federation still completely ignores metadata, so the exposition format it generates has all metric types as "untyped" and no units and help strings anyway, simply because Prometheus doesn't know about metadata post-ingestion. So being able to funnel unit and type through the Prometheus TSDB via labels would be a direct improvement for federation.

I think it would be better to enable metadata post-scrape/federation than adding type and unit as labels in the exposition format. See #39 (comment)

The problem is that federation breaks the assumption that all metrics of the same name (one metric family) all have the same metadata.

@bwplotka
Copy link
Member

The problem is that federation breaks the assumption that all metrics of the same name (one metric family) all have the same metadata.

Yea, I wonder if we could lift this with OM 2.0. WDYT @beorn7 ?

Related discussion: #39 (comment)

@beorn7
Copy link
Member

beorn7 commented Nov 27, 2024

The problem is that federation breaks the assumption that all metrics of the same name (one metric family) all have the same metadata.

Yea, I wonder if we could lift this with OM 2.0. WDYT @beorn7 ?

The current structure of both protobuf and text versions of OM doesn't really lend itself to multiple different type, unit, or help entries for metrics with the same name. So we needed a new structural way of specifying unit and type (let's ignore help for now) per metric (rather than per metric family). Ironically, we already have a place for things that are per metric. We call it labels. So let's put type and metric into labels? Wait… that's what this proposal actually proposes. 😁

@bwplotka
Copy link
Member

bwplotka commented Nov 27, 2024

Well yes, but we just discussed that type and unit per label will not be easily parsable in OM/text, so putting it somewhere else actually helps. There might be a big questions there:

  1. Is Metric Family concept actually useful these days? Should OM 2.0 simply redefine it, and flat that out? Personally it's only confusing, we don't use this semantic on the storage or PromQL (AFAIK). 💣
  2. What's wrong to keep metric families but make them unique per type and unit? (not only name)?

@beorn7
Copy link
Member

beorn7 commented Nov 27, 2024

Well yes, but we just discussed that type and unit per label will not be easily parsable in OM/text, so putting it somewhere else actually helps.

I wouldn't say that this is actually the case. The way we parse the labels doesn't really create a lot of parsing cost if there are more labels. It does create more payload, but not so much in relative terms. The OM text format is designed to have a lot of repetitive information anyway. If we want to avoid repetitive information, we needed a fundamentally different structure.

  1. Is Metric Family concept actually useful these days? Should OM 2.0 simply redefine it, and flat that out? Personally it's only confusing, we don't use this semantic on the storage or PromQL (AFAIK). 💣

It correctly models the per-scrape data model. It avoids (a part of the) repetition. Of course you can flat that out, but then you get the repetition that was marked above as "not easily parseable". My claim is that "flattening that out" is pretty much equivalent to "adding that as labels".

  1. What's wrong to keep metric families but make them unique per type and unit? (not only name)?

Well, the text format originally was designed to not be ordered in any way. So every line was keyed by the metric family name, i.e. a metric line starts with the metric family name, and all the metadata lines contain the metric family name as the 2nd token after the hash. OM became somewhat dependent on order, but not consequently so (you could avoid a lot of repetition if you did that) (another of the design problems I see with OM). With the original idea of the text format structure, you needed to repeat not just the metric family name on each line, but the combination of metric family name, type, and unit (which, you guessed it, is equivalent to putting type and unit into a label). Alternatively, we could redesign the text format fundamentally and make it really depend on the order everywhere, but then we should do it thoroughly and avoid other repetitions, too (at the very least the metric family name).

Protobuf is a bit different as it is structured, and all the family members are a repeated Metric message inside a MetricFamily message. So keying on more than just the name is relatively easy to accomplish. This is another reason why I think we should "simply" solve the protobuf parsing performance issue and then go back to "if you want efficiency, use protobuf – if you want simplicity, use text" (simplicity as in "easy to create in a valid form", which means kicking out many of the requirements OM text introduced on top of the original text format, like order, whitespace, …).

@bwplotka
Copy link
Member

bwplotka commented Nov 28, 2024

The OM text format is designed to have a lot of repetitive information anyway. If we want to avoid repetitive information, we needed a fundamentally different structure.

I thought it's more the fact we usually need to know what's the metric type (and metric family name) ahead of time (e.g. histograms have a different flow). But you are right, perhaps we could make label based approach work fine. Then it's only human readability question and what's defined in SDK and what's queried.

It correctly models the per-scrape data model.

I think I got it now, FAIK this only makes a difference for unstructured types like classic histogram and summary where metric name you define in SDKs (aka metric family name) is != resulting metric name. In the word with native histograms and perhaps native summaries one day, it would make no different, right?

My claim is that "flattening that out" is pretty much equivalent to "adding that as labels".

Agree.

Well, the text format originally was designed to not be ordered in any way. So every line was keyed by the metric family name, i.e. a metric line starts with the metric family name, and all the metadata lines contain the metric family name as the 2nd token after the hash. OM became somewhat dependent on order, but not consequently so (you could avoid a lot of repetition if you did that) (another of the design problems I see with OM). With the original idea of the text format structure, you needed to repeat not just the metric family name on each line, but the combination of metric family name, type, and unit (which, you guessed it, is equivalent to putting type and unit into a label). Alternatively, we could redesign the text format fundamentally and make it really depend on the order everywhere, but then we should do it thoroughly and avoid other repetitions, too (at the very least the metric family name).

Got it, thank you for detailed explanation! I think we should discuss those options in a clear proposal, make some decision on this in OM 2.0, will add issue for this.

Protobuf is a bit different as it is structured, and all the family members are a repeated Metric message inside a MetricFamily message. So keying on more than just the name is relatively easy to accomplish. This is another reason why I think we should "simply" solve the prometheus/prometheus#14668 and then go back to "if you want efficiency, use protobuf – if you want simplicity, use text" (simplicity as in "easy to create in a valid form", which means kicking out many of the requirements OM text introduced on top of the original text format, like order, whitespace, …).

Nice! Proposing as an intention in our OM 2.0 doc

@bwplotka
Copy link
Member

Created related issue on OM 2.0 prometheus/OpenMetrics#280

proposals/2024-09-25_metadata-labels.md Show resolved Hide resolved
proposals/2024-09-25_metadata-labels.md Show resolved Hide resolved

### Prometheus Server Ingestion

When receiving OTLP or PRW 2.0, or when scraping the text, OM, or proto formats, the type and unit of the metric are added as the `__type__` and `__unit__` labels, if not already present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "Users see no difference to exposition formats." stated above, but if you put the type and especially the unit in the # UNIT part then you need them in the metric name as well, defeating the purpose. You're other option is to put them into the series on each line.

Looking at /metrics is for humans and follows what you see is what you get in terms of showing what you can query - implying that if you can query by __type__ or the API/UI returns the __type__ then you should put them on each line.

And the implication of that is that there needs to be at least one more efficient exposition format that is not so verbose for machines to read.

So I'm not sure this would work without either putting the new labels on every line or modifying OpenMetrics.

@mrueg
Copy link

mrueg commented Dec 1, 2024

When reading this proposal, a thought that came to my mind is:

Wouldn't it be helpful to standardize on base units for the unit label and have an optional __unit_scale__ label in addition?

Milliseconds would become
{__unit__="seconds", __unit_scale__="10e-3"}

This will help if you have two timeseries, one very small and one very large and want to ensure you have a good precision on both.

(This is example is just for illustrative purposes, please don't make me do float64 math)

storage_space_used{drive="A",__unit__="bytes",__unit_scale__="10e15"} 1.2345678910111213
storage_space_used{drive="B",__unit__="bytes",__unit_scale__="10e3"} 1.2345678910111213

Right now this wouldn't be possible as I either had to set my unit to petabytes and won't get enough precision on the smaller values or bytes and will have issues on the larger values.

(If this is out of scope, similar to mixed units, maybe it could be recorded as a possible future extension)

@dashpole
Copy link
Author

dashpole commented Dec 4, 2024

@mrueg Can you elaborate on how the new label would get used? It is possible with the current proposal to simply use "petabytes" as the unit today (although our conventions recommend against that). Is your intention that PromQL would use it when aggregating between two series, and scale numbers before performing operations? Or is this to make life easier for the UI so it only needs to understand base units?

Signed-off-by: David Ashpole <[email protected]>
@mrueg
Copy link

mrueg commented Dec 4, 2024

@mrueg Can you elaborate on how the new label would get used? It is possible with the current proposal to simply use "petabytes" as the unit today (although our conventions recommend against that).

Would I be able to have

storage_space_used{drive="A",__unit__="petabytes"} 1.2345678910111213
storage_space_used{drive="B",__unit__="bytes"} 1.2345678910111213

?
I assumed the proposal says the __unit__ field needs to be the same over all series in the same metric.

Is your intention that PromQL would use it when aggregating between two series, and scale numbers before performing operations? Or is this to make life easier for the UI so it only needs to understand base units?

For the UI, Alerting etc. it would definitely be easier that way as the scaling becomes easier, only base units to support and it's clear if it's 2eX or 10eX (in case of e.g. Mebi/Mega).
For PromQL it could lead to higher precision and reduce the error when e.g. summing up multiple series. I remember seeing some Kahan summation in Prometheus, so not sure how much the error would be further improved by having a separate (and optional) __unit_scale__ field, I assume one would need to sort the values that need to get summed up first(?)

@beorn7
Copy link
Member

beorn7 commented Dec 5, 2024

I don't think that the precision argument really holds. IEEE 754 floats already have a built-in "scale" (the exponent). I don't see how adding another one will help in any way. (Unless we are talking about hitting the limits of IEEE 754, but that has never been a problem in practice.) The Kahan summation helps with precision issues if you have numbers of very different size in the mix, but as said, adding another exponent on top of IEEE 754's one, is not helping with that.

I could see a possible use for some kind of "scale" field: specifying the factor to convert to the base unit so that we can auto-convert metrics. However, I would not start to pull that into scope for this proposal. Let's think about those things once we have a solid concept of treating units in general.

I assumed the proposal says the __unit__ field needs to be the same over all series in the same metric.

My understanding is exactly the opposite: Since OTel allows metrics with the same name but different unit or type, Prometheus cannot ingest thim without collisions if we completely remove unit and type (rather than encoding it in the name (current state) or in labels (this proposal). Quote from the proposal: "However, OpenTelemetry currently considers the metric type and unit identifying, whereas Prometheus does not. Simply removing suffixes might result in "collisions" between distinct OpenTelemetry metrics which have the same name, but different types (less commonly) or units."

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just same more musings about the "real suffixes" approach.

These comments are not meant to change the proposal, just thought how this alternative might actually be compatible with the main proposal after all (i.e. we don't have to do it now, but we could add it incrementally at a later stage).

proposals/2024-09-25_metadata-labels.md Outdated Show resolved Hide resolved

## Alternatives

### “Real” Type and Unit suffixes in **name**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To view this alternative from a different perspective:

We could see it as "syntactic sugar":

http_request_duration{__unit__="seconds", __type__="histogram"} would be just another syntax for http_request_duration~seconds.histogram. (Order is arbitrary, and both are optional. Could be http_request_duration.histogram~seconds or http_request_duration.histogram or http_request_duration~seconds.)

This would be fully compatible with this proposal. It would make it much easier to add units and/or types to the metric name, so it would address the concern that you cannot see the unit and type anymore by looking at a PromQL expression without supporting tooling. If we allowed .total as an alias of .counter, we would have very little visible change. http_requests_total would become http_requests.total.


This solution is not chosen because:

* Requires PromQL changes (intrusive), touches on “dot” operator ideas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, using a real . operator to separate something like the type is one of the ideas the . was reserved for. So it's more a pro than a con. :)

Copy link
Member

@ywwg ywwg Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconded. If we keep reserving the dot for the future, we'll never use it. If the dot is useful here, let's take advantage

Copy link
Member

@ywwg ywwg Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR, we could decide we never really need the dot operator and we can officially add it to the blessed-character list, and then people can avoid the quoting syntax just for dots! I actually think that's an acceptable solution as well

This solution is not chosen because:

* Requires PromQL changes (intrusive), touches on “dot” operator ideas.
* Adding suffixes outside of quotes looks strange: {“http.server.duration”.seconds/histogram}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would apply those operators outside of the braces anyway {"http.server.duration"}~seconds.histogram. Looks less strange IMHO. (For conventional names, it could also follow the name before (but still outside) the braces: http_server_duration~seconds.histogram{job="foobar"} or http_server_duration{job="foobar"}~seconds.histogram

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't mind putting the unit/type outside of the braces. I could also see some syntax with other gathering operators like:

foo{bar="baz"}(seconds,histogram) or foo{bar="baz"}{seconds,histogram}

but that is not very pretty


* Requires PromQL changes (intrusive), touches on “dot” operator ideas.
* Adding suffixes outside of quotes looks strange: {“http.server.duration”.seconds/histogram}
* Rolling this out would be breaking for existing Prometheus users: E.g. {foo_seconds} becomes {foo.seconds/histogram}. Could this be part of OM 2.0?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same breakage you get if you switch from the current name mangling to "please keep OTel names as is" mode, isn't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OTel users, yes: this will involve a breaking change regardless. When I wrote this, I think I was assuming that prometheus users would move away from explicit unit suffixes because of the new suffixes introduced here (or end up with foo_seconds~seconds.histogram). I guess at least they can continue to query for foo_seconds like before, so maybe moving away from explicit unit suffixes won't be neccessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants