Add Sampling SIG research notes #213

spencerwilson · 2022-09-08T21:42:18Z

Create text/experimental/notes/ directory, containing a partial OTEP ("Motivation" section only) and two research documents on the sampling landscape

The scope of the OTEP (that appears in this PR in partial form) is a prerequisite to #191.

This PR needn't be merged. Its purpose is to provide a vehicle for review and iteration by the Sampling SIG. I previously distributed these documents as GitHub gists, but the SIG decided this would be superior to that. I expect that at some point this PR will have served its purpose and at that point may be closed.

- ported from https://docs.google.com/document/d/1hOOyqItEhyWIJG6Q94mEtobNzTfe_yMYuNavATuIskM/edit - advertised on Slack: https://cloud-native.slack.com/archives/C027DS6GZD3/p1658127264816959

In SIG meeting Peter noted that the resulting estimates would have such enormous error if ever a span _was_ sampled at a rate of 2^-63 that this "workaround" is merely a curiosity, not something that'd be practically useful.

PeterF778 · 2022-09-16T17:34:14Z

text/experimental/notes/0213-sampler-data-model.md

+   1. "Statistics" can be anything from [RED metrics](https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/) by service, to data used to answer richer questions like "Which dimensions of trace data are correlated with higher error rate?". You want to ensure that all inferences made from the data you *do* collect are valid.
+   2. Setting sampling error targets is akin to setting Service Level Objectives: just as one aspires to build *appropriately reliable* systems, so too one needs statistics which are *just accurate enough* to get valid insights from, but not so accurate that you excessively sacrifice goals #1 and #2.
+   3. An example consequence of this goal being unmet: metrics derived from the trace data become spiky and unfit for purpose.
+4. Ensure traces are complete.


I feel this is a bit too strong. While we want to see many complete traces, we must not require that all traces are complete. Consider infrequently used sub-services which might not get enough representation when all sampling decisions are made at the root. BTW, poor coverage for such services is a weak point across the whole surveyed landscape today.

PeterF778 · 2022-09-16T18:18:34Z

text/experimental/notes/sampler-survey.md

+- limiting: Support all of both spans per second, spans per month, GB per month (approximation is ok)
+- degree of limiting: Soft is ok
+- horizontally scalable: Yes
+- Prioritize tail sampling in Collector over head sampling in SDK


I don't think this fits well here. It is a technical design decision. While I agree with this sentence in practice, users may even have the opposite opinion, as tail-based sampling is generally more expensive than head-based sampling.

PeterF778 · 2022-09-20T22:09:50Z

text/experimental/notes/0213-sampler-data-model.md

+   1. Reduce or limit costs stemming from the construction and transmission of spans.
+   2. Analytics queries are faster when searching less data.
+2. Respect limits of downstream storage systems.
+   1. Trace storage systems often have data ingest limits (e.g., GBs per second, spans per second, spans per calendar month). The costs of exceeding these limits can be either reduced reliability or increased hosting expenditures.


I think similarly "hard" limitations apply for the tracers, collector and the network. Collecting too much data up front can lead to excessive memory usage, CPU or network saturation and can cause not only performance issues, but application malfunction as well.

PeterF778 · 2022-09-20T22:32:13Z

text/experimental/notes/balancers-and-limiters.md

+- Per-stratum limiting: Partition input traces into strata, and sample such that each stratum's throughput does not exceed a threshold.
+- Global limiting: Sample such that total throughput doesn't exceed a threshold.
+
+Note that in addition to limiting traces per unit time, there are also use cases to support limting spans per unit time, or bytes per unit time. In such cases the limiter implementation should take care not to impart bias by systematically preferring traces comprising fewer spans, or fewer bytes, over "larger" traces.


typo: "limting"

kalyanaj · 2022-09-28T19:13:13Z

text/experimental/notes/0213-sampler-data-model.md

+
+##### TraceIdRatioBased
+
+`TraceIdRatioBased` may be used to consistently sample or drop a certain fixed percentage of spans. The decision is based on a random value, the trace ID, rather than any of the span metadata available to ShouldSample (span name, initial attributes, etc.) As a result,


Currently, since OpenTelemetry doesn't specify what hashing algorithm to use, my understanding is this statement is not fully accurate (since different language SDKs could have different approaches) - e.g. when it is used for sampling non-root spans. It may be good to clarify that current limitation here.

kalyanaj · 2022-09-28T19:15:07Z

text/experimental/notes/0213-sampler-data-model.md

+   1. "Statistics" can be anything from [RED metrics](https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/) by service, to data used to answer richer questions like "Which dimensions of trace data are correlated with higher error rate?". You want to ensure that all inferences made from the data you *do* collect are valid.
+   2. Setting sampling error targets is akin to setting Service Level Objectives: just as one aspires to build *appropriately reliable* systems, so too one needs statistics which are *just accurate enough* to get valid insights from, but not so accurate that you excessively sacrifice goals #1 and #2.
+   3. An example consequence of this goal being unmet: metrics derived from the trace data become spiky and unfit for purpose.
+4. Ensure traces are complete.


Would something like "Ensure traces are as complete and consistent as possible" better express the intent here?

kalyanaj · 2022-09-28T19:18:08Z

text/experimental/notes/0213-sampler-data-model.md

+   2. Setting sampling error targets is akin to setting Service Level Objectives: just as one aspires to build *appropriately reliable* systems, so too one needs statistics which are *just accurate enough* to get valid insights from, but not so accurate that you excessively sacrifice goals #1 and #2.
+   3. An example consequence of this goal being unmet: metrics derived from the trace data become spiky and unfit for purpose.
+4. Ensure traces are complete.
+   1. "Complete" means that all of the spans belonging to the trace are collected. For more information, see ["Trace completeness"](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.12.0/specification/trace/tracestate-probability-sampling.md#trace-completeness) in the trace spec.


Have there been any discussions around how sampling impacts "linked traces" and whether the consistency/completeness goals should support linked traces as well? Since many async operations are modelled as linked traces, I am trying to understand if there's a way consistent sampling can be achieved across linked traces.

jmacd · 2022-10-05T23:31:45Z

text/experimental/notes/0213-sampler-data-model.md

+
+Notes:
+
+- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/telemetryquerylanguage/tql and give feedback. Split out from transformprocessor.


Suggested change

- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/telemetryquerylanguage/tql and give feedback. Split out from transformprocessor.

- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl and give feedback. Split out from transformprocessor.

jmacd · 2022-12-08T17:03:37Z

FYI I would like to merge this for the record. Thank you @spencerwilson!!

bputt-e · 2023-04-21T22:20:54Z

Not sure if this is something this PR will address, but we were experimenting with a client/api that enabled folks to set their sampling rate by span attributes. Using RQL would allow =, !=, regex, like/not like statements and would give more control over the data being traced. Yes, it does require the information that's being used to determine the sampling rate at the first time a span is created.

Our use case was to enable folks to increase sampling rates for a given value (or set of values) without having to make any code changes and the "sampling rules" would re-compile each time it got a change from the API (i.e. when a user makes a change). Moving forward, RQL isn't required, but flexibility could be helpful IF folks want more than = / != for comparing span attributes to something they want to sample by.

tedsuo · 2023-07-31T16:23:31Z

@spencerwilson we are cleaning up stale OTEP PRs. If there is no further action at this time, we will close this PR in one week. Feel free to open it again when it is time to pick it back up.

spencerwilson added 9 commits September 8, 2022 13:27

revision from Google Docs

6306fb0

- ported from https://docs.google.com/document/d/1hOOyqItEhyWIJG6Q94mEtobNzTfe_yMYuNavATuIskM/edit - advertised on Slack: https://cloud-native.slack.com/archives/C027DS6GZD3/p1658127264816959

mention complete traces

d56d9f2

rewrite Jaeger section

e1b9cd9

drop mention of leaky bucket technicalities

e5eba18

In SIG meeting Peter noted that the resulting estimates would have such enormous error if ever a span _was_ sampled at a rate of 2^-63 that this "workaround" is merely a curiosity, not something that'd be practically useful.

formatting changes (now editing with Typora)

8beba7f

add illustrating example

554f443

add balancers_and_limiters.md

c038ede

clarify descriptions of things

3bdc8fd

add, rename files

f186231

spencerwilson requested a review from a team September 8, 2022 21:42

spencerwilson added 10 commits September 8, 2022 17:21

s/squeeze factor/doubling factor/g

e804af0

fix math

c30bd1b

make B consistently the higher-frequency/volume thing

f083c22

s/determined/chosen/

2b6b51d

correct description of TotalThroughputSampler limiting

d37c23c

explain D = infinity meaning

c275d7d

s/sampling probability/p-value

a5359c7

fix infty usages

d0141b7

contrast logarithmic and slower-exponential

a5d6218

rename doubling factor -> doubling point

3ebe2d9

PeterF778 reviewed Sep 16, 2022

View reviewed changes

PeterF778 reviewed Sep 20, 2022

View reviewed changes

kalyanaj reviewed Sep 28, 2022

View reviewed changes

kalyanaj mentioned this pull request Sep 28, 2022

REQUEST: New membership for kalyanaj open-telemetry/community#1207

Closed

6 tasks

jmacd reviewed Oct 5, 2022

View reviewed changes

jmacd approved these changes Oct 5, 2022

View reviewed changes

tedsuo marked this pull request as draft January 9, 2023 17:51

tedsuo added the triaged label Jan 30, 2023

tedsuo added the stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. label Jul 31, 2023

spencerwilson mentioned this pull request Nov 20, 2023

A Sampling Configuration proposal #240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sampling SIG research notes #213

Add Sampling SIG research notes #213

spencerwilson commented Sep 8, 2022 •

edited

Loading

PeterF778 Sep 16, 2022

PeterF778 Sep 16, 2022

PeterF778 Sep 20, 2022

PeterF778 Sep 20, 2022

kalyanaj Sep 28, 2022

kalyanaj Sep 28, 2022

kalyanaj Sep 28, 2022

jmacd Oct 5, 2022

jmacd commented Dec 8, 2022

bputt-e commented Apr 21, 2023

tedsuo commented Jul 31, 2023


		##### TraceIdRatioBased

		`TraceIdRatioBased` may be used to consistently sample or drop a certain fixed percentage of spans. The decision is based on a random value, the trace ID, rather than any of the span metadata available to ShouldSample (span name, initial attributes, etc.) As a result,


		Notes:

		- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/telemetryquerylanguage/tql and give feedback. Split out from transformprocessor.

Add Sampling SIG research notes #213

Are you sure you want to change the base?

Add Sampling SIG research notes #213

Conversation

spencerwilson commented Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented Dec 8, 2022

bputt-e commented Apr 21, 2023

tedsuo commented Jul 31, 2023

spencerwilson commented Sep 8, 2022 •

edited

Loading