Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Opinionated OpenTelemetry Operator Sampling CR #3297

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frzifus
Copy link
Member

@frzifus frzifus commented Sep 20, 2024

@frzifus frzifus requested a review from a team as a code owner September 20, 2024 13:28
@frzifus frzifus added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Sep 20, 2024
components:
loadbalancer:
# Defines if the component is managed by the operator
managementState: managed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to set this field at the component level instead of just doing it at the full object level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Ive no strong opinion on that.

@frzifus frzifus added discuss-at-sig This issue or PR should be discussed at the next SIG meeting and removed discuss-at-sig This issue or PR should be discussed at the next SIG meeting labels Sep 26, 2024
Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts. Overall, I think i'm unclear how the CR solves the goals / use cases. It feels like this is very cluster admin focused but doesn't give users many options for customization. What if we used some annotations on the pods to determine something about the sampling configuration? Also, how would a user opt into using the created sampler architecture?

I think I would prefer a design that's a bit more composable for cluster admins / automatic for users. I.e.

Cluster admins:

  • determine which services should be instrumented via a label selector (maybe this is in its own CR)
  • define a SamplerPolicy CR
    • I like the components section, but I want this CR to be more flexible in terms of architecture based on sampler type
      • it would also be nice of users could bring their own collectors in the mix.
    • I also want support for sampling at the SDK level in here, as IMO that's the biggest gains in terms of efficiency

Users:

  • Optionally add an otel-instrumentation specific label to get instrumentation if the cluster admin requires on
  • Adds annotations to override the instrumentation policy within their pod template. The cluster admin should be able to define the limits for these (something like a valid range for sample % or max cardinality, etc.)

LMK if you want to talk through this more :) I like where this is going though.

## Goals and non-goals

**Goals**
- Provide an opinionated CR to simply sampling configuration in distributed environments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Provide an opinionated CR to simply sampling configuration in distributed environments
- Provide an opinionated CR to simplify sampling configuration in distributed environments


### CASE 3

As a user I want to be able to filter relevant data without much specific open telemetry knowledge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As a user I want to be able to filter relevant data without much specific open telemetry knowledge.
As a user I want to be able to filter relevant data without much specific OpenTelemetry knowledge.


![sampling arch](./images/arch_sampling_crd.png)

This custom resource creates an environment that allows us to apply e.g. tailbased sampling in a distributed environment. The operator takes care of creating an optional otel LB service and sampling instances similar to the figure shown above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we prefer tail sampling in the collector versus a global sampling policy that we apply to instrumentations? I guess not everything will be otel instrumented...


LB instances will be pre-configured to distribute traces based on a given routing key like the traceID to the sampler instances.

A policy is used to define which telemetry data should be sampled. Available policies can be found in the [tailbased sampling description](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also want us to be able to support the probabilistic sampler which requires no sticky load balancing configuration https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/probabilisticsamplerprocessor. How would this design work with that?

Comment on lines +130 to +133
1. Introduction of the CRD in v1alpha1.
2. First controller implementation.
3. Implementation of e2e tests.
4. CRD becomes part of the operator bundle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adjust this to include some more iteration and design changes prior to being in the main bundle.


**Goals**
- Provide an opinionated CR to simply sampling configuration in distributed environments
- Allow managing access using RBAC to different parts of the collector configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how the design solves this goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants