Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of a Synthetic Attribute for Server Span Telemetry #1127

Open
JacksonWeber opened this issue Jun 6, 2024 · 5 comments · May be fixed by #1523
Open

Introduction of a Synthetic Attribute for Server Span Telemetry #1127

JacksonWeber opened this issue Jun 6, 2024 · 5 comments · May be fixed by #1523
Labels
area:http enhancement New feature or request

Comments

@JacksonWeber
Copy link

JacksonWeber commented Jun 6, 2024

Area(s)

area:browser

Is your change request related to a problem? Please describe.

I would like to be able to identify telemetry created by synthetic sources such as bots or crawlers. This issue looks to work on defining conventions surrounding marking spans as originating from a synthetic source.

Describe the solution you'd like

I would like to introduce an attribute to HTTP server span semantic conventions, as well as metrics and logs that represents a low-cardinality string such as the below:

synthetic -> "not set" | "bot" | "synthetic test"

Where the synthetic attribute being set to "not set" represents telemetry that is not generated from a synthetic source. This convention will be helpful for scenarios where a user may want to mark telemetry generated from frequent synthetic tests or web crawlers separately from direct human engagement.

The determination of which of the three options a span falls into could be made by maintaining a list of known synthetic sources or allowing this decision to be user configurable.

Describe alternatives you've considered

While we could consider setting the synthetic attribute to a Boolean value, I believe the extra granularity of the low-cardinality string would be valuable.

Additional context

No response

@MSNev
Copy link
Contributor

MSNev commented Jun 6, 2024

#1230

@lmolkova
Copy link
Contributor

lmolkova commented Oct 7, 2024

A few questions/thoughts:

  • While non-HTTP usage would probably be low/non-exisent, I don't think it belongs in HTTP domain. So I'd consider adding some attribute like user_agent.type (probably needs a better name).
  • Is there some prior art in the industry to identify a synthetic source/bot user? Is there an attribute in ECS for it? Are there some non-telemetry user-agent conventions for it?
  • nit: let's just not set an attribute instead of using not_set value.

It'd be awesome if you could send a PR with a specific proposal (considering the above).

@JacksonWeber
Copy link
Author

A few questions/thoughts:

  • While non-HTTP usage would probably be low/non-exisent, I don't think it belongs in HTTP domain. So I'd consider adding some attribute like user_agent.type (probably needs a better name).
  • Is there some prior art in the industry to identify a synthetic source/bot user? Is there an attribute in ECS for it? Are there some non-telemetry user-agent conventions for it?
  • nit: let's just not set an attribute instead of using not_set value.

It'd be awesome if you could send a PR with a specific proposal (considering the above).

Thank you for your feedback on this issue! Just a couple questions regarding your first point:

  • While I don't expect non-HTTP telemetry to need this synthetic source flag, I suppose it could be more generic and defined outside of HTTP specifically. However, I'm struggling to find any more relevant association for this. For example, if I want to define some attribute on the spans.yaml, I have the options of http, rpc, faas, rpc, gen-ai, database, messaging, and cloud-events. None of which seem to be more relevant than http for something like synthetic source. Maybe I'm missing something about the structure of the semantic conventions here.
  • I'm also curious about the idea for a user_agent.type field, what kind of data would a field with that name hold?

@reyang
Copy link
Member

reyang commented Oct 10, 2024

I think we need to get some clarity regarding "what is synthetic source". For example, do we think it'll be a static list of client types (e.g. Agent header for HTTP) or a list that will be frequently updated?

For example, we do not want to have an explicit flag saying "this trace is a result of a synthetic request" then we noticed "oops, we just realized that there are other traces from agent XYZ, and this agent is actually powered by AI/LLM so the previously added synthetic flag should be fixed".

@JacksonWeber
Copy link
Author

@reyang I think it'll be important to keep the list of known synthetic sources updated over time as there's no way to predict how popular a certain bot might become.

I'm a little confused by your example. Are you essentially saying that in the scenario, it would be possible that we would miss synthetic traces created by newer technologies if we only maintained a static list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:http enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

5 participants