Skip to content

Commit

Permalink
Fix md errors
Browse files Browse the repository at this point in the history
  • Loading branch information
lameiraatt committed Mar 27, 2023
1 parent 56f3914 commit ad20750
Showing 1 changed file with 12 additions and 7 deletions.
19 changes: 12 additions & 7 deletions text/0228-chain-id.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

Add a chain ID attribute to spans and logs.


## Motivation

Spans include information on parent spans that make it possible for observability backends to digest and display span hierarchies within traces.
Expand All @@ -11,6 +10,7 @@ The availability of such a view is dependent on the used backend and, in a large
This OTEP introduces the concept of a chain ID attribute that should improve our ability to understand causality while looking at spans individually and also filter spans/logs using chain IDs, facilitating subtree data extraction and analysis.

For instance, let's look at the small span below (exported using the OTLP Exporter) that includes a `chain.id` attribute:

```json
{
"traceId": "oKgVBuTUDvKapES6R+XB/w==",
Expand Down Expand Up @@ -40,13 +40,15 @@ For instance, let's look at the small span below (exported using the OTLP Export

Chain ID creation logic will be detailed in the [Explanation](#explanation) section, but, for now, we can hint at the fact that the chain links refer to different span levels.
From a single span, without the chain ID, we can say that this isn't a level 0 span, because it has a parent span ID, but because the chain ID is present, we know this span is actually on level 3 of the trace tree and the path to that span is as follows:

* level 0 span - chain ID `2e072d7d02464a2490b65c864da59609#5`
* level 1 span - chain ID `2e072d7d02464a2490b65c864da59609#5#31`
* level 2 span - chain ID `2e072d7d02464a2490b65c864da59609#5#31#739`
* level 3 span - chain ID `2e072d7d02464a2490b65c864da59609#5#31#739#11`

The example span above includes a `task.processing.time.ns` attribute, which represents the amount of time the CPU spent processing the task associated with that span.
If we want to determine the total amount of CPU time spent on all tasks performed under that span (level 3+), we can:

1. Find spans where the `chain.id` attribute value starts with `2e072d7d02464a2490b65c864da59609#5#31#739#11`, to get the entire subtree;
2. Extract `task.processing.time.ns` values, if the attribute is present;
3. Sum.
Expand All @@ -57,7 +59,6 @@ The presence of chain IDs means that we can not only reduce the complexity of th

Chain ID links also represent a span count (excluding root) and can, for that reason, provide a measure of how much work has been carried out at different trace levels leading up to the associated span.


## Explanation

With the proposed change we will be enhancing the list of attributes in span and log record data to include a chain ID.
Expand Down Expand Up @@ -97,19 +98,20 @@ Logs created in a span's execution context will have the same chain ID as that s
The details on how to enable this functionality will vary with language and implementation, but it will be offered as a default OpenTelemetry SDK extension, which means that, once it's added as a dependency to the relevant projects, it should rely on SDK configuration changes, with some known exceptions.
For instance, our feasibility experiments were run in Java using auto-instrumentation and, in that case, we implemented an `AutoConfigurationCustomizerProvider` and simply extended the OpenTelemetry Java agent, so neither dependencies nor extension specific configuration changes were required in the test applications.


## Internal details

The code snippets shown in this section are from the Java module created to test the viability of our proposal.

In order to achieve the described functionality we're proposing the creation of a module that contains the following main components:

* chain ID: keeps chain ID string representation and child counter;
* chain ID manager: implements the chain ID creation logic and keeps the association between spans and chain IDs, so that each span is associated with only one chain ID in a process;
* span processor: adds the chain ID attribute to spans;
* log processor: adds the chain ID attribute to logs;
* baggage propagator: wraps the default baggage propagator (following [W3C Baggage Specification](https://w3c.github.io/baggage/)) and adds chain ID to the context baggage before propagation.

The call to create a new chain ID will be made from the span processor to the chain ID manager when a new span is started.

```java
public void onStart(Context parentContext, ReadWriteSpan span) {
ChainId chainId = chainIdManager.createChainId(parentContext, span.getSpanContext().getSpanId());
Expand All @@ -118,13 +120,15 @@ public void onStart(Context parentContext, ReadWriteSpan span) {
```

When a span ends, the call to delete the created chain ID, and its association with the span, will also be made from the span processor.

```java
public void onEnd(ReadableSpan span) {
chainIdManager.deleteChainId(span.getSpanContext().getSpanId());
}
```

The log correlation with other observability data by execution context is leveraged within the log processor when a log is emitted.

```java
public void onEmit(ReadWriteLogRecord logRecord) {
String spanId = logRecord.toLogData().getSpanContext().getSpanId();
Expand All @@ -136,6 +140,7 @@ public void onEmit(ReadWriteLogRecord logRecord) {
```

The context is immutable, so when a chain ID is associated with the span context in the context being propagated, baggage is rebuilt to include that chain ID and a copy of the original context containing that baggage will be propagated instead.

```java
public <C> void inject(Context context, @Nullable C carrier, TextMapSetter<C> setter) {
ChainId chainId = chainIdManager.getChainId(Span.fromContext(context).getSpanContext().getSpanId());
Expand All @@ -156,6 +161,7 @@ These key-value pairs are held in memory while a span is active, i.e., from `onS
As seen above, the chain ID manager is called from the span processor, the log processor and the baggage propagator.
At the moment, the `ChainIdManager` class is a singleton and the only component calling methods that write to the map is our span processor.
The methods in question are `createChainId` and `deleteChainId`, respectively:

```java
ChainId createChainId(Context parentContext, String spanId) {
String parentSpanId = Span.fromContext(parentContext).getSpanContext().getSpanId();
Expand All @@ -176,6 +182,7 @@ ChainId deleteChainId(String spanId) {
```

The log processor and baggage propagator only call the `getChainId` method, which reads from the map.

```java
public ChainId getChainId(String spanId) {
return spanIdToChainIdMap.get(spanId);
Expand All @@ -189,6 +196,7 @@ However, for `createChainId`, there should be further consideration.
This method is called exclusively from the span processor `onStart`, which is called once per span.
Knowing that a parent span must be created before its children, since parent data is added to spans on creation, concurrency in calling the `createChainId` can only happen when we're either starting spans under the same parent or starting spans under different parents.
When the parents are different, there won't be an issue, because different parent spans will be associated with different chain IDs, but when the parent is the same, we can have concurrent calls to `parentChainId.child()`.

```java
public ChainId child() {
return new ChainId(repr + "#" + childCounter.incrementAndGet());
Expand All @@ -200,6 +208,7 @@ The latter is an `AtomicInteger`, therefore, concurrent calls to this method sho

Lastly, in `createChainId`, the case where `parentChainId` is `null` happens when we're processing a span with no parent in the current process.
The `fromBaggage` method of the `ChainId` class tries to get the chain ID representation from the baggage, which should be present in propagated contexts, otherwise, it defaults to `ROOT.child()` (new level 0 span).

```java
static ChainId fromBaggage(Baggage baggage) {
String chainIdRepr = baggage.getEntryValue(KEY);
Expand All @@ -214,7 +223,6 @@ We considered using `service.instance.id` but that resource attribute might not
Our aim is to provide a module containing the described components for each supported language and add those modules to _opentelemetry-*-contrib_ repositories.
It's worth noting that, at the time of writing, Logs API and Logging SDK are still experimental and not all default OpenTelemetry SDK's include the `LogProcessor` interface (`LogRecordProcessor` from OTel Java SDK 1.19.0) implemented by one of our components.


## Trade-offs and mitigations

Using the extra span and log processors, extending the baggage and context in the propagator and managing the association between spans and chain IDs will incur a small overhead (the [Overhead and benchmarks](#overhead-and-benchmarks) section provides an overview of the tests run to assess the performance overhead).
Expand All @@ -223,7 +231,6 @@ This would have reduced the overhead by taking away the need to include a custom
We'd also have to drop our log processor component and change the SDK log builder to keep the implementation consistent.
However, we decided that modifying default OpenTelemetry SDK builders and configuration properties to incorporate a new non-core attribute was less fitting than providing a clean extension module that could be added whenever chain ID is required.


## Overhead and benchmarks

Java Microbench Harness(JMH) was used to benchmark key components of our implementation against some OTel SDK components.
Expand Down Expand Up @@ -298,7 +305,6 @@ The tests for the `extract` method were run with a 10 key-value pair baggage set

Our `ChainIdBaggagePropagator`'s `extract` consists of a call to the wrapped propagator `W3CBaggagePropagator`'s `extract` method and, as expected, the results were very close.


## Open questions

When using messaging systems, we can have situations in which the same message is consumed several times, either by a single consumer, different consumers or both.
Expand Down Expand Up @@ -347,7 +353,6 @@ One of the reasons why numbers are used as chain ID links instead of UUIDs is to
This raises concerns around propagation due to carrier limits.
For example, if the context is propagated in the headers of an HTTP request, even though the HTTP specification doesn't define any limits, servers usually do and those limits can vary, which makes it hard to find a generic solution to this problem.


## Future possibilities

Chain IDs add support for straightforward analysis of logs and traces, which is especially beneficial when debugging and running complex queries.
Expand Down

0 comments on commit ad20750

Please sign in to comment.