Replace span extensions with manual hashmap #22

Robo210 · 2024-12-23T21:52:10Z

The tracing_subscriber span extensions are useful, but they are just a hashmap with extra steps, requiring unnecessary boxing, dynamic trait objects, and std. Replace them with a manual hashmap, using hashbrown rather than the built-in hashmap class so we can get closer to no_std (hashbrown is the same code as hashmap, maintained by the rust-lang team, that works in no_std mode).

Also move new from EventWriter trait to ProviderTraits trait. This will simplify a follow-up PR.

Add an integration test for spans, with nested span entry to check for deadlocks. This adds a public "build" method that skips enablement checks so the user-mode code will always run. This can also be used by the benchmarks so you don't need to manually run WPR/perf.

We still need tracing-subscriber/registry, which needs tracing/std, but that will be dealt with another day.

Robo210 · 2024-12-23T21:55:08Z

src/layer/layer_impl.rs

 struct SpanData {
    fields: Box<[FieldValueIndex]>,
    activity_id: [u8; 16], // // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes
    related_activity_id: [u8; 16], // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes
-    start_time: SystemTime,
+    start_time: RwLock<SystemTime>,


The old code got away without locking this because it was being boxed into a dyn trait object, which hid the problem from the compiler. Now it sees that there is a data race that requires a lock.

At a higher level though, there is no data race, and I wish this extra locking could be avoided.

I don't believe you. This isn't needed.

😥

SpanData can only be Sync if all its members are Sync. Mutable fields are not Sync unless wrapped by something that implements Sync, which of the interior mutability types is RwLock or Mutex. If I manually tag the struct as Sync without ensuring no reads can happen at the same time as the write, it would technically be UB.

In theory, it is possible to start a struct on one thread and end it on another. Both operations only require a const reference from the logger's point of view - if a layer wants to implement any sort of ordering or mutation within it's implementation, it must do so safely, which is what the code is trying to.

The outer RwLock doesn't protect against interior mutability at all. A mutex would, but that's too aggressive for what we need, which is to ensure enter and exit don't encounter any data races when writing/reading this value.

The outer lock protects you just fine because it ensures you can only access the reference to the interior struct while the outer lock is held.

Multiple threads can hold a Read lock and still mutate an interior field though. That's why RwLock explicitly doesn't implement Sync unless each field does, while Mutex implements Sync always. Statics must be Sync.

That is to say:
Statics must be Sync. Which means the RwLock must be Sync.
The RwLock is Sync if the thing it contains is Sync.
The struct is Sync only if its fields are all Sync.
A mutable field is not Sync unless it guarantees it cannot be mutated when any other reference to it exists, otherwise it is UB.
The mutable field must therefore be locked or atomically changeable.
Atomic or RwLock, take your pick. RwLock is conceptually nicer, especially since SystemTime is not guaranteed to be any particular size or underlying type.

I think I see what you're saying. Here's my take.

The issue with RwLock is that there can be multiple threads that all see a const ref to T, so if it's bad for multiple threads to all read the same instance of T at the same time (e.g. Cell) then RwLock won't provide proper protection.

The docs indicate that SystemTime has an auto impl of Sync and Send, so it should all be fine.

If it weren't fine then the second RwLock wouldn't fix it. To make a non-Sync variable thread-safe you need a Mutex.

A second RwLock would only fix things if the mutation happens while the outer lock is not held. That's not happening here -- you only mutate data.start_time while you hold the lock on span_data.

Interior mutability means "Cell". There aren't any Cells here.

Ok, I see now. I can mutate the field directly as long as I hold the Write lock on the global and get a mutable reference to its contents. If I want to be able to only hold a Read lock on the global (since multiple threads can independently start spans and shouldn't be able to block each other), then I need the interior RwLock. But when that gets me into the problem described above: as soon as I make just that field mutable, it loses its Sync implementation unless it is a lock.

When I tried to use a RefCell (or just a Cell) rather than a RwLock:

`RefCell<std::time::SystemTime>` cannot be shared between threads safely within `(Id, common::SpanData)`, the trait `Sync` is not implemented for `RefCell<std::time::SystemTime>`, which is required by `RwLock<hashbrown::HashMap<Id, common::SpanData, asdf>>: Sync` if you want to do aliasing and mutation between multiple threads, use `std::sync::RwLock` instead required because it appears within the type `(Id, common::SpanData)` required for `hashbrown::raw::RawTable<(Id, common::SpanData)>` to implement `Sync` required for `RwLock<hashbrown::HashMap<Id, common::SpanData, asdf>>` to implement `Sync` shared static variables must have a type that implements `Sync`rustc common.rs(35, 8): required because it appears within the type `common::SpanData` map.rs(185, 12): required because it appears within the type `hashbrown::HashMap<Id, common::SpanData, asdf>`

Admittedly the code is currently taking a Write lock anyway, so the interior mutability isn't needed as you've pointed out, but I suspect that should change.

Makes sense. So I think we agree that at present, the interior lock is not needed because it is covered by an exterior lock and that in the future, we might want to improve the concurrency by minimizing the time we hold the exterior lock.

My thoughts on that:

You cannot manipulate hashtable entries without some kind of lock on the hashtable (exterior lock) to ensure the entry is not deleted/moved while you're changing it. A lock within the entry (interior lock) doesn't remove this requirement.

Holding the exterior lock for ReadOnly is sufficient, but that means you will have ping-ping between RW lock in on_new_span versus RO locks.

Typically if you want to make this all work, you end up needing to make the hashtable entry become an Arc so that you aren't manipulating the "entry" with the exterior lock held. Instead, you take the exterior lock, grab a reference to an entry, release the exterior lock, then manipulate the entry (potentially taking an interior lock, though I think we might be able to avoid that).

idigdoug · 2024-12-24T22:40:47Z

src/layer/layer_impl.rs

@@ -13,11 +17,15 @@ use crate::{

 use super::*;

+static SPAN_DATA: LazyLock<RwLock<HashMap<tracing::span::Id, SpanData, BuildHasherDefault<FNV1aHasher64>>>> = LazyLock::new(|| {


Get rid of the LazyLock. The only reason you need it is because BuildHasherDefault::default() is not const. The BuildHasher trait is trivial, so there's no reason why you can't make your own custom BuildHasherFNv1a in about 10 lines.

It goes deeper than that. The trait itself is not const, and can't be marked as const without using an unstable feature. There's no way to use with_hashed at all in a const context.

But the trait does not need to be const to use with_hasher in a const context. with_hasher doesn't call any functions in the trait.

Try it. It compiles.

idigdoug · 2024-12-24T22:56:45Z

src/layer/layer_impl.rs

@@ -120,7 +128,8 @@ where
            return;
        };

-        if span.extensions().get::<SpanData>().is_some() {
+        if SPAN_DATA.read().unwrap().contains_key(&span.id()) {


This is pessimistic and costs performance only to guard against something that should not happen.

Instead, be optimistic and assume that this will almost never happen. If it will almost never happen, the best way to handle it is to build the new span and then use a try_insert.

idigdoug · 2024-12-24T23:03:23Z

src/layer/layer_impl.rs

 struct SpanData {
    fields: Box<[FieldValueIndex]>,
    activity_id: [u8; 16], // // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes
    related_activity_id: [u8; 16], // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes
-    start_time: SystemTime,
+    start_time: RwLock<SystemTime>,


I don't believe you. This isn't needed.

Robo210 added 4 commits December 21, 2024 06:04

Replace span extensions with manual map.

f4d41e2

Add test for span

b82191f

Remove closed spans

f9f5b8c

Remove explicit need for tracing/std feature.

70b6d9c

We still need tracing-subscriber/registry, which needs tracing/std, but that will be dealt with another day.

Robo210 requested a review from a team as a code owner December 23, 2024 21:52

Robo210 commented Dec 23, 2024

View reviewed changes

idigdoug requested changes Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace span extensions with manual hashmap #22

Replace span extensions with manual hashmap #22

Robo210 commented Dec 23, 2024 •

edited

Loading

Robo210 Dec 23, 2024

idigdoug Dec 24, 2024

Robo210 Dec 24, 2024

idigdoug Dec 25, 2024

Robo210 Dec 25, 2024

Robo210 Dec 25, 2024

idigdoug Dec 25, 2024 •

edited

Loading

Robo210 Jan 2, 2025 •

edited

Loading

idigdoug Jan 2, 2025

idigdoug Dec 24, 2024

Robo210 Dec 24, 2024

idigdoug Dec 24, 2024 •

edited

Loading

idigdoug Dec 24, 2024

idigdoug Dec 24, 2024

		@@ -13,11 +17,15 @@ use crate::{

		use super::*;

		static SPAN_DATA: LazyLock<RwLock<HashMap<tracing::span::Id, SpanData, BuildHasherDefault<FNV1aHasher64>>>> = LazyLock::new(\|\| {

Replace span extensions with manual hashmap #22

Are you sure you want to change the base?

Replace span extensions with manual hashmap #22

Conversation

Robo210 commented Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idigdoug Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

Robo210 Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idigdoug Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Robo210 commented Dec 23, 2024 •

edited

Loading

idigdoug Dec 25, 2024 •

edited

Loading

Robo210 Jan 2, 2025 •

edited

Loading

idigdoug Dec 24, 2024 •

edited

Loading