-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace span extensions with manual hashmap #22
base: main
Are you sure you want to change the base?
Conversation
We still need tracing-subscriber/registry, which needs tracing/std, but that will be dealt with another day.
struct SpanData { | ||
fields: Box<[FieldValueIndex]>, | ||
activity_id: [u8; 16], // // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes | ||
related_activity_id: [u8; 16], // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes | ||
start_time: SystemTime, | ||
start_time: RwLock<SystemTime>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code got away without locking this because it was being boxed into a dyn trait object, which hid the problem from the compiler. Now it sees that there is a data race that requires a lock.
At a higher level though, there is no data race, and I wish this extra locking could be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe you. This isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😥
SpanData can only be Sync if all its members are Sync. Mutable fields are not Sync unless wrapped by something that implements Sync, which of the interior mutability types is RwLock or Mutex. If I manually tag the struct as Sync without ensuring no reads can happen at the same time as the write, it would technically be UB.
In theory, it is possible to start a struct on one thread and end it on another. Both operations only require a const reference from the logger's point of view - if a layer wants to implement any sort of ordering or mutation within it's implementation, it must do so safely, which is what the code is trying to.
The outer RwLock doesn't protect against interior mutability at all. A mutex would, but that's too aggressive for what we need, which is to ensure enter and exit don't encounter any data races when writing/reading this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outer lock protects you just fine because it ensures you can only access the reference to the interior struct while the outer lock is held.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple threads can hold a Read lock and still mutate an interior field though. That's why RwLock explicitly doesn't implement Sync unless each field does, while Mutex implements Sync always. Statics must be Sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to say:
Statics must be Sync. Which means the RwLock must be Sync.
The RwLock is Sync if the thing it contains is Sync.
The struct is Sync only if its fields are all Sync.
A mutable field is not Sync unless it guarantees it cannot be mutated when any other reference to it exists, otherwise it is UB.
The mutable field must therefore be locked or atomically changeable.
Atomic or RwLock, take your pick. RwLock is conceptually nicer, especially since SystemTime is not guaranteed to be any particular size or underlying type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you're saying. Here's my take.
The issue with RwLock is that there can be multiple threads that all see a const ref to T, so if it's bad for multiple threads to all read the same instance of T at the same time (e.g. Cell) then RwLock won't provide proper protection.
The docs indicate that SystemTime has an auto impl of Sync and Send, so it should all be fine.
If it weren't fine then the second RwLock wouldn't fix it. To make a non-Sync variable thread-safe you need a Mutex.
A second RwLock would only fix things if the mutation happens while the outer lock is not held. That's not happening here -- you only mutate data.start_time while you hold the lock on span_data.
Interior mutability means "Cell". There aren't any Cells here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see now. I can mutate the field directly as long as I hold the Write lock on the global and get a mutable reference to its contents. If I want to be able to only hold a Read lock on the global (since multiple threads can independently start spans and shouldn't be able to block each other), then I need the interior RwLock. But when that gets me into the problem described above: as soon as I make just that field mutable, it loses its Sync implementation unless it is a lock.
When I tried to use a RefCell (or just a Cell) rather than a RwLock:
`RefCell<std::time::SystemTime>` cannot be shared between threads safely
within `(Id, common::SpanData)`, the trait `Sync` is not implemented for `RefCell<std::time::SystemTime>`, which is required by `RwLock<hashbrown::HashMap<Id, common::SpanData, asdf>>: Sync`
if you want to do aliasing and mutation between multiple threads, use `std::sync::RwLock` instead
required because it appears within the type `(Id, common::SpanData)`
required for `hashbrown::raw::RawTable<(Id, common::SpanData)>` to implement `Sync`
required for `RwLock<hashbrown::HashMap<Id, common::SpanData, asdf>>` to implement `Sync`
shared static variables must have a type that implements `Sync`rustc
common.rs(35, 8): required because it appears within the type `common::SpanData`
map.rs(185, 12): required because it appears within the type `hashbrown::HashMap<Id, common::SpanData, asdf>`
Admittedly the code is currently taking a Write lock anyway, so the interior mutability isn't needed as you've pointed out, but I suspect that should change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. So I think we agree that at present, the interior lock is not needed because it is covered by an exterior lock and that in the future, we might want to improve the concurrency by minimizing the time we hold the exterior lock.
My thoughts on that:
- You cannot manipulate hashtable entries without some kind of lock on the hashtable (exterior lock) to ensure the entry is not deleted/moved while you're changing it. A lock within the entry (interior lock) doesn't remove this requirement.
- Holding the exterior lock for ReadOnly is sufficient, but that means you will have ping-ping between RW lock in
on_new_span
versus RO locks. - Typically if you want to make this all work, you end up needing to make the hashtable entry become an Arc so that you aren't manipulating the "entry" with the exterior lock held. Instead, you take the exterior lock, grab a reference to an entry, release the exterior lock, then manipulate the entry (potentially taking an interior lock, though I think we might be able to avoid that).
@@ -13,11 +17,15 @@ use crate::{ | |||
|
|||
use super::*; | |||
|
|||
static SPAN_DATA: LazyLock<RwLock<HashMap<tracing::span::Id, SpanData, BuildHasherDefault<FNV1aHasher64>>>> = LazyLock::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of the LazyLock. The only reason you need it is because BuildHasherDefault::default()
is not const. The BuildHasher
trait is trivial, so there's no reason why you can't make your own custom BuildHasherFNv1a
in about 10 lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes deeper than that. The trait itself is not const, and can't be marked as const without using an unstable feature. There's no way to use with_hashed at all in a const context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the trait does not need to be const to use with_hasher in a const context. with_hasher doesn't call any functions in the trait.
Try it. It compiles.
@@ -120,7 +128,8 @@ where | |||
return; | |||
}; | |||
|
|||
if span.extensions().get::<SpanData>().is_some() { | |||
if SPAN_DATA.read().unwrap().contains_key(&span.id()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pessimistic and costs performance only to guard against something that should not happen.
Instead, be optimistic and assume that this will almost never happen. If it will almost never happen, the best way to handle it is to build the new span and then use a try_insert.
struct SpanData { | ||
fields: Box<[FieldValueIndex]>, | ||
activity_id: [u8; 16], // // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes | ||
related_activity_id: [u8; 16], // if set, byte 0 is 1 and 64-bit span ID in the lower 8 bytes | ||
start_time: SystemTime, | ||
start_time: RwLock<SystemTime>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe you. This isn't needed.
The tracing_subscriber span extensions are useful, but they are just a hashmap with extra steps, requiring unnecessary boxing, dynamic trait objects, and std. Replace them with a manual hashmap, using hashbrown rather than the built-in hashmap class so we can get closer to no_std (hashbrown is the same code as hashmap, maintained by the rust-lang team, that works in no_std mode).
Also move new from EventWriter trait to ProviderTraits trait. This will simplify a follow-up PR.
Add an integration test for spans, with nested span entry to check for deadlocks. This adds a public "build" method that skips enablement checks so the user-mode code will always run. This can also be used by the benchmarks so you don't need to manually run WPR/perf.