Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

128 Bit Atomics Fallback Implementation #215

Conversation

insertinterestingnamehere
Copy link
Collaborator

Vendor in an implementation of lock-free 128 bit atomics from ARM's progress64 library as a fallback for when the standard implementation in gcc's libatomic isn't actually lock-free. Also test carefully to check whether the standard implementation will actually be lock-free and use it at least in those cases. LSE2 from gcc should be a bit more performant than the fallback from progress64 which uses LSE or just the consistency guarantees of armv8.

Ideally we'll eventually be able to stop carrying around an implementation of this, but that won't be for a while longer since gcc didn't actually start doing lock-free atomics with the LSE extensions until version 13.

This PR just adds the implementation, it doesn't replace any of the 128 bit atomic loads that highlighted the need for it yet. I'll do that separately.

A consequence of using these is that arm builds will likely become sensitive to the gcc version and exact codegen architecture used for compilation, but that's preferable to any hidden correctness issues that may arise from breaking the assumptions behind the standard library's lock-based implementation for atomics. There are a handful of places qthreads uses a speculative 128 bit load to sidestep the need for a lock. That idiom relies on mixed size atomic loads and stores applying correctly to a block of memory. Fortunately that actually does happen on x86 and arm, but only when the atomics are actually lock-free. That's not necessarily the case when using the standard fallback implementation with locks.

I'm initially marking this as a draft since I haven't finished the configure logic that will keep clang from taking the fallback implementation when it can just rely on gcc. Other than that, it should hopefully compile fine with all supported compilers. The source files are at least done.

@insertinterestingnamehere insertinterestingnamehere marked this pull request as draft December 20, 2023 00:05
@olivier-snl
Copy link
Collaborator

We should talk about this a bit more please. Including external TPLs in Qthreads makes approval for running on some SNL systems more difficult versus doing a "--with-progress64=" option used only when needed/desired. If we did snapshot it in, on what cadence would we want to update the progress64 snapshot?

@insertinterestingnamehere
Copy link
Collaborator Author

Great questions. I don't know what the approval process is like. Currently I'm just snapshotting the small piece of progress64 (https://github.com/ARM-software/progress64) that provides 128 bit atomics. Presumably that's an extremely stable portion so I don't foresee needing to update it at all unless there's some kind of critical bugfix made upstream. Hopefully we won't even need the fallback in a few more years and can just rely on the atomics from the C standard. That's not really feasible yet though unless we want to drop support for a whole bunch of stuff.

Since it took quite a bit of effort to sort through various patches bug reports and documentation to figure out what was even going on here, here's some more explanation on why all this nonsense is needed and why I'm suggesting we handle this this way. Totally open to discussion, just explaining the current status.

Why do we absolutely need lock-free 128 bit atomics?
Qthreads has an idiom in a few places where it speculatively loads 128 bits of a struct to see if it can read a value without bothering to acquire the associated lock (also contained in the associated 128 bits). I'd really like to make this kind of thing optional eventually since that would help with portability, but currently I'm just trying to fix things as closely to "as-is" as possible before moving on to refactoring at that level. The current version is relying on x86 to just do the right thing here, but in general this has to be done atomically or it's a race condition. Doing this atomically means we need support for mixed size atomics too since our lock is getting loaded together with the value it protects but that happens only some of the time. That's totally fine on x86 (Intel says it's fine, but I don't think it's in the spec) and arm, but it's not okay with respect to the C standard and it's not fine with the standard fallback implementation of locked atomics which relies on hash tables where the memory address of the atomic gets hashed. This means this idiom in qthreads requires lock-free atomics, not locked atomics. We just haven't seen issues from this much because in the past we've mostly been on x86 CPUs with AVX. Unfortunately support for lock-free 128 bit atomics is horribly inconsistent across other implementations. Most recent hardware supports it in some form or another, but the standard implementation of atomics is terribly inconsistent about actually using that support.

What would we have to drop to avoid needing to vendor in another implementation?
We'd need to only support gcc 13 or later on arm and only version 13 or the latest bugfix releases of gcc 11 and 12 on x86. We'd also need to only support armv8.4-a or later and require AVX on x86. All that might be workable, but it seemed like a lot to drop which is why I went out of my way to find an implementation that would get us support for architectures and compilers that are a bit older than that.

Can we implement these things ourselves?
Yes. I'm just not experienced with inline assembly so that would take longer and be more prone to errors because of the learning curve. Using a library seems like a more reliable option. If we did go this route, I really don't want to bother supporting anything earlier than armv8.1-a because the implementation for armv8-a looks especially nasty. That's a less critical limitation though.

What are all these ifdefs doing?
They're deciding whether to use a fallback implementation I've copied over from Arm's progress64 library. It's advantageous to use the standard atomics for performance when they're actually lock-free. Also it'd be nice to eventually get rid of a block of hand-implemented atomic operations since we should be getting that kind of thing from the compiler.

Why not just check for std::atomic<__int128>::is_always_lock_free at configure time?
This interface lies sometimes. Clang will report it as true even if it's not (llvm/llvm-project#75081) and gcc will report it as false even if it should be true (I still need to send those bug reports upstream...).

What about a configure time check of something like std::atomic<__int128> a; a.is_lock_free();?
In theory this should be equivalent to this mass of ifdefs, but the values I'm seeing here conflict with the documented behavior and with the implementation in gcc's source. I suspect bugs for 128 bit values, but I haven't fully confirmed yet. That's a work in progress.

Wait, but if the compiler doesn't know at compile-time, how can some of these atomics be lock-free at all?
libatomic dynamically detects the architecture it's running on at load time using cpuid (or similar on arm) and dispatches its implementation using ifuncs.

Why not just rely on the dynamic dispatch completely?
We need to guarantee that libatomic is not falling back to the lock-based implementation of atomics because it doesn't work for our optimistic load idiom.

What does AVX have to do with this?
On x86-64, 128 bit loads and stores are guaranteed atomic in Intel and AMD's specs but only for their AVX processors. That's not guaranteed for any others. Gcc's libatomic will fall back to lock-based atomics in those cases, potentially giving us correctness issues.

Why the weird gcc version checks for x86?
gcc only uses the 128 bit loads and stores on AVX with version 13, and the most recent debug releases of 11 and 12.

What are we doing when that doesn't work?
The vendored implementation of 128 bit atomics for ARM from progress64 includes an implementation of 128 bit atomics on x86 using cmpxchg16b which will be somewhat slower but will be correct.

Should we drop the versions of gcc that don't do this instead?
Maybe. We still need to vendor in the 128 bit atomic implementation from progress64 to manage what's going on with arm though, so I figured using their x86 implementation as a fallback was a better fix than dropping support for a bunch of stuff.

Should we drop support for non-AVX x86?
Maybe, but again, the cost of falling back to the version from progress64 is low. I have AVX on by default in our configure file now though, but there's a switch to turn it off and use the fallback implementation.

Should we just use the 128 bit loads and stores ourselves on x86 with AVX?
This is possible. I just haven't bothered since we get the optimal behavior with recent gcc releases anyway.

What even is progress64?
It's a permissively licensed collection of concurrency primitives implemented in C from ARM. See https://github.com/ARM-software/progress64. In particular, they already have a working implementation of 128 bit atomics there for arm 8 or later. I'm just snapshotting in the relevant files for that. It's a small portion of the library.

Okay, so what's going on with arm?
Lots. First off, we're only bothering with arm 8. Progress64 has two implementations of 128 bit atomics: one for armv8-a and another for armv8.1-a and later. Unfortunately gcc will not use the equivalent idioms on those platforms due to concerns about immutable memory (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814#c3). In order to get lock-free 128 bit atomics from gcc's libatomic armv8.4-a or later is needed since that's when arm introduced additional instructions to provide the functionality that gcc required. Also, gcc only does uses the new armv8.4-a instructions with version 13. To summarize: the relevant versions here are armv8-a, arv8.1-a, and armv8.4-a. Gcc only works with 8.4. Progress64 gives us implementations for the other two cases. Arm 8 starts to account for atomics in the memory model better. 8.1 adds some better/faster atomic instructions (LSE extensions), but they're not quite enough to implement the standard interface because of corner cases that we don't care about in qthreads. 8.4 adds even more instructions and performance improvements (LSE2 extensions) and that's what gcc uses (but only with version 13 or later).

What's going on with the weird arm version checks? What's up with __ARM_FEATURE_BTI?
gcc incorrectly reports __ARM_ARCH. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109415 for details. Whenever that macro starts working correctly, we'll be able to check for arm 8.4 directly. Unfortunately with it broken, there's actually no way to directly check for arm 8.4 vs 8.3 at compile-time. We can check for the __ARM_FEATURE_BTI macro though, which is only present on 8.5 or later, so it's the best we can do. __ARM_FEATURE_ATOMICS will be defined on 8.1 or later since the LSE extensions are available on those architectures, but unfortunately there isn't a feature macro to check for the updated LSE2 extensions. Note: this is why compiling with -march=native will probably become beneficial on arm.

Should we add -march=native to our build?
Maybe? Binaries are generally assumed to be at least somewhat relocatable in practice, but I don't know to what extent that applies to our downstream users. It definitely could make sense to at least do this for arm. Feedback particularly appreciated here.

What's going on with clang?
Clang happily inlines the appropriate instructions for 128 bit atomics in all optimized builds with armv8-a and later... in spite of the fact that it's supposed to be compatible with gcc... which is using locked atomics in those cases. On top of that, because clang is inlining the atomics instead of calling out to libatomic, it loses the benefit of dispatching at runtime to a potentially better implementation if the runtime architecture is significantly newer than the build-time one. In debug builds clang still calls out to whatever libatomic it gets from gcc. Unfortunately it's impossible to detect via preprocessor macros what version of gcc clang is getting its supporting libraries from. Clang sets its __GNUC__ and __GNUC_MINOR__ macros to mimic gcc 4.2, not whatever version of gcc it's actually using. This info can be obtained at configure time though. I still need to go back, get it, and then pipe that info through to our preprocessor with a separate flag (either our own define or clang's -fgnuc-version flag). Once that's done, we can have clang use the standard 128 bit atomics in all optimized builds and then fall back to the logic involving the gcc version for debug builds.

What about icc, icx, and acfl?
icc mimicks whatever version of gcc it's using for libgcc/libatomic/etc, so it's fine. Icx and acfl are just clang on the inside and will behave similar to clang.

What about powerpc?
As best I can tell, 128 bit atomics have worked for a very long time on power. I haven't yet been able to confirm that gcc actually uses them though.

@olivier-snl
Copy link
Collaborator

@insertinterestingnamehere Thank you for the explanations that put things in perspective. As I'm only a few hours from being through with the calendar year for work, I'm going to read it all carefully upon my return. We can also talk it over with our stakeholders, lest they have any concerns.

@insertinterestingnamehere
Copy link
Collaborator Author

Okay, this should be checking the version of gcc that clang uses for its supporting libraries at configure time now so we have that info for making decisions about what clang does WRT 128 bit atomics. This turned out to be substantially harder than I'd expected, but this version does work.

One other alternative that occurred to me: we could maybe try to shrink the size of the structs that are getting loaded speculatively instead of bothering with 128 bit atomics. The two examples I'm aware of are qt_spin_trylock_t which would probably be fine and syncvar_t which probably wouldn't, but I don't know for sure. In the case of syncvar, shrinking it would make it so that a syncvar could no longer hold a pointer as its data which seems unlikely to work, but I don't know exactly how it's used everywhere.

@insertinterestingnamehere
Copy link
Collaborator Author

insertinterestingnamehere commented Jan 16, 2024

Ooookay, I did some more digging through the qthreads source and it looks like the main motivating examples here are actually fine with just 64 bit atomics which is so much easier to work with. They still need to be lock-free and support overlapping mixed-size atomic accesses, but that is actually the case on all the hardware we support.

I'll leave this open for a while though since this patch has almost no impact on the existing code. It's quite possible another use-case will show up that does require the 128 bit atomics so now we have them if we need them. On the other hand, not having to maintain a consistent interface for 128 bit atomics until things stabilize a bit more upstream would be ideal so hopefully this isn't even necessary.

…s64 as a fallback

for when the standard implementation isn't actually lock free.
Also test carefully to check whether the standard implementation will actually
be lock free and use it at least in those cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants