-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
128 Bit Atomics Fallback Implementation #215
128 Bit Atomics Fallback Implementation #215
Conversation
We should talk about this a bit more please. Including external TPLs in Qthreads makes approval for running on some SNL systems more difficult versus doing a "--with-progress64=" option used only when needed/desired. If we did snapshot it in, on what cadence would we want to update the progress64 snapshot? |
Great questions. I don't know what the approval process is like. Currently I'm just snapshotting the small piece of progress64 (https://github.com/ARM-software/progress64) that provides 128 bit atomics. Presumably that's an extremely stable portion so I don't foresee needing to update it at all unless there's some kind of critical bugfix made upstream. Hopefully we won't even need the fallback in a few more years and can just rely on the atomics from the C standard. That's not really feasible yet though unless we want to drop support for a whole bunch of stuff. Since it took quite a bit of effort to sort through various patches bug reports and documentation to figure out what was even going on here, here's some more explanation on why all this nonsense is needed and why I'm suggesting we handle this this way. Totally open to discussion, just explaining the current status. Why do we absolutely need lock-free 128 bit atomics? What would we have to drop to avoid needing to vendor in another implementation? Can we implement these things ourselves? What are all these ifdefs doing? Why not just check for What about a configure time check of something like Wait, but if the compiler doesn't know at compile-time, how can some of these atomics be lock-free at all? Why not just rely on the dynamic dispatch completely? What does AVX have to do with this? Why the weird gcc version checks for x86? What are we doing when that doesn't work? Should we drop the versions of gcc that don't do this instead? Should we drop support for non-AVX x86? Should we just use the 128 bit loads and stores ourselves on x86 with AVX? What even is progress64? Okay, so what's going on with arm? What's going on with the weird arm version checks? What's up with Should we add -march=native to our build? What's going on with clang? What about icc, icx, and acfl? What about powerpc? |
@insertinterestingnamehere Thank you for the explanations that put things in perspective. As I'm only a few hours from being through with the calendar year for work, I'm going to read it all carefully upon my return. We can also talk it over with our stakeholders, lest they have any concerns. |
f5f88ed
to
b18e922
Compare
Okay, this should be checking the version of gcc that clang uses for its supporting libraries at configure time now so we have that info for making decisions about what clang does WRT 128 bit atomics. This turned out to be substantially harder than I'd expected, but this version does work. One other alternative that occurred to me: we could maybe try to shrink the size of the structs that are getting loaded speculatively instead of bothering with 128 bit atomics. The two examples I'm aware of are |
Ooookay, I did some more digging through the qthreads source and it looks like the main motivating examples here are actually fine with just 64 bit atomics which is so much easier to work with. They still need to be lock-free and support overlapping mixed-size atomic accesses, but that is actually the case on all the hardware we support. I'll leave this open for a while though since this patch has almost no impact on the existing code. It's quite possible another use-case will show up that does require the 128 bit atomics so now we have them if we need them. On the other hand, not having to maintain a consistent interface for 128 bit atomics until things stabilize a bit more upstream would be ideal so hopefully this isn't even necessary. |
…s64 as a fallback for when the standard implementation isn't actually lock free. Also test carefully to check whether the standard implementation will actually be lock free and use it at least in those cases.
…ng libraries into a preprocessor define.
b18e922
to
3ab66fa
Compare
Vendor in an implementation of lock-free 128 bit atomics from ARM's progress64 library as a fallback for when the standard implementation in gcc's libatomic isn't actually lock-free. Also test carefully to check whether the standard implementation will actually be lock-free and use it at least in those cases. LSE2 from gcc should be a bit more performant than the fallback from progress64 which uses LSE or just the consistency guarantees of armv8.
Ideally we'll eventually be able to stop carrying around an implementation of this, but that won't be for a while longer since gcc didn't actually start doing lock-free atomics with the LSE extensions until version 13.
This PR just adds the implementation, it doesn't replace any of the 128 bit atomic loads that highlighted the need for it yet. I'll do that separately.
A consequence of using these is that arm builds will likely become sensitive to the gcc version and exact codegen architecture used for compilation, but that's preferable to any hidden correctness issues that may arise from breaking the assumptions behind the standard library's lock-based implementation for atomics. There are a handful of places qthreads uses a speculative 128 bit load to sidestep the need for a lock. That idiom relies on mixed size atomic loads and stores applying correctly to a block of memory. Fortunately that actually does happen on x86 and arm, but only when the atomics are actually lock-free. That's not necessarily the case when using the standard fallback implementation with locks.
I'm initially marking this as a draft since I haven't finished the configure logic that will keep clang from taking the fallback implementation when it can just rely on gcc. Other than that, it should hopefully compile fine with all supported compilers. The source files are at least done.