-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: unsafe CXI <-> gdrcopy cleanup interactions #10041
Comments
cc @jswaro |
@tylerjereddy : Hi tyler, I missed this notification. I'll look into this. |
So, I looked at a single thread fault, and tracked it back to two separate threads. In the log above, what seems to have occurred is the following sequence of events: In PID=112670, there are at least two threads. TID=112670, and 122725.
Crash. MH is null, which triggers the issue in the library below. Something in libfabric likely zero'd the gdrcopy struct after the operation. The zero'ing of the struct likely happened while TID=112670 was acquiring the global lock. If both threads are trying to unmap the same region, then we need one of the following solutions:
@tylerjereddy Nice log capture. Checkpoints helped quite a bit. |
@j-xiong : Do you see this happening in general with other providers using GDRCOPY? Seems like it could be unique to CXI's RMA operation tracking, but it seems worth asking. |
@jswaro I haven't seen this with other providers yet. While I can't say similar thing won't happen with other providers, the sequence of events presented here is very much tied to the cxi provider. |
@tylerjereddy : Couple questions
The CXI provider should have a 1-to-1 mapping between CXI memory region/descriptor and GDR handle. At face value, it seems like the same CXI memory region is being freed by two separate threads. |
@iziemba @jswaro Do you both have access to Cray Slingshot 11 (>= 2 nodes) hardware? Probably the easiest way to debug would be to attempt to run the cuFFTMp multi-node example per my instructions at the top of NVIDIA/gdrcopy#296 yourselves. If you don't have access to the hardware, I was hoping LANL HPC would help interface a bit, but in the meanwhile it may be possible for me to try to do an "interactive" debug session where I setup a reproducer and run it on two nodes for you to see, add prints, etc. The problem is ultimately that we're trying to be able to easily build out a toolchain that performs multi-node cuFFTMp on Cray Slingshot 11 hardware so that we can do physics simulations with i.e., GROMACS, but this toolchain seems to be quite problematic and to often not complain when versions of components mismatch, apart from segfaulting. You've got NVSHMEM and The original report was for x86_64, but we're now also wanting to do these builds on Grace-Hopper ( |
Yes, either one of us would have access to the hardware to reproduce. We'll see if this could be reproduced internally following the instructions in the ticket above. |
While investigating NVIDIA/gdrcopy#296 (comment) I instrumented the
libfabric
dev-cxi
branch from @thomasgillis as follows:In a sample 2-node cuFFTMp run on a Slingshot 11 machine, I recorded the output:
out12.txt
In particular, I was looking for evidence that
libfabric
was trying to free the same memory address multiple times, one level above where it was happening ingdrcopy
. If we check the log, for example,grep -E -i -n "0x7e1ddb0" out12.txt
, it looks like this is indeed happening:The
cuda_gdrcopy_dev_unregister()
function insrc/hmem_cuda_gdrcopy.c
is getting called on the same handle memory address on the same process, but a different thread. In another trial, it looked like the spin lock incuda_gdrcopy_dev_unregister()
behaved slightly better, but then ultimately there was still a crash with an attempt to free the unmapped address:In
cuda_gdrcopy_dev_unregister()
afterpthread_spin_lock(&global_gdr_lock);
, is there not a need to make sure that another thread hasn't unmapped and unpinned the handle before resuming? I did start messing around with this a little bit:It still wasn't enough to get my cuFFTMp example running, but looking at the output I no longer see evidence of a second thread traversing through
gdr_unmap
with the same address that was unpinned by the first thread. Is it plausible that this and other issues exist in this code, or am I way off base?The text was updated successfully, but these errors were encountered: