-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Using Shared Memory with EFA results in strange errors with libfabric 1.19 #9694
Comments
Libfabric 1.18 and 1.19 are very different on how EFA provider use the SHM provider to offload the intra-node traffic. On the specific error you reported on the “unexpected status received from receiver”, this error usually indicates the remote peer is down for some reason. I don’t have a theory how this can relate to shm provider as it is an completion error from EFA NIC. Can you run with FI_LOG_LEVEL=warn in your env and see if there are more log printed? Another thing worth trying is to disable the shm provider usage inside EFA, via FI_EFA_ENABLE_SHM_TRANSFER=0 to see if there is any behavior change. |
Could you also clarify how you |
Right, apologies, that didn't come across correctly (also the formatting got really messed up. I'll post the error again). The connection between the communication and the shared memory is a little more indirect than that. We use shared memory only for intra-node communication. The pointers we use as our symmetric heap (VA) point to the shared memory region. We register our symmetric heap addresses with the NIC using fi_mr_regattr. There are two distinct cases:
Libfabric from installer 1.30 (1.19)
case (2) - When we try to issue a write operation from the same address in the shared memory region, we have no problems. Libfabric from installer 1.22 (1.17) The only real connection between the two is the change in registration size (~1GiB - ~2GiB) |
@Seth5141 Thanks for the clarification, I am still interested in seeing your results with |
That is with FI_LOG_LEVEL=warn. Unfortunately, I don't have the entire logs available to me. I failed to grab them off the machine before the allocation I had access to expired. I don't have an immediate way to get back on a system to continue testing. I can comment on this though:
I was playing with options to try and get to the bottom of this and did toggle that flag. It didn't have any effect on my results. |
@Seth5141 It will be hard to identify the issue without having more warning logs to understand why the |
Agreed. I'll let you know more information as soon as I can get access to another P5D instance. Until then, I am as stuck as you. |
Describe the bug
In libfabric 1.19 (from the aws-efa installer v1.30), I am unable to communicate over EFA when using shared memory when registered more than once with the same or different EFA devices.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Libfabric 1.18 and 1.19 should behave the same way in this case.
Output
Environment:
ubuntu on a P5DN instance
Additional context
@aws-ofiwg-bot
The text was updated successfully, but these errors were encountered: