Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device to Device transfers don't work with OpenMPI + LinkX provider on AMD GPUs #13048

Open
angainor opened this issue Jan 22, 2025 · 4 comments
Assignees
Milestone

Comments

@angainor
Copy link

OpenMPI 5.0.6 with shm+cxi:lnx fails to perform Device - Device transfers on LUMI system (AMD GPUs) with OSU benchmark. Host - Host transfers work as expected for intra- and inter-node transfers. For Device - Device transfers OpenMPI fails with

export FI_LNX_PROV_LINKS=shm+cxi
mpirun --mca opal_common_ofi_provider_include "shm+cxi:lnx" -np 2 -map-by numa ./osu_bibw -m 131072: D D

# OSU MPI-ROCM Bi-Directional Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
--------------------------------------------------------------------------
Open MPI failed to register your buffer.
This error is fatal, your job will abort

  Buffer Type: rocm
  Buffer Address: 0x154beaa00000
  Buffer Length: 131072
  Error: Required key not available (4294967030)
--------------------------------------------------------------------------

@hppritcha identified the problem to be related to #11076. There was a fix for this issue in #12290, but it was not merged to the 5.x branch.

@jsquyres
Copy link
Member

AMD: Can you reply?

@edgargabriel
Copy link
Member

I will need the help here from @hppritcha and @amirshehataornl (who developed the linkx provider in libfabric), since I am not thaat familiar with this code path. If its as simple as backport PR #12290, then it shouldn't be a challenge.

@hppritcha
Copy link
Member

This should be assigned to @naughtont3

@jsquyres
Copy link
Member

Thanks @edgargabriel. Might want to look into this soon, so that it can get into 5.0.7 final, if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants