-
Notifications
You must be signed in to change notification settings - Fork 420
Heterogeneous Operations on CUDA and ROCm Nodes Using UCX/UCC #9985
Replies: 1 comment · 2 replies
-
UCX supports both Cuda and ROCm, and in theory, should support such an environment. However, that scenario was never tested or optimized. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Thank you for the response. That sounds like a starting point. I'll try to test it and see how it works in practice. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I successfully built UCX with ROCm (v6.1.2) on an AMD node (AWS Despite these efforts, I encountered the following issues:
UCC log details for ROCm node
UCC log details for CUDA node
It would be greatly appreciated if you have any hints or suggestions on how to "force it" to run the distributed job for my PoC, or any guidance on configurations or debugging steps. I appreciate your help. |
Beta Was this translation helpful? Give feedback.
-
Hello UCX Team,
I'm working on a high-performance computing project involving nodes with different GPU setups—some nodes with NVIDIA GPUs running CUDA and others with AMD GPUs running ROCm. I am exploring ways to perform efficient MPI operations across these heterogeneous nodes.
Is it possible to use UCX and UCC to facilitate communication and collective operations between nodes with CUDA and ROCm environments? Specifically, can UCX and UCC act as a middleware to bridge the communication between RCCL (for ROCm) and NCCL (for CUDA)? If so, are there any specific configurations or build steps required to enable this interoperability?
Thank you for your guidance and support.
Beta Was this translation helpful? Give feedback.
All reactions