-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX installation done with OFED doesn't recognize cuda, cuda_cpy etc. #9950
Comments
@RamHPC Probably the UCX coming with MLNX_OFED bundle does not include the compiled-in cuda and gdrcopy support. Currently MLNX_OFED includes ucx/cuda support for most, but not all operating systems. |
Thank you! Will give this a try. When I see the flags MLNX_OFED installed UCX it looks like it included cuda and gdr_copy but some how transports don't show up. There is no way to re-calibrate existing installation? |
Describe the bug
Installed ucx-1.16 and everything was working fine. The devices/transports recognized are inline with the expectation. Installed OFED (MLNX_OFED_LINUX-24.04-0.6.6.0-rhel8.9-x86_64) for which automatically installed ucx-1.17 version. This doesn't show cuda, cuda_cpy and gdr_copy as devices/transports
Steps to Reproduce
ucx_info -v
)export UCX_TLS=ib,sm,cuda,cuda_copy,cuda_ipc,gdr_copy
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Red Hat Enterprise Linux release 8.9 (Ootpa) + Linux gpu2 4.18.0-513.24.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
$ ofed_info -s
MLNX_OFED_LINUX-24.04-0.6.6.0:
ibstat
oribv_devinfo -vv
commandDriver Version: 555.42.02
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
$ lsmod|grep gdrdrv
gdrdrv 24576 0
nvidia 8691712 365 nvidia_uvm,nvidia_fs,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXThe text was updated successfully, but these errors were encountered: