You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Installed ucx-1.16 and everything was working fine. The devices/transports recognized are inline with the expectation. Installed OFED (MLNX_OFED_LINUX-24.04-0.6.6.0-rhel8.9-x86_64) for which automatically installed ucx-1.17 version. This doesn't show cuda, cuda_cpy and gdr_copy as devices/transports
Steps to Reproduce
Command line
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
Any UCX environment variables used
export UCX_TLS=ib,sm,cuda,cuda_copy,cuda_ipc,gdr_copy
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue or cat /etc/redhat-release + uname -a
Red Hat Enterprise Linux release 8.9 (Ootpa) + Linux gpu2 4.18.0-513.24.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
For RDMA/IB/RoCE related issues:
Driver version:
rpm -q rdma-core or rpm -q libibverbs
or: MLNX_OFED version ofed_info -s
$ ofed_info -s
MLNX_OFED_LINUX-24.04-0.6.6.0:
HW information from ibstat or ibv_devinfo -vv command
Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
Manually need to load nv_peer_mem but not loaded when the problem happens.
$ lsmod|grep gdrdrv
gdrdrv 24576 0
nvidia 8691712 365 nvidia_uvm,nvidia_fs,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
OpenMPI version
5.0.3
Output of ucx_info -d to show transports and devices recognized by UCX
Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
The text was updated successfully, but these errors were encountered:
@RamHPC Probably the UCX coming with MLNX_OFED bundle does not include the compiled-in cuda and gdrcopy support. Currently MLNX_OFED includes ucx/cuda support for most, but not all operating systems.
I'd suggest to keep using UCX from GitHub distribution, that is to uninstall all ucx-* RPMs that came from MLNX_OFED.
@RamHPC Probably the UCX coming with MLNX_OFED bundle does not include the compiled-in cuda and gdrcopy support. Currently MLNX_OFED includes ucx/cuda support for most, but not all operating systems. I'd suggest to keep using UCX from GitHub distribution, that is to uninstall all ucx-* RPMs that came from MLNX_OFED.
Thank you! Will give this a try. When I see the flags MLNX_OFED installed UCX it looks like it included cuda and gdr_copy but some how transports don't show up. There is no way to re-calibrate existing installation?
Describe the bug
Installed ucx-1.16 and everything was working fine. The devices/transports recognized are inline with the expectation. Installed OFED (MLNX_OFED_LINUX-24.04-0.6.6.0-rhel8.9-x86_64) for which automatically installed ucx-1.17 version. This doesn't show cuda, cuda_cpy and gdr_copy as devices/transports
Steps to Reproduce
ucx_info -v
)export UCX_TLS=ib,sm,cuda,cuda_copy,cuda_ipc,gdr_copy
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
Red Hat Enterprise Linux release 8.9 (Ootpa) + Linux gpu2 4.18.0-513.24.1.el8_9.x86_64 Add basic types and functions, initial makefile, and smoke test. #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
$ ofed_info -s
MLNX_OFED_LINUX-24.04-0.6.6.0:
ibstat
oribv_devinfo -vv
commandDriver Version: 555.42.02
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
$ lsmod|grep gdrdrv
gdrdrv 24576 0
nvidia 8691712 365 nvidia_uvm,nvidia_fs,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXThe text was updated successfully, but these errors were encountered: