We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.
ucx_info -v
cat /etc/issue
cat /etc/redhat-release
uname -a
cat /etc/mlnx-release
rpm -q rdma-core
rpm -q libibverbs
ofed_info -s
ibstat
ibv_devinfo -vv
lsmod|grep nv_peer_mem
lsmod|grep gdrdrv
ucx_info -d
The text was updated successfully, but these errors were encountered:
logfile.txt
Sorry, something went wrong.
@shinoharakazuya can you pls post the output of show_gids command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y helps to resolve the issue?
show_gids
UCX_IB_ROCE_LOCAL_SUBNET=y
@jandres742 FYI
NOTE: This issue happens on Nvidia internal cluster
No branches or pull requests
Describe the bug
I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.
Steps to Reproduce
ucx_info -v
): Please see log file.Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
: Please see log file.Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX: Please see log file.The text was updated successfully, but these errors were encountered: