mlx5 connect on mlx5_1 failed: Connection timed out #9971

shinoharakazuya · 2024-06-24T04:37:02Z

I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.

Command line: Please see log file.
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v): Please see log file.
Any UCX environment variables used

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...): Please see log file.
- cat /etc/issue or cat /etc/redhat-release + uname -a
- For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
For RDMA/IB/RoCE related issues: Please see log file.
- Driver version:
  - rpm -q rdma-core or rpm -q libibverbs
  - or: MLNX_OFED version ofed_info -s
- HW information from ibstat or ibv_devinfo -vv command
For GPU related issues:
- GPU type : H100
- Cuda:
  - Drivers version:12.2
  - Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv : Please see log file.

OpenMPI version:5.0.3
Output of ucx_info -d to show transports and devices recognized by UCX: Please see log file.

The text was updated successfully, but these errors were encountered:

shinoharakazuya · 2024-06-24T04:37:48Z

yosefe · 2024-06-29T11:16:44Z

@shinoharakazuya can you pls post the output of show_gids command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y helps to resolve the issue?

changchengx · 2024-06-30T05:27:11Z

yosefe · 2024-06-30T07:03:56Z

NOTE: This issue happens on Nvidia internal cluster

shinoharakazuya added the Bug label Jun 24, 2024

yosefe mentioned this issue Jun 29, 2024

mlx5 connect on mlx5_1 failed: Connection timed out #9970

Closed

Provide feedback