-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda, rc Bandwidth fluctuates regularly #10164
Comments
@yangrudan does it happen with a smaller message size (for example, 4 MB)? |
|
|
|
@yangrudan can you pls try setting |
By the way , ping is ok root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/ucx/out/bin# ping 172.16.4.1
PING 172.16.4.1 (172.16.4.1) 56(84) bytes of data.
64 bytes from 172.16.4.1: icmp_seq=1 ttl=61 time=0.215 ms
64 bytes from 172.16.4.1: icmp_seq=2 ttl=61 time=0.115 ms
^C
--- 172.16.4.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1030ms
rtt min/avg/max/mdev = 0.115/0.165/0.215/0.050 ms |
@yangrudan seems some issue with ip/routing config, can you pls try ping from a specific interface (add |
It seems like doesn't work. And I just add -I in client side. |
sorry, i've meant to try ping with specific interface, something like: |
The first ping command fails which shows some issue with reaching from mlx5_cx6_3 on one server to mlx5_cx6_3 on the other server, can you pls check the network config? |
Maybe the net config's quesion. Close this issue. |
Yes, it seems the reason these devices are not reachable, anyway in order to get good GPU memory performance for GPU0, then according to nvidia-smi topology output, mlx5_cx6_3 device should be used |
Thank you very much for your patients.😊 |
Describe the bug
When I run ucx_perftest in two nodes, the bandwidth fluctuated regularly.
Steps to Reproduce
Setup and versions
OS version
Linux NH-DC-NM129-I06-12U-GPU-246 5.4.0-193-generic Ubuntu SMP Fri Aug 2 19:14:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
For RDMA/IB/RoCE related issues:
MLNX_OFED_LINUX-5.8-3.0.7.0:
For GPU related issues:
The text was updated successfully, but these errors were encountered: