Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single network card is very fast, dual network card speed is super slow #10249

Open
yangrudan opened this issue Oct 23, 2024 · 3 comments
Open
Labels

Comments

@yangrudan
Copy link

Describe the bug

My environment is the ucx perfest tag_bw test of the GDR in the machine. When I configure the environment variables of one network card, the measured speed is very fast. The environment variables select dual network cards and the speed is super slow. In addition, both dual network cards are optimal pcie topology.

image

Steps to Reproduce

My commands:

UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1 ./ucx_perftest -t tag_bw
UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1 ./ucx_perftest -t tag_bw -s 4194304 -m cuda 173.22.3.35

UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 ./ucx_perftest -t tag_bw 
UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 ./ucx_perftest -t tag_bw -s 4194304 -m cuda 173.22.3.35
- UCX version 1.17
- # Library version: 1.17.0
# Library path: /workspace/zccl-ucx/build/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17_kunlun_930', revision 551089e
# Configured with: --prefix=/workspace/zccl-ucx/build --enable-compiler-opt=0 --with-cuda=/usr/local/xpu --with-verbs --with-dm --with-rdmacm --enable-mt=yes --with-rc --with-mlx5-dv --with-go=no --enable-kunlun-gdr

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • Ubuntu 20.04.5 LTS
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • MLNX_OFED_LINUX-24.07-0.6.1.0
  • For GPU related issues:
    • GPU type: kunlun xpu
@yangrudan yangrudan added the Bug label Oct 23, 2024
@brminich
Copy link
Contributor

what is the performance when you set UCX_NET_DEVICES=mlx5_1:1?

@yangrudan
Copy link
Author

what is the performance when you set UCX_NET_DEVICES=mlx5_1:1?

NIC mlx5_0 and NIC mlx5_1 both are the best pcie topo for my xpu. So when set UCX_NET_DEVICES=mlx5_1:1, it is also fast as below.

image

@brminich
Copy link
Contributor

can you try to profile it with linux perf and check for the hotspots?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants