Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda, rc Bandwidth fluctuates regularly #10164

Closed
yangrudan opened this issue Sep 20, 2024 · 15 comments
Closed

cuda, rc Bandwidth fluctuates regularly #10164

yangrudan opened this issue Sep 20, 2024 · 15 comments
Labels

Comments

@yangrudan
Copy link

Describe the bug

When I run ucx_perftest in two nodes, the bandwidth fluctuated regularly.

+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]                33      0.588 440383.181 440383.181     2325.25    2325.25           2           2
[thread 0]                53      0.658 54057.252 294599.812    18942.88    3475.90          18           3
[thread 0]                65  43810.296 183909.575 274164.691     5567.95    3734.98           5           4
[thread 0]                87  48500.309 47570.272 216864.953    21526.05    4721.83          21           5
[thread 0]                97  48503.238 193600.011 214466.505     5289.26    4774.64           5           5
[thread 0]               118  48505.271 47841.231 184812.855    21404.13    5540.74          21           5
[thread 0]               129  48513.199 178729.274 184294.100     5729.34    5556.34           6           5
[thread 0]               151  48513.988 47579.408 164375.403    21521.92    6229.64          21           6
[thread 0]               161  48513.988 192064.810 166095.242     5331.53    6165.14           5           6
[thread 0]               182  48516.302 48136.813 152484.654    21272.70    6715.43          21           7
[thread 0]               193  48519.674 178550.265 153970.259     5735.08    6650.64           6           6
[thread 0]               215  48520.655 47590.321 143084.870    21516.98    7156.59          21           7

Steps to Reproduce

  • ./ucx_perftest -s 1073741824 -t tag_bw -m cuda -p 12345
  • ./ucx_perftest 172.16.4.4 -s 1073741824 -t tag_bw -m cuda -p 12345
  • ucx_info -v
# Library version: 1.18.0
# Library path: /root/yangrudan/ucx/build/../out/lib/libucs.so.0
# API headers version: 1.18.0
# Git branch 'master', revision 6a87bb1
# Configured with: --prefix=/root/yangrudan/ucx/build/../out --enable-compiler-opt=0 --with-cuda=/usr/local/cuda --with-verbs --with-dm --with-rdmacm --enable-mt=yes --with-rc --with-mlx5-dv --with-go=no
  • env variables: UCX_NET_DEVICES=mlx5_cx6_0:1 UCX_TLS=cuda,rc

Setup and versions

  • OS version

  • Linux NH-DC-NM129-I06-12U-GPU-246 5.4.0-193-generic Ubuntu SMP Fri Aug 2 19:14:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • For RDMA/IB/RoCE related issues:

    • Driver version:
      MLNX_OFED_LINUX-5.8-3.0.7.0:
  • For GPU related issues:

    • GPU type
    • Cuda:
      • NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3
root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/ucx/out/bin# lsmod|grep peer
nvidia_peermem         16384  0
ib_core               348160  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              56143872  354 nvidia_uvm,nvidia_peermem,nvidia_modeset
@yangrudan yangrudan added the Bug label Sep 20, 2024
@yosefe
Copy link
Contributor

yosefe commented Sep 22, 2024

@yangrudan does it happen with a smaller message size (for example, 4 MB)?
does it happen if removing the "--enable-compiler-opt=0 " flag from configure?

@yangrudan
Copy link
Author

@yangrudan does it happen with a smaller message size (for example, 4 MB)? does it happen if removing the "--enable-compiler-opt=0 " flag from configure?

image
It seems that the speed is uniform at 4MB, what's the reason for that?

The UCX_PROTO_INFO is as below:
image

@yosefe
Copy link
Contributor

yosefe commented Sep 23, 2024

@yangrudan

  1. Is there a similar flunctuation with 1GB host memory?
  2. can you pls try setting "UCX_RNDV_SCHEME=get_zcopy" to see if it helps to resolve?
  3. can you pls post the output of "nvidia-smi topo -m" ?

@yangrudan
Copy link
Author

@yangrudan

  1. Is there a similar flunctuation with 1GB host memory?
  2. can you pls try setting "UCX_RNDV_SCHEME=get_zcopy" to see if it helps to resolve?
  3. can you pls post the output of "nvidia-smi topo -m" ?
    1. host memory doesn't have flunctuation;
      image
    1. setting "UCX_RNDV_SCHEME=get_zcopy" has help, but bandwidth is lower than expect;
      image
    1. server's and client's topo is as follow;
#server=======================================================================
root@NH-DC-NM129-I05-20U-GPU-242:~/yangrudan/github_ucx_build/bin# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     32-63,96-127    1               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     32-63,96-127    1               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     32-63,96-127    1               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PHB     PHB     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PHB     PHB     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PIX     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PIX      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_cx6_0
  NIC1: mlx5_cx6_1
  NIC2: mlx5_cx6_2
  NIC3: mlx5_cx6_3
  NIC4: mlx5_cx4lx_0
  NIC5: mlx5_cx4lx_1
  NIC6: mlx5_cx4lx_2
  NIC7: mlx5_cx4lx_3
  NIC8: mlx5_cx4lx_4
  NIC9: mlx5_cx4lx_5


#client================================================================
root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/ucx/out/bin# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6       NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     SYS     PXB     SYS     SYS     SYSSYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     SYS     PXB     SYS     SYS     SYSSYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     SYS     SYS     SYS     SYSSYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     SYS     SYS     SYS     SYSSYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     32-63,96-127    1               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     32-63,96-127    1               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PXB     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     32-63,96-127    1               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PXB     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB      X      SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS
NIC1    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYSSYS     SYS     SYS
NIC3    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYSSYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PHBPHB     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PHBPHB     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X PIX     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PIX X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_cx6_0
  NIC1: mlx5_cx6_1
  NIC2: mlx5_cx6_2
  NIC3: mlx5_cx6_3
  NIC4: mlx5_cx4lx_0
  NIC5: mlx5_cx4lx_1
  NIC6: mlx5_cx4lx_2
  NIC7: mlx5_cx4lx_3
  NIC8: mlx5_cx4lx_4
  NIC9: mlx5_cx4lx_5

@yosefe
Copy link
Contributor

yosefe commented Sep 23, 2024

@yangrudan can you pls try setting UCX_NET_DEVICES= mlx5_cx6_3:1 - both with and without UCX_RNDV_SCHEME=get_zcopy?

@yangrudan
Copy link
Author

yangrudan commented Sep 23, 2024

@yangrudan can you pls try setting UCX_NET_DEVICES= mlx5_cx6_3:1 - both with and without UCX_RNDV_SCHEME=get_zcopy?

An error occurred as below:
image

By the way , ping is ok

root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/ucx/out/bin# ping 172.16.4.1
PING 172.16.4.1 (172.16.4.1) 56(84) bytes of data.
64 bytes from 172.16.4.1: icmp_seq=1 ttl=61 time=0.215 ms
64 bytes from 172.16.4.1: icmp_seq=2 ttl=61 time=0.115 ms
^C
--- 172.16.4.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1030ms
rtt min/avg/max/mdev = 0.115/0.165/0.215/0.050 ms

@yosefe
Copy link
Contributor

yosefe commented Sep 23, 2024

@yangrudan seems some issue with ip/routing config, can you pls try ping from a specific interface (add -I ens24np0)?

@yangrudan
Copy link
Author

yangrudan commented Sep 23, 2024

@yangrudan seems some issue with ip/routing config, can you pls try ping from a specific interface (add -I ens24np0)?

It seems like doesn't work. And I just add -I in client side.
image

@yosefe
Copy link
Contributor

yosefe commented Sep 23, 2024

sorry, i've meant to try ping with specific interface, something like:
root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/ucx/out/bin# ping -I ens24np0 172.16.4.1

@yangrudan
Copy link
Author

sorry, i've meant to try ping with specific interface, something like: root@NH-DC-NM129-I06-12U-GPU-246:~/yangrudan/ucx/out/bin# ping -I ens24np0 172.16.4.1

Sorry, I misunderstood
image

@yosefe
Copy link
Contributor

yosefe commented Sep 23, 2024

The first ping command fails which shows some issue with reaching from mlx5_cx6_3 on one server to mlx5_cx6_3 on the other server, can you pls check the network config?

@yangrudan
Copy link
Author

yangrudan commented Sep 23, 2024

I don't know much about network configuration. Could it be that the IP addresses of mlx5_cx6_3 of the two servers are not in the same subnet?
image

@yangrudan
Copy link
Author

Maybe the net config's quesion. Close this issue.

@yosefe
Copy link
Contributor

yosefe commented Sep 24, 2024

I don't know much about network configuration. Could it be that the IP addresses of mlx5_cx6_3 of the two servers are not in the same subnet?

Yes, it seems the reason these devices are not reachable, anyway in order to get good GPU memory performance for GPU0, then according to nvidia-smi topology output, mlx5_cx6_3 device should be used

@yangrudan
Copy link
Author

I don't know much about network configuration. Could it be that the IP addresses of mlx5_cx6_3 of the two servers are not in the same subnet?

Yes, it seems the reason these devices are not reachable, anyway in order to get good GPU memory performance for GPU0, then according to nvidia-smi topology output, mlx5_cx6_3 device should be used

Thank you very much for your patients.😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants