Replies: 2 comments 1 reply
-
@lijh5 what is the UCX version? Is it possible to run "perf top" on sender and receiver side to see what takes the most CPU time? |
Beta Was this translation helpful? Give feedback.
-
@yosefe 1. use hpcx-v2.14, ucx-v1.15.0 recevier: ucp_am_bw: recevier: From this perspective, it is true that tag_bw has done a lot of mem copying in the reviewer. |
Beta Was this translation helpful? Give feedback.
-
The test command is as follows:
use tag_bw
UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1 UCX_ZCOPY_THRESH=16384 UCX_RNDV_THRESH=16384 numactl -N 0 ucx_perftest -t tag_bw -s 4088
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0] 1596961 0.230 0.626 0.626 6224.36 6224.36 1596554 1596554
[thread 0] 3188257 0.230 0.629 0.627 6203.02 6213.69 1591081 1593818
[thread 0] 4785185 0.230 0.626 0.627 6224.57 6217.32 1596607 1594747
[thread 0] 6381601 0.230 0.626 0.627 6223.32 6218.82 1596287 1595132
[thread 0] 7979553 0.230 0.626 0.627 6229.42 6220.94 1597851 1595676
[thread 0] 9576481 0.230 0.626 0.627 6224.44 6221.52 1596575 1595826
Final: 10000000 0.230 0.626 0.627 6229.85 6221.87 1597961 1595916
use ucp_am_bw
UCX_TLS=rc UCX_NET_DEVICES=mlx5_0:1 UCX_MAX_EAGER_LANES=4 UCX_MAX_RNDV_LANES=4 UCX_ZCOPY_THRESH=16384 UCX_RNDV_THRESH=16384 numactl -N 0 ucx_perftest -t ucp_am_bw -s 4088
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0] 2340027 0.260 0.427 0.427 9133.24 9133.24 2342684 2342684
[thread 0] 4999162 0.240 0.376 0.400 10378.73 9755.98 2662154 2502419
[thread 0] 7635845 0.280 0.379 0.392 10291.09 9934.35 2639674 2548170
Final: 10000000 0.240 0.378 0.389 10311.26 10020.95 2644847 2570383
Question:
The results of tests tag_bw and ucp_am_bw are significantly different. Can tag_bw adjust any parameters to achieve the performance of ucp_am_bw?
Because MPI applications, such as osu_bw, use tag matching, will result in very low performance!
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions