-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduced performance of MPI_Allreduce in OpenMPI-5 compared to OpenMPI-4 #13082
Comments
Your program performs a single First check the PingPong performance (both inter and intra node) to make sure there is not something obviously wrong with your setup. |
I deliberately didn't use any warm-up iterations. My concern is about a single call for collectives, like in the example above. I also ran OSU tests that showed similar performance degradation for the Allreduce Latency test: $ $ osu_allreduce --version
# OSU MPI Allreduce Latency Test v7.1
$ mpirun -np 256 osu_allreduce -x 1000 -m 20000000 Results: OMPI-5:
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 23.53
2 28.21
4 47.42
8 34.88
16 26.18
32 33.35
64 26.99
128 32.02
256 40.39
512 39.98
1024 49.09
2048 48.13
4096 78.87
8192 110.08
16384 253.83
32768 439.93
65536 369.35
131072 569.72
262144 1114.83
524288 2938.23
1048576 6487.50
2097152 13952.42
4194304 46665.48
8388608 55700.13
16777216 114236.02
OMPI-4:
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 12.25
2 10.56
4 10.40
8 10.34
16 10.52
32 13.30
64 242.02
128 16.28
256 17.17
512 17.59
1024 19.18
2048 55.12
4096 35.89
8192 52.79
16384 86.91
32768 159.92
65536 294.05
131072 628.76
262144 1288.59
524288 2616.03
1048576 5254.27
2097152 15237.20
4194304 27232.65
8388608 52380.46
16777216 110870.23 I know that OMPI-5 switched from ORTE to PRRTE. Could it be that some default parameters changed and now affect the performance? |
I would first try to understand how communications are performed and which collective modules are used. you might want to run your test program with My first intention would be not to suspect this is a |
You have a dependency to UCC, it is possible the
|
Thanks for your advice. @bosilca checking on the $ diff ompi4_priority.dat ompi5_priority.dat
3c3
< mca:coll:han:param:coll_han_priority:value:0
---
> mca:coll:han:param:coll_han_priority:value:35
9a10
> mca:coll:ftagree:param:coll_ftagree_priority:value:30 The However, it seems that OMPI-5
with --mca pml ucx --mca coll basic,tuned,libnbc --mca coll_han_priority 35
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 10.29
2 9.63
4 9.52
8 9.50
16 9.36
32 10.69
64 10.97
128 13.76
256 14.68
512 15.05
1024 17.32
2048 24.94
4096 29.41
8192 53.14
16384 34.75
32768 49.39
65536 80.90
131072 148.58
262144 340.29
524288 821.65
1048576 2091.90
2097152 18382.99
4194304 19004.10
8388608 36119.31
16777216 81548.02 OMPI-5
with --mca coll_han_priority 0
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 30.86
2 26.29
4 22.09
8 27.41
16 29.43
32 31.47
64 52.18
128 35.33
256 30.33
512 33.00
1024 36.32
2048 43.09
4096 55.13
8192 71.08
16384 94.10
32768 129.57
65536 215.37
131072 356.73
262144 614.98
524288 1259.46
1048576 2478.40
2097152 6086.05
4194304 15640.81
8388608 41565.68
16777216 76458.63 OMPI-4
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 36.16
2 13.96
4 16.55
8 11.99
16 11.04
32 12.57
64 12.80
128 15.19
256 17.86
512 17.73
1024 19.18
2048 28.96
4096 33.06
8192 50.52
16384 87.57
32768 161.25
65536 295.48
131072 629.22
262144 1315.63
524288 2739.76
1048576 5349.60
2097152 14242.53
4194304 28070.97
8388608 53160.61
16777216 105697.46 Seems that setting |
Once you set Why would |
then setting OMPI-5
with --mca coll_han_priority 0 --mca coll_ucc_enable 0
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 24.41
2 32.12
4 36.12
8 27.96
16 24.05
32 54.13
64 33.05
128 37.34
256 30.94
512 35.70
1024 40.16
2048 54.50
4096 66.71
8192 73.83
16384 70.24
32768 126.08
65536 216.63
131072 426.60
262144 720.75
524288 1336.14
1048576 2602.08
2097152 6529.77
4194304 16252.92
8388608 37888.25
16777216 80172.36 so, not different from when using only OMPI-5
with --mca pml ucx --mca coll_han_priority 0
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 9.77
2 9.72
4 10.13
8 9.71
16 9.61
32 10.86
64 11.14
128 14.06
256 15.03
512 15.18
1024 17.19
2048 22.12
4096 24.51
8192 32.74
16384 34.74
32768 49.98
65536 82.50
131072 147.81
262144 283.40
524288 824.34
1048576 2218.86
2097152 5976.56
4194304 15519.07
8388608 45093.29
16777216 76054.07 So, the bottom line, I need to force using UCX for PML and disable HAN. Then Looking at this this comment, HAN seems to be more consistent across different data sizes, compared to |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.3 and v4.1.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From sources: https://www.open-mpi.org/software/ompi
Both versions of OpenMPI were installed using the following versions of dependencies:
Please describe the system on which you are running
RHEL9.4
$ lspci | grep -i Mellanox 01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] 01:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.2 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.3 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.4 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.5 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.6 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:00.7 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 01:01.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] 21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Details of the problem
I've noticed a performance degradation of some collective operations in OpenMPI-5, compared to OpenMPI-4. The test code below executes a simple
MPI_Allreduce
on 16M doubles:The code was compiled with a single
-O3
flag and executed on 256 processes across two nodes:The timing for OpenMPI-5.0.3:
Timing for OpenMPI-4.1.5:
Any suggestions on why the performance could be so different? Are there any recommendations on where to look to improve OMPI-5 performance?
The text was updated successfully, but these errors were encountered: