Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced performance of MPI_Allreduce in OpenMPI-5 compared to OpenMPI-4 #13082

Open
maxim-masterov opened this issue Feb 5, 2025 · 7 comments

Comments

@maxim-masterov
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.3 and v4.1.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From sources: https://www.open-mpi.org/software/ompi
Both versions of OpenMPI were installed using the following versions of dependencies:

depends_on("GCC/12.3.0")
depends_on("zlib/1.2.13-GCCcore-12.3.0")
depends_on("hwloc/2.9.1-GCCcore-12.3.0")
depends_on("libevent/2.1.12-GCCcore-12.3.0")
depends_on("UCX/1.14.1-GCCcore-12.3.0")
depends_on("libfabric/1.18.0-GCCcore-12.3.0")
depends_on("UCC/1.2.0-GCCcore-12.3.0")

Please describe the system on which you are running

  • Operating system/version:
    RHEL9.4
  • Computer hardware:
    $ cat /proc/cpuinfo | grep "model name" | tail -n 1
    model name	: AMD EPYC 7H12 64-Core Processor
  • Network type:
    $ lspci | grep -i Mellanox
    01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
    01:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.2 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.3 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.4 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.5 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.6 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:00.7 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    01:01.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
    21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
    21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Details of the problem

I've noticed a performance degradation of some collective operations in OpenMPI-5, compared to OpenMPI-4. The test code below executes a simple MPI_Allreduce on 16M doubles:

#include <iostream>
#include <mpi.h>
#include <sys/time.h>

#define TABLE_SIZE 16777216

int main(int argc, char **argv) {

  int rank, size;
  double table[TABLE_SIZE];
  double global_result[TABLE_SIZE];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get the rank of the process
  MPI_Comm_size(MPI_COMM_WORLD, &size); // Get the total number of processes

  for (int i = 0; i < TABLE_SIZE; i++) {
    table[i] = rank + i;
  }

  double start_time = MPI_Wtime();
  MPI_Allreduce(table, global_result, TABLE_SIZE, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  double end_time = MPI_Wtime();
  double elapsed_time = end_time - start_time;

  double min_time, max_time, avg_time;
  MPI_Allreduce(&elapsed_time, &min_time, 1, MPI_DOUBLE, MPI_MIN,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &max_time, 1, MPI_DOUBLE, MPI_MAX,
                MPI_COMM_WORLD);
  MPI_Allreduce(&elapsed_time, &avg_time, 1, MPI_DOUBLE, MPI_SUM,
                MPI_COMM_WORLD);
  avg_time /= size;

  if (rank == 0) {
    std::cout << "Global Reduced Result (sum of all elements across all "
                 "processes):\n";
    std::cout << "Result[0]: " << global_result[0] << std::endl;
    std::cout << "Result[" << TABLE_SIZE - 1
              << "]: " << global_result[TABLE_SIZE - 1] << std::endl;
    std::cout << "MPI_Allreduce (s): " << (end_time - start_time) << std::endl;
    std::cout << "MPI_Allreduce Timing Analysis:" << std::endl;
    std::cout << "  Minimum time: " << min_time << " seconds" << std::endl;
    std::cout << "  Maximum time: " << max_time << " seconds" << std::endl;
    std::cout << "  Average time: " << avg_time << " seconds" << std::endl;
  }

  MPI_Finalize();
  return 0;
}

The code was compiled with a single -O3 flag and executed on 256 processes across two nodes:

$ mpicxx -O3 mpi_allreduce_16M.cpp
...
$ mpirun -n 256 ./a.out

The timing for OpenMPI-5.0.3:

  Minimum time: 3.00434 seconds
  Maximum time: 3.01639 seconds
  Average time: 3.00823 seconds

Timing for OpenMPI-4.1.5:

  Minimum time: 0.816602 seconds
  Maximum time: 0.870789 seconds
  Average time: 0.85539 seconds

Any suggestions on why the performance could be so different? Are there any recommendations on where to look to improve OMPI-5 performance?

@ggouaillardet
Copy link
Contributor

Your program performs a single MPI_Allreduce(), and its performance might be negatively impacted by lazy initialization performed under the hood. You'd rather use well established benchmarks such as IMB from Intel or the OSU suite.

First check the PingPong performance (both inter and intra node) to make sure there is not something obviously wrong with your setup.

@maxim-masterov
Copy link
Author

I deliberately didn't use any warm-up iterations. My concern is about a single call for collectives, like in the example above. I also ran OSU tests that showed similar performance degradation for the Allreduce Latency test:

$ $ osu_allreduce --version
# OSU MPI Allreduce Latency Test v7.1
$ mpirun -np 256 osu_allreduce -x 1000 -m 20000000

Results:

OMPI-5:
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      23.53
2                      28.21
4                      47.42
8                      34.88
16                     26.18
32                     33.35
64                     26.99
128                    32.02
256                    40.39
512                    39.98
1024                   49.09
2048                   48.13
4096                   78.87
8192                  110.08
16384                 253.83
32768                 439.93
65536                 369.35
131072                569.72
262144               1114.83
524288               2938.23
1048576              6487.50
2097152             13952.42
4194304             46665.48
8388608             55700.13
16777216           114236.02


OMPI-4:
# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      12.25
2                      10.56
4                      10.40
8                      10.34
16                     10.52
32                     13.30
64                    242.02
128                    16.28
256                    17.17
512                    17.59
1024                   19.18
2048                   55.12
4096                   35.89
8192                   52.79
16384                  86.91
32768                 159.92
65536                 294.05
131072                628.76
262144               1288.59
524288               2616.03
1048576              5254.27
2097152             15237.20
4194304             27232.65
8388608             52380.46
16777216           110870.23

I know that OMPI-5 switched from ORTE to PRRTE. Could it be that some default parameters changed and now affect the performance?

@ggouaillardet
Copy link
Contributor

I would first try to understand how communications are performed and which collective modules are used.
Assuming UCX is used, are you using the very same version with both Open MPI releases?

you might want to run your test program with
mpirun --mca pml ucx --mca coll basic,libnbc ...
to force conservative defaults and see if the discrepancy persists.
then try
mpirun --mca pml ucx --mca coll basic,tuned,libnbc ...
and see if it changes the timing.

My first intention would be not to suspect this is a orte vs prrte issue.

@bosilca
Copy link
Member

bosilca commented Feb 5, 2025

You have a dependency to UCC, it is possible the MPI_Allreduce from UCC will be used in which case the problem might be more complicated to find. Here is a list of things you can try:

  • run ompi_info --parsable --param coll all -l 9 | grep 'priority:value' on the two versions to see how the different components priority changed. This will help understand what components are used for the MPI_Allreduce.
  • run with --mca coll ^ucc to see if eliminating UCC leads to more stable results.
  • run with --report-bindings to see if the bindings are the same across the two versions.

@maxim-masterov
Copy link
Author

Thanks for your advice.
@ggouaillardet Yes, Both OMPI versions are compiled against the same UCX and UCC versions. Setting --mca pml ucx --mca coll basic,libnbc didn't do much, but adding also tuned to the list of collectives resolved the problem. The timings are almost identical now.

@bosilca checking on the priority:value I spotted only one difference:

$ diff ompi4_priority.dat ompi5_priority.dat 
3c3
< mca:coll:han:param:coll_han_priority:value:0
---
> mca:coll:han:param:coll_han_priority:value:35
9a10
> mca:coll:ftagree:param:coll_ftagree_priority:value:30

The HAN priority was boosted from 0 in OMPI-4 to 35 in OMPI-5. I set it back to 0 and the performance became similar between both OMPI versions. So, the same result as setting --mca coll basic,tuned,libnbc.
Somehow I missed this major change, now I see there was a long discussion in #10347 and also mentioning here.

However, it seems that HAN improves performance only for large message sizes. Here are results from OSU:

OMPI-5
with --mca pml ucx --mca coll basic,tuned,libnbc --mca coll_han_priority 35

# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      10.29
2                       9.63
4                       9.52
8                       9.50
16                      9.36
32                     10.69
64                     10.97
128                    13.76
256                    14.68
512                    15.05
1024                   17.32
2048                   24.94
4096                   29.41
8192                   53.14
16384                  34.75
32768                  49.39
65536                  80.90
131072                148.58
262144                340.29
524288                821.65
1048576              2091.90
2097152             18382.99
4194304             19004.10
8388608             36119.31
16777216            81548.02
OMPI-5
with --mca coll_han_priority 0

# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      30.86
2                      26.29
4                      22.09
8                      27.41
16                     29.43
32                     31.47
64                     52.18
128                    35.33
256                    30.33
512                    33.00
1024                   36.32
2048                   43.09
4096                   55.13
8192                   71.08
16384                  94.10
32768                 129.57
65536                 215.37
131072                356.73
262144                614.98
524288               1259.46
1048576              2478.40
2097152              6086.05
4194304             15640.81
8388608             41565.68
16777216            76458.63
OMPI-4

# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      36.16
2                      13.96
4                      16.55
8                      11.99
16                     11.04
32                     12.57
64                     12.80
128                    15.19
256                    17.86
512                    17.73
1024                   19.18
2048                   28.96
4096                   33.06
8192                   50.52
16384                  87.57
32768                 161.25
65536                 295.48
131072                629.22
262144               1315.63
524288               2739.76
1048576              5349.60
2097152             14242.53
4194304             28070.97
8388608             53160.61
16777216           105697.46

Seems that setting --mca pml ucx --mca coll basic,tuned,libnbc --mca coll_han_priority 35 gives the best results for all message sizes.

@bosilca
Copy link
Member

bosilca commented Feb 5, 2025

Once you set --mca coll basic,tuned,libnbc HAN will never be used, so setting the priority is unnecessary.

Why would --mca coll_han_priority 0 and --mca coll basic,tuned,libnbc --mca coll_han_priority 35 give different performance numbers ? The former lower the priority of HAN to be lower than tuned, so tuned will always be called, while the latter completely disable the HAN component. Thus, in both cases HAN is never used, and the performance degradation is coming from somewhere else. That "else"might be UCC, because --mca coll basic,tuned,libnbc runs without UCC, while --mca coll_han_priority 0 allows UCC with the default priority.

@maxim-masterov
Copy link
Author

maxim-masterov commented Feb 6, 2025

then setting --mca coll_han_priority 0 --mca coll_ucc_priority 0, or --mca coll_han_priority 0 --mca coll_ucc_enable 0, should give the same performance as --mca coll basic,tuned,libnbc, right? I'm getting:

OMPI-5
with --mca coll_han_priority 0 --mca coll_ucc_enable 0

# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      24.41
2                      32.12
4                      36.12
8                      27.96
16                     24.05
32                     54.13
64                     33.05
128                    37.34
256                    30.94
512                    35.70
1024                   40.16
2048                   54.50
4096                   66.71
8192                   73.83
16384                  70.24
32768                 126.08
65536                 216.63
131072                426.60
262144                720.75
524288               1336.14
1048576              2602.08
2097152              6529.77
4194304             16252.92
8388608             37888.25
16777216            80172.36

so, not different from when using only --mca coll_han_priority 0.
Though, having --mca pml ucx --mca coll_han_priority 0 together does make difference:

OMPI-5
with --mca pml ucx --mca coll_han_priority 0

# OSU MPI Allreduce Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       9.77
2                       9.72
4                      10.13
8                       9.71
16                      9.61
32                     10.86
64                     11.14
128                    14.06
256                    15.03
512                    15.18
1024                   17.19
2048                   22.12
4096                   24.51
8192                   32.74
16384                  34.74
32768                  49.98
65536                  82.50
131072                147.81
262144                283.40
524288                824.34
1048576              2218.86
2097152              5976.56
4194304             15519.07
8388608             45093.29
16777216            76054.07

So, the bottom line, I need to force using UCX for PML and disable HAN. Then tuned will be used for collectives.

Looking at this this comment, HAN seems to be more consistent across different data sizes, compared to tuned. I'll try to increase the segment size for HAN to see if it helps in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants