prov/cxi performance regression in fi_pingpong #9802

philippfriese · 2024-02-09T10:35:02Z

Describe the bug
Using the upstreamed CXI provider (as of commit fc869ae main branch) yields reduced throughput in fi_pingpong (14GB/s for ofiwg/libfabric compared to 20GB/s for HPE-internal libfabric).

To Reproduce
Steps to reproduce the behavior:

Launch fi_pingpong -p cxi -e rdm on two Slingshot-connected nodes.
Observe performance deviation between ofiwg/libfabric and HPE-internal libfabric

Expected behavior
Equivalent performance between both libfabric-variants (~20GB/s).

Output
Deviating performance:

~14GB/s for ofiwg/libfabric
~20GB/s for hpe/libfabric

It is worth noting that the observed throughput of ofiwg/libfabric can be increased by setting the number of iterations from the default 10 to 100 via -I 100.
Additionally, using osu_bw and osu_latency from the OSU Microbenchmark Suite, no performance differences are observed between the two libfabric variants.

I've attached raw output of the fi_pingpong runs and osu_bw/osu_latency runs.

Environment:

ofiwg/libfabric at commit fc869ae main branch
ofiwg/libfabric configure setup: ./configure LDFLAGS=-Wl,--build-id --enable-cxi=yes --enable-only --enable-restricted-dl --enable-tcp --enable-udp --enable-rxm --enable-rxd --enable-hook_debug --enable-hook_hmem --enable-dmabuf_peer_mem --enable-verbs --enable-gdrcopy-dlopen --enable-profile=dl
hpe/libfabric version 1.15
OpenMPI 4.1.6 with --with-ofi=yes
OSU Microbenchmark: 7.3
openSUSE Leap 15.5, kernel 5.14.21
aarch64

Additional context
Due to a currently unresolved issue with the local Slingshot deployment on the used ARM platform, it is required to set FI_CXI_LLRING_MODE=never for both fi_pingpong and osu_bw.

The text was updated successfully, but these errors were encountered:

SSSSeb · 2024-02-09T13:41:58Z

ping @mindstorm38

mindstorm38 · 2024-02-09T15:29:41Z

I can't reproduce the regression, here's my environment:

HPE/libfabric v1.20.1
OFI/libfabric: v1.21.0a1 (your commit)
Iterations: 100
Aarch64 (also works with x86_64)
With FI_CXI_LLRING_MODE=never
Both are manually built from sources

Not yet tested with MPI

lflis · 2024-02-09T15:30:46Z

@mindstorm38
Which slingshot libraries version are you using?

mindstorm38 · 2024-02-09T15:59:25Z

I'm using the latest internal sources, I don't know the version number to be honest. I configure cxi, cassini headers and UAPI headers to directly point to the sources. Please tell me if you have any command to check a version that would be interesting for you, but note that my installation is not standard compared to official SlingShot packages, I'm working in parallel on a packages-based installation but it's on x86_64 so this will not be helpful in this case I guess (I'll try anyway).

vanderwb · 2024-05-22T16:48:53Z

FWIW - I've replicated this result on an x86_64 platform, with pretty much the exact same pingpong bandwidth numbers and the same result if I increase the iterations to 100. We are still running Slingshot 2.1, so maybe things work better with the newly released Slingshot 2.2. But in any case, the HPE 1.15 is performing better than the built-from-source 1.21.

philippfriese added the bug label Feb 9, 2024

jswaro added the prov/cxi label Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/cxi performance regression in fi_pingpong #9802

prov/cxi performance regression in fi_pingpong #9802

philippfriese commented Feb 9, 2024

SSSSeb commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024 •

edited

Loading

lflis commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024

vanderwb commented May 22, 2024

prov/cxi performance regression in fi_pingpong #9802

prov/cxi performance regression in fi_pingpong #9802

Comments

philippfriese commented Feb 9, 2024

SSSSeb commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024 • edited Loading

lflis commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024

vanderwb commented May 22, 2024

mindstorm38 commented Feb 9, 2024 •

edited

Loading