-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/cxi performance regression in fi_pingpong #9802
Comments
ping @mindstorm38 |
I can't reproduce the regression, here's my environment:
Not yet tested with MPI |
@mindstorm38 |
I'm using the latest internal sources, I don't know the version number to be honest. I configure cxi, cassini headers and UAPI headers to directly point to the sources. Please tell me if you have any command to check a version that would be interesting for you, but note that my installation is not standard compared to official SlingShot packages, I'm working in parallel on a packages-based installation but it's on x86_64 so this will not be helpful in this case I guess (I'll try anyway). |
FWIW - I've replicated this result on an x86_64 platform, with pretty much the exact same pingpong bandwidth numbers and the same result if I increase the iterations to 100. We are still running Slingshot 2.1, so maybe things work better with the newly released Slingshot 2.2. But in any case, the HPE 1.15 is performing better than the built-from-source 1.21. |
Describe the bug
Using the upstreamed CXI provider (as of commit fc869ae main branch) yields reduced throughput in
fi_pingpong
(14GB/s for ofiwg/libfabric compared to 20GB/s for HPE-internal libfabric).To Reproduce
Steps to reproduce the behavior:
fi_pingpong -p cxi -e rdm
on two Slingshot-connected nodes.Expected behavior
Equivalent performance between both libfabric-variants (~20GB/s).
Output
Deviating performance:
It is worth noting that the observed throughput of ofiwg/libfabric can be increased by setting the number of iterations from the default 10 to 100 via
-I 100
.Additionally, using
osu_bw
andosu_latency
from the OSU Microbenchmark Suite, no performance differences are observed between the two libfabric variants.I've attached raw output of the
fi_pingpong
runs andosu_bw
/osu_latency
runs.Environment:
./configure LDFLAGS=-Wl,--build-id --enable-cxi=yes --enable-only --enable-restricted-dl --enable-tcp --enable-udp --enable-rxm --enable-rxd --enable-hook_debug --enable-hook_hmem --enable-dmabuf_peer_mem --enable-verbs --enable-gdrcopy-dlopen --enable-profile=dl
--with-ofi=yes
Additional context
Due to a currently unresolved issue with the local Slingshot deployment on the used ARM platform, it is required to set
FI_CXI_LLRING_MODE=never
for bothfi_pingpong
andosu_bw
.The text was updated successfully, but these errors were encountered: