Replies: 1 comment
-
We did a lot of work with UCX and dask/ucx-py on summit here: rapidsai/ucx-py#616 I lost my summit access so I can't help directly anymore. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I'm testing UCX PUTs/GETs on CUDA memory on ORNL Summit, and I'm seeing very low performance for small GETs.
I'm using UCX trunk edacb52, configured with
--enable-mt --enable-cma --enable-numa --enable-silent-rules --enable-optimizations --enable-builtin-memcpy --enable-compiler-opt=3 --enable-option-checking --disable-debug --disable-gtest --disable-stats --disable-static --disable-tuning --disable-logging --disable-examples --disable-profiling --disable-assertions --disable-debug-data --disable-params-check --disable-fault-injection --disable-frame-pointer --disable-dependency-tracking --without-go --without-bfd --without-mpi --without-knem --without-java --without-valgrind --with-cache-line-size=128 --with-cuda=/sw/summit/cuda/11.0.3 --with-gdrcopy=/sw/summit/gdrcopy/2.0
.For environment variable, I have set:
With the setup above, I get the following inter-node numbers from
ucx_perftest -t ucp_get
:If I add
UCX_ZCOPY_THRESH=0
to the environment, I get:Adding
UCX_PROTO_INFO=y
to the environment shows thatsoftware emulation
(am bcopy?) is used for GET sizes of0..64
, but I cannot find a way to change this behavior or improve its performance. What's the correct setting for Summit?Any help is appreciated!
Beta Was this translation helpful? Give feedback.
All reactions