Skip to content

Conversation

ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented May 22, 2025

What

Split all-to-all implementation into two flows:

  • Executor (kernel-based)
  • Copy engine (CUDA batched memcpyAsync)

Why ?

  • executor is kernel based and consumes SM
  • copy engine implementation is SM free (only query event status during progress)

How ?

UCC_TL_CUDA_ALLTOALL_USE_COPY_ENGINE=y
Example command:

mpirun --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/ucc_build_redhat/install/lib:$LD_LIBRARY_PATH -x UCC_TLS=cuda,ucp -x UCC_LOG_LEVEL=info -x UCC_TL_CUDA_ALLTOALL_USE_COPY_ENGINE=y -np 8 /ucc_build_redhat/install/bin/ucc_perftest -c alltoall -F -m cuda -b 1k -e 8M -d float32

@ikryukov ikryukov self-assigned this May 22, 2025
@ikryukov ikryukov requested a review from Sergei-Lebedev May 22, 2025 10:17
@ikryukov ikryukov marked this pull request as ready for review July 9, 2025 14:29
@ikryukov ikryukov requested a review from janjust July 9, 2025 14:29
@ikryukov ikryukov requested review from nsarka and MamziB August 12, 2025 09:43
- Implement memcpy batch async
- Use multiple streams and remove redundant events/flags
- Add triggered post support, fix a2a CE implementation
- Clean up unused params and improve comments
- Update copyright
- Fix clang-tidy warnings
- Introduced a check for CL hierarchical teams to disable copy engine usage.
- Updated alltoallv functions to utilize the new `use_copy_engine` flag.
- Cleaned up redundant references to library configuration in favor of task-specific settings.
- Added a new inline function to determine if a task is part of a CL hierarchical team.

Signed-off-by: Ilya Kryukov <[email protected]>
Signed-off-by: Ilya Kryukov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant