Skip to content

Conversation

ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented Aug 21, 2025

What

Introduce a cross-GPU barrier implementation using NVLS symmetric memory, eliminating the need for host-side synchronization.

Why ?

The existing barrier relies on a host-managed shared memory synchronization mechanism. While functional, this approach is:

  • Expensive in performance - frequent host-device roundtrips add latency.
  • Not scalable across nodes - shared memory is not available in multinode deployments, making the current design unsuitable for larger topologies.
    This PR addresses both issues by moving barrier management fully into device-side execution.

How ?

  • Control space: Reserve 1024 bytes in the NVLS symmetric buffer for barrier control segments.
  • Mechanism:
    • The barrier is represented by a single monotonically incremented uint64_t counter.
    • Each participating GPU in the NVLS group increments the counter upon reaching the barrier.
    • Progress is determined by comparing the counter against the expected value, which scales with the NVLS group size.
  • Kernel integration:
    • NVLS kernels launch with 1–32 blocks.
    • Each block designates a leader thread (threadIdx.x == 0).
    • The leader thread performs the atomic increment and spins until the counter matches the expected group-wide value.

This design ensures all GPUs in the NVLS group complete a given phase of the algorithm before any proceed, without requiring host intervention.

Performance results:

bandwidth_comparison latency_comparison

@ikryukov ikryukov self-assigned this Aug 21, 2025
@ikryukov ikryukov marked this pull request as ready for review August 25, 2025 12:53
Signed-off-by: Ilya Kryukov <[email protected]>
@janjust janjust merged commit 222615d into openucx:master Sep 2, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants