You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## What
Introduce a cross-GPU barrier implementation using NVLS symmetric
memory, eliminating the need for host-side synchronization.
## Why ?
The existing barrier relies on a host-managed shared memory
synchronization mechanism. While functional, this approach is:
- **Expensive in performance** - frequent host-device roundtrips add
latency.
- **Not scalable across nodes** - shared memory is not available in
multinode deployments, making the current design unsuitable for larger
topologies.
This PR addresses both issues by moving barrier management fully into
device-side execution.
## How ?
- **Control space**: Reserve 1024 bytes in the NVLS symmetric buffer for
barrier control segments.
- **Mechanism**:
- The barrier is represented by a single monotonically incremented
**uint64_t** counter.
- Each participating GPU in the NVLS group increments the counter upon
reaching the barrier.
- Progress is determined by comparing the counter against the expected
value, which scales with the NVLS group size.
- **Kernel integration**:
- NVLS kernels launch with 1–32 blocks.
- Each block designates a leader thread (`threadIdx.x == 0`).
- The leader thread performs the atomic increment and spins until the
counter matches the expected group-wide value.
This design ensures all GPUs in the NVLS group complete a given phase of
the algorithm before any proceed, without requiring host intervention.
## Performance results:
<img width="1200" height="500" alt="bandwidth_comparison"
src="https://github.com/user-attachments/assets/694bf5e6-5129-489c-a350-7b75e6924fc0"
/>
<img width="1200" height="500" alt="latency_comparison"
src="https://github.com/user-attachments/assets/637967c2-99bf-4ed3-97de-52cbecf958f1"
/>
---------
Signed-off-by: Ilya Kryukov <[email protected]>
uint32_t expected_blocks = sm_count * tsize; // total num of blocks in the multicast group, num gpus * num blocks per gpu, used for barrier synchronization
0 commit comments