Host IR: make stream synchronization non blocking #3608

samnordmann · 2024-12-18T11:15:23Z

What

Make stream synchronization non-blocking from the CPU point of view

Why

Needed for achieving overlap in

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

before this patch:

after this patch

How

Before this patch, the host IR Synchronize would call c10::synchronize() on the cuda stream, which makes the CPU blocks until stream completion. With this patch, we synchronize the current stream with a given stream through a cudaEvent and the API cudaStreamWaitEvent.

samnordmann · 2024-12-18T11:19:28Z

csrc/multidevice/communicator.cpp

@@ -196,6 +197,8 @@ Communicator::Communicator(
    return;
  }

+  NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(local_rank_));


It's been a long time I suspected this was going to be required at some point. Anyway, this is a recommended (if not required) practice. Without it, cudaEventRecord throws cudaErrorInvalidResourceHandle in a multi-GPU scenario.

samnordmann · 2024-12-18T11:54:02Z

!test

samnordmann · 2024-12-18T12:00:56Z

!test

samnordmann · 2024-12-18T14:53:03Z

!test

csrc/host_ir/executor.cpp

…lap: AG+GEMM layout (#3606) Stacked on top of - [x] #3608 - [x] #3605 # What Lower a MatmulOp sharded on the first inner axis into a pipelined AG+GEMM algorithm achieving fine grained overlap. We introduce a new parallel type `Stream` to account for this scheduling. More precisely, this patch enables lowering the fusion: ``` TensorView* a = makeContigTensor(4); //[S, DIDx(D), M/(S*d), K] TensorView* b = makeContigTensor(2); //[K, N] TensorView* c = matmul(a, b); //[S, D, M/(S*D), N] fusion->addInput(a); fusion->addInput(b); fusion->addOutput(c); auto mesh = DeviceMesh::createForNumDevices(D); a->setDeviceMesh(mesh); b->setDeviceMesh(mesh); c->setDeviceMesh(mesh); a->axis(1)->parallelize(ParallelType::DIDx); c->axis(0)->parallelize(ParallelType::Stream); ``` to the Host Ir program (obtained from dump, using `NVFUSER_DUMP=host_ir`) ``` %HostIrContainer { (T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) : GetCurrentStream into Stream 0 T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i4 ), zero_init=false, resets_to_zero=fals e) T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i6 ), zero_init=fals e, resets_to_zero=false) FOR i104 in iS0{i0}: SetCurrentStream to Stream ( i104 % numberOfStreams ) T4_l_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 ) T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = select( T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS11{i0}, index = i104 ) Communication 46 (type=Allgather, team=(0 1 2 3 4 5 6 7), input=T4_l_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), output=T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) Wait Communication 46 T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}) = select( T2_g_float[iStream6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStream6{i0}, index = i104 ) T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}) = matmul(T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) SetCurrentStream to Stream 0 Synchronize Stream ( i104 % numberOfStreams ) } // %HostIrContainer ``` The nsight profile shows that we do achieve overlap, in a way that is comparable to the Aten overlap experiments ![Screenshot 2024-12-18 at 12 08 05](https://github.com/user-attachments/assets/75e37822-a78d-49e6-a644-4fb99c40e945)

make stream synchronization non blocking

36fd2be

samnordmann commented Dec 18, 2024

View reviewed changes

samnordmann requested a review from wujingyue December 18, 2024 11:20

samnordmann mentioned this pull request Dec 18, 2024

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3606

Merged

2 tasks

lint

1e9f1d0

samnordmann requested a review from nsarka December 18, 2024 11:38

add event to events_ container

af06de4

destroy event async at create site

5e166a0

wujingyue approved these changes Dec 19, 2024

View reviewed changes

csrc/host_ir/executor.cpp Show resolved Hide resolved

csrc/host_ir/executor.cpp Show resolved Hide resolved

samnordmann merged commit cd2b3eb into NVIDIA:main Dec 23, 2024
47 checks passed

xwang233 mentioned this pull request Jan 10, 2025

Lower distributed matmul to pipelined algorithm for fine-grained overlap: AG+GEMM layout #3695

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host IR: make stream synchronization non blocking #3608

Host IR: make stream synchronization non blocking #3608

samnordmann commented Dec 18, 2024 •

edited

Loading

samnordmann Dec 18, 2024

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024

Host IR: make stream synchronization non blocking #3608

Host IR: make stream synchronization non blocking #3608

Conversation

samnordmann commented Dec 18, 2024 • edited Loading

What

Why

How

samnordmann Dec 18, 2024

Choose a reason for hiding this comment

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024

samnordmann commented Dec 18, 2024 •

edited

Loading