Lower distributed matmul to pipelined algorithm for fine-grained overlap #3606

samnordmann · 2024-12-18T08:35:59Z

Stacked on top of

What

Lower a MatmulOp sharded on the first inner axis into a pipelined AG+GEMM algorithm achieving fine grained overlap.

More precisely, this patch enables lowering the fusion:

  TensorView* a = makeContigTensor(4); //[S, DIDx(D), M/(S*d), K]
  TensorView* b = makeContigTensor(2); //[K, N]
  TensorView* c = matmul(a, b); //[S, D, M/(S*D), N]

  fusion->addInput(a);
  fusion->addInput(b);
  fusion->addOutput(c);

  auto mesh = DeviceMesh::createForNumDevices(D);
  a->setDeviceMesh(mesh);
  b->setDeviceMesh(mesh);
  c->setDeviceMesh(mesh);

  a->axis(1)->parallelize(ParallelType::DIDx);

to the Host Ir program (obtained from dump, using NVFUSER_DUMP=host_ir)

%HostIrContainer { (T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
  GetCurrentStream into Stream 0
  T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i4 ), zero_init=false, resets_to_zero=false)
  T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i6 ), zero_init=false, resets_to_zero=false)
  FOR i104 in iS0{i0}:
    SetCurrentStream to Stream ( i104 % numberOfStreams )
    T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 )
    T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 )
    T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS11{i0}, index = i104 )
    Communication 46 (type=Allgather, team=(0 1 2 3 4 5 6 7), input=T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), output=T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    Wait Communication 46
    T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS6{i0}, index = i104 )
    T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = matmul(T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    SetCurrentStream to Stream 0
    Synchronize Stream ( i104 % numberOfStreams )
} // %HostIrContainer

The nsight profile shows that we do achieve overlap, in a way that is comparable to the Aten overlap experiments

…lower_matmul_to_hostir

samnordmann · 2024-12-18T13:35:55Z

!test

csrc/host_ir/lower.cpp

# What Make stream synchronization non-blocking from the CPU point of view # Why Needed for achieving overlap in - #3606 before this patch: ![Screenshot 2024-12-18 at 12 08 25](https://github.com/user-attachments/assets/f5c84282-ea85-4cb8-8a60-538cd91cfa1c) after this patch ![Screenshot 2024-12-18 at 12 08 05](https://github.com/user-attachments/assets/25537a5d-3e33-4ff8-baf4-4f013c1ed230) # How Before this patch, the host IR `Synchronize` would call `c10::synchronize()` on the cuda stream, which makes the CPU blocks until stream completion. With this patch, we synchronize the current stream with a given stream through a `cudaEvent` and the API `cudaStreamWaitEvent`.

…ent_stream

# What adds the primitive `GetCurrentStream` to Host Ir stack. # Why needed for - #3606 The idea is that if we want to use multiple stream internally, we need before hand to capture the user stream and to set it back to being the active stream when returning

…to_hostir

…mul_to_hostir

samnordmann · 2024-12-23T15:14:28Z

!test

samnordmann · 2024-12-23T15:40:19Z

!test

samnordmann · 2024-12-23T22:55:31Z

I am not sure what's going on with CI. Probably an infra issue

Host IR: add GetCurrentStream

38721fe

samnordmann mentioned this pull request Dec 18, 2024

Host IR: add GetCurrentStream #3605

Merged

samnordmann added 2 commits December 18, 2024 00:46

lint

c4ca266

lower to collective base pipeline AG+GEMM

b517c2b

samnordmann force-pushed the overlap/lower_matmul_to_hostir branch from bb867e8 to b517c2b Compare December 18, 2024 08:47

samnordmann added 4 commits December 18, 2024 00:48

lint

92ab927

lint

ed4440a

update with non blocking stream synchronization

ef8f00c

make stream synchronization non blocking

36fd2be

samnordmann mentioned this pull request Dec 18, 2024

Host IR: make stream synchronization non blocking #3608

Merged

samnordmann added 4 commits December 18, 2024 03:34

lint

1e9f1d0

add event to events_ container

af06de4

destroy event async at create site

5e166a0

Merge branch 'host_irs/non_blocking_stream_synchronize' into overlap/…

e8ffadb

…lower_matmul_to_hostir

nsarka reviewed Dec 20, 2024

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

nsarka reviewed Dec 20, 2024

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

nsarka reviewed Dec 20, 2024

View reviewed changes

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved

samnordmann added 2 commits December 23, 2024 03:04

minor review

741202b

Merge branch 'main' of github.com:NVIDIA/Fuser into host_irs/get_curr…

353c03c

…ent_stream

samnordmann added 2 commits December 23, 2024 05:51

Merge branch 'host_irs/get_current_stream' into overlap/lower_matmul_…

4420eb4

…to_hostir

Merge branch 'main' of github.com:NVIDIA/Fuser into overlap/lower_mat…

d0a9340

…mul_to_hostir

samnordmann added the Multi-GPU label Dec 23, 2024

samnordmann added 5 commits December 23, 2024 06:25

fix merge

0374604

minor review

5e07ad8

remove now unnecessary trick of adding artifical outputs

b546dce

lint

8e8b247

remove now unnecessary patch on broadcast

d5b42c2

samnordmann requested review from nsarka and wujingyue December 23, 2024 15:15

nsarka approved these changes Dec 23, 2024

View reviewed changes

samnordmann mentioned this pull request Dec 23, 2024

Ring Allgather + GEMM Overlap HostIR Implementation #3626

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower distributed matmul to pipelined algorithm for fine-grained overlap #3606

Lower distributed matmul to pipelined algorithm for fine-grained overlap #3606

samnordmann commented Dec 18, 2024 •

edited

Loading

samnordmann commented Dec 18, 2024

samnordmann commented Dec 23, 2024

samnordmann commented Dec 23, 2024

samnordmann commented Dec 23, 2024 •

edited

Loading

Lower distributed matmul to pipelined algorithm for fine-grained overlap #3606

Are you sure you want to change the base?

Lower distributed matmul to pipelined algorithm for fine-grained overlap #3606

Conversation

samnordmann commented Dec 18, 2024 • edited Loading

What

samnordmann commented Dec 18, 2024

samnordmann commented Dec 23, 2024

samnordmann commented Dec 23, 2024

samnordmann commented Dec 23, 2024 • edited Loading

samnordmann commented Dec 18, 2024 •

edited

Loading

samnordmann commented Dec 23, 2024 •

edited

Loading