Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower distributed matmul to pipelined algorithm for fine-grained overlap #3606

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Dec 18, 2024

Stacked on top of

What

Lower a MatmulOp sharded on the first inner axis into a pipelined AG+GEMM algorithm achieving fine grained overlap.

More precisely, this patch enables lowering the fusion:

  TensorView* a = makeContigTensor(4); //[S, DIDx(D), M/(S*d), K]
  TensorView* b = makeContigTensor(2); //[K, N]
  TensorView* c = matmul(a, b); //[S, D, M/(S*D), N]

  fusion->addInput(a);
  fusion->addInput(b);
  fusion->addOutput(c);

  auto mesh = DeviceMesh::createForNumDevices(D);
  a->setDeviceMesh(mesh);
  b->setDeviceMesh(mesh);
  c->setDeviceMesh(mesh);

  a->axis(1)->parallelize(ParallelType::DIDx);

to the Host Ir program (obtained from dump, using NVFUSER_DUMP=host_ir)

%HostIrContainer { (T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
  GetCurrentStream into Stream 0
  T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i4 ), zero_init=false, resets_to_zero=false)
  T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( ( i0 * i2 ) * i3 ) * i6 ), zero_init=false, resets_to_zero=false)
  FOR i104 in iS0{i0}:
    SetCurrentStream to Stream ( i104 % numberOfStreams )
    T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 )
    T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T0_g_float[iS0{i0}, ideviceIdx.x1{i2}, iS2{i3}, iS3{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS0{i0}, index = i104 )
    T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T3_g_float[iS11{i0}, iS12{i2}, iS13{i3}, iS14{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS11{i0}, index = i104 )
    Communication 46 (type=Allgather, team=(0 1 2 3 4 5 6 7), input=T4_g_float[ideviceIdx.x15{i2}, iS16{i3}, iS17{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), output=T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    Wait Communication 46
    T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = select( T2_g_float[iS6{i0}, iS7{i2}, iS8{i3}, iS9{i6}, rS10{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iS6{i0}, index = i104 )
    T6_l_float[iS21{i2}, iS22{i3}, iS23{i6}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = matmul(T5_l_float[iS18{i2}, iS19{i3}, iS20{i4}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                T1_g_float[iS4{i5}, iS5{i6}] (DeviceMesh{0 1 2 3 4 5 6 7}))
    SetCurrentStream to Stream 0
    Synchronize Stream ( i104 % numberOfStreams )
} // %HostIrContainer

The nsight profile shows that we do achieve overlap, in a way that is comparable to the Aten overlap experiments

Screenshot 2024-12-18 at 12 08 05

@samnordmann samnordmann force-pushed the overlap/lower_matmul_to_hostir branch from bb867e8 to b517c2b Compare December 18, 2024 08:47
@samnordmann
Copy link
Collaborator Author

!test

csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved
csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved
csrc/host_ir/lower.cpp Outdated Show resolved Hide resolved
samnordmann added a commit that referenced this pull request Dec 23, 2024
# What

Make stream synchronization non-blocking from the CPU point of view

# Why

Needed for achieving overlap in 
- #3606

before this patch:
![Screenshot 2024-12-18 at 12 08
25](https://github.com/user-attachments/assets/f5c84282-ea85-4cb8-8a60-538cd91cfa1c)
after this patch
![Screenshot 2024-12-18 at 12 08
05](https://github.com/user-attachments/assets/25537a5d-3e33-4ff8-baf4-4f013c1ed230)


# How 

Before this patch, the host IR `Synchronize` would call
`c10::synchronize()` on the cuda stream, which makes the CPU blocks
until stream completion. With this patch, we synchronize the current
stream with a given stream through a `cudaEvent` and the API
`cudaStreamWaitEvent`.
samnordmann added a commit that referenced this pull request Dec 23, 2024
# What

adds the primitive `GetCurrentStream` to Host Ir stack.

# Why

needed for 
- #3606

The idea is that if we want to use multiple stream internally, we need
before hand to capture the user stream and to set it back to being the
active stream when returning
@samnordmann
Copy link
Collaborator Author

!test

@samnordmann
Copy link
Collaborator Author

!test

@samnordmann
Copy link
Collaborator Author

samnordmann commented Dec 23, 2024

I am not sure what's going on with CI. Probably an infra issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants