How cutlass avoid compiler to be too smart and maintain the pipeline order as design. #1056

cloudhan · 2023-08-22T18:07:45Z

cloudhan
Aug 22, 2023

This is about compiler optimization avoidance.

Take simple tiled based sgemm pseudocode as an example:

...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
  store_register_a_b_to_smem(regs, smem, p)    // store for this block
  regs = load_global_a_and_b_to_register(p+1)  // load  for next block
  frag_a, frag_b = load_fragments(smem, p)     // load fragments from this block
  mac_loop(frag_a, frag_b, acc)                // rank 1 updates
}
store_global(acc)
...

However, the instruction order is not guaranteed without some explicit synchronizations or barriers. For example, it might be optimized as follows:

...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
  store_register_a_b_to_smem(regs, smem, p)    // store for this block
  frag_a, frag_b = load_fragments(smem, p)     // load fragments from this block
  mac_loop(frag_a, frag_b, acc)                // rank 1 updates
  regs = load_global_a_and_b_to_register(p+1)  // load for next block, compiler might do this to lower the register pressure.
}
store_global(acc)
...

To circumvent,

...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
  store_register_a_b_to_smem(regs, smem, p)    // store for this block
  regs = load_global_a_and_b_to_register(p+1)  // load  for next block
  __syncthreads();                             // <---------------- sync -----------------
  frag_a, frag_b = load_fragments(smem, p)     // load fragments from this block
  mac_loop(frag_a, frag_b, acc)                // rank 1 updates
}
store_global(acc)
...

__syncthreads generate bar.sync, and per the PTXISA:

The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.

A memory read (e.g., by ld or atom) has been performed when the value read has been transmitted from memory and cannot be modified by another thread participating in the barrier. A memory write (e.g., by st, red or atom) has been performed when the value written has become visible to other threads participating in the barrier, that is, when the previous value can no longer be read.

So the previous fix is not ideal,

...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
  store_register_a_b_to_smem(regs, smem, p)
  __syncthreads();                             // ensure store to smem is visible to later load_fragments(...)
  regs = load_global_a_and_b_to_register(p+1)
  <an optimization barrier>                    // avoid the compiler to be too smart
  frag_a, frag_b = load_fragments(smem, p)
  mac_loop(frag_a, frag_b, acc)
}
store_global(acc)
...

To sum up, how does cutlass maintain the shape of the pipeline as desired?

Answered by hwu36

Aug 23, 2023

then how can we ensure that the actual copy of tCrA_copy_view and tCrB_copy_view are not moved downward?

you cannot ensure that. nvcc heuristics is supposed to take care of it and generate the "best" binary it think it can. As any heuristics, it does not always generate the optimum code.

Isn't an hypothetically explicit "optimization barrier“ a better way?

Not really. First, it is hard to draw a line. load_global_a_and_b_to_register has many load instructions. We want them to be far away enough from shared memory store, but we also want them to spread a little bit evenly. So, the best place might need to interweave loads to interleave with all the other instructions. Second, it is ver…

View full answer

hwu36 · 2023-08-22T18:40:34Z

hwu36
Aug 22, 2023
Maintainer

You need to generate ptx close to cutlass's. Compiler pattern matches cutlass code to deliver the optimum binary.

3 replies

cloudhan Aug 23, 2023
Author

This does not specific to the sgemm example presented. I take the introduction of cute as a signal that cutlass team is advocating user to rollout their own pipeline for customized computation/fusion logic. If we write something like this

cutlass/include/cutlass/gemm/collective/sm70_mma_twostage.hpp

Lines 269 to 270 in 27de343

    
           copy(tCsA(_,_,k_block_next), tCrA_copy_view(_,_,k_block_next)); 
        
           copy(tCsB(_,_,k_block_next), tCrB_copy_view(_,_,k_block_next));

, then how can we ensure that the actual copy of tCrA_copy_view and tCrB_copy_view are not moved downward?

Compiler pattern matches cutlass code
And what does the pattern match means here? Is it match at CFG or IR level or at AST level with cutlass/cute class name?

Isn't an hypothetically explicit "optimization barrier“ a better way? And what is the precondition and what is ensured when we invoke thoes cute data movement primitive, could you please elaborate a bit?

hwu36 Aug 23, 2023
Maintainer

then how can we ensure that the actual copy of tCrA_copy_view and tCrB_copy_view are not moved downward?

you cannot ensure that. nvcc heuristics is supposed to take care of it and generate the "best" binary it think it can. As any heuristics, it does not always generate the optimum code.

Isn't an hypothetically explicit "optimization barrier“ a better way?

Not really. First, it is hard to draw a line. load_global_a_and_b_to_register has many load instructions. We want them to be far away enough from shared memory store, but we also want them to spread a little bit evenly. So, the best place might need to interweave loads to interleave with all the other instructions. Second, it is very hard to outsmart the compiler in the general way. Something works in one case may not always works in other cases. Sometimes, we want the compiler to have the freedom to put loads to be close to shared store for the sake of the other tradeoffs it has to make such as the register pressure as you mentioned.

And what is the precondition and what is ensured when we invoke thoes cute data movement primitive, could you please elaborate a bit?

Sorry, it is very hard to describe the work done by many people for many years in a short paragraph. Cutlass is co-designed by the compiler and kernel engineers. Every part of cutlass is very carefully crafted. The best recommendation I can give is to either using cutlass or make your generated ptx similar to cutlass.

Answer selected by cloudhan

cloudhan Aug 23, 2023
Author

Thanks. So I think async proxy is the final resort and is the de facto solution for it. Good luck to old hardware owner tho.

cloudhan · 2023-11-21T08:17:30Z

cloudhan
Nov 21, 2023
Author

@hwu36 The question has been answered, but the idea is quite interesting. Today I came across some interesting stuffs that exist in AMD compiler llvm.amdgcn.sched.barrier, they call it as a scheduler barrier.

A mask of 0 means that no instructions may be scheduled across
the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing
instructions may cross the sched_barrier.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How cutlass avoid compiler to be too smart and maintain the pipeline order as design. #1056

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How cutlass avoid compiler to be too smart and maintain the pipeline order as design. #1056

cloudhan Aug 22, 2023

Replies: 2 comments · 3 replies

hwu36 Aug 22, 2023 Maintainer

cloudhan Aug 23, 2023 Author

hwu36 Aug 23, 2023 Maintainer

cloudhan Aug 23, 2023 Author

cloudhan Nov 21, 2023 Author

cloudhan
Aug 22, 2023

Replies: 2 comments 3 replies

hwu36
Aug 22, 2023
Maintainer

cloudhan Aug 23, 2023
Author

hwu36 Aug 23, 2023
Maintainer

cloudhan Aug 23, 2023
Author

cloudhan
Nov 21, 2023
Author