-
This is about compiler optimization avoidance. Take simple tiled based sgemm pseudocode as an example: ...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
store_register_a_b_to_smem(regs, smem, p) // store for this block
regs = load_global_a_and_b_to_register(p+1) // load for next block
frag_a, frag_b = load_fragments(smem, p) // load fragments from this block
mac_loop(frag_a, frag_b, acc) // rank 1 updates
}
store_global(acc)
... However, the instruction order is not guaranteed without some explicit synchronizations or barriers. For example, it might be optimized as follows: ...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
store_register_a_b_to_smem(regs, smem, p) // store for this block
frag_a, frag_b = load_fragments(smem, p) // load fragments from this block
mac_loop(frag_a, frag_b, acc) // rank 1 updates
regs = load_global_a_and_b_to_register(p+1) // load for next block, compiler might do this to lower the register pressure.
}
store_global(acc)
... To circumvent, ...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
store_register_a_b_to_smem(regs, smem, p) // store for this block
regs = load_global_a_and_b_to_register(p+1) // load for next block
__syncthreads(); // <---------------- sync -----------------
frag_a, frag_b = load_fragments(smem, p) // load fragments from this block
mac_loop(frag_a, frag_b, acc) // rank 1 updates
}
store_global(acc)
...
So the previous fix is not ideal, ...
regs = load_global_a_and_b_to_register(/*block_index_k*/0)
for(p=0; p<num_block_k; p += 1) {
store_register_a_b_to_smem(regs, smem, p)
__syncthreads(); // ensure store to smem is visible to later load_fragments(...)
regs = load_global_a_and_b_to_register(p+1)
<an optimization barrier> // avoid the compiler to be too smart
frag_a, frag_b = load_fragments(smem, p)
mac_loop(frag_a, frag_b, acc)
}
store_global(acc)
... To sum up, how does cutlass maintain the shape of the pipeline as desired? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
You need to generate ptx close to cutlass's. Compiler pattern matches cutlass code to deliver the optimum binary. |
Beta Was this translation helpful? Give feedback.
-
@hwu36 The question has been answered, but the idea is quite interesting. Today I came across some interesting stuffs that exist in AMD compiler
|
Beta Was this translation helpful? Give feedback.
you cannot ensure that. nvcc heuristics is supposed to take care of it and generate the "best" binary it think it can. As any heuristics, it does not always generate the optimum code.
Not really. First, it is hard to draw a line.
load_global_a_and_b_to_register
has many load instructions. We want them to be far away enough from shared memory store, but we also want them to spread a little bit evenly. So, the best place might need to interweave loads to interleave with all the other instructions. Second, it is ver…