You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In streamk paper, ¬tile_started store partial sum to global memory workspace, ¬tile_ended accumulate partial sum from global memory workspace. ‘’‘Stream-K is better able to hide the latency of inter-CTA synchronization due to the temporal skew between writers and readers when sharing partial sums.’‘’
CalebDu
changed the title
[QST] question about streamk implementation in cutlass
[QST] Differences between papers and cutlass implementations about streamk algorithm
Nov 6, 2024
In streamk paper,
¬tile_started
store partial sum to global memory workspace,¬tile_ended
accumulate partial sum from global memory workspace. ‘’‘Stream-K is better able to hide the latency of inter-CTA synchronization due to the temporal skew between writers and readers when sharing partial sums.’‘’But I read streamk implementation in cutlass. https://github.com/NVIDIA/cutlass/blob/19f51596e8be9fe87d583616466581ab5740c19d/include/cutlass/gemm/kernel/gemm_universal_streamk.h#L968C1-L982C8
The section of code shows that
¬tile_ended
store partial sum to global memory workspace,¬tile_started
accumulate partial sum from global memory workspace. This is the exact opposite of the logic in the paper.In cutlass implementation, CTA1 tile1 will be stall because waiting CTA0 tile1 partial sum completed.
How to explain this discrepancy?
The text was updated successfully, but these errors were encountered: