[QST] Differences between papers and cutlass implementations about streamk algorithm #1923

CalebDu · 2024-11-06T14:58:12Z

In streamk paper, ¬tile_started store partial sum to global memory workspace, ¬tile_ended accumulate partial sum from global memory workspace. ‘’‘Stream-K is better able to hide the latency of inter-CTA synchronization due to the temporal skew between writers and readers when sharing partial sums.’‘’

But I read streamk implementation in cutlass. https://github.com/NVIDIA/cutlass/blob/19f51596e8be9fe87d583616466581ab5740c19d/include/cutlass/gemm/kernel/gemm_universal_streamk.h#L968C1-L982C8
The section of code shows that ¬tile_ended store partial sum to global memory workspace, ¬tile_startedaccumulate partial sum from global memory workspace. This is the exact opposite of the logic in the paper.

In cutlass implementation, CTA1 tile1 will be stall because waiting CTA0 tile1 partial sum completed.
How to explain this discrepancy？

The text was updated successfully, but these errors were encountered:

CalebDu · 2024-11-07T14:16:29Z

https://github.com/NVIDIA/cutlass/blob/d656afbd2a01112c0e4d90aafe0f8f78145c6585/include/cutlass/gemm/kernel/gemm_universal_streamk.h#L1063C1-L1063C75
I figure it out, sk block does reverse order from iter_end to iter_begin in cutlass implementation. so ¬tile_started accumulate partial sum from global memory workspace.

CalebDu added ? - Needs Triage question Question labels Nov 6, 2024

CalebDu changed the title ~~[QST] question about streamk implementation in cutlass~~ [QST] Differences between papers and cutlass implementations about streamk algorithm Nov 6, 2024

CalebDu closed this as completed Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Differences between papers and cutlass implementations about streamk algorithm #1923

[QST] Differences between papers and cutlass implementations about streamk algorithm #1923

CalebDu commented Nov 6, 2024 •

edited

Loading

CalebDu commented Nov 7, 2024

[QST] Differences between papers and cutlass implementations about streamk algorithm #1923

[QST] Differences between papers and cutlass implementations about streamk algorithm #1923

Comments

CalebDu commented Nov 6, 2024 • edited Loading

CalebDu commented Nov 7, 2024

CalebDu commented Nov 6, 2024 •

edited

Loading