Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix generic-k code in top-k with softmax #1993

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,11 @@

Those assumptions are as:
1. Fusion is over the N dimension.
2. Top-K is either 2 or 4 elements, and the value is static (meaning two kernels have to be
compiled to support both.)
2. Top-K value is static (meaning multiple kernels have to be compiled to support
different values.)
t4c1 marked this conversation as resolved.
Show resolved Hide resolved
* NOTE: Only K=2 and K=4 cases are performance-optimized and enabled by default.
There is also a generic sort that supports all K values greater than 1, but it can lead to serious performance implications to the underlying kernel.
If necessary, users can simply remove the K==2 || K ==4 assertion under cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp, and the generic sort will automatically be used for all other Ks.
3. The GEMM tile shape along N is greater than or equal to problem size
along N.

Expand Down
9 changes: 6 additions & 3 deletions include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -209,13 +209,14 @@ void add_element_to_desc_sorted_array(cutlass::Array<Element, N>& a, Element b)
// slower generic path with branching, slower, and can cause register spill
CUTLASS_PRAGMA_UNROLL
for (int k = 0; k < N; ++k) {
if (a[k] <= b) {
if (a[k] < b) {
// Shift down
CUTLASS_PRAGMA_UNROLL
for (int l = N - 1; l > k; --l) {
a[l] = a[l-1];
}
a[k] = b;
break;
}
}
}
Expand All @@ -237,7 +238,7 @@ void merge_desc_sorted_arrays(cutlass::Array<Element, N>& a, const cutlass::Arra
int j = 0;
CUTLASS_PRAGMA_UNROLL
for (int k = 0; k < N; ++k) {
if (a[k] <= b[j]) {
if (a[k] < b[j]) {
// Shift down
CUTLASS_PRAGMA_UNROLL
for (int l = N - 1; l > k; --l) {
Expand Down Expand Up @@ -334,7 +335,9 @@ template <
struct Sm90TopKSoftmaxColReduction {
private:
static_assert(is_same_v<ElementCompute, float>, "Fused Top-K + Softmax reduction requires FP32 accumulation.");
t4c1 marked this conversation as resolved.
Show resolved Hide resolved
static_assert(TopK == 2 || TopK == 4, "Fused Top-K + Softmax reduction only supports K=2 and K=4.");
static_assert(TopK == 2 || TopK == 4,
"Fused Top-K + Softmax reduction only allows K=2 and K=4, because those cases have been performance-optimized. Other values of K can be enabled by removing this assertion, but they may come with serious performance implications."
);
static_assert(Alignment * sizeof_bits_v<ElementOutput> % 128 == 0, "sub-16B alignment not supported yet");

// Reduction tensors
Expand Down