Allow scalar broadcasting in VisitorRowBroadcast and VisitorColBroadcast #1539

tlrmchlsmth · 2024-05-16T18:51:22Z

This PR addresses an inconsistency between the VisitorRowBroadcast/VisitorColBroadcast epilogues and the SM90RowBroadcast/SM90ColBroadcast epilogues.

The inconsistency is that the SM90 epilogues can handle either row/column broadcasting by passing in a nullptr for the first argument, and a float for the second, while the visitor epilogues cannot. This PR adds this functionality to the visitor epilogues.

I am using this for quantized GEMMs that can handle either per-token/per channel quantization or per-tensor quantization without compiling and distributing multiple kernels to handle all cases.

For reference, I ran into this issue when developing vllm-project/vllm#4749

tlrmchlsmth · 2024-05-21T16:42:59Z

Very happy to add unit tests and put in the work to get this PR into a landable state. But first hoping to get some high-level feedback on whether this is the right approach or a reasonable thing to do. Thanks!

tlrmchlsmth · 2024-05-23T21:06:07Z

cc @mnicely

Hongbosherlock · 2024-06-04T07:52:10Z

Hi @tlrmchlsmth thanks for your contribution. I'm working on int8 GEMM with dequant fusion.
Can the following code work with the original VisitorRowBroadcast/VisitorColBroadcast epilogues?

    // inputs
    //     A           [M, K]    int8
    //     B           [N, K]    int8
    //     alphaCol    [M, 1]    fp32
    //     alphaRow    [1, N]    fp32
    // outputs
    //     mat [M, N]            fp32

    // alphaCol    [M, 1]    fp32
    using V1Broadcast = cutlass::epilogue::threadblock::VisitorColBroadcast<
        OutputTileThreadMap, ElementC,
        cute::Stride<int32_t, _1, _0>  // StrideMNL
    >;

    // alphaRow    [1, N]    fp32
    using V2Broadcast = cutlass::epilogue::threadblock::VisitorRowBroadcast<
        OutputTileThreadMap, ElementC,
        cute::Stride<_0, _1, int32_t>  // StrideMNL
    >;

The inconsistency is that the SM90 epilogues can handle either row/column broadcasting by passing in a nullptr for the first argument, and a float for the second, while the visitor epilogues cannot. This PR adds this functionality to the visitor epilogues.

I don’t quite understand this PR. Regarding this issue, could you please provide some examples? In what situations won’t it work, and in what situations will it work based on this PR?

tlrmchlsmth · 2024-06-04T15:21:04Z

@Hongbosherlock
Compare the cutlass 2.0 epilogues in include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp and
the cutlass 3.0 epilogues in include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp

In the second, the row and column broadcast epilogues (Sm90RowBroadcast and Sm90ColBroadcast) have null_default arguments that are used to provide scalar broadcast functionality. In the first file, the similar row and column broadcast epilogues also have null_default arguments but they simply aren't used.

I tried the approach you suggest for cutlass 2.0 but couldn't get it to compile. If you have a full working example, I'd like to see it :)

Anyway, the same approach won't work for cutlass 3.0, as you will fail this static assert cute::is_same_v<StrideMNL, Stride<_1,_0, _0>>), and the problem this PR is addressing is the inconsistency between these two very similar types.

thakkarV · 2024-06-13T18:31:07Z

@hwu36 can we ask Zhaodong to merge this? I don't know his GitHub username

tlrmchlsmth · 2024-06-13T18:50:54Z

JFYI I did end up going in a different direction with these epilogue changes. See vllm-project/vllm#5137 -- I found that it was much nicer for a variety of reasons if both the scalar and the vector broadcast cases take a float * that points to device memory as an argument.

thakkarV · 2024-06-13T19:01:22Z

include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp

+          bool guard = get<1>(coord_v(i)) < n;
+          cutlass::arch::global_load<VecType, sizeof(VecType)>(dst_v(i), (void const*)&src_v(i), guard);
+        }
+      } else {


Nit: New line after branch close.

thakkarV · 2024-06-13T19:01:50Z

include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp

+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(src_v); ++i) {
+          if(get<1>(coord_v(i)) < n)
+          {


Not: no new line before brace open

thakkarV · 2024-06-13T19:02:07Z

include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp

+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(src_v); ++i) {
+          if(get<1>(coord_v(i)) < n)


Nit: spacing if (get

thakkarV · 2024-06-13T19:04:45Z

include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp

-      copy_if(pred, tC_gCol, tC_rCol);
+
+      if (params_ptr->ptr_col) {
+        // In this case we are loading from a column vector and broadcasting


A design question: this isn't really a scalar operation anymore. Does it make sense to extend this visitor, or to add replace this with a vector broadcast instead that then has a broadcasting layout for its data

replace this with a vector broadcast instead that then has a broadcasting layout for its data

Could you point me to an example of this?

thanks for taking a look at the PR BTW :)

hwu36 · 2024-06-13T20:54:28Z

@apuaaChen

apuaaChen · 2024-06-17T20:38:33Z

Hi @tlrmchlsmth, thanks for the PR! One question I have is that can we use the VisitorScalarBroadcast to achieve the same target? It also takes a scalar (e.g. float) and broadcast to the whole epilogue tile.

tlrmchlsmth · 2024-06-17T21:32:53Z

@apuaaChen We could totally do that, but then in order to have a kernel for every case of fp8 quantized GEMM that we need to support, this is 4x the number of kernels. The activations can have per-tensor or per-token scales and weights can have per-tensor or per-output channel scales. So this PR lets us pick a another point in the binary size/performance tradeoff space.

apuaaChen · 2024-06-21T19:55:59Z

@tlrmchlsmth Got it! Let me merge it. Thanks for the explanation.

ProExpertProg · 2024-07-17T15:30:24Z

@apuaaChen while @tlrmchlsmth ended up using a custom visitor that loads both a row and a scalar from the float* argument, I ran into this use-case when the scalar is a known constant (0 bias). Again the benefit is reducing code size by having one kernel handle both cases. Could we get this PR merged?

ProExpertProg · 2024-07-17T18:55:30Z

I guess could you let me know if you plan to merge it, or if there's any cleanup you want me to do before we merge. I also have a version with a boolean EnableNullptr parameter (default false) that enables the new scalar behavior which is consistent with the c3x epilogues. Let me know if I should push that to this branch.

apuaaChen · 2024-07-17T20:59:58Z

@ProExpertProg Please push your changes to this branch. I will first merge your updates to our internal repo. After the CI is passed, I can get your PR merged, thanks!

ProExpertProg · 2024-07-17T22:16:57Z

Perfect, thank you!!

ProExpertProg · 2024-07-17T22:19:41Z

And please don't hesitate to ask for any changes or improved comments, and feel free to make edits yourself if there are any style/formatting issues.

github-actions · 2024-08-16T23:05:49Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

ProExpertProg · 2024-08-17T00:12:16Z

@apuaaChen were you able to get the PR run on the internal CI?

apuaaChen · 2024-08-17T15:35:07Z

@apuaaChen were you able to get the PR run on the internal CI?

Yes，It passed the internal CI. I’m combining it with a few other fixes right now

github-actions · 2024-11-15T16:06:35Z

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

mnicely added the feature request New feature or request label May 24, 2024

thakkarV reviewed Jun 13, 2024

View reviewed changes

tlrmchlsmth and others added 2 commits July 17, 2024 11:58

Allow scalar broadcasting in VisitorRowBroadcast and VisitorColBroadcast

012d623

Added template parameter EnableNullptr (default false)

0b6c76e

ProExpertProg force-pushed the tms/2x_scalar_broadcast branch from 80a5654 to 0b6c76e Compare July 17, 2024 22:16

github-actions bot added the inactive-30d label Aug 16, 2024

github-actions bot removed the inactive-30d label Aug 17, 2024

github-actions bot added the inactive-90d label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow scalar broadcasting in VisitorRowBroadcast and VisitorColBroadcast #1539

Allow scalar broadcasting in VisitorRowBroadcast and VisitorColBroadcast #1539

tlrmchlsmth commented May 16, 2024

tlrmchlsmth commented May 21, 2024

tlrmchlsmth commented May 23, 2024

Hongbosherlock commented Jun 4, 2024 •

edited

Loading

tlrmchlsmth commented Jun 4, 2024

thakkarV commented Jun 13, 2024

tlrmchlsmth commented Jun 13, 2024

thakkarV Jun 13, 2024

thakkarV Jun 13, 2024

thakkarV Jun 13, 2024

thakkarV Jun 13, 2024

tlrmchlsmth Jun 13, 2024

tlrmchlsmth Jun 13, 2024

hwu36 commented Jun 13, 2024

apuaaChen commented Jun 17, 2024

tlrmchlsmth commented Jun 17, 2024

apuaaChen commented Jun 21, 2024

ProExpertProg commented Jul 17, 2024

ProExpertProg commented Jul 17, 2024

apuaaChen commented Jul 17, 2024

ProExpertProg commented Jul 17, 2024

ProExpertProg commented Jul 17, 2024

github-actions bot commented Aug 16, 2024

ProExpertProg commented Aug 17, 2024

apuaaChen commented Aug 17, 2024

github-actions bot commented Nov 15, 2024

Allow scalar broadcasting in VisitorRowBroadcast and VisitorColBroadcast #1539

Are you sure you want to change the base?

Allow scalar broadcasting in VisitorRowBroadcast and VisitorColBroadcast #1539

Conversation

tlrmchlsmth commented May 16, 2024

tlrmchlsmth commented May 21, 2024

tlrmchlsmth commented May 23, 2024

Hongbosherlock commented Jun 4, 2024 • edited Loading

tlrmchlsmth commented Jun 4, 2024

thakkarV commented Jun 13, 2024

tlrmchlsmth commented Jun 13, 2024

thakkarV Jun 13, 2024

Choose a reason for hiding this comment

thakkarV Jun 13, 2024

Choose a reason for hiding this comment

thakkarV Jun 13, 2024

Choose a reason for hiding this comment

thakkarV Jun 13, 2024

Choose a reason for hiding this comment

tlrmchlsmth Jun 13, 2024

Choose a reason for hiding this comment

tlrmchlsmth Jun 13, 2024

Choose a reason for hiding this comment

hwu36 commented Jun 13, 2024

apuaaChen commented Jun 17, 2024

tlrmchlsmth commented Jun 17, 2024

apuaaChen commented Jun 21, 2024

ProExpertProg commented Jul 17, 2024

ProExpertProg commented Jul 17, 2024

apuaaChen commented Jul 17, 2024

ProExpertProg commented Jul 17, 2024

ProExpertProg commented Jul 17, 2024

github-actions bot commented Aug 16, 2024

ProExpertProg commented Aug 17, 2024

apuaaChen commented Aug 17, 2024

github-actions bot commented Nov 15, 2024

Hongbosherlock commented Jun 4, 2024 •

edited

Loading