Device-side grouped GEMM #1620
-
Hello! Is there a way to do a device-side grouped (or batched) gemm? That is, my goal is to perform a set of GEMMs with non uniform sizes, flexible constraint as we can pad zeros, inside a kernel. I could serially invoke N Any suggestions for actualizing the grouped gemm, preferably for capability >= 70? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
By "device-side," do you mean that you'd like to perform the grouped-GEMM from within an existing kernel, rather than launching two separate kernels? |
Beta Was this translation helpful? Give feedback.
yes, that example ultimately uses this collective: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
which you can use directly in your code on the device side.