How to modify threadblock shape when A/B is transposed in GEMM? #1281
-
I wrote a custom kernel for grouped GEMM using cutlass. Should I transpose the threadblock, warp and instruction shapes here accordingly when A/B is transposed? Besides, how to tune these shapes for optimal performance? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Do you mean when there is an "internal transposition" due to the output being column major? If so, you should not need to change those template parameters. To tune for performance, it's suggested to try out many valid combinations of threadblock shape, warp shape, and stage count. You may find it useful to follow some of the combinations listed by the CUTLASS kernel generator. For example, these lines show a subset of vali combinations threadblock shape, stage count, and warp count for an SM80 FP16 GEMM. |
Beta Was this translation helpful? Give feedback.
Thanks for clarifying. A column-major input is unlikely to alter the best-performing combination of these parameters.