[QST] Grouped GEMM on A10 GPUs #2053

MinghaoYan · 2025-01-22T21:25:48Z

What is your question?
I am trying to run the Cutlass Grouped Gemm kernel on A10 GPUs. The code runs smoothly on A100 GPUs, the following is the set up I am using.

using GroupedGemmKernel = typename cutlass::gemm::kernel::DefaultGemmGrouped<
  // A operand.
  ::cutlass::bfloat16_t,
  GroupedGemmInputLayout<false>,
  ::cutlass::ComplexTransform::kNone,
  4, //GroupedGemmConfig::kAlignmentA,
  // B operand.
  ::cutlass::bfloat16_t,
  GroupedGemmInputLayout<false>,
  ::cutlass::ComplexTransform::kNone,
  4, //GroupedGemmConfig::kAlignmentB,
  // C operand.
  ::cutlass::bfloat16_t,
  ::cutlass::layout::RowMajor,
  float,
  ::cutlass::arch::OpClassTensorOp,
  ::cutlass::arch::Sm80,
  GroupedGemmConfig::ThreadblockShape,
  GroupedGemmConfig::WarpShape,
  GroupedGemmConfig::InstructionShape,
  GroupedGemmConfig::EpilogueOutputOp,
  ::cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle,
  GroupedGemmConfig::kStages>::GemmKernel;

However, I'm running into issues deploying it on A10 GPUs. If I change the ArchTag to SM86, I would run into a variety of compilation issues on incomplete types as shown in #609.

After further digging, in #1181 it was suggested that compiling with SM80 should be sufficient. My code does compile if archtag is set to SM80. However, it seems like any matrix multiplication setup would run into an error since my threadblock_count check

int threadblock_count = Gemm::sufficient(problem_sizes_host.data(), num_experts);

would always return 0.

Are there any suggestions or examples on how to properly use the GroupedGemm kernel on A10 GPUs?

The text was updated successfully, but these errors were encountered:

jackkosaian · 2025-01-22T22:50:52Z

It's likely the case that your kernel requires using more shared memory than is available on A10 (A100 has more shared memory than A10). Try reducing the ThreadblockShape.

MinghaoYan · 2025-01-24T21:16:27Z

Thank you, I tuned it a bit and got it to work.

MinghaoYan added ? - Needs Triage question Question labels Jan 22, 2025

MinghaoYan closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Grouped GEMM on A10 GPUs #2053

[QST] Grouped GEMM on A10 GPUs #2053

MinghaoYan commented Jan 22, 2025

jackkosaian commented Jan 22, 2025

MinghaoYan commented Jan 24, 2025

[QST] Grouped GEMM on A10 GPUs #2053

[QST] Grouped GEMM on A10 GPUs #2053

Comments

MinghaoYan commented Jan 22, 2025

jackkosaian commented Jan 22, 2025

MinghaoYan commented Jan 24, 2025