You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your question?
I am trying to run the Cutlass Grouped Gemm kernel on A10 GPUs. The code runs smoothly on A100 GPUs, the following is the set up I am using.
using GroupedGemmKernel = typename cutlass::gemm::kernel::DefaultGemmGrouped<
// A operand.
::cutlass::bfloat16_t,
GroupedGemmInputLayout<false>,
::cutlass::ComplexTransform::kNone,
4, //GroupedGemmConfig::kAlignmentA,
// B operand.
::cutlass::bfloat16_t,
GroupedGemmInputLayout<false>,
::cutlass::ComplexTransform::kNone,
4, //GroupedGemmConfig::kAlignmentB,
// C operand.
::cutlass::bfloat16_t,
::cutlass::layout::RowMajor,
float,
::cutlass::arch::OpClassTensorOp,
::cutlass::arch::Sm80,
GroupedGemmConfig::ThreadblockShape,
GroupedGemmConfig::WarpShape,
GroupedGemmConfig::InstructionShape,
GroupedGemmConfig::EpilogueOutputOp,
::cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle,
GroupedGemmConfig::kStages>::GemmKernel;
However, I'm running into issues deploying it on A10 GPUs. If I change the ArchTag to SM86, I would run into a variety of compilation issues on incomplete types as shown in #609.
After further digging, in #1181 it was suggested that compiling with SM80 should be sufficient. My code does compile if archtag is set to SM80. However, it seems like any matrix multiplication setup would run into an error since my threadblock_count check
int threadblock_count = Gemm::sufficient(problem_sizes_host.data(), num_experts);
would always return 0.
Are there any suggestions or examples on how to properly use the GroupedGemm kernel on A10 GPUs?
The text was updated successfully, but these errors were encountered:
It's likely the case that your kernel requires using more shared memory than is available on A10 (A100 has more shared memory than A10). Try reducing the ThreadblockShape.
What is your question?
I am trying to run the Cutlass Grouped Gemm kernel on A10 GPUs. The code runs smoothly on A100 GPUs, the following is the set up I am using.
However, I'm running into issues deploying it on A10 GPUs. If I change the ArchTag to SM86, I would run into a variety of compilation issues on incomplete types as shown in #609.
After further digging, in #1181 it was suggested that compiling with SM80 should be sufficient. My code does compile if archtag is set to SM80. However, it seems like any matrix multiplication setup would run into an error since my threadblock_count check
int threadblock_count = Gemm::sufficient(problem_sizes_host.data(), num_experts);
would always return 0.
Are there any suggestions or examples on how to properly use the GroupedGemm kernel on A10 GPUs?
The text was updated successfully, but these errors were encountered: