- Adaptation to Various Lengths of Group M: We have enhanced the repository to support different lengths of group M, providing more flexibility for diverse use cases.
- Support for Group K GEMM: We have also added support for Group K GEMM operations, expanding the functionality of the original DeepGEMM. NOTE: k in each group should be padded to 128x.
- Hopper architecture GPUs,
sm_90a
must be supported - Python 3.8 or above
- CUDA 12.3 or above
- But we highly recommend 12.8 or above for the best performance
- PyTorch 2.1 or above
- CUTLASS 3.6 or above (could be cloned by Git submodule)
# Submodule must be cloned
git clone --recursive [email protected]:InternLM/AdaptiveGemm.git
# Make symbolic links for third-party (CUTLASS and CuTe) include directories
python setup.py develop
# Test JIT compilation
python tests/test_jit.py
# Test all GEMM implements (normal, contiguous-grouped and masked-grouped)
python tests/test_varlen_groupm.py
python setup.py install
If you find our work useful, please cite:
@misc{su2025tmaadaptivefp8groupedgemm,
title={TMA-Adaptive FP8 Grouped GEMM: Eliminating Padding Requirements in Low-Precision Training and Inference on Hopper},
author={Zhongling Su and Rong Fu and Weihan Cao and Jianfei Gao and Minxi Jin and Zhilin Pei and Hui Wang},
year={2025},
eprint={2508.16584},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2508.16584},
}
To get detailed information of original deepgeem, please refer to https://github.com/deepseek-ai/DeepGEMM