-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] why the implementation of f16xs8 mixed gemm is different between TRT-LLM and native cutlass mixed gemm example? #2022
Comments
A general answer to your question would be: different approaches are possible, with different trade-offs. I'm not familiar with the code base you pointed to, but I would assume that it's made before CUTLASS added support for (some) mixed data-types GEMMs. Nowadays, CUTLASS way to apply quantization scale factors would be through EVT (see here for an EVT example). As far as performance/accuracy concerned, it depends on the context. For example, mixed data-types GEMM on Ampere generation GPUs requires re-arranging of elements of tensor having smaller data-type. CUTLASS is doing it during each GEMM operation, but when mixed data-types GEMM used in the context say of LLM inference, with a model having its weight quantized, there are implementations (like Marlin) that expect users to have this re-arrangement done up front, along with the weights quantization, so when such weight tensor repeatedly used as a mixed data-types GEMM operand during the inference, there will be a slight performance advantage over CUTLASS. As far as your second question concerned, I would assume that you need to also check here. |
@alexsamardzic thanks for your good response, I want to confirm that is |
The fundamental way, irrespective of Ampere or Hopper, to think about this is that Tensor Cores need thread-data arrangement in a specific layout depending on the input datatypes. There are various ways to achieve it :
|
Thanks for your detailed information @manishucsd , which is very useful for me. Still left a question, Marlin seems implement the mixed gemm using preprocess weights AOT that is 1st way you mentioned above, but when I was checking Marlin's code, it didn't use the LDSM for 4bit operand B, but LDS, declaring LDSM not support 4bit. It only uses LDSM for high bit operand A. Is this a potential optimization point for Marlin? |
What is your question?
Dear cutlass team,
lets consider sm80 and f16s8, the example of f16s8 TN mixed gemm shown here is different from TRT-LLM implementation, specifically, to my knowledge, the TRT-LLM one added the dequantization scale, but the cutlass one did not. Then my questions are:
Thanks your time!
cc @manishucsd @alexsamardzic @hwu36
The text was updated successfully, but these errors were encountered: