[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16? #2058

hyx1999 · 2025-01-24T16:31:52Z

Dear Team,

I wish to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16 where the w16a16 part is small. An example of this kernel used is for accelerating an LLM with LoRA applied.

I can find some examples in "torchao" that implement matrix multiplication of w4a4/w4a8 and integrate matrix multiplication and dequantization via epilogue, but I don't know how to further integrate matrix multiplication of w16a16 on top of it, is there any examples I can refer to?

hwu36 · 2025-02-06T03:10:02Z

Could you please elaborate the input and output of every step? Do you want to fuse two gemms into one kernel similar as what our ex.13 does?

hyx1999 · 2025-02-06T03:33:26Z

Thank you very much for your reply!
The input consists of two activations X1[L, D1], X2[L, D2] and two weight matrices W1[D, D1], W2[D, D2], where $L = 2048, D_1 = 4096, D_2 = 64, D = 4096$. The output is $Y = X_1 W_1^\top + X_2 W_2^\top$. Meanwhile, X1 and W1 will be quantized to 4bit.

I think the difference with ex.13 is that I need to add the results of the two GEMMs instead of computing the latter based on the results of the former GEMM.

hyx1999 added ? - Needs Triage question Question labels Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16? #2058

[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16? #2058

hyx1999 commented Jan 24, 2025

hwu36 commented Feb 6, 2025

hyx1999 commented Feb 6, 2025

[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16? #2058

[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16? #2058

Comments

hyx1999 commented Jan 24, 2025

hwu36 commented Feb 6, 2025

hyx1999 commented Feb 6, 2025