Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16? #2058

Open
hyx1999 opened this issue Jan 24, 2025 · 2 comments

Comments

@hyx1999
Copy link

hyx1999 commented Jan 24, 2025

Dear Team,

I wish to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16 where the w16a16 part is small. An example of this kernel used is for accelerating an LLM with LoRA applied.

I can find some examples in "torchao" that implement matrix multiplication of w4a4/w4a8 and integrate matrix multiplication and dequantization via epilogue, but I don't know how to further integrate matrix multiplication of w16a16 on top of it, is there any examples I can refer to?

@hwu36
Copy link
Collaborator

hwu36 commented Feb 6, 2025

Could you please elaborate the input and output of every step? Do you want to fuse two gemms into one kernel similar as what our ex.13 does?

@hyx1999
Copy link
Author

hyx1999 commented Feb 6, 2025

Thank you very much for your reply!
The input consists of two activations X1[L, D1], X2[L, D2] and two weight matrices W1[D, D1], W2[D, D2], where $L = 2048, D_1 = 4096, D_2 = 64, D = 4096$. The output is $Y = X_1 W_1^\top + X_2 W_2^\top$. Meanwhile, X1 and W1 will be quantized to 4bit.

I think the difference with ex.13 is that I need to add the results of the two GEMMs instead of computing the latter based on the results of the former GEMM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants