Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

Closed
dacorvo opened this issue Jul 12, 2024 · 7 comments
Closed

Integrate marlin fp16/bf16-int4/int8 matrix multiplication kernel #239

dacorvo opened this issue Jul 12, 2024 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed Stale

Comments

@dacorvo
Copy link
Collaborator

dacorvo commented Jul 12, 2024

Since the introduction of mixed-precision fp16-int4 MARLIN (Mixed Auto-Regressive Linear) kernels by IST-DASLab, new mixed-precision MARLIN kernels have been introduced for other data types.

In particular, mixed-precision fp16/bf16-int4/int8 kernels have been contributed to TGI and could be integrated in optimum-quanto as well with companion Int8MarlinQBytesTensor and Int4MarlinQBitsTensor to pack the weights.

@dacorvo dacorvo added the enhancement New feature or request label Jul 12, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 12, 2024
Copy link

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 18, 2024
@dacorvo dacorvo reopened this Aug 26, 2024
@dacorvo dacorvo added help wanted Extra attention is needed and removed Stale labels Aug 26, 2024
@dacorvo
Copy link
Collaborator Author

dacorvo commented Sep 13, 2024

The kernel has been integrated in quanto CUDA extension in https://github.com/huggingface/optimum-quanto/tree/add_marlin_int4_kernel (thanks to an initial work by @shcho1118).
It needs now to be fully integrated at inference.

@shovan777
Copy link
Contributor

@dacorvo what should be done to integrate this at inference?

@dacorvo
Copy link
Collaborator Author

dacorvo commented Sep 17, 2024

What is missing is a MarlinWeightsQBitsTensor class in the same spirit as AWQBitsTensor class, and a modification to QBitsTensor.create to select that class instead of AWQBitsTensor (because this kernel is more stable).
This is a bit involved as you have to do things right to stay compatible with serialization, gradients and compilation (all things done already in AWQBitsTensor).
Also, this requires writing some tests.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 18, 2024
@dacorvo
Copy link
Collaborator Author

dacorvo commented Oct 18, 2024

Done in #333

@dacorvo dacorvo closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed Stale
Projects
None yet
Development

No branches or pull requests

2 participants