Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPTQ/Marlin format quantization (4bit weight, f16 input) #89

Merged
merged 20 commits into from
Oct 14, 2024

Conversation

guoqingbao
Copy link
Collaborator

GPTQ/Marlin quantization

Candle-vllm now supports GPTQ (Marlin kernel), you may supply the quant (marlin) and dtype (f16) parameters if you have Marlin format quantized models, such as:

cargo run --release -- --port 2000 --dtype f16 --weight-path /home/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4-Marlin/ llama3 --quant marlin

Tested speed: 115 tokens/s (batch size = 1), 753 tokens/s (batch size=16) for LLaMa3.1 8B. (almost double performance for single query compared to bf16 format)

You may use AutoGPTQ to transform a model to marlin format by loading the (quantized) model and supply the use_marlin=True in AutoGPTQ (which will generate marlin format quantized model once you call save_pretrained).

Note: only 4bit GPTQ quantization supported for marlin format at the moment, and the input data type should be f16 (--dtype f16). You need also renamed the transformed marlin format weight to "model.safetensors" and copy the "tokenizer.json" from the source model folder.

Further plan: in-situ convertion of any quantized models to marlin format for speeding up inference.

@guoqingbao guoqingbao merged commit 1b4b0d4 into master Oct 14, 2024
4 of 6 checks passed
@EricLBuehler
Copy link
Owner

@guoqingbao nice work! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants