Support GPTQ/Marlin format quantization (4bit weight, f16 input) #89

guoqingbao · 2024-10-14T10:39:57Z

GPTQ/Marlin quantization

Candle-vllm now supports GPTQ (Marlin kernel), you may supply the quant (marlin) and dtype (f16) parameters if you have Marlin format quantized models, such as:

cargo run --release -- --port 2000 --dtype f16 --weight-path /home/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4-Marlin/ llama3 --quant marlin

Tested speed: 115 tokens/s (batch size = 1), 753 tokens/s (batch size=16) for LLaMa3.1 8B. (almost double performance for single query compared to bf16 format)

You may use AutoGPTQ to transform a model to marlin format by loading the (quantized) model and supply the use_marlin=True in AutoGPTQ (which will generate marlin format quantized model once you call save_pretrained).

Note: only 4bit GPTQ quantization supported for marlin format at the moment, and the input data type should be f16 (--dtype f16). You need also renamed the transformed marlin format weight to "model.safetensors" and copy the "tokenizer.json" from the source model folder.

Further plan: in-situ convertion of any quantized models to marlin format for speeding up inference.

…to the program.

… model

EricLBuehler · 2024-10-14T11:46:00Z

@guoqingbao nice work! 🤗

guoqingbao added 20 commits August 13, 2024 14:04

Support in-situ quantization

899515a

Typo fix

6e791f5

Cargo fmt

504398d

Optimize quantized matmul in batch processing & update Q4K results

a3e1fc4

Merge branch 'master' into develop

7309f55

Fix bug for non-stream response

80f56ae

Ask users to provide huggingface token if no token cached and passed …

bd476d3

…to the program.

No crash when both hidden_act and hidden_activation are set for gemma…

afb50f3

… model

Print the number of decoded tokens for each request

616ffc6

Merge branch 'master' into develop

573a61a

Restore previous bug fix

360a227

Support softcapping (Gemma-2 models)

a33884f

Merge branch 'master' into develop

f3b1a7d

Update lib.rs

761067e

Fix Gemma-2 multiple eos/bos ids

ff84499

Custom benchmark with parameters

2c81291

Mention arguments for benchmark.py

221eace

Tweak

08f9491

Support GPTQ/Marlin format quantization (4bit weight, f16 input)

e23d8ae

Merge branch 'master' into develop

d4239ef

guoqingbao merged commit 1b4b0d4 into master Oct 14, 2024
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPTQ/Marlin format quantization (4bit weight, f16 input) #89

Support GPTQ/Marlin format quantization (4bit weight, f16 input) #89

guoqingbao commented Oct 14, 2024

EricLBuehler commented Oct 14, 2024

Support GPTQ/Marlin format quantization (4bit weight, f16 input) #89

Support GPTQ/Marlin format quantization (4bit weight, f16 input) #89

Conversation

guoqingbao commented Oct 14, 2024

GPTQ/Marlin quantization

EricLBuehler commented Oct 14, 2024