LlamaQuantizer is a lightweight tool built on top of llama.cpp and stable-diffusion.cpp that simplifies the process of serial quantizing llama.cpp-supported models. With just a few steps, you can create quantized model set like OLMo-1B-0724-hf-Quantized.
LlamaQuantizer relies on the llama.cpp library to perform serial automated quantization on LLaMA models. The tool takes care of the following steps:
- Convertion HF-formatted models to GGUF;
- Calcaulation of importance matrix if neccessary;
- Quantizing the model weights and activations on multiple precision levels;
- Saving the quantized model set to the output directory.
To create a quantized version of the OLMo-1B-0724-hf model, follow these steps:
- Clone this repo with
git clone https://github.com/aifoundry-org/LlamaQuantizer.git
- Make sure that the required modules are present in your env
pip install -r ./requirements.txt
- Download and install llamacpp repo according this guide: Build llama.cpp locally
- Download huggingface repo like this one: OLMo-1B-0724-hf
- Run
quantizer.py
with correct paths to your cloned hf model repo with model and installed llamacpp repo:
./quantizer.py --engine llamacpp --llamacpp_path ../llama.cpp/ --model_name OLMo-7B-SFT-hf-0724 --model_dir ../OLMo-7B-0724-SFT-hf/
To make 1-bit and some of 2-bit quantizations, you need to calulate and use an importance matrix LlamaQuantizer can handle it! This is the example with importance matrix:
./quantizer.py --engine llamacpp --llamacpp_path ../llama.cpp/ --model_name OLMo-7B-SFT-hf-0724 --model_dir ../OLMo-7B-0724-SFT-hf/--imatrix_text_name wikitext --imatrix_data_path ../wikitext/wikitext-2-raw-v1/train-00000-of-00001.txt
- OLMo-7B-0724-hf-Quantized
- OLMo-7B-0724-Instruct-hf-Quantized
- OLMo-7B-0724-SFT-hf-Quantized
- OLMo-7B-0424-hf-Quantized
Contributions to LlamaQuantizer are welcome! If you'd like to report a bug, request a feature, or submit a pull request, please use this repo's tools.
LlamaQuantizer is released under the Apache 2.0.