Tutorial: How to convert HuggingFace model to GGUF format #2948

samos123 · 2023-09-01T02:02:05Z

samos123
Sep 1, 2023

Source: https://www.substratus.ai/blog/converting-hf-model-gguf-model/

I published this on our blog but though others here might benefit as well, so sharing the raw blog here on Github too. Hope it's helpful to folks here and feedback is welcome.

Downloading a HuggingFace model

There are various ways to download models, but in my experience the huggingface_hub
library has been the most reliable. The git clone method occasionally results in
OOM errors for large models.

Install the huggingface_hub library:

pip install huggingface_hub

Create a Python script named download.py with the following content:

from huggingface_hub import snapshot_download
model_id="lmsys/vicuna-13b-v1.5"
snapshot_download(repo_id=model_id, local_dir="vicuna-hf",
                  local_dir_use_symlinks=False, revision="main")

Run the Python script:

python download.py

You should now have the model downloaded to a directory called
vicuna-hf. Verify by running:

ls -lash vicuna-hf

Converting the model

Now it's time to convert the downloaded HuggingFace model to a GGUF model.
Llama.cpp comes with a converter script to do this.

Get the script by cloning the llama.cpp repo:

git clone https://github.com/ggerganov/llama.cpp.git

Install the required python libraries:

pip install -r llama.cpp/requirements.txt

Verify the script is there and understand the various options:

python llama.cpp/convert.py -h

Convert the HF model to GGUF model:

python llama.cpp/convert.py vicuna-hf \
  --outfile vicuna-13b-v1.5.gguf \
  --outtype q8_0

In this case we're also quantizing the model to 8 bit by setting
--outtype q8_0. Quantizing helps improve inference speed, but it can
negatively impact quality.
You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original
quality.

Verify the GGUF model was created:

ls -lash vicuna-13b-v1.5.gguf

Pushing the GGUF model to HuggingFace

You can optionally push back the GGUF model to HuggingFace.

Create a Python script with the filename upload.py that
has the following content:

from huggingface_hub import HfApi
api = HfApi()

model_id = "substratusai/vicuna-13b-v1.5-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    path_or_fileobj="vicuna-13b-v1.5.gguf",
    path_in_repo="vicuna-13b-v1.5.gguf",
    repo_id=model_id,
)

Get a HuggingFace Token that has write permission from here:
https://huggingface.co/settings/tokens

Set your HuggingFace token:

export HUGGING_FACE_HUB_TOKEN=<paste-your-own-token>

Run the upload.py script:

python upload.py

KerfuffleV2 · 2023-09-01T02:20:36Z

KerfuffleV2
Sep 1, 2023
Collaborator

You might want to add a small note that requantizing to other formats from q8_0 will reduce quality a bit compared to quantizing from f16 or f32. It should be a pretty small difference, but could matter if the person is intending to widely distribute quantized models or aiming for the top of the HF leaderboard. In other words, someone like TheBloke wouldn't want to convert to q8_0 and then requantize to all the other formats to distribute them, they'd want to convert to f16 or f32 and quantize from that.

3 replies

samos123 Sep 1, 2023
Author

Thanks very helpful feedback! I'm just getting started and learning as I go here.

I added this as a note:
In this case we're also quantizing the model to 8 bit by setting
--outtype q8_0. Quantizing helps improve inference speed, but it can
negatively impact quality.
You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original
quality.

KerfuffleV2 Sep 1, 2023
Collaborator

No problem. The convert.py tool is mostly just for converting models in other formats (like HuggingFace) to one that other GGML tools can deal with. I was actually the who added the ability for that tool to output q8_0 — what I was thinking is that for someone who just wants to do stuff like test different quantizations, etc being able to keep a nearly original quality model around at 1/2 the size is pretty useful.

The actual examples/quantize tool is what should be used most of the time for quantizing, because it supports many formats. Generally speaking, for quality you're better off running a model with more parameters than an unquantized or less quantized model. In other words, if I can run a 16bit 7B model or a 4bit (like q4_0, q4_k) 33B model I'm going to want to use the 4bit 33B. It's also a lot faster when you can run a model on the GPU so quantizing it so it can fit can make a big difference.

For those reasons, I think most people use fairly heavily quantized models (q4_k_m feels like the sweet spot to me) and normally would only use 16bit or q8_0 when running really small models because if you have lots of extra resources why not run the model in full quality?

Quantizing often does improve inference speed but probably the main reason people use it is because you just need so much memory to run big models without it. A 70B 16bit model takes like 140GB RAM even if you're just running it on CPU, however I can run that same 70B model quantized to q4_k_m on my system with 64GB and still have memory left over to do other stuff. Disk space/downloading the models is also a consideration too. Even a q4_k_m 70B is about 40GB. Keeping a 16bit version around would be around 160GB and take much to download in the first place also.

Not telling you to write anything specific, hopefully this information is helpful.

Green-Sky Sep 2, 2023
Collaborator

[...] however I can run that same 70B model quantized to q4_k_m on my system with 64GB and still have memory left over to do other stuff.

You can even run 70B quantized to q3_k_s on 32GB ram + 8GB vram. (imho the most impressive feat of llama.cpp).
There is even enough ram left over to run the model with 8k context size, with a slight quality loss. (--rope-scale 2 -c 8192 -ngl 20)

szymonrucinski · 2023-09-01T14:11:39Z

szymonrucinski
Sep 1, 2023

I have a model trained using Qlora and I can only convert it to min. 8-bit quantization using GGUF. What about q4_K_S quantization why are they not available?

10 replies

ghost Sep 2, 2023

lol Just preparing for the general move to GGUF, so will be doing some experimentation on what is 'best' in my situation. Gotta love moving targets.

Kainat-R Oct 5, 2023

Hey @szymonrucinski how did you convert qlora trained model into GGUF? As qlora trained models do not have a config.json file which is needed by GGUF. kindly refer to #3489 and help me if you can. Thanks!

francisco-lafe Jan 4, 2024

The Python convert tool is mostly for just converting models to GGUF/GGML compatible format. I actually added the q8_0 quantization to that recently since it's very close to the same quality as not quantizing. The idea is basically that it's an okay storage format to use for quantizing to others like q4_k_s and uses half as much space as 16bit.

For quantizing to all the formats llama.cpp supports, use the examples/quantize tool. Also keep in mind what I mentioned: If you quantize from 16bit to q4_k_s you'll get slightly better results than quantizing from q8_0.

Hi, maybe I'm missing something but int that folder, examples/quantize there is no binary or similar, but just a CPP file.
How do you quantize to something smaller than q8_0?
My goal is to go to q4_k

kevinknights29 Mar 3, 2024

Hi @francisco-lafe, in order to run the quantize tool, you need to build the llama.cpp repo.
You can use the following instructions from the README:

mkdir build
cd build
cmake ..
cmake --build . --config Release

Then you can run the quantize tool from binary, located at llama.cpp/build/bin
Example:

cd llama.cpp/build/bin && \
   ./quantize ./models/Llama-2-7b-chat-hf/ggml-model-f16.gguf ./models/Llama-2-7b-chat-hf/ggml-model-q4_0.gguf q4_0

francisco-lafe Mar 4, 2024

@kevinknights29 Thanks, I ended up using the pre-compiled binaries as I'm on Windows.
Regards,

AIAnytime · 2023-09-06T05:59:17Z

AIAnytime
Sep 6, 2023

Thanks for the wonderful explanation. I am getting below error:
Loading model file refact-hf/pytorch_model.bin Traceback (most recent call last): File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 1225, in <module> main() File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 1174, in main params = Params.load(model_plus) File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 304, in load params = Params.loadHFTransformerJson(model_plus.model, hf_config_path) File "/home/ai/Desktop/quantized model/llama.cpp/convert.py", line 214, in loadHFTransformerJson n_embd = config["hidden_size"] KeyError: 'hidden_size'

Can anyone help me debug this?

5 replies

samos123 Sep 6, 2023
Author

Can you show some more steps on what you did? Which model are you trying to load?

AIAnytime Sep 6, 2023

I think the convert.py inside llama.cpp is only for Llama based models? For example I am trying to convert a model whose config.json is below:

`[[{

| "architectures": [
| "GPTRefactForCausalLM"
| ],
| "attention_softmax_in_fp32": false,
| "attn_pdrop": 0.1,
| "auto_map": {
| "AutoConfig": "configuration_gpt_refact.GPTRefactConfig",
| "AutoModelForCausalLM": "modeling_gpt_refact.GPTRefactForCausalLM"
| },
| "bos_token_id": -1,
| "do_sample": true,
| "embd_pdrop": 0.1,
| "eos_token_id": 0,
| "initializer_range": 0.02,
| "layer_norm_epsilon": 1e-05,
| "model_type": "gpt_refact",
| "multi_query": true,
| "n_embd": 2048,
| "n_head": 32,
| "n_inner": null,
| "n_layer": 32,
| "n_positions": 4096,
| "resid_pdrop": 0.1,
| "scale_attention_softmax_in_fp32": false,
| "scale_attn_weights": true,
| "torch_dtype": "float16",
| "transformers_version": "4.31.0",
| "use_cache": true,
| "vocab_size": 49216
| }](url)
](url)
`

Now if you see convert.py, it has many additional params which might not be part of non-llama model? can anyone confirm and help me there?

KerfuffleV2 Sep 6, 2023
Collaborator

convert.py is just for LLaMA models. There's also a converter for Falcon models in a separate script. It looks like you're trying to convert a Refact model? It's not supported yet, there's another discussion open though: #3013

AIAnytime Sep 6, 2023

Thanks for your quick response, appreciate it! One question, is there any script available for Starcoder or any other bigcode LLM family? Or is it only Llama, Falcon?

KerfuffleV2 Sep 6, 2023
Collaborator

No problem. The main GGML repo has some examples and it seems like Starcoder is in there: https://github.com/ggerganov/ggml/tree/master/examples

Those examples tend to be relatively simple and don't have all the functionality of llama.cpp like advanced samplers, interactive mode, etc.

clearsitedesigns · 2023-11-10T20:03:00Z

clearsitedesigns
Nov 10, 2023

Is there a way to directly do this on colab?

1 reply

samos123 Nov 10, 2023
Author

I think it should work. Did you try?

daehuikim · 2023-11-15T00:33:31Z

daehuikim
Nov 15, 2023

This way i can only get one file such ass gguf. Is it available to convert model in reproducable format like TheBloke in huggingface?

I am curious i can produce this kinds of files with llama.cpp

1 reply

dame-cell Dec 10, 2023

hmm maybe we can just store all the different sizes locally in our pc then just upload all of them manually

venturaEffect · 2023-12-22T10:24:06Z

venturaEffect
Dec 22, 2023

Does anyone know why am I getting this error?

(hf2gguf) zasear@zaesarius:~/convert2gguf$ python llama.cpp/convert.py ai_lawyer \ --outfile ai-lawyer-v1.gguf \ --outtype q8_0 usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR] [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--bigendian] [--padvocab] model convert.py: error: unrecognized arguments: --outfile ai-lawyer-v1.gguf --outtype q8_0

I've tried also without backslashes but no success:

(hf2gguf) zasear@zaesarius:~/convert2gguf$ python llama.cpp/convert.py ai_lawyer --outfile ai-lawyer-v1.g guf --outtype q8_0 Traceback (most recent call last): File "/home/zasear/convert2gguf/llama.cpp/convert.py", line 1279, in <module> main() File "/home/zasear/convert2gguf/llama.cpp/convert.py", line 1207, in main model_plus = load_some_model(args.model) File "/home/zasear/convert2gguf/llama.cpp/convert.py", line 1131, in load_some_model raise Exception(f"Can't find model in directory {path}") Exception: Can't find model in directory ai_lawyer

This is how the structure looks like following the steps here:

Maybe it doesn't work because it is a private model in HuggingFace? But it shouldn't because it is locally downloaded...

Appreciate your help.

3 replies

teleprint-me Dec 22, 2023

It can't find the path.

Exception: Can't find model in directory ai_lawyer

Solution is in the error.

If it is private, it will only work if it's a supported architecture. Otherwise, you'll need to integrate it yourself.

venturaEffect Dec 22, 2023

Thanks, I think I found a work around.

It seems the finetuned models with Autotrain doesn't convert into GGUF format. Will have to merge it with the base model and from there convert it into GGUF. There is a space for that on HF.

Appreciate

heyili Jan 17, 2024

Hey man I am facing the same problem. Were you able to merge the adapter weights back to the base model and create a new model ?

andrewtvuong · 2024-01-09T18:04:08Z

andrewtvuong
Jan 9, 2024

Hi @samos123 I'm only used to working with .gguf kind of files for LLM, I have no idea what to do with this kind of models and so did a search and found your post.

Am I right to assume all models structured this way are hf models? Is there any where I can read more about this? It seems all Youtube go straight to the quantized version .gguf. Are hf models considered the raw models that can be further tuned into something else? I have lots of assumptions but hard to verify.

3 replies

samos123 Jan 9, 2024
Author

I guess you could call them raw models. The huggingface model in pytorch.bin format can be converted to a gguf model.

teleprint-me Jan 9, 2024

Any model (or model part) that ends with .pt, .pth, or .bin can be assumed to be a torch model. torch models are created using the PyTorch framework by Meta (formerly Facebook).

transformers is a framework created and maintained by HuggingFace and they typically will use any available framework to automate iterating through pre-training, fine-tuning, and other tasks for models, e.g. TensorFlow, Keras, Onnx, PyTorch, etc.

config.json or params.json usually have the Hyperparameters for a model.

tokenizer.json is a protobuf data structure that is automatically generated by the transformers framework.

tokenizer.model is a trained model created using sentencepiece that usually has all of the essential vocabulary for a model in NLP (Natural Language Processing) tasks.

special_tokens_map.json contains all the extra tokens added after pre-training a model.

You can learn all of this stuff by reading the documentation and playing around with the frameworks. They're all in Python.

andrewtvuong Jan 10, 2024

Thanks so much for the detailed explanation, very helpful to kick off my learning.

DevangPagare002 · 2024-01-17T11:53:41Z

DevangPagare002
Jan 17, 2024

3 replies

francisco-lafe Jan 17, 2024

That's a very basic error and you should first learn how to view files in a linux command line. As that's what's used on Py Notebooks.
It's telling you that the program "quantize" is not in the current path you're in.

DevangPagare002 Jan 17, 2024

But I am on the quantize path.

!pwd

output - /tmp/llama.cpp/examples/quantize

francisco-lafe Jan 17, 2024

You should run an ls command.

Taikono-Himazin · 2024-01-22T05:27:11Z

Taikono-Himazin
Jan 22, 2024

Please tell me the difference between the roles of the following files.

convert.py
convert-hf-to-gguf.py
convert-llama2c-to-ggml
convert-llama-ggml-to-gguf.py
convert-lora-to-ggml.py
convert-persimmon-to-gguf.py

My predictions are as follows.

convert.py: Convert from HuggingFace format to gguf
convert-hf-to-gguf.py: Convert from HuggingFace format to gguf
convert-llama2c-to-ggml: Convert from llama2c format to ggml
convert-llama-ggml-to-gguf.py: Convert from ggml format to gguf
convert-lora-to-ggml.py: Add LORA to base model and convert to ggml
convert-persimmon-to-gguf.py: Convert from persimmon format to gguf

Why aren't convert.py and convert-hf-to-gguf.py the same?

Also, only convert-llama2c-to-ggml is not a Python file, why is this?

2 replies

oldgithubman Apr 16, 2024

This should be documented in the main readme

oldgithubman Apr 30, 2024

Bump. I still think this needs better or more-obvious documentation. I personally don't know when to use which. Also, is it better to convert straight from original pth or does it produce the same result as converting from HF safetensors?

nan0bug00 · 2024-01-27T20:47:12Z

nan0bug00
Jan 27, 2024

Improved the download.py script:


import sys
from huggingface_hub import snapshot_download

if len(sys.argv) != 2:
    print("Usage: python download.py <model_id>")
    sys.exit(1)

model_id = sys.argv[1]
local_dir = model_id.replace("/", "-")
snapshot_download(repo_id=model_id, local_dir=local_dir,
                  local_dir_use_symlinks=False, revision="main")

This way you can just pass the model name on huggingface in the command line. It will remove the slash and replace it with a dash when creating the directory. Example:

python download.py lmsys/vicuna-13b-v1.5 will create a directory lmsys-vicuna-13b-v1.5 and place the model from huggingface within.

0 replies

nameless0704 · 2024-01-30T02:08:46Z

nameless0704
Jan 30, 2024

I'm having a keyerror: 'transformer.h.0.attn.c_attn.bias' when transforming a Qwen-14B-Chat model using convert.py with --outtype q8_0. Some say using convert_hf_to_gguf.py will fix it, but convert_hf_to_gguf.py doesn't have quantization options. So how to quantize AND convert?

1 reply

francisco-lafe Jan 30, 2024

You can first convert and then quantize.
https://github.com/ggerganov/llama.cpp/releases/tag/b2008
Any DL for Windows, place it in a folder ie llamacppbinaries
.\llamacppbinaries\quantize.exe

IntrepidWanderer · 2024-01-31T22:51:47Z

IntrepidWanderer
Jan 31, 2024

Hi, I ran into an odd error and was really struggling to find any relevant information online. Hoping someone here can help. I know almost nothing about the technical side of things, just an average AI text gen user. I'm trying to convert GGUFs for models and checked out instructions both here and this guide on Reddit:
https://www.reddit.com/r/LocalLLaMA/comments/18av9aw/quick_start_guide_to_converting_your_own_ggufs/?rdt=48304

I managed to get convert.py working, can do FP16 and Q8 converts without issue, but ran into the same mysterious error repeatedly when trying to use quantize.exe to convert pretty much anything. I've tried with both this model Mixtral Erotic and this model CatPPT

The error message is always the same:

PS C:\TextGen\llama-b1983-bin-win-cublas-cu12.2.0-x64> ".\quantize.exe." C:\TextGen\text-generation-webui-main\models\cloudyu_Mixtral_Erotic_13Bx2_MOE_22B C:\Convert\22B.f16.Q5.gguf f16
At line:1 char:19
... ntize.exe." C:\TextGen\text-generation-webui-main\models\cloudyu_Mixt ...
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Unexpected token 'C:\TextGen\text-generation-webui-main\models\cloudyu_Mixtral_Erotic_13Bx2_MOE_22B' in expression or
statement.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : UnexpectedToken

The processing always gets stuck on "line: 1 char:19", I'm not sure why and I can't really see what character it is specifically. BtW, I'm running in Powershell, just right clicked on the quantize.exe under Explorer and chose the option to auto navigate to that location. I'm not sure if that makes a difference.

I'm wondering if the error is because I don't have Llama.cpp installed correctly. Running quantize.exe through CMD gives an error about cudart64_12.dll missing, but downloading and putting the cudart files into the same folder doesn't stop the error. If I'm only using convert.py and quantize .exe, do I still need to follow the Cmake instructions on the Llama.cpp main page to "build Llama" from the source code? I've already ran the requirements.txt through pythonnkich is why convert.py is working for me, I think. It's just for some reason quantize.exe doesn't work.

Edit (Update):
Just came back to post an update about this issue. I haven't found a way to make quantize.exe work with the CuBLAS version, even though I was using Nvidia 1080 GPU. In the end, resorted to trying CLBlast, and it worked where CuBLAS wouldn't. I'm really not sure why.

0 replies

shubhamraj216 · 2024-02-27T06:55:32Z

shubhamraj216
Feb 27, 2024

I am getting below error when converting phi-2 model to gguf format. I must be missing something. Please help.

python3 llama.cpp/convert.py myllama-hf --outfile myllama-7b-v0.1.gguf

Error -

Traceback (most recent call last):
  File "/Users/shubh/Work/Personal/llama.cpp/convert.py", line 1483, in <module>
    main()
  File "/Users/shubh/Work/Personal/llama.cpp/convert.py", line 1419, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubh/Work/Personal/llama.cpp/convert.py", line 1271, in load_some_model
    raise Exception(f"Found multiple models in {path}, not sure which to pick: {files}")
Exception: Found multiple models in myllama-hf, not sure which to pick: [PosixPath('myllama-hf/model-00001-of-00002.safetensors'), PosixPath('myllama-hf/model.safetensors')]

Below is how the myllama-hf directory looks like

1 reply

HaoHoo May 30, 2024

New release of llama.cpp looks like supports Phi-2 now. My problem is Phi-3 vision...

phymbert · 2024-02-27T07:43:26Z

phymbert
Feb 27, 2024
Collaborator

I am getting below error when converting phi-2 model to gguf format. I must be missing something. Please help.

python3 llama.cpp/convert.py myllama-hf --outfile myllama-7b-v0.1.gguf

Error -

As the errors state, you are mixing multiple models safetensors files format in myllama-hf. You cannot have both "model-00001-of-*.safetensors" and "model.safetensors".

Please properly download files from HF microsoft/phi-2.

Note: you can directly download GGUF quantized Microsoft Phi-2 models from HF with hf.sh, example for a Q4_K_M:

./scripts/hf.sh --repo TheBloke/phi-2-GGUF --file phi-2.Q4_K_M.gguf

0 replies

FantasiaFoundry · 2024-03-03T11:34:52Z

FantasiaFoundry
Mar 3, 2024

This might be useful. If anyone wants to help improving it, it's always welcome.

https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script

0 replies

daanta-real · 2024-05-25T00:39:07Z

daanta-real
May 25, 2024

For everyone other than me who is having trouble importing distutils:
If the distutils import fail error occurs when you run the pip install command in the article in an environment where Python 3.12 is installed on Windows, you can easily solve it by downgrading Python's version to 3.11.

0 replies

stringang · 2024-05-25T09:58:58Z

stringang
May 25, 2024

notes:

convert.py is just for LLaMA models.
convert-hf-to-gguf.py needs to add the tokenizer of the model
- add this model in convert-hf-to-gguf-update.py.
- use the convert-hf-to-gguf-update.py to update the tokenizer of the model.(python convert-hf-to-gguf-update.py <huggingface_token> )

llama.cpp/convert-hf-to-gguf-update.py

Lines 63 to 86 in d6ef0e7

    
           # TODO: add models here, base models preferred 
        
           models = [ 
        
               {"name": "llama-spm",      "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", }, 
        
               {"name": "llama-bpe",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Meta-Llama-3-8B", }, 
        
               {"name": "phi-3",          "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct", }, 
        
               {"name": "deepseek-llm",   "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-llm-7b-base", }, 
        
               {"name": "deepseek-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base", }, 
        
               {"name": "falcon",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/falcon-7b", }, 
        
               {"name": "bert-bge",       "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/BAAI/bge-small-en-v1.5", }, 
        
               {"name": "mpt",            "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mosaicml/mpt-7b", }, 
        
               {"name": "starcoder",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/bigcode/starcoder2-3b", }, 
        
               {"name": "gpt-2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openai-community/gpt2", }, 
        
               {"name": "stablelm2",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b", }, 
        
               {"name": "refact",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/smallcloudai/Refact-1_6-base", }, 
        
               {"name": "command-r",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/CohereForAI/c4ai-command-r-v01", }, 
        
               {"name": "qwen2",          "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Qwen/Qwen1.5-7B", }, 
        
               {"name": "olmo",           "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/allenai/OLMo-1.7-7B-hf", }, 
        
               {"name": "dbrx",           "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/databricks/dbrx-base", }, 
        
               {"name": "jina-v2-en",     "tokt": TOKENIZER_TYPE.WPM, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-en", }, # WPM! 
        
               {"name": "jina-v2-es",     "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-es", }, 
        
               {"name": "jina-v2-de",     "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-de", }, 
        
               {"name": "smaug-bpe",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct", }, 
        
           ]

0 replies

cilia · 2024-06-30T13:54:01Z

cilia
Jun 30, 2024

New to this, am trying to convert an embedding model (https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) to gguf format. When I tried using convert-hf-to-gguf.py, it throws this error:
ERROR:hf-to-gguf:Model NewModel is not supported

It seems to stem from here:

llama.cpp/convert-hf-to-gguf.py

Line 3118 in 1c5eba6

logger.error(f"Model {hparams['architectures'][0]} is not supported")

Any idea what's the issue and fix here?

0 replies

severfire · 2024-07-01T17:59:57Z

severfire
Jul 1, 2024

Hi, not only me, but someone else had t his problem...
https://huggingface.co/google-bert/bert-base-uncased/discussions/66

is it me, or does it seems like BERT models are not supported, so they can't be converted to GGUF format?
can anyone direct me to place where I can learn how to do it? thanks!

0 replies

anitaokoh · 2024-07-02T08:11:29Z

anitaokoh
Jul 2, 2024

I have this error message. how do I fix it?

Traceback (most recent call last):
  File "/content/llama.cpp/convert-hf-to-gguf.py", line 3170, in <module>
    main()
  File "/content/llama.cpp/convert-hf-to-gguf.py", line 3154, in main
    model_instance.set_vocab()
  File "/content/llama.cpp/convert-hf-to-gguf.py", line 2233, in set_vocab
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/content/llama.cpp/convert-hf-to-gguf.py", line 386, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/content/llama.cpp/convert-hf-to-gguf.py", line 507, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

0 replies

flitzcore · 2024-08-06T07:48:09Z

flitzcore
Aug 6, 2024

I have deleted the code but I remember that I failed with that method. At the end I just at HuggingFace pages for online converter gguf Pada Sel, 6 Agu 2024 13.43, gavin-edward ***@***.***> menulis:

…

Thank you very much for your help. After building I ran quantize with: quantize models/susnato_phi-1_5.gguf models/susnato_phi-1_5_q8_0.gguf Q8_0 And it works nicely. Cheers! hello, could you please share your building method with me ? — Reply to this email directly, view it on GitHub <#2948 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATWVRZLPFXE3MG23L3QKLMLZQBWC7AVCNFSM6AAAAAA4G4QWYKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMRVGAYTSOA> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

0 replies

zufari4 · 2024-09-19T09:25:17Z

zufari4
Sep 19, 2024

C:\Dev\llama.cpp>python convert_hf_to_gguf.py --outfile ner.gguf bert-base-NER
INFO:hf-to-gguf:Loading model: bert-base-NER
ERROR:hf-to-gguf:Model BertForTokenClassification is not supported

I can't convert any models that classify tokens. What's wrong?

0 replies

mvmn · 2024-10-17T17:31:05Z

mvmn
Oct 17, 2024

Doesn't work. The convert.py isn't there, other versions of that script throw different kind of errors. Checking out older branches brings back convert.py, but it throws all sorts of errors (like AttributeError: module 'gguf' has no attribute 'MODEL_TENSOR_NAMES'. Did you mean: 'MODEL_TENSORS'? or KeyError: 'tok_embeddings.weight').

Is there a working instruction?

0 replies

krecicki · 2024-10-17T17:41:39Z

krecicki
Oct 17, 2024

If you go on hugging face, the boys set up a simple UI that you can use.Cody Krecicki, FounderChoice Internet Brands, Inc & ***@***.*** Oct 17, 2024, at 10:31 AM, Mykola Makhin ***@***.***> wrote: Doesn't work. The convert.py isn't there, other versions of that script throw different kind of errors. Checking out older branches brings back convert.py, but it throws all sorts of errors (like AttributeError: module 'gguf' has no attribute 'MODEL_TENSOR_NAMES'. Did you mean: 'MODEL_TENSORS'? or KeyError: 'tok_embeddings.weight'). Is there a working instruction? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

mvmn Oct 17, 2024

I know, but I want to run it locally with Jan, and for that I need to convert it to GGUF.

krecicki · 2024-10-17T19:08:12Z

krecicki
Oct 17, 2024

How’d you train the model? What did you use? What was the base model you used with the Lora? Is it just a random model you found and want a gguf?Cody Krecicki, FounderChoice Internet Brands, Inc & ***@***.*** Oct 17, 2024, at 11:00 AM, Mykola Makhin ***@***.***> wrote: I know, but I want to run it locally with Jan, and for that I need to convert it to GGUF. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

josephencila · 2024-11-06T03:09:25Z

josephencila
Nov 6, 2024

cant convert VIT image model?

python llama.cpp/convert_hf_to_gguf.py "C:\Users\user\models\vit-model-hf"
--outfile vit-model-hf.gguf --outfile f16 or f32 or q8_0

INFO:hf-to-gguf:Loading model: vit-model-hf
ERROR:hf-to-gguf:Model ViTForImageClassification is not supported

0 replies

krecicki · 2024-11-06T04:56:23Z

krecicki
Nov 6, 2024

What you guys have to understand is there are certain templates that work only https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_templateCody Krecicki, FounderChoice Internet Brands, Inc & ***@***.*** Nov 5, 2024, at 7:10 PM, Joseph Encila ***@***.***> wrote: cant convert VIT image model? python llama.cpp/convert_hf_to_gguf.py "C:\Users\user\models\vit-model-hf" --outfile vit-model-hf.gguf --outfile f16 or f32 or q8_0 INFO:hf-to-gguf:Loading model: vit-model-hf ERROR:hf-to-gguf:Model ViTForImageClassification is not supported —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

surprisedPikachu007 Nov 15, 2024

So, what should I do if i want to convert my custom LLM to gguf format. I can run the model with huggingface transformers.

krecicki · 2024-11-15T15:47:42Z

krecicki
Nov 15, 2024

When you say custom LLM, are you talking like you trained it from scratch without any base models or did you make a Lora? Did you follow any architecture or you just straight up made everything from scratch?Cody Krecicki, FounderChoice Internet Brands, Inc & ***@***.*** Nov 15, 2024, at 2:51 AM, Aravinda Kumar ***@***.***> wrote: So, what should I do if i want to convert my custom LLM to gguf format. I can run the model with huggingface transformers. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

surprisedPikachu007 Nov 15, 2024

It is based on the gpt3 architecture. But it was pretrained from scratch.

krecicki · 2024-11-15T16:49:38Z

krecicki
Nov 15, 2024

There should be a file somewhere in this repo where it has templates for the different chat templates for the different architectures. You might have to add a new one.Cody Krecicki, FounderChoice Internet Brands, Inc & ***@***.*** Nov 15, 2024, at 8:13 AM, Aravinda Kumar ***@***.***> wrote: It is based on the gpt3 architecture. But it was pretrained from scratch. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

surprisedPikachu007 Nov 15, 2024

Thank you for the response. Can you point me to the file?

krecicki · 2024-11-15T17:09:30Z

krecicki
Nov 15, 2024

Here is somewhere to start, https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_templateIt may not be what you’re looking for but your problem sounds similar to others. If this isn’t the case. Then try the gguf maker straight up on huggingface. If it still won’t work, I’m not sure my friend.Cody Krecicki, FounderChoice Internet Brands, Inc & ***@***.*** Nov 15, 2024, at 8:56 AM, Aravinda Kumar ***@***.***> wrote: Thank you for the response. Can you point me to the file? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

surprisedPikachu007 Nov 15, 2024

Ok. I'll try that. Thank you for your time.

Tutorial: How to convert HuggingFace model to GGUF format #2948

Downloading a HuggingFace model

Converting the model

Pushing the GGUF model to HuggingFace

Replies: 42 comments · 59 replies

KerfuffleV2 Sep 1, 2023 Collaborator

samos123 Sep 1, 2023 Author

KerfuffleV2 Sep 1, 2023 Collaborator

Green-Sky Sep 2, 2023 Collaborator

samos123 Sep 6, 2023 Author

`[[{

KerfuffleV2 Sep 6, 2023 Collaborator

KerfuffleV2 Sep 6, 2023 Collaborator

samos123 Nov 10, 2023 Author

samos123 Jan 9, 2024 Author

phymbert Feb 27, 2024 Collaborator

Replies: 42 comments 59 replies

KerfuffleV2
Sep 1, 2023
Collaborator

samos123 Sep 1, 2023
Author

KerfuffleV2 Sep 1, 2023
Collaborator

Green-Sky Sep 2, 2023
Collaborator

samos123 Sep 6, 2023
Author

KerfuffleV2 Sep 6, 2023
Collaborator

KerfuffleV2 Sep 6, 2023
Collaborator

samos123 Nov 10, 2023
Author

samos123 Jan 9, 2024
Author

phymbert
Feb 27, 2024
Collaborator