Can madlad400 gguf models from huggingface be used? #8300

OkLunixBSD · 2024-07-04T14:54:20Z

OkLunixBSD
Jul 4, 2024

I compiled the latest version which has T5 support and tried running a madlad400 model from https://huggingface.co/jbochi/madlad400-3b-mt/resolve/main/model-q4k.gguf
but it gave a

lama_model_load:` error loading model: error loading model architecture: unknown model architecture: ''

Is there a change in the conversion process from .safetensors which is needed for T5 models?
and if there is, is there a gguf-my-repo recent enough to support T5 models?
my version

./llama-cli --version
version: 1 (807b0c4)
built with cc (GCC) 11.2.0 for x86_64-slackware-linux

Answered by misutoneko

Jul 10, 2024

Okay I got it working now (I think), and man it feels FAST!!

The bad news is that my GGUF conversion procedure from jbochi => llama.cpp was quite a messy business indeed.
It involved conjuring up an empty GGUF, filling it with metadata and doing some frankensteining with KerfuffleV2's gguf-tools.
I also wrote a custom script to rename the tensors, and llama.cpp itself needed a teeny weeny change too.
The upside of this method is that the quantized tensors remain untouched.

I can give more details if there's interest but somehow I feel there must be a better way :D

EDIT: I've now managed to polish the conversion process a little bit, so that no llama.cpp customization is necessary any longer.

View full answer

misutoneko · 2024-07-05T17:24:54Z

misutoneko
Jul 5, 2024

Well, I couldn't figure out a way to use jbochi's gguf directly either.
If you take a look at what gguf_dump.py says, there are some sections missing,
such as general.architecture that it's complaining about, as well as tokenizer.ggml.tokens and who knows what else.
I suppose that stuff is available, it's just that it's not in the .gguf itself but in separate files (which means that it's no use for llama.cpp?).

So I think it's necessary to use the conversion script convert_hf_to_gguf.py.
However my understanding is that it would be better NOT to use a quantized model as a source if you can avoid it.
Also, you may need to download one additional file from HF called spiece.model if you don't have it already.
(I would like to give a little bit more specific info, but haven't been able to try this myself, and won't be for some time...)

Btw super enthused about these recent additions, this project just keeps on getting better :D
I've previously used the smallest MADLAD400 model with candle and while it works pretty well, there are some strange glitches that I haven't been able to weed out.
So it'll be interesting to see how llama.cpp handles it.

EDIT: OK so this behaves somewhat similarly to candle, but the glitches are slightly different.
It's a helluva lot faster in llama.cpp for me, probably because candle (or rather, gemm) didn't have full support for my CPU.

0 replies

misutoneko · 2024-07-10T22:47:26Z

misutoneko
Jul 10, 2024

Okay I got it working now (I think), and man it feels FAST!!

The bad news is that my GGUF conversion procedure from jbochi => llama.cpp was quite a messy business indeed.
It involved conjuring up an empty GGUF, filling it with metadata and doing some frankensteining with KerfuffleV2's gguf-tools.
I also wrote a custom script to rename the tensors, and llama.cpp itself needed a teeny weeny change too.
The upside of this method is that the quantized tensors remain untouched.

I can give more details if there's interest but somehow I feel there must be a better way :D

EDIT: I've now managed to polish the conversion process a little bit, so that no llama.cpp customization is necessary any longer.

Here's the patch if anyone wants to try this version. You'll need the original jbochi model and xdelta3.
MADLAD400_GGUF_patch.tar.gz

3 replies

OkLunixBSD Jul 11, 2024
Author

Thanks, I found out https://huggingface.co/spaces/ggml-org/gguf-my-repo gets synced from main every six hours, so I used it to create a Q4_K_M version of it.
it seems to work well for small sentences:

 ./llama-cli -p "<2de> Okay I got it working now (I think), and man it feels fast." -m ../models/madlad400-3b-mt-q4_k_m.gguf -t 4 --log-disable
▅ Okay, ich habe es jetzt funktioniert (ich denke), und Mann, es fühlt sich schnell an.

But why does it always have a ▅ character at the start?
Also the quality for longer passages is awful, I wonder if I should write a script that passes a longer passage sentence by sentence.
Anyways, ▅ Graeciae. (apparently it doesn't know gratias).

misutoneko Jul 11, 2024

The extra character was discussed here:
#8055 (comment)

My frankensteined model doesn't seem to suffer from that, however.

Also the quality for longer passages is awful, I wonder if I should write a script that passes a longer passage sentence by sentence.

Yes, sentence by sentence is how I used this model with candle, it might be necessary with llama.cpp too.
This model did seem to have a tendency to wander off (sometimes with hilarious results).

brauliobo Oct 21, 2024

same here: quality very poor and low speed on longer sentences

also although I have compiled with CUDA, it only uses 200mb on GPU memory and seems to do most of the work in the CPU

JohnClaw · 2024-07-11T16:47:45Z

JohnClaw
Jul 11, 2024

Did anyone made a llama.cpp compatible gguf for another T5 model, aya-101: https://huggingface.co/CohereForAI/aya-101 ?

1 reply

misutoneko Jul 12, 2024

There are GGUF's for candle available here:
https://huggingface.co/kcoopermiller/aya-101-GGUF

There was some comment about them producing gibberish though...
If they do work with candle, it may be possible to convert them the same way as the MADLAD400 ones.

EDIT: OK I've now tested the smallest aya-101 GGUF (Q2K) in both candle and llama.cpp.
Yeah it confirms positive for gibberish, unfortunately.

misutoneko · 2024-07-14T13:59:33Z

misutoneko
Jul 14, 2024

Oh wow, there they are, popping up at HF now:
https://huggingface.co/models?dataset=dataset:allenai%2FMADLAD-400&sort=created

0 replies

misutoneko · 2024-08-30T16:29:51Z

misutoneko
Aug 30, 2024

OK just got aya-101 working!

The catch is that you have to quantize it yourself.
I don't know what exactly is wrong with the kcoopermiller ones, but they just refused to do any useful work for me.
My newly quantized Q2K GGUF is also a lot larger for some reason.
(EDIT: It seems a lot of the tensors got quantized into Q3K, so that explains the size difference).

I wanted to test quantizing a large model with meager resources and this was as good a candidate as any.
So in the end I got the answer I was looking for: Yes, you can quantize 13B models even if all you have is a C2D and 4GB of memory :D
The process took a couple of hours (and it ate over 100GB of hd, nom nom nom).

(Of course, "large" is relative...in the era of 405B this is peanuts really :)

build/bin/llama-cli -m models/aya-101.Q2_K.gguf -p 'Translate to Finnish: I wanted to test quantizing a large model with meager resources and this was as good a candidate as any.'

Minä halusin testata suurta mallia vähäisillä resursseilla ja tämä oli erinomainen ehdokas. [end of text]

0 replies

steampunque · 2024-08-31T21:28:15Z

steampunque
Aug 31, 2024

aya-101 is missing the spiece.model file which is needed to convert it. I copied the one from mt5-xxl which enabled the conversion to work and created IQ4_XS quant.

bash-5.1$ lm "translate to finnish: I wanted to test quantizing a large model with meager resources and this was as good a candidate as any."
Halusin testata suurten mallien kvantifiointia vähäisillä resursseilla ja tämä oli yhtä hyvä kandidaatti kuin mikä tahansa.
bash-5.1$ lm "translate to english: Halusin testata suurten mallien kvantifiointia vähäisillä resursseilla ja tämä oli yhtä hyvä kandidaatti kuin mikä tahansa."
I wanted to test the quantification of large models with limited resources and this was as good a candidate as any.
bash-5.1$ lm "translate to english: Minä halusin testata suurta mallia vähäisillä resursseilla ja tämä oli erinomainen ehdokas"
I wanted to test a large model with limited resources and this was a perfect candidate.

The model is pretty dumb, looks mainly useful for translations:

lm "Answer the following yes/no question by reasoning step-by-step. Could a dandelion suffer from hepatitis?"
Dandelion is a species of flowering plant. Hepatitis is a disease caused by hepatitis B virus. Therefore, the final answer is yes.

Translated question with madlad400 to German, same answer.

bash-5.1$ lm "Beantworten Sie die folgende Ja/Nein-Frage schrittweise: Könnte ein Löwenzahn an Hepatitis leiden?"
Hepatitis ist eine Viruserkrankung, die durch eine Infektion mit Hepatitis A verursacht wird. Löwenzahn ist ein Gemüse. Die endgültige Antwort lautet also ja.

0 replies

misutoneko · 2024-09-01T16:39:58Z

misutoneko
Sep 1, 2024

Gotta agree on dumb :D
Also prone to looping but maybe I'm just not holding it right.

IQ4_XS, you say? I wonder how that imatrix thing is handled in these multilingual models.
EDIT: Oh I see, so IQ4_XS doesn't mandate imatrix. And you can actually use imatrix with other Q variants too...

Btw in case anyone's wondering, yes you can run this on said C2D/4GB machine. Well, it's more of a crawl though.

vvv Thanks vvv --repeat-penalty 2.0 and leveling up to IQ4_XS mitigated the looping problem, but not all the way.

1 reply

steampunque Sep 1, 2024

Gotta agree on dumb :D Also prone to looping but maybe I'm just not holding it right.

IQ4_XS, you say? I wonder how that imatrix thing is handled in these multilingual models.

I dont use imatrix. I haven't seen it loop. I use greedy sample temp=0 rep=1, and see no problems. I had to use madlad400 to translate a question or it obstinately just tries to answer the question in source language instead of doing the translation no matter how I prompt it.

twdragon · 2024-11-21T15:53:39Z

twdragon
Nov 21, 2024

I tried this: https://huggingface.co/Eddishockwave/madlad400-10b-mt-Q8_0-GGUF

It works producing quite good results anyhow.

0 replies

brauliobo · 2024-11-22T15:55:55Z

brauliobo
Nov 22, 2024

I got it working with llama-cli -p '<2pt> What are you doing?' but not with llama-cli -i and not with llama-server's /completions.

I've noticed that llama-server is setting a chat template and llama-cli isn't.

In interactive mode, it doesn't return anything and in API mode it returns "content": "?"

Anybody knows why? Using code from the latest git.

0 replies

codeAndxv · 2024-11-29T09:43:58Z

codeAndxv
Nov 29, 2024

@misutoneko Hi, sir. Could you complete the example in llama.swiftui using T5? I tried using the madlad400 model in Swift, but I got an error: llama.cpp:15664: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed. The code in llama.cpp is quite complex for me.

2 replies

misutoneko Nov 29, 2024

No, sorry. I've never used swift for anything so no clue on that. Does it work with llama-cli?

The error message suggests that the call to llama_encode() failed somehow.
You could try adding some logging and compare with what llama-cli does (examples/main/main.cpp).

codeAndxv Nov 29, 2024

No, sorry. I've never used swift for anything so no clue on that. Does it work with llama-cli?

The error message suggests that the call to llama_encode() failed somehow. You could try adding some logging and compare with what llama-cli does (examples/main/main.cpp).

Fortunately, I found similar code in batched.cpp and attempted to submit a PR. Thank you!

Can madlad400 gguf models from huggingface be used? #8300

Replies: 10 comments · 7 replies

OkLunixBSD Jul 11, 2024 Author

Replies: 10 comments 7 replies

OkLunixBSD Jul 11, 2024
Author