Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

LeMoussel · 2024-01-02T08:07:20Z

I'm a bit lost with the different quantization approaches such as GGUF, ExLlamaV2 & this project?
Is it the same thing? Is one approach faster?

GGUF: TheBloke/Mixtral-8x7B-v0.1-GGUF
ExLlamaV2: turboderp/Mixtral-8x7B-instruct-exl2

lavawolfiee · 2024-01-02T18:58:53Z

No, it's not the same thing.

Regarding ExLlamaV2 and llama.cpp (GGUF), I think it depends on your setup. As far as I know, ExLlamaV2 is faster on GPU but doesn't support CPU inference. llama.cpp on the other hand can split layers between CPU and GPU, reducing VRAM usage, and support pure CPU inference (initially, it was developed for CPU inference). They both are optimized for fast inference of LLMs and do their job pretty well. Note that they also have different quantization methods.

As for this project, we focus on optimizing inference for MoE-based models on consumer-class GPUs specifically. I can't tell you for sure right now when our method is faster/slower than the other ones, but we're currently researching that. It's also important to note that we used HQQ quantization, which is good but currently isn't very fast because it lacks good cuda kernels. Our team is actively working on supporting other quantization methods along with fast kernels and researching further possibilities to improve inference speed and quality.

Therefore, I believe our method is useful, at least if you don't have a lot of GPU VRAM (e.g., in Google Colab) or you want to fit a bigger model (with better quality) into it. We will do our best to implement new features and reach out to you as fast as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

LeMoussel commented Jan 2, 2024

lavawolfiee commented Jan 2, 2024 •

edited

Loading

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

Mixtral OffLoading/GGUF/ExLlamaV2, which approach to use? #11

Comments

LeMoussel commented Jan 2, 2024

lavawolfiee commented Jan 2, 2024 • edited Loading

lavawolfiee commented Jan 2, 2024 •

edited

Loading