-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Feature matrix
Johannes Gäßler edited this page Feb 17, 2025
·
9 revisions
CPU (AVX2) | CPU (ARM NEON) | Metal | CUDA | ROCm | SYCL | CLBlast | Vulkan | Kompute | |
---|---|---|---|---|---|---|---|---|---|
K-quants | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ 🐢⁵ | ✅ 🐢⁵ | 🚫 |
I-quants | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ | ✅ | Partial¹ | 🚫 | 🚫 | 🚫 |
Parallel Multi-GPU⁶ | N/A | N/A | N/A | ✅ | ✅ | 🚫 | ❓ | ❓ | ❓ |
K cache quants | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | ✅ | 🚫 | 🚫 |
MoE architecture | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | Partial² | 🚫 | 🚫 |
- ✅: feature works
- 🚫: feature does not work
- ❓: unknown, please contribute if you can test it yourself
- 🐢: feature is slow
- ¹: IQ3_S and IQ1_S, see #5886
- ²: Only with
-ngl 0
- ³: Inference is 50% slower
- ⁴: Slower than K-quants of comparable size
- ⁵: Slower than cuBLAS/rocBLAS on similar cards
- ⁶: By default, all backends can utilize multiple devices by running them sequentially. The CUDA code (which is also used for ROCm via HIP) also has code for running GPUs in parallel via
--split-mode row
. However, this is optimized relatively poorly and is only faster if the interconnect speed is fast vs. the speed of a single GPU. - ⁶: Only q8_0 and iq4_nl
Useful information for users that doesn't fit into Readme.
- Home
- Feature Matrix
- GGML Tips & Tricks
- Chat Templating
- Metadata Override
- HuggingFace Model Card Metadata Interoperability Consideration
These are information useful for Maintainers and Developers which does not fit into code comments
Click on a badge to jump to workflow. This is here as a useful general view of all the actions so that we may notice quicker if main branch automation is broken and where.