Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

AndreasKunar · 2024-11-20T17:06:21Z

What happened?

Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards.
(hardware: Apple MacBook Air M2 10 GPU 24GB RAM)

before PR:
make clean
git checkout ae8de6d
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	60.48 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	14.89 ± 0.20
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	63.50 ± 2.47
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	11.93 ± 3.30
with ngl 99:
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	194.94 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	11.81 ± 6.53

build: ae8de6d (4080)

versions after PR (including current):
make clean
git checkout 1607a5e
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	4.11 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	1.86 ± 0.01
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	62.81 ± 2.55
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	14.70 ± 1.97
with ngl 99:
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	186.02 ± 13.18
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	11.25 ± 3.42

build: 1607a5e (4081)

The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.

Name and Version

Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
macOS Sequoia 15.1.1

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

slaren · 2024-11-20T17:24:34Z

The problem seem that Q4_0 is being converted to Q4_0_4_8, which has bad performance in this CPU. If patched so that it is converted to Q4_0_4_4, the performance is as expected, e.g.:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 96a16dfb..ff27f9cb 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3549,7 +3549,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon()) {
             return GGML_TYPE_Q4_0_4_4;

@chaxu01 does the logic for choosing the type to repack to need to be updated?

ggerganov · 2024-11-20T17:28:29Z

I can confirm that on Apple, ggml_arm_arch_features.has_i8mm ends up being true, but the defined(__ARM_FEATURE_MATMUL_INT8) macro is false, so the logic decides to repack, but then does not use the optimized path and fallbacks to the slow scalar implementation.

eddnjjn · 2024-11-21T09:01:29Z

@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.

chaxu01 · 2024-11-21T09:09:03Z

@AndreasKunar I would like to confirm whether you built with the arch=...+i8mm option. Due to runtime CPU feature detection, we decided to repack into Q4_0_4_8 as it is supported by the CPU. However, when executing the kernel, the reference implementation is used because the optimized kernel was not included in the build.

AndreasKunar · 2024-11-21T13:14:33Z

@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.

@eddnjjn + @chaxu01: I normally do not build with cmake, I built as in my initial report with "make -j llama-bench" and no switches. This is also in accordance with the llama-performance instructions (discussion #4167, which is essential and still current).

I tried a cmake build with your -march flags - it does not degrade performance:
cmake -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod -B build
cmake --build build --config Release --target llama-bench
./build/bin/llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	56.64 ± 1.37
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	13.76 ± 0.15
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	59.08 ± 1.13
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	12.86 ± 0.94

build: 1bb30bf (4149)

I understand, that the problem is due to the repack-on-load code (which is great because it avoids having to manually convert Q4_0 to Q4_0_4_4 or _8). It's a great idea, thanks, and it helps a lot also on Macs with containers/VMs. This is why I initially included the Q4_0_4_4 performance numbers (w/o repack).

But if there are now special build flags or using cmake instead of make needed (especially with very common hardware like the Apple silicon Macs), I think it needs to be documented thoroughly in the llama.cpp build instructions and also in the Mac performance comparison build instructions (discussion #4167),...

I also understand, that in the future build might be best standardized on cmake and make depreciated (see on-going discussion issue #10268 ). But until now and as documented in the build instructions,... "make" always worked nicely for common hardware+OS combinations. If it does not work anymore, it is a severe bug and needs to be documented thoroughly!

P.S.: please tell me, if I can support you or help. I'm currently on vacation and a bit slow in respponding, apologies.

chaxu01 · 2024-11-21T13:45:38Z

@AndreasKunar Thanks for the verification. I agree that the llama.cpp build instructions could be updated to better reflect the runtime repack case.

slaren · 2024-11-21T13:46:33Z

At least the ggml_cpu_has_xxx functions should also check if the build supports the feature and return false otherwise. If the ARM features are not automatically enabled with -march=native, it may also be good to enable them by default, since they are checked at runtime anyway.

slaren · 2024-11-23T13:42:31Z

I will reopen this since the issue with the ARM features not being automatically enabled with the default native builds is still not addressed.

chaxu01 · 2024-11-24T07:31:32Z

@slaren One option for fixing this could be to update the CMake script to conditionally add the appropriate compile options based on the detected CPU capabilities. This would ensure that the required ARM features, such as -march=armv8.2a and any specific flags like i8mm, are enabled when supported by the hardware.

slaren · 2024-11-24T19:13:06Z

Yes, that should work. Ideally it would work in the same way as x86, and automatically enable all the features of the local CPU when GGML_NATIVE is enabled, and otherwise individual features can be enabled manually.

chaxu01 · 2024-11-25T06:40:02Z

Thanks for confirming. I’ll work on updating the CMake script to ensure that, when GGML_NATIVE is enabled, all supported ARM features of the local CPU are automatically detected and enabled.

chaxu01 · 2024-11-25T12:08:02Z

Created PR #10487

AndreasKunar added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 20, 2024

ggerganov added bug Something isn't working and removed bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 21, 2024

slaren mentioned this issue Nov 22, 2024

ggml : do not use ARM features not included in the build #10457

Merged

slaren closed this as completed in #10457 Nov 23, 2024

slaren reopened this Nov 23, 2024

chaxu01 mentioned this issue Nov 25, 2024

ggml-cpu: cmake add arm64 cpu feature check for macos #10487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

AndreasKunar commented Nov 20, 2024 •

edited

Loading

slaren commented Nov 20, 2024

ggerganov commented Nov 20, 2024

eddnjjn commented Nov 21, 2024

chaxu01 commented Nov 21, 2024

AndreasKunar commented Nov 21, 2024 •

edited

Loading

chaxu01 commented Nov 21, 2024

slaren commented Nov 21, 2024

slaren commented Nov 23, 2024

chaxu01 commented Nov 24, 2024

slaren commented Nov 24, 2024

chaxu01 commented Nov 25, 2024

chaxu01 commented Nov 25, 2024

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Comments

AndreasKunar commented Nov 20, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

slaren commented Nov 20, 2024

ggerganov commented Nov 20, 2024

eddnjjn commented Nov 21, 2024

chaxu01 commented Nov 21, 2024

AndreasKunar commented Nov 21, 2024 • edited Loading

chaxu01 commented Nov 21, 2024

slaren commented Nov 21, 2024

slaren commented Nov 23, 2024

chaxu01 commented Nov 24, 2024

slaren commented Nov 24, 2024

chaxu01 commented Nov 25, 2024

chaxu01 commented Nov 25, 2024

AndreasKunar commented Nov 20, 2024 •

edited

Loading

AndreasKunar commented Nov 21, 2024 •

edited

Loading