Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435

Open
AndreasKunar opened this issue Nov 20, 2024 · 12 comments · Fixed by #10457
Labels
bug Something isn't working

Comments

@AndreasKunar
Copy link
Contributor

AndreasKunar commented Nov 20, 2024

What happened?

Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards.
(hardware: Apple MacBook Air M2 10 GPU 24GB RAM)

before PR:
make clean
git checkout ae8de6d
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 60.48 ± 0.49
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 14.89 ± 0.20
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 pp512 63.50 ± 2.47
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 tg128 11.93 ± 3.30
with ngl 99:
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 194.94 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 11.81 ± 6.53

build: ae8de6d (4080)

versions after PR (including current):
make clean
git checkout 1607a5e
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 4.11 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 1.86 ± 0.01
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 pp512 62.81 ± 2.55
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 tg128 14.70 ± 1.97
with ngl 99:
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 186.02 ± 13.18
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 11.25 ± 3.42

build: 1607a5e (4081)

The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.

Name and Version

Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
macOS Sequoia 15.1.1

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

@AndreasKunar AndreasKunar added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 20, 2024
@slaren
Copy link
Collaborator

slaren commented Nov 20, 2024

The problem seem that Q4_0 is being converted to Q4_0_4_8, which has bad performance in this CPU. If patched so that it is converted to Q4_0_4_4, the performance is as expected, e.g.:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 96a16dfb..ff27f9cb 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3549,7 +3549,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon()) {
             return GGML_TYPE_Q4_0_4_4;

@chaxu01 does the logic for choosing the type to repack to need to be updated?

@ggerganov
Copy link
Owner

I can confirm that on Apple, ggml_arm_arch_features.has_i8mm ends up being true, but the defined(__ARM_FEATURE_MATMUL_INT8) macro is false, so the logic decides to repack, but then does not use the optimized path and fallbacks to the slow scalar implementation.

@ggerganov ggerganov added bug Something isn't working and removed bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 21, 2024
@eddnjjn
Copy link
Contributor

eddnjjn commented Nov 21, 2024

@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.

@chaxu01
Copy link
Contributor

chaxu01 commented Nov 21, 2024

@AndreasKunar I would like to confirm whether you built with the arch=...+i8mm option. Due to runtime CPU feature detection, we decided to repack into Q4_0_4_8 as it is supported by the CPU. However, when executing the kernel, the reference implementation is used because the optimized kernel was not included in the build.

@AndreasKunar
Copy link
Contributor Author

AndreasKunar commented Nov 21, 2024

@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined.

@eddnjjn + @chaxu01: I normally do not build with cmake, I built as in my initial report with "make -j llama-bench" and no switches. This is also in accordance with the llama-performance instructions (discussion #4167, which is essential and still current).

I tried a cmake build with your -march flags - it does not degrade performance:
cmake -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod -B build
cmake --build build --config Release --target llama-bench
./build/bin/llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 56.64 ± 1.37
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 13.76 ± 0.15
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 pp512 59.08 ± 1.13
llama 7B Q4_0_4_4 3.56 GiB 6.74 B Metal,BLAS 4 tg128 12.86 ± 0.94

build: 1bb30bf (4149)

I understand, that the problem is due to the repack-on-load code (which is great because it avoids having to manually convert Q4_0 to Q4_0_4_4 or _8). It's a great idea, thanks, and it helps a lot also on Macs with containers/VMs. This is why I initially included the Q4_0_4_4 performance numbers (w/o repack).

But if there are now special build flags or using cmake instead of make needed (especially with very common hardware like the Apple silicon Macs), I think it needs to be documented thoroughly in the llama.cpp build instructions and also in the Mac performance comparison build instructions (discussion #4167),...

I also understand, that in the future build might be best standardized on cmake and make depreciated (see on-going discussion issue #10268 ). But until now and as documented in the build instructions,... "make" always worked nicely for common hardware+OS combinations. If it does not work anymore, it is a severe bug and needs to be documented thoroughly!

P.S.: please tell me, if I can support you or help. I'm currently on vacation and a bit slow in respponding, apologies.

@chaxu01
Copy link
Contributor

chaxu01 commented Nov 21, 2024

@AndreasKunar Thanks for the verification. I agree that the llama.cpp build instructions could be updated to better reflect the runtime repack case.

@slaren
Copy link
Collaborator

slaren commented Nov 21, 2024

At least the ggml_cpu_has_xxx functions should also check if the build supports the feature and return false otherwise. If the ARM features are not automatically enabled with -march=native, it may also be good to enable them by default, since they are checked at runtime anyway.

@slaren
Copy link
Collaborator

slaren commented Nov 23, 2024

I will reopen this since the issue with the ARM features not being automatically enabled with the default native builds is still not addressed.

@slaren slaren reopened this Nov 23, 2024
@chaxu01
Copy link
Contributor

chaxu01 commented Nov 24, 2024

@slaren One option for fixing this could be to update the CMake script to conditionally add the appropriate compile options based on the detected CPU capabilities. This would ensure that the required ARM features, such as -march=armv8.2a and any specific flags like i8mm, are enabled when supported by the hardware.

@slaren
Copy link
Collaborator

slaren commented Nov 24, 2024

Yes, that should work. Ideally it would work in the same way as x86, and automatically enable all the features of the local CPU when GGML_NATIVE is enabled, and otherwise individual features can be enabled manually.

@chaxu01
Copy link
Contributor

chaxu01 commented Nov 25, 2024

Thanks for confirming. I’ll work on updating the CMake script to ensure that, when GGML_NATIVE is enabled, all supported ARM features of the local CPU are automatically detected and enabled.

@chaxu01
Copy link
Contributor

chaxu01 commented Nov 25, 2024

Created PR #10487

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants