-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Severe Performance Degradation on Q4_0 CPU-only with MacOS / Apple Silicon M2, after PR#9921 / Version 4081 #10435
Comments
The problem seem that Q4_0 is being converted to Q4_0_4_8, which has bad performance in this CPU. If patched so that it is converted to Q4_0_4_4, the performance is as expected, e.g.: diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 96a16dfb..ff27f9cb 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3549,7 +3549,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
return GGML_TYPE_Q4_0_8_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
- return GGML_TYPE_Q4_0_4_8;
+ //return GGML_TYPE_Q4_0_4_8;
}
if (ggml_cpu_has_neon()) {
return GGML_TYPE_Q4_0_4_4; @chaxu01 does the logic for choosing the type to repack to need to be updated? |
I can confirm that on Apple, |
@AndreasKunar : If you build with cmake, try adding something like -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod to the build command. This causes __ARM_FEATURE_MATMUL_INT8 to be defined. |
@AndreasKunar I would like to confirm whether you built with the arch=...+i8mm option. Due to runtime CPU feature detection, we decided to repack into Q4_0_4_8 as it is supported by the CPU. However, when executing the kernel, the reference implementation is used because the optimized kernel was not included in the build. |
@eddnjjn + @chaxu01: I normally do not build with cmake, I built as in my initial report with "make -j llama-bench" and no switches. This is also in accordance with the llama-performance instructions (discussion #4167, which is essential and still current). I tried a cmake build with your -march flags - it does not degrade performance:
build: 1bb30bf (4149) I understand, that the problem is due to the repack-on-load code (which is great because it avoids having to manually convert Q4_0 to Q4_0_4_4 or _8). It's a great idea, thanks, and it helps a lot also on Macs with containers/VMs. This is why I initially included the Q4_0_4_4 performance numbers (w/o repack). But if there are now special build flags or using cmake instead of make needed (especially with very common hardware like the Apple silicon Macs), I think it needs to be documented thoroughly in the llama.cpp build instructions and also in the Mac performance comparison build instructions (discussion #4167),... I also understand, that in the future build might be best standardized on cmake and make depreciated (see on-going discussion issue #10268 ). But until now and as documented in the build instructions,... "make" always worked nicely for common hardware+OS combinations. If it does not work anymore, it is a severe bug and needs to be documented thoroughly! P.S.: please tell me, if I can support you or help. I'm currently on vacation and a bit slow in respponding, apologies. |
@AndreasKunar Thanks for the verification. I agree that the llama.cpp build instructions could be updated to better reflect the runtime repack case. |
At least the |
I will reopen this since the issue with the ARM features not being automatically enabled with the default native builds is still not addressed. |
@slaren One option for fixing this could be to update the CMake script to conditionally add the appropriate compile options based on the detected CPU capabilities. This would ensure that the required ARM features, such as -march=armv8.2a and any specific flags like i8mm, are enabled when supported by the hardware. |
Yes, that should work. Ideally it would work in the same way as x86, and automatically enable all the features of the local CPU when |
Thanks for confirming. I’ll work on updating the CMake script to ensure that, when GGML_NATIVE is enabled, all supported ARM features of the local CPU are automatically detected and enabled. |
Created PR #10487 |
What happened?
Prior to PR #9921 / Version 4081 the -ngl 0 Q4_0 llama performance was significantly higher (more than 10x) than afterwards.
(hardware: Apple MacBook Air M2 10 GPU 24GB RAM)
before PR:
make clean
git checkout ae8de6d
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
build: ae8de6d (4080)
versions after PR (including current):
make clean
git checkout 1607a5e
make -j llama-bench
./llama-bench -p 512 -n 128 -t 4 -ngl 0 -m ...model...
build: 1607a5e (4081)
The variations except for -ngl 0 / Q4_0 might be due to the MacBook Air's thermals.
Name and Version
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
macOS Sequoia 15.1.1
What operating system are you seeing the problem on?
Mac
Relevant log output
No response
The text was updated successfully, but these errors were encountered: