cmake : fix ARM feature detection #10543

ggerganov · 2024-11-27T11:18:46Z

Fix MSVC I8MM feature detection + add logs.

slaren · 2024-11-27T11:28:22Z

I am not sure that I understand what is happening. Why does the ARM feature detection in ggml report that i8mm is supported if it is not? Why does the repacked Q4_0 work with i8mm enabled?

ggerganov · 2024-11-27T11:39:54Z

Why does the ARM feature detection in ggml report that i8mm is supported if it is not?

So far I have 3 datapoints:

On M1 Pro, has_i8mm in ggml-cpu.c is determined as false. On master the CMake was incorrectly setting __ARM_FEATURE_MATMUL_INT8, but it should be fixed on this branch.
On M2 Ultra has_i8mm in ggml-cpu.c is determined as true. The CMake both on master and in this PR sets __ARM_FEATURE_MATMUL_INT8. However, the make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU tests keep failing:

  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.033587799 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK

On M4, both master and this PR work correctly. I suspect it also works with M3 Max

I don't understand yet why M2 Ultra does not work.

slaren · 2024-11-27T11:45:57Z

I wonder if OS support is also necessary. I get this on M3 Max:

$ sysctl -a | grep hw.optional.arm.                                                                                                                                                                        12:45
hw.optional.arm.FEAT_FlagM: 1
hw.optional.arm.FEAT_FlagM2: 1
hw.optional.arm.FEAT_FHM: 1
hw.optional.arm.FEAT_DotProd: 1
hw.optional.arm.FEAT_SHA3: 1
hw.optional.arm.FEAT_RDM: 1
hw.optional.arm.FEAT_LSE: 1
hw.optional.arm.FEAT_SHA256: 1
hw.optional.arm.FEAT_SHA512: 1
hw.optional.arm.FEAT_SHA1: 1
hw.optional.arm.FEAT_AES: 1
hw.optional.arm.FEAT_PMULL: 1
hw.optional.arm.FEAT_SPECRES: 0
hw.optional.arm.FEAT_SB: 1
hw.optional.arm.FEAT_FRINTTS: 1
hw.optional.arm.FEAT_LRCPC: 1
hw.optional.arm.FEAT_LRCPC2: 1
hw.optional.arm.FEAT_FCMA: 1
hw.optional.arm.FEAT_JSCVT: 1
hw.optional.arm.FEAT_PAuth: 1
hw.optional.arm.FEAT_PAuth2: 1
hw.optional.arm.FEAT_FPAC: 1
hw.optional.arm.FEAT_DPB: 1
hw.optional.arm.FEAT_DPB2: 1
hw.optional.arm.FEAT_BF16: 1
>>hw.optional.arm.FEAT_I8MM: 1
hw.optional.arm.FEAT_WFxT: 0
hw.optional.arm.FEAT_RPRES: 1
hw.optional.arm.FEAT_ECV: 1
hw.optional.arm.FEAT_AFP: 1
hw.optional.arm.FEAT_LSE2: 1
hw.optional.arm.FEAT_CSV2: 1
hw.optional.arm.FEAT_CSV3: 1
hw.optional.arm.FEAT_DIT: 1
hw.optional.arm.FEAT_FP16: 1
hw.optional.arm.FEAT_SSBS: 1
hw.optional.arm.FEAT_BTI: 1
hw.optional.arm.FEAT_SME: 0
hw.optional.arm.FEAT_SME2: 0
hw.optional.arm.SME_F32F32: 0
hw.optional.arm.SME_BI32I32: 0
hw.optional.arm.SME_B16F32: 0
hw.optional.arm.SME_F16F32: 0
hw.optional.arm.SME_I8I32: 0
hw.optional.arm.SME_I16I32: 0
hw.optional.arm.FEAT_SME_F64F64: 0
hw.optional.arm.FEAT_SME_I16I64: 0
hw.optional.arm.FP_SyncExceptions: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1
hw.optional.arm64: 1

ggerganov · 2024-11-27T11:50:10Z

M2 Ultra also reports FEAT_I8MM with this command. Here are the only differences:

28c28
< hw.optional.arm.FEAT_RPRES: 1
---
> hw.optional.arm.FEAT_RPRES: 0
30c30
< hw.optional.arm.FEAT_AFP: 1
---
> hw.optional.arm.FEAT_AFP: 0
48a49
> hw.optional.arm.caps: 868632327146696703

hw.optional.arm.FEAT_FlagM: 1
hw.optional.arm.FEAT_FlagM2: 1
hw.optional.arm.FEAT_FHM: 1
hw.optional.arm.FEAT_DotProd: 1
hw.optional.arm.FEAT_SHA3: 1
hw.optional.arm.FEAT_RDM: 1
hw.optional.arm.FEAT_LSE: 1
hw.optional.arm.FEAT_SHA256: 1
hw.optional.arm.FEAT_SHA512: 1
hw.optional.arm.FEAT_SHA1: 1
hw.optional.arm.FEAT_AES: 1
hw.optional.arm.FEAT_PMULL: 1
hw.optional.arm.FEAT_SPECRES: 0
hw.optional.arm.FEAT_SB: 1
hw.optional.arm.FEAT_FRINTTS: 1
hw.optional.arm.FEAT_LRCPC: 1
hw.optional.arm.FEAT_LRCPC2: 1
hw.optional.arm.FEAT_FCMA: 1
hw.optional.arm.FEAT_JSCVT: 1
hw.optional.arm.FEAT_PAuth: 1
hw.optional.arm.FEAT_PAuth2: 1
hw.optional.arm.FEAT_FPAC: 1
hw.optional.arm.FEAT_DPB: 1
hw.optional.arm.FEAT_DPB2: 1
hw.optional.arm.FEAT_BF16: 1
hw.optional.arm.FEAT_I8MM: 1
hw.optional.arm.FEAT_WFxT: 0
hw.optional.arm.FEAT_RPRES: 0
hw.optional.arm.FEAT_ECV: 1
hw.optional.arm.FEAT_AFP: 0
hw.optional.arm.FEAT_LSE2: 1
hw.optional.arm.FEAT_CSV2: 1
hw.optional.arm.FEAT_CSV3: 1
hw.optional.arm.FEAT_DIT: 1
hw.optional.arm.FEAT_FP16: 1
hw.optional.arm.FEAT_SSBS: 1
hw.optional.arm.FEAT_BTI: 1
hw.optional.arm.FEAT_SME: 0
hw.optional.arm.FEAT_SME2: 0
hw.optional.arm.SME_F32F32: 0
hw.optional.arm.SME_BI32I32: 0
hw.optional.arm.SME_B16F32: 0
hw.optional.arm.SME_F16F32: 0
hw.optional.arm.SME_I8I32: 0
hw.optional.arm.SME_I16I32: 0
hw.optional.arm.FEAT_SME_F64F64: 0
hw.optional.arm.FEAT_SME_I16I64: 0
hw.optional.arm.FP_SyncExceptions: 1
hw.optional.arm.caps: 868632327146696703
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1
hw.optional.arm64: 1

ggerganov · 2024-11-27T11:53:33Z

Btw, it's weird because:

# this passes all tests on M2 Ultra
make -j && ./bin/test-backend-ops -o MUL_MAT -b Metal

# this fails the 2 Q4_0 tests as shown earlier
make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU

But both commands would run the CPU backend, correct?

ggerganov · 2024-11-27T11:58:34Z

On M4, both master and this PR work correctly. I suspect it also works with M3 Max

Correction, the -b CPU test also fails on M4 - both master and this PR. Does this work for you:

rm -rf build-mm
mkdir build-mm
cd build-mm
cmake ..
make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU

slaren · 2024-11-27T12:04:02Z

It also fails for me. The only way I can see this happening is if the second matrix multiplication on the CPU produces different results, which I don't understand how it could be happening.

  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): [MUL_MAT] NMSE = 0.124382206 > 0.000500000 sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): [MUL_MAT] NMSE = 0.135243290 > 0.000500000 sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): [MUL_MAT] NMSE = 0.130680086 > 0.000500000 sentinel mismatch: sent_3 FAIL

ggerganov · 2024-11-27T12:11:18Z

Almost positive that it is related to the compile flags somehow because it does not fail before 25669aa. Also, using the make build, it does not fail even on master:

cd llama.cpp
make -j tests && ./tests/test-backend-ops -o MUL_MAT -b CPU

And the only difference is that with the make, we don't pass -march explicitly. While with CMake, we currently pass: -march=armv8.2a+dotprod+i8mm.

Edit: Ah, but he make build likely does not enable the I8MM feature. So, it could be a bug in the implementation then.

slaren · 2024-11-27T12:19:25Z

In case it isn't clear, a sentinel mismatch means that there is a buffer overflow. I tried allocating a different buffer per tensor to see where it happens, and I got this:

==16634==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x00010510a980 at pc 0x000101c20e28 bp 0x000170455bb0 sp 0x000170455360
WRITE of size 64 at 0x00010510a980 thread T11704
    #0 0x101c20e24 in __asan_memcpy+0x440 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x50e24)
    #1 0x100a82594 in ggml_compute_forward_mul_mat_one_chunk ggml-cpu.c:7417
    #2 0x100a4dfe4 in ggml_compute_forward_mul_mat ggml-cpu.c:7633
    #3 0x100a48f30 in ggml_compute_forward ggml-cpu.c:12410
    #4 0x100a44ae4 in ggml_graph_compute_thread ggml-cpu.c:13420
    #5 0x100a47a68 in ggml_graph_compute_secondary_thread ggml-cpu.c:13540
    #6 0x18ff36f90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90)
    #7 0x18ff31d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30)

To reproduce:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index da66ed85..31fe4c33 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -473,6 +473,10 @@ struct test_case {
             return false;
         }

+        for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
+            t->data = malloc(ggml_nbytes(t));
+        }
+
         // build graph
         ggml_build_forward_expand(gf, out);

chaxu01 · 2024-11-27T13:51:45Z

On M2 Ultra has_i8mm in ggml-cpu.c is determined as true. The CMake both on master and in this PR sets __ARM_FEATURE_MATMUL_INT8. However, the make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU tests keep failing:

#10487 (comment)

ggml-ci

chaxu01 · 2024-11-28T12:01:22Z

The changes look good to me. I'll approve if requested or once more feedback is gathered.

ggml-ci

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 27, 2024

ggerganov mentioned this pull request Nov 27, 2024

ggml-cpu: cmake add arm64 cpu feature check for macos #10487

Merged

cmake : fix ARM feature detection for MSVC

0adfd0f

ggml-ci

ggerganov force-pushed the gg/cmake-fix-arm branch from 5d7868c to 0adfd0f Compare November 28, 2024 11:15

slaren approved these changes Nov 28, 2024

View reviewed changes

ggerganov merged commit eea986f into master Nov 28, 2024
57 checks passed

ggerganov deleted the gg/cmake-fix-arm branch November 28, 2024 12:56

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

cmake : fix ARM feature detection (ggml-org#10543)

94014df

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmake : fix ARM feature detection #10543

cmake : fix ARM feature detection #10543

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024 •

edited

Loading

chaxu01 commented Nov 27, 2024

chaxu01 commented Nov 28, 2024

cmake : fix ARM feature detection #10543

cmake : fix ARM feature detection #10543

Conversation

ggerganov commented Nov 27, 2024 • edited Loading

slaren commented Nov 27, 2024 • edited Loading

ggerganov commented Nov 27, 2024 • edited Loading

slaren commented Nov 27, 2024 • edited Loading

ggerganov commented Nov 27, 2024 • edited Loading

ggerganov commented Nov 27, 2024

ggerganov commented Nov 27, 2024 • edited Loading

slaren commented Nov 27, 2024

ggerganov commented Nov 27, 2024 • edited Loading

slaren commented Nov 27, 2024 • edited Loading

chaxu01 commented Nov 27, 2024

chaxu01 commented Nov 28, 2024

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024 •

edited

Loading

ggerganov commented Nov 27, 2024 •

edited

Loading

slaren commented Nov 27, 2024 •

edited

Loading