Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmake : fix ARM feature detection #10543

Merged
merged 1 commit into from
Nov 28, 2024
Merged

cmake : fix ARM feature detection #10543

merged 1 commit into from
Nov 28, 2024

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 27, 2024

cont #10487

Fix MSVC I8MM feature detection + add logs.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 27, 2024
@slaren
Copy link
Member

slaren commented Nov 27, 2024

I am not sure that I understand what is happening. Why does the ARM feature detection in ggml report that i8mm is supported if it is not? Why does the repacked Q4_0 work with i8mm enabled?

@ggerganov
Copy link
Member Author

ggerganov commented Nov 27, 2024

Why does the ARM feature detection in ggml report that i8mm is supported if it is not?

So far I have 3 datapoints:

  • On M1 Pro, has_i8mm in ggml-cpu.c is determined as false. On master the CMake was incorrectly setting __ARM_FEATURE_MATMUL_INT8, but it should be fixed on this branch.
  • On M2 Ultra has_i8mm in ggml-cpu.c is determined as true. The CMake both on master and in this PR sets __ARM_FEATURE_MATMUL_INT8. However, the make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU tests keep failing:
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.033587799 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
  • On M4, both master and this PR work correctly. I suspect it also works with M3 Max

I don't understand yet why M2 Ultra does not work.

@slaren
Copy link
Member

slaren commented Nov 27, 2024

I wonder if OS support is also necessary. I get this on M3 Max:

$ sysctl -a | grep hw.optional.arm.                                                                                                                                                                        12:45
hw.optional.arm.FEAT_FlagM: 1
hw.optional.arm.FEAT_FlagM2: 1
hw.optional.arm.FEAT_FHM: 1
hw.optional.arm.FEAT_DotProd: 1
hw.optional.arm.FEAT_SHA3: 1
hw.optional.arm.FEAT_RDM: 1
hw.optional.arm.FEAT_LSE: 1
hw.optional.arm.FEAT_SHA256: 1
hw.optional.arm.FEAT_SHA512: 1
hw.optional.arm.FEAT_SHA1: 1
hw.optional.arm.FEAT_AES: 1
hw.optional.arm.FEAT_PMULL: 1
hw.optional.arm.FEAT_SPECRES: 0
hw.optional.arm.FEAT_SB: 1
hw.optional.arm.FEAT_FRINTTS: 1
hw.optional.arm.FEAT_LRCPC: 1
hw.optional.arm.FEAT_LRCPC2: 1
hw.optional.arm.FEAT_FCMA: 1
hw.optional.arm.FEAT_JSCVT: 1
hw.optional.arm.FEAT_PAuth: 1
hw.optional.arm.FEAT_PAuth2: 1
hw.optional.arm.FEAT_FPAC: 1
hw.optional.arm.FEAT_DPB: 1
hw.optional.arm.FEAT_DPB2: 1
hw.optional.arm.FEAT_BF16: 1
>>hw.optional.arm.FEAT_I8MM: 1
hw.optional.arm.FEAT_WFxT: 0
hw.optional.arm.FEAT_RPRES: 1
hw.optional.arm.FEAT_ECV: 1
hw.optional.arm.FEAT_AFP: 1
hw.optional.arm.FEAT_LSE2: 1
hw.optional.arm.FEAT_CSV2: 1
hw.optional.arm.FEAT_CSV3: 1
hw.optional.arm.FEAT_DIT: 1
hw.optional.arm.FEAT_FP16: 1
hw.optional.arm.FEAT_SSBS: 1
hw.optional.arm.FEAT_BTI: 1
hw.optional.arm.FEAT_SME: 0
hw.optional.arm.FEAT_SME2: 0
hw.optional.arm.SME_F32F32: 0
hw.optional.arm.SME_BI32I32: 0
hw.optional.arm.SME_B16F32: 0
hw.optional.arm.SME_F16F32: 0
hw.optional.arm.SME_I8I32: 0
hw.optional.arm.SME_I16I32: 0
hw.optional.arm.FEAT_SME_F64F64: 0
hw.optional.arm.FEAT_SME_I16I64: 0
hw.optional.arm.FP_SyncExceptions: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1
hw.optional.arm64: 1

@ggerganov
Copy link
Member Author

ggerganov commented Nov 27, 2024

M2 Ultra also reports FEAT_I8MM with this command. Here are the only differences:

28c28
< hw.optional.arm.FEAT_RPRES: 1
---
> hw.optional.arm.FEAT_RPRES: 0
30c30
< hw.optional.arm.FEAT_AFP: 1
---
> hw.optional.arm.FEAT_AFP: 0
48a49
> hw.optional.arm.caps: 868632327146696703
hw.optional.arm.FEAT_FlagM: 1
hw.optional.arm.FEAT_FlagM2: 1
hw.optional.arm.FEAT_FHM: 1
hw.optional.arm.FEAT_DotProd: 1
hw.optional.arm.FEAT_SHA3: 1
hw.optional.arm.FEAT_RDM: 1
hw.optional.arm.FEAT_LSE: 1
hw.optional.arm.FEAT_SHA256: 1
hw.optional.arm.FEAT_SHA512: 1
hw.optional.arm.FEAT_SHA1: 1
hw.optional.arm.FEAT_AES: 1
hw.optional.arm.FEAT_PMULL: 1
hw.optional.arm.FEAT_SPECRES: 0
hw.optional.arm.FEAT_SB: 1
hw.optional.arm.FEAT_FRINTTS: 1
hw.optional.arm.FEAT_LRCPC: 1
hw.optional.arm.FEAT_LRCPC2: 1
hw.optional.arm.FEAT_FCMA: 1
hw.optional.arm.FEAT_JSCVT: 1
hw.optional.arm.FEAT_PAuth: 1
hw.optional.arm.FEAT_PAuth2: 1
hw.optional.arm.FEAT_FPAC: 1
hw.optional.arm.FEAT_DPB: 1
hw.optional.arm.FEAT_DPB2: 1
hw.optional.arm.FEAT_BF16: 1
hw.optional.arm.FEAT_I8MM: 1
hw.optional.arm.FEAT_WFxT: 0
hw.optional.arm.FEAT_RPRES: 0
hw.optional.arm.FEAT_ECV: 1
hw.optional.arm.FEAT_AFP: 0
hw.optional.arm.FEAT_LSE2: 1
hw.optional.arm.FEAT_CSV2: 1
hw.optional.arm.FEAT_CSV3: 1
hw.optional.arm.FEAT_DIT: 1
hw.optional.arm.FEAT_FP16: 1
hw.optional.arm.FEAT_SSBS: 1
hw.optional.arm.FEAT_BTI: 1
hw.optional.arm.FEAT_SME: 0
hw.optional.arm.FEAT_SME2: 0
hw.optional.arm.SME_F32F32: 0
hw.optional.arm.SME_BI32I32: 0
hw.optional.arm.SME_B16F32: 0
hw.optional.arm.SME_F16F32: 0
hw.optional.arm.SME_I8I32: 0
hw.optional.arm.SME_I16I32: 0
hw.optional.arm.FEAT_SME_F64F64: 0
hw.optional.arm.FEAT_SME_I16I64: 0
hw.optional.arm.FP_SyncExceptions: 1
hw.optional.arm.caps: 868632327146696703
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1
hw.optional.arm64: 1

@ggerganov
Copy link
Member Author

Btw, it's weird because:

# this passes all tests on M2 Ultra
make -j && ./bin/test-backend-ops -o MUL_MAT -b Metal

# this fails the 2 Q4_0 tests as shown earlier
make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU

But both commands would run the CPU backend, correct?

@ggerganov
Copy link
Member Author

ggerganov commented Nov 27, 2024

  • On M4, both master and this PR work correctly. I suspect it also works with M3 Max

Correction, the -b CPU test also fails on M4 - both master and this PR. Does this work for you:

rm -rf build-mm
mkdir build-mm
cd build-mm
cmake ..
make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU

@slaren
Copy link
Member

slaren commented Nov 27, 2024

It also fails for me. The only way I can see this happening is if the second matrix multiplication on the CPU produces different results, which I don't understand how it could be happening.

  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): [MUL_MAT] NMSE = 0.124382206 > 0.000500000 sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): [MUL_MAT] NMSE = 0.135243290 > 0.000500000 sentinel mismatch: sent_3 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): [MUL_MAT] NMSE = 0.130680086 > 0.000500000 sentinel mismatch: sent_3 FAIL

@ggerganov
Copy link
Member Author

ggerganov commented Nov 27, 2024

Almost positive that it is related to the compile flags somehow because it does not fail before 25669aa. Also, using the make build, it does not fail even on master:

cd llama.cpp
make -j tests && ./tests/test-backend-ops -o MUL_MAT -b CPU

And the only difference is that with the make, we don't pass -march explicitly. While with CMake, we currently pass: -march=armv8.2a+dotprod+i8mm.

Edit: Ah, but he make build likely does not enable the I8MM feature. So, it could be a bug in the implementation then.

@slaren
Copy link
Member

slaren commented Nov 27, 2024

In case it isn't clear, a sentinel mismatch means that there is a buffer overflow. I tried allocating a different buffer per tensor to see where it happens, and I got this:

==16634==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x00010510a980 at pc 0x000101c20e28 bp 0x000170455bb0 sp 0x000170455360
WRITE of size 64 at 0x00010510a980 thread T11704
    #0 0x101c20e24 in __asan_memcpy+0x440 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x50e24)
    #1 0x100a82594 in ggml_compute_forward_mul_mat_one_chunk ggml-cpu.c:7417
    #2 0x100a4dfe4 in ggml_compute_forward_mul_mat ggml-cpu.c:7633
    #3 0x100a48f30 in ggml_compute_forward ggml-cpu.c:12410
    #4 0x100a44ae4 in ggml_graph_compute_thread ggml-cpu.c:13420
    #5 0x100a47a68 in ggml_graph_compute_secondary_thread ggml-cpu.c:13540
    #6 0x18ff36f90 in _pthread_start+0x84 (libsystem_pthread.dylib:arm64e+0x6f90)
    #7 0x18ff31d30 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x1d30)

To reproduce:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index da66ed85..31fe4c33 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -473,6 +473,10 @@ struct test_case {
             return false;
         }

+        for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
+            t->data = malloc(ggml_nbytes(t));
+        }
+
         // build graph
         ggml_build_forward_expand(gf, out);

@chaxu01
Copy link
Collaborator

chaxu01 commented Nov 27, 2024

On M2 Ultra has_i8mm in ggml-cpu.c is determined as true. The CMake both on master and in this PR sets __ARM_FEATURE_MATMUL_INT8. However, the make -j && ./bin/test-backend-ops -o MUL_MAT -b CPU tests keep failing:

#10487 (comment)

@chaxu01
Copy link
Collaborator

chaxu01 commented Nov 28, 2024

The changes look good to me. I'll approve if requested or once more feedback is gathered.

@ggerganov ggerganov merged commit eea986f into master Nov 28, 2024
57 checks passed
@ggerganov ggerganov deleted the gg/cmake-fix-arm branch November 28, 2024 12:56
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants