Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

FanShupei
Copy link
Contributor

Supersede #10196

Here I implement IQ4_NL runtime repack for CPU backend. Currently only IQ4_NL_4_4 for Arm Neon, implemented by intrinsics. If you are curious how these intrinsics come (and many potential optimization opportunities), please see #10196 for more infomation, there's lengthy comparison between intrinsics version and original asm version.

I only support runtime repack and not support llama-quantize, since based on discussion in #10196, online repack is the preferable flow. Online repack for IQ4_NL is significant slower than Q4_0, but I haven't done any rigorous measurements. Static quantization support could be added later if anyone really needs it.

@FanShupei
Copy link
Contributor Author

Performance Evaluation

It shows about ~3x speed up for IQ4_NL. Tested on Mac M2 with GGML_METAL=off.

The previous PR #10541 contains more evaluation results.

This PR

model size params backend threads fa test t/s
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp64 98.49 ± 0.82
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp128 97.96 ± 0.83
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp256 95.77 ± 0.10
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 tg64 34.92 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp64 186.77 ± 0.52
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp128 186.40 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp256 181.26 ± 0.08
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 tg64 61.11 ± 0.04

build: f56013d (4193)

Master

model size params backend threads fa test t/s
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp64 31.41 ± 0.16
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp128 31.51 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp256 31.12 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 tg64 21.13 ± 0.01
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp64 61.33 ± 0.03
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp128 61.24 ± 0.00
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp256 60.58 ± 0.20
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 tg64 41.13 ± 0.03
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 pp64 127.61 ± 0.87
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 pp128 127.35 ± 0.11
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 pp256 122.99 ± 0.05
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 tg64 37.04 ± 0.01
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 pp64 247.89 ± 0.81
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 pp128 247.35 ± 0.15
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 pp256 238.69 ± 0.24
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 tg64 65.17 ± 0.04

build: 4a57d36 (4192)

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 27, 2024
Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good. I see a ~2x pp speedup on M3 Max and it doesn't seem to affect the load time too badly.

@FanShupei
Copy link
Contributor Author

I copy comment #10196 to here.

I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it.

I'm afraid our current runtime dispatch mechanism actually doesn't work and actually no one really tests it. The original ASM verison also needs dotprod feature, but it doesn't check it...

@slaren
Copy link
Collaborator

slaren commented Nov 27, 2024

Yes I agree, I was aware that this is an issue in x86. The goal for x86 is to bundle multiple versions of the CPU backend for the different instructions sets as a dynamic library and load the best one at startup. We should probably do the same for ARM, in addition to what you mention it is incomplete and not every function that uses the features checks them al runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants