ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

FanShupei · 2024-11-27T10:01:50Z

Supersede #10196

Here I implement IQ4_NL runtime repack for CPU backend. Currently only IQ4_NL_4_4 for Arm Neon, implemented by intrinsics. If you are curious how these intrinsics come (and many potential optimization opportunities), please see #10196 for more infomation, there's lengthy comparison between intrinsics version and original asm version.

I only support runtime repack and not support llama-quantize, since based on discussion in #10196, online repack is the preferable flow. Online repack for IQ4_NL is significant slower than Q4_0, but I haven't done any rigorous measurements. Static quantization support could be added later if anyone really needs it.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

FanShupei · 2024-11-27T10:09:06Z

Performance Evaluation

It shows about ~3x speed up for IQ4_NL. Tested on Mac M2 with GGML_METAL=off.

The previous PR #10541 contains more evaluation results.

This PR

model	size	params	backend	threads	fa	test	t/s
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp64	98.49 ± 0.82
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp128	97.96 ± 0.83
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp256	95.77 ± 0.10
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	tg64	34.92 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp64	186.77 ± 0.52
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp128	186.40 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp256	181.26 ± 0.08
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	tg64	61.11 ± 0.04

build: f56013d (4193)

Master

model	size	params	backend	threads	fa	test	t/s
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp64	31.41 ± 0.16
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp128	31.51 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp256	31.12 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	tg64	21.13 ± 0.01
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp64	61.33 ± 0.03
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp128	61.24 ± 0.00
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp256	60.58 ± 0.20
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	tg64	41.13 ± 0.03
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	pp64	127.61 ± 0.87
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	pp128	127.35 ± 0.11
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	pp256	122.99 ± 0.05
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	tg64	37.04 ± 0.01
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	pp64	247.89 ± 0.81
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	pp128	247.35 ± 0.15
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	pp256	238.69 ± 0.24
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	tg64	65.17 ± 0.04

build: 4a57d36 (4192)

slaren

The implementation looks good. I see a ~2x pp speedup on M3 Max and it doesn't seem to affect the load time too badly.

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

FanShupei · 2024-11-27T14:20:19Z

I copy comment #10196 to here.

I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it.

I'm afraid our current runtime dispatch mechanism actually doesn't work and actually no one really tests it. The original ASM verison also needs dotprod feature, but it doesn't check it...

slaren · 2024-11-27T14:27:00Z

Yes I agree, I was aware that this is an issue in x86. The goal for x86 is to bundle multiple versions of the CPU backend for the different instructions sets as a dynamic library and load the best one at startup. We should probably do the same for ARM, in addition to what you mention it is incomplete and not every function that uses the features checks them al runtime.

ggml-cpu: support IQ4_NL_4_4 by runtime repack

f56013d

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 27, 2024

slaren approved these changes Nov 27, 2024

View reviewed changes

slaren reviewed Nov 27, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

slaren reviewed Nov 27, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

ggml-cpu: add __ARM_FEATURE_DOTPROD guard

0aa6488

FanShupei force-pushed the repack-iq4_nl branch from 2307659 to 0aa6488 Compare November 27, 2024 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

FanShupei commented Nov 27, 2024

FanShupei commented Nov 27, 2024

slaren left a comment

FanShupei commented Nov 27, 2024

slaren commented Nov 27, 2024

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

Are you sure you want to change the base?

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

Conversation

FanShupei commented Nov 27, 2024

FanShupei commented Nov 27, 2024

Performance Evaluation

This PR

Master

slaren left a comment

Choose a reason for hiding this comment

FanShupei commented Nov 27, 2024

slaren commented Nov 27, 2024