fix Half-Quadratic Quantization and Dequantization on CPU #873

haricot · 2024-10-21T22:03:17Z

This confirms that test_bitpack is running solely on non-CPU hardware. To address this, we could implement a fix by ensuring contiguous data slices.

github-actions · 2024-10-21T22:04:23Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   12          105          104            0            1
 Python                 52         2280         1940           68          272
 TOML                   20          630          564            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               38         2803            0         2132          671
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 9          322          274            0           48
 |- TOML                 2           75           63            0           12
 (Total)                           3407          531         2132          744
-------------------------------------------------------------------------------
 Rust                  269        79455        71330         1688         6437
 |- Markdown           131         1357           25         1237           95
 (Total)                          80812        71355         2925         6532
===============================================================================
 Total                 402        85805        74379         3892         7534
===============================================================================

EricLBuehler

Hi @haricot! Thanks for the PR. Can you please update it so it also tests 8 bit quantization? Thanks!

EricLBuehler · 2024-10-22T08:56:59Z

@haricot were you planning on implementing HQQ for non-CUDA devices in this PR? The name seems to indicate so, I was just wondering!

haricot · 2024-10-22T10:58:05Z

Hi @EricLBuehler!

Indeed the title is not correct, because I can't optimize it and it is rather: fix Half-Quadratic Quantization and Dequantization on CPU if the quantization bit used is less than 8.
My first goal was to make the quantization work on my device. In fact, I could not quantize the models, I got OOM.

With your models quantized on GPU and my 8GB of VRAM, it worked correctly with this command:
cargo run -r --features cuda -- --pa-gpu-mem-usage 0.5 -i plain -m '/path/' --from-uqff /model/llm-hqq4.uqff

With the model quantized on CPU, the inferences on CPU or GPU, I got inconsistent texts. After correct it , I realized that the inferences on CPU with the quantized models were not optimal. I tried to profile to see which step should be optimized and implement a SIMD version of the dequantize function but it didn't turn out to be enough.

About Half-Quadratic Quantization on GPU and possible offload CPU to avoid OOM:
I would like to propose a method when a max limit of available VRAM is reached, the tensor is quantized on CPU. This is relevant?

test_bitpack cpu/cuda

95b2add

haricot changed the title ~~Optimizing HQQ quantization on CPU~~ Optimizing Half-Quadratic Quantization on CPU Oct 21, 2024

EricLBuehler requested changes Oct 22, 2024

View reviewed changes

add test_bitpack 8 bit quantization cpu/cuda

0c76982

haricot added 3 commits October 22, 2024 12:43

fix unnecessary nested cfg attributes

b162370

fix alloc/init cpu dequantize hqq

62b3ae2

ensuring contiguous data slices

f816446

Revert "ensuring contiguous data slices to see result in CI"

c5e7518

haricot changed the title ~~Optimizing Half-Quadratic Quantization on CPU~~ fix Half-Quadratic Quantization and Dequantization on CPU Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Half-Quadratic Quantization and Dequantization on CPU #873

fix Half-Quadratic Quantization and Dequantization on CPU #873

haricot commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

EricLBuehler left a comment

EricLBuehler commented Oct 22, 2024

haricot commented Oct 22, 2024 •

edited

Loading

fix Half-Quadratic Quantization and Dequantization on CPU #873

Are you sure you want to change the base?

fix Half-Quadratic Quantization and Dequantization on CPU #873

Conversation

haricot commented Oct 21, 2024

github-actions bot commented Oct 21, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented Oct 22, 2024

haricot commented Oct 22, 2024 • edited Loading

haricot commented Oct 22, 2024 •

edited

Loading