Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Half-Quadratic Quantization and Dequantization on CPU #873

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

haricot
Copy link

@haricot haricot commented Oct 21, 2024

This confirms that test_bitpack is running solely on non-CPU hardware. To address this, we could implement a fix by ensuring contiguous data slices.

Copy link

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   12          105          104            0            1
 Python                 52         2280         1940           68          272
 TOML                   20          630          564            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               38         2803            0         2132          671
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 9          322          274            0           48
 |- TOML                 2           75           63            0           12
 (Total)                           3407          531         2132          744
-------------------------------------------------------------------------------
 Rust                  269        79455        71330         1688         6437
 |- Markdown           131         1357           25         1237           95
 (Total)                          80812        71355         2925         6532
===============================================================================
 Total                 402        85805        74379         3892         7534
===============================================================================
  

@haricot haricot changed the title Optimizing HQQ quantization on CPU Optimizing Half-Quadratic Quantization on CPU Oct 21, 2024
Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @haricot! Thanks for the PR. Can you please update it so it also tests 8 bit quantization? Thanks!

@EricLBuehler
Copy link
Owner

@haricot were you planning on implementing HQQ for non-CUDA devices in this PR? The name seems to indicate so, I was just wondering!

@haricot
Copy link
Author

haricot commented Oct 22, 2024

Hi @EricLBuehler!

Indeed the title is not correct, because I can't optimize it and it is rather: fix Half-Quadratic Quantization and Dequantization on CPU if the quantization bit used is less than 8.
My first goal was to make the quantization work on my device. In fact, I could not quantize the models, I got OOM.

With your models quantized on GPU and my 8GB of VRAM, it worked correctly with this command:
cargo run -r --features cuda -- --pa-gpu-mem-usage 0.5 -i plain -m '/path/' --from-uqff /model/llm-hqq4.uqff

With the model quantized on CPU, the inferences on CPU or GPU, I got inconsistent texts. After correct it , I realized that the inferences on CPU with the quantized models were not optimal. I tried to profile to see which step should be optimized and implement a SIMD version of the dequantize function but it didn't turn out to be enough.

About Half-Quadratic Quantization on GPU and possible offload CPU to avoid OOM:
I would like to propose a method when a max limit of available VRAM is reached, the tensor is quantized on CPU. This is relevant?

@haricot haricot changed the title Optimizing Half-Quadratic Quantization on CPU fix Half-Quadratic Quantization and Dequantization on CPU Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants