-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Half-Quadratic Quantization and Dequantization on CPU #873
base: master
Are you sure you want to change the base?
Conversation
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 12 105 104 0 1 Python 52 2280 1940 68 272 TOML 20 630 564 2 64 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 196 169 1 26 (Total) 273 201 32 40 ------------------------------------------------------------------------------- Markdown 38 2803 0 2132 671 |- BASH 6 103 100 0 3 |- JSON 1 12 12 0 0 |- Python 5 92 82 0 10 |- Rust 9 322 274 0 48 |- TOML 2 75 63 0 12 (Total) 3407 531 2132 744 ------------------------------------------------------------------------------- Rust 269 79455 71330 1688 6437 |- Markdown 131 1357 25 1237 95 (Total) 80812 71355 2925 6532 =============================================================================== Total 402 85805 74379 3892 7534 =============================================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @haricot! Thanks for the PR. Can you please update it so it also tests 8 bit quantization? Thanks!
@haricot were you planning on implementing HQQ for non-CUDA devices in this PR? The name seems to indicate so, I was just wondering! |
Hi @EricLBuehler! Indeed the title is not correct, because I can't optimize it and it is rather: fix Half-Quadratic Quantization and Dequantization on CPU if the quantization bit used is less than 8. With your models quantized on GPU and my 8GB of VRAM, it worked correctly with this command: With the model quantized on CPU, the inferences on CPU or GPU, I got inconsistent texts. After correct it , I realized that the inferences on CPU with the quantized models were not optimal. I tried to profile to see which step should be optimized and implement a SIMD version of the dequantize function but it didn't turn out to be enough. About Half-Quadratic Quantization on GPU and possible offload CPU to avoid OOM: |
This confirms that
test_bitpack
is running solely on non-CPU hardware. To address this, we could implement a fix by ensuring contiguous data slices.