Can someone enlighten me on how exactly the Matmul 4bit quantization works? #865
han-minhee
started this conversation in
General
Replies: 1 comment
-
To make my question clearer, I made a simple python function assuming that there are 2 signed int4 values packed into one uint8
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I got int4 quantized Phi3-Mini using builder.py from onnxruntime-genai scripts.
However, I guess there's something I'm missing right now.
When I tried to unpack the quantized value (model.layers.0.attn.qkv_proj.MatMul.weight_Q4 from int4)
The unpacked value didn't match the float32 one (model.layers.0.attn.qkv_proj.MatMul.weight from fp32 onnx)
My goal is to learn how to unpack the values.
For K=3072, N=9216, bits=4, block_size=32, and when the original matrix is B
My understanding is that
B is originally [3072,9216] shaped
B is transposed to [9216, 3072]
Each column is grouped in block_size unit, resulting in [9216, 96, 32] shape
The 32 elements inside one block is scaled using single scale variable, and thus, there are [9216 * 96] scale variables
Two continous scaled float values are then translated into two int4 values and then packed into one uint8_t value.
Here comes the first question:
So, based on my understanding, I tried unpacking the values using the function
(It was implemented regardless of onnxruntime as I wanted to see what's going on)
But the unpacked values are totally different from the original values.
What am I missing?
Thank you in advance!
Assume that for matrix A which has M rows and N columns, [i][j]th data is stored at i * M + j
Beta Was this translation helpful? Give feedback.
All reactions