Can someone enlighten me on how exactly the Matmul 4bit quantization works? #865

han-minhee · 2024-09-03T06:01:14Z

han-minhee
Sep 3, 2024

I got int4 quantized Phi3-Mini using builder.py from onnxruntime-genai scripts.
However, I guess there's something I'm missing right now.

When I tried to unpack the quantized value (model.layers.0.attn.qkv_proj.MatMul.weight_Q4 from int4)
The unpacked value didn't match the float32 one (model.layers.0.attn.qkv_proj.MatMul.weight from fp32 onnx)
My goal is to learn how to unpack the values.

For K=3072, N=9216, bits=4, block_size=32, and when the original matrix is B
My understanding is that

B is originally [3072,9216] shaped
B is transposed to [9216, 3072]
Each column is grouped in block_size unit, resulting in [9216, 96, 32] shape
The 32 elements inside one block is scaled using single scale variable, and thus, there are [9216 * 96] scale variables
Two continous scaled float values are then translated into two int4 values and then packed into one uint8_t value.
Here comes the first question:

is it stored in signed bits (-8~7) and the zero point is 0,
or is it stored in unsigned way(0~15) and the zero point is 8?

Thus, B_quantized have the shape of [9216, 96, 16]
The first blob (that has 16 values) in the B_quantized corresponds to the first 32 values (in the 0th row) in the B

So, based on my understanding, I tried unpacking the values using the function
(It was implemented regardless of onnxruntime as I wanted to see what's going on)
But the unpacked values are totally different from the original values.
What am I missing?

Thank you in advance!

Assume that for matrix A which has M rows and N columns, [i][j]th data is stored at i * M + j

    for (int i= 0;  i< K; j++)
    {
        for (int block_idx = 0; block_idx < n_blocks_per_col; block_idx++)
        {
            for (int k = 0; k < blob_size; k++)
            {
                // Each packed_data contains two 4-bit quantized values
                uint8_t packed_data = ((uint8_t *)qunatized_tensor->data)[j * n_blocks_per_col * blob_size + block_idx * blob_size + k];

                float scale = ((float *)scales_tensor->data)[j * n_blocks_per_col + block_idx];

                // Extract two 4-bit values
                uint8_t qvalue1 = packed_data & 0x0F;        // lower 4 bits
                uint8_t qvalue2 = (packed_data >> 4) & 0x0F; // upper 4 bits

                // dequantize
                float dequantized1 = qvalue1 * scale;
                float dequantized2 = qvalue2 * scale;

                // find the index
                // note that it should be transposed
                int i = block_idx * block_size + k * 2;

                // set the value
                unpacked_data[i * N + j] = dequantized1;
                unpacked_data[(i + 1) * N + j] = dequantized2;
            }
        }
    }

han-minhee · 2024-09-03T06:13:59Z

han-minhee
Sep 3, 2024
Author

To make my question clearer, I made a simple python function assuming that there are 2 signed int4 values packed into one uint8
But the result is nowhere close to the original value 😢

def unpack_and_convert(uint8_value, scale):
    # Mask and shift to get the upper 4 bits
    upper_4_bits = (uint8_value >> 4) & 0x0F
    
    # Mask to get the lower 4 bits
    lower_4_bits = uint8_value & 0x0F
    
    # Convert the 4-bit values to signed integers
    if upper_4_bits >= 8:
        signed_upper = upper_4_bits - 16
    else:
        signed_upper = upper_4_bits
    
    if lower_4_bits >= 8:
        signed_lower = lower_4_bits - 16
    else:
        signed_lower = lower_4_bits
    
    result_upper = signed_upper * scale
    result_lower = signed_lower * scale
    
    return result_upper, result_lower

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can someone enlighten me on how exactly the Matmul 4bit quantization works? #865

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can someone enlighten me on how exactly the Matmul 4bit quantization works? #865

han-minhee Sep 3, 2024

Replies: 1 comment

han-minhee Sep 3, 2024 Author

han-minhee
Sep 3, 2024

han-minhee
Sep 3, 2024
Author