-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathAbout_GPTQ.txt
56 lines (46 loc) · 3.36 KB
/
About_GPTQ.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from https://arxiv.org/pdf/2210.17323.pdf
## What is GPTQ?
GPTQ is a one-shot weight quantization method based on approximate second-order information specially GPT model
Example - https://lightning.ai/blog/how-to-finetune-gpt-like-large-language-models-on-a-custom-dataset/
## Background
1. Layer-Wise Quantization
At a high level, our method follows the structure of state-of-the-art post-training quantization methods,
by performing quantization layer-by-layer, solving a corresponding reconstruction problem for each layer
Concretely, let W_l be the weights corresponding to a linear layer l and let
X_l denote the layer input corresponding to a small set of m data points running through the network
Then,the objective is to find a matrix of quantized weights Wc which minimizes the squared error,
relative to the full precision layer output. Formally, this can be restated as
argminWc ||WX − WcX ||2_2
2. Optimal Brain Quantization(OBQ)
OBQ, quantizes weights iteratively one-ata-time, depending on their impact on the loss increase,
after which it applies a closed-form update to the remaining unquantized weights, further reducing the loss
from https://arxiv.org/pdf/2208.11580.pdf
## The GPTQ algorithm
Step 1. Arbitrary Order Insight
The original OBQ method quantizes rows of W independently, in a specific order defined by the corresponding errors.
By contrast, we will aim to quantize the weights of all rows in the same order, and will show that this typically yields
results with a final squared error that is similar to the original solutions
Step 2. Lazy Batch-Updates
The final rounding decisions for column i are only affected by updates performed on this very column, and so updates to later
columns are irrelevant at this point in the process. This makes it possible to “lazily batch” update together,
thus achieving much better GPU utilization.
Although this strategy does not reduce the theoretical amount of compute, it effectively addresses
the memory-throughput bottleneck
Step 3. Cholesky Reformulation
we begin by noting that the only information required from H−1_Fq, where Fq denotes
the set of unquantized weights when quantizing weight q, is row q, or more precisely, the elements in
this row starting with the diagonal. The consequence is that we could precompute all of these rows
using a more numerically-stable method without any significant increase in memory consumption
Using a well-optimized Cholesky kernel also yields further speedup
Algorithm 1 Quantize W given inverse Hessian H^−1 = (2XX> + λI)^−1 and blocksize B.
Q ← 0drow×dcol // quantized output
E ← 0drow×B // block quantization errors
H^−1 ← Cholesky(H^−1)^T // Hessian inverse information
for i = 0, B, 2B, . . . do
for j = i, . . . , i + B − 1 do
Q_:,j ← quant(W_:,j ) // quantize column
E_:,j−i ← (W_:,j − Q_:,j ) / [H^−1]_jj // quantization error
W_:,j:(i+B) ← W_:,j:(i+B) − E_:,j−i · H^−1_j,j:(i+B) // update weights in block
end for
W_:,(i+B): ← W_:,(i+B): − E·H^−1_i:(i+B),(i+B): // update all remaining weights
end for