Question about the paper #10

jameswu2014 · 2024-05-18T04:44:48Z

F is a nonlinear function, why they are equivalent？

synxlin · 2024-05-19T21:06:35Z

Hi, @jameswu2014, thank you so much for your interests in our work.
Given that the matrix ($\Lambda$) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, $F(\mathbf{X}\mathbf{W}^T\Lambda ) = F(\mathbf{X}\mathbf{W}^T)\Lambda$ holds true in this context.
Here $\mathbf{W}$ is the up_proj weights, $F(\mathbf{V}) = \mathbf{G} \odot \mathbf{V}$, and $\mathbf{G}$ is outputs of gate_proj, i.e., $F(\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = \mathbf{G} \odot (\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = (\mathbf{G} \odot (\mathbf{X}\mathbf{W}^T)) \mathbf{\Lambda} = F(\mathbf{X}\mathbf{W}^T)\mathbf{\Lambda}$.

jameswu2014 · 2024-05-20T04:08:22Z

Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix (Λ) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, F(XWTΛ)=F(XWT)Λ holds true in this context. Here W is the up_proj weights, F(V)=G⊙V, and G is outputs of gate_proj, i.e., F(XWTΛ)=G⊙(XWTΛ)=(G⊙(XWT))Λ=F(XWT)Λ.

Got it, Firstly, thank you for your reply. But I still have several questions about it.
1. you mean G(outputs of gate_proj) do not need to be rotated? or onlining rotated?
2. the silu op is after gate_proj, so G's precision is fp16, and also silu's output is fp16, gate_proj gemm is int4xint8->fp16? Is it right?

synxlin · 2024-05-20T21:17:49Z

Hi @jameswu2014. For your questions,

We do not rotate block intermediate activations and thus outputs of gate_proj is not rotated.
All W4A8 GEMM kernels take in int4 weights and int8 activations, use INT8 tensor cores, and generate fp16 outputs.

jameswu2014 closed this as completed May 20, 2024

jameswu2014 reopened this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the paper #10

Question about the paper #10

jameswu2014 commented May 18, 2024

synxlin commented May 19, 2024

jameswu2014 commented May 20, 2024 •

edited

Loading

synxlin commented May 20, 2024

Question about the paper #10

Question about the paper #10

Comments

jameswu2014 commented May 18, 2024

synxlin commented May 19, 2024

jameswu2014 commented May 20, 2024 • edited Loading

synxlin commented May 20, 2024

jameswu2014 commented May 20, 2024 •

edited

Loading