Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the paper #10

Open
jameswu2014 opened this issue May 18, 2024 · 3 comments
Open

Question about the paper #10

jameswu2014 opened this issue May 18, 2024 · 3 comments

Comments

@jameswu2014
Copy link

20240518-124213
F is a nonlinear function, why they are equivalent?

@synxlin
Copy link
Contributor

synxlin commented May 19, 2024

Hi, @jameswu2014, thank you so much for your interests in our work.
Given that the matrix ($\Lambda$) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, $F(\mathbf{X}\mathbf{W}^T\Lambda ) = F(\mathbf{X}\mathbf{W}^T)\Lambda$ holds true in this context.
Here $\mathbf{W}$ is the up_proj weights, $F(\mathbf{V}) = \mathbf{G} \odot \mathbf{V}$, and $\mathbf{G}$ is outputs of gate_proj, i.e., $F(\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = \mathbf{G} \odot (\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = (\mathbf{G} \odot (\mathbf{X}\mathbf{W}^T)) \mathbf{\Lambda} = F(\mathbf{X}\mathbf{W}^T)\mathbf{\Lambda}$.

@jameswu2014
Copy link
Author

jameswu2014 commented May 20, 2024

Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix (Λ) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, F(XWTΛ)=F(XWT)Λ holds true in this context. Here W is the up_proj weights, F(V)=G⊙V, and G is outputs of gate_proj, i.e., F(XWTΛ)=G⊙(XWTΛ)=(G⊙(XWT))Λ=F(XWT)Λ.

Got it, Firstly, thank you for your reply. But I still have several questions about it.
1. you mean G(outputs of gate_proj) do not need to be rotated? or onlining rotated?
2. the silu op is after gate_proj, so G's precision is fp16, and also silu's output is fp16, gate_proj gemm is int4xint8->fp16? Is it right?

@synxlin
Copy link
Contributor

synxlin commented May 20, 2024

Hi @jameswu2014. For your questions,

  • We do not rotate block intermediate activations and thus outputs of gate_proj is not rotated.
  • All W4A8 GEMM kernels take in int4 weights and int8 activations, use INT8 tensor cores, and generate fp16 outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants