[82] Efficient Language Modeling with Sparse all-MLP (sMLP) #111

dhkim0225 · 2022-03-16T01:50:45Z

rosinality's comment

mlp with gating으로 lm 학습하기
moe 기반의 sparse 모델, 더 높은 성능을 더 적은 연산량으로 달성
autoregressive lm을 all mlp로 태클
autoregressive이기 때문에 moe routing에 이후 토큰 정보를 쓸 수 없어서 그에 대한 고려가 들어가 있음
spatial mixing에서 어떻게 했는지는 나와있진 않은데 nxn 행렬이니 아마 masking으로 될 것 같네요.

paper

INTRO

all-MLP는 컴퓨팅을 일정하게 유지하면서 모델 용량과 표현력을 크게 증가시킴
저자들은 sparse all-MLP를 제안하여 Transformer 기반 MoE(GShard, Switch Transformer, Base Layers, HASH Layers)와 Dense Transformer 및 all-MLP에 비해 train efficiency가 2배 까지 좋아짐을 보임.

Contributions

spare all-MLP (sMLP) 제안. MoE 를 모든 MLP 와 엮는 건 NLP 에서 처음 있는 시도. 새로운 routing 방식도 제안
좋은 성능

Background

Token-wise Operations

Notations

T == sequence length
H == hidden dim
h == total head number
Y == output
X == input token (T x H)
d == single head 별 dimension (H / h)

token-mixing 을 위해서

transformer 는 self-attention 수행
MLP 모델은 Spatial Gating Unint 사용 (gMLP 에서 채택한 방식)
W_s 를 attention score 처럼 생각할 수 있음.

Sparse Expert Models

MoE 가 요즘 흥한다.
여러가지 routing 방법들이 나왔는데,
baseline으로 사용하는 방법은 다음과 같다.

N개의 expert.

router weight W_r 을 이용해서, h(x) 를 구한다.

i번째 expert 에 대한 gate-value는 다음과 같이 구한다.

이 중에서 top-k (보통 2개) 만큼 expert 를 고르고, weighted sum 으로 최종 output 을 계산한다.

몇몇 expert 에만 학습이 치중되는 것을 막기 위해,

Switch Transformer나 gshard 는 differential load balancing loss 를 추가해서 device 끼리 communication latency 를 줄였다.
Base Layers 는 balancing loss 대신 balancing assignment algorithm 을 이용해서 좀 더 파이프라인을 단순화 시켰고,
HASH Layers 는 routing weight W_r 을 learning 시키는 대신, random hashing function 을 routing gate 로 사용했다.
논문에는 안 쓰여 있지만 [54] Dense-to-Sparse Gate for Mixture-of-Experts #82 DTS Gate 시도도 있다.

Base Layers 의 fig1. 그냥 token 개수 expert 마다 맞춰주기.
Base Layers 사용 방식을 tMoE 라고 부르겠다.

Methods

sMLP

전체적인 pipeline 이다.

N1 개의 dense block 과 N2 개의 sparse block을 갖는다.
Base Layers 의 tMoE 로 transformer 의 FFN 대체
self-attn 을 sMoE & Spatial Gating Unit 으로 대체

tMoE 는 token 을 routing 하고,
sMoE 는 hidden dimension 을 routing 한다.

Routing in Feature Space

autoregressive lm을 all mlp로 태클해야 한다.
autoregressive이기 때문에 MoE routing에 이후 토큰 정보를 쓸 수 없어서 해결책으로 2가지 routing 전략 제안

Deterministic Routing (sMoE)
1. routing strategy 를 배우는 대신, hidden-vector를 일정하게 잘라서 각 expert 에게 보냄.
2. 각각의 expert 를 모두 활용하는 점에서 MHSA 와 닮았음
Partial Prediction (sMoE)
1. 전체 sentence 에서 routing strategy 를 배우는 대신 첫 20% 토큰들로만 routing strategy 를 학습시키고, 나머지 80% 는 prediction으로 나눈다.

deterministic이 더 좋더라

Results

비교 대상 모델들

데이터 및 모델 크기

operation 별 param&FLOP 비교

모델 효율성 비교

작은 모델, 큰 모델의 효율성을 따져보기 위해 실험 수행
큰 scale에서도 좋은 성능을 내는 편이다.

귀여운 오타. world per sec 이 아니라 word per sec 이다.

모델 detail

dense layer 가 확 늘었음

Zero-shot priming evaluation

GPT-3 는 데이터셋을 3배 많이 썼었음

The text was updated successfully, but these errors were encountered:

dhkim0225 added Pretraining Meta AI MoE MLP WIP and removed WIP labels Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[82] Efficient Language Modeling with Sparse all-MLP (sMLP) #111

[82] Efficient Language Modeling with Sparse all-MLP (sMLP) #111

dhkim0225 commented Mar 16, 2022 •

edited

Loading

[82] Efficient Language Modeling with Sparse all-MLP (sMLP) #111

[82] Efficient Language Modeling with Sparse all-MLP (sMLP) #111

Comments

dhkim0225 commented Mar 16, 2022 • edited Loading

INTRO

Contributions

Background

Token-wise Operations

Sparse Expert Models

Methods

sMLP

Routing in Feature Space

Results

모델 효율성 비교

모델 detail

Zero-shot priming evaluation

dhkim0225 commented Mar 16, 2022 •

edited

Loading