-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathEfficientFormer.txt
50 lines (42 loc) · 3.51 KB
/
EfficientFormer.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
https://github.com/snap-research/EfficientFormer
## ‘Patch Embedding with large kernel and stride’ causes performance degradation in mobile devices
- The large kernel and stride of the non-overlapping CNN for patch embedding affects the execution speed on mobile devices
## Consistent feature dimension is important for token mixer and MHSA (Multi head Self-Attention) is not the cause of slowdown
- Choosing a token mixer is essential when designing a ViT-based model.
The author selected pooling and MHSA as candidates for the token mixer and was able to conclude that 'pooling is simple and efficient' and 'MHSA has better performance'
** Token mixer - https://openreview.net/pdf?id=8l5GjEqGiRG
- The token-mixing MLPs allow communication between different spatial locations (tokens);
they operate on each channel independently and take individual columns of the table as inputs.
These two types of layers are interleaved to enable interaction of both input dimensions
-- CONV-BN has a faster processing speed than the LN(GN)-Linear structure, and the performance decrease that comes from selecting CONV-BN can be tolerated
## Overall Architectur
see : https://openreview.net/pdf?id=NXHXoYMLIG - page 5
- The network consists of a patch embedding (PatchEmbed) and stack of meta Transformer blocks
y = ∏(i to m) MB_i(PatchEmbed(x_0^B,3H,W))
- where X0 is the input image with batch size(B) and spatial size(H,W) and Y is the desired output, and m is the total number of blocks(depth_)
- MB consists of unspecified token mixer followed by a MLP block
X_i+1 = MB_i(X_i) = MLP(TokenMixer(X_i))
- where Xi|i>0 is the intermediate feature that forwarded into the ith MB.
- The network includes 4 Stages
## Dimension-Consistent Design
- The network starts with 4D partition, while 3D partition is applied in the last stages
- First, input images are processed by a CONV stem with two 3 × 3 convolutions with stride 2 as patch embedding
- Then the network starts with MB4D with a simple Pool mixer to extract low level features
- After processing all the MB4D blocks, we perform a one-time reshaping to transform the features size and enter 3D partition
- MB3D follows conventional ViT
- In the paper, LinearG denotes the Linear followed by GeLU
- In the MHSA equation, Q, K, V represents query, key, and values, and b is parameterized attention bias as position encodings
## Latency Driven Slimming
# Supernet
- supernet is defined for searching efficient models
- MetaPath (MP) is defined, which is the collection of possible blocks
** The network starts with 4D partition, while 3D partition is applied in the last stages (why MB3D only enabling in the last two stages)
- Since the computation of MHSA grows quadratically with respect to token length, integrating it in early Stages would largely increase the computation cost.
- Early stages in the networks capture low-level features, while late layers learn long-term dependencies
## Searching Space
# Searching Algorithm
- First, the supernet is trained with Gumbel Softmax sampling to get the importance score for the blocks within each MP
x_(i+1) = sigma_n {(e^(a_i^n+ε_i^n)/T / (sigma_n e(a_i^n+ε_i^n)/T} ⋅ MP_i,j(X_i)
- Where a evaluates the importance of each block in MP as it represents the probability to select a block. ε∼U(0, 1) ensures exploration
- n ∈ {4D, I} for Stage 1 and Stage 2, and n ∈ {4D, 3D, I} for Stage 3 and Stage 4
- Then, a latency lookup table is built by collecting the on-device latency of MB4D and MB3D with different widths (multiples of 16)