- ViT (ICLR 2021) (classification or backbone network)
- DeiT (ICML 2021) (classification or backbone network with knowledge distillation)
- Swin Transformer (ICCV 2021) (classification or backbone network)
- Segmenter (ICCV 2021) (semantic segmentation)
- SETR (CVPR 2021) (semantic segmentation)
- CoaT (ICCV 2021) (classification or backbone network)
- Vision Transformer is a model for image classification inspired by Transformer in NLP
- This deal with image patches like words in NLP
- Self-Attention is important main idea
- It dosen't use CNN completely and achieved SOTA in each task of Vision