Name	Name	Last commit message	Last commit date
parent directory ..
configs	configs
README.md	README.md

Vision Transformer

PaddlePaddle reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy*†, Lucas Beyer*, Alexander Kolesnikov*, Dirk Weissenborn*, Xiaohua Zhai*, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby*†.

(*) equal technical contribution, (†) equal advising.

Overview of the model: we split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable "classification token" to the sequence.

Requirements

To enjoy some new features, PaddlePaddle 2.4 is required. For more installation tutorials refer to installation.md

How to Train

# Note: If running on multiple nodes, 
# set the following environment variables 
# and then need to run the script on each node.
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m paddle.distributed.launch \
    --nnodes=$PADDLE_NNODES \
    --master=$PADDLE_MASTER \
    --devices=$CUDA_VISIBLE_DEVICES \
    plsc-train \
    -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml

How to Finetune

# [Optional] Download checkpoint
mkdir -p pretrained/vit/ViT_base_patch16_224/
wget -O ./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224.pdparams https://plsc.bj.bcebos.com/models/vit/v2.4/imagenet2012-ViT-B_16-224.pdparams

# Note: If running on multiple nodes, 
# set the following environment variables 
# and then need to run the script on each node.
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  
python -m paddle.distributed.launch \
    --nnodes=$PADDLE_NNODES \
    --master=$PADDLE_MASTER \
    --devices=$CUDA_VISIBLE_DEVICES \
    plsc-train \
    -c ./configs/ViT_base_patch16_384_ft_in1k_1n8c_dp_fp16o2.yaml \
    -o Global.pretrained_model=./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224

Other Configurations

We provide more directly runnable configurations, see ViT Configurations.

Models

Model	Phase	Dataset	Configs	GPUs	Img/sec	Top1 Acc	Official	Pre-trained checkpoint	Fine-tuned checkpoint	Log
ViT-B_16_224	pretrain	ImageNet2012	config	A100*N1C8	3583	0.75196	0.7479	download	-	log
ViT-B_16_384	finetune	ImageNet2012	config	A100*N1C8	719	0.77972	0.7791	download	download	log
ViT-L_16_224	pretrain	ImageNet21K	config	A100*N4C32	5256	-	-	download	-	log
ViT-L_16_384	finetune	ImageNet2012	config	A100*N4C32	934	0.85030	0.8505	download	download	log

ImageNet21K data preparation

ImageNet21K is not "cleaned" but basically the (exact) same image may appear under multiple folders (labels) (see About ImageNet-21K train and eval label). ViT official paper and repository also do not give "cleaned" <image, label> training label files.

According to various information and conjectures (thanks @lucasb-eyer), we got the accuracy given by ViT official repository. If you want to pre-train ViT-Large on ImageNet 21K from scratch, you can process the data according to the following steps:

Since ImageNet21K does not have an officially divided verification set, we use all the images as the training set. We construct the dummy verification set not for parameter adjustment and evaluation, but for the convenience of observing whether the training is ok.

(1) Calculate the md5 value of each image

# 21841 classes
ImageNet21K/
└── images
    ├── n00004475/
    ├── n02087122/
    ├── ...
    └── n12492682/

find /data/ImageNet21K/images/ -type f -print0 | xargs --null md5sum > md5sum.txt

(2) Reassign multi-label based on md5 value

from collections import defaultdict

lines = []
with open('md5sum.txt', 'r') as f:
    for line in f:
        # 35c1efae521b76e423cdd07a00d963c9  /data/ImageNet21K/images/n00004475/n00004475_54295.JPEG
        line = line.replace('/data/ImageNet21K/', '')
        lines.append(line)

ret = defaultdict(list)
classes = set()
for line in lines:
    line = line.strip()
    md5, path = line.split()
    ret[md5].append(path)
    classes.add(path.split('/')[-2])

classes = sorted(entry for entry in classes)
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}

out = []
for key in ret:
    paths = ret[key]
    path = paths[0] 
    labels = []
    for p in paths:
        class_to_idx[p.split('/')[-2]]
        labels.append(class_to_idx[p.split('/')[-2]])
    labels = [l for l in set(labels)]
    labels.sort()
    out.append((path, labels))

out.sort(key=lambda x: x[1][0])

fp = open('image_all_list.txt', 'w')
for path, labels in out:
    labels = [str(l) for l in labels]
    label = ','.join(labels)
    fp.write(f'{path} {label}\n')

(3) [Optinal] Choose a dummy validation set

import os
from collections import defaultdict

val_list = []

id_to_images = defaultdict(list)

with open('image_all_list.txt', 'r') as f:
    for line in f:
        path, label = line.strip().split()
        label = label.split(',')
        if len(label) == 1:
            id_to_images[label[0]].append(path)

with open('image_dummy_val_list.txt', 'w') as f:
    for idx in id_to_images:
        for path in id_to_images[idx][:20]:
            f.write(f'{path} {idx}\n')

Citations

@article{dosovitskiy2020,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vit

vit

README.md

Vision Transformer

Requirements

How to Train

How to Finetune

Other Configurations

Models

ImageNet21K data preparation

Citations

Files

vit

Directory actions

More options

Directory actions

More options

Latest commit

History

vit

Folders and files

parent directory

README.md

Vision Transformer

Requirements

How to Train

How to Finetune

Other Configurations

Models

ImageNet21K data preparation

Citations