PaddlePaddle reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy*†, Lucas Beyer*, Alexander Kolesnikov*, Dirk Weissenborn*, Xiaohua Zhai*, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby*†.
(*) equal technical contribution, (†) equal advising.
Overview of the model: we split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable "classification token" to the sequence.
To enjoy some new features, PaddlePaddle 2.4 is required. For more installation tutorials refer to installation.md
# Note: If running on multiple nodes,
# set the following environment variables
# and then need to run the script on each node.
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m paddle.distributed.launch \
--nnodes=$PADDLE_NNODES \
--master=$PADDLE_MASTER \
--devices=$CUDA_VISIBLE_DEVICES \
plsc-train \
-c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml
# [Optional] Download checkpoint
mkdir -p pretrained/vit/ViT_base_patch16_224/
wget -O ./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224.pdparams https://plsc.bj.bcebos.com/models/vit/v2.4/imagenet2012-ViT-B_16-224.pdparams
# Note: If running on multiple nodes,
# set the following environment variables
# and then need to run the script on each node.
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m paddle.distributed.launch \
--nnodes=$PADDLE_NNODES \
--master=$PADDLE_MASTER \
--devices=$CUDA_VISIBLE_DEVICES \
plsc-train \
-c ./configs/ViT_base_patch16_384_ft_in1k_1n8c_dp_fp16o2.yaml \
-o Global.pretrained_model=./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224
We provide more directly runnable configurations, see ViT Configurations.
Model | Phase | Dataset | Configs | GPUs | Img/sec | Top1 Acc | Official | Pre-trained checkpoint | Fine-tuned checkpoint | Log |
---|---|---|---|---|---|---|---|---|---|---|
ViT-B_16_224 | pretrain | ImageNet2012 | config | A100*N1C8 | 3583 | 0.75196 | 0.7479 | download | - | log |
ViT-B_16_384 | finetune | ImageNet2012 | config | A100*N1C8 | 719 | 0.77972 | 0.7791 | download | download | log |
ViT-L_16_224 | pretrain | ImageNet21K | config | A100*N4C32 | 5256 | - | - | download | - | log |
ViT-L_16_384 | finetune | ImageNet2012 | config | A100*N4C32 | 934 | 0.85030 | 0.8505 | download | download | log |
ImageNet21K is not "cleaned" but basically the (exact) same image may appear under multiple folders (labels)
(see About ImageNet-21K train and eval label).
ViT official paper and repository also do not give "cleaned" <image, label>
training label files.
According to various information and conjectures (thanks @lucasb-eyer), we got the accuracy given by ViT official repository. If you want to pre-train ViT-Large on ImageNet 21K from scratch, you can process the data according to the following steps:
Since ImageNet21K does not have an officially divided verification set, we use all the images as the training set. We construct the dummy verification set not for parameter adjustment and evaluation, but for the convenience of observing whether the training is ok.
(1) Calculate the md5 value of each image
# 21841 classes
ImageNet21K/
└── images
├── n00004475/
├── n02087122/
├── ...
└── n12492682/
find /data/ImageNet21K/images/ -type f -print0 | xargs --null md5sum > md5sum.txt
(2) Reassign multi-label based on md5 value
from collections import defaultdict
lines = []
with open('md5sum.txt', 'r') as f:
for line in f:
# 35c1efae521b76e423cdd07a00d963c9 /data/ImageNet21K/images/n00004475/n00004475_54295.JPEG
line = line.replace('/data/ImageNet21K/', '')
lines.append(line)
ret = defaultdict(list)
classes = set()
for line in lines:
line = line.strip()
md5, path = line.split()
ret[md5].append(path)
classes.add(path.split('/')[-2])
classes = sorted(entry for entry in classes)
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
out = []
for key in ret:
paths = ret[key]
path = paths[0]
labels = []
for p in paths:
class_to_idx[p.split('/')[-2]]
labels.append(class_to_idx[p.split('/')[-2]])
labels = [l for l in set(labels)]
labels.sort()
out.append((path, labels))
out.sort(key=lambda x: x[1][0])
fp = open('image_all_list.txt', 'w')
for path, labels in out:
labels = [str(l) for l in labels]
label = ','.join(labels)
fp.write(f'{path} {label}\n')
(3) [Optinal] Choose a dummy validation set
import os
from collections import defaultdict
val_list = []
id_to_images = defaultdict(list)
with open('image_all_list.txt', 'r') as f:
for line in f:
path, label = line.strip().split()
label = label.split(',')
if len(label) == 1:
id_to_images[label[0]].append(path)
with open('image_dummy_val_list.txt', 'w') as f:
for idx in id_to_images:
for path in id_to_images[idx][:20]:
f.write(f'{path} {idx}\n')
@article{dosovitskiy2020,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={arXiv preprint arXiv:2010.11929},
year={2020}
}