Skip to content

Official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024)

License

Notifications You must be signed in to change notification settings

shannanyinxiang/ViTEraser

Repository files navigation

ViTEraser (AAAI 2024)

The official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024). The ViTEraser revisits the conventional single-step one-stage framework and improves it with ViTs for feature modeling and the proposed SegMIM pretraining. Below are the frameworks of ViTEraser and SegMIM.

ViTEraser SegMIM

Todo List

  • Inference code and model weights
  • ViTEraser training code
  • SegMIM pre-training code

Environment

We recommend using Anaconda to manage environments. Run the following commands to install dependencies.

conda create -n viteraser python=3.7 -y
conda activate viteraser
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
git clone https://github.com/shannanyinxiang/ViTEraser.git
cd ViTEraser
pip install -r requirements.txt

Datasets

1. Text Removal Dataset

  • SCUT-EnsText [paper]:

    1. Download the training and testing sets of SCUT-EnsText at link.
    2. Rename all_images and all_labels folders to image and label, respectively.
    3. Generate text masks:
      # Generating masks for the training set of SCUT-EnsText
      python tools/generate_mask.py \
        --data_root data/TextErase/SCUT-EnsText/train    
    
      # Generating masks for the testing set of SCUT-EnsText
      # Masks are not used for inference. Just keep the same data structure as the training stage.
      python tools/generate_mask.py \
        --data_root data/TextErase/SCUT-EnsText/test
    

2. SegMIM Pretraining Datasets

(optional, only required by SegMIM pretraining)

Please prepare the above datasets into the data folder following the file structure below.

data
├─TextErase
│  └─SCUT-EnsText
│     ├─train
│     │  ├─image
│     │  ├─label
│     │  └─mask
│     └─test
│        ├─image
│        ├─label
│        └─mask
└─SegMIMDatasets
   ├─ArT
   ├─ICDAR2013
   ├─ICDAR2015
   ├─LSVT
   ├─MLT2017
   ├─ReCTS
   └─TextOCR

Models

The download links of pre-trained ViTEraser weights are provided in the following table.

Name BaiduNetDisk GoogleDrive
ViTEraser-Tiny link link
ViTEraser-Small link link
ViTEraser-Base link link

Inference

The example command for the inference with ViTEraser-Tiny is:

CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch \
        --master_port=3151 \
        --nproc_per_node 1 \
        --use_env \
        main.py \
        --eval \
        --data_root data/TextErase/ \
        --val_dataset scutens_test \
        --batch_size 1 \
        --encoder swinv2 \
        --decoder swinv2 \
        --pred_mask false \
        --intermediate_erase false \
        --swin_enc_embed_dim 96 \
        --swin_enc_depths 2 2 6 2 \
        --swin_enc_num_heads 3 6 12 24 \
        --swin_enc_window_size 16 \
        --swin_dec_depths 2 6 2 2 2 \
        --swin_dec_num_heads 24 12 6 3 2 \
        --swin_dec_window_size 16 \
        --output_dir path/to/save/output/ \
        --resume path/to/weights/

Argument changes for different scales of ViTEraser are as below:

Argument Tiny Small Base
swin_enc_embed_dim 96 96 128
swin_enc_depths 2 2 6 2 2 2 18 2 2 2 18 2
swin_enc_num_heads 3 6 12 24 3 6 12 24 4 8 16 32
swin_enc_window_size 16 16 8
swin_dec_depths 2 6 2 2 2 2 18 2 2 2 2 18 2 2 2
swin_dec_num_heads 24 12 6 3 2 24 12 6 3 2 32 16 8 4 2
swin_dec_window_size 16 8 8

Evaluation

The command for calculating metrics is:

python eval/evaluation.py \
    --gt_path data/TextErase/SCUT-EnsText/test/label/ \
    --target_path path/to/model/output/

python -m pytorch_fid \
    data/TextErase/SCUT-EnsText/test/label/ \
    path/to/model/output/ \
    --device cuda:0

ViTEraser Training

1. Training without SegMIM pretraining

  • Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo).
  • Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
  • Put the pretrained weights into the pretrained folder.
  • Run the example scripts in the scripts/viteraser-training-wosegmim folder. For instance, run the following command to train ViTEraser-Tiny without SegMIM pretraining.
bash scripts/viteraser-training-wosegmim/viteraser-tiny-train.sh

2. Training with SegMIM pretraining

  • Download the SegMIM pretraining weights for ViTEraser-Tiny (download link), ViTEraser-Small (download link), or ViTEraser-Base (download link).
  • Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
  • Put the pretrained weights into the pretrained folder.
  • Run the example scripts in the scripts/viteraser-training-withsegmim folder. For instance, run the following command to train ViTEraser-Tiny with SegMIM pretraining.
bash scripts/viteraser-training-withsegmim/viteraser-tiny-train-withsegmim.sh

SegMIM Pretraining

  • Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo) into the pretrained folder.
  • Run the example scripts in the scripts/segmim folder. For instance, run the following command to perform SegMIM pretraining of ViTEraser-Tiny.
# end-to-end encoder-decoder pretraining
bash scripts/segmim/viteraser-tiny-segmim.sh

# standalone encoder finetuning
bash scripts/segmim/viteraser-tiny-encoder-finetune.sh

Citation

@inproceedings{peng2024viteraser,
  title={ViTEraser: Harnessing the power of vision transformers for scene text removal with SegMIM pretraining},
  author={Peng, Dezhi and Liu, Chongyu and Liu, Yuliang and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={5},
  pages={4468--4477},
  year={2024}
}

Copyright

This repository can only be used for non-commercial research purpose.

For commercial use, please contact Prof. Lianwen Jin ([email protected]).

Copyright 2024, Deep Learning and Vision Computing Lab, South China University of Technology.

About

Official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published