Skip to content

232525/ImageCaptioning_Verbose

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ImageCaptioning-Verbose

PyTorch implementation for Image Captioning, supporting single or multi-gpus training, and inference with a single model or ensemble of multiple models.

This repo is a version that uses pre-extracted features for training and testing.

Requirements (Our Main Enviroment)

  • Python 3.10.11
  • PyTorch 1.13.1
  • TorchVision 0.14.1
  • coco-caption
  • numpy
  • tqdm

Note: Also supports earlier PyTorch versions, such as 1.5.1. For newer version (>2.0), we have not verified!

Preparation

1. coco-caption preparation

Refer coco-caption README.md, you will first need to download the Stanford CoreNLP 3.6.0 code and models for use by SPICE. To do this, run:

cd coco_caption
bash get_stanford_models.sh

2. Data preparation

The necessary files in training and evaluation are saved in mscoco folder, which is organized as follows:

mscoco/
|--feature/
    |--COCO_SwinL_Feats/ # Grid Features, for PureT
       |--*.npz
    |--COCO_UpDown_10_100_Feats/ # Region Features, for UpDown, XLAN, XTransformer, ...
       |--*.npz
|--misc/
|--sent/
|--txt/

where the mscoco/feature/COCO_SwinL_Feats folder contains the pre-extracted features of MSCOCO 2014 dataset. You can download other files from GoogleDrive or 百度网盘(提取码: hryh).

Refer to tools/README.md for more details.

Training

*Note: our repository is mainly based on JDAI-CV/image-captioning, and we directly reused their config.yml files, so there are many useless parameter in our model.

1. Training under XE loss

Download pre-trained Backbone model (Swin-Transformer) from GoogleDrive or 百度网盘(提取码: hryh) and save it in the root directory.

Before training, you may need check and modify the parameters in config.yml and train.sh files. Then run the script:

# for XE training
bash experiments_PureT/PureT_XE/train.sh

2. Training using SCST (self-critical sequence training)

Copy the pre-trained model under XE loss into folder of experiments_PureT/PureT_SCST/snapshot/ and modify config.yml and train.sh files. Then run the script:

# for SCST training
bash experiments_PureT/PureT_SCST/train.sh

Evaluation

You can download the pre-trained model from GoogleDrive or 百度网盘(提取码: hryh).

CUDA_VISIBLE_DEVICES=0 python main_test.py --folder experiments_PureT/PureT_SCST/ --resume 27
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr SPICE
82.1 67.3 52.0 40.9 30.2 60.1 138.2 24.2

Reference

If you find this repo useful, please consider citing (no obligation at all):

@inproceedings{wangyiyu2022PureT,
  author       = {Yiyu Wang and
                  Jungang Xu and
                  Yingfei Sun},
  title        = {End-to-End Transformer Based Model for Image Captioning},
  booktitle    = {Proceedings of the AAAI Conference on Artificial Intelligence},
  pages        = {2585--2594},
  publisher    = {{AAAI} Press},
  year         = {2022},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/20160}, 
  doi          = {10.1609/aaai.v36i3.20160},
}

Acknowledgements

This repository is based on JDAI-CV/image-captioning, ruotianluo/self-critical.pytorch and microsoft/Swin-Transformer.

TODO:

  • Details of data preparation
  • More datasets supported
  • More approachs supported

About

Pytorch implementation for Image Captioning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published