ImageCaptioning-Verbose

PyTorch implementation for Image Captioning, supporting single or multi-gpus training, and inference with a single model or ensemble of multiple models.

This repo is a version that uses pre-extracted features for training and testing.

Raw Transformer as encoder and decoder
PureT: [Paper], [Source Code]
XLAN and XTransformer: [Paper], [Source Code]
UpDown: [Paper], [Source Code]
M2 Transformer: [Paper], [Source Code]

Requirements (Our Main Enviroment)

Python 3.10.11
PyTorch 1.13.1
TorchVision 0.14.1
coco-caption
numpy
tqdm

Note: Also supports earlier PyTorch versions, such as 1.5.1. For newer version (>2.0), we have not verified!

Preparation

1. coco-caption preparation

Refer coco-caption README.md, you will first need to download the Stanford CoreNLP 3.6.0 code and models for use by SPICE. To do this, run:

cd coco_caption
bash get_stanford_models.sh

2. Data preparation

The necessary files in training and evaluation are saved in mscoco folder, which is organized as follows:

mscoco/
|--feature/
    |--COCO_SwinL_Feats/ # Grid Features, for PureT
       |--*.npz
    |--COCO_UpDown_10_100_Feats/ # Region Features, for UpDown, XLAN, XTransformer, ...
       |--*.npz
|--misc/
|--sent/
|--txt/

where the mscoco/feature/COCO_SwinL_Feats folder contains the pre-extracted features of MSCOCO 2014 dataset. You can download other files from GoogleDrive or 百度网盘(提取码: hryh).

Refer to tools/README.md for more details.

Training

*Note: our repository is mainly based on JDAI-CV/image-captioning, and we directly reused their config.yml files, so there are many useless parameter in our model.

1. Training under XE loss

Download pre-trained Backbone model (Swin-Transformer) from GoogleDrive or 百度网盘(提取码: hryh) and save it in the root directory.

Before training, you may need check and modify the parameters in config.yml and train.sh files. Then run the script:

# for XE training
bash experiments_PureT/PureT_XE/train.sh

2. Training using SCST (self-critical sequence training)

Copy the pre-trained model under XE loss into folder of experiments_PureT/PureT_SCST/snapshot/ and modify config.yml and train.sh files. Then run the script:

# for SCST training
bash experiments_PureT/PureT_SCST/train.sh

Evaluation

You can download the pre-trained model from GoogleDrive or 百度网盘(提取码: hryh).

CUDA_VISIBLE_DEVICES=0 python main_test.py --folder experiments_PureT/PureT_SCST/ --resume 27

BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
82.1	67.3	52.0	40.9	30.2	60.1	138.2	24.2

Reference

If you find this repo useful, please consider citing (no obligation at all):

@inproceedings{wangyiyu2022PureT,
  author       = {Yiyu Wang and
                  Jungang Xu and
                  Yingfei Sun},
  title        = {End-to-End Transformer Based Model for Image Captioning},
  booktitle    = {Proceedings of the AAAI Conference on Artificial Intelligence},
  pages        = {2585--2594},
  publisher    = {{AAAI} Press},
  year         = {2022},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/20160}, 
  doi          = {10.1609/aaai.v36i3.20160},
}

Acknowledgements

This repository is based on JDAI-CV/image-captioning, ruotianluo/self-critical.pytorch and microsoft/Swin-Transformer.

TODO:

Details of data preparation
More datasets supported
More approachs supported

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
coco_caption @ dc0d08b		coco_caption @ dc0d08b
data/temp		data/temp
datasets		datasets
evaluation		evaluation
experiments		experiments
imgs		imgs
lib		lib
losses		losses
lr_scheduler		lr_scheduler
models		models
mscoco		mscoco
optimizer		optimizer
samplers		samplers
scorer		scorer
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
ICC分词预处理.ipynb		ICC分词预处理.ipynb
README.md		README.md
cal_flops.py		cal_flops.py
examples.sh		examples.sh
main.py		main.py
main_ensemble_onlinetest.py		main_ensemble_onlinetest.py
main_ensemble_test.py		main_ensemble_test.py
main_multi_gpu.py		main_multi_gpu.py
main_onlinetest.py		main_onlinetest.py
main_test.py		main_test.py
main_val.py		main_val.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ImageCaptioning-Verbose

Requirements (Our Main Enviroment)

Preparation

1. coco-caption preparation

2. Data preparation

Training

1. Training under XE loss

2. Training using SCST (self-critical sequence training)

Evaluation

Reference

Acknowledgements

About

Releases

Packages

Languages

232525/ImageCaptioning_Verbose

Folders and files

Latest commit

History

Repository files navigation

ImageCaptioning-Verbose

Requirements (Our Main Enviroment)

Preparation

1. coco-caption preparation

2. Data preparation

Training

1. Training under XE loss

2. Training using SCST (self-critical sequence training)

Evaluation

Reference

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages