This repository provides an evaluation metric for image captioning using ViLBERT which is based on our paper ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT.
This code is built upon original ViLBERT paper and its repository. We provide the almost same guideline as in the original repository as follows.
- Create a fresh conda environment, and install all dependencies.
conda create -n vilbert-score python=3.6
conda activate vilbert-score
git clone https://github.com/hwanheelee1993/ViLBERTScore.git
cd ViLBERTScore
pip install -r requirements.txt
- Install pytorch
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
- Install this codebase as a package in this environment.
python setup.py develop
We used two pre-trained models(pretrained ViLBERT, fine-tuned on 12 tasks) in our work. Please download the models in original ViLBERT repository and save it to "save" dir. (two files: "pretrained_model.bin", "multi_task_model.bin")
We provide the processed version for Flickr8k and Composite in this link, including the pre-computed detection features. Download the files and extract to "data" dir.
We extract the detection features following the guidelines in this link. Please extract the features from the link and save them as "imgs_rcnn.pkl" which is a list of each feature.
Then make files, "cand_caps.pkl", "gt_caps.pkl",("scores.pkl" optional for computing correlation) which are the list of each feature.
You can compute the scores using the following code.
python compute_vilbertscore.py --dataset flickr8k
MIT license
Please cite the following paper if you use this code. ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT.
@inproceedings{lee2020vilbertscore,
title={ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT},
author={Lee, Hwanhee and Yoon, Seunghyun and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Jung, Kyomin},
booktitle={Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems},
pages={34--39},
year={2020}
}