Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scene chainer #5

Open
wants to merge 6 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions scenes/chainer-caption/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
data/*.model
chainer_env/
224 changes: 224 additions & 0 deletions scenes/chainer-caption/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# image caption generation by chainer

## Notes after two years.
I created this basically when I was almost undergrad and then things changed a lot! I feel a bit ashamed to show my dirty coding :) Well, the pretrained models are still effective and probably good if you just want to try generating captions in English/Chinese/Japanese. However, when it comes to training, probably not the best one, and many algorithmic improvements happened after I stop the maintenance of this code base…

If you want to train image captioning, I highly recommend pytorch rather than chainer. I stopped using chainer because it always broke the compatibility when I upgrade it. For example, this code is written in chainer 1.x and never works in 2.x, 3.x, 4.x, … and 6.x (see how quickly they change the version! ). *1

If you still want to stay on this code, here’s some note that you should know


- I used python 2.7 at that time, and this code may not work with python3 or higher.
Make sure to use chainer 1.x. I tested on 1.24.0 with CUDA 8 (cuda 9 and chainer 1.x doesn’t work together).

- I put more formal description of requirements in the later section so see it and please try to use miniconda (or anaconda) to reproduce the environment.


- I tried to change the code to make it possible to finetune the CNN part, then I kind of failed to document it, and now I don’t remember what I was doing :) Some files are leftovers that you don't need to use... e.g. `train_image_caption_model.py` is the leftover while `train_caption_model.py` is the good one.

*1 This does NOT mean i do not like chainer anymore. It's still great tool to quickly prototype from scratch. I just want to say, if you want to minimize the maintenance effort for a long term, probably it is not the right tool, because of the frequent major version up that breaks the downward compatibility.

## descrtiption

This repository contains an implementation of typical image caption generation based on neural network (i.e. CNN + RNN). The model first extracts the image feature by CNN and then generates captions by RNN. CNN is ResNet50 and RNN is a standard LSTM .

The training data is MSCOCO. I preprocessed MSCOCO images by extracting CNN features in advance. Then I trained the language model to generate captions. Not only English, I trained on Japanese and Chinese.

I made pre-trained models available. For English captions, the model achieves CIDEr of 0.692 (Otheres are Bleu1: 0.657, Bleu2: 0.471, Bleu3: 0.327, Bleu4: 0.228, METEOR: 0.213, ROUGE_L: 0.47) for the MSCOCO validation dataset. The scores are increased a little bit when the beam search is used. For example, CIDEr is 0.716 with beam size of 5. If you want to achieve a better score, CNN has to be fine-tuned, but I haven’t tried because it’s computationally heavier.

<img src="sample.png" >

## requirements
- python 2.7
- CUDA 8.0
- chainer 1.24.0 http://chainer.org

and some more packages. Make sure to use this version of chainer by `pip install chainer==1.24.0`. Different versions easily break the compatability.

I use `conda` to manage the environment and here's what I did.
```
conda create -y --prefix=conda python=2.7
conda deactivate
conda activate ./conda
conda install --yes numpy scipy matplotlib pandas pillow #normal tools
CUDA_PATH=/usr/local/cuda-8.0/ pip install chainer==1.24.0 --no-cache-dir
CUDA_PATH=/usr/local/cuda-8.0/ pip install cupy==2.0
pip install h5py
pip install nltk
```
You can also see my [`env.yml`](./data/env.yml) that is basically the exact environment of mine.


## citation:
If you find this implementation useful, please consider to cite:
```
@article{multilingual-caption-arxiv,
title={{Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation}},
author={Satoshi Tsutsui, David Crandall},
journal={arXiv:1706.06275},
year={2017}
}

@inproceedings{multilingual-caption,
author={Satoshi Tsutsui, David Crandall},
booktitle = {CVPR Language and Vision Workshop},
title = {{Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation}},
year = {2017}
}
```

## I just want to generate caption!
OK, first, you need to download the models and other preprocessed files.
```
bash download.sh
```
Then you can generate caption.
```
#English
python sample_code_beam.py \
--rnn-model ./data/caption_en_model40.model \
--cnn-model ./data/ResNet50.model \
--vocab ./data/MSCOCO/mscoco_caption_train2014_processed_dic.json \
--gpu -1 \
--img ./sample_imgs/COCO_val2014_000000185546.jpg \

#Japanese trained from machine translated Japanese (https://github.com/apple2373/mt-mscoco)
python sample_code_beam.py \
--rnn-model ./data/caption_jp_mt_model40.model \
--cnn-model ./data/ResNet50.model \
--vocab ./data/MSCOCO/captions_train2014_jp_translation_processed_dic.json \
--gpu -1 \
--img ./sample_imgs/COCO_val2014_000000185546.jpg \


#Japanese by YJCaptions (https://github.com/yahoojapan/YJCaptions)
python sample_code_beam.py \
--rnn-model ./data/caption_jp_yj_model40.model \
--cnn-model ./data/ResNet50.model \
--vocab ./data/MSCOCO/yjcaptions26k_clean_processed_dic.json \
--gpu -1 \
--img ./sample_imgs/COCO_val2014_000000185546.jpg \

#Chinese trained from machine translated Chinese (https://github.com/apple2373/mt-mscoco)
python sample_code_beam.py \
--rnn-model ./data/caption_cn_model40.model \
--cnn-model ./data/ResNet50.model \
--vocab ./data/MSCOCO/captions_train2014_cn_translation_processed_dic.json \
--gpu -1 \
--img ./sample_imgs/COCO_val2014_000000185546.jpg \

```
See the help for other options. You can, for example, use beam search if you want.

## I want to run caption generation module as a web API.
I have a simple script for that.
```
cd webapi
python server.py --rnn-model ../data/caption_en_model40.model \
--cnn-model ../data/ResNet50.model \
--vocab ../data/MSCOCO/mscoco_caption_train2014_processed_dic.json \
--gpu -1 \

curl -X POST -F image=@./sample_imgs/COCO_val2014_000000185546.jpg http://localhost:8090/predict
#you should get json
```


## I want to train the model by myself.
I made the preprocessed files (e.g., extracted ResNet features) available. You can download like this.
```
bash download.sh train
```
Then you can train like this.
```
# Preprocessing

cd ./code/

## English
## make sure to downalod captions_train2014.json from the orignal MSCOCO!
python preprocess_MSCOCO_captions.py \
--input ../data/MSCOCO/captions_train2014.json \
--output ../data/MSCOCO/mscoco_caption_train2014_processed.json -\
-outdic ../data/MSCOCO/mscoco_caption_train2014_processed_dic.json \
--outfreq ../data/MSCOCO/mscoco_caption_train2014_processed_freq.json

## Japanese from Yahoo
python preprocess_MSCOCO_captions.py \
--input ../data/MSCOCO/yjcaptions26k_clean.json \
--output ../data/MSCOCO/yjcaptions26k_clean_processed.json \
--outdic ../data/MSCOCO/yjcaptions26k_clean_processed_dic.json \
--outfreq ../data/MSCOCO/yjcaptions26k_clean_processed_freq.json \
--cut 0 \
--char True \

## Japanese from machine translation
python preprocess_MSCOCO_captions.py \
--input ../data/MSCOCO/captions_train2014_jp_translation.json \
--output ../data/MSCOCO/captions_train2014_jp_translation_processed.json \
--outdic ../data/MSCOCO/captions_train2014_jp_translation_processed_dic.json \
--outfreq ../data/MSCOCO/captions_train2014_jp_translation_processed_freq.json \
--cut 5 \
--char True \

## Chinese from machine translation
python preprocess_MSCOCO_captions.py \
--input ../data/MSCOCO/captions_train2014_cn_translation.json \
--output ../data/MSCOCO/captions_train2014_cn_translation_processed.json \
--outdic ../data/MSCOCO/captions_train2014_cn_translation_processed_dic.json \
--outfreq ../data/MSCOCO/captions_train2014_cn_translation_processed_freq.json \
--cut 5 \
--char True \

cd ../

# Train
## train English caption
python train_caption_model.py --savedir ./experiment1 --epoch 40 --batch 120 --gpu -1 \
--vocab ./data/MSCOCO/mscoco_caption_train2014_processed_dic.json \
--captions ./data/MSCOCO/mscoco_caption_train2014_processed.json \

## train Chinese caption by machine translation
python train_caption_model.py --savedir ./experiment1cn --epoch 50 --batch 120 --gpu 0 \
--vocab ./data/MSCOCO/captions_train2014_cn_translation_processed_dic.json \
--captions ./data/MSCOCO/captions_train2014_cn_translation_processed.json\

## train Japanese caption by machine translation
python train_caption_model.py --savedir ./experiment1jp_mt --epoch 40 --batch 120 --gpu 0 \
--vocab ./data/MSCOCO/captions_train2014_jp_translation_processed_dic.json \
--captions ./data/MSCOCO/captions_train2014_jp_translation_processed.json\
--preload True

## train Japanese caption by Yahoo's
python train_caption_model.py --savedir ./experiment1jp_yj --epoch 40 --batch 120 --gpu 0 \
--vocab ./data/MSCOCO/yjcaptions26k_clean_processed_dic.json \
--captions ./data/MSCOCO/yjcaptions26k_clean_processed.json \
--preload True
```

## I want to train the model from my own data.
Alright, you need to do additional amount of work.
```
cd code
#extract features using ResNet50 \
python ResNet_feature_extractor.py --img-dir ../data/MSCOCO/train2014 \
--out-dir ../data/MSCOCO/train2014_ResNet50_features \
--gpu -1
```
`--gpu` is GPU id (-1 is CPU).`—img-dir` is the directory that you stores images. `—out-dir` is the directory that the ResNet features will be saved. The file name will be the same, but extension is “.npz”.
```
#preprocess the json files. you need to have the same structure as MSCOCO json.
python preprocess_MSCOCO_captions.py \
--input ../data/MSCOCO/captions_train2014.json \
--output ../data/MSCOCO/mscoco_caption_train2014_processed.json \
--outdic ../data/MSCOCO/mscoco_caption_train2014_processed_dic.json \
--outfreq ../data/MSCOCO/mscoco_caption_train2014_processed_freq.json \
—-cut 5 \
—-char True
```
`input` is the json file containing caption. `output` will be the main preprocessed output. `outdic` is the vocabulary file. `outfreq` is the internal file you don’t need it in the training. Just frequency count. `cut` is the cutoff frequency for minor words. Character based chunking will be used when `char` is True. You can use it for non-spaced languages like Japanese and Chinese.

Then you can use my script above for training.

## I want to fine-tune CNN part.
Officially, this doesn’t support the CNN tuning. That’s what I did in the original paeper. However, informally, I made it to train CNN part too... but I didn't document it, now, after two years, I don't remember very much. `train_image_caption_model.py` is the script to train the CNN part. I also remember that I tried to use another preprocessed json than I document here. Currently I have the `_dic` file and the main processed file separately but I combined them. The script to generate a preprocessed file in the new format (i.e that one compatible with `train_image_caption_model.py` ) should be `code/preprocess_captions.py`. That's what I vaguely remember.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add your own README file, and reference the original file with proper credits.

64 changes: 64 additions & 0 deletions scenes/chainer-caption/code/CaptionDataLoader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

#class to get the data in a batch way
#loading on memory option (preload_all_features) took 6m10.400s (user time = 2m41.546s) to load if it is true

import numpy as np

class CaptionDataLoader(object):
def __init__(self, captions,image_feature_path,preload_all_features=False,filename_img_id=False):
self.captions = captions
self.image_feature_path=image_feature_path#path before image id. e.g. ../data/MSCOCO/train2014_ResNet50_features/COCO_train2014_
self.caption_ids = captions.keys()
self.random_indicies = np.random.permutation(len(self.captions))
self.index_count=0
self.epoch=1
self.preload_all_features=preload_all_features
self.filename_img_id=filename_img_id
if self.preload_all_features:
if self.filename_img_id:
self.image_features=np.array([np.load("%s/%s.npz"%(self.image_feature_path,self.captions[caption_id]["image_id"]))['arr_0'] for caption_id in self.caption_ids])
else:
self.image_features=np.array([np.load("%s%012d.npz"%(self.image_feature_path,self.captions[caption_id]["image_id"]))['arr_0'] for caption_id in self.caption_ids])

def get_batch(self,batch_size):
batch_data_indicies=self.random_indicies[self.index_count:self.index_count+batch_size]
self.index_count+=batch_size
if self.index_count > len(self.captions):
self.epoch+=1
self.suffle_data()
self.index_count=0

#sorry the following lines are so complicated...
#this is just loading preprocessed images features and captions for this batch
if self.preload_all_features:
batch_image_features=self.image_features[batch_data_indicies]
else:
if self.filename_img_id:
batch_image_features=np.array([np.load("%s/%s.npz"%(self.image_feature_path,self.captions[self.caption_ids[i]]["image_id"]))['arr_0'] for i in batch_data_indicies])
else:
batch_image_features=np.array([np.load("%s%012d.npz"%(self.image_feature_path,self.captions[self.caption_ids[i]]["image_id"]))['arr_0'] for i in batch_data_indicies])

batch_word_indices=[np.array(self.captions[self.caption_ids[i]]["token_ids"],dtype=np.int32) for i in batch_data_indicies]

return batch_image_features,batch_word_indices

def suffle_data(self):
self.random_indicies = np.random.permutation(len(self.captions))


if __name__ == '__main__':
#test code
import json
with open("../data/MSCOCO/mscoco_caption_train2014_processed.json", 'r') as f:
captions = json.load(f)
dataset=CaptionDataLoader(captions,image_feature_path="../data/MSCOCO/train2014_ResNet50_features/COCO_train2014_")
batch_image_features,batch_word_indices = dataset.get_batch(10)
print(batch_word_indices)
print(batch_image_features.shape)

dataset=CaptionDataLoader(captions,image_feature_path="../data/MSCOCO/train2014_ResNet50_features/COCO_train2014_",preload_all_features=True)
batch_image_features,batch_word_indices = dataset.get_batch(10)
print(batch_word_indices)
print(batch_image_features.shape)
76 changes: 76 additions & 0 deletions scenes/chainer-caption/code/CaptionDataLoader2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

#class to get the data in a batch way
#loading on memory option (preload_all_features) took 6m10.400s (user time = 2m41.546s) to load if it is true
#refactered version of CaptionDataLoader.py

import numpy as np
import os
from image_loader import Image_loader
from ResNet50 import ResNet

class CaptionDataLoader(object):
def __init__(self,dataset,image_feature_root,image_root="",preload_all_features=False,image_mean="imagenet",holding_raw_captions=False):
self.holding_raw_captions=holding_raw_captions
self.image_loader=Image_loader(mean=image_mean)
self.captions=dataset["captions"]
self.num_captions=len(self.captions)
self.images=dataset["images"]
self.caption2image={caption["idx"]:caption["image_idx"] for caption in dataset["captions"]}
self.image_feature_root=image_feature_root+"/"#path to preprocessed image features. It assume the feature are stored with the same name but only extension is changed to .npz
self.image_root=image_root+"/"#path to image directory
self.random_indicies = np.random.permutation(len(self.captions))
self.index_count=0
self.epoch=1
self.preload_all_features=preload_all_features
if self.preload_all_features:
self.image_features=np.array([np.load("%s/%s.npz"%(self.image_feature_root, os.path.splitext(image["file_path"])[0] ))['arr_0'] for image in self.images])

def get_batch(self,batch_size,raw_image=False):
#if raw_image is true, it will give you Batchx3x224x224 otherwise it will be just features
batch_caption_indicies=self.random_indicies[self.index_count:self.index_count+batch_size]
self.index_count+=batch_size
if self.index_count > len(self.captions):
self.epoch+=1
self.suffle_data()
self.index_count=0

#sorry the following lines are so complicated...
#this is just loading preprocessed images or image features and captions for this batch
if raw_image:
batch_images= np.array( [self.image_loader.load(self.image_root+self.images[self.caption2image[i]]["file_path"],expand_batch_dim=False) for i in batch_caption_indicies] )
else:
if self.preload_all_features:
batch_images=self.image_features[[self.caption2image[i] for i in batch_caption_indicies]]
else:
batch_images=np.array([np.load("%s/%s.npz"%(self.image_feature_root, os.path.splitext(self.images[self.caption2image[i]]["file_path"])[0] ))['arr_0'] for i in batch_caption_indicies])
if self.holding_raw_captions:
batch_word_indices=[self.captions[i]["caption"] for i in batch_caption_indicies]
else:
batch_word_indices=[np.array(self.captions[i]["caption"],dtype=np.int32) for i in batch_caption_indicies]

return batch_images,batch_word_indices

def suffle_data(self):
self.random_indicies = np.random.permutation(len(self.captions))


if __name__ == '__main__':
#test code
import json
with open("../data/MSCOCO/mscoco_train2014_all_preprocessed.json", 'r') as f:
captions = json.load(f)
dataset=CaptionDataLoader(captions,image_feature_root="../data/MSCOCO/MSCOCO_ResNet50_features/",image_root="../data/MSCOCO/MSCOCO_raw_images/")
batch_images,batch_word_indices = dataset.get_batch(10,raw_image=True)
print(batch_word_indices)
print(batch_images)

batch_image_features,batch_word_indices = dataset.get_batch(10)
print(batch_word_indices)
print(batch_image_features.shape)

dataset=CaptionDataLoader(captions,image_feature_root="../data/MSCOCO/MSCOCO_ResNet50_features",preload_all_features=True)
batch_image_features,batch_word_indices = dataset.get_batch(10)
print(batch_word_indices)
print(batch_image_features.shape)
Loading