-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scene chainer #5
Open
Mostafa3zazi
wants to merge
6
commits into
devel
Choose a base branch
from
scene_chainer
base: devel
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
157a95f
Merge pull request #1 from cane-see-project/devel
mhashim6 2ba0539
first commit for chainerV7
Mostafa3zazi de527c3
Add chainer global configurations
Mostafa3zazi fe014ac
Update README.md
Mostafa3zazi 8bafa4c
Add resize_img_and_generate function
Mostafa3zazi 13fde44
delete unnecessary files
Mostafa3zazi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
data/*.model | ||
chainer_env/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,224 @@ | ||
# image caption generation by chainer | ||
|
||
## Notes after two years. | ||
I created this basically when I was almost undergrad and then things changed a lot! I feel a bit ashamed to show my dirty coding :) Well, the pretrained models are still effective and probably good if you just want to try generating captions in English/Chinese/Japanese. However, when it comes to training, probably not the best one, and many algorithmic improvements happened after I stop the maintenance of this code base… | ||
|
||
If you want to train image captioning, I highly recommend pytorch rather than chainer. I stopped using chainer because it always broke the compatibility when I upgrade it. For example, this code is written in chainer 1.x and never works in 2.x, 3.x, 4.x, … and 6.x (see how quickly they change the version! ). *1 | ||
|
||
If you still want to stay on this code, here’s some note that you should know | ||
|
||
|
||
- I used python 2.7 at that time, and this code may not work with python3 or higher. | ||
Make sure to use chainer 1.x. I tested on 1.24.0 with CUDA 8 (cuda 9 and chainer 1.x doesn’t work together). | ||
|
||
- I put more formal description of requirements in the later section so see it and please try to use miniconda (or anaconda) to reproduce the environment. | ||
|
||
|
||
- I tried to change the code to make it possible to finetune the CNN part, then I kind of failed to document it, and now I don’t remember what I was doing :) Some files are leftovers that you don't need to use... e.g. `train_image_caption_model.py` is the leftover while `train_caption_model.py` is the good one. | ||
|
||
*1 This does NOT mean i do not like chainer anymore. It's still great tool to quickly prototype from scratch. I just want to say, if you want to minimize the maintenance effort for a long term, probably it is not the right tool, because of the frequent major version up that breaks the downward compatibility. | ||
|
||
## descrtiption | ||
|
||
This repository contains an implementation of typical image caption generation based on neural network (i.e. CNN + RNN). The model first extracts the image feature by CNN and then generates captions by RNN. CNN is ResNet50 and RNN is a standard LSTM . | ||
|
||
The training data is MSCOCO. I preprocessed MSCOCO images by extracting CNN features in advance. Then I trained the language model to generate captions. Not only English, I trained on Japanese and Chinese. | ||
|
||
I made pre-trained models available. For English captions, the model achieves CIDEr of 0.692 (Otheres are Bleu1: 0.657, Bleu2: 0.471, Bleu3: 0.327, Bleu4: 0.228, METEOR: 0.213, ROUGE_L: 0.47) for the MSCOCO validation dataset. The scores are increased a little bit when the beam search is used. For example, CIDEr is 0.716 with beam size of 5. If you want to achieve a better score, CNN has to be fine-tuned, but I haven’t tried because it’s computationally heavier. | ||
|
||
<img src="sample.png" > | ||
|
||
## requirements | ||
- python 2.7 | ||
- CUDA 8.0 | ||
- chainer 1.24.0 http://chainer.org | ||
|
||
and some more packages. Make sure to use this version of chainer by `pip install chainer==1.24.0`. Different versions easily break the compatability. | ||
|
||
I use `conda` to manage the environment and here's what I did. | ||
``` | ||
conda create -y --prefix=conda python=2.7 | ||
conda deactivate | ||
conda activate ./conda | ||
conda install --yes numpy scipy matplotlib pandas pillow #normal tools | ||
CUDA_PATH=/usr/local/cuda-8.0/ pip install chainer==1.24.0 --no-cache-dir | ||
CUDA_PATH=/usr/local/cuda-8.0/ pip install cupy==2.0 | ||
pip install h5py | ||
pip install nltk | ||
``` | ||
You can also see my [`env.yml`](./data/env.yml) that is basically the exact environment of mine. | ||
|
||
|
||
## citation: | ||
If you find this implementation useful, please consider to cite: | ||
``` | ||
@article{multilingual-caption-arxiv, | ||
title={{Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation}}, | ||
author={Satoshi Tsutsui, David Crandall}, | ||
journal={arXiv:1706.06275}, | ||
year={2017} | ||
} | ||
|
||
@inproceedings{multilingual-caption, | ||
author={Satoshi Tsutsui, David Crandall}, | ||
booktitle = {CVPR Language and Vision Workshop}, | ||
title = {{Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation}}, | ||
year = {2017} | ||
} | ||
``` | ||
|
||
## I just want to generate caption! | ||
OK, first, you need to download the models and other preprocessed files. | ||
``` | ||
bash download.sh | ||
``` | ||
Then you can generate caption. | ||
``` | ||
#English | ||
python sample_code_beam.py \ | ||
--rnn-model ./data/caption_en_model40.model \ | ||
--cnn-model ./data/ResNet50.model \ | ||
--vocab ./data/MSCOCO/mscoco_caption_train2014_processed_dic.json \ | ||
--gpu -1 \ | ||
--img ./sample_imgs/COCO_val2014_000000185546.jpg \ | ||
|
||
#Japanese trained from machine translated Japanese (https://github.com/apple2373/mt-mscoco) | ||
python sample_code_beam.py \ | ||
--rnn-model ./data/caption_jp_mt_model40.model \ | ||
--cnn-model ./data/ResNet50.model \ | ||
--vocab ./data/MSCOCO/captions_train2014_jp_translation_processed_dic.json \ | ||
--gpu -1 \ | ||
--img ./sample_imgs/COCO_val2014_000000185546.jpg \ | ||
|
||
|
||
#Japanese by YJCaptions (https://github.com/yahoojapan/YJCaptions) | ||
python sample_code_beam.py \ | ||
--rnn-model ./data/caption_jp_yj_model40.model \ | ||
--cnn-model ./data/ResNet50.model \ | ||
--vocab ./data/MSCOCO/yjcaptions26k_clean_processed_dic.json \ | ||
--gpu -1 \ | ||
--img ./sample_imgs/COCO_val2014_000000185546.jpg \ | ||
|
||
#Chinese trained from machine translated Chinese (https://github.com/apple2373/mt-mscoco) | ||
python sample_code_beam.py \ | ||
--rnn-model ./data/caption_cn_model40.model \ | ||
--cnn-model ./data/ResNet50.model \ | ||
--vocab ./data/MSCOCO/captions_train2014_cn_translation_processed_dic.json \ | ||
--gpu -1 \ | ||
--img ./sample_imgs/COCO_val2014_000000185546.jpg \ | ||
|
||
``` | ||
See the help for other options. You can, for example, use beam search if you want. | ||
|
||
## I want to run caption generation module as a web API. | ||
I have a simple script for that. | ||
``` | ||
cd webapi | ||
python server.py --rnn-model ../data/caption_en_model40.model \ | ||
--cnn-model ../data/ResNet50.model \ | ||
--vocab ../data/MSCOCO/mscoco_caption_train2014_processed_dic.json \ | ||
--gpu -1 \ | ||
|
||
curl -X POST -F image=@./sample_imgs/COCO_val2014_000000185546.jpg http://localhost:8090/predict | ||
#you should get json | ||
``` | ||
|
||
|
||
## I want to train the model by myself. | ||
I made the preprocessed files (e.g., extracted ResNet features) available. You can download like this. | ||
``` | ||
bash download.sh train | ||
``` | ||
Then you can train like this. | ||
``` | ||
# Preprocessing | ||
|
||
cd ./code/ | ||
|
||
## English | ||
## make sure to downalod captions_train2014.json from the orignal MSCOCO! | ||
python preprocess_MSCOCO_captions.py \ | ||
--input ../data/MSCOCO/captions_train2014.json \ | ||
--output ../data/MSCOCO/mscoco_caption_train2014_processed.json -\ | ||
-outdic ../data/MSCOCO/mscoco_caption_train2014_processed_dic.json \ | ||
--outfreq ../data/MSCOCO/mscoco_caption_train2014_processed_freq.json | ||
|
||
## Japanese from Yahoo | ||
python preprocess_MSCOCO_captions.py \ | ||
--input ../data/MSCOCO/yjcaptions26k_clean.json \ | ||
--output ../data/MSCOCO/yjcaptions26k_clean_processed.json \ | ||
--outdic ../data/MSCOCO/yjcaptions26k_clean_processed_dic.json \ | ||
--outfreq ../data/MSCOCO/yjcaptions26k_clean_processed_freq.json \ | ||
--cut 0 \ | ||
--char True \ | ||
|
||
## Japanese from machine translation | ||
python preprocess_MSCOCO_captions.py \ | ||
--input ../data/MSCOCO/captions_train2014_jp_translation.json \ | ||
--output ../data/MSCOCO/captions_train2014_jp_translation_processed.json \ | ||
--outdic ../data/MSCOCO/captions_train2014_jp_translation_processed_dic.json \ | ||
--outfreq ../data/MSCOCO/captions_train2014_jp_translation_processed_freq.json \ | ||
--cut 5 \ | ||
--char True \ | ||
|
||
## Chinese from machine translation | ||
python preprocess_MSCOCO_captions.py \ | ||
--input ../data/MSCOCO/captions_train2014_cn_translation.json \ | ||
--output ../data/MSCOCO/captions_train2014_cn_translation_processed.json \ | ||
--outdic ../data/MSCOCO/captions_train2014_cn_translation_processed_dic.json \ | ||
--outfreq ../data/MSCOCO/captions_train2014_cn_translation_processed_freq.json \ | ||
--cut 5 \ | ||
--char True \ | ||
|
||
cd ../ | ||
|
||
# Train | ||
## train English caption | ||
python train_caption_model.py --savedir ./experiment1 --epoch 40 --batch 120 --gpu -1 \ | ||
--vocab ./data/MSCOCO/mscoco_caption_train2014_processed_dic.json \ | ||
--captions ./data/MSCOCO/mscoco_caption_train2014_processed.json \ | ||
|
||
## train Chinese caption by machine translation | ||
python train_caption_model.py --savedir ./experiment1cn --epoch 50 --batch 120 --gpu 0 \ | ||
--vocab ./data/MSCOCO/captions_train2014_cn_translation_processed_dic.json \ | ||
--captions ./data/MSCOCO/captions_train2014_cn_translation_processed.json\ | ||
|
||
## train Japanese caption by machine translation | ||
python train_caption_model.py --savedir ./experiment1jp_mt --epoch 40 --batch 120 --gpu 0 \ | ||
--vocab ./data/MSCOCO/captions_train2014_jp_translation_processed_dic.json \ | ||
--captions ./data/MSCOCO/captions_train2014_jp_translation_processed.json\ | ||
--preload True | ||
|
||
## train Japanese caption by Yahoo's | ||
python train_caption_model.py --savedir ./experiment1jp_yj --epoch 40 --batch 120 --gpu 0 \ | ||
--vocab ./data/MSCOCO/yjcaptions26k_clean_processed_dic.json \ | ||
--captions ./data/MSCOCO/yjcaptions26k_clean_processed.json \ | ||
--preload True | ||
``` | ||
|
||
## I want to train the model from my own data. | ||
Alright, you need to do additional amount of work. | ||
``` | ||
cd code | ||
#extract features using ResNet50 \ | ||
python ResNet_feature_extractor.py --img-dir ../data/MSCOCO/train2014 \ | ||
--out-dir ../data/MSCOCO/train2014_ResNet50_features \ | ||
--gpu -1 | ||
``` | ||
`--gpu` is GPU id (-1 is CPU).`—img-dir` is the directory that you stores images. `—out-dir` is the directory that the ResNet features will be saved. The file name will be the same, but extension is “.npz”. | ||
``` | ||
#preprocess the json files. you need to have the same structure as MSCOCO json. | ||
python preprocess_MSCOCO_captions.py \ | ||
--input ../data/MSCOCO/captions_train2014.json \ | ||
--output ../data/MSCOCO/mscoco_caption_train2014_processed.json \ | ||
--outdic ../data/MSCOCO/mscoco_caption_train2014_processed_dic.json \ | ||
--outfreq ../data/MSCOCO/mscoco_caption_train2014_processed_freq.json \ | ||
—-cut 5 \ | ||
—-char True | ||
``` | ||
`input` is the json file containing caption. `output` will be the main preprocessed output. `outdic` is the vocabulary file. `outfreq` is the internal file you don’t need it in the training. Just frequency count. `cut` is the cutoff frequency for minor words. Character based chunking will be used when `char` is True. You can use it for non-spaced languages like Japanese and Chinese. | ||
|
||
Then you can use my script above for training. | ||
|
||
## I want to fine-tune CNN part. | ||
Officially, this doesn’t support the CNN tuning. That’s what I did in the original paeper. However, informally, I made it to train CNN part too... but I didn't document it, now, after two years, I don't remember very much. `train_image_caption_model.py` is the script to train the CNN part. I also remember that I tried to use another preprocessed json than I document here. Currently I have the `_dic` file and the main processed file separately but I combined them. The script to generate a preprocessed file in the new format (i.e that one compatible with `train_image_caption_model.py` ) should be `code/preprocess_captions.py`. That's what I vaguely remember. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
|
||
#class to get the data in a batch way | ||
#loading on memory option (preload_all_features) took 6m10.400s (user time = 2m41.546s) to load if it is true | ||
|
||
import numpy as np | ||
|
||
class CaptionDataLoader(object): | ||
def __init__(self, captions,image_feature_path,preload_all_features=False,filename_img_id=False): | ||
self.captions = captions | ||
self.image_feature_path=image_feature_path#path before image id. e.g. ../data/MSCOCO/train2014_ResNet50_features/COCO_train2014_ | ||
self.caption_ids = captions.keys() | ||
self.random_indicies = np.random.permutation(len(self.captions)) | ||
self.index_count=0 | ||
self.epoch=1 | ||
self.preload_all_features=preload_all_features | ||
self.filename_img_id=filename_img_id | ||
if self.preload_all_features: | ||
if self.filename_img_id: | ||
self.image_features=np.array([np.load("%s/%s.npz"%(self.image_feature_path,self.captions[caption_id]["image_id"]))['arr_0'] for caption_id in self.caption_ids]) | ||
else: | ||
self.image_features=np.array([np.load("%s%012d.npz"%(self.image_feature_path,self.captions[caption_id]["image_id"]))['arr_0'] for caption_id in self.caption_ids]) | ||
|
||
def get_batch(self,batch_size): | ||
batch_data_indicies=self.random_indicies[self.index_count:self.index_count+batch_size] | ||
self.index_count+=batch_size | ||
if self.index_count > len(self.captions): | ||
self.epoch+=1 | ||
self.suffle_data() | ||
self.index_count=0 | ||
|
||
#sorry the following lines are so complicated... | ||
#this is just loading preprocessed images features and captions for this batch | ||
if self.preload_all_features: | ||
batch_image_features=self.image_features[batch_data_indicies] | ||
else: | ||
if self.filename_img_id: | ||
batch_image_features=np.array([np.load("%s/%s.npz"%(self.image_feature_path,self.captions[self.caption_ids[i]]["image_id"]))['arr_0'] for i in batch_data_indicies]) | ||
else: | ||
batch_image_features=np.array([np.load("%s%012d.npz"%(self.image_feature_path,self.captions[self.caption_ids[i]]["image_id"]))['arr_0'] for i in batch_data_indicies]) | ||
|
||
batch_word_indices=[np.array(self.captions[self.caption_ids[i]]["token_ids"],dtype=np.int32) for i in batch_data_indicies] | ||
|
||
return batch_image_features,batch_word_indices | ||
|
||
def suffle_data(self): | ||
self.random_indicies = np.random.permutation(len(self.captions)) | ||
|
||
|
||
if __name__ == '__main__': | ||
#test code | ||
import json | ||
with open("../data/MSCOCO/mscoco_caption_train2014_processed.json", 'r') as f: | ||
captions = json.load(f) | ||
dataset=CaptionDataLoader(captions,image_feature_path="../data/MSCOCO/train2014_ResNet50_features/COCO_train2014_") | ||
batch_image_features,batch_word_indices = dataset.get_batch(10) | ||
print(batch_word_indices) | ||
print(batch_image_features.shape) | ||
|
||
dataset=CaptionDataLoader(captions,image_feature_path="../data/MSCOCO/train2014_ResNet50_features/COCO_train2014_",preload_all_features=True) | ||
batch_image_features,batch_word_indices = dataset.get_batch(10) | ||
print(batch_word_indices) | ||
print(batch_image_features.shape) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
|
||
#class to get the data in a batch way | ||
#loading on memory option (preload_all_features) took 6m10.400s (user time = 2m41.546s) to load if it is true | ||
#refactered version of CaptionDataLoader.py | ||
|
||
import numpy as np | ||
import os | ||
from image_loader import Image_loader | ||
from ResNet50 import ResNet | ||
|
||
class CaptionDataLoader(object): | ||
def __init__(self,dataset,image_feature_root,image_root="",preload_all_features=False,image_mean="imagenet",holding_raw_captions=False): | ||
self.holding_raw_captions=holding_raw_captions | ||
self.image_loader=Image_loader(mean=image_mean) | ||
self.captions=dataset["captions"] | ||
self.num_captions=len(self.captions) | ||
self.images=dataset["images"] | ||
self.caption2image={caption["idx"]:caption["image_idx"] for caption in dataset["captions"]} | ||
self.image_feature_root=image_feature_root+"/"#path to preprocessed image features. It assume the feature are stored with the same name but only extension is changed to .npz | ||
self.image_root=image_root+"/"#path to image directory | ||
self.random_indicies = np.random.permutation(len(self.captions)) | ||
self.index_count=0 | ||
self.epoch=1 | ||
self.preload_all_features=preload_all_features | ||
if self.preload_all_features: | ||
self.image_features=np.array([np.load("%s/%s.npz"%(self.image_feature_root, os.path.splitext(image["file_path"])[0] ))['arr_0'] for image in self.images]) | ||
|
||
def get_batch(self,batch_size,raw_image=False): | ||
#if raw_image is true, it will give you Batchx3x224x224 otherwise it will be just features | ||
batch_caption_indicies=self.random_indicies[self.index_count:self.index_count+batch_size] | ||
self.index_count+=batch_size | ||
if self.index_count > len(self.captions): | ||
self.epoch+=1 | ||
self.suffle_data() | ||
self.index_count=0 | ||
|
||
#sorry the following lines are so complicated... | ||
#this is just loading preprocessed images or image features and captions for this batch | ||
if raw_image: | ||
batch_images= np.array( [self.image_loader.load(self.image_root+self.images[self.caption2image[i]]["file_path"],expand_batch_dim=False) for i in batch_caption_indicies] ) | ||
else: | ||
if self.preload_all_features: | ||
batch_images=self.image_features[[self.caption2image[i] for i in batch_caption_indicies]] | ||
else: | ||
batch_images=np.array([np.load("%s/%s.npz"%(self.image_feature_root, os.path.splitext(self.images[self.caption2image[i]]["file_path"])[0] ))['arr_0'] for i in batch_caption_indicies]) | ||
if self.holding_raw_captions: | ||
batch_word_indices=[self.captions[i]["caption"] for i in batch_caption_indicies] | ||
else: | ||
batch_word_indices=[np.array(self.captions[i]["caption"],dtype=np.int32) for i in batch_caption_indicies] | ||
|
||
return batch_images,batch_word_indices | ||
|
||
def suffle_data(self): | ||
self.random_indicies = np.random.permutation(len(self.captions)) | ||
|
||
|
||
if __name__ == '__main__': | ||
#test code | ||
import json | ||
with open("../data/MSCOCO/mscoco_train2014_all_preprocessed.json", 'r') as f: | ||
captions = json.load(f) | ||
dataset=CaptionDataLoader(captions,image_feature_root="../data/MSCOCO/MSCOCO_ResNet50_features/",image_root="../data/MSCOCO/MSCOCO_raw_images/") | ||
batch_images,batch_word_indices = dataset.get_batch(10,raw_image=True) | ||
print(batch_word_indices) | ||
print(batch_images) | ||
|
||
batch_image_features,batch_word_indices = dataset.get_batch(10) | ||
print(batch_word_indices) | ||
print(batch_image_features.shape) | ||
|
||
dataset=CaptionDataLoader(captions,image_feature_root="../data/MSCOCO/MSCOCO_ResNet50_features",preload_all_features=True) | ||
batch_image_features,batch_word_indices = dataset.get_batch(10) | ||
print(batch_word_indices) | ||
print(batch_image_features.shape) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add your own README file, and reference the original file with proper credits.