This repository contains code for developing a system of MED(Multimedia Event Detection).
The code is only for the research purpose within CREST-Deep. You can download the code and do your own experiments freely.
The required data for developing the system is located in Tsubame.
For the detailed information of accessing the data, please refer to Part I: Data
.
If you have any questions or requirements, please do not hesitate to contact us under mengxi at ks.cs.titech.ac.jp
, ryamamot at ks.cs.titech.ac.jp
.
You can find more helpful information of running the baseline from the tutorial slides in slidesForMEDBaselineTutorial
Part 0: Introduction to MED
Part I: Data
Part II: Evaluation
Part III: System Overview
Part IV: Frame Extraction
Part V: Deep Feature Extraction
Part VI: SVM Training and Testing
Part VII: LSTM Training and Testing
MED: Multimedia event detection is one task of TRECVID: a large scale video information search and retrieval workshop hosted by NIST.
Video is becoming a new means of documenting everything from recipes to how to change a tire of a car. Ever expanding multimedia video content necessitates development of new technologies for retrieving relevant videos based solely on the audio and visual content of the video. Participating MED teams will create a system that quickly finds events in a large collection of search videos. -- http://www-nlpir.nist.gov/projects/tv2016/tv2016.html#med
In this task, a system should find and rank videos including specified event from a large collection of videos. The event is specified with a textual description and a small number of example videos.
In contrast to event recognition, videos in the large collection may contain no or multiple events.
Basically, we are required to use given training video data to construct a system that is able to judge whether an unknown video clip contains the following events:
20 Events for Classification
Event ID | Event Name |
---|---|
E021 | Attempting_a_bike_trick |
E022 | Cleaning_an_appliance |
E023 | Dog_show |
E024 | Giving_directions_to_a_location |
E025 | Marriage_proposal |
E026 | Renovating_a_home |
E027 | Rock_climbing |
E028 | Town_hall_meeting |
E029 | Winning_a_race_without_a_vehicle |
E030 | Working_on_a_metal_crafts_project |
E031 | Beekeeping |
E032 | Wedding_shower |
E033 | Non-motorized_vehicle_repair |
E034 | Fixing_musical_instrument |
E035 | Horse_riding_competition |
E036 | Felling_a_tree |
E037 | Parking_a_vehicle |
E038 | Playing_fetch |
E039 | Tailgating |
E040 | Tuning_musical_instrument |
It is possible that a video clip contains not any events listed above. For the detailed description of each event, please refer to ***.
We place all the data needed for each module's input in Tsubame under the directory:
/gs/hs0/tga-crest-deep/shinodaG
The whole data is split into six parts:
LDC2011E41_TEST(32060 videos)
LDC2012E01(2000 videos)
LDC2012E110(10899 videos)
LDC2013E115(1496 videos)
LDC2013E56(242 videos)
LDC2014E16(254 videos)
The parts that are involved in training are:
LDC2011E41_TEST(portions of videos)
LDC2012E01
LDC2013E115
The parts that are involved in testing are:
LDC2011E41_TEST(portions of videos)
LDC2012E110
LDC2013E56
LDC2014E16
The detailed split information for training and testing is contained in the csv
annotation files. They are located in:
/gs/hs0/tga-crest-deep/shinodaG/annotations/csv
For the detailed explanations of the annotations in csv
, please refer to the Evaluation
part.
-
Video Data (Input for Frame Extraction)
The video data is located in
/gs/hs0/tga-crest-deep/shinodaG/video
They are compressed with H.264 and stored in .mp4 format.
-
Frame Data (Input for Deep Feature Extraction)
The frame data is located in
/gs/hs0/tga-crest-deep/shinodaG/frame
-
Feature Data (Input for SVM and LSTM)
The feature data is located in
/gs/hs0/tga-crest-deep/shinodaG/feature
We provide two kinds of features, i.e. `avgFeature` and `perFrameFeature`.
`avgFeature` is one-vector feature for one video. It is calculated by taking the average over the deep features of the frames within the video.
`perFrameFeature` contains multiple feature vectors for one video. Each feature vector corresponds to one frame in the video. `perFrameFeature` is stored under the h5 format, and every row in the h5 file is one vector corresponding to one frame. The order of the vectors follows the order of the time, i.e. the first row corresponds to the first frame, the second row corresponds to the second frame...
- Model Data
We place the caffe models under
/gs/hs0/tga-crest-deep/shinodaG/models/caffeModels
The Deep Feature Extraction Module requires `googLeNet` of caffe, which is located in
/gs/hs0/tga-crest-deep/shinodaG/models/caffeModels/imageShuffleNet
https://www.nist.gov/sites/default/files/documents/itl/iad/mig/MED16_Evaluation_Plan_V1.pdf
The performance of the system is evaluated using mAP (mean Average Precision).
Average Precision is often used for measuring the performance of an information retrieval system.
For a given target event, the testing video clips are listed from top to bottom according to their relevance scores with respect to the target event, which are given by the system.
From that list, we are able to calculate the precisions and recalls in different cut-off thresholds of the list. Then the Average Precision is calculated as the size of the area under the P-R (Precision-Recall) curve.
In practice, we approximate the area under the P-R curve using the following formula:
AP = (1 / n) * (sum_k_from_1_to_n precision(k)),
where n
is the number of target videos in the testing dataset, precision(k)
is the precision of the retrieval list that is cut off at the point where k
target videos are just included.
Finally, mAP is calculated by taking the average of APs over all the events, which is used as the measurement of the system.
TRECVID
provides us the video information and annotations for training and testing in csv
form:
EVENTS-BG_20160701_ClipMD.csv
provides information about the background videos for training. These videos contain not any 20 target events.
EVENTS-PS-100Ex_20160701_JudgementMD.csv
provides annotations of positive and hard-negative video clips for training. Each row corresponds to a video, presenting the video ClipID
, EventID
and Instance_type
. positive
in the Instance_type
indicates the video is positive, while miss
indicates the video is hard-negative.
Kindred14-Test_20140428_ClipMD.csv
provides information about the testing videos.
Kindred14-Test_20140428_Ref.csv
provides annotations of the testing videos. Each row corresponds to a video-event combination, indicating whether the video contains the event. For example, "000069.E029","n"
means the video 000069
contains no event of E029
, while "996867.E037","y"
means the video 996867
contains event of E037
.
Kindred14-Test_20140428_EventDB.csv
provides the correspondence information of eventIds and eventNames.
In addition to the above annotation files, we generate txt
version ones for convenience to use in codes related to training an RNN network. For the detailed explanation of the txt
annotations, please refer to AnnotationProcess
in Part VII: LSTM Training and Testing
.
The whole system mainly consists of four modules, namely Frame Extraction
, Deep Feature Extraction
, SVM Training and Testing
and LSTM Training and Testing
. Please refer to the following figure to understand the relations among these modules.
-
Frame Extraction: This module extracts frame images from videos. The input is directory containing videos and list of videos (optional). The output is directory containing png images of video frames every 2 seconds.
-
Deep Feature Extraction: This module is for extracting the deep features from video frames. The input is the frames extracted from videos, and the output is the corresponding features. We use
googLeNet
in the baseline code, though it is possible to easily switch to the other CNN. -
SVM Training and Testing: This module will train and test SVM with deep features. The input is the annotations for training and testing data and averaged deep feature over a video. The output is detection results and average precision of the system.
-
LSTM Training and Testing: This module aims to build an LSTM-based RNN for detecting events. The input is the features extracted from frames, and the output is an LSTM-based RNN model(training phase), or detection results(testing phase).
You can start your experiments from any module in this pileline, since we have prepared the processed data for each module's input in Tsubame. For the access of the data, please refer to Part I: Data
.
This module extracts frame images from videos. The input is directory containing videos and list of videos (optional). The output is directory containing png images of video frames every 2 seconds.
- ffmpeg - to extract frames from videos
https://ffmpeg.org/
Binaries stored in/work1/t2g-crest-deep/ShinodaLab/library/ffmpeg-3.2.4/bin/
-
videodir
- (required) directory of videosVideo files should be placed as following format:
${videodir}/${videoname}.mp4
-
outdir
- (required) directory of framesFrames will be output with names following format:
${outdir}/${videoname}/${videoname}_00000001.png
${outdir}/${videoname}/${videoname}_00000002.png
-
list
- (optional) list of videosThis file should contain only file names but not paths as follows:
hoge.mp4
fuga.mp4
if `list` is not specified, every .mp4 files under `videodir` will be processed.
./extractFrames.sh
This module is for extracting the deep features from video frames. The input is the frames extracted from videos, and the output is the corresponding features.
This module is written in Python and depends on:
- Python 2.7
- Caffe
You can easily import these dependencies by excuting the following if you are in Tsubame:
source /usr/apps.sp3/nosupport/gsic/env/caffe-0.13.sh
source /usr/apps.sp3/nosupport/gsic/env/python-2.7.7.sh
To run the code for extracting the features, please edit the variables in 'extractDeepFeaturesStarter.sh', and run:
./extractDeepFeaturesStarter.sh
It will extract deep features of avgFeature
and perFrameFeature
. For the explanation of avgFeature
and perFrameFeature
, please refer to Part I: Data
.
Please refer to the script extractDeepFeaturesStarter.sh
for the configuration of variables.
Note: Now the code only supports extracting the deep feature from the pool5/7x7_s1
layer of googLeNet
.
This module will train and test SVM with deep features. The input is the annotations for training and testing data and averaged deep feature over a video. The output is detection results and average precision of the system.
- libsvm - to train and test SVM
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
Binaries stored in/work1/t2g-crest-deep/ShinodaLab/library/libsvm-3.22/
EXPID
- (required) name of the experimentTempOutDir
- (required) temporally directoryLIBSVM
- (required) Location of LIBSVMIS_LINEAR
- (required) SVM kernel type- 0 - Use RBF kernel
- 1 - Use linear kernel
SVSUFFIX
- (required) suffix of feature file nameANNOT_DIR
- (required) directory where annotation files are savedTEST_DATA
- (required) prefix of annotation files for testingBG_DATA
- (required) prefix of annotation files for training back-ground dataTRAIN_DATA
- (required) prefix of annotation files for training positive dataTEST_SVDIR
- (required) directories where feature files for testing are savedTRAIN_SVDIR
- (required) directories where feature files for training are saved
./svm.sh
${EXPID}/${EXPID}.detection.csv
- detection results${EXPID}/ap.csv
- average precision and their mean
You are expected to get mAP 0.512
on the test set.
This module aims to build an LSTM-based RNN for detecting events. The input is the features extracted from frames, and the output is an LSTM-based RNN model (training phase), or detection results (testing phase).
This module is divided into three parts: AnnotationProcess
, Lstm
and ResultEvaluate
.
AnnotationProcess
processes the csv
annotations and convert them into txt
annotations, which are used as a part of input in Lstm
.
AnnotationProcess
is written in C/C++.
To compile the C/C++ code of AnnotationProcess
, simply run:
./compile.sh
It will give you an excutable convertCsvToTxt
.
Editing the variables in the script convertCsvToTxt.sh
and running:
./convertCsvToTxt.sh
will give you the txt
annotations in the place that you specify in the script.
The output variables include:
TRAIN_TXT_PATH, TEST_TXT_PATH: the txt
annotation files of training and testing
Each row in the txt
annotation file corresponds to a video clip. It includes the path of the feature of the video and a label index ranging from [1, 21]
. [1, 20]
corresponds to the eventID from E21
to E40
, and 21
corresponds to the background
label indicating not any target events are included.
NEW_TEST_REF_FILE: the test 'csv' annotation file, which will be used for evaluation in ResultEvaluate
. The format is the same as the test csv
file of input.
Lstm
is written in Lua and depends on:
- Torch
We have installed the Torch
framework under:
/gs/hs0/tga-crest-deep/shinodaG/library/torch/torch-master_17.8.8
If you are in Tsubame, you can easily import the framework into your environment by executing:
source /gs/hs0/tga-crest-deep/shinodaG/library/env/torch.sh
To train your own LSTM model, simply edit the variables in trainStarter.sh
:
TRAIN_ANNOTATION_PATH (input): the training txt
annotation file you create in AnnotationProcess
MODEL_SAVING_DIR (output): the directory you would like to save your models to.
To train your own LSTM model, run:
./trainStarter.sh
If you are in Tsubame3, you can instead submit a job, by first editing the submission script file (e.g. modifying -o, -e options) and run:
qsub -g YOUR_GROUP_NAME submitLstmTrain.sh
The training is supposed to be completed within an hour.
To use your trained LSTM model for testing, edit the variables in testStarter.sh
:
TEST_ANNOTATION_PATH (input): the test txt
annotation file you create in AnnotationProcess
and edit the variables in testStarterBatch
:
MODEL_DIR (input): the directory you put your trained models.
OUTPUT_DIR (output): the directory you want to store the softmax probabilities of test data in.
and then run
./testStarterBatch.sh
If you are in Tsubame3, you can instead submit a job, by first editing the submission script file (e.g. modifying -o, -e options) and run:
qsub -g YOUR_GROUP_NAME submitLstmTest.sh
The testing is supposed to be completed within 20 mins.
ResultEvaluate
is written in Python and bash. It depends on:
- Python 2.7
To get the final AP (Average Precision) for the detection result, edit the variables in the script evaluateStarter.sh
:
H5_SOFTMAX_DIR (input): The directory containing the 'h5' result files to be evaluated, which are output by 'lstm_test'
TEST_REF (input): the test csv
file you create in AnnotationProcess
OUTPUT_AP_DIR (output): The directory to store mAP (mean Average Precision) scores
and then run:
./evaluateStarter.sh
The AP performance will be written to the place that you specify (OUTPUT_AP_DIR) in the script.
Using the following parameters for training the Lstm, you are expected to get mAP around 0.43
on the test set.
EPOCH_NUM=40
BATCH_SIZE=128
HIDDEN_UNIT=256
LEARNING_RATE=0.001
LEARNING_RATE_DECAY=1E-4
WEIGHT_DECAY=0.01
GRADIENT_CLIP=5