Skip to content

This repository contains the code for a video captioning system inspired by Sequence to Sequence -- Video to Text. This system takes as input a video and generates a caption in English describing the video.

License

Notifications You must be signed in to change notification settings

sddai/video-captioning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Video Captioning using S2VT

Introduction

This repository contains my implementation of a video captioning system. This system takes as input a video and generates a caption describing the event in the video.

I took inspiration from Sequence to Sequence -- Video to Text, a video captioning work proposed by researchers at the University of Texas, Austin.

Requirements

For running my code and reproducing the results, the following packages need to be installed first. I have used Python 2.7 for the whole of this project.

Packages:

  • TensorFlow
  • Caffe
  • NumPy
  • cv2
  • imageio
  • scikit-image

S2VT - Architecture and working

Attached below is the architecture diagram of S2VT as given in their paper.

Arch_S2VT

The working of the system while generating a caption for a given video is represented below diagrammatically.

S2VT_Working

Running instructions

  1. Install all the packages mentioned in the 'Requirements' section for the smooth running of this project.
  2. Using Vid2Url_Full.txt, download the dataset clips from Youtube and store in <YOUTUBE_CLIPS_DIR>.
    • Example to use Vid2Url - {'vid1547': 'm1NR0uNNs5Y_104_110'}
    • YouTube video identifier - m1NR0uNNs5Y
    • Start time - 104 seconds, End time - 110 seconds
    • Download frames between 104 seconds and 110 seconds in https://www.youtube.com/watch?v=m1NR0uNNs5Y
    • Relevant frames for video id 'vid1547' have been downloaded
  3. Pass downloaded video paths and batch size (depending on hardware constraints) to extract_feats() in Extract_Feats.py to extract VGG16 features for the downloaded video clips and store in <VIDEO_DIR>.
  4. Change paths in lines 13 to 16 in utils.py to point to directories in your workspace.
  5. Run training_vidcap.py with the number of epochs as a command line argument. eg. python training_vidcap.py 10
  6. Pass saved checkpoint files from Step 5 to test_videocap.py to run trained model on the validation set.

Sample results

Attached below are a few screenshots from caption generation for videos from the validation set.

Result1

Result2

Dataset

Even though S2VT was trained on MSVD, M-VAD and MPII-MD, I have trained my system only on MSVD, which can be downloaded here.

Demo

A demo of my system can be found here

Acknowledgements

About

This repository contains the code for a video captioning system inspired by Sequence to Sequence -- Video to Text. This system takes as input a video and generates a caption in English describing the video.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%