Skip to content

mohammadshahabuddin/Multi-modal-SwinBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-modal SwinBERT: Video Captioning of Social Media Short Videos for Blind People using End-to-End Transformers

Empowering Accessibility for Blind and Visually Impaired Individuals

figure1

This project proposes a Multi-modal Video Captioning Model based on SwinBERT, leveraging both video frames and audio features to generate captions for short social media videos. By combining visual and auditory modalities, this end-to-end transformer-based solution aims to enhance video accessibility for blind and visually impaired users.

Features

  1. Multi-modal Input: Visual: Frames extracted from video files using Video Swin Transformers.

Audio: Mel-spectrograms generated from video audio.

  1. Captioning Model: Based on SwinBERT architecture, optimized for multi-modal video captioning tasks.

  2. Dataset Customization: Easily adaptable to your dataset with proper folder structuring.

Setup

  1. Clone and Set Up the Repository

git clone https://github.com/mohammadshahabuddin/Multi-modal-SwinBERT.git

cd Multi-modal-SwinBERT

  1. Download Required Components COCO Captioning Tools: Download and place the cider and coco_caption folders in the src/evalcap directory. Download Link

  2. Pretrained Video Swin Transformers: Our code is based on SwinBERT mentioned in https://github.com/microsoft/SwinBERT. Please follow the steps mentioned in this link to download pretrained Video Swin Transformers. We have also followed the data structure of SwinBERT.

  3. Generate Mel-Spectrograms To extract audio features, use this Jupyter Notebook to generate mel-spectrograms from your video files.

  4. Prepare the Dataset Place your dataset in the datasets folder. Use the msu.txt file to list the video names. Ensure the corresponding mel-spectrogram files share the same names as the videos.

This project builds on and integrates components from the following repositories:

SwinBERT by Microsoft

COCO Captioning Tools

PycocoEvalCap

About

Multi-modal Transformer for Video Captioning Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages