Multi-modal SwinBERT: Video Captioning of Social Media Short Videos for Blind People using End-to-End Transformers
This project proposes a Multi-modal Video Captioning Model based on SwinBERT, leveraging both video frames and audio features to generate captions for short social media videos. By combining visual and auditory modalities, this end-to-end transformer-based solution aims to enhance video accessibility for blind and visually impaired users.
- Multi-modal Input: Visual: Frames extracted from video files using Video Swin Transformers.
Audio: Mel-spectrograms generated from video audio.
-
Captioning Model: Based on SwinBERT architecture, optimized for multi-modal video captioning tasks.
-
Dataset Customization: Easily adaptable to your dataset with proper folder structuring.
- Clone and Set Up the Repository
git clone https://github.com/mohammadshahabuddin/Multi-modal-SwinBERT.git
cd Multi-modal-SwinBERT
-
Download Required Components COCO Captioning Tools: Download and place the cider and coco_caption folders in the src/evalcap directory. Download Link
-
Pretrained Video Swin Transformers: Our code is based on SwinBERT mentioned in https://github.com/microsoft/SwinBERT. Please follow the steps mentioned in this link to download pretrained Video Swin Transformers. We have also followed the data structure of SwinBERT.
-
Generate Mel-Spectrograms To extract audio features, use this Jupyter Notebook to generate mel-spectrograms from your video files.
-
Prepare the Dataset Place your dataset in the datasets folder. Use the msu.txt file to list the video names. Ensure the corresponding mel-spectrogram files share the same names as the videos.