Multi-modal SwinBERT: Video Captioning of Social Media Short Videos for Blind People using End-to-End Transformers

Empowering Accessibility for Blind and Visually Impaired Individuals

This project proposes a Multi-modal Video Captioning Model based on SwinBERT, leveraging both video frames and audio features to generate captions for short social media videos. By combining visual and auditory modalities, this end-to-end transformer-based solution aims to enhance video accessibility for blind and visually impaired users.

Features

Multi-modal Input: Visual: Frames extracted from video files using Video Swin Transformers.

Audio: Mel-spectrograms generated from video audio.

Captioning Model: Based on SwinBERT architecture, optimized for multi-modal video captioning tasks.
Dataset Customization: Easily adaptable to your dataset with proper folder structuring.

Setup

Clone and Set Up the Repository

git clone https://github.com/mohammadshahabuddin/Multi-modal-SwinBERT.git

cd Multi-modal-SwinBERT

Download Required Components COCO Captioning Tools: Download and place the cider and coco_caption folders in the src/evalcap directory. Download Link
Pretrained Video Swin Transformers: Our code is based on SwinBERT mentioned in https://github.com/microsoft/SwinBERT. Please follow the steps mentioned in this link to download pretrained Video Swin Transformers. We have also followed the data structure of SwinBERT.
Generate Mel-Spectrograms To extract audio features, use this Jupyter Notebook to generate mel-spectrograms from your video files.
Prepare the Dataset Place your dataset in the datasets folder. Use the msu.txt file to list the video names. Ensure the corresponding mel-spectrogram files share the same names as the videos.

This project builds on and integrates components from the following repositories:

SwinBERT by Microsoft

COCO Captioning Tools

PycocoEvalCap

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
datasets		datasets
docs		docs
models		models
prepro		prepro
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
run.sh		run.sh
testing.sh		testing.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal SwinBERT: Video Captioning of Social Media Short Videos for Blind People using End-to-End Transformers

Empowering Accessibility for Blind and Visually Impaired Individuals

Features

Setup

This project builds on and integrates components from the following repositories:

About

Releases

Packages

Languages

License

mohammadshahabuddin/Multi-modal-SwinBERT

Folders and files

Latest commit

History

Repository files navigation

Multi-modal SwinBERT: Video Captioning of Social Media Short Videos for Blind People using End-to-End Transformers

Empowering Accessibility for Blind and Visually Impaired Individuals

Features

Setup

This project builds on and integrates components from the following repositories:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages