SIGMA: Sinkhorn-Guided Masked Video Modeling (ECCV 2024).

🔥 Sinkhorn-Guided Masked Video Modeling

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods.

✨ Something-Something V2

Method	Extra Data	Backbone	Resolution	#Frames x Clips x Crops	Epoch	Top-1
VideoMAE	no	ViT-S	224x224	16x2x3	2400	66.8
VideoMAE	no	ViT-B	224x224	16x2x3	800	69.6
SIGMA	Img-1k	ViT-S	224x224	16x2x3	2400	68.6
SIGMA	Img-1k	ViT-B	224x224	16x2x3	800	70.9

✨ Kinetics-400

Method	Extra Data	Backbone	Resolution	#Frames x Clips x Crops	Epoch	Top-1
VideoMAE	no	ViT-S	224x224	16x5x3	1600	79.0
VideoMAE	no	ViT-B	224x224	16x5x3	800	80.0
SIGMA	Img-1k	ViT-S	224x224	16x5x3	800	79.4
SIGMA	Img-1k	ViT-B	224x224	16x5x3	800	81.6

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md. You can find our models on Huggingface(https://huggingface.co/SMSD75/SIGMA)

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

⚠️ Our code is based on VideoMAE code base.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:


@inproceedings{salehi2024sigma,
  title={SIGMA: Sinkhorn-Guided Masked Video Modeling},
  author={Salehi, Mohammadreza and Dorkenwald, Michael and Thoker, Fida Mohammad and Gavves, Efstratios and Snoek, Cees GM and Asano, Yuki M},
  journal={European Conference of Computer Vision},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
__pycache__		__pycache__
figs		figs
knn		knn
mini_kinetics		mini_kinetics
scripts_finetune		scripts_finetune
scripts_pretrain		scripts_pretrain
static		static
DATASET.md		DATASET.md
FINETUNE.md		FINETUNE.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MODEL_ZOO.md		MODEL_ZOO.md
NOTICE.md		NOTICE.md
PRETRAIN.md		PRETRAIN.md
README.md		README.md
attn_mask.py		attn_mask.py
datasets.py		datasets.py
engine_for_finetuning.py		engine_for_finetuning.py
engine_for_pretraining.py		engine_for_pretraining.py
eval_knn.py		eval_knn.py
functional.py		functional.py
index.html		index.html
info_nce.py		info_nce.py
kinetics.py		kinetics.py
masking_generator.py		masking_generator.py
metrics_utils.py		metrics_utils.py
mixup.py		mixup.py
modeling_finetune.py		modeling_finetune.py
modeling_pretrain.py		modeling_pretrain.py
optim_factory.py		optim_factory.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
run_class_finetuning.py		run_class_finetuning.py
run_mae_pretraining.py		run_mae_pretraining.py
run_videomae_vis.py		run_videomae_vis.py
sigma_env.yml		sigma_env.yml
ssv2.py		ssv2.py
transforms.py		transforms.py
utils.py		utils.py
utils_datasets.py		utils_datasets.py
video_transforms.py		video_transforms.py
vis.sh		vis.sh
visualise_augmentations.py		visualise_augmentations.py
visualize_masks.py		visualize_masks.py
volume_transforms.py		volume_transforms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIGMA: Sinkhorn-Guided Masked Video Modeling (ECCV 2024).

🔥 Sinkhorn-Guided Masked Video Modeling

✨ Something-Something V2

✨ Kinetics-400

🔨 Installation

➡️ Data Preparation

🔄 Pre-training

⤴️ Fine-tuning with pre-trained models

📍Model Zoo

⚠️ Our code is based on VideoMAE code base.

✏️ Citation

About

Releases

Packages

Contributors 3

Languages

License

QUVA-Lab/SIGMA

Folders and files

Latest commit

History

Repository files navigation

SIGMA: Sinkhorn-Guided Masked Video Modeling (ECCV 2024).

🔥 Sinkhorn-Guided Masked Video Modeling

✨ Something-Something V2

✨ Kinetics-400

🔨 Installation

➡️ Data Preparation

🔄 Pre-training

⤴️ Fine-tuning with pre-trained models

📍Model Zoo

⚠️ Our code is based on VideoMAE code base.

✏️ Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages