This repository is the official implementation of the following paper.
Paper Title: MMFformer: Multimodal Fusion Transformer Network for Depression Detection
Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Proceedings of the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria. Copyright 2025 by the author(s).
Depression is a serious mental health illness that significantly affects an individual’s well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.
python implementation
LOCAL ENVIRONMENT
OS : Ubuntu 24.04.2 LTS
Memory : 128.0 GiB
Processor : Intel® Xeon® w5-3425 × 24
Graphics : 2 x (NVIDIA RTX A6000)
GPU Memory : 2 x (48 GB) = 96 GB
CPU(s) : 24
Gnome : 46.0 We use the D-Vlog and LMVD dataset, proposed in this paper. For the D-Vlog dataset, please fill in the form at the bottom of the dataset website, and send a request email to the author. For the LMVD dataset, please download features on the released Baidu Netdisk website or figshare.
Following D-Vlog's setup, the dataset is split into train, validation and test sets with a 7:1:2 ratio. For the LMVD without official splitting, we randomly split the LMVD with a 7:1:2 ratio a 8:1:1 ratio and the specific division is stored in `../data/lmvd-dataset/lmvd_labels.csv'.
Furthermore, you can run lmvd_extract_npy.py to obtain .npy features to train the model. You also can make labels with this code lmvd_prepare_labels.py.
- Note: The pretrained model can be found here. [
ckpt_path='../pretrained_models/visualmae_pretrained.pth']
Thanks for the clarification. Based on your existing parse_args() and the provided training command, here are some commonly used command-line examples that work well for both training and testing scenarios using mainkfold.py. Each example assumes your script handles them appropriately.
python mainkfold.py --train True --num_folds 10 --start_fold 0 \
--epochs 225 --batch_size 16 --learning_rate 1e-5 \
--model MultiModalDepDet --fusion it --dataset dvlog-datasetpython mainkfold.py --train True --num_folds 10 --start_fold 0 \
--epochs 225 --batch_size 16 --learning_rate 1e-5 \
--model MultiModalDepDet --fusion it --dataset lmvd-datasetpython mainkfold.py --train True --num_folds 10 --start_fold 5 \
--epochs 225 --batch_size 16 --learning_rate 1e-5 \
--model MultiModalDepDet --fusion video --dataset dvlog-dataset \
--resume_path ../weights/dvlog-dataset_MultiModalDepDet_4/checkpoints/best_model.ptpython mainkfold.py --train True --cross_infer True \
--num_folds 10 --model MultiModalDepDet \
--fusion ia --dataset dvlog-datasetpython mainkfold.py --train True --cross_infer True \
--num_folds 10 --model MultiModalDepDet \
--fusion ia --dataset lmvd-datasetpython mainkfold.py --train False --cross_infer True \
--num_folds 5 --model MultiModalDepDet --fusion it \
--dataset dvlog-datasetpython mainkfold.py --train False --cross_infer True \
--num_folds 5 --model MultiModalDepDet --fusion it \
--dataset lmvd-datasetpython mainkfold.py --train True --num_folds 5 \
--device cuda:0 cuda:1 --num_heads 4 --fusion lt \
--model MultiModalDepDet --dataset dvlog-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5python mainkfold.py --train True --num_folds 5 \
--device cuda:0 cuda:1 --num_heads 4 --fusion lt \
--model MultiModalDepDet --dataset lmvd-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5python mainkfold.py --train True --if_wandb True --tqdm_able False \
--model MultiModalDepDet --fusion lt --dataset dvlog-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5python mainkfold.py --train True --if_wandb True --tqdm_able False \
--model MultiModalDepDet --fusion lt --dataset lmvd-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5python mainkfold.py \
--train True \
--num_folds 10 \
--start_fold 0 \
--epochs 225 \
--batch_size 16 \
--learning_rate 1e-5 \
--model MultiModalDepDet \
--fusion lt \
--dataset lmvd-dataset \
--device cuda \
--cross_infer False \
--resume_path "" \
--begin_epoch 1 \
--if_wandb False \
--tqdm_able True \
--num_heads 1- If you find this project useful for your research, please cite our paper:
@article{haque2025mmfformer,
title = {MMFformer: Multimodal Fusion Transformer Network for Depression Detection},
author = {Haque, Md Rezwanul and Islam, Md Milon and Raju, SM and Altaheri, Hamdi and Nassar, Lobna and Karray, Fakhri},
journal = {arXiv preprint arXiv:2508.06701},
year = {2025},
}This project is licensed under the MIT License.
