Skip to content

rezwanh001/Large-Scale-Multimodal-Depression-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Large-Scale-Multimodal-Depression-Detection

This repository is the official implementation of the following paper.

Paper Title: MMFformer: Multimodal Fusion Transformer Network for Depression Detection
Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray

Proceedings of the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria. Copyright 2025 by the author(s).

arXiv

Abstract

Depression is a serious mental health illness that significantly affects an individual’s well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.


python implementation


Related resources:

LOCAL ENVIRONMENT

OS          :   Ubuntu 24.04.2 LTS       
Memory      :   128.0 GiB
Processor   :   Intel® Xeon® w5-3425 × 24
Graphics    :   2 x (NVIDIA RTX A6000)
GPU Memory  :   2 x (48 GB) = 96 GB
CPU(s)      :   24
Gnome       :   46.0 

1. Prepare Datasets

We use the D-Vlog and LMVD dataset, proposed in this paper. For the D-Vlog dataset, please fill in the form at the bottom of the dataset website, and send a request email to the author. For the LMVD dataset, please download features on the released Baidu Netdisk website or figshare.

Following D-Vlog's setup, the dataset is split into train, validation and test sets with a 7:1:2 ratio. For the LMVD without official splitting, we randomly split the LMVD with a 7:1:2 ratio a 8:1:1 ratio and the specific division is stored in `../data/lmvd-dataset/lmvd_labels.csv'.

Furthermore, you can run lmvd_extract_npy.py to obtain .npy features to train the model. You also can make labels with this code lmvd_prepare_labels.py.

  • Note: The pretrained model can be found here. [ckpt_path='../pretrained_models/visualmae_pretrained.pth']

Thanks for the clarification. Based on your existing parse_args() and the provided training command, here are some commonly used command-line examples that work well for both training and testing scenarios using mainkfold.py. Each example assumes your script handles them appropriately.


✅ Common Command-Line Examples


🔁 1. Standard 10-Fold Training

python mainkfold.py --train True --num_folds 10 --start_fold 0 \
--epochs 225 --batch_size 16 --learning_rate 1e-5 \
--model MultiModalDepDet --fusion it --dataset dvlog-dataset
python mainkfold.py --train True --num_folds 10 --start_fold 0 \
--epochs 225 --batch_size 16 --learning_rate 1e-5 \
--model MultiModalDepDet --fusion it --dataset lmvd-dataset

🎯 2. Resume Training from a Specific Fold

python mainkfold.py --train True --num_folds 10 --start_fold 5 \
--epochs 225 --batch_size 16 --learning_rate 1e-5 \
--model MultiModalDepDet --fusion video --dataset dvlog-dataset \
--resume_path ../weights/dvlog-dataset_MultiModalDepDet_4/checkpoints/best_model.pt

🧪 3. Cross-corpus Validation

python mainkfold.py --train True --cross_infer True \
--num_folds 10 --model MultiModalDepDet \
--fusion ia --dataset dvlog-dataset
python mainkfold.py --train True --cross_infer True \
--num_folds 10 --model MultiModalDepDet \
--fusion ia --dataset lmvd-dataset

🧪 4. Cross Inference with Fusion Strategy Testing

python mainkfold.py --train False --cross_infer True \
--num_folds 5 --model MultiModalDepDet --fusion it \
--dataset dvlog-dataset
python mainkfold.py --train False --cross_infer True \
--num_folds 5 --model MultiModalDepDet --fusion it \
--dataset lmvd-dataset

💻 5. Train with GPU(s) and Multiple Heads

python mainkfold.py --train True --num_folds 5 \
--device cuda:0 cuda:1 --num_heads 4 --fusion lt \
--model MultiModalDepDet --dataset dvlog-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5
python mainkfold.py --train True --num_folds 5 \
--device cuda:0 cuda:1 --num_heads 4 --fusion lt \
--model MultiModalDepDet --dataset lmvd-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5

🔁 6. WandB Enabled Training with TQDM Off

python mainkfold.py --train True --if_wandb True --tqdm_able False \
--model MultiModalDepDet --fusion lt --dataset dvlog-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5
python mainkfold.py --train True --if_wandb True --tqdm_able False \
--model MultiModalDepDet --fusion lt --dataset lmvd-dataset \
--batch_size 16 --epochs 225 --learning_rate 1e-5

✅ All-in-One Command

python mainkfold.py \
  --train True \
  --num_folds 10 \
  --start_fold 0 \
  --epochs 225 \
  --batch_size 16 \
  --learning_rate 1e-5 \
  --model MultiModalDepDet \
  --fusion lt \
  --dataset lmvd-dataset \
  --device cuda \
  --cross_infer False \
  --resume_path "" \
  --begin_epoch 1 \
  --if_wandb False \
  --tqdm_able True \
  --num_heads 1

📖 Citation

  • If you find this project useful for your research, please cite our paper:
@article{haque2025mmfformer,
  title = {MMFformer: Multimodal Fusion Transformer Network for Depression Detection},
  author = {Haque, Md Rezwanul and Islam, Md Milon and Raju, SM and Altaheri, Hamdi and Nassar, Lobna and Karray, Fakhri},
  journal = {arXiv preprint arXiv:2508.06701},
  year = {2025},
}

License

This project is licensed under the MIT License.

License: MIT

About

MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published