- About
- Setting up the repository
- Feature Extraction
- Train EmoTx with different configurations!
- Download
- Bibtex
This is the official code repository for CVPR-2023 accepted paper "How you feelin'? Learning Emotions and Mental States in Movie Scenes". This repository contains the implementation of EmoTx, a Transformer-based model designed to predict emotions and mental states at both the scene and character levels. Our model leverages multiple modalities, including visual, facial, and language features, to capture a comprehensive understanding of emotions in complex movie environments. Additionally, we provide the pre-trained weights for EmoTx and all the pre-trained feature backbones used in this project. We also provide the extracted features for scene (full frame), character faces and subtitle from MovieGraphs dataset.
- Clone the repository and change the working directory to be project's root.
$ git clone https://github.com/katha-ai/EmoTx-CVPR2023.git
$ cd EmoTx-CVPR2023
- This project strictly requires
python==3.6
.
Create a virtual environment using Conda-
$ conda create -n emotx python=3.6
$ conda activate emotx
(emotx) $ pip install -r requirements.txt
OR
Create a virtual environment using pip (make sure you have Python3.6 installed)
$ python3.6 -m pip install virtualenv
$ python3.6 -m virtualenv emotx
$ source emotx/bin/activate
(emotx) $ pip install -r requirements.txt
You can also use wget
to download these files-
$ wget -O <FILENAME> <LINK>
File name | Contents | Comments |
---|---|---|
EmoTx_min_feats.tar.gz |
|
contains data/ directory which will occupy 167GB of disk space. |
InceptionResNetV1_VGGface_face_feats.tar.gz | Character face features extracted from InceptionResNet_v1 model pre-trained on VGGface2 dataset. | Contains generic_face_features/ directory. To use these features with EmoTx, move this directory inside data/ extracted from EmoTx_min_feats.tar.gz . After extraction, generic_face_features/ will occupy 32GB of disk space. |
VGG-vm_FER13_face_feats.tar.gz | Character face features extracted from VGG-vm model pretrained on VGGFace and FER13 datasets | Contains emo_face_features/ directory. TO use these features with EmoTx, move this directory inside data/ extracted with EmoTx_min_feats.tar.gz . After extraction, emo_face_features/ will occupy 254GB of disk sace. |
ResNet152_ImgNet_scene_feats.tar.gz | Scene (full frame) features extracted from ResNet152 model pre-trained on ImageNet dataset | Contains generic_scene_features/ directory. To use these features with EmoTx, move this directory inside data/ extracted from EmoTx_min_feats.tar.gz . After extraction, generic_scene_features/ will occupy 72GB of disk space. |
ResNet50_PL365_scene_feats.tar.gz | Scene (full frame) features extracted from ResNet50 model pre-trained on Places365 dataset. | Contains resnet50_places_scene_features/ directory. To use these features with EmoTx, move this directory inside data/ extracted from EmoTx_min_feats.tar.gz . After extraction, resnet50_places_scene_features/ will occupy 143GB of disk space. |
- Create a copy of the given config template
(emotx) $ cp config_base.yaml config.yaml
- Edit the lines
2-9
in config as directed in the comments. If you have extracted theEmoTx_min_feats.tar.gz
in/home/user/data
, then the path variables inconfig.yaml
would be-
# Path variables
data_path: /home/user/data
resource_path: /home/user/data/MovieGraph/resources/
clip_srts_path: /home/user/data/MovieGraph/srt/clip_srt/
emotic_mapping_path: /home/user/data/emotic_mapping.json
pkl_path: /home/user/data/MovieGraph/mg/py3loader/
save_path: /home/user/checkpoints/
saved_model_path: /home/user/data/pretrained_models/
hugging_face_cache_path: /home/user/.cache/
dumps_path: "./dumps"
# Directory names
...
Refer the full config_base.yaml
for the default parameter configuration.
Follow the instructions in feature_extractors/README.md to extract required features from MovieGraphs dataset. Note that we have already provided the pre-extracted features above and therefore you need not extract the features again.
After extracting the features and creating the config, you can train EmoTx on a 12GB GPU!
You can also use the pre-trained weights provided in the Download section.
Note: the Eval_mAP: [[A,B], C]
in log line (printed during training) represents the char_mAP, scene_mAP and average of both respectively.
Note: it is recommended to use wandb
Using the default values given in the config_base.yaml
- To train EmoTx for MovieGraphs-top10 emotion label set, use the default config (no argument required)
(emotx) $ python trainer.py
- To train EmoTx with MovieGraphs-top25 emotion label set-
(emotx) $ python trainer.py top_k=25
- To use EmoticMapping label set-
(emotx) $ python trainer.py use_emotic_mapping=True
- To use different scene features (valid keywords-
mvit_v1
,resnet50_places
,generic
) [generic=ResNet150_ImageNet]
(emotx) $ python trainer.py scene_feat_type="mvit_v1"
- To use different character face features (valid keywords-
resnet50_fer
,emo
,generic
) [emo=VGG-vm_FER13, generic=InceptionResNetV1_VGGface]
(emotx) $ python trainer.py scene_feat_type="resnet50_fer"
- To use fine-tuned/pre-trained subtitle features (valid choices-
False
(to use fine-tuned RoBERTa) |True
(to use pre-trained RoBERTa))
(emotx) $ python trainer.py srt_feat_pretrained=False
- Train with only scene features
(emotx) $ python trainer.py use_char_feats=False use_srt_feats=False get_char_targets=False
- To train with only character face features
(emotx) $ python trainer.py use_scene_feats=False use_srt_feats=False get_scene_targets=False
- To train with scene and subtitle features
(emotx) $ python trainer.py use_char_feats=False get_char_targets=False
- Enable wandb logging (recommended)
(emotx) $ python trainer.py wandb.logging=True wandb.project=<PROJECT_NAME> wandb.entity=<WANDB_USERNAME>
All the above arguments can be combined to train with different configurations.
File name | Comments | Training command |
---|---|---|
EmoTx_Top10.pt | EmoTx trained on MovieGraphs-top10 emotion label set | (emotx) $ python trainer.py model_no=4.0 top_k=10 |
EmoTx_Top25.pt | EmoTx trained on MovieGraphs-top25 emotion label set | (emotx) $ python trainer.py model_no=4.0 top_k=25 |
EmoTx_Emotic.pt | EmoTx trained on EmoticMapping emotion label set | (emotx) $ python trainer.py model_no=4.0 use_emotic_mapping=True |
These models can be loaded using the following code-
import torch
from models.emotx import EmoTx
model_checkpoint_filepath = "<PATH_TO_CHECKPOINT>.pt"
chkpt = torch.load(model_checkpoint_filepath)
model = EmoTx(
num_labels=chkpt["num_labels"],
num_pos_embeddings=chkpt["num_pos_embeddings"],
scene_feat_dim=chkpt["scene_feat_dim"],
char_feat_dim=chkpt["char_feat_dim"],
srt_feat_dim=chkpt["srt_feat_dim"],
num_chars=chkpt["num_chars"],
num_enc_layers=chkpt["num_enc_layers"],
max_individual_tokens=chkpt["max_individual_tokens"],
hidden_dim=chkpt["hidden_dim"]
)
model.load_state_dict(chkpt["state_dict"])
File name | Comments |
---|---|
ResNet50_PL365.pt | ResNet50 trained on Places365 dataset |
MViT_v1_Kinetics400.pt | MViT_v1 trained on Kinetics400 dataset |
ResNet50_FER.pt | ResNet50 trained on VGGFace, FER2013 and SFEW datasets |
InceptionResNetV1_VGGface.pt | InceptionResNetV1 trained on VGGFace2 dataset |
VGG-vm_FER13.pt | VGG-vm trained on VGGface and FER2013 dataset |
MTCNN.pth and MTCNN.json | MTCNN model and config used for face detection |
Cascade_RCNN_movienet.pth and Cascade_RCNN_movienet.json | Config and person detection model pre-trained on MovieNet character annotations |
RoBERTa_finetuned_t10.pt | RoBERTa fine-tuned on MovieGraphs dataset with Top-10 label set |
RoBERTa_finetuned_t25.pt | RoBERTa fine-tuned on MovieGraphs dataset with Top-25 label set |
RoBERTa_finetuned_Emotic.pt | RoBERTa fine-tuned on MovieGraphs dataset with Emotic-Mapped label set |
If you find any part of this repository useful, please cite the following paper!
@inproceedings{dhruv2023emotx,
title = {{How you feelin'? Learning Emotions and Mental States in Movie Scenes}},
author = {Dhruv Srivastava and Aditya Kumar Singh and Makarand Tapaswi},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}