GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]
- Inference Code
- Pretrained Models
- A web demo
- Training Code
- Clean Code to make it look nicer
- Support for MeanFlow (I assume it will be even faster, we will see it soon)
- Merge with Intentional-Gesture
conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh
This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis, accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper.
- The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
- The pretrained model previously provided is trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, Shortcut-reflow)
- If you want to use all-speaker, please modify the config files to include all speaker ids.
- April 16, 2025: update the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
- I did not do hyperparameter tuning for all-speaker, but just use the same setting as 1-speaker.
- You can add the speaker embedding if you want to have a better performance for all-speakers. I did not add it to make the code more aligned with the current 1-speaker setting and make the model capable of generating gesture for a novel speaker.
- No gesture type information is used in the current version. The reason is that for a novel speaker and novel setting, you never know the gesture type. As a result, including this information is not realistic for real-world applications. However, if you just want to see a even better FGD, you can try to add this information.
- Got accepted to ICCV 2025 !!!! Thanks to my co-authors, I will clean up the dirty code when I get a time for easiler re-implementation in the future.
# Option 1: From Google Drive
# Download the pretrained model (Shortcut) + (Shortcut-reflow) + (Diffusion) + (RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder
# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt
# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder
For evaluation and training, not necessary for running a web demo or inference.
- Download the original raw data
bash preprocess/bash_raw_cospeech_download.sh
Require download dataset
# Evaluate the pretrained shortcut model (20 steps)
python test.py -c configs/shortcut_rvqvae_128.yaml
# Evaluate the pretrained shortcut-reflow model (2-step)
python test.py -c configs/shortcut_reflow_test.yaml
# Evaluate the pretrained diffusion model
python test.py -c configs/diffuser_rvqvae_128.yaml
Require download dataset
bash train_rvq.sh
Require download dataset
# Train the shortcut model
python train.py -c configs/shortcut_rvqvae_128.yaml
# Train the diffusion model
python train.py -c configs/diffuser_rvqvae_128.yaml
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
Thanks to SynTalker, EMAGE, DiffuseStyleGesture, our code is partially borrowing from them. Please check these useful repos.
If you find our code or paper helps, please consider citing:
@misc{liu2025gesturelsmlatentshortcutbased,
title={GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling},
author={Pinxin Liu and Luchuan Song and Junhua Huang and Haiyang Liu and Chenliang Xu},
year={2025},
eprint={2501.18898},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18898},
}