GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]

📝 Release Plans

⚒️ Installation

Build Environtment

conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh

Results

This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis, accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper.

Note

The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
The pretrained model previously provided is trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, Shortcut-reflow)
If you want to use all-speaker, please modify the config files to include all speaker ids.
April 16, 2025: update the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
I did not do hyperparameter tuning for all-speaker, but just use the same setting as 1-speaker.
You can add the speaker embedding if you want to have a better performance for all-speakers. I did not add it to make the code more aligned with the current 1-speaker setting and make the model capable of generating gesture for a novel speaker.
No gesture type information is used in the current version. The reason is that for a novel speaker and novel setting, you never know the gesture type. As a result, including this information is not realistic for real-world applications. However, if you just want to see a even better FGD, you can try to add this information.
Got accepted to ICCV 2025 !!!! Thanks to my co-authors, I will clean up the dirty code when I get a time for easiler re-implementation in the future.

Download Model (1-speaker and all-speaker)

# Option 1: From Google Drive
# Download the pretrained model (Shortcut) + (Shortcut-reflow) + (Diffusion) + (RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder

# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt

# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder

Download Dataset

For evaluation and training, not necessary for running a web demo or inference.

Download the original raw data

bash preprocess/bash_raw_cospeech_download.sh

Eval (1-speaker)

Require download dataset

# Evaluate the pretrained shortcut model (20 steps)
python test.py -c configs/shortcut_rvqvae_128.yaml

# Evaluate the pretrained shortcut-reflow model (2-step)
python test.py -c configs/shortcut_reflow_test.yaml

# Evaluate the pretrained diffusion model
python test.py -c configs/diffuser_rvqvae_128.yaml

Train RVQ-VAEs (1-speaker)

Require download dataset

bash train_rvq.sh

Train Generator (1-speaker)

Require download dataset


# Train the shortcut model
python train.py -c configs/shortcut_rvqvae_128.yaml

# Train the diffusion model
python train.py -c configs/diffuser_rvqvae_128.yaml

Demo (1-speaker)

python demo.py -c configs/shortcut_rvqvae_128_hf.yaml

🙏 Acknowledgments

Thanks to SynTalker, EMAGE, DiffuseStyleGesture, our code is partially borrowing from them. Please check these useful repos.

📖 Citation

If you find our code or paper helps, please consider citing:

@misc{liu2025gesturelsmlatentshortcutbased,
      title={GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}, 
      author={Pinxin Liu and Luchuan Song and Junhua Huang and Haiyang Liu and Chenliang Xu},
      year={2025},
      eprint={2501.18898},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.18898}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
configs		configs
dataloaders		dataloaders
demo		demo
mean_std		mean_std
models		models
optimizers		optimizers
preprocess		preprocess
static		static
utils		utils
weights		weights
README.md		README.md
beat-new.png		beat-new.png
demo.py		demo.py
diffuser_rvqvae_trainer.py		diffuser_rvqvae_trainer.py
index.html		index.html
requirements.txt		requirements.txt
rvq_beatx_test.py		rvq_beatx_test.py
rvq_beatx_train.py		rvq_beatx_train.py
shortcut_rvqvae_trainer.py		shortcut_rvqvae_trainer.py
test.py		test.py
test_sample_t.py		test_sample_t.py
train.py		train.py
train_rvq.sh		train_rvq.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]

📝 Release Plans

⚒️ Installation

Build Environtment

Results

Note

Download Model (1-speaker and all-speaker)

Download Dataset

Eval (1-speaker)

Train RVQ-VAEs (1-speaker)

Train Generator (1-speaker)

Demo (1-speaker)

🙏 Acknowledgments

📖 Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

andypinxinliu/GestureLSM

Folders and files

Latest commit

History

Repository files navigation

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]

📝 Release Plans

⚒️ Installation

Build Environtment

Results

Note

Download Model (1-speaker and all-speaker)

Download Dataset

Eval (1-speaker)

Train RVQ-VAEs (1-speaker)

Train Generator (1-speaker)

Demo (1-speaker)

🙏 Acknowledgments

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages