Jiefeng Li · Jinkun Cao · Haotian Zhang · Davis Rempe · Jan Kautz · Umar Iqbal · Ye Yuan
GEM is a generalist model for human motion that handles multiple tasks with a single model, supporting diverse conditioning signals including video, keypoints, text, audio, and 3D keyframes.
- [October 2025] 📢 The GEM codebase is released!
Stay tuned for the pretrained models and evaluation scripts.
Follow the project page for updates and announcements.
GEM introduces a unified generative framework that connects motion estimation and generation through shared objectives.
- Unified framework: Reframes motion estimation as constrained generation, allowing a single model to perform both tasks.
- Regression × Diffusion synergy: Combines the accuracy of regression models with the diversity of diffusion-based generation.
- Estimation-guided training: Trains effectively on in-the-wild datasets using only 2D or textual supervision.
- Multimodal conditioning: Supports video, text, audio, 2D/3D keyframes, or even time-varying mixed inputs (e.g., video → text → video).
- Arbitrary-length motion: Generates continuous, coherent sequences of any duration in one diffusion pass.
- State-of-the-art performance: Achieves leading results on diverse motion estimation and generation benchmarks.
For more details, visit the GEM project page →
You can download pretrained models from Google Drive.
Paper:
GENMO: A GENeralist Model for Human MOtion
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, Ye Yuan
ICCV, 2025
BibTeX:
@inproceedings{genmo2025,
title = {GENMO: A GENeralist Model for Human MOtion},
author = {Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}