Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu,
Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie
- [Nov 6, 2025] 🔥 We release Cambrian-S model weights, training code, and evaluation suite.
- [Nov 6, 2025] 🔥 We release VSI-SUPER, a benchmark designed for spatial supersensing.
- [Nov 6, 2025] 🔥 We release VSI-590K, a dataset curated for spatial sensing.
Here are our Cambrian-S checkpoints along with instructions on how to use the weights. Our models excel at spatial reasoning in video understanding, demonstrating significant improvements over previous state-of-the-art methods on spatial understanding benchmarks while maintaining competitive performance on general video understanding tasks.
Comparison of Cambrian-S with other leading MLLMs on general video understanding benchmarks.
Results: Cambrian-S maintains competitive performance on standard video benchmarks (Perception Test and EgoSchema) while excelling at spatial reasoning tasks.
VSI-SUPER performance is evaluated on Cambrian-S-7B-LFP.
| Model | Base-LLM | Vision Encoder | Hugging Face |
|---|---|---|---|
| Cambrian-S-7B-LFP | Qwen2.5-7B-Instruct |
siglip2-so400m-patch14-384 |
nyu-visionx/Cambrian-S-7B-LFP |
| Model | Base-LLM | Vision Encoder | Hugging Face |
|---|---|---|---|
| Cambrian-S-7B | Qwen2.5-7B-Instruct |
siglip2-so400m-patch14-384 |
nyu-visionx/Cambrian-S-7B |
| Cambrian-S-3B | Qwen2.5-3B-Instruct |
siglip2-so400m-patch14-384 |
nyu-visionx/Cambrian-S-3B |
| Cambrian-S-1.5B | Qwen2.5-1.5B-Instruct |
siglip2-so400m-patch14-384 |
nyu-visionx/Cambrian-S-1.5B |
| Cambrian-S-0.5B | Qwen2.5-0.5B-Instruct |
siglip2-so400m-patch14-384 |
nyu-visionx/Cambrian-S-0.5B |
VSI-590K is a video instruction-tuning dataset focusing on spatial understanding.
VSI-590K dataset statistics.
QAs are grouped by: question types (left) and task groups (right).
Hugging Face: nyu-visionx/VSI-590K
We are working on cleaning and re-organizing our TPU-based training code, please stay tuned!
We have released our evaluation code in the lmms-eval/ subfolder. Please see the README there for more details.
For detailed benchmark results, please refer to the General Model Performance and VSI-SUPER Performance sections above.
If you find our work useful for your research, please consider to cite our work:
@article{yang2025cambrians,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Daohan and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},
journal={arXiv preprint arXiv:2511.04670},
year={2025}
}
@article{brown2025shortcuts,
author = {Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
title = {Benchmark Designers Should ``Train on the Test Set'' to Expose Exploitable Non-Visual Shortcuts},
journal = {arXiv preprint arXiv:2511.04655},
year = {2025}
}
@article{brown2025simsv,
title = { {SIMS-V}: Simulated Instruction-Tuning for Spatial Video Understanding },
author = { Brown, Ellis and Ray, Arijit and Krishna, Ranjay and Girshick, Ross and Fergus, Rob and Xie, Saining },
journal = { arXiv preprint arXiv:2511.04668 },
year = { 2025 }
}
@article{yang2024think,
title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W. and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
year={2024},
journal={arXiv preprint arXiv:2412.14171},
}
@article{tong2024cambrian,
title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},
author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},
journal={arXiv preprint arXiv:2406.16860},
year={2024}
}- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
- Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces - Introduces VSI-Bench for evaluating visual-spatial intelligence
- SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
- Test-Set Stress-Test: Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts







