Skip to content

cambrian-mllm/cambrian-s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cambrian-S Cambrian-S:
Towards Spatial Supersensing in Video

Cambrian-S

arXiv Website HF Model: Cambrian-S HF Dataset: VSI-590K HF Dataset: VSI-Super
*Equal Contribution    †Core Contributor

Release

  • [Nov 6, 2025] 🔥 We release Cambrian-S model weights, training code, and evaluation suite.
  • [Nov 6, 2025] 🔥 We release VSI-SUPER, a benchmark designed for spatial supersensing.
  • [Nov 6, 2025] 🔥 We release VSI-590K, a dataset curated for spatial sensing.

Contents

Cambrian-S Weights

Here are our Cambrian-S checkpoints along with instructions on how to use the weights. Our models excel at spatial reasoning in video understanding, demonstrating significant improvements over previous state-of-the-art methods on spatial understanding benchmarks while maintaining competitive performance on general video understanding tasks.

General Model Performance

Comparison of Cambrian-S with other leading MLLMs on general video understanding benchmarks.

General Model Performance

Results: Cambrian-S maintains competitive performance on standard video benchmarks (Perception Test and EgoSchema) while excelling at spatial reasoning tasks.

VSI-SUPER Performance

VSI-SUPER performance is evaluated on Cambrian-S-7B-LFP.

VSI-SUPER Count Results VSI-SUPER SOR Results

Model Card

Model Trained with Predictive Sensing

Model Base-LLM Vision Encoder Hugging Face
Cambrian-S-7B-LFP Qwen2.5-7B-Instruct siglip2-so400m-patch14-384 nyu-visionx/Cambrian-S-7B-LFP

Standard MLLM Models

Model Base-LLM Vision Encoder Hugging Face
Cambrian-S-7B Qwen2.5-7B-Instruct siglip2-so400m-patch14-384 nyu-visionx/Cambrian-S-7B
Cambrian-S-3B Qwen2.5-3B-Instruct siglip2-so400m-patch14-384 nyu-visionx/Cambrian-S-3B
Cambrian-S-1.5B Qwen2.5-1.5B-Instruct siglip2-so400m-patch14-384 nyu-visionx/Cambrian-S-1.5B
Cambrian-S-0.5B Qwen2.5-0.5B-Instruct siglip2-so400m-patch14-384 nyu-visionx/Cambrian-S-0.5B

VSI-590K Dataset

VSI-590K is a video instruction-tuning dataset focusing on spatial understanding.

VSI-590K Data Construction

VSI-590K dataset statistics.

VSI-590K Details

QAs are grouped by: question types (left) and task groups (right).

VSI-590K Pie Chart VSI-590K Task Type Chart

Hugging Face: nyu-visionx/VSI-590K

Train

We are working on cleaning and re-organizing our TPU-based training code, please stay tuned!

Evaluation

We have released our evaluation code in the lmms-eval/ subfolder. Please see the README there for more details.

For detailed benchmark results, please refer to the General Model Performance and VSI-SUPER Performance sections above.

Citation

If you find our work useful for your research, please consider to cite our work:

@article{yang2025cambrians,
  title={Cambrian-S: Towards Spatial Supersensing in Video},
  author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Daohan and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},
  journal={arXiv preprint arXiv:2511.04670},
  year={2025}
}

@article{brown2025shortcuts,
  author = {Brown, Ellis and Yang, Jihan and Yang, Shusheng and Fergus, Rob and Xie, Saining},
  title = {Benchmark Designers Should ``Train on the Test Set'' to Expose Exploitable Non-Visual Shortcuts},
  journal = {arXiv preprint arXiv:2511.04655},
  year = {2025}
}

@article{brown2025simsv,
  title   =  { {SIMS-V}: Simulated Instruction-Tuning for Spatial Video Understanding },
  author  =  { Brown, Ellis and Ray, Arijit and Krishna, Ranjay and Girshick, Ross and Fergus, Rob and Xie, Saining },
  journal =  { arXiv preprint arXiv:2511.04668 },
  year    =  { 2025 }
}

@article{yang2024think,
    title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
    author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W. and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
    year={2024},
    journal={arXiv preprint arXiv:2412.14171},
}

@article{tong2024cambrian,
  title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},
  author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},
  journal={arXiv preprint arXiv:2406.16860},
  year={2024}
}

Related Projects

  • Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
  • Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces - Introduces VSI-Bench for evaluating visual-spatial intelligence
  • SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
  • Test-Set Stress-Test: Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts