Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs
English | 中文
Baichuan-Omni-1.5 🤗 | Baichuan-Omni-1.5-Base 🤗 | Technical Report 📖
Baichuan-Omni-1.5 is the latest end-to-end trained omni-modal large model that supports comprehensive input modalities (text, image, video, audio) and dual output modalities (text and audio). Built upon the Qwen2.5-7B language model, it can process inputs from various modalities and generate high-quality text and speech outputs in a controllable manner.
-
Baichuan-Omni-1.5-Base: To promote the development of omni-modal models, we have open-sourced a foundational model trained on high-quality, extensive datasets. This model has not undergone supervised fine-tuning (SFT) for instructions, offering great flexibility and serving as the best-performing foundational omni-modal model currently available.
-
Baichuan-Omni-1.5: Leveraging the robust Baichuan-Omni-1.5-Base, this model undergoes end-to-end training with high-quality omni-modal aligned data. Baichuan-Omni-1.5 achieves text, image, video, and audio understanding capabilities comparable to GPT-4o-mini.
- 🏁 Baichuan-Omni-1.5
- ⭐ Model Architecture
- 🧠 Multi-stage Omni-modal Training Framework
- 📊 Performance Evaluation
- 🍰 Example Use Cases
- 🚀 Local WebUI Demo
- ⚙️ Fine-tuning
- 📈 Open-source Evaluation Datasets
- 📣 Acknowledgments
⚠️ Disclaimer- 📜 License
- ✒️ Citation
Baichuan-Omni-1.5 represents the latest and most advanced model in the Baichuan-omni series, trained and inferred through an end-to-end approach. Compared to the open-sourced counterparts, Baichuan-Omni-1.5 demonstrates significant improvements in the understanding of text, image, audio and video inputs. Notably, the model showcases impressive capabilities in controllable real-time voice interactions and collaborative real-time understanding across various modalities. In addition to its general capabilities, Baichuan-Omni-1.5 stands out as the most outstanding MLLM in the medical domain. This opens up exciting new possibilities for AGI to contribute to the well-being of human society. Based on the evaluation results, we summarize the key advantages and contributions of Baichuan-Omni-1.5:
-
Omni-modal Interaction: Baichuan-Omni-1.5 is designed to process text, image, audio, and video inputs, delivering high-quality text and speech outputs. It is capable of achieving seamless, high-quality cross-modal interactions without compromising the capabilities of any modality.
-
Excellent Vision-Language Capability: Baichuan-Omni-1.5 scores an average of 73.3 across ten image-understanding benchmarks, which surpasses GPT-4o-mini by an average of 6 points.
-
Unified and Outstanding Speech Capabilities: We design an 8-layer RVQ audio tokenizer (Baichuan-Audio-Tokenizer) achieves an optimal balance between capturing semantic and acoustic information with 12.5 Hz frame rate, which supports high-quality controllable bilingual (Chinese and English) real-time conversations. At the same time, we have also open-sourced the audio understanding and generation benchmark (OpenAudio-Bench) to evaluate the end-to-end capabilities of audio.
-
Leading Medical Image Understanding: We collect a comprehensive medical understanding benchmark: OpenMM-Medical, which is an integration of existing datasets. Our model achieves state-of-the-art perfor-mance on GMAI-MMBench and OpenMM-Medical. Specifically, on OpenMM-Medical, Baichuan-Omni-1.5 scores 83.8% using a 7B LLM, surpassing Qwen2-VL-72B’s score of 80.7%.
Click here to view the detailed results of pure text understanding ability.
Comprehensive Tasks | ||||||
---|---|---|---|---|---|---|
Model | Size | MMLU (Acc.) |
CMMLU (Acc.) |
AGIEval (Acc.) |
C-Eval (Acc.) |
GAOKAO (Acc.) |
Proprietary Models | ||||||
GPT 4o | - | 88.0♢ |
78.3♢ |
62.3♢ |
86.0♢ |
- |
GPT 4o mini | - | 82.0 | 67.6 | 52.2 | 63.6 | 70.8 |
Open-source Models (Pure text) | ||||||
MAP-Neo | 7B | 58.2 | 55.1 | 33.9 | 57.5 | - |
Qwen1.5-Chat | 7B | 61.5 | 68.0 | 39.3 | 68.8 | - |
Llama3-Instruct | 8B | 67.1 | 51.7 | 38.4 | 50.7 | - |
OLMo | 7B | 28.4 | 25.6 | 19.9 | 27.3 | - |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 71.0* | 46.6 | 46.2* | 56.7* | - |
VITA-1.5 | 7B | 71.0 | 75.1 | 47.9 | 65.6 | 57.4 |
Baichuan-Omni | 7B | 65.3 | 72.2 | 47.7 | 68.9 | - |
MiniCPM-o 2.6 | 7B | 65.3 | 63.3 | 50.9 | 61.5 | 56.3 |
Baichuan-Omni-1.5 |
7B | 72.2 | 75.5 | 54.4 | 73.1 | 73.5 |
Click here to view detailed evaluation results of image understanding ability.
Multi-choice & Yes-or-No Question | ||||||||
---|---|---|---|---|---|---|---|---|
Model | Size | MMBench-EN (Acc.) |
MMbench-CN (Acc.) |
SEED-IMG (Acc.) |
MMMU-val (Acc.) |
HallusionBench (Acc.) |
||
Proprietary Models | ||||||||
GPT-4o | - | 83.4♢ | 82.1♢ | - | 69.1♢ |
55.0♢ |
||
GPT-4o-mini | - | 77.7 | 76.9 | 72.3 | 60.0♢ | 46.1♢ | ||
Open Source Models (Vision-Language) | ||||||||
Qwen2-VL-7B | 7B | 81.7 | 81.9 | 76.5 |
52.7 | 50.6∗ | ||
MiniCPM-Llama3-V 2.5 | 8B | 76.7 | 73.3 | 72.4 | 45.8∗ | 42.5 | ||
Open Source Models (Omni-modal) | ||||||||
VITA | 8x7B | 74.7 | 71.4 | 72.6 | 45.3 | 39.7∗ | ||
VITA-1.5 | 7B | 80.8 | 80.2 | 74.2 | 53.1 | 44.1 | ||
Baichuan-Omni | 7B | 76.2 | 74.9 | 74.1 | 47.3 | 47.8 | ||
MiniCPM-o 2.6 | 7B | 83.6 | 81.8 | 75.4 | 51.1 | 50.1 | ||
Baichuan-Omni-1.5 |
7B | 85.6 |
83.6 |
75.7 | 53.9 | 49.7 |
Visual Question Answering | ||||||||
---|---|---|---|---|---|---|---|---|
Model | Size | RealWorldQA (Acc.) |
MathVista-mini (Acc.) |
TextVQA-val (Acc.) |
ChartQA (Acc.) |
OCRBench (Acc.) |
||
Proprietary Models | ||||||||
GPT-4o | - | 75.4♢ |
63.8♢ | - | 85.7♢ | 73.6♢ | ||
GPT-4o-mini | - | 66.3 | 53.4 | 66.8 | - | 77.4 | ||
Open Source Models (Vision-Language) | ||||||||
Qwen2-VL-7B | 7B | 69.7 | 58.2∗ | 84.3∗ |
83.0∗ | 84.5∗ | ||
MiniCPM-Llama3-V 2.5 | 8B | 63.5 | 54.3∗ | 76.6 | 72.0 | 72.5 | ||
Open Source Models (Omni-modal) | ||||||||
VITA | 8x7B | 59.0 | 44.9∗ | 71.8 | 76.6 | 68.5∗ | ||
VITA-1.5 | 7B | 66.8 | 66.5 |
74.9 | 79.6 | 73.3 | ||
Baichuan-Omni | 7B | 62.6 | 51.9 | 74.3 | 79.6 | 70.0 | ||
MiniCPM-o 2.6 | 7B | 67.7 | 64.6 | 80.1 | 87.6 |
89.7∗ |
||
Baichuan-Omni-1.5 | 7B | 68.8 | 63.6 | 83.2 | 84.9 | 84.0 |
Click here to view detailed evaluation results of video understanding ability.
General VQA | ||||||
---|---|---|---|---|---|---|
Model | Size | # Frames | MVBench (Acc.) |
Egoschema (Acc.) |
VideoMME (Acc.) |
Perception-Test (Acc.) |
Proprietary Models | ||||||
Gemini 1.5 Pro | - | - | 81.3♢ |
63.2* | 75.0♢ |
- |
GPT 4o mini | - | - | 55.2 | 58.5 | 63.6 | 48.2 |
GPT 4o | - | - | - | 77.2* |
71.9♢ | - |
GPT 4V | - | - | 43.7♢ | 55.6* | 59.9♢ | - |
Open-source Models (Vision-language) | ||||||
Qwen2-VL-7B | 7B | 2 fps (max 768) | 67.0* | 64.4 | 66.7* | 66.6 | 63.3* | 59.0 | 62.3* | 60.3 |
AnyGPT | 8B | 48 | 33.2 | 32.1 | 29.8 | 29.1 |
VideoLLaMA 2 | 7B | 16 | 54.6* | 51.7* | 46.6* | 51.4* |
VideoChat2 | 7B | 16 | 51.1* | 42.1♢ | 33.7♢ | 47.3♢ |
LLaVA-NeXT-Video | 7B | 32 | 46.5♢ | 43.9♢ | 33.7♢ | 48.8♢ |
Video-LLaVA | 7B | 8 | 41.0♢ | 38.4♢ | 39.9♢ | 44.3♢ |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 1 fps (max 32) | 53.4 | 53.9 | 56.1 | 56.2 |
VITA-1.5 | 7B | 1 fps (max 32) | 55.5 | 54.7 | 57.3 | 57.6 |
Baichuan-Omni | 7B | 1 fps (max 32) | 60.9 | 58.8 | 58.2 | 56.8 |
MiniCPM-o 2.6 | 7B | 1 fps (max 64) | 58.6 | 50.7 | 63.4 | 66.6 |
Baichuan-Omini-1.5 | 7B | 1 fps (max 32) | 63.7 | 62.4 | 60.1 | 68.9 |
Open-ended VQA | ||||||
---|---|---|---|---|---|---|
Model | Size | # Frames | ActivityNet-QA | MSVD-QA | ||
(Acc.) | (Score) | (Acc.) | (Score) | |||
Proprietary Models | ||||||
Gemini 1.5 Pro | - | - | 56.7* | - | - | - |
GPT 4o mini | - | 1 fps (max 32) | 62.1 | 3.1 | 67.5 | 3.3 |
GPT 4o | - | - | 61.9* | - | - | - |
GPT 4V | - | - | 59.5* | - | - | - |
Open-source Models (Vision-language) | ||||||
Qwen2 VL | 7B | 2 fps (max 768) | 17.4 | 1.9 | 61.1 | 3.5 |
VideoLLaMA 2 | 7B | 16 | 50.2* | 3.3* | 70.9* | 3.8* |
VideoChat2 | 7B | 16 | 49.1* | 3.3* | 70.0* | 3.9* |
LLaVA-NeXT-Video | 7B | 32 | 53.5* | 3.2* | 67.4 | 3.4 |
Video-LLaVA | 7B | 8 | 45.3* | 3.3* | 70.7* | 3.9* |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 1 fps (max 32) | 55.0 | 3.5 | 63.9 | 3.7 |
VITA-1.5 | 7B | 1 fps (max 32) | 59.6 | 3.0 | 67.6 | 3.3 |
Baichuan-Omni | 7B | 1 fps (max 48) | 58.6 | 3.7 |
72.2 | 4.0 |
MiniCPM-o 2.6 | 7B | 1 fps (max 64) | 63.0 |
3.1 | 73.7 | 3.6 |
Baichuan-Omni-1.5 | 7B | 1 fps (max 48) | 62.0 | 3.1 | 74.2 |
3.6 |
Click here to view detailed evaluation results of audio understanding and generation ability.
Audio Comprehensive Capacity | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Model | Size | Reasoning QA | Llama Questions | Web Questions | TriviaQA | AlpacaEval | |||||
s→t | s→s | s→t | s→s | s→t | s→s | s→t | s→s | s→t | s→s | ||
Proprietary Models | |||||||||||
GPT-4o-Audio | - | 55.6 | - | 88.4 | - | 8.10 | - | 9.06 | - | 8.01 | - |
Open-source Models (Pure Audio) | |||||||||||
GLM-4-Voice | 9B | - | 26.5 | - | 71.0 | - | 5.15 | - | 4.66 | - | 4.89 |
Open-source Models (Omni-modal) | |||||||||||
VITA-1.5 | 7B | 41.0 | - | 74.2 | - | 5.73 | - | 4.68 | - | 6.82 | - |
MiniCPM-o 2.6 | 7B | 38.6 | - | 77.8 | - | 6.86 | - | 6.19 | - | 5.18 | - |
Baichuan-Omni-1.5 | 7B | 50.0 | 40.9 | 78.5 | 75.3 | 5.91 | 5.52 | 5.72 | 5.31 | 7.79 | 6.94 |
Click here to view the detailed evaluation results of omni-modal understanding ability.
Omni-Undesratnding | ||||||
---|---|---|---|---|---|---|
Model | Size | Image & Audio (Acc.) |
Image Caption & Audio (Acc.) |
Image & Audio Transcript (Acc.) |
Image Caption & Audio Transcript (Acc.) |
|
Proprietary Models | ||||||
GPT4o-mini | - | - | - | 37.0 | 37.7 | |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 33.1 | 31.8 | 42.0 | 44.2 | |
VITA-1.5 | 7B | 33.4 | 29.6 | 48.5 | 47.2 |
|
Baichuan-Omni | 7B | 32.2 | 26.5 | 42.6 | 44.2 | |
MiniCPM-o 2.6 | 7B | 40.5 | 30.8 | 53.2 |
46.3 | |
Baichuan-Omni-1.5 |
7B | 42.9 |
37.7 |
47.9 | 46.9 |
Click here to view detailed evaluation results of medical image understanding ability.
Medical Understanding | ||||||
---|---|---|---|---|---|---|
Model | Size | GMAI-MMB-VAL (Acc.) |
OpenMM-Medical (Acc.) |
|||
Proprietary Models | ||||||
GPT4o-mini | - | 46.4 | 74.3 | |||
Open-source Models (Vision-Language) | ||||||
Qwen2 VL | 7B | 46.3 | 76.9 | |||
Qwen2 VL | 72B | 50.7 |
80.7 | |||
Open-source Models (Omni-modal) | ||||||
VITA-1.5 | 7B | 36.7 | 67.1 | |||
MiniCPM-o 2.6 | 7B | 41.5 | 73.6 | |||
Baichuan-Omni-1.5 |
7B | 49.9 | 83.8 |
conda create -n baichuan_omni python==3.12
conda activate baichuan_omni
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r baichuan_omni_requirements.txt
pip install accelerate flash_attn==2.6.3 speechbrain==1.0.0 deepspeed==0.14.4
apt install llvm ffmpeg
Modify MODEL_PATH in web_demo/constants.py to the local model path
cd web_demo
python vision_s2s_gradio_demo_cosy_multiturn.py
cd web_demo
python s2s_gradio_demo_cosy_multiturn.py
cd web_demo
python video_s2s_gradio_demo_cosy_singleturn.py
Coming soon
OpenMM-Medical
To comprehensively evaluate the model's multi-modal medical capabilities, we have collected OpenMM-Medical, which includes data from public available medical image datasets such as ACRIMA (retinal images), BioMediTech (microscope images), and CoronaHack (X-rays), totaling 88,996 images.
OpenAudioBench
To efficiently assess the model's "IQ" issues, we developed OpenAudioBench, comprising five end-to-end audio understanding sub-datasets: four public benchmarks (Llama Question, WEB QA, TriviaQA, AlpacaEval), and an internally created speech logical reasoning dataset by the Baichuan team, totaling 2,701 entries. This suite reflects the model's comprehensive "IQ" level.
- Visual Encoder Architecture: NaVit
- Automatic Speech Recognition (ASR) Model: Whisper
- Large Language Model (LLM): Qwen2.5 7B
- Visual Encoder Weight Initialization: Based on Qwen2-VL-7B (Link)
- Some Code Contributions: From CosyVoice and Matcha-TTS (CosyVoice GitHub, Matcha-TTS GitHub)
- HiFi-GAN Vocoder Used in CosyVoice 2.0: (CosyVoice 2.0)
We strongly urge all users not to employ the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models for any activities that may endanger national or social security or engage in illegal activities. Additionally, we request that these models not be used in internet services without proper safety reviews and registrations. We hope all users adhere to these guidelines to ensure technological development proceeds within a regulated and legal framework.
We have made every effort to ensure the compliance of the data used during the training process. However, despite our extensive efforts, due to the complexity of models and data, unforeseen issues may still arise. Therefore, we will not be held responsible for any problems arising from the use of the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base open-source models, including but not limited to data security issues, public opinion risks, or risks associated with misleading, misuse, dissemination, or improper utilization of the models.
Community use of the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models must comply with the Apache 2.0 license and the "Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base Community License Agreement." These models support commercial use. If you plan to use the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models or their derivatives for commercial purposes, please confirm that your entity meets the following criteria:
- Your or your affiliated party's daily active user count (DAU) is below 1 million.
- You or your affiliated party are not software service providers or cloud service providers.
- There is no possibility of re-granting the commercial license to third parties without prior approval from Baichuan Inc.
Under these conditions, you need to submit the required application materials for the "Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base Community License Agreement" via email at [email protected]. Upon approval, Baichuan Inc. will grant you a non-exclusive, global, non-transferable, non-sublicensable, and revocable commercial license.
If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
@article{li2025baichuan,
title={Baichuan-Omni-1.5 Technical Report},
author={Li, Yadong and Liu, Jun and Zhang, Tao and Chen, Song and Li, Tianpeng and Li, Zehuan and Liu, Lijun and Ming, Lingfeng and Dong, Guosheng and Pan, Da and others},
journal={arXiv preprint arXiv:2501.15368},
year={2025}
}