Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs

English | 中文

Baichuan-Omni-1.5 🤗 | Baichuan-Omni-1.5-Base 🤗 | Technical Report 📖

OpenMM-Medical 🤗 | OpenAudioBench 🤗

Baichuan-Omni-1.5 is the latest end-to-end trained omni-modal large model that supports comprehensive input modalities (text, image, video, audio) and dual output modalities (text and audio). Built upon the Qwen2.5-7B language model, it can process inputs from various modalities and generate high-quality text and speech outputs in a controllable manner.

Baichuan-Omni-1.5-Base: To promote the development of omni-modal models, we have open-sourced a foundational model trained on high-quality, extensive datasets. This model has not undergone supervised fine-tuning (SFT) for instructions, offering great flexibility and serving as the best-performing foundational omni-modal model currently available.
Baichuan-Omni-1.5: Leveraging the robust Baichuan-Omni-1.5-Base, this model undergoes end-to-end training with high-quality omni-modal aligned data. Baichuan-Omni-1.5 achieves text, image, video, and audio understanding capabilities comparable to GPT-4o-mini.

📖 Table of Contents

🏁 Baichuan-Omni-1.5
⭐ Model Architecture
🧠 Multi-stage Omni-modal Training Framework
📊 Performance Evaluation
🍰 Example Use Cases
🚀 Local WebUI Demo
⚙️ Fine-tuning
📈 Open-source Evaluation Datasets
📣 Acknowledgments
⚠️ Disclaimer
📜 License
✒️ Citation

Baichuan-Omni-1.5

Baichuan-Omni-1.5 represents the latest and most advanced model in the Baichuan-omni series, trained and inferred through an end-to-end approach. Compared to the open-sourced counterparts, Baichuan-Omni-1.5 demonstrates significant improvements in the understanding of text, image, audio and video inputs. Notably, the model showcases impressive capabilities in controllable real-time voice interactions and collaborative real-time understanding across various modalities. In addition to its general capabilities, Baichuan-Omni-1.5 stands out as the most outstanding MLLM in the medical domain. This opens up exciting new possibilities for AGI to contribute to the well-being of human society. Based on the evaluation results, we summarize the key advantages and contributions of Baichuan-Omni-1.5:

Omni-modal Interaction: Baichuan-Omni-1.5 is designed to process text, image, audio, and video inputs, delivering high-quality text and speech outputs. It is capable of achieving seamless, high-quality cross-modal interactions without compromising the capabilities of any modality.
Excellent Vision-Language Capability: Baichuan-Omni-1.5 scores an average of 73.3 across ten image-understanding benchmarks, which surpasses GPT-4o-mini by an average of 6 points.
Unified and Outstanding Speech Capabilities: We design an 8-layer RVQ audio tokenizer (Baichuan-Audio-Tokenizer) achieves an optimal balance between capturing semantic and acoustic information with 12.5 Hz frame rate, which supports high-quality controllable bilingual (Chinese and English) real-time conversations. At the same time, we have also open-sourced the audio understanding and generation benchmark (OpenAudio-Bench) to evaluate the end-to-end capabilities of audio.
Leading Medical Image Understanding: We collect a comprehensive medical understanding benchmark: OpenMM-Medical, which is an integration of existing datasets. Our model achieves state-of-the-art perfor-mance on GMAI-MMBench and OpenMM-Medical. Specifically, on OpenMM-Medical, Baichuan-Omni-1.5 scores 83.8% using a 7B LLM, surpassing Qwen2-VL-72B’s score of 80.7%.

Model Architecture

Multi-stage Omni-modal Training Framework

Performance Evaluation

Click here to view the detailed results of pure text understanding ability.

Pure text understanding ability

Comprehensive Tasks
Model	Size	MMLU (Acc.)	CMMLU (Acc.)	AGIEval (Acc.)	C-Eval (Acc.)	GAOKAO (Acc.)
Proprietary Models
GPT 4o	-	88.0♢	78.3♢	62.3♢	86.0♢	-
GPT 4o mini	-	82.0	67.6	52.2	63.6	70.8
Open-source Models (Pure text)
MAP-Neo	7B	58.2	55.1	33.9	57.5	-
Qwen1.5-Chat	7B	61.5	68.0	39.3	68.8	-
Llama3-Instruct	8B	67.1	51.7	38.4	50.7	-
OLMo	7B	28.4	25.6	19.9	27.3	-
Open-source Models (Omni-modal)
VITA	8x7B	71.0*	46.6	46.2*	56.7*	-
VITA-1.5	7B	71.0	75.1	47.9	65.6	57.4
Baichuan-Omni	7B	65.3	72.2	47.7	68.9	-
MiniCPM-o 2.6	7B	65.3	63.3	50.9	61.5	56.3
Baichuan-Omni-1.5	7B	72.2	75.5	54.4	73.1	73.5

Click here to view detailed evaluation results of image understanding ability.

Image understanding ability

Multi-choice & Yes-or-No Question
Model	Size	MMBench-EN (Acc.)	MMbench-CN (Acc.)	SEED-IMG (Acc.)	MMMU-val (Acc.)	HallusionBench (Acc.)
Proprietary Models
GPT-4o	-	83.4♢	82.1♢	-	69.1♢	55.0♢
GPT-4o-mini	-	77.7	76.9	72.3	60.0♢	46.1♢
Open Source Models (Vision-Language)
Qwen2-VL-7B	7B	81.7	81.9	76.5	52.7	50.6∗
MiniCPM-Llama3-V 2.5	8B	76.7	73.3	72.4	45.8∗	42.5
Open Source Models (Omni-modal)
VITA	8x7B	74.7	71.4	72.6	45.3	39.7∗
VITA-1.5	7B	80.8	80.2	74.2	53.1	44.1
Baichuan-Omni	7B	76.2	74.9	74.1	47.3	47.8
MiniCPM-o 2.6	7B	83.6	81.8	75.4	51.1	50.1
Baichuan-Omni-1.5	7B	85.6	83.6	75.7	53.9	49.7

Visual Question Answering
Model	Size	RealWorldQA (Acc.)	MathVista-mini (Acc.)	TextVQA-val (Acc.)	ChartQA (Acc.)	OCRBench (Acc.)
Proprietary Models
GPT-4o	-	75.4♢	63.8♢	-	85.7♢	73.6♢
GPT-4o-mini	-	66.3	53.4	66.8	-	77.4
Open Source Models (Vision-Language)
Qwen2-VL-7B	7B	69.7	58.2∗	84.3∗	83.0∗	84.5∗
MiniCPM-Llama3-V 2.5	8B	63.5	54.3∗	76.6	72.0	72.5
Open Source Models (Omni-modal)
VITA	8x7B	59.0	44.9∗	71.8	76.6	68.5∗
VITA-1.5	7B	66.8	66.5	74.9	79.6	73.3
Baichuan-Omni	7B	62.6	51.9	74.3	79.6	70.0
MiniCPM-o 2.6	7B	67.7	64.6	80.1	87.6	89.7∗
Baichuan-Omni-1.5	7B	68.8	63.6	83.2	84.9	84.0

Click here to view detailed evaluation results of video understanding ability.

Video understanding ability

General VQA
Model	Size	# Frames	MVBench (Acc.)	Egoschema (Acc.)	VideoMME (Acc.)	Perception-Test (Acc.)
Proprietary Models
Gemini 1.5 Pro	-	-	81.3♢	63.2*	75.0♢	-
GPT 4o mini	-	-	55.2	58.5	63.6	48.2
GPT 4o	-	-	-	77.2*	71.9♢	-
GPT 4V	-	-	43.7♢	55.6*	59.9♢	-
Open-source Models (Vision-language)
Qwen2-VL-7B	7B	2 fps (max 768)	67.0* \| 64.4	66.7* \| 66.6	63.3* \| 59.0	62.3* \| 60.3
AnyGPT	8B	48	33.2	32.1	29.8	29.1
VideoLLaMA 2	7B	16	54.6*	51.7*	46.6*	51.4*
VideoChat2	7B	16	51.1*	42.1♢	33.7♢	47.3♢
LLaVA-NeXT-Video	7B	32	46.5♢	43.9♢	33.7♢	48.8♢
Video-LLaVA	7B	8	41.0♢	38.4♢	39.9♢	44.3♢
Open-source Models (Omni-modal)
VITA	8x7B	1 fps (max 32)	53.4	53.9	56.1	56.2
VITA-1.5	7B	1 fps (max 32)	55.5	54.7	57.3	57.6
Baichuan-Omni	7B	1 fps (max 32)	60.9	58.8	58.2	56.8
MiniCPM-o 2.6	7B	1 fps (max 64)	58.6	50.7	63.4	66.6
Baichuan-Omini-1.5	7B	1 fps (max 32)	63.7	62.4	60.1	68.9

Open-ended VQA
Model	Size	# Frames	ActivityNet-QA		MSVD-QA
Model	Size	# Frames	(Acc.)	(Score)	(Acc.)	(Score)
Proprietary Models
Gemini 1.5 Pro	-	-	56.7*	-	-	-
GPT 4o mini	-	1 fps (max 32)	62.1	3.1	67.5	3.3
GPT 4o	-	-	61.9*	-	-	-
GPT 4V	-	-	59.5*	-	-	-
Open-source Models (Vision-language)
Qwen2 VL	7B	2 fps (max 768)	17.4	1.9	61.1	3.5
VideoLLaMA 2	7B	16	50.2*	3.3*	70.9*	3.8*
VideoChat2	7B	16	49.1*	3.3*	70.0*	3.9*
LLaVA-NeXT-Video	7B	32	53.5*	3.2*	67.4	3.4
Video-LLaVA	7B	8	45.3*	3.3*	70.7*	3.9*
Open-source Models (Omni-modal)
VITA	8x7B	1 fps (max 32)	55.0	3.5	63.9	3.7
VITA-1.5	7B	1 fps (max 32)	59.6	3.0	67.6	3.3
Baichuan-Omni	7B	1 fps (max 48)	58.6	3.7	72.2	4.0
MiniCPM-o 2.6	7B	1 fps (max 64)	63.0	3.1	73.7	3.6
Baichuan-Omni-1.5	7B	1 fps (max 48)	62.0	3.1	74.2	3.6

Click here to view detailed evaluation results of audio understanding and generation ability.

Audio understanding and generation ability

Audio Comprehensive Capacity
Model	Size	Reasoning QA		Llama Questions		Web Questions		TriviaQA		AlpacaEval
Model	Size	s→t	s→s	s→t	s→s	s→t	s→s	s→t	s→s	s→t	s→s
Proprietary Models
GPT-4o-Audio	-	55.6	-	88.4	-	8.10	-	9.06	-	8.01	-
Open-source Models (Pure Audio)
GLM-4-Voice	9B	-	26.5	-	71.0	-	5.15	-	4.66	-	4.89
Open-source Models (Omni-modal)
VITA-1.5	7B	41.0	-	74.2	-	5.73	-	4.68	-	6.82	-
MiniCPM-o 2.6	7B	38.6	-	77.8	-	6.86	-	6.19	-	5.18	-
Baichuan-Omni-1.5	7B	50.0	40.9	78.5	75.3	5.91	5.52	5.72	5.31	7.79	6.94

Click here to view the detailed evaluation results of omni-modal understanding ability.

Omni-modal understanding ability

Omni-Undesratnding
Model	Size	Image & Audio (Acc.)	Image Caption & Audio (Acc.)	Image & Audio Transcript (Acc.)	Image Caption & Audio Transcript (Acc.)
Proprietary Models
GPT4o-mini	-	-	-	37.0	37.7
Open-source Models (Omni-modal)
VITA	8x7B	33.1	31.8	42.0	44.2
VITA-1.5	7B	33.4	29.6	48.5	47.2
Baichuan-Omni	7B	32.2	26.5	42.6	44.2
MiniCPM-o 2.6	7B	40.5	30.8	53.2	46.3
Baichuan-Omni-1.5	7B	42.9	37.7	47.9	46.9

Click here to view detailed evaluation results of medical image understanding ability.

Medical image understanding ability

Medical Understanding
Model	Size	GMAI-MMB-VAL (Acc.)	OpenMM-Medical (Acc.)
Proprietary Models
GPT4o-mini	-	46.4	74.3
Open-source Models (Vision-Language)
Qwen2 VL	7B	46.3	76.9
Qwen2 VL	72B	50.7	80.7
Open-source Models (Omni-modal)
VITA-1.5	7B	36.7	67.1
MiniCPM-o 2.6	7B	41.5	73.6
Baichuan-Omni-1.5	7B	49.9	83.8

Typical Examples

Local WebUI Demo

Preparation

Creating a Virtual Environment

conda create -n baichuan_omni python==3.12
conda activate baichuan_omni
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r baichuan_omni_requirements.txt
pip install accelerate flash_attn==2.6.3 speechbrain==1.0.0 deepspeed==0.14.4
apt install llvm ffmpeg

Download the model and modify the model path

Modify MODEL_PATH in web_demo/constants.py to the local model path

Image Demo

cd web_demo
python vision_s2s_gradio_demo_cosy_multiturn.py

Audio Demo

cd web_demo
python s2s_gradio_demo_cosy_multiturn.py

Video Demo

cd web_demo
python video_s2s_gradio_demo_cosy_singleturn.py

Fine-tuning

Coming soon

Open-source Evaluation Datasets

OpenMM-Medical

To comprehensively evaluate the model's multi-modal medical capabilities, we have collected OpenMM-Medical, which includes data from public available medical image datasets such as ACRIMA (retinal images), BioMediTech (microscope images), and CoronaHack (X-rays), totaling 88,996 images.

OpenAudioBench

To efficiently assess the model's "IQ" issues, we developed OpenAudioBench, comprising five end-to-end audio understanding sub-datasets: four public benchmarks (Llama Question, WEB QA, TriviaQA, AlpacaEval), and an internally created speech logical reasoning dataset by the Baichuan team, totaling 2,701 entries. This suite reflects the model's comprehensive "IQ" level.

Acknowledgments

Visual Encoder Architecture: NaVit
Automatic Speech Recognition (ASR) Model: Whisper
Large Language Model (LLM): Qwen2.5 7B
Visual Encoder Weight Initialization: Based on Qwen2-VL-7B (Link)
Some Code Contributions: From CosyVoice and Matcha-TTS (CosyVoice GitHub, Matcha-TTS GitHub)
HiFi-GAN Vocoder Used in CosyVoice 2.0: (CosyVoice 2.0)

Disclaimer

We strongly urge all users not to employ the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models for any activities that may endanger national or social security or engage in illegal activities. Additionally, we request that these models not be used in internet services without proper safety reviews and registrations. We hope all users adhere to these guidelines to ensure technological development proceeds within a regulated and legal framework.

We have made every effort to ensure the compliance of the data used during the training process. However, despite our extensive efforts, due to the complexity of models and data, unforeseen issues may still arise. Therefore, we will not be held responsible for any problems arising from the use of the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base open-source models, including but not limited to data security issues, public opinion risks, or risks associated with misleading, misuse, dissemination, or improper utilization of the models.

License

Community use of the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models must comply with the Apache 2.0 license and the "Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base Community License Agreement." These models support commercial use. If you plan to use the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models or their derivatives for commercial purposes, please confirm that your entity meets the following criteria:

Your or your affiliated party's daily active user count (DAU) is below 1 million.
You or your affiliated party are not software service providers or cloud service providers.
There is no possibility of re-granting the commercial license to third parties without prior approval from Baichuan Inc.

Under these conditions, you need to submit the required application materials for the "Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base Community License Agreement" via email at opensource.contact@baichuan-inc.com. Upon approval, Baichuan Inc. will grant you a non-exclusive, global, non-transferable, non-sublicensable, and revocable commercial license.

Citation

If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️！

@article{li2025baichuan,
  title={Baichuan-Omni-1.5 Technical Report},
  author={Li, Yadong and Liu, Jun and Zhang, Tao and Chen, Song and Li, Tianpeng and Li, Zehuan and Liu, Lijun and Ming, Lingfeng and Dong, Guosheng and Pan, Da and others},
  journal={arXiv preprint arXiv:2501.15368},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs

📖 Table of Contents

Baichuan-Omni-1.5

Model Architecture

Multi-stage Omni-modal Training Framework

Performance Evaluation

Pure text understanding ability

Image understanding ability

Video understanding ability

Audio understanding and generation ability

Omni-modal understanding ability

Medical image understanding ability

Typical Examples

Local WebUI Demo

Preparation

Creating a Virtual Environment

Download the model and modify the model path

Image Demo

Audio Demo

Video Demo

Fine-tuning

Open-source Evaluation Datasets

Acknowledgments

Disclaimer

License

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs

📖 Table of Contents

Baichuan-Omni-1.5

Model Architecture

Multi-stage Omni-modal Training Framework

Performance Evaluation

Pure text understanding ability

Image understanding ability

Video understanding ability

Audio understanding and generation ability

Omni-modal understanding ability

Medical image understanding ability

Typical Examples

Local WebUI Demo

Preparation

Creating a Virtual Environment

Download the model and modify the model path

Image Demo

Audio Demo

Video Demo

Fine-tuning

Open-source Evaluation Datasets

Acknowledgments

Disclaimer

License

Citation