Skip to content

Latest commit

 

History

History
1149 lines (1064 loc) · 30 KB

README.md

File metadata and controls

1149 lines (1064 loc) · 30 KB

Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs

English | 中文

Baichuan-Omni-1.5 🤗 | Baichuan-Omni-1.5-Base 🤗 | Technical Report 📖

OpenMM-Medical 🤗 | OpenAudioBench 🤗

Baichuan-Omni-1.5 is the latest end-to-end trained omni-modal large model that supports comprehensive input modalities (text, image, video, audio) and dual output modalities (text and audio). Built upon the Qwen2.5-7B language model, it can process inputs from various modalities and generate high-quality text and speech outputs in a controllable manner.

  • Baichuan-Omni-1.5-Base: To promote the development of omni-modal models, we have open-sourced a foundational model trained on high-quality, extensive datasets. This model has not undergone supervised fine-tuning (SFT) for instructions, offering great flexibility and serving as the best-performing foundational omni-modal model currently available.

  • Baichuan-Omni-1.5: Leveraging the robust Baichuan-Omni-1.5-Base, this model undergoes end-to-end training with high-quality omni-modal aligned data. Baichuan-Omni-1.5 achieves text, image, video, and audio understanding capabilities comparable to GPT-4o-mini.

📖 Table of Contents

Baichuan-Omni-1.5

Baichuan-Omni-1.5 represents the latest and most advanced model in the Baichuan-omni series, trained and inferred through an end-to-end approach. Compared to the open-sourced counterparts, Baichuan-Omni-1.5 demonstrates significant improvements in the understanding of text, image, audio and video inputs. Notably, the model showcases impressive capabilities in controllable real-time voice interactions and collaborative real-time understanding across various modalities. In addition to its general capabilities, Baichuan-Omni-1.5 stands out as the most outstanding MLLM in the medical domain. This opens up exciting new possibilities for AGI to contribute to the well-being of human society. Based on the evaluation results, we summarize the key advantages and contributions of Baichuan-Omni-1.5:

  • Omni-modal Interaction: Baichuan-Omni-1.5 is designed to process text, image, audio, and video inputs, delivering high-quality text and speech outputs. It is capable of achieving seamless, high-quality cross-modal interactions without compromising the capabilities of any modality.

  • Excellent Vision-Language Capability: Baichuan-Omni-1.5 scores an average of 73.3 across ten image-understanding benchmarks, which surpasses GPT-4o-mini by an average of 6 points.

  • Unified and Outstanding Speech Capabilities: We design an 8-layer RVQ audio tokenizer (Baichuan-Audio-Tokenizer) achieves an optimal balance between capturing semantic and acoustic information with 12.5 Hz frame rate, which supports high-quality controllable bilingual (Chinese and English) real-time conversations. At the same time, we have also open-sourced the audio understanding and generation benchmark (OpenAudio-Bench) to evaluate the end-to-end capabilities of audio.

  • Leading Medical Image Understanding: We collect a comprehensive medical understanding benchmark: OpenMM-Medical, which is an integration of existing datasets. Our model achieves state-of-the-art perfor-mance on GMAI-MMBench and OpenMM-Medical. Specifically, on OpenMM-Medical, Baichuan-Omni-1.5 scores 83.8% using a 7B LLM, surpassing Qwen2-VL-72B’s score of 80.7%.

Model Architecture


Multi-stage Omni-modal Training Framework


Performance Evaluation


Click here to view the detailed results of pure text understanding ability.

Pure text understanding ability

Comprehensive Tasks
Model Size MMLU
(Acc.)
CMMLU
(Acc.)
AGIEval
(Acc.)
C-Eval
(Acc.)
GAOKAO
(Acc.)
Proprietary Models
GPT 4o - 88.0♢
78.3♢
62.3♢
86.0♢
-
GPT 4o mini - 82.0 67.6 52.2 63.6 70.8
Open-source Models (Pure text)
MAP-Neo 7B 58.2 55.1 33.9 57.5 -
Qwen1.5-Chat 7B 61.5 68.0 39.3 68.8 -
Llama3-Instruct 8B 67.1 51.7 38.4 50.7 -
OLMo 7B 28.4 25.6 19.9 27.3 -
Open-source Models (Omni-modal)
VITA 8x7B 71.0* 46.6 46.2* 56.7* -
VITA-1.5 7B 71.0 75.1 47.9 65.6 57.4
Baichuan-Omni 7B 65.3 72.2 47.7 68.9 -
MiniCPM-o 2.6 7B 65.3 63.3 50.9 61.5 56.3
Baichuan-Omni-1.5
7B 72.2 75.5 54.4 73.1 73.5
Click here to view detailed evaluation results of image understanding ability.

Image understanding ability

Multi-choice & Yes-or-No Question
Model Size MMBench-EN
(Acc.)
MMbench-CN
(Acc.)
SEED-IMG
(Acc.)
MMMU-val
(Acc.)
HallusionBench
(Acc.)
Proprietary Models
GPT-4o - 83.4♢ 82.1♢ - 69.1♢
55.0♢
GPT-4o-mini - 77.7 76.9 72.3 60.0♢ 46.1♢
Open Source Models (Vision-Language)
Qwen2-VL-7B 7B 81.7 81.9 76.5
52.7 50.6∗
MiniCPM-Llama3-V 2.5 8B 76.7 73.3 72.4 45.8∗ 42.5
Open Source Models (Omni-modal)
VITA 8x7B 74.7 71.4 72.6 45.3 39.7∗
VITA-1.5 7B 80.8 80.2 74.2 53.1 44.1
Baichuan-Omni 7B 76.2 74.9 74.1 47.3 47.8
MiniCPM-o 2.6 7B 83.6 81.8 75.4 51.1 50.1
Baichuan-Omni-1.5
7B 85.6
83.6
75.7 53.9 49.7

Visual Question Answering
Model Size RealWorldQA
(Acc.)
MathVista-mini
(Acc.)
TextVQA-val
(Acc.)
ChartQA
(Acc.)
OCRBench
(Acc.)
Proprietary Models
GPT-4o - 75.4♢
63.8♢ - 85.7♢ 73.6♢
GPT-4o-mini - 66.3 53.4 66.8 - 77.4
Open Source Models (Vision-Language)
Qwen2-VL-7B 7B 69.7 58.2∗ 84.3∗
83.0∗ 84.5∗
MiniCPM-Llama3-V 2.5 8B 63.5 54.3∗ 76.6 72.0 72.5
Open Source Models (Omni-modal)
VITA 8x7B 59.0 44.9∗ 71.8 76.6 68.5∗
VITA-1.5 7B 66.8 66.5
74.9 79.6 73.3
Baichuan-Omni 7B 62.6 51.9 74.3 79.6 70.0
MiniCPM-o 2.6 7B 67.7 64.6 80.1 87.6
89.7∗
Baichuan-Omni-1.5 7B 68.8 63.6 83.2 84.9 84.0
Click here to view detailed evaluation results of video understanding ability.

Video understanding ability

General VQA   
Model Size # Frames MVBench
(Acc.)
Egoschema
(Acc.)
VideoMME
(Acc.)
Perception-Test
(Acc.)
Proprietary Models
Gemini 1.5 Pro - - 81.3♢
63.2* 75.0♢
-
GPT 4o mini - - 55.2 58.5 63.6 48.2
GPT 4o - - - 77.2*
71.9♢ -
GPT 4V - - 43.7♢ 55.6* 59.9♢ -
Open-source Models (Vision-language)
Qwen2-VL-7B 7B 2 fps (max 768) 67.0* | 64.4 66.7* | 66.6 63.3* | 59.0 62.3* | 60.3
AnyGPT 8B 48 33.2 32.1 29.8 29.1
VideoLLaMA 2 7B 16 54.6* 51.7* 46.6* 51.4*
VideoChat2 7B 16 51.1* 42.1♢ 33.7♢ 47.3♢
LLaVA-NeXT-Video 7B 32 46.5♢ 43.9♢ 33.7♢ 48.8♢
Video-LLaVA 7B 8 41.0♢ 38.4♢ 39.9♢ 44.3♢
Open-source Models (Omni-modal)
VITA 8x7B 1 fps (max 32) 53.4 53.9 56.1 56.2
VITA-1.5 7B 1 fps (max 32) 55.5 54.7 57.3 57.6
Baichuan-Omni 7B 1 fps (max 32) 60.9 58.8 58.2 56.8
MiniCPM-o 2.6 7B 1 fps (max 64) 58.6 50.7 63.4 66.6
Baichuan-Omini-1.5 7B 1 fps (max 32) 63.7 62.4 60.1 68.9

Open-ended VQA
Model Size # Frames ActivityNet-QA MSVD-QA
(Acc.) (Score) (Acc.) (Score)
Proprietary Models
Gemini 1.5 Pro - - 56.7* - - -
GPT 4o mini - 1 fps (max 32) 62.1 3.1 67.5 3.3
GPT 4o - - 61.9* - - -
GPT 4V - - 59.5* - - -
Open-source Models (Vision-language)
Qwen2 VL 7B 2 fps (max 768) 17.4 1.9 61.1 3.5
VideoLLaMA 2 7B 16 50.2* 3.3* 70.9* 3.8*
VideoChat2 7B 16 49.1* 3.3* 70.0* 3.9*
LLaVA-NeXT-Video 7B 32 53.5* 3.2* 67.4 3.4
Video-LLaVA 7B 8 45.3* 3.3* 70.7* 3.9*
Open-source Models (Omni-modal)
VITA 8x7B 1 fps (max 32) 55.0 3.5 63.9 3.7
VITA-1.5 7B 1 fps (max 32) 59.6 3.0 67.6 3.3
Baichuan-Omni 7B 1 fps (max 48) 58.6 3.7
72.2 4.0
MiniCPM-o 2.6 7B 1 fps (max 64) 63.0
3.1 73.7 3.6
Baichuan-Omni-1.5 7B 1 fps (max 48) 62.0 3.1 74.2
3.6
Click here to view detailed evaluation results of audio understanding and generation ability.

Audio understanding and generation ability

Audio Comprehensive Capacity
Model Size Reasoning QA Llama Questions Web Questions TriviaQA AlpacaEval
s→t s→s s→t s→s s→t s→s s→t s→s s→t s→s
Proprietary Models
GPT-4o-Audio - 55.6 - 88.4 - 8.10 - 9.06 - 8.01 -
Open-source Models (Pure Audio)
GLM-4-Voice 9B - 26.5 - 71.0 - 5.15 - 4.66 - 4.89
Open-source Models (Omni-modal)
VITA-1.5 7B 41.0 - 74.2 - 5.73 - 4.68 - 6.82 -
MiniCPM-o 2.6 7B 38.6 - 77.8 - 6.86 - 6.19 - 5.18 -
Baichuan-Omni-1.5 7B 50.0 40.9 78.5 75.3 5.91 5.52 5.72 5.31 7.79 6.94
Click here to view the detailed evaluation results of omni-modal understanding ability.

Omni-modal understanding ability

Omni-Undesratnding
Model Size Image &
Audio (Acc.)
Image Caption &
Audio (Acc.)
Image & Audio
Transcript (Acc.)
Image Caption &
Audio Transcript (Acc.)
Proprietary Models
GPT4o-mini - - - 37.0 37.7
Open-source Models (Omni-modal)
VITA 8x7B 33.1 31.8 42.0 44.2
VITA-1.5 7B 33.4 29.6 48.5 47.2
Baichuan-Omni 7B 32.2 26.5 42.6 44.2
MiniCPM-o 2.6 7B 40.5 30.8 53.2
46.3
Baichuan-Omni-1.5
7B 42.9
37.7
47.9 46.9
Click here to view detailed evaluation results of medical image understanding ability.

Medical image understanding ability

Medical Understanding   
Model Size GMAI-MMB-VAL
(Acc.)
OpenMM-Medical
(Acc.)
Proprietary Models
GPT4o-mini - 46.4 74.3
Open-source Models (Vision-Language)
Qwen2 VL 7B 46.3 76.9
Qwen2 VL 72B 50.7
80.7
Open-source Models (Omni-modal)
VITA-1.5 7B 36.7 67.1
MiniCPM-o 2.6 7B 41.5 73.6
Baichuan-Omni-1.5
7B 49.9 83.8

Typical Examples


pipeline math fly_bill

Local WebUI Demo

Preparation

Creating a Virtual Environment
conda create -n baichuan_omni python==3.12
conda activate baichuan_omni
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r baichuan_omni_requirements.txt
pip install accelerate flash_attn==2.6.3 speechbrain==1.0.0 deepspeed==0.14.4
apt install llvm ffmpeg
Download the model and modify the model path

Modify MODEL_PATH in web_demo/constants.py to the local model path

Image Demo

cd web_demo
python vision_s2s_gradio_demo_cosy_multiturn.py

Audio Demo

cd web_demo
python s2s_gradio_demo_cosy_multiturn.py

Video Demo

cd web_demo
python video_s2s_gradio_demo_cosy_singleturn.py

Fine-tuning

Coming soon

Open-source Evaluation Datasets

OpenMM-Medical

To comprehensively evaluate the model's multi-modal medical capabilities, we have collected OpenMM-Medical, which includes data from public available medical image datasets such as ACRIMA (retinal images), BioMediTech (microscope images), and CoronaHack (X-rays), totaling 88,996 images.

OpenAudioBench

To efficiently assess the model's "IQ" issues, we developed OpenAudioBench, comprising five end-to-end audio understanding sub-datasets: four public benchmarks (Llama Question, WEB QA, TriviaQA, AlpacaEval), and an internally created speech logical reasoning dataset by the Baichuan team, totaling 2,701 entries. This suite reflects the model's comprehensive "IQ" level.

Acknowledgments

Disclaimer

We strongly urge all users not to employ the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models for any activities that may endanger national or social security or engage in illegal activities. Additionally, we request that these models not be used in internet services without proper safety reviews and registrations. We hope all users adhere to these guidelines to ensure technological development proceeds within a regulated and legal framework.

We have made every effort to ensure the compliance of the data used during the training process. However, despite our extensive efforts, due to the complexity of models and data, unforeseen issues may still arise. Therefore, we will not be held responsible for any problems arising from the use of the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base open-source models, including but not limited to data security issues, public opinion risks, or risks associated with misleading, misuse, dissemination, or improper utilization of the models.

License

Community use of the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models must comply with the Apache 2.0 license and the "Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base Community License Agreement." These models support commercial use. If you plan to use the Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base models or their derivatives for commercial purposes, please confirm that your entity meets the following criteria:

  • Your or your affiliated party's daily active user count (DAU) is below 1 million.
  • You or your affiliated party are not software service providers or cloud service providers.
  • There is no possibility of re-granting the commercial license to third parties without prior approval from Baichuan Inc.

Under these conditions, you need to submit the required application materials for the "Baichuan-Omni-1.5/Baichuan-Omni-1.5-Base Community License Agreement" via email at [email protected]. Upon approval, Baichuan Inc. will grant you a non-exclusive, global, non-transferable, non-sublicensable, and revocable commercial license.

Citation

If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!

@article{li2025baichuan,
  title={Baichuan-Omni-1.5 Technical Report},
  author={Li, Yadong and Liu, Jun and Zhang, Tao and Chen, Song and Li, Tianpeng and Li, Zehuan and Liu, Lijun and Ming, Lingfeng and Dong, Guosheng and Pan, Da and others},
  journal={arXiv preprint arXiv:2501.15368},
  year={2025}
}