Voice Omni

Voice Omni
- Survey
- Voice Omni
- Duplex
- Evaluation
- Projects
- Products
- Datasets
- Toolkits
- Misc

Survey

WavChat: A Survey of Spoken Dialogue Models, arXiv, 2411.13577, arxiv, pdf, cication: -1

Shengpeng Ji, Yifu Chen, Minghui Fang, ..., Jin Xu, Zhou Zhao
Awesome-Speech-Language-Model - ddlBoJack
A Survey on Speech Large Language Models, arXiv, 2410.18908, arxiv, pdf, cication: -1

Jing Peng, Yucheng Wang, Yu Xi, ..., Xizhuo Zhang, Kai Yu

Voice Omni

🌟 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction, arXiv, 2501.06282, arxiv, pdf, cication: -1

Qian Chen, Yafeng Chen, Yanni Chen, ..., Chong Zhang, Jinren Zhou · (funaudiollm.github)
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone 🤗
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis, arXiv, 2501.04561, arxiv, pdf, cication: -1

Run Luo, Ting-En Lin, Haonan Zhang, ..., Hamid Alinejad-Rokny, Fei Huang
GPT-4o
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction, arXiv, 2501.01957, arxiv, pdf, cication: -1

Chaoyou Fu, Haojia Lin, Xiong Wang, ..., Caifeng Shan, Ran He · (VITA - VITA-MLLM)
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training, arXiv, 2412.15649, arxiv, pdf, cication: -1

Wenxi Chen, Ziyang Ma, Ruiqi Yan, ..., Shujie Liu, Xie Chen · (slam-omni.github)
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners, arXiv, 2412.04917, arxiv, pdf, cication: -1

Ze Yuan, Yanqing Liu, Shujie Liu, ..., Sheng Zhao
AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM, arXiv, 2412.01145, arxiv, pdf, cication: -1

Ruchao Fan, Bo Ren, Yuxuan Hu, ..., Shujie Liu, Jinyu Li
🌟 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions, arXiv, 2412.09596, arxiv, pdf, cication: -1

Pan Zhang, Xiaoyi Dong, Yuhang Cao, ..., Dahua Lin, Jiaqi Wang · (InternLM-XComposer - InternLM) · (huggingface)
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition, arXiv, 2412.09501, arxiv, pdf, cication: -1

Zhisheng Zhong, Chengyao Wang, Yuqi Liu, ..., Shu Liu, Jiaya Jia · (lyra-omni.github) · (103.170.5) · (Lyra - dvlab-research) · (huggingface)
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners, arXiv, 2412.04917, arxiv, pdf, cication: -1

Ze Yuan, Yanqing Liu, Shujie Liu, ..., Sheng Zhao · (cognitivespeech.github)
Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data, arXiv, 2412.01078, arxiv, pdf, cication: -1

Shuaijiang Zhao, Tingwei Guo, Bajian Xiang, ..., Wei Zou, Xiangang Li · (huggingface)
🌟 Paper page - Scaling Speech-Text Pre-training with Synthetic Interleaved Data
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation, arXiv, 2411.18138, arxiv, pdf, cication: -1

Wenyi Yu, Siyin Wang, Xiaoyu Yang, ..., Yuxuan Wang, Chao Zhang
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM, arXiv, 2409.17353, arxiv, pdf, cication: -1

Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu
🌟 Building a Taiwanese Mandarin Spoken Language Model: A First Attempt, arXiv, 2411.07111, arxiv, pdf, cication: -1

Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, ..., Shu-wen Yang, Hung-yi Lee
freddyaboulton / llama-code-editor 🤗
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback, arXiv, 2411.01834, arxiv, pdf, cication: -1

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, ..., Hung-yi Lee, Ivan Bulyko
Introducing hertz-dev, the first open-source base model for conversational audio generation

· (x) · (hertz-dev - Standard-Intelligence)
🌟 Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM, arXiv, 2411.00774, arxiv, pdf, cication: -1

Xiong Wang, Yangze Li, Chaoyou Fu, ..., Xing Sun, Long Ma

· (freeze-omni.github)

· (Freeze-Omni - VITA-MLLM)
Generative Expressive Conversational Speech Synthesis, arXiv, 2407.21491, arxiv, pdf, cication: -1

Rui Liu, Yifan Hu, Yi Ren, ..., Xiang Yin, Haizhou Li · (GPT-Talker - walker-hyf) · (mp.weixin.qq)
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation, arXiv, 2410.20336, arxiv, pdf, cication: -1

Maohao Shen, Shun Zhang, Jilong Wu, ..., Mike Seltzer, Qing He · (maohaos2.github)
GPT-4o System Card, arXiv, 2410.21276, arxiv, pdf, cication: -1

OpenAI, :, Aaron Hurst, ..., Yunxing Dai, Yury Malkov
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant, arXiv, 2410.15316, arxiv, pdf, cication: -1

Alan Dao, Dinh Bach Vu, Huy Hoang Ha · (ichigo.homebrew) · (homebrew) · (ichigo - homebrewltd)
🌟 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities, arXiv, 2410.11190, arxiv, pdf, cication: 1

Zhifei Xie, Changqiao Wu · (mini-omni2 - gpt-omni)

Duplex

🌟 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation, arXiv, 2410.17799, arxiv, pdf, cication: -1

Qinglin Zhang, Luyao Cheng, Chong Deng, ..., Hai Yu, Chaohong Tan

· (omniflatten.github)
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents, arXiv, 2409.15594, arxiv, pdf, cication: -1

Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, ..., Hongyu Gong, Shyamnath Gollakota

· (syncllm.cs.washington)

Evaluation

Evaluating Audio Reasoning with Big Bench Audio 🤗
VoiceBench: Benchmarking LLM-Based Voice Assistants, arXiv, 2410.17196, arxiv, pdf, cication: -1

Yiming Chen, Xianghu Yue, Chen Zhang, ..., Robby T. Tan, Haizhou Li

· (VoiceBench - MatthewCYM)
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning, arXiv, 2410.16130, arxiv, pdf, cication: -1

Chun-Yi Kuan, Hung-yi Lee

Projects

🌟 MiniCPM-o - OpenBMB
openai-realtime-embedded-sdk - openai
Infini-Megrez - infinigence
AnyModal - ritabratamaiti

A Flexible Multimodal Language Model Framework
VideoChat - Henry-23
ultravox - fixie-ai

· (demo.ultravox) · (huggingface)
WavChat - jishengpeng

A Survey of Spoken Dialogue Models
gradio-groq-basics - bklieger-groq

· (𝕏)
Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. 🤗
KingNish / OpenGPT-4o 🤗
GLM-4-Voice - THUDM
homebrewltd / mini-Ichigo-llama3.2-3B-s-instruct 🤗

Products

Introducing OCTAVE, a next-generation speech-language model. 𝕏
Project Astra: Exploring a Universal AI Assistant with Greg Wayne 🎬
voice AI hackathon 𝕏

Datasets

Toolkits

Misc

OpenAI Realtime API: The Missing Manual
声网刘斌：“Her”真正落地实现离不开RTE能力的支撑｜MEET 2025
OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment 🤗
TEN-Agent - TEN-framework
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Talk to AI with natural speech detection 𝕏
chatgpt voice demo with mini-omni 2 (multimodal) 𝕏
Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 𝕏
🎬 dotAI 2024 - Neil Zeghidour - Multimodal language models
s2s_endpoint - 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

voice_omni.md

voice_omni.md

Voice Omni

Survey

Voice Omni

Duplex

Evaluation

Projects

Products

Datasets

Toolkits

Misc

Files

voice_omni.md

Latest commit

History

voice_omni.md

File metadata and controls

Voice Omni

Survey

Voice Omni

Duplex

Evaluation

Projects

Products

Datasets

Toolkits

Misc