-
WavChat: A Survey of Spoken Dialogue Models,
arXiv, 2411.13577
, arxiv, pdf, cication: -1Shengpeng Ji, Yifu Chen, Minghui Fang, ..., Jin Xu, Zhou Zhao
-
Awesome-Speech-Language-Model - ddlBoJack
-
A Survey on Speech Large Language Models,
arXiv, 2410.18908
, arxiv, pdf, cication: -1Jing Peng, Yucheng Wang, Yu Xi, ..., Xizhuo Zhang, Kai Yu
-
🌟 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction,
arXiv, 2501.06282
, arxiv, pdf, cication: -1Qian Chen, Yafeng Chen, Yanni Chen, ..., Chong Zhang, Jinren Zhou · (funaudiollm.github)
-
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone 🤗
-
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis,
arXiv, 2501.04561
, arxiv, pdf, cication: -1Run Luo, Ting-En Lin, Haonan Zhang, ..., Hamid Alinejad-Rokny, Fei Huang
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction,
arXiv, 2501.01957
, arxiv, pdf, cication: -1Chaoyou Fu, Haojia Lin, Xiong Wang, ..., Caifeng Shan, Ran He · (VITA - VITA-MLLM)
-
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training,
arXiv, 2412.15649
, arxiv, pdf, cication: -1Wenxi Chen, Ziyang Ma, Ruiqi Yan, ..., Shujie Liu, Xie Chen · (slam-omni.github)
-
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners,
arXiv, 2412.04917
, arxiv, pdf, cication: -1Ze Yuan, Yanqing Liu, Shujie Liu, ..., Sheng Zhao
-
AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM,
arXiv, 2412.01145
, arxiv, pdf, cication: -1Ruchao Fan, Bo Ren, Yuxuan Hu, ..., Shujie Liu, Jinyu Li
-
🌟 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions,
arXiv, 2412.09596
, arxiv, pdf, cication: -1Pan Zhang, Xiaoyi Dong, Yuhang Cao, ..., Dahua Lin, Jiaqi Wang · (InternLM-XComposer - InternLM) · (huggingface)
-
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition,
arXiv, 2412.09501
, arxiv, pdf, cication: -1Zhisheng Zhong, Chengyao Wang, Yuqi Liu, ..., Shu Liu, Jiaya Jia · (lyra-omni.github) · (103.170.5) · (Lyra - dvlab-research) · (huggingface)
-
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners,
arXiv, 2412.04917
, arxiv, pdf, cication: -1Ze Yuan, Yanqing Liu, Shujie Liu, ..., Sheng Zhao · (cognitivespeech.github)
-
Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data,
arXiv, 2412.01078
, arxiv, pdf, cication: -1Shuaijiang Zhao, Tingwei Guo, Bajian Xiang, ..., Wei Zou, Xiangang Li · (huggingface)
-
🌟 Paper page - Scaling Speech-Text Pre-training with Synthetic Interleaved Data
-
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation,
arXiv, 2411.18138
, arxiv, pdf, cication: -1Wenyi Yu, Siyin Wang, Xiaoyu Yang, ..., Yuxuan Wang, Chao Zhang
-
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM,
arXiv, 2409.17353
, arxiv, pdf, cication: -1Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu
-
🌟 Building a Taiwanese Mandarin Spoken Language Model: A First Attempt,
arXiv, 2411.07111
, arxiv, pdf, cication: -1Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, ..., Shu-wen Yang, Hung-yi Lee
-
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback,
arXiv, 2411.01834
, arxiv, pdf, cication: -1Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, ..., Hung-yi Lee, Ivan Bulyko
-
Introducing hertz-dev, the first open-source base model for conversational audio generation
-
🌟 Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM,
arXiv, 2411.00774
, arxiv, pdf, cication: -1Xiong Wang, Yangze Li, Chaoyou Fu, ..., Xing Sun, Long Ma
· (Freeze-Omni - VITA-MLLM)
-
Generative Expressive Conversational Speech Synthesis,
arXiv, 2407.21491
, arxiv, pdf, cication: -1Rui Liu, Yifan Hu, Yi Ren, ..., Xiang Yin, Haizhou Li · (GPT-Talker - walker-hyf) · (mp.weixin.qq)
-
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation,
arXiv, 2410.20336
, arxiv, pdf, cication: -1Maohao Shen, Shun Zhang, Jilong Wu, ..., Mike Seltzer, Qing He · (maohaos2.github)
-
GPT-4o System Card,
arXiv, 2410.21276
, arxiv, pdf, cication: -1OpenAI, :, Aaron Hurst, ..., Yunxing Dai, Yury Malkov
-
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant,
arXiv, 2410.15316
, arxiv, pdf, cication: -1Alan Dao, Dinh Bach Vu, Huy Hoang Ha · (ichigo.homebrew) · (homebrew) · (ichigo - homebrewltd)
-
🌟 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities,
arXiv, 2410.11190
, arxiv, pdf, cication: 1Zhifei Xie, Changqiao Wu · (mini-omni2 - gpt-omni)
-
🌟 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation,
arXiv, 2410.17799
, arxiv, pdf, cication: -1Qinglin Zhang, Luyao Cheng, Chong Deng, ..., Hai Yu, Chaohong Tan
-
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents,
arXiv, 2409.15594
, arxiv, pdf, cication: -1Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, ..., Hongyu Gong, Shyamnath Gollakota
-
VoiceBench: Benchmarking LLM-Based Voice Assistants,
arXiv, 2410.17196
, arxiv, pdf, cication: -1Yiming Chen, Xianghu Yue, Chen Zhang, ..., Robby T. Tan, Haizhou Li
· (VoiceBench - MatthewCYM)
-
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning,
arXiv, 2410.16130
, arxiv, pdf, cication: -1Chun-Yi Kuan, Hung-yi Lee
-
🌟 MiniCPM-o - OpenBMB
-
openai-realtime-embedded-sdk - openai
-
Infini-Megrez - infinigence
-
AnyModal - ritabratamaiti
A Flexible Multimodal Language Model Framework
-
VideoChat - Henry-23
-
ultravox - fixie-ai
· (demo.ultravox) · (huggingface)
-
WavChat - jishengpeng
A Survey of Spoken Dialogue Models
-
gradio-groq-basics - bklieger-groq
· (𝕏)
-
GLM-4-Voice - THUDM
- Introducing OCTAVE, a next-generation speech-language model. 𝕏
- Project Astra: Exploring a Universal AI Assistant with Greg Wayne 🎬
- voice AI hackathon 𝕏
- OpenAI Realtime API: The Missing Manual
- 声网刘斌:“Her”真正落地实现离不开RTE能力的支撑|MEET 2025
- OmniAudio is the world's fastest and most efficient audio-language model for on-device deployment 🤗
- TEN-Agent - TEN-framework
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
- Talk to AI with natural speech detection 𝕏
- chatgpt voice demo with mini-omni 2 (multimodal) 𝕏
- Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 𝕏
- 🎬 dotAI 2024 - Neil Zeghidour - Multimodal language models
- s2s_endpoint - 🤗