Skip to content

Latest commit

 

History

History
183 lines (127 loc) · 15.3 KB

voice_omni.md

File metadata and controls

183 lines (127 loc) · 15.3 KB

Voice Omni

Survey

  • WavChat: A Survey of Spoken Dialogue Models, arXiv, 2411.13577, arxiv, pdf, cication: -1

    Shengpeng Ji, Yifu Chen, Minghui Fang, ..., Jin Xu, Zhou Zhao

  • Awesome-Speech-Language-Model - ddlBoJack Star

  • A Survey on Speech Large Language Models, arXiv, 2410.18908, arxiv, pdf, cication: -1

    Jing Peng, Yucheng Wang, Yu Xi, ..., Xizhuo Zhang, Kai Yu

Voice Omni

  • 🌟 MinMo: A Multimodal Large Language Model for Seamless Voice Interaction, arXiv, 2501.06282, arxiv, pdf, cication: -1

    Qian Chen, Yafeng Chen, Yanni Chen, ..., Chong Zhang, Jinren Zhou · (funaudiollm.github)

  • A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone 🤗

  • OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis, arXiv, 2501.04561, arxiv, pdf, cication: -1

    Run Luo, Ting-En Lin, Haonan Zhang, ..., Hamid Alinejad-Rokny, Fei Huang

  • GPT-4o

  • VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction, arXiv, 2501.01957, arxiv, pdf, cication: -1

    Chaoyou Fu, Haojia Lin, Xiong Wang, ..., Caifeng Shan, Ran He · (VITA - VITA-MLLM) Star

  • SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training, arXiv, 2412.15649, arxiv, pdf, cication: -1

    Wenxi Chen, Ziyang Ma, Ruiqi Yan, ..., Shujie Liu, Xie Chen · (slam-omni.github)

  • Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners, arXiv, 2412.04917, arxiv, pdf, cication: -1

    Ze Yuan, Yanqing Liu, Shujie Liu, ..., Sheng Zhao

  • AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM, arXiv, 2412.01145, arxiv, pdf, cication: -1

    Ruchao Fan, Bo Ren, Yuxuan Hu, ..., Shujie Liu, Jinyu Li

  • 🌟 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions, arXiv, 2412.09596, arxiv, pdf, cication: -1

    Pan Zhang, Xiaoyi Dong, Yuhang Cao, ..., Dahua Lin, Jiaqi Wang · (InternLM-XComposer - InternLM) Star · (huggingface)

  • Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition, arXiv, 2412.09501, arxiv, pdf, cication: -1

    Zhisheng Zhong, Chengyao Wang, Yuqi Liu, ..., Shu Liu, Jiaya Jia · (lyra-omni.github) · (103.170.5) · (Lyra - dvlab-research) Star · (huggingface)

  • Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners, arXiv, 2412.04917, arxiv, pdf, cication: -1

    Ze Yuan, Yanqing Liu, Shujie Liu, ..., Sheng Zhao · (cognitivespeech.github)

  • Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data, arXiv, 2412.01078, arxiv, pdf, cication: -1

    Shuaijiang Zhao, Tingwei Guo, Bajian Xiang, ..., Wei Zou, Xiangang Li · (huggingface)

  • 🌟 Paper page - Scaling Speech-Text Pre-training with Synthetic Interleaved Data

  • SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation, arXiv, 2411.18138, arxiv, pdf, cication: -1

    Wenyi Yu, Siyin Wang, Xiaoyu Yang, ..., Yuxuan Wang, Chao Zhang

  • Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM, arXiv, 2409.17353, arxiv, pdf, cication: -1

    Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu

  • 🌟 Building a Taiwanese Mandarin Spoken Language Model: A First Attempt, arXiv, 2411.07111, arxiv, pdf, cication: -1

    Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, ..., Shu-wen Yang, Hung-yi Lee

  • freddyaboulton / llama-code-editor 🤗

  • Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback, arXiv, 2411.01834, arxiv, pdf, cication: -1

    Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, ..., Hung-yi Lee, Ivan Bulyko

  • Introducing hertz-dev, the first open-source base model for conversational audio generation

    · (x) · (hertz-dev - Standard-Intelligence) Star

  • 🌟 Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM, arXiv, 2411.00774, arxiv, pdf, cication: -1

    Xiong Wang, Yangze Li, Chaoyou Fu, ..., Xing Sun, Long Ma

    · (freeze-omni.github)

    · (Freeze-Omni - VITA-MLLM) Star

  • Generative Expressive Conversational Speech Synthesis, arXiv, 2407.21491, arxiv, pdf, cication: -1

    Rui Liu, Yifan Hu, Yi Ren, ..., Xiang Yin, Haizhou Li · (GPT-Talker - walker-hyf) Star · (mp.weixin.qq)

  • Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation, arXiv, 2410.20336, arxiv, pdf, cication: -1

    Maohao Shen, Shun Zhang, Jilong Wu, ..., Mike Seltzer, Qing He · (maohaos2.github)

  • GPT-4o System Card, arXiv, 2410.21276, arxiv, pdf, cication: -1

    OpenAI, :, Aaron Hurst, ..., Yunxing Dai, Yury Malkov

  • Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant, arXiv, 2410.15316, arxiv, pdf, cication: -1

    Alan Dao, Dinh Bach Vu, Huy Hoang Ha · (ichigo.homebrew) · (homebrew) · (ichigo - homebrewltd) Star

  • 🌟 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities, arXiv, 2410.11190, arxiv, pdf, cication: 1

    Zhifei Xie, Changqiao Wu · (mini-omni2 - gpt-omni) Star

Duplex

  • 🌟 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation, arXiv, 2410.17799, arxiv, pdf, cication: -1

    Qinglin Zhang, Luyao Cheng, Chong Deng, ..., Hai Yu, Chaohong Tan

    · (omniflatten.github)

  • Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents, arXiv, 2409.15594, arxiv, pdf, cication: -1

    Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, ..., Hongyu Gong, Shyamnath Gollakota

    · (syncllm.cs.washington)

Evaluation

  • Evaluating Audio Reasoning with Big Bench Audio 🤗

  • VoiceBench: Benchmarking LLM-Based Voice Assistants, arXiv, 2410.17196, arxiv, pdf, cication: -1

    Yiming Chen, Xianghu Yue, Chen Zhang, ..., Robby T. Tan, Haizhou Li

    · (VoiceBench - MatthewCYM) Star

  • Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning, arXiv, 2410.16130, arxiv, pdf, cication: -1

    Chun-Yi Kuan, Hung-yi Lee

Projects

Products

Datasets

Toolkits

Misc