Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) #2694

vipin-ust · 2024-12-05T13:54:36Z

Is your feature request related to a problem? Please describe.
I often face challenges with latency and lack of fluidity when generating real-time audio and animations for interactive applications. This becomes particularly frustrating when working with AI-driven dialogues or streaming outputs from generative AI models (like GPT), where delays disrupt the experience and reduce engagement.

Describe the solution you'd like
I would like a Real-Time Text Streaming feature that processes text incrementally as it's input, allowing for immediate conversion into audio and simultaneous generation of blend shapes. This feature should:

Enable continuous, low-latency text-to-audio synthesis with natural speech.
Produce synchronized facial animation (blend shapes) in real-time, ensuring that visual feedback matches the audio output.
Minimize latency to improve the responsiveness of applications like virtual avatars, live performances, and conversational AI.

Describe alternatives you've considered
Currently, I rely on generating audio and blend shapes after receiving the complete input from the a LLM, which takes around 3-5 seconds. This approach creates noticeable latency when using Azure Speech for audio generation and blend shapes rendering, leading to a suboptimal experience for end-users.

The delay disrupts the fluidity of real-time interactions, making the system feel less responsive and natural, especially in dynamic applications like AI-driven dialogues or interactive avatars. An alternative could involve exploring real-time streaming methods to incrementally process the input, reducing the noticeable lag and improving user satisfaction.

yulin-li · 2024-12-06T03:37:48Z

we support input text streaming, but blendshapes is not available in this api yet.

vipin-ust · 2024-12-10T10:05:22Z

Can we expect the blendshapes in nearby future release?

github-actions · 2024-12-30T02:24:20Z

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

vipin-ust changed the title ~~Real-Time Text Streaming for Audio and Blend Shapes Generation- SPEECH SDK~~ Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) Dec 5, 2024

github-actions bot added the update needed For items that are in progress but have not been updated label Dec 30, 2024

pankopon added enhancement New feature or request in-review In review text-to-speech Text-to-Speech and removed update needed For items that are in progress but have not been updated labels Mar 6, 2025

pankopon assigned yulin-li Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) #2694

Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) #2694

vipin-ust commented Dec 5, 2024 •

edited

Loading

yulin-li commented Dec 6, 2024

vipin-ust commented Dec 10, 2024

github-actions bot commented Dec 30, 2024

Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) #2694

Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) #2694

Comments

vipin-ust commented Dec 5, 2024 • edited Loading

yulin-li commented Dec 6, 2024

vipin-ust commented Dec 10, 2024

github-actions bot commented Dec 30, 2024

vipin-ust commented Dec 5, 2024 •

edited

Loading