You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I often face challenges with latency and lack of fluidity when generating real-time audio and animations for interactive applications. This becomes particularly frustrating when working with AI-driven dialogues or streaming outputs from generative AI models (like GPT), where delays disrupt the experience and reduce engagement.
Describe the solution you'd like
I would like a Real-Time Text Streaming feature that processes text incrementally as it's input, allowing for immediate conversion into audio and simultaneous generation of blend shapes. This feature should:
Enable continuous, low-latency text-to-audio synthesis with natural speech.
Produce synchronized facial animation (blend shapes) in real-time, ensuring that visual feedback matches the audio output.
Minimize latency to improve the responsiveness of applications like virtual avatars, live performances, and conversational AI.
Describe alternatives you've considered
Currently, I rely on generating audio and blend shapes after receiving the complete input from the a LLM, which takes around 3-5 seconds. This approach creates noticeable latency when using Azure Speech for audio generation and blend shapes rendering, leading to a suboptimal experience for end-users.
The delay disrupts the fluidity of real-time interactions, making the system feel less responsive and natural, especially in dynamic applications like AI-driven dialogues or interactive avatars. An alternative could involve exploring real-time streaming methods to incrementally process the input, reducing the noticeable lag and improving user satisfaction.
The text was updated successfully, but these errors were encountered:
vipin-ust
changed the title
Real-Time Text Streaming for Audio and Blend Shapes Generation- SPEECH SDK
Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme)
Dec 5, 2024
Is your feature request related to a problem? Please describe.
I often face challenges with latency and lack of fluidity when generating real-time audio and animations for interactive applications. This becomes particularly frustrating when working with AI-driven dialogues or streaming outputs from generative AI models (like GPT), where delays disrupt the experience and reduce engagement.
Describe the solution you'd like
I would like a Real-Time Text Streaming feature that processes text incrementally as it's input, allowing for immediate conversion into audio and simultaneous generation of blend shapes. This feature should:
Enable continuous, low-latency text-to-audio synthesis with natural speech.
Produce synchronized facial animation (blend shapes) in real-time, ensuring that visual feedback matches the audio output.
Minimize latency to improve the responsiveness of applications like virtual avatars, live performances, and conversational AI.
Describe alternatives you've considered
Currently, I rely on generating audio and blend shapes after receiving the complete input from the a LLM, which takes around 3-5 seconds. This approach creates noticeable latency when using Azure Speech for audio generation and blend shapes rendering, leading to a suboptimal experience for end-users.
The delay disrupts the fluidity of real-time interactions, making the system feel less responsive and natural, especially in dynamic applications like AI-driven dialogues or interactive avatars. An alternative could involve exploring real-time streaming methods to incrementally process the input, reducing the noticeable lag and improving user satisfaction.
The text was updated successfully, but these errors were encountered: