Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) #2694

Open
vipin-ust opened this issue Dec 5, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request in-review In review text-to-speech Text-to-Speech

Comments

@vipin-ust
Copy link

vipin-ust commented Dec 5, 2024

Is your feature request related to a problem? Please describe.
I often face challenges with latency and lack of fluidity when generating real-time audio and animations for interactive applications. This becomes particularly frustrating when working with AI-driven dialogues or streaming outputs from generative AI models (like GPT), where delays disrupt the experience and reduce engagement.

Describe the solution you'd like
I would like a Real-Time Text Streaming feature that processes text incrementally as it's input, allowing for immediate conversion into audio and simultaneous generation of blend shapes. This feature should:

Enable continuous, low-latency text-to-audio synthesis with natural speech.
Produce synchronized facial animation (blend shapes) in real-time, ensuring that visual feedback matches the audio output.
Minimize latency to improve the responsiveness of applications like virtual avatars, live performances, and conversational AI.

Describe alternatives you've considered
Currently, I rely on generating audio and blend shapes after receiving the complete input from the a LLM, which takes around 3-5 seconds. This approach creates noticeable latency when using Azure Speech for audio generation and blend shapes rendering, leading to a suboptimal experience for end-users.

The delay disrupts the fluidity of real-time interactions, making the system feel less responsive and natural, especially in dynamic applications like AI-driven dialogues or interactive avatars. An alternative could involve exploring real-time streaming methods to incrementally process the input, reducing the noticeable lag and improving user satisfaction.

@vipin-ust vipin-ust changed the title Real-Time Text Streaming for Audio and Blend Shapes Generation- SPEECH SDK Real-Time Text Streaming for Audio and Blend Shapes Generation(facial position with viseme) Dec 5, 2024
@yulin-li
Copy link
Contributor

yulin-li commented Dec 6, 2024

we support input text streaming, but blendshapes is not available in this api yet.

@vipin-ust
Copy link
Author

Can we expect the blendshapes in nearby future release?

Copy link

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

@github-actions github-actions bot added the update needed For items that are in progress but have not been updated label Dec 30, 2024
@pankopon pankopon added enhancement New feature or request in-review In review text-to-speech Text-to-Speech and removed update needed For items that are in progress but have not been updated labels Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in-review In review text-to-speech Text-to-Speech
Projects
None yet
Development

No branches or pull requests

3 participants