Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New feature] Adding visemes as part of the output #99

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

fabiocat93
Copy link

Introducing SpeechToVisemes 🗣️

This PR addresses issue #37 and introduces the SpeechToVisemes module, a submodule of TTS 🤖. This new functionality converts speech into visemes, which is crucial for apps that require visual representations of spoken words, like lip-syncing in animations or improving accessibility features 🎬.

Example Usage 📹

demo_s2s.mov

How it works 🤔

The tool generates timestamped sequences of visemes (22 mouth shapes in total, following Microsoft's documentation 📚) by transcribing synthesized speech with the huggingface ASR pipeline using phoneme-recognition models. The default model is "bookbot/wav2vec2-ljspeech-gruut", which provides decent results with low latency and no external dependencies. Other alternatives include "ct-vikramanantha/phoneme-scorer-v2-wav2vec2" and "Bluecast/wav2vec2-Phoneme".
Following, phonemes are mapped into visemes, also taking into account language-specific sounds.

Notes on the server architecture 🛠️

I've implemented STV as a submodule of TTS to leverage the existing architecture (I didn't want to make any major edit here). However, I have ideas on how we may restructure the entire tool architecture to make it more generalizable. I propose a sensor-engine-actuator framework (instead of having just STT, LM, and TTS), where the sensor includes submodules like STT and speech emotion recognition (which can also run in parallel), the engine comprises LLM and application-specific rules, and the actuator includes TTS, visemes generation, and potentially more output instructions. Such a framework would enable emotion-aware agents, which are fundamental in a lot of scenarios (e.g., therapy)!! @andimarafioti Let's discuss this further if you're interested! 💬

Your feedback is welcome! 🤗

Updating the viseme branch
Updating the speech-to-speech fork with visemes
@fabiocat93
Copy link
Author

FYI: I have updated the branch with the last changes in the upstream branch. Feel free to review the code whenever you have time

Copy link
Member

@andimarafioti andimarafioti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

@fabiocat93
Copy link
Author

Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

Thank you @andimarafioti for the review. Overall, this was mainly a first attempt to explore the feasibility and latency of the speech-to-viseme conversion, so I was more focused on the output structure and seeing if you were happy with it. Since the latency seems acceptable and the output looks good, I can now refine the code with a stronger focus on readability and modularity.

More specifically:

I would put that huge dictionary of mappings into a dictionary.

I can definitely do this! It will clean up the code and make it easier to maintain.

I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things.

Totally fair. Those were more for internal testing and to show how to access the info without overwhelming the terminal logs. I'll clean it up and find a better way to manage that information.

I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

I initially tied speech-to-visemes to the TTS handlers because certain commercial APIs (like Amazon Polly) offer TTS with time-stamped visemes. I wanted to emulate that interface. But I agree with your suggestion of decoppling speech-to-visemes and TTS—it would definitely improve the modularity, simplify future extensions, and keep things consistent across the tool. I’ll work on it

@fabiocat93
Copy link
Author

Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

hi @andimarafioti, I have addressed all your points above. I hope you like my changes:

  • Integrated the speech to visemes (STV) module with its own handler to make it in-line with the rest of the library
  • Read phoneme-viseme mapping from a json file
  • Added some pre-commit hooks for fixing spelling issues with codespell and style issues with ruff
  • Cleaned the code some more (e.g., remove my old unnecessary print statements)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants