[New feature] Adding visemes as part of the output #99

fabiocat93 · 2024-09-09T17:14:27Z

Introducing SpeechToVisemes 🗣️

This PR addresses issue #37 and introduces the SpeechToVisemes module, a submodule of TTS 🤖. This new functionality converts speech into visemes, which is crucial for apps that require visual representations of spoken words, like lip-syncing in animations or improving accessibility features 🎬.

Example Usage 📹

demo_s2s.mov

How it works 🤔

The tool generates timestamped sequences of visemes (22 mouth shapes in total, following Microsoft's documentation 📚) by transcribing synthesized speech with the huggingface ASR pipeline using phoneme-recognition models. The default model is "bookbot/wav2vec2-ljspeech-gruut", which provides decent results with low latency and no external dependencies. Other alternatives include "ct-vikramanantha/phoneme-scorer-v2-wav2vec2" and "Bluecast/wav2vec2-Phoneme".
Following, phonemes are mapped into visemes, also taking into account language-specific sounds.

Notes on the server architecture 🛠️

I've implemented STV as a submodule of TTS to leverage the existing architecture (I didn't want to make any major edit here). However, I have ideas on how we may restructure the entire tool architecture to make it more generalizable. I propose a sensor-engine-actuator framework (instead of having just STT, LM, and TTS), where the sensor includes submodules like STT and speech emotion recognition (which can also run in parallel), the engine comprises LLM and application-specific rules, and the actuator includes TTS, visemes generation, and potentially more output instructions. Such a framework would enable emotion-aware agents, which are fundamental in a lot of scenarios (e.g., therapy)!! @andimarafioti Let's discuss this further if you're interested! 💬

Your feedback is welcome! 🤗

adding speech_to_visemes

Updating the viseme branch

Updating the speech-to-speech fork with visemes

fabiocat93 · 2024-09-23T19:14:40Z

FYI: I have updated the branch with the last changes in the upstream branch. Feel free to review the code whenever you have time

andimarafioti

Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

fabiocat93 · 2024-09-27T20:05:13Z

Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

Thank you @andimarafioti for the review. Overall, this was mainly a first attempt to explore the feasibility and latency of the speech-to-viseme conversion, so I was more focused on the output structure and seeing if you were happy with it. Since the latency seems acceptable and the output looks good, I can now refine the code with a stronger focus on readability and modularity.

More specifically:

I would put that huge dictionary of mappings into a dictionary.

I can definitely do this! It will clean up the code and make it easier to maintain.

I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things.

Totally fair. Those were more for internal testing and to show how to access the info without overwhelming the terminal logs. I'll clean it up and find a better way to manage that information.

I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

I initially tied speech-to-visemes to the TTS handlers because certain commercial APIs (like Amazon Polly) offer TTS with time-stamped visemes. I wanted to emulate that interface. But I agree with your suggestion of decoppling speech-to-visemes and TTS—it would definitely improve the modularity, simplify future extensions, and keep things consistent across the tool. I’ll work on it

Cleaning code

Adding speech to visemes as a child of BaseHandler

fabiocat93 · 2024-10-08T23:06:26Z

Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.

hi @andimarafioti, I have addressed all your points above. I hope you like my changes:

Integrated the speech to visemes (STV) module with its own handler to make it in-line with the rest of the library
Read phoneme-viseme mapping from a json file
Added some pre-commit hooks for fixing spelling issues with codespell and style issues with ruff
Cleaned the code some more (e.g., remove my old unnecessary print statements)

fabiocat93 and others added 2 commits September 9, 2024 18:18

adding speech_to_visemes

ba4368b

Merge pull request #1 from sensein/speech_to_visemes

90c38d4

adding speech_to_visemes

fabiocat93 mentioned this pull request Sep 9, 2024

[Feature request] How about adding an optional speech to viseme model at the end of our chain? #37

Open

fabiocat93 added 2 commits September 23, 2024 20:52

Merge pull request #2 from sensein/main

6ba4f97

Updating the viseme branch

Merge pull request #3 from sensein/original

1414ed4

Updating the speech-to-speech fork with visemes

michaelgoldpiano force-pushed the main branch from 779a7ee to 1414ed4 Compare September 24, 2024 19:00

andimarafioti reviewed Sep 27, 2024

View reviewed changes

fabiocat93 and others added 7 commits October 3, 2024 22:10

Merge branch 'huggingface:main' into main

e2beed6

picking phoneme_viseme_map from json file

7ff873c

adding pre-commit hooks for codespell and ruff style check

20ec10b

removing TTS/STV/phoneme_viseme_map_readable.json.txt

522f716

Merge pull request #4 from sensein/json

7176a1b

Cleaning code

fixing style issues

62cd4e1

integrating speech to visemes as part of the s2s flow

66c4f78

fabiocat93 mentioned this pull request Oct 8, 2024

Speech to visemes as a child of BaseHandler sensein/speech-to-speech#5

Merged

fabiocat93 added 2 commits October 8, 2024 18:53

Delete STV/speech_to_visemes.py

c7b85e1

Merge pull request #5 from sensein/flow

1bc8186

Adding speech to visemes as a child of BaseHandler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New feature] Adding visemes as part of the output #99

[New feature] Adding visemes as part of the output #99

fabiocat93 commented Sep 9, 2024

fabiocat93 commented Sep 23, 2024

andimarafioti left a comment

fabiocat93 commented Sep 27, 2024

fabiocat93 commented Oct 8, 2024

[New feature] Adding visemes as part of the output #99

Are you sure you want to change the base?

[New feature] Adding visemes as part of the output #99

Conversation

fabiocat93 commented Sep 9, 2024

Introducing SpeechToVisemes 🗣️

Example Usage 📹

How it works 🤔

Notes on the server architecture 🛠️

fabiocat93 commented Sep 23, 2024

andimarafioti left a comment

Choose a reason for hiding this comment

fabiocat93 commented Sep 27, 2024

fabiocat93 commented Oct 8, 2024