Always-listening mode from multiple sources? #93
-
For my use case, I'm interested in real-time, always-on voice transcription, ideally with multiple devices (microphones). I intend to wire this up to an LLM, with custom "instruct" prompts to identify when the input requires a response. Willow would mainly need to be responsible for identifying the beginning and end of voice recordings, probably with silence detection. For input, I have an ESP BOX and some ESP Echo's on the way, as well as bluetooth speakers and USB microphones available. After reading through the issues and discussions, I have a few Q's (in order of importance):
I think that's it for now - I know these are some pretty niche uses cases! Very exciting project, I'm looking forward to contributing where I can! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Sorry, I was working my way up in unread notifications. I think I address most of this in my reply to #84. I am confident you don't want silence detection but rather voice activity detection and even that will be problematic for this use case (as I noted in #84). Yes, WIS can support multiple sessions simultaneously. However, see my notes about your intended use case and the requirements for the multi-model pipeline this would require (and the resulting compute and imperfections incurred by each). I'm losing track at this point but I think you're looking at a half a dozen fairly resource intense models which when sequenced together will dramatically expose the imperfections in each of them - dramatically compounding errors, hallucinations, etc in each of them. We currently suppress Whisper timestamps but it's just Whisper so you can get them with a tweak. Note that many people find them to be inadequate and end up doing all kinds of things with Silero and chunking to try to work around this. We don't directly support diaratization directly but we do support speaker authentication/verification via wavlm to identify/verify the speaker. It supposedly supports diaritization as well (makes sense) although I've never tried it for that specific use case. WIS has a REST API as well that allows supports a POST of nearly any audio file as input. It wouldn't take much to get that audio to WIS - or if you get really ambitious (you seem like the type hah) you can use our WebRTC audio streaming endpoint to make Whisper as "realtime" as it's going to get. |
Beta Was this translation helpful? Give feedback.
Sorry, I was working my way up in unread notifications. I think I address most of this in my reply to #84.
I am confident you don't want silence detection but rather voice activity detection and even that will be problematic for this use case (as I noted in #84).
Yes, WIS can support multiple sessions simultaneously. However, see my notes about your intended use case and the requirements for the multi-model pipeline this would require (and the resulting compute and imperfections incurred by each). I'm losing track at this point but I think you're looking at a half a dozen fairly resource intense models which when sequenced together will dramatically expose the imperfections in each of the…