Always-listening mode from multiple sources? #93

tensiondriven · 2023-06-29T00:08:38Z

tensiondriven
Jun 29, 2023

For my use case, I'm interested in real-time, always-on voice transcription, ideally with multiple devices (microphones). I intend to wire this up to an LLM, with custom "instruct" prompts to identify when the input requires a response. Willow would mainly need to be responsible for identifying the beginning and end of voice recordings, probably with silence detection.

For input, I have an ESP BOX and some ESP Echo's on the way, as well as bluetooth speakers and USB microphones available.

After reading through the issues and discussions, I have a few Q's (in order of importance):

How feasible would it be to use willow-inference-server to listen to devices in my home continuously, without having to use a wake-word?
Can willow-inference-server support multiple clients (microphones) concurrently?
Can I easily get access to the output of Whisper, so I can capture the transcription along with the timestamp for indexing in a database (for later analysis/retrieval)?
Any plans for diarization, that is, the ability to identify who is speaking? (This would be useful both as input to the LLM and in a database).
Any plans for a Linux CLI that can take an audio source as input? For example, if i had half a dozen bluetooth conference phones, I'd want my server listening to these, and I imagine doing this on a headless ubuntu box or something similar.

I think that's it for now - I know these are some pretty niche uses cases! Very exciting project, I'm looking forward to contributing where I can!

Answered by kristiankielhofner

Jun 29, 2023

Sorry, I was working my way up in unread notifications. I think I address most of this in my reply to #84.

I am confident you don't want silence detection but rather voice activity detection and even that will be problematic for this use case (as I noted in #84).

Yes, WIS can support multiple sessions simultaneously. However, see my notes about your intended use case and the requirements for the multi-model pipeline this would require (and the resulting compute and imperfections incurred by each). I'm losing track at this point but I think you're looking at a half a dozen fairly resource intense models which when sequenced together will dramatically expose the imperfections in each of the…

View full answer

kristiankielhofner · 2023-06-29T02:48:48Z

kristiankielhofner
Jun 29, 2023
Maintainer

Sorry, I was working my way up in unread notifications. I think I address most of this in my reply to #84.

I am confident you don't want silence detection but rather voice activity detection and even that will be problematic for this use case (as I noted in #84).

Yes, WIS can support multiple sessions simultaneously. However, see my notes about your intended use case and the requirements for the multi-model pipeline this would require (and the resulting compute and imperfections incurred by each). I'm losing track at this point but I think you're looking at a half a dozen fairly resource intense models which when sequenced together will dramatically expose the imperfections in each of them - dramatically compounding errors, hallucinations, etc in each of them.

We currently suppress Whisper timestamps but it's just Whisper so you can get them with a tweak. Note that many people find them to be inadequate and end up doing all kinds of things with Silero and chunking to try to work around this.

We don't directly support diaratization directly but we do support speaker authentication/verification via wavlm to identify/verify the speaker. It supposedly supports diaritization as well (makes sense) although I've never tried it for that specific use case.

WIS has a REST API as well that allows supports a POST of nearly any audio file as input. It wouldn't take much to get that audio to WIS - or if you get really ambitious (you seem like the type hah) you can use our WebRTC audio streaming endpoint to make Whisper as "realtime" as it's going to get.

2 replies

tensiondriven Jul 1, 2023
Author

Beautiful, thank you for the timely, thoughtful, and simultaneously critical and constructive response. My thanks! I'm also still reading through notifications, will check out what you said in #84.

kristiankielhofner Jul 3, 2023
Maintainer

Thanks!

Hah, same here! Holiday weekend in the US...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always-listening mode from multiple sources? #93

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Always-listening mode from multiple sources? #93

tensiondriven Jun 29, 2023

Replies: 1 comment · 2 replies

kristiankielhofner Jun 29, 2023 Maintainer

tensiondriven Jul 1, 2023 Author

kristiankielhofner Jul 3, 2023 Maintainer

tensiondriven
Jun 29, 2023

Replies: 1 comment 2 replies

kristiankielhofner
Jun 29, 2023
Maintainer

tensiondriven Jul 1, 2023
Author

kristiankielhofner Jul 3, 2023
Maintainer