Wyoming Integration #135

hornej · 2023-10-25T05:09:45Z

hornej
Oct 25, 2023

Home Assistant recently announced wake word support in their latest Year of the Voice update which brought attention to the Wyoming protocol which is a "peer-to-peer protocol for voice assistants".

It would be awesome to be able to hook up WIS as the speech to text inference server for Home Assistant using the Wyoming protocol since WIS is way faster than a Raspberry Pi 4 (which is what I am using).

I'd love to hear if you guys are planning on a Wyoming integration anytime soon? I've seen mention of the WAS protocol but haven't found documentation or implementation details for it.

Home Assistant also mentioned they are targeting the ESP32-S3-Box-3 for future voice satellite improvements.

kristiankielhofner · 2023-10-25T14:31:51Z

kristiankielhofner
Oct 25, 2023
Maintainer

I saw that! I'm going to warn you, this will be long but like some of my other rants on these issues and discussions you're the first to ask this question but I know it will come up in the future so this response will serve as a draft to what we will eventually make available on heywillow.io and elsewhere as I'm sure you won't be the last to ask :).

TLDR we have no plans to support Wyoming and likely never will. If you want to know why, read on.

A little background/clarification:

Both WIS and faster-whisper use the same underlying engine (ctranslate2). WIS is very slightly faster because of some optimization work we've done but they are extremely close on CPU.
Where WIS really pulls ahead is the finely tuned and even more highly optimized Nvidia CUDA GPU support. Long story short speech recognition is for GPUs, not CPUs. It sounds like you may have already seen them but if not you can view WIS benchmarks and comparison benchmarks to see how CPU compares. For the models I consider usable for voice assistant tasks (small beam 2 at a very minimum) all but the fastest CPUs available (like Threadripper) are essentially unusable and even Threadripper isn't great. We have seen this with Willow for months and from what I can tell on HA reddit, community forum, github issues, etc people are realizing that tiny and even base work very poorly for voice assistant tasks in the real world - often to the point of unusable. If you want to get really wild my local RTX 3090 (not fair) holds the current record in voice command completion at 108ms from wake to command completion on HA which is completely unnecessarily fast. However, I also know that as we add functionality this floor will likely only go up. We intend to surpass Alexa/Google Home/etc functionality while maintaining the highest quality and fastest response times anywhere. We count in single digit milliseconds.
In my view the Raspberry Pi (even the 5) and similar hardware are non-starters with medium likely taking at least 25 seconds (on the 5) to transcribe 3.8 seconds of speech. It's somewhat counter-intuitive but shorter speech segments have lower realtime multiples (see benchmarks) so it's even worse with many speech commands being in the sub two second range. Many Willow+WIS users go "all out" with large-v2 beam size 5 with WIS and GPU because it's so fast it doesn't really matter, with even seven year old used $100 Pascal GPUs going from wake to TTS response in well under one second with the largest models. GPUs have wildly different architectures compared to CPU and as I like to say if you bring a CPU to a GPU fight you're going to lose. It's all about using the right tool for the job. The usable demo videos and user reports with Raspberry Pi and similar are all almost certainly using Nabu Casa (Azure) clouds.
Overall, Willow is as fast, accurate, and usable as it is because of the overall design and entire ecosystem - Willow, WAS, and WIS. With the current approach Home Assistant uses (Wyoming, etc) even with WIS it will always be (much) slower and less accurate than the equivalent experience using Willow. Accuracy and general usability will also likely be worse because of fundamentals like audio hardware, frontend processing, overall architecture, and protocol design. HA Voice and Wyoming are extremely new but I'm pretty confident that over time there will need to be some fundamental re-thinking to that approach.

With that, I've done extensive research and testing with Home Assistant Voice and the component pieces including Wyoming. They are both a tremendous achievement and are very interesting but they are problematic in terms of aligning with the overall goals of the Willow. In no particular order:

More-or-less constant audio streaming from all devices. This incurs substantial performance and usability penalties because Wyoming uses an audio frame packetization interval that results in relatively high bandwidth usage and packet processing overhead everywhere from wifi to the system(s) dealing with the ~35 packets per second per device/stream. It seems that HA Voice and Wyoming have implemented some VAD approaches on the satellite hardware but these have accuracy and optimization issues. For example, VAD can be kind of "finicky" generally and if you have any kind of conversation (from people talking to music, podcasts, TV, etc) within audio range of satellite devices VAD will be activating/deactivating all over the place, rendering the implementation of it moot. Because HA does extremely latency sensitive wake word detection on the host post-stream there's really no way to work around this. In my testing with this scenario five Atom Echo devices used roughly 72% CPU time across all cores on a Raspberry PI4 just to process the constant incoming audio without wake, leading to situations where even the HA web interface became unresponsive and command processing even with tiny took tens of seconds. With Willow host CPU usage is zero with any number of devices in any scenario until on-device wake and then VAD activate.
Protocol design and implementation. Wyoming uses UDP for audio transport which is problematic for a few reasons:

Absent any retransmission handling from higher layers in the stack UDP doesn't retransmit. Generally speaking UDP is great for bi-directional audio streaming for conversational audio flows between humans and that's why it's used there. If a packet is lost in a bidirectional real-time conversation you don't want to retransmit because the conversation has already "moved along". However, for single direction async command-response flows like voice assistant tasks protocols with retransmission logic are superior and more robust. Wifi (especially with 2.4 GHz) is pretty prone to packet loss and this can also lead to fundamental accuracy and usability issues.
UDP without fairly complex higher layer application-specific handling (like DTLS) doesn't support encryption. From what I've seen Wyoming doesn't support encryption (esphome can use NOISE generally but not for voice to my knowledge) and I view this as problematic for obvious reasons. Even if Wyoming were to implement encryption somehow the fundamental packetization I mentioned above would add significant processing and latency overhead because of the packetization rate and flow. Roughly 35 packets per second only gets worse when the packets are larger and systems in the audio path need to encrypt/decrypt them at that rate, especially for the host-based wake word detection.
Multiplexing of disparate UDP streams over a single port is challenging and often impossible. From what I've seen Wyoming uses separate UDP ports for each voice session/device (which is very common/standard with UDP approaches like VoIP, WebRTC, etc). However, this introduces various issues with firewalls, docker, etc. We've seen with Willow that a surprising number of Home Assistant users have very complex firewall and network configurations and supporting them with Wyoming would be challenging.
Performance and bandwidth utilization. It's also slower and uses more bandwidth because each (very small) audio frame incurs the header/frame encapsulation overhead from each underlying layer (1-3 in the OSI model). This contributes to both packet processing overhead and bandwidth utilization everywhere.

Ecosystem support. Wyoming is very, very new and relatively obscure. Willow and WIS use bog-standard HTTP/HTTPS for all audio transport and streaming, which everything in the commercial world and elsewhere has used for decades for very good reasons. HTTP is a very well understood and robust protocol (obviously) that is able to leverage the entire HTTP ecosystem which is essentially second to none. It also inherently supports multiple sessions over a single port, TLS, authentication, chunking, etc. We also benefit from the very standard and widespread support of hardware accelerated TLS and encryption.
Security. In addition to the encryption issues, Willow devices do not listen on any network sockets. This has substantial advantages for everything from security to network handling as Willow devices only initiate outbound connections over single ports. WIth the UDP streaming approach devices on each end of the connection basically throw/"spray and pray" packets at dedicated device/session UDP listening ports on each side which has implications for everything from security to network compatibility.
Network support. Willow has been in the real world with real users for six months. We've seen that many Willow users have somewhat esoteric network configurations that involve NAT, firewalls, and more. Willow users are not unique in this regard as they're also Home Assistant users and with the fundamental design of Wyoming supporting these network configurations ranges from difficult to impossible. The VoIP and WebRTC worlds have necessary yet completely ridiculous things that have been developed over decades in trying to address this. I've lost years of my life trying to get this approach to work universally.
Extensibility and feature support. Willow, WIS, and WAS already do things like speaker identification, multiple simultaneous wake session handling, etc. If you haven't already seen it you can check out my latest demo video. It's not clear to me how or if Wyoming and HA Voice can fundamentally support this functionality - and as a six month old project we're only just getting started. Stay tuned!

Overall, Willow and WIS are much more performant because we do things like stuff audio frames into packets at MTU plus everything else I've described. It's also much more robust (accurate) even in the most challenging conditions because HTTP/HTTPS use TCP (of course) with retransmissions. Willow and WIS also support high-quality and performant audio compression to reduce wifi airtime and network utilization even further, which some Willow users enable because they have especially challenging wifi environments - situations where 2.4 GHz wifi spectrum is completely stomped on with dozens of visible networks and overall competition for airtime. Implementing this with Wyoming and the HA Voice ecosystem would also likely be challenging and is currently unsupported.

So, the concern overall with Wyoming and HA Voice ecosystem support is that (in my view) it's fundamentally incapable of meeting our goals. This is a very good and valid question but overall I don't want to create a scenario where the fundamental user experience with WIS and HA Voice doesn't (and almost certainly can't) compare to native Willow. Also note I'm in no way an authority on Wyoming/HA Voice or a true expert with them so I may be wrong on some of these points but based on documentation, code review, and testing I believe all of this is reasonably accurate.

Willow is a very small team with three less-than-fulltime developers. Certainly at this time we need to focus our energies and time on our project goals and users. Our goal is to truly be the best voice interface in the world - even beating commercial implementations. We have more than enough to do just to catch up with the tens/hundreds of millions of dollars over nearly a decade Amazon has put into Alexa ;).

In terms of the BOX-3, I'm not sure what you mean here. Willow has had full support for the BOX-3 since mid-September and obviously the BOX and BOX-Lite since inception. We've moved on to supporting even more hardware like the M5Stack CoreS3.

I understand this may sound critical as I tend to write very matter-of-factly but I want to be very clear: what HA has done is fantastic and I support anything that enables the goals we're all interested in - privacy, local hosting, and flexibility. The HA Voice approach has other advantages over Willow (like ease of use and tight integration) and in the end users should use what works best for them and their situation. Willow has no HA user monetization approaches (and never will) or other motivations to try to convince people we're "better" because we're not - all of this is very user and situation specific. Willow is designed with the approach I've described because our monetization strategy is commercial and "enterprise" applications, the only markets I understand and have experience with.

I have many hard-learned lessons from 20 years of experience with the intersection of voice, audio, networks, and systems. In the end these points may not have much meaningful difference to end users but in my experience they almost always do, with the difference often being it works or doesn't.

0 replies

hornej · 2023-10-29T04:10:18Z

hornej
Oct 29, 2023
Author

@kristiankielhofner I appreciate the very thorough response! It is very impressive what you guys have built. I have tested various demos like Picovoice, Sensory, OpenWakeWord, DSP Concepts on different hardware platforms like STM32, Arduino, XMOS, and Raspberry Pi and have still had the best results in terms of speech to text from Willow Inference Server. Whisper just seems to work very well with a variety of mics and without any requiring extra DSP.

I recently have been using the M5 Atom Echo with HA and the out of the box experience (haven't messed with any of the voice settings yet) using both Home Assistant Cloud and faster-whisper has sadly been underwhelming. I am excited for my ESP32-S3-BOX-3 to arrive so I can test out the full Willow experience!

1 reply

kristiankielhofner Oct 29, 2023
Maintainer

You're welcome, and thanks! I'm glad WIS is working well for you.

While Whisper is fairly robust to noise, etc we find the audio frontend processing from the ESP-BOX and overall hardware configuration to be key in terms of "true" far-field voice. Far-field voice is generally defined as "up to five meters" away. No one outside of OpenAI is quite sure what dataset(s) Whisper was trained on but it stands to reason there probably weren't a lot of samples of speech from 5 meters away (if any) with the various levels of acoustic echo inherent to those types of samples. For example, one of my more challenging tests for Willow is speaking from the end of a ~5 meter hallway (all "hard" surfaces) with the ESP-BOX/speaker around a corner. It's not perfect but it works surprisingly well, with only my various XMOS platforms (XVF3510, XVF3500, and to a lesser extent the XVF3000) being comparable. As you likely know the low-quantity pricing of these platforms (XMOS dev boards, USB Respeaker V2) themselves are more expensive than the entire ESP-BOX for just the audio hardware. Audio gain/level is another issue at these distances.

The Atom Echo and "throw a microphone on a Pi" can be usable under ideal conditions from up to a couple of meters away but the next issue is the available models and resulting performance when using Whisper on CPU (as in faster-whisper with HA Voice). As I noted in my previous comment in our and our user's experience we feel that the Whisper small model with beam size 2 should be considered the minimum for typical conditions and voice assistant tasks, which is more-or-less a non-starter on all but the most capable and expensive CPUs. The overall HA Voice implementation can apply AEC in software but this introduces even more performance and resource utilization issues. I haven't done a significant amount of testing with it but I also suspect it doesn't work as well as the Espressif ADF/ESP-BOX implementation and overall hardware configuration. Of course with the GPU emphasis of WIS users are free to experiment with larger models and beam sizes while still maintaining the level of responsiveness they expect from commercial solutions. When you stack up these fundamentals plus the overall implementation and architecture (Wyoming) I'm not surprised to hear your experience with the HA Voice approach is as you say "underwhelming". In perusing the feedback from the HA community this seems to unfortunately be fairly typical.

I should point out the production version of the ESP-BOX-3 uses slight different hardware from the pre-release version. We will be publishing a release this week that fully supports it. Let us know how things go with Willow when you receive your ESP-BOX-3!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wyoming Integration #135

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Wyoming Integration #135

Uh oh!

hornej Oct 25, 2023

Replies: 2 comments · 1 reply

Uh oh!

kristiankielhofner Oct 25, 2023 Maintainer

Uh oh!

hornej Oct 29, 2023 Author

Uh oh!

Uh oh!

kristiankielhofner Oct 29, 2023 Maintainer

hornej
Oct 25, 2023

Replies: 2 comments 1 reply

kristiankielhofner
Oct 25, 2023
Maintainer

hornej
Oct 29, 2023
Author

kristiankielhofner Oct 29, 2023
Maintainer