Common problems with wakeword example #301
Closed
StuartIanNaylor
started this conversation in
Dev
Replies: 1 comment 2 replies
-
|
make a OVOS plugin 😉 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
https://github.com/StuartIanNaylor/wake_word_capture
It contains 5 training runs and the resultant tflite files are included.
Has a brief introduction to https://github.com/google-research/google-research/tree/master/kws_streaming that was used for model training.
Has a number of models to choose from but some don't work but generally for parameters/accuracy always found a CRNN a good bet.
Its slight better than the GRU of Precise as combining CNN & GRU creates a wakeword model slighly lighter (less params) and a touch more accurate.
https://github.com/google-research/google-research/tree/master/kws_streaming contains some very good benchmark on likely every wakeword model that should be of interest.
I will add a BcResnet as prob the current king as manages very high accuracy but its extremely light 10K params vs the CRNN included that is 400K. Still getting to grips with Torch as have trained but haven't exported a quantised model to Onnx, but previously have purely used tensorflow.
Prob way to go as RKNN and Esspressif are actively supporting Onnx whilst tflite seems a bit of a dead end unless converted.
Its all something I did 3 years ago trying to work out why many opensource KW are often bad and prone to false positives.
There seems to be this myth that you can just pour a sample soup into a unknown category add you KW class and hey presto or also add a noise/silence category.
For basic classifications models that all lightweight classification models are (there are transformer wakeword! not lite) this is a huge class imbalance that often fails because of softmax calcs where a false positive occurs not because the kw got a good hit but because the other classes got extremely few feature hits. So the softmax calc of kw hit tally divided by all tally triggers...
The solution is simple and create less binary like KW models with more classes and the above is purely an example of training runs with logs and models so you can see how when you train you get less and less of a hockey stick instant accuracy of overfitting the more classes you add.
Basic classification models are very much a result of the dataset you provide and wakeword is a subset of image detection models that benefit from having many classes that force a training run to search harder because there is much cross entropy.
When you have a few very different classes of just KW vs every other feature the words of a full language will provide the training will easily find simple distinctions and quickly overfit giving 100% accuracy you can use tensorboard to look at the logsdir included for the training graph.
The 1st train run uses a fairly common Kw, notKw, Noise class format, next adds 2x Like KW which are just phonetically similar words that are excluded from notKw. this greatly increases cross entropy and forces the model to train harder but also massively reduces the chance of only Kw getting a feature hit and Softmax trigering to give a false positive.
Notkw picks words of similar sylable count to provide basic matching of probable spectra to kw, the 3rd training run adds 1syl as wakeword feature are just basic features like edges, textures, and shapes which should force training to look at the full edge of 1syl vs kw sylables represented in spectra.
Then finally for want of a better word phon is added that is a concatenation of words that are continuous that is a spoken word equivalent of the Noise class in that they are full frame speech spectra vs fullframe noise spectra that also is the oppisite of 1syl and provides sylable spectra than the KW.
Each training run is included as examples of finished tflite models of streaming and no streaming tflite variety with quantised and not quantised versions, with a labels.txt for the index ordinals of the classes used.
its also basically something I did 3 years ago but stopped due to the lack of kw wrod datasets and the bad quality that the forced alignment of ML-commons provides of bad labeling and alignment that forced alignment often gives.
I saw that microWakeWord that HA is using and that they where getting many complaints about the wakeword performnace I thought I would take a look and instantly by purely listening to the samples the training script that creates the 100 sampes they use as a base for training. They do a lot of weird stuff but the 1000 voices have very little difference in prosody and its no surprise that you have to talk in an American English accent quite rapidly because that is what the piper model provides.
It interested me though as before I was creating a database of ML-common to select words by phone/sylable to try synthetic data to create clean samples than the dross of what is available. You get no control over noise and reverberation with available datasets that have a ton of badly recorded so with TTS you can create clean samples and be accurate with additive augmentation that is balanced throughout your dataset which you can not as there is zero metadata and once as a mono audio file its near impossible to measure and differentiate.
Synthetic works great but you need to use 'better' bigger TTS than more toy like TTS that are so to run on low resource embedded such as a Pi3 or above.
I stopped at just less than 4000 voices as purely an example.
Coqui ⓍTTSv2 870
Emotivoice 1932
Piper (Sherpa) 904
Kokorov1 53
Kokorov1_1 103
Kokoro_en 11
VCKT 109
I am am sure there are many more TTS that can provide many more voices as 4000 really is far too few to represent all the different prosody dialects and regions can provide but Coqui xTTSv2 is great for spoken english with an accent for those that use english as a 2nd language.
The above is an example but yes you can use synthetic data Emotivoice is great as it has a huge 2000 voice choice, but also strange the piper model from the Sherpa Onnx https://k2-fsa.github.io/sherpa/onnx/tts/index.html TTS models purely from listening gives far more prosody than the HA training script so Piper has been used just not what they call 'sample generator' that is far too basic in prosody.
The "words.db" with the word/sylable/phone dictionary created with class views is included which is a SqLite database used to create class word lists.
Provided is just an example as more voices will likely benefit but its proof of concept to prove that many Kw models have a huge class imblance by having far too few classes.
The more the merrier and a KW ord model would benefit from having multiple keyword as long a they are similar in sylable contruction and length but unique in phonetics, you will increase the accuracy of KW and likely 3 kw model is very doable adding no increase in size or computation that would give a big increase in classes with supporting LikeKw in a single model.
I thought I would share as ML and KW became a bit of an interest learning about basic ML principles and how to interpret what your training curve us telling you, overfitting and class imbalance.
I am not actually or intend to provide 'product.
I will at one time include the BcResnet in Onnx format fairly prob in the next couple of days and will leave it for others to go in search of TTS with more prosody as there is a terrible habit of not just being US english but a very neutral form that I call TV English that likely is the same in any language because of source datasets avail.
There is a ton of ultra accurate one-shot TTS and there are many datasets its just pain and work finding them and collating or other multivoice TTS that do provide much variance in prosody that are indistinguishable in spectra but especially so when quantised to MFCC.
https://github.com/Qualcomm-AI-research/bcresnet
Its trained and tested accurate but still to work out the convert and quantise to Onnx code.
Also included in the demo script https://github.com/StuartIanNaylor/wake_word_capture/blob/main/kws-stream-avg.py is demonstration of how easy it is to capture spoken wake word via a streaming model that has high polling rates with low computation that is far less accurate with rolling window style Kw models, due to slow polling rates and much more shift in time of capture.
The best data is from the device of use as it contains the artefacts of the microphones, device and filters in use and maybe not ondevice training but local training can be used to continously improve to user voices and environment of use.
Also provides a simple mech to stop double triggers as the more accurate the KW the quicker it will detect KW with less input and you need to provide some sort of mechanism to allow most of the KW to pass or double or even more triggers can happen.
There are limits to how much noise and reverb you can mix where the distinctness of a Kw is lost seems approx 30% noise and above it becomes problematic but post filter you should not get those levels. the above is recorded with no reverb and 30% noise as the expectation is purely to hook up a microphone and test without going to the hassle of including filter such as PiDTLN or https://github.com/Rikorose/DeepFilterNet/tree/main/DeepFilterNet which is near Rtxvoice standards but runs only on a single big core.
Also same with source-speration to cope with doubletalk such as https://github.com/yluo42/TAC which is excellent as can use invariant or fixed wide array microphones. I have trained on x2 but it does reduce the Snr capability by approx half that x6 provides but guess there is a sweetspot somewhere between the 2.
Also don't stick your microphones ontop of a speaker unless you want to create massively larger problems in filter complexity and voice filtering / extraction. Create simple ears that only use local speech enhancement that send clean unprocessed audio to an upstream authoritive speech enhancement wake word model. Opensource doesn't need to mimic consumer product it can create low cost edge 'ears' that broadcast using low sensitivity wakeword as low energy 24/7 input to multiplex zonal microphones to an upstream central speech system and use something like https://github.com/badaix/snapcast that is more than just smartspeaker output its audiophile grade total audio system for multiroom. On pi its very low resource and it allows the use of multiple servers with multiple clients and can be as complex or as simple as you desire, supports multichannel 96khz audio and ultra tight none skipping latency compensation.
Even enclosures create huge problems in design to stop resonace passing from speaker to microphone that you can see in the complex construction via taerdowns of later gen Alexa/Nest smart speakers that is beyond simple maker enclsoure builds. You just don't need to if you sperate microphone and speaker and simpler lower cost devices can quickly create multiple point source wide area arrays, to feed central more compute parts of the pipeline where ASR/LLM might reside, where wireless audio be it squeezlite or my personal fave of snapcast which is wonderful Sonos beating opensource.
PS a cheap https://www.adafruit.com/product/1713?srsltid microphone with its silicon analog 1st stage AGC and similary cheap CM108 card https://www.aliexpress.com/item/1005007032785063.html with PiDTLN will beat the Xmos offerings such as VoicePe or FutureProofHomes/Satellite1 hands down because the reality is they are not very good or at least currently. Coupled with digital AGC and filter (PiDTLN) it will cope well with 3m farfield that can be extended by adding further ears to extend coverage and use physical positioning of multiple devices than trying NASA style long range farfield devices.
They merely use a low compute tflite model on a multicore microcontroller where even a Pizero2 allows much more complex models such as DTLN that works well on 2 cores. Upstream with more compute you have a much wider choice of better more complex models.
Its sort of stupid to provide upstream speech enhancement especially at the micro edge. x2 Max9814 can be used with https://www.amazon.co.uk/KOUWELL-dwi%C3%84TMkowa-AXAGON-ADA-17-USB2-0/dp/B07JGYSZJY / https://plugable.com/products/usb-audio?srsltid=AfmBOoowsOEf5XOOlRFjFbEvEv_qBcWk-_f2eEiBOeTYz-K0gnwnhgUY that is low cost and far easier to implement than any hat as plug&play on any device.
A PiZero2 with those as a lowcost edge 'ear' or what I consider better due to Pi4 matching A55 cores of the https://www.aliexpress.com/item/1005007614734251.html due to having Armv8.2 vector ML instructions but wakeword with a BcResnet likely would run on NPU or you can use the GPU. Both have less compute than the great CPU but they are more efficient in energy by quite a margin even though the A55 is excellent in energy usuage.
Beta Was this translation helpful? Give feedback.
All reactions