- Dataset Overview
- Data Visualization
- Methodology used
- Observation
- Deduction
ASR model is implemented with the help of machine leaning and natural language processing.
- Dataset used for this task is sourced from Google.
- The dataset is composed of short (one-second or less) audio clips of commands, such as "down", "go", "left", "no", "right", "stop", "up" and "yes".
- Each of 8 commands have 1000 different samples of voices.
- Each WAV file contains time-series data with a set number of samples per second.
- Each Sample represents the amplitude of the audio signal at that specific time.
- The total dataset consists of 8k samples which is further splitted in 80% training samples and 10%-10% for validating and testing
We have used a CNN( convolutional neural network) model, CNN utilizes correlations which exist with the input data. Each concurrent layer of the neural network connects some input neurons.
- First of all our data is in .wav form.
- We will transform these waveforms from the time-domain signals into the frequency-domain signals by computing the short-time Fourier transform (STFT) to convert the waveforms to as spectrograms, which show frequency changes over time.
- Spectrogram is a visual way of representing the signal strength or loudness of a signal over time at various frequencies.
- These can be represented as 2D images
- To feed data in our model we have used these spectrogram images .
The following can be deduced from the plots seen of various command waveforms using spectrograms :
- There is some extent of overlapping across different categories of commands.
- For commands like “NO” and “STOP” , both spectrogram and mel- scale spectrogram are kind of similar.
- The sample rate for this dataset is 16kHz.( padding is done if sample rate is less)
- Range of frequency is In a 32-bit floating-point system, each WAV files in the
- dataset has, the amplitude values range from between -1.0 to +1.0.
- Due to the presence of noise makes it a difficult classification challenge.
Confusion matrix showing how well the model did classifying each of the commands in the test set:
Some of deductions made from the uesd CNN model are :
- Precision Score of the model is: 0.8125
- Recall Score of the model is : 0.8125
- F1 Score of the model is: 0.812
- With few layers of CNN, we can only determine less features, but for deep learning stage we need extract more features.
- Due to the problem of overfitting, performance of model is degraded.
- Also because of less amount of dataset, CNN is not able to perform well.