Skip to content

macck7/Automatic-Speech-recognition-ASR-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic Speech Recognition

Table of Contents

  • Dataset Overview
  • Data Visualization
  • Methodology used
  • Observation
  • Deduction

ASR model is implemented with the help of machine leaning and natural language processing.

  • Dataset used for this task is sourced from Google.
  • The dataset is composed of short (one-second or less) audio clips of commands, such as "down", "go", "left", "no", "right", "stop", "up" and "yes".
  • Each of 8 commands have 1000 different samples of voices.
  • Each WAV file contains time-series data with a set number of samples per second.
  • Each Sample represents the amplitude of the audio signal at that specific time.
  • The total dataset consists of 8k samples which is further splitted in 80% training samples and 10%-10% for validating and testing

Methodology Used

We have used a CNN( convolutional neural network) model, CNN utilizes correlations which exist with the input data. Each concurrent layer of the neural network connects some input neurons.

  • First of all our data is in .wav form.
  • We will transform these waveforms from the time-domain signals into the frequency-domain signals by computing the short-time Fourier transform (STFT) to convert the waveforms to as spectrograms, which show frequency changes over time.
  • Spectrogram is a visual way of representing the signal strength or loudness of a signal over time at various frequencies.
  • These can be represented as 2D images
  • To feed data in our model we have used these spectrogram images .

Observation

The following can be deduced from the plots seen of various command waveforms using spectrograms :

  • There is some extent of overlapping across different categories of commands.
  • For commands like “NO” and “STOP” , both spectrogram and mel- scale spectrogram are kind of similar.
  • The sample rate for this dataset is 16kHz.( padding is done if sample rate is less)
  • Range of frequency is In a 32-bit floating-point system, each WAV files in the
  • dataset has, the amplitude values range from between -1.0 to +1.0.
  • Due to the presence of noise makes it a difficult classification challenge.

Confusion matrix showing how well the model did classifying each of the commands in the test set:

Deduction

Some of deductions made from the uesd CNN model are :

  • Precision Score of the model is: 0.8125
  • Recall Score of the model is : 0.8125
  • F1 Score of the model is: 0.812
  • With few layers of CNN, we can only determine less features, but for deep learning stage we need extract more features.
  • Due to the problem of overfitting, performance of model is degraded.
  • Also because of less amount of dataset, CNN is not able to perform well.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published