Skip to content

JoyKrishan/UrbanSound-classification-using-2types-of-Deep-Models-and-different-audio-features

Repository files navigation

Research Topic:

Urban Sound Classification Using Convolutional Neural Network and Long Short Term Memory Based on Multiple Features

Dataset:

Urban Sound Classification. For detailed explaination about the dataset and the methods used to clean and compile it please read this paper.

Audio features(Spectral features) extracted:

MFCC, Mel Spectrogram, Chroma STFT, Chroma CQT, Chroma Cens, Spectral Contrast, Tonnetz.

MFCC of Dog Bark Mel Spectogram of Dog Bark

Chroma STFT of Dog Bark Chroma CQT of Dog Bark

Chroma Cens of Dog Bark Spectral Contrast of Dog Bark

Chroma Cens of Dog Bark

Models Used:

  • A CNN model with layers of CONV2D ---> MAXPOOL ---> CONV2D ---> MAXPOOL ---> DENSE ---> DENSE ---> SOFTMAX.The first layer of Conv2d uses 64 filters with the dimension of (5*1*1) which is placed on the input shape of (20*5*1). After that a maxpool layer is applied followed by another CONV2D layer and so on. Finally a softmax layer is used at the end to classify between the 10 classes. We have used the adam optimizer which is the most optimized algorithm to calculate the cost.
  • A LSTM model with layers of 2 LSTM blocks, 2 time distributed dense layers and finally a output layer for classification. LSTM ---> LSTM ---> DENSE ---> DENSE ---> SOFTMAX. The first and the second lstm layer contains 128 blocks which returns 128 yHat values from the given x values. The values from the LSTM layes are passed onto Dense layer and then to the output layer. Here we use an adam optimizer so that the model converges faster.

Augmentation

  • Time Stretch : In this technique, we slow down or speed up the sound clips with a rate of 0.9 and 1.1. In this way, we couldgenerate more 17464 new audio clips for our augmented dataset.

Time Stretch

  • Pitch Shift : The factors of {-2, +2} are used to raise and lower the pitch (in semitones) of an audio sample in the dataset through which we could create 17464 samples using pitch shift.

Pitch Shift

  • Pitch Shift along with Time Stretch : In this augmentation step, a sound clip is manipulated using both pitch shift and time stress to generate a total of 34928 novel audio clips.

Pitch Shift and Time Stretch

Results

The table below shows the max accuracy of the models on different types of features used.

Model Accuracy
CNN 96.5%
LSTM 98.81%

Fundamental Libraries required before implementation

Librosa and ffmpeg

pip install librosa 
pip install ffmpeg

Cite this paper if you use the code and information from the paper 😃

About

Thesis Work

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published