GitHub - JoyKrishan/UrbanSound-classification-using-2types-of-Deep-Models-and-different-audio-features: Thesis Work

Research Topic:

Urban Sound Classification Using Convolutional Neural Network and Long Short Term Memory Based on Multiple Features

Dataset:

Urban Sound Classification. For detailed explaination about the dataset and the methods used to clean and compile it please read this paper.

Audio features(Spectral features) extracted:

MFCC, Mel Spectrogram, Chroma STFT, Chroma CQT, Chroma Cens, Spectral Contrast, Tonnetz.

Models Used:

A CNN model with layers of CONV2D ---> MAXPOOL ---> CONV2D ---> MAXPOOL ---> DENSE ---> DENSE ---> SOFTMAX.The first layer of Conv2d uses 64 filters with the dimension of (5*1*1) which is placed on the input shape of (20*5*1). After that a maxpool layer is applied followed by another CONV2D layer and so on. Finally a softmax layer is used at the end to classify between the 10 classes. We have used the adam optimizer which is the most optimized algorithm to calculate the cost.

A LSTM model with layers of 2 LSTM blocks, 2 time distributed dense layers and finally a output layer for classification. LSTM ---> LSTM ---> DENSE ---> DENSE ---> SOFTMAX. The first and the second lstm layer contains 128 blocks which returns 128 yHat values from the given x values. The values from the LSTM layes are passed onto Dense layer and then to the output layer. Here we use an adam optimizer so that the model converges faster.

Augmentation

Time Stretch : In this technique, we slow down or speed up the sound clips with a rate of 0.9 and 1.1. In this way, we couldgenerate more 17464 new audio clips for our augmented dataset.

Pitch Shift : The factors of {-2, +2} are used to raise and lower the pitch (in semitones) of an audio sample in the dataset through which we could create 17464 samples using pitch shift.

Pitch Shift along with Time Stretch : In this augmentation step, a sound clip is manipulated using both pitch shift and time stress to generate a total of 34928 novel audio clips.

Results

The table below shows the max accuracy of the models on different types of features used.

Model	Accuracy
CNN	96.5%
LSTM	98.81%

Fundamental Libraries required before implementation

Librosa and ffmpeg

pip install librosa 
pip install ffmpeg

For more information - Click here to view our paper published in IEEE

Cite this paper if you use the code and information from the paper 😃

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
CNN WITH AUGMENTATION		CNN WITH AUGMENTATION
CNN WITHOUT AUGMENTATION		CNN WITHOUT AUGMENTATION
LSTM WITH AUGMENTATION		LSTM WITH AUGMENTATION
LSTM WITHOUT AUGMENTATION		LSTM WITHOUT AUGMENTATION
Miscellaneous		Miscellaneous
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research Topic:

Dataset:

Audio features(Spectral features) extracted:

Models Used:

Augmentation

Results

Fundamental Libraries required before implementation

For more information - Click here to view our paper published in IEEE

About

Releases

Packages

Languages

JoyKrishan/UrbanSound-classification-using-2types-of-Deep-Models-and-different-audio-features

Folders and files

Latest commit

History

Repository files navigation

Research Topic:

Dataset:

Audio features(Spectral features) extracted:

Models Used:

Augmentation

Results

Fundamental Libraries required before implementation

For more information - Click here to view our paper published in IEEE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages