Many studies have leveraged the harmonic patterns in music to achieve high accuracy on music/speech classification. However, the rap genre, with its vocal style closely resembling spoken words, blurs these lines with its speech-like qualities. This project will investigate the efficacy of 4 existing pre-trained models + LSTM and 1 CNN+FC (fully connected layers) in discriminating between rap vocals and speech.
Our data is self-collected audio data of speech and rap vocals, which are scrapped from youtube via yt-dlp
, followed by Demucs (htdemucs_ft version)
for separating target vocals from music tracks.
This project uses self-collected data.
The Ultimate_Rap_Dataset_Cleaned has 207 rap songs with a total of 48109 sec ≈ 13.36 hr;
The Ultimate_Speech_Dataset_Cleaned has 172 speech audio files with a total of 76362 sec ≈ 21.21 hr.
The Data preparation, data pre-processing, and data cleaning are time-concuming. After having our audio data with JSON files downloaded, we perform vocal separation to extract rap vocals as well as speech from their music tracks. Followed by removing and replacing problematic characters, ensuring compatibility across different systems and software, and preventing errors, we formed our Ultimate datasets.
A comprehensive list of rap music was curated to ensure a diverse and representative dataset, it included a wide range of rap music from its late 1970s to contemporary innovations. Besides, conscious effort was made to incorporate more songs by female rappers to achieve a more balanced gender distribution.
We specifically target speech audio that contains background music, applying
Demucs
for speech separation to maintain consistency between isolated rap vocals and isolated speech in our dataset.
We compare 5 models on this task:
If using conda:
conda env create -f environment.yml
If using pip:
pip install -r requirements.txt
To effectively progress through the model training process, it is crucial to run the cells in your Jupyter notebook sequentially. Each cell in the BetterNotebook.ipynb
builds upon the previous ones, from data loading and preprocessing to the final stages of model training. Here are some important points to keep in mind:
- Data Reshaping: Different pre-trained models require input tensors of different shapes. Pay attention to the reshaping steps in the notebook to ensure that your data conforms to the required dimensions for each model.
- Variable and File Names: In the notebook, variables that store temporary data might have the same names as the .np or .npz files where data is saved. While they share names, their contents at any given point could be different due to ongoing data processing steps.
- Saving and Loading Data: Throughout the notebook, data is frequently saved to and loaded from .np (NumPy arrays) or .npz (compressed NumPy array archives) files. Make sure to modify to your path.
Check out our Colab demo to see how the model identifies three raw rap vocals. The chosen model, PANNs+LSTM
, is our best-performing model. The model outputs a probability between 0 and 1, with 0 indicating rap and 1 indicating speech.
For a bit of fun, try recording your own rap vocals and testing them with the model! Use your own audio and see how our classification system handles your unique style.
The four pre-trained embedding extractor models with LSTM, as well as a simple CNN+FC model, achieved rather similar test accuracy, with the PANNs+LSTM and VGGish+LSTM models delivered the best performance. Interestingly, the naive CNN+FC model demonstrated its potential and competitiveness in this task. All models achieved a performance of around 80%-90% in accuracy.
Special thanks to my teammates, Junzhe Liu and Nick Lin, for their contributions to debugging and creating the demo. Their collaboration and support have been invaluable to this project.
Please cite this repo if you find this project helpful for your project/paper:
Chung, F. (2024). Sound Classification on Rap Vocals and Speech. GitHub repository, https://github.com/Vio-Chung/Rap-Speech-Classification.
cff-version: 1.2.0
message: "Please cite it as below if used."
authors:
- family-names: Chung
given-names: Fang-Chi (Vio)
orcid: https://orcid.org/0009-0004-0857-5252
title: "Sound Classification on Rap Vocals and Speech"
version: 1.0.0
date-released: 2024-05-02
-
Hybrid Transformers for Music Source Separation.
Rouard, S., Massa, F., & Défossez, A. (2023).
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). -
Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
Aurora Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 3852–3856, Brighton, UK, May 2019. -
Look, Listen and Learn
Relja Arandjelović and Andrew Zisserman.
IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017. -
Panns: Large-scale pretrained audio neural networks for audio pattern recognition.
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley.
IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894.