Skip to content

Latest commit

 

History

History
61 lines (44 loc) · 2.54 KB

README.md

File metadata and controls

61 lines (44 loc) · 2.54 KB

NI-MVI Semestra project 2022

Goal: Compare the quality of audio recording of Spanish speakers enhanced by CMGAN model. I will compare inferred data from :

  1. Model pretrained on english speakers as provided by authors of the CMGAN.
  2. Fine-tuned pretrained model with Spanish speakers.
  3. Model trained from scratch using custom data.

Model: CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement, Sherif Abdulatif, Ruizhe Cao, Bin Yang

Data: Spanish speaking audio recording in high quality (podcast quality). Data downloaded from YouTube. Both sexes - male and female. After preprocessing 50 minutes of data. Train:evaluation ratio is 40:10. Preprocessing consists in:

  • Tokenization: split data to short audio files, 3-8 seconds long.
  • Downsampling: recommended procedure in paper, 16kHz and 16 bits per sample.
  • Adding noise using the DEMAND dataset as recommended in the paper. List of data sources is in the sources.txt file. The preprocessed data are available on my university Google Drive, link in sources.txt.

Research: CMGAN is almost SoA. The successor SCP-CMGAN offers other metrics system which I did not understand so I chose the closest solution with available pretrained model and public dataset. paperswithcode.com

GAN training

Approach:

  1. Preprocess custom data.

  2. Train model from scratch using custom data.

  3. Fine-tune existing pretrained model using custom data.

  4. Evaluate models with PESQ and STOI metrics:

    • Scratch.
    • Fine-tuned.
    • Pre-trained.
  5. Compare results.

  6. Prepare samples:

    • Clean audio.
    • Noisy audio.
    • Enhanced with different models.

Final Report