The root path SED_PyTorch/
of source code is made up of three folders: aed_data/
, MainClasse/
, and task_scripts/
. The detailed introduction about them is given in the following.
SED_PyTorch/aed_data/
This folder contains two database directories of the datasets (i.e., Freesound and TUT), together with the experiment results on these two datasets. Under the database directories of freesound/
and tut_data/
, there are two folders mfccs/
and result/
respectively. Where, mfccs/
provides the input raw data, and result/
saves the result of each experiment on the corresponding database, and also contains the weights of the proposed model in the aed_data/
is described in the following.
-
freesound/
ae_dataset/
: saves the original Freesound dataset.audio_target/
: saves the discriminative audio signal which are selected from the original Freesound dataset.audio/
: saves the short sound segments cut fromaudio_target/
via Audacity (a free software). Here, the labels of short sound segments are the same as the label of the long audio signal that they are cut from.mfccs/
:datas/
: contains MFCC features (*.pkl
file), named asmfcc_[events number]_.pkl
).labels/
: contains pickle files which are corresponding to feature files.noise/
: contains three types of environment noise, which will be used when generating sub-set with different polyphonic level.result/
: saves the model weights in the training stage and the results of cross evaluation as table files (*.csv
).
-
tut_data/
train/
: saves the data related to the training set, which includes annotation files, original signal files and MFCC features.test/
: saves the testing set.result/
: saves the model weights and experiment results.
-
SED_PyTorch/MainClasses/
This folder contains several core files (Python classes files), which defines the data processing, models initialization, the training and evaluation process. The details about each class are given below.
-
DataHandler.py
: This Python script defines the class of parameter initialization and data processing, which contain several methods for data preprocessing, such asenframe()
,windowing()
, etc. Besides, it provides all the raw input data for each task script byload_data()
method. Note that, the most important parameter ofload_data()
isfold
, which is used to generate the training and validation set. It is worth mentioning that the methodmix_data()
is essential for the Freesound dataset, because it is used to generate the sub-dataset with various polyphonic levels. Here is the usage of mixing the polyphonic sound for Freesound dataset: 1.LoadDataHandler
class and create an object:from MainClasses.DataHandler import DataHandler dh = DataHandler(dataset='Freesound')
2.Generate n
mixture sound with k
categories, and here k
denotes polyphonic level:
dt.mix_data(nevents=k, nsamples=n)
By using mix_data()
, we can get a .pkl
file which contains the MFCCs features extracted from the mixed sound.
-
TorchDataset.py
: This Python script defines the class of data packing, which extending the super class Dataset imported from PyTorch. As the data in array form can not be loaded directly for the PyTorch module, we should convert the form of data into Tensor and pack them into the model. To achieve the data packing, we re-write the initialization andget_item()
methods in the class TorchDataset. -
CustmerMetrics.py
andCustmerLosses.py
: These two scripts give the definition of the loss function and metrics using in our experiments. In the class ofCustmerMetrics
, we define the most two important metrics proposed by the DCASE Challenge 2017, named as segment-based F1 score and segment-based Error Rate (ER). Besides, in order to make training stage more clearly, we also define the R square and binary accurate in this class. And the proposed novel disentangling loss function is defined in theCustmerLosses.py
. Below are the usage about this two scripts:
from MainClasses.CustmerLosses import DisentLoss
from MainClasses.CustmerMetrics import segment_metrics, r_square, binary_accurate
loss_fcn = DisentLoss(K=num_events, beta=beta) # K denotes the number of event category
-
SBetaVAE.py
: This Python script defines the classes while implementing our proposed method, which extends the PyTorch super classnn.Module
by rewriting theinit()
andforward()
function. Specifically,SBetaVAE.py
contains four classes:LinearAttention
,Encoder
,Decoder
andSBetaVAE
, of which the first three classes define the key layers of the proposed SBetaVAE model are defined inSBetaVAE
class. It is simple to define a new model by using the codes below:from MainClasses.SBetaVAE import SBetaVAE sb_vae = SBetaVAE(options)
-
SED_PyTorch/task_scripts/
: This folder contains Python scripts provided for the major experiments implemented in our paper. It is important to review the code carefully for the replicating work. We provide five script files for various tasks mentioned in our paper, which correspond to the evaluation experiment on the feature distribution in Fig. 3, the disentanglement evaluation in Fig. 4, the evaluation experiment on DCASE2017 challenge TUT dataset in Table 3, the evaluation experiment on Freesound dataset with various event polyphonic levels in Table 4 and the evaluation experiment on data augmentation in Table 5, respectively. To make these script files easier to read, we encapsulate each script with at least two functions:setup_args()
for arguments setup andrunning()
for preparing input data, building model and executing training/evaluation process. Besides, some of the scripts containvalidation()
andtest()
functions which will not be execute individually but will be called inrunning()
function. All the task scripts can be implemented using the following shell command:cd SED_PyTorch/tast_scripts/ python [any scripts].py [args] # -h for help
Where the optional arguments can be found in
setup_args()
or-h
in terminal and note that, for all experiments, this shell command will implement all the training, validation and testing stage. If you want to execute testing stage only, just set the argumentse
(epoch) to 0, which means do not train the model.Although such command can be used to repeat the experiments, it is necessary to give a detailed introduction for each task script. It is important to emphasize that one needs to execute these task scripts following the order given below as later task script may dependent on the outputs from previous ones.
-
tb4_various_events.py
: This script evaluates the performance of our methods on data of various polyphonic levels made from Freesound as shown in Table 4 in the accepted paper. In this script, there are many optional arguments which will determine the structure of the model. We can conduct this experiment after we choose suitable value for each optional argument. For example, if we want to train our model with the samples which contain 10 event categories, and in this case, the parameterk
should be set as 10 and the other hyper-parameters$\beta$ ,$\lambda$ andlatent_dim
will be set as 4, 2 and 30 automatically. Therefore we just need to call the shell command below to conduct this evaluation experiment:python tb4_various_events.py -k 10 # here "-k" is short of "--num_events"
Here, It's important to note that the argument
-m
denotes whether we need generate new data, and it is set as 1 as default if you want generate new data, and when you do not need to generate new data, set as 0. As we have uploaded the sub-dataset for this experiments, it needn't to generate them again.-
fig4_disentanglement_visualization.py
: This script is used to evaluate the disentanglement performance, listed as in Fig. 4 in our paper . To qualitatively show event-specific disentangled factors learned by supervised$\beta$ -VAE, we need to call the shell command below, which contains five major steps:python fig4_disentanglement_visualization.py
-
Create model and load weights;
sb_vae = SBetaVAE(options).cuda() sb_vae.load_state_dict(torch.load(options.result_path + sb_vae.name + '/fold_' + str(fold) + '_cp_weight.h5')
-
Give
n
samples of input datax
, and extractz*
for some specific sound event categories;..., z_stars, ... = sb_vae(torch.from_numpy(test_data).float().cuda())
-
Adjust one latent variable while fixing others in
z_star
and visualize the corresponding changes in the generated data. TakeChildren
(shown in Fig.4. in our paper) as an instance, we adjust the value of the 14-th dimension ofz_star
and fixing others. -
Define
delta()
function, mentioned in Section 4.5 of our paper, and calculate the difference among generated data; -
Visualize the differences calculated above using hot figure.
-
tb5_data_augmentation.py
: This script gives the method to evaluate the data generation ability of the proposed method.
-
We first make up the unbalanced dataset from Freesound using the class
DataHandler
mentioned before:from DataHandler import * dh = DataHandler('freesound')
-
Then call
mix_data()
method with the argumentisUnbalanced=True
, by which we can limit the number of the samples in the unbalanced dataset for a specific event category.dh.mix_data(nevents=5, nsamples=2000, isUnbalanced=True)
-
After getting the unbalanced dataset, we train and evaluate the model:
model = SBetaVAE(options) outputs = model(inputs) for e in range(epoch): loss = DisentLoss(outputs, targets) loss.backward() # here we get the original results f1_score, error_rate = test(model, test_data, test_labels, supervised=True)
-
Next, we generate the samples for the event category with insufficient samples by decoding the specific latent factors
z*
:# generate new data with torch.no_grad(): x_augmented = torch.Tensor() for n_sample, (x_data, y_data) in enumerate(train_loader): dec_out, detectors_out,z_stars,alphas,(mu,log_var)=sb_vae(x_data.float().cuda()) dec_x = sb_vae.decoder(z_stars[:, :, 0]) # we generate the first event defaultly. x_augmented = torch.cat([x_augmented, dec_x])
-
- At last, we extend the unbalanced dataset with the generated data to retrain the model, and evaluate it again, which will improve the performance of the augmented event category.
-
tb3_dcase17_ours.py
: This script evaluates the performance of the proposed method on DCASE 2017 SED challenge TUT dataset as shown in Table 3. Intb3_dcase17.py
,running()
function defines the process of building model, loading weights, training, evaluating and testing models among the 4-folds cross-validation. Since the training and testing set are selected by the DCASE challenge, it is not necessary to provide more arguments than$b$ (batchsize),$e$ (epoch), and$lr$ (learning rate) when calling the shell command below:python tb3_dcase17.py -b 128 -e 50 -lr 0.0003
-
fig3_feature_distribution.py
: This script is used to visualize the distributions of the features learned by the proposed model, listed as in Fig. 3. The visualization procedure is dependent on$t$ -SNE which should be imported bysklearn.manifold
. In order to make the replicating work simpler, we have put all the steps in each of algorithm intorunning()
function. Finally, to show the figure of feature distribution, call the command below and the figure file (.png
) will be saved at aed_data/tut_data/result/distribution.png
:python fig3_feature_distribution.py
Below are the details about this experiment.
-
Create model and load weights:
sb_vae = SBetaVAE(options).cuda() sb_vae.load_state_dict(torch.load('/path/of/weight'))
-
Give
n
samples as the input data ofx
, and extract the output of hidden layers(m_th
):for i in range(options.num_events): n_ = 0 while n_ < num_to_plot: item = random.randint(0, len(test_dataset) - 1) if test_dataset[item][1][i] == 1: # ensure the test data contains the i-th type of event n_ += 1 x_data, y_data = test_dataset[item] with torch.no_grad(): ..., bottleneck_features, ... = sb_vae(torch.from_numpy(x_data).float().cuda()) h_out = torch.cat([h_out, torch.relu(bottleneck_features)[:, :, i]])
-
Create
$t$ -SNE object and initialize it withPCA
:from sklearn.manifold import TSNE tsne = TSNE(n_components=target_dim, init='pca')
-
Train and transform the high-level features into 2 dimensions:
tsne_datas = tsne.fit_transform(datas)
-
Using
matplotlib
to plot the transformed data with colorful legend labels:nums_each_ev = int(len(datas) / n_events) for i in range(n_events): plt.scatter(tsne_datas[nums_each_ev * i:nums_each_ev * (i + 1), 0], tsne_datas[nums_each_ev * i:nums_each_ev * (i + 1), 1], c=plt.cm.Set1((i + 1) / 10.), alpha=0.6, marker='o') plt.legend(label_list, loc='lower right', ncol=2) plt.title('Distribution of features learned by {}.'.format(model.name)) plt.show()
In order to simplify the procedure, in fig3_feature_distribution.py
, we has put all the five steps into running()
function. For more details, you need read the code comments carefully.
The running time of the main task scripts executed in our GPU server is shown below:
tb4_various_events_ours.py
with various event categories:
5 events with 2000 samples: 26s * 80 epochs * 4 folds; 10 events with 3000 samples: 44s * 80 epochs * 4 folds; 15 events with 4000 samples: 50s * 80 epochs * 4 folds; 20 events with 5000 samples: 73s * 80 epochs * 4 folds;
tb3_dcase17_ours.py
:
dcase: 179s * 200 epochs * 4 folds
The CPU of our server is Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz and the GPU is NVIDIA RTX 2080Ti.
At last, in order to make the reproducibility work more efficiently, we provide all the weights of the model using in all the five experiments. The weights file (results
, so if one wants to skip the training stag, he/she needs to put the weights file into the corresponding dataset directories first, then run each task scripts motioned above, with the parameter