Skip to content

Latest commit

 

History

History
425 lines (396 loc) · 25 KB

ini_file.md

File metadata and controls

425 lines (396 loc) · 25 KB

Overview of options for the nkululeko framework

  • To be specified in a .ini file, config parser syntax
  • Kind of all (well, most) values have defaults

Contents

Sections

EXP

  • root: experiment root folder
    • root = ./results/
  • type: the kind of experiment
    • type = classification
    • possible values:
      • classification: supervised learning experiment with a restricted set of categories (e.g., emotion categories).
      • regression: supervised learning experiment with continuous values (e.g., speaker age in years).
  • store: (relative to root) folder for caches
    • store = ./store/
  • name: a name for debugging output
    • name = emodb_exp
  • fig_dir: (relative to root) folder for plots
    • fig_dir = ./images/
  • res_dir: (relative to root) folder for result output
    • res_dir = ./results/
  • models_dir: (relative to root) folder to save models
    • models_dir = ./models/
  • runs: number of runs (e.g., to average over random initializations)
    • runs = 1
  • epochs: number of epochs for ANN training
    • epochs = 1
  • save: save the experiment as a pickle file to be restored again later (True or False)
    • save = False
  • save_test: save the test predictions as a new database in CSV format (default is False)
    • save_test = ./my_saved_test_predictions.csv
  • databases: name of databases to compare for the multidb module
    • databases = ['emodb', 'timit']
  • use_splits: can be used for multidb module to use the orginal split sets when train or test database. Else the whole database is used.
    • use_splits = True

DATA

  • type: just a flag now to mark continuous data, so it can be binned to categorical data (using bins and labels)
    • type = continuous
  • databases: list of databases to be used in the experiment
    • databases = ['emodb', 'timit']
  • tests: Datasets to be used as test data for the given best model. The databases do NOT have to appear in the databases field!
    • tests = ['emovo']
  • root_folders: specify an additional configuration specifically for all entries starting with a dataset name, acting as global defaults.
    • root_folders = data_roots.ini
  • db_name: path with audformatted repository for each database listed in 'databases*. If this path is not absolute, it will be treated relative to the experiment folder.
    • emodb = /home/data/audformat/emodb/
  • db_name.type: type of storage, e.g., audformat database or 'csv' (needs header: file, speaker, task)
    • emodb.type = audformat
  • db_name.absolute_path: only for 'csv' databases: are the audio file paths relative or absolute? If not absolute, they will be treated relative to the database parent folder. NOT the experiment root folder.
    • my_data.absolute_path = True
  • db_name.audio_path: only for 'csv' databases: are the audio files in a special common folder?
    • my_data.audio_path = wav_files/
  • db_name.mapping: mapping python dictionary to map between categories for cross-database experiments (format: {'target_emo':'source_emo'})
    • emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}
  • db_name.colnames: mapping to rename columns to standard names
    • my_data.colnames = {'speaker':'Participant ID', 'sex':'gender', 'Age': 'age'}
  • db_name.split_strategy: How to identify sets for train/development data splits within one database
    • emodb.split_strategy = reuse
    • Possible values:
      • database: default (task.train and task.test)
      • specified: specify the tables (an opportunity to assign multiple or no tables to train or dev set)
        • emodb.train_tables = ['emotion.categories.train.gold_standard']
        • emodb.test_tables = ['emotion.categories.test.gold_standard']
      • speaker_split: split samples randomly but speaker disjunct, given a percentage of speakers for the test set.
        • emodb.test_size = 50 (default:20)
      • list of test speakers: you can simply provide a list of test ids
        • emodb.split_strategy = [12, 14, 15, 16]
      • random: split samples randomly (but NOT speaker disjunct, e.g., no speaker info given or each sample a speaker), given a percentage of samples for the test set.
        • emodb.tests_size = 50 (default:20)
      • reuse: reuse the splits after a speaker_split run to save time with feature extraction.
      • train: use the entire database for training
      • test: use the entire database for evaluation
  • db_name.target_tables: tables that contain the target / speaker / sex labels
    • emodb.target_tables = ['emotion']
  • target_tables_append: set this to True if the multiple tables should be combined row-wise, else they are combined column-wise
    • target_tables_append = False
  • db_name.files_tables: tables that contain the audio file names
    • emodb.files_tables = ['files']
  • db_name.test_tables: tables that should be used for testing
    • emodb.test_tables = ['emotion.categories.test.gold_standard']
  • db_name.train_tables: tables that should be used for training
    • emodb.train_tables = ['emotion.categories.train.gold_standard']
  • db_name.limit_samples: maximum number of random N samples per table (for testing with very large data mainly)
    • emodb.limit_samples = 20
  • db_name.required: force a data set to have a specific feature (for example, filter all sets that have gender labeled in a database where this is not the case for all samples, e.g. MozillaCommonVoice)
    • emodb.required = gender
  • db_name.limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where the same speakers have a large number of samples)
    • emodb.limit_samples_per_speaker = 20
  • db_name.min_duration_of_sample: limit the samples to a minimum length (in seconds)
    • emodb.min_duration_of_sample = 0.0
  • db_name.max_duration_of_sample: limit the samples to a maximum length (in seconds)
    • emodb.max_duration_of_sample = 0.0
  • db_name.rename_speakers: add the database name to the speaker names, e.g., because several databases use the same names
    • emodb.rename_speakers = False
  • db_name.filter: don't use all the data but only selected values from columns: [col, val]*
    • emodb.filter = [['gender', 'female']]
  • db_name.scale: scale (standard normalize) the target variable (if numeric)
    • my_data.scale = True
  • db_name.reverse: reverse the target variable (if numeric). I.e. f(x) = abs(x-max)
  • db_name.reverse.max: max value to be used in the formula above. If omitted, the distribution will start with 0.
  • target: the task name, e.g. age or emotion
    • target = emotion
  • labels: for classification experiments: the names of the categories (is also used for regression when binning the values)
    • labels = ['anger', 'boredom', 'disgust', 'fear', 'happiness', 'neutral', 'sadness']
  • bins: array of integers to be used for binning continuous data
    • bins = [-100, 40, 50, 60, 70, 100]
  • no_reuse: don't re-use any tables, but start fresh
    • no_reuse = False
  • min_dur_test: specify a minimum duration for test samples (in seconds)
    • min_dur_test = 3.5
  • target_divide_by: divide the target values by some factor, e.g., to make age smaller and encode years from .0 to 1
    • target_divide_by = 100
  • limit_samples: maximum number of random N samples per sample selection
    • limit_samples = 20
  • limit_samples_per_speaker: maximum number of samples per speaker per sample selection
    • limit_samples_per_speaker = 20
  • min_duration_of_sample: limit the samples to a minimum length (in seconds) per sample selection
    • min_duration_of_sample = 0.0
  • max_duration_of_sample: limit the samples to a maximum length (in seconds) per sample selection
    • max_duration_of_sample = 0.0
  • check_size: check the filesize of all samples in train and test splits in bytes
    • check_size = 1000
  • check_vad: check if the files contain speech, using silero VAD
    • check_vad = True

AUGMENT

  • augment: select the methods to augment: either traditional or random_splice
    • augment = ['traditional', 'random_splice']
    • choices are:
      • traditional: uses the audiomentations package
      • random_splice: randomly re-orders short splices (obfuscates the words)
  • p_reverse: for random_splice: probability of some samples to be in reverse order (default: 0.3)
  • top_db: for random_splice: top db level for silence to be recognized (default: 12)
  • sample_selection: select the samples to augment: either train, test, or all
    • sample_selection = all
  • result: file name to store the augmented data (can then be added to training)
    • result = augmented.csv
  • augmentations: select the augmentation methods for the audiomentation module. Default provided.
    • augmentations = Compose([AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.05),Shift(p=0.5),BandPassFilter(min_center_freq=100.0, max_center_freq=6000),])

SEGMENT

  • sample_selection: select the samples to segment: either train, test, or all
    • sample_selection = all
  • segment_result: name of the segmented data table as a result
    • segment_target = segmented.csv
  • method: select the model
  • min_length: the minimum length of rest samples (in seconds)
    • min_length = 2
  • max_length: the maximum length of segments; longer ones are cut here. (in seconds)
    • max_length = 10 # if not set, original segmentation is used

FEATS

  • type: a comma-separated list of types of features; they will be column-wise concatenated
    • type = ['os']
    • possible values:
      • import: already computed features
        • import_file = pathes to files with features in CSV format
          • import_file = ['path1/file1.csv', 'path2/file1.csv2']
      • mld: mid-level-descriptors
        • mld.model = path to the mld sources folder
        • min_syls = minimum number of syllables
      • os: open smile features
        • set = eGeMAPSv02 (features set)
        • level = functionals (or lld: feature level)
        • os.features: list of selected features (disregard others)
      • praat: Praat selected features thanks to David R. Feinberg scripts
        • praat.features: list of selected features (disregard others)
      • spectra: Melspecs for convolutional networks
        • fft_win_dur = 25 (msec analysis frame/window length)
        • fft_hop_dur = 10 (msec hop duration)
        • fft_nbands = 64 (number of frequency bands)
      • ast: audio spectrogram transformer features from MIT
      • trill: TRILL embeddings from Google
        • trill.model = path to the TRILL model folder, optional
      • wav2vec variants: wav2vec2 embeddings from facebook
        • "wav2vec2-large-robust-ft-swbd-300h"
        • wav2vec.model = path to the wav2vec2 model folder
        • wav2vec2.layer = which last hidden layer to use
      • Hubert variants: facebook Hubert models
        • "hubert-base-ls960", "hubert-large-ll60k", "hubert-large-ls960-ft", hubert-xlarge-ll60k, "hubert-xlarge-ls960-ft"
      • WavLM:
        • "wavlm-base", "wavlm-base-plus", "wavlm-large"
      • Whisper: whisper models
        • "whisper-base", "whisper-large", "whisper-medium", "whisper-tiny"
      • audmodel: audEERING emotion model embeddings, wav2vec2.0 model finetuned on MSPPodcast emotions, embeddings
        • aud.model = ./audmodel/ (path to the audEERING model folder)
      • auddim: audEERING emotion model dimensions, wav2vec2.0 model finetuned on MSPPodcast arousal, dominance, valence
      • agender: audEERING age and gender model embeddings, wav2vec2.0 model finetuned on several age databases, embeddings
        • agender.model = ./agender/ (path to the audEERING model folder)
      • agender_agender: audEERING age and gender model age and gender predictions, wav2vec2.0 model finetuned on several age and gendeer databases: age, female, male, child
      • clap: Laion's Clap embedding
      • xbow: open crossbow features codebook computed from open smile features
        • xbow.model = path to xbow root folder (containing xbow.jar)
        • size = 500 (codebook size, rule of thumb: should grow with datasize)
        • assignments = 10 (number of words in the bag representation where the counter is increased for each input LLD, rule of thumb: should grow/shrink with codebook size)
      • snr: estimated SNR (signal-to-noise ratio)
      • mos: estimated MOS (mean opinion score)
      • pesq: estimated PESQ (Perceptual Evaluation of Speech Quality)
      • sdr: estimated SDR (Perceptual Evaluation of Speech Quality)
      • spkrec: speaker-id: speechbrain embeddings
      • stoi: estimated STOI (Perceptual Evaluation of Speech Quality)
  • features = python list of selected features to be used (all others ignored)
    • features = ['JitterPCA', 'meanF0Hz', 'hld_sylRate']
  • no_reuse: don't re-use already extracted features, but start fresh
    • no_reuse = False
  • store_format: how to store the features: possible values [pkl | csv]
    • store_format = pkl
  • scale: scale the features (important for gmm)
    • scale=standard
    • possible values:
      • standard: z-transformation (mean of 0 and std of 1) based on the training set
      • robust: robust scaler
      • speaker: like standard but based on individual speaker sets (also for the test)
      • bins: convert feature values into 0, .5 and 1 (for low, mid and high)
  • set: name of opensmile feature set, e.g. eGeMAPSv02, ComParE_2016, GeMAPSv01a, eGeMAPSv01a
    • set = eGeMAPSv02
  • level: level of opensmile features
    • level = functional
    • possible values:
      • functional: aggregated over the whole utterance
      • lld: low-level descriptor: framewise
  • balancing: balance the features with respect to class distribution
    • balancing=smote
    • possible values:
      • ros: simply repeat random samples from the minority classes
      • smote: invent new minority samples by little changes from the existing ones
      • adasyn: similar to smote, but resulting in uneven class distributions

MODEL

  • type: type of classifier
    • type = svm
    • possible values:
      • bayes: Naive Bayes classifier
      • cnn: Convolutional neural network (only works with feature type=spectra)
      • finetune: Finetune a transformer model with huggingface. In this case the features are ignored, because audiofiles are used directly.
        • pretrained_model: Base model for finetuning/transfer learning. Variants of wav2vec2, Hubert, and WavLM are tested to work. Default is facebook/wav2vec2-large-robust-ft-swbd-300h.
          • pretrained_model = microsoft/wavlm-base
        • push_to_hub: For finetuning, whether to push the model to the huggingface model hub. Default is False.
          • push_to_hub = True
        • max_duration: Max. duration of samples/segments for the transformer in seconds, frames are pooled.
          • max_duration = 8.0
      • gmm: Gaussian mixture classifier
      • knn: k nearest neighbor classifier
        • K_val = 5
        • KNN_weights = uniform | distance
      • knn_reg: K nearest neighbor regressor
      • mlp: Multi-Layer-Perceptron for classification
      • mlp_reg: Multi-Layer-Perceptron for regression
      • svm: Support Vector Machine
        • C_val = 0.001
        • kernel = rbf # ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
      • svr: Support Vector Regression
        • C_val = 0.001
        • kernel = rbf # ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
      • tree: Classification tree classifier
      • tree_reg: Classification tree regressor
      • xgb: XG-Boost
      • xgr: XG-Boost Regression
  • balancing: balancing for finetune type; for other than finetune, set balancing in [FEATS].
    • possible values: [ros, smote, adasyn]
  • tuning_params: possible tuning parameters for x-fold optimization (for Bayes, KNN, KNN_reg, Tree, Tree_reg, SVM, SVR, XGB and XGR)
    • tuning_params = ['subsample', 'n_estimators', 'max_depth']
      • subsample = [.5, .7]
      • n_estimators = [50, 80, 200]
      • max_depth = [1, 6]
  • scoring: scoring measure for the optimization
    • scoring = recall_macro
  • layers: layer outline (number of hidden layers and number of neurons per layer) for the MLP as a python dictionary
    • layers = {'l1':8, 'l2':4}
  • class_weight: add class_weight to the linear classifier (XGB, SVM) fit methods for imbalanced data (True or False)
    • class_weight = False
  • logo: leave-one-speaker group out. Will disregard train/dev splits and split the speakers in logo groups and then do a LOGO evaluation. If you want LOSO (leave one speaker out), simply set the number to the number of speakers.
    • logo = 10
  • k_fold_cross: k-fold-cross validation. Will disregard train/dev splits and do a stratified cross validation (meaning that classes are balanced across folds). speaker id is ignored.
    • k_fold_cross = 10
  • loss: A loss function for regression ANN models (classification models use Cross Entropy Loss with or without class weights)
    • loss = mse/cross
    • possible values (SHOULD correspond with measure):
      • mse: mean squared error
      • mae: mean average error
      • 1-ccc: concordance correlation coefficient
      • cross: cross entropy correlation
      • f1: Soft (differentiable) F1 Loss
  • measure: A measure/metric to report progress with regression experiments (classification is UAR)
    • measure = mse
    • possible values:
      • mse: mean squared error
      • mae: mean absolute error
      • ccc: concordance correlation coefficient
  • learning_rate: The learning rate for ANN models
    • learning_rate = 0.0001
  • drop: Adding dropout (after each hidden layer). Value states dropout probability
    • drop = .5
  • batch_size: Size of the batch before backpropagation for neural nets
    • batch_size = 8
  • device: For torch/huggingface models: select your GPU number if you have one. Values are either "cpu" or GPU ids (e.g., 0, 1 or both "0,1"). By default, the GPU/CUDA is used if available, otherwise is CPU.
    • device = 0
  • patience: Number of epochs to wait if the result gets better (for early stopping)
    • patience = 5
  • n_jobs: set/restrict the number of processes for model training (replaces former num_workers)
    • n_jobs = 8

EXPL

  • model: Which model to use to estimate feature importance.
    • model = ['log_reg'] # can be all models from the MODEL section, If they are combined, the mean result is used.
  • max_feats: Maximal number of important features
    • max_feats = 10
  • sample_selection: Which sample set/split to use for feature importance, sample distribution, spotlight and feature distributions
    • sample_selection = all # either all, train or test
  • feature_distributions: plot distributions for features and analyze importance
    • feature_distributions = True
  • permutation: use feature permutation to determine the best features. Make sure to test the models before.
    • permutation = True
  • scatter: make a scatter plot of combined train and test data, colored by label.
    • scatter = ['tsne', 'umap', 'pca']
  • scatter.target: target for the scatter plot (defaults to target value).
    • scatter = ['age', 'gender', 'likability]
  • scatter.dim: dimension of reduction, can be 2 or 3.
    • scatter.dim = 2
  • plot_tree: Plot a decision tree for classification (Requires model = tree)
    • plot_tree = False
  • value_counts: plot distributions of target for the samples and speakers (in the image_dir)
    • value_counts = [['gender'], ['age'], ['age', 'duration']]
  • column.bin_reals: If the column variable is real numbers (instead of categories), should it be binned? for any value in value_counts as well as the target variable
    • age.bin_reals = True
  • dist_type: type of plot for value counts, either histogram (hist) or density estimation (kde)
    • dist_type = kde
  • spotlight: open a web-browser window to inspect the data with the spotlight software. Needs package renumics-spotlight to be installed!
    • spotlight = False
  • shap: comopute SHAP values
    • shap = False
  • targets: Speaker/speech characteristics to be predicted by some models
    • targets = ['speaker', 'gender', 'age', 'snr', 'arousal', 'valence', 'dominance', 'pesq', 'mos']
  • sample_selection: which split: [train, test, all]
    • sample_selection = all

EXPORT

  • target_root: New root directory for the database, will be created
    • target_root = ./exported_data/
  • orig_root: Path to folder that is parent to the original audio files
    • orig_root = ../data/emodb/wav
  • data_name: Name for the CSV file
    • data_name = exported_database
  • segments_as_files: Whether original files should be used, or segments split (resulting potentially in many new files).
    • segments_as_files = False

CROSSDB

PLOT

  • name: special name as a prefix for all plots (stored in img_dir).
    • name = my_special_config_within_the_experiment
  • epochs: whether to make a plot each for every epoch result.
    • epochs = False
  • anim_progression: generate an animated GIF from the epoch plots
    • anim_progression = False
  • fps: frames per second for the animated GIF
    • fps = 1
  • epoch_progression: plot the progression of test, train and loss results over epochs
    • epoch_progression = False
  • best_model: search for the best performing model and plot conf matrix (needs MODEL.store to be turned on)
    • best_model = False
  • combine_per_speaker: print an extra confusion plot where the predictions per speaker are combined, with either the mode or the mean function
    • combine_per_speaker = mode
  • format: format for plots, either png or eps (for scalable graphics)
    • format = png
  • ccc: show concordance correlation coefficient in plot headings
    • ccc = False

RESAMPLE

  • sample_selection: which split: [train, test, all]
    • sample_selection = all
  • replace: whether samples should be replaced right where they are, or copies done and a new dataframe given
    • replace = False
  • target: the name of the new dataframe, if replace==false
    • target = data_resampled.csv

REPORT

  • show: print the report at the end
    • show = False
  • fresh: start a new report
  • latex: generate a latex and PDF document: name of document
    • latex = my_latex_document
  • title: title for document
  • author: author for document