- To be specified in a .ini file, config parser syntax
- Kind of all (well, most) values have defaults
- root: experiment root folder
- root = ./results/
- type: the kind of experiment
- type = classification
- possible values:
- classification: supervised learning experiment with a restricted set of categories (e.g., emotion categories).
- regression: supervised learning experiment with continuous values (e.g., speaker age in years).
- store: (relative to root) folder for caches
- store = ./store/
- name: a name for debugging output
- name = emodb_exp
- fig_dir: (relative to root) folder for plots
- fig_dir = ./images/
- res_dir: (relative to root) folder for result output
- res_dir = ./results/
- models_dir: (relative to root) folder to save models
- models_dir = ./models/
- runs: number of runs (e.g., to average over random initializations)
- runs = 1
- epochs: number of epochs for ANN training
- epochs = 1
- save: save the experiment as a pickle file to be restored again later (True or False)
- save = False
- save_test: save the test predictions as a new database in CSV format (default is False)
- save_test = ./my_saved_test_predictions.csv
- databases: name of databases to compare for the multidb module
- databases = ['emodb', 'timit']
- use_splits: can be used for multidb module to use the orginal split sets when train or test database. Else the whole database is used.
- use_splits = True
- type: just a flag now to mark continuous data, so it can be binned to categorical data (using bins and labels)
- type = continuous
- databases: list of databases to be used in the experiment
- databases = ['emodb', 'timit']
- tests: Datasets to be used as test data for the given best model. The databases do NOT have to appear in the databases field!
- tests = ['emovo']
- root_folders: specify an additional configuration specifically for all entries starting with a dataset name, acting as global defaults.
- root_folders = data_roots.ini
- db_name: path with audformatted repository for each database listed in 'databases*. If this path is not absolute, it will be treated relative to the experiment folder.
- emodb = /home/data/audformat/emodb/
- db_name.type: type of storage, e.g., audformat database or 'csv' (needs header: file, speaker, task)
- emodb.type = audformat
- db_name.absolute_path: only for 'csv' databases: are the audio file paths relative or absolute? If not absolute, they will be treated relative to the database parent folder. NOT the experiment root folder.
- my_data.absolute_path = True
- db_name.audio_path: only for 'csv' databases: are the audio files in a special common folder?
- my_data.audio_path = wav_files/
- db_name.mapping: mapping python dictionary to map between categories for cross-database experiments (format: {'target_emo':'source_emo'})
- emodb.mapping = {'anger':'angry', 'happiness':'happy', 'sadness':'sad', 'neutral':'neutral'}
- db_name.colnames: mapping to rename columns to standard names
- my_data.colnames = {'speaker':'Participant ID', 'sex':'gender', 'Age': 'age'}
- db_name.split_strategy: How to identify sets for train/development data splits within one database
- emodb.split_strategy = reuse
- Possible values:
- database: default (task.train and task.test)
- specified: specify the tables (an opportunity to assign multiple or no tables to train or dev set)
- emodb.train_tables = ['emotion.categories.train.gold_standard']
- emodb.test_tables = ['emotion.categories.test.gold_standard']
- speaker_split: split samples randomly but speaker disjunct, given a percentage of speakers for the test set.
- emodb.test_size = 50 (default:20)
- list of test speakers: you can simply provide a list of test ids
- emodb.split_strategy = [12, 14, 15, 16]
- random: split samples randomly (but NOT speaker disjunct, e.g., no speaker info given or each sample a speaker), given a percentage of samples for the test set.
- emodb.tests_size = 50 (default:20)
- reuse: reuse the splits after a speaker_split run to save time with feature extraction.
- train: use the entire database for training
- test: use the entire database for evaluation
- db_name.target_tables: tables that contain the target / speaker / sex labels
- emodb.target_tables = ['emotion']
- target_tables_append: set this to True if the multiple tables should be combined row-wise, else they are combined column-wise
- target_tables_append = False
- db_name.files_tables: tables that contain the audio file names
- emodb.files_tables = ['files']
- db_name.test_tables: tables that should be used for testing
- emodb.test_tables = ['emotion.categories.test.gold_standard']
- db_name.train_tables: tables that should be used for training
- emodb.train_tables = ['emotion.categories.train.gold_standard']
- db_name.limit_samples: maximum number of random N samples per table (for testing with very large data mainly)
- emodb.limit_samples = 20
- db_name.required: force a data set to have a specific feature (for example, filter all sets that have gender labeled in a database where this is not the case for all samples, e.g. MozillaCommonVoice)
- emodb.required = gender
- db_name.limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where the same speakers have a large number of samples)
- emodb.limit_samples_per_speaker = 20
- db_name.min_duration_of_sample: limit the samples to a minimum length (in seconds)
- emodb.min_duration_of_sample = 0.0
- db_name.max_duration_of_sample: limit the samples to a maximum length (in seconds)
- emodb.max_duration_of_sample = 0.0
- db_name.rename_speakers: add the database name to the speaker names, e.g., because several databases use the same names
- emodb.rename_speakers = False
- db_name.filter: don't use all the data but only selected values from columns: [col, val]*
- emodb.filter = [['gender', 'female']]
- db_name.scale: scale (standard normalize) the target variable (if numeric)
- my_data.scale = True
- db_name.reverse: reverse the target variable (if numeric). I.e. f(x) = abs(x-max)
- db_name.reverse.max: max value to be used in the formula above. If omitted, the distribution will start with 0.
- target: the task name, e.g. age or emotion
- target = emotion
- labels: for classification experiments: the names of the categories (is also used for regression when binning the values)
- labels = ['anger', 'boredom', 'disgust', 'fear', 'happiness', 'neutral', 'sadness']
- bins: array of integers to be used for binning continuous data
- bins = [-100, 40, 50, 60, 70, 100]
- no_reuse: don't re-use any tables, but start fresh
- no_reuse = False
- min_dur_test: specify a minimum duration for test samples (in seconds)
- min_dur_test = 3.5
- target_divide_by: divide the target values by some factor, e.g., to make age smaller and encode years from .0 to 1
- target_divide_by = 100
- limit_samples: maximum number of random N samples per sample selection
- limit_samples = 20
- limit_samples_per_speaker: maximum number of samples per speaker per sample selection
- limit_samples_per_speaker = 20
- min_duration_of_sample: limit the samples to a minimum length (in seconds) per sample selection
- min_duration_of_sample = 0.0
- max_duration_of_sample: limit the samples to a maximum length (in seconds) per sample selection
- max_duration_of_sample = 0.0
- check_size: check the filesize of all samples in train and test splits in bytes
- check_size = 1000
- check_vad: check if the files contain speech, using silero VAD
- check_vad = True
- augment: select the methods to augment: either traditional or random_splice
- augment = ['traditional', 'random_splice']
- choices are:
- traditional: uses the audiomentations package
- random_splice: randomly re-orders short splices (obfuscates the words)
- p_reverse: for random_splice: probability of some samples to be in reverse order (default: 0.3)
- top_db: for random_splice: top db level for silence to be recognized (default: 12)
- sample_selection: select the samples to augment: either train, test, or all
- sample_selection = all
- result: file name to store the augmented data (can then be added to training)
- result = augmented.csv
- augmentations: select the augmentation methods for the audiomentation module. Default provided.
- augmentations = Compose([AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.05),Shift(p=0.5),BandPassFilter(min_center_freq=100.0, max_center_freq=6000),])
- sample_selection: select the samples to segment: either train, test, or all
- sample_selection = all
- segment_result: name of the segmented data table as a result
- segment_target = segmented.csv
- method: select the model
- method = silero
- min_length: the minimum length of rest samples (in seconds)
- min_length = 2
- max_length: the maximum length of segments; longer ones are cut here. (in seconds)
- max_length = 10 # if not set, original segmentation is used
- type: a comma-separated list of types of features; they will be column-wise concatenated
- type = ['os']
- possible values:
- import: already computed features
- import_file = pathes to files with features in CSV format
- import_file = ['path1/file1.csv', 'path2/file1.csv2']
- import_file = pathes to files with features in CSV format
- mld: mid-level-descriptors
- mld.model = path to the mld sources folder
- min_syls = minimum number of syllables
- os: open smile features
- set = eGeMAPSv02 (features set)
- level = functionals (or lld: feature level)
- os.features: list of selected features (disregard others)
- praat: Praat selected features thanks to David R. Feinberg scripts
- praat.features: list of selected features (disregard others)
- spectra: Melspecs for convolutional networks
- fft_win_dur = 25 (msec analysis frame/window length)
- fft_hop_dur = 10 (msec hop duration)
- fft_nbands = 64 (number of frequency bands)
- ast: audio spectrogram transformer features from MIT
- trill: TRILL embeddings from Google
- trill.model = path to the TRILL model folder, optional
- wav2vec variants: wav2vec2 embeddings from facebook
- "wav2vec2-large-robust-ft-swbd-300h"
- wav2vec.model = path to the wav2vec2 model folder
- wav2vec2.layer = which last hidden layer to use
- Hubert variants: facebook Hubert models
- "hubert-base-ls960", "hubert-large-ll60k", "hubert-large-ls960-ft", hubert-xlarge-ll60k, "hubert-xlarge-ls960-ft"
- WavLM:
- "wavlm-base", "wavlm-base-plus", "wavlm-large"
- Whisper: whisper models
- "whisper-base", "whisper-large", "whisper-medium", "whisper-tiny"
- audmodel: audEERING emotion model embeddings, wav2vec2.0 model finetuned on MSPPodcast emotions, embeddings
- aud.model = ./audmodel/ (path to the audEERING model folder)
- auddim: audEERING emotion model dimensions, wav2vec2.0 model finetuned on MSPPodcast arousal, dominance, valence
- agender: audEERING age and gender model embeddings, wav2vec2.0 model finetuned on several age databases, embeddings
- agender.model = ./agender/ (path to the audEERING model folder)
- agender_agender: audEERING age and gender model age and gender predictions, wav2vec2.0 model finetuned on several age and gendeer databases: age, female, male, child
- clap: Laion's Clap embedding
- xbow: open crossbow features codebook computed from open smile features
- xbow.model = path to xbow root folder (containing xbow.jar)
- size = 500 (codebook size, rule of thumb: should grow with datasize)
- assignments = 10 (number of words in the bag representation where the counter is increased for each input LLD, rule of thumb: should grow/shrink with codebook size)
- snr: estimated SNR (signal-to-noise ratio)
- mos: estimated MOS (mean opinion score)
- pesq: estimated PESQ (Perceptual Evaluation of Speech Quality)
- sdr: estimated SDR (Perceptual Evaluation of Speech Quality)
- spkrec: speaker-id: speechbrain embeddings
- stoi: estimated STOI (Perceptual Evaluation of Speech Quality)
- import: already computed features
- features = python list of selected features to be used (all others ignored)
- features = ['JitterPCA', 'meanF0Hz', 'hld_sylRate']
- no_reuse: don't re-use already extracted features, but start fresh
- no_reuse = False
- store_format: how to store the features: possible values [pkl | csv]
- store_format = pkl
- scale: scale the features (important for gmm)
- scale=standard
- possible values:
- standard: z-transformation (mean of 0 and std of 1) based on the training set
- robust: robust scaler
- speaker: like standard but based on individual speaker sets (also for the test)
- bins: convert feature values into 0, .5 and 1 (for low, mid and high)
- set: name of opensmile feature set, e.g. eGeMAPSv02, ComParE_2016, GeMAPSv01a, eGeMAPSv01a
- set = eGeMAPSv02
- level: level of opensmile features
- level = functional
- possible values:
- functional: aggregated over the whole utterance
- lld: low-level descriptor: framewise
- balancing: balance the features with respect to class distribution
- balancing=smote
- possible values:
- ros: simply repeat random samples from the minority classes
- smote: invent new minority samples by little changes from the existing ones
- adasyn: similar to smote, but resulting in uneven class distributions
- type: type of classifier
- type = svm
- possible values:
- bayes: Naive Bayes classifier
- cnn: Convolutional neural network (only works with feature type=spectra)
- finetune: Finetune a transformer model with huggingface. In this case the features are ignored, because audiofiles are used directly.
- pretrained_model: Base model for finetuning/transfer learning. Variants of wav2vec2, Hubert, and WavLM are tested to work. Default is facebook/wav2vec2-large-robust-ft-swbd-300h.
- pretrained_model = microsoft/wavlm-base
- push_to_hub: For finetuning, whether to push the model to the huggingface model hub. Default is False.
- push_to_hub = True
- max_duration: Max. duration of samples/segments for the transformer in seconds, frames are pooled.
- max_duration = 8.0
- pretrained_model: Base model for finetuning/transfer learning. Variants of wav2vec2, Hubert, and WavLM are tested to work. Default is facebook/wav2vec2-large-robust-ft-swbd-300h.
- gmm: Gaussian mixture classifier
- GMM_components = 4 (currently must be the same as number of labels)
- GMM_covariance_type = full | tied | diag | spherical
- knn: k nearest neighbor classifier
- K_val = 5
- KNN_weights = uniform | distance
- knn_reg: K nearest neighbor regressor
- mlp: Multi-Layer-Perceptron for classification
- mlp_reg: Multi-Layer-Perceptron for regression
- svm: Support Vector Machine
- C_val = 0.001
- kernel = rbf # ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
- svr: Support Vector Regression
- C_val = 0.001
- kernel = rbf # ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’
- tree: Classification tree classifier
- tree_reg: Classification tree regressor
- xgb: XG-Boost
- xgr: XG-Boost Regression
- balancing: balancing for finetune type; for other than finetune, set balancing in [FEATS].
- possible values: [ros, smote, adasyn]
- tuning_params: possible tuning parameters for x-fold optimization (for Bayes, KNN, KNN_reg, Tree, Tree_reg, SVM, SVR, XGB and XGR)
- tuning_params = ['subsample', 'n_estimators', 'max_depth']
- subsample = [.5, .7]
- n_estimators = [50, 80, 200]
- max_depth = [1, 6]
- tuning_params = ['subsample', 'n_estimators', 'max_depth']
- scoring: scoring measure for the optimization
- scoring = recall_macro
- layers: layer outline (number of hidden layers and number of neurons per layer) for the MLP as a python dictionary
- layers = {'l1':8, 'l2':4}
- class_weight: add class_weight to the linear classifier (XGB, SVM) fit methods for imbalanced data (True or False)
- class_weight = False
- logo: leave-one-speaker group out. Will disregard train/dev splits and split the speakers in logo groups and then do a LOGO evaluation. If you want LOSO (leave one speaker out), simply set the number to the number of speakers.
- logo = 10
- k_fold_cross: k-fold-cross validation. Will disregard train/dev splits and do a stratified cross validation (meaning that classes are balanced across folds). speaker id is ignored.
- k_fold_cross = 10
- loss: A loss function for regression ANN models (classification models use Cross Entropy Loss with or without class weights)
- loss = mse/cross
- possible values (SHOULD correspond with measure):
- mse: mean squared error
- mae: mean average error
- 1-ccc: concordance correlation coefficient
- cross: cross entropy correlation
- f1: Soft (differentiable) F1 Loss
- measure: A measure/metric to report progress with regression experiments (classification is UAR)
- measure = mse
- possible values:
- mse: mean squared error
- mae: mean absolute error
- ccc: concordance correlation coefficient
- learning_rate: The learning rate for ANN models
- learning_rate = 0.0001
- drop: Adding dropout (after each hidden layer). Value states dropout probability
- drop = .5
- batch_size: Size of the batch before backpropagation for neural nets
- batch_size = 8
- device: For torch/huggingface models: select your GPU number if you have one. Values are either "cpu" or GPU ids (e.g., 0, 1 or both "0,1"). By default, the GPU/CUDA is used if available, otherwise is CPU.
- device = 0
- patience: Number of epochs to wait if the result gets better (for early stopping)
- patience = 5
- n_jobs: set/restrict the number of processes for model training (replaces former num_workers)
- n_jobs = 8
- model: Which model to use to estimate feature importance.
- model = ['log_reg'] # can be all models from the MODEL section, If they are combined, the mean result is used.
- max_feats: Maximal number of important features
- max_feats = 10
- sample_selection: Which sample set/split to use for feature importance, sample distribution, spotlight and feature distributions
- sample_selection = all # either all, train or test
- feature_distributions: plot distributions for features and analyze importance
- feature_distributions = True
- permutation: use feature permutation to determine the best features. Make sure to test the models before.
- permutation = True
- scatter: make a scatter plot of combined train and test data, colored by label.
- scatter = ['tsne', 'umap', 'pca']
- scatter.target: target for the scatter plot (defaults to target value).
- scatter = ['age', 'gender', 'likability]
- scatter.dim: dimension of reduction, can be 2 or 3.
- scatter.dim = 2
- plot_tree: Plot a decision tree for classification (Requires model = tree)
- plot_tree = False
- value_counts: plot distributions of target for the samples and speakers (in the image_dir)
- value_counts = [['gender'], ['age'], ['age', 'duration']]
- column.bin_reals: If the column variable is real numbers (instead of categories), should it be binned? for any value in value_counts as well as the target variable
- age.bin_reals = True
- dist_type: type of plot for value counts, either histogram (hist) or density estimation (kde)
- dist_type = kde
- spotlight: open a web-browser window to inspect the data with the spotlight software. Needs package renumics-spotlight to be installed!
- spotlight = False
- shap: comopute SHAP values
- shap = False
- targets: Speaker/speech characteristics to be predicted by some models
- targets = ['speaker', 'gender', 'age', 'snr', 'arousal', 'valence', 'dominance', 'pesq', 'mos']
- sample_selection: which split: [train, test, all]
- sample_selection = all
- target_root: New root directory for the database, will be created
- target_root = ./exported_data/
- orig_root: Path to folder that is parent to the original audio files
- orig_root = ../data/emodb/wav
- data_name: Name for the CSV file
- data_name = exported_database
- segments_as_files: Whether original files should be used, or segments split (resulting potentially in many new files).
- segments_as_files = False
- train_extra: add a additional training partition to all experiments in the cross database series. This extra data should be described in a root_folders file
- train_extra = ['addtrain_db_1', 'addtrain_db_2']
- name: special name as a prefix for all plots (stored in img_dir).
- name = my_special_config_within_the_experiment
- epochs: whether to make a plot each for every epoch result.
- epochs = False
- anim_progression: generate an animated GIF from the epoch plots
- anim_progression = False
- fps: frames per second for the animated GIF
- fps = 1
- epoch_progression: plot the progression of test, train and loss results over epochs
- epoch_progression = False
- best_model: search for the best performing model and plot conf matrix (needs MODEL.store to be turned on)
- best_model = False
- combine_per_speaker: print an extra confusion plot where the predictions per speaker are combined, with either the
mode
or themean
function- combine_per_speaker = mode
- format: format for plots, either png or eps (for scalable graphics)
- format = png
- ccc: show concordance correlation coefficient in plot headings
- ccc = False
- sample_selection: which split: [train, test, all]
- sample_selection = all
- replace: whether samples should be replaced right where they are, or copies done and a new dataframe given
- replace = False
- target: the name of the new dataframe, if replace==false
- target = data_resampled.csv
- show: print the report at the end
- show = False
- fresh: start a new report
- latex: generate a latex and PDF document: name of document
- latex = my_latex_document
- title: title for document
- author: author for document