This visualizer explores BigGAN (Brock et al., 2018) latent space by using pitch/tempo of an audio file to generate and interpolate between noise/class vector inputs to the model. Classes are chosen manually or optionally using semantic similarity on BERT encodings of a lyrics corpus.
usage: visualizer.py [-h] -s SONG [-r {128,256,512}] [-d DURATION]
[-ps [200-295]] [-ts [0.05-0.8]]
[-c CLASSES [CLASSES ...]] [-n NUM_CLASSES] [-j [0-1]]
[-fl i*2^6] [-t [0.1-1]] [-sf [10-30]] [-bs BATCH_SIZE]
[-o OUTPUT_FILE] [--use_last_vectors]
[--use_last_classes] [--sort_pitch] [-l LYRICS]
[-e {sbert,doc2vec}] [-es {best,random,ransac}]
- In order to speed up runtime, code can be run on Google Colab GPUs (or other cloud notebook providers) using
biggan_music_visualizer.ipynb
(hosted here). - The
[-n NUM_CLASSES]
parameter selects the number of classes to interpolate between. - Default behavior is to select
[-n NUM_CLASSES]
random classes. The[-c CLASSES [CLASSES ...]]
parameter can be used to select specific ImageNet classes. A full list can be found here, and a list categorized by coarse descriptors here. Be sure to use theint
ids and not thestring
labels, and set[-n NUM_CLASSES]
to the number of chosen classes. - Use the
[--sort_by_power]
flag to map classes to the[-n NUM_CLASSES]
highest power pitches. By default, classes are mapped to a chromatic scale. - The
[-d DURATION]
parameter can be useful to generate short videos while tweaking other parameters. Once the desired parameters are set, use the[--use_last_vector]
flag and remove the[-d DURATION]
parameter to generate the same video at full length. - Reducing the output resolution with
[-r {128,256,512}]
and/or increasing the frame length with[-fl i*2^6]
can help reduce the runtime. - To compute classes through semantic similarity to a lyrics file, use the
[-l LYRICS]
parameter. The embedding technique and strategy for choosing classes can be set with[-e {sbert,doc2vec}]
and[-es {best,random,ransac}]
respectively. - Pitch and tempo sensitivity can be set with
[-ps [200-295]]
and[-ts [0.05-0.8]]
respectively. Jitter, truncation and smooth factor can be set with[-j [0-1]]
,[-t [0.1-1]]
and[-sf [10-30]]
respectively. - See the help column of the
arguments
section for details on all parameters.
short | long | default | range | help |
---|---|---|---|---|
-h |
--help |
show this help message and exit | ||
-s |
--song |
path to input audio file [REQUIRED] |
||
-r |
--resolution |
512 |
{128,256,512} |
output video resolution |
-d |
--duration |
None |
int |
output video duration |
-ps |
--pitch_sensitivity |
220 |
[200-295] |
controls the sensitivity of the class vector to changes in pitch |
-ts |
--tempo_sensitivity |
0.25 |
[0.05-0.8] |
controls the sensitivity of the noise vector to changes in volume and tempo |
-c |
--classes |
None |
manually specify [--num_classes] ImageNet classes |
|
-n |
--num_classes |
12 |
[1-12] |
number of unique classes to use |
-j |
--jitter |
0.5 |
[0-1] |
controls jitter of the noise vector to reduce repitition |
-fl |
--frame_length |
512 |
i*2^6 |
number of audio frames to video frames in the output |
-t |
--truncation |
1 |
[0.1-1] |
BigGAN truncation parameter controls complexity of structure within frames |
-sf |
--smooth_factor |
20 |
[10-30] |
controls interpolation between class vectors to smooth rapid flucations |
-bs |
--batch_size |
20 |
int |
BigGAN batch_size |
-o |
--output_file |
name of output file stored in output/ , defaults to [--song] path base_name |
||
--use_last_vectors |
False |
bool |
set flag to use previous saved class/noise vectors | |
--use_last_classes |
False |
bool |
set flag to use previous classes | |
--sort_pitches |
False |
bool |
set flag to sort pitches by the ordering of classes | |
-l |
--lyrics |
None |
path to lyrics file; setting [--lyrics LYRICS] computes classes by semantic similarity under BERT encodings |
|
-e |
--encoding |
sbert |
{sbert,doc2vec} |
controls choice of sentence embeddings technique |
-es |
--encoding_strategy |
None |
{random,best,ransac} |
controls strategy for choosing classes: [-e sbert] can use best or random while [-e doc2vec] can use ransac |
Thanks to Matt Siegelman for providing the inspiration as well as a boilerplate for the project.