This is the official implementation of "Long-Term Rhythmic Video Soundtracker", ICML2023.
Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, and Yu Qiao.
OpenGVLab, Shanghai Artificial Intelligence Laboratory
We present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms in sync with visual cues. Our framework consists of a latent conditional diffusion probabilistic model to perform waveform synthesis. Furthermore, a series of context-aware conditioning encoders are proposed to take temporal information into consideration for a long-term generation. We also extend our model's applicability from dances to multiple sports scenarios such as floor exercise and figure skating. To perform comprehensive evaluations, we establish a benchmark for rhythmic video soundtracks including the pre-processed dataset, improved evaluation metrics, and robust generative baselines.
pip install -r requirements.txt
bash scripts/loris_{subset}_s{length}.sh
bash scripts/infer_{subset}_s{length}.sh
Dataset is available in huggingface.
from datasets import load_dataset
dataset = load_dataset("OpenGVLab/LORIS")
We provide the pre-trained checkpoints and backbone audio diffusion model as follow:
Audio-diffusion-pytorch-v0.0.43,
Dance 25 seconds,
Figure Skating 25 seconds,
Floor Exercise 25 seconds,
Floor Exercise 50 seconds
It should be noted that these checkpoints must only be used for research purposes.
@inproceedings{Yu2023Long,
title={Long-Term Rhythmic Video Soundtracker},
author={Yu, Jiashuo and Wang, Yaohui and Chen, Xinyuan and Sun, Xiao and Qiao, Yu },
booktitle={International Conference on Machine Learning (ICML)},
year={2023}
}
We would like to thank the authors of previous related projects for generously sharing their code and insights: audio-diffusion-pytorch, CDCD, D2M-GAN, VQ-Diffusion, and JukeBox.