Johannes Schusterbauer-Fischer * · Ming Gui* · Pingchuan Ma* · Nick Stracke · Stefan A. Baumann · Vincent Tao Hu · Björn Ommer
CompVis Group @ LMU Munich
* equal contribution
ECCV 2024 Oral
Samples synthesized in
In this work, we leverage the complementary strengths of Diffusion Models (DMs), Flow Matching models (FMs), and Variational AutoEncoders (VAEs): the diversity of stochastic DMs, the speed of FMs in training and inference stages, and the efficiency of a convolutional decoder to map latents into pixel space. This synergy results in a small diffusion model that excels in generating diverse samples at a low resolution. Flow Matching then takes a direct path from this lower-resolution representation to a higher-resolution latent, which is subsequently translated into a high-resolution image by a convolutional decoder. We achieve competitive high-resolution image synthesis at
During training we feed both a low- and a high-res image through the pre-trained encoder to obtain a low- and a high-res latent code. Our model is trained to regress a vector field which forms a probability path from the low- to the high-res latent within
At inference we can take any diffusion model, generate the low-res latent, and then use our Coupling Flow Matching model to synthesize the higher dimensional latent code. Finally, the pre-trained decoder projects the latent code back to pixel space, resulting in
We show zero-shot quantitative comparison of our method against other state-of-the-art methods on the COCO dataset. Our method achieves a good trade-off between performance and computational cost.
We can cascade our models to increase the resolution of a
You can find more qualitative results on our project page.
Please execute the following command to download the first stage autoencoder checkpoint:
mkdir checkpoints
wget -O checkpoints/sd_ae.ckpt https://www.dropbox.com/scl/fi/lvfvy7qou05kxfbqz5d42/sd_ae.ckpt?rlkey=fvtu2o48namouu9x3w08olv3o&st=vahu44z5&dl=0
For training the model, you have to provide a config file. An example config can be found in configs/flow400_64-128/unet-base_psu.yaml
. Please customize the data part to your use case.
In order to speed up the training process, we pre-computed the latents. Your dataloader should return a batch with the following keys, i.e. image
, latent
, and latent_lowres
. Please notice that we use pixel space upsampling (PSU in the paper), therefore the latent
and latent_lowres
should have the same spatial resolution (refer to L228 extract_from_batch()
in fmboost/trainer.py
).
Afterwards, you can start the training with
python3 train.py --config configs/flow400_64-128/unet-base_psu.yaml --name your-name --use_wandb
the flag --use_wandb
enables logging to WandB. By default, it only logs metrics to a CSV file and tensorboard. All logs are stored in the logs
folder. You can also define a folder structure for your experiment name, e.g. logs/exp_name
.
If you want to resume from a checkpoint, just add the additional parameter
... --resume_checkpoint path_to_your_checkpoint.ckpt
This resumes all states from the checkpoint (i.e. optimizer states). If you want to just load weights in a non-strict manner from some checkpoint, use the --load_weights
argument.
We will release a pretrained checkpoint and the corresponding inference jupyter notebook soon. Stay tuned!
Please cite our paper:
@misc{fischer2023boosting,
title={Boosting Latent Diffusion with Flow Matching},
author={Johannes S. Fischer and Ming Gui and Pingchuan Ma and Nick Stracke and Stefan A. Baumann and Vincent Tao Hu and Björn Ommer},
year={2023},
eprint={2312.07360},
archivePrefix={arXiv},
primaryClass={cs.CV}
}