Skip to content

Conversation

@mahmoudhas
Copy link

Refactors the pipeline to use standalone functions for different steps and applies torch.compile where possible. Observed the following improvements using torch.compile:

  • the UNet forward pass went down from 200 ms to 100 ms (per denoising step)
  • the VAE forward pass went down from 250 ms to 110 ms (per inference chunk)
  • The ImageProcessor (face and mask detection) time went down from 290 ms to 230 ms (per inference chunk)

The compilation takes around 1 minute.

TODO for production readiness:

  • re-use the python process to serve multiple requests, saving compilation time.
  • remove the torch profiler
  • pad the last inference chunk so it doesn't trigger a recompilation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants