-
-
Notifications
You must be signed in to change notification settings - Fork 258
Description
Describe your use-case.
The latest version of Diffusers supports being able to configure or select a specific attention backend such as FlashAttention-2/FlashAttention-3 (which supports backward pass).
OneTrainer could potentially benefit in performance if this was togglable using the provided API as per the documentation below using a context manager:
https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends
with attention_backend("flash"):
# training/inference goes here
It is also possible to force OneTrainer to use a particular attention backend using the following environment variable (example using FA2):
DIFFUSERS_ATTN_BACKEND=flash
The feature itself is marked as experimental and still have yet to find out the potential drawbacks of using this for training. So far I have observed that for training a Chroma LoRA with batch size 3 on an RTX 5090, the seconds per iteration reduced from 9.5s/it to 6.4s/it on my machine.
There are checks associated with each of the backends (ie. shape, dtypes, etc) which are skipped by default. Depending on selected training types in OneTrainer, stuff can easy break if not aligned. It's possible to enable these checks in Diffusers, but that incurs extra overhead per attention-call (but can be good as a quick sanity check to validate the current configuration is sane):
DIFFUSERS_ATTN_CHECKS=1
I tried enabling this flag and it failed one of the assertions:
Attention mask must match the key's second to last dimension.
Probably needs some more investigation.
What would you like to see as a solution?
Look into potentially adding a dropdown to select supported attention backends (or documenting something maybe and adding the necessary caveats or tested configurations/models).
This feature is marked as experimental currently by HuggingFace.
Have you considered alternatives? List them here.
No response