Skip to content

Conversation

@dxqb
Copy link
Collaborator

@dxqb dxqb commented Nov 8, 2025

torch SDPA automatically uses a flash attention algorithm if possible:

While torch automatically uses flash attention if there is no attention mask given at all, it does not recognize it if there is a no-op attention mask (mask is given, but no tokens are masked).

This PR detects that and uses no mask instead of a no-op mask, resulting in a significant speed-up of 20-25% in those cases. This is always the case if batch size is 1, and less often if batch size > 1.

    Qwen bs1 1328 px RTX 6000 Ada Chroma bs2 512 px RTX 4070
OneTrainer baseline 8.6 2.3
OneTrainer avoiding masks 6.6 1.85
OneTrainer +compiled 5.1  
OneTrainer +int8 3.4  

This in principle can be combined with other upcoming features that improve performance further, see last two lines.

thank you to @FurkanGozukara for pointing out this speed difference

  • only implemented for Qwen and Chroma. Hunyuan, HiDream and Pixart appear to use attention masks as well

@dxqb
Copy link
Collaborator Author

dxqb commented Nov 8, 2025

* [x]  only implemented for Qwen and Chroma. Hunyuan, HiDream and Pixart appear to use attention masks as well

It doesn't make sense for now to implement this PR for the other models, because they all use a fixed sequence length. So the attention mask is never no-op.

@dxqb dxqb marked this pull request as ready for review November 8, 2025 15:53
@dxqb dxqb added the merging last steps before merge label Nov 8, 2025
@dxqb dxqb changed the title Avoid attention masks Avoid attention masks for Qwen and Chroma Nov 8, 2025
dxqb added a commit to dxqb/OneTrainer that referenced this pull request Nov 8, 2025
@zzlol63
Copy link

zzlol63 commented Nov 9, 2025

Looks good, just did a quick test with #1107 to include the FlashAttention code path on Windows and am seeing a speedup with Chroma, moreso when the batch size is 1 to avoid the padded text token scenario that requires attention masking.

#chroma LoRA 24GB + bs2 + res 1024
before: 4.8s/it
after: 4.2-4.8s/it (fluctuating across steps)

#chroma LoRA 24GB + bs1 + res 1024
before: 2.3s/it
after: 1.5s/it

It would be more advantageous to train with batch size 1 and use accumulation steps to make up for reduction to take advantage of the performance gains from FlashAttention.

@dxqb dxqb merged commit ccc0501 into Nerogar:master Nov 9, 2025
1 check passed
@dxqb dxqb deleted the avoid_attn_mask branch November 9, 2025 12:28
@DarkViewAI
Copy link

@dxqb i just tested this, and a previous commit, it seems avoid attention mask starting giving me banding issues in my images, like those long line artifacts.

using the default config for qwen in onetrainer -
Screenshot 2025-11-12 180430
qwen lora 24gb preset

i did not have these on a previous commit

@zzlol63
Copy link

zzlol63 commented Nov 13, 2025

@dxqb Is the issue due to the removal of the tensor multiplication on the mask for Qwen?

@dxqb
Copy link
Collaborator Author

dxqb commented Nov 13, 2025

@dxqb i just tested this, and a previous commit, it seems avoid attention mask starting giving me banding issues in my images, like those long line artifacts.

it's not possible that this PR is the reason for this. Please look for other reasons, or make a direct comparison.

This PR is mathematically identical before and after. In theory, and in tests:
Because the Qwen diffusers pipeline has bugs regarding attention masks, I have written test code that confirms this the following way:
The same code that is used for training, is used for sampling images instead. I sample 4 different prompts, long and short. Once, each image separately with batch size 1, and once in a batch of 4 using an attention mask. The result is pixel-identical.

@dxqb
Copy link
Collaborator Author

dxqb commented Nov 18, 2025

@dxqb i just tested this, and a previous commit, it seems avoid attention mask starting giving me banding issues in my images, like those long line artifacts.

was this generated with a lightning 8-step or 4-step LoRA?

@DarkViewAI
Copy link

@dxqb hi no this was onetrainer sample, but the lora did the same in comfyui

@DarkViewAI
Copy link

DarkViewAI commented Nov 18, 2025

i've just been using an older commit * and it works fine now

@dxqb
Copy link
Collaborator Author

dxqb commented Nov 18, 2025

@dxqb hi no this was onetrainer sample, but the lora did the same in comfyui

I've never seen any such artifact in a OneTrainer sample. If you can reproduce this please open an issue or join the Discord to show there.
It is a known issue with Lightning LoRA in Comfy, but with and without any OneTrainer LoRA, just as an artifact of Lightning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merging last steps before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants