Document causal mask alignment in scaled_dot_product_attention #2967

Anri-Lombard · 2026-01-04T04:08:03Z

Summary

Document that MLX's mask="causal" uses lower-right alignment
Clarify the difference from PyTorch's default is_causal=True (upper-left)

When T_q != T_kv, this distinction matters:

MLX (lower-right): Last query aligns with last key
PyTorch default (upper-left): First query aligns with first key

References:

Relates to #2835

Clarify that MLX uses lower-right alignment for causal masks when T_q != T_kv, which differs from PyTorch's default upper-left alignment. Relates to ml-explore#2835

zcbenz

I don't think PyTorch has a causal_lower_right option for SDPA and the description is not really right.

Anri-Lombard · 2026-01-18T13:05:42Z

Hey @zcbenz, it does have causal_lower_right since 2.3 and can be used with SDPA via the attn_mask parameter. I ran a script with:

from torch.nn.attention.bias import causal_lower_right
bias = causal_lower_right(T_q, T_kv)
F.scaled_dot_product_attention(q, k, v, attn_mask=bias)

to verify.

Here is the tutorial that documents this explicitly: https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html.

I also verified masks are mathematically identical. For example with T_q=2, T_kv=4:

  MLX's mask (using q_off = max(0, kL - qL)):
        k0  k1  k2  k3
  q0 [  1   1   1   0  ]
  q1 [  1   1   1   1  ]

  PyTorch's causal_lower_right(2, 4):
        k0  k1  k2  k3
  q0 [  1   1   1   0  ]
  q1 [  1   1   1   1  ]

  PyTorch's is_causal=True (upper_left):
        k0  k1  k2  k3
  q0 [  1   0   0   0  ]
  q1 [  1   1   0   0  ]

The first two are identical; the third is different. This is also consistent with MLX's CUDA backend which uses cuDNN's set_causal_mask_bottom_right.

Is there something specific about the description you think is incorrect? if your concern is that causal_lower_right isn't a direct SDPA parameter (like is_causal=True) but rather a separate utility class, I could clarify the wording to use the full module path torch.nn.attention.bias.causal_lower_right.

zcbenz · 2026-01-18T23:07:59Z

Thanks for linking the docs, this is a new learn for me. On the behavior, it actually depends on whether T_q is larger or smaller than T_kv:

mlx/mlx/backend/cuda/scaled_dot_product_attention.cpp

Lines 204 to 208 in 9052f67

    
           if (q.shape(2) > k.shape(2)) { 
        
             options.set_causal_mask(do_causal); 
        
           } else { 
        
             options.set_causal_mask_bottom_right(do_causal); 
        
           }

The mask uses lower-right alignment when T_q <= T_kv and upper-left when T_q > T_kv.

Anri-Lombard · 2026-01-19T18:19:03Z

Thanks! Fixed to describe the conditional alignment behavior 🙏

zcbenz

Looks good to me. /cc @awni for a second look.

awni · 2026-01-21T00:22:04Z

The comment definitely makes sense. But I also find it a bit strange that we switch from lower right to upper left depending on if query is longer or shorter than the keys. It's quite rare for the query to be longer than the keys which is why we never really looked at it carefully.

I'm wondering if we should change the behavior in that case rather than documenting something that is a bit unusual? Or maybe it's a good idea to keep it this way?

zcbenz · 2026-01-21T00:39:23Z

I agree current behavior is unusual, and using lower right for all should be a better choice.

Document causal mask alignment in scaled_dot_product_attention

c55b3f5

Clarify that MLX uses lower-right alignment for causal masks when T_q != T_kv, which differs from PyTorch's default upper-left alignment. Relates to ml-explore#2835

zcbenz requested changes Jan 18, 2026

View reviewed changes

Fix causal mask documentation to describe conditional alignment behavior

f434231

The mask uses lower-right alignment when T_q <= T_kv and upper-left when T_q > T_kv.

zcbenz approved these changes Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document causal mask alignment in scaled_dot_product_attention #2967

Document causal mask alignment in scaled_dot_product_attention #2967

Uh oh!

Anri-Lombard commented Jan 4, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Anri-Lombard commented Jan 18, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 19, 2026

Uh oh!

zcbenz left a comment

Uh oh!

awni commented Jan 21, 2026

Uh oh!

zcbenz commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Document causal mask alignment in scaled_dot_product_attention #2967

Are you sure you want to change the base?

Document causal mask alignment in scaled_dot_product_attention #2967

Uh oh!

Conversation

Anri-Lombard commented Jan 4, 2026

Summary

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Anri-Lombard commented Jan 18, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 19, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

awni commented Jan 21, 2026

Uh oh!

zcbenz commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants