[SYCLTLA] Support all Pytorch strides for FlashAttention fwd/bwd kernel #2727

LuFinch · 2026-01-12T07:44:08Z

Currently, SYCLTLA FlashAttention fwd/bwd kernels only support BHSD/BSHD layout, which work well in single process scenario.

However, in distributed scenario, the batch_size/num_heads dimension's may be split by DP/TP which makes the stride not contiguous in BHSD/BSHD so that current support is not enough.

This PR use Pytorch's tensor stride to compute offset so that FA kernel could support all Pytorch stride.

Copilot

Pull request overview

This PR updates the SYCLTLA FlashAttention implementation to support all PyTorch tensor strides, removing the previous BHSD/BSHD layout restrictions. This enables the kernels to work correctly in distributed scenarios where batch_size/num_heads dimensions may be split by data/tensor parallelism.

Changes:

Removed layout-specific code and replaced it with stride-based offset calculations using PyTorch tensor strides
Introduced new parameter structs (QKV_params, FLASH_FWD_params, FLASH_BWD_params) to encapsulate stride information
Simplified kernel interfaces by passing parameter structs instead of individual arguments

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/ATen/native/transformers/xpu/flash_attn/utils.h	Removed ATTN_TENSOR_LAYOUT enum and related layout checking functions, deprecated check_flash_attention_layout
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_fwd.cpp	Replaced layout-based logic with stride-based offset calculations, updated function signatures to use FLASH_FWD_params
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_common.h	Added QKV_params, FLASH_FWD_params, FLASH_BWD_params structs and helper functions to populate them from tensors
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.h	Updated Param struct with stride fields and offset calculation methods, removed layout-specific stride setup functions
src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp	Refactored to use FLASH_BWD_params, updated offset calculations to use do_offset/dqaccum_offset, made grad_out contiguous
src/ATen/native/transformers/xpu/flash_attn/sycltla/kernel/xe_sdpa_fwd.h	Removed is_bshd parameter and layout-specific branching, simplified coordinate calculations
src/ATen/native/transformers/xpu/flash_attn/sycltla/kernel/tile_scheduler_sdpa_fwd.h	Removed is_bshd parameter, eliminated BSHD-specific grid calculation path
src/ATen/native/transformers/xpu/flash_attn/sycltla/collective/xe_flash_attn_sdpa_fwd_mma.h	Changed Arguments to use stride fields instead of packed strides, added RuntimeParams struct, simplified offset calculations
src/ATen/native/transformers/xpu/flash_attn/sycltla/collective/xe_flash_attn_sdpa_fwd_epilogue.h	Updated to use stride-based offset calculations, removed is_bshd branching, separated Params and RuntimeParams

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-12T07:44:57Z

src/ATen/native/transformers/xpu/flash_attn/sycltla/mha_bwd.cpp

      dropout == 0.0,
      "FlashAttentionBackwardXPU does not only support dropout > 0.0 yet");

+  at::Tensor contiguous_grad_out = grad_out.contiguous();


Creating a contiguous copy of grad_out unconditionally may introduce unnecessary memory allocation and copy overhead when grad_out is already contiguous. Consider checking if grad_out is contiguous first and only creating the copy when needed, similar to pattern: at::Tensor contiguous_grad_out = grad_out.is_contiguous() ? grad_out : grad_out.contiguous();

Suggested change

at::Tensor contiguous_grad_out = grad_out.contiguous();

at::Tensor contiguous_grad_out =

grad_out.is_contiguous() ? grad_out : grad_out.contiguous();

@LuFinch , an you check this and other memory hogs. I am seeing similar memory use as MATH kernel -which is not the case in cuda. See the section "Using flash attention SDP kernel (without dropout), A100 Using flash attention SDP kernel (without dropout), A100" in this blog.

Hi， you mean this contiguous? I am aligning with CUDA https://github.com/pytorch/pytorch/blob/9f5d6ec4fe0ca9c219ac057ed1bdd62f6b759996/aten/src/ATen/native/transformers/cuda/attention_backward.cu#L91

support all Pytorch strides for FA2 fwd/bwd kernel

4fa6193

Copilot AI review requested due to automatic review settings January 12, 2026 07:44

Copilot AI reviewed Jan 12, 2026

View reviewed changes

intel deleted a comment from Copilot AI Jan 13, 2026

LuFinch requested a review from cfgfung January 19, 2026 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCLTLA] Support all Pytorch strides for FlashAttention fwd/bwd kernel #2727

[SYCLTLA] Support all Pytorch strides for FlashAttention fwd/bwd kernel #2727

Uh oh!

LuFinch commented Jan 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

githubsgi Jan 13, 2026

Uh oh!

LuFinch Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	at::Tensor contiguous_grad_out = grad_out.contiguous();
	at::Tensor contiguous_grad_out =
	grad_out.is_contiguous() ? grad_out : grad_out.contiguous();

[SYCLTLA] Support all Pytorch strides for FlashAttention fwd/bwd kernel #2727

Are you sure you want to change the base?

[SYCLTLA] Support all Pytorch strides for FlashAttention fwd/bwd kernel #2727

Uh oh!

Conversation

LuFinch commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

githubsgi Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

LuFinch Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuFinch commented Jan 12, 2026 •

edited

Loading