[Feature Request] Add Native Sparse Attention (NSA) support

**Is your feature request related to a problem? Please describe.**

Training and inference with long sequences (32K+) using dense attention is prohibitively expensive. TE currently offers sliding window attention as the only sparse alternative, but this uses a fixed local pattern that loses long-range dependencies.

**Describe the solution you'd like**

Add support for Native Sparse Attention (NSA) from the paper [Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention](https://arxiv.org/abs/2502.11089). NSA dynamically selects important tokens via learned compression and top-k block selection, maintaining long-range dependencies while being hardware-efficient.

**Describe alternatives you've considered**

Using the standalone Triton implementation ([native-sparse-attention-triton](https://github.com/XunhaoLai/native-sparse-attention-triton)) - Works but doesn't integrate with TE's FP8 quantization, CP, or fused kernels. Requires maintaining separate code paths.

**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Add Native Sparse Attention (NSA) support #2511

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add Native Sparse Attention (NSA) support #2511

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions