-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Is your feature request related to a problem? Please describe.
Training and inference with long sequences (32K+) using dense attention is prohibitively expensive. TE currently offers sliding window attention as the only sparse alternative, but this uses a fixed local pattern that loses long-range dependencies.
Describe the solution you'd like
Add support for Native Sparse Attention (NSA) from the paper Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. NSA dynamically selects important tokens via learned compression and top-k block selection, maintaining long-range dependencies while being hardware-efficient.
Describe alternatives you've considered
Using the standalone Triton implementation (native-sparse-attention-triton) - Works but doesn't integrate with TE's FP8 quantization, CP, or fused kernels. Requires maintaining separate code paths.
Additional context