Skip to content

Conversation

@yaox12
Copy link
Member

@yaox12 yaox12 commented Dec 11, 2025

Description

Optimizations:

  1. Vectorized load/store for the aligned case, i.e., non-interleaved, input/output are contiguous at the last dimension, and the last dimension is aligned to 8.
  2. For the forward kernel, do not load freqs to shared memory, because each entry of freqs is only used by one thread and not shared. This avoids thread sync and shared memory bank conflict.
  3. For the backward kernel, keep the shared memory usage, because different threads may separately access the sin/cos of the same element in freqs.

Performance

Tested on GB200:

1.2-1.4x speedup for fwd and 1.5-1.6x speedup for bwd.

Shape [4096, 1, 128, 128] in sbhd.

  • Baseline
fused_rope_forward_kernel CUDA time avg: 0.095 ms, bandwidth: 2.86 TB/s
fused_rope_backward_kernel CUDA time avg: 0.092 ms, bandwidth: 2.93 TB/s
  • Optimized
fused_rope_forward_kernel CUDA time avg: 0.067 ms, bandwidth: 4.06 TB/s
fused_rope_backward_kernel CUDA time avg: 0.057 ms, bandwidth: 4.74 TB/s

Shape [512, 32, 24, 128] in sbhd.

  • Baseline
fused_rope_forward_kernel CUDA time avg: 0.079 ms, bandwidth: 2.55 TB/s
fused_rope_backward_kernel CUDA time avg: 0.071 ms, bandwidth: 2.85 TB/s
  • Optimized
fused_rope_forward_kernel CUDA time avg: 0.064 ms, bandwidth: 3.17 TB/s
fused_rope_backward_kernel CUDA time avg: 0.049 ms, bandwidth: 4.14 TB/s

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@yaox12 yaox12 force-pushed the xiny/optimize_rope branch from 533c6ce to 3d9be7c Compare December 11, 2025 10:57
@ptrendx ptrendx added the performance Performance issues label Dec 11, 2025
@yaox12 yaox12 force-pushed the xiny/optimize_rope branch from 9068bfc to cafa75f Compare December 12, 2025 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants