Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory overhead from copy2d #13

Merged
merged 9 commits into from
Oct 6, 2024
Merged

Conversation

EndlessReform
Copy link
Owner

@EndlessReform EndlessReform commented Oct 6, 2024

This PR fixes:

  • Now using the official KVCache implementation
  • For smaller models, the Candle repeat_interleave optimization with Tensor::cat is actually slower than the previous .unsqueeze().expand().reshape() even despite doing non-strided copies, due to the overhead of doing 8 sequential copy2ds + another to make it contiguous. I've reverted to the original, which improves speeds by another 20 tokens/sec. Further optimization is needed though, as this is down from 70µs to 25µs on a 4090 - still as much as the actual attention mechanism!
  • Moved rep pen calculation for slow codebooks to GPU, to avoid copying from GPU -> CPU for rep pen -> GPU for softmax and temp scaling -> sampling. It's lower-precision at BF16, but shaves off another ~100µs since we have to do this 8 times per timestep.
  • Fix RTF calculation for DualAR

For anyone else reading this: precomputed indexes aren't actually faster for interleaving the KV tensors, probably due to overhead.

Still behind the reference implementation with torch.compile: 160 tokens/sec on Nvidia 4090, vs 250 tokens/sec reported by maintainers. We're also getting increasingly CPU-bound due to kernel launch overhead.

Next steps:

  • Add flash attention support
  • Look into efficient RoPE and fused add+norm operators: they're making up a large percentage of runtime now
  • Potential CustomOp for interleaving, it's still taking too long
  • Experiment with CUDA graphs: looks like Candle may be adding support soon? Experimentations around cuda graphs huggingface/candle#2538

@EndlessReform EndlessReform self-assigned this Oct 6, 2024
@EndlessReform EndlessReform merged commit f543dae into main Oct 6, 2024
@EndlessReform EndlessReform deleted the reduce-attn-fragmentation branch October 6, 2024 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant