-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Supported compute capabilities? #21
Comments
Sorry for delay. TL;DR is that ThunderKittens as-is will NOT run on pre-SM80, but it would not be very hard to modify it to support down to SM_75 with a couple of targeted preprocessor directives within the codebase. SM_70 is definitely possible but would take a bit more work. AFAICT there are just two things in the codebase that will break on SM_75.
The above looks fairly easy, although it's not a P0 for us at the moment. But if someone wants to submit a PR, happy to look it through and merge. SM_70 is going to be a little bit more of a hassle, although still not hugely difficult. (Main difficulty being a lack of half-precision types.) The swizzling indexes (src/types/shared/st_layout) for FP32 would need to be checked and possibly extended, and MMA again would need some wrapping. Global to shared memory movement I think would also need a touch-up. But this reads to me like just a day or two of work, and again, happy to review or merge a PR if people want it enough to put it in. |
given the post and the thing you said, It's probably pointless to implement thunderkittens on ancient GPU like P40/P100? just curious. what's your opinion |
The reason to do it would be if you happen to really like the TK programming model of working with tiles. But there are no tensor cores, so the MMA wrapper would turn into actual sequential multiply+add instructions and you wouldn't really see any performance gains over naively writing raw CUDA. |
I've been working on porting FlashAttention-2 to pre-SM80 architectures (Turing and Volta) and was wondering if TK supports SM70 and SM75 hardware. Writing 100 lines of TK primitives sounds a lot easier than 1000 lines of CUDA...
The text was updated successfully, but these errors were encountered: