Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Supported compute capabilities? #21

Open
bayley opened this issue May 13, 2024 · 3 comments
Open

[Question] Supported compute capabilities? #21

bayley opened this issue May 13, 2024 · 3 comments

Comments

@bayley
Copy link

bayley commented May 13, 2024

I've been working on porting FlashAttention-2 to pre-SM80 architectures (Turing and Volta) and was wondering if TK supports SM70 and SM75 hardware. Writing 100 lines of TK primitives sounds a lot easier than 1000 lines of CUDA...

@benjaminfspector
Copy link
Collaborator

Sorry for delay. TL;DR is that ThunderKittens as-is will NOT run on pre-SM80, but it would not be very hard to modify it to support down to SM_75 with a couple of targeted preprocessor directives within the codebase. SM_70 is definitely possible but would take a bit more work.

AFAICT there are just two things in the codebase that will break on SM_75.

  1. All the load_async/store_async functions may refuse to compile. Since they're templates, if you simply don't instantiate them, the C++ compiler probably won't complain? But it would be safer to wrap them in an #ifdef so that people don't try to use them.
  2. SM_75 doesn't do BF16, but I kind of half-assed the FP16 support in TK since BF16 is usually better. So the stuff in src/common/ that defines ops for bf16 would need to also be defined for FP16. The other thing that would need a wrapper is the base matmul instruction wrapper (you'd want the m16n8k8 shape)

The above looks fairly easy, although it's not a P0 for us at the moment. But if someone wants to submit a PR, happy to look it through and merge.

SM_70 is going to be a little bit more of a hassle, although still not hugely difficult. (Main difficulty being a lack of half-precision types.) The swizzling indexes (src/types/shared/st_layout) for FP32 would need to be checked and possibly extended, and MMA again would need some wrapping. Global to shared memory movement I think would also need a touch-up. But this reads to me like just a day or two of work, and again, happy to review or merge a PR if people want it enough to put it in.

@sorasoras
Copy link

Sorry for delay. TL;DR is that ThunderKittens as-is will NOT run on pre-SM80, but it would not be very hard to modify it to support down to SM_75 with a couple of targeted preprocessor directives within the codebase. SM_70 is definitely possible but would take a bit more work.

AFAICT there are just two things in the codebase that will break on SM_75.

  1. All the load_async/store_async functions may refuse to compile. Since they're templates, if you simply don't instantiate them, the C++ compiler probably won't complain? But it would be safer to wrap them in an #ifdef so that people don't try to use them.
  2. SM_75 doesn't do BF16, but I kind of half-assed the FP16 support in TK since BF16 is usually better. So the stuff in src/common/ that defines ops for bf16 would need to also be defined for FP16. The other thing that would need a wrapper is the base matmul instruction wrapper (you'd want the m16n8k8 shape)

The above looks fairly easy, although it's not a P0 for us at the moment. But if someone wants to submit a PR, happy to look it through and merge.

SM_70 is going to be a little bit more of a hassle, although still not hugely difficult. (Main difficulty being a lack of half-precision types.) The swizzling indexes (src/types/shared/st_layout) for FP32 would need to be checked and possibly extended, and MMA again would need some wrapping. Global to shared memory movement I think would also need a touch-up. But this reads to me like just a day or two of work, and again, happy to review or merge a PR if people want it enough to put it in.

given the post and the thing you said, It's probably pointless to implement thunderkittens on ancient GPU like P40/P100? just curious. what's your opinion

@benjaminfspector
Copy link
Collaborator

The reason to do it would be if you happen to really like the TK programming model of working with tiles. But there are no tensor cores, so the MMA wrapper would turn into actual sequential multiply+add instructions and you wouldn't really see any performance gains over naively writing raw CUDA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants