-
Notifications
You must be signed in to change notification settings - Fork 100
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: CUDA acceleration for Cheetah protocol #503
Comments
Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib. |
Hi @BeStrongok Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal. Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time. Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS. But feel free to give it a shot :P |
Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah. |
Thank you for providing me with these useful information. :) |
I would expect 60x faster keyswitching than single core CPU implementation, if you put the whole key-switching logic into GPU. However it might take "a little bit" work to do so. The GPU code in |
Less than 10x RotateRows is less impressive to me since 10 cores CPU is much more easier to get than a x100 NV card. |
Feature Request Type
Performance
Have you searched existing issues?
No
Is your feature request related to a problem?
No.
Describe features you want to add to SPU
Below.
Describe features you want to add to SPU
Hi, SPU team:
The cheetah protocol is a high-performance two-party inference protocol which is running on CPUs, i'm thinking is there a way to apply CUDA acceleration on this protocol?
Such as the matrix encoding process, or the computation process which compute the results of each modulus. Do you have any corresponding development plans?
The text was updated successfully, but these errors were encountered: