Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fp8 work on older GPUs #34

Merged
merged 5 commits into from
Nov 4, 2024
Merged

Make fp8 work on older GPUs #34

merged 5 commits into from
Nov 4, 2024

Conversation

yorickvP
Copy link
Contributor

Tested on RTX A5000.

  • Use float32 matmul on older GPUs
  • Offload based on total GPU memory instead of 'A40' in name
  • Tell the FP8 code about offloading and compiling

@daanelson
Copy link
Collaborator

@yorickvP this is great, making 8-bit inference work regardless of hardware is useful. That said, when I run cog predict -i prompt=<whatever> on an A40, it takes ~10 minutes to compile and then there's no output. can you take a look? wary of pushing a broken path here

@yorickvP
Copy link
Contributor Author

I checked again, and the prediction eventually succeeded: https://replicate.com/p/8r81q9z2b5rgp0cjv26977a6kg .
What this changed is that it now attempts to compile fp8 on A40 where it didn't previously, which brings the boot time over 10 minutes.
I'll disable fp8 compile in the offload case!

@daanelson
Copy link
Collaborator

@yorickvP fantastic! works great now, merging.

@daanelson daanelson merged commit a9a42fb into main Nov 4, 2024
1 check passed
@daanelson daanelson deleted the yorickvp/pre-ada-mm branch November 4, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants