-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lora loading for bf16 and fp8, as separate models #24
Conversation
Had to fix some bugs in the original lora loading code. Outputs are here: https://replicate.com/replicate-internal/test-flux-dev-lora * bf16 lora loading works * For fp8 lora loading, the lora strength is much lower and I need to crank up lora_scale to 1.7 to match lora_scale 1.1 at bf16
Very excited for this PR. Thanks for doing this! I was looking for an H100 lora inference provider and this seems like it would do the trick. I was curious as well if pricing for fast generations would be any different than per image because the GPU usage time is much less? |
Hi, I was curious if this work is continuing on this PR? I believe this should make flux dev lora inference fast enough that you'd gain a customer versus using Fal as they currently are a few seconds faster (but from my benchmarking, this should be faster in the end). Happy to take on any tasks if you would like as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean up a few things and then we're good!
Had to fix some bugs in the original lora loading code.
Outputs are here: https://replicate.com/replicate-internal/test-flux-dev-lora
Fusing and unfusing is also slightly lossy so the model slowly degrades over time. We could do something like what peft and add a new node that does the matmul on the fly, instead of fusing. But that would slow down inference. Curious if you have ideas @daanelson