Lora loading for bf16 and fp8, as separate models #24

andreasjansson · 2024-09-27T12:50:45Z

Had to fix some bugs in the original lora loading code.

Outputs are here: https://replicate.com/replicate-internal/test-flux-dev-lora

bf16 lora loading works
For fp8 lora loading, the lora strength is much lower and I need to crank up lora_scale to 1.7 to match lora_scale 1.1 at bf16

Fusing and unfusing is also slightly lossy so the model slowly degrades over time. We could do something like what peft and add a new node that does the matmul on the fly, instead of fusing. But that would slow down inference. Curious if you have ideas @daanelson

Had to fix some bugs in the original lora loading code. Outputs are here: https://replicate.com/replicate-internal/test-flux-dev-lora * bf16 lora loading works * For fp8 lora loading, the lora strength is much lower and I need to crank up lora_scale to 1.7 to match lora_scale 1.1 at bf16

Averylamp · 2024-10-02T20:57:18Z

Very excited for this PR. Thanks for doing this! I was looking for an H100 lora inference provider and this seems like it would do the trick. I was curious as well if pricing for fast generations would be any different than per image because the GPU usage time is much less?

Averylamp · 2024-10-15T02:56:02Z

Hi, I was curious if this work is continuing on this PR? I believe this should make flux dev lora inference fast enough that you'd gain a customer versus using Fal as they currently are a few seconds faster (but from my benchmarking, this should be faster in the end). Happy to take on any tasks if you would like as well.

daanelson

clean up a few things and then we're good!

.github/workflows/push.yaml

cog-safe-push-dev-lora.yaml

fp8/lora_loading.py

Avoid weight degradation by cloning linear fp8 layers

0fb3a6f

Averylamp mentioned this pull request Oct 17, 2024

Allow for Private HF_Lora URLs with Query Params for Auth lucataco/cog-flux-dev-lora#1

Open

daanelson and others added 7 commits October 25, 2024 04:58

handling kohya and xlabs loras

538c70c

don't reload/unload unnecessarily

189d6ee

small tweaks

f3339be

unified lora loading with fewer bugs

bb15033

1.5x lora scale multiplier for going fast

bf910cc

cog safe push ci

d88d6dc

a computer lied to me

168bf87

daanelson marked this pull request as ready for review October 31, 2024 23:22

daanelson and others added 3 commits October 31, 2024 16:27

Merge branch 'main' into lora-loading

554712f

linting, formatting

c8f9712

cleanup

017e384

daanelson approved these changes Oct 31, 2024

View reviewed changes

.github/workflows/push.yaml Outdated Show resolved Hide resolved

cog-safe-push-dev-lora.yaml Outdated Show resolved Hide resolved

fp8/lora_loading.py Show resolved Hide resolved

daanelson merged commit 1d75bdd into main Oct 31, 2024
1 check passed

daanelson deleted the lora-loading branch October 31, 2024 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lora loading for bf16 and fp8, as separate models #24

Lora loading for bf16 and fp8, as separate models #24

andreasjansson commented Sep 27, 2024 •

edited

Loading

Averylamp commented Oct 2, 2024

Averylamp commented Oct 15, 2024

daanelson left a comment

Lora loading for bf16 and fp8, as separate models #24

Lora loading for bf16 and fp8, as separate models #24

Conversation

andreasjansson commented Sep 27, 2024 • edited Loading

Averylamp commented Oct 2, 2024

Averylamp commented Oct 15, 2024

daanelson left a comment

Choose a reason for hiding this comment

andreasjansson commented Sep 27, 2024 •

edited

Loading