how to adapt to non square image model training and inference? #4

SkylerZheng · 2023-10-12T00:12:11Z

can you share how to adapt to non square image model training and inference? is it possible to use stable diffusion pipeline to generate non-square images?

SkylerZheng · 2023-10-12T00:12:55Z

Hi, you mentioned you will adapt to SDM 2.1, can you specify how you are going to do that?

SkylerZheng · 2023-10-13T17:03:41Z

@mihirp1998 Any thoughts on this? I tried to adapt to SDM 2.1, the hps score is very low, after training for 20 epochs, it's still around 0.24. Wondering what went wrong with my experiment.

mihirp1998 · 2023-10-13T18:39:01Z

I haven't tried SDM 2.1 yet, plan to try it in the weekend.
Also i'm not sure what's the issue with non square image training?
Can you ellaborate more on the issues you are facing with SD 2.1 training and non-square image training can help me with the integration.

SkylerZheng · 2023-10-13T19:40:29Z

Hi @mihirp1998 , thank you very much for the quick response. I am trying to train with SDM 2.1, i changed the height and width in vae from 64, 64 to 96, 96 (512 vs. 768)But the generated images from epoch 0 are non-sense, and the more I train the model, the worse the quality will be. The HPS reward is always in the range of 0.22 to 0.24.

I also tried non-square setting (128, 72), same issue.

I'm wondering besides vae config, what else do I need to change? What's the parameter value 0.18215 here? Do I need to change it for SD2.1?
ims = pipeline.vae.decode(latent.to(pipeline.vae.dtype) / 0.18215).sample

BTW, accelerate does not work for me, so I can only use 1 GPU for the training. I have scaled down the lr to 1e-4 or even 5e-5, no improvement.

config = set_config_batch(config, total_samples_per_epoch=256,total_batch_size=32, per_gpu_capacity=1)

Any advice or help is appreciated! Thanks!

mihirp1998 · 2023-10-13T23:05:10Z

okay i'll look into sd 2.1.

btw what is the error you get with accelerate in multi-gpu setting? also does accelerate work for you with other repos or is it just with this repo it doesn't work?

SkylerZheng · 2023-10-15T22:42:50Z

@mihirp1998, cool, thanks a lot! When I use accelerate, the training is just hung there, looks like data has not been loaded at all, so no training is happening. I used accelerate with dreambooth model training, it worked. It could be python3.10 and accelerate 0.17.0 are not compatible with my AWS EC2 env. Please let me know if you have any updates on sd 2.1! I tried to load stabilityai/stable-diffusion-2-1 for training, but the losses are nan, I printed the latents values, all are nan, but the evaluation worked fine, very weird, let me know if you have encountered the same problem!

mihirp1998 · 2023-10-16T01:24:21Z

For accelerate, does it hang after one epoch, or from the the beginning?

can you try removing this line and trying it again:

AlignProp/main.py

Line 602 in a269c5a

accelerator.save_state()

SkylerZheng · 2023-10-16T01:53:22Z

@mihirp1998 From the beginning, I didnot use accelerate for sd1.5, and i was able to replicate your results. Sure, let me try this, thank you!

SkylerZheng · 2023-10-16T17:40:56Z

@mihirp1998 Still no luck. Have you tried it on SD 2.1, any good news?

SkylerZheng · 2023-10-17T16:40:43Z

@mihirp1998 This is the training log with sd 2.1, the loss does not drop but increase gradually...

mihirp1998 · 2023-10-17T17:16:56Z

can you maybe try lower learning rates to see if the loss goes down?

I did try sd 2.1-base and found a similar issue of loss not going down. I think i'll have to look into it more closely to get it to work.

Probably playing with learning rate or parameters to adapt (lora vs unet vs changing lora dimension) might be worth trying.

mihirp1998 · 2023-10-17T17:57:01Z

Also i'll recommend directly trying SDXL instead: https://stablediffusionxl.com/

As i think it's probably better than SD 2.1

SkylerZheng · 2023-10-17T18:25:25Z

Hi @mihirp1998, thank you very much for the confirmation! I did try different lora rank, and different learning rates, none of them worked. Unfortunately, SDXL is too big for us, we can only consider sd 2.1, I will also keep looking into this and keep you posted! BTW, accelerate now worked with multiple gpu for me, thankfully!

mihirp1998 · 2023-10-17T18:50:59Z

I see, what changed in accelerate to get it to work?

SkylerZheng · 2023-10-17T20:12:24Z

I see, what changed in accelerate to get it to work?

I honestly do not know. Maybe the system updates helped...

mihirp1998 · 2023-10-17T20:45:13Z

@mihirp1998 This is the training log with sd 2.1, the loss does not drop but increase gradually...

are these curves with sd 2.1 or sd 2.1-base?

If they are with sd 2.1 then how did u fix the nan problem?

SkylerZheng · 2023-10-17T20:58:32Z

@mihirp1998 This is sd 2.1, I used pipeline.unet to do the prediction instead of unet. But this is a bit different from your original lora setting. The loss increases I believe it's the lr is too big, as I reduced the per_gpu_capacity to 1 but the lr is still se-3. When i changed the lr from 1e-3 to 1e-4, the loss does not drop, nor increase.

I also tried the new lora setting with sd 1.5, seems not working well. check the orange wandb logs attached.

mihirp1998 · 2023-10-17T21:04:38Z

I see, so i'm assuming u r not updating the lora parameters anymore but the whole unet?

Also can you try setting : config.train.adam_weight_decay = 0.0

try both settings updating with and without LoRA, i'm not sure why are u get nan with lora

SkylerZheng · 2023-10-17T21:06:33Z

No, I did freeze the unet, but only updating lora, otherwise, the memory will explode as you mentioned in your paper. Let me try config.train.adam_weight_decay = 0.0. Are you not getting nan problem with sd 2.1?

mihirp1998 · 2023-10-17T21:15:21Z

I don't understand how this fixes the nan problem? Like what's happening here and how does this change anything?

I used pipeline.unet to do the prediction instead of unet. But this is a bit different from your original lora setting.

SkylerZheng · 2023-10-17T21:22:01Z

I don't understand how this fixes the nan problem? Like what's happening here and how does this change anything?

I used pipeline.unet to do the prediction instead of unet. But this is a bit different from your original lora setting.

It's weird indeed, but seems like the lora layers added do not work for SD 2.1. I'm thinking we can try other ways of lora for SD 2.1, for example, peft.

mihirp1998 · 2023-10-17T21:27:22Z

Okay sure, but do u know what results in the nan outcome at the first place?

btw i tried sd2.1-base with setting config.train.adam_weight_decay = 0.0 and i find the loss to go down.

SkylerZheng · 2023-10-17T22:16:39Z

Okay sure, but do u know what results in the nan outcome at the first place?

btw i tried sd2.1-base with setting config.train.adam_weight_decay = 0.0 and i find the loss to go down.

but do u know what results in the nan outcome at the first place?--> I just replaced sd 1.5 with stabilityai/sd-2.1 from huggingface, and changed the latent dimension from 64 to 96. As a result, the lora weights were not updated due to nan problem, so the image quality keep unchanged.

Great to hear that! Can you help try sd 2.1 as well? Because sd 2.1, the dimension changed from 512 to 768, so the per_gpu_capacity will also go down from 4 to 1, that will affect the lr.

mihirp1998 · 2023-10-17T22:24:57Z

As a result, the lora weights were not updated due to nan problem, so the image quality keep unchanged.

I think having 64 as the latent dimension height width was causing the nan issue. Probably sd-2.1 should work after setting weight_decay to 0

Can you help try sd 2.1 as well?

I plan to try this after a week, will also try SD refiner then as i have neurips camera ready deadline. But i think sd-2.1-base is working, and i think the same strategy should work for sd-2.1. Let me know if it works for you.

SkylerZheng · 2023-10-17T22:34:26Z

I think having 64 as the latent dimension height width was causing the nan issue. Probably sd-2.1 should work after setting weight_decay to 0 I tried 96, nan issue was still not solved. I'm currently testing with 0 weight decay, hopefully it will work!

Thanks a lot for the help! I will keep you posted on this.

Xynonners · 2023-10-19T07:36:24Z

sorry for hijacking this thread, but, when trying to adapt for SDXL, this occurs:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (308x768 and 2048x640)

It seems that the LoRA implementation on SDXL is completely different too.

Xynonners · 2023-10-19T12:30:10Z

got much further and now running into a negative tensor issue on the backprop...

mihirp1998 · 2023-10-19T20:40:16Z

Thanks! if you are successful in integrating do please send a pull request. Would love to integrate it.

Xerxemi · 2023-10-22T04:22:29Z

got much further and now running into a negative tensor issue on the backprop...

oddly, after a few hours of working on this the issue can be skirted by setting the latent dim to 64x64 or 96x96 rather than 128x128 (which causes the issue)...

EDIT: seems like the LoRA still isn't training, even though it says it's all fine.

Xerxemi · 2023-10-22T04:25:57Z

@mihirp1998, cool, thanks a lot! When I use accelerate, the training is just hung there, looks like data has not been loaded at all, so no training is happening. I used accelerate with dreambooth model training, it worked. It could be python3.10 and accelerate 0.17.0 are not compatible with my AWS EC2 env. Please let me know if you have any updates on sd 2.1! I tried to load stabilityai/stable-diffusion-2-1 for training, but the losses are nan, I printed the latents values, all are nan, but the evaluation worked fine, very weird, let me know if you have encountered the same problem!

In my experience losses end up at NaN when using float16, bfloat16 doesn't have this issue. I still have to check if lowering the latent dim causes NaN on SDXL.

EDIT: calling pipeline.upcast_vae() upcasts parts of the vae to float32, bypassing the issue.

Xerxemi · 2023-10-22T04:35:22Z

I don't understand how this fixes the nan problem? Like what's happening here and how does this change anything?

I used pipeline.unet to do the prediction instead of unet. But this is a bit different from your original lora setting.

It's weird indeed, but seems like the lora layers added do not work for SD 2.1. I'm thinking we can try other ways of lora for SD 2.1, for example, peft.

huggingface/diffusers#5331

It might aiso be worthwhile to note that LoRAAttnProcessor (and by extension LoRAAttnProcessor2_0) is deprecated - a dump of the unet shows a difference in layout when directly setting LoRALinearLayer(s).

mihirp1998 · 2023-10-26T11:55:47Z

thanks for the infromation! i'll work on this in few days. Did SDXL work for u?

Xerxemi · 2023-10-26T12:47:41Z

thanks for the infromation! i'll work on this in few days. Did SDXL work for u?

yup, though the architecture is enough different so I ended up rewriting the code a few times.

currently working on adding optimizations and fixing up the code (replacing denoising loop this time since it seems to be somewhat broken but "working").

SkylerZheng · 2023-10-27T00:42:12Z

@Xerxemi Could you share the script for SDXL if possible?

SkylerZheng · 2023-11-02T04:45:11Z

Hi @Xerxemi , does it work for your SDXL well? Does the loss drop normally?

Xerxemi · 2023-11-02T09:53:55Z

Hi @Xerxemi , does it work for your SDXL well? Does the loss drop normally?

yup, I'm currently fixing a lot of little bugs (I practically replaced 80% of the code) but the loss does drop with weight decay active.

SkylerZheng · 2023-11-02T17:40:32Z

Hi @Xerxemi , that's great! I cannot make the loss decrease. Could you help share the learning rate you use, and how you add the lora layers for SDXL? Will really appreciate the help!

Xerxemi · 2023-11-03T14:06:33Z

@SkylerZheng sure, my LR is 3e-4 and weight decay is 5e-2 using the AdamW bnb sampler (I also lowered beta2 to 0.99 and epsilon to 1e-6).

I use peft to add the LoRA layers (or LoHa/LoKr).

Xerxemi · 2023-11-03T14:25:15Z

(this is somewhat inaccurate as the optimizer is actually prodigy here)

SkylerZheng · 2023-11-03T16:40:37Z

Hi @Xerxemi , congratulations on this! It looks great, and thank you very much for the sharing!

mihirp1998 · 2024-01-26T20:40:15Z

Hi @Xerxemi , I have set up the codebase for SDXL, currently trying to get it to work.

In the above plots, what was the reward function u were using? Also is there anything else u changed other than the ones u mentioned:

my LR is 3e-4 and weight decay is 5e-2 using the AdamW bnb sampler (I also lowered beta2 to 0.99 and epsilon to 1e-6).

dain5832 · 2024-07-18T08:28:37Z

@mihirp1998 , @Xerxemi
Hi, are there any updates for SDXL??
I would appreciate if I could use xl version.

Xynonners · 2024-07-18T17:35:27Z

Hi @Xerxemi , I have set up the codebase for SDXL, currently trying to get it to work.

In the above plots, what was the reward function u were using? Also is there anything else u changed other than the ones u mentioned:

my LR is 3e-4 and weight decay is 5e-2 using the AdamW bnb sampler (I also lowered beta2 to 0.99 and epsilon to 1e-6).

reward function was HPSv2.

Xynonners · 2024-07-18T17:37:01Z

@mihirp1998 , @Xerxemi Hi, are there any updates for SDXL?? I would appreciate if I could use xl version.

well, we practically gave up on the idea due to quality reasons but yes it did work at one point

currently out on a trip but back at home the codebase should still exist somewhere

anonymous-atom · 2024-11-16T22:50:04Z

Any tried to successfully train using some custom loss function ? Tried to adjust lots of parameters, still loss increasing.

mihirp1998 · 2024-11-16T23:24:25Z

You can try reducing the value of K (truncation steps), maybe try setting it to 1 and that could make the loss landscape easy to optimize, and obviously play with the learning rate too.

…

On Sat, Nov 16, 2024 at 5:50 PM Karun ***@***.***> wrote: Any tried to successfully train using some custom loss function ? Tried to adjust lots of parameters, still loss increasing. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE4C5LAYFVFBOVQCK5RCKID2A7D3FAVCNFSM6AAAAAA54Y25ACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBQHAZTOMRZHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

anonymous-atom · 2024-11-16T23:28:25Z

I did tried lots of combination in Hypeparameters space, including your suggestion above, still the loss seems to be increasing

anonymous-atom · 2024-11-24T00:00:10Z

So does the AlignProp scripts support SDXL-Turbo now ?

mihirp1998 · 2024-11-24T03:45:37Z

@Xynonners or @Xerxemi can u please release it if you have it somewhere, unfortunately i dont have the bandwidth to add this support.

Xynonners · 2024-11-26T01:02:43Z

@Xynonners or @Xerxemi can u please release it if you have it somewhere, unfortunately i dont have the bandwidth to add this support.

Sure, I can go ahead and try and dig it up if I have the time. Currently quite busy though so it might take a while.

anonymous-atom · 2024-11-26T01:07:21Z

Thanks that means a lot!

…

On Mon, Nov 25, 2024, 20:03 Xynonners ***@***.***> wrote: @Xynonners <https://github.com/Xynonners> or @Xerxemi <https://github.com/Xerxemi> can u please release it if you have it somewhere, unfortunately i dont have the bandwidth to add this support. Sure, I can go ahead and try and dig it up if I have the time. Currently quite busy though so it might take a while. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZTQIN2IJWRHKOBWWSDG232CPCEVAVCNFSM6AAAAAA54Y25ACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJZGM2DENZYGA> . You are receiving this because you commented.Message ID: ***@***.***>

anonymous-atom · 2024-12-07T21:36:28Z

Hey @Xynonners, you got a chance to check with the code ?

how to adapt to non square image model training and inference? #4

how to adapt to non square image model training and inference? #4

Comments

SkylerZheng commented Oct 12, 2023 • edited Loading

SkylerZheng commented Oct 12, 2023

SkylerZheng commented Oct 13, 2023

mihirp1998 commented Oct 13, 2023 • edited Loading

SkylerZheng commented Oct 13, 2023 • edited Loading

mihirp1998 commented Oct 13, 2023

SkylerZheng commented Oct 15, 2023 • edited Loading

mihirp1998 commented Oct 16, 2023 • edited Loading

SkylerZheng commented Oct 16, 2023

SkylerZheng commented Oct 16, 2023

SkylerZheng commented Oct 17, 2023

mihirp1998 commented Oct 17, 2023

mihirp1998 commented Oct 17, 2023

SkylerZheng commented Oct 17, 2023

mihirp1998 commented Oct 17, 2023

SkylerZheng commented Oct 17, 2023

mihirp1998 commented Oct 17, 2023

SkylerZheng commented Oct 17, 2023 • edited Loading

mihirp1998 commented Oct 17, 2023

SkylerZheng commented Oct 17, 2023 • edited Loading

mihirp1998 commented Oct 17, 2023

SkylerZheng commented Oct 17, 2023

mihirp1998 commented Oct 17, 2023 • edited Loading

SkylerZheng commented Oct 17, 2023

mihirp1998 commented Oct 17, 2023 • edited Loading

SkylerZheng commented Oct 17, 2023

Xynonners commented Oct 19, 2023

Xynonners commented Oct 19, 2023

mihirp1998 commented Oct 19, 2023

Xerxemi commented Oct 22, 2023 • edited Loading

Xerxemi commented Oct 22, 2023 • edited Loading

Xerxemi commented Oct 22, 2023 • edited Loading

mihirp1998 commented Oct 26, 2023

Xerxemi commented Oct 26, 2023 • edited Loading

SkylerZheng commented Oct 27, 2023

SkylerZheng commented Nov 2, 2023

Xerxemi commented Nov 2, 2023

SkylerZheng commented Nov 2, 2023

Xerxemi commented Nov 3, 2023

Xerxemi commented Nov 3, 2023 • edited Loading

SkylerZheng commented Nov 3, 2023

mihirp1998 commented Jan 26, 2024 • edited Loading

dain5832 commented Jul 18, 2024

Xynonners commented Jul 18, 2024

Xynonners commented Jul 18, 2024

anonymous-atom commented Nov 16, 2024

mihirp1998 commented Nov 16, 2024 via email

anonymous-atom commented Nov 16, 2024

anonymous-atom commented Nov 24, 2024

mihirp1998 commented Nov 24, 2024 • edited Loading

Xynonners commented Nov 26, 2024

anonymous-atom commented Nov 26, 2024 via email

anonymous-atom commented Dec 7, 2024

SkylerZheng commented Oct 12, 2023 •

edited

Loading

mihirp1998 commented Oct 13, 2023 •

edited

Loading

SkylerZheng commented Oct 13, 2023 •

edited

Loading

SkylerZheng commented Oct 15, 2023 •

edited

Loading

mihirp1998 commented Oct 16, 2023 •

edited

Loading

SkylerZheng commented Oct 17, 2023 •

edited

Loading

SkylerZheng commented Oct 17, 2023 •

edited

Loading

mihirp1998 commented Oct 17, 2023 •

edited

Loading

mihirp1998 commented Oct 17, 2023 •

edited

Loading

Xerxemi commented Oct 22, 2023 •

edited

Loading

Xerxemi commented Oct 22, 2023 •

edited

Loading

Xerxemi commented Oct 22, 2023 •

edited

Loading

Xerxemi commented Oct 26, 2023 •

edited

Loading

Xerxemi commented Nov 3, 2023 •

edited

Loading

mihirp1998 commented Jan 26, 2024 •

edited

Loading

mihirp1998 commented Nov 24, 2024 •

edited

Loading