Fix gradient accumulation related unbound errors for tracking gradnorm #54

a-r-r-o-w · 2024-10-19T22:35:00Z

Fixes #41. @Yuancheng-Xu would you mind giving this a review?

sayakpaul

Left a small comment.

sayakpaul · 2024-10-20T04:23:51Z

training/cogvideox_image_to_video_lora.py

@@ -651,6 +651,8 @@ def load_model_hook(models, input_dir):

 for step, batch in enumerate(train_dataloader):
 models_to_accumulate = [transformer]
+ gradient_norm_before_clip = None


Oh I thought we always calculate it?

cogvideox-factory/training/cogvideox_image_to_video_lora.py

Line 763 in 6c00cf0

gradient_norm_before_clip = get_gradient_norm(transformer.parameters())

edit: Oh or not when gradient accumulation steps > 1?

Yuancheng-Xu · 2024-10-20T19:42:52Z

The current fix looks fine to me.

However I have a question: Why not just put all the logging inside if accelerator.sync_gradients:?

What I mean is to change the original code

if accelerator.sync_gradients:
   progress_bar.update(1)
   global_step += 1
   <some code...>
   
last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
logs = {
                "loss": loss.detach().item(),
                "lr": last_lr,
                "gradient_norm_before_clip": gradient_norm_before_clip,
                "gradient_norm_after_clip": gradient_norm_after_clip,
            }
progress_bar.set_postfix(**logs)
accelerator.log(logs, step=global_step)

to this modified one

if accelerator.sync_gradients:
   progress_bar.update(1)
   global_step += 1
   <some code...>

   last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
   logs = {
                "loss": loss.detach().item(),
                "lr": last_lr,
                "gradient_norm_before_clip": gradient_norm_before_clip,
                "gradient_norm_after_clip": gradient_norm_after_clip,
            }
    progress_bar.set_postfix(**logs)
    accelerator.log(logs, step=global_step)

It seems that since the logging only happens when accelerator.log(logs, step=global_step), it makes more sense to only log when global_step+=1.

I am not so sure about this, since I also saw the original implementation in other diffuser implementation too, like here

sayakpaul · 2024-10-21T04:10:41Z

Not a bad idea but I guess it's more a structural preference and logical convenience honestly.

update

fe787ce

a-r-r-o-w requested a review from sayakpaul October 19, 2024 22:35

sayakpaul reviewed Oct 20, 2024

View reviewed changes

Yuancheng-Xu mentioned this pull request Oct 20, 2024

cannot access local variable 'gradient_norm_before_clip' where it is not associated with a value #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gradient accumulation related unbound errors for tracking gradnorm #54

Fix gradient accumulation related unbound errors for tracking gradnorm #54

a-r-r-o-w commented Oct 19, 2024 •

edited

Loading

sayakpaul left a comment

sayakpaul Oct 20, 2024 •

edited

Loading

Yuancheng-Xu commented Oct 20, 2024

sayakpaul commented Oct 21, 2024

Fix gradient accumulation related unbound errors for tracking gradnorm #54

Are you sure you want to change the base?

Fix gradient accumulation related unbound errors for tracking gradnorm #54

Conversation

a-r-r-o-w commented Oct 19, 2024 • edited Loading

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul Oct 20, 2024 • edited Loading

Choose a reason for hiding this comment

Yuancheng-Xu commented Oct 20, 2024

sayakpaul commented Oct 21, 2024

a-r-r-o-w commented Oct 19, 2024 •

edited

Loading

sayakpaul Oct 20, 2024 •

edited

Loading