Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gradient accumulation related unbound errors for tracking gradnorm #54

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

a-r-r-o-w
Copy link
Owner

@a-r-r-o-w a-r-r-o-w commented Oct 19, 2024

Fixes #41. @Yuancheng-Xu would you mind giving this a review?

Copy link
Collaborator

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small comment.

@@ -651,6 +651,8 @@ def load_model_hook(models, input_dir):

for step, batch in enumerate(train_dataloader):
models_to_accumulate = [transformer]
gradient_norm_before_clip = None
Copy link
Collaborator

@sayakpaul sayakpaul Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought we always calculate it?

gradient_norm_before_clip = get_gradient_norm(transformer.parameters())

edit: Oh or not when gradient accumulation steps > 1?

@Yuancheng-Xu
Copy link
Contributor

The current fix looks fine to me.

However I have a question: Why not just put all the logging inside if accelerator.sync_gradients:?

What I mean is to change the original code

if accelerator.sync_gradients:
   progress_bar.update(1)
   global_step += 1
   <some code...>
   
last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
logs = {
                "loss": loss.detach().item(),
                "lr": last_lr,
                "gradient_norm_before_clip": gradient_norm_before_clip,
                "gradient_norm_after_clip": gradient_norm_after_clip,
            }
progress_bar.set_postfix(**logs)
accelerator.log(logs, step=global_step)

to this modified one

if accelerator.sync_gradients:
   progress_bar.update(1)
   global_step += 1
   <some code...>

   last_lr = lr_scheduler.get_last_lr()[0] if lr_scheduler is not None else args.learning_rate
   logs = {
                "loss": loss.detach().item(),
                "lr": last_lr,
                "gradient_norm_before_clip": gradient_norm_before_clip,
                "gradient_norm_after_clip": gradient_norm_after_clip,
            }
    progress_bar.set_postfix(**logs)
    accelerator.log(logs, step=global_step)

It seems that since the logging only happens when accelerator.log(logs, step=global_step), it makes more sense to only log when global_step+=1.

I am not so sure about this, since I also saw the original implementation in other diffuser implementation too, like here

@sayakpaul
Copy link
Collaborator

Not a bad idea but I guess it's more a structural preference and logical convenience honestly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cannot access local variable 'gradient_norm_before_clip' where it is not associated with a value
3 participants