Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR enables scaling the learning rate of a layer by giving its name in
scale-lr-layer
and the multiplier inlr-multiplier
by using the existing internal logic ofscale_l_cond
andlr_mult
.Motivation:
MuP and several interesting papers that followed (ex. Depth-MuP) suggest, among other technics s.a layers' output scaling and initializations, to use different LRs depending on width in order to enhance feature learning and avoid that output layers dominate the learning process. When combined with proper initializations and layers' output scaling, it consists of a stable setting especially for sweeping and scaling hyperparameters for pretraining.
Implementation:
Generalizes/makes more flexible the existing use of this feature for lm
head
during finetuning by making it possible to specify the name of the target layer as well as the LR multiplier.Extends its use for pretraining as well. When no layer is specified, the
scale_lr_cond
argument isNone
and no lr-scaling is applied.Why?:
A GPT like model typically has an ffn-factor > 1. It's 3.5 for Llama3.1 70B. Which suggests that down-projection (
linear_fc2
in Megatron) requires a lower LR. TheoreticallyLR x 1/ffn_factor
.This way, we don't have to add a new argument (ex.
downproj-lr-mult
) each time we want to test scaling of a certain layer (ex.linear_fc2
).P.S:
Layers' output scaling (before residual-connections) as introduced in Depth-MuP to account for depth-scaling will be suggested in a separate PR. Same for init.