Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use _get_clones #2270

Open
ebsmothers opened this issue Jan 16, 2025 · 8 comments
Open

Don't use _get_clones #2270

ebsmothers opened this issue Jan 16, 2025 · 8 comments
Labels
best practice Things we should be doing but aren't community help wanted We would love the community's help completing this issue

Comments

@ebsmothers
Copy link
Contributor

We've used the _get_clones utility a lot for cleaner instantiation of our TransformerDecoder from a single layer. However, we shouldn't use it for cases that require random initialization and don't subsequently override their params with a state dict load (i.e. what we do for LoRA when we're not resuming from an intermediate checkpoint). The following script demonstrates why: the initialized values for cloned modules will be the same across layers, so if we use _get_clones (and don't subsequently load in a weight to override the init values), all our layers have identical values.

import torch
from torchtune.modules.transformer import _get_clones
from torchtune.modules.peft import LoRALinear

def main():
    loras_loop = [None] * 4
    for i in range(4):
        loras_loop[i] = LoRALinear(in_dim=16, out_dim=8, rank=4, alpha=1.0)

    loras_cloned = _get_clones(
        LoRALinear(in_dim=16, out_dim=8, rank=4, alpha=1.0), 4
    )

    loop_max_diff = torch.max(torch.abs(loras_loop[0].lora_a.weight - loras_loop[3].lora_a.weight))
    cloned_max_diff = torch.max(torch.abs(loras_cloned[0].lora_a.weight - loras_cloned[3].lora_a.weight))

    print(f"Max diff in for loop: {loop_max_diff}")
    print(f"Max diff with _get_clones: {cloned_max_diff}")

if __name__ == "__main__":
    main()


...

Max diff in for loop: 0.4612632691860199
Max diff with _get_clones: 0.0
@ebsmothers ebsmothers added the best practice Things we should be doing but aren't label Jan 16, 2025
@RdoubleA
Copy link
Contributor

@ebsmothers what would be the recommended alternative here? Could we just call reset_parameters or similar on all the layers in _get_clones?

@Ankur-singh
Copy link
Contributor

Dumb question: why can't we use for-loop? Does it have to do something with performance?

@RdoubleA
Copy link
Contributor

The layer is already instantiated outside of TransformerDecoder/VisionTransformer, or wherever get_clones is used. So we can't use a for loop to instantiate multiple layers, unless we change the __init__ signatures to accept a ModuleList instead.

@ebsmothers
Copy link
Contributor Author

@Ankur-singh not a dumb question at all. Actually it has nothing to do with performance but more to do with convenience. The idea of _get_clones is that you don't need a for loop, you can just pass a single layer with num_layers param and have it built for you. But if it messes with initialization then probably better to just always write the for loop.

@RdoubleA actually we do support it now (at least for TransformerDecoder). We even have a bunch of models being built in this way already, see e.g. Llama 3.1. So really we just need to migrate other builders onto this. (Personally I wouldn't recommend the reset_parameters approach cause it just introduces some extra API coupling).

@joecummings joecummings changed the title Don't use _get_clones Don't use _get_clones Jan 16, 2025
@RdoubleA RdoubleA added the community help wanted We would love the community's help completing this issue label Jan 16, 2025
@Ankur-singh
Copy link
Contributor

Ability to pass ModuleList is quite neat. We should update the VisionTransformer.__init__ as well.

@ebsmothers after migrating other builders, will support for layers: nn.Module be dropped? If we plan to keep it around, then updating _get_clones to use a for-loop might be a good idea, as it would better accommodate the ModuleList format. This change could also improve flexibility and readability when managing layers in the model.

@ebsmothers
Copy link
Contributor Author

@Ankur-singh agreed about updating VisionTransformer too. Nitpicking a bit but really we should probably drop support for n_layers (since ModuleList is still nn.Module). I don't think we can really keep _get_clones in its current format either, since the for loop approach requires that each Module be created inside the function rather than passed as an argument. (Though lmk if I'm misunderstanding your point)

@Ankur-singh
Copy link
Contributor

the for loop approach requires that each Module be created inside the function rather than passed as an argument.

Thats a very good point. For now, we can update _get_clones to use the for-loop. Once all the models are updated to use the new builder API, we can drop support for n_layers.

Does that sound good? I can submit a PR updating all the model builders.

@ebsmothers
Copy link
Contributor Author

Yes that sounds great. Thanks @Ankur-singh, looking forward to the PR!

@Ankur-singh Ankur-singh mentioned this issue Jan 19, 2025
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
best practice Things we should be doing but aren't community help wanted We would love the community's help completing this issue
Projects
None yet
Development

No branches or pull requests

3 participants