Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint 1.3 backwards compatibility #152

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

AleHD
Copy link
Contributor

@AleHD AleHD commented Apr 25, 2024

This pull request introduces three main additions:

  • 1.3 weights are now backwards compatible, meaning we can load checkpoints that had the .safetensors suffix incorrectly placed.
  • The logic used to get the filenames of weights shards was moved from serialize.weights.load_weights to serialize.utils.get_path thanks to the addition of return_all_matches keyword.
  • Addition of a test that verifies the capability to load weights when changing parallelism topology.

@zzhhjjj
Copy link
Collaborator

zzhhjjj commented Apr 26, 2024

Hello,

Thank you for your PR.

As I understand it, you are verifying the checkpoint version. If it is 1.3, a different pattern will be used to retrieve the shards; otherwise, the existing pattern will be used. I suspect that this approach may not resolve all the issues, given that the checkpoint version 1.3 was only added to the main branch 4 days ago (see commit), whereas the bug was introduced a month and a half ago in this commit. Additionally, checking for version 1.2 seems problematic as it might disrupt the loading logic for all checkpoints before this commit.

Timeline:
Checkpoint version=1.2 -> bug introduced -> Checkpoint version=1.3

We have already merged a quick fix that utilized a script to modify the file name: PR #151. Could you build upon this solution? Perhaps making it automatic when this pattern is detected, instead of requiring users to run the script manually.

I really appreciate your contribution.

@AleHD
Copy link
Contributor Author

AleHD commented Apr 29, 2024

Thanks for the feedback. I implemented the suggested changes and tested it locally. Seems to be working ok. Let me know if you have any other comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants