Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for tokenizer in downloaded models directory #364

Merged
merged 8 commits into from
Nov 14, 2024

Conversation

khaledsulayman
Copy link
Member

@khaledsulayman khaledsulayman commented Nov 11, 2024

This change forces tokenizer loading to be done using the locally downloaded teacher model, as opposed to pulling from huggingface.

Resolves: #343

Signed-off-by: Khaled Sulayman [email protected]

@mergify mergify bot added the ci-failure label Nov 11, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from f4096d4 to a6ee9b3 Compare November 11, 2024 21:42
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 11, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from a6ee9b3 to 13537a7 Compare November 12, 2024 16:14
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 12, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 13537a7 to ae3ab10 Compare November 12, 2024 19:13
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 12, 2024
Copy link
Contributor

mergify bot commented Nov 12, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @khaledsulayman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added testing Relates to testing needs-rebase and removed ci-failure labels Nov 12, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch 3 times, most recently from 7ebb8d3 to 2a1a0cc Compare November 13, 2024 21:27
@mergify mergify bot added ci-failure and removed needs-rebase labels Nov 13, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 2a1a0cc to 8c256b4 Compare November 14, 2024 03:15
@mergify mergify bot removed the ci-failure label Nov 14, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 8c256b4 to 0dbb79a Compare November 14, 2024 03:16
@khaledsulayman khaledsulayman marked this pull request as ready for review November 14, 2024 03:17
@khaledsulayman khaledsulayman requested a review from a team November 14, 2024 03:17
@mergify mergify bot added the ci-failure label Nov 14, 2024
Copy link
Contributor

mergify bot commented Nov 14, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @khaledsulayman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 14, 2024
@RobotSail
Copy link
Member

It seems like we're including artifacts directly in our code for testing purpose, e.g.: tests/testdata/models/instructlab/granite-7b-lab/tokenizer.json.

We want to avoid this approach since the artifacts themselves may become out-of-date overtime and therefore lead to inaccurate tests, in addition to adding an unnecessary amount of data to our codebase. If possible, I recommend to instead setup a workflow in which these files are downloaded automatically as part of the test suite. The AutoTokenizer.from_pretrained API already does this when loading the tokenizer, so I would check if there's a more direct way to simply download the tokenizer.

@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 0dbb79a to 1a47348 Compare November 14, 2024 16:59
@mergify mergify bot removed the needs-rebase label Nov 14, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 23dd23d to 66b41e0 Compare November 14, 2024 18:59
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 4991e87 to b1bd62d Compare November 14, 2024 19:20
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024
Copy link
Contributor

mergify bot commented Nov 14, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @khaledsulayman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@khaledsulayman
Copy link
Member Author

khaledsulayman commented Nov 14, 2024

@RobotSail I see your point. I personally don't feel too strongly either way, I'll let others chime in on if that's worth blocking this PR. I've also opened #384 so we don't lose track of it

…ation in instructlab.utils

In our case, .safetensor file validation is not needed since we don't read it to load a tokenizer

Signed-off-by: Khaled Sulayman <[email protected]>
@khaledsulayman khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from a9ce521 to fa555c2 Compare November 14, 2024 20:23
@mergify mergify bot removed the needs-rebase label Nov 14, 2024
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @khaledsulayman! LGTM 🚀

@mergify mergify bot added the one-approval label Nov 14, 2024
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good enough to me to merge and then keep improving things as we go. A few things that come to mind we may want to tackle later:

  • renaming tokenizer_model_name to something else, since it really expects a directory and not a huggingface model name
  • consider factoring out some common requirements, util classes, etc into some instructlab/common repo/package that SDG, CLI, and training repos could share - things like is_model_gguf or ensuring we're all aligned on identical versions of transformers, torch, etc - needs more discussion outside this PR, but this PR copies a few more things from instructlab/instructlab that we'll want to stay in-sync over time
  • adding an e2e test that exercises the ContextAwareChunker - right now we have a functional test that tests PDF chunking, but we don't have something that ensures the entire e2e flow from ilab data generate is wired up properly for PDFs, including things like the tokenizer_model_name gets passed in as expected

Because we're hoping to get a release out soon and some of us are off tomorrow, I don't think any of the above needs to hold this up. And, some of the above would explode the scope of this PR so best done separately anyway.

Thanks for all the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable Tokenizer Loading from downloaded Teacher Model
4 participants