Check for tokenizer in downloaded models directory #364

khaledsulayman · 2024-11-11T20:51:02Z

This change forces tokenizer loading to be done using the locally downloaded teacher model, as opposed to pulling from huggingface.

Resolves: #343

Signed-off-by: Khaled Sulayman [email protected]

mergify · 2024-11-12T22:04:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. @khaledsulayman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-14T03:25:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. @khaledsulayman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

RobotSail · 2024-11-14T13:31:59Z

It seems like we're including artifacts directly in our code for testing purpose, e.g.: tests/testdata/models/instructlab/granite-7b-lab/tokenizer.json.

We want to avoid this approach since the artifacts themselves may become out-of-date overtime and therefore lead to inaccurate tests, in addition to adding an unnecessary amount of data to our codebase. If possible, I recommend to instead setup a workflow in which these files are downloaded automatically as part of the test suite. The AutoTokenizer.from_pretrained API already does this when loading the tokenizer, so I would check if there's a more direct way to simply download the tokenizer.

mergify · 2024-11-14T19:44:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. @khaledsulayman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

khaledsulayman · 2024-11-14T20:17:23Z

@RobotSail I see your point. I personally don't feel too strongly either way, I'll let others chime in on if that's worth blocking this PR. I've also opened #384 so we don't lose track of it

Signed-off-by: Khaled Sulayman <[email protected]>

…ation in instructlab.utils In our case, .safetensor file validation is not needed since we don't read it to load a tokenizer Signed-off-by: Khaled Sulayman <[email protected]>

Signed-off-by: Khaled Sulayman <[email protected]>

aakankshaduggal

Thanks @khaledsulayman! LGTM 🚀

bbrowning

This looks good enough to me to merge and then keep improving things as we go. A few things that come to mind we may want to tackle later:

renaming tokenizer_model_name to something else, since it really expects a directory and not a huggingface model name
consider factoring out some common requirements, util classes, etc into some instructlab/common repo/package that SDG, CLI, and training repos could share - things like is_model_gguf or ensuring we're all aligned on identical versions of transformers, torch, etc - needs more discussion outside this PR, but this PR copies a few more things from instructlab/instructlab that we'll want to stay in-sync over time
adding an e2e test that exercises the ContextAwareChunker - right now we have a functional test that tests PDF chunking, but we don't have something that ensures the entire e2e flow from ilab data generate is wired up properly for PDFs, including things like the tokenizer_model_name gets passed in as expected

Because we're hoping to get a release out soon and some of us are off tomorrow, I don't think any of the above needs to hold this up. And, some of the above would explode the scope of this PR so best done separately anyway.

Thanks for all the work on this!

mergify bot added the ci-failure label Nov 11, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from f4096d4 to a6ee9b3 Compare November 11, 2024 21:42

mergify bot added ci-failure and removed ci-failure labels Nov 11, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from a6ee9b3 to 13537a7 Compare November 12, 2024 16:14

mergify bot added ci-failure and removed ci-failure labels Nov 12, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 13537a7 to ae3ab10 Compare November 12, 2024 19:13

mergify bot added ci-failure and removed ci-failure labels Nov 12, 2024

mergify bot added testing Relates to testing needs-rebase and removed ci-failure labels Nov 12, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch 3 times, most recently from 7ebb8d3 to 2a1a0cc Compare November 13, 2024 21:27

mergify bot added ci-failure and removed needs-rebase labels Nov 13, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 2a1a0cc to 8c256b4 Compare November 14, 2024 03:15

mergify bot removed the ci-failure label Nov 14, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 8c256b4 to 0dbb79a Compare November 14, 2024 03:16

khaledsulayman marked this pull request as ready for review November 14, 2024 03:17

khaledsulayman requested a review from a team November 14, 2024 03:17

mergify bot added the ci-failure label Nov 14, 2024

mergify bot added the needs-rebase label Nov 14, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 0dbb79a to 1a47348 Compare November 14, 2024 16:59

mergify bot removed the needs-rebase label Nov 14, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 23dd23d to 66b41e0 Compare November 14, 2024 18:59

mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from 4991e87 to b1bd62d Compare November 14, 2024 19:20

mergify bot added ci-failure and removed ci-failure labels Nov 14, 2024

mergify bot added needs-rebase and removed ci-failure labels Nov 14, 2024

khaledsulayman mentioned this pull request Nov 14, 2024

Download tokenizer artifacts in CI instead of storing them in tests/testdata/models #384

Open

khaledsulayman added 8 commits November 14, 2024 15:22

Check for tokenizer in downloaded models directory

42f64a0

Signed-off-by: Khaled Sulayman <[email protected]>

Add error logging for failed tokenizer loading

b75b851

Signed-off-by: Khaled Sulayman <[email protected]>

Add unit tests for tokenizer creation

063d220

Signed-off-by: Khaled Sulayman <[email protected]>

Increase Exception specificity for invalid model paths

37f2f33

Signed-off-by: Khaled Sulayman <[email protected]>

include custom safetensor and gguf format checking based on implement…

89cde8a

…ation in instructlab.utils In our case, .safetensor file validation is not needed since we don't read it to load a tokenizer Signed-off-by: Khaled Sulayman <[email protected]>

add necessary dependencies for validating gguf files

b2fa4fc

Signed-off-by: Khaled Sulayman <[email protected]>

hotfix: change error logging for invalid model formats

36cc0d2

Signed-off-by: Khaled Sulayman <[email protected]>

change HF model path to local model path in functional test

fa555c2

Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the ks-teacher-model-tokenizer branch from a9ce521 to fa555c2 Compare November 14, 2024 20:23

mergify bot removed the needs-rebase label Nov 14, 2024

aakankshaduggal approved these changes Nov 14, 2024

View reviewed changes

mergify bot added the one-approval label Nov 14, 2024

bbrowning approved these changes Nov 14, 2024

View reviewed changes

mergify bot removed the one-approval label Nov 14, 2024

mergify bot merged commit e8bbd06 into instructlab:main Nov 14, 2024
22 checks passed

khaledsulayman mentioned this pull request Nov 15, 2024

Delete .gitattributes #393

Merged

relyt0925 mentioned this pull request Nov 21, 2024

data generate --model parameter used for local file path and where to point to remote teacher model endpoint (need two separate variables) #425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for tokenizer in downloaded models directory #364

Check for tokenizer in downloaded models directory #364

khaledsulayman commented Nov 11, 2024 •

edited

Loading

mergify bot commented Nov 12, 2024

mergify bot commented Nov 14, 2024

RobotSail commented Nov 14, 2024

mergify bot commented Nov 14, 2024

khaledsulayman commented Nov 14, 2024 •

edited

Loading

aakankshaduggal left a comment

bbrowning left a comment

Check for tokenizer in downloaded models directory #364

Check for tokenizer in downloaded models directory #364

Conversation

khaledsulayman commented Nov 11, 2024 • edited Loading

mergify bot commented Nov 12, 2024

mergify bot commented Nov 14, 2024

RobotSail commented Nov 14, 2024

mergify bot commented Nov 14, 2024

khaledsulayman commented Nov 14, 2024 • edited Loading

aakankshaduggal left a comment

Choose a reason for hiding this comment

bbrowning left a comment

Choose a reason for hiding this comment

khaledsulayman commented Nov 11, 2024 •

edited

Loading

khaledsulayman commented Nov 14, 2024 •

edited

Loading