-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for tokenizer in downloaded models directory #364
Check for tokenizer in downloaded models directory #364
Conversation
f4096d4
to
a6ee9b3
Compare
a6ee9b3
to
13537a7
Compare
13537a7
to
ae3ab10
Compare
This pull request has merge conflicts that must be resolved before it can be |
7ebb8d3
to
2a1a0cc
Compare
2a1a0cc
to
8c256b4
Compare
8c256b4
to
0dbb79a
Compare
This pull request has merge conflicts that must be resolved before it can be |
It seems like we're including artifacts directly in our code for testing purpose, e.g.: We want to avoid this approach since the artifacts themselves may become out-of-date overtime and therefore lead to inaccurate tests, in addition to adding an unnecessary amount of data to our codebase. If possible, I recommend to instead setup a workflow in which these files are downloaded automatically as part of the test suite. The |
0dbb79a
to
1a47348
Compare
23dd23d
to
66b41e0
Compare
4991e87
to
b1bd62d
Compare
This pull request has merge conflicts that must be resolved before it can be |
@RobotSail I see your point. I personally don't feel too strongly either way, I'll let others chime in on if that's worth blocking this PR. I've also opened #384 so we don't lose track of it |
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
…ation in instructlab.utils In our case, .safetensor file validation is not needed since we don't read it to load a tokenizer Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
a9ce521
to
fa555c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @khaledsulayman! LGTM 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good enough to me to merge and then keep improving things as we go. A few things that come to mind we may want to tackle later:
- renaming
tokenizer_model_name
to something else, since it really expects a directory and not a huggingface model name - consider factoring out some common requirements, util classes, etc into some instructlab/common repo/package that SDG, CLI, and training repos could share - things like
is_model_gguf
or ensuring we're all aligned on identical versions of transformers, torch, etc - needs more discussion outside this PR, but this PR copies a few more things from instructlab/instructlab that we'll want to stay in-sync over time - adding an e2e test that exercises the ContextAwareChunker - right now we have a functional test that tests PDF chunking, but we don't have something that ensures the entire e2e flow from
ilab data generate
is wired up properly for PDFs, including things like the tokenizer_model_name gets passed in as expected
Because we're hoping to get a release out soon and some of us are off tomorrow, I don't think any of the above needs to hold this up. And, some of the above would explode the scope of this PR so best done separately anyway.
Thanks for all the work on this!
This change forces tokenizer loading to be done using the locally downloaded teacher model, as opposed to pulling from huggingface.
Resolves: #343
Signed-off-by: Khaled Sulayman [email protected]