-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: upsample the phase10 knowledge dataset #377
Conversation
Is upsampling a special case here? Or is it just that we need to adjust our mixing recipe in use for these knowledge leaf node(s) to have a fixed sampling size or a sampling ratio larger than the default of 1.0? See |
842b43f
to
1695d4b
Compare
1695d4b
to
6809fad
Compare
6809fad
to
d6a6e7c
Compare
d6a6e7c
to
6fb7222
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very reasonable solution to the upscaling problem in such a short time. I actually don't think it's quite as hacky as the code comments imply, but agree it's not an ideal solution to this general problem.
I'm running the full data generation pipeline against a sample taxonomy with a skill leaf node, a knowledge leaf node, and with a precomputed skills dataset getting mixed in via a customized default skills recipe (ie using https://github.com/instructlab/sdg/blob/main/docs/data_mixing.md#using-instructlab-community-pre-generated-dataset). However, I don't think this will finish on my available hardware before I head out for the night. If for some reason it errors out overnight because of these changes, I'll leave a note tomorrow.
Other than the one nit about replacing the stdout print with a logger, this looks ready to go. Since I'll be scarce tomorrow, going ahead and giving this one approval.
Thanks for the detailed PR, taking a couple of iterations on this to make it far less hacky than originally proposed, and the attention to detail with code comments and type hints!
6fb7222
to
efaa693
Compare
When we mix the knowledge dataset with skills today, we do not account for the potential discrepancy in size between the generated knowledge data and skills data. This leads to the models potentially forgetting the data it was trained on in the knowledge phase. As a simple workaround, we simply upsample the knowledge samples before mixing them in with the generated skills dataset. Signed-off-by: Oleg S <[email protected]>
efaa693
to
18e7e42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good approach, thanks for working to get this in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RobotSail 🚢
When we mix the knowledge dataset with skills today, we do not account for the potential discrepancy
in size between the generated knowledge data and skills data. This leads to the models potentially
forgetting the data it was trained on in the knowledge phase. As a simple workaround, we simply
upsample the knowledge samples before mixing them in with the generated skills dataset.
Signed-off-by: Oleg S [email protected]