Mix all the user's generated skills instead of 30 per leaf node #421

bbrowning · 2024-11-29T14:06:59Z

Users have provided feedback that they expect all of their generated skills to get mixed in the final output dataset by default instead of truncating this to only 30. The choice of 30 was not a great default for every use case anyway, and mostly tailored towards very large taxonomies with large numbers of skills. Most of our users are using fewer numbers of skills, and expect a larger amount of generated skill data to make it into the results to increase the overall number of skill samples in the mixed output that matches their custom skills.

Fixes #420

Users have provided feedback that they expect all of their generated skills to get mixed in the final output dataset by default instead of truncating this to only 30. The choice of 30 was not a great default for every use case anyway, and mostly tailored towards very large taxonomies with large numbers of skills. Most of our users are using fewer numbers of skills, and expect a larger amount of generated skill data to make it into the results to increase the overall number of skill samples in the mixed output that matches their custom skills. Signed-off-by: Ben Browning <[email protected]>

aakankshaduggal

@bbrowning Thanks for the PR. I like the idea of mixing all the samples that are being generated, but we need to control/ensure how much we are generated vs how much we are mixing. Should we continue to default to maybe 30/50 and let the users override given special cases. Having said that, should we make this a parameter that we can expose?

jwm4

This seems like an improvement to me. FWIW, I agree it would be even better to make it a parameter as @aakankshaduggal suggests above.

bbrowning · 2024-12-04T01:28:16Z

I agree that users need the ability to control how many skills get mixed. However, do we want to introduce a new parameter for that here? Or instead tackle that once we expose data mixing directly via the CLI? I'm personally inclined to have everything mix in the default recipes at a sampling size of 1.0 as the starting point, and then show advanced users how to re-mix their mixed dataset using Recipes, which gives them fine-grained control over the sampling size of every individual leaf node in the mixed dataset.

bbrowning · 2024-12-11T22:38:09Z

We had some discussion outside GitHub about this, and perhaps it's not worth doing until we expose knobs to users to control the number of samples generated (which --sdg-scale-factor plays a role in, only sometimes) and the number of samples mixed (which has no exposed knob today).

aakankshaduggal reviewed Dec 2, 2024

View reviewed changes

jwm4 approved these changes Dec 3, 2024

View reviewed changes

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination instructlab/community#417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mix all the user's generated skills instead of 30 per leaf node #421

Mix all the user's generated skills instead of 30 per leaf node #421

bbrowning commented Nov 29, 2024

aakankshaduggal left a comment

jwm4 left a comment •

edited

Loading

bbrowning commented Dec 4, 2024

bbrowning commented Dec 11, 2024

Mix all the user's generated skills instead of 30 per leaf node #421

Are you sure you want to change the base?

Mix all the user's generated skills instead of 30 per leaf node #421

Conversation

bbrowning commented Nov 29, 2024

aakankshaduggal left a comment

Choose a reason for hiding this comment

jwm4 left a comment • edited Loading

Choose a reason for hiding this comment

bbrowning commented Dec 4, 2024

bbrowning commented Dec 11, 2024

jwm4 left a comment •

edited

Loading