-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mix all the user's generated skills instead of 30 per leaf node #421
base: main
Are you sure you want to change the base?
Conversation
Users have provided feedback that they expect all of their generated skills to get mixed in the final output dataset by default instead of truncating this to only 30. The choice of 30 was not a great default for every use case anyway, and mostly tailored towards very large taxonomies with large numbers of skills. Most of our users are using fewer numbers of skills, and expect a larger amount of generated skill data to make it into the results to increase the overall number of skill samples in the mixed output that matches their custom skills. Signed-off-by: Ben Browning <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbrowning Thanks for the PR. I like the idea of mixing all the samples that are being generated, but we need to control/ensure how much we are generated vs how much we are mixing. Should we continue to default to maybe 30/50 and let the users override given special cases. Having said that, should we make this a parameter that we can expose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an improvement to me. FWIW, I agree it would be even better to make it a parameter as @aakankshaduggal suggests above.
I agree that users need the ability to control how many skills get mixed. However, do we want to introduce a new parameter for that here? Or instead tackle that once we expose data mixing directly via the CLI? I'm personally inclined to have everything mix in the default recipes at a sampling size of 1.0 as the starting point, and then show advanced users how to re-mix their mixed dataset using Recipes, which gives them fine-grained control over the sampling size of every individual leaf node in the mixed dataset. |
We had some discussion outside GitHub about this, and perhaps it's not worth doing until we expose knobs to users to control the number of samples generated (which |
Users have provided feedback that they expect all of their generated skills to get mixed in the final output dataset by default instead of truncating this to only 30. The choice of 30 was not a great default for every use case anyway, and mostly tailored towards very large taxonomies with large numbers of skills. Most of our users are using fewer numbers of skills, and expect a larger amount of generated skill data to make it into the results to increase the overall number of skill samples in the mixed output that matches their custom skills.
Fixes #420