Introduce data mixing recipe yaml files #203

bbrowning · 2024-07-24T19:17:26Z

This introduces Recipe yaml files, which are used both as an input into the data mixing process and as an output of the process.

As an input, we have some default recipe files that specify any precomputed datasets that should be mixed with data from new skills when generating the overall mix of samples that will be sent to the training process. In our specific case, we're adding in a single precomputed dataset of instructlab/InstructLabCommunity that gets pulled from HuggingFace and mixed with new skill samples.

As an output of the data generation process, we write recipe yamls to document which datasets were mixed together and in what proportions along with the system prompt that was used during the generation. Here's an example of a recipe yaml put into the output directory after running data generation:

datasets:
- path: instructlab/InstructLabCommunity
  sampling_size: 1.0
- path: node_datasets_2024-07-25T10_27_12/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl
  sampling_size: 1.0
metadata:
  sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\
    \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\
    \ model. My primary function is to be a chat assistant."

While this change is relatively small, it came from some much larger PRs that were co-authored by multiple people so giving credit to all of those here.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#20

Refs #162, #171, #185, #201.

aakankshaduggal

Thanks! This looks great @bbrowning 🚢

markmc

Some minor comments, but I think the point about what recipe_path is relative to is probably important - e.g. has this been tested with ilab data generate ?

src/instructlab/sdg/datamixing.py

src/instructlab/sdg/configs/knowledge/data_recipe/default_recipe.yaml

src/instructlab/sdg/datamixing.py

bbrowning · 2024-07-25T03:28:48Z

Marking as draft while I work on absolute vs relative paths a bit, as well as address the other feedback here.

src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml

markmc · 2024-07-25T14:19:50Z

Ok, lots of great discussion in these threads. It's a lot to absorb, but I think it boils down to a pretty straightforward set of changes

Default recipes - downstream

If a downstream user/packager wants to add default recipes (and datasets), they should install them in /usr/share/instructlab/sdg

generate_data() should supply default_recipes_dir path to DataMixer using platformdirs.PlatformDirs
downstream users should install recipes into /usr/share/instructlab/sdg/default_data_recipes/{knowledge,skills}.yaml
downstream users should install datasets into /usr/share/instructlab/sdg/datasets
this a packaging thing - e.g. it would typically be done as part of the container image build
the library would use any default recipes found there

Default recipes - testing

Since we won't have default recipes in the library, we do need some way of ensuring this code path is tested - this can be done with a unit test. But it is a requirement to merge this PR IMO.

Community dataset recipe - upstream

Assuming we have at least unit testing of recipes, I don't think this is a blocker to merging this PR - I'm fine with just filing an issue to capture this.

We want to enable upstream users to mix the instructlab/InstructLabCommunity dataset.

They would download the dataset and place it in ~/.cache ? Potentially there would be an ilab data download command to do this

It can be supplied with e.g. ilab data generate --skills-dataset= and ilab would construct a simple skills recipe (in memory) and pass it to the library

This path could be tested in our e2e CI, which would also improve automated testing coverage of the library.

Hugging Face downloads in library

Given the above, there is no need for the library to have support for downloading from Hugging Face

Relative paths

There is no requirement to support relative paths to recipes files.

Paths to dataset should be considered relative to the recipe file.

Empty recipes

Recipe() should be sufficient to create an empty recipe.

sys_prompt

It sounds like we never want to use the sys_prompt value from a user-supplied recipe. It is included in recipe files that are written by the library for information only.

Move sys_prompt in the recipe format to a new metadata section, and ignore that whole section when reading.

bbrowning · 2024-07-25T14:37:24Z

Thank you @markmc , that is super helpful. I've pushed a commit that should adjust to the sys_prompt guidance under a new metadata section, but it will take a bit more time/thought to implement some of the other changes. If you feel comfortable with the downstream dataset chunk of this or anything else, feel free to grab it. Otherwise I'll dig in to them in a bit.

bbrowning · 2024-07-25T14:39:24Z

For reference, the generated recipe yaml files now look like this:

datasets:
- path: instructlab/InstructLabCommunity
  sampling_size: 1.0
- path: node_datasets_2024-07-25T10_27_12/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl
  sampling_size: 1.0
metadata:
  sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\
    \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\
    \ model. My primary function is to be a chat assistant."

I've updated the original comment at the top of this PR to reflect the current state as well.

bbrowning · 2024-07-25T20:35:40Z

Assuming CI passes, I believe all concerns are addressed now.

@markmc added in support for loading of default recipe files from platformdirs.PlatformDirs, allowing downstream users to package default recipe yaml files in a place like /usr/share/instructlab/sdg/default_data_recipes/{knowledge,skills}.yaml. I added a unit test to ensure that if we find default recipe yaml files in those directories, they get loaded and used.

#201 was opened and expanded with more details as a result of the conversations in this PR to figure out how to properly get the precomputed dataset from HuggingFace into the upstream ilab experience in an easy-to-use way. This may involve a new command, such as ilab data download, but discussion around that design can happen in that issue.

This PR no longer includes downloading anything from HuggingFace, as several valid concerns were brought up about doing that silently and without explicit user intent.

Empty Recipe()s can be created, and we have unit tests ensuring that works along with a few other permutations.

We do not load sys_prompt from the recipe files, but do write it out to them under a new metadata section so that it's still obvious which system prompt was used during data mixing without having to look inside the data files themselves.

mergify · 2024-07-25T21:05:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bbrowning please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

src/instructlab/sdg/datamixing.py

markmc · 2024-07-25T21:24:35Z

src/instructlab/sdg/datamixing.py

+            # Mixed data is written out relative to a recipe's own
+            # path, so update the recipe's path to the newly written
+            # one before writing out its datasets
+            recipe.recipe_path = full_recipe_path


Maybe I'm being dumb, but I really don't understand what's going on here

We've just written a recipe file to a path, and setting recipe.recipe_path is going to influence where the dataset gets written? But save_mixed_dataset() doesn't use recipe_path?

Thanks!

save_mixed_dataset() DOES use recipe_path, as it reads the datasets in from disk to mix them. I wasn't very happy with this when I wrote it, and I should have recognized the code smell myself when I wrote that comment explaining what it's doing. What happens is we loaded a default recipe from whatever path - say /default_recipes/skills.yaml (or instantiated an empty Recipe(), perhaps). Then, we save the recipe to a new path - like /generated_data/skills_train_recipe.yaml. At this point, the actual Recipe instance we have still thinks its recipe_path is /default_recipes/skills.yaml, even though we wrote it out to a different path. The fix staring me in the face as I type this is that when someone calls save_recipe, that should update the Recipe's recipe_path to the new path it just wrote itself out to.

The datasets are added to the recipe with a relative path, the recipe is written to a new path, and we then need to update the recipe to know its new path so that it can successfully load the datasets from the relative path, which is relative to the newly written location versus the original location.

I pushed one more commit that adjusts this that feels better for me, and keeps this detail encapsulated within the Recipe class instead of leaking out to consumers.

markmc

Looks good to me, thanks @bbrowning

Happy for you to merge this when CI passes

bbrowning · 2024-07-25T21:44:14Z

Thanks! Will squash commits, unmark from Draft, and merge when CI is good.

This introduces Recipe yaml files, which are used both as an input into the data mixing process and as an output of the process. As an input, we have some default recipe files that specify any precomputed datasets that should be mixed with data from new skills when generating the overall mix of samples that will be sent to the training process. If a downstream user/packager wants to add default recipes (and datasets), they should install them to a path like `/usr/share/instructlab/sdg` (varies by platform, uses Python's `platformdirs.PlatformDirs` to respect platform conventions). Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml Datasets should be in sdg/datasets but this location is not enforced. Currently we are not shipping any default recipe files in the upstream, but there is a unit test in place to ensure the functionality to load default recipes from disk works once we decide how we want to ship a precomputed dataset to our upstream users. As an output of the data generation process, we write recipe yamls to document which datasets were mixed together and in what proportions along with the system prompt that was used during the generation. Here's an example of a recipe yaml put into the output directory after running data generation: ```yaml datasets: - path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl sampling_size: 1.0 metadata: sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\ \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\ \ model. My primary function is to be a chat assistant." ``` Datasets may be referenced by relative paths, which are relative to the recipe's own directory. Or, they may use absolute filesystem paths. Anything written out under the metadata section (currently just sys_prompt) is purely informational for the user and ignored when loading recipes. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#20 Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

aakankshaduggal approved these changes Jul 24, 2024

View reviewed changes

markmc requested changes Jul 24, 2024

View reviewed changes

markmc reviewed Jul 24, 2024

View reviewed changes

src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved

bbrowning marked this pull request as draft July 25, 2024 03:28

mergify bot added the testing Relates to testing label Jul 25, 2024

markmc added this to the 0.2.1 milestone Jul 25, 2024

markmc reviewed Jul 25, 2024

View reviewed changes

src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml Outdated Show resolved Hide resolved

mergify bot added the ci-failure label Jul 25, 2024

mergify bot removed the ci-failure label Jul 25, 2024

mergify bot added ci-failure and removed ci-failure labels Jul 25, 2024

markmc force-pushed the write-recipe-yamls branch 2 times, most recently from e6bdfc2 to cc2aa1d Compare July 25, 2024 15:58

mergify bot added the ci-failure label Jul 25, 2024

markmc force-pushed the write-recipe-yamls branch from cc2aa1d to 4e43bad Compare July 25, 2024 16:03

mergify bot removed the ci-failure label Jul 25, 2024

This was referenced Jul 25, 2024

Pull taxonomy precomputed dataset from hugging face #201

Open

[Epic] Support for mixing generated datasets before training #162

Closed

mergify bot added the needs-rebase label Jul 25, 2024

markmc reviewed Jul 25, 2024

View reviewed changes

src/instructlab/sdg/datamixing.py Outdated Show resolved Hide resolved

bbrowning force-pushed the write-recipe-yamls branch from 9f8bcee to 9c4f64f Compare July 25, 2024 21:23

mergify bot removed the needs-rebase label Jul 25, 2024

markmc reviewed Jul 25, 2024

View reviewed changes

markmc approved these changes Jul 25, 2024

View reviewed changes

bbrowning force-pushed the write-recipe-yamls branch from 6ab40d8 to fcfacfc Compare July 25, 2024 22:06

bbrowning marked this pull request as ready for review July 25, 2024 22:06

bbrowning merged commit 7087621 into instructlab:main Jul 26, 2024
11 checks passed

bbrowning deleted the write-recipe-yamls branch July 26, 2024 01:11

bbrowning mentioned this pull request Jul 26, 2024

Write Recipe files during data mixing #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce data mixing recipe yaml files #203

Introduce data mixing recipe yaml files #203

bbrowning commented Jul 24, 2024 •

edited

Loading

aakankshaduggal left a comment

markmc left a comment

bbrowning commented Jul 25, 2024

markmc commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

mergify bot commented Jul 25, 2024

markmc Jul 25, 2024

bbrowning Jul 25, 2024

bbrowning Jul 25, 2024

markmc left a comment

bbrowning commented Jul 25, 2024

Introduce data mixing recipe yaml files #203

Introduce data mixing recipe yaml files #203

Conversation

bbrowning commented Jul 24, 2024 • edited Loading

aakankshaduggal left a comment

Choose a reason for hiding this comment

markmc left a comment

Choose a reason for hiding this comment

bbrowning commented Jul 25, 2024

markmc commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

bbrowning commented Jul 25, 2024

mergify bot commented Jul 25, 2024

markmc Jul 25, 2024

Choose a reason for hiding this comment

bbrowning Jul 25, 2024

Choose a reason for hiding this comment

bbrowning Jul 25, 2024

Choose a reason for hiding this comment

markmc left a comment

Choose a reason for hiding this comment

bbrowning commented Jul 25, 2024

bbrowning commented Jul 24, 2024 •

edited

Loading