[Epic] Support for mixing generated datasets before training #162

bbrowning · 2024-07-18T00:57:05Z

Overview

The research team that developed InstructLab's processes has determined that we need a way to mix generated datasets before training. This is necessary to get the best results we can when adding knowledge to a model.

This issue tracks the work across the SDG and other repos required to implement this change.

instructlab/sdg repository

0.2.0 milestone: https://github.com/instructlab/sdg/milestone/4

In-progress PR at #163

The text was updated successfully, but these errors were encountered:

bbrowning · 2024-07-18T01:27:02Z

This is related to #95, although I felt it warranted its own issue here as this is mostly about taking the data mixing implementation done in another fork and getting it ready to merge back into this repo, while the other epic is mostly about tracking the actual implementation of data mixing and is likely already done, for some value of done.

shivchander · 2024-07-18T01:28:20Z

determine if aakankshaduggal#6 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation
determine if aakankshaduggal#9 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching

Yep these can come after mixing

bbrowning · 2024-07-18T12:39:13Z

Great, thanks @shivchander for that confirmation. I created separate issues to track batching/parallel (#167) and caching (#168), and updated the description above to link to those.

bbrowning · 2024-07-18T17:54:02Z

Added some additional items in the issue description where changes may be needed in instructlab/instructlab and/or instructlab/training to handle the new data-mixed filenames, or we may need to output filenames that are compatible with the existing prefix standard of train_*. I'm not sure which way to proceed there yet, but will track that down.

bbrowning · 2024-07-18T21:54:15Z

After discussion with others offline, I took the approach of outputting additional files in the legacy train/test jsonl formats expected by the legacy Linux training code in ilab. This gets the e2e CI job passing now. I've also tested manual generate/train workflows using the simple pipeline with legacy training, but have not yet verified full pipeline or new training work here.

This introduces Recipe yaml files, which are used both as an input into the data mixing process and as an output of the process. As an input, we have some default recipe files that specify any precomputed datasets that should be mixed with data from new skills when generating the overall mix of samples that will be sent to the training process. If a downstream user/packager wants to add default recipes (and datasets), they should install them to a path like `/usr/share/instructlab/sdg` (varies by platform, uses Python's `platformdirs.PlatformDirs` to respect platform conventions). Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml Datasets should be in sdg/datasets but this location is not enforced. Currently we are not shipping any default recipe files in the upstream, but there is a unit test in place to ensure the functionality to load default recipes from disk works once we decide how we want to ship a precomputed dataset to our upstream users. As an output of the data generation process, we write recipe yamls to document which datasets were mixed together and in what proportions along with the system prompt that was used during the generation. Here's an example of a recipe yaml put into the output directory after running data generation: ```yaml datasets: - path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl sampling_size: 1.0 metadata: sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\ \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\ \ model. My primary function is to be a chat assistant." ``` Datasets may be referenced by relative paths, which are relative to the recipe's own directory. Or, they may use absolute filesystem paths. Anything written out under the metadata section (currently just sys_prompt) is purely informational for the user and ignored when loading recipes. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#20 Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#21 Refs instructlab#162. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This adds support for generating auxiliary datasets during knowledge data generation. An auxiliary dataset is where we ask the model to generate some additional data samples with a different prompt than the standard dataset, along with some extra instruction prompts that will get matched to the auxiliary generated samples and used during training. The auxiliary instructions are a new part of the pipeline config, as they are tightly coupled to the pipeline config. An example, where you'll note the `spellcheck` value from the pipeline config has to match across both the pipeline config and the new auxiliary instructions, so we just list both in the same config file. version: "1.0" blocks: ... - name: flatten_auxiliary_columns type: FlattenColumnsBlock config: var_cols: - spellcheck - base_document value_name: corrected_document var_name: dataset_type ... datamixing: auxiliary_instructions: spellcheck: - Correct any spelling errors in the document and output the corrected version. - Rewrite the document to remove any spelling errors. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#21 Refs instructlab#162. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

This was referenced Jul 23, 2024

Write Recipe files during data mixing #185

Closed

Introduce a way to mix generated datasets before sending to training #163

Merged

bbrowning mentioned this issue Jul 24, 2024

Introduce data mixing recipe yaml files #203

Merged

bbrowning mentioned this issue Jul 25, 2024

Add support for auxiliary dataset generation #204

Merged

markmc added this to the 0.2.1 milestone Jul 25, 2024

markmc mentioned this issue Jul 25, 2024

[EPIC] Data mixing #95

Closed

2 tasks

markmc modified the milestones: 0.2.1, 0.2.2, 0.2.3 Jul 26, 2024

markmc closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Support for mixing generated datasets before training #162

[Epic] Support for mixing generated datasets before training #162

bbrowning commented Jul 18, 2024 •

edited by markmc

Loading

bbrowning commented Jul 18, 2024

shivchander commented Jul 18, 2024

bbrowning commented Jul 18, 2024

bbrowning commented Jul 18, 2024

bbrowning commented Jul 18, 2024

[Epic] Support for mixing generated datasets before training #162

[Epic] Support for mixing generated datasets before training #162

Comments

bbrowning commented Jul 18, 2024 • edited by markmc Loading

Overview

instructlab/sdg repository

bbrowning commented Jul 18, 2024

shivchander commented Jul 18, 2024

bbrowning commented Jul 18, 2024

bbrowning commented Jul 18, 2024

bbrowning commented Jul 18, 2024

bbrowning commented Jul 18, 2024 •

edited by markmc

Loading