Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Support for mixing generated datasets before training #162

Closed
17 of 19 tasks
bbrowning opened this issue Jul 18, 2024 · 5 comments
Closed
17 of 19 tasks

[Epic] Support for mixing generated datasets before training #162

bbrowning opened this issue Jul 18, 2024 · 5 comments
Milestone

Comments

@bbrowning
Copy link
Contributor

bbrowning commented Jul 18, 2024

Overview

The research team that developed InstructLab's processes has determined that we need a way to mix generated datasets before training. This is necessary to get the best results we can when adding knowledge to a model.

This issue tracks the work across the SDG and other repos required to implement this change.

instructlab/sdg repository

0.2.0 milestone: https://github.com/instructlab/sdg/milestone/4

In-progress PR at #163

  • cherry-pick only the data-mixing commits from Add data mixing aakankshaduggal/sdg#4 on top of instructlab/sdg main (some other changes related to knowledge schema and other bits snuck into there)
  • determine if batching and parallel generation is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation
  • determine if caching is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching
  • fix unresolved placeholder value in src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml causing <path_to_dataset> to actually get used as a path when attempting skill data generation
    • This depends on a precomputed dataset being released somewhere (such as HuggingFace), and code modified to download and reference that dataset from its coordinates listed in default_recipe.yaml. For now I just removed the placeholder (so skills are running with no precomputed data added in), and opened Add precomputed dataset to skills data generation #171 to track figuring this out.
  • Update data mixing code to work with "simple" pipeline or, if not possible, discuss fallout if it only works in the "full" pipeline
    • I was able to get this working with the simple pipeline by utilizing the _get_question and _get_response helpers.
  • Fix build_raft_dataset in parse_and_convert.py to not infinitely loop when working with a small dataset, such as 1-2 pieces of generated data that we'll encounter in the "simple" CI pipeline or with users testing locally with very small numbers of instructions.
  • Fix bug in generate_data.py where if a knowledge taxonomy leaf gets generated first, it treats all subsequent taxonomy leaves as knowledge even though they may be skills, which blows up
  • Ensure legacy train continues to work by continuing to produce train_*.jsonl and test_*.jsonl files.
  • ensure e2e tests pass with new data mixing code in CI
  • confirm with instructlab/training that data mixing output format and content matches expectations for training's input
  • remove trailing whitespace, unused imports, dead code, typos
  • squash, reorder, reword, general clean up of existing commits plus new fixes
  • ensure correct DCO and co-authorship on all commits, attributing original authors but signed off by me on any modified commits
  • Create a follow-up PR to write out recipe yaml files - tracked at Write Recipe files during data mixing #185
  • Create a follow-up PR to remove legacy train/messages jsonl formats, once instructlab/instructlab can work with only the new formats.
  • Create a follow-up PR to add in precomputed datasets (partially done, see below)
  • Create a follow-up PR to add auxiliary datasets
  • Create a follow-up PR to add "duplicate context issue" -- data mixing - Fix duplicate context issue by taking set of all context, using sampling without replacement, and comparing text directly instead of row_idx #200
  • manually verify data generation is properly mixing data after all of the above
@bbrowning
Copy link
Contributor Author

This is related to #95, although I felt it warranted its own issue here as this is mostly about taking the data mixing implementation done in another fork and getting it ready to merge back into this repo, while the other epic is mostly about tracking the actual implementation of data mixing and is likely already done, for some value of done.

@shivchander
Copy link
Member

determine if aakankshaduggal#6 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging batching and parallel generation
determine if aakankshaduggal#9 is a hard prerequisite for data mixing or can be done separately - if separate, create an issue to track merging caching

Yep these can come after mixing

@bbrowning
Copy link
Contributor Author

Great, thanks @shivchander for that confirmation. I created separate issues to track batching/parallel (#167) and caching (#168), and updated the description above to link to those.

@bbrowning
Copy link
Contributor Author

Added some additional items in the issue description where changes may be needed in instructlab/instructlab and/or instructlab/training to handle the new data-mixed filenames, or we may need to output filenames that are compatible with the existing prefix standard of train_*. I'm not sure which way to proceed there yet, but will track that down.

@bbrowning
Copy link
Contributor Author

After discussion with others offline, I took the approach of outputting additional files in the legacy train/test jsonl formats expected by the legacy Linux training code in ilab. This gets the e2e CI job passing now. I've also tested manual generate/train workflows using the simple pipeline with legacy training, but have not yet verified full pipeline or new training work here.

@markmc markmc added this to the 0.2.1 milestone Jul 25, 2024
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this issue Jul 25, 2024
This introduces Recipe yaml files, which are used both as an input
into the data mixing process and as an output of the process.

As an input, we have some default recipe files that specify any
precomputed datasets that should be mixed with data from new skills
when generating the overall mix of samples that will be sent to the
training process.

If a downstream user/packager wants to add default recipes (and
datasets), they should install them to a path like
`/usr/share/instructlab/sdg` (varies by platform, uses Python's
`platformdirs.PlatformDirs` to respect platform conventions).

Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml

Datasets should be in sdg/datasets but this location is not enforced.

Currently we are not shipping any default recipe files in the upstream,
but there is a unit test in place to ensure the functionality to load
default recipes from disk works once we decide how we want to ship a
precomputed dataset to our upstream users.

As an output of the data generation process, we write recipe yamls to
document which datasets were mixed together and in what proportions
along with the system prompt that was used during the
generation. Here's an example of a recipe yaml put into the output
directory after running data generation:

```yaml
datasets:
- path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl
  sampling_size: 1.0
metadata:
  sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\
    \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\
    \ model. My primary function is to be a chat assistant."
```

Datasets may be referenced by relative paths, which are relative to the
recipe's own directory. Or, they may use absolute filesystem paths.

Anything written out under the metadata section (currently just
sys_prompt) is purely informational for the user and ignored when
loading recipes.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#20

Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
@markmc markmc mentioned this issue Jul 25, 2024
2 tasks
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this issue Jul 26, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
@markmc markmc modified the milestones: 0.2.1, 0.2.2, 0.2.3 Jul 26, 2024
markmc pushed a commit to bbrowning/instructlab-sdg that referenced this issue Jul 29, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this issue Jul 29, 2024
This adds support for generating auxiliary datasets during knowledge
data generation. An auxiliary dataset is where we ask the model to
generate some additional data samples with a different prompt than the
standard dataset, along with some extra instruction prompts that will
get matched to the auxiliary generated samples and used during
training.

The auxiliary instructions are a new part of the pipeline config, as
they are tightly coupled to the pipeline config. An example, where
you'll note the `spellcheck` value from the pipeline config has to match
across both the pipeline config and the new auxiliary instructions, so
we just list both in the same config file.

version: "1.0"
blocks:
...
  - name: flatten_auxiliary_columns
    type: FlattenColumnsBlock
    config:
      var_cols:
        - spellcheck
        - base_document
      value_name: corrected_document
      var_name: dataset_type
...
datamixing:
  auxiliary_instructions:
    spellcheck:
      - Correct any spelling errors in the document and output the corrected version.
      - Rewrite the document to remove any spelling errors.

Parts of this are extracted and rebased from
aakankshaduggal#4
aakankshaduggal#21

Refs instructlab#162.

Co-authored-by: shivchander <[email protected]>
Co-authored-by: Khaled Sulayman <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Ben Browning <[email protected]>
@markmc markmc closed this as completed Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants