[Epic] Support for v3 schema of knowledge taxonomy additions #160

russellb · 2024-07-17T13:27:05Z

Overview

The research team that developed InstructLab's processes has determined that we need to change how we receive and process knowledge contributions. This is necessary to get the best results we know when adding knowledge to a model.

This issue tracks the work across both the SDG repo and other repos necessary to implement this change.

Note

We are not attempting to retain backward compatible for knowledge using schema versions v1 or v2. It would be more complicated and would produce lower-quality results, so the effort does not seem worthwhile.

Tasks

instructlab/taxonomy#1253

doc updates to explain the new format in any docs
Devise an approach for how to deal with open knowledge PRs that need to be reworked for the updated schema. The changes are not scriptable
Audit docs for knowledge contribution details documentation impact

Other Notes

instructlab/schema repository

Sample of new format: https://github.com/instructlab/taxonomy/blob/7729fcd62ca68e36225a98a954e702734cc09ae1/knowledge/science/anatomy/tonsils/qna.yaml

An updated example that is considered valid under the proposed taxonomy schema: https://github.com/russellb/taxonomy/tree/v3-example/knowledge/tonsils

Related work (non-blocking):

We have taxonomy parsing handling in too many places. Let's use this opportunity to complete the move to instructlab-schema.
- Taxonomy reading API schema#33

instructlab/sdg repository

0.2.0 milestone: https://github.com/instructlab/sdg/milestone/4

generate_data() changes, and other code it depends on
Will need work in src.instructlab.sdg.utils.taxonomy

Summary of changes implemented in this fork for preparing pipeline input: https://github.com/aakankshaduggal/sdg

Currently we do this:

Document gets transformed to chunks
For each chunk we attach 3 questions from the seed examples, based on a sliding window approach - so if we had more than 3 seed qa - we first attach the first 3 to the chunk, then move the window by 1 seed example and attach to the same chunk

New change:

document gets transformed to chunks
for each chunk we simply iterate through the contexts + qa (from the seed examples) and make a call to the knowledge pipeline

Related work (non-blocking):

Change to using taxonomy parsing code from instructlab-schema
- Use instructlab-schema package to parse qna.yaml files #62

instructlab/instructlab repo

Related work (non-blocking):

Make use of instructlab-schema taxonomy parsing code

The text was updated successfully, but these errors were encountered:

Nothing uses the parsed taxonomy data that was returned by this code. There is a proposal to change the taxonomy schema. Instead of fixing this copy of this code, make it stop reading the contents aside from doing automated validation. Rename these functions to `validate_taxonomy` instead of `read_taxonomy` since we only care about the validation part and not returning any data from them. If we look at changing this to use a new API, this helps clarify what functionality we actually care about (validation, not the data). Related to instructlab/sdg#160 Signed-off-by: Russell Bryant <[email protected]>

For more information on the v3 schema, see this issue: instructlab#160 This change to the prompt does a couple of important things: - Make use of document-specific context for the provided sample q&a. - Add the new `document_outline` field which provides a summary of the document. Co-authored-by: abhi1092 <[email protected]> Co-authored-by: shiv <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Russell Bryant <[email protected]>

This is part of instructlab#160 The changes here originated from aakankshaduggal@5baf6df There are two major changes here. - When parsing a `qna.yaml` file from a taxonomy tree, adjust for the new schema for knowledge. There is no attempt to maintain compatibility with prior versions of the schema (v1, v2). - Change how we translate the taxonomy data into the dataset sent into the pipeline as input. Instead of implementing a sliding window approach of 3 sample qna pairs at a time over all chunks of the document, we now create a row per seed_example (context and associated qna pairs) for each chunk of knowledge docs. Co-authored-by: abhi1092 <[email protected]> Co-authored-by: shiv <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Russell Bryant <[email protected]>

markmc · 2024-07-23T15:25:56Z

Closing as complete, I think instructlab/taxonomy#1253 captures the remaining known work

markmc · 2024-07-23T15:26:08Z

Closing

russellb mentioned this issue Jul 17, 2024

Introduce v3 schema version to support new knowledge format instructlab/schema#38

Closed

russellb mentioned this issue Jul 17, 2024

utils: Drop reading of taxonomy data instructlab/instructlab#1767

Merged

russellb mentioned this issue Jul 17, 2024

Add v3 knowledge schema support #161

Merged

russellb added this to the 0.2.0 milestone Jul 17, 2024

russellb mentioned this issue Jul 22, 2024

Introduce v3 schema instructlab/schema#39

Merged

This was referenced Jul 23, 2024

Knowledge v3 support instructlab/instructlab#1790

Merged

Knowledge has an incompatible new v3 file format instructlab/taxonomy#1253

Closed

markmc closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Support for v3 schema of knowledge taxonomy additions #160

[Epic] Support for v3 schema of knowledge taxonomy additions #160

russellb commented Jul 17, 2024 •

edited by markmc

Loading

markmc commented Jul 23, 2024

markmc commented Jul 23, 2024

[Epic] Support for v3 schema of knowledge taxonomy additions #160

[Epic] Support for v3 schema of knowledge taxonomy additions #160

Comments

russellb commented Jul 17, 2024 • edited by markmc Loading

Overview

Tasks

Other Notes

instructlab/schema repository

instructlab/sdg repository

instructlab/instructlab repo

markmc commented Jul 23, 2024

markmc commented Jul 23, 2024

russellb commented Jul 17, 2024 •

edited by markmc

Loading