Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Support for v3 schema of knowledge taxonomy additions #160

Closed
19 of 27 tasks
russellb opened this issue Jul 17, 2024 · 2 comments
Closed
19 of 27 tasks

[Epic] Support for v3 schema of knowledge taxonomy additions #160

russellb opened this issue Jul 17, 2024 · 2 comments
Milestone

Comments

@russellb
Copy link
Member

russellb commented Jul 17, 2024

Overview

The research team that developed InstructLab's processes has determined that we need to change how we receive and process knowledge contributions. This is necessary to get the best results we know when adding knowledge to a model.

This issue tracks the work across both the SDG repo and other repos necessary to implement this change.

Note

We are not attempting to retain backward compatible for knowledge using schema versions v1 or v2. It would be more complicated and would produce lower-quality results, so the effort does not seem worthwhile.

Tasks

instructlab/taxonomy#1253

  • doc updates to explain the new format in any docs
  • Devise an approach for how to deal with open knowledge PRs that need to be reworked for the updated schema. The changes are not scriptable
  • Audit docs for knowledge contribution details documentation impact

Other Notes

instructlab/schema repository

Sample of new format: https://github.com/instructlab/taxonomy/blob/7729fcd62ca68e36225a98a954e702734cc09ae1/knowledge/science/anatomy/tonsils/qna.yaml

An updated example that is considered valid under the proposed taxonomy schema: https://github.com/russellb/taxonomy/tree/v3-example/knowledge/tonsils

Related work (non-blocking):

  • We have taxonomy parsing handling in too many places. Let's use this opportunity to complete the move to instructlab-schema.

instructlab/sdg repository

0.2.0 milestone: https://github.com/instructlab/sdg/milestone/4

generate_data() changes, and other code it depends on
Will need work in src.instructlab.sdg.utils.taxonomy

Summary of changes implemented in this fork for preparing pipeline input: https://github.com/aakankshaduggal/sdg

Currently we do this:

Document gets transformed to chunks
For each chunk we attach 3 questions from the seed examples, based on a sliding window approach - so if we had more than 3 seed qa - we first attach the first 3 to the chunk, then move the window by 1 seed example and attach to the same chunk

New change:

document gets transformed to chunks
for each chunk we simply iterate through the contexts + qa (from the seed examples) and make a call to the knowledge pipeline

Related work (non-blocking):

instructlab/instructlab repo

Related work (non-blocking):

  • Make use of instructlab-schema taxonomy parsing code
russellb added a commit to russellb/instructlab that referenced this issue Jul 17, 2024
Nothing uses the parsed taxonomy data that was returned by this code.
There is a proposal to change the taxonomy schema. Instead of fixing
this copy of this code, make it stop reading the contents aside from
doing automated validation.

Rename these functions to `validate_taxonomy` instead of
`read_taxonomy` since we only care about the validation part and not
returning any data from them.

If we look at changing this to use a new API, this helps clarify what
functionality we actually care about (validation, not the data).

Related to instructlab/sdg#160

Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb added this to the 0.2.0 milestone Jul 17, 2024
russellb added a commit to russellb/instructlab that referenced this issue Jul 18, 2024
Nothing uses the parsed taxonomy data that was returned by this code.
There is a proposal to change the taxonomy schema. Instead of fixing
this copy of this code, make it stop reading the contents aside from
doing automated validation.

Rename these functions to `validate_taxonomy` instead of
`read_taxonomy` since we only care about the validation part and not
returning any data from them.

If we look at changing this to use a new API, this helps clarify what
functionality we actually care about (validation, not the data).

Related to instructlab/sdg#160

Signed-off-by: Russell Bryant <[email protected]>
russellb added a commit to russellb/sdg that referenced this issue Jul 22, 2024
For more information on the v3 schema, see this issue:

  instructlab#160

This change to the prompt does a couple of important things:

- Make use of document-specific context for the provided sample q&a.

- Add the new `document_outline` field which provides a summary of the
  document.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
russellb added a commit to russellb/sdg that referenced this issue Jul 22, 2024
This is part of instructlab#160

The changes here originated from aakankshaduggal@5baf6df

There are two major changes here.

- When parsing a `qna.yaml` file from a taxonomy tree, adjust for the
  new schema for knowledge. There is no attempt to maintain
  compatibility with prior versions of the schema (v1, v2).

- Change how we translate the taxonomy data into the dataset sent into
  the pipeline as input. Instead of implementing a sliding window
  approach of 3 sample qna pairs at a time over all chunks of the
  document, we now create a row per seed_example (context and
  associated qna pairs) for each chunk of knowledge docs.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

Closing as complete, I think instructlab/taxonomy#1253 captures the remaining known work

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

Closing

@markmc markmc closed this as completed Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants