-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Support for v3 schema of knowledge taxonomy additions #160
Milestone
Comments
russellb
added a commit
to russellb/instructlab
that referenced
this issue
Jul 17, 2024
Nothing uses the parsed taxonomy data that was returned by this code. There is a proposal to change the taxonomy schema. Instead of fixing this copy of this code, make it stop reading the contents aside from doing automated validation. Rename these functions to `validate_taxonomy` instead of `read_taxonomy` since we only care about the validation part and not returning any data from them. If we look at changing this to use a new API, this helps clarify what functionality we actually care about (validation, not the data). Related to instructlab/sdg#160 Signed-off-by: Russell Bryant <[email protected]>
russellb
added a commit
to russellb/instructlab
that referenced
this issue
Jul 18, 2024
Nothing uses the parsed taxonomy data that was returned by this code. There is a proposal to change the taxonomy schema. Instead of fixing this copy of this code, make it stop reading the contents aside from doing automated validation. Rename these functions to `validate_taxonomy` instead of `read_taxonomy` since we only care about the validation part and not returning any data from them. If we look at changing this to use a new API, this helps clarify what functionality we actually care about (validation, not the data). Related to instructlab/sdg#160 Signed-off-by: Russell Bryant <[email protected]>
russellb
added a commit
to russellb/sdg
that referenced
this issue
Jul 22, 2024
For more information on the v3 schema, see this issue: instructlab#160 This change to the prompt does a couple of important things: - Make use of document-specific context for the provided sample q&a. - Add the new `document_outline` field which provides a summary of the document. Co-authored-by: abhi1092 <[email protected]> Co-authored-by: shiv <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Russell Bryant <[email protected]>
russellb
added a commit
to russellb/sdg
that referenced
this issue
Jul 22, 2024
This is part of instructlab#160 The changes here originated from aakankshaduggal@5baf6df There are two major changes here. - When parsing a `qna.yaml` file from a taxonomy tree, adjust for the new schema for knowledge. There is no attempt to maintain compatibility with prior versions of the schema (v1, v2). - Change how we translate the taxonomy data into the dataset sent into the pipeline as input. Instead of implementing a sliding window approach of 3 sample qna pairs at a time over all chunks of the document, we now create a row per seed_example (context and associated qna pairs) for each chunk of knowledge docs. Co-authored-by: abhi1092 <[email protected]> Co-authored-by: shiv <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Signed-off-by: Russell Bryant <[email protected]>
This was referenced Jul 23, 2024
Closing as complete, I think instructlab/taxonomy#1253 captures the remaining known work |
Closing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Overview
The research team that developed InstructLab's processes has determined that we need to change how we receive and process knowledge contributions. This is necessary to get the best results we know when adding knowledge to a model.
This issue tracks the work across both the SDG repo and other repos necessary to implement this change.
Note
We are not attempting to retain backward compatible for knowledge using schema versions v1 or v2. It would be more complicated and would produce lower-quality results, so the effort does not seem worthwhile.
Tasks
instructlab-schema 0.3.0 -- https://github.com/instructlab/schema/milestone/1
instructlab-sdg 0.2.0 -- https://github.com/instructlab/sdg/milestone/4 (blocked on instructlab-schema 0.3.0)
instructlab 0.18.0 -- blocked on both of the above -- https://github.com/instructlab/instructlab/milestone/16
ilab
depend oninstructlab-sdg>=0.2.0
(target for this epic) -- Knowledge v3 support instructlab#1790instructlab-schema>=0.3.0
Remove the
Switch instructlab to PR 1790
e2e hack from theinstructlab/sdg
repo -- Remove temporary e2e hack to use knowledge v3 PR #187instructlab/taxonomy and instructlab/community repositories
instructlab/taxonomy#1253
Other Notes
instructlab/schema repository
Sample of new format: https://github.com/instructlab/taxonomy/blob/7729fcd62ca68e36225a98a954e702734cc09ae1/knowledge/science/anatomy/tonsils/qna.yaml
An updated example that is considered valid under the proposed taxonomy schema: https://github.com/russellb/taxonomy/tree/v3-example/knowledge/tonsils
Related work (non-blocking):
instructlab/sdg repository
0.2.0
milestone: https://github.com/instructlab/sdg/milestone/4generate_data() changes, and other code it depends on
Will need work in src.instructlab.sdg.utils.taxonomy
Summary of changes implemented in this fork for preparing pipeline input: https://github.com/aakankshaduggal/sdg
Related work (non-blocking):
instructlab/instructlab repo
Related work (non-blocking):
The text was updated successfully, but these errors were encountered: