Use knowledge data for phase 1 training and skills data for phase 2 training in standalone script #113

MichaelClifford · 2024-10-18T17:17:39Z

Closes #86

Updates the data_processing_op() to output two datasets,skills_processed_data and knowledge_processed_data instead a single processed_data output. These new datasets are stored in .../processed_data/knowledge and .../processed_data/skills respectively.
This changes requires that the SDG data used contains both a knowledge_train_msgs*.jsonl file as well as a skills_train_msgs*.jsonl file (which it should, but may cause a failure if a small PR is used to generate the SDG data).
Updates the PYTORCH_TRAINING_JOB template in standalone.tpl to use the knowledge_processed_data for phase 1 training and skills_processed_data for phase 2.

Note: This PR does not add this feature to the KFP, I need to update the pytorchJob_manifest_op() to reflect this change. However, that function has fallen a bit behind the standalone version at this point so will bring it back to parity in a later PR.

Signed-off-by: Michael Clifford <[email protected]>

sallyom

this code looks clean, let's merge it - with the caveat that it hasn't been tested in the pipeline so there may be a follow-up PR to fix the pipeline - so long as this has been tested and proved to work with standalone script, I'm good with merging this!
LGTM

use knowledge data for phase 1 training and skills data for phase 2

a1015c7

Signed-off-by: Michael Clifford <[email protected]>

MichaelClifford requested review from leseb, cooktheryan and sallyom October 18, 2024 17:17

sallyom approved these changes Oct 18, 2024

View reviewed changes

sallyom merged commit 348c920 into opendatahub-io:main Oct 18, 2024
1 check passed

leseb mentioned this pull request Nov 5, 2024

Update Training steps to include appropriate dataset #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use knowledge data for phase 1 training and skills data for phase 2 training in standalone script #113

Use knowledge data for phase 1 training and skills data for phase 2 training in standalone script #113

MichaelClifford commented Oct 18, 2024

sallyom left a comment

Use knowledge data for phase 1 training and skills data for phase 2 training in standalone script #113

Use knowledge data for phase 1 training and skills data for phase 2 training in standalone script #113

Conversation

MichaelClifford commented Oct 18, 2024

sallyom left a comment

Choose a reason for hiding this comment