Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use knowledge data for phase 1 training and skills data for phase 2 training in standalone script #113

Merged
merged 1 commit into from
Oct 18, 2024

Conversation

MichaelClifford
Copy link
Collaborator

Closes #86

  • Updates the data_processing_op() to output two datasets,skills_processed_data and knowledge_processed_data instead a single processed_data output. These new datasets are stored in .../processed_data/knowledge and .../processed_data/skills respectively.

  • This changes requires that the SDG data used contains both a knowledge_train_msgs*.jsonl file as well as a skills_train_msgs*.jsonl file (which it should, but may cause a failure if a small PR is used to generate the SDG data).

  • Updates the PYTORCH_TRAINING_JOB template in standalone.tpl to use the knowledge_processed_data for phase 1 training and skills_processed_data for phase 2.

Note: This PR does not add this feature to the KFP, I need to update the pytorchJob_manifest_op() to reflect this change. However, that function has fallen a bit behind the standalone version at this point so will bring it back to parity in a later PR.

Copy link
Collaborator

@sallyom sallyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code looks clean, let's merge it - with the caveat that it hasn't been tested in the pipeline so there may be a follow-up PR to fix the pipeline - so long as this has been tested and proved to work with standalone script, I'm good with merging this!
LGTM

@sallyom sallyom merged commit 348c920 into opendatahub-io:main Oct 18, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to set the dataset for each training phase.
2 participants