Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use train and test split with the recipes? #2222

Open
7rabbit opened this issue Jan 1, 2025 · 1 comment
Open

How to use train and test split with the recipes? #2222

7rabbit opened this issue Jan 1, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request triaged This issue has been assigned an owner and appropriate label

Comments

@7rabbit
Copy link

7rabbit commented Jan 1, 2025

Dear torchtune team,

With sft trainer we can do
train_dataset=ds["train"],
eval_dataset=ds["validation"],
if it is a split from huggingface dataset.

I wonder under a fine-tuning recipes with instruction dataset, how is this achieved, particular in a YAML configuration file? With current example on the tutorial: split: train, I feel that the whole dataset is used for training. Should we prepare json/csv files before hand with spitted on train/test/validation set?
Thanks

@joecummings joecummings added enhancement New feature or request triaged This issue has been assigned an owner and appropriate label labels Jan 6, 2025
@joecummings joecummings self-assigned this Jan 6, 2025
@joecummings
Copy link
Contributor

Hey @7rabbit - currently this isn't available through a YAML config. You're more than welcome to hack onto our recipes to add this functionality, but we also have it on our roadmap to support early this year!

For now, if you want to only train on part of the dataset you can either preprocess yourself, do it "online" through a custom transform, or specify a percentage of the dataset to use like so "train[:50%]".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triaged This issue has been assigned an owner and appropriate label
Projects
None yet
Development

No branches or pull requests

2 participants