Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a train field to training data #283

Open
hamishivi opened this issue Aug 23, 2024 · 2 comments
Open

Add a train field to training data #283

hamishivi opened this issue Aug 23, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@hamishivi
Copy link
Collaborator

We may want to mask intermediate dialogue turns, e.g. if they arise due to the model making a mistake. I propose adding this by creating a train field that is added to datasets in an instance. An instance can now look like:

 [{ "role": "user", "content": "some questions" },
{ "role": "assistant", "content": "some answer", "train": false },
{ "role": "user", "content": "oh, i see", "train": false },
{ "role": "assistant", "content": "some answer", "train": true }, ...]

When train is set, it overrides our original logic for deciding what turns to train on. When it is not present, we use our old logic (train on all assistant turns only). We can then additionally add a basic flag to the dataset mixer when we want to do some higher-level thing automatically (e.g., train_on_final_turn_only).

This gives us the flexbility to train on arbitrary turn combinations, where the user can preprocess the dataset how they want if they want to do something fancy (for example, using another model to judge whether turns are worth training on or not).

Let me know if this makes sense!

@hamishivi hamishivi added the enhancement New feature or request label Aug 23, 2024
@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Aug 23, 2024

Has there ever been studies on if masking out the prompt is helpful?

@hamishivi
Copy link
Collaborator Author

hamishivi commented Aug 23, 2024

Nothing that like screams we should go one way or the other:

  • qlora table 10 suggests training on the prompt hurts marginally.
  • a more recent paper suggests training on the prompt (but not on the special prompt tokens, importantly), but it depends on the length of the inputs. They specifically look at the tulu 2 dataset, and find that training on the full dataset, it isnt that helpful to also train on inputs, but for subsets it is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants