Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples for E2E Mistral training & deployment #673

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jonetiz
Copy link

@jonetiz jonetiz commented Jul 30, 2024

What does this PR do?

Adds examples which are used for an upcoming AWS Machine Learning blog.

Coming soon to AWS Machine Learning blog
Copy link
Member

@michaelbenayoun michaelbenayoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it is quite different than our examples which are the equivalent of the examples in Transformers.
Should we create a directory like "aws-examples" or something where we would put them?

cc @philschmid @pagezyhf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the run_clm.py script we already provide?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinction is in dataset loading, it uses load_from_disk to load a saved DatasetDict, before downloading from HF hub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I do not think this is really needed as a distinction. But that's fair.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should arther update the general on to make it support loading datasets from disk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think we should keep examples easy. Saving the dataset on disk is not really needed. There is a caching mechanism, we should just push them to use the dataset id and let the Datasets library handle the rest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelbenayoun

In our example, gsm8k is not pre-formatted for run_clm.py, which uses a single text column for training data, so the dataset needs to be formatted.

The reason we're loading from disk is to show customers (in the blog post) how the dataset needs to be formatted by leading them through it step-by-step (which ends up being dataset.py). While we can modify the training script to shape the dataset as necessary, we thought it'd be easier to show it in a standalone file - for the purposes of our blog post, run_clm.py isn't discussed in detail (the idea being it's plug and play with any LLM or dataset) so simplicity isn't necessarily our concern there.

Technically speaking, we could modify run_clm.py to shape the data itself, but I'd also argue it's better to have a training script that can take in any pre-formatted dataset, rather than having to modify it every time for new datasets. Hope this helps you see my point of view.

@michaelbenayoun
Copy link
Member

@pagezyhf @philschmid wdyt?

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening the PR. I missing some context. Whats is the goal of that? I cannot see instruction what to do with the yaml file or how the scripts are used. If want to make an official example we should make sure that's understandable and useable for every user.

examples/mistral-e2e/examples/chat.py Outdated Show resolved Hide resolved
examples/mistral-e2e/examples/chat.py Outdated Show resolved Hide resolved
examples/mistral-e2e/examples/compile.py Outdated Show resolved Hide resolved
examples/mistral-e2e/examples/dataset.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should arther update the general on to make it support loading datasets from disk.

examples/mistral-e2e/examples/test.py Outdated Show resolved Hide resolved
@jonetiz
Copy link
Author

jonetiz commented Aug 1, 2024

Thank you for opening the PR. I missing some context. Whats is the goal of that? I cannot see instruction what to do with the yaml file or how the scripts are used. If want to make an official example we should make sure that's understandable and useable for every user.

So this is for an AWS Machine Learning blog, to give beginners to intermediates an idea of how to use AWS Trainium and AWS Inferentia2 for simple end-to-end training and inferencing.

The YAML file is an Amazon CloudFormation template - it doesn't necessarily need to be included, so I can take that out.

Overall, similar to the other examples in this directory we (will) have documentation elsewhere that refers to these examples. Michael suggested we might create an aws-examples directory, which would be fine as well.

@michaelbenayoun
Copy link
Member

I would say let's me the tutorial as minimalist as possible and put it in an aws-examples folder. We can still update that afterwards.

aws-examples/mistral-e2e/chat.py Outdated Show resolved Hide resolved
aws-examples/mistral-e2e/chat.py Show resolved Hide resolved
aws-examples/README.md Outdated Show resolved Hide resolved
aws-examples/mistral-e2e/chat.py Outdated Show resolved Hide resolved
aws-examples/mistral-e2e/compile.py Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

1 similar comment
@HuggingFaceDocBuilderDev

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants