KEP-2170: Add PyTorch DDP MNIST training example #2387

astefanutti · 2025-01-14T13:35:04Z

What this PR does / why we need it:

This PR adds an example that demonstrates how to train MNIST with PyTorch DDP using the training operator and SDK v2.

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2025-01-14T13:35:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-01-14T13:40:27Z

Pull Request Test Coverage Report for Build 12904336983

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 12862208924:	0.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

andreyvelich

Thank you for creating this example @astefanutti. We will use it as getting started example!
However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

So Data Scientists and ML Engineers can quickly take them and execute locally or inside the Kubeflow Notebooks.

Additionally, we are planning to build the testing infra using Papermill to make sure these Notebooks are runnable.

cc @kubeflow/wg-training-leads @Electronic-Waste @akshaychitneni @shravan-achar

andreyvelich · 2025-01-14T14:06:26Z

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

astefanutti · 2025-01-14T14:14:12Z

However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

@andreyvelich awesome, sorry if I missed that.

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

astefanutti · 2025-01-14T14:19:59Z

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

I can see it's nice to have an example as small as possible. I can certainly remove the "evaluation" part to make it shorter.

That being said, I'd be inclined to assume evaluation is a critical part of the training for any Data Scientist or ML Engineer, so the value is high and it does not add much complexity nor foreign concepts that the train section already has.

andreyvelich · 2025-01-14T14:24:21Z

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

We can still use the /examples folder for them, maybe we can use the /test/e2e/notebooks folder for additional test suites, if we need them.
For example, we can keep the script to run Notebooks in the e2e/notebooks folder: https://github.com/kubeflow/training-operator/blob/master/scripts/run-notebook.sh

astefanutti · 2025-01-14T14:46:52Z

Thanks, that makes all sense. Keeping examples in the examples directory makes them easier to find, obviously :)

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.

Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

andreyvelich · 2025-01-14T14:53:21Z

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.
Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

Sure, go ahead! We can create the E2Es once you have Notebook ready.

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

astefanutti · 2025-01-14T15:00:16Z

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

Actually I hesitated when I started :)

I agree with you. Let's move it to use FashionMNIST.

andreyvelich · 2025-01-19T23:53:25Z

Hi @astefanutti, did you get a chance to transfer your example into Jupyter Notebook so we can use it as Getting Started example ?

astefanutti · 2025-01-20T08:51:16Z

@andreyvelich yes I'm on it, I should be able to push it quickly.

review-notebook-app · 2025-01-20T15:10:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

astefanutti · 2025-01-21T09:56:10Z

@andreyvelich PTAL

Signed-off-by: Antonin Stefanutti <[email protected]>

andreyvelich

Thank you for doing this @astefanutti!
I left a few comments.

andreyvelich · 2025-01-27T01:00:47Z