Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add PyTorch DDP MNIST training example #2387

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

astefanutti
Copy link
Contributor

What this PR does / why we need it:

This PR adds an example that demonstrates how to train MNIST with PyTorch DDP using the training operator and SDK v2.

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Jan 14, 2025

Pull Request Test Coverage Report for Build 12904336983

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12862208924: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

@astefanutti astefanutti force-pushed the pr-10 branch 3 times, most recently from c953498 to dced478 Compare January 14, 2025 13:57
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this example @astefanutti. We will use it as getting started example!
However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

So Data Scientists and ML Engineers can quickly take them and execute locally or inside the Kubeflow Notebooks.

Additionally, we are planning to build the testing infra using Papermill to make sure these Notebooks are runnable.

cc @kubeflow/wg-training-leads @Electronic-Waste @akshaychitneni @shravan-achar

@andreyvelich
Copy link
Member

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

@astefanutti astefanutti force-pushed the pr-10 branch 2 times, most recently from 6b542e9 to 5f45584 Compare January 14, 2025 14:11
@astefanutti
Copy link
Contributor Author

astefanutti commented Jan 14, 2025

However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

@andreyvelich awesome, sorry if I missed that.

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

@astefanutti
Copy link
Contributor Author

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

I can see it's nice to have an example as small as possible. I can certainly remove the "evaluation" part to make it shorter.

That being said, I'd be inclined to assume evaluation is a critical part of the training for any Data Scientist or ML Engineer, so the value is high and it does not add much complexity nor foreign concepts that the train section already has.

@andreyvelich
Copy link
Member

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

We can still use the /examples folder for them, maybe we can use the /test/e2e/notebooks folder for additional test suites, if we need them.
For example, we can keep the script to run Notebooks in the e2e/notebooks folder: https://github.com/kubeflow/training-operator/blob/master/scripts/run-notebook.sh

@astefanutti
Copy link
Contributor Author

Thanks, that makes all sense. Keeping examples in the examples directory makes them easier to find, obviously :)

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.

Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

@andreyvelich
Copy link
Member

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.
Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

Sure, go ahead! We can create the E2Es once you have Notebook ready.

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

@astefanutti
Copy link
Contributor Author

astefanutti commented Jan 14, 2025

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

Actually I hesitated when I started :)

I agree with you. Let's move it to use FashionMNIST.

@andreyvelich
Copy link
Member

Hi @astefanutti, did you get a chance to transfer your example into Jupyter Notebook so we can use it as Getting Started example ?

@astefanutti
Copy link
Contributor Author

@andreyvelich yes I'm on it, I should be able to push it quickly.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jan 20, 2025
@astefanutti astefanutti marked this pull request as ready for review January 20, 2025 15:15
@astefanutti
Copy link
Contributor Author

@andreyvelich PTAL

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @astefanutti!
I left a few comments.

@@ -0,0 +1,344 @@
import argparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't yet support bypass the Python files to the TrainJobs, I would suggest that we keep the Python function in the Jupyter Notebook: #2347
cc @shravan-achar

Comment on lines +20 to +29
backend = dict.get("backend")
batch_size = dict.get("batch_size")
test_batch_size = dict.get("test_batch_size")
epochs = dict.get("epochs")
lr = dict.get("lr")
lr_gamma = dict.get("lr_gamma")
lr_period = dict.get("lr_period")
seed = dict.get("seed")
log_interval = dict.get("log_interval")
save_model = dict.get("save_model")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this example as simple as we can and remove all of these args ?
Since we want to use this as a Getting Started example, I would suggest to keep minimal amount of working PyTorch code.
We can create more examples where we showcase the train func args.

def __init__(self):
super(Net, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep 2 liner and 2 cnn layers for this model, similar to this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb

logits = self.linear_relu_stack(x)
return logits

def train(model, device, criterion, train_loader, optimizer, epoch, log_interval):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid nested functions for simplicity.

)
)

def evaluate(model, device, criterion, rank, test_loader, epoch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove evaluation step.

Comment on lines +130 to +156
dist.init_process_group(backend=backend)

torch.manual_seed(seed)
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])

model = Net()

use_cuda = torch.cuda.is_available()
if use_cuda:
if backend != torch.distributed.Backend.NCCL:
print(
"Please use NCCL distributed backend for the best performance using NVIDIA GPUs"
)
device = torch.device(f"cuda:{local_rank}")
model = DistributedDataParallel(model.to(device), device_ids=[local_rank])
else:
device = torch.device("cpu")
model = DistributedDataParallel(model.to(device))

transform = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
]
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this part to make it as simple as here: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb
I think, we can significantly reduce number of code.

dist.destroy_process_group()


if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also can be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the steps in this Notebook should be:

  1. Install kubeflow SDK.
  2. Creating PyTorch Code
  3. Run the Notebook locally for 1 epoch to verify that code is functional.
  4. List of available Training Runtimes
  5. Create TrainJob for distributed training
  6. Check TrainJob's components.
  7. Get the TrainJob's logs.
  8. Delete the TrainJob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants