-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add PyTorch DDP MNIST training example #2387
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 12904336983Details
💛 - Coveralls |
c953498
to
dced478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating this example @astefanutti. We will use it as getting started example!
However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.
So Data Scientists and ML Engineers can quickly take them and execute locally or inside the Kubeflow Notebooks.
Additionally, we are planning to build the testing infra using Papermill to make sure these Notebooks are runnable.
cc @kubeflow/wg-training-leads @Electronic-Waste @akshaychitneni @shravan-achar
Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob |
6b542e9
to
5f45584
Compare
@andreyvelich awesome, sorry if I missed that. Do I understand it correctly you initially want the examples to be created under |
I can see it's nice to have an example as small as possible. I can certainly remove the "evaluation" part to make it shorter. That being said, I'd be inclined to assume evaluation is a critical part of the training for any Data Scientist or ML Engineer, so the value is high and it does not add much complexity nor foreign concepts that the train section already has. |
We can still use the |
Thanks, that makes all sense. Keeping examples in the I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case. Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing? |
Sure, go ahead! We can create the E2Es once you have Notebook ready. Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ? |
Actually I hesitated when I started :) I agree with you. Let's move it to use FashionMNIST. |
Hi @astefanutti, did you get a chance to transfer your example into Jupyter Notebook so we can use it as Getting Started example ? |
@andreyvelich yes I'm on it, I should be able to push it quickly. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
77fbfdc
to
95788ea
Compare
@andreyvelich PTAL |
Signed-off-by: Antonin Stefanutti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @astefanutti!
I left a few comments.
@@ -0,0 +1,344 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't yet support bypass the Python files to the TrainJobs, I would suggest that we keep the Python function in the Jupyter Notebook: #2347
cc @shravan-achar
backend = dict.get("backend") | ||
batch_size = dict.get("batch_size") | ||
test_batch_size = dict.get("test_batch_size") | ||
epochs = dict.get("epochs") | ||
lr = dict.get("lr") | ||
lr_gamma = dict.get("lr_gamma") | ||
lr_period = dict.get("lr_period") | ||
seed = dict.get("seed") | ||
log_interval = dict.get("log_interval") | ||
save_model = dict.get("save_model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this example as simple as we can and remove all of these args ?
Since we want to use this as a Getting Started example, I would suggest to keep minimal amount of working PyTorch code.
We can create more examples where we showcase the train func args.
def __init__(self): | ||
super(Net, self).__init__() | ||
self.flatten = nn.Flatten() | ||
self.linear_relu_stack = nn.Sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep 2 liner and 2 cnn layers for this model, similar to this example: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb
logits = self.linear_relu_stack(x) | ||
return logits | ||
|
||
def train(model, device, criterion, train_loader, optimizer, epoch, log_interval): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid nested functions for simplicity.
) | ||
) | ||
|
||
def evaluate(model, device, criterion, rank, test_loader, epoch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove evaluation step.
dist.init_process_group(backend=backend) | ||
|
||
torch.manual_seed(seed) | ||
rank = int(os.environ["RANK"]) | ||
local_rank = int(os.environ["LOCAL_RANK"]) | ||
|
||
model = Net() | ||
|
||
use_cuda = torch.cuda.is_available() | ||
if use_cuda: | ||
if backend != torch.distributed.Backend.NCCL: | ||
print( | ||
"Please use NCCL distributed backend for the best performance using NVIDIA GPUs" | ||
) | ||
device = torch.device(f"cuda:{local_rank}") | ||
model = DistributedDataParallel(model.to(device), device_ids=[local_rank]) | ||
else: | ||
device = torch.device("cpu") | ||
model = DistributedDataParallel(model.to(device)) | ||
|
||
transform = transforms.Compose( | ||
[ | ||
transforms.ToTensor(), | ||
transforms.Normalize((0.5,), (0.5,)), | ||
] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor this part to make it as simple as here: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb
I think, we can significantly reduce number of code.
dist.destroy_process_group() | ||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the steps in this Notebook should be:
- Install
kubeflow
SDK. - Creating PyTorch Code
- Run the Notebook locally for 1 epoch to verify that code is functional.
- List of available Training Runtimes
- Create TrainJob for distributed training
- Check TrainJob's components.
- Get the TrainJob's logs.
- Delete the TrainJob.
What this PR does / why we need it:
This PR adds an example that demonstrates how to train MNIST with PyTorch DDP using the training operator and SDK v2.
Checklist: