Skip to content

Commit f9c2724

Browse files
committed
KEP-2170: Add PyTorch DDP Fashion MNIST training example
Signed-off-by: Antonin Stefanutti <[email protected]>
1 parent 1dfa40c commit f9c2724

File tree

3 files changed

+590
-0
lines changed

3 files changed

+590
-0
lines changed

examples/pytorch/mnist-ddp/README.md

+112
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# PyTorch DDP Fashion MNIST Training Example
2+
3+
This example demonstrates how to train a convolutional neural network to classify images
4+
using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset
5+
and [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
6+
7+
You can either run this example with the provided Jupyter notebook,
8+
or by running the Python script directly.
9+
10+
In any case, you need to install the Kubeflow training v2 control plane
11+
on your Kubernetes cluster, if it's not already deployed:
12+
13+
```console
14+
kubectl apply --server-side -k "https://github.com/kubeflow/training-operator.git/manifests/v2/overlays/standalone?ref=master"
15+
```
16+
17+
## Jupyter Notebook
18+
19+
You can set up your environment by running the following commands:
20+
21+
```console
22+
python -m venv .venv
23+
source .venv/bin/activate
24+
pip install jupyter
25+
```
26+
27+
And start the notebook by running:
28+
29+
```console
30+
jupyter notebook examples/pytorch/mnist-ddp/mnist.ipynb
31+
```
32+
33+
You can then access the notebook from your Web browser and follow the instructions.
34+
35+
## Python Script
36+
37+
### Setup
38+
39+
You need to set up the Python environment on your local machine or client:
40+
41+
```console
42+
python -m venv .venv
43+
source .venv/bin/activate
44+
pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
45+
```
46+
47+
You can refer to the [training operator documentation](https://www.kubeflow.org/docs/components/training/installation/)
48+
for more information.
49+
50+
### Usage
51+
52+
```console
53+
python mnist.py --help
54+
usage: mnist.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--lr LR] [--lr-gamma G] [--lr-period P] [--seed S] [--log-interval N] [--save-model]
55+
[--backend {gloo,nccl}] [--num-workers N] [--worker-resources RESOURCE QUANTITY] [--runtime NAME]
56+
57+
PyTorch DDP Fashion MNIST Training Example
58+
59+
options:
60+
-h, --help show this help message and exit
61+
--batch-size N input batch size for training [100]
62+
--test-batch-size N input batch size for testing [100]
63+
--epochs N number of epochs to train [10]
64+
--lr LR learning rate [1e-1]
65+
--lr-gamma G learning rate decay factor [0.5]
66+
--lr-period P learning rate decay period in step size [20]
67+
--seed S random seed [0]
68+
--log-interval N how many batches to wait before logging training metrics [10]
69+
--save-model saving the trained model [False]
70+
--backend {gloo,nccl}
71+
Distributed backend [nccl]
72+
--num-workers N Number of workers [1]
73+
--worker-resources RESOURCE QUANTITY
74+
Resources per worker [cpu: 1, memory: 2Gi, nvidia.com/gpu: 1]
75+
--runtime NAME the training runtime [torch-distributed]
76+
```
77+
78+
### Example
79+
80+
Train the model on 8 worker nodes using 1 NVIDIA GPU each:
81+
82+
```console
83+
python mnist.py \
84+
--num-workers 4 \
85+
--worker-resources "nvidia.com/gpu" 1 \
86+
--worker-resource cpu 4 \
87+
--worker-resources memory 16Gi \
88+
--epochs 100 \
89+
--batch-size 100 \
90+
--lr 1e-1 \
91+
--lr-period 20 \
92+
--lr-gamma 0.8
93+
```
94+
95+
At the end of each epoch, local metrics are printed in each worker logs and the global metrics
96+
are gathered and printed in the rank 0 worker logs.
97+
98+
When the training completes, you should see the following at the end of the rank 0 worker logs:
99+
100+
```text
101+
--------------- Epoch 50 Evaluation ---------------
102+
103+
Local rank 0:
104+
- Loss: 0.0040
105+
- Accuracy: 2255/2500 (90%)
106+
107+
Global metrics:
108+
- Loss: 0.004319
109+
- Accuracy: 9011/10000 (90.11%)
110+
111+
---------------------------------------------------
112+
```
+141
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"pycharm": {
7+
"name": "#%% md\n"
8+
}
9+
},
10+
"source": [
11+
"# PyTorch DDP Fashion MNIST Training Example"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {
17+
"pycharm": {
18+
"name": "#%% md\n"
19+
}
20+
},
21+
"source": [
22+
"This example demonstrates how to train a convolutional neural network to classify images using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset and [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)."
23+
]
24+
},
25+
{
26+
"cell_type": "markdown",
27+
"metadata": {
28+
"tags": []
29+
},
30+
"source": [
31+
"## Install the Kubeflow Training Python SDK\n",
32+
"\n",
33+
"You need to install the Kubeflow Training SDK to run this Notebook."
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"## Create the Kubeflow Training Client"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": 1,
46+
"metadata": {
47+
"pycharm": {
48+
"name": "#%%\n"
49+
}
50+
},
51+
"outputs": [],
52+
"source": [
53+
"from kubeflow.training import Trainer, TrainingClient\n",
54+
"from mnist import train_fashion_mnist"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": 2,
60+
"metadata": {},
61+
"outputs": [],
62+
"source": [
63+
"client = TrainingClient()"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"metadata": {},
69+
"source": [
70+
"## Start the Train Job"
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": 13,
76+
"metadata": {},
77+
"outputs": [],
78+
"source": [
79+
"job_name = client.train(\n",
80+
" runtime_ref=\"torch-distributed\",\n",
81+
" trainer=Trainer(\n",
82+
" func=train_fashion_mnist,\n",
83+
" func_args={\n",
84+
" \"backend\": \"nccl\",\n",
85+
" \"batch_size\": 100,\n",
86+
" \"test_batch_size\": 100,\n",
87+
" \"epochs\": 100,\n",
88+
" \"lr\": 1e-1,\n",
89+
" \"lr_gamma\": 0.95,\n",
90+
" \"lr_period\": 20,\n",
91+
" \"seed\": 0,\n",
92+
" \"log_interval\": 10,\n",
93+
" \"save_model\": False,\n",
94+
" },\n",
95+
" num_nodes=4,\n",
96+
" resources_per_node={\n",
97+
" \"nvidia.com/gpu\": 1,\n",
98+
" },\n",
99+
" ),\n",
100+
")"
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"metadata": {},
106+
"source": [
107+
"## Watch the Train Job Logs"
108+
]
109+
},
110+
{
111+
"cell_type": "code",
112+
"execution_count": null,
113+
"metadata": {},
114+
"outputs": [],
115+
"source": [
116+
"client.get_job_logs(job_name, follow=True)"
117+
]
118+
}
119+
],
120+
"metadata": {
121+
"kernelspec": {
122+
"display_name": "Python 3 (ipykernel)",
123+
"language": "python",
124+
"name": "python3"
125+
},
126+
"language_info": {
127+
"codemirror_mode": {
128+
"name": "ipython",
129+
"version": 3
130+
},
131+
"file_extension": ".py",
132+
"mimetype": "text/x-python",
133+
"name": "python",
134+
"nbconvert_exporter": "python",
135+
"pygments_lexer": "ipython3",
136+
"version": "3.11.9"
137+
}
138+
},
139+
"nbformat": 4,
140+
"nbformat_minor": 4
141+
}

0 commit comments

Comments
 (0)