KEP-2170: Add PyTorch DDP Fashion MNIST training example

astefanutti · astefanutti · commit f9c2724a3ee7 · 2025-01-20T16:10:32.000+01:00
Signed-off-by: Antonin Stefanutti &lt;antonin@stefanutti.fr&gt;
diff --git a/examples/pytorch/mnist-ddp/README.md b/examples/pytorch/mnist-ddp/README.md
@@ -0,0 +1,112 @@
+# PyTorch DDP Fashion MNIST Training Example
+
+This example demonstrates how to train a convolutional neural network to classify images
+using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset
+and [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
+
+You can either run this example with the provided Jupyter notebook,
+or by running the Python script directly.
+
+In any case, you need to install the Kubeflow training v2 control plane
+on your Kubernetes cluster, if it's not already deployed:
+
+```console
+kubectl apply --server-side -k "https://github.com/kubeflow/training-operator.git/manifests/v2/overlays/standalone?ref=master"
+```
+
+## Jupyter Notebook
+
+You can set up your environment by running the following commands:
+
+```console
+python -m venv .venv
+source .venv/bin/activate
+pip install jupyter
+```
+
+And start the notebook by running:
+
+```console
+jupyter notebook examples/pytorch/mnist-ddp/mnist.ipynb
+```
+
+You can then access the notebook from your Web browser and follow the instructions.
+
+## Python Script
+
+### Setup
+
+You need to set up the Python environment on your local machine or client:
+
+```console
+python -m venv .venv
+source .venv/bin/activate
+pip install git+https://github.com/kubeflow/training-operator.git@master#subdirectory=sdk_v2
+```
+
+You can refer to the [training operator documentation](https://www.kubeflow.org/docs/components/training/installation/)
+for more information.
+
+### Usage
+
+```console
+python mnist.py --help
+usage: mnist.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--lr LR] [--lr-gamma G] [--lr-period P] [--seed S] [--log-interval N] [--save-model]
+                [--backend {gloo,nccl}] [--num-workers N] [--worker-resources RESOURCE QUANTITY] [--runtime NAME]
+
+PyTorch DDP Fashion MNIST Training Example
+
+options:
+  -h, --help            show this help message and exit
+  --batch-size N        input batch size for training [100]
+  --test-batch-size N   input batch size for testing [100]
+  --epochs N            number of epochs to train [10]
+  --lr LR               learning rate [1e-1]
+  --lr-gamma G          learning rate decay factor [0.5]
+  --lr-period P         learning rate decay period in step size [20]
+  --seed S              random seed [0]
+  --log-interval N      how many batches to wait before logging training metrics [10]
+  --save-model          saving the trained model [False]
+  --backend {gloo,nccl}
+                        Distributed backend [nccl]
+  --num-workers N       Number of workers [1]
+  --worker-resources RESOURCE QUANTITY
+                        Resources per worker [cpu: 1, memory: 2Gi, nvidia.com/gpu: 1]
+  --runtime NAME        the training runtime [torch-distributed]
+```
+
+### Example
+
+Train the model on 8 worker nodes using 1 NVIDIA GPU each:
+
+```console
+python mnist.py \
+    --num-workers 4 \
+    --worker-resources "nvidia.com/gpu" 1 \
+    --worker-resource cpu 4 \
+    --worker-resources memory 16Gi \
+    --epochs 100 \
+    --batch-size 100 \
+    --lr 1e-1 \
+    --lr-period 20 \
+    --lr-gamma 0.8
+```
+
+At the end of each epoch, local metrics are printed in each worker logs and the global metrics
+are gathered and printed in the rank 0 worker logs.
+
+When the training completes, you should see the following at the end of the rank 0 worker logs:
+
+```text
+--------------- Epoch 50 Evaluation ---------------
+
+Local rank 0:
+- Loss: 0.0040
+- Accuracy: 2255/2500 (90%)
+
+Global metrics:
+- Loss: 0.004319
+- Accuracy: 9011/10000 (90.11%)
+
+---------------------------------------------------
+```
diff --git a/examples/pytorch/mnist-ddp/mnist.ipynb b/examples/pytorch/mnist-ddp/mnist.ipynb
@@ -0,0 +1,141 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "# PyTorch DDP Fashion MNIST Training Example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "This example demonstrates how to train a convolutional neural network to classify images using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset and [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Install the Kubeflow Training Python SDK\n",
+    "\n",
+    "You need to install the Kubeflow Training SDK to run this Notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Kubeflow Training Client"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from kubeflow.training import Trainer, TrainingClient\n",
+    "from mnist import train_fashion_mnist"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = TrainingClient()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Start the Train Job"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "job_name = client.train(\n",
+    "    runtime_ref=\"torch-distributed\",\n",
+    "    trainer=Trainer(\n",
+    "        func=train_fashion_mnist,\n",
+    "        func_args={\n",
+    "            \"backend\": \"nccl\",\n",
+    "            \"batch_size\": 100,\n",
+    "            \"test_batch_size\": 100,\n",
+    "            \"epochs\": 100,\n",
+    "            \"lr\": 1e-1,\n",
+    "            \"lr_gamma\": 0.95,\n",
+    "            \"lr_period\": 20,\n",
+    "            \"seed\": 0,\n",
+    "            \"log_interval\": 10,\n",
+    "            \"save_model\": False,\n",
+    "        },\n",
+    "        num_nodes=4,\n",
+    "        resources_per_node={\n",
+    "            \"nvidia.com/gpu\": 1,\n",
+    "        },\n",
+    "    ),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Watch the Train Job Logs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.get_job_logs(job_name, follow=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/examples/pytorch/mnist-ddp/mnist.py b/examples/pytorch/mnist-ddp/mnist.py