Webinar Demo Flow

KubeFlow demo

Explain kfapp/aws_config/cluster_config.sh for GPUs
Explain kfapp/aws_config/cluster_features.sh for private access, disable endpoint, control/data plane logging
In kfapp/env.sh, explain KUBEFLOW_COMPONENTS and disabling of ALB and ingress controllers
Show config:
```
kubectl config get-contexts
```

Show GPUs

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,MEMORY:.status.allocatable.memory,CPU:.status.allocatable.cpu,GPU:.status.allocatable.nvidia\.com/gpu"

Jupyter notebook

Get ingress port:
```
kubectl get ingress -n istio-system
```
Use the IP address
Click on Notebooks
Create new server
Specify the name
Optionally, change the CPU (for faster processing)
SPAWN
Wait for the server to be ready
CONNECT
Create a new notebook (top right)
Python3
Copy the code from https://github.com/aws-samples/machine-learning-using-k8s/blob/master/samples/mnist/training/tensorflow/mnist.py
Change args = parser.parse_args() to args = parser.parse_args(args=[])
Delete last two lines
Run
In the new code block, add main()
Run
Show the output

Single node training

Follow the steps from https://github.com/aws-samples/machine-learning-using-k8s/blob/master/docs/mnist/inference/tensorflow.md to run inference engine

Deploy serving components:

ks apply default -c ${TF_SERVING_SERVICE}
ks apply default -c ${TF_SERVING_DEPLOYMENT}

Show that the pods are running:

kubectl get pods -n kubeflow --selector=app=mnist

Do port forward

kubectl port-forward -n kubeflow `kubectl get pods -n kubeflow --selector=app=mnist -o jsonpath='{.items[0].metadata.name}' --field-selector=status.phase=Running` 8500:8500

Run inference:

python samples/mnist/inference/tensorflow/inference_client.py --endpoint http://localhost:8500/v1/models/mnist:predict

Delete serving components:
```
ks delete -c ${TF_SERVING_DEPLOYMENT}
```

Distributed training

Walk through https://github.com/aws-samples/machine-learning-using-k8s/blob/master/docs/imagenet/training/tensorflow-horovod.md
- Explain EXEC command, specifically batch_size, num_batches, display_every
- Explain gpusPerReplica
Show FSx Lustre
- Show S3 backing bucket
Show the logs
```
kubectl -n kubeflow logs -f ${POD_NAME}
```
Explain output from ring 0 and 1

Optional

TensorBoard
Katib
Fairing
KFServing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webinar Demo Flow

KubeFlow demo

Jupyter notebook

Single node training

Distributed training

Optional

Clone this wiki locally