-
Notifications
You must be signed in to change notification settings - Fork 52
Webinar Demo Flow
Arun Gupta edited this page Jul 21, 2019
·
15 revisions
- Explain
kfapp/aws_config/cluster_config.sh
for GPUs - Explain
kfapp/aws_config/cluster_features.sh
for private access, disable endpoint, control/data plane logging - In
kfapp/env.sh
, explainKUBEFLOW_COMPONENTS
and disabling of ALB and ingress controllers - Show config:
kubectl config get-contexts
- Show GPUs
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,MEMORY:.status.allocatable.memory,CPU:.status.allocatable.cpu,GPU:.status.allocatable.nvidia\.com/gpu"
- Get ingress port:
kubectl get ingress -n istio-system
- Use the IP address
- Click on Notebooks
- Create new server
- Specify the name
- Optionally, change the CPU (for faster processing)
SPAWN
- Wait for the server to be ready
CONNECT
- Create a new notebook (top right)
- Python3
- Copy the code from https://github.com/aws-samples/machine-learning-using-k8s/blob/master/samples/mnist/training/tensorflow/mnist.py
- Change
args = parser.parse_args()
toargs = parser.parse_args(args=[])
- Delete last two lines
- Run
- In the new code block, add
main()
- Run
- Show the output
- Follow the steps from https://github.com/aws-samples/machine-learning-using-k8s/blob/master/docs/mnist/inference/tensorflow.md to run inference engine
- Deploy serving components:
ks apply default -c ${TF_SERVING_SERVICE} ks apply default -c ${TF_SERVING_DEPLOYMENT}
- Show that the pods are running:
kubectl get pods -n kubeflow --selector=app=mnist
- Do port forward
kubectl port-forward -n kubeflow `kubectl get pods -n kubeflow --selector=app=mnist -o jsonpath='{.items[0].metadata.name}' --field-selector=status.phase=Running` 8500:8500
- Run inference:
python samples/mnist/inference/tensorflow/inference_client.py --endpoint http://localhost:8500/v1/models/mnist:predict
- Delete serving components:
ks delete -c ${TF_SERVING_DEPLOYMENT}
- Walk through https://github.com/aws-samples/machine-learning-using-k8s/blob/master/docs/imagenet/training/tensorflow-horovod.md
- Explain
EXEC
command, specificallybatch_size
,num_batches
,display_every
- Explain
gpusPerReplica
- Explain
- Show FSx Lustre
- Show S3 backing bucket
- Show the logs
Explain output from ring 0 and 1
kubectl -n kubeflow logs -f ${POD_NAME}
- TensorBoard
- Katib
- Fairing
- KFServing