This document explains how to create an Amazon EKS cluster with GPU-enabled workers.
This documentation is from official Kubeflow on AWS documentation. Please check website for more details for Kubeflow on AWS. If you meet any problem during installation, please check Troubleshooting Deployments on Amazon EKS
- Install kubectl
- Install and configure the AWS Command Line Interface (AWS CLI):
- Install the AWS Command Line Interface.
- Configure the AWS CLI by running the following command:
aws configure
. - Enter your Access Keys (Access Key ID and Secret Access Key).
- Enter your preferred AWS Region and default output options.
- Install eksctl (version 0.1.27 or newer).
- Install jq.
- Install ksonnet. (
brew install ksonnet/tap/ks
for mac user)
-
Subscribe to the GPU supported AMI:
-
Run the following commands to download the latest
kfctl.sh
:export KUBEFLOW_SRC=~/tmp/kubeflow-aws export KUBEFLOW_TAG=v0.5-branch mkdir -p ${KUBEFLOW_SRC} && cd ${KUBEFLOW_SRC} curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
/tmp/kubeflow-aws
is full path to your preferred download directory.
-
Run the following commands to set up your environment and initialize the cluster.
export KFAPP=kfapp export REGION=us-west-2 export AWS_CLUSTER_NAME=kubeflow-aws ${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform aws \ --awsClusterName ${AWS_CLUSTER_NAME} \ --awsRegion ${REGION}
AWS_CLUSTER_NAME
- A unique name for your Amazon EKS cluster.KFAPP
- Use a relative directory name here rather than absolute path, such askfapp
.REGION
- Use the AWS Region you want to create your cluster in.
-
Generate and apply platform changes.
You can customize your cluster configuration, control plane logging, and private cluster endpoint access before you
apply platform
, please see Customizing Kubeflow on AWS for more information.cd ${KFAPP} ${KUBEFLOW_SRC}/scripts/kfctl.sh generate platform # Customize your Amazon EKS cluster configuration before following the next step
-
Open
cluster_config.yaml
and update the file so that it looks like as shown below. This file can also be copied from the repo.apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: # AWS_CLUSTER_NAME and AWS_REGION will override `name` and `region` here. name: kubeflow-aws region: us-west-2 version: '1.12' # If your region has multiple availability zones, you can specify 3 of them. #availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"] # NodeGroup holds all configuration attributes that are specific to a nodegroup # You can have several node group in your cluster. nodeGroups: #- name: cpu-nodegroup # instanceType: m5.2xlarge # desiredCapacity: 1 # minSize: 0 # maxSize: 2 # volumeSize: 30 # Example of GPU node group - name: Tesla-V100 instanceType: p3.8xlarge availabilityZones: ["us-west-2b"] desiredCapacity: 2 minSize: 0 maxSize: 2 volumeSize: 50 ssh: allow: true publicKeyPath: '~/.ssh/id_rsa.pub'
Then apply the changes:
# vim ${KUBEFLOW_SRC}/${KFAPP}/aws_config/cluster_config.yaml ${KUBEFLOW_SRC}/scripts/kfctl.sh apply platform
-
Generate and apply the Kubernetes changes.
${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
Important!!! By default, these scripts create an AWS Application Load Balancer for Kubeflow that is open to public. This is good for development testing and for short term use, but we do not recommend that you use this configuration for production workloads.
To secure your installation, you have two options:
-
Disable ingress before you
apply k8s
. Open${KUBEFLOW_SRC}/${KFAPP}/env.sh
and edit theKUBEFLOW_COMPONENTS
environment variable. Delete,\"alb-ingress-controller\",\"istio-ingress\"
and save the file. -
Follow the instructions to add authentication before you
apply k8s
Once your customization is done or if you're fine to have a public endpoint for testing, you can run this command to deploy Kubeflow.
${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s
This will take a few minutes for all pods get ready.
-
-
Get memory, CPU and GPU for each node in the cluster:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,MEMORY:.status.allocatable.memory,CPU:.status.allocatable.cpu,GPU:.status.allocatable.nvidia\.com/gpu"
Shows something like:
NAME MEMORY CPU GPU ip-192-168-101-177.us-west-2.compute.internal 251643680Ki 32 4 ip-192-168-196-254.us-west-2.compute.internal 251643680Ki 32 4
The maximum number of GPUs that may be scheduled to a pod is capped by the number of GPUs available per node. By default, pods are scheduled on CPU.
-
Verify kubeflow:
kubectl get pods -n=kubeflow NAME READY STATUS RESTARTS AGE ambassador-5cf8cd97d5-68xgv 1/1 Running 0 19m ambassador-5cf8cd97d5-cxp85 1/1 Running 0 19m ambassador-5cf8cd97d5-r57hc 1/1 Running 0 19m argo-ui-7c9c69d464-p7mjh 1/1 Running 0 17m centraldashboard-6f47d694bd-qd56p 1/1 Running 0 18m jupyter-0 1/1 Running 0 18m katib-ui-6bdb7d76cc-z9dv5 1/1 Running 0 16m metacontroller-0 1/1 Running 0 17m minio-7bfcc6c7b9-d8xqf 1/1 Running 0 17m ml-pipeline-6fdd759597-n9zws 1/1 Running 0 17m ml-pipeline-persistenceagent-5669f69cdd-gdzwq 1/1 Running 1 16m ml-pipeline-scheduledworkflow-9f6d5d5b6-zfqtd 1/1 Running 0 16m ml-pipeline-ui-67f79b964d-jrwx6 1/1 Running 0 16m mysql-6f6b5f7b64-bg2vp 1/1 Running 0 17m pytorch-operator-6f87db67b7-4bksz 1/1 Running 0 17m spartakus-volunteer-6f5f47f95-cl5sf 1/1 Running 0 17m studyjob-controller-774d45f695-2k6t8 1/1 Running 0 16m tf-job-dashboard-5f986cf99d-nqxm7 1/1 Running 0 18m tf-job-operator-v1beta1-5876c48976-zbmrq 1/1 Running 0 18m vizier-core-fc7969897-rns98 1/1 Running 1 16m vizier-core-rest-6fcd4665d9-bf69g 1/1 Running 0 16m vizier-db-777675b958-kt8p2 1/1 Running 0 16m vizier-suggestion-bayesianoptimization-54db8d594f-5srk6 1/1 Running 0 16m vizier-suggestion-grid-6f5d9d647f-tzgqw 1/1 Running 0 16m vizier-suggestion-hyperband-59dd9bb9bc-mldbl 1/1 Running 0 16m vizier-suggestion-random-6dd597c997-8qdnr 1/1 Running 0 16m workflow-controller-5c95f95f58-nvg9l 1/1 Running 0 17m
You may not need all the componets. If you want to customize components, please check Kubeflow Customization.
-
Once done playing, uninstall KubeFlow and cluster.
cd ${KUBEFLOW_SRC}/${KFAPP} ${KUBEFLOW_SRC}/scripts/kfctl.sh delete all