Create Amazon EKS cluster with GPU-enabled workers and Kubeflow

This document explains how to create an Amazon EKS cluster with GPU-enabled workers.

This documentation is from official Kubeflow on AWS documentation. Please check website for more details for Kubeflow on AWS. If you meet any problem during installation, please check Troubleshooting Deployments on Amazon EKS

Prerequisites

Install kubectl
Install and configure the AWS Command Line Interface (AWS CLI):
- Install the AWS Command Line Interface.
- Configure the AWS CLI by running the following command: aws configure.
- Enter your Access Keys (Access Key ID and Secret Access Key).
- Enter your preferred AWS Region and default output options.
Install eksctl (version 0.1.27 or newer).
Install jq.
Install ksonnet. (brew install ksonnet/tap/ks for mac user)

Create cluster and install Kubeflow

Subscribe to the GPU supported AMI:

https://aws.amazon.com/marketplace/pp/B07GRHFXGM

Run the following commands to download the latest kfctl.sh:

export KUBEFLOW_SRC=~/tmp/kubeflow-aws
export KUBEFLOW_TAG=v0.5-branch

mkdir -p ${KUBEFLOW_SRC} && cd ${KUBEFLOW_SRC}
curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash

/tmp/kubeflow-aws is full path to your preferred download directory.

Run the following commands to set up your environment and initialize the cluster.
```
export KFAPP=kfapp
export REGION=us-west-2
export AWS_CLUSTER_NAME=kubeflow-aws

${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform aws \
--awsClusterName ${AWS_CLUSTER_NAME} \
--awsRegion ${REGION}
```
- AWS_CLUSTER_NAME - A unique name for your Amazon EKS cluster.
- KFAPP - Use a relative directory name here rather than absolute path, such as kfapp.
- REGION - Use the AWS Region you want to create your cluster in.
Generate and apply platform changes.

You can customize your cluster configuration, control plane logging, and private cluster endpoint access before you apply platform, please see Customizing Kubeflow on AWS for more information.
```
cd ${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh generate platform
# Customize your Amazon EKS cluster configuration before following the next step
```

Open cluster_config.yaml and update the file so that it looks like as shown below. This file can also be copied from the repo.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  # AWS_CLUSTER_NAME and AWS_REGION will override `name` and `region` here.
  name: kubeflow-aws
  region: us-west-2
  version: '1.12'
# If your region has multiple availability zones, you can specify 3 of them.
#availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"]

# NodeGroup holds all configuration attributes that are specific to a nodegroup
# You can have several node group in your cluster.
nodeGroups:
  #- name: cpu-nodegroup
  #  instanceType: m5.2xlarge
  #  desiredCapacity: 1
  #  minSize: 0
  #  maxSize: 2
  #  volumeSize: 30

  # Example of GPU node group
  - name: Tesla-V100
    instanceType: p3.8xlarge
    availabilityZones: ["us-west-2b"]
    desiredCapacity: 2
    minSize: 0
    maxSize: 2
    volumeSize: 50
    ssh:
      allow: true
      publicKeyPath: '~/.ssh/id_rsa.pub'

Then apply the changes:

# vim ${KUBEFLOW_SRC}/${KFAPP}/aws_config/cluster_config.yaml
${KUBEFLOW_SRC}/scripts/kfctl.sh apply platform

Generate and apply the Kubernetes changes.
```
${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
```
Important!!! By default, these scripts create an AWS Application Load Balancer for Kubeflow that is open to public. This is good for development testing and for short term use, but we do not recommend that you use this configuration for production workloads.

To secure your installation, you have two options:
- Disable ingress before you apply k8s. Open ${KUBEFLOW_SRC}/${KFAPP}/env.sh and edit the KUBEFLOW_COMPONENTS environment variable. Delete ,\"alb-ingress-controller\",\"istio-ingress\" and save the file.
- Follow the instructions to add authentication before you apply k8s
Once your customization is done or if you're fine to have a public endpoint for testing, you can run this command to deploy Kubeflow.
```
${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s
```
This will take a few minutes for all pods get ready.

Get memory, CPU and GPU for each node in the cluster:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,MEMORY:.status.allocatable.memory,CPU:.status.allocatable.cpu,GPU:.status.allocatable.nvidia\.com/gpu"

Shows something like:

NAME                                            MEMORY        CPU       GPU
ip-192-168-101-177.us-west-2.compute.internal   251643680Ki   32        4
ip-192-168-196-254.us-west-2.compute.internal   251643680Ki   32        4

The maximum number of GPUs that may be scheduled to a pod is capped by the number of GPUs available per node. By default, pods are scheduled on CPU.

Verify kubeflow:

kubectl get pods -n=kubeflow
NAME                                                      READY   STATUS    RESTARTS   AGE
ambassador-5cf8cd97d5-68xgv                               1/1     Running   0          19m
ambassador-5cf8cd97d5-cxp85                               1/1     Running   0          19m
ambassador-5cf8cd97d5-r57hc                               1/1     Running   0          19m
argo-ui-7c9c69d464-p7mjh                                  1/1     Running   0          17m
centraldashboard-6f47d694bd-qd56p                         1/1     Running   0          18m
jupyter-0                                                 1/1     Running   0          18m
katib-ui-6bdb7d76cc-z9dv5                                 1/1     Running   0          16m
metacontroller-0                                          1/1     Running   0          17m
minio-7bfcc6c7b9-d8xqf                                    1/1     Running   0          17m
ml-pipeline-6fdd759597-n9zws                              1/1     Running   0          17m
ml-pipeline-persistenceagent-5669f69cdd-gdzwq             1/1     Running   1          16m
ml-pipeline-scheduledworkflow-9f6d5d5b6-zfqtd             1/1     Running   0          16m
ml-pipeline-ui-67f79b964d-jrwx6                           1/1     Running   0          16m
mysql-6f6b5f7b64-bg2vp                                    1/1     Running   0          17m
pytorch-operator-6f87db67b7-4bksz                         1/1     Running   0          17m
spartakus-volunteer-6f5f47f95-cl5sf                       1/1     Running   0          17m
studyjob-controller-774d45f695-2k6t8                      1/1     Running   0          16m
tf-job-dashboard-5f986cf99d-nqxm7                         1/1     Running   0          18m
tf-job-operator-v1beta1-5876c48976-zbmrq                  1/1     Running   0          18m
vizier-core-fc7969897-rns98                               1/1     Running   1          16m
vizier-core-rest-6fcd4665d9-bf69g                         1/1     Running   0          16m
vizier-db-777675b958-kt8p2                                1/1     Running   0          16m
vizier-suggestion-bayesianoptimization-54db8d594f-5srk6   1/1     Running   0          16m
vizier-suggestion-grid-6f5d9d647f-tzgqw                   1/1     Running   0          16m
vizier-suggestion-hyperband-59dd9bb9bc-mldbl              1/1     Running   0          16m
vizier-suggestion-random-6dd597c997-8qdnr                 1/1     Running   0          16m
workflow-controller-5c95f95f58-nvg9l                      1/1     Running   0          17m

You may not need all the componets. If you want to customize components, please check Kubeflow Customization.

Once done playing, uninstall KubeFlow and cluster.

cd ${KUBEFLOW_SRC}/${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh delete all

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eks-gpu.md

eks-gpu.md

Create Amazon EKS cluster with GPU-enabled workers and Kubeflow

Prerequisites

Create cluster and install Kubeflow

Files

eks-gpu.md

Latest commit

History

eks-gpu.md

File metadata and controls

Create Amazon EKS cluster with GPU-enabled workers and Kubeflow

Prerequisites

Create cluster and install Kubeflow