Skip to content

Latest commit

 

History

History
185 lines (143 loc) · 8.7 KB

eks-gpu.md

File metadata and controls

185 lines (143 loc) · 8.7 KB

Create Amazon EKS cluster with GPU-enabled workers and Kubeflow

This document explains how to create an Amazon EKS cluster with GPU-enabled workers.

This documentation is from official Kubeflow on AWS documentation. Please check website for more details for Kubeflow on AWS. If you meet any problem during installation, please check Troubleshooting Deployments on Amazon EKS

Prerequisites

  • Install kubectl
  • Install and configure the AWS Command Line Interface (AWS CLI):
  • Install eksctl (version 0.1.27 or newer).
  • Install jq.
  • Install ksonnet. (brew install ksonnet/tap/ks for mac user)

Create cluster and install Kubeflow

  1. Subscribe to the GPU supported AMI:

    https://aws.amazon.com/marketplace/pp/B07GRHFXGM

  2. Run the following commands to download the latest kfctl.sh:

    export KUBEFLOW_SRC=~/tmp/kubeflow-aws
    export KUBEFLOW_TAG=v0.5-branch
    
    mkdir -p ${KUBEFLOW_SRC} && cd ${KUBEFLOW_SRC}
    curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
    
    • /tmp/kubeflow-aws is full path to your preferred download directory.
  3. Run the following commands to set up your environment and initialize the cluster.

    export KFAPP=kfapp
    export REGION=us-west-2
    export AWS_CLUSTER_NAME=kubeflow-aws
    
    ${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform aws \
    --awsClusterName ${AWS_CLUSTER_NAME} \
    --awsRegion ${REGION}
    
    • AWS_CLUSTER_NAME - A unique name for your Amazon EKS cluster.
    • KFAPP - Use a relative directory name here rather than absolute path, such as kfapp.
    • REGION - Use the AWS Region you want to create your cluster in.
  4. Generate and apply platform changes.

    You can customize your cluster configuration, control plane logging, and private cluster endpoint access before you apply platform, please see Customizing Kubeflow on AWS for more information.

    cd ${KFAPP}
    ${KUBEFLOW_SRC}/scripts/kfctl.sh generate platform
    # Customize your Amazon EKS cluster configuration before following the next step
  5. Open cluster_config.yaml and update the file so that it looks like as shown below. This file can also be copied from the repo.

    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    metadata:
      # AWS_CLUSTER_NAME and AWS_REGION will override `name` and `region` here.
      name: kubeflow-aws
      region: us-west-2
      version: '1.12'
    # If your region has multiple availability zones, you can specify 3 of them.
    #availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"]
    
    # NodeGroup holds all configuration attributes that are specific to a nodegroup
    # You can have several node group in your cluster.
    nodeGroups:
      #- name: cpu-nodegroup
      #  instanceType: m5.2xlarge
      #  desiredCapacity: 1
      #  minSize: 0
      #  maxSize: 2
      #  volumeSize: 30
    
      # Example of GPU node group
      - name: Tesla-V100
        instanceType: p3.8xlarge
        availabilityZones: ["us-west-2b"]
        desiredCapacity: 2
        minSize: 0
        maxSize: 2
        volumeSize: 50
        ssh:
          allow: true
          publicKeyPath: '~/.ssh/id_rsa.pub'
    

    Then apply the changes:

    # vim ${KUBEFLOW_SRC}/${KFAPP}/aws_config/cluster_config.yaml
    ${KUBEFLOW_SRC}/scripts/kfctl.sh apply platform
  6. Generate and apply the Kubernetes changes.

    ${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s

    Important!!! By default, these scripts create an AWS Application Load Balancer for Kubeflow that is open to public. This is good for development testing and for short term use, but we do not recommend that you use this configuration for production workloads.

    To secure your installation, you have two options:

    • Disable ingress before you apply k8s. Open ${KUBEFLOW_SRC}/${KFAPP}/env.sh and edit the KUBEFLOW_COMPONENTS environment variable. Delete ,\"alb-ingress-controller\",\"istio-ingress\" and save the file.

    • Follow the instructions to add authentication before you apply k8s

    Once your customization is done or if you're fine to have a public endpoint for testing, you can run this command to deploy Kubeflow.

    ${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s

    This will take a few minutes for all pods get ready.

  7. Get memory, CPU and GPU for each node in the cluster:

    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,MEMORY:.status.allocatable.memory,CPU:.status.allocatable.cpu,GPU:.status.allocatable.nvidia\.com/gpu"
    

    Shows something like:

    NAME                                            MEMORY        CPU       GPU
    ip-192-168-101-177.us-west-2.compute.internal   251643680Ki   32        4
    ip-192-168-196-254.us-west-2.compute.internal   251643680Ki   32        4
    

    The maximum number of GPUs that may be scheduled to a pod is capped by the number of GPUs available per node. By default, pods are scheduled on CPU.

  8. Verify kubeflow:

    kubectl get pods -n=kubeflow
    NAME                                                      READY   STATUS    RESTARTS   AGE
    ambassador-5cf8cd97d5-68xgv                               1/1     Running   0          19m
    ambassador-5cf8cd97d5-cxp85                               1/1     Running   0          19m
    ambassador-5cf8cd97d5-r57hc                               1/1     Running   0          19m
    argo-ui-7c9c69d464-p7mjh                                  1/1     Running   0          17m
    centraldashboard-6f47d694bd-qd56p                         1/1     Running   0          18m
    jupyter-0                                                 1/1     Running   0          18m
    katib-ui-6bdb7d76cc-z9dv5                                 1/1     Running   0          16m
    metacontroller-0                                          1/1     Running   0          17m
    minio-7bfcc6c7b9-d8xqf                                    1/1     Running   0          17m
    ml-pipeline-6fdd759597-n9zws                              1/1     Running   0          17m
    ml-pipeline-persistenceagent-5669f69cdd-gdzwq             1/1     Running   1          16m
    ml-pipeline-scheduledworkflow-9f6d5d5b6-zfqtd             1/1     Running   0          16m
    ml-pipeline-ui-67f79b964d-jrwx6                           1/1     Running   0          16m
    mysql-6f6b5f7b64-bg2vp                                    1/1     Running   0          17m
    pytorch-operator-6f87db67b7-4bksz                         1/1     Running   0          17m
    spartakus-volunteer-6f5f47f95-cl5sf                       1/1     Running   0          17m
    studyjob-controller-774d45f695-2k6t8                      1/1     Running   0          16m
    tf-job-dashboard-5f986cf99d-nqxm7                         1/1     Running   0          18m
    tf-job-operator-v1beta1-5876c48976-zbmrq                  1/1     Running   0          18m
    vizier-core-fc7969897-rns98                               1/1     Running   1          16m
    vizier-core-rest-6fcd4665d9-bf69g                         1/1     Running   0          16m
    vizier-db-777675b958-kt8p2                                1/1     Running   0          16m
    vizier-suggestion-bayesianoptimization-54db8d594f-5srk6   1/1     Running   0          16m
    vizier-suggestion-grid-6f5d9d647f-tzgqw                   1/1     Running   0          16m
    vizier-suggestion-hyperband-59dd9bb9bc-mldbl              1/1     Running   0          16m
    vizier-suggestion-random-6dd597c997-8qdnr                 1/1     Running   0          16m
    workflow-controller-5c95f95f58-nvg9l                      1/1     Running   0          17m
    

    You may not need all the componets. If you want to customize components, please check Kubeflow Customization.

  9. Once done playing, uninstall KubeFlow and cluster.

    cd ${KUBEFLOW_SRC}/${KFAPP}
    ${KUBEFLOW_SRC}/scripts/kfctl.sh delete all