Skip to content

data-max-hq/cost-effective-ml

Repository files navigation

Cost Effective ML

Building a Multi GPU Kubernetes Cluster for Scalable and Cost-Effective ML Training with Ray and Kubeflow. Related blog post here.

Building the Multi GPU Kubernetes Cluster

1-setup.png

What we will be doing:

  1. Create one CPU node and two GPU nodes
  2. Create a Kubernetes cluster and add the nodes in cluster
  3. Enable Kubernetes dashboard
  4. Install NVIDIA GPU Operator
  5. Check GPUs are available in the cluster
  6. Install KubeRay
  7. Create a Ray Cluster
  8. Enable Ray dashboard
  9. Run Ray workload in Kubeflow

Prerequisites

These tools must be installed in the nodes before starting:

  • Git
  • Helm3
  • Kustomize
  • Make
  • Nvidia Container Runtime

Versions tested in the demo

  • Kubernetes 1.25
  • Python 3.8
  • Ray 2.6
  • Kubeflow 1.7
  • Ubuntu 20.04
  • KubeRay 0.6.0
  • NVIDIA GPU Operator v23.6.0
  • Demo tested on Genesis Cloud with NVIDIA RTX3090 GPUs

How to set up K3S master node

Install prerequisites

Install common utilities

sudo apt-get install apt-transport-https git make -y

Install helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Install kustomize

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash
sudo mv kustomize /bin/

Install Kubernetes

Install K3S on the main node

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 sh -

Run kubectl without sudo

sudo chown $USER /etc/rancher/k3s/k3s.yaml

Kubernetes worker nodes setup

Install prerequisites

(If node contains GPUs) Make sure NVIDIA drivers are installed

Check by running:

nvidia-smi

Install Nvidia Container Runtime

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update \
    && sudo apt-get install -y nvidia-container-toolkit

Installing K3S agents on worker nodes

From the main node get the node token

sudo cat /var/lib/rancher/k3s/server/node-token on master node

Run the K3S installation command on the worker nodes

export K3S_NODE_TOKEN=NODE_TOKEN
export SERVER_IP=(Public/Private IP of master node)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 K3S_URL=https://${SERVER_IP}:6443 K3S_TOKEN=${K3S_NODE_TOKEN} sh -

Install NVIDIA GPU Operator from the main node

NVIDIA GPU Operator allows the cluster to have access to GPUs in nodes. It installs the necessary tools to make the GPUs accessible for Kubernetes.

More on nvidia-gpu-operator

sudo helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && sudo helm repo update

sudo helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false \
      --kubeconfig /etc/rancher/k3s/k3s.yaml  

Usage/Examples

You can play around with GPUs by using Jupyter Notebook.

Install Kubeflow

Install Kubeflow

git clone https://github.com/data-max-hq/manifests.git
cd manifests/
while ! kustomize build example | awk '!/well-defined/' | sudo k3s kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Check Kubeflow installation status:

sudo kubectl get po -n kubeflow

After Kubeflow is installed, expose the Kubeflow UI:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 --address='0.0.0.0'

Create Ray Cluster

Install KubeRay Operator

sudo helm repo add kuberay https://ray-project.github.io/kuberay-helm/
sudo helm repo update
sudo helm upgrade --install \
    kuberay-operator kuberay/kuberay-operator \
    --namespace kuberay-operator \
    --create-namespace \
    --version 0.6.0 \
    --kubeconfig /etc/rancher/k3s/k3s.yaml

Check the Operator Installation

sudo kubectl get pods -n kuberay-operator

Create Ray Cluster

sh ray-cluster.sh

Troubleshooting

Links

Made with ❤️ by data-max.io.

About

MultiGPU model training setup with Kubernetes, Ray and Kubeflow

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published