Cost Effective ML

Building a Multi GPU Kubernetes Cluster for Scalable and Cost-Effective ML Training with Ray and Kubeflow. Related blog post here.

Building the Multi GPU Kubernetes Cluster

What we will be doing:

Create one CPU node and two GPU nodes
Create a Kubernetes cluster and add the nodes in cluster
Enable Kubernetes dashboard
Install NVIDIA GPU Operator
Check GPUs are available in the cluster
Install KubeRay
Create a Ray Cluster
Enable Ray dashboard
Run Ray workload in Kubeflow

Prerequisites

These tools must be installed in the nodes before starting:

Git
Helm3
Kustomize
Make
Nvidia Container Runtime

Versions tested in the demo

Kubernetes 1.25
Python 3.8
Ray 2.6
Kubeflow 1.7
Ubuntu 20.04
KubeRay 0.6.0
NVIDIA GPU Operator v23.6.0
Demo tested on Genesis Cloud with NVIDIA RTX3090 GPUs

How to set up K3S master node

Install prerequisites

Install common utilities

sudo apt-get install apt-transport-https git make -y

Install `helm`

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Install `kustomize`

curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash
sudo mv kustomize /bin/

Install Kubernetes

Install `K3S` on the main node

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 sh -

Run `kubectl` without sudo

sudo chown $USER /etc/rancher/k3s/k3s.yaml

Kubernetes worker nodes setup

Install prerequisites

(If node contains GPUs) Make sure NVIDIA drivers are installed

Check by running:

nvidia-smi

Install `Nvidia Container Runtime`

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update \
    && sudo apt-get install -y nvidia-container-toolkit

Installing `K3S` agents on worker nodes

From the main node get the node token

sudo cat /var/lib/rancher/k3s/server/node-token on master node

Run the `K3S` installation command on the worker nodes

export K3S_NODE_TOKEN=NODE_TOKEN
export SERVER_IP=(Public/Private IP of master node)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 K3S_URL=https://${SERVER_IP}:6443 K3S_TOKEN=${K3S_NODE_TOKEN} sh -

Install NVIDIA GPU Operator from the main node

NVIDIA GPU Operator allows the cluster to have access to GPUs in nodes. It installs the necessary tools to make the GPUs accessible for Kubernetes.

Usage/Examples

You can play around with GPUs by using Jupyter Notebook.

Install Kubeflow

git clone https://github.com/data-max-hq/manifests.git
cd manifests/
while ! kustomize build example | awk '!/well-defined/' | sudo k3s kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Check Kubeflow installation status:

sudo kubectl get po -n kubeflow

After Kubeflow is installed, expose the Kubeflow UI:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 --address='0.0.0.0'

Create Ray Cluster

Install KubeRay Operator

sudo helm repo add kuberay https://ray-project.github.io/kuberay-helm/
sudo helm repo update
sudo helm upgrade --install \
    kuberay-operator kuberay/kuberay-operator \
    --namespace kuberay-operator \
    --create-namespace \
    --version 0.6.0 \
    --kubeconfig /etc/rancher/k3s/k3s.yaml

Check the Operator Installation

sudo kubectl get pods -n kuberay-operator

Create Ray Cluster

sh ray-cluster.sh

Troubleshooting

Configure private registries in k3s: https://docs.k3s.io/installation/private-registry
- https://breadnet.co.uk/using-google-artifact-registry-with-k3s/
Restart k3s and k3s-agent: https://docs.k3s.io/upgrades/manual#restarting-k3s
Restart k3s and k3s-agent if command kubectl describe node *gpu-node* does not show nvidia.com/gpu resource

Links

Made with ❤️ by datamax.ai.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
diagrams/images		diagrams/images
example_notebook		example_notebook
k3s		k3s
ray_demo		ray_demo
workernode		workernode
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ray-cluster.sh		ray-cluster.sh

License

data-max-hq/cost-effective-ml

Folders and files

Latest commit

History

Repository files navigation

Cost Effective ML

Building the Multi GPU Kubernetes Cluster

What we will be doing:

Prerequisites

Versions tested in the demo

How to set up K3S master node

Install prerequisites

Install common utilities

Install helm

Install kustomize

Install Kubernetes

Install K3S on the main node

Run kubectl without sudo

Kubernetes worker nodes setup

Install prerequisites

(If node contains GPUs) Make sure NVIDIA drivers are installed

Install Nvidia Container Runtime

Installing K3S agents on worker nodes

From the main node get the node token

Run the K3S installation command on the worker nodes

Install NVIDIA GPU Operator from the main node

Usage/Examples

Install Kubeflow

Install Kubeflow

Check Kubeflow installation status:

After Kubeflow is installed, expose the Kubeflow UI:

Create Ray Cluster

Install KubeRay Operator

Check the Operator Installation

Create Ray Cluster

Troubleshooting

Links

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Install `helm`

Install `kustomize`

Install `K3S` on the main node

Run `kubectl` without sudo

Install `Nvidia Container Runtime`

Installing `K3S` agents on worker nodes

Run the `K3S` installation command on the worker nodes

Packages