Building a Multi GPU Kubernetes Cluster for Scalable and Cost-Effective ML Training with Ray and Kubeflow. Related blog post here.
- Create one CPU node and two GPU nodes
- Create a Kubernetes cluster and add the nodes in cluster
- Enable Kubernetes dashboard
- Install NVIDIA GPU Operator
- Check GPUs are available in the cluster
- Install KubeRay
- Create a Ray Cluster
- Enable Ray dashboard
- Run Ray workload in Kubeflow
These tools must be installed in the nodes before starting:
- Git
- Helm3
- Kustomize
- Make
- Nvidia Container Runtime
- Kubernetes 1.25
- Python 3.8
- Ray 2.6
- Kubeflow 1.7
- Ubuntu 20.04
- KubeRay 0.6.0
- NVIDIA GPU Operator v23.6.0
- Demo tested on Genesis Cloud with NVIDIA RTX3090 GPUs
sudo apt-get install apt-transport-https git make -y
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /bin/
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 sh -
sudo chown $USER /etc/rancher/k3s/k3s.yaml
Check by running:
nvidia-smi
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update \
&& sudo apt-get install -y nvidia-container-toolkit
sudo cat /var/lib/rancher/k3s/server/node-token on master node
export K3S_NODE_TOKEN=NODE_TOKEN
export SERVER_IP=(Public/Private IP of master node)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.8+k3s1 K3S_URL=https://${SERVER_IP}:6443 K3S_TOKEN=${K3S_NODE_TOKEN} sh -
NVIDIA GPU Operator allows the cluster to have access to GPUs in nodes. It installs the necessary tools to make the GPUs accessible for Kubernetes.
More on nvidia-gpu-operator
sudo helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& sudo helm repo update
sudo helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false \
--kubeconfig /etc/rancher/k3s/k3s.yaml
You can play around with GPUs by using Jupyter Notebook.
git clone https://github.com/data-max-hq/manifests.git
cd manifests/
while ! kustomize build example | awk '!/well-defined/' | sudo k3s kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
sudo kubectl get po -n kubeflow
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 --address='0.0.0.0'
sudo helm repo add kuberay https://ray-project.github.io/kuberay-helm/
sudo helm repo update
sudo helm upgrade --install \
kuberay-operator kuberay/kuberay-operator \
--namespace kuberay-operator \
--create-namespace \
--version 0.6.0 \
--kubeconfig /etc/rancher/k3s/k3s.yaml
sudo kubectl get pods -n kuberay-operator
sh ray-cluster.sh
- Configure private registries in k3s: https://docs.k3s.io/installation/private-registry
- Restart k3s and k3s-agent: https://docs.k3s.io/upgrades/manual#restarting-k3s
- Restart k3s and k3s-agent if command
kubectl describe node *gpu-node*
does not show nvidia.com/gpu resource
- https://cloud.google.com/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke
- https://github.com/ray-project/kuberay
- https://docs.ray.io/en/latest/cluster/kubernetes/examples/gpu-training-example.html#kuberay-gpu-training-example
- https://ray-project.github.io/kuberay/deploy/helm/
- https://docs.ray.io/en/latest/train/train.html
- https://github.com/NVIDIA/gpu-operator
Made with ❤️ by data-max.io.