Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Multiple k8s support #4586

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ document.addEventListener('DOMContentLoaded', () => {
{ selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' },
{ selector: '.toctree-l1 > a', text: 'Admin Policy Enforcement' },
{ selector: '.toctree-l1 > a', text: 'Using Existing Machines' },
{ selector: '.toctree-l2 > a', text: 'Multi-Kubernetes Clusters' },
];
newItems.forEach(({ selector, text }) => {
document.querySelectorAll(selector).forEach((el) => {
Expand Down
1 change: 1 addition & 0 deletions docs/source/images/multi-kubernetes.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/reference/kubernetes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,4 @@ Table of Contents
Getting Started <kubernetes-getting-started>
kubernetes-setup
kubernetes-troubleshooting
multi-kubernetes
146 changes: 146 additions & 0 deletions docs/source/reference/kubernetes/multi-kubernetes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
.. _multi-kubernetes:

Multi-Kubernetes Clusters
=============================


SkyPilot allows you to manage dev pods, jobs and services across multiple Kubernetes clusters in a single pane of glass.

You may have multiple Kubernetes clusters for different:

* **Use cases:** e.g., a production cluster and a development/testing cluster.
* **Regions or clouds:** e.g., US and EU regions; or AWS and Lambda clouds.
* **Accelerators:** e.g., NVIDIA H100 cluster and a Google TPU cluster.
* **Configurations:** e.g., a small cluster for a single node and a large cluster for multiple nodes.
* **Kubernetes versions:** e.g., to upgrade a cluster from Kubernetes 1.20 to 1.21, you may create a new Kubernetes cluster to avoid downtime or unexpected errors.


.. image:: /images/multi-kubernetes.svg
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
:width: 80%
:align: center

.. original image: https://docs.google.com/presentation/d/1_NzqS_ccihsQKfbOTewPaH8D496zaHMuh-fvPsPf9y0/edit#slide=id.p

Configuration
-------------

Step 1: Set Up Credentials
~~~~~~~~~~~~~~~~~~~~~~~~~~~

To work with multiple Kubernetes clusters, their credentials must be set up as individual `contexts <https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/>`_ in your local ``~/.kube/config`` file.

For deploying new clusters and getting credentials, see :ref:`kubernetes-setup-deploy`.

For example, a ``~/.kube/config`` file may look like this:

.. code-block:: yaml

apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
...
server: https://xx.xx.xx.xx:45819
name: my-h100-cluster
- cluster:
certificate-authority-data:
...
server: https://yy.yy.yy.yy:45819
name: my-tpu-cluster
contexts:
- context:
cluster: my-h100-cluster
user: my-h100-cluster
name: my-h100-cluster
- context:
cluster: my-tpu-cluster
namespace: my-namespace
user: my-tpu-cluster
name: my-tpu-cluster
current-context: my-h100-cluster
...


In this example, we have two Kubernetes clusters: ``my-h100-cluster`` and ``my-tpu-cluster``, and each Kubernetes cluster has a context for it.

Step 2: Setup SkyPilot to Access Multiple Kubernetes Clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Unlike clouds, SkyPilot does not failover through different Kubernetes clusters (regions) by default because each Kubernetes clusters can have a different purpose.

By default, SkyPilot only uses the current context in the kubeconfig, e.g., ``current-context: my-h100-cluster`` or you can get the current context with ``kubectl config current-context``.

To allow SkyPilot to access multiple Kubernetes clusters, you can set the ``kubernetes.allowed_contexts`` in the SkyPilot config.

.. code-block:: yaml

kubernetes:
allowed_contexts:
- my-h100-cluster
- my-tpu-cluster

To check the enabled Kubernetes clusters, you can run ``sky check kubernetes``.

.. code-block:: console

$ sky check kubernetes

🎉 Enabled clouds 🎉
✔ Kubernetes
Allowed contexts:
├── my-h100-cluster
└── my-tpu-cluster


Failover across Multiple Kubernetes Clusters
--------------------------------------------

With the ``kubernetes.allowed_contexts`` global config, SkyPilot failover through the Kubernetes clusters in the ``allowed_contexts`` in the same
order as they are specified.


.. code-block:: console

$ sky launch --gpus H100 --cloud kubernetes echo 'Hello World'

Considered resources (1 node):
------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-gke 0.00 ✔
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-eks 0.00
------------------------------------------------------------------------------------------------------------


Point to a Kubernetes Cluster and Launch
-----------------------------------------

SkyPilot borrows the ``region`` concept from clouds to denote a Kubernetes context. You can point to a Kubernetes cluster
by specifying the ``--region`` with the context name for that cluster.

.. code-block:: console

# Check the GPUs available in a Kubernetes cluster
$ sky show-gpus --cloud kubernetes --region my-h100-cluster

Kubernetes GPUs (Context: my-h100-cluster)
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 3, 4, 5, 6, 7, 8 8 8

Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-h100-cluster-hbzn H100 8 8
my-h100-cluster-w5x7 None 0 0

When launching a SkyPilot cluster or task, you can also specify the context name with ``--region`` to launch the cluster or task in.

.. code-block:: console

$ sky launch --cloud kubernetes --region my-tpu-cluster echo 'Hello World'


Dynamically Update Kubernetes Clusters to Use
----------------------------------------------

You can have configure SkyPilot to dynamically fetch Kubernetes cluster configs and enforce restrictions on which clusters are used. Refer to :ref:`dynamic-kubernetes-contexts-update-policy` for more.

4 changes: 2 additions & 2 deletions sky/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,10 +245,10 @@ def _format_enabled_cloud(cloud_name: str) -> str:
# here we are using rich. We should migrate this file to
# use colorama as we do in the rest of the codebase.
symbol = ('└── ' if i == len(existing_contexts) - 1 else '├── ')
contexts_formatted.append(f'\n {symbol}{context}')
contexts_formatted.append(f'\n {symbol}{context}')
Copy link
Collaborator Author

@Michaelvll Michaelvll Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced the indent for cleaner view, cc'ing @romilbhardwaj : )

context_info = f'Allowed contexts:{"".join(contexts_formatted)}'
else:
context_info = f'Active context: {existing_contexts[0]}'

return f'{cloud_name}[/green][dim]\n └── {context_info}[/dim][green]'
return f'{cloud_name}[/green][dim]\n {context_info}[/dim][green]'
return cloud_name