Report Management provides data statistics for cluster, node, pods, workspace, and namespace across
-five dimensions: CPU Utilization, Memory Utilization, Storage Utilization, GPU Computing Power Utilization,
+five dimensions: CPU Utilization, Memory Utilization, Storage Utilization, GPU Utilization,
and GPU Memory Utilization. It also integrates with the audit and alert modules to support the statistical
management of audit and alert data, supporting a total of seven types of reports.
Check if the GPU card in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected.
+
Check if the GPU in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected.
Currently, the cluster will automatically enable GPU and set the GPU type as Iluvatar .
Confirm whether the cluster has detected the GPU card. Click Clusters -> Cluster Settings -> Addon Plugins ,
+
Confirm whether the cluster has detected the GPU. Click Clusters -> Cluster Settings -> Addon Plugins ,
and check whether the proper GPU type is automatically enabled and detected.
Currently, the cluster will automatically enable GPU and set the GPU type to Ascend .
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU card into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
+
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
This chapter provides installation guidance for MetaX's gpu-extensions, gpu-operator, and other components, as well as usage methods for both the full GPU card and vGPU modes.
+
This chapter provides installation guidance for MetaX's gpu-extensions, gpu-operator, and other components, as well as usage methods for both the full GPU and vGPU modes.
The required tar package has been downloaded and installed from the MetaX Software Center. This article uses metax-gpu-k8s-package.0.7.10.tar.gz as an example.
Metax provides two helm-chart packages: metax-extensions and gpu-operator. Depending on the usage scenario, different components can be selected for installation.
-
Metax-extensions: Includes two components, gpu-device and gpu-label. When using the Metax-extensions solution, the user's application container image needs to be built based on the MXMACA® base image. Moreover, Metax-extensions is only suitable for scenarios using the full GPU card.
-
gpu-operator: Includes components such as gpu-device, gpu-label, driver-manager, container-runtime, and operator-controller. When using the gpu-operator solution, users can choose to create application container images that do not include the MXMACA® SDK. The gpu-operator is suitable for both full GPU card and vGPU scenarios.
+
Metax-extensions: Includes two components, gpu-device and gpu-label. When using the Metax-extensions solution, the user's application container image needs to be built based on the MXMACA® base image. Moreover, Metax-extensions is only suitable for scenarios using the full GPU.
+
gpu-operator: Includes components such as gpu-device, gpu-label, driver-manager, container-runtime, and operator-controller. When using the gpu-operator solution, users can choose to create application container images that do not include the MXMACA® SDK. The gpu-operator is suitable for both full GPU and vGPU scenarios.
The current cluster has installed the Cambricon firmware, drivers, and DevicePlugin components. For installation details, please refer to the official documentation:
When installing DevicePlugin, please disable the --enable-device-type parameter; otherwise, the SuanFeng AI computing platform will not be able to correctly recognize the Cambricon GPU.
+
When installing DevicePlugin, please disable the --enable-device-type parameter; otherwise, the Suanova AI computing platform will not be able to correctly recognize the Cambricon GPU.
AI platform container management platform has been deployed and is running properly.
The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions.
-
The GPU card in the current cluster has not undergone any virtualization operations or been occupied by other applications.
+
The GPU in the current cluster has not undergone any virtualization operations or been occupied by other applications.
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
The kernel version of the cluster nodes where the gpu-operator is to be deployed must be
- completely consistent. The distribution and GPU card model of the nodes must fall within
+ completely consistent. The distribution and GPU model of the nodes must fall within
the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU card to provide separate GPU resources to multiple users, maximizing GPU utilization.
+
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU card into seven GPU instances that can be used by multiple users.
+
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU into seven GPU instances that can be used by multiple users.
SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices.
Confirm if the cluster has recognized the GPU card type.
+
Confirm if the cluster has recognized the GPU type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
When deploying an application using an image, you can select and use NVIDIA MIG resources.
-
Example of MIG Single Mode (used in the same way as a full GPU card):
+
Example of MIG Single Mode (used in the same way as a full GPU):
Note
-
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU card (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
+
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
Using vGPU through YAML Configura
limits:nvidia.com/gpucores:'20'# Request 20% of GPU cores for each cardnvidia.com/gpumem:'200'# Request 200MB of GPU memory for each card
-nvidia.com/vgpu:'1'# Request 1 GPU card
+nvidia.com/vgpu:'1'# Request 1 GPUimagePullPolicy:AlwaysrestartPolicy:Always
-
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU card.
+
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
-
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
+
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
-
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
+
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
-
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
+
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
-
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
+
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
etcd backup is based on cluster data as the core backup. In cases such as hardware device damage, development and test configuration errors, etc., the backup cluster data can be restored through etcd backup.
This page explains how to create a Worker Cluster. By default, when creating a new Worker Cluster, the operating system type and CPU architecture of the worker nodes should be consistent with the Global Service Cluster. If you want to create a cluster with a different operating system or architecture than the Global Management Cluster, refer to Creating an Ubuntu Worker Cluster on a CentOS Management Platform for instructions.
Check if the GPU card in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected.
+
Check if the GPU in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected.
Currently, the cluster will automatically enable GPU and set the GPU type as Iluvatar .
Confirm whether the cluster has detected the GPU card. Click Clusters -> Cluster Settings -> Addon Plugins ,
+
Confirm whether the cluster has detected the GPU. Click Clusters -> Cluster Settings -> Addon Plugins ,
and check whether the proper GPU type is automatically enabled and detected.
Currently, the cluster will automatically enable GPU and set the GPU type to Ascend .
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU card into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
+
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
Architecture diagram of NVIDIA GPU Operator:
diff --git a/en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html b/en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html
index 88a2cb79d1..9635269082 100644
--- a/en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html
+++ b/en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html
@@ -10,7 +10,7 @@
-Using the Whole NVIDIA GPU Card for an Application - 豐收二號檔案站
+Using the Whole NVIDIA GPU for an Application - 豐收二號檔案站
@@ -24,7 +24,7 @@
AI platform container management platform has been deployed and is running properly.
The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions.
-
The GPU card in the current cluster has not undergone any virtualization operations or been occupied by other applications.
+
The GPU in the current cluster has not undergone any virtualization operations or been occupied by other applications.
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
The kernel version of the cluster nodes where the gpu-operator is to be deployed must be
- completely consistent. The distribution and GPU card model of the nodes must fall within
+ completely consistent. The distribution and GPU model of the nodes must fall within
the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU card to provide separate GPU resources to multiple users, maximizing GPU utilization.
+
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU card into seven GPU instances that can be used by multiple users.
+
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU into seven GPU instances that can be used by multiple users.
SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices.
Confirm if the cluster has recognized the GPU card type.
+
Confirm if the cluster has recognized the GPU type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
When deploying an application using an image, you can select and use NVIDIA MIG resources.
-
Example of MIG Single Mode (used in the same way as a full GPU card):
+
Example of MIG Single Mode (used in the same way as a full GPU):
Note
-
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU card (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
+
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
Using vGPU through YAML Configura
limits:nvidia.com/gpucores:'20'# Request 20% of GPU cores for each cardnvidia.com/gpumem:'200'# Request 200MB of GPU memory for each card
-nvidia.com/vgpu:'1'# Request 1 GPU card
+nvidia.com/vgpu:'1'# Request 1 GPUimagePullPolicy:AlwaysrestartPolicy:Always
-
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU card.
+
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
As the number of business applications continues to grow, the resources of the cluster become increasingly tight. At this point, you can expand the cluster nodes based on kubean. After the expansion, applications can run on the newly added nodes, alleviating resource pressure.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
-
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
+
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
-
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
+
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
-
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
+
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
-
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
+
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
\u9019\u662f\u8c50\u6536\u4e8c\u865f AI \u7b97\u529b\u4e2d\u5fc3\u7684\u6a94\u6848\u7ad9\u3002
\u7d42\u7aef\u7528\u6236\u624b\u518a\uff1a\u5728\u5bb9\u5668\u5316\u74b0\u5883\u4e2d\uff0c\u4f7f\u7528\u96f2\u4e3b\u6a5f\uff0c\u958b\u767c AI \u7b97\u6cd5\uff0c\u69cb\u5efa\u8a13\u7df4\u548c\u63a8\u7406\u4efb\u52d9
5.0 AI Lab \u63d0\u4f9b\u4e86\u4efb\u52a1\u8c03\u5ea6\u5668\uff0c\u53ef\u4ee5\u5e2e\u52a9\u60a8\u66f4\u597d\u5730\u7ba1\u7406\u4efb\u52a1\uff0c\u9664\u4e86\u63d0\u4f9b\u57fa\u7840\u7684\u8c03\u5ea6\u5668\u4e4b\u5916\uff0c\u76ee\u524d\u4e5f\u652f\u6301\u7528\u6237\u81ea\u5b9a\u4e49\u8c03\u5ea6\u5668\u3002
\u5728 Kubernetes \u4e2d\uff0c\u4efb\u52a1\u8c03\u5ea6\u5668\u8d1f\u8d23\u51b3\u5b9a\u5c06 Pod \u5206\u914d\u5230\u54ea\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002\u5b83\u8003\u8651\u591a\u79cd\u56e0\u7d20\uff0c\u5982\u8d44\u6e90\u9700\u6c42\u3001\u786c\u4ef6/\u8f6f\u4ef6\u7ea6\u675f\u3001\u4eb2\u548c\u6027/\u53cd\u4eb2\u548c\u6027\u89c4\u5219\u3001\u6570\u636e\u5c40\u90e8\u6027\u7b49\u3002
\u9ed8\u8ba4\u8c03\u5ea6\u5668\u662f Kubernetes \u96c6\u7fa4\u4e2d\u7684\u4e00\u4e2a\u6838\u5fc3\u7ec4\u4ef6\uff0c\u8d1f\u8d23\u51b3\u5b9a\u5c06 Pod \u5206\u914d\u5230\u54ea\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002\u8ba9\u6211\u4eec\u6df1\u5165\u4e86\u89e3\u5b83\u7684\u5de5\u4f5c\u539f\u7406\u3001\u7279\u6027\u548c\u914d\u7f6e\u65b9\u6cd5\u3002
\u8c03\u5ea6\u5668\u4f1a\u904d\u5386\u6240\u6709\u8282\u70b9\uff0c\u6392\u9664\u4e0d\u6ee1\u8db3 Pod \u8981\u6c42\u7684\u8282\u70b9\uff0c\u8003\u8651\u7684\u56e0\u7d20\u5305\u62ec\uff1a
\u4ee5\u4e0a\uff0c\u5c31\u662f\u6211\u4eec\u5728 AI Lab \u4e2d\uff0c\u4e3a\u4efb\u52a1\u589e\u52a0\u8c03\u5ea6\u5668\u9009\u9879\u7684\u914d\u7f6e\u4f7f\u7528\u8bf4\u660e\u3002
CPU \u81f3\u5c11 8 \u6838\uff0c\u63a8\u8350 16 \u6838
\u5185\u5b58 64GB\uff0c\u63a8\u8350 128GB
Info
\u5728\u5f00\u59cb\u4f53\u9a8c\u4e4b\u524d\uff0c\u8bf7\u68c0\u67e5 AI \u7b97\u529b\u5e73\u53f0\u4ee5\u53ca AI Lab \u90e8\u7f72\u6b63\u786e\uff0cGPU \u961f\u5217\u8d44\u6e90\u521d\u59cb\u5316\u6210\u529f\uff0c\u4e14\u7b97\u529b\u8d44\u6e90\u5145\u8db3\u3002
AI Lab \u4f1a\u5728\u540e\u53f0\u8fdb\u884c\u5168\u81ea\u52a8\u6570\u636e\u9884\u70ed\uff0c\u4ee5\u4fbf\u540e\u7eed\u7684\u4efb\u52a1\u80fd\u591f\u5feb\u901f\u8bbf\u95ee\u6570\u636e\u3002
AI Lab \u63d0\u4f9b\u4e86\u73af\u5883\u7ba1\u7406\u7684\u80fd\u529b\uff0c\u5c06 Python \u73af\u5883\u4f9d\u8d56\u5305\u7ba1\u7406\u548c\u5f00\u53d1\u5de5\u5177\u3001\u4efb\u52a1\u955c\u50cf\u7b49\u8fdb\u884c\u89e3\u8026\uff0c\u89e3\u51b3\u4e86\u4f9d\u8d56\u7ba1\u7406\u6df7\u4e71\uff0c\u73af\u5883\u4e0d\u4e00\u81f4\u7b49\u95ee\u9898\u3002
\u8fd9\u91cc\u4f7f\u7528 AI Lab \u63d0\u4f9b\u7684\u73af\u5883\u7ba1\u7406\u529f\u80fd\uff0c\u521b\u5efa ChatGLM3 \u5fae\u8c03\u6240\u9700\u7684\u73af\u5883\uff0c\u4ee5\u5907\u540e\u7eed\u4f7f\u7528\u3002
AI Lab \u63d0\u4f9b\u4e86 Notebook \u4f5c\u4e3a IDE \u7684\u529f\u80fd\uff0c\u53ef\u4ee5\u8ba9\u7528\u6237\u5728\u6d4f\u89c8\u5668\u4e2d\u76f4\u63a5\u7f16\u5199\u4ee3\u7801\uff0c\u8fd0\u884c\u4ee3\u7801\uff0c\u67e5\u770b\u4ee3\u7801\u8fd0\u884c\u7ed3\u679c\uff0c\u975e\u5e38\u9002\u5408\u4e8e\u6570\u636e\u5206\u6790\u3001\u673a\u5668\u5b66\u4e60\u3001\u6df1\u5ea6\u5b66\u4e60\u7b49\u9886\u57df\u7684\u5f00\u53d1\u3002
\u60a8\u53ef\u4ee5\u4f7f\u7528 AI Lab \u63d0\u4f9b\u7684 JupyterLab Notebook \u6765\u8fdb\u884c ChatGLM3 \u7684\u5fae\u8c03\u4efb\u52a1\u3002
\u672c\u6587\u4ee5 ChatGLM3 \u4e3a\u4f8b\uff0c\u5e26\u60a8\u5feb\u901f\u4e86\u89e3\u548c\u4e0a\u624b AI Lab \u7684\u6a21\u578b\u5fae\u8c03\uff0c\u4f7f\u7528 LoRA \u5fae\u8c03\u4e86 ChatGLM3 \u6a21\u578b\u3002
AI Lab \u63d0\u4f9b\u4e86\u975e\u5e38\u4e30\u5bcc\u7684\u529f\u80fd\uff0c\u53ef\u4ee5\u5e2e\u52a9\u6a21\u578b\u5f00\u53d1\u8005\u5feb\u901f\u8fdb\u884c\u6a21\u578b\u5f00\u53d1\u3001\u5fae\u8c03\u3001\u63a8\u7406\u7b49\u4efb\u52a1\uff0c\u540c\u65f6\u4e5f\u63d0\u4f9b\u4e86\u4e30\u5bcc\u7684 OpenAPI \u63a5\u53e3\uff0c\u53ef\u4ee5\u65b9\u4fbf\u5730\u4e0e\u7b2c\u4e09\u65b9\u5e94\u7528\u751f\u6001\u8fdb\u884c\u7ed3\u5408\u3002
Label Studio \u662f\u4e00\u4e2a\u5f00\u6e90\u7684\u6570\u636e\u6807\u6ce8\u5de5\u5177\uff0c\u7528\u4e8e\u5404\u79cd\u673a\u5668\u5b66\u4e60\u548c\u4eba\u5de5\u667a\u80fd\u4efb\u52a1\u3002 \u4ee5\u4e0b\u662f Label Studio \u7684\u7b80\u8981\u4ecb\u7ecd\uff1a
Label Studio \u901a\u8fc7\u5176\u7075\u6d3b\u6027\u548c\u529f\u80fd\u4e30\u5bcc\u6027\uff0c\u4e3a\u6570\u636e\u79d1\u5b66\u5bb6\u548c\u673a\u5668\u5b66\u4e60\u5de5\u7a0b\u5e08\u63d0\u4f9b\u4e86\u5f3a\u5927\u7684\u6570\u636e\u6807\u6ce8\u89e3\u51b3\u65b9\u6848\u3002
"},{"location":"admin/baize/best-practice/label-studio.html#ai","title":"\u90e8\u7f72\u5230 AI \u7b97\u529b\u5e73\u53f0","text":"
\u8981\u60f3\u5728 AI Lab \u4e2d\u4f7f\u7528 Label Studio\uff0c\u9700\u5c06\u5176\u90e8\u7f72\u5230\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\uff0c \u4f60\u53ef\u4ee5\u901a\u8fc7 Helm \u7684\u65b9\u5f0f\u5feb\u901f\u90e8\u7f72\u3002
Note
\u66f4\u591a\u90e8\u7f72\u8be6\u60c5\uff0c\u8bf7\u53c2\u9605 Deploy Label Studio on Kubernetes\u3002
\u5982\u679c\u8981\u6dfb\u52a0 Label Studio \u5230\u5bfc\u822a\u680f\uff0c\u53ef\u4ee5\u53c2\u8003\u5168\u5c40\u7ba1\u7406 OEM IN \u7684\u65b9\u5f0f\u3002 \u4ee5\u4e0b\u6848\u4f8b\u662f\u589e\u52a0\u5230 AI Lab \u4e8c\u7ea7\u5bfc\u822a\u7684\u6dfb\u52a0\u65b9\u5f0f\u3002
\u4ee5\u4e0a\uff0c\u5c31\u662f\u5982\u4f55\u6dfb\u52a0 Label Studio \u5e76\u5c06\u5176\u4f5c\u4e3a AI Lab \u7684\u6807\u6ce8\u7ec4\u4ef6\uff0c\u901a\u8fc7\u5c06\u6807\u6ce8\u540e\u7684\u6570\u636e\u6dfb\u52a0\u5230 AI Lab \u7684\u6570\u636e\u96c6\u4e2d\uff0c \u8054\u52a8\u7b97\u6cd5\u5f00\u53d1\uff0c\u5b8c\u5584\u7b97\u6cd5\u5f00\u53d1\u6d41\u7a0b\uff0c\u540e\u7eed\u5982\u4f55\u4f7f\u7528\u8bf7\u5173\u6ce8\u5176\u4ed6\u6587\u6863\u53c2\u8003\u3002
\u5f00\u53d1\u63a7\u5236\u53f0\u662f\u5f00\u53d1\u8005\u65e5\u5e38\u6267\u884c AI \u63a8\u7406\u3001\u5927\u6a21\u578b\u8bad\u7ec3\u7b49\u4efb\u52a1\u7684\u63a7\u5236\u53f0\u3002
\u672c\u6587\u63d0\u4f9b\u4e86\u7b80\u5355\u7684\u64cd\u4f5c\u624b\u518c\u4ee5\u4fbf\u7528\u6237\u4f7f\u7528 AI Lab \u8fdb\u884c\u6570\u636e\u96c6\u3001Notebook\u3001\u4efb\u52a1\u8bad\u7ec3\u7684\u6574\u4e2a\u5f00\u53d1\u3001\u8bad\u7ec3\u6d41\u7a0b\u3002
\u7b49\u5f85\u73af\u5883\u9884\u70ed\u6210\u529f\u540e\uff0c\u53ea\u9700\u8981\u5c06\u6b64\u73af\u5883\u6302\u8f7d\u5230 Notebook\u3001\u8bad\u7ec3\u4efb\u52a1\u4e2d\uff0c\u4f7f\u7528 AI Lab \u63d0\u4f9b\u7684\u57fa\u7840\u955c\u50cf\u5c31\u53ef\u4ee5
AI Lab \u63d0\u4f9b\u6a21\u578b\u5f00\u53d1\u3001\u8bad\u7ec3\u4ee5\u53ca\u63a8\u7406\u8fc7\u7a0b\u6240\u6709\u9700\u8981\u7684\u6570\u636e\u96c6\u7ba1\u7406\u529f\u80fd\u3002\u76ee\u524d\u652f\u6301\u5c06\u591a\u79cd\u6570\u636e\u6e90\u7edf\u4e00\u63a5\u5165\u80fd\u529b\u3002
\u901a\u8fc7\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5c06\u6570\u636e\u6e90\u63a5\u5165\u5230 AI Lab \u4e2d\uff0c\u5b9e\u73b0\u6570\u636e\u7684\u7edf\u4e00\u7eb3\u7ba1\u3001\u9884\u70ed\u3001\u6570\u636e\u96c6\u7ba1\u7406\u7b49\u529f\u80fd\u3002
\u672c\u6587\u8bf4\u660e\u5982\u4f55\u5728 AI Lab \u4e2d\u7ba1\u7406\u4f60\u7684\u73af\u5883\u4f9d\u8d56\u5e93\uff0c\u4ee5\u4e0b\u662f\u5177\u4f53\u64cd\u4f5c\u6b65\u9aa4\u548c\u6ce8\u610f\u4e8b\u9879\u3002
\u968f\u7740 AI Lab \u7684\u5feb\u901f\u8fed\u4ee3\uff0c\u6211\u4eec\u5df2\u7ecf\u652f\u6301\u4e86\u591a\u79cd\u6a21\u578b\u7684\u63a8\u7406\u670d\u52a1\uff0c\u60a8\u53ef\u4ee5\u5728\u8fd9\u91cc\u770b\u5230\u6240\u652f\u6301\u7684\u6a21\u578b\u4fe1\u606f\u3002
AI Lab v0.3.0 \u4e0a\u7ebf\u4e86\u6a21\u578b\u63a8\u7406\u670d\u52a1\uff0c\u9488\u5bf9\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u65b9\u4fbf\u7528\u6237\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528AI Lab \u7684\u63a8\u7406\u670d\u52a1\uff0c\u65e0\u9700\u5173\u5fc3\u6a21\u578b\u7684\u90e8\u7f72\u548c\u7ef4\u62a4\u3002
AI Lab v0.6.0 \u652f\u6301\u4e86\u5b8c\u6574\u7248\u672c\u7684 vLLM \u63a8\u7406\u80fd\u529b\uff0c\u652f\u6301\u8bf8\u591a\u5927\u8bed\u8a00\u6a21\u578b\uff0c\u5982 LLama\u3001Qwen\u3001ChatGLM \u7b49\u3002
\u60a8\u53ef\u4ee5\u5728 AI Lab \u4e2d\u4f7f\u7528\u7ecf\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9a8c\u8bc1\u8fc7\u7684 GPU \u7c7b\u578b\uff1b \u66f4\u591a\u7ec6\u8282\u53c2\u9605 GPU \u652f\u6301\u77e9\u9635\u3002
\u901a\u8fc7 Triton Inference Server \u53ef\u4ee5\u5f88\u597d\u7684\u652f\u6301\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u6211\u4eec\u76ee\u524d\u652f\u6301\u4e3b\u6d41\u7684\u63a8\u7406\u540e\u7aef\u670d\u52a1\uff1a
AI Lab \u76ee\u524d\u63d0\u4f9b\u4ee5 Triton\u3001vLLM \u4f5c\u4e3a\u63a8\u7406\u6846\u67b6\uff0c\u7528\u6237\u53ea\u9700\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5feb\u901f\u542f\u52a8\u4e00\u4e2a\u9ad8\u6027\u80fd\u7684\u63a8\u7406\u670d\u52a1\u3002
\u652f\u6301 API key \u7684\u8bf7\u6c42\u65b9\u5f0f\u8ba4\u8bc1\uff0c\u7528\u6237\u53ef\u4ee5\u81ea\u5b9a\u4e49\u589e\u52a0\u8ba4\u8bc1\u53c2\u6570\u3002
\u63a8\u7406\u670d\u52a1\u521b\u5efa\u5b8c\u6210\u4e4b\u540e\uff0c\u70b9\u51fb\u63a8\u7406\u670d\u52a1\u540d\u79f0\u8fdb\u5165\u8be6\u60c5\uff0c\u67e5\u770b API \u8c03\u7528\u65b9\u6cd5\u3002\u901a\u8fc7\u4f7f\u7528 Curl\u3001Python\u3001Nodejs \u7b49\u65b9\u5f0f\u9a8c\u8bc1\u6267\u884c\u7ed3\u679c\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
\u5728 AI Lab \u6a21\u5757\u4e2d\uff0c\u63d0\u4f9b\u4e86\u6a21\u578b\u5f00\u53d1\u8fc7\u7a0b\u91cd\u8981\u7684\u53ef\u89c6\u5316\u5206\u6790\u5de5\u5177\uff0c\u7528\u4e8e\u5c55\u793a\u673a\u5668\u5b66\u4e60\u6a21\u578b\u7684\u8bad\u7ec3\u8fc7\u7a0b\u548c\u7ed3\u679c\u3002 \u672c\u6587\u5c06\u4ecb\u7ecd \u4efb\u52a1\u5206\u6790\uff08Tensorboard\uff09\u7684\u57fa\u672c\u6982\u5ff5\u3001\u5728 AI Lab \u7cfb\u7edf\u4e2d\u7684\u4f7f\u7528\u65b9\u6cd5\uff0c\u4ee5\u53ca\u5982\u4f55\u914d\u7f6e\u6570\u636e\u96c6\u7684\u65e5\u5fd7\u5185\u5bb9\u3002
\u5728 AI Lab \u7cfb\u7edf\u4e2d\uff0c\u6211\u4eec\u63d0\u4f9b\u4e86\u4fbf\u6377\u7684\u65b9\u5f0f\u6765\u521b\u5efa\u548c\u7ba1\u7406 Tensorboard\u3002\u4ee5\u4e0b\u662f\u5177\u4f53\u6b65\u9aa4\uff1a
\u521b\u5efa\u5206\u5e03\u5f0f\u4efb\u52a1\uff1a\u5728 AI Lab \u5e73\u53f0\u4e0a\u521b\u5efa\u4e00\u4e2a\u65b0\u7684\u5206\u5e03\u5f0f\u8bad\u7ec3\u4efb\u52a1\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
jovyan@19d0197587cc:/$ baizectl\nAI platform management tool\n\nUsage:\n baizectl [command]\n\nAvailable Commands:\n completion Generate the autocompletion script for the specified shell\n data Management datasets\n help Help about any command\n job Manage jobs\n login Login to the platform\n version Show cli version\n\nFlags:\n --cluster string Cluster name to operate\n -h, --help help for baizectl\n --mode string Connection mode: auto, api, notebook (default \"auto\")\n -n, --namespace string Namespace to use for the operation. If not set, the default Namespace will be used.\n -s, --server string \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 access base url\n --skip-tls-verify Skip TLS certificate verification\n --token string \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 access token\n -w, --workspace int32 Workspace ID to use for the operation\n\nUse \"baizectl [command] --help\" for more information about a command.\n
jovyan@19d0197587cc:/$ baizectl job\nManage jobs\n\nUsage:\n baizectl job [command]\n\nAvailable Commands:\n delete Delete a job\n logs Show logs of a job\n ls List jobs\n restart restart a job\n submit Submit a job\n\nFlags:\n -h, --help help for job\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl job [command] --help\" for more information about a command.\n
(base) jovyan@den-0:~$ baizectl job submit --help\nSubmit a job\n\nUsage:\n baizectl job submit [flags] -- command ...\n\nAliases:\n submit, create\n\nExamples:\n# Submit a job to run the command \"torchrun python train.py\"\nbaizectl job submit -- torchrun python train.py\n# Submit a job with 2 workers(each pod use 4 gpus) to run the command \"torchrun python train.py\" and use the image \"pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime\"\nbaizectl job submit --image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime --workers 2 --resources nvidia.com/gpu=4 -- torchrun python train.py\n# Submit a tensorflow job to run the command \"python train.py\"\nbaizectl job submit --tensorflow -- python train.py\n\n\nFlags:\n --annotations stringArray The annotations of the job, the format is key=value\n --auto-load-env It only takes effect when executed in Notebook, the environment variables of the current environment will be automatically read and set to the environment variables of the Job, the specific environment variables to be read can be specified using the BAIZE_MAPPING_ENVS environment variable, the default is PATH,CONDA_*,*PYTHON*,NCCL_*, if set to false, the environment variables of the current environment will not be read. (default true)\n --commands stringArray The default command of the job\n -d, --datasets stringArray The dataset bind to the job, the format is datasetName:mountPath, e.g. mnist:/data/mnist\n -e, --envs stringArray The environment variables of the job, the format is key=value\n -x, --from-notebook string Define whether to read the configuration of the current Notebook and directly create tasks, including images, resources, Dataset, etc.\n auto: Automatically determine the mode according to the current environment. If the current environment is a Notebook, it will be set to notebook mode.\n false: Do not read the configuration of the current Notebook.\n true: Read the configuration of the current Notebook. (default \"auto\")\n -h, --help help for submit\n --image string The image of the job, it must be specified if fromNotebook is false.\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --labels stringArray The labels of the job, the format is key=value\n --max-retries int32 number of retries before marking this job failed\n --max-run-duration int Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it\n --name string The name of the job, if empty, the name will be generated automatically.\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --priority string The priority of the job, current support baize-medium-priority, baize-low-priority, baize-high-priority\n --pvcs stringArray The pvcs bind to the job, the format is pvcName:mountPath, e.g. mnist:/data/mnist\n --pytorch Pytorch Job, has higher priority than --job-type\n --queue string The queue to used\n --requests-resources stringArray Similar to resources, but sets the resources of requests\n --resources stringArray The resources of the job, it is a string in the format of cpu=1,memory=1Gi,nvidia.com/gpu=1, it will be set to the limits and requests of the container.\n --restart-policy string The job restart policy (default \"on-failure\")\n --runtime-envs baizectl data ls --runtime-env The runtime environment to use for the job, you can use baizectl data ls --runtime-env to get the runtime environment\n --shm-size int32 The shared memory size of the job, default is 0, which means no shared memory, if set to more than 0, the job will use the shared memory, the unit is MiB\n --tensorboard-log-dir string The tensorboard log directory, if set, the job will automatically start tensorboard, else not. The format is /path/to/log, you can use relative path in notebook.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --workers int The workers of the job, default is 1, which means single worker, if set to more than 1, the job will be distributed. (default 1)\n --working-dir string The working directory of job container, if in notebook mode, the default is the directory of the current file\n
(base) jovyan@den-0:~$ baizectl job logs --help\nShow logs of a job\n\nUsage:\n baizectl job logs <job-name> [pod-name] [flags]\n\nAliases:\n logs, log\n\nFlags:\n -f, --follow Specify if the logs should be streamed.\n -h, --help help for logs\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --pytorch Pytorch Job, has higher priority than --job-type\n --tail int Lines of recent log file to display.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --timestamps Show timestamps\n
(base) jovyan@den-0:~$ baizectl job log -t TENSORFLOW tf-sample-job-v2-202406161632-evgrbrhn -f\n2024-06-16 08:33:06.083766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n2024-06-16 08:33:06.086189: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2024-06-16 08:33:07.223046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\nModel: \"sequential\"\n_________________________________________________________________\n Layer (type) Output Shape Param # \n=================================================================\n Conv1 (Conv2D) (None, 13, 13, 8) 80 \n\n flatten (Flatten) (None, 1352) 0 \n\n Softmax (Dense) (None, 10) 13530 \n\n=================================================================\nTotal params: 13610 (53.16 KB)\nTrainable params: 13610 (53.16 KB)\nNon-trainable params: 0 (0.00 Byte)\n...\n
(base) jovyan@den-0:~$ baizectl data \nManagement datasets\n\nUsage:\n baizectl data [flags]\n baizectl data [command]\n\nAliases:\n data, dataset, datasets, envs, runtime-envs\n\nAvailable Commands:\n ls List datasets\n\nFlags:\n -h, --help help for data\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl data [command] --help\" for more information about a command.\n
baizectl data \u652f\u6301\u901a\u8fc7 ls \u547d\u4ee4\u67e5\u770b\u6570\u636e\u96c6\u5217\u8868\uff0c\u9ed8\u8ba4\u663e\u793a table \u683c\u5f0f\uff0c\u7528\u6237\u53ef\u4ee5\u901a\u8fc7 -o \u53c2\u6570\u6307\u5b9a\u8f93\u51fa\u683c\u5f0f\u3002
(base) jovyan@den-0:~$ baizectl data ls\n NAME TYPE URI PHASE \n fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY \n sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY \n training-output PVC pvc://training-output READY \n
jovyan@19d0197587cc:/$ baizess\nsource switch tool\n\nUsage:\n baizess [command] [package-manager]\n\nAvailable Commands:\n set Switch the source of specified package manager to current fastest source\n reset Reset the source of specified package manager to default source\n\nAvailable Package-managers:\n apt (require root privilege)\n conda\n pip\n
Notebook \u63d0\u4f9b\u4e86\u4e00\u4e2a\u5728\u7ebf\u7684 Web \u4ea4\u4e92\u5f0f\u7f16\u7a0b\u73af\u5883\uff0c\u65b9\u4fbf\u5f00\u53d1\u8005\u5feb\u901f\u8fdb\u884c\u6570\u636e\u79d1\u5b66\u548c\u673a\u5668\u5b66\u4e60\u5b9e\u9a8c\u3002
\u5f53\u7cfb\u7edf\u63d0\u793a\u60a8\u201cEnter a file in which to save the key\u201d\uff0c\u60a8\u53ef\u4ee5\u76f4\u63a5\u6572\u51fb Enter \u952e\u4f7f\u7528\u9ed8\u8ba4\u8def\u5f84\uff0c\u6216\u8005\u6307\u5b9a\u4e00\u4e2a\u65b0\u7684\u8def\u5f84\u3002
\u767b\u5f55\u5230\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0, \u7136\u540e\u53f3\u4e0a\u89d2\u5e10\u53f7\u70b9\u5f00\uff0c\u9009\u62e9\u4e2a\u4eba\u4e2d\u5fc3
\u70b9\u51fb Next \uff0cPyCharm \u5c06\u5c1d\u8bd5\u8fde\u63a5\u5230\u8fdc\u7a0b\u670d\u52a1\u5668\u3002\u5982\u679c\u8fde\u63a5\u6210\u529f\uff0c\u60a8\u5c06\u88ab\u8981\u6c42\u8f93\u5165\u5bc6\u7801\u6216\u9009\u62e9\u79c1\u94a5\u6587\u4ef6\u3002
\u8fd0\u7ef4\u7ba1\u7406\u662f IT \u8fd0\u7ef4\u4eba\u5458\u65e5\u5e38\u7ba1\u7406 IT \u8d44\u6e90\uff0c\u5904\u7406\u5de5\u4f5c\u7684\u7a7a\u95f4\u3002
\u672c\u6587\u5c06\u6301\u7eed\u7edf\u8ba1\u548c\u68b3\u7406 AI Lab \u4f7f\u7528\u8fc7\u7a0b\u53ef\u80fd\u56e0\u73af\u5883\u6216\u64cd\u4f5c\u4e0d\u89c4\u8303\u5f15\u8d77\u7684\u62a5\u9519\uff0c\u4ee5\u53ca\u5728\u4f7f\u7528\u8fc7\u7a0b\u4e2d\u9047\u5230\u67d0\u4e9b\u62a5\u9519\u7684\u95ee\u9898\u5206\u6790\u3001\u89e3\u51b3\u65b9\u6848\u3002
Warning
\u672c\u6587\u6863\u4ec5\u9002\u7528\u4e8e AI \u7b97\u529b\u4e2d\u5fc3\u7248\u672c\uff0c\u82e5\u9047\u5230 AI Lab \u7684\u4f7f\u7528\u95ee\u9898\uff0c\u8bf7\u4f18\u5148\u67e5\u770b\u6b64\u6392\u969c\u624b\u518c\u3002
AI Lab \u5728 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u6a21\u5757\u540d\u79f0 baize\uff0c\u63d0\u4f9b\u4e86\u4e00\u7ad9\u5f0f\u7684\u6a21\u578b\u8bad\u7ec3\u3001\u63a8\u7406\u3001\u6a21\u578b\u7ba1\u7406\u7b49\u529f\u80fd\u3002
\u5728 AI Lab \u5f00\u53d1\u63a7\u5236\u53f0\u3001\u8fd0\u7ef4\u63a7\u5236\u53f0\uff0c\u529f\u80fd\u6a21\u5757\u7684\u96c6\u7fa4\u641c\u7d22\u6761\u4ef6\u7684\u4e0b\u62c9\u5217\u8868\u627e\u4e0d\u5230\u60f3\u8981\u7684\u96c6\u7fa4\u3002
\u5728 AI Lab \u4e2d\uff0c\u96c6\u7fa4\u4e0b\u62c9\u5217\u8868\u5982\u679c\u7f3a\u5c11\u4e86\u60f3\u8981\u7684\u96c6\u7fa4\uff0c\u53ef\u80fd\u662f\u7531\u4e8e\u4ee5\u4e0b\u539f\u56e0\u5bfc\u81f4\u7684\uff1a
baize-agent \u672a\u5b89\u88c5\u6216\u5b89\u88c5\u4e0d\u6210\u529f\uff0c\u5bfc\u81f4 AI Lab \u65e0\u6cd5\u83b7\u53d6\u96c6\u7fa4\u4fe1\u606f
\u5b89\u88c5 baize-agent \u672a\u914d\u7f6e\u96c6\u7fa4\u540d\u79f0\uff0c\u5bfc\u81f4 AI Lab \u65e0\u6cd5\u83b7\u53d6\u96c6\u7fa4\u4fe1\u606f
AI Lab \u6709\u4e00\u4e9b\u57fa\u7840\u7ec4\u4ef6\u9700\u8981\u5728\u6bcf\u4e2a\u5de5\u4f5c\u96c6\u7fa4\u5185\u8fdb\u884c\u5b89\u88c5\uff0c\u5982\u679c\u5de5\u4f5c\u96c6\u7fa4\u5185\u672a\u5b89\u88c5 baize-agent \u65f6\uff0c\u53ef\u4ee5\u5728\u754c\u9762\u4e0a\u9009\u62e9\u5b89\u88c5\uff0c\u53ef\u80fd\u4f1a\u5bfc\u81f4\u4e00\u4e9b\u975e\u9884\u671f\u7684\u62a5\u9519\u7b49\u95ee\u9898\u3002
\u5982\u679c\u96c6\u7fa4\u5185\u53ef\u89c2\u6d4b\u7ec4\u4ef6\u5f02\u5e38\uff0c\u53ef\u80fd\u4f1a\u5bfc\u81f4 AI Lab \u65e0\u6cd5\u83b7\u53d6\u96c6\u7fa4\u4fe1\u606f\uff0c\u8bf7\u68c0\u67e5\u5e73\u53f0\u7684\u53ef\u89c2\u6d4b\u670d\u52a1\u662f\u5426\u6b63\u5e38\u8fd0\u884c\u53ca\u914d\u7f6e\u3002
\u5728 AI Lab \u4e2d\uff0c\u5982\u679c\u521b\u5efa\u670d\u52a1\u65f6\uff0c\u53d1\u73b0\u6307\u5b9a\u7684\u547d\u540d\u7a7a\u95f4\u4e0d\u5b58\u5728 LocalQueue\uff0c\u5219\u4f1a\u63d0\u793a\u9700\u8981\u521d\u59cb\u5316\u961f\u5217\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86\u9884\u7f6e\u7684\u7cfb\u7edf\u89d2\u8272\uff0c\u5e2e\u52a9\u7528\u6237\u7b80\u5316\u89d2\u8272\u6743\u9650\u7684\u4f7f\u7528\u6b65\u9aa4\u3002
Note
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86\u4e09\u79cd\u7c7b\u578b\u7684\u7cfb\u7edf\u89d2\u8272\uff0c\u5206\u522b\u4e3a\u5e73\u53f0\u89d2\u8272\u3001\u5de5\u4f5c\u7a7a\u95f4\u89d2\u8272\u548c\u6587\u4ef6\u5939\u89d2\u8272\u3002
IAM\uff08Identity and Access Management\uff0c\u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236\uff09\u662f\u5168\u5c40\u7ba1\u7406\u7684\u4e00\u4e2a\u91cd\u8981\u6a21\u5757\uff0c\u60a8\u53ef\u4ee5\u901a\u8fc7\u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236\u6a21\u5757\u521b\u5efa\u3001\u7ba1\u7406\u548c\u9500\u6bc1\u7528\u6237\uff08\u7528\u6237\u7ec4\uff09\uff0c\u5e76\u4f7f\u7528\u7cfb\u7edf\u89d2\u8272\u548c\u81ea\u5b9a\u4e49\u89d2\u8272\u63a7\u5236\u5176\u4ed6\u7528\u6237\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u6743\u9650\u3002
\u5f53\u60a8\u5e0c\u671b\u672c\u4f01\u4e1a\u5458\u5de5\u53ef\u4ee5\u4f7f\u7528\u4f01\u4e1a\u5185\u90e8\u7684\u8ba4\u8bc1\u7cfb\u7edf\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\uff0c\u800c\u4e0d\u9700\u8981\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u521b\u5efa\u5bf9\u5e94\u7684\u7528\u6237\uff0c\u60a8\u53ef\u4ee5\u4f7f\u7528\u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c\u5efa\u7acb\u60a8\u6240\u5728\u4f01\u4e1a\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u4fe1\u4efb\u5173\u7cfb\uff0c\u901a\u8fc7\u8054\u5408\u8ba4\u8bc1\u4f7f\u5458\u5de5\u4f7f\u7528\u4f01\u4e1a\u5df2\u6709\u8d26\u53f7\u76f4\u63a5\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\uff0c\u5b9e\u73b0\u5355\u70b9\u767b\u5f55\u3002
\u5168\u5c40\u7ba1\u7406\u652f\u6301\u57fa\u4e8e LDAP \u548c OIDC \u534f\u8bae\u7684\u5355\u70b9\u767b\u5f55\uff0c\u5982\u679c\u60a8\u7684\u4f01\u4e1a\u6216\u7ec4\u7ec7\u5df2\u6709\u81ea\u5df1\u7684\u8d26\u53f7\u4f53\u7cfb\uff0c\u540c\u65f6\u5e0c\u671b\u7ba1\u7406\u7ec4\u7ec7\u5185\u7684\u6210\u5458\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\uff0c\u60a8\u53ef\u4ee5\u4f7f\u7528\u5168\u5c40\u7ba1\u7406\u63d0\u4f9b\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c\u800c\u4e0d\u5fc5\u5728\u60a8\u7684\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u6bcf\u4e00\u4f4d\u7ec4\u7ec7\u6210\u5458\u521b\u5efa\u7528\u6237\u540d/\u5bc6\u7801\u3002\u60a8\u53ef\u4ee5\u5411\u8fd9\u4e9b\u5916\u90e8\u7528\u6237\u8eab\u4efd\u6388\u4e88\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\u7684\u6743\u9650\u3002
\u8d1f\u8d23\u6536\u96c6\u548c\u5b58\u50a8\u7528\u6237\u8eab\u4efd\u4fe1\u606f\u3001\u7528\u6237\u540d\u3001\u5bc6\u7801\u7b49\uff0c\u5728\u7528\u6237\u767b\u5f55\u65f6\u8d1f\u8d23\u8ba4\u8bc1\u7528\u6237\u7684\u670d\u52a1\u3002\u5728\u4f01\u4e1a\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u7684\u8fc7\u7a0b\u4e2d\uff0c\u8eab\u4efd\u63d0\u4f9b\u5546\u6307\u4f01\u4e1a\u81ea\u8eab\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u3002
\u670d\u52a1\u63d0\u4f9b\u5546\u901a\u8fc7\u4e0e\u8eab\u4efd\u63d0\u4f9b\u5546 IdP \u5efa\u7acb\u4fe1\u4efb\u5173\u7cfb\uff0c\u4f7f\u7528 IDP \u63d0\u4f9b\u7684\u7528\u6237\u4fe1\u606f\uff0c\u4e3a\u7528\u6237\u63d0\u4f9b\u5177\u4f53\u7684\u670d\u52a1\u3002\u5728\u4f01\u4e1a\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u7684\u8fc7\u7a0b\u4e2d\uff0c\u670d\u52a1\u63d0\u4f9b\u5546\u6307 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u3002
\u7ba1\u7406\u5458\u65e0\u9700\u91cd\u65b0\u521b\u5efa\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7528\u6237
\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u524d\uff0c\u7ba1\u7406\u5458\u9700\u8981\u5728\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u548c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u5206\u522b\u4e3a\u7528\u6237\u521b\u5efa\u8d26\u53f7\uff1b\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u540e\uff0c\u4f01\u4e1a\u7ba1\u7406\u5458\u53ea\u9700\u8981\u5728\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u4e2d\u4e3a\u7528\u6237\u521b\u5efa\u8d26\u53f7\uff0c\u7528\u6237\u5373\u53ef\u540c\u65f6\u8bbf\u95ee\u4e24\u4e2a\u7cfb\u7edf\uff0c\u964d\u4f4e\u4e86\u4eba\u5458\u7ba1\u7406\u6210\u672c\u3002
\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u524d\uff0c\u7528\u6237\u8bbf\u95ee\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u548c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9700\u8981\u4f7f\u7528\u4e24\u4e2a\u7cfb\u7edf\u7684\u8d26\u53f7\u767b\u5f55\uff1b\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u540e\uff0c\u7528\u6237\u5728\u672c\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u4e2d\u767b\u5f55\u5373\u53ef\u8bbf\u95ee\u4e24\u4e2a\u7cfb\u7edf\u3002
LDAP \u82f1\u6587\u5168\u79f0\u4e3a Lightweight Directory Access Protocol\uff0c\u5373\u8f7b\u578b\u76ee\u5f55\u8bbf\u95ee\u534f\u8bae\uff0c\u8fd9\u662f\u4e00\u4e2a\u5f00\u653e\u7684\u3001\u4e2d\u7acb\u7684\u5de5\u4e1a\u6807\u51c6\u5e94\u7528\u534f\u8bae\uff0c \u901a\u8fc7 IP \u534f\u8bae\u63d0\u4f9b\u8bbf\u95ee\u63a7\u5236\u548c\u7ef4\u62a4\u5206\u5e03\u5f0f\u4fe1\u606f\u7684\u76ee\u5f55\u4fe1\u606f\u3002
\u5982\u679c\u60a8\u7684\u4f01\u4e1a\u6216\u7ec4\u7ec7\u5df2\u6709\u81ea\u5df1\u7684\u8d26\u53f7\u4f53\u7cfb\uff0c\u540c\u65f6\u60a8\u7684\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u652f\u6301 LDAP \u534f\u8bae\uff0c\u5c31\u53ef\u4ee5\u4f7f\u7528\u5168\u5c40\u7ba1\u7406\u63d0\u4f9b\u7684\u57fa\u4e8e LDAP \u534f\u8bae\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c\u800c\u4e0d\u5fc5\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u6bcf\u4e00\u4f4d\u7ec4\u7ec7\u6210\u5458\u521b\u5efa\u7528\u6237\u540d/\u5bc6\u7801\u3002 \u60a8\u53ef\u4ee5\u5411\u8fd9\u4e9b\u5916\u90e8\u7528\u6237\u8eab\u4efd\u6388\u4e88\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\u7684\u6743\u9650\u3002
\u5f53\u60a8\u901a\u8fc7 LDAP \u534f\u8bae\u5c06\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5efa\u7acb\u4fe1\u4efb\u5173\u7cfb\u540e\uff0c\u53ef\u901a\u8fc7\u624b\u52a8\u540c\u6b65\u6216\u81ea\u52a8\u540c\u6b65\u7684\u65b9\u5f0f\uff0c\u5c06\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u4e2d\u7684\u7528\u6237\u6216\u7528\u6237\u7ec4\u4e00\u6b21\u6027\u540c\u6b65\u81f3 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u3002
\u540c\u6b65\u540e\u7ba1\u7406\u5458\u53ef\u5bf9\u7528\u6237\u7ec4/\u7528\u6237\u7ec4\u8fdb\u884c\u6279\u91cf\u6388\u6743\uff0c\u540c\u65f6\u7528\u6237\u53ef\u901a\u8fc7\u5728\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u4e2d\u7684\u8d26\u53f7/\u5bc6\u7801\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\u3002
\u5982\u679c\u60a8\u7684\u4f01\u4e1a\u6216\u7ec4\u7ec7\u4e2d\u7684\u6210\u5458\u5747\u7ba1\u7406\u5728\u4f01\u4e1a\u5fae\u4fe1\u4e2d\uff0c\u60a8\u53ef\u4ee5\u4f7f\u7528\u5168\u5c40\u7ba1\u7406\u63d0\u4f9b\u7684\u57fa\u4e8e OAuth 2.0 \u534f\u8bae\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c \u800c\u4e0d\u5fc5\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u6bcf\u4e00\u4f4d\u7ec4\u7ec7\u6210\u5458\u521b\u5efa\u7528\u6237\u540d/\u5bc6\u7801\u3002 \u60a8\u53ef\u4ee5\u5411\u8fd9\u4e9b\u5916\u90e8\u7528\u6237\u8eab\u4efd\u6388\u4e88\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\u7684\u6743\u9650\u3002
\u5b57\u6bb5 \u63cf\u8ff0 \u4f01\u4e1a ID \u4f01\u4e1a\u5fae\u4fe1\u7684 ID Agent ID \u81ea\u5efa\u5e94\u7528\u7684 ID ClientSecret \u81ea\u5efa\u5e94\u7528\u7684 Secret
\u5b57\u6bb5 \u63cf\u8ff0 \u63d0\u4f9b\u5546\u540d\u79f0 \u663e\u793a\u5728\u767b\u5f55\u9875\u4e0a\uff0c\u662f\u8eab\u4efd\u63d0\u4f9b\u5546\u7684\u5165\u53e3 \u8ba4\u8bc1\u65b9\u5f0f \u5ba2\u6237\u7aef\u8eab\u4efd\u9a8c\u8bc1\u65b9\u6cd5\u3002\u5982\u679c JWT \u4f7f\u7528\u79c1\u94a5\u7b7e\u540d\uff0c\u8bf7\u4e0b\u62c9\u9009\u62e9 JWT signed with private key \u3002\u5177\u4f53\u53c2\u9605 Client Authentication\u3002 \u5ba2\u6237\u7aef ID \u5ba2\u6237\u7aef\u7684 ID \u5ba2\u6237\u7aef\u5bc6\u94a5 \u5ba2\u6237\u7aef\u5bc6\u7801 \u5ba2\u6237\u7aef URL \u53ef\u901a\u8fc7\u8eab\u4efd\u63d0\u4f9b\u5546 well-known \u63a5\u53e3\u4e00\u952e\u83b7\u53d6\u767b\u5f55 URL\u3001Token URL\u3001\u7528\u6237\u4fe1\u606f URL \u548c\u767b\u51fa URL \u81ea\u52a8\u5173\u8054 \u5f00\u542f\u540e\u5f53\u8eab\u4efd\u63d0\u4f9b\u5546\u7528\u6237\u540d/\u90ae\u7bb1\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7528\u6237\u540d/\u90ae\u7bb1\u91cd\u590d\u65f6\u5c06\u81ea\u52a8\u4f7f\u4e8c\u8005\u5173\u8054
Note
\u5f53\u7528\u6237\u901a\u8fc7\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u5b8c\u6210\u7b2c\u4e00\u6b21\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u540e\uff0c\u7528\u6237\u4fe1\u606f\u624d\u4f1a\u88ab\u540c\u6b65\u81f3\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684 \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u7528\u6237\u5217\u8868 \u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b58\u5728\u4e09\u79cd\u89d2\u8272\u8303\u56f4\uff0c\u80fd\u591f\u7075\u6d3b\u3001\u6709\u6548\u5730\u89e3\u51b3\u60a8\u5728\u6743\u9650\u4e0a\u7684\u4f7f\u7528\u95ee\u9898\uff1a
\u5e73\u53f0\u89d2\u8272\u662f\u7c97\u7c92\u5ea6\u6743\u9650\uff0c\u5bf9\u5e73\u53f0\u4e0a\u6240\u6709\u76f8\u5173\u8d44\u6e90\u5177\u6709\u76f8\u5e94\u6743\u9650\u3002\u901a\u8fc7\u5e73\u53f0\u89d2\u8272\u53ef\u4ee5\u8d4b\u4e88\u7528\u6237\u5bf9\u6240\u6709\u96c6\u7fa4\u3001\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u7b49\u7684\u589e\u5220\u6539\u67e5\u6743\u9650\uff0c \u800c\u4e0d\u80fd\u5177\u4f53\u5230\u67d0\u4e00\u4e2a\u96c6\u7fa4\u6216\u67d0\u4e00\u4e2a\u5de5\u4f5c\u7a7a\u95f4\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86 5 \u4e2a\u9884\u7f6e\u7684\u3001\u7528\u6237\u53ef\u76f4\u63a5\u4f7f\u7528\u7684\u5e73\u53f0\u89d2\u8272\uff1a
Admin
Kpanda Owner
Workspace and Folder Owner
IAM Owner
Audit Owner
\u540c\u65f6\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fd8\u652f\u6301\u7528\u6237\u521b\u5efa\u81ea\u5b9a\u4e49\u5e73\u53f0\u89d2\u8272\uff0c\u53ef\u6839\u636e\u9700\u8981\u81ea\u5b9a\u4e49\u89d2\u8272\u5185\u5bb9\u3002 \u5982\u521b\u5efa\u4e00\u4e2a\u5e73\u53f0\u89d2\u8272\uff0c\u5305\u542b\u5e94\u7528\u5de5\u4f5c\u53f0\u7684\u6240\u6709\u529f\u80fd\u6743\u9650\uff0c\u7531\u4e8e\u5e94\u7528\u5de5\u4f5c\u53f0\u4f9d\u8d56\u4e8e\u5de5\u4f5c\u7a7a\u95f4\uff0c \u56e0\u6b64\u5e73\u53f0\u4f1a\u5e2e\u52a9\u7528\u6237\u9ed8\u8ba4\u52fe\u9009\u5de5\u4f5c\u7a7a\u95f4\u7684\u67e5\u770b\u6743\u9650\uff0c\u8bf7\u4e0d\u8981\u624b\u52a8\u53d6\u6d88\u52fe\u9009\u3002 \u82e5\u7528\u6237 A \u88ab\u6388\u4e88\u8be5 Workbench\uff08\u5e94\u7528\u5de5\u4f5c\u53f0\uff09\u89d2\u8272\uff0c\u5c06\u81ea\u52a8\u62e5\u6709\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u4e0b\u7684\u5e94\u7528\u5de5\u4f5c\u53f0\u76f8\u5173\u529f\u80fd\u7684\u589e\u5220\u6539\u67e5\u7b49\u6743\u9650\u3002
\u6587\u4ef6\u5939\u89d2\u8272\u7684\u6743\u9650\u7c92\u5ea6\u4ecb\u4e8e\u5e73\u53f0\u89d2\u8272\u4e0e\u5de5\u4f5c\u7a7a\u95f4\u89d2\u8272\u4e4b\u95f4\uff0c\u901a\u8fc7\u6587\u4ef6\u5939\u89d2\u8272\u53ef\u4ee5\u8d4b\u4e88\u7528\u6237\u67d0\u4e2a\u6587\u4ef6\u5939\u53ca\u5176\u5b50\u6587\u4ef6\u5939\u548c\u8be5\u6587\u4ef6\u5939\u4e0b\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u7684\u7ba1\u7406\u6743\u9650\u3001\u67e5\u770b\u6743\u9650\u7b49\uff0c \u5e38\u9002\u7528\u4e8e\u4f01\u4e1a\u4e2d\u7684\u90e8\u95e8\u573a\u666f\u3002\u6bd4\u5982\u7528\u6237 B \u662f\u4e00\u7ea7\u90e8\u95e8\u7684 Leader\uff0c\u901a\u5e38\u7528\u6237 B \u80fd\u591f\u7ba1\u7406\u8be5\u4e00\u7ea7\u90e8\u95e8\u3001\u5176\u4e0b\u7684\u6240\u6709\u4e8c\u7ea7\u90e8\u95e8\u548c\u90e8\u95e8\u4e2d\u7684\u9879\u76ee\u7b49\uff0c \u5728\u6b64\u573a\u666f\u4e2d\u7ed9\u7528\u6237 B \u6388\u4e88\u4e00\u7ea7\u6587\u4ef6\u5939\u7684\u7ba1\u7406\u5458\u6743\u9650\uff0c\u7528\u6237 B \u4e5f\u5c06\u62e5\u6709\u5176\u4e0b\u7684\u4e8c\u7ea7\u6587\u4ef6\u5939\u548c\u5de5\u4f5c\u7a7a\u95f4\u7684\u76f8\u5e94\u6743\u9650\u3002 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86 3 \u4e2a\u9884\u7f6e\u7684\u3001\u7528\u6237\u53ef\u76f4\u63a5\u4f7f\u7528\u6587\u4ef6\u5939\u89d2\u8272\uff1a
Folder Admin
Folder Editor
Folder Viewer
\u540c\u65f6\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fd8\u652f\u6301\u7528\u6237\u521b\u5efa\u81ea\u5b9a\u4e49\u6587\u4ef6\u5939\u89d2\u8272\uff0c\u53ef\u6839\u636e\u9700\u8981\u81ea\u5b9a\u4e49\u89d2\u8272\u5185\u5bb9\u3002 \u5982\u521b\u5efa\u4e00\u4e2a\u6587\u4ef6\u5939\u89d2\u8272\uff0c\u5305\u542b\u5e94\u7528\u5de5\u4f5c\u53f0\u7684\u6240\u6709\u529f\u80fd\u6743\u9650\u3002\u82e5\u7528\u6237 A \u5728\u6587\u4ef6\u5939 01 \u4e2d\u88ab\u6388\u4e88\u8be5\u89d2\u8272\uff0c \u5c06\u62e5\u6709\u8be5\u6587\u4ef6\u5939\u4e0b\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u4e2d\u5e94\u7528\u5de5\u4f5c\u53f0\u76f8\u5173\u529f\u80fd\u7684\u589e\u5220\u6539\u67e5\u6743\u9650\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5728\u63a5\u5165\u5ba2\u6237\u7684\u7cfb\u7edf\u540e\uff0c\u53ef\u4ee5\u521b\u5efa Webhook\uff0c\u5728\u7528\u6237\u521b\u5efa/\u66f4\u65b0/\u5220\u9664/\u767b\u5f55/\u767b\u51fa\u4e4b\u65f6\u53d1\u9001\u6d88\u606f\u901a\u77e5\u3002
\u6e90\u5e94\u7528\u7a0b\u5e8f\uff08\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\uff09\u6267\u884c\u67d0\u4e2a\u7279\u5b9a\u64cd\u4f5c\u6216\u4e8b\u4ef6\u3002
Method\uff1a\u89c6\u60c5\u51b5\u9009\u62e9\u9002\u7528\u7684\u65b9\u6cd5\uff0c\u4f8b\u5982\u4f01\u4e1a\u5fae\u4fe1\u63a8\u8350\u4f7f\u7528 POST \u65b9\u6cd5
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9884\u5148\u5b9a\u4e49\u4e86\u4e00\u4e9b\u53d8\u91cf\uff0c\u60a8\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u60c5\u51b5\u5728\u6d88\u606f\u4f53\u4e2d\u4f7f\u7528\u8fd9\u4e9b\u53d8\u91cf\u3002
"},{"location":"admin/ghippo/audit/open-audit.html#ai","title":"\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b89\u88c5\u5b8c\u6210\u65f6\u72b6\u6001","text":"
\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u5728\u7cfb\u7edf\u548c\u7f51\u7edc\u7ba1\u7406\u4e2d\u626e\u6f14\u7740\u5173\u952e\u89d2\u8272\uff0c\u5b83\u6709\u52a9\u4e8e\u8ffd\u8e2a\u6d3b\u52a8\u3001\u7ef4\u62a4\u5b89\u5168\u3001\u89e3\u51b3\u95ee\u9898\u5e76\u786e\u4fdd\u7cfb\u7edf\u5408\u89c4\u6027\u3002 \u4f46\u662f\u83b7\u53d6\u6e90 IP \u4f1a\u5e26\u6765\u4e00\u5b9a\u7684\u6027\u80fd\u635f\u8017\uff0c\u6240\u4ee5\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u5ba1\u8ba1\u65e5\u5fd7\u5e76\u4e0d\u603b\u662f\u5f00\u542f\u7684\uff0c \u5728\u4e0d\u540c\u7684\u5b89\u88c5\u6a21\u5f0f\u4e0b\uff0c\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u7684\u9ed8\u8ba4\u5f00\u542f\u60c5\u51b5\u4e0d\u540c\uff0c\u5e76\u4e14\u5f00\u542f\u7684\u65b9\u5f0f\u4e0d\u540c\u3002 \u4e0b\u9762\u4f1a\u6839\u636e\u5b89\u88c5\u6a21\u5f0f\u5206\u522b\u4ecb\u7ecd\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u7684\u9ed8\u8ba4\u5f00\u542f\u60c5\u51b5\u4ee5\u53ca\u5982\u4f55\u5f00\u542f\u3002
\u8be5\u6a21\u5f0f\u5b89\u88c5\u4e0b\uff0c\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u9ed8\u8ba4\u662f\u5173\u95ed\u7684\uff0c\u5f00\u542f\u6b65\u9aa4\u5982\u4e0b\uff1a
\u666e\u901a\u7528\u6237\u662f\u6307\u80fd\u591f\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5927\u90e8\u5206\u4ea7\u54c1\u6a21\u5757\u53ca\u529f\u80fd\uff08\u7ba1\u7406\u529f\u80fd\u9664\u5916\uff09\uff0c\u5bf9\u6743\u9650\u8303\u56f4\u5185\u7684\u8d44\u6e90\u6709\u4e00\u5b9a\u7684\u64cd\u4f5c\u6743\u9650\uff0c\u80fd\u591f\u72ec\u7acb\u4f7f\u7528\u8d44\u6e90\u90e8\u7f72\u5e94\u7528\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u4e3a\u6b64\u5f15\u5165\u4e86\u5de5\u4f5c\u7a7a\u95f4\u7684\u6982\u5ff5\u3002\u5de5\u4f5c\u7a7a\u95f4\u901a\u8fc7\u5171\u4eab\u8d44\u6e90\u53ef\u4ee5\u63d0\u4f9b\u66f4\u9ad8\u7ef4\u5ea6\u7684\u8d44\u6e90\u9650\u989d\u80fd\u529b\uff0c\u5b9e\u73b0\u5de5\u4f5c\u7a7a\u95f4\uff08\u79df\u6237\uff09\u5728\u8d44\u6e90\u9650\u989d\u4e0b\u81ea\u52a9\u5f0f\u521b\u5efa Kubernetes \u547d\u540d\u7a7a\u95f4\u7684\u80fd\u529b\u3002
\u9996\u5148\u8981\u6309\u7167\u73b0\u6709\u7684\u4f01\u4e1a\u5c42\u7ea7\u7ed3\u6784\uff0c\u6784\u5efa\u4e0e\u4f01\u4e1a\u76f8\u540c\u7684\u6587\u4ef6\u5939\u5c42\u7ea7\u3002 AI \u7b97\u529b\u4e2d\u5fc3\u652f\u6301 5 \u7ea7\u6587\u4ef6\u5939\uff0c\u53ef\u4ee5\u6839\u636e\u4f01\u4e1a\u5b9e\u9645\u60c5\u51b5\u81ea\u7531\u7ec4\u5408\uff0c\u5c06\u6587\u4ef6\u5939\u548c\u5de5\u4f5c\u7a7a\u95f4\u6620\u5c04\u4e3a\u4f01\u4e1a\u4e2d\u7684\u90e8\u95e8\u3001\u9879\u76ee\u3001\u4f9b\u5e94\u5546\u7b49\u5b9e\u4f53\u3002
\u4f34\u968f\u4e1a\u52a1\u7684\u6301\u7eed\u6269\u5f20\uff0c\u516c\u53f8\u89c4\u6a21\u4e0d\u65ad\u58ee\u5927\uff0c\u5b50\u516c\u53f8\u3001\u5206\u516c\u53f8\u7eb7\u7eb7\u8bbe\u7acb\uff0c\u6709\u7684\u5b50\u516c\u53f8\u8fd8\u8fdb\u4e00\u6b65\u8bbe\u7acb\u5b59\u516c\u53f8\uff0c \u539f\u5148\u7684\u5927\u90e8\u95e8\u4e5f\u9010\u6e10\u7ec6\u5206\u6210\u591a\u4e2a\u5c0f\u90e8\u95e8\uff0c\u4ece\u800c\u4f7f\u5f97\u7ec4\u7ec7\u7ed3\u6784\u7684\u5c42\u7ea7\u65e5\u76ca\u589e\u591a\u3002\u8fd9\u79cd\u7ec4\u7ec7\u7ed3\u6784\u7684\u53d8\u5316\uff0c\u4e5f\u5bf9 IT \u6cbb\u7406\u67b6\u6784\u4ea7\u751f\u4e86\u5f71\u54cd\u3002
\u7cfb\u7edf\u6d88\u606f\u7528\u4e8e\u901a\u77e5\u6240\u6709\u7528\u6237\uff0c\u7c7b\u4f3c\u4e8e\u7cfb\u7edf\u516c\u544a\uff0c\u4f1a\u5728\u7279\u5b9a\u65f6\u95f4\u663e\u793a\u5728 AI \u7b97\u529b\u4e2d\u5fc3UI \u7684\u9876\u90e8\u680f\u3002
\u6700\u4f73\u5b9e\u8df5\uff1a\u8fd0\u7ef4\u90e8\u95e8\u624b\u4e2d\u6709\u4e00\u4e2a\u9ad8\u53ef\u7528\u96c6\u7fa4 01\uff0c\u60f3\u8981\u5206\u914d\u7ed9\u90e8\u95e8 A\uff08\u5de5\u4f5c\u7a7a\u95f4 A\uff09\u548c\u90e8\u95e8 B\uff08\u5de5\u4f5c\u7a7a\u95f4 B\uff09\u4f7f\u7528\uff0c\u5176\u4e2d\u90e8\u95e8 A \u5206\u914d CPU 50 \u6838\uff0c\u90e8\u95e8 B \u5206\u914d CPU 100 \u6838\u3002 \u90a3\u4e48\u53ef\u4ee5\u501f\u7528\u5171\u4eab\u8d44\u6e90\u7684\u6982\u5ff5\uff0c\u5c06\u96c6\u7fa4 01 \u5206\u522b\u5171\u4eab\u7ed9\u90e8\u95e8 A \u548c\u90e8\u95e8 B\uff0c\u5e76\u9650\u5236\u90e8\u95e8 A \u7684 CPU \u4f7f\u7528\u989d\u5ea6\u4e3a 50\uff0c\u90e8\u95e8 B \u7684 CPU \u4f7f\u7528\u989d\u5ea6\u4e3a 100\u3002 \u90a3\u4e48\u90e8\u95e8 A \u7684\u7ba1\u7406\u5458\uff08\u5de5\u4f5c\u7a7a\u95f4 A Admin\uff09\u80fd\u591f\u5728\u5e94\u7528\u5de5\u4f5c\u53f0\u521b\u5efa\u5e76\u4f7f\u7528\u547d\u540d\u7a7a\u95f4\uff0c\u5176\u4e2d\u547d\u540d\u7a7a\u95f4\u989d\u5ea6\u603b\u548c\u4e0d\u80fd\u8d85\u8fc7 50 \u6838\uff0c\u90e8\u95e8 B \u7684\u7ba1\u7406\u5458\uff08\u5de5\u4f5c\u7a7a\u95f4 B Admin\uff09\u80fd\u591f\u5728\u5e94\u7528\u5de5\u4f5c\u53f0\u521b\u5efa\u5e76\u4f7f\u7528\u547d\u540d\u7a7a\u95f4\uff0c\u5176\u4e2d\u547d\u540d\u7a7a\u95f4\u989d\u5ea6\u603b\u548c\u4e0d\u80fd\u8d85\u8fc7 100 \u6838\u3002 \u90e8\u95e8 A \u7684\u7ba1\u7406\u5458\u548c\u90e8\u95e8 B \u7ba1\u7406\u5458\u521b\u5efa\u7684\u547d\u540d\u7a7a\u95f4\u4f1a\u88ab\u81ea\u52a8\u7ed1\u5b9a\u5728\u8be5\u90e8\u95e8\uff0c\u90e8\u95e8\u4e2d\u7684\u5176\u4ed6\u6210\u5458\u5c06\u5bf9\u5e94\u7684\u62e5\u6709\u547d\u540d\u7a7a\u95f4\u7684 Namesapce Admin\u3001Namesapce Edit\u3001Namesapce View \u89d2\u8272\uff08\u8fd9\u91cc\u90e8\u95e8\u6307\u7684\u662f\u5de5\u4f5c\u7a7a\u95f4\uff0c\u5de5\u4f5c\u7a7a\u95f4\u8fd8\u53ef\u4ee5\u6620\u5c04\u4e3a\u7ec4\u7ec7\u3001\u4f9b\u5e94\u5546\u7b49\u5176\u4ed6\u6982\u5ff5\uff09\u3002\u6574\u4e2a\u8fc7\u7a0b\u5982\u4e0b\u8868\uff1a
\u90e8\u95e8 \u89d2\u8272 \u5171\u4eab\u96c6\u7fa4 Cluster \u8d44\u6e90\u914d\u989d \u90e8\u95e8\u7ba1\u7406\u5458 A Workspace Admin \u96c6\u7fa4 01 CPU 50 \u6838 \u90e8\u95e8\u7ba1\u7406\u5458 B Workspace Admin \u96c6\u7fa4 01 CPU 100 \u6838
"},{"location":"admin/ghippo/best-practice/ws-best-practice.html#ai","title":"\u5de5\u4f5c\u7a7a\u95f4\u5bf9 AI \u7b97\u529b\u4e2d\u5fc3\u5404\u6a21\u5757\u7684\u4f5c\u7528","text":"
GProduct \u662f AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u9664\u5168\u5c40\u7ba1\u7406\u5916\u7684\u6240\u6709\u5176\u4ed6\u6a21\u5757\u7684\u7edf\u79f0\uff0c\u8fd9\u4e9b\u6a21\u5757\u9700\u8981\u4e0e\u5168\u5c40\u7ba1\u7406\u5bf9\u63a5\u540e\u624d\u80fd\u52a0\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u3002
"},{"location":"admin/ghippo/best-practice/oem/custom-idp.html","title":"\u5b9a\u5236 AI \u7b97\u529b\u4e2d\u5fc3\u5bf9\u63a5\u5916\u90e8\u8eab\u4efd\u63d0\u4f9b\u5546 (IdP)","text":"
\u8eab\u4efd\u63d0\u4f9b\u5546\uff08IdP, Identity Provider\uff09\uff1a\u5f53 AI \u7b97\u529b\u4e2d\u5fc3\u9700\u8981\u4f7f\u7528\u5ba2\u6237\u7cfb\u7edf\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c \u4f7f\u7528\u5ba2\u6237\u7cfb\u7edf\u767b\u5f55\u754c\u9762\u6765\u8fdb\u884c\u767b\u5f55\u8ba4\u8bc1\u65f6\uff0c\u8be5\u5ba2\u6237\u7cfb\u7edf\u88ab\u79f0\u4e3a AI \u7b97\u529b\u4e2d\u5fc3\u7684\u8eab\u4efd\u63d0\u4f9b\u5546
cd quarkus\nmvn -f ../pom.xml clean install -DskipTestsuite -DskipExamples -DskipTests\n
"},{"location":"admin/ghippo/best-practice/oem/keycloak-idp.html#ide","title":"\u4ece IDE \u8fd0\u884c","text":""},{"location":"admin/ghippo/best-practice/oem/keycloak-idp.html#service","title":"\u6dfb\u52a0 service \u4ee3\u7801","text":""},{"location":"admin/ghippo/best-practice/oem/keycloak-idp.html#keycloak","title":"\u5982\u679c\u53ef\u4ece keycloak \u7ee7\u627f\u90e8\u5206\u529f\u80fd","text":"
"},{"location":"admin/ghippo/best-practice/oem/oem-in.html","title":"\u5982\u4f55\u5c06\u5ba2\u6237\u7cfb\u7edf\u96c6\u6210\u5230 AI \u7b97\u529b\u4e2d\u5fc3\uff08OEM IN\uff09","text":"
OEM IN \u662f\u6307\u5408\u4f5c\u4f19\u4f34\u7684\u5e73\u53f0\u4f5c\u4e3a\u5b50\u6a21\u5757\u5d4c\u5165 AI \u7b97\u529b\u4e2d\u5fc3\uff0c\u51fa\u73b0\u5728 AI \u7b97\u529b\u4e2d\u5fc3\u4e00\u7ea7\u5bfc\u822a\u680f\u3002 \u7528\u6237\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3\u8fdb\u884c\u767b\u5f55\u548c\u7edf\u4e00\u7ba1\u7406\u3002\u5b9e\u73b0 OEM IN \u5171\u5206\u4e3a 5 \u6b65\uff0c\u5206\u522b\u662f\uff1a
\u4ee5\u4e0b\u4f7f\u7528\u5f00\u6e90\u8f6f\u4ef6 Label Studio \u6765\u505a\u5d4c\u5957\u6f14\u793a\u3002\u5b9e\u9645\u573a\u666f\u9700\u8981\u81ea\u5df1\u89e3\u51b3\u5ba2\u6237\u7cfb\u7edf\u7684\u95ee\u9898\uff1a
\u4f8b\u5982\u5ba2\u6237\u7cfb\u7edf\u9700\u8981\u81ea\u5df1\u6dfb\u52a0\u4e00\u4e2a Subpath\uff0c\u7528\u4e8e\u533a\u5206\u54ea\u4e9b\u662f AI \u7b97\u529b\u4e2d\u5fc3\u7684\u670d\u52a1\uff0c\u54ea\u4e9b\u662f\u5ba2\u6237\u7cfb\u7edf\u7684\u670d\u52a1\u3002
\u5c06\u5ba2\u6237\u7cfb\u7edf\u4e0e AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u901a\u8fc7 OIDC/OAUTH \u7b49\u534f\u8bae\u5bf9\u63a5\uff0c\u4f7f\u7528\u6237\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u540e\u8fdb\u5165\u5ba2\u6237\u7cfb\u7edf\u65f6\u65e0\u9700\u518d\u6b21\u767b\u5f55\u3002
Note
\u8fd9\u91cc\u4f7f\u7528\u4e24\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u76f8\u4e92\u5bf9\u63a5\u6765\u8fdb\u884c\u6f14\u793a\u3002\u6db5\u76d6\u5c06 AI \u7b97\u529b\u4e2d\u5fc3 \u4f5c\u4e3a\u7528\u6237\u6e90\u767b\u5f55\u5ba2\u6237\u5e73\u53f0\uff0c\u548c\u5c06\u5ba2\u6237\u5e73\u53f0\u4f5c\u4e3a\u7528\u6237\u6e90\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3 \u5e73\u53f0\u4e24\u79cd\u573a\u666f\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u767b\u5f55\u5ba2\u6237\u5e73\u53f0\uff1a \u9996\u5148\u5c06\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u5b9e\u73b0\u5bf9\u63a5\u540e\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u7684\u7528\u6237\u53ef\u4ee5\u901a\u8fc7 OIDC \u76f4\u63a5\u767b\u5f55\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3\uff0c \u800c\u65e0\u9700\u5728\u7b2c\u4e8c\u5957\u4e2d\u518d\u6b21\u521b\u5efa\u7528\u6237\u3002\u5728\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u901a\u8fc7 \u5168\u5c40\u7ba1\u7406 -> \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u63a5\u5165\u7ba1\u7406 \u521b\u5efa SSO \u63a5\u5165\u3002
\u5ba2\u6237\u5e73\u53f0\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\uff1a \u5c06\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3 \u4e2d\u751f\u6210\u7684\u5ba2\u6237\u7aef ID\u3001\u5ba2\u6237\u7aef\u5bc6\u94a5\u3001\u5355\u70b9\u767b\u5f55 URL \u7b49\u586b\u5199\u5230\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3 \u5168\u5c40\u7ba1\u7406 -> \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u8eab\u4efd\u63d0\u4f9b\u5546 -> OIDC \u4e2d\uff0c\u5b8c\u6210\u7528\u6237\u5bf9\u63a5\u3002 \u5bf9\u63a5\u540e\uff0c\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u7684\u7528\u6237\u53ef\u4ee5\u901a\u8fc7 OIDC \u76f4\u63a5\u767b\u5f55\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3\uff0c\u800c\u65e0\u9700\u5728\u7b2c\u4e8c\u5957\u4e2d\u518d\u6b21\u521b\u5efa\u7528\u6237\u3002
\u5bf9\u63a5\u5b8c\u6210\u540e\uff0c\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3 \u767b\u5f55\u9875\u9762\u5c06\u51fa\u73b0 OIDC \u9009\u9879\uff0c\u9996\u6b21\u767b\u5f55\u65f6\u9009\u62e9\u901a\u8fc7 OIDC \u767b\u5f55\uff08\u81ea\u5b9a\u4e49\u540d\u79f0\uff0c\u8fd9\u91cc\u662f\u540d\u79f0\u662f loginname\uff09\uff0c \u540e\u7eed\u5c06\u76f4\u63a5\u8fdb\u5165\u65e0\u9700\u518d\u6b21\u9009\u62e9\u3002
Note
\u4f7f\u7528\u4e24\u5957 AI \u7b97\u529b\u4e2d\u5fc3,\u8868\u660e\u5ba2\u6237\u53ea\u8981\u652f\u6301 OIDC \u534f\u8bae\uff0c\u65e0\u8bba\u662f AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u8fd8\u662f\u201c\u5ba2\u6237\u5e73\u53f0\u201d\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u4e24\u79cd\u573a\u666f\u90fd\u652f\u6301\u3002
\u53c2\u8003\u6587\u6863\u4e0b\u65b9\u7684 tar \u5305\u6765\u5b9e\u73b0\u4e00\u4e2a\u7a7a\u58f3\u7684\u524d\u7aef\u5b50\u5e94\u7528\uff0c\u628a\u5ba2\u6237\u7cfb\u7edf\u4ee5 iframe \u7684\u5f62\u5f0f\u653e\u8fdb\u8be5\u7a7a\u58f3\u5e94\u7528\u91cc\u3002
\u5bf9\u63a5\u5b8c\u6210\u540e\uff0c\u5c06\u5728 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u4e00\u7ea7\u5bfc\u822a\u680f\u51fa\u73b0 \u5ba2\u6237\u7cfb\u7edf \uff0c\u70b9\u51fb\u53ef\u8fdb\u5165\u5ba2\u6237\u7cfb\u7edf\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u652f\u6301\u901a\u8fc7\u5199 CSS \u7684\u65b9\u5f0f\u6765\u5b9e\u73b0\u5916\u89c2\u5b9a\u5236\u3002\u5b9e\u9645\u5e94\u7528\u4e2d\u5ba2\u6237\u7cfb\u7edf\u5982\u4f55\u5b9e\u73b0\u5916\u89c2\u5b9a\u5236\u9700\u8981\u6839\u636e\u5b9e\u9645\u60c5\u51b5\u5904\u7406\u3002
"},{"location":"admin/ghippo/best-practice/oem/oem-in.html#anyproduct-ai","title":"AnyProduct \u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u5176\u4ed6\u80fd\u529b(\u53ef\u9009)","text":"
\u64cd\u4f5c\u65b9\u6cd5\u4e3a\u8c03\u7528 AI \u7b97\u529b\u4e2d\u5fc3OpenAPI\u3002
OEM OUT \u662f\u6307\u5c06 AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u5b50\u6a21\u5757\u63a5\u5165\u5176\u4ed6\u4ea7\u54c1\uff0c\u51fa\u73b0\u5728\u5176\u4ed6\u4ea7\u54c1\u7684\u83dc\u5355\u4e2d\u3002 \u7528\u6237\u767b\u5f55\u5176\u4ed6\u4ea7\u54c1\u540e\u53ef\u76f4\u63a5\u8df3\u8f6c\u81f3 AI \u7b97\u529b\u4e2d\u5fc3\u65e0\u9700\u4e8c\u6b21\u767b\u5f55\u3002\u5b9e\u73b0 OEM OUT \u5171\u5206\u4e3a 5 \u6b65\uff0c\u5206\u522b\u662f\uff1a
\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\uff08\u5047\u8bbe\u90e8\u7f72\u5b8c\u7684\u8bbf\u95ee\u5730\u5740\u4e3a https://10.6.8.2:30343/\uff09
\u5ba2\u6237\u7cfb\u7edf\u548c AI \u7b97\u529b\u4e2d\u5fc3\u524d\u53ef\u4ee5\u653e\u4e00\u4e2a nginx \u53cd\u4ee3\u6765\u5b9e\u73b0\u540c\u57df\u8bbf\u95ee\uff0c / \u8def\u7531\u5230\u5ba2\u6237\u7cfb\u7edf\uff0c /dce5 (subpath) \u8def\u7531\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7cfb\u7edf\uff0c vi /etc/nginx/conf.d/default.conf \u793a\u4f8b\u5982\u4e0b\uff1a
\u5c06\u5ba2\u6237\u7cfb\u7edf\u4e0e AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u901a\u8fc7 OIDC/OAUTH \u7b49\u534f\u8bae\u5bf9\u63a5\uff0c\u4f7f\u7528\u6237\u767b\u5f55\u5ba2\u6237\u7cfb\u7edf\u540e\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u65e0\u9700\u518d\u6b21\u767b\u5f55\u3002 \u5728\u62ff\u5230\u5ba2\u6237\u7cfb\u7edf\u7684 OIDC \u4fe1\u606f\u540e\u586b\u5165 \u5168\u5c40\u7ba1\u7406 -> \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u8eab\u4efd\u63d0\u4f9b\u5546 \u4e2d\u3002
\u5bf9\u63a5\u5b8c\u6210\u540e\uff0cAI \u7b97\u529b\u4e2d\u5fc3\u767b\u5f55\u9875\u9762\u5c06\u51fa\u73b0 OIDC\uff08\u81ea\u5b9a\u4e49\uff09\u9009\u9879\uff0c\u9996\u6b21\u4ece\u5ba2\u6237\u7cfb\u7edf\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u9009\u62e9\u901a\u8fc7 OIDC \u767b\u5f55\uff0c \u540e\u7eed\u5c06\u76f4\u63a5\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u65e0\u9700\u518d\u6b21\u9009\u62e9\u3002
\u5bf9\u63a5\u5bfc\u822a\u680f\u662f\u6307 AI \u7b97\u529b\u4e2d\u5fc3\u51fa\u73b0\u5728\u5ba2\u6237\u7cfb\u7edf\u7684\u83dc\u5355\u4e2d\uff0c\u7528\u6237\u70b9\u51fb\u76f8\u5e94\u7684\u83dc\u5355\u540d\u79f0\u80fd\u591f\u76f4\u63a5\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u3002 \u56e0\u6b64\u5bf9\u63a5\u5bfc\u822a\u680f\u4f9d\u8d56\u4e8e\u5ba2\u6237\u7cfb\u7edf\uff0c\u4e0d\u540c\u5e73\u53f0\u9700\u8981\u6309\u7167\u5177\u4f53\u60c5\u51b5\u8fdb\u884c\u5904\u7406\u3002
\u56fd\u5bc6\u7f51\u5173\u90e8\u7f72\u6210\u529f\u4e4b\u540e\uff0c\u81ea\u5b9a\u4e49 AI \u7b97\u529b\u4e2d\u5fc3\u53cd\u5411\u4ee3\u7406\u670d\u52a1\u5668\u5730\u5740\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u5728 \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 \u4e2d\u901a\u8fc7\u7ba1\u7406\u5458\u521b\u5efa\u65b0\u7528\u6237\u7684\u65b9\u5f0f\u4e3a\u7528\u6237\u5206\u914d\u4e00\u4e2a\u9644\u6709\u4e00\u5b9a\u6743\u9650\u7684\u8d26\u53f7\u3002\u8be5\u7528\u6237\u4ea7\u751f\u7684\u6240\u6709\u884c\u4e3a\u90fd\u5c06\u5173\u8054\u5230\u81ea\u5df1\u7684\u5e10\u53f7\u3002
"},{"location":"admin/ghippo/install/reverse-proxy.html","title":"\u81ea\u5b9a\u4e49 AI \u7b97\u529b\u4e2d\u5fc3\u53cd\u5411\u4ee3\u7406\u670d\u52a1\u5668\u5730\u5740","text":"
\u8bbf\u95ee\u5bc6\u94a5\uff08Access Key\uff09\u53ef\u7528\u4e8e\u8bbf\u95ee\u5f00\u653e API \u548c\u6301\u7eed\u53d1\u5e03\uff0c\u7528\u6237\u53ef\u5728\u4e2a\u4eba\u4e2d\u5fc3\u53c2\u7167\u4ee5\u4e0b\u6b65\u9aa4\u83b7\u53d6\u5bc6\u94a5\u5e76\u8bbf\u95ee API\u3002
\u4f7f\u7528\u60a8\u7684\u7528\u6237\u540d/\u5bc6\u7801\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\u3002\u70b9\u51fb\u5de6\u4fa7\u5bfc\u822a\u680f\u5e95\u90e8\u7684 \u5168\u5c40\u7ba1\u7406 \u3002
"},{"location":"admin/ghippo/personal-center/ssh-key.html#4-ai","title":"\u6b65\u9aa4 4\uff1a\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0a\u8bbe\u7f6e\u516c\u94a5","text":"
\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0UI \u9875\u9762\uff0c\u5728\u9875\u9762\u53f3\u4e0a\u89d2\u9009\u62e9 \u4e2a\u4eba\u4e2d\u5fc3 -> SSH \u516c\u94a5 \u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u53ef\u901a\u8fc7 \u5916\u89c2\u5b9a\u5236 \u66f4\u6362\u767b\u5f55\u754c\u9762\u3001\u9876\u90e8\u5bfc\u822a\u680f\u4ee5\u53ca\u5e95\u90e8\u7248\u6743\u548c\u5907\u6848\u4fe1\u606f\uff0c\u5e2e\u52a9\u7528\u6237\u66f4\u597d\u5730\u8fa8\u8bc6\u4ea7\u54c1\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4f1a\u5728\u7528\u6237\u5fd8\u8bb0\u5bc6\u7801\u65f6\uff0c\u5411\u7528\u6237\u53d1\u9001\u7535\u5b50\u90ae\u4ef6\u4ee5\u9a8c\u8bc1\u7535\u5b50\u90ae\u4ef6\u5730\u5740\uff0c\u786e\u4fdd\u7528\u6237\u662f\u672c\u4eba\u64cd\u4f5c\u3002 \u8981\u4f7f\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u80fd\u591f\u53d1\u9001\u7535\u5b50\u90ae\u4ef6\uff0c\u9700\u8981\u5148\u63d0\u4f9b\u60a8\u7684\u90ae\u4ef6\u670d\u52a1\u5668\u5730\u5740\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5728\u56fe\u5f62\u754c\u9762\u4e0a\u63d0\u4f9b\u4e86\u57fa\u4e8e\u5bc6\u7801\u548c\u8bbf\u95ee\u63a7\u5236\u7684\u5b89\u5168\u7b56\u7565\u3002
\u4f1a\u8bdd\u8d85\u65f6\u7b56\u7565\uff1a\u7528\u6237\u5728 x \u5c0f\u65f6\u5185\u6ca1\u6709\u64cd\u4f5c\uff0c\u9000\u51fa\u5f53\u524d\u8d26\u53f7\u3002
\u652f\u6301\u81ea\u5b9a\u4e49\u8bbe\u7f6e CPU \u3001\u5185\u5b58\u3001\u5b58\u50a8\u4ee5\u53ca GPU \u7684\u8ba1\u8d39\u5355\u4f4d\uff0c\u4ee5\u53ca\u8d27\u5e01\u5355\u4f4d\u3002
\u652f\u6301\u5c55\u793a CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c
\u96c6\u7fa4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u96c6\u7fa4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u8be5\u6bb5\u65f6\u95f4\u5185\u96c6\u7fa4\u4e0b\u7684\u8282\u70b9\u6570\u91cf\uff0c \u53ef\u901a\u8fc7\u70b9\u51fb\u8282\u70b9\u6570\u91cf\u5feb\u6377\u8fdb\u5165\u8282\u70b9\u62a5\u8868\uff0c\u5e76\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u5185\u8be5\u96c6\u7fa4\u4e0b\u7684\u8282\u70b9\u4f7f\u7528\u60c5\u51b5\u3002
\u8282\u70b9\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u8282\u70b9\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u8282\u70b9\u7684 IP\u3001\u7c7b\u578b\u548c\u6240\u5c5e\u96c6\u7fa4\u3002
\u5bb9\u5668\u7ec4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u5bb9\u5668\u7ec4\u7684\u6240\u5c5e\u547d\u540d\u7a7a\u95f4\u3001\u6240\u5c5e\u96c6\u7fa4\u548c\u6240\u5c5e\u5de5\u4f5c\u7a7a\u95f4\u3002
\u5de5\u4f5c\u7a7a\u95f4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u547d\u540d\u7a7a\u95f4\u6570\u91cf\u548c\u5bb9\u5668\u7ec4\u6570\u91cf\uff0c \u53ef\u901a\u8fc7\u70b9\u51fb\u547d\u540d\u7a7a\u95f4\u6570\u91cf\u5feb\u6377\u8fdb\u5165\u547d\u540d\u7a7a\u95f4\u62a5\u8868\uff0c\u5e76\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u5185\u8be5\u5de5\u4f5c\u7a7a\u95f4\u4e0b\u547d\u540d\u7a7a\u95f4\u7684\u4f7f\u7528\u60c5\u51b5\uff1b\u540c\u6837\u7684\u65b9\u5f0f\u53ef\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u4e0b\u8be5\u5de5\u4f5c\u7a7a\u95f4\u4e0b\u7684\u5bb9\u5668\u7ec4\u7684\u4f7f\u7528\u60c5\u51b5\u3002
\u547d\u540d\u7a7a\u95f4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u547d\u540d\u7a7a\u95f4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u5bb9\u5668\u7ec4\u6570\u91cf\u3001\u6240\u5c5e\u96c6\u7fa4\u3001\u6240\u5c5e\u5de5\u4f5c\u7a7a\u95f4\uff0c \u53ef\u901a\u8fc7\u70b9\u51fb\u5bb9\u5668\u7ec4\u6570\u91cf\u5feb\u6377\u8fdb\u5165\u5bb9\u5668\u7ec4\u62a5\u8868\uff0c\u5e76\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u5185\u8be5\u547d\u540d\u7a7a\u95f4\u4e0b\u7684\u5bb9\u5668\u7ec4\u7684\u4f7f\u7528\u60c5\u51b5\u3002
\u51fa\u73b0\u8fd9\u4e2a\u95ee\u9898\u539f\u56e0\u4e3a\uff1aghippo-keycloak \u8fde\u63a5\u7684 Mysql \u6570\u636e\u5e93\u51fa\u73b0\u6545\u969c, \u5bfc\u81f4 OIDC Public keys \u88ab\u91cd\u7f6e
\u5220\u9664 keycloak database \u5e76\u521b\u5efa\uff0c\u63d0\u793a CREATE DATABASE IF NOT EXISTS keycloak CHARACTER SET utf8
\u91cd\u542f Keycloak Pod \u89e3\u51b3\u95ee\u9898
"},{"location":"admin/ghippo/troubleshooting/ghippo03.html#cpu-does-not-support-86-64-v2","title":"CPU does not support \u00d786-64-v2","text":""},{"location":"admin/ghippo/troubleshooting/ghippo03.html#_5","title":"\u6545\u969c\u8868\u73b0","text":"
keycloak \u65e0\u6cd5\u6b63\u5e38\u542f\u52a8\uff0ckeycloak pod \u8fd0\u884c\u72b6\u6001\u4e3a CrashLoopBackOff \u5e76\u4e14 keycloak \u7684 log \u51fa\u73b0\u5982\u4e0b\u56fe\u6240\u793a\u7684\u4fe1\u606f
\u8fd0\u884c\u4e0b\u9762\u7684\u68c0\u67e5\u811a\u672c\uff0c\u67e5\u8be2\u5f53\u524d\u8282\u70b9 cpu \u7684 x86-64\u67b6\u6784\u7684\u7279\u5f81\u7ea7\u522b
\u6267\u884c\u4e0b\u9762\u547d\u4ee4\u67e5\u770b\u5f53\u524d cpu \u7684\u7279\u6027\uff0c\u5982\u679c\u8f93\u51fa\u4e2d\u5305\u542b sse4_2\uff0c\u5219\u8868\u793a\u4f60\u7684\u5904\u7406\u5668\u652f\u6301SSE 4.2\u3002
\u9700\u8981\u5347\u7ea7\u4f60\u7684\u4e91\u4e3b\u673a\u6216\u7269\u7406\u673a CPU \u4ee5\u652f\u6301 x86-64-v2 \u53ca\u4ee5\u4e0a\uff0c\u786e\u4fddx86 CPU \u6307\u4ee4\u96c6\u652f\u6301 sse4.2\uff0c\u5982\u4f55\u5347\u7ea7\u9700\u8981\u4f60\u54a8\u8be2\u4e91\u4e3b\u673a\u5e73\u53f0\u63d0\u4f9b\u5546\u6216\u7740\u7269\u7406\u673a\u63d0\u4f9b\u5546\u3002
\u4ee5\u7ec8\u7aef\u7528\u6237\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\uff0c\u5bfc\u822a\u5230\u5bf9\u5e94\u7684\u670d\u52a1\uff0c\u67e5\u770b\u8bbf\u95ee\u7aef\u53e3\u3002
\u544a\u8b66\u4e2d\u5fc3\u662f AI \u7b97\u529b\u5e73\u53f0 \u63d0\u4f9b\u7684\u4e00\u4e2a\u91cd\u8981\u529f\u80fd\uff0c\u5b83\u8ba9\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u56fe\u5f62\u754c\u9762\u65b9\u4fbf\u5730\u6309\u7167\u96c6\u7fa4\u548c\u547d\u540d\u7a7a\u95f4\u67e5\u770b\u6240\u6709\u6d3b\u52a8\u548c\u5386\u53f2\u544a\u8b66\uff0c \u5e76\u6839\u636e\u544a\u8b66\u7ea7\u522b\uff08\u7d27\u6025\u3001\u8b66\u544a\u3001\u63d0\u793a\uff09\u6765\u641c\u7d22\u544a\u8b66\u3002
\u6240\u6709\u544a\u8b66\u90fd\u662f\u57fa\u4e8e\u9884\u8bbe\u7684\u544a\u8b66\u89c4\u5219\u8bbe\u5b9a\u7684\u9608\u503c\u6761\u4ef6\u89e6\u53d1\u7684\u3002\u5728 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u5185\u7f6e\u4e86\u4e00\u4e9b\u5168\u5c40\u544a\u8b66\u7b56\u7565\uff0c\u540c\u65f6\u60a8\u4e5f\u53ef\u4ee5\u968f\u65f6\u521b\u5efa\u3001\u5220\u9664\u544a\u8b66\u7b56\u7565\uff0c\u5bf9\u4ee5\u4e0b\u6307\u6807\u8fdb\u884c\u8bbe\u7f6e\uff1a
AI \u7b97\u529b\u5e73\u53f0 \u544a\u8b66\u4e2d\u5fc3\u662f\u4e00\u4e2a\u529f\u80fd\u5f3a\u5927\u7684\u544a\u8b66\u7ba1\u7406\u5e73\u53f0\uff0c\u53ef\u5e2e\u52a9\u7528\u6237\u53ca\u65f6\u53d1\u73b0\u548c\u89e3\u51b3\u96c6\u7fa4\u4e2d\u51fa\u73b0\u7684\u95ee\u9898\uff0c \u63d0\u9ad8\u4e1a\u52a1\u7a33\u5b9a\u6027\u548c\u53ef\u7528\u6027\uff0c\u4fbf\u4e8e\u96c6\u7fa4\u5de1\u68c0\u548c\u6545\u969c\u6392\u67e5\u3002
\u5207\u6362\u5230 insight-system -> Fluent Bit \u4eea\u8868\u76d8\u3002
Fluent Bit \u4eea\u8868\u76d8\u4e0a\u65b9\u6709\u51e0\u4e2a\u9009\u9879\u6846\uff0c\u53ef\u4ee5\u9009\u62e9\u65e5\u5fd7\u91c7\u96c6\u63d2\u4ef6\u3001\u65e5\u5fd7\u8fc7\u6ee4\u63d2\u4ef6\u3001\u65e5\u5fd7\u8f93\u51fa\u63d2\u4ef6\u53ca\u6240\u5728\u96c6\u7fa4\u540d\u3002
\u672c\u6587\u5c06\u4ee5 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u4e3e\u4f8b\uff0c\u8bb2\u89e3\u5982\u4f55\u901a\u8fc7 Insight \u53d1\u73b0 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u5f02\u5e38\u7684\u7ec4\u4ef6\u5e76\u5206\u6790\u51fa\u7ec4\u4ef6\u5f02\u5e38\u7684\u6839\u56e0\u3002
\u7ed3\u5408 Grafana\uff08\u5f53\u524d\u955c\u50cf\u7248\u672c 9.3.14\uff09\u7684\u5b98\u65b9\u6587\u6863\u3002\u6839\u636e\u5982\u4e0b\u6b65\u9aa4\u914d\u7f6e\u4f7f\u7528\u5916\u90e8\u7684\u6570\u636e\u5e93\uff0c\u793a\u4f8b\u4ee5 MySQL \u4e3a\u4f8b\uff1a
AI \u7b97\u529b\u4e2d\u5fc3 Insight \u76ee\u524d\u63a8\u8350\u4f7f\u7528\u5c3e\u90e8\u91c7\u6837\u5e76\u4f18\u5148\u652f\u6301\u5c3e\u90e8\u91c7\u6837\u3002
Pod Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Pod \u4e0a\u5bf9\u5e94\u7684\u76d1\u63a7\u6570\u636e\u3002
Service Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Service \u5bf9\u5e94 Endpoints \u4e0a\u7684\u76d1\u63a7\u6570\u636e\u3002
\u7531\u4e8e ICMP \u9700\u8981\u66f4\u9ad8\u6743\u9650\uff0c\u56e0\u6b64\uff0c\u6211\u4eec\u8fd8\u9700\u8981\u63d0\u5347 Pod \u6743\u9650\uff0c\u5426\u5219\u4f1a\u51fa\u73b0 operation not permitted \u7684\u9519\u8bef\u3002\u6709\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\u63d0\u5347\u6743\u9650\uff1a
port \uff1a\u6307\u5b9a\u91c7\u96c6\u6570\u636e\u9700\u8981\u901a\u8fc7\u7684\u7aef\u53e3\uff0c\u8bbe\u7f6e\u7684\u7aef\u53e3\u4e3a\u91c7\u96c6\u7684 Service \u7aef\u53e3\u6240\u8bbe\u7f6e\u7684 name \u3002
\u8fd9\u662f\u9700\u8981\u53d1\u73b0\u7684 Service \u7684\u8303\u56f4\u3002 namespaceSelector \u5305\u542b\u4e24\u4e2a\u4e92\u65a5\u5b57\u6bb5\uff0c\u5b57\u6bb5\u7684\u542b\u4e49\u5982\u4e0b\uff1a
any \uff1a\u6709\u4e14\u4ec5\u6709\u4e00\u4e2a\u503c true \uff0c\u5f53\u8be5\u5b57\u6bb5\u88ab\u8bbe\u7f6e\u65f6\uff0c\u5c06\u76d1\u542c\u6240\u6709\u7b26\u5408 Selector \u8fc7\u6ee4\u6761\u4ef6\u7684 Service \u7684\u53d8\u52a8\u3002
\u8d44\u6e90\u6d88\u8017\uff1a\u53ef\u6309 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u548c\u78c1\u76d8\u4f7f\u7528\u7387\u5206\u522b\u67e5\u770b\u8fd1\u4e00\u5c0f\u65f6 TOP5 \u96c6\u7fa4\u3001\u8282\u70b9\u7684\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\u3002
\u9ed8\u8ba4\u6309\u7167\u6839\u636e CPU \u4f7f\u7528\u7387\u6392\u5e8f\u3002\u60a8\u53ef\u5207\u6362\u6307\u6807\u5207\u6362\u96c6\u7fa4\u3001\u8282\u70b9\u7684\u6392\u5e8f\u65b9\u5f0f\u3002
\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\uff1a\u53ef\u67e5\u770b\u8fd1 15 \u5929\u7684\u8282\u70b9\u4e2a\u6570\u8d8b\u52bf\u4ee5\u53ca\u4e00\u5c0f\u65f6 Pod \u7684\u8fd0\u884c\u8d8b\u52bf\u3002
\u4f7f\u7528 \u903b\u8f91\u64cd\u4f5c\u7b26\uff08AND\u3001OR\u3001NOT\u3001\"\" \uff09\u7b26\u67e5\u8be2\u591a\u4e2a\u5173\u952e\u5b57\uff0c\u4f8b\u5982\uff1akeyword1 AND (keyword2 OR keyword3) NOT keyword4\u3002
kubectl get secrets -n mcamel-system mcamel-common-es-cluster-masters-es-elastic-user -o jsonpath=\"{.data.elastic}\" |base64 -d\n
\u8fdb\u5165 Kibana -> Stack Management -> Index Management \uff0c\u6253\u5f00 Include hidden indices \u9009\u9879\uff0c\u5373\u53ef\u89c1\u6240\u6709\u7684 index\u3002 \u6839\u636e index \u7684\u5e8f\u53f7\u5927\u5c0f\uff0c\u4fdd\u7559\u5e8f\u53f7\u5927\u7684 index\uff0c\u5220\u9664\u5e8f\u53f7\u5c0f\u7684 index\u3002
\u91cd\u542f Pod\uff0c\u7b49\u5f85 Pod \u6062\u590d\u8fd0\u884c\u72b6\u6001\u4e4b\u540e\uff0cFluenbit \u5c06\u4e0d\u518d\u91c7\u96c6\u8fd9\u4e2a Pod \u5185\u7684\u5bb9\u5668\u7684\u65e5\u5fd7\u3002
"},{"location":"admin/insight/infra/cluster.html#_4","title":"\u53c2\u8003\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645 CPU \u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 CPU \u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684 CPU \u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684\u5185\u5b58\u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002"},{"location":"admin/insight/infra/container.html","title":"\u5bb9\u5668\u76d1\u63a7","text":"
"},{"location":"admin/insight/infra/container.html#_4","title":"\u6307\u6807\u53c2\u8003\u8bf4\u660e","text":"\u6307\u6807\u540d\u79f0 \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c\u3002 CPU \u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u8bf7\u6c42\u91cf\u4e4b\u548c\u3002 CPU \u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u9650\u5236\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u9650\u5236\u91cf\u4e4b\u548c\u3002 \u78c1\u76d8\u8bfb\u5199\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\u78c1\u76d8\u6bcf\u79d2\u8fde\u7eed\u8bfb\u53d6\u548c\u5199\u5165\u7684\u603b\u548c\uff0c\u8868\u793a\u78c1\u76d8\u6bcf\u79d2\u8bfb\u53d6\u548c\u5199\u5165\u64cd\u4f5c\u6570\u7684\u6027\u80fd\u5ea6\u91cf\u3002 \u7f51\u7edc\u53d1\u9001\u63a5\u6536\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\uff0c\u6309\u5de5\u4f5c\u8d1f\u8f7d\u7edf\u8ba1\u7684\u7f51\u7edc\u6d41\u91cf\u7684\u6d41\u5165\u3001\u6d41\u51fa\u901f\u7387\u3002"},{"location":"admin/insight/infra/event.html","title":"\u4e8b\u4ef6\u67e5\u8be2","text":"
AI \u7b97\u529b\u5e73\u53f0 Insight \u652f\u6301\u6309\u96c6\u7fa4\u3001\u547d\u540d\u7a7a\u95f4\u67e5\u8be2\u4e8b\u4ef6\uff0c\u5e76\u63d0\u4f9b\u4e86\u4e8b\u4ef6\u72b6\u6001\u5206\u5e03\u56fe\uff0c\u5bf9\u91cd\u8981\u4e8b\u4ef6\u8fdb\u884c\u7edf\u8ba1\u3002
\u901a\u8fc7\u91cd\u8981\u4e8b\u4ef6\u7edf\u8ba1\uff0c\u60a8\u53ef\u4ee5\u65b9\u4fbf\u5730\u4e86\u89e3\u955c\u50cf\u62c9\u53d6\u5931\u8d25\u6b21\u6570\u3001\u5065\u5eb7\u68c0\u67e5\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u8fd0\u884c\u5931\u8d25\u6b21\u6570\u3001 Pod \u8c03\u5ea6\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668 OOM \u5185\u5b58\u8017\u5c3d\u6b21\u6570\u3001\u5b58\u50a8\u5377\u6302\u8f7d\u5931\u8d25\u6b21\u6570\u4ee5\u53ca\u6240\u6709\u4e8b\u4ef6\u7684\u603b\u6570\u3002\u8fd9\u4e9b\u4e8b\u4ef6\u901a\u5e38\u5206\u4e3a\u300cWarning\u300d\u548c\u300cNormal\u300d\u4e24\u7c7b\u3002
"},{"location":"admin/insight/infra/namespace.html#_4","title":"\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c \u5185\u5b58\u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c \u5bb9\u5668\u7ec4 CPU \u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf \u5bb9\u5668\u7ec4\u5185\u5b58\u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf"},{"location":"admin/insight/infra/node.html","title":"\u8282\u70b9\u76d1\u63a7","text":"
\u6307\u6807\u540d\u79f0 \u63cf\u8ff0 Current Status Response \u8868\u793a HTTP \u63a2\u6d4b\u8bf7\u6c42\u7684\u54cd\u5e94\u72b6\u6001\u7801\u3002 Ping Status \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u662f\u5426\u6210\u529f\u30021 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u6210\u529f\uff0c0 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u5931\u8d25\u3002 IP Protocol \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u4f7f\u7528\u7684 IP \u534f\u8bae\u7248\u672c\u3002 SSL Expiry \u8868\u793a SSL/TLS \u8bc1\u4e66\u7684\u6700\u65e9\u5230\u671f\u65f6\u95f4\u3002 DNS Response (Latency) \u8868\u793a\u6574\u4e2a\u63a2\u6d4b\u8fc7\u7a0b\u7684\u6301\u7eed\u65f6\u95f4\uff0c\u5355\u4f4d\u662f\u79d2\u3002 HTTP Duration \u8868\u793a\u4ece\u53d1\u9001\u8bf7\u6c42\u5230\u63a5\u6536\u5230\u5b8c\u6574\u54cd\u5e94\u7684\u6574\u4e2a\u8fc7\u7a0b\u7684\u65f6\u95f4\u3002"},{"location":"admin/insight/infra/probe.html#_7","title":"\u5220\u9664\u62e8\u6d4b\u4efb\u52a1","text":"
AI \u7b97\u529b\u4e2d\u5fc3 \u5e73\u53f0\u5b9e\u73b0\u4e86\u5bf9\u591a\u4e91\u591a\u96c6\u7fa4\u7684\u7eb3\u7ba1\uff0c\u5e76\u652f\u6301\u521b\u5efa\u96c6\u7fa4\u3002\u5728\u6b64\u57fa\u7840\u4e0a\uff0c\u53ef\u89c2\u6d4b\u6027 Insight \u4f5c\u4e3a\u591a\u96c6\u7fa4\u7edf\u4e00\u89c2\u6d4b\u65b9\u6848\uff0c\u901a\u8fc7\u90e8\u7f72 insight-agent \u63d2\u4ef6\u5b9e\u73b0\u5bf9\u591a\u96c6\u7fa4\u89c2\u6d4b\u6570\u636e\u7684\u91c7\u96c6\uff0c\u5e76\u652f\u6301\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3 \u53ef\u89c2\u6d4b\u6027\u4ea7\u54c1\u5b9e\u73b0\u5bf9\u6307\u6807\u3001\u65e5\u5fd7\u3001\u94fe\u8def\u6570\u636e\u7684\u67e5\u8be2\u3002
"},{"location":"admin/insight/quickstart/install/gethosturl.html#insight-agent_1","title":"\u5728\u5176\u4ed6\u96c6\u7fa4\u5b89\u88c5 insight-agent","text":""},{"location":"admin/insight/quickstart/install/gethosturl.html#insight-server","title":"\u901a\u8fc7 Insight Server \u63d0\u4f9b\u7684\u63a5\u53e3\u83b7\u53d6\u5730\u5740","text":"
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u8c03\u7528\u63a5\u53e3\u65f6\u9700\u8981\u989d\u5916\u4f20\u9012\u96c6\u7fa4\u4e2d\u4efb\u610f\u5916\u90e8\u53ef\u8bbf\u95ee\u7684\u8282\u70b9 IP\uff0c\u4f1a\u4f7f\u7528\u8be5 IP \u62fc\u63a5\u51fa\u5bf9\u5e94\u670d\u52a1\u7684\u5b8c\u6574\u8bbf\u95ee\u5730\u5740\u3002
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u4ee5\u4e0a\u5c31\u7eea\u4e4b\u540e\uff0c\u60a8\u5c31\u53ef\u4ee5\u901a\u8fc7\u6ce8\u89e3\uff08Annotation\uff09\u65b9\u5f0f\u4e3a\u5e94\u7528\u7a0b\u5e8f\u63a5\u5165\u94fe\u8def\u8ffd\u8e2a\u4e86\uff0cOTel \u76ee\u524d\u652f\u6301\u901a\u8fc7\u6ce8\u89e3\u7684\u65b9\u5f0f\u63a5\u5165\u94fe\u8def\u3002 \u6839\u636e\u670d\u52a1\u8bed\u8a00\uff0c\u9700\u8981\u6dfb\u52a0\u4e0a\u4e0d\u540c\u7684 pod annotations\u3002\u6bcf\u4e2a\u670d\u52a1\u53ef\u6dfb\u52a0\u4e24\u7c7b\u6ce8\u89e3\u4e4b\u4e00\uff1a
\u7531\u4e8e Go \u81ea\u52a8\u68c0\u6d4b\u9700\u8981\u8bbe\u7f6e OTEL_GO_AUTO_TARGET_EXE\uff0c \u56e0\u6b64\u60a8\u5fc5\u987b\u901a\u8fc7\u6ce8\u89e3\u6216 Instrumentation \u8d44\u6e90\u63d0\u4f9b\u6709\u6548\u7684\u53ef\u6267\u884c\u8def\u5f84\u3002\u672a\u8bbe\u7f6e\u6b64\u503c\u4f1a\u5bfc\u81f4 Go \u81ea\u52a8\u68c0\u6d4b\u6ce8\u5165\u4e2d\u6b62\uff0c\u4ece\u800c\u5bfc\u81f4\u63a5\u5165\u94fe\u8def\u5931\u8d25\u3002
Go \u81ea\u52a8\u68c0\u6d4b\u4e5f\u9700\u8981\u63d0\u5347\u6743\u9650\u3002\u4ee5\u4e0b\u6743\u9650\u662f\u81ea\u52a8\u8bbe\u7f6e\u7684\u5e76\u4e14\u662f\u5fc5\u9700\u7684\u3002
\u4ee5 Go \u8bed\u8a00\u4e3a\u4f8b\u7684\u624b\u52a8\u57cb\u70b9\u63a5\u5165\uff1a\u4f7f\u7528 OpenTelemetry SDK \u589e\u5f3a Go \u5e94\u7528\u7a0b\u5e8f
\u5229\u7528 ebpf \u5b9e\u73b0 Go \u8bed\u8a00\u65e0\u4fb5\u5165\u63a2\u9488\uff08\u5b9e\u9a8c\u6027\u529f\u80fd\uff09
OpenTelemetry \u4e5f\u7b80\u79f0\u4e3a OTel\uff0c\u662f\u4e00\u4e2a\u5f00\u6e90\u7684\u53ef\u89c2\u6d4b\u6027\u6846\u67b6\uff0c\u53ef\u4ee5\u5e2e\u52a9\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u751f\u6210\u548c\u6536\u96c6\u9065\u6d4b\u6570\u636e\uff1a\u94fe\u8def\u3001\u6307\u6807\u548c\u65e5\u5fd7\u3002
\u672c\u6587\u4e3b\u8981\u8bb2\u89e3\u5982\u4f55\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u901a\u8fc7 OpenTelemetry Go SDK \u589e\u5f3a\u5e76\u63a5\u5165\u94fe\u8def\u76d1\u63a7\u3002
"},{"location":"admin/insight/quickstart/otel/golang/golang.html#otel-sdk-go_1","title":"\u4f7f\u7528 OTel SDK \u589e\u5f3a Go \u5e94\u7528","text":""},{"location":"admin/insight/quickstart/otel/golang/golang.html#_1","title":"\u5b89\u88c5\u76f8\u5173\u4f9d\u8d56","text":"
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
\u521b\u5efa meter provider\uff0c\u5e76\u6307\u5b9a prometheus \u4f5c\u4e3a exporter\u3002
/*\n* Copyright The OpenTelemetry Authors\n* SPDX-License-Identifier: Apache-2.0\n*/\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
\u81ea\u5b9a\u4e49 meter \u5e76\u5f00\u542f http server
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n* Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n* these to a Prometheus instance via a HttpServer exporter.\n*\n* <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n* The Gauge callback gets executed every collection interval.\n*/\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // it is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
\u8fd9\u91cc\u4f7f\u7528\u7b2c\u4e8c\u79cd\u7528\u6cd5\uff0c\u542f\u52a8 JVM \u65f6\u9700\u8981\u6307\u5b9a JMX Exporter \u7684 jar \u5305\u6587\u4ef6\u548c\u914d\u7f6e\u6587\u4ef6\u3002 jar \u5305\u662f\u4e8c\u8fdb\u5236\u6587\u4ef6\uff0c\u4e0d\u597d\u901a\u8fc7 configmap \u6302\u8f7d\uff0c\u914d\u7f6e\u6587\u4ef6\u6211\u4eec\u51e0\u4e4e\u4e0d\u9700\u8981\u4fee\u6539\uff0c \u6240\u4ee5\u5efa\u8bae\u662f\u76f4\u63a5\u5c06 JMX Exporter \u7684 jar \u5305\u548c\u914d\u7f6e\u6587\u4ef6\u90fd\u6253\u5305\u5230\u4e1a\u52a1\u5bb9\u5668\u955c\u50cf\u4e2d\u3002
\u5176\u4e2d\uff0c\u7b2c\u4e8c\u79cd\u65b9\u5f0f\u6211\u4eec\u53ef\u4ee5\u9009\u62e9\u5c06 JMX Exporter \u7684 jar \u6587\u4ef6\u653e\u5728\u4e1a\u52a1\u5e94\u7528\u955c\u50cf\u4e2d\uff0c \u4e5f\u53ef\u4ee5\u9009\u62e9\u5728\u90e8\u7f72\u7684\u65f6\u5019\u6302\u8f7d\u8fdb\u53bb\u3002\u8fd9\u91cc\u5206\u522b\u5bf9\u4e24\u79cd\u65b9\u5f0f\u505a\u4e00\u4e2a\u4ecb\u7ecd\uff1a
"},{"location":"admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#jmx-exporter-jar","title":"\u65b9\u5f0f\u4e00\uff1a\u5c06 JMX Exporter JAR \u6587\u4ef6\u6784\u5efa\u81f3\u4e1a\u52a1\u955c\u50cf\u4e2d","text":"
\u7136\u540e\u51c6\u5907 jar \u5305\u6587\u4ef6\uff0c\u53ef\u4ee5\u5728 jmx_exporter \u7684 Github \u9875\u9762\u627e\u5230\u6700\u65b0\u7684 jar \u5305\u4e0b\u8f7d\u5730\u5740\u5e76\u53c2\u8003\u5982\u4e0b Dockerfile:
\u5728\u672a\u5f00\u542f\u7f51\u683c\u60c5\u51b5\u4e0b\uff0c\u6d4b\u8bd5\u60c5\u51b5\u7edf\u8ba1\u51fa\u7cfb\u7edf Job \u6307\u6807\u91cf\u4e0e Pod \u7684\u5173\u7cfb\u4e3a Series \u6570\u91cf = 800 * Pod \u6570\u91cf
\u5728\u5f00\u542f\u670d\u52a1\u7f51\u683c\u65f6\uff0c\u5f00\u542f\u529f\u80fd\u540e Pod \u4ea7\u751f\u7684 Istio \u76f8\u5173\u6307\u6807\u6570\u91cf\u7ea7\u4e3a Series \u6570\u91cf = 768 * Pod \u6570\u91cf
\u8868\u683c\u4e2d\u7684 Pod \u6570\u91cf \u6307\u96c6\u7fa4\u4e2d\u57fa\u672c\u7a33\u5b9a\u8fd0\u884c\u7684 Pod \u6570\u91cf\uff0c\u5982\u51fa\u73b0\u5927\u91cf\u7684 Pod \u91cd\u542f\uff0c\u5219\u4f1a\u9020\u6210\u77ed\u65f6\u95f4\u5185\u6307\u6807\u91cf\u7684\u9661\u589e\uff0c\u6b64\u65f6\u8d44\u6e90\u9700\u8981\u8fdb\u884c\u76f8\u5e94\u4e0a\u8c03\u3002
\u78c1\u76d8\u7528\u91cf = \u77ac\u65f6\u6307\u6807\u91cf x 2 x \u5355\u4e2a\u6570\u636e\u70b9\u7684\u5360\u7528\u78c1\u76d8 x 60 x 24 x \u5b58\u50a8\u65f6\u95f4 (\u5929)
\u5b58\u50a8\u65f6\u957f(\u5929) x 60 x 24 \u5c06\u65f6\u95f4(\u5929)\u6362\u7b97\u6210\u5206\u949f\u4ee5\u4fbf\u8ba1\u7b97\u78c1\u76d8\u7528\u91cf\u3002
\u5728\u4e0b\u8ff0\u793a\u4f8b\u4e2d\uff0cAlertmanager \u5c06\u6240\u6709 CPU \u4f7f\u7528\u7387\u9ad8\u4e8e\u9608\u503c\u7684\u544a\u8b66\u5206\u914d\u5230\u4e00\u4e2a\u540d\u4e3a\u201ccritical_alerts\u201d\u7684\u7b56\u7565\u4e2d\u3002
\u4f8b\u5982\uff1a field:[value TO ] \u00a0\u8868\u793a\u00a0 field \u00a0\u7684\u53d6\u503c\u8303\u56f4\u4ece\u00a0 value \u00a0\u5230\u6b63\u65e0\u7a77\uff0c field:[ TO value] \u00a0\u8868\u793a\u00a0 field \u00a0\u7684\u53d6\u503c\u8303\u56f4\u4ece\u8d1f\u65e0\u7a77\u5230\u00a0 value \u3002
\u6bcf\u4e2a Pod \u53ef\u8fd0\u884c\u591a\u4e2a Sidecar \u5bb9\u5668\uff0c\u53ef\u4ee5\u901a\u8fc7 ; \u9694\u79bb\uff0c\u5b9e\u73b0\u4e0d\u540c Sidecar \u5bb9\u5668\u91c7\u96c6\u591a\u4e2a\u6587\u4ef6\u5230\u591a\u4e2a\u5b58\u50a8\u5377\u3002
\u91cd\u542f Pod\uff0c\u5f85 Pod \u72b6\u6001\u53d8\u6210 \u8fd0\u884c\u4e2d \u540e\uff0c\u5219\u53ef\u901a\u8fc7 \u65e5\u5fd7\u67e5\u8be2 \u754c\u9762\uff0c\u67e5\u627e\u8be5 Pod \u7684\u5bb9\u5668\u5185\u65e5\u5fd7\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 insight-es-k8s-logs-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 jaeger-ilm-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u90e8\u7f72 Kubernetes \u96c6\u7fa4\u662f\u4e3a\u4e86\u652f\u6301\u9ad8\u6548\u7684 AI \u7b97\u529b\u8c03\u5ea6\u548c\u7ba1\u7406\uff0c\u5b9e\u73b0\u5f39\u6027\u4f38\u7f29\uff0c\u63d0\u4f9b\u9ad8\u53ef\u7528\u6027\uff0c\u4ece\u800c\u4f18\u5316\u6a21\u578b\u8bad\u7ec3\u548c\u63a8\u7406\u8fc7\u7a0b\u3002
\u5230\u6b64\u96c6\u7fa4\u521b\u5efa\u6210\u529f\uff0c\u53ef\u4ee5\u53bb\u67e5\u770b\u96c6\u7fa4\u6240\u5305\u542b\u7684\u8282\u70b9\u3002\u4f60\u53ef\u4ee5\u53bb\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u5e76\u4f7f\u7528 GPU \u4e86\u3002
\u4e0b\u4e00\u6b65\uff1a\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d
\u5907\u4efd\u901a\u5e38\u5206\u4e3a\u5168\u91cf\u5907\u4efd\u3001\u589e\u91cf\u5907\u4efd\u3001\u5dee\u5f02\u5907\u4efd\u4e09\u79cd\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u76ee\u524d\u652f\u6301\u5168\u91cf\u5907\u4efd\u548c\u589e\u91cf\u5907\u4efd\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u7684\u5907\u4efd\u6062\u590d\u53ef\u4ee5\u5206\u4e3a \u5e94\u7528\u5907\u4efd \u548c ETCD \u5907\u4efd \u4e24\u79cd\uff0c\u652f\u6301\u624b\u52a8\u5907\u4efd\uff0c\u6216\u57fa\u4e8e CronJob \u5b9a\u65f6\u81ea\u52a8\u5907\u4efd\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u5e94\u7528\u505a\u5907\u4efd\uff0c\u672c\u6559\u7a0b\u4e2d\u4f7f\u7528\u7684\u6f14\u793a\u5e94\u7528\u540d\u4e3a dao-2048 \uff0c\u5c5e\u4e8e\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u3002
CA \u8bc1\u4e66\uff1a\u53ef\u901a\u8fc7\u5982\u4e0b\u547d\u4ee4\u67e5\u770b\u8bc1\u4e66\uff0c\u7136\u540e\u5c06\u8bc1\u4e66\u5185\u5bb9\u590d\u5236\u7c98\u8d34\u5230\u5bf9\u5e94\u4f4d\u7f6e\uff1a
S3 region \uff1a\u4e91\u5b58\u50a8\u7684\u5730\u7406\u533a\u57df\u3002\u9ed8\u8ba4\u4f7f\u7528 us-east-1 \u53c2\u6570\uff0c\u7531\u7cfb\u7edf\u7ba1\u7406\u5458\u63d0\u4f9b
S3 force path style \uff1a\u4fdd\u6301\u9ed8\u8ba4\u914d\u7f6e true
S3 server URL \uff1a\u5bf9\u8c61\u5b58\u50a8\uff08minio\uff09\u7684\u63a7\u5236\u53f0\u8bbf\u95ee\u5730\u5740\uff0cminio \u4e00\u822c\u63d0\u4f9b\u4e86 UI \u8bbf\u95ee\u548c\u63a7\u5236\u53f0\u8bbf\u95ee\u4e24\u4e2a\u670d\u52a1\uff0c\u6b64\u5904\u8bf7\u4f7f\u7528\u63a7\u5236\u53f0\u8bbf\u95ee\u7684\u5730\u5740
\u5df2\u7ecf\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u597d\u4e00\u4e2a\u5de5\u4f5c\u96c6\u7fa4\uff0c\u53ef\u53c2\u8003\u6587\u6863\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u3002
\u672c\u6587\u5c06\u4ecb\u7ecd\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u5982\u4f55\u624b\u52a8\u4e3a\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u5de5\u4f5c\u8282\u70b9\u8fdb\u884c\u6269\u5bb9\u3002 \u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u4e0d\u5efa\u8bae\u5728\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u540e\u5bf9\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u8fdb\u884c\u6269\u5bb9\uff0c\u8bf7\u5728\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u524d\u505a\u597d\u8d44\u6e90\u89c4\u5212\u3002
\u5df2\u7ecf\u901a\u8fc7\u706b\u79cd\u8282\u70b9\u5b8c\u6210 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u7684\u90e8\u7f72\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u4e0a\u7684 kind \u96c6\u7fa4\u8fd0\u884c\u6b63\u5e38\u3002
[root@localhost ~]# podman ps\n\n# \u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n220d662b1b6a docker.m.daocloud.io/kindest/node:v1.26.2 2 weeks ago Up 2 weeks 0.0.0.0:443->30443/tcp, 0.0.0.0:8081->30081/tcp, 0.0.0.0:9000-9001->32000-32001/tcp, 0.0.0.0:36674->6443/tcp my-cluster-installer-control-plane\n
"},{"location":"admin/kpanda/best-practice/add-worker-node-on-global.html#kind-ai","title":"\u5c06 kind \u96c6\u7fa4\u63a5\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u96c6\u7fa4\u5217\u8868","text":"
\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\uff0c\u8fdb\u5165\u5bb9\u5668\u7ba1\u7406\uff0c\u5728\u96c6\u7fa4\u5217\u8868\u9875\u53f3\u4fa7\u70b9\u51fb \u63a5\u5165\u96c6\u7fa4 \u6309\u94ae\uff0c\u8fdb\u5165\u63a5\u5165\u96c6\u7fa4\u9875\u9762\u3002
\u672c\u6b21\u6f14\u793a\u5c06\u57fa\u4e8e AI \u7b97\u529b\u4e2d\u5fc3\u7684\u5e94\u7528\u5907\u4efd\u529f\u80fd\uff0c\u5b9e\u73b0\u4e00\u4e2a\u6709\u72b6\u6001\u5e94\u7528\u7684\u8de8\u96c6\u7fa4\u5907\u4efd\u8fc1\u79fb\u3002
Note
\u5f53\u524d\u64cd\u4f5c\u8005\u5e94\u5177\u6709 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u7ba1\u7406\u5458\u7684\u6743\u9650\u3002
\u67e5\u770b NFS Pod \u72b6\u6001\uff0c\u7b49\u5f85\u5176\u72b6\u6001\u53d8\u4e3a running \uff08\u5927\u7ea6\u9700\u8981 2 \u5206\u949f\uff09\u3002
kubectl get pod -n nfs-system -owide\n
\u9884\u671f\u8f93\u51fa\u4e3a\uff1a
[root@g-master1 ~]# kubectl get pod -owide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nnfs-provisioner-7dfb9bcc45-74ws2 1/1 Running 0 4m45s 10.6.175.100 g-master1 <none> <none>\n
"},{"location":"admin/kpanda/best-practice/backup-mysql-on-nfs.html#mysql_1","title":"\u90e8\u7f72 MySQL \u5e94\u7528","text":"
\u4e3a MySQL \u5e94\u7528\u51c6\u5907\u57fa\u4e8e NFS \u5b58\u50a8\u7684 PVC\uff0c\u7528\u6765\u5b58\u50a8 MySQL \u670d\u52a1\u5185\u7684\u6570\u636e\u3002
\u4f7f\u7528 vi pvc.yaml \u547d\u4ee4\u5728\u8282\u70b9\u4e0a\u521b\u5efa\u540d\u4e3a pvc.yaml \u7684\u6587\u4ef6\uff0c\u5c06\u4e0b\u9762\u7684 YAML \u5185\u5bb9\u590d\u5236\u5230 pvc.yaml \u6587\u4ef6\u5185\u3002
\u6267\u884c kubectl get pod | grep mysql \u67e5\u770b MySQL Pod \u72b6\u6001\uff0c\u7b49\u5f85\u5176\u72b6\u6001\u53d8\u4e3a running \uff08\u5927\u7ea6\u9700\u8981 2 \u5206\u949f\uff09\u3002
\u9884\u671f\u8f93\u51fa\u4e3a\uff1a
[root@g-master1 ~]# kubectl get pod |grep mysql\nmysql-deploy-5d6f94cb5c-gkrks 1/1 Running 0 2m53s\n
Note
\u5982\u679c MySQL Pod \u72b6\u6001\u957f\u671f\u5904\u4e8e\u975e running \u72b6\u6001\uff0c\u901a\u5e38\u662f\u56e0\u4e3a\u6ca1\u6709\u5728\u96c6\u7fa4\u7684\u6240\u6709\u8282\u70b9\u4e0a\u5b89\u88c5 NFS \u4f9d\u8d56\u3002
\u6267\u884c kubectl describe pod ${mysql pod \u540d\u79f0} \u67e5\u770b Pod \u7684\u8be6\u7ec6\u4fe1\u606f\u3002
\u5982\u679c\u62a5\u9519\u4e2d\u6709 MountVolume.SetUp failed for volume \"pvc-4ad70cc6-df37-4253-b0c9-8cb86518ccf8\" : mount failed: exit status 32 \u4e4b\u7c7b\u7684\u4fe1\u606f\uff0c\u8bf7\u5206\u522b\u6267\u884c kubectl delete -f nfs.yaml/pvc.yaml/mysql.yaml \u5220\u9664\u4e4b\u524d\u7684\u8d44\u6e90\u540e\uff0c\u91cd\u65b0\u4ece\u90e8\u7f72 NFS \u670d\u52a1\u5f00\u59cb\u3002
\u5411 MySQL \u5e94\u7528\u5199\u5165\u6570\u636e\u3002
\u4e3a\u4e86\u4fbf\u4e8e\u540e\u671f\u9a8c\u8bc1\u8fc1\u79fb\u6570\u636e\u662f\u5426\u6210\u529f\uff0c\u53ef\u4ee5\u4f7f\u7528\u811a\u672c\u5411 MySQL \u5e94\u7528\u4e2d\u5199\u5165\u6d4b\u8bd5\u6570\u636e\u3002
\u4f7f\u7528 vi insert.sh \u547d\u4ee4\u5728\u8282\u70b9\u4e0a\u521b\u5efa\u540d\u4e3a insert.sh \u7684\u811a\u672c\uff0c\u5c06\u4e0b\u9762\u7684 YAML \u5185\u5bb9\u590d\u5236\u5230\u8be5\u811a\u672c\u3002
insert.sh
#!/bin/bash\n\nfunction rand(){\n min=$1\n max=$(($2-$min+1))\n num=$(date +%s%N)\n echo $(($num%$max+$min))\n}\n\nfunction insert(){\n user=$(date +%s%N | md5sum | cut -c 1-9)\n age=$(rand 1 100)\n\n sql=\"INSERT INTO test.users(user_name, age)VALUES('${user}', ${age});\"\n echo -e ${sql}\n\n kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"${sql}\"\n\n}\n\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE DATABASE IF NOT EXISTS test;\"\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE TABLE IF NOT EXISTS test.users(user_name VARCHAR(10) NOT NULL,age INT UNSIGNED)ENGINE=InnoDB DEFAULT CHARSET=utf8;\"\n\nwhile true;do\n insert\n sleep 1\ndone\n
mysql: [Warning] Using a password on the command line interface can be insecure.\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('dc09195ba', 10);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('80ab6aa28', 70);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('f488e3d46', 23);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('e6098695c', 93);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('eda563e7d', 63);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('a4d1b8d68', 17);\nmysql: [Warning] Using a password on the command line interface can be insecure.\n
\u5728\u952e\u76d8\u4e0a\u540c\u65f6\u6309\u4e0b control \u548c c \u6682\u505c\u811a\u672c\u7684\u6267\u884c\u3002
\u524d\u5f80 MySQL Pod \u67e5\u770b MySQL \u4e2d\u5199\u5165\u7684\u6570\u636e\u3002
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
\u9884\u671f\u8f93\u51fa\u4e3a\uff1a
mysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
mysql> set global read_only=1; #1\u662f\u53ea\u8bfb\uff0c0\u662f\u8bfb\u5199\nmysql> show global variables like \"%read_only%\"; #\u67e5\u8be2\u72b6\u6001\n
\u4e3a MySQL \u5e94\u7528\u53ca PVC \u6570\u636e\u6dfb\u52a0\u72ec\u6709\u7684\u6807\u7b7e\uff1a backup=mysql \uff0c\u4fbf\u4e8e\u5907\u4efd\u65f6\u9009\u62e9\u8d44\u6e90\u3002
kubectl label deploy mysql-deploy backup=mysql # \u4e3a __mysql-deploy__ \u8d1f\u8f7d\u6dfb\u52a0\u6807\u7b7e\nkubectl label pod mysql-deploy-5d6f94cb5c-gkrks backup=mysql # \u4e3a mysql pod \u6dfb\u52a0\u6807\u7b7e\nkubectl label pvc mydata backup=mysql # \u4e3a mysql \u7684 pvc \u6dfb\u52a0\u6807\u7b7e\n
"},{"location":"admin/kpanda/best-practice/backup-mysql-on-nfs.html#mysql_3","title":"\u8de8\u96c6\u7fa4\u6062\u590d MySQL \u5e94\u7528\u53ca\u6570\u636e","text":"
\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\uff0c\u5728\u5de6\u4fa7\u5bfc\u822a\u9009\u62e9 \u5bb9\u5668\u7ba1\u7406 -> \u5907\u4efd\u6062\u590d -> \u5e94\u7528\u5907\u4efd \u3002
NAME READY STATUS RESTARTS AGE\nmysql-deploy-5798f5d4b8-62k6c 1/1 Running 0 24h\n
\u68c0\u67e5 MySQL \u6570\u636e\u8868\u4e2d\u7684\u6570\u636e\u662f\u5426\u6062\u590d\u6210\u529f\u3002
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
mysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
Success
\u53ef\u4ee5\u770b\u5230\uff0cPod \u4e2d\u7684\u6570\u636e\u548c main-cluster \u96c6\u7fa4\u4e2d Pod \u91cc\u9762\u7684\u6570\u636e\u4e00\u81f4\u3002\u8fd9\u8bf4\u660e\u5df2\u7ecf\u6210\u529f\u5730\u5c06 main-cluster \u4e2d\u7684 MySQL \u5e94\u7528\u53ca\u5176\u6570\u636e\u8de8\u96c6\u7fa4\u6062\u590d\u5230\u4e86 recovery-cluster \u96c6\u7fa4\u3002
\u672c\u6587\u4ec5\u9488\u5bf9\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u7ba1\u7406\u5e73\u53f0\u548c\u5f85\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u67b6\u6784\u5747\u4e3a AMD\u3002 \u521b\u5efa\u96c6\u7fa4\u65f6\u4e0d\u652f\u6301\u5f02\u6784\uff08AMD \u548c ARM \u6df7\u5408\uff09\u90e8\u7f72\uff0c\u60a8\u53ef\u4ee5\u5728\u96c6\u7fa4\u521b\u5efa\u5b8c\u6210\u540e\uff0c\u901a\u8fc7\u63a5\u5165\u5f02\u6784\u8282\u70b9\u7684\u65b9\u5f0f\u8fdb\u884c\u96c6\u7fa4\u6df7\u5408\u90e8\u7f72\u7ba1\u7406\u3002
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u8fd8\u5b58\u6d3b\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u8bf7\u786e\u4fdd\u5df2\u7ecf\u767b\u5f55\u5230\u706b\u79cd\u8282\u70b9\uff01\u5e76\u4e14\u4e4b\u524d\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u4f7f\u7528\u7684 clusterConfig.yaml \u6587\u4ef6\u8fd8\u5728\u3002
\u4e0b\u8f7d\u6240\u9700\u7684 RedHat OS package \u5305\u548c ISO \u79bb\u7ebf\u5305\uff1a
\u8d44\u6e90\u540d \u8bf4\u660e \u4e0b\u8f7d\u5730\u5740 os-pkgs-redhat9-v0.9.3.tar.gz RedHat9.2 OS-package \u5305 https://github.com/kubean-io/kubean/releases/download/v0.9.3/os-pkgs-redhat9-v0.9.3.tar.gz ISO \u79bb\u7ebf\u5305 ISO \u5305\u5bfc\u5165\u706b\u79cd\u8282\u70b9\u811a\u672c \u524d\u5f80 RedHat \u5b98\u65b9\u5730\u5740\u767b\u5f55\u4e0b\u8f7d import-iso ISO \u5bfc\u5165\u706b\u79cd\u8282\u70b9\u811a\u672c https://github.com/kubean-io/kubean/releases/download/v0.9.3/import_iso.sh"},{"location":"admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#os-pckage-minio","title":"\u5bfc\u5165 os pckage \u79bb\u7ebf\u5305\u81f3\u706b\u79cd\u8282\u70b9\u7684 minio","text":"
\u89e3\u538b RedHat os pckage \u79bb\u7ebf\u5305
\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u89e3\u538b\u4e0b\u8f7d\u7684 os pckage \u79bb\u7ebf\u5305\u3002\u6b64\u5904\u6211\u4eec\u4e0b\u8f7d\u7684 RedHat os pckage \u79bb\u7ebf\u5305\u3002
tar -xvf os-pkgs-redhat9-v0.9.3.tar.gz \n
os package \u89e3\u538b\u540e\u7684\u6587\u4ef6\u5185\u5bb9\u5982\u4e0b\uff1a
os-pkgs\n \u251c\u2500\u2500 import_ospkgs.sh # \u8be5\u811a\u672c\u7528\u4e8e\u5bfc\u5165 os packages \u5230 MinIO \u6587\u4ef6\u670d\u52a1\n \u251c\u2500\u2500 os-pkgs-amd64.tar.gz # amd64 \u67b6\u6784\u7684 os packages \u5305\n \u251c\u2500\u2500 os-pkgs-arm64.tar.gz # arm64 \u67b6\u6784\u7684 os packages \u5305\n \u2514\u2500\u2500 os-pkgs.sha256sum.txt # os packages \u5305\u7684 sha256sum \u6548\u9a8c\u6587\u4ef6\n
\u5bfc\u5165 OS Package \u81f3\u706b\u79cd\u8282\u70b9\u7684 MinIO
\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u5c06 os packages \u5305\u5230 MinIO \u6587\u4ef6\u670d\u52a1\u4e2d\uff1a
"},{"location":"admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#iso-minio","title":"\u5bfc\u5165 ISO \u79bb\u7ebf\u5305\u81f3\u706b\u79cd\u8282\u70b9\u7684 MinIO","text":"
\u6267\u884c\u5982\u4e0b\u547d\u4ee4, \u5c06 ISO \u5305\u5230 MinIO \u6587\u4ef6\u670d\u52a1\u4e2d:
\u672c\u6587\u4ec5\u9488\u5bf9\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u7ba1\u7406\u5e73\u53f0\u548c\u5f85\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u67b6\u6784\u5747\u4e3a AMD\u3002 \u521b\u5efa\u96c6\u7fa4\u65f6\u4e0d\u652f\u6301\u5f02\u6784\uff08AMD \u548c ARM \u6df7\u5408\uff09\u90e8\u7f72\uff0c\u60a8\u53ef\u4ee5\u5728\u96c6\u7fa4\u521b\u5efa\u5b8c\u6210\u540e\uff0c\u901a\u8fc7\u63a5\u5165\u5f02\u6784\u8282\u70b9\u7684\u65b9\u5f0f\u8fdb\u884c\u96c6\u7fa4\u6df7\u5408\u90e8\u7f72\u7ba1\u7406\u3002
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u8fd8\u5b58\u6d3b\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u8bf7\u786e\u4fdd\u5df2\u7ecf\u767b\u5f55\u5230\u706b\u79cd\u8282\u70b9\uff01\u5e76\u4e14\u4e4b\u524d\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u4f7f\u7528\u7684 clusterConfig.yaml \u6587\u4ef6\u8fd8\u5728\u3002
\u4e0b\u8f7d\u6240\u9700\u7684 Ubuntu OS package \u5305\u548c ISO \u79bb\u7ebf\u5305\uff1a
\u8d44\u6e90\u540d \u8bf4\u660e \u4e0b\u8f7d\u5730\u5740 os-pkgs-ubuntu2204-v0.18.2.tar.gz Ubuntu1804 OS-package \u5305 https://github.com/kubean-io/kubean/releases/download/v0.18.2/os-pkgs-ubuntu2204-v0.18.2.tar.gz ISO \u79bb\u7ebf\u5305 ISO \u5305 http://mirrors.melbourne.co.uk/ubuntu-releases/"},{"location":"admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#os-package-iso-minio","title":"\u5bfc\u5165 OS Package \u548c ISO \u79bb\u7ebf\u5305\u81f3\u706b\u79cd\u8282\u70b9\u7684 MinIO","text":"
AI \u7b97\u529b\u4e2d\u5fc3ETCD \u5907\u4efd\u8fd8\u539f\u4ec5\u9650\u4e8e\u9488\u5bf9\u540c\u4e00\u96c6\u7fa4\uff08\u8282\u70b9\u6570\u548c IP \u5730\u5740\u6ca1\u6709\u53d8\u5316\uff09\u8fdb\u884c\u5907\u4efd\u4e0e\u8fd8\u539f\u3002 \u4f8b\u5982\uff0c\u5907\u4efd\u4e86 A \u96c6\u7fa4 \u7684 etcd \u6570\u636e\u540e\uff0c\u53ea\u80fd\u5c06\u5907\u4efd\u6570\u636e\u8fd8\u539f\u5230 A \u96c6\u7fa4\u4e2d\uff0c\u4e0d\u80fd\u8fd8\u539f\u5230 B \u96c6\u7fa4\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u7684\u5907\u4efd\u662f\u5168\u91cf\u6570\u636e\u5907\u4efd\uff0c\u8fd8\u539f\u65f6\u5c06\u8fd8\u539f\u6700\u540e\u4e00\u6b21\u5907\u4efd\u7684\u5168\u91cf\u6570\u636e\u3002
{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:50.817+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.31:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-1:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:55.818+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-2:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.32:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-2:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:52:00.820+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.33:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-3:2379 (context deadline exceeded)\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
--initial-advertise-peer-urls\uff1aetcd member \u96c6\u7fa4\u4e4b\u95f4\u8bbf\u95ee\u5730\u5740\u3002\u5fc5\u987b\u8ddf etcd \u7684\u914d\u7f6e\u4fdd\u6301\u4e00\u81f4\u3002
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
INFO[0000] Finding latest set of snapshot to recover from...\nINFO[0000] Restoring from base snapshot: Full-00000000-00111147-1679991074 actor=restorer\nINFO[0001] successfully fetched data of base snapshot in 1.241380207 seconds actor=restorer\n{\"level\":\"info\",\"ts\":1680011221.2511616,\"caller\":\"mvcc/kvstore.go:380\",\"msg\":\"restored last compact revision\",\"meta-bucket-name\":\"meta\",\"meta-bucket-name-key\":\"finishedCompactRev\",\"restored-compact-revision\":110327}\n{\"level\":\"info\",\"ts\":1680011221.3045986,\"caller\":\"membership/cluster.go:392\",\"msg\":\"added member\",\"cluster-id\":\"66638454b9dd7b8a\",\"local-member-id\":\"0\",\"added-peer-id\":\"123c2503a378fc46\",\"added-peer-peer-urls\":[\"https://10.6.212.10:2380\"]}\nINFO[0001] Starting embedded etcd server... actor=restorer\n....\n\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:565\",\"msg\":\"stopped serving peer traffic\",\"address\":\"127.0.0.1:37161\"}\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:367\",\"msg\":\"closed etcd server\",\"name\":\"default\",\"data-dir\":\"/var/lib/etcd\",\"advertise-peer-urls\":[\"http://localhost:0\"],\"advertise-client-urls\":[\"http://localhost:0\"]}\nINFO[0003] Successfully restored the etcd data directory.\n
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\" \n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| 123c2503a378fc46 | started | controller-node-1 | https://10.6.212.10:2380 | https://10.6.212.10:2379 | false |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n
\u67e5\u770b controller-node-1 \u72b6\u6001:
etcdctl endpoint status --endpoints=controller-node-1:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| controller-node-1:2379 | 123c2503a378fc46 | 3.5.6 | 15 MB | true | false | 3 | 1200 | 1199 | |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
\u6062\u590d\u5176\u4ed6\u8282\u70b9\u6570\u636e
\u4e0a\u8ff0\u6b65\u9aa4\u5df2\u7ecf\u8fd8\u539f\u4e86\u8282\u70b9 01 \u7684\u6570\u636e\uff0c\u82e5\u60f3\u8981\u8fd8\u539f\u5176\u4ed6\u8282\u70b9\u6570\u636e\uff0c\u53ea\u9700\u8981\u5c06 etcd \u7684 Pod \u542f\u52a8\u8d77\u6765\uff0c\u8ba9 etcd \u81ea\u5df1\u5b8c\u6210\u6570\u636e\u540c\u6b65\u3002
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| 6ea47110c5a87c03 | started | controller-node-1 | https://10.5.14.31:2380 | https://10.5.14.31:2379 | false |\n| e222e199f1e318c4 | started | controller-node-2 | https://10.5.14.32:2380 | https://10.5.14.32:2379 | false |\n| f64eeda321aabe2d | started | controller-node-3 | https://10.5.14.33:2380 | https://10.5.14.33:2379 | false |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n
\u68c0\u67e5 3 \u4e2a member \u8282\u70b9\u662f\u5426\u6b63\u5e38:
etcdctl endpoint status --endpoints=controller-node-1:2379,controller-node-2:2379,controller-node-3:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
\u5728 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\uff0c\u4e5f\u63d0\u4f9b\u4e86\u901a\u8fc7 UI \u6765\u914d\u7f6e\u9ad8\u7ea7\u53c2\u6570\u7684\u529f\u80fd\uff0c\u5728\u521b\u5efa\u96c6\u7fa4\u6700\u540e\u4e00\u6b65\u6dfb\u52a0\u81ea\u5b9a\u4e49\u53c2\u6570\uff1a
\u534f\u8bae \u7aef\u53e3 \u6e90 \u76ee\u7684 \u63cf\u8ff0 TCP 2379-2380 Servers Servers \u9002\u7528\u4e8e HA\u4e0e\u5d4c\u5165\u5f0fetcd TCP 6443 Agents Servers K3s supervisor \u548c Kubernetes API Server UDP 8472 All nodes All nodes \u4ec5\u9002\u7528\u4e8eFlannel VXLAN TCP 10250 All nodes All nodes Kubelet metrics UDP 51820 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5e26\u6709 IPv4\u7684Flannel Wireguard UDP 51821 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5e26\u6709 IPv6\u7684Flannel Wireguard TCP 5001 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5d4c\u5165\u5206\u5e03\u5f0f\u6ce8\u518c\u8868\uff08Spegel\uff09 TCP 6443 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5d4c\u5165\u5206\u5e03\u5f0f\u6ce8\u518c\u8868\uff08Spegel\uff09
$ export K3S_VERSION=v1.30.3+k3s1\n$ bash k3slcm\n* Copying ./v1.30.3/k3s-airgap-images-amd64.tar.zst to 172.30.41.5\n* Copying ./v1.30.3/k3s to 172.30.41.5\n* Copying ./v1.30.3/k3s-install.sh to 172.30.41.5\n* Copying ./v1.30.3/k3s-airgap-images-amd64.tar.zst to 172.30.41.6\n* Copying ./v1.30.3/k3s to 172.30.41.6\n* Copying ./v1.30.3/k3s-install.sh to 172.30.41.6\n* Copying ./v1.30.3/k3s-airgap-images-amd64.tar.zst to 172.30.41.7\n* Copying ./v1.30.3/k3s to 172.30.41.7\n* Copying ./v1.30.3/k3s-install.sh to 172.30.41.7\n* Installing on first server node [172.30.41.5]\n[INFO] Skipping k3s download and verify\n[INFO] Skipping installation of SELinux RPM\n[INFO] Skipping /usr/local/bin/kubectl symlink to k3s, already exists\n[INFO] Skipping /usr/local/bin/crictl symlink to k3s, already exists\n[INFO] Skipping /usr/local/bin/ctr symlink to k3s, already exists\n[INFO] Creating killall script /usr/local/bin/k3s-killall.sh\n[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh\n[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env\n[INFO] systemd: Creating service file /etc/systemd/system/k3s.service\n[INFO] systemd: Enabling k3s unit\nCreated symlink /etc/systemd/system/multi-user.target.wants/k3s.service \u2192 /etc/systemd/system/k3s.service.\n[INFO] No change detected so skipping service start\n* Installing on other server node [172.30.41.6]\n......\n
\u76ee\u524d\u652f\u6301\u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4\u5728 v1.26-v1.28\uff0c\u53ef\u4ee5\u53c2\u9605 AI \u7b97\u529b\u4e2d\u5fc3\u96c6\u7fa4\u7248\u672c\u652f\u6301\u4f53\u7cfb\u3002
$ cat /etc/fstab\n\n# /etc/fstab\n# Created by anaconda on Thu Mar 19 11:32:59 2020\n#\n# Accessible filesystems, by reference, are maintained under '/dev/disk'\n# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info\n#\n/dev/mapper/centos-root / xfs defaults 0 0\nUUID=3ed01f0e-67a1-4083-943a-343b7fed1708 /boot xfs defaults 0 0\n/dev/mapper/centos-swap swap swap defaults 0 0\n
\u672c\u6587\u4ec5\u9488\u5bf9\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u6240\u521b\u5efa\u7684\u5de5\u4f5c\u96c6\u7fa4\u8fdb\u884c\u5f02\u6784\u8282\u70b9\u7684\u6dfb\u52a0\uff0c\u4e0d\u5305\u62ec\u63a5\u5165\u7684\u96c6\u7fa4\u3002
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u8fd8\u5b58\u6d3b\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u5df2\u7ecf\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u597d\u4e00\u4e2a AMD \u67b6\u6784\uff0c\u64cd\u4f5c\u7cfb\u7edf\u4e3a CentOS 7.9 \u7684\u5de5\u4f5c\u96c6\u7fa4\uff0c\u521b\u5efa\u53c2\u8003\u6587\u6863\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4
\u4ee5 ARM \u67b6\u6784\u3001\u64cd\u4f5c\u7cfb\u7edf Kylin v10 sp2 \u4e3a\u4f8b\u3002
\u8bf7\u786e\u4fdd\u5df2\u7ecf\u767b\u5f55\u5230\u706b\u79cd\u8282\u70b9\uff01\u5e76\u4e14\u4e4b\u524d\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u4f7f\u7528\u7684 clusterConfig.yaml \u6587\u4ef6\u8fd8\u5728\u3002
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.hostsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-hosts-conf\",\"namespace\":\"kubean-system\"}\n
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.varsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-vars-conf\",\"namespace\":\"kubean-system\"}\n
\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u4ee5\u4e0b\u64cd\u4f5c\u6307\u5357\uff0c\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u6240\u521b\u5efa\u7684\u975e\u754c\u9762\u4e2d\u63a8\u8350\u7684 Kubernetes \u7248\u672c\u3002
\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u5236\u4f5c\u589e\u91cf\u79bb\u7ebf\u5305\u7684\u65b9\u5f0f\u5bf9\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u6240\u521b\u5efa\u7684\u5de5\u4f5c\u96c6\u7fa4\u7684 kubernetes \u7684\u7248\u672c\u8fdb\u884c\u5347\u7ea7\u3002
\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\u7684 UI \u7ba1\u7406\u754c\u9762\uff0c\u60a8\u53ef\u4ee5\u7ee7\u7eed\u6267\u884c\u4ee5\u4e0b\u64cd\u4f5c\uff1a
\u672c\u6587\u4ecb\u7ecd\u79bb\u7ebf\u6a21\u5f0f\u4e0b\u5982\u4f55\u5728 \u672a\u58f0\u660e\u652f\u6301\u7684 OS \u4e0a\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u3002AI \u7b97\u529b\u4e2d\u5fc3\u58f0\u660e\u652f\u6301\u7684 OS \u8303\u56f4\u8bf7\u53c2\u8003 AI \u7b97\u529b\u4e2d\u5fc3\u652f\u6301\u7684\u64cd\u4f5c\u7cfb\u7edf
\u79bb\u7ebf\u6a21\u5f0f\u4e0b\u5728\u672a\u58f0\u660e\u652f\u6301\u7684 OS \u4e0a\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u4e3b\u8981\u7684\u6d41\u7a0b\u5982\u4e0b\u56fe\uff1a
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u627e\u5230\u4e00\u4e2a\u548c\u5f85\u5efa\u96c6\u7fa4\u8282\u70b9\u67b6\u6784\u548c OS \u5747\u4e00\u81f4\u7684\u5728\u7ebf\u73af\u5883\uff0c\u672c\u6587\u4ee5 AnolisOS 8.8 GA \u4e3a\u4f8b\u3002\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u751f\u6210\u79bb\u7ebf os-pkgs \u5305\u3002
\u6267\u884c\u5b8c\u6210\u4e0a\u8ff0\u547d\u4ee4\u540e\uff0c\u7b49\u5f85\u754c\u9762\u63d0\u793a\uff1a All packages for node (X.X.X.X) have been installed \u5373\u8868\u793a\u5b89\u88c5\u5b8c\u6210\u3002
LSE/LSR Pod \u7684 Request \u548c Limit \u5fc5\u987b\u76f8\u7b49\uff0cCPU \u503c\u5fc5\u987b\u662f 1000 \u7684\u6574\u6570\u500d\u3002
LSE Pod \u5206\u914d\u7684 CPU \u662f\u5b8c\u5168\u72ec\u5360\u7684\uff0c\u4e0d\u5f97\u5171\u4eab\u3002\u5982\u679c\u8282\u70b9\u662f\u8d85\u7ebf\u7a0b\u67b6\u6784\uff0c\u53ea\u4fdd\u8bc1\u903b\u8f91\u6838\u5fc3\u7ef4\u5ea6\u662f\u9694\u79bb\u7684\uff0c\u4f46\u662f\u53ef\u4ee5\u901a\u8fc7 CPUBindPolicyFullPCPUs \u7b56\u7565\u83b7\u5f97\u66f4\u597d\u7684\u9694\u79bb\u3002
LSR Pod \u5206\u914d\u7684 CPU \u53ea\u80fd\u4e0e BE Pod \u5171\u4eab\u3002
LS Pod \u7ed1\u5b9a\u4e86\u4e0e LSE/LSR Pod \u72ec\u5360\u4e4b\u5916\u7684\u5171\u4eab CPU \u6c60\u3002
BE Pod \u7ed1\u5b9a\u4f7f\u7528\u8282\u70b9\u4e2d\u9664 LSE Pod \u72ec\u5360\u4e4b\u5916\u7684\u6240\u6709 CPU \u3002
\u4ee5\u4e0b\u793a\u4f8b\u4e2d\u521b\u5efa4\u4e2a\u526f\u672c\u6570\u4e3a1\u7684 deployment, \u8bbe\u7f6e QoS \u7c7b\u522b\u4e3a LSE, LSR, LS, BE, \u5f85 pod \u521b\u5efa\u5b8c\u6210\u540e\uff0c\u89c2\u5bdf\u5404 pod \u7684 CPU \u5206\u914d\u60c5\u51b5\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-lse-56c9cd77f5-cdqbd\nCPU set for Pod nginx-lse-56c9cd77f5-cdqbd (Burstable QoS): 0-1\n
QoS \u7c7b\u578b\u4e3a LSR \u7684 Pod, \u7ed1\u5b9a CPU 2-3 \u6838\uff0c\u53ef\u4e0e BE \u7c7b\u578b\u7684 Pod \u5171\u4eab\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-lsr-c7fdb97d8-b58h8\nCPU set for Pod nginx-lsr-c7fdb97d8-b58h8 (Burstable QoS): 2-3\n
QoS \u7c7b\u578b\u4e3a LS \u7684 Pod, \u4f7f\u7528 CPU 4-15 \u6838\uff0c\u7ed1\u5b9a\u4e86\u4e0e LSE/LSR Pod \u72ec\u5360\u4e4b\u5916\u7684\u5171\u4eab CPU \u6c60\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-ls-54746c8cf8-rh4b7\nCPU set for Pod nginx-ls-54746c8cf8-rh4b7 (Burstable QoS): 4-15\n
QoS \u7c7b\u578b\u4e3a BE \u7684 pod, \u53ef\u4f7f\u7528 LSE Pod \u72ec\u5360\u4e4b\u5916\u7684 CPU\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-be-577c946b89-js2qn\nCPU set for Pod nginx-be-577c946b89-js2qn (BestEffort QoS): 2,4-12\n
\u67e5\u770b\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u8d44\u6e90\u7533\u8bf7\u503c\u4e0e\u9650\u5236\u503c\u7684\u6bd4\u503c\u662f\u5426\u7b26\u5408\u8d85\u552e\u6bd4
\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u63a5\u5165\u6216\u521b\u5efa\u7684\u96c6\u7fa4\uff0c\u4e0d\u4ec5\u53ef\u4ee5\u901a\u8fc7 UI \u754c\u9762\u76f4\u63a5\u8bbf\u95ee\uff0c\u4e5f\u53ef\u4ee5\u901a\u8fc7\u5176\u4ed6\u4e24\u79cd\u65b9\u5f0f\u8fdb\u884c\u8bbf\u95ee\u63a7\u5236\uff1a
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u57fa\u4e8e\u96c6\u7fa4\u7684\u4e0d\u540c\u529f\u80fd\u5b9a\u4f4d\u5bf9\u96c6\u7fa4\u8fdb\u884c\u4e86\u89d2\u8272\u5206\u7c7b\uff0c\u5e2e\u52a9\u7528\u6237\u66f4\u597d\u5730\u7ba1\u7406 IT \u57fa\u7840\u8bbe\u65bd\u3002
\u6b64\u96c6\u7fa4\u7528\u4e8e\u8fd0\u884c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7ec4\u4ef6\uff0c\u4f8b\u5982\u5bb9\u5668\u7ba1\u7406\u3001\u5168\u5c40\u7ba1\u7406\u3001\u53ef\u89c2\u6d4b\u6027\u3001\u955c\u50cf\u4ed3\u5e93\u7b49\u3002 \u4e00\u822c\u4e0d\u627f\u8f7d\u4e1a\u52a1\u8d1f\u8f7d\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u63a5\u5165\u578b\u96c6\u7fa4\u548c\u81ea\u5efa\u96c6\u7fa4\u91c7\u53d6\u4e0d\u540c\u7684\u7248\u672c\u652f\u6301\u673a\u5236\u3002
\u4f8b\u5982\uff0c\u793e\u533a\u652f\u6301\u7684\u7248\u672c\u8303\u56f4\u662f 1.25\u30011.26\u30011.27\uff0c\u5219\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u662f 1.24\u30011.25\u30011.26\uff0c\u5e76\u4e14\u4f1a\u4e3a\u7528\u6237\u63a8\u8350\u4e00\u4e2a\u7a33\u5b9a\u7684\u7248\u672c\uff0c\u5982 1.24.7\u3002
\u9664\u6b64\u4e4b\u5916\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e0e\u793e\u533a\u4fdd\u6301\u9ad8\u5ea6\u540c\u6b65\uff0c\u5f53\u793e\u533a\u7248\u672c\u8fdb\u884c\u9012\u589e\u540e\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e5f\u4f1a\u540c\u6b65\u9012\u589e\u4e00\u4e2a\u7248\u672c\u3002
"},{"location":"admin/kpanda/clusters/cluster-version.html#kubernetes","title":"Kubernetes \u7248\u672c\u652f\u6301\u8303\u56f4","text":"Kubernetes \u793e\u533a\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u63a8\u8350\u7248\u672c \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b89\u88c5\u5668 \u53d1\u5e03\u65f6\u95f4
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4e2d\uff0c\u96c6\u7fa4\u89d2\u8272\u5206\u56db\u7c7b\uff1a\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u3001\u7ba1\u7406\u96c6\u7fa4\u3001\u5de5\u4f5c\u96c6\u7fa4\u3001\u63a5\u5165\u96c6\u7fa4\u3002 \u5176\u4e2d\uff0c\u63a5\u5165\u96c6\u7fa4\u53ea\u80fd\u4ece\u7b2c\u4e09\u65b9\u5382\u5546\u63a5\u5165\uff0c\u53c2\u89c1\u63a5\u5165\u96c6\u7fa4\u3002
\u672c\u9875\u4ecb\u7ecd\u5982\u4f55\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u65b0\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u5de5\u4f5c\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u9700\u8981\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4fdd\u6301\u4e00\u81f4\u3002 \u5982\u9700\u4f7f\u7528\u533a\u522b\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4 OS \u6216\u67b6\u6784\u7684\u8282\u70b9\u521b\u5efa\u96c6\u7fa4\uff0c\u53c2\u9605\u5728 centos \u7ba1\u7406\u5e73\u53f0\u4e0a\u521b\u5efa ubuntu \u5de5\u4f5c\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u63a8\u8350\u4f7f\u7528 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u64cd\u4f5c\u7cfb\u7edf\u6765\u521b\u5efa\u96c6\u7fa4\u3002 \u5982\u60a8\u672c\u5730\u8282\u70b9\u4e0d\u5728\u4e0a\u8ff0\u652f\u6301\u8303\u56f4\uff0c\u53ef\u53c2\u8003\u5728\u975e\u4e3b\u6d41\u64cd\u4f5c\u7cfb\u7edf\u4e0a\u521b\u5efa\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u6839\u636e\u4e1a\u52a1\u9700\u6c42\u51c6\u5907\u4e00\u5b9a\u6570\u91cf\u7684\u8282\u70b9\uff0c\u4e14\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u4e00\u81f4\u3002
\u63a8\u8350 Kubernetes \u7248\u672c 1.29.5\uff0c\u5177\u4f53\u7248\u672c\u8303\u56f4\uff0c\u53c2\u9605 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u96c6\u7fa4\u7248\u672c\u652f\u6301\u4f53\u7cfb\uff0c \u76ee\u524d\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4\u5728 v1.28.0-v1.30.2\u3002\u5982\u9700\u521b\u5efa\u4f4e\u7248\u672c\u7684\u96c6\u7fa4\uff0c\u8bf7\u53c2\u8003\u96c6\u7fa4\u7248\u672c\u652f\u6301\u8303\u56f4\u3001\u90e8\u7f72\u4e0e\u5347\u7ea7 Kubean \u5411\u4e0b\u517c\u5bb9\u7248\u672c\u3002
\u76ee\u6807\u4e3b\u673a\u9700\u8981\u5141\u8bb8 IPv4 \u8f6c\u53d1\u3002\u5982\u679c Pod \u548c Service \u4f7f\u7528\u7684\u662f IPv6\uff0c\u5219\u76ee\u6807\u670d\u52a1\u5668\u9700\u8981\u5141\u8bb8 IPv6 \u8f6c\u53d1\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u6682\u4e0d\u63d0\u4f9b\u5bf9\u9632\u706b\u5899\u7684\u7ba1\u7406\u529f\u80fd\uff0c\u60a8\u9700\u8981\u9884\u5148\u81ea\u884c\u5b9a\u4e49\u76ee\u6807\u4e3b\u673a\u9632\u706b\u5899\u89c4\u5219\u3002\u4e3a\u4e86\u907f\u514d\u521b\u5efa\u96c6\u7fa4\u7684\u8fc7\u7a0b\u4e2d\u51fa\u73b0\u95ee\u9898\uff0c\u5efa\u8bae\u7981\u7528\u76ee\u6807\u4e3b\u673a\u7684\u9632\u706b\u5899\u3002
\u670d\u52a1\u7f51\u6bb5\uff1a\u540c\u4e00\u96c6\u7fa4\u4e0b\u5bb9\u5668\u4e92\u76f8\u8bbf\u95ee\u65f6\u4f7f\u7528\u7684 Service \u8d44\u6e90\u7684\u7f51\u6bb5\uff0c\u51b3\u5b9a Service \u8d44\u6e90\u7684\u4e0a\u9650\u3002\u521b\u5efa\u540e\u4e0d\u53ef\u4fee\u6539\u3002
\u5982\u679c\u60f3\u5f7b\u5e95\u5220\u9664\u4e00\u4e2a\u63a5\u5165\u7684\u96c6\u7fa4\uff0c\u9700\u8981\u524d\u5f80\u521b\u5efa\u8be5\u96c6\u7fa4\u7684\u539f\u59cb\u5e73\u53f0\u64cd\u4f5c\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0d\u652f\u6301\u5220\u9664\u63a5\u5165\u7684\u96c6\u7fa4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c \u5378\u8f7d\u96c6\u7fa4 \u548c \u89e3\u9664\u63a5\u5165 \u7684\u533a\u522b\u5728\u4e8e\uff1a
"},{"location":"admin/kpanda/clusters/integrate-rancher-cluster.html#ai","title":"\u6b65\u9aa4\u4e09\uff1a\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u754c\u9762\u63a5\u5165\u96c6\u7fa4","text":"
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED\nadmin.conf Dec 14, 2024 07:26 UTC 204d no \napiserver Dec 14, 2024 07:26 UTC 204d ca no \napiserver-etcd-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \napiserver-kubelet-client Dec 14, 2024 07:26 UTC 204d ca no \ncontroller-manager.conf Dec 14, 2024 07:26 UTC 204d no \netcd-healthcheck-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-peer Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-server Dec 14, 2024 07:26 UTC 204d etcd-ca no \nfront-proxy-client Dec 14, 2024 07:26 UTC 204d front-proxy-ca no \nscheduler.conf Dec 14, 2024 07:26 UTC 204d no \n\nCERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED\nca Dec 12, 2033 07:26 UTC 9y no \netcd-ca Dec 12, 2033 07:26 UTC 9y no \nfront-proxy-ca Dec 12, 2033 07:26 UTC 9y no \n
\u9759\u6001 Pod \u662f\u88ab\u672c\u5730 kubelet \u800c\u4e0d\u662f API \u670d\u52a1\u5668\u7ba1\u7406\uff0c\u6240\u4ee5 kubectl \u4e0d\u80fd\u7528\u6765\u5220\u9664\u6216\u91cd\u542f\u4ed6\u4eec\u3002
\u5982\u679c Pod \u4e0d\u5728\u6e05\u5355\u76ee\u5f55\u91cc\uff0ckubelet \u5c06\u4f1a\u7ec8\u6b62\u5b83\u3002 \u5728\u53e6\u4e00\u4e2a fileCheckFrequency \u5468\u671f\u4e4b\u540e\u4f60\u53ef\u4ee5\u5c06\u6587\u4ef6\u79fb\u56de\u53bb\uff0ckubelet \u53ef\u4ee5\u5b8c\u6210 Pod \u7684\u91cd\u5efa\uff0c\u800c\u7ec4\u4ef6\u7684\u8bc1\u4e66\u66f4\u65b0\u64cd\u4f5c\u4e5f\u5f97\u4ee5\u5b8c\u6210\u3002
Kubernetes \u7248\u672c\u4ee5 x.y.z \u8868\u793a\uff0c\u5176\u4e2d x \u662f\u4e3b\u8981\u7248\u672c\uff0c y \u662f\u6b21\u8981\u7248\u672c\uff0c z \u662f\u8865\u4e01\u7248\u672c\u3002
\u5f53\u5bc6\u94a5\u7c7b\u578b\u4e3a TLS (kubernetes.io/tls)\uff1a\u9700\u8981\u586b\u5165\u8bc1\u4e66\u51ed\u8bc1\u548c\u79c1\u94a5\u6570\u636e\u3002\u8bc1\u4e66\u662f\u81ea\u7b7e\u540d\u6216 CA \u7b7e\u540d\u8fc7\u7684\u51ed\u636e\uff0c\u7528\u6765\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u3002\u8bc1\u4e66\u8bf7\u6c42\u662f\u5bf9\u7b7e\u540d\u7684\u8bf7\u6c42\uff0c\u9700\u8981\u4f7f\u7528\u79c1\u94a5\u8fdb\u884c\u7b7e\u540d\u3002
"},{"location":"admin/kpanda/configmaps-secrets/use-secret.html#pod","title":"\u4f7f\u7528\u5bc6\u94a5\u4f5c\u4e3a Pod \u7684\u6570\u636e\u5377","text":""},{"location":"admin/kpanda/configmaps-secrets/use-secret.html#_6","title":"\u56fe\u5f62\u754c\u9762\u64cd\u4f5c","text":"
\u672c\u6587\u4ecb\u7ecd \u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u5bf9 GPU\u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8d44\u6e90\u7edf\u4e00\u8fd0\u7ef4\u7ba1\u7406\u80fd\u529b\u3002
\u968f\u7740 AI \u5e94\u7528\u3001\u5927\u6a21\u578b\u3001\u4eba\u5de5\u667a\u80fd\u3001\u81ea\u52a8\u9a7e\u9a76\u7b49\u65b0\u5174\u6280\u672f\u7684\u5feb\u901f\u53d1\u5c55\uff0c\u4f01\u4e1a\u9762\u4e34\u7740\u8d8a\u6765\u8d8a\u591a\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4efb\u52a1\u548c\u6570\u636e\u5904\u7406\u9700\u6c42\u3002 \u4ee5 CPU \u4e3a\u4ee3\u8868\u7684\u4f20\u7edf\u8ba1\u7b97\u67b6\u6784\u5df2\u65e0\u6cd5\u6ee1\u8db3\u4f01\u4e1a\u65e5\u76ca\u589e\u957f\u7684\u8ba1\u7b97\u9700\u6c42\u3002\u6b64\u65f6\uff0c\u4ee5 GPU \u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8ba1\u7b97\u56e0\u5728\u5904\u7406\u5927\u89c4\u6a21\u6570\u636e\u3001\u8fdb\u884c\u590d\u6742\u8ba1\u7b97\u548c\u5b9e\u65f6\u56fe\u5f62\u6e32\u67d3\u65b9\u9762\u5177\u6709\u72ec\u7279\u7684\u4f18\u52bf\u88ab\u5e7f\u6cdb\u5e94\u7528\u3002
\u4e0e\u6b64\u540c\u65f6\uff0c\u7531\u4e8e\u7f3a\u4e4f\u5f02\u6784\u8d44\u6e90\u8c03\u5ea6\u7ba1\u7406\u7b49\u65b9\u9762\u7684\u7ecf\u9a8c\u548c\u4e13\u4e1a\u7684\u89e3\u51b3\u65b9\u6848\uff0c\u5bfc\u81f4\u4e86 GPU \u8bbe\u5907\u7684\u8d44\u6e90\u5229\u7528\u7387\u6781\u4f4e\uff0c\u7ed9\u4f01\u4e1a\u5e26\u6765\u4e86\u9ad8\u6602\u7684 AI \u751f\u4ea7\u6210\u672c\u3002 \u5982\u4f55\u964d\u672c\u589e\u6548\uff0c\u63d0\u9ad8 GPU \u7b49\u5f02\u6784\u8d44\u6e90\u7684\u5229\u7528\u6548\u7387\uff0c\u6210\u4e3a\u4e86\u5f53\u524d\u4f17\u591a\u4f01\u4e1a\u4e9f\u9700\u8de8\u8d8a\u7684\u4e00\u9053\u96be\u9898\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u652f\u6301\u5bf9 GPU\u3001NPU \u7b49\u5f02\u6784\u8d44\u6e90\u8fdb\u884c\u7edf\u4e00\u8c03\u5ea6\u548c\u8fd0\u7ef4\u7ba1\u7406\uff0c\u5145\u5206\u91ca\u653e GPU \u8d44\u6e90\u7b97\u529b\uff0c\u52a0\u901f\u4f01\u4e1a AI \u7b49\u65b0\u5174\u5e94\u7528\u53d1\u5c55\u3002GPU \u7ba1\u7406\u80fd\u529b\u5982\u4e0b\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08iluvatar.ai/vcuda-core\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u8c03\u6574\u540e\u67e5\u770b Pod \u4e2d\u7684\u8d44\u6e90 GPU \u5206\u914d\u8d44\u6e90\uff1a
\u901a\u8fc7\u4e0a\u8ff0\u6b65\u9aa4\uff0c\u60a8\u53ef\u4ee5\u5728\u4e0d\u91cd\u542f vGPU Pod \u7684\u60c5\u51b5\u4e0b\u52a8\u6001\u5730\u8c03\u6574\u5176\u7b97\u529b\u548c\u663e\u5b58\u8d44\u6e90\uff0c\u4ece\u800c\u66f4\u7075\u6d3b\u5730\u6ee1\u8db3\u4e1a\u52a1\u9700\u6c42\u5e76\u4f18\u5316\u8d44\u6e90\u5229\u7528\u3002
\u672c\u9875\u8bf4\u660e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684 GPU \u53ca\u64cd\u4f5c\u7cfb\u7edf\u6240\u5bf9\u5e94\u7684\u77e9\u9635\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u8282\u70b9\u7684\u4e0d\u540c GPU \u5361\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u5361\u6545\u969c\u3002
Binpack\uff1a \u591a\u4e2a Pod \u4f1a\u4f18\u5148\u9009\u62e9\u540c\u4e00\u4e2a\u8282\u70b9\uff0c\u9002\u7528\u4e8e\u63d0\u9ad8 GPU \u5229\u7528\u7387\uff0c\u51cf\u5c11\u8d44\u6e90\u788e\u7247\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u4e0d\u540c\u8282\u70b9\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u8282\u70b9\u6545\u969c\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d POD \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u5e76\u4e14\u8981 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08huawei.com/Ascend910\uff09 \uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14**\u5c0f\u4e8e\u7b49\u4e8e**\u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
GPU \u7b97\u529b\uff08cambricon.com/mlu.smlu.vcore\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u4f7f\u7528\u6838\u5fc3\u7684\u767e\u5206\u6bd4\u6570\u91cf\u3002
apiVersion: v1 \nkind: Pod \nmetadata: \n name: pod1 \nspec: \n restartPolicy: OnFailure \n containers: \n - image: ubuntu:16.04 \n name: pod1-ctr \n command: [\"sleep\"] \n args: [\"100000\"] \n resources: \n limits: \n cambricon.com/mlu: \"1\" # use this when device type is not enabled, else delete this line. \n #cambricon.com/mlu: \"1\" #uncomment to use when device type is enabled \n #cambricon.com/mlu.share: \"1\" #uncomment to use device with env-share mode \n #cambricon.com/mlu.mim-2m.8gb: \"1\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vcore: \"100\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vmemory: \"1024\" #uncomment to use device with mim mode\n
\u4e3a\u5728\u6240\u6709\u4ea7\u54c1\u4e2d\u516c\u5f00\u201c\u5b8c\u5168\u76f8\u540c\u201d\u7684 MIG \u8bbe\u5907\u7c7b\u578b\uff0c\u521b\u5efa\u76f8\u540c\u7684GI \u548c CI
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u672c\u6587\u4f7f\u7528 AMD \u67b6\u6784\u7684 CentOS 7.9\uff083.10.0-1160\uff09\u8fdb\u884c\u6f14\u793a\u3002\u5982\u9700\u4f7f\u7528 Red Hat 8.4 \u90e8\u7f72\uff0c \u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u548c\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u3002
\u4f7f\u7528\u5185\u7f6e\u7684\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u65e0\u9700\u4fee\u6539\u955c\u50cf\u7248\u672c\uff0c\u5176\u4ed6\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u955c\u50cf\u3002 \u6ce8\u610f\u7248\u672c\u53f7\u540e\u65e0\u9700\u586b\u5199 Ubuntu\u3001CentOS\u3001Red Hat \u7b49\u64cd\u4f5c\u7cfb\u7edf\u540d\u79f0\uff0c\u82e5\u5b98\u65b9\u955c\u50cf\u542b\u6709\u64cd\u4f5c\u7cfb\u7edf\u540e\u7f00\uff0c\u8bf7\u624b\u52a8\u79fb\u9664\u3002
"},{"location":"admin/kpanda/gpu/nvidia/push_image_to_repo.html","title":"\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u672c\u6587\u4ee5 Red Hat 8.4 \u7684 nvcr.io/nvidia/driver:525.105.17-rhel8.4 \u79bb\u7ebf\u9a71\u52a8\u955c\u50cf\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u79bb\u7ebf\u955c\u50cf\u3002
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
\u5f53\u5de5\u4f5c\u8282\u70b9\u7684\u5185\u6838\u7248\u672c\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u5185\u6838\u7248\u672c\u6216 OS \u7c7b\u578b\u4e0d\u4e00\u81f4\u65f6\uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
"},{"location":"admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#os","title":"\u68c0\u67e5\u96c6\u7fa4\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c","text":"
\u5206\u522b\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u548c\u5f85\u90e8\u7f72 GPU Operator \u7684\u8282\u70b9\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c\u4e00\u81f4\u5219\u65e0\u9700\u6784\u5efa yum \u6e90\uff0c \u53ef\u53c2\u8003\u79bb\u7ebf\u5b89\u88c5 GPU Operator \u6587\u6863\u76f4\u63a5\u5b89\u88c5\uff1b\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u6216\u5185\u6838\u7248\u672c\u4e0d\u4e00\u81f4\uff0c\u8bf7\u6267\u884c\u4e0b\u4e00\u6b65\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 8.4\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4\u3002
\u672c\u6587\u4ee5 Red Hat 8.4 4.18.0-305.el8.x86_64 \u8282\u70b9\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u5305\uff0c \u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\uff0c\u901a\u8fc7 RepoConfig.ConfigMapName \u53c2\u6570\u6765\u4f7f\u7528\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio \u6587\u4ef6\u670d\u52a1\u5668\u8bbf\u95ee\u5730\u5740 \u7528\u6237\u540d \u5bc6\u7801\n
\u4f8b\u5982\uff1a
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
"},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90","text":""},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#_1","title":"\u4f7f\u7528\u573a\u666f\u4ecb\u7ecd","text":"
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9884\u7f6e\u4e86 CentOS 7.9\uff0c\u5185\u6838\u4e3a 3.10.0-1160 \u7684 GPU Operator \u79bb\u7ebf\u5305\u3002\u5176\u5b83 OS \u7c7b\u578b\u7684\u8282\u70b9\u6216\u5185\u6838\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90\u5305\uff0c\u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\u4f7f\u7528 RepoConfig.ConfigMapName \u53c2\u6570\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 7.9\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4
"},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-red-hat-79-os","title":"2. \u4e0b\u8f7d Red Hat 7.9 OS \u7684\u79bb\u7ebf\u9a71\u52a8\u955c\u50cf","text":"
"},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-red-hat-gpu-opreator","title":"3. \u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u3002
SM \uff1a\u6d41\u5f0f\u591a\u5904\u7406\u5668\uff08Streaming Multiprocessor\uff09\uff0cGPU \u7684\u6838\u5fc3\u8ba1\u7b97\u5355\u5143\uff0c\u8d1f\u8d23\u6267\u884c\u56fe\u5f62\u6e32\u67d3\u548c\u901a\u7528\u8ba1\u7b97\u4efb\u52a1\u3002 \u6bcf\u4e2a SM \u5305\u542b\u4e00\u7ec4 CUDA \u6838\u5fc3\uff0c\u4ee5\u53ca\u5171\u4eab\u5185\u5b58\u3001\u5bc4\u5b58\u5668\u6587\u4ef6\u548c\u5176\u4ed6\u8d44\u6e90\uff0c\u53ef\u4ee5\u540c\u65f6\u6267\u884c\u591a\u4e2a\u7ebf\u7a0b\u3002 \u6bcf\u4e2a MIG \u5b9e\u4f8b\u90fd\u62e5\u6709\u4e00\u5b9a\u6570\u91cf\u7684 SM \u548c\u5176\u4ed6\u76f8\u5173\u8d44\u6e90\uff0c\u4ee5\u53ca\u88ab\u5212\u5206\u51fa\u6765\u7684\u663e\u5b58\u3002
GPU SM Slice \uff1aGPU SM \u5207\u7247\u662f GPU \u4e0a SM \u7684\u6700\u5c0f\u8ba1\u7b97\u5355\u4f4d\u3002\u5728 MIG \u6a21\u5f0f\u4e0b\u914d\u7f6e\u65f6\uff0c GPU SM \u5207\u7247\u5927\u7ea6\u662f GPU \u4e2d\u53ef\u7528 SMS \u603b\u6570\u7684\u4e03\u5206\u4e4b\u4e00\u3002
Compute Instance \uff1aGPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b \uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c\u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002
GPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247(GI)\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b\uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c \u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002\u4f7f\u7528\u4e0a\u9762\u7684\u76f8\u540c 4g.20gb \u793a\u4f8b\uff0c\u53ef\u4ee5\u521b\u5efa\u4e00\u4e2a CI \u4ee5\u4ec5\u4f7f\u7528\u7b2c\u4e00\u4e2a\u8ba1\u7b97\u5207\u7247\u7684 1c.4g.20gb \u8ba1\u7b97\u914d\u7f6e\uff0c\u5982\u4e0b\u56fe\u84dd\u8272\u90e8\u5206\u6240\u793a\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
NUMA \u8282\u70b9\u662f Non-Uniform Memory Access\uff08\u975e\u7edf\u4e00\u5185\u5b58\u8bbf\u95ee\uff09\u67b6\u6784\u4e2d\u7684\u4e00\u4e2a\u57fa\u672c\u7ec4\u6210\u5355\u5143\uff0c\u4e00\u4e2a Node \u8282\u70b9\u662f\u591a\u4e2a NUMA \u8282\u70b9\u7684\u96c6\u5408\uff0c \u5728\u591a\u4e2a NUMA \u8282\u70b9\u4e4b\u95f4\u8fdb\u884c\u5185\u5b58\u8bbf\u95ee\u65f6\u4f1a\u4ea7\u751f\u5ef6\u8fdf\uff0c\u5f00\u53d1\u8005\u53ef\u4ee5\u901a\u8fc7\u4f18\u5316\u4efb\u52a1\u8c03\u5ea6\u548c\u5185\u5b58\u5206\u914d\u7b56\u7565\uff0c\u6765\u63d0\u9ad8\u5185\u5b58\u8bbf\u95ee\u6548\u7387\u548c\u6574\u4f53\u6027\u80fd\u3002
Numa \u4eb2\u548c\u6027\u8c03\u5ea6\u7684\u5e38\u89c1\u573a\u666f\u662f\u90a3\u4e9b\u5bf9 CPU \u53c2\u6570\u654f\u611f/\u8c03\u5ea6\u5ef6\u8fdf\u654f\u611f\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4f5c\u4e1a\u3002\u5982\u79d1\u5b66\u8ba1\u7b97\u3001\u89c6\u9891\u89e3\u7801\u3001\u52a8\u6f2b\u52a8\u753b\u6e32\u67d3\u3001\u5927\u6570\u636e\u79bb\u7ebf\u5904\u7406\u7b49\u5177\u4f53\u573a\u666f\u3002
Pod \u8c03\u5ea6\u65f6\u53ef\u4ee5\u91c7\u7528\u7684 NUMA \u653e\u7f6e\u7b56\u7565\uff0c\u5177\u4f53\u7b56\u7565\u5bf9\u5e94\u7684\u8c03\u5ea6\u884c\u4e3a\u8bf7\u53c2\u89c1 Pod \u8c03\u5ea6\u884c\u4e3a\u8bf4\u660e\u3002
single-numa-node\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a single-numa-node \u7684\u8282\u70b9\u6c60\u4e2d\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
restricted\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a restricted \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c\u7684 NUMA \u96c6\u5408\u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
best-effort\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a best-effort \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14\u5c3d\u91cf\u5c06 CPU \u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u6ca1\u6709\u8282\u70b9\u6ee1\u8db3\u8fd9\u4e00\u6761\u4ef6\uff0c\u5219\u9009\u62e9\u6700\u4f18\u8282\u70b9\u8fdb\u884c\u653e\u7f6e\u3002
\u5f53Pod\u8bbe\u7f6e\u4e86\u62d3\u6251\u7b56\u7565\u65f6\uff0cVolcano \u4f1a\u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\u9884\u6d4b\u5339\u914d\u7684\u8282\u70b9\u5217\u8868\u3002 \u8c03\u5ea6\u8fc7\u7a0b\u5982\u4e0b\uff1a
\u6839\u636e Pod \u8bbe\u7f6e\u7684 Volcano \u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u5177\u6709\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u3002
\u5728\u8bbe\u7f6e\u4e86\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u4e2d\uff0c\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u8be5\u7b56\u7565\u8981\u6c42\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6\u3002
Pod \u53ef\u914d\u7f6e\u7684\u62d3\u6251\u7b56\u7565 1. \u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u53ef\u8c03\u5ea6\u7684\u8282\u70b9 2. \u8fdb\u4e00\u6b65\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u7b56\u7565\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6 none \u9488\u5bf9\u914d\u7f6e\u4e86\u4ee5\u4e0b\u51e0\u79cd\u62d3\u6251\u7b56\u7565\u7684\u8282\u70b9\uff0c\u8c03\u5ea6\u65f6\u5747\u65e0\u7b5b\u9009\u884c\u4e3a\u3002none\uff1a\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 - best-effort \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201cbest-effort\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u5c3d\u53ef\u80fd\u6ee1\u8db3\u7b56\u7565\u8981\u6c42\u8fdb\u884c\u8c03\u5ea6\uff1a\u4f18\u5148\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\uff0c\u5982\u679c\u5355 NUMA \u8282\u70b9\u65e0\u6cd5\u6ee1\u8db3 CPU \u7533\u8bf7\u503c\uff0c\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 restricted \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201crestricted\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u4e25\u683c\u9650\u5236\u7684\u8c03\u5ea6\u7b56\u7565\uff1a\u5355 NUMA \u8282\u70b9\u7684CPU\u5bb9\u91cf\u4e0a\u9650\u5927\u4e8e\u7b49\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002\u6b64\u65f6\u5982\u679c\u5355 NUMA \u8282\u70b9\u5269\u4f59\u7684 CPU \u53ef\u4f7f\u7528\u91cf\u4e0d\u8db3\uff0c\u5219 Pod \u65e0\u6cd5\u8c03\u5ea6\u3002\u5355 NUMA \u8282\u70b9\u7684 CPU \u5bb9\u91cf\u4e0a\u9650\u5c0f\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u53ef\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 single-numa-node \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201csingle-numa-node\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 \u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002"},{"location":"admin/kpanda/gpu/volcano/numa.html#numa_1","title":"\u914d\u7f6e NUMA \u4eb2\u548c\u8c03\u5ea6\u7b56\u7565","text":"
\u5047\u8bbe NUMA \u8282\u70b9\u60c5\u51b5\u5982\u4e0b\uff1a
\u5de5\u4f5c\u8282\u70b9 \u8282\u70b9\u7b56\u7565\u62d3\u6251\u7ba1\u7406\u5668\u7b56\u7565 NUMA \u8282\u70b9 0 \u4e0a\u7684\u53ef\u5206\u914d CPU NUMA \u8282\u70b9 1 \u4e0a\u7684\u53ef\u5206\u914d CPU node-1 single-numa-node 16U 16U node-2 best-effort 16U 16U node-3 best-effort 20U 20U
\u793a\u4f8b\u4e00\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a 2U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201csingle-numa-node\u201d\uff0c\u56e0\u6b64\u4f1a\u88ab\u8c03\u5ea6\u5230\u76f8\u540c\u7b56\u7565\u7684 node-1\u3002
\u793a\u4f8b\u4e8c\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a20U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201cbest-effort\u201d\uff0c\u5b83\u5c06\u88ab\u8c03\u5ea6\u5230 node-3\uff0c \u56e0\u4e3a node-3 \u53ef\u4ee5\u5728\u5355\u4e2a NUMA \u8282\u70b9\u4e0a\u5206\u914d Pod \u7684 CPU \u8bf7\u6c42\uff0c\u800c node-2 \u9700\u8981\u5728\u4e24\u4e2a NUMA \u8282\u70b9\u4e0a\u6267\u884c\u6b64\u64cd\u4f5c\u3002
"},{"location":"admin/kpanda/gpu/volcano/numa.html#cpu","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5","text":"
\u60a8\u53ef\u4ee5\u901a\u8fc7 lscpu \u547d\u4ee4\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5\uff1a
"},{"location":"admin/kpanda/gpu/volcano/numa.html#cpu_1","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d","text":"
\u7136\u540e\u67e5\u770b NUMA \u8282\u70b9\u4f7f\u7528\u60c5\u51b5\uff1a
# \u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d\ncat /var/lib/kubelet/cpu_manager_state\n{\"policyName\":\"static\",\"defaultCpuSet\":\"0,10-15,25-31\",\"entries\":{\"777870b5-c64f-42f5-9296-688b9dc212ba\":{\"container-1\":\"16-24\"},\"fb15e10a-b6a5-4aaa-8fcd-76c1aa64e6fd\":{\"container-1\":\"1-9\"}},\"checksum\":318470969}\n
\u4ee5\u4e0a\u793a\u4f8b\u4e2d\u8868\u793a\uff0c\u8282\u70b9\u4e0a\u8fd0\u884c\u4e86\u4e24\u4e2a\u5bb9\u5668\uff0c\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node0 \u76841-9 \u6838\uff0c\u53e6\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node1 \u7684 16-24 \u6838\u3002
"},{"location":"admin/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"\u4f7f\u7528 Volcano \u7684 Gang Scheduler","text":"
Gang \u8c03\u5ea6\u7b56\u7565\u662f volcano-scheduler \u7684\u6838\u5fc3\u8c03\u5ea6\u7b97\u6cd5\u4e4b\u4e00\uff0c\u5b83\u6ee1\u8db3\u4e86\u8c03\u5ea6\u8fc7\u7a0b\u4e2d\u7684 \u201cAll or nothing\u201d \u7684\u8c03\u5ea6\u9700\u6c42\uff0c \u907f\u514d Pod \u7684\u4efb\u610f\u8c03\u5ea6\u5bfc\u81f4\u96c6\u7fa4\u8d44\u6e90\u7684\u6d6a\u8d39\u3002\u5177\u4f53\u7b97\u6cd5\u662f\uff0c\u89c2\u5bdf Job \u4e0b\u7684 Pod \u5df2\u8c03\u5ea6\u6570\u91cf\u662f\u5426\u6ee1\u8db3\u4e86\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\uff0c \u5f53 Job \u7684\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\u5f97\u5230\u6ee1\u8db3\u65f6\uff0c\u4e3a Job \u4e0b\u7684\u6240\u6709 Pod \u6267\u884c\u8c03\u5ea6\u52a8\u4f5c\uff0c\u5426\u5219\uff0c\u4e0d\u6267\u884c\u3002
\u57fa\u4e8e\u5bb9\u5668\u7ec4\u6982\u5ff5\u7684 Gang \u8c03\u5ea6\u7b97\u6cd5\u5341\u5206\u9002\u5408\u9700\u8981\u591a\u8fdb\u7a0b\u534f\u4f5c\u7684\u573a\u666f\u3002AI \u573a\u666f\u5f80\u5f80\u5305\u542b\u590d\u6742\u7684\u6d41\u7a0b\uff0c Data Ingestion\u3001Data Analysts\u3001Data Splitting\u3001Trainer\u3001Serving\u3001Logging \u7b49\uff0c \u9700\u8981\u4e00\u7ec4\u5bb9\u5668\u8fdb\u884c\u534f\u540c\u5de5\u4f5c\uff0c\u5c31\u5f88\u9002\u5408\u57fa\u4e8e\u5bb9\u5668\u7ec4\u7684 Gang \u8c03\u5ea6\u7b56\u7565\u3002 MPI \u8ba1\u7b97\u6846\u67b6\u4e0b\u7684\u591a\u7ebf\u7a0b\u5e76\u884c\u8ba1\u7b97\u901a\u4fe1\u573a\u666f\uff0c\u7531\u4e8e\u9700\u8981\u4e3b\u4ece\u8fdb\u7a0b\u534f\u540c\u5de5\u4f5c\uff0c\u4e5f\u975e\u5e38\u9002\u5408\u4f7f\u7528 Gang \u8c03\u5ea6\u7b56\u7565\u3002 \u5bb9\u5668\u7ec4\u4e0b\u7684\u5bb9\u5668\u9ad8\u5ea6\u76f8\u5173\u4e5f\u53ef\u80fd\u5b58\u5728\u8d44\u6e90\u4e89\u62a2\uff0c\u6574\u4f53\u8c03\u5ea6\u5206\u914d\uff0c\u80fd\u591f\u6709\u6548\u89e3\u51b3\u6b7b\u9501\u3002
Binpack \u5728\u5bf9\u4e00\u4e2a\u8282\u70b9\u6253\u5206\u65f6\uff0c\u4f1a\u6839\u636e Binpack \u63d2\u4ef6\u81ea\u8eab\u6743\u91cd\u548c\u5404\u8d44\u6e90\u8bbe\u7f6e\u7684\u6743\u91cd\u503c\u7efc\u5408\u6253\u5206\u3002 \u9996\u5148\uff0c\u5bf9 Pod \u8bf7\u6c42\u8d44\u6e90\u4e2d\u7684\u6bcf\u7c7b\u8d44\u6e90\u4f9d\u6b21\u6253\u5206\uff0c\u4ee5 CPU \u4e3a\u4f8b\uff0cCPU \u8d44\u6e90\u5728\u5f85\u8c03\u5ea6\u8282\u70b9\u7684\u5f97\u5206\u4fe1\u606f\u5982\u4e0b\uff1a
CPU.weight * (request + used) / allocatable\n
\u5373 CPU \u6743\u91cd\u503c\u8d8a\u9ad8\uff0c\u5f97\u5206\u8d8a\u9ad8\uff0c\u8282\u70b9\u8d44\u6e90\u4f7f\u7528\u91cf\u8d8a\u6ee1\uff0c\u5f97\u5206\u8d8a\u9ad8\u3002Memory\u3001GPU \u7b49\u8d44\u6e90\u539f\u7406\u7c7b\u4f3c\u3002\u5176\u4e2d\uff1a
CPU.weight \u4e3a\u7528\u6237\u8bbe\u7f6e\u7684 CPU \u6743\u91cd
request \u4e3a\u5f53\u524d Pod \u8bf7\u6c42\u7684 CPU \u8d44\u6e90\u91cf
used \u4e3a\u5f53\u524d\u8282\u70b9\u5df2\u7ecf\u5206\u914d\u4f7f\u7528\u7684 CPU \u91cf
allocatable \u4e3a\u5f53\u524d\u8282\u70b9 CPU \u53ef\u7528\u603b\u91cf
\u4f18\u5148\u7ea7\u7684\u51b3\u5b9a\u57fa\u4e8e\u914d\u7f6e\u7684 PriorityClass \u4e2d\u7684 Value \u503c\uff0c\u503c\u8d8a\u5927\u4f18\u5148\u7ea7\u8d8a\u9ad8\u3002\u9ed8\u8ba4\u5df2\u542f\u7528\uff0c\u65e0\u9700\u4fee\u6539\u3002\u53ef\u901a\u8fc7\u4ee5\u4e0b\u547d\u4ee4\u786e\u8ba4\u6216\u4fee\u6539\u3002
\u901a\u8fc7 kubectl get pod \u67e5\u770b Pod \u8fd0\u884c\u4fe1\u606f\uff0c\u96c6\u7fa4\u8d44\u6e90\u4e0d\u8db3\uff0cPod \u5904\u4e8e Pending \u72b6\u6001\uff1a
\u6b64\u5916\uff0cVolcano \u4e0e Spark\u3001TensorFlow\u3001PyTorch \u7b49\u4e3b\u6d41\u8ba1\u7b97\u6846\u67b6\u65e0\u7f1d\u5bf9\u63a5\uff0c\u5e76\u652f\u6301 CPU \u548c GPU \u7b49\u5f02\u6784\u8bbe\u5907\u7684\u6df7\u5408\u8c03\u5ea6\uff0c\u4e3a AI \u8ba1\u7b97\u4efb\u52a1\u63d0\u4f9b\u4e86\u5168\u9762\u7684\u4f18\u5316\u652f\u6301\u3002
\u63a5\u4e0b\u6765\uff0c\u6211\u4eec\u5c06\u4ecb\u7ecd\u5982\u4f55\u5b89\u88c5\u548c\u4f7f\u7528 Volcano\uff0c\u4ee5\u4fbf\u60a8\u80fd\u591f\u5145\u5206\u5229\u7528\u5176\u8c03\u5ea6\u7b56\u7565\u4f18\u52bf\uff0c\u4f18\u5316 AI \u8ba1\u7b97\u4efb\u52a1\u3002
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u7684\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u3001\u8282\u70b9\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u4e09\u4e2a\u7ef4\u5ea6\u8fdb\u884c\u81ea\u5b9a\u4e49\u5de1\u68c0\u9879\uff0c\u5de1\u68c0\u7ed3\u675f\u540e\u4f1a\u81ea\u52a8\u751f\u6210\u53ef\u89c6\u5316\u7684\u5de1\u68c0\u62a5\u544a\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5982\u9700\u4e86\u89e3\u6216\u6267\u884c\u5b89\u5168\u65b9\u9762\u7684\u5de1\u68c0\uff0c\u53ef\u53c2\u8003\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u5b89\u5168\u626b\u63cf\u7c7b\u578b\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u7ef4\u5ea6\u3001\u8282\u70b9\u7ef4\u5ea6\u3001\u5bb9\u5668\u7ec4\u7ef4\u5ea6\u8fdb\u884c\u5de1\u68c0\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5728\u8282\u70b9\u88ab\u8bbe\u7f6e\u4e3a\u72ec\u4eab\u8282\u70b9\u524d\u5df2\u7ecf\u8fd0\u884c\u5728\u6b64\u8282\u70b9\u4e0a\u7684\u5e94\u7528\u548c\u670d\u52a1\u5c06\u4e0d\u4f1a\u53d7\u5f71\u54cd\uff0c\u4f9d\u7136\u4f1a\u6b63\u5e38\u8fd0\u884c\u5728\u8be5\u8282\u70b9\u4e0a\uff0c\u4ec5\u5f53\u8fd9\u4e9b Pod \u88ab\u5220\u9664\u6216\u91cd\u5efa\u65f6\uff0c\u624d\u4f1a\u8c03\u5ea6\u5230\u5176\u5b83\u975e\u72ec\u4eab\u8282\u70b9\u4e0a\u3002
\u7531\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u8fd0\u884c\u7740 kpanda\u3001ghippo\u3001insight \u7b49\u5e73\u53f0\u57fa\u7840\u7ec4\u4ef6\uff0c\u5728 Global \u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u5c06\u53ef\u80fd\u5bfc\u81f4\u5f53\u7cfb\u7edf\u7ec4\u4ef6\u91cd\u542f\u540e\uff0c\u7cfb\u7edf\u7ec4\u4ef6\u65e0\u6cd5\u8c03\u5ea6\u5230\u88ab\u72ec\u4eab\u7684\u8282\u70b9\u4e0a\uff0c\u5f71\u54cd\u7cfb\u7edf\u7684\u6574\u4f53\u9ad8\u53ef\u7528\u80fd\u529b\u3002\u56e0\u6b64\uff0c\u901a\u5e38\u60c5\u51b5\u4e0b\uff0c\u6211\u4eec\u4e0d\u63a8\u8350\u7528\u6237\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u7279\u6027\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u5bb9\u5668\u7ec4\u5b89\u5168\u7b56\u7565\u6307\u5728 kubernetes \u96c6\u7fa4\u4e2d\uff0c\u901a\u8fc7\u4e3a\u6307\u5b9a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e\u4e0d\u540c\u7684\u7b49\u7ea7\u548c\u6a21\u5f0f\uff0c\u5b9e\u73b0\u5728\u5b89\u5168\u7684\u5404\u4e2a\u65b9\u9762\u63a7\u5236 Pod \u7684\u884c\u4e3a\uff0c\u53ea\u6709\u6ee1\u8db3\u4e00\u5b9a\u7684\u6761\u4ef6\u7684 Pod \u624d\u4f1a\u88ab\u7cfb\u7edf\u63a5\u53d7\u3002\u5b83\u8bbe\u7f6e\u4e09\u4e2a\u7b49\u7ea7\u548c\u4e09\u79cd\u6a21\u5f0f\uff0c\u7528\u6237\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u7684\u9700\u6c42\u9009\u62e9\u66f4\u52a0\u5408\u9002\u7684\u65b9\u6848\u6765\u8bbe\u7f6e\u9650\u5236\u7b56\u7565\u3002
Note
\u4e00\u6761\u5b89\u5168\u6a21\u5f0f\u4ec5\u80fd\u914d\u7f6e\u4e00\u6761\u5b89\u5168\u7b56\u7565\u3002\u540c\u65f6\u8bf7\u8c28\u614e\u4e3a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e enforce \u7684\u5b89\u5168\u6a21\u5f0f\uff0c\u8fdd\u53cd\u540e\u5c06\u4f1a\u5bfc\u81f4 Pod \u65e0\u6cd5\u521b\u5efa\u3002
\u5df2\u7ecf\u5b8c\u6210 Ingress \u5b9e\u4f8b\u7684\u521b\u5efa\uff0c\u5df2\u90e8\u7f72\u5e94\u7528\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5e76\u4e14\u5df2\u521b\u5efa\u5bf9\u5e94 Service
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u6bcf\u4e2a Pod \u90fd\u6709\u4e00\u4e2a\u5185\u90e8\u72ec\u7acb\u7684 IP \u5730\u5740\uff0c\u4f46\u662f\u5de5\u4f5c\u8d1f\u8f7d\u4e2d\u7684 Pod \u53ef\u80fd\u4f1a\u88ab\u968f\u65f6\u521b\u5efa\u548c\u5220\u9664\uff0c\u76f4\u63a5\u4f7f\u7528 Pod IP \u5730\u5740\u5e76\u4e0d\u80fd\u5bf9\u5916\u63d0\u4f9b\u670d\u52a1\u3002
\u8fd9\u5c31\u9700\u8981\u521b\u5efa\u670d\u52a1\uff0c\u901a\u8fc7\u670d\u52a1\u60a8\u4f1a\u83b7\u5f97\u4e00\u4e2a\u56fa\u5b9a\u7684 IP \u5730\u5740\uff0c\u4ece\u800c\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u524d\u7aef\u548c\u540e\u7aef\u7684\u89e3\u8026\uff0c\u8ba9\u5916\u90e8\u7528\u6237\u80fd\u591f\u8bbf\u95ee\u670d\u52a1\u3002\u540c\u65f6\uff0c\u670d\u52a1\u8fd8\u63d0\u4f9b\u4e86\u8d1f\u8f7d\u5747\u8861\uff08LoadBalancer\uff09\u529f\u80fd\uff0c\u4f7f\u7528\u6237\u80fd\u4ece\u516c\u7f51\u8bbf\u95ee\u5230\u5de5\u4f5c\u8d1f\u8f7d\u3002
\u70b9\u9009 \u96c6\u7fa4\u5185\u8bbf\u95ee\uff08ClusterIP\uff09 \uff0c\u8fd9\u662f\u6307\u901a\u8fc7\u96c6\u7fa4\u7684\u5185\u90e8 IP \u66b4\u9732\u670d\u52a1\uff0c\u9009\u62e9\u6b64\u9879\u7684\u670d\u52a1\u53ea\u80fd\u5728\u96c6\u7fa4\u5185\u90e8\u8bbf\u95ee\u3002\u8fd9\u662f\u9ed8\u8ba4\u7684\u670d\u52a1\u7c7b\u578b\u3002\u53c2\u8003\u4e0b\u8868\u914d\u7f6e\u53c2\u6570\u3002
\u7b56\u7565\u914d\u7f6e\u5206\u4e3a\u5165\u6d41\u91cf\u7b56\u7565\u548c\u51fa\u6d41\u91cf\u7b56\u7565\u3002\u5982\u679c\u6e90 Pod \u60f3\u8981\u6210\u529f\u8fde\u63a5\u5230\u76ee\u6807 Pod\uff0c\u6e90 Pod \u7684\u51fa\u6d41\u91cf\u7b56\u7565\u548c\u76ee\u6807 Pod \u7684\u5165\u6d41\u91cf\u7b56\u7565\u90fd\u9700\u8981\u5141\u8bb8\u8fde\u63a5\u3002\u5982\u679c\u4efb\u4f55\u4e00\u65b9\u4e0d\u5141\u8bb8\u8fde\u63a5\uff0c\u90fd\u4f1a\u5bfc\u81f4\u8fde\u63a5\u5931\u8d25\u3002
\u6267\u884c ls \u547d\u4ee4\u67e5\u770b\u7ba1\u7406\u96c6\u7fa4\u4e0a\u7684\u5bc6\u94a5\u662f\u5426\u521b\u5efa\u6210\u529f\uff0c\u6b63\u786e\u53cd\u9988\u5982\u4e0b\uff1a
\u68c0\u67e5\u9879 \u63cf\u8ff0 \u64cd\u4f5c\u7cfb\u7edf \u53c2\u8003\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf SELinux \u5173\u95ed \u9632\u706b\u5899 \u5173\u95ed \u67b6\u6784\u4e00\u81f4\u6027 \u8282\u70b9\u95f4 CPU \u67b6\u6784\u4e00\u81f4\uff08\u5982\u5747\u4e3a ARM \u6216 x86\uff09 \u4e3b\u673a\u65f6\u95f4 \u6240\u6709\u4e3b\u673a\u95f4\u540c\u6b65\u8bef\u5dee\u5c0f\u4e8e 10 \u79d2\u3002 \u7f51\u7edc\u8054\u901a\u6027 \u8282\u70b9\u53ca\u5176 SSH \u7aef\u53e3\u80fd\u591f\u6b63\u5e38\u88ab\u5e73\u53f0\u8bbf\u95ee\u3002 CPU \u53ef\u7528 CPU \u8d44\u6e90\u5927\u4e8e 4 Core \u5185\u5b58 \u53ef\u7528\u5185\u5b58\u8d44\u6e90\u5927\u4e8e 8 GB"},{"location":"admin/kpanda/nodes/node-check.html#_2","title":"\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf","text":"\u67b6\u6784 \u64cd\u4f5c\u7cfb\u7edf \u5907\u6ce8 ARM Kylin Linux Advanced Server release V10 (Sword) SP2 \u63a8\u8350 ARM UOS Linux ARM openEuler x86 CentOS 7.x \u63a8\u8350 x86 Redhat 7.x \u63a8\u8350 x86 Redhat 8.x \u63a8\u8350 x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 \u6d77\u5149 x86 UOS Linux x86 openEuler"},{"location":"admin/kpanda/nodes/node-details.html","title":"\u8282\u70b9\u8be6\u60c5","text":"
\u652f\u6301\u5c06\u8282\u70b9\u6682\u505c\u8c03\u5ea6\u6216\u6062\u590d\u8c03\u5ea6\u3002\u6682\u505c\u8c03\u5ea6\u6307\uff0c\u505c\u6b62\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002\u6062\u590d\u8c03\u5ea6\u6307\uff0c\u53ef\u4ee5\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
\u6c61\u70b9 (Taint) \u80fd\u591f\u4f7f\u8282\u70b9\u6392\u65a5\u67d0\u4e00\u7c7b Pod\uff0c\u907f\u514d Pod \u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002 \u6bcf\u4e2a\u8282\u70b9\u4e0a\u53ef\u4ee5\u5e94\u7528\u4e00\u4e2a\u6216\u591a\u4e2a\u6c61\u70b9\uff0c\u4e0d\u80fd\u5bb9\u5fcd\u8fd9\u4e9b\u6c61\u70b9\u7684 Pod \u5219\u4e0d\u4f1a\u88ab\u8c03\u5ea6\u8be5\u8282\u70b9\u4e0a\u3002
\u4e3a\u8282\u70b9\u6dfb\u52a0\u6c61\u70b9\u4e4b\u540e\uff0c\u53ea\u6709\u80fd\u5bb9\u5fcd\u8be5\u6c61\u70b9\u7684 Pod \u624d\u80fd\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
NoSchedule\uff1a\u65b0\u7684 Pod \u4e0d\u4f1a\u88ab\u8c03\u5ea6\u5230\u5e26\u6709\u6b64\u6c61\u70b9\u7684\u8282\u70b9\u4e0a\uff0c\u9664\u975e\u65b0\u7684 Pod \u5177\u6709\u76f8\u5339\u914d\u7684\u5bb9\u5fcd\u5ea6\u3002\u5f53\u524d\u6b63\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u4e0d\u4f1a \u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u4e0d\u80fd\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f1a\u9a6c\u4e0a\u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f46\u662f\u5728\u5bb9\u5fcd\u5ea6\u5b9a\u4e49\u4e2d\u6ca1\u6709\u6307\u5b9a tolerationSeconds\uff0c\u5219 Pod \u8fd8\u4f1a\u4e00\u76f4\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\u800c\u4e14\u6307\u5b9a\u4e86 tolerationSeconds\uff0c\u5219 Pod \u8fd8\u80fd\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u7ee7\u7eed\u8fd0\u884c\u6307\u5b9a\u7684\u65f6\u957f\u3002\u8fd9\u6bb5\u65f6\u95f4\u8fc7\u53bb\u540e\uff0c\u518d\u4ece\u8282\u70b9\u4e0a\u9a71\u9664\u8fd9\u4e9b Pod\u3002
PreferNoSchedule\uff1a\u8fd9\u662f\u201c\u8f6f\u6027\u201d\u7684 NoSchedule\u3002\u63a7\u5236\u5e73\u9762\u5c06**\u5c1d\u8bd5**\u907f\u514d\u5c06\u4e0d\u5bb9\u5fcd\u6b64\u6c61\u70b9\u7684 Pod \u8c03\u5ea6\u5230\u8282\u70b9\u4e0a\uff0c\u4f46\u4e0d\u80fd\u4fdd\u8bc1\u5b8c\u5168\u907f\u514d\u3002\u6240\u4ee5\u8981\u5c3d\u91cf\u907f\u514d\u4f7f\u7528\u6b64\u6c61\u70b9\u3002
\u5f53\u524d\u96c6\u7fa4\u5df2\u63a5\u5165\u5bb9\u5668\u7ba1\u7406\u4e14\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u5df2\u7ecf\u5b89\u88c5 kolm \u7ec4\u4ef6\uff08helm \u6a21\u677f\u641c\u7d22 kolm\uff09
\u53ea\u9700\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0cKpanda \u63a7\u5236\u5668\u4f1a\u628a Global Cluster \u589e\u52a0\u7684\u6743\u9650\u70b9\u540c\u6b65\u5230\u6240\u6709\u63a5\u5165\u5b50\u96c6\u7fa4\u4e2d\uff0c\u540c\u6b65\u9700\u4e00\u6bb5\u65f6\u95f4\u624d\u80fd\u5b8c\u6210
\u53ea\u80fd\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0c\u5728\u5b50\u96c6\u7fa4\u65b0\u589e\u7684\u6743\u9650\u70b9\u4f1a\u88ab Global Cluster \u5185\u7f6e\u89d2\u8272\u6743\u9650\u70b9\u8986\u76d6
\u53ea\u652f\u6301\u4f7f\u7528\u56fa\u5b9a Label \u7684 ClusterRole \u8ffd\u52a0\u6743\u9650\uff0c\u4e0d\u652f\u6301\u66ff\u6362\u6216\u8005\u5220\u9664\u6743\u9650\uff0c\u4e5f\u4e0d\u80fd\u4f7f\u7528 role \u8ffd\u52a0\u6743\u9650\uff0c\u5185\u7f6e\u89d2\u8272\u8ddf\u7528\u6237\u521b\u5efa\u7684 ClusterRole Label \u5bf9\u5e94\u5173\u7cfb\u5982\u4e0b
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301 Pod \u8d44\u6e90\u57fa\u4e8e\u6307\u6807\u8fdb\u884c\u5f39\u6027\u4f38\u7f29\uff08Horizontal Pod Autoscaling, HPA\uff09\u3002 \u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e CPU \u5229\u7528\u7387\u3001\u5185\u5b58\u7528\u91cf\u53ca\u81ea\u5b9a\u4e49\u6307\u6807\u6307\u6807\u6765\u52a8\u6001\u8c03\u6574 Pod \u8d44\u6e90\u7684\u526f\u672c\u6570\u91cf\u3002 \u4f8b\u5982\uff0c\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u57fa\u4e8e CPU \u5229\u7528\u7387\u6307\u6807\u5f39\u6027\u4f38\u7f29\u7b56\u7565\u540e\uff0c\u5f53 Pod \u7684 CPU \u5229\u7528\u7387\u8d85\u8fc7/\u4f4e\u4e8e\u60a8\u8bbe\u7f6e\u7684\u6307\u6807\u9600\u503c\uff0c\u5de5\u4f5c\u8d1f\u8f7d\u63a7\u5236\u5668\u5c06\u4f1a\u81ea\u52a8\u589e\u52a0/\u8f83\u5c11 Pod \u526f\u672c\u6570\u3002
\u5982\u679c\u57fa\u4e8e CPU \u5229\u7528\u7387\u521b\u5efa HPA \u7b56\u7565\uff0c\u5fc5\u987b\u9884\u5148\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u914d\u7f6e\u9650\u5236\uff08Limit\uff09\uff0c\u5426\u5219\u65e0\u6cd5\u8ba1\u7b97 CPU \u5229\u7528\u7387\u3002
\u7cfb\u7edf\u5185\u7f6e\u4e86 CPU \u548c\u5185\u5b58\u4e24\u79cd\u5f39\u6027\u4f38\u7f29\u6307\u6807\u4ee5\u6ee1\u8db3\u7528\u6237\u7684\u57fa\u7840\u4e1a\u52a1\u4f7f\u7528\u573a\u666f\u3002
\u76ee\u6807 CPU \u5229\u7528\u7387\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b Pod \u7684 CPU \u4f7f\u7528\u7387\u3002\u8ba1\u7b97\u65b9\u5f0f\u4e3a\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u7684 Pod \u8d44\u6e90 / \u5de5\u4f5c\u8d1f\u8f7d\u7684\u8bf7\u6c42\uff08request\uff09\u503c\u3002\u5f53\u5b9e\u9645 CPU \u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u76ee\u6807\u5185\u5b58\u7528\u91cf\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b\u7684 Pod \u7684\u5185\u5b58\u7528\u91cf\u3002\u5f53\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\u901a\u8fc7\u76d1\u63a7 Pod \u5728\u4e00\u6bb5\u65f6\u95f4\u5185\u7684\u8d44\u6e90\u7533\u8bf7\u548c\u7528\u91cf\uff0c \u8ba1\u7b97\u51fa\u5bf9\u8be5 Pod \u800c\u8a00\u6700\u9002\u5408\u7684 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u3002\u4f7f\u7528 VPA \u53ef\u4ee5\u66f4\u52a0\u5408\u7406\u5730\u4e3a\u96c6\u7fa4\u4e0b\u6bcf\u4e2a Pod \u5206\u914d\u8d44\u6e90\uff0c\u63d0\u9ad8\u96c6\u7fa4\u7684\u6574\u4f53\u8d44\u6e90\u5229\u7528\u7387\uff0c\u907f\u514d\u96c6\u7fa4\u8d44\u6e90\u6d6a\u8d39\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\uff0c\u57fa\u4e8e\u6b64\u529f\u80fd\u53ef\u4ee5\u6839\u636e\u5bb9\u5668\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\u52a8\u6001\u8c03\u6574 Pod \u8bf7\u6c42\u503c\u3002 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u65b9\u5f0f\u6765\u4fee\u6539\u8d44\u6e90\u8bf7\u6c42\u503c\uff0c\u60a8\u53ef\u4ee5\u6839\u636e\u5b9e\u9645\u9700\u8981\u8fdb\u884c\u914d\u7f6e\u3002
\u672c\u6587\u5c06\u4ecb\u7ecd\u5982\u4f55\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u914d\u7f6e Pod \u5782\u76f4\u4f38\u7f29\u3002
Warning
\u4f7f\u7528 VPA \u4fee\u6539 Pod \u8d44\u6e90\u8bf7\u6c42\u4f1a\u89e6\u53d1 Pod \u91cd\u542f\u3002\u7531\u4e8e Kubernetes \u672c\u8eab\u7684\u9650\u5236\uff0c Pod \u91cd\u542f\u540e\u53ef\u80fd\u4f1a\u88ab\u8c03\u5ea6\u5230\u5176\u5b83\u8282\u70b9\u4e0a\u3002
\u4f38\u7f29\u6a21\u5f0f\uff1a\u6267\u884c\u4fee\u6539 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u7684\u65b9\u5f0f\uff0c\u76ee\u524d\u5782\u76f4\u4f38\u7f29\u652f\u6301\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u4f38\u7f29\u6a21\u5f0f\u3002
\u8d44\u6e90\u7c7b\u578b\uff1a\u8fdb\u884c\u76d1\u63a7\u7684\u81ea\u5b9a\u4e49\u6307\u6807\u7c7b\u578b\uff0c\u5305\u542b Pod \u548c Service \u4e24\u79cd\u7c7b\u578b\u3002
\u6570\u636e\u7c7b\u578b\uff1a\u7528\u4e8e\u8ba1\u7b97\u6307\u6807\u503c\u7684\u65b9\u6cd5\uff0c\u5305\u542b\u76ee\u6807\u503c\u548c\u76ee\u6807\u5e73\u5747\u503c\u4e24\u79cd\u7c7b\u578b\uff0c\u5f53\u8d44\u6e90\u7c7b\u578b\u4e3a Pod \u65f6\uff0c\u53ea\u652f\u6301\u4f7f\u7528\u76ee\u6807\u5e73\u5747\u503c\u3002
\u5f85\u5904\u7406\u7684\u8bf7\u6c42\u603b\u6570 + \u80fd\u63a5\u53d7\u7684\u8d85\u8fc7\u76ee\u6807\u5e76\u53d1\u6570\u7684\u8bf7\u6c42\u6570\u91cf > \u6bcf\u4e2a Pod \u7684\u76ee\u6807\u5e76\u53d1\u6570 * Pod \u6570\u91cf
\u6267\u884c\u4e0b\u9762\u547d\u4ee4\u6d4b\u8bd5\uff0c\u5e76\u53ef\u4ee5\u901a\u8fc7 kubectl get pods -A -w \u6765\u89c2\u5bdf\u6269\u5bb9\u7684 Pod\u3002
API \u5b89\u5168\uff1a\u542f\u7528\u4e86\u4e0d\u5b89\u5168\u7684 API \u7248\u672c\uff0c\u662f\u5426\u8bbe\u7f6e\u4e86\u9002\u5f53\u7684 RBAC \u89d2\u8272\u548c\u6743\u9650\u9650\u5236\u7b49
\u6570\u636e\u5377\uff08PersistentVolume\uff0cPV\uff09\u662f\u96c6\u7fa4\u4e2d\u7684\u4e00\u5757\u5b58\u50a8\uff0c\u53ef\u7531\u7ba1\u7406\u5458\u4e8b\u5148\u5236\u5907\uff0c\u6216\u4f7f\u7528\u5b58\u50a8\u7c7b\uff08Storage Class\uff09\u6765\u52a8\u6001\u5236\u5907\u3002PV \u662f\u96c6\u7fa4\u8d44\u6e90\uff0c\u4f46\u62e5\u6709\u72ec\u7acb\u7684\u751f\u547d\u5468\u671f\uff0c\u4e0d\u4f1a\u968f\u7740 Pod \u8fdb\u7a0b\u7ed3\u675f\u800c\u88ab\u5220\u9664\u3002\u5c06 PV \u6302\u8f7d\u5230\u5de5\u4f5c\u8d1f\u8f7d\u53ef\u4ee5\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u7684\u6570\u636e\u6301\u4e45\u5316\u3002PV \u4e2d\u4fdd\u5b58\u4e86\u53ef\u88ab Pod \u4e2d\u5bb9\u5668\u8bbf\u95ee\u7684\u6570\u636e\u76ee\u5f55\u3002
HostPath\uff1a\u4f7f\u7528 Node \u8282\u70b9\u7684\u6587\u4ef6\u7cfb\u7edf\u4e0a\u7684\u6587\u4ef6\u6216\u76ee\u5f55\u4f5c\u4e3a\u6570\u636e\u5377\uff0c\u4e0d\u652f\u6301\u57fa\u4e8e\u8282\u70b9\u4eb2\u548c\u6027\u7684 Pod \u8c03\u5ea6\u3002
ReadWriteOncePod\uff1a\u6570\u636e\u5377\u53ef\u4ee5\u88ab\u5355\u4e2a Pod \u4ee5\u8bfb\u5199\u65b9\u5f0f\u6302\u8f7d\u3002
\u56de\u6536\u7b56\u7565\uff1a
Retain\uff1a\u4e0d\u5220\u9664 PV\uff0c\u4ec5\u5c06\u5176\u72b6\u6001\u53d8\u4e3a released \uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u56de\u6536\u3002\u6709\u5173\u5982\u4f55\u624b\u52a8\u56de\u6536\uff0c\u53ef\u53c2\u8003\u6301\u4e45\u5377\u3002
\u6587\u4ef6\u7cfb\u7edf\uff1a\u6570\u636e\u5377\u5c06\u88ab Pod \u6302\u8f7d\u5230\u67d0\u4e2a\u76ee\u5f55\u3002\u5982\u679c\u6570\u636e\u5377\u7684\u5b58\u50a8\u6765\u81ea\u67d0\u5757\u8bbe\u5907\u800c\u8be5\u8bbe\u5907\u76ee\u524d\u4e3a\u7a7a\uff0c\u7b2c\u4e00\u6b21\u6302\u8f7d\u5377\u4e4b\u524d\u4f1a\u5728\u8bbe\u5907\u4e0a\u521b\u5efa\u6587\u4ef6\u7cfb\u7edf\u3002
\u5757\uff1a\u5c06\u6570\u636e\u5377\u4f5c\u4e3a\u539f\u59cb\u5757\u8bbe\u5907\u6765\u4f7f\u7528\u3002\u8fd9\u7c7b\u5377\u4ee5\u5757\u8bbe\u5907\u7684\u65b9\u5f0f\u4ea4\u7ed9 Pod \u4f7f\u7528\uff0c\u5176\u4e0a\u6ca1\u6709\u4efb\u4f55\u6587\u4ef6\u7cfb\u7edf\uff0c\u53ef\u4ee5\u8ba9 Pod \u66f4\u5feb\u5730\u8bbf\u95ee\u6570\u636e\u5377\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u652f\u6301\u5c06\u4e00\u4e2a\u5b58\u50a8\u6c60\u5171\u4eab\u7ed9\u591a\u4e2a\u547d\u540d\u7a7a\u95f4\u4f7f\u7528\uff0c\u4ee5\u4fbf\u63d0\u9ad8\u8d44\u6e90\u5229\u7528\u6548\u7387\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u5c31\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002\u9ed8\u8ba4\u503c\u4e3a 360 s\u3002
\u5b88\u62a4\u8fdb\u7a0b\uff08DaemonSet\uff09\u901a\u8fc7\u8282\u70b9\u4eb2\u548c\u6027\u4e0e\u6c61\u70b9\u529f\u80fd\u786e\u4fdd\u5728\u5168\u90e8\u6216\u90e8\u5206\u8282\u70b9\u4e0a\u8fd0\u884c\u4e00\u4e2a Pod \u7684\u526f\u672c\u3002\u5bf9\u4e8e\u65b0\u52a0\u5165\u96c6\u7fa4\u7684\u8282\u70b9\uff0cDaemonSet \u81ea\u52a8\u5728\u65b0\u8282\u70b9\u4e0a\u90e8\u7f72\u76f8\u5e94\u7684 Pod\uff0c\u5e76\u8ddf\u8e2a Pod \u7684\u8fd0\u884c\u72b6\u6001\u3002\u5f53\u8282\u70b9\u88ab\u79fb\u9664\u65f6\uff0cDaemonSet \u5219\u5220\u9664\u5176\u521b\u5efa\u7684\u6240\u6709 Pod\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u65e0\u6548 Pod \u6570\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u6d6a\u6d8c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u4e3b\u8981\u4e3a Pod \u548c ReplicaSet \u63d0\u4f9b\u58f0\u660e\u5f0f\u66f4\u65b0\uff0c\u652f\u6301\u5f39\u6027\u4f38\u7f29\u3001\u6eda\u52a8\u5347\u7ea7\u3001\u7248\u672c\u56de\u9000\u7b49\u529f\u80fd\u3002\u5728 Deployment \u4e2d\u58f0\u660e\u671f\u671b\u7684 Pod \u72b6\u6001\uff0cDeployment Controller \u4f1a\u901a\u8fc7 ReplicaSet \u4fee\u6539\u5f53\u524d\u72b6\u6001\uff0c\u4f7f\u5176\u8fbe\u5230\u9884\u5148\u58f0\u660e\u7684\u671f\u671b\u72b6\u6001\u3002Deployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u652f\u6301\u6570\u636e\u6301\u4e45\u5316\uff0c\u9002\u7528\u4e8e\u90e8\u7f72\u65e0\u72b6\u6001\u7684\u3001\u4e0d\u9700\u8981\u4fdd\u5b58\u6570\u636e\u3001\u968f\u65f6\u53ef\u4ee5\u91cd\u542f\u56de\u6eda\u7684\u5e94\u7528\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u65e0\u72b6\u6001\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u4e0d\u53ef\u7528\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u5cf0\u503c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5b9e\u4f8b\u6570\uff1a\u8f93\u5165\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u5b9e\u4f8b\u6570\u91cf\u3002\u9ed8\u8ba4\u521b\u5efa 1 \u4e2a Pod \u5b9e\u4f8b\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e76\u884c\u6570\uff1a\u4efb\u52a1\u6267\u884c\u8fc7\u7a0b\u4e2d\u5141\u8bb8\u540c\u65f6\u521b\u5efa\u7684\u6700\u5927 Pod \u6570\uff0c\u5e76\u884c\u6570\u5e94\u4e0d\u5927\u4e8e Pod \u603b\u6570\u3002\u9ed8\u8ba4\u4e3a 1\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002
\u6709\u72b6\u6001\u8d1f\u8f7d\uff08StatefulSet\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u548c\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u7c7b\u4f3c\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406 Pod \u96c6\u5408\u7684\u90e8\u7f72\u548c\u4f38\u7f29\u3002\u4e8c\u8005\u7684\u4e3b\u8981\u533a\u522b\u5728\u4e8e\uff0cDeployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u4fdd\u5b58\u6570\u636e\uff0c\u800c StatefulSet \u662f\u6709\u72b6\u6001\u7684\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406\u6709\u72b6\u6001\u5e94\u7528\u3002\u6b64\u5916\uff0cStatefulSet \u4e2d\u7684 Pod \u5177\u6709\u6c38\u4e45\u4e0d\u53d8\u7684 ID\uff0c\u4fbf\u4e8e\u5728\u5339\u914d\u5b58\u50a8\u5377\u65f6\u8bc6\u522b\u5bf9\u5e94\u7684 Pod\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u6709\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
Kubernetes v1.7 \u53ca\u5176\u4e4b\u540e\u7684\u7248\u672c\u53ef\u4ee5\u901a\u8fc7 .spec.podManagementPolicy \u8bbe\u7f6e Pod \u7684\u7ba1\u7406\u7b56\u7565\uff0c\u652f\u6301\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\uff1a
\u6309\u5e8f\u7b56\u7565\uff08OrderedReady\uff09 \uff1a\u9ed8\u8ba4\u7684 Pod \u7ba1\u7406\u7b56\u7565\uff0c\u8868\u793a\u6309\u987a\u5e8f\u90e8\u7f72 Pod\uff0c\u53ea\u6709\u524d\u4e00\u4e2a Pod \u90e8\u7f72 \u6210\u529f\u5b8c\u6210\u540e\uff0c\u6709\u72b6\u6001\u8d1f\u8f7d\u624d\u4f1a\u5f00\u59cb\u90e8\u7f72\u4e0b\u4e00\u4e2a Pod\u3002\u5220\u9664 Pod \u65f6\u5219\u91c7\u7528\u9006\u5e8f\uff0c\u6700\u540e\u521b\u5efa\u7684\u6700\u5148\u88ab\u5220\u9664\u3002
\u5e76\u884c\u7b56\u7565\uff08Parallel\uff09 \uff1a\u5e76\u884c\u521b\u5efa\u6216\u5220\u9664\u5bb9\u5668\uff0c\u548c Deployment \u7c7b\u578b\u7684 Pod \u4e00\u6837\u3002StatefulSet \u63a7\u5236\u5668\u5e76\u884c\u5730\u542f\u52a8\u6216\u7ec8\u6b62\u6240\u6709\u7684\u5bb9\u5668\u3002\u542f\u52a8\u6216\u8005\u7ec8\u6b62\u5176\u4ed6 Pod \u524d\uff0c\u65e0\u9700\u7b49\u5f85 Pod \u8fdb\u5165 Running \u548c ready \u6216\u8005\u5b8c\u5168\u505c\u6b62\u72b6\u6001\u3002 \u8fd9\u4e2a\u9009\u9879\u53ea\u4f1a\u5f71\u54cd\u6269\u7f29\u64cd\u4f5c\u7684\u884c\u4e3a\uff0c\u4e0d\u5f71\u54cd\u66f4\u65b0\u65f6\u7684\u987a\u5e8f\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u73af\u5883\u53d8\u91cf\u662f\u6307\u5bb9\u5668\u8fd0\u884c\u73af\u5883\u4e2d\u8bbe\u5b9a\u7684\u4e00\u4e2a\u53d8\u91cf\uff0c\u7528\u4e8e\u7ed9 Pod \u6dfb\u52a0\u73af\u5883\u6807\u5fd7\u6216\u4f20\u9012\u914d\u7f6e\u7b49\uff0c\u652f\u6301\u901a\u8fc7\u952e\u503c\u5bf9\u7684\u5f62\u5f0f\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5728\u539f\u751f Kubernetes \u7684\u57fa\u7840\u4e0a\u589e\u52a0\u4e86\u56fe\u5f62\u5316\u754c\u9762\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\uff0c\u652f\u6301\u4ee5\u4e0b\u51e0\u79cd\u914d\u7f6e\u65b9\u5f0f\uff1a
\u53d8\u91cf/\u53d8\u91cf\u5f15\u7528\uff08Pod Field\uff09\uff1a\u5c06 Pod \u5b57\u6bb5\u4f5c\u4e3a\u73af\u5883\u53d8\u91cf\u7684\u503c\uff0c\u4f8b\u5982 Pod \u7684\u540d\u79f0
\u5c31\u7eea\u68c0\u67e5\uff08ReadinessProbe\uff09 \u53ef\u63a2\u77e5\u5bb9\u5668\u4f55\u65f6\u51c6\u5907\u597d\u63a5\u53d7\u8bf7\u6c42\u6d41\u91cf\uff0c\u5f53\u4e00\u4e2a Pod \u5185\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5c31\u7eea\u65f6\uff0c\u624d\u80fd\u8ba4\u4e3a\u8be5 Pod \u5c31\u7eea\u3002 \u8fd9\u79cd\u4fe1\u53f7\u7684\u4e00\u4e2a\u7528\u9014\u5c31\u662f\u63a7\u5236\u54ea\u4e2a Pod \u4f5c\u4e3a Service \u7684\u540e\u7aef\u3002 \u82e5 Pod \u5c1a\u672a\u5c31\u7eea\uff0c\u4f1a\u88ab\u4ece Service \u7684\u8d1f\u8f7d\u5747\u8861\u5668\u4e2d\u5254\u9664\u3002
Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c Pod \u5185\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \u72b6\u6001\u3002\u5982\u679c Pod \u4e2d\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\uff0c\u5219\u72b6\u6001\u53d8\u4e3a Failed \u3002\u4ee5\u4e0b phase \u5b57\u6bb5\u503c\u8868\u660e\u4e86\u4e00\u4e2a Pod \u5904\u4e8e\u751f\u547d\u5468\u671f\u7684\u54ea\u4e2a\u9636\u6bb5\u3002
\u503c \u63cf\u8ff0 Pending \uff08\u60ac\u51b3\uff09 Pod \u5df2\u88ab\u7cfb\u7edf\u63a5\u53d7\uff0c\u4f46\u6709\u4e00\u4e2a\u6216\u8005\u591a\u4e2a\u5bb9\u5668\u5c1a\u672a\u521b\u5efa\u4ea6\u672a\u8fd0\u884c\u3002\u8fd9\u4e2a\u9636\u6bb5\u5305\u62ec\u7b49\u5f85 Pod \u88ab\u8c03\u5ea6\u7684\u65f6\u95f4\u548c\u901a\u8fc7\u7f51\u7edc\u4e0b\u8f7d\u955c\u50cf\u7684\u65f6\u95f4\u3002 Running \uff08\u8fd0\u884c\u4e2d\uff09 Pod \u5df2\u7ecf\u7ed1\u5b9a\u5230\u4e86\u67d0\u4e2a\u8282\u70b9\uff0cPod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u88ab\u521b\u5efa\u3002\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u4ecd\u5728\u8fd0\u884c\uff0c\u6216\u8005\u6b63\u5904\u4e8e\u542f\u52a8\u6216\u91cd\u542f\u72b6\u6001\u3002 Succeeded \uff08\u6210\u529f\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u6210\u529f\u7ec8\u6b62\uff0c\u5e76\u4e14\u4e0d\u4f1a\u518d\u91cd\u542f\u3002 Failed \uff08\u5931\u8d25\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u7ec8\u6b62\uff0c\u5e76\u4e14\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u662f\u56e0\u4e3a\u5931\u8d25\u800c\u7ec8\u6b62\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u5bb9\u5668\u4ee5\u975e 0 \u72b6\u6001\u9000\u51fa\u6216\u8005\u88ab\u7cfb\u7edf\u7ec8\u6b62\u3002 Unknown \uff08\u672a\u77e5\uff09 \u56e0\u4e3a\u67d0\u4e9b\u539f\u56e0\u65e0\u6cd5\u53d6\u5f97 Pod \u7684\u72b6\u6001\uff0c\u8fd9\u79cd\u60c5\u51b5\u901a\u5e38\u662f\u56e0\u4e3a\u4e0e Pod \u6240\u5728\u4e3b\u673a\u901a\u4fe1\u5931\u8d25\u6240\u81f4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u4e2d\u521b\u5efa\u4e00\u4e2a\u5de5\u4f5c\u8d1f\u8f7d\u65f6\uff0c\u901a\u5e38\u4f7f\u7528\u955c\u50cf\u6765\u6307\u5b9a\u5bb9\u5668\u4e2d\u7684\u8fd0\u884c\u73af\u5883\u3002\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u5728\u6784\u5efa\u955c\u50cf\u65f6\uff0c\u53ef\u4ee5\u901a\u8fc7 Entrypoint \u548c CMD \u4e24\u4e2a\u5b57\u6bb5\u6765\u5b9a\u4e49\u5bb9\u5668\u8fd0\u884c\u65f6\u6267\u884c\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002\u5982\u679c\u9700\u8981\u66f4\u6539\u5bb9\u5668\u955c\u50cf\u542f\u52a8\u524d\u3001\u542f\u52a8\u540e\u3001\u505c\u6b62\u524d\u7684\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u5bb9\u5668\u7684\u751f\u547d\u5468\u671f\u4e8b\u4ef6\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u6765\u8986\u76d6\u955c\u50cf\u4e2d\u9ed8\u8ba4\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u542f\u52a8\u540e\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u505c\u6b62\u524d\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u8282\u70b9\u4e5f\u6709\u6807\u7b7e\u3002\u60a8\u53ef\u4ee5\u624b\u52a8\u6dfb\u52a0\u6807\u7b7e\u3002 Kubernetes \u4e5f\u4f1a\u4e3a\u96c6\u7fa4\u4e2d\u6240\u6709\u8282\u70b9\u6dfb\u52a0\u4e00\u4e9b\u6807\u51c6\u7684\u6807\u7b7e\u3002\u53c2\u89c1\u5e38\u7528\u7684\u6807\u7b7e\u3001\u6ce8\u89e3\u548c\u6c61\u70b9\u4ee5\u4e86\u89e3\u5e38\u89c1\u7684\u8282\u70b9\u6807\u7b7e\u3002\u901a\u8fc7\u4e3a\u8282\u70b9\u6dfb\u52a0\u6807\u7b7e\uff0c\u60a8\u53ef\u4ee5\u8ba9 Pod \u8c03\u5ea6\u5230\u7279\u5b9a\u8282\u70b9\u6216\u8282\u70b9\u7ec4\u4e0a\u3002\u60a8\u53ef\u4ee5\u4f7f\u7528\u8fd9\u4e2a\u529f\u80fd\u6765\u786e\u4fdd\u7279\u5b9a\u7684 Pod \u53ea\u80fd\u8fd0\u884c\u5728\u5177\u6709\u4e00\u5b9a\u9694\u79bb\u6027\uff0c\u5b89\u5168\u6027\u6216\u76d1\u7ba1\u5c5e\u6027\u7684\u8282\u70b9\u4e0a\u3002
nodeSelector \u662f\u8282\u70b9\u9009\u62e9\u7ea6\u675f\u7684\u6700\u7b80\u5355\u63a8\u8350\u5f62\u5f0f\u3002\u60a8\u53ef\u4ee5\u5c06 nodeSelector \u5b57\u6bb5\u6dfb\u52a0\u5230 Pod \u7684\u89c4\u7ea6\u4e2d\u8bbe\u7f6e\u60a8\u5e0c\u671b\u76ee\u6807\u8282\u70b9\u6240\u5177\u6709\u7684\u8282\u70b9\u6807\u7b7e\u3002Kubernetes \u53ea\u4f1a\u5c06 Pod \u8c03\u5ea6\u5230\u62e5\u6709\u6307\u5b9a\u6bcf\u4e2a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002 nodeSelector \u63d0\u4f9b\u4e86\u4e00\u79cd\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u6765\u5c06 Pod \u7ea6\u675f\u5230\u5177\u6709\u7279\u5b9a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002\u4eb2\u548c\u6027\u548c\u53cd\u4eb2\u548c\u6027\u6269\u5c55\u4e86\u60a8\u53ef\u4ee5\u5b9a\u4e49\u7684\u7ea6\u675f\u7c7b\u578b\u3002\u4f7f\u7528\u4eb2\u548c\u6027\u4e0e\u53cd\u4eb2\u548c\u6027\u7684\u4e00\u4e9b\u597d\u5904\u6709\uff1a
\u60a8\u53ef\u4ee5\u6807\u660e\u67d0\u89c4\u5219\u662f\u201c\u8f6f\u9700\u6c42\u201d\u6216\u8005\u201c\u504f\u597d\u201d\uff0c\u8fd9\u6837\u8c03\u5ea6\u5668\u5728\u65e0\u6cd5\u627e\u5230\u5339\u914d\u8282\u70b9\u65f6\uff0c\u4f1a\u5ffd\u7565\u4eb2\u548c\u6027/\u53cd\u4eb2\u548c\u6027\u89c4\u5219\uff0c\u786e\u4fdd Pod \u8c03\u5ea6\u6210\u529f\u3002
\u60a8\u53ef\u4ee5\u4f7f\u7528\u8282\u70b9\u4e0a\uff08\u6216\u5176\u4ed6\u62d3\u6251\u57df\u4e2d\uff09\u8fd0\u884c\u7684\u5176\u4ed6 Pod \u7684\u6807\u7b7e\u6765\u5b9e\u65bd\u8c03\u5ea6\u7ea6\u675f\uff0c\u800c\u4e0d\u662f\u53ea\u80fd\u4f7f\u7528\u8282\u70b9\u672c\u8eab\u7684\u6807\u7b7e\u3002\u8fd9\u4e2a\u80fd\u529b\u8ba9\u60a8\u80fd\u591f\u5b9a\u4e49\u89c4\u5219\u5141\u8bb8\u54ea\u4e9b Pod \u53ef\u4ee5\u88ab\u653e\u7f6e\u5728\u4e00\u8d77\u3002
\u60a8\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u4eb2\u548c\uff08affinity\uff09\u4e0e\u53cd\u4eb2\u548c\uff08anti-affinity\uff09\u6765\u9009\u62e9 Pod \u8981\u90e8\u7f72\u7684\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u53ef\u4ee5\u548c\u54ea\u4e9b Pod\u90e8 \u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5bf9\u4e8e\u76f8\u4e92\u901a\u4fe1\u7684\u670d\u52a1\uff0c\u53ef\u901a\u8fc7\u5e94\u7528\u4eb2\u548c\u6027\u8c03\u5ea6\uff0c\u5c06\u5176\u90e8\u7f72\u5230\u540c\u4e00\u62d3\u6251\u57df\uff08\u5982\u540c\u4e00\u53ef\u7528\u533a\uff09\u4e2d\uff0c\u51cf\u5c11\u5b83\u4eec\u4e4b\u95f4\u7684\u7f51\u7edc\u5ef6\u8fdf\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u53cd\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u4e0d\u53ef\u4ee5\u548c\u54ea\u4e9b Pod \u90e8\u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5c06\u4e00\u4e2a\u8d1f\u8f7d\u7684\u76f8\u540c Pod \u5206\u6563\u90e8\u7f72\u5230\u4e0d\u540c\u7684\u62d3\u6251\u57df\uff08\u4f8b\u5982\u4e0d\u540c\u4e3b\u673a\uff09\u4e2d\uff0c\u63d0\u9ad8\u8d1f\u8f7d\u672c\u8eab\u7684\u7a33\u5b9a\u6027\u3002
Pod \u662f Kuberneters \u4e2d\u521b\u5efa\u548c\u7ba1\u7406\u7684\u3001\u6700\u5c0f\u7684\u8ba1\u7b97\u5355\u5143\uff0c\u5373\u4e00\u7ec4\u5bb9\u5668\u7684\u96c6\u5408\u3002\u8fd9\u4e9b\u5bb9\u5668\u5171\u4eab\u5b58\u50a8\u3001\u7f51\u7edc\u4ee5\u53ca\u7ba1\u7406\u63a7\u5236\u5bb9\u5668\u8fd0\u884c\u65b9\u5f0f\u7684\u7b56\u7565\u3002 Pod \u901a\u5e38\u4e0d\u7531\u7528\u6237\u76f4\u63a5\u521b\u5efa\uff0c\u800c\u662f\u901a\u8fc7\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u6765\u521b\u5efa\u3002 Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c\u81f3\u5c11\u5176\u4e2d\u6709\u4e00\u4e2a\u4e3b\u8981\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \uff0c\u4e4b\u540e\u53d6\u51b3\u4e8e Pod \u4e2d\u662f\u5426\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\u800c\u8fdb\u5165 Succeeded \u6216\u8005 Failed \u9636\u6bb5\u3002
\u7b2c\u4e94\u4ee3\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4f9d\u636e Pod \u7684\u72b6\u6001\u3001\u526f\u672c\u6570\u7b49\u56e0\u7d20\uff0c\u8bbe\u8ba1\u4e86\u4e00\u79cd\u5185\u7f6e\u7684\u5de5\u4f5c\u8d1f\u8f7d\u751f\u547d\u5468\u671f\u7684\u72b6\u6001\u96c6\uff0c\u4ee5\u8ba9\u7528\u6237\u80fd\u591f\u66f4\u52a0\u771f\u5b9e\u7684\u611f\u77e5\u5de5\u4f5c\u8d1f\u8f7d\u8fd0\u884c\u60c5\u51b5\u3002 \u7531\u4e8e\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u7c7b\u578b\uff08\u6bd4\u5982\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u548c\u4efb\u52a1\uff09\u5bf9 Pod \u7684\u7ba1\u7406\u673a\u5236\u4e0d\u4e00\u81f4\uff0c\u56e0\u6b64\uff0c\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u5728\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u4f1a\u5448\u73b0\u4e0d\u540c\u7684\u751f\u547d\u5468\u671f\u72b6\u6001\uff0c\u5177\u4f53\u5982\u4e0b\u8868\uff1a
\u606d\u559c\uff0c\u60a8\u6210\u529f\u8fdb\u5165\u4e86 AI \u7b97\u529b\u5e73\u53f0\uff0c\u73b0\u5728\u53ef\u4ee5\u5f00\u59cb\u60a8\u7684 AI \u4e4b\u65c5\u4e86\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9488\u5bf9\u5bb9\u5668\u3001Pod\u3001\u955c\u50cf\u3001\u8fd0\u884c\u65f6\u3001\u5fae\u670d\u52a1\u63d0\u4f9b\u4e86\u5168\u9762\u81ea\u52a8\u5316\u7684\u5b89\u5168\u5b9e\u73b0\u3002 \u4e0b\u8868\u5217\u51fa\u4e86\u4e00\u4e9b\u5df2\u5b9e\u73b0\u6216\u6b63\u5728\u5b9e\u73b0\u4e2d\u7684\u5b89\u5168\u7279\u6027\u3002
\u4ee5 \u7528\u6237\u8eab\u4efd \u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\uff0c\u67e5\u770b\u5176\u662f\u5426\u88ab\u5206\u914d\u4e86 test-ns-1 \u547d\u540d\u7a7a\u95f4\u3002
\u4e0b\u4e00\u6b65\uff1a\u521b\u5efa AI \u8d1f\u8f7d\u4f7f\u7528 GPU \u8d44\u6e90
"},{"location":"admin/share/workload.html","title":"\u521b\u5efa AI \u8d1f\u8f7d\u4f7f\u7528 GPU \u8d44\u6e90","text":"
\u7ba1\u7406\u5458\u4e3a\u5de5\u4f5c\u7a7a\u95f4\u5206\u914d\u8d44\u6e90\u914d\u989d\u540e\uff0c\u7528\u6237\u5c31\u53ef\u4ee5\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u6765\u4f7f\u7528 GPU \u7b97\u529b\u8d44\u6e90\u3002
"},{"location":"admin/virtnest/best-practice/import-ubuntu.html","title":"\u5982\u4f55\u4ece VMWare \u5bfc\u5165\u4f20\u7edf Linux \u4e91\u4e3b\u673a\u5230\u4e91\u539f\u751f\u4e91\u4e3b\u673a\u5e73\u53f0","text":"
\u672c\u6587\u5c06\u8be6\u7ec6\u4ecb\u7ecd\u5982\u4f55\u901a\u8fc7\u547d\u4ee4\u884c\u5c06\u5916\u90e8\u5e73\u53f0 VMware \u4e0a\u7684 Linux \u4e91\u4e3b\u673a\u5bfc\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u4e91\u4e3b\u673a\u4e2d\u3002
Can't use SSL_get_servername\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=20:unable to get local issuer certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=21:unable to verify the first certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify return:1\nDONE\nsha1 Fingerprint=C3:9D:D7:55:6A:43:11:2B:DE:BA:27:EA:3B:C2:13:AF:E4:12:62:4D # \u6240\u9700\u503c\n
\u9700\u8981\u6839\u636e\u7f51\u7edc\u6a21\u5f0f\u7684\u4e0d\u540c\u914d\u7f6e\u4e0d\u540c\u7684\u4fe1\u606f\uff0c\u82e5\u6709\u56fa\u5b9a IP \u7684\u9700\u6c42\uff0c\u9700\u8981\u9009\u62e9 Bridge \u7f51\u7edc\u6a21\u5f0f
"},{"location":"admin/virtnest/best-practice/import-windows.html","title":"\u5982\u4f55\u4ece VMWare \u5bfc\u5165\u4f20\u7edf Windows \u4e91\u4e3b\u673a\u5230\u4e91\u539f\u751f\u4e91\u4e3b\u673a\u5e73\u53f0","text":"
\u672c\u6587\u5c06\u8be6\u7ec6\u4ecb\u7ecd\u5982\u4f55\u901a\u8fc7\u547d\u4ee4\u884c\u5c06\u5916\u90e8\u5e73\u53f0 VMware \u4e0a\u7684\u4e91\u4e3b\u673a\u5bfc\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u4e91\u4e3b\u673a\u4e2d\u3002
"},{"location":"admin/virtnest/best-practice/import-windows.html#windows_1","title":"\u68c0\u67e5 Windows \u7684\u5f15\u5bfc\u7c7b\u578b","text":"
\u5c06\u5916\u90e8\u5e73\u53f0\u7684\u4e91\u4e3b\u673a\u5bfc\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u865a\u62df\u5316\u5e73\u53f0\u4e2d\u65f6\uff0c\u9700\u8981\u6839\u636e\u4e91\u4e3b\u673a\u7684\u542f\u52a8\u7c7b\u578b\uff08BIOS \u6216 UEFI\uff09\u8fdb\u884c\u76f8\u5e94\u7684\u914d\u7f6e\uff0c\u4ee5\u786e\u4fdd\u4e91\u4e3b\u673a\u80fd\u591f\u6b63\u786e\u542f\u52a8\u548c\u8fd0\u884c\u3002
"},{"location":"admin/virtnest/best-practice/import-windows.html#linux-windows","title":"\u5bf9\u6bd4\u5bfc\u5165 Linux \u548c Windows \u4e91\u4e3b\u673a\u7684\u5dee\u5f02","text":"
Windows \u53ef\u80fd\u9700\u8981 UEFI \u914d\u7f6e\u3002
Windows \u901a\u5e38\u9700\u8981\u5b89\u88c5 VirtIO \u9a71\u52a8\u3002
Windows \u591a\u78c1\u76d8\u5bfc\u5165\u901a\u5e38\u4e0d\u9700\u8981\u91cd\u65b0\u6302\u8f7d\u78c1\u76d8\u3002
"},{"location":"admin/virtnest/best-practice/vm-windows.html","title":"\u521b\u5efa Windows \u4e91\u4e3b\u673a","text":"
\u672c\u6587\u5c06\u4ecb\u7ecd\u5982\u4f55\u901a\u8fc7\u547d\u4ee4\u884c\u521b\u5efa Windows \u4e91\u4e3b\u673a\u3002
\u521b\u5efa Windows \u4e91\u4e3b\u673a\u4e4b\u524d\uff0c\u9700\u8981\u5148\u53c2\u8003\u5b89\u88c5\u4e91\u4e3b\u673a\u6a21\u5757\u7684\u4f9d\u8d56\u548c\u524d\u63d0\u786e\u5b9a\u60a8\u7684\u73af\u5883\u5df2\u7ecf\u51c6\u5907\u5c31\u7eea\u3002
\u521b\u5efa\u8fc7\u7a0b\u5efa\u8bae\u53c2\u8003\u5b98\u65b9\u6587\u6863\uff1a\u5b89\u88c5 windows \u7684\u6587\u6863\u3001 \u5b89\u88c5 Windows \u76f8\u5173\u9a71\u52a8\u7a0b\u5e8f\u3002
Windows \u4e91\u4e3b\u673a\u5efa\u8bae\u4f7f\u7528 VNC \u7684\u8bbf\u95ee\u65b9\u5f0f\u3002
"},{"location":"admin/virtnest/best-practice/vm-windows.html#iso","title":"\u5bfc\u5165 ISO \u955c\u50cf","text":"
\u200b\u521b\u5efa Windows \u4e91\u4e3b\u673a\u9700\u8981\u5bfc\u5165 ISO \u955c\u50cf\u7684\u4e3b\u8981\u539f\u56e0\u662f\u4e3a\u4e86\u5b89\u88c5 Windows \u64cd\u4f5c\u7cfb\u7edf\u3002 \u4e0e Linux \u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\uff0cWindows \u64cd\u4f5c\u7cfb\u7edf\u5b89\u88c5\u8fc7\u7a0b\u901a\u5e38\u9700\u8981\u4ece\u5b89\u88c5\u5149\u76d8\u6216 ISO \u955c\u50cf\u6587\u4ef6\u4e2d\u5f15\u5bfc\u3002 \u56e0\u6b64\uff0c\u5728\u521b\u5efa Windows \u4e91\u4e3b\u673a\u65f6\uff0c\u9700\u8981\u5148\u5bfc\u5165 Windows \u64cd\u4f5c\u7cfb\u7edf\u7684\u5b89\u88c5 ISO \u955c\u50cf\u6587\u4ef6\uff0c\u4ee5\u4fbf\u4e91\u4e3b\u673a\u80fd\u591f\u6b63\u5e38\u5b89\u88c5\u3002
\u4ee5\u4e0b\u4ecb\u7ecd\u4e24\u4e2a\u5bfc\u5165 ISO \u955c\u50cf\u7684\u529e\u6cd5\uff1a
Windows \u7248\u672c\u7684\u4e91\u4e3b\u673a\u5927\u591a\u6570\u60c5\u51b5\u662f\u9700\u8981\u8fdc\u7a0b\u684c\u9762\u63a7\u5236\u8bbf\u95ee\u7684\uff0c\u5efa\u8bae\u4f7f\u7528 Microsoft \u8fdc\u7a0b\u684c\u9762\u63a7\u5236\u60a8\u7684\u4e91\u4e3b\u673a\u3002
Note
\u4f60\u7684 Windows \u7248\u672c\u9700\u652f\u6301\u8fdc\u7a0b\u684c\u9762\u63a7\u5236\uff0c\u624d\u80fd\u4f7f\u7528 Microsoft \u8fdc\u7a0b\u684c\u9762\u3002
\u9700\u8981\u5173\u95ed Windows \u7684\u9632\u706b\u5899\u3002
Windows \u4e91\u4e3b\u673a\u6dfb\u52a0\u6570\u636e\u76d8\u7684\u65b9\u5f0f\u548c Linux \u4e91\u4e3b\u673a\u4e00\u81f4\u3002\u4f60\u53ef\u4ee5\u53c2\u8003\u4e0b\u9762\u7684 YAML \u793a\u4f8b\uff1a
\u8fd9\u4e9b\u80fd\u529b\u548c Linux \u4e91\u4e3b\u673a\u4e00\u81f4\uff0c\u53ef\u76f4\u63a5\u53c2\u8003\u914d\u7f6e Linux \u4e91\u4e3b\u673a\u7684\u65b9\u5f0f\u3002
"},{"location":"admin/virtnest/best-practice/vm-windows.html#windows_1","title":"\u8bbf\u95ee Windows \u4e91\u4e3b\u673a","text":"
[root@master ~]# helm search repo virtnest-release/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest-release/virtnest 0.6.0 v0.6.0 A Helm chart for virtnest\n
# \u6210\u529f\u7684\u60c5\u51b5\nQEMU: Checking for hardware virtualization : PASS\nQEMU: Checking if device /dev/kvm exists : PASS\nQEMU: Checking if device /dev/kvm is accessible : PASS\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for device assignment IOMMU support : PASS\nQEMU: Checking if IOMMU is enabled by kernel : PASS\nQEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure Guest support)\n\n# \u5931\u8d25\u7684\u60c5\u51b5\nQEMU: Checking for hardware virtualization : FAIL (Only emulated CPUs are available, performance will be significantly limited)\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpu' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller mount-point : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller mount-point : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller mount-point : PASS\nWARN (Unknown if this platform has IOMMU support)\n
[root@master ~]# helm search repo virtnest/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest/virtnest 0.2.0 v0.2.0 A Helm chart for virtnest\n...\n
Passt\uff08\u76f4\u901a\uff09/Bridge\uff08\u6865\u63a5\uff09\u6a21\u5f0f\u4e0b\u652f\u6301\u624b\u52a8\u6dfb\u52a0\u7f51\u5361\u3002\u70b9\u51fb \u6dfb\u52a0\u7f51\u5361 \uff0c\u8fdb\u884c\u7f51\u5361 IP \u6c60\u7684\u914d\u7f6e\u3002\u9009\u62e9\u548c\u7f51\u7edc\u6a21\u5f0f\u5339\u914d\u7684 Multus CR\uff0c\u82e5\u6ca1\u6709\u5219\u9700\u8981\u81ea\u884c\u521b\u5efa\u3002
\u82e5\u6253\u5f00 \u4f7f\u7528\u9ed8\u8ba4 IP \u6c60 \u5f00\u5173\uff0c\u5219\u4f7f\u7528 multus CR \u914d\u7f6e\u4e2d\u7684\u9ed8\u8ba4 IP \u6c60\u3002\u82e5\u5173\u95ed\u5f00\u5173\uff0c\u5219\u624b\u52a8\u9009\u62e9 IP \u6c60\u3002
IP \uff1a\u4e91\u4e3b\u673a\u7684 IP \u5730\u5740\u3002\u5bf9\u4e8e\u6dfb\u52a0\u591a\u5f20\u7f51\u5361\u7684\u4e91\u4e3b\u673a\uff0c\u4f1a\u4e3a\u5176\u5206\u914d\u591a\u4e2a IP \u5730\u5740\u3002
\u4e91\u4e3b\u673a\u672a\u8fdb\u884c\u78c1\u76d8\u843d\u76d8\u64cd\u4f5c\uff0c\u6216\u4f7f\u7528 Rook-Ceph\u3001HwameiStor HA \u6a21\u5f0f\u4f5c\u4e3a\u5b58\u50a8\u7cfb\u7edf
\u68c0\u67e5\u4e91\u4e3b\u673a launcher pod \u72b6\u6001\uff1a
kubectl get pod\n
\u67e5\u770b launcher pod \u662f\u5426\u5904\u4e8e Terminating \u72b6\u6001\u3002
\u5f3a\u5236\u5220\u9664 launcher pod\uff1a
\u5982\u679c launcher pod \u72b6\u6001\u4e3a Terminating\uff0c\u53ef\u4ee5\u6267\u884c\u4ee5\u4e0b\u547d\u4ee4\u8fdb\u884c\u5f3a\u5236\u5220\u9664\uff1a
kubectl delete <launcher pod> --force\n
\u66ff\u6362 <launcher pod> \u4e3a\u4f60\u7684 launcher pod \u540d\u79f0\u3002
\u5f3a\u5236\u5220\u9664 pod \u540e\uff0c\u9700\u8981\u7b49\u5f85\u5927\u7ea6\u516d\u5206\u949f\u4ee5\u8ba9 launcher pod \u542f\u52a8\uff0c\u6216\u8005\u53ef\u4ee5\u901a\u8fc7\u4ee5\u4e0b\u547d\u4ee4\u7acb\u5373\u542f\u52a8 pod\uff1a
kubectl get pv | grep <vm name>\nkubectl get VolumeAttachment | grep <pv name>\n
\u6fc0\u6d3b VMExport Feature Gate\uff0c\u5728\u539f\u6709\u96c6\u7fa4\u5185\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c \u53ef\u53c2\u8003How to activate a feature gate
Bridge\uff08\u6865\u63a5\uff09\u6a21\u5f0f\u4e0b\u652f\u6301\u624b\u52a8\u6dfb\u52a0\u7f51\u5361\u3002\u70b9\u51fb \u6dfb\u52a0\u7f51\u5361 \uff0c\u8fdb\u884c\u7f51\u5361 IP \u6c60\u7684\u914d\u7f6e\u3002\u9009\u62e9\u548c\u7f51\u7edc\u6a21\u5f0f\u5339\u914d\u7684 Multus CR\uff0c\u82e5\u6ca1\u6709\u5219\u9700\u8981\u81ea\u884c\u521b\u5efa\u3002
\u82e5\u6253\u5f00 \u4f7f\u7528\u9ed8\u8ba4 IP \u6c60 \u5f00\u5173\uff0c\u5219\u4f7f\u7528 multus CR \u914d\u7f6e\u4e2d\u7684\u9ed8\u8ba4 IP \u6c60\u3002\u82e5\u5173\u95ed\u5f00\u5173\uff0c\u5219\u624b\u52a8\u9009\u62e9 IP \u6c60\u3002
AI Lab \u63d0\u4f9b\u6a21\u578b\u5f00\u53d1\u3001\u8bad\u7ec3\u4ee5\u53ca\u63a8\u7406\u8fc7\u7a0b\u6240\u6709\u9700\u8981\u7684\u6570\u636e\u96c6\u7ba1\u7406\u529f\u80fd\u3002\u76ee\u524d\u652f\u6301\u5c06\u591a\u79cd\u6570\u636e\u6e90\u7edf\u4e00\u63a5\u5165\u80fd\u529b\u3002
\u901a\u8fc7\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5c06\u6570\u636e\u6e90\u63a5\u5165\u5230 AI Lab \u4e2d\uff0c\u5b9e\u73b0\u6570\u636e\u7684\u7edf\u4e00\u7eb3\u7ba1\u3001\u9884\u70ed\u3001\u6570\u636e\u96c6\u7ba1\u7406\u7b49\u529f\u80fd\u3002
\u672c\u6587\u8bf4\u660e\u5982\u4f55\u5728 AI Lab \u4e2d\u7ba1\u7406\u4f60\u7684\u73af\u5883\u4f9d\u8d56\u5e93\uff0c\u4ee5\u4e0b\u662f\u5177\u4f53\u64cd\u4f5c\u6b65\u9aa4\u548c\u6ce8\u610f\u4e8b\u9879\u3002
\u968f\u7740 AI Lab \u7684\u5feb\u901f\u8fed\u4ee3\uff0c\u6211\u4eec\u5df2\u7ecf\u652f\u6301\u4e86\u591a\u79cd\u6a21\u578b\u7684\u63a8\u7406\u670d\u52a1\uff0c\u60a8\u53ef\u4ee5\u5728\u8fd9\u91cc\u770b\u5230\u6240\u652f\u6301\u7684\u6a21\u578b\u4fe1\u606f\u3002
AI Lab v0.3.0 \u4e0a\u7ebf\u4e86\u6a21\u578b\u63a8\u7406\u670d\u52a1\uff0c\u9488\u5bf9\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u65b9\u4fbf\u7528\u6237\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528AI Lab \u7684\u63a8\u7406\u670d\u52a1\uff0c\u65e0\u9700\u5173\u5fc3\u6a21\u578b\u7684\u90e8\u7f72\u548c\u7ef4\u62a4\u3002
AI Lab v0.6.0 \u652f\u6301\u4e86\u5b8c\u6574\u7248\u672c\u7684 vLLM \u63a8\u7406\u80fd\u529b\uff0c\u652f\u6301\u8bf8\u591a\u5927\u8bed\u8a00\u6a21\u578b\uff0c\u5982 LLama\u3001Qwen\u3001ChatGLM \u7b49\u3002
\u60a8\u53ef\u4ee5\u5728 AI Lab \u4e2d\u4f7f\u7528\u7ecf\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9a8c\u8bc1\u8fc7\u7684 GPU \u7c7b\u578b\uff1b \u66f4\u591a\u7ec6\u8282\u53c2\u9605 GPU \u652f\u6301\u77e9\u9635\u3002
\u901a\u8fc7 Triton Inference Server \u53ef\u4ee5\u5f88\u597d\u7684\u652f\u6301\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u6211\u4eec\u76ee\u524d\u652f\u6301\u4e3b\u6d41\u7684\u63a8\u7406\u540e\u7aef\u670d\u52a1\uff1a
AI Lab \u76ee\u524d\u63d0\u4f9b\u4ee5 Triton\u3001vLLM \u4f5c\u4e3a\u63a8\u7406\u6846\u67b6\uff0c\u7528\u6237\u53ea\u9700\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5feb\u901f\u542f\u52a8\u4e00\u4e2a\u9ad8\u6027\u80fd\u7684\u63a8\u7406\u670d\u52a1\u3002
\u652f\u6301 API key \u7684\u8bf7\u6c42\u65b9\u5f0f\u8ba4\u8bc1\uff0c\u7528\u6237\u53ef\u4ee5\u81ea\u5b9a\u4e49\u589e\u52a0\u8ba4\u8bc1\u53c2\u6570\u3002
\u63a8\u7406\u670d\u52a1\u521b\u5efa\u5b8c\u6210\u4e4b\u540e\uff0c\u70b9\u51fb\u63a8\u7406\u670d\u52a1\u540d\u79f0\u8fdb\u5165\u8be6\u60c5\uff0c\u67e5\u770b API \u8c03\u7528\u65b9\u6cd5\u3002\u901a\u8fc7\u4f7f\u7528 Curl\u3001Python\u3001Nodejs \u7b49\u65b9\u5f0f\u9a8c\u8bc1\u6267\u884c\u7ed3\u679c\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
\u5728 AI Lab \u6a21\u5757\u4e2d\uff0c\u63d0\u4f9b\u4e86\u6a21\u578b\u5f00\u53d1\u8fc7\u7a0b\u91cd\u8981\u7684\u53ef\u89c6\u5316\u5206\u6790\u5de5\u5177\uff0c\u7528\u4e8e\u5c55\u793a\u673a\u5668\u5b66\u4e60\u6a21\u578b\u7684\u8bad\u7ec3\u8fc7\u7a0b\u548c\u7ed3\u679c\u3002 \u672c\u6587\u5c06\u4ecb\u7ecd \u4efb\u52a1\u5206\u6790\uff08Tensorboard\uff09\u7684\u57fa\u672c\u6982\u5ff5\u3001\u5728 AI Lab \u7cfb\u7edf\u4e2d\u7684\u4f7f\u7528\u65b9\u6cd5\uff0c\u4ee5\u53ca\u5982\u4f55\u914d\u7f6e\u6570\u636e\u96c6\u7684\u65e5\u5fd7\u5185\u5bb9\u3002
\u5728 AI Lab \u7cfb\u7edf\u4e2d\uff0c\u6211\u4eec\u63d0\u4f9b\u4e86\u4fbf\u6377\u7684\u65b9\u5f0f\u6765\u521b\u5efa\u548c\u7ba1\u7406 Tensorboard\u3002\u4ee5\u4e0b\u662f\u5177\u4f53\u6b65\u9aa4\uff1a
\u521b\u5efa\u5206\u5e03\u5f0f\u4efb\u52a1\uff1a\u5728 AI Lab \u5e73\u53f0\u4e0a\u521b\u5efa\u4e00\u4e2a\u65b0\u7684\u5206\u5e03\u5f0f\u8bad\u7ec3\u4efb\u52a1\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
\u8bbf\u95ee\u5bc6\u94a5\uff08Access Key\uff09\u53ef\u7528\u4e8e\u8bbf\u95ee\u5f00\u653e API \u548c\u6301\u7eed\u53d1\u5e03\uff0c\u7528\u6237\u53ef\u5728\u4e2a\u4eba\u4e2d\u5fc3\u53c2\u7167\u4ee5\u4e0b\u6b65\u9aa4\u83b7\u53d6\u5bc6\u94a5\u5e76\u8bbf\u95ee API\u3002
\u4f7f\u7528\u60a8\u7684\u7528\u6237\u540d/\u5bc6\u7801\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\u3002\u70b9\u51fb\u5de6\u4fa7\u5bfc\u822a\u680f\u5e95\u90e8\u7684 \u5168\u5c40\u7ba1\u7406 \u3002
"},{"location":"end-user/ghippo/personal-center/ssh-key.html#4-ai","title":"\u6b65\u9aa4 4\uff1a\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0a\u8bbe\u7f6e\u516c\u94a5","text":"
\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0UI \u9875\u9762\uff0c\u5728\u9875\u9762\u53f3\u4e0a\u89d2\u9009\u62e9 \u4e2a\u4eba\u4e2d\u5fc3 -> SSH \u516c\u94a5 \u3002
\u4ee5\u7ec8\u7aef\u7528\u6237\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\uff0c\u5bfc\u822a\u5230\u5bf9\u5e94\u7684\u670d\u52a1\uff0c\u67e5\u770b\u8bbf\u95ee\u7aef\u53e3\u3002
\u544a\u8b66\u4e2d\u5fc3\u662f AI \u7b97\u529b\u5e73\u53f0 \u63d0\u4f9b\u7684\u4e00\u4e2a\u91cd\u8981\u529f\u80fd\uff0c\u5b83\u8ba9\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u56fe\u5f62\u754c\u9762\u65b9\u4fbf\u5730\u6309\u7167\u96c6\u7fa4\u548c\u547d\u540d\u7a7a\u95f4\u67e5\u770b\u6240\u6709\u6d3b\u52a8\u548c\u5386\u53f2\u544a\u8b66\uff0c \u5e76\u6839\u636e\u544a\u8b66\u7ea7\u522b\uff08\u7d27\u6025\u3001\u8b66\u544a\u3001\u63d0\u793a\uff09\u6765\u641c\u7d22\u544a\u8b66\u3002
\u6240\u6709\u544a\u8b66\u90fd\u662f\u57fa\u4e8e\u9884\u8bbe\u7684\u544a\u8b66\u89c4\u5219\u8bbe\u5b9a\u7684\u9608\u503c\u6761\u4ef6\u89e6\u53d1\u7684\u3002\u5728 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u5185\u7f6e\u4e86\u4e00\u4e9b\u5168\u5c40\u544a\u8b66\u7b56\u7565\uff0c\u540c\u65f6\u60a8\u4e5f\u53ef\u4ee5\u968f\u65f6\u521b\u5efa\u3001\u5220\u9664\u544a\u8b66\u7b56\u7565\uff0c\u5bf9\u4ee5\u4e0b\u6307\u6807\u8fdb\u884c\u8bbe\u7f6e\uff1a
AI \u7b97\u529b\u5e73\u53f0 \u544a\u8b66\u4e2d\u5fc3\u662f\u4e00\u4e2a\u529f\u80fd\u5f3a\u5927\u7684\u544a\u8b66\u7ba1\u7406\u5e73\u53f0\uff0c\u53ef\u5e2e\u52a9\u7528\u6237\u53ca\u65f6\u53d1\u73b0\u548c\u89e3\u51b3\u96c6\u7fa4\u4e2d\u51fa\u73b0\u7684\u95ee\u9898\uff0c \u63d0\u9ad8\u4e1a\u52a1\u7a33\u5b9a\u6027\u548c\u53ef\u7528\u6027\uff0c\u4fbf\u4e8e\u96c6\u7fa4\u5de1\u68c0\u548c\u6545\u969c\u6392\u67e5\u3002
Pod Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Pod \u4e0a\u5bf9\u5e94\u7684\u76d1\u63a7\u6570\u636e\u3002
Service Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Service \u5bf9\u5e94 Endpoints \u4e0a\u7684\u76d1\u63a7\u6570\u636e\u3002
\u7531\u4e8e ICMP \u9700\u8981\u66f4\u9ad8\u6743\u9650\uff0c\u56e0\u6b64\uff0c\u6211\u4eec\u8fd8\u9700\u8981\u63d0\u5347 Pod \u6743\u9650\uff0c\u5426\u5219\u4f1a\u51fa\u73b0 operation not permitted \u7684\u9519\u8bef\u3002\u6709\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\u63d0\u5347\u6743\u9650\uff1a
port \uff1a\u6307\u5b9a\u91c7\u96c6\u6570\u636e\u9700\u8981\u901a\u8fc7\u7684\u7aef\u53e3\uff0c\u8bbe\u7f6e\u7684\u7aef\u53e3\u4e3a\u91c7\u96c6\u7684 Service \u7aef\u53e3\u6240\u8bbe\u7f6e\u7684 name \u3002
\u8fd9\u662f\u9700\u8981\u53d1\u73b0\u7684 Service \u7684\u8303\u56f4\u3002 namespaceSelector \u5305\u542b\u4e24\u4e2a\u4e92\u65a5\u5b57\u6bb5\uff0c\u5b57\u6bb5\u7684\u542b\u4e49\u5982\u4e0b\uff1a
any \uff1a\u6709\u4e14\u4ec5\u6709\u4e00\u4e2a\u503c true \uff0c\u5f53\u8be5\u5b57\u6bb5\u88ab\u8bbe\u7f6e\u65f6\uff0c\u5c06\u76d1\u542c\u6240\u6709\u7b26\u5408 Selector \u8fc7\u6ee4\u6761\u4ef6\u7684 Service \u7684\u53d8\u52a8\u3002
\u8d44\u6e90\u6d88\u8017\uff1a\u53ef\u6309 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u548c\u78c1\u76d8\u4f7f\u7528\u7387\u5206\u522b\u67e5\u770b\u8fd1\u4e00\u5c0f\u65f6 TOP5 \u96c6\u7fa4\u3001\u8282\u70b9\u7684\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\u3002
\u9ed8\u8ba4\u6309\u7167\u6839\u636e CPU \u4f7f\u7528\u7387\u6392\u5e8f\u3002\u60a8\u53ef\u5207\u6362\u6307\u6807\u5207\u6362\u96c6\u7fa4\u3001\u8282\u70b9\u7684\u6392\u5e8f\u65b9\u5f0f\u3002
\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\uff1a\u53ef\u67e5\u770b\u8fd1 15 \u5929\u7684\u8282\u70b9\u4e2a\u6570\u8d8b\u52bf\u4ee5\u53ca\u4e00\u5c0f\u65f6 Pod \u7684\u8fd0\u884c\u8d8b\u52bf\u3002
\u4f7f\u7528 \u903b\u8f91\u64cd\u4f5c\u7b26\uff08AND\u3001OR\u3001NOT\u3001\"\" \uff09\u7b26\u67e5\u8be2\u591a\u4e2a\u5173\u952e\u5b57\uff0c\u4f8b\u5982\uff1akeyword1 AND (keyword2 OR keyword3) NOT keyword4\u3002
"},{"location":"end-user/insight/infra/cluster.html#_4","title":"\u53c2\u8003\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645 CPU \u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 CPU \u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684 CPU \u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684\u5185\u5b58\u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002"},{"location":"end-user/insight/infra/container.html","title":"\u5bb9\u5668\u76d1\u63a7","text":"
"},{"location":"end-user/insight/infra/container.html#_4","title":"\u6307\u6807\u53c2\u8003\u8bf4\u660e","text":"\u6307\u6807\u540d\u79f0 \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c\u3002 CPU \u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u8bf7\u6c42\u91cf\u4e4b\u548c\u3002 CPU \u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u9650\u5236\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u9650\u5236\u91cf\u4e4b\u548c\u3002 \u78c1\u76d8\u8bfb\u5199\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\u78c1\u76d8\u6bcf\u79d2\u8fde\u7eed\u8bfb\u53d6\u548c\u5199\u5165\u7684\u603b\u548c\uff0c\u8868\u793a\u78c1\u76d8\u6bcf\u79d2\u8bfb\u53d6\u548c\u5199\u5165\u64cd\u4f5c\u6570\u7684\u6027\u80fd\u5ea6\u91cf\u3002 \u7f51\u7edc\u53d1\u9001\u63a5\u6536\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\uff0c\u6309\u5de5\u4f5c\u8d1f\u8f7d\u7edf\u8ba1\u7684\u7f51\u7edc\u6d41\u91cf\u7684\u6d41\u5165\u3001\u6d41\u51fa\u901f\u7387\u3002"},{"location":"end-user/insight/infra/event.html","title":"\u4e8b\u4ef6\u67e5\u8be2","text":"
AI \u7b97\u529b\u5e73\u53f0 Insight \u652f\u6301\u6309\u96c6\u7fa4\u3001\u547d\u540d\u7a7a\u95f4\u67e5\u8be2\u4e8b\u4ef6\uff0c\u5e76\u63d0\u4f9b\u4e86\u4e8b\u4ef6\u72b6\u6001\u5206\u5e03\u56fe\uff0c\u5bf9\u91cd\u8981\u4e8b\u4ef6\u8fdb\u884c\u7edf\u8ba1\u3002
\u901a\u8fc7\u91cd\u8981\u4e8b\u4ef6\u7edf\u8ba1\uff0c\u60a8\u53ef\u4ee5\u65b9\u4fbf\u5730\u4e86\u89e3\u955c\u50cf\u62c9\u53d6\u5931\u8d25\u6b21\u6570\u3001\u5065\u5eb7\u68c0\u67e5\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u8fd0\u884c\u5931\u8d25\u6b21\u6570\u3001 Pod \u8c03\u5ea6\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668 OOM \u5185\u5b58\u8017\u5c3d\u6b21\u6570\u3001\u5b58\u50a8\u5377\u6302\u8f7d\u5931\u8d25\u6b21\u6570\u4ee5\u53ca\u6240\u6709\u4e8b\u4ef6\u7684\u603b\u6570\u3002\u8fd9\u4e9b\u4e8b\u4ef6\u901a\u5e38\u5206\u4e3a\u300cWarning\u300d\u548c\u300cNormal\u300d\u4e24\u7c7b\u3002
"},{"location":"end-user/insight/infra/namespace.html#_4","title":"\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c \u5185\u5b58\u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c \u5bb9\u5668\u7ec4 CPU \u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf \u5bb9\u5668\u7ec4\u5185\u5b58\u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf"},{"location":"end-user/insight/infra/node.html","title":"\u8282\u70b9\u76d1\u63a7","text":"
\u6307\u6807\u540d\u79f0 \u63cf\u8ff0 Current Status Response \u8868\u793a HTTP \u63a2\u6d4b\u8bf7\u6c42\u7684\u54cd\u5e94\u72b6\u6001\u7801\u3002 Ping Status \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u662f\u5426\u6210\u529f\u30021 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u6210\u529f\uff0c0 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u5931\u8d25\u3002 IP Protocol \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u4f7f\u7528\u7684 IP \u534f\u8bae\u7248\u672c\u3002 SSL Expiry \u8868\u793a SSL/TLS \u8bc1\u4e66\u7684\u6700\u65e9\u5230\u671f\u65f6\u95f4\u3002 DNS Response (Latency) \u8868\u793a\u6574\u4e2a\u63a2\u6d4b\u8fc7\u7a0b\u7684\u6301\u7eed\u65f6\u95f4\uff0c\u5355\u4f4d\u662f\u79d2\u3002 HTTP Duration \u8868\u793a\u4ece\u53d1\u9001\u8bf7\u6c42\u5230\u63a5\u6536\u5230\u5b8c\u6574\u54cd\u5e94\u7684\u6574\u4e2a\u8fc7\u7a0b\u7684\u65f6\u95f4\u3002"},{"location":"end-user/insight/infra/probe.html#_7","title":"\u5220\u9664\u62e8\u6d4b\u4efb\u52a1","text":"
AI \u7b97\u529b\u4e2d\u5fc3 \u5e73\u53f0\u5b9e\u73b0\u4e86\u5bf9\u591a\u4e91\u591a\u96c6\u7fa4\u7684\u7eb3\u7ba1\uff0c\u5e76\u652f\u6301\u521b\u5efa\u96c6\u7fa4\u3002\u5728\u6b64\u57fa\u7840\u4e0a\uff0c\u53ef\u89c2\u6d4b\u6027 Insight \u4f5c\u4e3a\u591a\u96c6\u7fa4\u7edf\u4e00\u89c2\u6d4b\u65b9\u6848\uff0c\u901a\u8fc7\u90e8\u7f72 insight-agent \u63d2\u4ef6\u5b9e\u73b0\u5bf9\u591a\u96c6\u7fa4\u89c2\u6d4b\u6570\u636e\u7684\u91c7\u96c6\uff0c\u5e76\u652f\u6301\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3 \u53ef\u89c2\u6d4b\u6027\u4ea7\u54c1\u5b9e\u73b0\u5bf9\u6307\u6807\u3001\u65e5\u5fd7\u3001\u94fe\u8def\u6570\u636e\u7684\u67e5\u8be2\u3002
"},{"location":"end-user/insight/quickstart/install/gethosturl.html#insight-agent_1","title":"\u5728\u5176\u4ed6\u96c6\u7fa4\u5b89\u88c5 insight-agent","text":""},{"location":"end-user/insight/quickstart/install/gethosturl.html#insight-server","title":"\u901a\u8fc7 Insight Server \u63d0\u4f9b\u7684\u63a5\u53e3\u83b7\u53d6\u5730\u5740","text":"
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u8c03\u7528\u63a5\u53e3\u65f6\u9700\u8981\u989d\u5916\u4f20\u9012\u96c6\u7fa4\u4e2d\u4efb\u610f\u5916\u90e8\u53ef\u8bbf\u95ee\u7684\u8282\u70b9 IP\uff0c\u4f1a\u4f7f\u7528\u8be5 IP \u62fc\u63a5\u51fa\u5bf9\u5e94\u670d\u52a1\u7684\u5b8c\u6574\u8bbf\u95ee\u5730\u5740\u3002
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u4ee5\u4e0a\u5c31\u7eea\u4e4b\u540e\uff0c\u60a8\u5c31\u53ef\u4ee5\u901a\u8fc7\u6ce8\u89e3\uff08Annotation\uff09\u65b9\u5f0f\u4e3a\u5e94\u7528\u7a0b\u5e8f\u63a5\u5165\u94fe\u8def\u8ffd\u8e2a\u4e86\uff0cOTel \u76ee\u524d\u652f\u6301\u901a\u8fc7\u6ce8\u89e3\u7684\u65b9\u5f0f\u63a5\u5165\u94fe\u8def\u3002 \u6839\u636e\u670d\u52a1\u8bed\u8a00\uff0c\u9700\u8981\u6dfb\u52a0\u4e0a\u4e0d\u540c\u7684 pod annotations\u3002\u6bcf\u4e2a\u670d\u52a1\u53ef\u6dfb\u52a0\u4e24\u7c7b\u6ce8\u89e3\u4e4b\u4e00\uff1a
\u7531\u4e8e Go \u81ea\u52a8\u68c0\u6d4b\u9700\u8981\u8bbe\u7f6e OTEL_GO_AUTO_TARGET_EXE\uff0c \u56e0\u6b64\u60a8\u5fc5\u987b\u901a\u8fc7\u6ce8\u89e3\u6216 Instrumentation \u8d44\u6e90\u63d0\u4f9b\u6709\u6548\u7684\u53ef\u6267\u884c\u8def\u5f84\u3002\u672a\u8bbe\u7f6e\u6b64\u503c\u4f1a\u5bfc\u81f4 Go \u81ea\u52a8\u68c0\u6d4b\u6ce8\u5165\u4e2d\u6b62\uff0c\u4ece\u800c\u5bfc\u81f4\u63a5\u5165\u94fe\u8def\u5931\u8d25\u3002
Go \u81ea\u52a8\u68c0\u6d4b\u4e5f\u9700\u8981\u63d0\u5347\u6743\u9650\u3002\u4ee5\u4e0b\u6743\u9650\u662f\u81ea\u52a8\u8bbe\u7f6e\u7684\u5e76\u4e14\u662f\u5fc5\u9700\u7684\u3002
\u4ee5 Go \u8bed\u8a00\u4e3a\u4f8b\u7684\u624b\u52a8\u57cb\u70b9\u63a5\u5165\uff1a\u4f7f\u7528 OpenTelemetry SDK \u589e\u5f3a Go \u5e94\u7528\u7a0b\u5e8f
\u5229\u7528 ebpf \u5b9e\u73b0 Go \u8bed\u8a00\u65e0\u4fb5\u5165\u63a2\u9488\uff08\u5b9e\u9a8c\u6027\u529f\u80fd\uff09
OpenTelemetry \u4e5f\u7b80\u79f0\u4e3a OTel\uff0c\u662f\u4e00\u4e2a\u5f00\u6e90\u7684\u53ef\u89c2\u6d4b\u6027\u6846\u67b6\uff0c\u53ef\u4ee5\u5e2e\u52a9\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u751f\u6210\u548c\u6536\u96c6\u9065\u6d4b\u6570\u636e\uff1a\u94fe\u8def\u3001\u6307\u6807\u548c\u65e5\u5fd7\u3002
\u672c\u6587\u4e3b\u8981\u8bb2\u89e3\u5982\u4f55\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u901a\u8fc7 OpenTelemetry Go SDK \u589e\u5f3a\u5e76\u63a5\u5165\u94fe\u8def\u76d1\u63a7\u3002
"},{"location":"end-user/insight/quickstart/otel/golang/golang.html#otel-sdk-go_1","title":"\u4f7f\u7528 OTel SDK \u589e\u5f3a Go \u5e94\u7528","text":""},{"location":"end-user/insight/quickstart/otel/golang/golang.html#_1","title":"\u5b89\u88c5\u76f8\u5173\u4f9d\u8d56","text":"
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
\u521b\u5efa meter provider\uff0c\u5e76\u6307\u5b9a prometheus \u4f5c\u4e3a exporter\u3002
/*\n* Copyright The OpenTelemetry Authors\n* SPDX-License-Identifier: Apache-2.0\n*/\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
\u81ea\u5b9a\u4e49 meter \u5e76\u5f00\u542f http server
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n* Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n* these to a Prometheus instance via a HttpServer exporter.\n*\n* <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n* The Gauge callback gets executed every collection interval.\n*/\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // it is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
\u8fd9\u91cc\u4f7f\u7528\u7b2c\u4e8c\u79cd\u7528\u6cd5\uff0c\u542f\u52a8 JVM \u65f6\u9700\u8981\u6307\u5b9a JMX Exporter \u7684 jar \u5305\u6587\u4ef6\u548c\u914d\u7f6e\u6587\u4ef6\u3002 jar \u5305\u662f\u4e8c\u8fdb\u5236\u6587\u4ef6\uff0c\u4e0d\u597d\u901a\u8fc7 configmap \u6302\u8f7d\uff0c\u914d\u7f6e\u6587\u4ef6\u6211\u4eec\u51e0\u4e4e\u4e0d\u9700\u8981\u4fee\u6539\uff0c \u6240\u4ee5\u5efa\u8bae\u662f\u76f4\u63a5\u5c06 JMX Exporter \u7684 jar \u5305\u548c\u914d\u7f6e\u6587\u4ef6\u90fd\u6253\u5305\u5230\u4e1a\u52a1\u5bb9\u5668\u955c\u50cf\u4e2d\u3002
\u5176\u4e2d\uff0c\u7b2c\u4e8c\u79cd\u65b9\u5f0f\u6211\u4eec\u53ef\u4ee5\u9009\u62e9\u5c06 JMX Exporter \u7684 jar \u6587\u4ef6\u653e\u5728\u4e1a\u52a1\u5e94\u7528\u955c\u50cf\u4e2d\uff0c \u4e5f\u53ef\u4ee5\u9009\u62e9\u5728\u90e8\u7f72\u7684\u65f6\u5019\u6302\u8f7d\u8fdb\u53bb\u3002\u8fd9\u91cc\u5206\u522b\u5bf9\u4e24\u79cd\u65b9\u5f0f\u505a\u4e00\u4e2a\u4ecb\u7ecd\uff1a
"},{"location":"end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#jmx-exporter-jar","title":"\u65b9\u5f0f\u4e00\uff1a\u5c06 JMX Exporter JAR \u6587\u4ef6\u6784\u5efa\u81f3\u4e1a\u52a1\u955c\u50cf\u4e2d","text":"
\u7136\u540e\u51c6\u5907 jar \u5305\u6587\u4ef6\uff0c\u53ef\u4ee5\u5728 jmx_exporter \u7684 Github \u9875\u9762\u627e\u5230\u6700\u65b0\u7684 jar \u5305\u4e0b\u8f7d\u5730\u5740\u5e76\u53c2\u8003\u5982\u4e0b Dockerfile:
\u5728\u672a\u5f00\u542f\u7f51\u683c\u60c5\u51b5\u4e0b\uff0c\u6d4b\u8bd5\u60c5\u51b5\u7edf\u8ba1\u51fa\u7cfb\u7edf Job \u6307\u6807\u91cf\u4e0e Pod \u7684\u5173\u7cfb\u4e3a Series \u6570\u91cf = 800 * Pod \u6570\u91cf
\u5728\u5f00\u542f\u670d\u52a1\u7f51\u683c\u65f6\uff0c\u5f00\u542f\u529f\u80fd\u540e Pod \u4ea7\u751f\u7684 Istio \u76f8\u5173\u6307\u6807\u6570\u91cf\u7ea7\u4e3a Series \u6570\u91cf = 768 * Pod \u6570\u91cf
\u8868\u683c\u4e2d\u7684 Pod \u6570\u91cf \u6307\u96c6\u7fa4\u4e2d\u57fa\u672c\u7a33\u5b9a\u8fd0\u884c\u7684 Pod \u6570\u91cf\uff0c\u5982\u51fa\u73b0\u5927\u91cf\u7684 Pod \u91cd\u542f\uff0c\u5219\u4f1a\u9020\u6210\u77ed\u65f6\u95f4\u5185\u6307\u6807\u91cf\u7684\u9661\u589e\uff0c\u6b64\u65f6\u8d44\u6e90\u9700\u8981\u8fdb\u884c\u76f8\u5e94\u4e0a\u8c03\u3002
\u78c1\u76d8\u7528\u91cf = \u77ac\u65f6\u6307\u6807\u91cf x 2 x \u5355\u4e2a\u6570\u636e\u70b9\u7684\u5360\u7528\u78c1\u76d8 x 60 x 24 x \u5b58\u50a8\u65f6\u95f4 (\u5929)
\u5b58\u50a8\u65f6\u957f(\u5929) x 60 x 24 \u5c06\u65f6\u95f4(\u5929)\u6362\u7b97\u6210\u5206\u949f\u4ee5\u4fbf\u8ba1\u7b97\u78c1\u76d8\u7528\u91cf\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 insight-es-k8s-logs-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 jaeger-ilm-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u90e8\u7f72 Kubernetes \u96c6\u7fa4\u662f\u4e3a\u4e86\u652f\u6301\u9ad8\u6548\u7684 AI \u7b97\u529b\u8c03\u5ea6\u548c\u7ba1\u7406\uff0c\u5b9e\u73b0\u5f39\u6027\u4f38\u7f29\uff0c\u63d0\u4f9b\u9ad8\u53ef\u7528\u6027\uff0c\u4ece\u800c\u4f18\u5316\u6a21\u578b\u8bad\u7ec3\u548c\u63a8\u7406\u8fc7\u7a0b\u3002
\u5230\u6b64\u96c6\u7fa4\u521b\u5efa\u6210\u529f\uff0c\u53ef\u4ee5\u53bb\u67e5\u770b\u96c6\u7fa4\u6240\u5305\u542b\u7684\u8282\u70b9\u3002\u4f60\u53ef\u4ee5\u53bb\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u5e76\u4f7f\u7528 GPU \u4e86\u3002
\u4e0b\u4e00\u6b65\uff1a\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d
\u5907\u4efd\u901a\u5e38\u5206\u4e3a\u5168\u91cf\u5907\u4efd\u3001\u589e\u91cf\u5907\u4efd\u3001\u5dee\u5f02\u5907\u4efd\u4e09\u79cd\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u76ee\u524d\u652f\u6301\u5168\u91cf\u5907\u4efd\u548c\u589e\u91cf\u5907\u4efd\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u7684\u5907\u4efd\u6062\u590d\u53ef\u4ee5\u5206\u4e3a \u5e94\u7528\u5907\u4efd \u548c ETCD \u5907\u4efd \u4e24\u79cd\uff0c\u652f\u6301\u624b\u52a8\u5907\u4efd\uff0c\u6216\u57fa\u4e8e CronJob \u5b9a\u65f6\u81ea\u52a8\u5907\u4efd\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u5e94\u7528\u505a\u5907\u4efd\uff0c\u672c\u6559\u7a0b\u4e2d\u4f7f\u7528\u7684\u6f14\u793a\u5e94\u7528\u540d\u4e3a dao-2048 \uff0c\u5c5e\u4e8e\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u3002
CA \u8bc1\u4e66\uff1a\u53ef\u901a\u8fc7\u5982\u4e0b\u547d\u4ee4\u67e5\u770b\u8bc1\u4e66\uff0c\u7136\u540e\u5c06\u8bc1\u4e66\u5185\u5bb9\u590d\u5236\u7c98\u8d34\u5230\u5bf9\u5e94\u4f4d\u7f6e\uff1a
S3 region \uff1a\u4e91\u5b58\u50a8\u7684\u5730\u7406\u533a\u57df\u3002\u9ed8\u8ba4\u4f7f\u7528 us-east-1 \u53c2\u6570\uff0c\u7531\u7cfb\u7edf\u7ba1\u7406\u5458\u63d0\u4f9b
S3 force path style \uff1a\u4fdd\u6301\u9ed8\u8ba4\u914d\u7f6e true
S3 server URL \uff1a\u5bf9\u8c61\u5b58\u50a8\uff08minio\uff09\u7684\u63a7\u5236\u53f0\u8bbf\u95ee\u5730\u5740\uff0cminio \u4e00\u822c\u63d0\u4f9b\u4e86 UI \u8bbf\u95ee\u548c\u63a7\u5236\u53f0\u8bbf\u95ee\u4e24\u4e2a\u670d\u52a1\uff0c\u6b64\u5904\u8bf7\u4f7f\u7528\u63a7\u5236\u53f0\u8bbf\u95ee\u7684\u5730\u5740
\u67e5\u770b\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u8d44\u6e90\u7533\u8bf7\u503c\u4e0e\u9650\u5236\u503c\u7684\u6bd4\u503c\u662f\u5426\u7b26\u5408\u8d85\u552e\u6bd4
\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u63a5\u5165\u6216\u521b\u5efa\u7684\u96c6\u7fa4\uff0c\u4e0d\u4ec5\u53ef\u4ee5\u901a\u8fc7 UI \u754c\u9762\u76f4\u63a5\u8bbf\u95ee\uff0c\u4e5f\u53ef\u4ee5\u901a\u8fc7\u5176\u4ed6\u4e24\u79cd\u65b9\u5f0f\u8fdb\u884c\u8bbf\u95ee\u63a7\u5236\uff1a
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u57fa\u4e8e\u96c6\u7fa4\u7684\u4e0d\u540c\u529f\u80fd\u5b9a\u4f4d\u5bf9\u96c6\u7fa4\u8fdb\u884c\u4e86\u89d2\u8272\u5206\u7c7b\uff0c\u5e2e\u52a9\u7528\u6237\u66f4\u597d\u5730\u7ba1\u7406 IT \u57fa\u7840\u8bbe\u65bd\u3002
\u6b64\u96c6\u7fa4\u7528\u4e8e\u8fd0\u884c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7ec4\u4ef6\uff0c\u4f8b\u5982\u5bb9\u5668\u7ba1\u7406\u3001\u5168\u5c40\u7ba1\u7406\u3001\u53ef\u89c2\u6d4b\u6027\u3001\u955c\u50cf\u4ed3\u5e93\u7b49\u3002 \u4e00\u822c\u4e0d\u627f\u8f7d\u4e1a\u52a1\u8d1f\u8f7d\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u63a5\u5165\u578b\u96c6\u7fa4\u548c\u81ea\u5efa\u96c6\u7fa4\u91c7\u53d6\u4e0d\u540c\u7684\u7248\u672c\u652f\u6301\u673a\u5236\u3002
\u4f8b\u5982\uff0c\u793e\u533a\u652f\u6301\u7684\u7248\u672c\u8303\u56f4\u662f 1.25\u30011.26\u30011.27\uff0c\u5219\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u662f 1.24\u30011.25\u30011.26\uff0c\u5e76\u4e14\u4f1a\u4e3a\u7528\u6237\u63a8\u8350\u4e00\u4e2a\u7a33\u5b9a\u7684\u7248\u672c\uff0c\u5982 1.24.7\u3002
\u9664\u6b64\u4e4b\u5916\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e0e\u793e\u533a\u4fdd\u6301\u9ad8\u5ea6\u540c\u6b65\uff0c\u5f53\u793e\u533a\u7248\u672c\u8fdb\u884c\u9012\u589e\u540e\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e5f\u4f1a\u540c\u6b65\u9012\u589e\u4e00\u4e2a\u7248\u672c\u3002
"},{"location":"end-user/kpanda/clusters/cluster-version.html#kubernetes","title":"Kubernetes \u7248\u672c\u652f\u6301\u8303\u56f4","text":"Kubernetes \u793e\u533a\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u63a8\u8350\u7248\u672c \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b89\u88c5\u5668 \u53d1\u5e03\u65f6\u95f4
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4e2d\uff0c\u96c6\u7fa4\u89d2\u8272\u5206\u56db\u7c7b\uff1a\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u3001\u7ba1\u7406\u96c6\u7fa4\u3001\u5de5\u4f5c\u96c6\u7fa4\u3001\u63a5\u5165\u96c6\u7fa4\u3002 \u5176\u4e2d\uff0c\u63a5\u5165\u96c6\u7fa4\u53ea\u80fd\u4ece\u7b2c\u4e09\u65b9\u5382\u5546\u63a5\u5165\uff0c\u53c2\u89c1\u63a5\u5165\u96c6\u7fa4\u3002
\u672c\u9875\u4ecb\u7ecd\u5982\u4f55\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u65b0\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u5de5\u4f5c\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u9700\u8981\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4fdd\u6301\u4e00\u81f4\u3002 \u5982\u9700\u4f7f\u7528\u533a\u522b\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4 OS \u6216\u67b6\u6784\u7684\u8282\u70b9\u521b\u5efa\u96c6\u7fa4\uff0c\u53c2\u9605\u5728 centos \u7ba1\u7406\u5e73\u53f0\u4e0a\u521b\u5efa ubuntu \u5de5\u4f5c\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u63a8\u8350\u4f7f\u7528 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u64cd\u4f5c\u7cfb\u7edf\u6765\u521b\u5efa\u96c6\u7fa4\u3002 \u5982\u60a8\u672c\u5730\u8282\u70b9\u4e0d\u5728\u4e0a\u8ff0\u652f\u6301\u8303\u56f4\uff0c\u53ef\u53c2\u8003\u5728\u975e\u4e3b\u6d41\u64cd\u4f5c\u7cfb\u7edf\u4e0a\u521b\u5efa\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u6839\u636e\u4e1a\u52a1\u9700\u6c42\u51c6\u5907\u4e00\u5b9a\u6570\u91cf\u7684\u8282\u70b9\uff0c\u4e14\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u4e00\u81f4\u3002
\u63a8\u8350 Kubernetes \u7248\u672c 1.29.5\uff0c\u5177\u4f53\u7248\u672c\u8303\u56f4\uff0c\u53c2\u9605 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u96c6\u7fa4\u7248\u672c\u652f\u6301\u4f53\u7cfb\uff0c \u76ee\u524d\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4\u5728 v1.28.0-v1.30.2\u3002\u5982\u9700\u521b\u5efa\u4f4e\u7248\u672c\u7684\u96c6\u7fa4\uff0c\u8bf7\u53c2\u8003\u96c6\u7fa4\u7248\u672c\u652f\u6301\u8303\u56f4\u3001\u90e8\u7f72\u4e0e\u5347\u7ea7 Kubean \u5411\u4e0b\u517c\u5bb9\u7248\u672c\u3002
\u76ee\u6807\u4e3b\u673a\u9700\u8981\u5141\u8bb8 IPv4 \u8f6c\u53d1\u3002\u5982\u679c Pod \u548c Service \u4f7f\u7528\u7684\u662f IPv6\uff0c\u5219\u76ee\u6807\u670d\u52a1\u5668\u9700\u8981\u5141\u8bb8 IPv6 \u8f6c\u53d1\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u6682\u4e0d\u63d0\u4f9b\u5bf9\u9632\u706b\u5899\u7684\u7ba1\u7406\u529f\u80fd\uff0c\u60a8\u9700\u8981\u9884\u5148\u81ea\u884c\u5b9a\u4e49\u76ee\u6807\u4e3b\u673a\u9632\u706b\u5899\u89c4\u5219\u3002\u4e3a\u4e86\u907f\u514d\u521b\u5efa\u96c6\u7fa4\u7684\u8fc7\u7a0b\u4e2d\u51fa\u73b0\u95ee\u9898\uff0c\u5efa\u8bae\u7981\u7528\u76ee\u6807\u4e3b\u673a\u7684\u9632\u706b\u5899\u3002
\u670d\u52a1\u7f51\u6bb5\uff1a\u540c\u4e00\u96c6\u7fa4\u4e0b\u5bb9\u5668\u4e92\u76f8\u8bbf\u95ee\u65f6\u4f7f\u7528\u7684 Service \u8d44\u6e90\u7684\u7f51\u6bb5\uff0c\u51b3\u5b9a Service \u8d44\u6e90\u7684\u4e0a\u9650\u3002\u521b\u5efa\u540e\u4e0d\u53ef\u4fee\u6539\u3002
\u5982\u679c\u60f3\u5f7b\u5e95\u5220\u9664\u4e00\u4e2a\u63a5\u5165\u7684\u96c6\u7fa4\uff0c\u9700\u8981\u524d\u5f80\u521b\u5efa\u8be5\u96c6\u7fa4\u7684\u539f\u59cb\u5e73\u53f0\u64cd\u4f5c\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0d\u652f\u6301\u5220\u9664\u63a5\u5165\u7684\u96c6\u7fa4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c \u5378\u8f7d\u96c6\u7fa4 \u548c \u89e3\u9664\u63a5\u5165 \u7684\u533a\u522b\u5728\u4e8e\uff1a
"},{"location":"end-user/kpanda/clusters/integrate-rancher-cluster.html#ai","title":"\u6b65\u9aa4\u4e09\uff1a\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u754c\u9762\u63a5\u5165\u96c6\u7fa4","text":"
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED\nadmin.conf Dec 14, 2024 07:26 UTC 204d no \napiserver Dec 14, 2024 07:26 UTC 204d ca no \napiserver-etcd-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \napiserver-kubelet-client Dec 14, 2024 07:26 UTC 204d ca no \ncontroller-manager.conf Dec 14, 2024 07:26 UTC 204d no \netcd-healthcheck-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-peer Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-server Dec 14, 2024 07:26 UTC 204d etcd-ca no \nfront-proxy-client Dec 14, 2024 07:26 UTC 204d front-proxy-ca no \nscheduler.conf Dec 14, 2024 07:26 UTC 204d no \n\nCERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED\nca Dec 12, 2033 07:26 UTC 9y no \netcd-ca Dec 12, 2033 07:26 UTC 9y no \nfront-proxy-ca Dec 12, 2033 07:26 UTC 9y no \n
\u9759\u6001 Pod \u662f\u88ab\u672c\u5730 kubelet \u800c\u4e0d\u662f API \u670d\u52a1\u5668\u7ba1\u7406\uff0c\u6240\u4ee5 kubectl \u4e0d\u80fd\u7528\u6765\u5220\u9664\u6216\u91cd\u542f\u4ed6\u4eec\u3002
\u5982\u679c Pod \u4e0d\u5728\u6e05\u5355\u76ee\u5f55\u91cc\uff0ckubelet \u5c06\u4f1a\u7ec8\u6b62\u5b83\u3002 \u5728\u53e6\u4e00\u4e2a fileCheckFrequency \u5468\u671f\u4e4b\u540e\u4f60\u53ef\u4ee5\u5c06\u6587\u4ef6\u79fb\u56de\u53bb\uff0ckubelet \u53ef\u4ee5\u5b8c\u6210 Pod \u7684\u91cd\u5efa\uff0c\u800c\u7ec4\u4ef6\u7684\u8bc1\u4e66\u66f4\u65b0\u64cd\u4f5c\u4e5f\u5f97\u4ee5\u5b8c\u6210\u3002
Kubernetes \u7248\u672c\u4ee5 x.y.z \u8868\u793a\uff0c\u5176\u4e2d x \u662f\u4e3b\u8981\u7248\u672c\uff0c y \u662f\u6b21\u8981\u7248\u672c\uff0c z \u662f\u8865\u4e01\u7248\u672c\u3002
\u5f53\u5bc6\u94a5\u7c7b\u578b\u4e3a TLS (kubernetes.io/tls)\uff1a\u9700\u8981\u586b\u5165\u8bc1\u4e66\u51ed\u8bc1\u548c\u79c1\u94a5\u6570\u636e\u3002\u8bc1\u4e66\u662f\u81ea\u7b7e\u540d\u6216 CA \u7b7e\u540d\u8fc7\u7684\u51ed\u636e\uff0c\u7528\u6765\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u3002\u8bc1\u4e66\u8bf7\u6c42\u662f\u5bf9\u7b7e\u540d\u7684\u8bf7\u6c42\uff0c\u9700\u8981\u4f7f\u7528\u79c1\u94a5\u8fdb\u884c\u7b7e\u540d\u3002
"},{"location":"end-user/kpanda/configmaps-secrets/use-secret.html#pod","title":"\u4f7f\u7528\u5bc6\u94a5\u4f5c\u4e3a Pod \u7684\u6570\u636e\u5377","text":""},{"location":"end-user/kpanda/configmaps-secrets/use-secret.html#_6","title":"\u56fe\u5f62\u754c\u9762\u64cd\u4f5c","text":"
\u672c\u6587\u4ecb\u7ecd \u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u5bf9 GPU\u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8d44\u6e90\u7edf\u4e00\u8fd0\u7ef4\u7ba1\u7406\u80fd\u529b\u3002
\u968f\u7740 AI \u5e94\u7528\u3001\u5927\u6a21\u578b\u3001\u4eba\u5de5\u667a\u80fd\u3001\u81ea\u52a8\u9a7e\u9a76\u7b49\u65b0\u5174\u6280\u672f\u7684\u5feb\u901f\u53d1\u5c55\uff0c\u4f01\u4e1a\u9762\u4e34\u7740\u8d8a\u6765\u8d8a\u591a\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4efb\u52a1\u548c\u6570\u636e\u5904\u7406\u9700\u6c42\u3002 \u4ee5 CPU \u4e3a\u4ee3\u8868\u7684\u4f20\u7edf\u8ba1\u7b97\u67b6\u6784\u5df2\u65e0\u6cd5\u6ee1\u8db3\u4f01\u4e1a\u65e5\u76ca\u589e\u957f\u7684\u8ba1\u7b97\u9700\u6c42\u3002\u6b64\u65f6\uff0c\u4ee5 GPU \u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8ba1\u7b97\u56e0\u5728\u5904\u7406\u5927\u89c4\u6a21\u6570\u636e\u3001\u8fdb\u884c\u590d\u6742\u8ba1\u7b97\u548c\u5b9e\u65f6\u56fe\u5f62\u6e32\u67d3\u65b9\u9762\u5177\u6709\u72ec\u7279\u7684\u4f18\u52bf\u88ab\u5e7f\u6cdb\u5e94\u7528\u3002
\u4e0e\u6b64\u540c\u65f6\uff0c\u7531\u4e8e\u7f3a\u4e4f\u5f02\u6784\u8d44\u6e90\u8c03\u5ea6\u7ba1\u7406\u7b49\u65b9\u9762\u7684\u7ecf\u9a8c\u548c\u4e13\u4e1a\u7684\u89e3\u51b3\u65b9\u6848\uff0c\u5bfc\u81f4\u4e86 GPU \u8bbe\u5907\u7684\u8d44\u6e90\u5229\u7528\u7387\u6781\u4f4e\uff0c\u7ed9\u4f01\u4e1a\u5e26\u6765\u4e86\u9ad8\u6602\u7684 AI \u751f\u4ea7\u6210\u672c\u3002 \u5982\u4f55\u964d\u672c\u589e\u6548\uff0c\u63d0\u9ad8 GPU \u7b49\u5f02\u6784\u8d44\u6e90\u7684\u5229\u7528\u6548\u7387\uff0c\u6210\u4e3a\u4e86\u5f53\u524d\u4f17\u591a\u4f01\u4e1a\u4e9f\u9700\u8de8\u8d8a\u7684\u4e00\u9053\u96be\u9898\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u652f\u6301\u5bf9 GPU\u3001NPU \u7b49\u5f02\u6784\u8d44\u6e90\u8fdb\u884c\u7edf\u4e00\u8c03\u5ea6\u548c\u8fd0\u7ef4\u7ba1\u7406\uff0c\u5145\u5206\u91ca\u653e GPU \u8d44\u6e90\u7b97\u529b\uff0c\u52a0\u901f\u4f01\u4e1a AI \u7b49\u65b0\u5174\u5e94\u7528\u53d1\u5c55\u3002GPU \u7ba1\u7406\u80fd\u529b\u5982\u4e0b\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08iluvatar.ai/vcuda-core\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u8c03\u6574\u540e\u67e5\u770b Pod \u4e2d\u7684\u8d44\u6e90 GPU \u5206\u914d\u8d44\u6e90\uff1a
\u901a\u8fc7\u4e0a\u8ff0\u6b65\u9aa4\uff0c\u60a8\u53ef\u4ee5\u5728\u4e0d\u91cd\u542f vGPU Pod \u7684\u60c5\u51b5\u4e0b\u52a8\u6001\u5730\u8c03\u6574\u5176\u7b97\u529b\u548c\u663e\u5b58\u8d44\u6e90\uff0c\u4ece\u800c\u66f4\u7075\u6d3b\u5730\u6ee1\u8db3\u4e1a\u52a1\u9700\u6c42\u5e76\u4f18\u5316\u8d44\u6e90\u5229\u7528\u3002
\u672c\u9875\u8bf4\u660e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684 GPU \u53ca\u64cd\u4f5c\u7cfb\u7edf\u6240\u5bf9\u5e94\u7684\u77e9\u9635\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u8282\u70b9\u7684\u4e0d\u540c GPU \u5361\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u5361\u6545\u969c\u3002
Binpack\uff1a \u591a\u4e2a Pod \u4f1a\u4f18\u5148\u9009\u62e9\u540c\u4e00\u4e2a\u8282\u70b9\uff0c\u9002\u7528\u4e8e\u63d0\u9ad8 GPU \u5229\u7528\u7387\uff0c\u51cf\u5c11\u8d44\u6e90\u788e\u7247\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u4e0d\u540c\u8282\u70b9\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u8282\u70b9\u6545\u969c\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d POD \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u5e76\u4e14\u8981 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08huawei.com/Ascend910\uff09 \uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14**\u5c0f\u4e8e\u7b49\u4e8e**\u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
GPU \u7b97\u529b\uff08cambricon.com/mlu.smlu.vcore\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u4f7f\u7528\u6838\u5fc3\u7684\u767e\u5206\u6bd4\u6570\u91cf\u3002
apiVersion: v1 \nkind: Pod \nmetadata: \n name: pod1 \nspec: \n restartPolicy: OnFailure \n containers: \n - image: ubuntu:16.04 \n name: pod1-ctr \n command: [\"sleep\"] \n args: [\"100000\"] \n resources: \n limits: \n cambricon.com/mlu: \"1\" # use this when device type is not enabled, else delete this line. \n #cambricon.com/mlu: \"1\" #uncomment to use when device type is enabled \n #cambricon.com/mlu.share: \"1\" #uncomment to use device with env-share mode \n #cambricon.com/mlu.mim-2m.8gb: \"1\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vcore: \"100\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vmemory: \"1024\" #uncomment to use device with mim mode\n
\u4e3a\u5728\u6240\u6709\u4ea7\u54c1\u4e2d\u516c\u5f00\u201c\u5b8c\u5168\u76f8\u540c\u201d\u7684 MIG \u8bbe\u5907\u7c7b\u578b\uff0c\u521b\u5efa\u76f8\u540c\u7684GI \u548c CI
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u672c\u6587\u4f7f\u7528 AMD \u67b6\u6784\u7684 CentOS 7.9\uff083.10.0-1160\uff09\u8fdb\u884c\u6f14\u793a\u3002\u5982\u9700\u4f7f\u7528 Red Hat 8.4 \u90e8\u7f72\uff0c \u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u548c\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u3002
\u4f7f\u7528\u5185\u7f6e\u7684\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u65e0\u9700\u4fee\u6539\u955c\u50cf\u7248\u672c\uff0c\u5176\u4ed6\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u955c\u50cf\u3002 \u6ce8\u610f\u7248\u672c\u53f7\u540e\u65e0\u9700\u586b\u5199 Ubuntu\u3001CentOS\u3001Red Hat \u7b49\u64cd\u4f5c\u7cfb\u7edf\u540d\u79f0\uff0c\u82e5\u5b98\u65b9\u955c\u50cf\u542b\u6709\u64cd\u4f5c\u7cfb\u7edf\u540e\u7f00\uff0c\u8bf7\u624b\u52a8\u79fb\u9664\u3002
"},{"location":"end-user/kpanda/gpu/nvidia/push_image_to_repo.html","title":"\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u672c\u6587\u4ee5 Red Hat 8.4 \u7684 nvcr.io/nvidia/driver:525.105.17-rhel8.4 \u79bb\u7ebf\u9a71\u52a8\u955c\u50cf\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u79bb\u7ebf\u955c\u50cf\u3002
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
\u5f53\u5de5\u4f5c\u8282\u70b9\u7684\u5185\u6838\u7248\u672c\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u5185\u6838\u7248\u672c\u6216 OS \u7c7b\u578b\u4e0d\u4e00\u81f4\u65f6\uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
"},{"location":"end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#os","title":"\u68c0\u67e5\u96c6\u7fa4\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c","text":"
\u5206\u522b\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u548c\u5f85\u90e8\u7f72 GPU Operator \u7684\u8282\u70b9\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c\u4e00\u81f4\u5219\u65e0\u9700\u6784\u5efa yum \u6e90\uff0c \u53ef\u53c2\u8003\u79bb\u7ebf\u5b89\u88c5 GPU Operator \u6587\u6863\u76f4\u63a5\u5b89\u88c5\uff1b\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u6216\u5185\u6838\u7248\u672c\u4e0d\u4e00\u81f4\uff0c\u8bf7\u6267\u884c\u4e0b\u4e00\u6b65\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 8.4\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4\u3002
\u672c\u6587\u4ee5 Red Hat 8.4 4.18.0-305.el8.x86_64 \u8282\u70b9\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u5305\uff0c \u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\uff0c\u901a\u8fc7 RepoConfig.ConfigMapName \u53c2\u6570\u6765\u4f7f\u7528\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio \u6587\u4ef6\u670d\u52a1\u5668\u8bbf\u95ee\u5730\u5740 \u7528\u6237\u540d \u5bc6\u7801\n
\u4f8b\u5982\uff1a
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
"},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90","text":""},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#_1","title":"\u4f7f\u7528\u573a\u666f\u4ecb\u7ecd","text":"
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9884\u7f6e\u4e86 CentOS 7.9\uff0c\u5185\u6838\u4e3a 3.10.0-1160 \u7684 GPU Operator \u79bb\u7ebf\u5305\u3002\u5176\u5b83 OS \u7c7b\u578b\u7684\u8282\u70b9\u6216\u5185\u6838\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90\u5305\uff0c\u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\u4f7f\u7528 RepoConfig.ConfigMapName \u53c2\u6570\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 7.9\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4
"},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-red-hat-79-os","title":"2. \u4e0b\u8f7d Red Hat 7.9 OS \u7684\u79bb\u7ebf\u9a71\u52a8\u955c\u50cf","text":"
"},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-red-hat-gpu-opreator","title":"3. \u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u3002
SM \uff1a\u6d41\u5f0f\u591a\u5904\u7406\u5668\uff08Streaming Multiprocessor\uff09\uff0cGPU \u7684\u6838\u5fc3\u8ba1\u7b97\u5355\u5143\uff0c\u8d1f\u8d23\u6267\u884c\u56fe\u5f62\u6e32\u67d3\u548c\u901a\u7528\u8ba1\u7b97\u4efb\u52a1\u3002 \u6bcf\u4e2a SM \u5305\u542b\u4e00\u7ec4 CUDA \u6838\u5fc3\uff0c\u4ee5\u53ca\u5171\u4eab\u5185\u5b58\u3001\u5bc4\u5b58\u5668\u6587\u4ef6\u548c\u5176\u4ed6\u8d44\u6e90\uff0c\u53ef\u4ee5\u540c\u65f6\u6267\u884c\u591a\u4e2a\u7ebf\u7a0b\u3002 \u6bcf\u4e2a MIG \u5b9e\u4f8b\u90fd\u62e5\u6709\u4e00\u5b9a\u6570\u91cf\u7684 SM \u548c\u5176\u4ed6\u76f8\u5173\u8d44\u6e90\uff0c\u4ee5\u53ca\u88ab\u5212\u5206\u51fa\u6765\u7684\u663e\u5b58\u3002
GPU SM Slice \uff1aGPU SM \u5207\u7247\u662f GPU \u4e0a SM \u7684\u6700\u5c0f\u8ba1\u7b97\u5355\u4f4d\u3002\u5728 MIG \u6a21\u5f0f\u4e0b\u914d\u7f6e\u65f6\uff0c GPU SM \u5207\u7247\u5927\u7ea6\u662f GPU \u4e2d\u53ef\u7528 SMS \u603b\u6570\u7684\u4e03\u5206\u4e4b\u4e00\u3002
Compute Instance \uff1aGPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b \uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c\u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002
GPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247(GI)\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b\uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c \u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002\u4f7f\u7528\u4e0a\u9762\u7684\u76f8\u540c 4g.20gb \u793a\u4f8b\uff0c\u53ef\u4ee5\u521b\u5efa\u4e00\u4e2a CI \u4ee5\u4ec5\u4f7f\u7528\u7b2c\u4e00\u4e2a\u8ba1\u7b97\u5207\u7247\u7684 1c.4g.20gb \u8ba1\u7b97\u914d\u7f6e\uff0c\u5982\u4e0b\u56fe\u84dd\u8272\u90e8\u5206\u6240\u793a\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
NUMA \u8282\u70b9\u662f Non-Uniform Memory Access\uff08\u975e\u7edf\u4e00\u5185\u5b58\u8bbf\u95ee\uff09\u67b6\u6784\u4e2d\u7684\u4e00\u4e2a\u57fa\u672c\u7ec4\u6210\u5355\u5143\uff0c\u4e00\u4e2a Node \u8282\u70b9\u662f\u591a\u4e2a NUMA \u8282\u70b9\u7684\u96c6\u5408\uff0c \u5728\u591a\u4e2a NUMA \u8282\u70b9\u4e4b\u95f4\u8fdb\u884c\u5185\u5b58\u8bbf\u95ee\u65f6\u4f1a\u4ea7\u751f\u5ef6\u8fdf\uff0c\u5f00\u53d1\u8005\u53ef\u4ee5\u901a\u8fc7\u4f18\u5316\u4efb\u52a1\u8c03\u5ea6\u548c\u5185\u5b58\u5206\u914d\u7b56\u7565\uff0c\u6765\u63d0\u9ad8\u5185\u5b58\u8bbf\u95ee\u6548\u7387\u548c\u6574\u4f53\u6027\u80fd\u3002
Numa \u4eb2\u548c\u6027\u8c03\u5ea6\u7684\u5e38\u89c1\u573a\u666f\u662f\u90a3\u4e9b\u5bf9 CPU \u53c2\u6570\u654f\u611f/\u8c03\u5ea6\u5ef6\u8fdf\u654f\u611f\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4f5c\u4e1a\u3002\u5982\u79d1\u5b66\u8ba1\u7b97\u3001\u89c6\u9891\u89e3\u7801\u3001\u52a8\u6f2b\u52a8\u753b\u6e32\u67d3\u3001\u5927\u6570\u636e\u79bb\u7ebf\u5904\u7406\u7b49\u5177\u4f53\u573a\u666f\u3002
Pod \u8c03\u5ea6\u65f6\u53ef\u4ee5\u91c7\u7528\u7684 NUMA \u653e\u7f6e\u7b56\u7565\uff0c\u5177\u4f53\u7b56\u7565\u5bf9\u5e94\u7684\u8c03\u5ea6\u884c\u4e3a\u8bf7\u53c2\u89c1 Pod \u8c03\u5ea6\u884c\u4e3a\u8bf4\u660e\u3002
single-numa-node\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a single-numa-node \u7684\u8282\u70b9\u6c60\u4e2d\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
restricted\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a restricted \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c\u7684 NUMA \u96c6\u5408\u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
best-effort\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a best-effort \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14\u5c3d\u91cf\u5c06 CPU \u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u6ca1\u6709\u8282\u70b9\u6ee1\u8db3\u8fd9\u4e00\u6761\u4ef6\uff0c\u5219\u9009\u62e9\u6700\u4f18\u8282\u70b9\u8fdb\u884c\u653e\u7f6e\u3002
\u5f53Pod\u8bbe\u7f6e\u4e86\u62d3\u6251\u7b56\u7565\u65f6\uff0cVolcano \u4f1a\u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\u9884\u6d4b\u5339\u914d\u7684\u8282\u70b9\u5217\u8868\u3002 \u8c03\u5ea6\u8fc7\u7a0b\u5982\u4e0b\uff1a
\u6839\u636e Pod \u8bbe\u7f6e\u7684 Volcano \u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u5177\u6709\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u3002
\u5728\u8bbe\u7f6e\u4e86\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u4e2d\uff0c\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u8be5\u7b56\u7565\u8981\u6c42\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6\u3002
Pod \u53ef\u914d\u7f6e\u7684\u62d3\u6251\u7b56\u7565 1. \u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u53ef\u8c03\u5ea6\u7684\u8282\u70b9 2. \u8fdb\u4e00\u6b65\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u7b56\u7565\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6 none \u9488\u5bf9\u914d\u7f6e\u4e86\u4ee5\u4e0b\u51e0\u79cd\u62d3\u6251\u7b56\u7565\u7684\u8282\u70b9\uff0c\u8c03\u5ea6\u65f6\u5747\u65e0\u7b5b\u9009\u884c\u4e3a\u3002none\uff1a\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 - best-effort \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201cbest-effort\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u5c3d\u53ef\u80fd\u6ee1\u8db3\u7b56\u7565\u8981\u6c42\u8fdb\u884c\u8c03\u5ea6\uff1a\u4f18\u5148\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\uff0c\u5982\u679c\u5355 NUMA \u8282\u70b9\u65e0\u6cd5\u6ee1\u8db3 CPU \u7533\u8bf7\u503c\uff0c\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 restricted \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201crestricted\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u4e25\u683c\u9650\u5236\u7684\u8c03\u5ea6\u7b56\u7565\uff1a\u5355 NUMA \u8282\u70b9\u7684CPU\u5bb9\u91cf\u4e0a\u9650\u5927\u4e8e\u7b49\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002\u6b64\u65f6\u5982\u679c\u5355 NUMA \u8282\u70b9\u5269\u4f59\u7684 CPU \u53ef\u4f7f\u7528\u91cf\u4e0d\u8db3\uff0c\u5219 Pod \u65e0\u6cd5\u8c03\u5ea6\u3002\u5355 NUMA \u8282\u70b9\u7684 CPU \u5bb9\u91cf\u4e0a\u9650\u5c0f\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u53ef\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 single-numa-node \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201csingle-numa-node\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 \u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002"},{"location":"end-user/kpanda/gpu/volcano/numa.html#numa_1","title":"\u914d\u7f6e NUMA \u4eb2\u548c\u8c03\u5ea6\u7b56\u7565","text":"
\u5047\u8bbe NUMA \u8282\u70b9\u60c5\u51b5\u5982\u4e0b\uff1a
\u5de5\u4f5c\u8282\u70b9 \u8282\u70b9\u7b56\u7565\u62d3\u6251\u7ba1\u7406\u5668\u7b56\u7565 NUMA \u8282\u70b9 0 \u4e0a\u7684\u53ef\u5206\u914d CPU NUMA \u8282\u70b9 1 \u4e0a\u7684\u53ef\u5206\u914d CPU node-1 single-numa-node 16U 16U node-2 best-effort 16U 16U node-3 best-effort 20U 20U
\u793a\u4f8b\u4e00\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a 2U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201csingle-numa-node\u201d\uff0c\u56e0\u6b64\u4f1a\u88ab\u8c03\u5ea6\u5230\u76f8\u540c\u7b56\u7565\u7684 node-1\u3002
\u793a\u4f8b\u4e8c\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a20U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201cbest-effort\u201d\uff0c\u5b83\u5c06\u88ab\u8c03\u5ea6\u5230 node-3\uff0c \u56e0\u4e3a node-3 \u53ef\u4ee5\u5728\u5355\u4e2a NUMA \u8282\u70b9\u4e0a\u5206\u914d Pod \u7684 CPU \u8bf7\u6c42\uff0c\u800c node-2 \u9700\u8981\u5728\u4e24\u4e2a NUMA \u8282\u70b9\u4e0a\u6267\u884c\u6b64\u64cd\u4f5c\u3002
"},{"location":"end-user/kpanda/gpu/volcano/numa.html#cpu","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5","text":"
\u60a8\u53ef\u4ee5\u901a\u8fc7 lscpu \u547d\u4ee4\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5\uff1a
"},{"location":"end-user/kpanda/gpu/volcano/numa.html#cpu_1","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d","text":"
\u7136\u540e\u67e5\u770b NUMA \u8282\u70b9\u4f7f\u7528\u60c5\u51b5\uff1a
# \u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d\ncat /var/lib/kubelet/cpu_manager_state\n{\"policyName\":\"static\",\"defaultCpuSet\":\"0,10-15,25-31\",\"entries\":{\"777870b5-c64f-42f5-9296-688b9dc212ba\":{\"container-1\":\"16-24\"},\"fb15e10a-b6a5-4aaa-8fcd-76c1aa64e6fd\":{\"container-1\":\"1-9\"}},\"checksum\":318470969}\n
\u4ee5\u4e0a\u793a\u4f8b\u4e2d\u8868\u793a\uff0c\u8282\u70b9\u4e0a\u8fd0\u884c\u4e86\u4e24\u4e2a\u5bb9\u5668\uff0c\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node0 \u76841-9 \u6838\uff0c\u53e6\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node1 \u7684 16-24 \u6838\u3002
"},{"location":"end-user/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"\u4f7f\u7528 Volcano \u7684 Gang Scheduler","text":"
Gang \u8c03\u5ea6\u7b56\u7565\u662f volcano-scheduler \u7684\u6838\u5fc3\u8c03\u5ea6\u7b97\u6cd5\u4e4b\u4e00\uff0c\u5b83\u6ee1\u8db3\u4e86\u8c03\u5ea6\u8fc7\u7a0b\u4e2d\u7684 \u201cAll or nothing\u201d \u7684\u8c03\u5ea6\u9700\u6c42\uff0c \u907f\u514d Pod \u7684\u4efb\u610f\u8c03\u5ea6\u5bfc\u81f4\u96c6\u7fa4\u8d44\u6e90\u7684\u6d6a\u8d39\u3002\u5177\u4f53\u7b97\u6cd5\u662f\uff0c\u89c2\u5bdf Job \u4e0b\u7684 Pod \u5df2\u8c03\u5ea6\u6570\u91cf\u662f\u5426\u6ee1\u8db3\u4e86\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\uff0c \u5f53 Job \u7684\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\u5f97\u5230\u6ee1\u8db3\u65f6\uff0c\u4e3a Job \u4e0b\u7684\u6240\u6709 Pod \u6267\u884c\u8c03\u5ea6\u52a8\u4f5c\uff0c\u5426\u5219\uff0c\u4e0d\u6267\u884c\u3002
\u57fa\u4e8e\u5bb9\u5668\u7ec4\u6982\u5ff5\u7684 Gang \u8c03\u5ea6\u7b97\u6cd5\u5341\u5206\u9002\u5408\u9700\u8981\u591a\u8fdb\u7a0b\u534f\u4f5c\u7684\u573a\u666f\u3002AI \u573a\u666f\u5f80\u5f80\u5305\u542b\u590d\u6742\u7684\u6d41\u7a0b\uff0c Data Ingestion\u3001Data Analysts\u3001Data Splitting\u3001Trainer\u3001Serving\u3001Logging \u7b49\uff0c \u9700\u8981\u4e00\u7ec4\u5bb9\u5668\u8fdb\u884c\u534f\u540c\u5de5\u4f5c\uff0c\u5c31\u5f88\u9002\u5408\u57fa\u4e8e\u5bb9\u5668\u7ec4\u7684 Gang \u8c03\u5ea6\u7b56\u7565\u3002 MPI \u8ba1\u7b97\u6846\u67b6\u4e0b\u7684\u591a\u7ebf\u7a0b\u5e76\u884c\u8ba1\u7b97\u901a\u4fe1\u573a\u666f\uff0c\u7531\u4e8e\u9700\u8981\u4e3b\u4ece\u8fdb\u7a0b\u534f\u540c\u5de5\u4f5c\uff0c\u4e5f\u975e\u5e38\u9002\u5408\u4f7f\u7528 Gang \u8c03\u5ea6\u7b56\u7565\u3002 \u5bb9\u5668\u7ec4\u4e0b\u7684\u5bb9\u5668\u9ad8\u5ea6\u76f8\u5173\u4e5f\u53ef\u80fd\u5b58\u5728\u8d44\u6e90\u4e89\u62a2\uff0c\u6574\u4f53\u8c03\u5ea6\u5206\u914d\uff0c\u80fd\u591f\u6709\u6548\u89e3\u51b3\u6b7b\u9501\u3002
Binpack \u5728\u5bf9\u4e00\u4e2a\u8282\u70b9\u6253\u5206\u65f6\uff0c\u4f1a\u6839\u636e Binpack \u63d2\u4ef6\u81ea\u8eab\u6743\u91cd\u548c\u5404\u8d44\u6e90\u8bbe\u7f6e\u7684\u6743\u91cd\u503c\u7efc\u5408\u6253\u5206\u3002 \u9996\u5148\uff0c\u5bf9 Pod \u8bf7\u6c42\u8d44\u6e90\u4e2d\u7684\u6bcf\u7c7b\u8d44\u6e90\u4f9d\u6b21\u6253\u5206\uff0c\u4ee5 CPU \u4e3a\u4f8b\uff0cCPU \u8d44\u6e90\u5728\u5f85\u8c03\u5ea6\u8282\u70b9\u7684\u5f97\u5206\u4fe1\u606f\u5982\u4e0b\uff1a
CPU.weight * (request + used) / allocatable\n
\u5373 CPU \u6743\u91cd\u503c\u8d8a\u9ad8\uff0c\u5f97\u5206\u8d8a\u9ad8\uff0c\u8282\u70b9\u8d44\u6e90\u4f7f\u7528\u91cf\u8d8a\u6ee1\uff0c\u5f97\u5206\u8d8a\u9ad8\u3002Memory\u3001GPU \u7b49\u8d44\u6e90\u539f\u7406\u7c7b\u4f3c\u3002\u5176\u4e2d\uff1a
CPU.weight \u4e3a\u7528\u6237\u8bbe\u7f6e\u7684 CPU \u6743\u91cd
request \u4e3a\u5f53\u524d Pod \u8bf7\u6c42\u7684 CPU \u8d44\u6e90\u91cf
used \u4e3a\u5f53\u524d\u8282\u70b9\u5df2\u7ecf\u5206\u914d\u4f7f\u7528\u7684 CPU \u91cf
allocatable \u4e3a\u5f53\u524d\u8282\u70b9 CPU \u53ef\u7528\u603b\u91cf
\u4f18\u5148\u7ea7\u7684\u51b3\u5b9a\u57fa\u4e8e\u914d\u7f6e\u7684 PriorityClass \u4e2d\u7684 Value \u503c\uff0c\u503c\u8d8a\u5927\u4f18\u5148\u7ea7\u8d8a\u9ad8\u3002\u9ed8\u8ba4\u5df2\u542f\u7528\uff0c\u65e0\u9700\u4fee\u6539\u3002\u53ef\u901a\u8fc7\u4ee5\u4e0b\u547d\u4ee4\u786e\u8ba4\u6216\u4fee\u6539\u3002
\u901a\u8fc7 kubectl get pod \u67e5\u770b Pod \u8fd0\u884c\u4fe1\u606f\uff0c\u96c6\u7fa4\u8d44\u6e90\u4e0d\u8db3\uff0cPod \u5904\u4e8e Pending \u72b6\u6001\uff1a
\u6b64\u5916\uff0cVolcano \u4e0e Spark\u3001TensorFlow\u3001PyTorch \u7b49\u4e3b\u6d41\u8ba1\u7b97\u6846\u67b6\u65e0\u7f1d\u5bf9\u63a5\uff0c\u5e76\u652f\u6301 CPU \u548c GPU \u7b49\u5f02\u6784\u8bbe\u5907\u7684\u6df7\u5408\u8c03\u5ea6\uff0c\u4e3a AI \u8ba1\u7b97\u4efb\u52a1\u63d0\u4f9b\u4e86\u5168\u9762\u7684\u4f18\u5316\u652f\u6301\u3002
\u63a5\u4e0b\u6765\uff0c\u6211\u4eec\u5c06\u4ecb\u7ecd\u5982\u4f55\u5b89\u88c5\u548c\u4f7f\u7528 Volcano\uff0c\u4ee5\u4fbf\u60a8\u80fd\u591f\u5145\u5206\u5229\u7528\u5176\u8c03\u5ea6\u7b56\u7565\u4f18\u52bf\uff0c\u4f18\u5316 AI \u8ba1\u7b97\u4efb\u52a1\u3002
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u7684\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u3001\u8282\u70b9\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u4e09\u4e2a\u7ef4\u5ea6\u8fdb\u884c\u81ea\u5b9a\u4e49\u5de1\u68c0\u9879\uff0c\u5de1\u68c0\u7ed3\u675f\u540e\u4f1a\u81ea\u52a8\u751f\u6210\u53ef\u89c6\u5316\u7684\u5de1\u68c0\u62a5\u544a\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5982\u9700\u4e86\u89e3\u6216\u6267\u884c\u5b89\u5168\u65b9\u9762\u7684\u5de1\u68c0\uff0c\u53ef\u53c2\u8003\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u5b89\u5168\u626b\u63cf\u7c7b\u578b\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u7ef4\u5ea6\u3001\u8282\u70b9\u7ef4\u5ea6\u3001\u5bb9\u5668\u7ec4\u7ef4\u5ea6\u8fdb\u884c\u5de1\u68c0\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5728\u8282\u70b9\u88ab\u8bbe\u7f6e\u4e3a\u72ec\u4eab\u8282\u70b9\u524d\u5df2\u7ecf\u8fd0\u884c\u5728\u6b64\u8282\u70b9\u4e0a\u7684\u5e94\u7528\u548c\u670d\u52a1\u5c06\u4e0d\u4f1a\u53d7\u5f71\u54cd\uff0c\u4f9d\u7136\u4f1a\u6b63\u5e38\u8fd0\u884c\u5728\u8be5\u8282\u70b9\u4e0a\uff0c\u4ec5\u5f53\u8fd9\u4e9b Pod \u88ab\u5220\u9664\u6216\u91cd\u5efa\u65f6\uff0c\u624d\u4f1a\u8c03\u5ea6\u5230\u5176\u5b83\u975e\u72ec\u4eab\u8282\u70b9\u4e0a\u3002
\u7531\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u8fd0\u884c\u7740 kpanda\u3001ghippo\u3001insight \u7b49\u5e73\u53f0\u57fa\u7840\u7ec4\u4ef6\uff0c\u5728 Global \u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u5c06\u53ef\u80fd\u5bfc\u81f4\u5f53\u7cfb\u7edf\u7ec4\u4ef6\u91cd\u542f\u540e\uff0c\u7cfb\u7edf\u7ec4\u4ef6\u65e0\u6cd5\u8c03\u5ea6\u5230\u88ab\u72ec\u4eab\u7684\u8282\u70b9\u4e0a\uff0c\u5f71\u54cd\u7cfb\u7edf\u7684\u6574\u4f53\u9ad8\u53ef\u7528\u80fd\u529b\u3002\u56e0\u6b64\uff0c\u901a\u5e38\u60c5\u51b5\u4e0b\uff0c\u6211\u4eec\u4e0d\u63a8\u8350\u7528\u6237\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u7279\u6027\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u5bb9\u5668\u7ec4\u5b89\u5168\u7b56\u7565\u6307\u5728 kubernetes \u96c6\u7fa4\u4e2d\uff0c\u901a\u8fc7\u4e3a\u6307\u5b9a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e\u4e0d\u540c\u7684\u7b49\u7ea7\u548c\u6a21\u5f0f\uff0c\u5b9e\u73b0\u5728\u5b89\u5168\u7684\u5404\u4e2a\u65b9\u9762\u63a7\u5236 Pod \u7684\u884c\u4e3a\uff0c\u53ea\u6709\u6ee1\u8db3\u4e00\u5b9a\u7684\u6761\u4ef6\u7684 Pod \u624d\u4f1a\u88ab\u7cfb\u7edf\u63a5\u53d7\u3002\u5b83\u8bbe\u7f6e\u4e09\u4e2a\u7b49\u7ea7\u548c\u4e09\u79cd\u6a21\u5f0f\uff0c\u7528\u6237\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u7684\u9700\u6c42\u9009\u62e9\u66f4\u52a0\u5408\u9002\u7684\u65b9\u6848\u6765\u8bbe\u7f6e\u9650\u5236\u7b56\u7565\u3002
Note
\u4e00\u6761\u5b89\u5168\u6a21\u5f0f\u4ec5\u80fd\u914d\u7f6e\u4e00\u6761\u5b89\u5168\u7b56\u7565\u3002\u540c\u65f6\u8bf7\u8c28\u614e\u4e3a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e enforce \u7684\u5b89\u5168\u6a21\u5f0f\uff0c\u8fdd\u53cd\u540e\u5c06\u4f1a\u5bfc\u81f4 Pod \u65e0\u6cd5\u521b\u5efa\u3002
\u5df2\u7ecf\u5b8c\u6210 Ingress \u5b9e\u4f8b\u7684\u521b\u5efa\uff0c\u5df2\u90e8\u7f72\u5e94\u7528\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5e76\u4e14\u5df2\u521b\u5efa\u5bf9\u5e94 Service
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u6bcf\u4e2a Pod \u90fd\u6709\u4e00\u4e2a\u5185\u90e8\u72ec\u7acb\u7684 IP \u5730\u5740\uff0c\u4f46\u662f\u5de5\u4f5c\u8d1f\u8f7d\u4e2d\u7684 Pod \u53ef\u80fd\u4f1a\u88ab\u968f\u65f6\u521b\u5efa\u548c\u5220\u9664\uff0c\u76f4\u63a5\u4f7f\u7528 Pod IP \u5730\u5740\u5e76\u4e0d\u80fd\u5bf9\u5916\u63d0\u4f9b\u670d\u52a1\u3002
\u8fd9\u5c31\u9700\u8981\u521b\u5efa\u670d\u52a1\uff0c\u901a\u8fc7\u670d\u52a1\u60a8\u4f1a\u83b7\u5f97\u4e00\u4e2a\u56fa\u5b9a\u7684 IP \u5730\u5740\uff0c\u4ece\u800c\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u524d\u7aef\u548c\u540e\u7aef\u7684\u89e3\u8026\uff0c\u8ba9\u5916\u90e8\u7528\u6237\u80fd\u591f\u8bbf\u95ee\u670d\u52a1\u3002\u540c\u65f6\uff0c\u670d\u52a1\u8fd8\u63d0\u4f9b\u4e86\u8d1f\u8f7d\u5747\u8861\uff08LoadBalancer\uff09\u529f\u80fd\uff0c\u4f7f\u7528\u6237\u80fd\u4ece\u516c\u7f51\u8bbf\u95ee\u5230\u5de5\u4f5c\u8d1f\u8f7d\u3002
\u70b9\u9009 \u96c6\u7fa4\u5185\u8bbf\u95ee\uff08ClusterIP\uff09 \uff0c\u8fd9\u662f\u6307\u901a\u8fc7\u96c6\u7fa4\u7684\u5185\u90e8 IP \u66b4\u9732\u670d\u52a1\uff0c\u9009\u62e9\u6b64\u9879\u7684\u670d\u52a1\u53ea\u80fd\u5728\u96c6\u7fa4\u5185\u90e8\u8bbf\u95ee\u3002\u8fd9\u662f\u9ed8\u8ba4\u7684\u670d\u52a1\u7c7b\u578b\u3002\u53c2\u8003\u4e0b\u8868\u914d\u7f6e\u53c2\u6570\u3002
\u7b56\u7565\u914d\u7f6e\u5206\u4e3a\u5165\u6d41\u91cf\u7b56\u7565\u548c\u51fa\u6d41\u91cf\u7b56\u7565\u3002\u5982\u679c\u6e90 Pod \u60f3\u8981\u6210\u529f\u8fde\u63a5\u5230\u76ee\u6807 Pod\uff0c\u6e90 Pod \u7684\u51fa\u6d41\u91cf\u7b56\u7565\u548c\u76ee\u6807 Pod \u7684\u5165\u6d41\u91cf\u7b56\u7565\u90fd\u9700\u8981\u5141\u8bb8\u8fde\u63a5\u3002\u5982\u679c\u4efb\u4f55\u4e00\u65b9\u4e0d\u5141\u8bb8\u8fde\u63a5\uff0c\u90fd\u4f1a\u5bfc\u81f4\u8fde\u63a5\u5931\u8d25\u3002
\u6267\u884c ls \u547d\u4ee4\u67e5\u770b\u7ba1\u7406\u96c6\u7fa4\u4e0a\u7684\u5bc6\u94a5\u662f\u5426\u521b\u5efa\u6210\u529f\uff0c\u6b63\u786e\u53cd\u9988\u5982\u4e0b\uff1a
\u68c0\u67e5\u9879 \u63cf\u8ff0 \u64cd\u4f5c\u7cfb\u7edf \u53c2\u8003\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf SELinux \u5173\u95ed \u9632\u706b\u5899 \u5173\u95ed \u67b6\u6784\u4e00\u81f4\u6027 \u8282\u70b9\u95f4 CPU \u67b6\u6784\u4e00\u81f4\uff08\u5982\u5747\u4e3a ARM \u6216 x86\uff09 \u4e3b\u673a\u65f6\u95f4 \u6240\u6709\u4e3b\u673a\u95f4\u540c\u6b65\u8bef\u5dee\u5c0f\u4e8e 10 \u79d2\u3002 \u7f51\u7edc\u8054\u901a\u6027 \u8282\u70b9\u53ca\u5176 SSH \u7aef\u53e3\u80fd\u591f\u6b63\u5e38\u88ab\u5e73\u53f0\u8bbf\u95ee\u3002 CPU \u53ef\u7528 CPU \u8d44\u6e90\u5927\u4e8e 4 Core \u5185\u5b58 \u53ef\u7528\u5185\u5b58\u8d44\u6e90\u5927\u4e8e 8 GB"},{"location":"end-user/kpanda/nodes/node-check.html#_2","title":"\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf","text":"\u67b6\u6784 \u64cd\u4f5c\u7cfb\u7edf \u5907\u6ce8 ARM Kylin Linux Advanced Server release V10 (Sword) SP2 \u63a8\u8350 ARM UOS Linux ARM openEuler x86 CentOS 7.x \u63a8\u8350 x86 Redhat 7.x \u63a8\u8350 x86 Redhat 8.x \u63a8\u8350 x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 \u6d77\u5149 x86 UOS Linux x86 openEuler"},{"location":"end-user/kpanda/nodes/node-details.html","title":"\u8282\u70b9\u8be6\u60c5","text":"
\u652f\u6301\u5c06\u8282\u70b9\u6682\u505c\u8c03\u5ea6\u6216\u6062\u590d\u8c03\u5ea6\u3002\u6682\u505c\u8c03\u5ea6\u6307\uff0c\u505c\u6b62\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002\u6062\u590d\u8c03\u5ea6\u6307\uff0c\u53ef\u4ee5\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
\u6c61\u70b9 (Taint) \u80fd\u591f\u4f7f\u8282\u70b9\u6392\u65a5\u67d0\u4e00\u7c7b Pod\uff0c\u907f\u514d Pod \u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002 \u6bcf\u4e2a\u8282\u70b9\u4e0a\u53ef\u4ee5\u5e94\u7528\u4e00\u4e2a\u6216\u591a\u4e2a\u6c61\u70b9\uff0c\u4e0d\u80fd\u5bb9\u5fcd\u8fd9\u4e9b\u6c61\u70b9\u7684 Pod \u5219\u4e0d\u4f1a\u88ab\u8c03\u5ea6\u8be5\u8282\u70b9\u4e0a\u3002
\u4e3a\u8282\u70b9\u6dfb\u52a0\u6c61\u70b9\u4e4b\u540e\uff0c\u53ea\u6709\u80fd\u5bb9\u5fcd\u8be5\u6c61\u70b9\u7684 Pod \u624d\u80fd\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
NoSchedule\uff1a\u65b0\u7684 Pod \u4e0d\u4f1a\u88ab\u8c03\u5ea6\u5230\u5e26\u6709\u6b64\u6c61\u70b9\u7684\u8282\u70b9\u4e0a\uff0c\u9664\u975e\u65b0\u7684 Pod \u5177\u6709\u76f8\u5339\u914d\u7684\u5bb9\u5fcd\u5ea6\u3002\u5f53\u524d\u6b63\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u4e0d\u4f1a \u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u4e0d\u80fd\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f1a\u9a6c\u4e0a\u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f46\u662f\u5728\u5bb9\u5fcd\u5ea6\u5b9a\u4e49\u4e2d\u6ca1\u6709\u6307\u5b9a tolerationSeconds\uff0c\u5219 Pod \u8fd8\u4f1a\u4e00\u76f4\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\u800c\u4e14\u6307\u5b9a\u4e86 tolerationSeconds\uff0c\u5219 Pod \u8fd8\u80fd\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u7ee7\u7eed\u8fd0\u884c\u6307\u5b9a\u7684\u65f6\u957f\u3002\u8fd9\u6bb5\u65f6\u95f4\u8fc7\u53bb\u540e\uff0c\u518d\u4ece\u8282\u70b9\u4e0a\u9a71\u9664\u8fd9\u4e9b Pod\u3002
PreferNoSchedule\uff1a\u8fd9\u662f\u201c\u8f6f\u6027\u201d\u7684 NoSchedule\u3002\u63a7\u5236\u5e73\u9762\u5c06**\u5c1d\u8bd5**\u907f\u514d\u5c06\u4e0d\u5bb9\u5fcd\u6b64\u6c61\u70b9\u7684 Pod \u8c03\u5ea6\u5230\u8282\u70b9\u4e0a\uff0c\u4f46\u4e0d\u80fd\u4fdd\u8bc1\u5b8c\u5168\u907f\u514d\u3002\u6240\u4ee5\u8981\u5c3d\u91cf\u907f\u514d\u4f7f\u7528\u6b64\u6c61\u70b9\u3002
\u5f53\u524d\u96c6\u7fa4\u5df2\u63a5\u5165\u5bb9\u5668\u7ba1\u7406\u4e14\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u5df2\u7ecf\u5b89\u88c5 kolm \u7ec4\u4ef6\uff08helm \u6a21\u677f\u641c\u7d22 kolm\uff09
\u53ea\u9700\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0cKpanda \u63a7\u5236\u5668\u4f1a\u628a Global Cluster \u589e\u52a0\u7684\u6743\u9650\u70b9\u540c\u6b65\u5230\u6240\u6709\u63a5\u5165\u5b50\u96c6\u7fa4\u4e2d\uff0c\u540c\u6b65\u9700\u4e00\u6bb5\u65f6\u95f4\u624d\u80fd\u5b8c\u6210
\u53ea\u80fd\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0c\u5728\u5b50\u96c6\u7fa4\u65b0\u589e\u7684\u6743\u9650\u70b9\u4f1a\u88ab Global Cluster \u5185\u7f6e\u89d2\u8272\u6743\u9650\u70b9\u8986\u76d6
\u53ea\u652f\u6301\u4f7f\u7528\u56fa\u5b9a Label \u7684 ClusterRole \u8ffd\u52a0\u6743\u9650\uff0c\u4e0d\u652f\u6301\u66ff\u6362\u6216\u8005\u5220\u9664\u6743\u9650\uff0c\u4e5f\u4e0d\u80fd\u4f7f\u7528 role \u8ffd\u52a0\u6743\u9650\uff0c\u5185\u7f6e\u89d2\u8272\u8ddf\u7528\u6237\u521b\u5efa\u7684 ClusterRole Label \u5bf9\u5e94\u5173\u7cfb\u5982\u4e0b
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301 Pod \u8d44\u6e90\u57fa\u4e8e\u6307\u6807\u8fdb\u884c\u5f39\u6027\u4f38\u7f29\uff08Horizontal Pod Autoscaling, HPA\uff09\u3002 \u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e CPU \u5229\u7528\u7387\u3001\u5185\u5b58\u7528\u91cf\u53ca\u81ea\u5b9a\u4e49\u6307\u6807\u6307\u6807\u6765\u52a8\u6001\u8c03\u6574 Pod \u8d44\u6e90\u7684\u526f\u672c\u6570\u91cf\u3002 \u4f8b\u5982\uff0c\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u57fa\u4e8e CPU \u5229\u7528\u7387\u6307\u6807\u5f39\u6027\u4f38\u7f29\u7b56\u7565\u540e\uff0c\u5f53 Pod \u7684 CPU \u5229\u7528\u7387\u8d85\u8fc7/\u4f4e\u4e8e\u60a8\u8bbe\u7f6e\u7684\u6307\u6807\u9600\u503c\uff0c\u5de5\u4f5c\u8d1f\u8f7d\u63a7\u5236\u5668\u5c06\u4f1a\u81ea\u52a8\u589e\u52a0/\u8f83\u5c11 Pod \u526f\u672c\u6570\u3002
\u5982\u679c\u57fa\u4e8e CPU \u5229\u7528\u7387\u521b\u5efa HPA \u7b56\u7565\uff0c\u5fc5\u987b\u9884\u5148\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u914d\u7f6e\u9650\u5236\uff08Limit\uff09\uff0c\u5426\u5219\u65e0\u6cd5\u8ba1\u7b97 CPU \u5229\u7528\u7387\u3002
\u7cfb\u7edf\u5185\u7f6e\u4e86 CPU \u548c\u5185\u5b58\u4e24\u79cd\u5f39\u6027\u4f38\u7f29\u6307\u6807\u4ee5\u6ee1\u8db3\u7528\u6237\u7684\u57fa\u7840\u4e1a\u52a1\u4f7f\u7528\u573a\u666f\u3002
\u76ee\u6807 CPU \u5229\u7528\u7387\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b Pod \u7684 CPU \u4f7f\u7528\u7387\u3002\u8ba1\u7b97\u65b9\u5f0f\u4e3a\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u7684 Pod \u8d44\u6e90 / \u5de5\u4f5c\u8d1f\u8f7d\u7684\u8bf7\u6c42\uff08request\uff09\u503c\u3002\u5f53\u5b9e\u9645 CPU \u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u76ee\u6807\u5185\u5b58\u7528\u91cf\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b\u7684 Pod \u7684\u5185\u5b58\u7528\u91cf\u3002\u5f53\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\u901a\u8fc7\u76d1\u63a7 Pod \u5728\u4e00\u6bb5\u65f6\u95f4\u5185\u7684\u8d44\u6e90\u7533\u8bf7\u548c\u7528\u91cf\uff0c \u8ba1\u7b97\u51fa\u5bf9\u8be5 Pod \u800c\u8a00\u6700\u9002\u5408\u7684 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u3002\u4f7f\u7528 VPA \u53ef\u4ee5\u66f4\u52a0\u5408\u7406\u5730\u4e3a\u96c6\u7fa4\u4e0b\u6bcf\u4e2a Pod \u5206\u914d\u8d44\u6e90\uff0c\u63d0\u9ad8\u96c6\u7fa4\u7684\u6574\u4f53\u8d44\u6e90\u5229\u7528\u7387\uff0c\u907f\u514d\u96c6\u7fa4\u8d44\u6e90\u6d6a\u8d39\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\uff0c\u57fa\u4e8e\u6b64\u529f\u80fd\u53ef\u4ee5\u6839\u636e\u5bb9\u5668\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\u52a8\u6001\u8c03\u6574 Pod \u8bf7\u6c42\u503c\u3002 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u65b9\u5f0f\u6765\u4fee\u6539\u8d44\u6e90\u8bf7\u6c42\u503c\uff0c\u60a8\u53ef\u4ee5\u6839\u636e\u5b9e\u9645\u9700\u8981\u8fdb\u884c\u914d\u7f6e\u3002
\u672c\u6587\u5c06\u4ecb\u7ecd\u5982\u4f55\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u914d\u7f6e Pod \u5782\u76f4\u4f38\u7f29\u3002
Warning
\u4f7f\u7528 VPA \u4fee\u6539 Pod \u8d44\u6e90\u8bf7\u6c42\u4f1a\u89e6\u53d1 Pod \u91cd\u542f\u3002\u7531\u4e8e Kubernetes \u672c\u8eab\u7684\u9650\u5236\uff0c Pod \u91cd\u542f\u540e\u53ef\u80fd\u4f1a\u88ab\u8c03\u5ea6\u5230\u5176\u5b83\u8282\u70b9\u4e0a\u3002
\u4f38\u7f29\u6a21\u5f0f\uff1a\u6267\u884c\u4fee\u6539 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u7684\u65b9\u5f0f\uff0c\u76ee\u524d\u5782\u76f4\u4f38\u7f29\u652f\u6301\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u4f38\u7f29\u6a21\u5f0f\u3002
\u8d44\u6e90\u7c7b\u578b\uff1a\u8fdb\u884c\u76d1\u63a7\u7684\u81ea\u5b9a\u4e49\u6307\u6807\u7c7b\u578b\uff0c\u5305\u542b Pod \u548c Service \u4e24\u79cd\u7c7b\u578b\u3002
\u6570\u636e\u7c7b\u578b\uff1a\u7528\u4e8e\u8ba1\u7b97\u6307\u6807\u503c\u7684\u65b9\u6cd5\uff0c\u5305\u542b\u76ee\u6807\u503c\u548c\u76ee\u6807\u5e73\u5747\u503c\u4e24\u79cd\u7c7b\u578b\uff0c\u5f53\u8d44\u6e90\u7c7b\u578b\u4e3a Pod \u65f6\uff0c\u53ea\u652f\u6301\u4f7f\u7528\u76ee\u6807\u5e73\u5747\u503c\u3002
\u5f85\u5904\u7406\u7684\u8bf7\u6c42\u603b\u6570 + \u80fd\u63a5\u53d7\u7684\u8d85\u8fc7\u76ee\u6807\u5e76\u53d1\u6570\u7684\u8bf7\u6c42\u6570\u91cf > \u6bcf\u4e2a Pod \u7684\u76ee\u6807\u5e76\u53d1\u6570 * Pod \u6570\u91cf
\u6267\u884c\u4e0b\u9762\u547d\u4ee4\u6d4b\u8bd5\uff0c\u5e76\u53ef\u4ee5\u901a\u8fc7 kubectl get pods -A -w \u6765\u89c2\u5bdf\u6269\u5bb9\u7684 Pod\u3002
API \u5b89\u5168\uff1a\u542f\u7528\u4e86\u4e0d\u5b89\u5168\u7684 API \u7248\u672c\uff0c\u662f\u5426\u8bbe\u7f6e\u4e86\u9002\u5f53\u7684 RBAC \u89d2\u8272\u548c\u6743\u9650\u9650\u5236\u7b49
\u6570\u636e\u5377\uff08PersistentVolume\uff0cPV\uff09\u662f\u96c6\u7fa4\u4e2d\u7684\u4e00\u5757\u5b58\u50a8\uff0c\u53ef\u7531\u7ba1\u7406\u5458\u4e8b\u5148\u5236\u5907\uff0c\u6216\u4f7f\u7528\u5b58\u50a8\u7c7b\uff08Storage Class\uff09\u6765\u52a8\u6001\u5236\u5907\u3002PV \u662f\u96c6\u7fa4\u8d44\u6e90\uff0c\u4f46\u62e5\u6709\u72ec\u7acb\u7684\u751f\u547d\u5468\u671f\uff0c\u4e0d\u4f1a\u968f\u7740 Pod \u8fdb\u7a0b\u7ed3\u675f\u800c\u88ab\u5220\u9664\u3002\u5c06 PV \u6302\u8f7d\u5230\u5de5\u4f5c\u8d1f\u8f7d\u53ef\u4ee5\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u7684\u6570\u636e\u6301\u4e45\u5316\u3002PV \u4e2d\u4fdd\u5b58\u4e86\u53ef\u88ab Pod \u4e2d\u5bb9\u5668\u8bbf\u95ee\u7684\u6570\u636e\u76ee\u5f55\u3002
HostPath\uff1a\u4f7f\u7528 Node \u8282\u70b9\u7684\u6587\u4ef6\u7cfb\u7edf\u4e0a\u7684\u6587\u4ef6\u6216\u76ee\u5f55\u4f5c\u4e3a\u6570\u636e\u5377\uff0c\u4e0d\u652f\u6301\u57fa\u4e8e\u8282\u70b9\u4eb2\u548c\u6027\u7684 Pod \u8c03\u5ea6\u3002
ReadWriteOncePod\uff1a\u6570\u636e\u5377\u53ef\u4ee5\u88ab\u5355\u4e2a Pod \u4ee5\u8bfb\u5199\u65b9\u5f0f\u6302\u8f7d\u3002
\u56de\u6536\u7b56\u7565\uff1a
Retain\uff1a\u4e0d\u5220\u9664 PV\uff0c\u4ec5\u5c06\u5176\u72b6\u6001\u53d8\u4e3a released \uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u56de\u6536\u3002\u6709\u5173\u5982\u4f55\u624b\u52a8\u56de\u6536\uff0c\u53ef\u53c2\u8003\u6301\u4e45\u5377\u3002
\u6587\u4ef6\u7cfb\u7edf\uff1a\u6570\u636e\u5377\u5c06\u88ab Pod \u6302\u8f7d\u5230\u67d0\u4e2a\u76ee\u5f55\u3002\u5982\u679c\u6570\u636e\u5377\u7684\u5b58\u50a8\u6765\u81ea\u67d0\u5757\u8bbe\u5907\u800c\u8be5\u8bbe\u5907\u76ee\u524d\u4e3a\u7a7a\uff0c\u7b2c\u4e00\u6b21\u6302\u8f7d\u5377\u4e4b\u524d\u4f1a\u5728\u8bbe\u5907\u4e0a\u521b\u5efa\u6587\u4ef6\u7cfb\u7edf\u3002
\u5757\uff1a\u5c06\u6570\u636e\u5377\u4f5c\u4e3a\u539f\u59cb\u5757\u8bbe\u5907\u6765\u4f7f\u7528\u3002\u8fd9\u7c7b\u5377\u4ee5\u5757\u8bbe\u5907\u7684\u65b9\u5f0f\u4ea4\u7ed9 Pod \u4f7f\u7528\uff0c\u5176\u4e0a\u6ca1\u6709\u4efb\u4f55\u6587\u4ef6\u7cfb\u7edf\uff0c\u53ef\u4ee5\u8ba9 Pod \u66f4\u5feb\u5730\u8bbf\u95ee\u6570\u636e\u5377\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u652f\u6301\u5c06\u4e00\u4e2a\u5b58\u50a8\u6c60\u5171\u4eab\u7ed9\u591a\u4e2a\u547d\u540d\u7a7a\u95f4\u4f7f\u7528\uff0c\u4ee5\u4fbf\u63d0\u9ad8\u8d44\u6e90\u5229\u7528\u6548\u7387\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u5c31\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002\u9ed8\u8ba4\u503c\u4e3a 360 s\u3002
\u5b88\u62a4\u8fdb\u7a0b\uff08DaemonSet\uff09\u901a\u8fc7\u8282\u70b9\u4eb2\u548c\u6027\u4e0e\u6c61\u70b9\u529f\u80fd\u786e\u4fdd\u5728\u5168\u90e8\u6216\u90e8\u5206\u8282\u70b9\u4e0a\u8fd0\u884c\u4e00\u4e2a Pod \u7684\u526f\u672c\u3002\u5bf9\u4e8e\u65b0\u52a0\u5165\u96c6\u7fa4\u7684\u8282\u70b9\uff0cDaemonSet \u81ea\u52a8\u5728\u65b0\u8282\u70b9\u4e0a\u90e8\u7f72\u76f8\u5e94\u7684 Pod\uff0c\u5e76\u8ddf\u8e2a Pod \u7684\u8fd0\u884c\u72b6\u6001\u3002\u5f53\u8282\u70b9\u88ab\u79fb\u9664\u65f6\uff0cDaemonSet \u5219\u5220\u9664\u5176\u521b\u5efa\u7684\u6240\u6709 Pod\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u65e0\u6548 Pod \u6570\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u6d6a\u6d8c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u4e3b\u8981\u4e3a Pod \u548c ReplicaSet \u63d0\u4f9b\u58f0\u660e\u5f0f\u66f4\u65b0\uff0c\u652f\u6301\u5f39\u6027\u4f38\u7f29\u3001\u6eda\u52a8\u5347\u7ea7\u3001\u7248\u672c\u56de\u9000\u7b49\u529f\u80fd\u3002\u5728 Deployment \u4e2d\u58f0\u660e\u671f\u671b\u7684 Pod \u72b6\u6001\uff0cDeployment Controller \u4f1a\u901a\u8fc7 ReplicaSet \u4fee\u6539\u5f53\u524d\u72b6\u6001\uff0c\u4f7f\u5176\u8fbe\u5230\u9884\u5148\u58f0\u660e\u7684\u671f\u671b\u72b6\u6001\u3002Deployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u652f\u6301\u6570\u636e\u6301\u4e45\u5316\uff0c\u9002\u7528\u4e8e\u90e8\u7f72\u65e0\u72b6\u6001\u7684\u3001\u4e0d\u9700\u8981\u4fdd\u5b58\u6570\u636e\u3001\u968f\u65f6\u53ef\u4ee5\u91cd\u542f\u56de\u6eda\u7684\u5e94\u7528\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u65e0\u72b6\u6001\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u4e0d\u53ef\u7528\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u5cf0\u503c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5b9e\u4f8b\u6570\uff1a\u8f93\u5165\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u5b9e\u4f8b\u6570\u91cf\u3002\u9ed8\u8ba4\u521b\u5efa 1 \u4e2a Pod \u5b9e\u4f8b\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e76\u884c\u6570\uff1a\u4efb\u52a1\u6267\u884c\u8fc7\u7a0b\u4e2d\u5141\u8bb8\u540c\u65f6\u521b\u5efa\u7684\u6700\u5927 Pod \u6570\uff0c\u5e76\u884c\u6570\u5e94\u4e0d\u5927\u4e8e Pod \u603b\u6570\u3002\u9ed8\u8ba4\u4e3a 1\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002
\u6709\u72b6\u6001\u8d1f\u8f7d\uff08StatefulSet\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u548c\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u7c7b\u4f3c\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406 Pod \u96c6\u5408\u7684\u90e8\u7f72\u548c\u4f38\u7f29\u3002\u4e8c\u8005\u7684\u4e3b\u8981\u533a\u522b\u5728\u4e8e\uff0cDeployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u4fdd\u5b58\u6570\u636e\uff0c\u800c StatefulSet \u662f\u6709\u72b6\u6001\u7684\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406\u6709\u72b6\u6001\u5e94\u7528\u3002\u6b64\u5916\uff0cStatefulSet \u4e2d\u7684 Pod \u5177\u6709\u6c38\u4e45\u4e0d\u53d8\u7684 ID\uff0c\u4fbf\u4e8e\u5728\u5339\u914d\u5b58\u50a8\u5377\u65f6\u8bc6\u522b\u5bf9\u5e94\u7684 Pod\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u6709\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
Kubernetes v1.7 \u53ca\u5176\u4e4b\u540e\u7684\u7248\u672c\u53ef\u4ee5\u901a\u8fc7 .spec.podManagementPolicy \u8bbe\u7f6e Pod \u7684\u7ba1\u7406\u7b56\u7565\uff0c\u652f\u6301\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\uff1a
\u6309\u5e8f\u7b56\u7565\uff08OrderedReady\uff09 \uff1a\u9ed8\u8ba4\u7684 Pod \u7ba1\u7406\u7b56\u7565\uff0c\u8868\u793a\u6309\u987a\u5e8f\u90e8\u7f72 Pod\uff0c\u53ea\u6709\u524d\u4e00\u4e2a Pod \u90e8\u7f72 \u6210\u529f\u5b8c\u6210\u540e\uff0c\u6709\u72b6\u6001\u8d1f\u8f7d\u624d\u4f1a\u5f00\u59cb\u90e8\u7f72\u4e0b\u4e00\u4e2a Pod\u3002\u5220\u9664 Pod \u65f6\u5219\u91c7\u7528\u9006\u5e8f\uff0c\u6700\u540e\u521b\u5efa\u7684\u6700\u5148\u88ab\u5220\u9664\u3002
\u5e76\u884c\u7b56\u7565\uff08Parallel\uff09 \uff1a\u5e76\u884c\u521b\u5efa\u6216\u5220\u9664\u5bb9\u5668\uff0c\u548c Deployment \u7c7b\u578b\u7684 Pod \u4e00\u6837\u3002StatefulSet \u63a7\u5236\u5668\u5e76\u884c\u5730\u542f\u52a8\u6216\u7ec8\u6b62\u6240\u6709\u7684\u5bb9\u5668\u3002\u542f\u52a8\u6216\u8005\u7ec8\u6b62\u5176\u4ed6 Pod \u524d\uff0c\u65e0\u9700\u7b49\u5f85 Pod \u8fdb\u5165 Running \u548c ready \u6216\u8005\u5b8c\u5168\u505c\u6b62\u72b6\u6001\u3002 \u8fd9\u4e2a\u9009\u9879\u53ea\u4f1a\u5f71\u54cd\u6269\u7f29\u64cd\u4f5c\u7684\u884c\u4e3a\uff0c\u4e0d\u5f71\u54cd\u66f4\u65b0\u65f6\u7684\u987a\u5e8f\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u73af\u5883\u53d8\u91cf\u662f\u6307\u5bb9\u5668\u8fd0\u884c\u73af\u5883\u4e2d\u8bbe\u5b9a\u7684\u4e00\u4e2a\u53d8\u91cf\uff0c\u7528\u4e8e\u7ed9 Pod \u6dfb\u52a0\u73af\u5883\u6807\u5fd7\u6216\u4f20\u9012\u914d\u7f6e\u7b49\uff0c\u652f\u6301\u901a\u8fc7\u952e\u503c\u5bf9\u7684\u5f62\u5f0f\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5728\u539f\u751f Kubernetes \u7684\u57fa\u7840\u4e0a\u589e\u52a0\u4e86\u56fe\u5f62\u5316\u754c\u9762\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\uff0c\u652f\u6301\u4ee5\u4e0b\u51e0\u79cd\u914d\u7f6e\u65b9\u5f0f\uff1a
\u53d8\u91cf/\u53d8\u91cf\u5f15\u7528\uff08Pod Field\uff09\uff1a\u5c06 Pod \u5b57\u6bb5\u4f5c\u4e3a\u73af\u5883\u53d8\u91cf\u7684\u503c\uff0c\u4f8b\u5982 Pod \u7684\u540d\u79f0
\u5c31\u7eea\u68c0\u67e5\uff08ReadinessProbe\uff09 \u53ef\u63a2\u77e5\u5bb9\u5668\u4f55\u65f6\u51c6\u5907\u597d\u63a5\u53d7\u8bf7\u6c42\u6d41\u91cf\uff0c\u5f53\u4e00\u4e2a Pod \u5185\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5c31\u7eea\u65f6\uff0c\u624d\u80fd\u8ba4\u4e3a\u8be5 Pod \u5c31\u7eea\u3002 \u8fd9\u79cd\u4fe1\u53f7\u7684\u4e00\u4e2a\u7528\u9014\u5c31\u662f\u63a7\u5236\u54ea\u4e2a Pod \u4f5c\u4e3a Service \u7684\u540e\u7aef\u3002 \u82e5 Pod \u5c1a\u672a\u5c31\u7eea\uff0c\u4f1a\u88ab\u4ece Service \u7684\u8d1f\u8f7d\u5747\u8861\u5668\u4e2d\u5254\u9664\u3002
Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c Pod \u5185\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \u72b6\u6001\u3002\u5982\u679c Pod \u4e2d\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\uff0c\u5219\u72b6\u6001\u53d8\u4e3a Failed \u3002\u4ee5\u4e0b phase \u5b57\u6bb5\u503c\u8868\u660e\u4e86\u4e00\u4e2a Pod \u5904\u4e8e\u751f\u547d\u5468\u671f\u7684\u54ea\u4e2a\u9636\u6bb5\u3002
\u503c \u63cf\u8ff0 Pending \uff08\u60ac\u51b3\uff09 Pod \u5df2\u88ab\u7cfb\u7edf\u63a5\u53d7\uff0c\u4f46\u6709\u4e00\u4e2a\u6216\u8005\u591a\u4e2a\u5bb9\u5668\u5c1a\u672a\u521b\u5efa\u4ea6\u672a\u8fd0\u884c\u3002\u8fd9\u4e2a\u9636\u6bb5\u5305\u62ec\u7b49\u5f85 Pod \u88ab\u8c03\u5ea6\u7684\u65f6\u95f4\u548c\u901a\u8fc7\u7f51\u7edc\u4e0b\u8f7d\u955c\u50cf\u7684\u65f6\u95f4\u3002 Running \uff08\u8fd0\u884c\u4e2d\uff09 Pod \u5df2\u7ecf\u7ed1\u5b9a\u5230\u4e86\u67d0\u4e2a\u8282\u70b9\uff0cPod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u88ab\u521b\u5efa\u3002\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u4ecd\u5728\u8fd0\u884c\uff0c\u6216\u8005\u6b63\u5904\u4e8e\u542f\u52a8\u6216\u91cd\u542f\u72b6\u6001\u3002 Succeeded \uff08\u6210\u529f\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u6210\u529f\u7ec8\u6b62\uff0c\u5e76\u4e14\u4e0d\u4f1a\u518d\u91cd\u542f\u3002 Failed \uff08\u5931\u8d25\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u7ec8\u6b62\uff0c\u5e76\u4e14\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u662f\u56e0\u4e3a\u5931\u8d25\u800c\u7ec8\u6b62\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u5bb9\u5668\u4ee5\u975e 0 \u72b6\u6001\u9000\u51fa\u6216\u8005\u88ab\u7cfb\u7edf\u7ec8\u6b62\u3002 Unknown \uff08\u672a\u77e5\uff09 \u56e0\u4e3a\u67d0\u4e9b\u539f\u56e0\u65e0\u6cd5\u53d6\u5f97 Pod \u7684\u72b6\u6001\uff0c\u8fd9\u79cd\u60c5\u51b5\u901a\u5e38\u662f\u56e0\u4e3a\u4e0e Pod \u6240\u5728\u4e3b\u673a\u901a\u4fe1\u5931\u8d25\u6240\u81f4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u4e2d\u521b\u5efa\u4e00\u4e2a\u5de5\u4f5c\u8d1f\u8f7d\u65f6\uff0c\u901a\u5e38\u4f7f\u7528\u955c\u50cf\u6765\u6307\u5b9a\u5bb9\u5668\u4e2d\u7684\u8fd0\u884c\u73af\u5883\u3002\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u5728\u6784\u5efa\u955c\u50cf\u65f6\uff0c\u53ef\u4ee5\u901a\u8fc7 Entrypoint \u548c CMD \u4e24\u4e2a\u5b57\u6bb5\u6765\u5b9a\u4e49\u5bb9\u5668\u8fd0\u884c\u65f6\u6267\u884c\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002\u5982\u679c\u9700\u8981\u66f4\u6539\u5bb9\u5668\u955c\u50cf\u542f\u52a8\u524d\u3001\u542f\u52a8\u540e\u3001\u505c\u6b62\u524d\u7684\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u5bb9\u5668\u7684\u751f\u547d\u5468\u671f\u4e8b\u4ef6\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u6765\u8986\u76d6\u955c\u50cf\u4e2d\u9ed8\u8ba4\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u542f\u52a8\u540e\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u505c\u6b62\u524d\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u8282\u70b9\u4e5f\u6709\u6807\u7b7e\u3002\u60a8\u53ef\u4ee5\u624b\u52a8\u6dfb\u52a0\u6807\u7b7e\u3002 Kubernetes \u4e5f\u4f1a\u4e3a\u96c6\u7fa4\u4e2d\u6240\u6709\u8282\u70b9\u6dfb\u52a0\u4e00\u4e9b\u6807\u51c6\u7684\u6807\u7b7e\u3002\u53c2\u89c1\u5e38\u7528\u7684\u6807\u7b7e\u3001\u6ce8\u89e3\u548c\u6c61\u70b9\u4ee5\u4e86\u89e3\u5e38\u89c1\u7684\u8282\u70b9\u6807\u7b7e\u3002\u901a\u8fc7\u4e3a\u8282\u70b9\u6dfb\u52a0\u6807\u7b7e\uff0c\u60a8\u53ef\u4ee5\u8ba9 Pod \u8c03\u5ea6\u5230\u7279\u5b9a\u8282\u70b9\u6216\u8282\u70b9\u7ec4\u4e0a\u3002\u60a8\u53ef\u4ee5\u4f7f\u7528\u8fd9\u4e2a\u529f\u80fd\u6765\u786e\u4fdd\u7279\u5b9a\u7684 Pod \u53ea\u80fd\u8fd0\u884c\u5728\u5177\u6709\u4e00\u5b9a\u9694\u79bb\u6027\uff0c\u5b89\u5168\u6027\u6216\u76d1\u7ba1\u5c5e\u6027\u7684\u8282\u70b9\u4e0a\u3002
nodeSelector \u662f\u8282\u70b9\u9009\u62e9\u7ea6\u675f\u7684\u6700\u7b80\u5355\u63a8\u8350\u5f62\u5f0f\u3002\u60a8\u53ef\u4ee5\u5c06 nodeSelector \u5b57\u6bb5\u6dfb\u52a0\u5230 Pod \u7684\u89c4\u7ea6\u4e2d\u8bbe\u7f6e\u60a8\u5e0c\u671b\u76ee\u6807\u8282\u70b9\u6240\u5177\u6709\u7684\u8282\u70b9\u6807\u7b7e\u3002Kubernetes \u53ea\u4f1a\u5c06 Pod \u8c03\u5ea6\u5230\u62e5\u6709\u6307\u5b9a\u6bcf\u4e2a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002 nodeSelector \u63d0\u4f9b\u4e86\u4e00\u79cd\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u6765\u5c06 Pod \u7ea6\u675f\u5230\u5177\u6709\u7279\u5b9a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002\u4eb2\u548c\u6027\u548c\u53cd\u4eb2\u548c\u6027\u6269\u5c55\u4e86\u60a8\u53ef\u4ee5\u5b9a\u4e49\u7684\u7ea6\u675f\u7c7b\u578b\u3002\u4f7f\u7528\u4eb2\u548c\u6027\u4e0e\u53cd\u4eb2\u548c\u6027\u7684\u4e00\u4e9b\u597d\u5904\u6709\uff1a
\u60a8\u53ef\u4ee5\u6807\u660e\u67d0\u89c4\u5219\u662f\u201c\u8f6f\u9700\u6c42\u201d\u6216\u8005\u201c\u504f\u597d\u201d\uff0c\u8fd9\u6837\u8c03\u5ea6\u5668\u5728\u65e0\u6cd5\u627e\u5230\u5339\u914d\u8282\u70b9\u65f6\uff0c\u4f1a\u5ffd\u7565\u4eb2\u548c\u6027/\u53cd\u4eb2\u548c\u6027\u89c4\u5219\uff0c\u786e\u4fdd Pod \u8c03\u5ea6\u6210\u529f\u3002
\u60a8\u53ef\u4ee5\u4f7f\u7528\u8282\u70b9\u4e0a\uff08\u6216\u5176\u4ed6\u62d3\u6251\u57df\u4e2d\uff09\u8fd0\u884c\u7684\u5176\u4ed6 Pod \u7684\u6807\u7b7e\u6765\u5b9e\u65bd\u8c03\u5ea6\u7ea6\u675f\uff0c\u800c\u4e0d\u662f\u53ea\u80fd\u4f7f\u7528\u8282\u70b9\u672c\u8eab\u7684\u6807\u7b7e\u3002\u8fd9\u4e2a\u80fd\u529b\u8ba9\u60a8\u80fd\u591f\u5b9a\u4e49\u89c4\u5219\u5141\u8bb8\u54ea\u4e9b Pod \u53ef\u4ee5\u88ab\u653e\u7f6e\u5728\u4e00\u8d77\u3002
\u60a8\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u4eb2\u548c\uff08affinity\uff09\u4e0e\u53cd\u4eb2\u548c\uff08anti-affinity\uff09\u6765\u9009\u62e9 Pod \u8981\u90e8\u7f72\u7684\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u53ef\u4ee5\u548c\u54ea\u4e9b Pod\u90e8 \u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5bf9\u4e8e\u76f8\u4e92\u901a\u4fe1\u7684\u670d\u52a1\uff0c\u53ef\u901a\u8fc7\u5e94\u7528\u4eb2\u548c\u6027\u8c03\u5ea6\uff0c\u5c06\u5176\u90e8\u7f72\u5230\u540c\u4e00\u62d3\u6251\u57df\uff08\u5982\u540c\u4e00\u53ef\u7528\u533a\uff09\u4e2d\uff0c\u51cf\u5c11\u5b83\u4eec\u4e4b\u95f4\u7684\u7f51\u7edc\u5ef6\u8fdf\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u53cd\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u4e0d\u53ef\u4ee5\u548c\u54ea\u4e9b Pod \u90e8\u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5c06\u4e00\u4e2a\u8d1f\u8f7d\u7684\u76f8\u540c Pod \u5206\u6563\u90e8\u7f72\u5230\u4e0d\u540c\u7684\u62d3\u6251\u57df\uff08\u4f8b\u5982\u4e0d\u540c\u4e3b\u673a\uff09\u4e2d\uff0c\u63d0\u9ad8\u8d1f\u8f7d\u672c\u8eab\u7684\u7a33\u5b9a\u6027\u3002
Pod \u662f Kuberneters \u4e2d\u521b\u5efa\u548c\u7ba1\u7406\u7684\u3001\u6700\u5c0f\u7684\u8ba1\u7b97\u5355\u5143\uff0c\u5373\u4e00\u7ec4\u5bb9\u5668\u7684\u96c6\u5408\u3002\u8fd9\u4e9b\u5bb9\u5668\u5171\u4eab\u5b58\u50a8\u3001\u7f51\u7edc\u4ee5\u53ca\u7ba1\u7406\u63a7\u5236\u5bb9\u5668\u8fd0\u884c\u65b9\u5f0f\u7684\u7b56\u7565\u3002 Pod \u901a\u5e38\u4e0d\u7531\u7528\u6237\u76f4\u63a5\u521b\u5efa\uff0c\u800c\u662f\u901a\u8fc7\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u6765\u521b\u5efa\u3002 Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c\u81f3\u5c11\u5176\u4e2d\u6709\u4e00\u4e2a\u4e3b\u8981\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \uff0c\u4e4b\u540e\u53d6\u51b3\u4e8e Pod \u4e2d\u662f\u5426\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\u800c\u8fdb\u5165 Succeeded \u6216\u8005 Failed \u9636\u6bb5\u3002
\u7b2c\u4e94\u4ee3\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4f9d\u636e Pod \u7684\u72b6\u6001\u3001\u526f\u672c\u6570\u7b49\u56e0\u7d20\uff0c\u8bbe\u8ba1\u4e86\u4e00\u79cd\u5185\u7f6e\u7684\u5de5\u4f5c\u8d1f\u8f7d\u751f\u547d\u5468\u671f\u7684\u72b6\u6001\u96c6\uff0c\u4ee5\u8ba9\u7528\u6237\u80fd\u591f\u66f4\u52a0\u771f\u5b9e\u7684\u611f\u77e5\u5de5\u4f5c\u8d1f\u8f7d\u8fd0\u884c\u60c5\u51b5\u3002 \u7531\u4e8e\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u7c7b\u578b\uff08\u6bd4\u5982\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u548c\u4efb\u52a1\uff09\u5bf9 Pod \u7684\u7ba1\u7406\u673a\u5236\u4e0d\u4e00\u81f4\uff0c\u56e0\u6b64\uff0c\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u5728\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u4f1a\u5448\u73b0\u4e0d\u540c\u7684\u751f\u547d\u5468\u671f\u72b6\u6001\uff0c\u5177\u4f53\u5982\u4e0b\u8868\uff1a
\u606d\u559c\uff0c\u60a8\u6210\u529f\u8fdb\u5165\u4e86 AI \u7b97\u529b\u5e73\u53f0\uff0c\u73b0\u5728\u53ef\u4ee5\u5f00\u59cb\u60a8\u7684 AI \u4e4b\u65c5\u4e86\u3002
"},{"location":"end-user/share/workload.html","title":"\u521b\u5efa AI \u8d1f\u8f7d\u4f7f\u7528 GPU \u8d44\u6e90","text":"
\u7ba1\u7406\u5458\u4e3a\u5de5\u4f5c\u7a7a\u95f4\u5206\u914d\u8d44\u6e90\u914d\u989d\u540e\uff0c\u7528\u6237\u5c31\u53ef\u4ee5\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u6765\u4f7f\u7528 GPU \u7b97\u529b\u8d44\u6e90\u3002
AI Lab provides a job scheduler to help you better manage jobs. In addition to the basic scheduler, it also supports custom schedulers.
"},{"location":"en/admin/baize/best-practice/add-scheduler.html#introduction-to-job-scheduler","title":"Introduction to Job Scheduler","text":"
In Kubernetes, the job scheduler is responsible for deciding which node to assign a Pod to. It considers various factors such as resource requirements, hardware/software constraints, affinity/anti-affinity rules, and data locality.
The default scheduler is a core component in a Kubernetes cluster that decides which node a Pod should run on. Let's delve into its working principles, features, and configuration methods.
In addition to basic job scheduling capabilities, we also support the use of Scheduler Plugins: Kubernetes SIG Scheduling, which maintains a set of scheduler plugins including Coscheduling (Gang Scheduling) and other features.
To deploy a secondary scheduler plugin in a worker cluster, refer to Deploying Secondary Scheduler Plugin.
"},{"location":"en/admin/baize/best-practice/add-scheduler.html#enable-scheduler-plugins-in-ai-lab","title":"Enable Scheduler Plugins in AI Lab","text":"
Danger
Improper operations when adding scheduler plugins may affect the stability of the entire cluster. It is recommended to test in a test environment or contact our technical support team.
Note that if you wish to use more scheduler plugins in training jobs, you need to manually install them successfully in the worker cluster first. Then, when deploying the baize-agent in the cluster, add the proper scheduler plugin configuration.
Through the container management UI provided by Helm Apps , you can easily deploy scheduler plugins in the cluster.
Then, click Install in the top right corner. (If the baize-agent has already been deployed, you can update it in the Helm App list.) Add the scheduler.
Note the parameter hierarchy of the scheduler. After adding, click OK .
Note: Do not omit this configuration when updating the baize-agent in the future.
"},{"location":"en/admin/baize/best-practice/add-scheduler.html#specify-scheduler-when-creating-a-job","title":"Specify Scheduler When Creating a Job","text":"
Once you have successfully deployed the proper scheduler plugin in the cluster and correctly added the proper scheduler configuration in the baize-agent, you can specify the scheduler when creating a job.
If everything is set up correctly, you will see the scheduler plugin you deployed in the scheduler dropdown menu.
This concludes the instructions for configuring and using the scheduler options in AI Lab.
In the Notebook, multiple available base images are provided by default for developers to choose from. In most cases, this will meet the developers' needs.
DaoCloud provides a default Notebook image that contains all necessary development tools and resources.
baize/baize-notebook\n
This Notebook includes basic development tools. Taking baize-notebook:v0.5.0 (May 30, 2024) as an example, the relevant dependencies and versions are as follows:
Dependency Version Description Ubuntu 22.04.3 Default OS Python 3.11.6 Default Python version pip 23.3.1 conda(mamba) 23.3.1 jupyterlab 3.6.6 JupyterLab image, providing a complete Notebook experience codeserver v4.89.1 Mainstream Code development tool for a familiar experience *baizectl v0.5.0 DaoCloud built-in CLI task management tool *SSH - Supports local SSH direct access to the Notebook container *kubectl v1.27 Kubernetes CLI for managing container resources within Notebook
Note
With each version iteration, AI platform will proactively maintain and update.
However, sometimes users may need custom images. This page explains how to update images and add them to the Notebook creation interface for selection.
Building a new image requires using baize-notebook as the base image to ensure the Notebook runs properly.
When building a custom image, it is recommended to first understand the Dockerfile of the baize-notebook image to better understand how to build a custom image.
"},{"location":"en/admin/baize/best-practice/change-notebook-image.html#dockerfile-for-baize-notebook","title":"Dockerfile for baize-notebook","text":"
"},{"location":"en/admin/baize/best-practice/change-notebook-image.html#add-to-the-notebook-image-list-helm","title":"Add to the Notebook Image List (Helm)","text":"
Warning
Note that this must be done by the platform administrator. Be cautious with changes.
Currently, the image selector needs to be modified by updating the Helm parameters of baize. The specific steps are as follows:
In the Helm Apps list of the kpanda-global-cluster global management cluster, find baize, enter the update page, and modify the Notebook image in the YAML parameters:
Note the parameter modification path global.config.notebook_images:
...\nglobal:\n ...\n config:\n notebook_images:\n ...\n names: release.daocloud.io/baize/baize-notebook:v0.5.0\n # Add your image information here\n
After the update is completed and the Helm App restarts successfully, you can see the new image in the Notebook creation interface image selection.
"},{"location":"en/admin/baize/best-practice/checkpoint.html","title":"Checkpoint Mechanism and Usage","text":"
In practical deep learning scenarios, model training typically lasts for a period, which places higher demands on the stability and efficiency of distributed training tasks. Moreover, during actual training, unexpected interruptions can cause the loss of the model state, requiring the training process to start over. This not only wastes time and resources, which is particularly evident in LLM training, but also affects the training effectiveness of the model.
The ability to save the model state during training, so that it can be restored in case of an interruption, becomes crucial. Checkpointing is the mainstream solution to this problem. This article will introduce the basic concepts of the Checkpoint mechanism and its usage in PyTorch and TensorFlow.
"},{"location":"en/admin/baize/best-practice/checkpoint.html#what-is-a-checkpoint","title":"What is a Checkpoint?","text":"
A checkpoint is a mechanism for saving the state of a model during training. By periodically saving checkpoints, you can restore the model in the following situations:
Training interruption (e.g., system crash or manual interruption)
TensorFlow provides the tf.train.Checkpoint class to manage the saving and restoring of models and optimizers.
"},{"location":"en/admin/baize/best-practice/checkpoint.html#save-checkpoints-in-tensorflow","title":"Save Checkpoints in TensorFlow","text":"
Here is an example of saving a checkpoint in TensorFlow:
import tensorflow as tf\n\n# Assume you have a simple model\nmodel = tf.keras.Sequential([\n tf.keras.layers.Dense(2, input_shape=(10,))\n])\noptimizer = tf.keras.optimizers.Adam(learning_rate=0.001)\n\n# Define checkpoint\ncheckpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)\ncheckpoint_dir = './checkpoints'\ncheckpoint_prefix = f'{checkpoint_dir}/ckpt'\n\n# Train the model...\n# Save checkpoint\ncheckpoint.save(file_prefix=checkpoint_prefix)\n
Note
Users of AI Lab can directly mount high-performance storage as the checkpoint directory to improve the speed of saving and restoring checkpoints.
"},{"location":"en/admin/baize/best-practice/checkpoint.html#restore-checkpoints-in-tensorflow","title":"Restore Checkpoints in TensorFlow","text":"
Load the checkpoint and restore the model and optimizer state:
# Restore checkpoint\nlatest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\ncheckpoint.restore(latest_checkpoint)\n\n# Continue training or inference...\n
"},{"location":"en/admin/baize/best-practice/checkpoint.html#manage-checkpoints-in-distributed-training-with-tensorflow","title":"Manage Checkpoints in Distributed Training with TensorFlow","text":"
In distributed training, TensorFlow manages checkpoints primarily through the following methods:
Using tf.train.Checkpoint and tf.train.CheckpointManager
Regular Saving : Determine a suitable saving frequency based on training time and resource consumption, such as every epoch or every few training steps.
Save Multiple Checkpoints : Keep the latest few checkpoints to prevent issues like file corruption or inapplicability.
Record Metadata : Save additional information in the checkpoint, such as the epoch number and loss value, to better restore the training state.
Use Version Control : Save checkpoints for different experiments to facilitate comparison and reuse.
Validation and Testing : Use checkpoints for validation and testing at different training stages to ensure model performance and stability.
The checkpoint mechanism plays a crucial role in deep learning training. By effectively using the checkpoint features in PyTorch and TensorFlow, you can significantly improve the reliability and efficiency of training. The methods and best practices described in this article should help you better manage the training process of deep learning models.
"},{"location":"en/admin/baize/best-practice/deploy-nfs-in-worker.html","title":"Deploy NFS for Preloading Dataset","text":"
A Network File System (NFS) allows remote hosts to mount file systems over a network and interact with those file systems as though they are mounted locally. This enables system administrators to consolidate resources onto centralized servers on the network.
Dataset is a core feature provided by AI Lab. By abstracting the dependency on data throughout the entire lifecycle of MLOps into datasets, users can manage various types of data in datasets so that training tasks can directly use the data in the dataset.
When remote data is not within the worker cluster, datasets provide the capability to automatically preheat data, supporting data preloading from sources such as Git, S3, and HTTP to the local cluster.
A storage service supporting the ReadWriteMany mode is needed for preloading remote data for the dataset, and it is recommended to deploy NFS within the cluster.
This article mainly introduces how to quickly deploy an NFS service and add it as a StorageClass for the cluster.
Installing csi-driver-nfs requires the use of Helm, please ensure it is installed beforehand.
# Add Helm repository\nhelm repo add csi-driver-nfs https://mirror.ghproxy.com/https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts\nhelm repo update csi-driver-nfs\n\n# Deploy csi-driver-nfs\n# The parameters here mainly optimize the image address to accelerate downloads in China\nhelm upgrade --install csi-driver-nfs csi-driver-nfs/csi-driver-nfs \\\n --set image.nfs.repository=k8s.m.daocloud.io/sig-storage/nfsplugin \\\n --set image.csiProvisioner.repository=k8s.m.daocloud.io/sig-storage/csi-provisioner \\\n --set image.livenessProbe.repository=k8s.m.daocloud.io/sig-storage/livenessprobe \\\n --set image.nodeDriverRegistrar.repository=k8s.m.daocloud.io/sig-storage/csi-node-driver-registrar \\\n --namespace nfs \\\n --version v4.5.0\n
Warning
Not all images of csi-nfs-controller support helm parameters, so the image field of the deployment needs to be manually modified. Change image: registry.k8s.io to image: k8s.dockerproxy.com to accelerate downloads in China.
Create a dataset and set the dataset's associated storage class and preloading method to NFS to preheat remote data into the cluster.
After the dataset is successfully created, you can see that the dataset's status is preloading, and you can start using it after the preloading is completed.
Run the following command to install the NFS client:
sudo yum install nfs-utils\n
Check the NFS server configuration to ensure that the NFS server is running and configured correctly. You can try mounting manually to test:
sudo mkdir -p /mnt/test\nsudo mount -t nfs <nfs-server>:/nfsdata /mnt/test\n
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html","title":"Fine-tune the ChatGLM3 Model by Using AI Lab","text":"
This page uses the ChatGLM3 model as an example to demonstrate how to use LoRA (Low-Rank Adaptation) to fine-tune the ChatGLM3 model within the AI Lab environment. The demo program is from the ChatGLM3 official example.
GPU with at least 20GB memory, recommended RTX4090 or NVIDIA A/H series
At least 200GB of available disk space
At least 8-core CPU, recommended 16-core
64GB RAM, recommended 128GB
Info
Before starting, ensure AI platform and AI Lab are correctly installed, GPU queue resources are successfully initialized, and computing resources are sufficient.
Utilize the dataset management feature provided by AI Lab to quickly preheat and persist the data required for fine-tuning large models, reducing GPU resource occupation due to data preparation, and improving resource utilization efficiency.
Create the required data resources on the dataset list page. These resources include the ChatGLM3 code and data files, all of which can be managed uniformly through the dataset list.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#code-and-model-files","title":"Code and Model Files","text":"
ChatGLM3 is a dialogue pre-training model jointly released by zhipuai.cn and Tsinghua University KEG Lab.
First, pull the ChatGLM3 code repository and download the pre-training model for subsequent fine-tuning tasks.
AI Lab will automatically preheat the data in the background to ensure quick data access for subsequent tasks.
You also need to prepare an empty dataset to store the model files output after the fine-tuning task is completed. Here, create an empty dataset, using PVC as an example.
Warning
Ensure to use a storage type that supports ReadWriteMany to allow quick access to resources for subsequent tasks.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#set-up-environment","title":"Set up Environment","text":"
For model developers, preparing the Python environment dependencies required for model development is crucial. Traditionally, environment dependencies are either packaged directly into the development tool's image or installed in the local environment, which can lead to inconsistency in environment dependencies and difficulties in managing and updating dependencies.
AI Lab provides environment management capabilities, decoupling Python environment dependency package management from development tools and task images, solving dependency management chaos and environment inconsistency issues.
Here, use the environment management feature provided by AI Lab to create the environment required for ChatGLM3 fine-tuning for subsequent use.
Warning
The ChatGLM repository contains a requirements.txt file that includes the environment dependencies required for ChatGLM3 fine-tuning.
This fine-tuning does not use the deepspeed and mpi4py packages. It is recommended to comment them out in the requirements.txt file to avoid compilation failures.
In the environment management list, you can quickly create a Python environment and complete the environment creation through a simple form configuration; a Python 3.11.x environment is required here.
Since CUDA is required for this experiment, GPU resources need to be configured here to preheat the necessary resource dependencies.
Creating the environment involves downloading a series of Python dependencies, and download speeds may vary based on your location. Using a domestic mirror for acceleration can speed up the download.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#use-notebook-as-ide","title":"Use Notebook as IDE","text":"
AI Lab provides Notebook as an IDE feature, allowing users to write, run, and view code results directly in the browser. This is very suitable for development in data analysis, machine learning, and deep learning fields.
You can use the JupyterLab Notebook provided by AI Lab for the ChatGLM3 fine-tuning task.
In the Notebook list, you can create a Notebook according to the page operation guide. Note that you need to configure the proper Notebook resource parameters according to the resource requirements mentioned earlier to avoid resource issues affecting the fine-tuning process.
Note
When creating a Notebook, you can directly mount the preloaded model code dataset and environment, greatly saving data preparation time.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#mount-dataset-and-code","title":"Mount Dataset and Code","text":"
Note: The ChatGLM3 code files are mounted to the /home/jovyan/ChatGLM3 directory, and you also need to mount the AdvertiseGen dataset to the /home/jovyan/ChatGLM3/finetune_demo/data/AdvertiseGen directory to allow the fine-tuning task to access the data.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#mount-pvc-to-model-output-folder","title":"Mount PVC to Model Output Folder","text":"
The model output location used this time is the /home/jovyan/ChatGLM3/finetune_demo/output directory. You can mount the previously created PVC dataset to this directory, so the trained model can be saved to the dataset for subsequent inference tasks.
After creation, you can see the Notebook interface where you can write, run, and view code results directly in the Notebook.
Once in the Notebook, you can find the previously mounted dataset and code in the File Browser option in the Notebook sidebar. Locate the ChatGLM3 folder.
You will find the fine-tuning code for ChatGLM3 in the finetune_demo folder. Open the lora_finetune.ipynb file, which contains the fine-tuning code for ChatGLM3.
First, follow the instructions in the README.md file to understand the entire fine-tuning process. It is recommended to read it thoroughly to ensure that the basic environment dependencies and data preparation work are completed.
Open the terminal and use conda to switch to the preheated environment, ensuring consistency with the JupyterLab Kernel for subsequent code execution.
First, preprocess the AdvertiseGen dataset, standardizing the data to meet the Lora pre-training format requirements. Save the processed data to the AdvertiseGen_fix folder.
import json\nfrom typing import Union\nfrom pathlib import Path\n\ndef _resolve_path(path: Union[str, Path]) -> Path:\n return Path(path).expanduser().resolve()\n\ndef _mkdir(dir_name: Union[str, Path]):\n dir_name = _resolve_path(dir_name)\n if not dir_name.is_dir():\n dir_name.mkdir(parents=True, exist_ok=False)\n\ndef convert_adgen(data_dir: Union[str, Path], save_dir: Union[str, Path]):\n def _convert(in_file: Path, out_file: Path):\n _mkdir(out_file.parent)\n with open(in_file, encoding='utf-8') as fin:\n with open(out_file, 'wt', encoding='utf-8') as fout:\n for line in fin:\n dct = json.loads(line)\n sample = {'conversations': [{'role': 'user', 'content': dct['content']},\n {'role': 'assistant', 'content': dct['summary']}]}\n fout.write(json.dumps(sample, ensure_ascii=False) + '\\n')\n\n data_dir = _resolve_path(data_dir)\n save_dir = _resolve_path(save_dir)\n\n train_file = data_dir / 'train.json'\n if train_file is_file():\n out_file = save_dir / train_file.relative_to(data_dir)\n _convert(train_file, out_file)\n\n dev_file = data_dir / 'dev.json'\n if dev_file.is_file():\n out_file = save_dir / dev_file.relative_to(data_dir)\n _convert(dev_file, out_file)\n\nconvert_adgen('data/AdvertiseGen', 'data/AdvertiseGen_fix')\n
To save debugging time, you can reduce the number of entries in /home/jovyan/ChatGLM3/finetune_demo/data/AdvertiseGen_fix/dev.json to 50. The data is in JSON format, making it easy to process.
After preprocessing the data, you can proceed with the fine-tuning test. Configure the fine-tuning parameters in the /home/jovyan/ChatGLM3/finetune_demo/configs/lora.yaml file. Key parameters to focus on include:
Open a new terminal window and use the following command for local fine-tuning testing. Ensure that the parameter configurations and paths are correct:
finetune_hf.py is the fine-tuning script in the ChatGLM3 code
data/AdvertiseGen_fix is your preprocessed dataset
./chatglm3-6b is your pre-trained model path
configs/lora.yaml is the fine-tuning configuration file
During fine-tuning, you can use the nvidia-smi command to check GPU memory usage:
After fine-tuning is complete, an output directory will be generated in the finetune_demo directory, containing the fine-tuned model files. This way, the fine-tuned model files are saved to the previously created PVC dataset.
After completing the local fine-tuning test and ensuring that your code and data are correct, you can submit the fine-tuning task to the AI Lab for large-scale training and fine-tuning tasks.
Note
This is the recommended model development and fine-tuning process: first, conduct local fine-tuning tests to ensure that the code and data are correct.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#submit-fine-tuning-tasks-via-ui","title":"Submit Fine-tuning Tasks via UI","text":"
Use Pytorch to create a fine-tuning task. Select the resources of the cluster you need to use based on your actual situation. Ensure to meet the resource requirements mentioned earlier.
Image: You can directly use the model image provided by baizectl.
Startup command: Based on your experience using LoRA fine-tuning in the Notebook, the code files and data are in the /home/jovyan/ChatGLM3/finetune_demo directory, so you can directly use this path:
After successfully submitting the task, you can view the training progress of the task in real-time in the task list. You can see the task status, resource usage, logs, and other information.
View task logs
After the task is completed, you can view the fine-tuned model files in the data output dataset for subsequent inference tasks.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#submit-tasks-via-baizectl","title":"Submit Tasks via baizectl","text":"
AI Lab's Notebook supports using the baizectl command-line tool without authentication. If you prefer using CLI, you can directly use the baizectl command-line tool to submit tasks.
After completing the fine-tuning task, you can use the fine-tuned model for inference tasks. Here, you can use the inference service provided by AI Lab to create an inference service with the output model.
In the inference service list, you can create a new inference service. When selecting the model, choose the previously output dataset and configure the model path.
Regarding model resource requirements and GPU resource requirements for inference services, configure them based on the model size and inference concurrency. Refer to the resource configuration of the previous fine-tuning tasks.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#configure-model-runtime","title":"Configure Model Runtime","text":"
Configuring the model runtime is crucial. Currently, AI Lab supports vLLM as the model inference service runtime, which can be directly selected.
Tip
vLLM supports a wide range of large language models. Visit vLLM for more information. These models can be easily used within AI Lab.
After creation, you can see the created inference service in the inference service list. The model service list allows you to get the model's access address directly.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#test-the-model-service","title":"Test the Model Service","text":"
Try using the curl command in the terminal to test the model service. Here, you can see the returned results, enabling you to use the model service for inference tasks.
This page used ChatGLM3 as an example to quickly introduce and get you started with the AI Lab for model fine-tuning, using LoRA to fine-tune the ChatGLM3 model.
AI Lab provides a wealth of features to help model developers quickly conduct model development, fine-tuning, and inference tasks. It also offers rich OpenAPI interfaces, facilitating integration with third-party application ecosystems.
Refer to the video tutorial: Data Labeling and Dataset Usage Instructions
Label Studio is an open-source data labeling tool used for various machine learning and artificial intelligence jobs. Here is a brief introduction to Label Studio:
Supports labeling of various data types including images, audio, video, and text
Can be used for jobs such as object detection, image classification, speech transcription, and named entity recognition
Provides a customizable labeling interface
Supports various labeling formats and export options
Label Studio offers a powerful data labeling solution for data scientists and machine learning engineers due to its flexibility and rich features.
"},{"location":"en/admin/baize/best-practice/label-studio.html#deploy-to-ai-platform","title":"Deploy to AI platform","text":"
To use Label Studio in AI Lab, it needs to be deployed to the Global Service Cluster. You can quickly deploy it using Helm.
Note
For more deployment details, refer to Deploy Label Studio on Kubernetes.
Enter the Global Service Cluster, find Helm Apps -> Helm Repositories from the left navigation bar, click the Create Repository button, and fill in the following parameters:
After successfully adding the repository, click the \u2507 on the right side of the list and select Sync Repository. Wait a moment to complete the synchronization. (This sync operation will also be used for future updates of Label Studio).
Then navigate to the Helm Charts page, search for label-studio, and click the card.
Choose the latest version and configure the installation parameters as shown below, naming it label-studio. It is recommended to create a new namespace. Switch the parameters to YAML and modify the configuration according to the instructions.
global:\n image:\n repository: heartexlabs/label-studio # Configure proxy address here if docker.io is inaccessible\n extraEnvironmentVars:\n LABEL_STUDIO_HOST: https://{Access_Address}/label-studio # Use the AI platform login address, refer to the current webpage URL\n LABEL_STUDIO_USERNAME: {User_Email} # Must be an email, replace with your own\n LABEL_STUDIO_PASSWORD: {User_Password}\napp:\n nginx:\n livenessProbe:\n path: /label-studio/nginx_health\n readinessProbe:\n path: /label-studio/version\n
At this point, the installation of Label Studio is complete.
Warning
By default, PostgreSQL will be installed as the data service middleware. If the image pull fails, it may be because docker.io is inaccessible. Ensure to switch to an available proxy.
If you have your own PostgreSQL data service middleware, you can use the following parameters:
global:\n image:\n repository: heartexlabs/label-studio # Configure proxy address here if docker.io is inaccessible\n extraEnvironmentVars:\n LABEL_STUDIO_HOST: https://{Access_Address}/label-studio # Use the AI platform login address, refer to the current webpage URL\n LABEL_STUDIO_USERNAME: {User_Email} # Must be an email, replace with your own\n LABEL_STUDIO_PASSWORD: {User_Password}\napp:\n nginx:\n livenessProbe:\n path: /label-studio/nginx_health\n readinessProbe:\n path: /label-studio/version\npostgresql:\n enabled: false # Disable the built-in PostgreSQL\nexternalPostgresql:\n host: \"postgres-postgresql\" # PostgreSQL address\n port: 5432\n username: \"label_studio\" # PostgreSQL username\n password: \"your_label_studio_password\" # PostgreSQL password\n database: \"label_studio\" # PostgreSQL database name\n
"},{"location":"en/admin/baize/best-practice/label-studio.html#add-gproduct-to-navigation-bar","title":"Add GProduct to Navigation Bar","text":"
To add Label Studio to the AI platform navigation bar, you can refer to the method in Global Management OEM IN. The following example shows how to add it to the secondary navigation of AI Lab.
The above describes how to add Label Studio and integrate it as an labeling component in AI Lab. By adding labels to the datasets in AI Lab, you can associate it with algorithm development and improve the algorithm development process. For further usage, refer to relevant documentation.
"},{"location":"en/admin/baize/best-practice/train-with-deepspeed.html","title":"Submit a DeepSpeed Training Task","text":"
According to the DeepSpeed official documentation, it's recommended to modifying your code to implement the training task.
Specifically, you can use deepspeed.init_distributed() instead of torch.distributed.init_process_group(...). Then run the command using torchrun to submit it as a PyTorch distributed task, which will allow you to run a DeepSpeed task.
Yes, you can use torchrun to run your DeepSpeed training script. torchrun is a utility provided by PyTorch for distributed training. You can combine torchrun with the DeepSpeed API to start your training task.
Below is an example of running a DeepSpeed training script using torchrun:
Write the training script:
train.py
import torch\nimport deepspeed\nfrom torch.utils.data import DataLoader\n\n# Load model and data\nmodel = YourModel()\ntrain_dataset = YourDataset()\ntrain_dataloader = DataLoader(train_dataset, batch_size=32)\n\n# Configure file path\ndeepspeed_config = \"deepspeed_config.json\"\n\n# Create DeepSpeed training engine\nmodel_engine, optimizer, _, _ = deepspeed.initialize(\n model=model,\n model_parameters=model.parameters(),\n config_params=deepspeed_config\n)\n\n# Training loop\nfor batch in train_dataloader:\n loss = model_engine(batch)\n model_engine.backward(loss)\n model_engine.step()\n
Run the training script using torchrun or baizectl:
torchrun train.py\n
In this way, you can combine PyTorch's distributed training capabilities with DeepSpeed's optimization technologies for more efficient training. You can use the baizectl command to submit a job in a notebook:
This document provides a simple guide for users to use the AI Lab platform for the entire development and training process of datasets, Notebooks, and job training.
Click Data Management -> Datasets in the navigation bar, then click Create. Create three datasets as follows:
Code: https://github.com/d-run/drun-samples
For faster access in China, use Gitee: https://gitee.com/samzong_lu/training-sample-code.git
For faster access in China, use Gitee: https://gitee.com/samzong_lu/fashion-mnist.git
Empty PVC: Create an empty PVC to output the trained model and logs after training.
Note
Currently, only StorageClass with ReadWriteMany mode is supported. Please use NFS or the recommended JuiceFS.
Prepare the development environment by clicking Notebooks in the navigation bar, then click Create. Associate the three datasets created in the previous step and fill in the mount paths as shown in the image below:
Wait for the Notebook to be created successfully, click the access link in the list to enter the Notebook. Execute the following command in the Notebook terminal to start the job training.
Click Job Center -> Jobs in the navigation bar, create a Tensorflow Single job. Refer to the image below for job configuration and enable the Job Analysis (Tensorboard) feature. Click Create and wait for the status to complete.
For large datasets or models, it is recommended to enable GPU configuration in the resource configuration step.
In the job created in the previous step, you can click the specific job analysis to view the job status and optimize the job training.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html","title":"Create, Use and Delete Datasets","text":"
AI Lab provides comprehensive dataset management functions needed for model development, training, and inference processes. Currently, it supports unified access to various data sources.
With simple configurations, you can connect data sources to AI Lab, achieving unified data management, preloading, dataset management, and other functionalities.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#create-a-dataset","title":"Create a Dataset","text":"
In the left navigation bar, click Data Management -> Dataset List, and then click the Create button on the right.
Select the worker cluster and namespace to which the dataset belongs, then click Next.
Configure the data source type for the target data, then click OK.
Currently supported data sources include:
GIT: Supports repositories such as GitHub, GitLab, and Gitee
Upon successful creation, the dataset will be returned to the dataset list. You can perform more actions by clicking \u2507 on the right.
Info
The system will automatically perform a one-time data preloading after the dataset is successfully created; the dataset cannot be used until the preloading is complete.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-a-dataset","title":"Use a Dataset","text":"
Once the dataset is successfully created, it can be used in tasks such as model training and inference.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-in-notebook","title":"Use in Notebook","text":"
In creating a Notebook, you can directly use the dataset; the usage is as follows:
Use the dataset as training data mount
Use the dataset as code mount
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-in-training-obs","title":"Use in Training obs","text":"
Use the dataset to specify job output
Use the dataset to specify job input
Use the dataset to specify TensorBoard output
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-in-inference-services","title":"Use in Inference Services","text":"
Use the dataset to mount a model
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#delete-a-dataset","title":"Delete a Dataset","text":"
If you find a dataset to be redundant, expired, or no longer needed, you can delete it from the dataset list.
Click the \u2507 on the right side of the dataset list, then choose Delete from the dropdown menu.
In the pop-up window, confirm the dataset you want to delete, enter the dataset name, and then click Delete.
A confirmation message will appear indicating successful deletion, and the dataset will disappear from the list.
Caution
Once a dataset is deleted, it cannot be recovered, so please proceed with caution.
Traditionally, Python environment dependencies are built into an image, which includes the Python version and dependency packages. This approach has high maintenance costs and is inconvenient to update, often requiring a complete rebuild of the image.
In AI Lab, users can manage pure environment dependencies through the Environment Management module, decoupling this part from the image. The advantages include:
One environment can be used in multiple places, such as in Notebooks, distributed training tasks, and even inference services.
Updating dependency packages is more convenient; you only need to update the environment dependencies without rebuilding the image.
The main components of the environment management are:
Cluster : Select the cluster to operate on.
Namespace : Select the namespace to limit the scope of operations.
Environment List : Displays all environments and their statuses under the current cluster and namespace.
"},{"location":"en/admin/baize/developer/dataset/environments.html#explanation-of-environment-list-fields","title":"Explanation of Environment List Fields","text":"
Name : The name of the environment.
Status : The current status of the environment (normal or failed). New environments undergo a warming-up process, after which they can be used in other tasks.
Creation Time : The time the environment was created.
"},{"location":"en/admin/baize/developer/dataset/environments.html#creat-new-environment","title":"Creat New Environment","text":"
On the Environment Management interface, click the Create button at the top right to enter the environment creation process.
Fill in the following basic information:
Name : Enter the environment name, with a length of 2-63 characters, starting and ending with lowercase letters or numbers.
Deployment Location:
Cluster : Select the cluster to deploy, such as gpu-cluster.
Namespace : Select the namespace, such as default.
Remarks (optional): Enter remarks.
Labels (optional): Add labels to the environment.
Annotations (optional): Add annotations to the environment. After completing the information, click Next to proceed to environment configuration.
Python Version : Select the required Python version, such as 3.12.3.
Package Manager : Choose the package management tool, either PIP or CONDA.
Environment Data :
If PIP is selected: Enter the dependency package list in requirements.txt format in the editor below.
If CONDA is selected: Enter the dependency package list in environment.yaml format in the editor below.
Other Options (optional):
Additional pip Index URLs : Configure additional pip index URLs; suitable for internal enterprise private repositories or PIP acceleration sites.
GPU Configuration : Enable or disable GPU configuration; some GPU-related dependency packages need GPU resources configured during preloading.
Associated Storage : Select the associated storage configuration; environment dependency packages will be stored in the associated storage. Note: Storage must support ReadWriteMany.
After configuration, click the Create button, and the system will automatically create and configure the new Python environment.
Verify that the Python version and package manager configuration are correct.
Ensure the selected cluster and namespace are available.
If dependency preloading fails:
Check if the requirements.txt or environment.yaml file format is correct.
Verify that the dependency package names and versions are correct. If other issues arise, contact the platform administrator or refer to the platform help documentation for more support.
These are the basic steps and considerations for managing Python dependencies in AI Lab.
With the rapid iteration of AI Lab, we have now supported various model inference services. Here, you can see information about the supported models.
AI Lab v0.3.0 launched model inference services, facilitating users to directly use the inference services of AI Lab without worrying about model deployment and maintenance for traditional deep learning models.
AI Lab v0.6.0 supports the complete version of vLLM inference capabilities, supporting many large language models such as LLama, Qwen, ChatGLM, and more.
Note
The support for inference capabilities is related to the version of AI Lab.
You can use GPU types that have been verified by AI platform in AI Lab. For more details, refer to the GPU Support Matrix.
Through the Triton Inference Server, traditional deep learning models can be well supported. Currently, AI Lab supports mainstream inference backend services:
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
With vLLM, we can quickly use large language models. Here, you can see the list of models we support, which generally aligns with the vLLM Support Models.
HuggingFace Models: We support most of HuggingFace's models. You can see more models at the HuggingFace Model Hub.
The vLLM Supported Models list includes supported large language models and vision-language models.
Models fine-tuned using the vLLM support framework.
"},{"location":"en/admin/baize/developer/inference/models.html#new-features-of-vllm","title":"New Features of vLLM","text":"
Currently, AI Lab also supports some new features when using vLLM as an inference tool:
Enable Lora Adapter to optimize model inference services during inference.
Provide a compatible OpenAPI interface with OpenAI, making it easy for users to switch to local inference services at a low cost and quickly transition.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html","title":"Create Inference Service Using Triton Framework","text":"
The AI Lab currently offers Triton and vLLM as inference frameworks. Users can quickly start a high-performance inference service with simple configurations.
Danger
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html#introduction-to-triton","title":"Introduction to Triton","text":"
Triton is an open-source inference server developed by NVIDIA, designed to simplify the deployment and inference of machine learning models. It supports a variety of deep learning frameworks, including TensorFlow and PyTorch, enabling users to easily manage and deploy different types of models.
Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. The following example illustrates the PyTorch model for mnist handwritten digit recognition.
Note
The model to be inferred must adhere to the following directory structure within the dataset:
Currently, form-based creation is supported, allowing you to create services with field prompts in the interface.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html#configure-model-path","title":"Configure Model Path","text":"
The model path model-repo/mnist-cnn/1/model.pt must be consistent with the directory structure of the dataset.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html#model-configuration","title":"Model Configuration","text":""},{"location":"en/admin/baize/developer/inference/triton-inference.html#configure-input-and-output-parameters","title":"Configure Input and Output Parameters","text":"
Note
The first dimension of the input and output parameters defaults to batchsize, setting it to -1 allows for the automatic calculation of the batchsize based on the input inference data. The remaining dimensions and data type must match the model's input.
Send HTTP POST Request: Use tools like curl or HTTP client libraries (e.g., Python's requests library) to send POST requests to the Triton Server.
Set HTTP Headers: Configuration generated automatically based on user settings, include metadata about the model inputs and outputs in the HTTP headers.
Construct Request Body: The request body usually contains the input data for inference and model-specific metadata.
<ip> is the host address where the Triton Inference Server is running.
<port> is the port where the Triton Inference Server is running.
<inference-name> is the name of the inference service that has been created.
\"name\" must match the name of the input parameter in the model configuration.
\"shape\" must match the dims of the input parameter in the model configuration.
\"datatype\" must match the Data Type of the input parameter in the model configuration.
\"data\" should be replaced with the actual inference data.
Please note that the above example code needs to be adjusted according to your specific model and environment. The format and content of the input data must also comply with the model's requirements.
"},{"location":"en/admin/baize/developer/inference/vllm-inference.html","title":"Create Inference Service Using vLLM Framework","text":"
AI Lab supports using vLLM as an inference service, offering all the capabilities of vLLM while fully adapting to the OpenAI interface definition.
"},{"location":"en/admin/baize/developer/inference/vllm-inference.html#introduction-to-vllm","title":"Introduction to vLLM","text":"
vLLM is a fast and easy-to-use library for inference and services. It aims to significantly improve the throughput and memory efficiency of language model services in real-time scenarios. vLLM boasts several features in terms of speed and flexibility:
Continuous batching of incoming requests.
Efficiently manages attention keys and values memory using PagedAttention.
Seamless integration with popular HuggingFace models.
Select the vLLM inference framework. In the model module selection, choose the pre-created model dataset hdd-models and fill in the path information where the model is located within the dataset.
This guide uses the ChatGLM3 model for creating the inference service.
Configure the resources for the inference service and adjust the parameters for running the inference service.
Parameter Name Description GPU Resources Configure GPU resources for inference based on the model scale and cluster resources. Allow Remote Code Controls whether vLLM trusts and executes code from remote sources. LoRA LoRA is a parameter-efficient fine-tuning technique for deep learning models. It reduces the number of parameters and computational complexity by decomposing the original model parameter matrix into low-rank matrices. 1. --lora-modules: Specifies specific modules or layers for low-rank approximation. 2. max_loras_rank: Specifies the maximum rank for each adapter layer in the LoRA model. For simpler tasks, a smaller rank value can be chosen, while more complex tasks may require a larger rank value to ensure model performance. 3. max_loras: Indicates the maximum number of LoRA layers that can be included in the model, customized based on model size and inference complexity. 4. max_cpu_loras: Specifies the maximum number of LoRA layers that can be handled in a CPU environment. Associated Environment Selects predefined environment dependencies required for inference.
Info
For models that support LoRA parameters, refer to vLLM Supported Models.
In the Advanced Configuration , support is provided for automated affinity scheduling based on GPU resources and other node configurations. Users can also customize scheduling policies.
Once the inference service is created, click the name of the inference service to enter the details and view the API call methods. Verify the execution results using Curl, Python, and Node.js.
Copy the curl command from the details and execute it in the terminal to send a model inference request. The expected output should be:
Job management refers to the functionality of creating and managing job lifecycles through job scheduling and control components.
AI platform Smart Computing Capability adopts Kubernetes' Job mechanism to schedule various AI inference and training jobs.
Click Job Center -> Jobs in the left navigation bar to enter the job list. Click the Create button on the right.
The system will pre-fill basic configuration data, including the cluster, namespace, type, queue, and priority. Adjust these parameters and click Next.
Configure the URL, runtime parameters, and associated datasets, then click Next.
Optionally add labels, annotations, runtime env variables, and other job parameters. Select a scheduling policy and click Confirm.
After the job is successfully created, it will have several running statuses:
Pytorch is an open-source deep learning framework that provides a flexible environment for training and deployment. A Pytorch job is a job that uses the Pytorch framework.
In the AI Lab platform, we provide support and adaptation for Pytorch jobs. Through a graphical interface, you can quickly create Pytorch jobs and perform model training.
Here we use the baize-notebook base image and the associated environment as the basic runtime environment for the job.
To learn how to create an environment, refer to Environments.
"},{"location":"en/admin/baize/developer/jobs/pytorch.html#create-jobs","title":"Create Jobs","text":""},{"location":"en/admin/baize/developer/jobs/pytorch.html#pytorch-single-jobs","title":"Pytorch Single Jobs","text":"
Log in to the AI Lab platform, click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Pytorch Single and click Next .
Fill in the job name and description, then click OK .
Once the job is successfully submitted, we can enter the job details to see the resource usage. From the upper right corner, go to Workload Details to view the log output during the training process.
import os\nimport torch\nimport torch.distributed as dist\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\nclass SimpleModel(nn.Module):\n def __init__(self):\n super(SimpleModel, self).__init__()\n self.fc = nn.Linear(10, 1)\n\n def forward(self, x):\n return self.fc(x)\n\ndef train():\n # Print environment information\n print(f'PyTorch version: {torch.__version__}')\n print(f'CUDA available: {torch.cuda.is_available()}')\n if torch.cuda.is_available():\n print(f'CUDA version: {torch.version.cuda}')\n print(f'CUDA device count: {torch.cuda.device_count()}')\n\n rank = int(os.environ.get('RANK', '0'))\n world_size = int(os.environ.get('WORLD_SIZE', '1'))\n\n print(f'Rank: {rank}, World Size: {world_size}')\n\n # Initialize distributed environment\n try:\n if world_size > 1:\n dist.init_process_group('nccl')\n print('Distributed process group initialized successfully')\n else:\n print('Running in non-distributed mode')\n except Exception as e:\n print(f'Error initializing process group: {e}')\n return\n\n # Set device\n try:\n if torch.cuda.is_available():\n device = torch.device(f'cuda:{rank % torch.cuda.device_count()}')\n print(f'Using CUDA device: {device}')\n else:\n device = torch.device('cpu')\n print('CUDA not available, using CPU')\n except Exception as e:\n print(f'Error setting device: {e}')\n device = torch.device('cpu')\n print('Falling back to CPU')\n\n try:\n model = SimpleModel().to(device)\n print('Model moved to device successfully')\n except Exception as e:\n print(f'Error moving model to device: {e}')\n return\n\n try:\n if world_size > 1:\n ddp_model = DDP(model, device_ids=[rank % torch.cuda.device_count()] if torch.cuda.is_available() else None)\n print('DDP model created successfully')\n else:\n ddp_model = model\n print('Using non-distributed model')\n except Exception as e:\n print(f'Error creating DDP model: {e}')\n return\n\n loss_fn = nn.MSELoss()\n optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)\n\n # Generate some random data\n try:\n data = torch.randn(100, 10, device=device)\n labels = torch.randn(100, 1, device=device)\n print('Data generated and moved to device successfully')\n except Exception as e:\n print(f'Error generating or moving data to device: {e}')\n return\n\n for epoch in range(10):\n try:\n ddp_model.train()\n outputs = ddp_model(data)\n loss = loss_fn(outputs, labels)\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n if rank == 0:\n print(f'Epoch {epoch}, Loss: {loss.item():.4f}')\n except Exception as e:\n print(f'Error during training epoch {epoch}: {e}')\n break\n\n if world_size > 1:\n dist.destroy_process_group()\n\nif __name__ == '__main__':\n train()\n
"},{"location":"en/admin/baize/developer/jobs/pytorch.html#number-of-job-replicas","title":"Number of Job Replicas","text":"
Note that Pytorch Distributed training jobs will create a group of Master and Worker training Pods, where the Master is responsible for coordinating the training job, and the Worker is responsible for the actual training work.
Note
In this demonstration: Master replica count is 1, Worker replica count is 2; Therefore, we need to set the replica count to 3 in the Job Configuration , which is the sum of Master and Worker replica counts. Pytorch will automatically tune the roles of Master and Worker.
AI Lab provides important visualization analysis tools provided for the model development process, used to display the training process and results of machine learning models. This document will introduce the basic concepts of Job Analysis (Tensorboard), its usage in the AI Lab system, and how to configure the log content of datasets.
Note
Tensorboard is a visualization tool provided by TensorFlow, used to display the training process and results of machine learning models. It can help developers more intuitively understand the training dynamics of their models, analyze model performance, debug issues, and more.
The role and advantages of Tensorboard in the model development process:
Visualize Training Process : Display metrics such as training and validation loss, and accuracy through charts, helping developers intuitively observe the training effects of the model.
Debug and Optimize Models : By viewing the weights and gradient distributions of different layers, help developers discover and fix issues in the model.
Compare Different Experiments : Simultaneously display the results of multiple experiments, making it convenient for developers to compare the effects of different models and hyperparameter configurations.
Track Training Data : Record the datasets and parameters used during training to ensure the reproducibility of experiments.
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#how-to-create-tensorboard","title":"How to Create Tensorboard","text":"
In the AI Lab system, we provide a convenient way to create and manage Tensorboard. Here are the specific steps:
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#enable-tensorboard-when-creating-a-notebook","title":"Enable Tensorboard When Creating a Notebook","text":"
Create a Notebook : Create a new Notebook on the AI Lab platform.
Enable Tensorboard : On the Notebook creation page, enable the Tensorboard option and specify the dataset and log path.
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#enable-tensorboard-after-creating-and-completing-a-distributed-job","title":"Enable Tensorboard After Creating and Completing a Distributed Job","text":"
Create a Distributed Job : Create a new distributed training job on the AI Lab platform.
Configure Tensorboard : On the job configuration page, enable the Tensorboard option and specify the dataset and log path.
View Tensorboard After Job Completion : After the job is completed, you can view the Tensorboard link on the job details page. Click the link to see the visualized results of the training process.
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#directly-reference-tensorboard-in-a-notebook","title":"Directly Reference Tensorboard in a Notebook","text":"
In a Notebook, you can directly start Tensorboard through code. Here is a sample code snippet:
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#how-to-configure-dataset-log-content","title":"How to Configure Dataset Log Content","text":"
When using Tensorboard, you can record and configure different datasets and log content. Here are some common configuration methods:
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#configure-training-and-validation-dataset-logs","title":"Configure Training and Validation Dataset Logs","text":"
While training the model, you can use TensorFlow's tf.summary API to record logs for the training and validation datasets. Here is a sample code snippet:
# Import necessary libraries\nimport tensorflow as tf\n\n# Create log directories\ntrain_log_dir = 'logs/gradient_tape/train'\nval_log_dir = 'logs/gradient_tape/val'\ntrain_summary_writer = tf.summary.create_file_writer(train_log_dir)\nval_summary_writer = tf.summary.create_file_writer(val_log_dir)\n\n# Train model and record logs\nfor epoch in range(EPOCHS):\n for (x_train, y_train) in train_dataset:\n # Training step\n train_step(x_train, y_train)\n with train_summary_writer.as_default():\n tf.summary.scalar('loss', train_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)\n\n for (x_val, y_val) in val_dataset:\n # Validation step\n val_step(x_val, y_val)\n with val_summary_writer.as_default():\n tf.summary.scalar('loss', val_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', val_accuracy.result(), step=epoch)\n
In addition to logs for training and validation datasets, you can also record other custom log content such as learning rate and gradient distribution. Here is a sample code snippet:
In AI Lab, Tensorboards created through various methods are uniformly displayed on the job analysis page, making it convenient for users to view and manage.
Users can view information such as the link, status, and creation time of Tensorboard on the job analysis page and directly access the visualized results of Tensorboard through the link.
Tensorflow, along with Pytorch, is a highly active open-source deep learning framework that provides a flexible environment for training and deployment.
AI Lab provides support and adaptation for the Tensorflow framework. You can quickly create Tensorflow jobs and conduct model training through graphical operations.
Here, we use the baize-notebook base image and the associated environment as the basic runtime environment for jobs.
For information on how to create an environment, refer to Environment List.
"},{"location":"en/admin/baize/developer/jobs/tensorflow.html#creating-a-job","title":"Creating a Job","text":""},{"location":"en/admin/baize/developer/jobs/tensorflow.html#example-tfjob-single","title":"Example TFJob Single","text":"
Log in to the AI Lab platform and click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Tensorflow Single and click Next .
Fill in the job name and description, then click OK .
"},{"location":"en/admin/baize/developer/jobs/tensorflow.html#pre-warming-the-code-repository","title":"Pre-warming the Code Repository","text":"
Use AI Lab -> Dataset List to create a dataset and pull the code from a remote GitHub repository into the dataset. This way, when creating a job, you can directly select the dataset and mount the code into the job.
Command parameters: Use python /code/tensorflow/tf-single.py
\"\"\"\n pip install tensorflow numpy\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\n\n# Create some random data\nx = np.random.rand(100, 1)\ny = 2 * x + 1 + np.random.rand(100, 1) * 0.1\n\n# Create a simple model\nmodel = tf.keras.Sequential([\n tf.keras.layers.Dense(1, input_shape=(1,))\n])\n\n# Compile the model\nmodel.compile(optimizer='adam', loss='mse')\n\n# Train the model, setting epochs to 10\nhistory = model.fit(x, y, epochs=10, verbose=1)\n\n# Print the final loss\nprint('Final loss: {' + str(history.history['loss'][-1]) +'}')\n\n# Use the model to make predictions\ntest_x = np.array([[0.5]])\nprediction = model.predict(test_x)\nprint(f'Prediction for x=0.5: {prediction[0][0]}')\n
After the job is successfully submitted, you can enter the job details to see the resource usage. From the upper right corner, navigate to Workload Details to view log outputs during the training process.
Once a job is created, it will be displayed in the job list.
In the job list, click the \u2507 on the right side of a job and select Job Workload Details .
A pop-up window will appear asking you to choose which Pod to view. Click Enter .
You will be redirected to the container management interface, where you can view the container\u2019s working status, labels and annotations, and any events that have occurred.
You can also view detailed logs of the current Pod for the recent period. By default, 100 lines of logs are displayed. To view more detailed logs or to download logs, click the blue Insight text at the top.
Additionally, you can use the ... in the upper right corner to view the current Pod's YAML, and to upload or download files. Below is an example of a Pod's YAML.
baizectl is a command line tool specifically designed for model developers and data scientists within the AI Lab module. It provides a series of commands to help users manage distributed training jobs, check job statuses, manage datasets, and more. It also supports connecting to Kubernetes worker clusters and AI platform workspaces, aiding users in efficiently using and managing Kubernetes platform resources.
The basic format of the baizectl command is as follows:
jovyan@19d0197587cc:/$ baizectl\nAI platform management tool\n\nUsage:\n baizectl [command]\n\nAvailable Commands:\n completion Generate the autocompletion script for the specified shell\n data Management datasets\n help Help about any command\n job Manage jobs\n login Login to the platform\n version Show cli version\n\nFlags:\n --cluster string Cluster name to operate\n -h, --help help for baizectl\n --mode string Connection mode: auto, api, notebook (default \"auto\")\n -n, --namespace string Namespace to use for the operation. If not set, the default Namespace will be used.\n -s, --server string access base url\n --skip-tls-verify Skip TLS certificate verification\n --token string access token\n -w, --workspace int32 Workspace ID to use for the operation\n\nUse \"baizectl [command] --help\" for more information about a command.\n
The above provides basic information about baizectl. Users can view the help information using baizectl --help, or view the help information for specific commands using baizectl [command] --help.
The basic format of the baizectl command is as follows:
baizectl [command] [flags]\n
Here, [command] refers to the specific operation command, such as data and job, and [flags] are optional parameters used to specify detailed information about the operation.
baizectl provides a series of commands to manage distributed training jobs, including viewing job lists, submitting jobs, viewing logs, restarting jobs, deleting jobs, and more.
jovyan@19d0197587cc:/$ baizectl job\nManage jobs\n\nUsage:\n baizectl job [command]\n\nAvailable Commands:\n delete Delete a job\n logs Show logs of a job\n ls List jobs\n restart restart a job\n submit Submit a job\n\nFlags:\n -h, --help help for job\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl job [command] --help\" for more information about a command.\n
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#submit-training-jobs","title":"Submit Training Jobs","text":"
baizectl supports submitting a job using the submit command. You can view detailed information by using baizectl job submit --help.
(base) jovyan@den-0:~$ baizectl job submit --help\nSubmit a job\n\nUsage:\n baizectl job submit [flags] -- command ...\n\nAliases:\n submit, create\n\nExamples:\n# Submit a job to run the command \"torchrun python train.py\"\nbaizectl job submit -- torchrun python train.py\n# Submit a job with 2 workers(each pod use 4 gpus) to run the command \"torchrun python train.py\" and use the image \"pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime\"\nbaizectl job submit --image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime --workers 2 --resources nvidia.com/gpu=4 -- torchrun python train.py\n# Submit a tensorflow job to run the command \"python train.py\"\nbaizectl job submit --tensorflow -- python train.py\n\n\nFlags:\n --annotations stringArray The annotations of the job, the format is key=value\n --auto-load-env It only takes effect when executed in Notebook, the environment variables of the current environment will be automatically read and set to the environment variables of the Job, the specific environment variables to be read can be specified using the BAIZE_MAPPING_ENVS environment variable, the default is PATH,CONDA_*,*PYTHON*,NCCL_*, if set to false, the environment variables of the current environment will not be read. (default true)\n --commands stringArray The default command of the job\n -d, --datasets stringArray The dataset bind to the job, the format is datasetName:mountPath, e.g. mnist:/data/mnist\n -e, --envs stringArray The environment variables of the job, the format is key=value\n -x, --from-notebook string Define whether to read the configuration of the current Notebook and directly create tasks, including images, resources, and dataset.\n auto: Automatically determine the mode according to the current environment. If the current environment is a Notebook, it will be set to notebook mode.\n false: Do not read the configuration of the current Notebook.\n true: Read the configuration of the current Notebook. (default \"auto\")\n -h, --help help for submit\n --image string The image of the job, it must be specified if fromNotebook is false.\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --labels stringArray The labels of the job, the format is key=value\n --max-retries int32 number of retries before marking this job failed\n --max-run-duration int Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it\n --name string The name of the job, if empty, the name will be generated automatically.\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --priority string The priority of the job, current support baize-medium-priority, baize-low-priority, baize-high-priority\n --pvcs stringArray The pvcs bind to the job, the format is pvcName:mountPath, e.g. mnist:/data/mnist\n --pytorch Pytorch Job, has higher priority than --job-type\n --queue string The queue to used\n --requests-resources stringArray Similar to resources, but sets the resources of requests\n --resources stringArray The resources of the job, it is a string in the format of cpu=1,memory=1Gi,nvidia.com/gpu=1, it will be set to the limits and requests of the container.\n --restart-policy string The job restart policy (default \"on-failure\")\n --runtime-envs baizectl data ls --runtime-env The runtime environment to use for the job, you can use baizectl data ls --runtime-env to get the runtime environment\n --shm-size int32 The shared memory size of the job, default is 0, which means no shared memory, if set to more than 0, the job will use the shared memory, the unit is MiB\n --tensorboard-log-dir string The tensorboard log directory, if set, the job will automatically start tensorboard, else not. The format is /path/to/log, you can use relative path in notebook.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --workers int The workers of the job, default is 1, which means single worker, if set to more than 1, the job will be distributed. (default 1)\n --working-dir string The working directory of job container, if in notebook mode, the default is the directory of the current file\n
Note
Explanation of command parameters for submitting jobs:
--name: Job name. If empty, it will be auto-generated.
--resources: Job resources, formatted as cpu=1 memory=1Gi,nvidia.com/gpu=1.
--workers: Number of job worker nodes. The default is 1. When set to greater than 1, the job will run in a distributed manner.
--queue: Job queue. Queue resources need to be created in advance.
--working-dir: Working directory. In Notebook mode, the current file directory will be used by default.
--datasets: Dataset, formatted as datasetName:mountPath, for example mnist:/data/mnist.
--shm-size: Shared memory size. This can be enabled for distributed training jobs, indicating the use of shared memory, with units in MiB.
--labels: Job labels, formatted as key=value.
--max-retries: Maximum retry count. The number of times to retry the job upon failure. The job will restart upon failure. Default is unlimited.
--max-run-duration: Maximum run duration. The job will be terminated by the system if it exceeds the specified run time. Default is unlimited.
--restart-policy: Restart policy, supporting on-failure, never, always. The default is on-failure.
--from-notebook: Whether to read configurations from the Notebook. Supports auto, true, false, with the default being auto.
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#example-of-a-pytorch-single-node-job","title":"Example of a PyTorch Single-Node Job","text":"
Example of submitting a training job. Users can modify parameters based on their actual needs. Below is an example of creating a PyTorch job:
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#example-of-a-distributed-pytorch-job","title":"Example of a Distributed PyTorch Job","text":"
Example of submitting a training job. You can modify parameters based on their actual needs. Below is an example of creating a distributed PyTorch job:
baizectl job supports viewing the job list using the ls command. By default, it displays pytorch jobs, but users can specify the job type using the -t parameter.
(base) jovyan@den-0:~$ baizectl job ls # View pytorch jobs by default\n NAME TYPE PHASE DURATION COMMAND \n demong PYTORCH SUCCEEDED 1m2s sleep 60 \n demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200 \n(base) jovyan@den-0:~$ baizectl job ls demo-sleep # View a specific job\n NAME TYPE PHASE DURATION COMMAND \n demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200 \n(base) jovyan@den-0:~$ baizectl job ls -t TENSORFLOW # View tensorflow jobs\n NAME TYPE PHASE DURATION COMMAND \n demotfjob TENSORFLOW CREATED 0s sleep 1000 \n
The job list uses table as the default display format. If you want to view more information, you can use the json or yaml format, which can be specified using the -o parameter.
baizectl job supports viewing job logs using the logs command. You can view detailed information by using baizectl job logs --help.
(base) jovyan@den-0:~$ baizectl job logs --help\nShow logs of a job\n\nUsage:\n baizectl job logs <job-name> [pod-name] [flags]\n\nAliases:\n logs, log\n\nFlags:\n -f, --follow Specify if the logs should be streamed.\n -h, --help help for logs\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --pytorch Pytorch Job, has higher priority than --job-type\n --tail int Lines of recent log file to display.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --timestamps Show timestamps\n
Note
The --follow parameter allows for real-time log viewing.
The --tail parameter specifies the number of log lines to view, with a default of 50 lines.
The --timestamps parameter displays timestamps.
Example of viewing job logs:
(base) jovyan@den-0:~$ baizectl job log -t TENSORFLOW tf-sample-job-v2-202406161632-evgrbrhn -f\n2024-06-16 08:33:06.083766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n2024-06-16 08:33:06.086189: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2024-06-16 08:33:07.223046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\nModel: \"sequential\"\n_________________________________________________________________\n Layer (type) Output Shape Param # \n=================================================================\n Conv1 (Conv2D) (None, 13, 13, 8) 80 \n\n flatten (Flatten) (None, 1352) 0 \n\n Softmax (Dense) (None, 10) 13530 \n\n=================================================================\nTotal params: 13610 (53.16 KB)\nTrainable params: 13610 (53.16 KB)\nNon-trainable params: 0 (0.00 Byte)\n...\n
baizectl supports managing datasets. Currently, it supports viewing the dataset list, making it convenient to quickly bind datasets during job training.
(base) jovyan@den-0:~$ baizectl data \nManagement datasets\n\nUsage:\n baizectl data [flags]\n baizectl data [command]\n\nAliases:\n data, dataset, datasets, envs, runtime-envs\n\nAvailable Commands:\n ls List datasets\n\nFlags:\n -h, --help help for data\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl data [command] --help\" for more information about a command.\n
baizectl data supports viewing the datasets using the ls command. By default, it displays in table format, but users can specify the output format using the -o parameter.
(base) jovyan@den-0:~$ baizectl data ls\n NAME TYPE URI PHASE \n fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY \n sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY \n training-output PVC pvc://training-output READY \n
When submitting a training job, you can specify the dataset using the -d or --datasets parameter, for example:
The environment runtime-env is a unique environment management capability of Suanova. By decoupling the dependencies required for model development, training tasks, and inference, it offers a more flexible way to manage dependencies without the need to repeatedly build complex Docker images. You simply need to select the appropriate environment.
Additionally, runtime-env supports hot updates and dynamic upgrades, allowing you to update environment dependencies without rebuilding the image.
baizectl data supports viewing the environment list using the runtime-env command. By default, it displays in table format, but users can specify the output format using the -o parameter.
(base) jovyan@den-0:~$ baizectl data ls --runtime-env \n NAME TYPE URI PHASE \n fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY \n sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY \n training-output PVC pvc://training-output READY \n tensorflow-sample CONDA conda://python?version=3.12.3 PROCESSING \n
When submitting a training job, you can specify the environment using the --runtime-env parameter:
baizectl supports more advanced usage, such as generating auto-completion scripts, using specific clusters and namespaces, and using specific workspaces.
The above command generates an auto-completion script for bash and saves it to the /etc/bash_completion.d/baizectl directory. You can load the auto-completion script by using source /etc/bash_completion.d/baizectl.
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#using-specific-clusters-and-namespaces","title":"Using Specific Clusters and Namespaces","text":"
baizectl job ls --cluster my-cluster --namespace my-namespace\n
This command will list all jobs in the my-namespace namespace within the my-cluster cluster.
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#using-specific-workspaces","title":"Using Specific Workspaces","text":"
Solution: Check if the --server parameter is set correctly and ensure that the network connection is stable. If the server uses a self-signed certificate, you can use --skip-tls-verify to skip TLS certificate verification.
Question: How can I resolve insufficient permissions issues?
Solution: Ensure that you are using the correct --token parameter to log in and check if the current user has the necessary permissions for the operation.
Question: Why can't I list the datasets?
Solution: Check if the namespace and workspace are set correctly and ensure that the current user has permission to access these resources.
With this guide, you can quickly get started with baizectl commands and efficiently manage AI platform resources in practical applications. If you have any questions or issues, it is recommended to use baizectl [command] --help to check more detailed information.
baizess is a built-in, out-of-the-box source switch tool within the Notebook of AI Lab module. It provides a streamlined command-line interface to facilitate the management of package sources for various programming environments. With baizess, users can easily switch sources for commonly used package managers, ensuring seamless access to the latest libraries and dependencies. This tool enhances the efficiency of developers and data scientists by simplifying the process of managing package sources.
The basic information of the baizess command is as follows:
jovyan@19d0197587cc:/$ baizess\nsource switch tool\n\nUsage:\n baizess [command] [package-manager]\n\nAvailable Commands:\n set Switch the source of specified package manager to current fastest source\n reset Reset the source of specified package manager to default source\n\nAvailable Package-managers:\n apt (require root privilege)\n conda\n pip\n
set\uff1aBackup the source, perform speed test, and switch the specified package manager's source to the fastest domestic source based on speed test result.
reset\uff1aReset the specified package manager to default source.
Notebook provides an online web interactive programming environment, making it convenient for developers to quickly conduct data science and machine learning experiments.
Upon entering the developer console, developers can create and manage Notebooks in different clusters and namespaces.
Click Notebooks in the left navigation bar to enter the Notebook list. Click the Create button on the right.
The system will pre-fill basic configuration data, including the cluster, namespace, queue, priority, resources, and job arguments. Adjust these arguments and click OK.
The newly created Notebook will initially be in the Pending state, and will change to Running after a moment, with the latest one appearing at the top of the list by default.
Click the \u2507 on the right side to perform more actions: update arguments, start/stop, clone Notebook, view workload details, and delete.
Note
If you choose pure CPU resources and find that all GPUs on the node are mounted, you can try adding the following container environment variable to resolve this issue:
If you find a Notebook to be redundant, expired, or no longer needed for any other reason, you can delete it from the Notebook list.
Click the \u2507 on the right side of the Notebook in the Notebook list, then choose Delete from the dropdown menu.
In the pop-up window, confirm the Notebook you want to delete, enter the Notebook name, and then click Delete.
A confirmation message will appear indicating successful deletion, and the Notebook will disappear from the list.
Caution
Once a Notebook is deleted, it cannot be recovered, so please proceed with caution.
"},{"location":"en/admin/baize/developer/notebooks/notebook-auto-close.html","title":"Automatic Shutdown of Idle Notebooks","text":"
To optimize resource usage, the smart computing system automatically shuts down idle notebooks after a period of inactivity. This helps free up resources when a notebook is not in use.
Advantages: This feature significantly reduces resource waste from long periods of inactivity, enhancing overall efficiency.
Disadvantages: Without proper backup strategies in place, this may lead to potential data loss.
Note
This feature is enabled by default at the cluster level, with a default timeout of 30 minutes.
Currently, configuration changes must be made manually, but more convenient options will be available in the future.
To modify the deployment parameters of baize-agent in your worker cluster, update the Helm App.
"},{"location":"en/admin/baize/developer/notebooks/notebook-auto-close.html#modify-on-ui","title":"Modify on UI","text":"
In the clusters page, locate your worker cluster, go to its details, select Helm Apps, and find baize-agent under the baize-system namespace, Click Update on the upper right corner.
"},{"location":"en/admin/baize/developer/notebooks/notebook-auto-close.html#modify-on-cli","title":"Modify on CLI","text":"
In the console, use the helm upgrade command to change the configuration:
# Set version number\nexport VERSION=0.8.0\n\n# Update Helm Chart \nhelm upgrade --install baize-agent baize/baize-agent \\\n --namespace baize-system \\\n --create-namespace \\\n --set global.imageRegistry=release.daocloud.io \\\n --set notebook-controller.culling_enabled=true \\ # Enable automatic shutdown (default: true)\n --set notebook-controller.cull_idle_time=120 \\ # Set idle timeout to 120 minutes (default: 30 minutes)\n --set notebook-controller.idleness_check_period=1 \\ # Set check interval to 1 minute (default: 1 minute)\n --version=$VERSION\n
Note
To prevent data loss after an automatic shutdown, upgrade to v0.8.0 or higher and enable the auto-save feature in your notebook configuration.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html","title":"Use Environments in Notebooks","text":"
Environment management is one of the key features of AI Lab. By associating an environment in a Notebook , you can quickly switch between different environments, making it easier for them to develop and debug.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#select-an-environment-when-creating-a-notebook","title":"Select an Environment When Creating a Notebook","text":"
When creating a Notebook, you can select one or more environments. If there isn\u2019t a suitable environment, you can create a new one in Environments .
For instructions on how to create an environment, refer to Environments.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#use-environments-in-notebooks_1","title":"Use Environments in Notebooks","text":"
Note
In the Notebook, both conda and mamba are provided as environment management tools. You can choose the appropriate tool based on their needs.
In AI Lab, you can use the conda environment management tool. You can view the list of current environments in the Notebook by using the command !conda env list.
This command lists all conda environments and adds an asterisk (*) before the currently activated environment.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#manage-kernel-environment-in-jupyterlab","title":"Manage Kernel Environment in JupyterLab","text":"
In JupyterLab, the environments associated with the Notebook are automatically bounded to the Kernel list, allowing you to quickly switch environments through the Kernel.
With this method, you can simultaneously write and debug algorithms in a single Notebook.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#switch-environments-in-a-terminal","title":"Switch Environments in a Terminal","text":"
The Notebook for AI Lab now also supports VSCode.
If you prefer managing and switching environments in the Terminal, you can follow these steps:
Upon first starting and using the Notebook, you need to execute conda init, and then run conda activate <env_name> to switch to the proper environment.
(base) jovyan@chuanjia-jupyter-0:~/yolov8$ conda init bash # Initialize bash environment, only needed for the first use\nno change /opt/conda/condabin/conda\n change /opt/conda/bin/conda\n change /opt/conda/bin/conda-env\n change /opt/conda/bin/activate\n change /opt/conda/bin/deactivate\n change /opt/conda/etc/profile.d/conda.sh\n change /opt/conda/etc/fish/conf.d/conda.fish\n change /opt/conda/shell/condabin/Conda.psm1\n change /opt/conda/shell/condabin/conda-hook.ps1\n change /opt/conda/lib/python3.11/site-packages/xontrib/conda.xsh\n change /opt/conda/etc/profile.d/conda.csh\n change /home/jovyan/.bashrc\n action taken.\nAdded mamba to /home/jovyan/.bashrc\n\n==> For changes to take effect, close and re-open your current shell. <==\n\n(base) jovyan@chuanjia-jupyter-0:~/yolov8$ source ~/.bashrc # Reload bash environment\n(base) jovyan@chuanjia-jupyter-0:~/yolov8$ conda activate python-3.10 # Switch to python-3.10 environment\n(python-3.10) jovyan@chuanjia-jupyter-0:~/yolov8$ conda env list\n\n mamba version : 1.5.1\n# conda environments:\n#\ndkj-python312-pure /opt/baize-runtime-env/dkj-python312-pure/conda/envs/dkj-python312-pure\npython-3.10 * /opt/baize-runtime-env/python-3.10/conda/envs/python-3.10 # Currently activated environment\ntorch-smaple /opt/baize-runtime-env/torch-smaple/conda/envs/torch-smaple\nbase /opt/conda\nbaize-base /opt/conda/envs/baize-base\n
If you prefer to use mamba, you will need to use mamba init and mamba activate <env_name>.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#view-packages-in-environment","title":"View Packages in Environment","text":"
One important feature of different environment management is the ability to use different packages by quickly switching environments within a Notebook.
You can use the command below to view all packages in the current environment using conda.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#update-packages-in-environment","title":"Update Packages in Environment","text":"
Currently, you can update the packages in the environment through the Environment Management UI in AI Lab.
The AI Lab provided by Notebook supports local access via SSH;
With simple configuration, you can use SSH to access the Jupyter Notebook. Whether you are using Windows, Mac, or Linux operating systems, you can follow the steps below.
First, you need to generate an SSH public and private key pair on your computer. This key pair will be used for the authentication process to ensure secure access.
Mac/LinuxWindows
Open the terminal.
Enter the command:
ssh-keygen -t rsa -b 4096\n
When prompted with \u201cEnter a file in which to save the key,\u201d you can press Enter to use the default path or specify a new path.
Next, you will be prompted to enter a passphrase (optional), which adds an extra layer of security. If you choose to enter a passphrase, remember it as you will need it each time you use the key.
Install Git Bash (if you haven't already).
Open Git Bash.
Enter the command:
ssh-keygen -t rsa -b 4096\n
Follow the same steps as Mac/Linux.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#add-ssh-public-key-to-personal-center-optional","title":"Add SSH Public Key to Personal Center (Optional)","text":"
Open the generated public key file, usually located at ~/.ssh/id_rsa.pub (if you did not change the default path).
Copy the public key content.
Log in to the system's personal center.
Look for the SSH public key configuration area and paste the copied public key into the designated location.
Save the changes.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#enable-ssh-in-notebook","title":"Enable SSH in Notebook","text":"
Log in to the Jupyter Notebook web interface.
Find the Notebook for which you want to enable SSH.
In the Notebook's settings or details page, find the option Enable SSH and enable it.
Record or copy the displayed SSH access command. This command will be used in subsequent steps for SSH connection.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#ssh-in-different-environments","title":"SSH in Different Environments","text":""},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#example","title":"Example","text":"
Assume the SSH command you obtained is as follows:
ssh username@mockhost -p 2222\n
Replace username with your username, mockhost with the actual hostname, and 2222 with the actual port number.
If prompted to accept the host's identity, type yes.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#remote-development-with-ide","title":"Remote Development with IDE","text":"
In addition to using command line tools for SSH connection, you can also utilize modern IDEs such as Visual Studio Code (VSCode) and PyCharm's SSH remote connection feature to develop locally while utilizing remote server resources.
Using SSH in VSCodeUsing SSH in PyCharm
VSCode supports SSH remote connection through the Remote - SSH extension, allowing you to edit files on the remote server directly in the local VSCode environment and run commands.
Steps:
Ensure you have installed VSCode and the Remote - SSH extension.
Open VSCode and click the remote resource manager icon at the bottom of the left activity bar.
Select Remote-SSH: Connect to Host... and then click + Add New SSH Host...
Enter the SSH connection command, for example:
ssh username@mockhost -p 2222\n
Press Enter. Replace username, mockhost, and 2222 with your actual username, hostname, and port number.
Select a configuration file to save this SSH host, usually the default is fine.
After completing, your SSH host will be added to the SSH target list. Click your host to connect. If it's your first connection, you may be prompted to verify the host's fingerprint. After accepting, you will be asked to enter the passphrase (if the SSH key has a passphrase). Once connected successfully, you can edit remote files in VSCode and utilize remote resources just as if you were developing locally.
PyCharm Professional Edition supports connecting to remote servers via SSH and directly developing in the local PyCharm.
Steps:
Open PyCharm and open or create a project.
Select File -> Settings (on Mac, it's PyCharm -> Preferences).
In the settings window, navigate to Project: YourProjectName -> Python Interpreter.
Click the gear icon in the upper right corner and select Add...
In the pop-up window, select SSH Interpreter.
Enter the remote host information: hostname (mockhost), port number (2222), username (username). Replace these placeholders with your actual information.
Click Next. PyCharm will attempt to connect to the remote server. If the connection is successful, you will be asked to enter the passphrase or select the private key file.
Once configured, click Finish. Now, your PyCharm will use the Python interpreter on the remote server.
Within the same Workspace, any user can log in to a Notebook with SSH enabled using their own SSH credentials. This means that as long as users have configured their SSH public key in the personal center and the Notebook has enabled SSH, they can use SSH for a secure connection.
Note that permissions for different users may vary depending on the Workspace configuration. Ensure you understand and comply with your organization's security and access policies.
By following the above steps, you should be able to successfully configure and use SSH to access the Jupyter Notebook. If you encounter any issues, refer to the system help documentation or contact the system administrator.
"},{"location":"en/admin/baize/developer/notebooks/start-pause.html","title":"Start and Stop Notebook","text":"
After a Notebook is successfully created, it typically has several states:
Pending
Running
Stopped
If a Notebook is in the Stopped state, click the \u2507 on the right side in the list, then choose Start from the dropdown menu.
This Notebook will move into the running queue, and its status will change to Pending. If everything is normal, its status will change to Running after a moment.
If you have finished using the Notebook, you can choose Stop from the menu to change its status to Stopped.
GPU Allocated: Statistics on the GPU allocation status of all unfinished tasks in the current cluster, calculating the ratio between requested GPUs (Request) and total resources (Total).
GPU Utilization: Statistics on the actual resource utilization of all running tasks in the current cluster, calculating the ratio between the GPUs actually used (Usage) and the total resources (Total).
Automatically consolidate GPU resource information across the entire platform, providing detailed GPU device information display, and allowing you to view workload statistics and task execution information for various GPUs.
After entering Operator, click Resource Management -> GPU Management in the left navigation bar to view GPU and task information.
In the Operator mode, queues can be used to schedule and optimize batch job workloads, effectively managing multiple tasks running on a cluster and optimizing resource utilization through a queue system.
Click Queue Management in the left navigation bar, then click the Create button on the right.
The system will pre-fill basic setup data, including the cluster to deploy to, workspace, and queuing policy. Click OK after adjusting these parameters.
A confirmation message will appear upon creation, returning you to the queue management list. Click the \u2507 on the right side of the list to perform additional operations such as update or delete.
This document will continuously compile and organize errors that may arise from environmental issues or improper operations during the use of AI Lab, as well as analyze and provide solutions for certain errors encountered during use.
Warning
This documentation is only applicable to version AI platform. If you encounter issues with the use of AI Lab, please refer to this troubleshooting guide first.
In AI platform, the module name for AI Lab is baize, which offers one-stop solutions for model training, inference, model management, and more.
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html","title":"Cluster Not Found in Drop-Down List","text":""},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#symptom","title":"Symptom","text":"
In the AI Lab Developer and Operator UI, the desired cluster cannot be found in the drop-down list while you search for a cluster.
If the desired cluster is missing from the cluster drop-down list in AI Lab, it could be due to the following reasons:
The baize-agent is not installed or failed to install, causing AI Lab to be unable to retrieve cluster information.
The cluster name was not configured when installing baize-agent, causing AI Lab to be unable to retrieve cluster information.
Observable components within the worker cluster are abnormal, leading to the inability to collect metrics information from the cluster.
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#solution","title":"Solution","text":""},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#baize-agent-not-installed-or-failed-to-install","title":"baize-agent not installed or failed to install","text":"
AI Lab requires some basic components to be installed in each worker cluster. If the baize-agent is not installed in the worker cluster, you can choose to install it via UI, which might lead to some unexpected errors.
Therefore, to ensure a good user experience, the selectable cluster range only includes clusters where the baize-agent has been successfully installed.
If the issue is due to the baize-agent not being installed or installation failure, use the following steps:
Container Management -> Clusters -> Helm Apps -> Helm Charts , find baize-agent and install it.
Note
Quickly jump to this address: https://<host>/kpanda/clusters/<cluster_name>/helm/charts/addon/baize-agent. Note to replace <host> with the actual console address, and <cluster_name> with the actual cluster name.
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#cluster-name-not-configured-in-the-process-of-installing-baize-agent","title":"Cluster name not configured in the process of installing baize-agent","text":"
When installing baize-agent, ensure to configure the cluster name. This name will be used for Insight metrics collection and is empty by default, requiring manual configuration .
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#insight-components-in-the-worker-cluster-are-abnormal","title":"Insight components in the worker cluster are abnormal","text":"
If the Insight components in the cluster are abnormal, it might cause AI Lab to be unable to retrieve cluster information. Check if the platform's Insight services are running and configured correctly.
Check if the insight-server component is running properly in the Global Service Cluster.
Check if the insight-agent component is running properly in the worker cluster.
When creating a Notebook, training task, or inference service, if the queue is being used for the first time in that namespace, there will be a prompt to initialize the queue with one click. However, the initialization fails.
In the AI Lab environment, the queue management capability is provided by Kueue. Kueue provides two types of queue management resources:
ClusterQueue: A cluster-level queue mainly used to manage resource quotas within the queue, including CPU, memory, and GPU.
LocalQueue: A namespace-level queue that needs to point to a ClusterQueue for resource allocation within the queue.
In the AI Lab environment, if a service is created and the specified namespace does not have a LocalQueue, there will be a prompt to initialize the queue.
In rare cases, the LocalQueue initialization might fail due to special reasons.
"},{"location":"en/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.html","title":"Notebook Not Controlled by Queue Quota","text":"
In the AI Lab module, when users create a Notebook, they find that even if the selected queue lacks resources, the Notebook can still be created successfully.
The queue management capability in AI Lab is provided by Kueue, and the Notebook service is provided through JupyterHub. JupyterHub has high requirements for the Kubernetes version. For versions below v1.27, even if queue quotas are set in AI platform, and users select the quota when creating a Notebook, the Notebook will not actually be restricted by the queue quota.
Solution: Plan in advance. It is recommended to use Kubernetes version v1.27 or above in the production environment.
Reference: Jupyter Notebook Documentation
"},{"location":"en/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.html#issue-02-configuration-not-enabled","title":"Issue 02: Configuration Not Enabled","text":"
Analysis:
When the Kubernetes cluster version is greater than v1.27, the Notebook still cannot be restricted by the queue quota.
This is because Kueue needs to have support for enablePlainPod enabled to take effect for the Notebook service.
Solution: When deploying baize-agent in the worker cluster, enable Kueue support for enablePlainPod.
If you forget your password, you can reset it by following the instructions on this page.
"},{"location":"en/admin/ghippo/password.html#steps-to-reset-password","title":"Steps to Reset Password","text":"
When an administrator initially creates a user, it sets a username and password for him. After the user logs in, fill in the email address and change the password in Personal Center . If the user has not set an email address, he can only contact the administrator to reset the password.
If you forget your password, you can click Forgot your password? on the login interface.
Enter your login email and click Submit .
Find the password reset email in the mailbox, and click the link in your email. The link is effective for 5 minutes.
Install applications that support 2FA dynamic password generation (such as Google Authenticator) on mobile phone or other devices. Set up a dynamic password to activate your account, and click Submit .
Set a new password and click Submit . The requirements for setting a new password are consistent with the password rules when creating an account.
The password is successfully reset, and you enter the home page directly.
The flow of the password reset process is as follows.
graph TB\n\npass[Forgot password] --> usern[Enter username]\n--> button[Click button to send a mail] --> judge1[Check your username is correct or not]\n\n judge1 -.Correct.-> judge2[Check if you have bounded a mail]\n judge1 -.Wrong.-> tip1[Error of incorrect username]\n\n judge2 -.A mail has been bounded.-> send[Send a reset mail]\n judge2 -.No any mail bounded.-> tip2[No any mail bounded<br>Contact admin to reset password]\n\nsend --> click[Click the mail link] --> config[Config dynamic password] --> reset[Reset password]\n--> success[Successfully reset]\n\nclassDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;\nclassDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff;\nclassDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5;\n\nclass pass,usern,button,tip1,send,tip2,send,click,config,reset,success plain;\nclass judge1,judge2 k8s
AI platform supports the creation of three scopes of custom roles:
The permissions of Platform Role take effect on all relevant resources of the platform
The permissions of workspace role take effect on the resources under the workspace where the user is located
The permissions of folder role take effect on the folder where the user is located and the subfolders and workspace resources under it
"},{"location":"en/admin/ghippo/access-control/custom-role.html#create-a-platform-role","title":"Create a platform role","text":"
A platform role refers to a role that can manipulate features related to a certain module of AI platform (such as container management, microservice engine, Multicloud Management, service mesh, Container registry, Workbench, and global management).
From the left navigation bar, click Global Management -> Access Control -> Roles , and click Create Custom Role .
Enter the name and description, select Platform Role , check the role permissions and click OK .
Return to the role list, search for the custom role you just created, and click \u2507 on the right to perform operations such as copying, editing, and deleting.
After the platform role is successfully created, you can go to User/group to add users and groups for this role.
"},{"location":"en/admin/ghippo/access-control/custom-role.html#create-a-workspace-role","title":"Create a workspace role","text":"
A workspace role refers to a role that can manipulate features related to a module (such as container management, microservice engine, Multicloud Management, service mesh, container registry, Workbench, and global management) according to the workspace.
From the left navigation bar, click Global Management -> Access Control -> Roles , and click Create Custom Role .
Enter the name and description, select Workspace role , check the role permissions and click OK .
Return to the role list, search for the custom role you just created, and click \u2507 on the right to perform operations such as copying, editing, and deleting.
After the workspace role is successfully created, you can go to Workspace to authorize and set which workspaces this role can manage.
The folder role refers to the ability to manipulate the relevant features of a module of AI platform (such as container management, microservice engine, Multicloud Management, service mesh, container registry, Workbench and global management) according to folders and subfolders. Role.
From the left navigation bar, click Global Management -> Access Control -> Roles , and click Create Custom Role .
Enter the name and description, select Folder Role , check the role permissions and click OK .
Return to the role list, search for the custom role you just created, and click \u2507 on the right to perform operations such as copying, editing, and deleting.
After the folder role is successfully created, you can go to Folder to authorize and set which folders this role can manage.
When two or more platforms need to integrate or embed with each other, user system integration is usually required. During the process of user system integration, the Docking Portal mainly provides SSO (Single Sign-On) capability. If you want to integrate AI platform as a user source into a client platform, you can achieve it by docking a product through Docking Portal .
"},{"location":"en/admin/ghippo/access-control/docking.html#docking-a-product","title":"Docking a product","text":"
Prerequisite: Administrator privileges for the platform or IAM Owner privileges for access control.
Log in with an admin, navigate to Access Control , select Docking Portal , enter the Docking Portal list, and click Create SSO Profile in the upper right corner.
On the Create SSO Profile page, fill in the Client ID.
After successfully creating the SSO access, in the Docking Portal list, click the just created Client ID to enter the details, copy the Client ID, Secret Key, and Single Sign-On URL information, and fill them in the client platform to complete the user system integration.
AI platform provides predefined system roles to help users simplify the process of role permission usage.
Note
AI platform provides three types of system roles: platform role, workspace role, and folder role.
Platform role: has proper permissions for all related resources on the platform. Please go to user/group page for authorization.
Workspace role: has proper permissions for a specific workspace. Please go to the specific workspace page for authorization.
Folder role: has proper permissions for a specific folder, subfolder, and resources under its workspace. Please go to the specific folder page for authorization.
Five system roles are predefined in Access Control: Admin, IAM Owner, Audit Owner, Kpanda Owner, and Workspace and Folder Owner. These five roles are created by the system and cannot be modified by users. The proper permissions of each role are as follows:
Role Name Role Type Module Role Permissions Admin System role All Platform administrator, manages all platform resources, represents the highest authority of the platform. IAM Owner System role Access Control Administrator of Access Control, has all permissions under this service, such as managing users/groups and authorization. Audit Owner System role Audit Log Administrator of Audit Log, has all permissions under this service, such as setting audit log policies and exporting audit logs. Kpanda Owner System role Container Management Administrator of Container Management, has all permissions under this service, such as creating/accessing clusters, deploying applications, granting cluster/namespace-related permissions to users/groups. Workspace and Folder Owner System role Workspace and Folder Administrator of Workspace and Folder, has all permissions under this service, such as creating folders/workspaces, authorizing folder/workspace-related permissions to users/groups, using features such as Workbench and microservice engine under the workspace."},{"location":"en/admin/ghippo/access-control/global.html#workspace-roles","title":"Workspace Roles","text":"
Three system roles are predefined in Access Control: Workspace Admin, Workspace Editor, and Workspace Viewer. These three roles are created by the system and cannot be modified by users. The proper permissions of each role are as follows:
Role Name Role Type Module Role Permissions Workspace Admin System role Workspace Administrator of a workspace, with management permission of the workspace. Workspace Editor System role Workspace Editor of a workspace, with editing permission of the workspace. Workspace Viewer System role Workspace Viewer of a workspace, with readonly permission of the workspace."},{"location":"en/admin/ghippo/access-control/global.html#folder-roles","title":"Folder Roles","text":"
Three system roles are predefined in Access Control: Folder Admin, Folder Editor, and Folder Viewer. These three roles are created by the system and cannot be modified by users. The proper permissions of each role are as follows:
Role Name Role Type Module Role Permissions Folder Admin System role Workspace Administrator of a folder and its subfolders/workspaces, with management permission. Folder Editor System role Workspace Editor of a folder and its subfolders/workspaces, with editing permission. Folder Viewer System role Workspace Viewer of a folder and its subfolders/workspaces, with readonly permission."},{"location":"en/admin/ghippo/access-control/group.html","title":"Group","text":"
A group is a collection of users. By joining a group, a user can inherit the role permissions of the group. Authorize users in batches through groups to better manage users and their permissions.
Enters Access Control , selects Groups , enters the list of groups, and clicks Create a group on the upper right.
Fill in the group information on the Create group page.
Click OK , the group is created successfully, and you will return to the group list page. The first line in the list is the newly created group.
"},{"location":"en/admin/ghippo/access-control/group.html#add-permissions-to-a-group","title":"Add permissions to a group","text":"
Prerequisite: The group already exists.
Enters Access Control , selects Groups , enters the list of groups, and clicks \u2507 -> Add permissions .
On the Add permissions page, check the required role permissions (multiple choices are allowed).
Click OK to add permissions to the group. Automatically return to the group list, click a group to view the permissions granted to the group.
"},{"location":"en/admin/ghippo/access-control/group.html#add-users-to-a-group","title":"Add users to a group","text":"
Enters Access Control , selects Groups to display the group list, and on the right side of a group, click \u2507 -> Add Members .
On the Add Group Members page, click the user to be added (multiple choices are allowed). If there is no user available, click Create a new user , first go to create a user, and then return to this page and click the refresh icon to display the newly created user.
Click OK to finish adding users to the group.
Note
Users in the group will inherit the permissions of the group; users who join the group can be viewed in the group details.
Note: Deleting a group will not delete the users in the group, but the users in the group will no longer be able to inherit the permissions of the group
The administrator enters Access Control , selects group to enter the group list, and on the right side of a group, click \u2507 -> Delete .
Click Delete to delete the group.
Return to the group list, and the screen will prompt that the deletion is successful.
Note
Deleting a group will not delete the users in the group, but the users in the group will no longer be able to inherit the permissions from the group.
"},{"location":"en/admin/ghippo/access-control/iam.html","title":"What is IAM","text":"
IAM (Identity and Access Management) is an important module of global management. You can create, manage and destroy users (groups) through the access control module, and use system roles and custom roles to control other users Access to the AI platform.
Structures and roles within an enterprise can be complex, with the management of projects, work groups, and mandates constantly changing. Access control uses a clear and tidy page to open up the authorization relationship between users, groups, and roles, and realize the authorization of users (groups) with the shortest link.
Appropriate role
Access control pre-defines an administrator role for each sub-module, without user maintenance, you can directly authorize the predefined system roles of the platform to users to realize the modular management of the platform. For fine-grained permissions, please refer to Permission Management.
Enterprise-grade access control
When you want your company's employees to use the company's internal authentication system to log in to the AI platform without creating proper users on the AI platform, you can use the identity provider feature of access control to establish a trust relationship between your company and Suanova, Through joint authentication, employees can directly log in to the AI platform with the existing account of the enterprise, realizing single sign-on.
Global management supports single sign-on based on LDPA and OIDC protocols. If your enterprise or organization has its own account system and you want to manage members in the organization to use AI platform resources, you can use the identity provider feature provided by global management. Instead of having to create username/passwords for every organization member in your AI platform. You can grant permissions to use AI platform resources to these external user identities.
Responsible for collecting and storing user identity information, usernames, and passwords, and responsible for authenticating users when they log in. In the identity authentication process between an enterprise and AI platform, the identity provider refers to the identity provider of the enterprise itself.
Service Provider (SP)
The service provider establishes a trust relationship with the identity provider IdP, and uses the user information provided by the IDP to provide users with specific services. In the process of enterprise authentication with AI platform, the service provider refers to AI platform.
LDAP
LDAP refers to Lightweight Directory Access Protocol (Lightweight Directory Access Protocol), which is often used for single sign-on, that is, users can log in with one account password in multiple services. Global management supports LDAP for identity authentication, so the enterprise IdP that establishes identity authentication with AI platform through the LDAP protocol must support the LDAP protocol. For a detailed description of LDAP, please refer to: Welcome to LDAP.
OIDC
OIDC, short for OpenID Connect, is an identity authentication standard protocol based on the OAuth 2.0 protocol. Global management supports the OIDC protocol for identity authentication, so the enterprise IdP that establishes identity authentication with AI platform through the OIDC protocol must support the OIDC protocol. For a detailed description of OIDC, please refer to: Welcome to OpenID Connect.
OAuth 2.0
OAuth 2.0 is the abbreviation of Open Authorization 2.0. It is an open authorization protocol. The authorization framework supports third-party applications to obtain access permissions in their own name.
Administrators do not need to recreate AI platform users
Before using the identity provider for identity authentication, the administrator needs to create an account for the user in the enterprise management system and AI platform respectively; after using the identity provider for identity authentication, the enterprise administrator only needs to create an account for the user in the enterprise management system, Users can access both systems at the same time, reducing personnel management costs.
Users do not need to remember two sets of platform accounts
Before using the identity provider for identity authentication, users need to log in with the accounts of the two systems to access the enterprise management system and AI platform; after using the identity provider for identity authentication, users can log in to the enterprise management system to access the two systems.
The full name of LDAP is Lightweight Directory Access Protocol, which is an open and neutral industry-standard application protocol that provides access control and maintains directories for distributed information through the IP protocol.
If your enterprise or organization has its own account system, and your enterprise user management system supports the LDAP protocol, you can use the identity provider feature based on the LDAP protocol provided by the Global Management instead of creating usernames/passwords for each member in AI platform. You can grant permissions to use AI platform resources to these external user identities.
In Global Management, the operation steps are as follows:
Log in to AI platform as a user with admin role. Click Global Management -> Access Control in the lower left corner of the left navigation bar.
Click Identity Provider on the left nav bar, click Create an Identity Provider button.
In the LDAP tab, fill in the following fields and click Save to establish a trust relationship with the identity provider and a user mapping relationship.
Field Description Vendor Supports LDAP (Lightweight Directory Access Protocol) and AD (Active Directory) Identity Provider Name (UI display name) Used to distinguish different identity providers Connection URL The address and port number of the LDAP service, e.g., ldap://10.6.165.2:30061 Bind DN The DN of the LDAP administrator, which Keycloak will use to access the LDAP server Bind credentials The password of the LDAP administrator. This field can retrieve its value from a vault using the ${vault.ID} format. Users DN The full DN of the LDAP tree where your users are located. This DN is the parent of the LDAP users. For example, if the DN of a typical user is similar to \u201cuid='john',ou=users,dc=example,dc=com\u201d, it can be \u201cou=users,dc=example,dc=com\u201d. User Object Classes All values of the LDAP objectClass attribute for users in LDAP, separated by commas. For example: \u201cinetOrgPerson,organizationalPerson\u201d. New Keycloak users will be written to LDAP with all of these object classes, and existing LDAP user records will be found if they contain all of these object classes. Enable StartTLS Encrypts the connection between AI platform and LDAP when enabled Default Permission Users/groups have no permissions by default after synchronization Full name mapping proper First name and Last Name User Name Mapping The unique username for the user Mailbox Mapping User email
Advanced Config
Field Description Enable or not Enabled by default. When disabled, this LDAP configuration will not take effect. Periodic full sync Disabled by default. When enabled, a sync period can be configured, such as syncing once every hour. Edit mode Read-only mode will not modify the source data in LDAP. Write mode will sync data back to LDAP after user information is edited on the platform. Read timeout Adjusting this value can effectively avoid interface timeouts when the amount of LDAP data is large. User LDAP filter An additional LDAP filter used to filter the search for users. Leave it empty if no additional filter is needed. Ensure it starts with \u201c(\u201d and ends with \u201c)\u201d. Username LDAP attribute The name of the LDAP attribute maps to the Keycloak username. For many LDAP server vendors, it can be \u201cuid\u201d. For Active Directory, it can be \u201csAMAccountName\u201d or \u201ccn\u201d. This attribute should be filled in for all LDAP user records you want to import into Keycloak. RDN LDAP attribute The name of the LDAP attribute that serves as the RDN (top-level attribute) of the typical user DN. It is usually the same as the Username LDAP attribute, but this is not required. For example, for Active Directory, when the username attribute might be \u201csAMAccountName\u201d, \u201ccn\u201d is often used as the RDN attribute. UUID LDAP attribute The name of the LDAP attribute used as the unique object identifier (UUID) for objects in LDAP. For many LDAP server vendors, it is \u201centryUUID\u201d. However, some may differ. For example, for Active Directory, it should be \u201cobjectGUID\u201d. If your LDAP server does not support the UUID concept, you can use any other attribute that should be unique among LDAP users in the tree, such as \u201cuid\u201d or \u201centryDN\u201d.
On the Sync Groups tab, fill in the following fields to configure the mapping relationship of groups, and click Save again.
Field Description Example base DN location of the group in the LDAP tree ou=groups,dc=example,dc=org Usergroup Object Filter Object classes for usergroups, separated by commas if more classes are required. In a typical LDAP deployment, usually \"groupOfNames\", the system has been filled in automatically, if you need to change it, just edit it. * means all. * group name cn Unchangeable
Note
After you have established a trust relationship between the enterprise user management system and AI platform through the LDAP protocol, you can synchronize the users or groups in the enterprise user management system to AI platform at one time through auto/manual synchronization.
After synchronization, the administrator can authorize groups/groups in batches, and users can log in to AI platform through the account/password in the enterprise user management system.
See the LDAP Operations Demo Video for a hands-on tutorial.
If all members in your enterprise or organization are managed in WeCom, you can use the identity provider feature based on the OAuth 2.0 protocol provided by Global Management, without the need to create a username/password for each organization member in AI platform. You can grant these external user identities permission to use AI platform resources.
Log in to AI platform with a user who has the admin role. Click Global Management -> Access Control at the bottom of the left navigation bar.
Select Identity Providers on the left navigation bar, and click the OAuth 2.0 tab. Fill in the form fields and establish a trust relationship with WeCom, then click Save.
"},{"location":"en/admin/ghippo/access-control/oauth2.0.html#proper-fields-in-wecom","title":"proper fields in WeCom","text":"
Note
Before integration, you need to create a custom application in the WeCom management console. Refer to How to create a custom application link
Field Description Corp ID ID of WeCom Agent ID ID of the custom application ClientSecret Secret of the custom application
WeCom ID:
Agent ID and ClientSecret:
"},{"location":"en/admin/ghippo/access-control/oidc.html","title":"Create and Manage OIDC","text":"
OIDC (OpenID Connect) is an identity layer based on OAuth 2.0 and an identity authentication standard protocol based on the OAuth2 protocol.
If your enterprise or organization already has its own account system, and your enterprise user management system supports the OIDC protocol, you can use the OIDC protocol-based identity provider feature provided by the Global Management instead of creating usernames/passwords for each member in AI platform. You can grant permissions to use AI platform resources to these external user identities.
The specific operation steps are as follows.
Log in to AI platform as a user with admin role. Click Global Management -> Access Control at the bottom of the left navigation bar.
On the left nav bar select Identity Provider , click OIDC -> Create an Identity Provider
After completing the form fields and establishing a trust relationship with the identity provider, click Save .
Fields Descriptions Provider Name displayed on the login page and is the entry point for the identity provider Authentication Method Client authentication method. If the JWT is signed with a private key, select JWT signed with private key from the dropdown. For details, refer to Client Authentication. Client ID Client ID Client Secret Client Secret Client URL One-click access to login URL, Token URL, user information URL and logout URL through the identity provider's well-known interface Auto-associate After it is turned on, when the identity provider username/email is duplicated with the AI platform username/email, the two will be automatically associated
Note
After the user completes the first login to AI platform through the enterprise user management system, the user information will be synchronized to Access Control -> User List of AI platform.
Users who log in for the first time will not be given any default permissions and need to be authorized by an administrator (the administrator can be a platform administrator, submodule administrator or resource administrator).
For practical tutorials, please refer to OIDC Operation Video Tutorials, or refer to Azure OpenID Connect (OIDC) Access Process.
The interactive process of user authentication is as follows:
Use a browser to initiate a single sign-on request for AI platform.
According to the information carried in the login link, AI platform searches for the proper configuration information in Global Management -> Access Control -> Identity Provider , constructs an OIDC authorization Request, and sends it to the browser.
After the browser receives the request, it forwards the OIDC authorization Request to the enterprise IdP.
Enter the username and password on the login page of the enterprise IdP. The enterprise IdP verifies the provided identity information, constructs an ID token carrying user information, and sends an OIDC authorization response to the browser.
After the browser responds, it forwards the OIDC authorization Response to AI platform.
AI platform takes the ID Token from the OIDC Authorization Response, maps it to a specific user list according to the configured identity conversion rules, and issues the Token.
Complete single sign-on to access AI platform.
"},{"location":"en/admin/ghippo/access-control/role.html","title":"Role and Permission Management","text":"
A role corresponds to a set of permissions that determine the actions that can be performed on resources. Granting a user a role means granting all the permissions included in that role.
AI platform platform provides three levels of roles, which effectively solve your permission-related issues:
Platform roles are coarse-grained permissions that grant proper permissions to all relevant resources on the platform. By assigning platform roles, users can have permissions to create, delete, modify, and view all clusters and workspaces, but not specifically to a particular cluster or workspace. AI platform provides 5 pre-defined platform roles that users can directly use:
Admin
Kpanda Owner
Workspace and Folder Owner
IAM Owner
Audit Owner
Additionally, AI platform supports the creation of custom platform roles with customized content as needed. For example, creating a platform role that includes all functional permissions in the Workbench. Since the Workbench depends on workspaces, the platform will automatically select the \"view\" permission for workspaces by default. Please do not manually deselect it. If User A is granted this Workbench role, they will automatically have all functional permissions related to the Workbench in all workspaces.
"},{"location":"en/admin/ghippo/access-control/role.html#platform-role-authorization-methods","title":"Platform Role Authorization Methods","text":"
There are three ways to authorize platform roles:
In the Global Management -> Access Control -> Users section, find the user in the user list, click ... , select Authorization , and grant platform role permissions to the user.
In the Global Management -> Access Control -> Groups section, create a group in the group list, add the user to the group, and grant authorization to the group (the specific operation is: find the group in the group list, click ... , select Add Permissions , and grant platform roles to the group).
In the Global Management -> Access Control -> Roles section, find the proper platform role in the role list, click the role name to access details, click the Related Members button, select the user or group, and click OK .
Workspace roles are fine-grained roles that grant users management permissions, view permissions, or Workbench-related permissions for a specific workspace. Users with these roles can only manage the assigned workspace and cannot access other workspaces. AI platform provides 3 pre-defined workspace roles that users can directly use:
Workspace Admin
Workspace Editor
Workspace Viewer
Moreover, AI platform supports the creation of custom workspace roles with customized content as needed. For example, creating a workspace role that includes all functional permissions in the Workbench. Since the Workbench depends on workspaces, the platform will automatically select the \"view\" permission for workspaces by default. Please do not manually deselect it. If User A is granted this role in Workspace 01, they will have all functional permissions related to the Workbench in Workspace 01.
Note
Unlike platform roles, workspace roles need to be used within the workspace. Once authorized, users will only have the functional permissions of that role within the assigned workspace.
"},{"location":"en/admin/ghippo/access-control/role.html#workspace-role-authorization-methods","title":"Workspace Role Authorization Methods","text":"
In the Global Management -> Workspace and Folder list, find the workspace, click Authorization , and grant workspace role permissions to the user.
Folder roles have permissions granularity between platform roles and workspace roles. They grant users management permissions and view permissions for a specific folder and its sub-folders, as well as all workspaces within that folder. Folder roles are commonly used in departmental scenarios in enterprises. For example, User B is a leader of a first-level department and usually has management permissions over the first-level department, all second-level departments under it, and projects within those departments. In this scenario, User B is granted admin permissions for the first-level folder, which also grants proper permissions for the second-level folders and workspaces below them. AI platform provides 3 pre-defined folder roles that users can directly use:
Folder Admin
Folder Editor
Folder Viewer
Additionally, AI platform supports the creation of custom folder roles with customized content as needed. For example, creating a folder role that includes all functional permissions in the Workbench. If User A is granted this role in Folder 01, they will have all functional permissions related to the Workbench in all workspaces within Folder 01.
Note
The functionality of modules depends on workspaces, and folders provide further grouping mechanisms with permission inheritance capabilities. Therefore, folder permissions not only include the folder itself but also its sub-folders and workspaces.
"},{"location":"en/admin/ghippo/access-control/role.html#folder-role-authorization-methods","title":"Folder Role Authorization Methods","text":"
In the Global Management -> Workspace and Folder list, find the folder, click Authorization , and grant folder role permissions to the user.
A user refers to a user created by the platform administrator Admin or the access control administrator IAM Owner on the Global Management -> Access Control -> Users page, or a user connected through LDAP / OIDC . The username represents the account, and the user logs in to the Suanova Enterprise platform through the username and password.
Having a user account is a prerequisite for users to access the platform. The newly created user does not have any permissions by default. For example, you need to assign proper role permissions to users, such as granting administrator permissions to submodules in User List or User Details . The sub-module administrator has the highest authority of the sub-module, and can create, manage, and delete all resources of the module. If a user needs to be granted permission for a specific resource, such as the permission to use a certain resource, please see Resource Authorization Description.
This page introduces operations such as creating, authorizing, disabling, enabling, and deleting users.
Prerequisite: You have the platform administrator Admin permission or the access control administrator IAM Admin permission.
The administrator enters Access Control , selects Users , enters the user list, and clicks Create User on the upper right.
Fill in the username and login password on the Create User page. If you need to create multiple users at one time, you can click Create User to create in batches, and you can create up to 5 users at a time. Determine whether to set the user to reset the password when logging in for the first time according to your actual situation.
Click OK , the user is successfully created and returns to the user list page.
Note
The username and password set here will be used to log in to the platform.
"},{"location":"en/admin/ghippo/access-control/user.html#authorize-for-user","title":"Authorize for User","text":"
Prerequisite: The user already exists.
The administrator enters Access Control , selects Users , enters the user list, and clicks \u2507 -> Authorization .
On the Authorization page, check the required role permissions (multiple choices are allowed).
Click OK to complete the authorization for the user.
Note
In the user list, click a user to enter the user details page.
"},{"location":"en/admin/ghippo/access-control/user.html#add-user-to-group","title":"Add user to group","text":"
The administrator enters Access Control , selects Users , enters the user list, and clicks \u2507 -> Add to Group .
On the Add to Group page, check the groups to be joined (multiple choices are allowed). If there is no optional group, click Create a new group to create a group, and then return to this page and click the Refresh button to display the newly created group.
Click OK to add the user to the group.
Note
The user will inherit the permissions of the group, and you can view the groups that the user has joined in User Details .
Once a user is deactivated, that user will no longer be able to access the Platform. Unlike deleting a user, a disabled user can be enabled again as needed. It is recommended to disable the user before deleting it to ensure that no critical service is using the key created by the user.
The administrator enters Access Control , selects Users , enters the user list, and clicks a username to enter user details.
Click Edit on the upper right, turn off the status button, and make the button gray and inactive.
Premise: User mailboxes need to be set. There are two ways to set user mailboxes.
On the user details page, the administrator clicks Edit , enters the user's email address in the pop-up box, and clicks OK to complete the email setting.
Users can also enter the Personal Center and set the email address on the Security Settings page.
If the user forgets the password when logging in, please refer to Reset Password.
After deleting a user, the user will no longer be able to access platform resources in any way, please delete carefully. Before deleting a user, make sure your key programs no longer use keys created by that user. If you are unsure, it is recommended to disable the user before deleting. If you delete a user and then create a new user with the same name, the new user is considered a new, separate identity that does not inherit the deleted user's roles.
The administrator enters Access Control , selects Users , enters the user list, and clicks \u2507 -> Delete .
With AI platform integrated into the client's system, you can create Webhooks to send message notifications when users are created, updated, deleted, logged in, or logged out.
Webhook is a mechanism for implementing real-time event notifications. It allows an application to push data or events to another application without the need for polling or continuous querying. By configuring Webhooks, you can specify that the target application receives and processes notifications when a certain event occurs.
The working principle of Webhook is as follows:
The source application (AI platform) performs a specific operation or event.
The source application packages the relevant data and information into an HTTP request and sends it to the URL specified by the target application (e.g., enterprise WeChat group robot).
The target application receives the request and processes it based on the data and information provided.
By using Webhooks, you can achieve the following functionalities:
Real-time notification: Notify other applications in a timely manner when a specific event occurs.
Automation: The target application can automatically trigger predefined operations based on the received Webhook requests, eliminating the need for manual intervention.
Data synchronization: Use Webhooks to pass data from one application to another, enabling synchronized updates.
Common use cases include:
Version control systems (e.g., GitHub, GitLab): Automatically trigger build and deployment operations when code repositories change.
E-commerce platforms: Send update notifications to logistics systems when order statuses change.
Chatbot platforms: Push messages to target servers via Webhooks for processing when user messages are received.
Audit logs help you monitor and record the activities of each user, and provide features for collecting, storing and querying security-related records arranged in chronological order. With the audit log service, you can continuously monitor and retain user behaviors in the Global Management module, including but not limited to user creation, user login/logout, user authorization, and user operations related to Kubernetes.
The audit log feature has the following characteristics:
Out of the box: When installing and using the platform, the audit log feature will be enabled by default, automatically recording various user-related actions, such as creating users, authorization, and login/logout. By default, 365 days of user behavior can be viewed within the platform.
Security analysis: The audit log will record user operations in detail and provide an export function. Through these events, you can judge whether the account is at risk.
Real-time recording: Quickly collect operation events, and trace back in the audit log list after user operations, so that suspicious behavior can be found at any time.
Convenient and reliable: The audit log supports manual cleaning and automatic cleaning, and the cleaning policy can be configured according to your storage size.
On the Settings tab, you can clean up audit logs for user operations and system operations.
You can manually clean up the logs, but it is recommended to export and save them before cleaning. You can also set the maximum retention time for the logs to automatically clean them up.
Note
The audit logs related to Kubernetes in the auditing module are provided by the Insight module. To reduce the storage pressure of the audit logs, Global Management by default does not collect Kubernetes-related logs. If you need to record them, please refer to Enabling K8s Audit Logs. Once enabled, the cleanup function is consistent with the Global Management cleanup function, but they do not affect each other.
"},{"location":"en/admin/ghippo/audit/open-audit.html","title":"Enable/Disable collection of audit logs","text":"
Kubernetes Audit Logs: Kubernetes itself generates audit logs. When this feature is enabled, audit log files for Kubernetes will be created in the specified directory.
Collecting Kubernetes Audit Logs: The log files mentioned above are collected using the Insight Agent. The prerequisite for collecting Kubernetes audit logs are that the cluster has enabled Kubernetes audit logs, the export of audit logs has been allowed, and the collection of audit logs has been opened.
Run the following command to check if audit logs are generated under the /var/log/kubernetes/audit directory. If they exist, it means that Kubernetes audit logs are successfully enabled.
ls /var/log/kubernetes/audit\n
If they are not enabled, please refer to the documentation on enabling/disabling Kubernetes audit logs.
"},{"location":"en/admin/ghippo/audit/open-audit.html#enable-collection-of-kubernetes-audit-logs-process","title":"Enable Collection of Kubernetes Audit Logs Process","text":"
Modify the IP address in this command to the IP address of the Spark node.
Note
If using a self-built Harbor repository, please modify the chart repo URL in the first step to the insight-agent chart URL of the self-built repository.
Save the current Insight Agent helm values.
helm get values insight-agent -n insight-system -o yaml > insight-agent-values-bak.yaml\n
Get the current version number ${insight_version_code}.
insight_version_code=`helm list -n insight-system |grep insight-agent | awk {'print $10'}`\n
Restart all fluentBit pods under the insight-system namespace.
fluent_pod=`kubectl get pod -n insight-system | grep insight-agent-fluent-bit | awk {'print $1'} | xargs`\nkubectl delete pod ${fluent_pod} -n insight-system\n
"},{"location":"en/admin/ghippo/audit/open-audit.html#disable-collection-of-kubernetes-audit-logs","title":"Disable Collection of Kubernetes Audit Logs","text":"
The remaining steps are the same as enabling the collection of Kubernetes audit logs, with only a modification in the previous section's step 4: updating the helm value configuration.
"},{"location":"en/admin/ghippo/audit/open-audit.html#ai-community-online-installation-environment","title":"AI Community Online Installation Environment","text":"
Note
If installing AI Community in a Kind cluster, perform the following steps inside the Kind container.
Run the following command to check if audit logs are generated under the /var/log/kubernetes/audit directory. If they exist, it means that Kubernetes audit logs are successfully enabled.
ls /var/log/kubernetes/audit\n
If they are not enabled, please refer to the documentation on enabling/disabling Kubernetes audit logs.
"},{"location":"en/admin/ghippo/audit/open-audit.html#enable-collection-of-kubernetes-audit-logs-process_1","title":"Enable Collection of Kubernetes Audit Logs Process","text":"
Save the current values.
helm get values insight-agent -n insight-system -o yaml > insight-agent-values-bak.yaml\n
Get the current version number ${insight_version_code} and update the configuration.
insight_version_code=`helm list -n insight-system |grep insight-agent | awk {'print $10'}`\n
If the upgrade fails due to an unsupported version, check if the helm repo used in the command has that version. If not, retry after you updated the helm repo.
helm repo update insight-release\n
Restart all fluentBit pods under the insight-system namespace.
fluent_pod=`kubectl get pod -n insight-system | grep insight-agent-fluent-bit | awk {'print $1'} | xargs`\nkubectl delete pod ${fluent_pod} -n insight-system\n
"},{"location":"en/admin/ghippo/audit/open-audit.html#disable-collection-of-kubernetes-audit-logs_1","title":"Disable Collection of Kubernetes Audit Logs","text":"
The remaining steps are the same as enabling the collection of Kubernetes audit logs, with only a modification in the previous section's step 3: updating the helm value configuration.
Each worker cluster is independent and can be turned on as needed.
"},{"location":"en/admin/ghippo/audit/open-audit.html#steps-to-enable-audit-log-collection-when-creating-a-cluster","title":"Steps to Enable Audit Log Collection When Creating a Cluster","text":"
By default, the collection of K8s audit logs is turned off. If you need to enable it, you can follow these steps:
Set the switch to the enabled state to enable the collection of K8s audit logs.
When creating a worker cluster via AI platform, ensure that the K8s audit log option for the cluster is set to 'true' so that the created worker cluster will have audit logs enabled.
After the cluster creation is successful, the K8s audit logs for that worker cluster will be collected.
"},{"location":"en/admin/ghippo/audit/open-audit.html#steps-to-enabledisable-after-accessing-or-creating-the-cluster","title":"Steps to Enable/Disable After Accessing or Creating the Cluster","text":""},{"location":"en/admin/ghippo/audit/open-audit.html#confirm-enabling-k8s-audit-logs","title":"Confirm Enabling K8s Audit Logs","text":"
Run the following command to check if audit logs are generated under the /var/log/kubernetes/audit directory. If they exist, it means that K8s audit logs are successfully enabled.
ls /var/log/kubernetes/audit\n
If they are not enabled, please refer to the documentation on enabling/disabling K8s audit logs.
"},{"location":"en/admin/ghippo/audit/open-audit.html#enable-collection-of-k8s-audit-logs","title":"Enable Collection of K8s Audit Logs","text":"
The collection of K8s audit logs is disabled by default. To enable it, follow these steps:
Select the cluster that has been accessed and needs to enable the collection of K8s audit logs.
Go to the Helm App management page and update the insight-agent configuration (if insight-agent is not installed, you can install it).
Enable/Disable the collection of K8s audit logs switch.
After enabling/disabling the switch, the fluent-bit pod needs to be restarted for the changes to take effect.
By default, the Kubernetes cluster does not generate audit log information. Through the following configuration, you can enable the audit log feature of Kubernetes.
Note
In a public cloud environment, it may not be possible to control the output and output path of Kubernetes audit logs.
Prepare the Policy file for the audit log
Configure the API server, and enable audit logs
Reboot and verify
"},{"location":"en/admin/ghippo/audit/open-k8s-audit.html#prepare-audit-log-policy-file","title":"Prepare audit log Policy file","text":"Click to view Policy YAML for audit log policy.yaml
Put the above audit log file in /etc/kubernetes/audit-policy/ folder, and name it apiserver-audit-policy.yaml .
"},{"location":"en/admin/ghippo/audit/open-k8s-audit.html#configure-the-api-server","title":"Configure the API server","text":"
Open the configuration file kube-apiserver.yaml of the API server, usually in the /etc/kubernetes/manifests/ folder, and add the following configuration information:
Please back up kube-apiserver.yaml before this step. The backup file cannot be placed in the /etc/kubernetes/manifests/ , and it is recommended to put it in the /etc/kubernetes/tmp .
"},{"location":"en/admin/ghippo/audit/open-k8s-audit.html#test-and-verify","title":"Test and verify","text":"
After a while, the API server will automatically restart, and run the following command to check whether there is an audit log generated in the /var/log/kubernetes/audit directory. If so, it means that the K8s audit log is successfully enabled.
ls /var/log/kubernetes/audit\n
If you want to close it, just remove the relevant commands in spec.containers.command .
"},{"location":"en/admin/ghippo/audit/source-ip.html","title":"Get Source IP in Audit Logs","text":"
The source IP in audit logs plays a critical role in system and network management. It helps track activities, maintain security, resolve issues, and ensure system compliance. However, getting the source IP can result in some performance overhead, so that audit logs are not always enabled in AI platform. The default enablement of source IP in audit logs and the methods to enable it vary depending on the installation mode. The following sections will explain the default enablement and the steps to enable source IP in audit logs based on the installation mode.
Note
Enabling audit logs will modify the replica count of the istio-ingressgateway, resulting in a certain performance overhead. Enabling audit logs requires disabling LoadBalance of kube-proxy and Topology Aware Routing, which can have a certain impact on cluster performance. After enabling audit logs, it is essential to ensure that the istio-ingressgateway exists on the proper node to the access IP. If the istio-ingressgateway drifts due to node health issues or other issues, it needs to be manually rescheduled back to that node. Otherwise, it will affect the normal operation of AI platform.
"},{"location":"en/admin/ghippo/audit/source-ip.html#determine-the-installation-mode","title":"Determine the Installation Mode","text":"
kubectl get pod -n metallb-system\n
Run the above command in the cluster. If the result is as follows, it means that the cluster is not in the MetalLB installation mode:
No resources found in metallbs-system namespace.\n
In this mode, the source IP in audit logs is gotten by default after the installation. For more information, refer to MetalLB Source IP.
"},{"location":"en/admin/ghippo/audit/gproduct-audit/ghippo.html","title":"Audit Items of Global Management","text":"Events Resource Type Notes UpdateEmail-Account Account UpdatePassword-Account Account CreateAccessKeys-Account Account UpdateAccessKeys-Account Account DeleteAccessKeys-Account Account Create-User User Delete-User User Update-User User UpdateRoles-User User UpdatePassword-User User CreateAccessKeys-User User UpdateAccessKeys-User User DeleteAccessKeys-User User Create-Group Group Delete-Group Group Update-Group Group AddUserTo-Group Group RemoveUserFrom-Group Group UpdateRoles-Group Group UpdateRoles-User User Create-LADP LADP Update-LADP LADP Delete-LADP LADP Unable to audit through API server for OIDC Login-User User Logout-User User UpdatePassword-SecurityPolicy SecurityPolicy UpdateSessionTimeout-SecurityPolicy SecurityPolicy UpdateAccountLockout-SecurityPolicy SecurityPolicy UpdateLogout-SecurityPolicy SecurityPolicy MailServer-SecurityPolicy SecurityPolicy CustomAppearance-SecurityPolicy SecurityPolicy OfficialAuthz-SecurityPolicy SecurityPolicy Create-Workspace Workspace Delete-Workspace Workspace BindResourceTo-Workspace Workspace UnBindResource-Workspace Workspace BindShared-Workspace Workspace SetQuota-Workspace Workspace Authorize-Workspace Workspace DeAuthorize-Workspace Workspace UpdateDeAuthorize-Workspace Workspace Update-Workspace Workspace Create-Folder Folder Delete-Folder Folder UpdateAuthorize-Folder Folder Update-Folder Folder Authorize-Folder Folder DeAuthorize-Folder Folder AutoCleanup-Audit Audit ManualCleanup-Audit Audit Export-Audit Audit"},{"location":"en/admin/ghippo/audit/gproduct-audit/insight.html","title":"Insight Audit Items","text":"Events Resource Type Notes Create-ProbeJob ProbeJob Update-ProbeJob ProbeJob Delete-ProbeJob ProbeJob Create-AlertPolicy AlertPolicy Update-AlertPolicy AlertPolicy Delete-AlertPolicy AlertPolicy Import-AlertPolicy AlertPolicy Create-AlertRule AlertRule Update-AlertRule AlertRule Delete-AlertRule AlertRule Create-RuleTemplate RuleTemplate Update-RuleTemplate RuleTemplate Delete-RuleTemplate RuleTemplate Create-email email Update-email email Delete-Receiver Receiver Create-dingtalk dingtalk Update-dingtalk dingtalk Delete-Receiver Receiver Create-wecom wecom Update-wecom wecom Delete-Receiver Receiver Create-webhook webhook Update-webhook webhook Delete-Receiver Receiver Create-sms sms Update-sms sms Delete-Receiver Receiver Create-aliyun(tencent,custom) aliyun, tencent, custom Update-aliyun(tencent,custom) aliyun, tencent, custom Delete-SMSserver SMSserver Create-MessageTemplate MessageTemplate Update-MessageTemplate MessageTemplate Delete-MessageTemplate MessageTemplate Create-AlertSilence AlertSilence Update-AlertSilence AlertSilence Delete-AlertSilence AlertSilence Create-AlertInhibition AlertInhibition Update-AlertInhibition AlertInhibition Delete-AlertInhibition AlertInhibition Update-SystemSettings SystemSettings"},{"location":"en/admin/ghippo/audit/gproduct-audit/kpanda.html","title":"Audit Items of Container Management","text":"Events Resource Types Create-Cluster Cluster Delete-Cluster Cluster Integrate-Cluster Cluster Remove-Cluster Cluster Upgrade-Cluster Cluster Integrate-Node Node Remove-Node Node Update-NodeGPUMode NodeGPUMode Create-HelmRepo HelmRepo Create-HelmApp HelmApp Delete-HelmApp HelmApp Create-Deployment Deployment Delete-Deployment Deployment Create-DaemonSet DaemonSet Delete-DaemonSet DaemonSet Create-StatefulSet StatefulSet Delete-StatefulSet StatefulSet Create-Job Job Delete-Job Job Create-CronJob CronJob Delete-CronJob CronJob Delete-Pod Pod Create-Service Service Delete-Service Service Create-Ingress Ingress Delete-Ingress Ingress Create-StorageClass StorageClass Delete-StorageClass StorageClass Create-PersistentVolume PersistentVolume Delete-PersistentVolume PersistentVolume Create-PersistentVolumeClaim PersistentVolumeClaim Delete-PersistentVolumeClaim PersistentVolumeClaim Delete-ReplicaSet ReplicaSet BindResourceTo-Workspace Workspace UnBindResource-Workspace Workspace BindResourceTo-Workspace Workspace UnBindResource-Workspace Workspace Create-CloudShell CloudShell Delete-CloudShell CloudShell"},{"location":"en/admin/ghippo/audit/gproduct-audit/virtnest.html","title":"Audit Items of Virtual Machine","text":"Events Resource Type Notes Restart-VMs VM ConvertToTemplate-VMs VM Edit-VMs VM Update-VMs VM Restore-VMs VM Power on-VMs VM LiveMigrate-VMs VM Delete-VMs VM Delete-VM Template VM Template Create-VMs VM CreateSnapshot-VMs VM Power off-VMs VM Clone-VMs VM"},{"location":"en/admin/ghippo/best-practice/authz-plan.html","title":"Ordinary user authorization plan","text":"
Ordinary users refer to those who can use most product modules and features (except management features), have certain operation rights to resources within the scope of authority, and can independently use resources to deploy applications.
The authorization and resource planning process for such users is shown in the following figure.
graph TB\n\n start([Start]) --> user[1. Create User]\n user --> ns[2. Prepare Kubernetes Namespace]\n ns --> ws[3. Prepare Workspace]\n ws --> ws-to-ns[4. Bind a workspace to namespace]\n ws-to-ns --> authu[5. Authorize a user with Workspace Editor]\n authu --> complete([End])\n\nclick user \"https://docs.daocloud.io/en/ghippo/access-control/user/\"\nclick ns \"https://docs.daocloud.io/en/kpanda/namespaces/createns/\"\nclick ws \"https://docs.daocloud.io/en/ghippo/workspace/workspace/\"\nclick ws-to-ns \"https://docs.daocloud.io/en/ghippo/workspace/ws-to-ns-across-clus/\"\nclick authu \"https://docs.daocloud.io/en/ghippo/workspace/wspermission/\"\n\n classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;\n classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;\n classDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5;\n class user,ns,ws,ws-to-ns,authu cluster;\n class start,complete plain;
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html","title":"Assign a Cluster to Multiple Workspaces (Tenants)","text":"
Cluster resources are typically managed by operations personnel. When allocating resources, they need to create namespaces to isolate resources and set resource quotas. This method has a drawback: if the business volume of the enterprise is large, manually allocating resources requires a significant amount of work, and flexibly adjusting resource quotas can also be challenging.
To address this, the AI platform introduces the concept of workspaces. By sharing resources, workspaces can provide higher-dimensional resource quota capabilities, allowing workspaces (tenants) to self-create Kubernetes namespaces under resource quotas.
For example, if you want several departments to share different clusters:
Cluster01 (Normal) Cluster02 (High Availability) Department (Workspace) A 50 quota 10 quota Department (Workspace) B 100 quota 20 quota
You can follow the process below to share clusters with multiple departments/workspaces/tenants:
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html#prepare-a-workspace","title":"Prepare a Workspace","text":"
Workspaces are designed to meet multi-tenant usage scenarios, forming isolated resource environments based on clusters, cluster namespaces, meshes, mesh namespaces, multicloud, multicloud namespaces, and other resources. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers.
Log in to AI platform with a user having the admin/folder admin role and click Global Management at the bottom of the left navigation bar.
Click Workspaces and Folders in the left navigation bar, then click the Create Workspace button at the top right.
Fill in the workspace name, folder, and other information, then click OK to complete the creation of the workspace.
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html#prepare-a-cluster","title":"Prepare a Cluster","text":"
Workspaces are designed to meet multi-tenant usage scenarios, forming isolated resource environments based on clusters, cluster namespaces, meshes, mesh namespaces, multicloud, multicloud namespaces, and other resources. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers.
Follow these steps to prepare a cluster.
Click Container Management at the bottom of the left navigation bar, then select Clusters .
Click Create Cluster to create a cluster or click Integrate Cluster to integrate a cluster.
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html#add-cluster-to-workspace","title":"Add Cluster to Workspace","text":"
Return to Global Management to add clusters to the workspace.
Click Global Management -> Workspaces and Folders -> Shared Resources, then click a workspace name and click the New Shared Resource button.
Select the cluster, fill in the resource quota, and click OK .
"},{"location":"en/admin/ghippo/best-practice/folder-practice.html","title":"Folder Best Practices","text":"
A folder represents an organizational unit (such as a department) and is a node in the resource hierarchy.
A folder can contain workspaces, subfolders, or a combination of both. It provides identity management, multi-level and permission mapping capabilities, and can map the role of a user/group in a folder to its subfolders, workspaces and resources. Therefore, with the help of folders, enterprise managers can centrally manage and control all resources.
Build corporate hierarchy
First of all, according to the existing enterprise hierarchy structure, build the same folder hierarchy as the enterprise. The AI platform supports 5-level folders, which can be freely combined according to the actual situation of the enterprise, and folders and workspaces are mapped to entities such as departments, projects, and suppliers in the enterprise.
Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces.
User identity management
Folder provides three roles: Folder Admin, Folder Editor, and Folder Viewer. View role permissions, you can grant different roles to users/groups in the same folder through Authorization.
Role and permission mapping
Enterprise Administrator: Grant the Folder Admin role on the root folder. He will have administrative authority over all departments, projects and their resources.
Department manager: grant separate management rights to each subfolder and workspace.
Project members: Grant management rights separately at the workspace and resource levels.
"},{"location":"en/admin/ghippo/best-practice/super-group.html","title":"Architecture Management of Large Enterprises","text":"
With the continuous scaling of business, the company's scale continues to grow, subsidiaries and branches are established one after another, and some subsidiaries even further establish subsidiaries. The original large departments are gradually subdivided into multiple smaller departments, leading to an increasing number of hierarchical levels in the organizational structure. This organizational structure change also affects the IT governance architecture.
The specific operational steps are as follows:
Enable Isolation Mode between Folder/WS
Please refer to Enable Isolation Mode between Folder/WS.
Plan Enterprise Architecture according to the Actual Situation
Under a multi-level organizational structure, it is recommended to use the second-level folder as an isolation unit to isolate users/user groups/resources between \"sub-companies\". After isolation, users/user groups/resources between \"sub-companies\" are not visible to each other.
Create Users/Integrate User Systems
The main platform administrator Admin can create users on the platform or integrate users through LDAP/OIDC/OAuth2.0 and other identity providers to AI platform.
Create Folder Roles
In the isolation mode of Folder/WS, the platform administrator Admin needs to first authorize users to invite them to various sub-companies, so that the \"sub-company administrators (Folder Admin)\" can manage these users, such as secondary authorization or editing permissions. It is recommended to simplify the management work of the platform administrator Admin by creating a role without actual permissions to assist the platform administrator Admin in inviting users to sub-companies through \"authorization\". The actual permissions of sub-company users are delegated to the sub-company administrators (Folder Admin) to manage independently. (The following demonstrates how to create a resource-bound role without actual permissions, i.e., minirole)
Note
Resource-bound permissions used alone do not take effect, hence meeting the requirement of inviting users to sub-companies through \"authorization\" and then managed by sub-company administrators Folder Admin.
Authorize Users
The platform administrator invites users to various sub-companies according to the actual situation and appoints sub-company administrators.
Authorize sub-company regular users as \"minirole\" (1), and authorize sub-company administrators as Folder Admin.
Refers to the role without actual permissions created in step 4
Sub-company Administrators Manage Users/User Groups Independently
Sub-company administrator Folder Admin can only see their own \"Sub-company 2\" after logging into the platform, and can adjust the architecture by creating folders, creating workspaces, and assigning other permissions to users in Sub-company 2 through adding authorization/edit permissions.
When adding authorization, sub-company administrator Folder Admin can only see users invited by the platform administrator through \"authorization\", and cannot see all users on the platform, thus achieving user isolation between Folder/WS, and the same applies to user groups (the platform administrator can see and authorize all users and user groups on the platform).
Note
The main difference between large enterprises and small/medium-sized enterprises lies in whether users/user groups in Folder and workspaces are visible to each other. In large enterprises, users/user groups between subsidiaries are not visible + permission isolation; in small/medium-sized enterprises, users between departments are visible to each other + permission isolation.
System messages are used to notify all users, similar to system announcements, and will be displayed at the top bar of the AI platform UI at specific times.
"},{"location":"en/admin/ghippo/best-practice/system-message.html#configure-system-messages","title":"Configure System Messages","text":"
You can create a system message by applying the YAML for the system message in the Cluster Roles. The display time of the message is determined by the time fields in the YAML. System messages will only be displayed within the time range configured by the start and end fields.
In the Clusters, click the name of the Global Service Cluster to enter the Gobal Service Cluster.
Select CRDs from the left navigation bar, search for ghippoconfig, and click the ghippoconfigs.ghippo.io that appears in the search results.
Click Create from YAML or modify an existing YAML.
A sample YAML is as follows:
apiVersion: ghippo.io/v1alpha1\nkind: GhippoConfig\nmetadata:\n name: system-message\nspec:\n message: \"this is a message\"\n start: 2024-01-02T15:04:05+08:00\n end: 2024-07-24T17:26:05+08:00\n
"},{"location":"en/admin/ghippo/best-practice/ws-best-practice.html","title":"Workspace Best Practices","text":"
A workspace is a resource grouping unit, and most resources can be bound to a certain workspace. The workspace can realize the binding relationship between users and roles through authorization and resource binding, and apply it to all resources in the workspace at one time.
Through the workspace, you can easily manage teams and resources, and solve cross-module and cross-cluster resource authorization issues.
A workspace consists of three features: authorization, resource groups, and shared resources. It mainly solves the problems of unified authorization of resources, resource grouping and resource quota.
Authorization: Grant users/groups different roles in the workspace, and apply the roles to the resources in the workspace.
Best practice: When ordinary users want to use Workbench, microservice engine, service mesh, and middleware module features, or need to have permission to use container management and some resources in the service mesh, the administrator needs to grant the workspace permissions (Workspace Admin, Workspace Edit, Workspace View). The administrator here can be the Admin role, the Workspace Admin role of the workspace, or the Folder Admin role above the workspace. See Relationship between Folder and Workspace.
Resource group: Resource group and shared resource are two resource management modes of the workspace.
Resource groups support four resource types: Cluster, Cluster-Namespace (cross-cluster), Mesh, and Mesh-Namespace. A resource can only be bound to one resource group. After a resource is bound to a resource group, the owner of the workspace will have all the management rights of the resource, which is equivalent to the owner of the resource, so it is not limited by the resource quota.
Best practice: The workspace can grant different role permissions to department members through the \"authorization\" function, and the workspace can apply the authorization relationship between people and roles to all resources in the workspace at one time. Therefore, the operation and maintenance personnel only need to bind resources to resource groups, and add different roles in the department to different resource groups to ensure that resource permissions are assigned correctly.
Department Role Cluster Cross-cluster Cluster-Namespace Mesh Mesh-Namespace Department Admin Workspace Admin \u2713 \u2713 \u2713 \u2713 Department Core Members Workspace Edit \u2713 \u2717 \u2713 \u2717 Other Members Workspace View \u2713 \u2717 \u2717 \u2717
Shared resources: The shared resource feature is mainly for cluster resources.
A cluster can be shared by multiple workspaces (referring to the shared resource feature in the workspace); a workspace can also use the resources of multiple clusters at the same time. However, resource sharing does not mean that the sharer (workspace) can use the shared resource (cluster) without restriction, so the resource quota that the sharer (workspace) can use is usually limited.
At the same time, unlike resource groups, workspace members are only users of shared resources and can use resources in the cluster under resource quotas. For example, go to Workbench to create a namespace, and deploy applications, but do not have the management authority of the cluster. After the restriction, the total resource quota of the namespace created/bound under this workspace cannot exceed the resources set by the cluster in this workspace.
Best practice: The operation and maintenance department has a high-availability cluster 01, and wants to allocate it to department A (workspace A) and department B (workspace B), where department A allocates 50 CPU cores, and department B allocates CPU 100 cores. Then you can borrow the concept of shared resources, share cluster 01 with department A and department B respectively, and limit the CPU usage quota of department A to 50, and the CPU usage quota of department B to 100. Then the administrator of department A (workspace A Admin) can create and use a namespace in Workbench, and the sum of the namespace quotas cannot exceed 50 cores, and the administrator of department B (workspace B Admin) can create a namespace in Workbench And use namespaces, where the sum of namespace credits cannot exceed 100 cores. The namespaces created by the administrators of department A and department B will be automatically bound to the department, and other members of the department will have the roles of Namesapce Admin, Namesapce Edit, and Namesapce View proper to the namespace (the department here refers to Workspace, workspace can also be mapped to other concepts such as organization, and supplier). The whole process is as follows:
Department Role Cluster Resource Quota Department Administrator A Workspace Admin CPU 50 cores CPU 50 cores Department Administrator B Workspace Admin CPU 100 cores CPU 100 cores Other Members of the Department Namesapce AdminNamesapce EditNamesapce View Assign as Needed Assign as Needed
"},{"location":"en/admin/ghippo/best-practice/ws-best-practice.html#the-effect-of-the-workspace-on-the-ai-platfrom","title":"The effect of the workspace on the AI platfrom","text":"
Module name: Container Management
Due to the particularity of functional modules, resources created in the container management module will not be automatically bound to a certain workspace.
If you need to perform unified authorization management on people and resources through workspaces, you can manually bind the required resources to a certain workspace, to apply the roles of users in this workspace to resources (resources here can be cross- clustered).
In addition, there is a slight difference between container management and service mesh in terms of resource binding entry. The workspace provides the binding entry of Cluster and Cluster-Namespace resources in container management, but has not opened the Mesh and Mesh-Namespace for service mesh. Bindings for Namespace resources.
For Mesh and Mesh-Namespace resources, you can manually bind them in the resource list of the service mesh.
"},{"location":"en/admin/ghippo/best-practice/ws-best-practice.html#use-cases-of-workspace","title":"Use Cases of Workspace","text":"
Mapping to concepts such as different departments, projects, and organizations. At the same time, the roles of Workspace Admin, Workspace Edit, and Workspace View in the workspace can be mapped to different roles in departments, projects, and organizations
Add resources for different purposes to different workspaces for separate management and use
Set up completely independent administrators for different workspaces to realize user and authority management within the scope of the workspace
Share resources to different workspaces, and limit the upper limit of resources that can be used by workspaces
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html","title":"Workspaces (tenants) bind namespaces across clusters","text":"
Namespaces from different clusters are bound under the workspace (tenant), which enables the workspace (tenant) to flexibly manage the Kubernetes Namespace under any cluster on the platform. At the same time, the platform provides permission mapping capabilities, which can map the user's permissions in the workspace to the bound namespace.
When one or more cross-cluster namespaces are bound under the workspace (tenant), the administrator does not need to authorize the members in the workspace again. The roles of members in the workspace will be automatically mapped according to the following mapping relationship to complete the authorization, avoiding repeated operations of multiple authorizations:
Workspace Admin corresponds to Namespace Admin
Workspace Editor corresponds to Namespace Editor
Workspace Viewer corresponds to Namespace Viewer
Here is an example:
User Workspace Role User A Workspace01 Workspace Admin
After binding a namespace to a workspace:
User Category Role User A Workspace01 Workspace Admin Namespace01 Namespace Admin"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#implementation-plan","title":"Implementation plan","text":"
Bind different namespaces from different clusters to the same workspace (tenant), and use the process for members under the workspace (tenant) as shown in the figure.
graph TB\n\npreparews[prepare workspace] --> preparens[prepare namespace]\n--> judge([whether the namespace is bound to another workspace])\njudge -.unbound.->nstows[bind namespace to workspace] -->wsperm[manage workspace access]\njudge -.bound.->createns[Create a new namespace]\n\nclassDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;\nclassDef k8s fill: #326ce5, stroke: #fff, stroke-width: 1px, color: #fff;\nclassDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5;\n\nclass preparews, preparens, createns, nstows, wsperm cluster;\nclass judge plain\n\nclick preparews \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_3\"\nclick prepares \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_4\"\nclick nstows \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_5\"\nclick wsperm \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_6\"\nclick creates \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_4\"
In order to meet the multi-tenant use cases, the workspace forms an isolated resource environment based on multiple resources such as clusters, cluster namespaces, meshs, mesh namespaces, multicloud, and multicloud namespaces. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers.
Log in to AI platform as a user with the admin/folder admin role, and click Global Management at the bottom of the left navigation bar.
Click Workspace and Folder in the left navigation bar, and click the Create Workspace button in the upper right corner.
After filling in the workspace name, folder and other information, click OK to complete the creation of the workspace.
Tip: If the created namespace already exists in the platform, click a workspace, and under the Resource Group tab, click Bind Resource to directly bind the namespace.
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#prepare-the-namespace","title":"Prepare the namespace","text":"
A namespace is a smaller unit of resource isolation that can be managed and used by members of a workspace after it is bound to a workspace.
Follow the steps below to prepare a namespace that is not yet bound to any workspace.
Click Container Management at the bottom of the left navigation bar.
Click the name of the target cluster to enter Cluster Details .
Click Namespace on the left navigation bar to enter the namespace management page, and click the Create button on the right side of the page.
Fill in the name of the namespace, configure the workspace and tags (optional settings), and click OK .
Info
Workspaces are primarily used to divide groups of resources and grant users (groups of users) different access rights to that resource. For a detailed description of the workspace, please refer to Workspace and Folder.
Click OK to complete the creation of the namespace. On the right side of the namespace list, click \u2507 , and you can select Bind Workspace from the pop-up menu.
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#bind-the-namespace-to-the-workspace","title":"Bind the namespace to the workspace","text":"
In addition to binding in the namespace list, you can also return to global management , follow the steps below to bind the workspace.
Click Global Management -> Workspace and Folder -> Resource Group , click a workspace name, and click the Bind Resource button.
Select the workspace to be bound (multiple choices are allowed), and click OK to complete the binding.
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#add-members-to-the-workspace-and-authorize","title":"Add members to the workspace and authorize","text":"
In Workspace and Folder -> Authorization , click the name of a workspace, and click the Add Authorization button.
After selecting the User/group and Role to be authorized, click OK to complete the authorization.
"},{"location":"en/admin/ghippo/best-practice/gproduct/intro.html","title":"How GProduct connects to global management","text":"
GProduct is the general term for all other modules in AI platform except the global management. These modules need to be connected with the global management before they can be added to AI platform.
"},{"location":"en/admin/ghippo/best-practice/gproduct/intro.html#what-to-be-docking","title":"What to be docking","text":"
Docking Navigation Bar
The entrances are unified on the left navigation bar.
Access Routing and AuthN
Unify the IP or domain name, and unify the routing entry through the globally managed Istio Gateway.
Unified login / unified AuthN authentication
The login page is unified using the global management (Keycloak) login page, and the API authn token verification uses Istio Gateway. After GProduct is connected to the global management, there is no need to pay attention to how to implement login and authentication.
Only support one of overview, workbench, container, microservice, data service, and management
The larger the number, the higher it is ranked
The configuration for the global management navigation bar category is stored in a ConfigMap and cannot be added through registration at present. Please contact the global management team to add it.
The kpanda front-end is integrated into the AI platform parent application Anakin as a micro-frontend.
AI platform frontend uses qiankun to connect the sub-applications UI. See getting started.
After registering the GProductNavigator CR, the proper registration information will be generated for the front-end parent application. For example, kpanda will generate the following registration information:
{\n \"id\": \"kpanda\",\n \"title\": \"\u5bb9\u5668\u7ba1\u7406\",\n \"url\": \"/kpanda\",\n \"uiAssetsUrl\": \"/ui/kpanda/\", // The trailing / is required\n \"needImportLicense\": false\n},\n
The proper relation between the above registration and the qiankun sub-application fields is:
container and loader are provided by the frontend parent application. The sub-application does not need to concern it. Props will provide a pinia store containing user basic information and sub-product registration information.
qiankun will use the following parameters on startup:
start({\n sandbox: {\n experimentalStyleIsolation: true,\n },\n // Remove the favicon in the sub-application to prevent it from overwriting the parent application's favicon in Firefox\n getTemplate: (template) => template.replaceAll(/<link\\s* rel=\"[\\w\\s]*icon[\\w\\s]*\"\\s*( href=\".*?\")?\\s*\\/?>/g, ''),\n});\n
Refer to Docking demo tar to GProduct provided by frontend team.
"},{"location":"en/admin/ghippo/best-practice/gproduct/route-auth.html","title":"Access routing and login authentication","text":"
Unified login and password verification after docking, the effect is as follows:
The API bear token verification of each GProduct module goes through the Istio Gateway.
Take kpanda as an example to register GProductProxy CR.
# GProductProxy CR example, including routing and login authentication\n\n# spec.proxies: The route written later cannot be a subset of the route written first, and vice versa\n# spec.proxies.match.uri.prefix: If it is a backend api, it is recommended to add \"/\" at the end of the prefix to indicate the end of this path (special requirements can not be added)\n# spec.proxies.match.uri: supports prefix and exact modes; Prefix and Exact can only choose 1 out of 2; Prefix has a higher priority than Exact\n\napiVersion: ghippo.io/v1alpha1\nkind: GProductProxy\nmetadata:\n name: kpanda # (1)\nspec:\n gproduct: kpanda # (2)\n proxies:\n - labels:\n kind: UIEntry\n match:\n uri:\n prefix: /kpanda # (3)\n rewrite:\n uri: /index.html\n destination:\n host: ghippo-anakin.ghippo-system.svc.cluster.local\n port: 80\n authnCheck: false # (4)\n - labels:\n kind: UIAssets\n match:\n uri:\n prefix: /ui/kpanda/ # (5)\n destination:\n host: kpanda-ui.kpanda-system.svc.cluster.local\n port: 80\n authnCheck: false\n - match:\n uri:\n prefix: /apis/kpanda.io/v1/a\n destination:\n host: kpanda-service.kpanda-system.svc.cluster.local\n port: 80\n authnCheck: false\n - match:\n uri:\n prefix: /apis/kpanda.io/v1 # (6)\n destination:\n host: kpanda-service.kpanda-system.svc.cluster.local\n port: 80\n authnCheck: true\n
Cluster-level CRDs
You need to specify the GProduct name in lowercase
Can also support exact
Whether istio-gateway is required to perform AuthN Token authentication for this routing API, false means to skip authentication
UIAssets recommends adding / at the end to indicate the end (otherwise there may be problems in the front end)
The route written later cannot be a subset of the route written earlier, and vice versa
"},{"location":"en/admin/ghippo/best-practice/menu/menu-display-or-hiding.html","title":"Display/Hide Navigation Bar Menu Based on Permissions","text":"
Under the current permission system, Global Management has the capability to regulate the visibility of navigation bar menus according to user permissions. However, due to the authorization information of Container Management not being synchronized with Global Management, Global Management cannot accurately determine whether to display the Container Management menu.
This document implements the following through configuration: By default, the menus for Container Management and Insight will not be displayed in areas where Global Management cannot make a judgment. A Whitelist authorization strategy is employed to effectively manage the visibility of these menus. (The permissions for clusters or namespaces authorized through the Container Management page cannot be perceived or judged by Global Management)
For example, if User A holds the Cluster Admin role for cluster A in Container Management, Global Management cannot determine whether to display the Container Management menu. After the configuration described in this document, User A will not see the Container Management menu by default. They will need to have explicit permission in Global Management to access the Container Management menu.
The feature to show/hide menus based on permissions must be enabled. The methods to enable this are as follows:
For new installation enviroments, add the --set global.navigatorVisibleDependency=true parameter when using helm install.
For existing environments, back up values using helm get values ghippo -n ghippo-system -o yaml, then modify bak.yaml and add global.navigatorVisibleDependency: true.
Then upgrade the Global Management using the following command:
"},{"location":"en/admin/ghippo/best-practice/menu/menu-display-or-hiding.html#configure-the-navigation-bar","title":"Configure the Navigation Bar","text":"
Apply the following YAML in kpanda-global-cluster:
"},{"location":"en/admin/ghippo/best-practice/menu/menu-display-or-hiding.html#achieve-the-above-effect-through-custom-roles","title":"Achieve the Above Effect Through Custom Roles","text":"
Note
Only the menus for the Container Management module need to be configured separately menu permissions. Other modules will automatically show/hide based on user permissions
Create a custom role that includes the permission to view the Container Management menu, and then grant this role to users who need access to the Container Management menu.
you can see the navigation bar menus for container management and observability. The result is as follows:
"},{"location":"en/admin/ghippo/best-practice/oem/custom-idp.html","title":"Customizing AI platform Integration with IdP","text":"
Identity Provider (IdP): In AI platform, when a client system needs to be used as the user source and user authentication is performed through the client system's login interface, the client system is referred to as the Identity Provider for AI platform.
If there is a high customization requirement for the Ghippo login IdP, such as supporting WeCom, WeChat, or other social organization login requirements, please refer to this document for implementation.
Upgrade Ghippo to v0.15.0 or above. You can also directly install and deploy Ghippo v0.15.0, but make sure to manually record the following information.
After a successful upgrade, an installation command should be manually run. The parameter values set in --set should be gotten from the above saved content, along with additional parameter values:
global.idpPlugin.enabled: Whether to enable the custom plugin, default is disabled.
global.idpPlugin.image.repository: The image address used by the initContainer to initialize the custom plugin.
global.idpPlugin.image.tag: The image tag used by the initContainer to initialize the custom plugin.
global.idpPlugin.path: The directory file of the custom plugin within the above image.
Known issue in keycloak >= v21, support for old version themes has been removed and may be fixed in v22. See Issue #15344.
This demo uses Keycloak v20.0.5.
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#source-based-development","title":"Source-based Development","text":""},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#configure-the-environment","title":"Configure the Environment","text":"
Refer to keycloak/building.md for environment configuration.
Run the following commands based on keycloak/README.md:
cd quarkus\nmvn -f ../pom.xml clean install -DskipTestsuite -DskipExamples -DskipTests\n
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#run-from-ide","title":"Run from IDE","text":""},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#add-service-code","title":"Add Service Code","text":""},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#if-inheriting-some-functionality-from-keycloak","title":"If inheriting some functionality from Keycloak","text":"
Add files under the directory services/src/main/java/org/keycloak/broker :
The file names should be xxxProvider.java and xxxProviderFactory.java .
xxxProviderFactory.java example:
Pay attention to the variable PROVIDER_ID = \"oauth\"; , as it will be used in the HTML definition later.
xxxProvider.java example:
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#if-unable-to-inherit-functionality-from-keycloak","title":"If unable to inherit functionality from Keycloak","text":"
Refer to the three files in the image below to write your own code:
Add xxxProviderFactory to resource service
Add xxxProviderFactory to services/src/main/resources/META-INF/services/org.keycloak.broker.provider.IdentityProviderFactory so that the newly added code can work:
Add HTML file
Copy the file themes/src/main/resources/theme/base/admin/resources/partials/realm-identity-provider-oidc.html and rename it as realm-identity-provider-oauth.html (remember the variable to pay attention to from earlier).
Place the copied file in themes/src/main/resources/theme/base/admin/resources/partials/realm-identity-provider-oauth.html .
All the necessary files have been added. Now you can start debugging the functionality.
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#packaging-as-a-jar-plugin","title":"Packaging as a JAR Plugin","text":"
Create a new Java project and copy the above code into the project, as shown below:
Refer to pom.xml.
Run mvn clean package to package the code, resulting in the xxx-jar-with-dependencies.jar file.
Download Keycloak Release 20.0.5 zip package and extract it.
Copy the xxx-jar-with-dependencies.jar file to the keycloak-20.0.5/providers directory.
Run the following command to check if the functionality is working correctly:
bin/kc.sh start-dev\n
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html","title":"Integrating Customer Systems into AI platform (OEM IN)","text":"
OEM IN refers to the partner's platform being embedded as a submodule in AI platform, appearing in the primary navigation bar of AI platform. Users can log in and manage it uniformly through AI platform. The implementation of OEM IN is divided into 5 steps:
Unify Domain
Integrate User Systems
Integrate Navigation Bar
Customize Appearance
Integrate Permission System (Optional)
For specific operational demonstrations, refer to the OEM IN Best Practices Video Tutorial.
Note
The open source software Label Studio is used for nested demonstrations below. In actual scenarios, you need to solve the following issues in the customer system:
The customer system needs to add a Subpath to distinguish which services belong to AI platform and which belong to the customer system.
Adjust the operations on the customer system during the application according to the actual situation.
Plan the Subpath path of the customer system: http://10.6.202.177:30123/label-studio (It is recommended to use a recognizable name as the Subpath, which should not conflict with the HTTP router of the main AI platform). Ensure that users can access the customer system through http://10.6.202.177:30123/label-studio.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#unify-domain-name-and-port","title":"Unify Domain Name and Port","text":"
SSH into the AI platform server.
ssh root@10.6.202.177\n
Create the label-studio.yaml file using the vim command.
vim label-studio.yaml\n
label-studio.yaml
apiVersion: networking.istio.io/v1beta1\nkind: ServiceEntry\nmetadata:\n name: label-studio\n namespace: ghippo-system\nspec:\n exportTo:\n - \"*\"\n hosts:\n - label-studio.svc.external\n ports:\n # Add a virtual port\n - number: 80\n name: http\n protocol: HTTP\n location: MESH_EXTERNAL\n resolution: STATIC\n endpoints:\n # Change to the domain name (or IP) of the customer system\n - address: 10.6.202.177\n ports:\n # Change to the port number of the customer system\n http: 30123\n---\napiVersion: networking.istio.io/v1alpha3\nkind: VirtualService\nmetadata:\n # Change to the name of the customer system\n name: label-studio\n namespace: ghippo-system\nspec:\n exportTo:\n - \"*\"\n hosts:\n - \"*\"\n gateways:\n - ghippo-gateway\n http:\n - match:\n - uri:\n exact: /label-studio # Change to the routing address of the customer system in the Web UI entry\n - uri:\n prefix: /label-studio/ # Change to the routing address of the customer system in the Web UI entry\n route:\n - destination:\n # Change to the value of spec.hosts in the ServiceEntry above\n host: label-studio.svc.external\n port:\n # Change to the value of spec.ports in the ServiceEntry above\n number: 80\n---\napiVersion: security.istio.io/v1beta1\nkind: AuthorizationPolicy\nmetadata:\n # Change to the name of the customer system\n name: label-studio\n namespace: istio-system\nspec:\n action: ALLOW\n selector:\n matchLabels:\n app: istio-ingressgateway\n rules:\n - from:\n - source:\n requestPrincipals:\n - '*'\n - to:\n - operation:\n paths:\n - /label-studio # Change to the value of spec.http.match.uri.prefix in VirtualService\n - /label-studio/* # Change to the value of spec.http.match.uri.prefix in VirtualService (Note: add \"*\" at the end)\n
Apply the label-studio.yaml using the kubectl command:
kubectl apply -f\u00a0label-studio.yaml\n
Verify if the IP and port of the Label Studio UI are consistent:
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#integrate-user-systems","title":"Integrate User Systems","text":"
Integrate the customer system with the AI platform platform through protocols like OIDC/OAUTH, allowing users to enter the customer system without logging in again after logging into the AI platform platform.
In the scenario of two AI platform, you can create SSO access through Global Management -> Access Control -> Docking Portal.
After creating, fill in the details such as the Client ID, Client Secret, and Login URL in the customer system's Global Management -> Access Control -> Identity Provider -> OIDC, to complete user integration.
After integration, the customer system login page will display the OIDC (Custom) option. Select to log in via OIDC the first time entering the customer system from the AI platform platform, and subsequently, you will directly enter the customer system without selecting again.
Refer to the tar package at the bottom of the document to implement an empty frontend sub-application, and embed the customer system into this empty shell application in the form of an iframe.
Download the gproduct-demo-main.tar.gz file and change the value of the src attribute in App-iframe.vue under the src folder (the user entering the customer system):
The absolute address: src=\"https://10.6.202.177:30443/label-studio\" (AI platform address + Subpath)
The relative address, such as src=\"./external-anyproduct/insight\"
Delete the App.vue and main.ts files under the src folder, and rename:
Rename App-iframe.vue to App.vue
Rename main-iframe.ts to main.ts
Build the image following the steps in the readme (Note: before executing the last step, replace the image address in demo.yaml with the built image address)
After integration, the Customer System will appear in the primary navigation bar of AI platform, and clicking it will allow users to enter the customer system.
AI platform supports customizing the appearance by writing CSS. How the customer system implements appearance customization in actual applications needs to be handled according to the actual situation.
Log in to the customer system, and through Global Management -> Settings -> Appearance, you can customize platform background colors, logos, and names. For specific operations, please refer to Appearance Customization.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#integrate-permission-system-optional","title":"Integrate Permission System (Optional)","text":"
Method One:
Customized teams can implement a customized module that AI platform will notify each user login event to the customized module via Webhook, and the customized module can call the OpenAPI of AnyProduct and AI platform to synchronize the user's permission information.
Method Two:
Through Webhook, notify AnyProduct of each authorization change (if required, it can be implemented later).
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#use-other-capabilities-of-ai-platform-in-anyproduct-optional","title":"Use Other Capabilities of AI platform in AnyProduct (Optional)","text":"
Download the tar package for gProduct-demo-main integration
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html","title":"Integrate AI platform into Customer System (OEM OUT)","text":"
OEM OUT refers to integrating AI platform as a sub-module into other products, appearing in their menus. You can directly access AI platform without logging in again after logging into other products. The OEM OUT integration involves 5 steps:
Deploy AI platform (Assuming the access address after deployment is https://10.6.8.2:30343/).
To achieve cross-domain access between the customer system and AI platform, you can use an nginx reverse proxy. Use the following example configuration in vi /etc/nginx/conf.d/default.conf :
server {\n listen 80;\n server_name localhost;\n\n location /dce5/ {\n proxy_pass https://10.6.8.2:30343/;\n proxy_http_version 1.1;\n proxy_read_timeout 300s; # This line is required for using kpanda cloudtty, otherwise it can be removed\n proxy_send_timeout 300s; # This line is required for using kpanda cloudtty, otherwise it can be removed\n\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n\n proxy_set_header Upgrade $http_upgrade; # This line is required for using kpanda cloudtty, otherwise it can be removed\n proxy_set_header Connection $connection_upgrade; # This line is required for using kpanda cloudtty, otherwise it can be removed\n }\n\n location / {\n proxy_pass https://10.6.165.50:30443/; # Assuming this is the customer system address (e.g., Yiyun)\n proxy_http_version 1.1;\n\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n }\n}\n
Assuming the nginx entry address is 10.6.165.50, follow the Customize AI platform Reverse Proxy Server Address to set the AI_PROXY reverse proxy as http://10.6.165.50/dce5. Ensure that AI platform can be accessed via http://10.6.165.50/dce5. The customer system also needs to configure the reverse proxy based on its specific requirements.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html#user-system-integration","title":"User System Integration","text":"
Integrate the customer system with AI platform using protocols like OIDC/OAUTH, allowing users to access AI platform without logging in again after logging into the customer system. Fill in the OIDC information of the customer system in Global Management -> Access Control -> Identity Provider .
After integration, the AI platform login page will display the OIDC (custom) option. When accessing AI platform from the customer system for the first time, select OIDC login, and subsequent logins will directly enter AI platform without needing to choose again.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html#navigation-bar-integration","title":"Navigation Bar Integration","text":"
Navigation bar integration means adding AI platform to the menu of the customer system. You can directly access AI platform by clicking the proper menu item. The navigation bar integration depends on the customer system and needs to be handled based on specific circumstances.
Use Global Management -> Settings -> Appearance to customize the platform's background color, logo, and name. For detailed instructions, refer to Appearance Customization.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html#permission-system-integration-optional","title":"Permission System Integration (optional)","text":"
Permission system integration is complex. If you have such requirements, please contact the Global Management team.
Tengine: Tengine is a web server project initiated by taobao.com. Based on Nginx, it adds many advanced features and features for the needs of high-traffic websites.
Tongsuo: Formerly known as BabaSSL, Tongsuo is an open-source cryptographic library that offers a range of modern cryptographic algorithms and secure communication protocols. It is designed to support a variety of use cases, including storage, network security, key management, and privacy computing. By providing foundational cryptographic capabilities, Tongsuo ensures the privacy, integrity, and authenticity of data during transmission, storage, and usage. It also enhances security throughout the data lifecycle, offering robust privacy protection and security features.
You can refer to the Tongsuo official documentation to use OpenSSL to generate SM2 certificates, or visit Guomi SSL Laboratory to apply for SM2 certificates.
In the end, we will get the following files:
-rw-r--r-- 1 root root 749 Dec 8 02:59 sm2.*.enc.crt.pem\n-rw-r--r-- 1 root root 258 Dec 8 02:59 sm2.*.enc.key.pem\n-rw-r--r-- 1 root root 749 Dec 8 02:59 sm2.*.sig.crt.pem\n-rw-r--r-- 1 root root 258 Dec 8 02:59 sm2.*.sig.key.pem\n
-rw-r--r-- 1 root root 216 Dec 8 03:21 rsa.*.crt.pem\n-rw-r--r-- 1 root root 4096 Dec 8 02:59 rsa.*.key.pem\n
"},{"location":"en/admin/ghippo/install/gm-gateway.html#configure-sm2-and-rsa-tls-certificates-for-the-guomi-gateway","title":"Configure SM2 and RSA TLS Certificates for the Guomi Gateway","text":"
The Guomi gateway used in this article supports SM2 and RSA TLS certificates. The advantage of dual certificates is that when the browser does not support SM2 TLS certificates, it automatically switches to RSA TLS certificates.
For more detailed configurations, please refer to the Tongsuo official documentation.
We enter the Tengine container:
# Go to the nginx configuration file directory\ncd /usr/local/nginx/conf\n\n# Create the cert folder to store TLS certificates\nmkdir cert\n\n# Copy the SM2 and RSA TLS certificates to the `/usr/local/nginx/conf/cert` directory\ncp sm2.*.enc.crt.pem sm2.*.enc.key.pem sm2.*.sig.crt.pem sm2.*.sig.key.pem /usr/local/nginx/conf/cert\ncp rsa.*.crt.pem rsa.*.key.pem /usr/local/nginx/conf/cert\n\n# Edit the nginx.conf configuration\nvim nginx.conf\n...\nserver {\n listen 443 ssl;\n proxy_http_version 1.1;\n # Enable Guomi function to support SM2 TLS certificates\n enable_ntls on;\n\n # RSA certificate\n # If your browser does not support Guomi certificates, you can enable this option, and Tengine will automatically recognize the user's browser and use RSA certificates for fallback\n ssl_certificate /usr/local/nginx/conf/cert/rsa.*.crt.pem;\n ssl_certificate_key /usr/local/nginx/conf/cert/rsa.*.key.pem;\n\n # Configure two pairs of SM2 certificates for encryption and signature\n # SM2 signature certificate\n ssl_sign_certificate /usr/local/nginx/conf/cert/sm2.*.sig.crt.pem;\n ssl_sign_certificate_key /usr/local/nginx/conf/cert/sm2.*.sig.key.pem;\n # SM2 encryption certificate\n ssl_enc_certificate /usr/local/nginx/conf/cert/sm2.*.enc.crt.pem;\n ssl_enc_certificate_key /usr/local/nginx/conf/cert/sm2.*.enc.key.pem;\n ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;\n\n location / {\n proxy_set_header Host $http_host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header REMOTE-HOST $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n # You need to modify the address here to the address of the Istio ingress gateway\n # For example, proxy_pass https://istio-ingressgateway.istio-system.svc.cluster.local\n # Or proxy_pass https://demo-dev.daocloud.io\n proxy_pass https://istio-ingressgateway.istio-system.svc.cluster.local;\n }\n}\n
"},{"location":"en/admin/ghippo/install/gm-gateway.html#reload-the-configuration-of-the-guomi-gateway","title":"Reload the Configuration of the Guomi Gateway","text":"
You can deploy a web browser that supports Guomi certificates. For example, Samarium Browser, and then access the UI interface through Tengine to verify if the Guomi certificate is effective.
Before a user uses a new system, there is no data in the system, and the system cannot identify the new user. In order to identify the user identity and bind user data, the user needs an account that can uniquely identify the user identity.
AI platform assigns an account with certain permissions to the user through the way the administrator creates a new user in User and Access Control . All behaviors generated by this user will be associated with their own account.
The user logs in through the account/password, and the system verifies whether the identity is legal. If the verification is legal, the user logs in successfully.
Note
If the user does not perform any operation within 24 hours after logging in, the login status will be automatically logged out. If the logged-in user is always active, the logged-in state will persist.
The simple process of user login is shown in the figure below.
Set environment variables for easier use in the following steps.
# Your reverse proxy address, for example `export Suanova_PROXY=\"https://demo-alpha.daocloud.io\"` \nexport Suanova_PROXY=\"https://domain:port\"\n\n# Helm --set parameter backup file\nexport GHIPPO_VALUES_BAK=\"ghippo-values-bak.yaml\"\n\n# Get the current version of ghippo\nexport GHIPPO_HELM_VERSION=$(helm get notes ghippo -n ghippo-system | grep \"Chart Version\" | awk -F ': ' '{ print $2 }')\n
Backup the --set parameters.
helm get values ghippo -n ghippo-system -o yaml > ${GHIPPO_VALUES_BAK}\n
Set environment variables for easier use in the following steps.
# Your reverse proxy address, for example `export Suanova_PROXY=\"https://demo-alpha.daocloud.io\"` \nexport Suanova_PROXY=\"https://domain:port\"\n\n# Helm --set parameter backup file\nexport GHIPPO_VALUES_BAK=\"ghippo-values-bak.yaml\"\n\n# Get the current version of ghippo\nexport GHIPPO_HELM_VERSION=$(helm get notes ghippo -n ghippo-system | grep \"Chart Version\" | awk -F ': ' '{ print $2 }')\n
Backup the --set parameters.
helm get values ghippo -n ghippo-system -o yaml > ${GHIPPO_VALUES_BAK}\n
The access key can be used to access the openAPI and continuous delivery. Users can obtain the key and access the API by referring to the following steps in the personal center.
Log in to AI platform, find Personal Center in the drop-down menu in the upper right corner, and you can manage the access key of the account on the Access Keys page.
Info
Access key is displayed only once. If you forget your access key, you will need to create a new key.
"},{"location":"en/admin/ghippo/personal-center/accesstoken.html#use-the-key-to-access-api","title":"Use the key to access API","text":"
When accessing AI platform openAPI, add the header Authorization:Bearer ${token} to the request to identify the visitor, where ${token} is the key obtained in the previous step. Request Example
curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k\n
This section explains how to set the interface language. Currently supports Chinese, English two languages.
Language setting is the portal for the platform to provide multilingual services. The platform is displayed in Chinese by default. Users can switch the platform language by selecting English or automatically detecting the browser language preference according to their needs. Each user's multilingual service is independent of each other, and switching will not affect other users.
The platform provides three ways to switch languages: Chinese, English-English, and automatically detect your browser language preference.
The operation steps are as follows.
Log in to the AI platform with your username/password. Click Global Management at the bottom of the left navigation bar.
Click the username in the upper right corner and select Personal Center .
Function description: It is used to fill in the email address and modify the login password.
Email: After the administrator configures the email server address, the user can click the Forget Password button on the login page to fill in the email address there to retrieve the password.
Password: The password used to log in to the platform, it is recommended to change the password regularly.
The specific operation steps are as follows:
Click the username in the upper right corner and select Personal Center .
Click the Security Settings tab. Fill in your email address or change the login password.
"},{"location":"en/admin/ghippo/personal-center/ssh-key.html","title":"Configuring SSH Public Key","text":"
This article explains how to configure SSH public key.
Before generating a new SSH key, please check if you need to use an existing SSH key stored in the root directory of the local user. For Linux and Mac, use the following command to view existing public keys. Windows users can use the following command in WSL (requires Windows 10 or above) or Git Bash to view the generated public keys.
ED25519 Algorithm:
cat ~/.ssh/id_ed25519.pub\n
RSA Algorithm:
cat ~/.ssh/id_rsa.pub\n
If a long string starting with ssh-ed25519 or ssh-rsa is returned, it means that a local public key already exists. You can skip Step 2 Generate SSH Key and proceed directly to Step 3.
If Step 1 does not return the specified content string, it means that there is no available SSH key locally and a new SSH key needs to be generated. Please follow these steps:
Access the terminal (Windows users please use WSL or Git Bash), and run ssh-keygen -t.
Enter the key algorithm type and an optional comment.
The comment will appear in the .pub file and can generally use the email address as the comment content.
To generate a key pair based on the ED25519 algorithm, use the following command:
ssh-keygen -t ed25519 -C \"<comment>\"\n
To generate a key pair based on the RSA algorithm, use the following command:
ssh-keygen -t rsa -C \"<comment>\"\n
Press Enter to choose the SSH key generation path.
Taking the ED25519 algorithm as an example, the default path is as follows:
Generating public/private ed25519 key pair.\nEnter file in which to save the key (/home/user/.ssh/id_ed25519):\n
The default key generation path is /home/user/.ssh/id_ed25519, and the proper public key is /home/user/.ssh/id_ed25519.pub.
Set a passphrase for the key.
Enter passphrase (empty for no passphrase):\nEnter same passphrase again:\n
The passphrase is empty by default, and you can choose to use a passphrase to protect the private key file. If you do not want to enter a passphrase every time you access the repository using the SSH protocol, you can enter an empty passphrase when creating the key.
Press Enter to complete the key pair creation.
"},{"location":"en/admin/ghippo/personal-center/ssh-key.html#step-3-copy-the-public-key","title":"Step 3. Copy the Public Key","text":"
In addition to manually copying the generated public key information printed on the command line, you can use the following commands to copy the public key to the clipboard, depending on the operating system.
Windows (in WSL or Git Bash):
cat ~/.ssh/id_ed25519.pub | clip\n
Mac:
tr -d '\\n'< ~/.ssh/id_ed25519.pub | pbcopy\n
GNU/Linux (requires xclip):
xclip -sel clip < ~/.ssh/id_ed25519.pub\n
"},{"location":"en/admin/ghippo/personal-center/ssh-key.html#step-4-set-the-public-key-on-ai-platform-platform","title":"Step 4. Set the Public Key on AI platform Platform","text":"
Log in to the AI platform UI page and select Profile -> SSH Public Key in the upper right corner of the page.
Add the generated SSH public key information.
SSH public key content.
Public key title: Supports customizing the public key name for management differentiation.
Expiration: Set the expiration period for the public key. After it expires, the public key will be automatically invalidated and cannot be used. If not set, it will be permanently valid.
The About page primarily showcases the latest versions of each module, highlights the open source software used, and expresses gratitude to the technical team via an animated video.
Steps to view are as follows:
Log in to AI platform as a user with Admin role. Click Global Management at the bottom of the left navigation bar.
Click Settings , select About , and check the product version, open source software statement, and development teams.
In AI platform, you have the option to customize the appearance of the login page, top navigation bar, bottom copyright and ICP registration to enhance your product recognition.
"},{"location":"en/admin/ghippo/platform-setting/appearance.html#customizing-login-page-and-top-navigation-bar","title":"Customizing Login Page and Top Navigation Bar","text":"
To get started, log in to AI platform as a user with the admin role and navigate to Global Management -> Settings found at the bottom of the left navigation bar.
Select Appearance . On the Custom your login page tab, modify the icon and text of the login page as needed, then click Save .
Log out and refresh the login page to see the configured effect.
On the Advanced customization tab, you can modify login page, navigation bar, copyright, and ICP registration with css.
Note
If you wish to restore the default settings, simply click Revert . This action will discard all customized settings.
Advanced customization allows you to modify the color, font spacing, and font size of the entire container platform using CSS styles. Please note that familiarity with CSS syntax is required.
To reset any advanced customizations, delete the contents of the black input box or click the Revert button.
AI platform will send an e-mail to the user to verify the e-mail address if the user forgets the password to ensure that the user is acting in person. In order for AI platform to be able to send email, you need to provide your mail server address first.
The specific operation steps are as follows:
Log in to AI platform as a user with admin role. Click Global Management at the bottom of the left navigation bar.
Click Settings , select Mail Server Settings .
Complete the following fields to configure the mail server:
Field Description Example SMTP server address SMTP server address that can provide mail service smtp.163.com SMTP server port Port for sending mail 25 Username Name of the SMTP user test@163.com Password Password for the SMTP account 123456 Sender's email address Sender's email address test@163.com Use SSL secure connection SSL can be used to encrypt emails, thereby improving the security of information transmitted via emails, usually need to configure a certificate for the mail server Disable
After the configuration is complete, click Save , and click Test Mail Server .
A message indicating that the mail has been successfully sent appears in the upper right corner of the screen, indicating that the mail server has been successfully set up.
Q: What is the reason why the user still cannot retrieve the password after the mail server is set up?
Answer: The user may not have an email address or set a wrong email address; at this time, users with the admin role can find the user by username in Global Management -> Access Control , and set it as The user sets a new login password.
If the mail server is not connected, please check whether the mail server address, username and password are correct.
New passwords must differ from the most recent historical password.
Users are required to change their passwords upon expiration.
Passwords must not match the username.
Passwords cannot be the same as the user's email address.
Customizable password rules.
Customizable minimum password length.
"},{"location":"en/admin/ghippo/platform-setting/security.html#access-control-policy","title":"Access Control Policy","text":"
Session Timeout Policy: Users will be automatically logged out after a period of inactivity lasting x hours.
Account Lockout Policy: Accounts will be locked after multiple failed login attempts within a specified time frame.
Login/Logout Policy: Users will be logged out when closing the browser.
To configure the password and access control policies, navigate to global management, then click Settings -> Security Policy in the left navigation bar.
Operation Management provides a visual representation of the total usage and utilization rates of CPU, memory, storage and GPU across various dimensions such as cluster, node, namespace, pod, and workspace within a specified time range on the platform. It also automatically calculates platform consumption information based on usage, usage time, and unit price. By default, the module enables all report statistics, but platform administrators can manually enable or disable individual reports. After enabling or disabling, the platform will start or stop collecting report data within a maximum of 20 minutes. Previously collected data will still be displayed normally. Operation Management data can be retained on the platform for up to 365 days. Statistical data exceeding this retention period will be automatically deleted. You can also download reports in CSV or Excel format for further statistics and analysis.
Operation Management is available only for the Standard Edition and above. It is not supported in the Community Edition.
You need to install or upgrade the Operations Management module first, and then you can experience report management and billing metering.
Report Management provides data statistics for cluster, node, pods, workspace, and namespace across five dimensions: CPU Utilization, Memory Utilization, Storage Utilization, GPU Computing Power Utilization, and GPU Memory Utilization. It also integrates with the audit and alert modules to support the statistical management of audit and alert data, supporting a total of seven types of reports.
Accounting & Billing provides billing statistics for clusters, nodes, pods, namespaces, and workspaces on the platform. It calculates the consumption for each resource during the statistical period based on the usage of CPU, memory, storage and GPU, as well as user-configured prices and currency units. Depending on the selected time span, such as monthly, quarterly, or annually, it can quickly calculate the actual consumption for that period.
Accounting and billing further process the usage data of resources based on reports. You can manually set the unit price and currency unit for CPU, memory, GPU and storage. After setting, the system will automatically calculate the expenses of clusters, nodes, pods, namespaces, and workspaces over a period. You can adjust the period freely and export billing reports in Excel or Csv format after filtering by week, month, quarter, or year.
"},{"location":"en/admin/ghippo/report-billing/billing.html#billing-rules-and-effective-time","title":"Billing Rules and Effective Time","text":"
Billing Rules: Default billing is based on the maximum value of request and usage.
Effective Time: Effective the next day, the fees incurred on that day are calculated based on the unit price and quantity obtained at midnight the next day.
Support customizing the billing unit for CPU, memory, storage and GPU, as well as the currency unit.
Support custom querying of billing data within a year, automatically calculating the billing situation for the selected time period.
Support exporting billing reports in CSV and Excel formats.
Support enabling/disabling individual billing reports. After enabling/disabling, the platform will start/stop collecting data within 20 minutes, and past collected data will still be displayed normally.
Support selective display of billing data for CPU, total memory, storage, GPU and total.
Cluster Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all clusters within a certain period, as well as the number of nodes in that cluster. By clicking the number of nodes, you can quickly enter the node billing report and view the billing situation of nodes in that cluster during that time period.
Node Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all nodes within a certain period, as well as the IP, type, and belonging cluster of nodes.
Pod Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all pods within a certain period, as well as the namespace, cluster, and workspace to which the pod belongs.
Workspace Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all workspaces within a certain period, as well as the number of namespaces and pods. By clicking the number of namespaces, you can quickly enter the namespace billing report and view the billing situation of namespaces in that workspace during that time period; the same method can be used to view the billing situation of pods in that workspace during that time period.
Namespace Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all namespaces within a certain period, as well as the number of pods, the belonging cluster, and workspace. By clicking the number of pods, you can quickly enter the pod billing report and view the billing situation of pods in that namespace during that time period.
Report management visually displays statistical data across clusters, nodes, pods, workspaces, namespaces, audits, and alarms. This data provides a reliable foundation for platform billing and utilization optimization.
Supports custom queries for statistical data within a year
Allows exporting reports in CSV and Excel formats
Supports enabling/disabling individual reports; once toggled, the platform will start/stop data collection within 20 minutes, but previously collected data will still be displayed.
Displays maximum, minimum, and average values for CPU utilization, memory utilization, storage utilization, and GPU memory utilization
Cluster Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all clusters during a specific time period, as well as the number of nodes under the cluster. You can quickly access the node report by clicking on the node count and view the utilization of nodes under the cluster during that period.
Node Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all nodes during a specific time period, along with the node's IP, type, and affiliated cluster.
Pod Report: Shows the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all pods during a specific time period, as well as the pod's namespace, affiliated cluster, and workspace.
Workspace Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all workspaces during a specific time period, along with the number of namespaces and pods. You can quickly access the namespace report by clicking on the namespace count and view the utilization of namespaces under the workspace during that period; similarly, you can view the utilization of pods under the workspace.
Namespace Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all namespaces during a specific time period, as well as the number of pods, affiliated clusters, and workspaces. You can quickly access the pod report by clicking on the pod count and view the utilization of pods within the namespace during that period.
Audit Report: Divided into user actions and resource operations. The user action report mainly counts the number of operations by a single user during a period, including successful and failed attempts; The resource operation report mainly counts the number of operations on a type of resource by all users.
Alarm Report: Displays the number of alarms for all nodes during a specific period, including the occurrences of fatal, severe, and warning alarms.
Log in to AI platform as a user with the Admin role. Click Global Management -> Operations Management at the bottom of the left sidebar.
After entering Operations Management, switch between different menus to view reports on clusters, nodes, and pods.
"},{"location":"en/admin/ghippo/troubleshooting/ghippo01.html","title":"Unable to start istio-ingressgateway when restarting the cluster (virtual machine)?","text":"
The error message is as shown in the following image:
Possible cause: The jwtsUri address of the RequestAuthentication CR cannot be accessed, causing istiod to be unable to push the configuration to istio-ingressgateway (This bug can be avoided in Istio 1.15: https://github.com/istio/istio/pull/39341/).
Solution:
Backup the RequestAuthentication ghippo CR.
kubectl get RequestAuthentication ghippo -n istio-system -o yaml > ghippo-ra.yaml\n
Before applying the RequestAuthentication ghippo CR, make sure that ghippo-apiserver and ghippo-keycloak are started correctly.
"},{"location":"en/admin/ghippo/troubleshooting/ghippo02.html","title":"Login loop with error 401 or 403","text":"
This issue occurs when the MySQL database connected to ghippo-keycloak encounters a failure, causing the OIDC Public keys to be reset.
For Global Management version 0.11.1 and above, you can follow these steps to restore normal operation by updating the Global Management configuration file using helm .
# Update helm repository\nhelm repo update ghippo\n\n# Backup ghippo parameters\nhelm get values ghippo -n ghippo-system -o yaml > ghippo-values-bak.yaml\n\n# Get the current deployed ghippo version\nversion=$(helm get notes ghippo -n ghippo-system | grep \"Chart Version\" | awk -F ': ' '{ print $2 }')\n\n# Perform the update operation to make the configuration file take effect\nhelm upgrade ghippo ghippo/ghippo \\\n-n ghippo-system \\\n-f ./ghippo-values-bak.yaml \\\n--version ${version}\n
"},{"location":"en/admin/ghippo/troubleshooting/ghippo03.html","title":"Keycloak Unable to Start","text":""},{"location":"en/admin/ghippo/troubleshooting/ghippo03.html#common-issues","title":"Common Issues","text":""},{"location":"en/admin/ghippo/troubleshooting/ghippo03.html#symptoms","title":"Symptoms","text":"
MySQL is ready with no errors. After installing the global management, Keycloak fails to start (more than 10 times).
If the database is MySQL, check if the Keycloak database encoding is UTF8.
Check the network connection from Keycloak to the database, ensure the database resources are sufficient, including but not limited to resource limits, storage space, and physical machine resources.
Check if MySQL resource usage has reached the limit
Check if the number of tables in the MySQL database keycloak is 95. (The number of tables may vary across different versions of Keycloak, so you can compare it with the number of tables in the Keycloak database of the same version in development or testing environments). If the number is fewer, it indicates that there may be an issue with the database table initialization (The command to check the number of tables is: show tables;).
Delete and recreate the Keycloak database with the command CREATE DATABASE IF NOT EXISTS keycloak CHARACTER SET utf8
You need to upgrade your virtual machine or physical machine CPU to support x86-64-v2 and above, ensuring that the x86 CPU instruction set supports SSE4.2. For details on how to upgrade, you should consult your virtual machine platform provider or your physical machine provider.
For more information, see: https://github.com/keycloak/keycloak/issues/17290
"},{"location":"en/admin/ghippo/troubleshooting/ghippo04.html","title":"Failure to Upgrade Global Management Separately","text":"
If the upgrade fails and includes the following message, you can refer to the Offline Upgrade section to complete the installation of CRDs by following the steps for updating the ghippo crd.
ensure CRDs are installed first\n
"},{"location":"en/admin/ghippo/workspace/folder-permission.html","title":"Description of folder permissions","text":"
Folders have permission mapping capabilities, which can map the permissions of users/groups in this folder to subfolders, workspaces and resources under it.
If the user/group is Folder Admin role in this folder, it is still Folder Admin role when mapped to a subfolder, and Workspace Admin is mapped to the workspace under it; If a Namespace is bound in Workspace and Folder -> Resource Group , the user/group is also a Namespace Admin after mapping.
Note
The permission mapping capability of folders will not be applied to shared resources, because sharing is to share the use permissions of the cluster to multiple workspaces, rather than assigning management permissions to workspaces, so permission inheritance and role mapping will not be implemented.
Folders have hierarchical capabilities, so when folders are mapped to departments/suppliers/projects in the enterprise,
If a user/group has administrative authority (Admin) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have administrative authority;
If a user/group has access rights (Editor) in the first-level department, the second-, third-, and fourth-level departments or projects under it also have access rights;
If a user/group has read-only permission (Viewer) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have read-only permission.
Objects Actions Folder Admin Folder Editor Folder Viewer on the folder itself view \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 To Subfolder Create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 workspace under it create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 Workspace under it - Resource Group View \u2713 \u2713 \u2713 resource binding \u2713 \u2717 \u2717 unbind \u2713 \u2717 \u2717 Workspaces under it - Shared Resources View \u2713 \u2713 \u2713 New share \u2713 \u2717 \u2717 Unshare \u2713 \u2717 \u2717 Resource Quota \u2713 \u2717 \u2717"},{"location":"en/admin/ghippo/workspace/folders.html","title":"Create/Delete Folders","text":"
Folders have the capability to map permissions, allowing users/user groups to have their permissions in the folder mapped to its sub-folders, workspaces, and resources.
Follow the steps below to create a folder:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Folder button in the top right corner.
Fill in the folder name, parent folder, and other information, then click OK to complete creating the folder.
Tip
After successful creation, the folder name will be displayed in the left tree structure, represented by different icons for workspaces and folders.
Note
To edit or delete a specific folder, select it and Click \u2507 on the right side.
If there are resources bound to the resource group or shared resources within the folder, the folder cannot be deleted. All resources need to be unbound before deleting.
If there are registry resources accessed by the microservice engine module within the folder, the folder cannot be deleted. All access to the registry needs to be removed before deleting the folder.
Shared resources do not necessarily mean that the shared users can use the shared resources without any restrictions. Admin, Kpanda Owner, and Workspace Admin can limit the maximum usage quota of a user through the Resource Quota feature in shared resources. If no restrictions are set, it means the usage is unlimited.
CPU Request (Core)
CPU Limit (Core)
Memory Request (MB)
Memory Limit (MB)
Total Storage Request (GB)
Persistent Volume Claims (PVC)
GPU Type, Spec, Quantity (including but not limited to Nvidia, Ascend, ILLUVATAR, and other GPUs)
A resource (cluster) can be shared among multiple workspaces, and a workspace can use resources from multiple shared clusters simultaneously.
"},{"location":"en/admin/ghippo/workspace/quota.html#resource-groups-and-shared-resources","title":"Resource Groups and Shared Resources","text":"
Cluster resources in both shared resources and resource groups are derived from Container Management. However, different effects will occur when binding a cluster to a workspace or sharing it with a workspace.
Binding Resources
Users/User groups in the workspace will have full management and usage permissions for the cluster. Workspace Admin will be mapped as Cluster Admin. Workspace Admin can access the Container Management module to manage the cluster.
Note
As of now, there are no Cluster Editor and Cluster Viewer roles in the Container Management module. Therefore, Workspace Editor and Workspace Viewer cannot be mapped.
Adding Shared Resources
Users/User groups in the workspace will have usage permissions for the cluster resources.
Unlike resource groups, when sharing a cluster with a workspace, the roles of the users in the workspace will not be mapped to the resources. Therefore, Workspace Admin will not be mapped as Cluster Admin.
This section demonstrates three scenarios related to resource quotas.
Select workspace ws01 and the shared cluster in Workbench, and create a namespace ns01 .
If no resource quotas are set in the shared cluster, there is no need to set resource quotas when creating the namespace.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the CPU request for the namespace must be less than or equal to 100 cores (CPU Request \u2264 100 core) for successful creation.
"},{"location":"en/admin/ghippo/workspace/quota.html#bind-namespace-to-workspace","title":"Bind Namespace to Workspace","text":"
Prerequisite: Workspace ws01 has added a shared cluster, and the operator has the Workspace Admin + Kpanda Owner or Admin role.
The two methods of binding have the same effect.
Bind the created namespace ns01 to ws01 in Container Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
Bind the namespace ns01 to ws01 in Global Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
"},{"location":"en/admin/ghippo/workspace/quota.html#unbind-namespace-from-workspace","title":"Unbind Namespace from Workspace","text":"
The two methods of unbinding have the same effect.
Unbind the namespace ns01 from workspace ws01 in Container Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
Unbind the namespace ns01 from workspace ws01 in Global Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html","title":"Differences between Resource Groups and Shared Resources","text":"
Both resource groups and shared resources support cluster binding, but they have significant differences in usage.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#differences-in-usage-scenarios","title":"Differences in Usage Scenarios","text":"
Cluster Binding for Resource Groups: Resource groups are usually used for batch authorization. After binding a resource group to a cluster, the workspace administrator will be mapped as a cluster administrator and able to manage and use cluster resources.
Cluster Binding for Shared Resources: Shared resources are usually used for resource quotas. A typical scenario is that the platform administrator assigns a cluster to a first-level supplier, who then assigns the cluster to a second-level supplier and sets resource quotas for the second-level supplier.
Note: In this scenario, the platform administrator needs to impose resource restrictions on secondary suppliers. Currently, it is not supported to limit the cluster quota of secondary suppliers by the primary supplier.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#differences-in-cluster-quota-usage","title":"Differences in Cluster Quota Usage","text":"
Cluster Binding for Resource Groups: The workspace administrator is mapped as the administrator of the cluster and is equivalent to being granted the Cluster Admin role in Container Management-Permission Management. They can have unrestricted access to cluster resources, manage important content such as management nodes, and cannot be subject to resource quotas.
Cluster Binding for Shared Resources: The workspace administrator can only use the quota in the cluster to create namespaces in the Workbench and does not have cluster management permissions. If the workspace is restricted by a quota, the workspace administrator can only create and use namespaces within the quota range.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#differences-in-resource-types","title":"Differences in Resource Types","text":"
Resource Groups: Can bind to clusters, cluster-namespaces, multiclouds, multicloud namespaces, meshs, and mesh-namespaces.
Shared Resources: Can only bind to clusters.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#similarities-between-resource-groups-and-shared-resources","title":"Similarities between Resource Groups and Shared Resources","text":"
After binding to a cluster, both resource groups and shared resources can go to the Workbench to create namespaces, which will be automatically bound to the workspace.
A workspace is a resource category that represents a hierarchical relationship of resources. A workspace can contain resources such as clusters, namespaces, and registries. Typically, each workspace corresponds to a project and different resources can be allocated, and different users and user groups can be assigned to each workspace.
Follow the steps below to create a workspace:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Workspace button in the top right corner.
Fill in the workspace name, folder assignment, and other information, then click OK to complete creating the workspace.
Tip
After successful creation, the workspace name will be displayed in the left tree structure, represented by different icons for folders and workspaces.
Note
To edit or delete a specific workspace or folder, select it and click ... on the right side.
If resource groups and shared resources have resources under the workspace, the workspace cannot be deleted. All resources need to be unbound before deletion of the workspace.
If Microservices Engine has Integrated Registry under the workspace, the workspace cannot be deleted. Integrated Registry needs to be removed before deletion of the workspace.
If Container Registry has Registry Space or Integrated Registry under the workspace, the workspace cannot be deleted. Registry Space needs to be removed, and Integrated Registry needs to be deleted before deletion of the workspace.
"},{"location":"en/admin/ghippo/workspace/ws-folder.html","title":"Workspace and Folder","text":"
Workspace and Folder is a feature that provides resource isolation and grouping, addressing issues related to unified authorization, resource grouping, and resource quotas.
Workspace and Folder involves two concepts: workspaces and folders.
Workspaces allow the management of resources through Authorization , Resource Group , and Shared Resource , enabling users (and user groups) to share resources within the workspace.
Resources
Resources are at the lowest level of the hierarchy in the resource management module. They include clusters, namespaces, pipelines, gateways, and more. All these resources can only have workspaces as their parent level. Workspaces act as containers for grouping resources.
Workspace
A workspace usually refers to a project or environment, and the resources in each workspace are logically isolated from those in other workspaces. You can grant users (groups of users) different access rights to the same set of resources through authorization in the workspace.
Workspaces are at the first level, counting from the bottom of the hierarchy, and contain resources. All resources except shared resources have one and only one parent. All workspaces also have one and only one parent folder.
Resources are grouped by workspace, and there are two grouping modes in workspace, namely Resource Group and Shared Resource .
Resource group
A resource can only be added to one resource group, and resource groups correspond to workspaces one by one. After a resource is added to a resource group, Workspace Admin will obtain the management authority of the resource, which is equivalent to the owner of the resource.
Share resource
For shared resources, multiple workspaces can share one or more resources. Resource owners can choose to share their own resources with the workspace. Generally, when sharing, the resource owner will limit the amount of resources that can be used by the shared workspace. After resources are shared, Workspace Admin only has resource usage rights under the resource limit, and cannot manage resources or adjust the amount of resources that can be used by the workspace.
At the same time, shared resources also have certain requirements for the resources themselves. Only Cluster (cluster) resources can be shared. Cluster Admin can share Cluster resources to different workspaces, and limit the use of workspaces on this Cluster.
Workspace Admin can create multiple Namespaces within the resource quota, but the sum of the resource quotas of the Namespaces cannot exceed the resource quota of the Cluster in the workspace. For Kubernetes resources, the only resource type that can be shared currently is Cluster.
Folders can be used to build enterprise business hierarchy relationships.
Folders are a further grouping mechanism based on workspaces and have a hierarchical structure. A folder can contain workspaces, other folders, or a combination of both, forming a tree-like organizational relationship.
Folders allow you to map your business hierarchy and group workspaces by department. Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces.
A folder has one and only one parent folder, and the root folder is the highest level of the hierarchy. The root folder has no parent, and folders and workspaces are attached to the root folder.
In addition, users (groups) in folders can inherit permissions from their parents through a hierarchical structure. The permissions of the user in the hierarchical structure come from the combination of the permissions of the current level and the permissions inherited from its parents. The permissions are additive and there is no mutual exclusion.
"},{"location":"en/admin/ghippo/workspace/ws-permission.html","title":"Description of workspace permissions","text":"
The workspace has permission mapping and resource isolation capabilities, and can map the permissions of users/groups in the workspace to the resources under it. If the user/group has the Workspace Admin role in the workspace and the resource Namespace is bound to the workspace-resource group, the user/group will become Namespace Admin after mapping.
Note
The permission mapping capability of the workspace will not be applied to shared resources, because sharing is to share the cluster usage permissions to multiple workspaces, rather than assigning management permissions to the workspaces, so permission inheritance and role mapping will not be implemented.
Resource isolation is achieved by binding resources to different workspaces. Therefore, resources can be flexibly allocated to each workspace (tenant) with the help of permission mapping, resource isolation, and resource sharing capabilities.
Generally applicable to the following two use cases:
Cluster one-to-one
Ordinary Cluster Department/Tenant (Workspace) Purpose Cluster 01 A Administration and Usage Cluster 02 B Administration and Usage
Cluster one-to-many
Cluster Department/Tenant (Workspace) Resource Quota Cluster 01 A 100 core CPU B 50-core CPU
Authorized users can go to modules such as workbench, microservice engine, middleware, multicloud orchestration, and service mesh to use resources in the workspace. For the operation scope of the roles of Workspace Admin, Workspace Editor, and Workspace Viewer in each module, please refer to the permission description:
If a user John (\"John\" represents any user who is required to bind resources) has the Workspace Admin role assigned or has been granted proper permissions through a custom role, which includes the Workspace's \"Resource Binding\" Permissions, and wants to bind a specific cluster or namespace to the workspace.
To bind cluster/namespace resources to a workspace, not only the workspace's \"Resource Binding\" permissions are required, but also the permissions of Cluster Admin.
"},{"location":"en/admin/ghippo/workspace/wsbind-permission.html#granting-authorization-to-john","title":"Granting Authorization to John","text":"
Using the Platform Admin Role, grant John the role of Workspace Admin on the Workspace -> Authorization page.
Then, on the Container Management -> Permissions page, authorize John as a Cluster Admin by Add Permission.
"},{"location":"en/admin/ghippo/workspace/wsbind-permission.html#binding-to-workspace","title":"Binding to Workspace","text":"
Using John's account to log in to AI platform, on the Container Management -> Clusters page, John can bind the specified cluster to his own workspace by using the Bind Workspace button.
Note
John can only bind clusters or namespaces to a specific workspace in the Container Management module, and cannot perform this operation in the Global Management module.
To bind a namespace to a workspace, you must have at least Workspace Admin and Cluster Admin permissions.
"},{"location":"en/admin/host/createhost.html","title":"Create and Start a Cloud Host","text":"
After the user completes registration and is assigned a workspace, namespace, and resources, they can create and start a cloud host.
"},{"location":"en/admin/host/usehost.html#steps-to-follow","title":"Steps to Follow","text":"
Log in to the AI platform as an administrator.
Navigate to Container Management -> Container Network -> Services, click the service name to enter the service details page, and click Update at the top right corner.
Change the port range to 30900-30999, ensuring there are no conflicts.
Log in to the AI platform as an end user, navigate to the proper service, and check the access port.
Use an SSH client to log in to the cloud host from the external network.
At this point, you can perform various operations on the cloud host.
Next step: Cloud Resource Sharing: Quota Management
The Alert Center is an important feature provided by AI platform that allows users to easily view all active and historical alerts by cluster and namespace through a graphical interface, and search alerts based on severity level (critical, warning, info).
All alerts are triggered based on the threshold conditions set in the preset alert rules. In AI platform, some global alert policies are built-in, but users can also create or delete alert policies at any time, and set thresholds for the following metrics:
CPU usage
Memory usage
Disk usage
Disk reads per second
Disk writes per second
Cluster disk read throughput
Cluster disk write throughput
Network send rate
Network receive rate
Users can also add labels and annotations to alert rules. Alert rules can be classified as active or expired, and certain rules can be enabled/disabled to achieve silent alerts.
When the threshold condition is met, users can configure how they want to be notified, including email, DingTalk, WeCom, webhook, and SMS notifications. All notification message templates can be customized and all messages are sent at specified intervals.
In addition, the Alert Center also supports sending alert messages to designated users through short message services provided by Alibaba Cloud, Tencent Cloud, and more platforms that will be added soon, enabling multiple ways of alert notification.
AI platform Alert Center is a powerful alert management platform that helps users quickly detect and resolve problems in the cluster, improve business stability and availability, and facilitate cluster inspection and troubleshooting.
In addition to the built-in alert policies, AI platform allows users to create custom alert policies. Each alert policy is a collection of alert rules that can be set for clusters, nodes, and workloads. When an alert object reaches the threshold set by any of the rules in the policy, an alert is automatically triggered and a notification is sent.
Taking the built-in alerts as an example, click the first alert policy alertmanager.rules .
You can see that some alert rules have been set under it. You can add more rules under this policy, or edit or delete them at any time. You can also view the historical and active alerts related to this alert policy and edit the notification configuration.
Select Alert Center -> Alert Policies , and click the Create Alert Policy button.
Fill in the basic information, select one or more clusters, nodes, or workloads as the alert objects, and click Next .
The list must have at least one rule. If the list is empty, please Add Rule .
Create an alert rule in the pop-up window, fill in the parameters, and click OK .
Template rules: Pre-defined basic metrics that can monitor CPU, memory, disk, and network.
PromQL rules: Input a PromQL expression, please query Prometheus expressions.
Duration: After the alert is triggered and the duration reaches the set value, the alert policy will become a triggered state.
Alert level: Including emergency, warning, and information levels.
Advanced settings: Custom tags and annotations.
After clicking Next , configure notifications.
After the configuration is complete, click the OK button to return to the Alert Policy list.
Tip
The newly created alert policy is in the Not Triggered state. Once the threshold conditions and duration specified in the rules are met, it will change to the Triggered state.
After filling in the basic information, click Add Rule and select Log Rule as the rule type.
Creating log rules is supported only when the resource object is selected as a node or workload.
Field Explanation:
Filter Condition : Field used to query log content, supports four filtering conditions: AND, OR, regular expression matching, and fuzzy matching.
Condition : Based on the filter condition, enter keywords or matching conditions.
Time Range : Time range for log queries.
Threshold Condition : Enter the alert threshold value in the input box. When the set threshold is reached, an alert will be triggered. Supported comparison operators are: >, \u2265, =, \u2264, <.
Alert Level : Select the alert level to indicate the severity of the alert.
After filling in the basic information, click Add Rule and select Event Rule as the rule type.
Creating event rules is supported only when the resource object is selected as a workload.
Field Explanation:
Event Rule : Only supports selecting the workload as the resource object.
Event Reason : Different event reasons for different types of workloads, where the event reasons are combined with \"AND\" relationship.
Time Range : Detect data generated within this time range. If the threshold condition is reached, an alert event will be triggered.
Threshold Condition : When the generated events reach the set threshold, an alert event will be triggered.
Trend Chart : By default, it queries the trend of event changes within the last 10 minutes. The value at each point represents the total number of occurrences within a certain period of time (time range) from the current time point to a previous time.
Click \u2507 at the right side of the list, then choose Delete from the pop-up menu to delete an alert policy. By clicking on the policy name, you can enter the policy details where you can add, edit, or delete the alert rules under it.
Warning
Deleted alert strategies will be permanently removed, so please proceed with caution.
The Alert template allows platform administrators to create Alert templates and rules, and business units can directly use Alert templates to create Alert policies. This feature can reduce the management of Alert rules by business personnel and allow for modification of Alert thresholds based on actual environment conditions.
In the navigation bar, select Alert -> Alert Policy, and click Alert Template at the top.
Click Create Alert Template, and set the name, description, and other information for the Alert template.
Parameter Description Template Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Resource Type Used to specify the matching type of the Alert template. Alert Rule Supports pre-defined multiple Alert rules, including template rules and PromQL rules.
Click OK to complete the creation and return to the Alert template list. Click the template name to view the template details.
Alert Inhibition is mainly a mechanism for temporarily hiding or reducing the priority of alerts that do not need immediate attention. The purpose of this feature is to reduce unnecessary alert information that may disturb operations personnel, allowing them to focus on more critical issues.
Alert inhibition recognizes and ignores certain alerts by defining a set of rules to deal with specific conditions. There are mainly the following conditions:
Parent-child inhibition: when a parent alert (for example, a crash on a node) is triggered, all child alerts aroused by it (for example, a crash on a container running on that node) are inhibited.
Similar alert inhibition: When alerts have the same characteristics (for example, the same problem on the same instance), multiple alerts are inhibited.
In the left navigation bar, select Alert -> Noise Reduction, and click Inhibition at the top.
Click Create Inhibition, and set the name and rules for the inhibition.
Note
The problem of avoiding multiple similar or related alerts that may be triggered by the same issue is achieved by defining a set of rules to identify and ignore certain alerts through Rule Details and Alert Details.
Parameter Description Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Cluster The cluster where the inhibition rule applies. Namespace The namespace where the inhibition rule applies. Source Alert Matching alerts by label conditions. It compares alerts that meet all label conditions with those that meet inhibition conditions, and alerts that do not meet inhibition conditions will be sent to the user as usual. Value range explanation: - Alert Level: The level of metric or event alerts, can be set as: Critical, Major, Minor. - Resource Type: The resource type specific for the alert object, can be set as: Cluster, Node, StatefulSet, Deployment, DaemonSet, Pod. - Labels: Alert identification attributes, consisting of label name and label value, supports user-defined values. Inhibition Specifies the matching conditions for the target alert (the alert to be inhibited). Alerts that meet all the conditions will no longer be sent to the user. Equal Specifies the list of labels to compare to determine if the source alert and target alert match. Inhibition is triggered only when the values of the labels specified in equal are exactly the same in the source and target alerts. The equal field is optional. If the equal field is omitted, all labels are used for matching.
Click OK to complete the creation and return to Inhibition list. Click the inhibition rule name to view the rule details.
After entering Insight , click Alert Center -> Notification Settings in the left navigation bar. By default, the email notification object is selected. Click Add email group and add one or more email addresses.
Multiple email addresses can be added.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the email group.
In the left navigation bar, click Alert Center -> Notification Settings -> WeCom . Click Add Group Robot and add one or more group robots.
For the URL of the WeCom group robot, please refer to the official document of WeCom: How to use group robots.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> DingTalk . Click Add Group Robot and add one or more group robots.
For the URL of the DingTalk group robot, please refer to the official document of DingTalk: Custom Robot Access.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> Lark . Click Add Group Bot and add one or more group bots.
Note
When signature verification is required in Lark's group bot, you need to fill in the specific signature key when enabling notifications. Refer to Customizing Bot User Guide.
After configuration, you will be automatically redirected to the list page. Click \u2507 on the right side of the list and select Send Test Message . You can edit or delete group bots.
In the left navigation bar, click Alert Center -> Notification Settings -> Webhook . Click New Webhook and add one or more Webhooks.
For the Webhook URL and more configuration methods, please refer to the webhook document.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the Webhook.
In the left navigation bar, click Alert Center -> Notification Settings -> SMS . Click Add SMS Group and add one or more SMS groups.
Enter the name, the object receiving the message, phone number, and notification server in the pop-up window.
The notification server needs to be created in advance under Notification Settings -> Notification Server . Currently, two cloud servers, Alibaba Cloud and Tencent Cloud, are supported. Please refer to your own cloud server information for the specific configuration parameters.
After the SMS group is successfully added, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the SMS group.
The message template feature supports customizing the content of message templates and can notify specified objects in the form of email, WeCom, DingTalk, Webhook, and SMS.
"},{"location":"en/admin/insight/alert-center/msg-template.html#creating-a-message-template","title":"Creating a Message Template","text":"
In the left navigation bar, select Alert -> Message Template .
Insight comes with two default built-in templates in both Chinese and English for user convenience.
Fill in the template content.
Info
Observability comes with predefined message templates. If you need to define the content of the templates, refer to Configure Notification Templates.
Click the name of a message template to view the details of the message template in the right slider.
Parameters Variable Description ruleName {{ .Labels.alertname }} The name of the rule that triggered the alert groupName {{ .Labels.alertgroup }} The name of the alert policy to which the alert rule belongs severity {{ .Labels.severity }} The level of the alert that was triggered cluster {{ .Labels.cluster }} The cluster where the resource that triggered the alert is located namespace {{ .Labels.namespace }} The namespace where the resource that triggered the alert is located node {{ .Labels.node }} The node where the resource that triggered the alert is located targetType {{ .Labels.target_type }} The resource type of the alert target target {{ .Labels.target }} The name of the object that triggered the alert value {{ .Annotations.value }} The metric value at the time the alert notification was triggered startsAt {{ .StartsAt }} The time when the alert started to occur endsAt {{ .EndsAt }} The time when the alert ended description {{ .Annotations.description }} A detailed description of the alert labels {{ for .labels }} {{ end }} All labels of the alert use the for function to iterate through the labels list to get all label contents."},{"location":"en/admin/insight/alert-center/msg-template.html#editing-or-deleting-a-message-template","title":"Editing or Deleting a Message Template","text":"
Click \u2507 on the right side of the list and select Edit or Delete from the pop-up menu to modify or delete the message template.
Warning
Once a template is deleted, it cannot be recovered, so please use caution when deleting templates.
Alert silence is a feature that allows alerts meeting certain criteria to be temporarily disabled from sending notifications within a specific time range. This feature helps operations personnel avoid receiving too many noisy alerts during certain operations or events, while also allowing for more precise handling of real issues that need to be addressed.
On the Alert Silence page, you can see two tabs: Active Rule and Expired Rule. The former presents the rules currently in effect, while the latter presents those that were defined in the past but have now expired (or have been deleted by the user).
"},{"location":"en/admin/insight/alert-center/silent.html#creating-a-silent-rule","title":"Creating a Silent Rule","text":"
In the left navigation bar, select Alert -> Noice Reduction -> Alert Silence , and click the Create Silence Rule button.
Fill in the parameters for the silent rule, such as cluster, namespace, tags, and time, to define the scope and effective time of the rule, and then click OK .
Return to the rule list, and on the right side of the list, click \u2507 to edit or delete a silent rule.
Through the Alert Silence feature, you can flexibly control which alerts should be ignored and when they should be effective, thereby improving operational efficiency and reducing the possibility of false alerts.
Insight supports SMS notifications and currently sends alert messages using integrated Alibaba Cloud and Tencent Cloud SMS services. This article explains how to configure the SMS notification server in Insight. The variables supported in the SMS signature are the default variables in the message template. As the number of SMS characters is limited, it is recommended to choose more explicit variables.
For information on how to configure SMS recipients, refer to the document: Configure SMS Notification Group.
Go to Alert Center -> Notification Settings -> Notification Server .
Click Add Notification Server .
Configure Alibaba Cloud server.
To apply for Alibaba Cloud SMS service, please refer to Alibaba Cloud SMS Service.
Field descriptions:
AccessKey ID : Parameter used by Alibaba Cloud to identify the user.
AccessKey Secret : Key used by Alibaba Cloud to authenticate the user. AccessKey Secret must be kept confidential.
SMS Signature : The SMS service supports creating signatures that meet the requirements according to user needs. When sending SMS, the SMS platform will add the approved SMS signature to the SMS content before sending it to the SMS recipient.
Template CODE : The SMS template is the specific content of the SMS to be sent.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content.
Please refer to Alibaba Cloud Variable Specification.
Note
Example: The template content defined in Alibaba Cloud is: ${severity}: ${alertname} triggered at ${startat}. Refer to the configuration in the parameter template.
Configure Tencent Cloud server.
To apply for Tencent Cloud SMS service, please refer to Tencent Cloud SMS.
Field descriptions:
Secret ID : Parameter used by Tencent Cloud to identify the API caller.
SecretKey : Parameter used by Tencent Cloud to authenticate the API caller.
SMS Template ID : The SMS template ID automatically generated by Tencent Cloud system.
Signature Content : The SMS signature content, which is the full name or abbreviation of the actual website name defined in the Tencent Cloud SMS signature.
SdkAppId : SMS SdkAppId, the actual SdkAppId generated after adding the application in the Tencent Cloud SMS console.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content. Please refer to: Tencent Cloud Variable Specification.
Note
Example: The template content defined in Tencent Cloud is: {1}: {2} triggered at {3}. Refer to the configuration in the parameter template.
After installing the insight-agent in the cluster, Fluent Bit in insight-agent will collect logs in the cluster by default, including Kubernetes event logs, node logs, and container logs. Fluent Bit has already configured various log collection plugins, related filter plugins, and log output plugins. The working status of these plugins determines whether log collection is normal. Below is a dashboard for Fluent Bit that monitors the working conditions of each Fluent Bit in the cluster and the collection, processing, and export of plugin logs.
Use AI platform platform, enter Insight , and select the Dashboard in the left navigation bar.
Click the dashboard title Overview .
Switch to the insight-system -> Fluent Bit dashboard.
There are several check boxes above the Fluent Bit dashboard to select the input plugin, filter plugin, output plugin, and cluster in which it is located.
Filter Plugin Plugin Description Lua.audit_log.k8s Use lua to filter Kubernetes audit logs that meet certain conditions
Note
There are more filter plugins than Lua.audit_log.k8s, which only introduces filters that will discard logs.
Log Output Plugin
Output Plugin Plugin Description es.kube.kubeevent.syslog Write Kubernetes audit logs, event logs, and syslog logs to ElasticSearch cluster forward.audit_log Send Kubernetes audit logs and global management audit logs to Global Management es.skoala Write request logs and instance logs of microservice gateway to ElasticSearch cluster"},{"location":"en/admin/insight/best-practice/debug-trace.html","title":"Trace Collection Troubleshooting Guide","text":"
Before attempting to troubleshoot issues with trace data collection, you need to understand the transmission path of trace data. The following is a schematic diagram of the transmission of trace data:
As shown in the above figure, any transmission failure at any step will result in the inability to query trace data. If you find that there is no trace data after completing the application trace enhancement, please perform the following steps:
Use AI platform platform, enter Insight , and select the Dashboard in the left navigation bar.
Click the dashboard title Overview .
Switch to the insight-system -> insight tracing debug dashboard.
You can see that this dashboard is composed of three blocks, each responsible for monitoring the data transmission of different clusters and components. Check whether there are problems with trace data transmission through the generated time series chart.
Display the opentelemetry collector in different worker clusters receiving language probe/SDK trace data and sending aggregated trace data. You can select the cluster where it is located by the Cluster selection box in the upper left corner.
Note
Based on these four time series charts, you can determine whether the opentelemetry collector in this cluster is running normally.
global opentelemetry collector
Display the opentelemetry collector in the Global Service Cluster receiving trace data from the worker cluster's opentelemetry collector and sending aggregated trace data.
Note
The opentelemetry collector in the Global Management Cluster is also responsible for sending audit logs of all worker clusters' global management module and Kubernetes audit logs (not collected by default) to the audit server component of the global management module.
global jaeger collector
Display the jaeger collector in the Global Management Cluster receiving data from the otel collector in the Global Management Cluster and sending trace data to the ElasticSearch cluster.
"},{"location":"en/admin/insight/best-practice/find_root_cause.html","title":"Troubleshooting Service Issues with Insight","text":"
This article serves as a guide on using Insight to identify and analyze abnormal components in AI platform and determine the root causes of component exceptions.
Please note that this post assumes you have a basic understanding of Insight's product features or vision.
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#service-map-identifying-abnormalities-on-a-macro-level","title":"Service Map - Identifying Abnormalities on a Macro Level","text":"
In enterprise microservice architectures, managing a large number of services with complex interdependencies can be challenging. Insight offers service map monitoring, allowing users to gain a high-level overview of the running microservices in the system.
In the example below, you observe that the node insight-server is highlighted in red/yellow on the service map. By hovering over the node, you can see the error rate associated with it. To investigate further and understand why the error rate is not 0 , you can explore more detailed information:
Alternatively, clicking on the service name at the top will take you to the service's overview UI:
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#service-overview-delving-into-detailed-analysis","title":"Service Overview - Delving into Detailed Analysis","text":"
When it becomes necessary to analyze inbound and outbound traffic separately, you can use the filter in the upper right corner to refine the data. After applying the filter, you can observe that the service has multiple operations proper to a non-zero error rate. To investigate further, you can inspect the traces generated by these operations during a specific time period by clicking on \"View Traces\":
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#trace-details-identifying-and-eliminating-root-causes-of-errors","title":"Trace Details - Identifying and Eliminating Root Causes of Errors","text":"
In the trace list, you can easily identify traces marked as error (circled in red in the figure above) and examine their details by clicking on the proper trace. The following figure illustrates the trace details:
Within the trace diagram, you can quickly locate the last piece of data in an error state. Expanding the associated logs section reveals the cause of the request error:
Following the above analysis method, you can also identify traces related to other operation errors:
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#lets-get-started-with-your-analysis","title":"Let's Get Started with Your Analysis!","text":""},{"location":"en/admin/insight/best-practice/insight-kafka.html","title":"Kafka + Elasticsearch Stream Architecture for Handling Large-Scale Logs","text":"
As businesses grow, the amount of log data generated by applications increases significantly. To ensure that systems can properly collect and analyze massive amounts of log data, it is common practice to introduce a streaming architecture using Kafka to handle asynchronous data collection. The collected log data flows through Kafka and is consumed by proper components, which then store the data into Elasticsearch for visualization and analysis using Insight.
This article will introduce two solutions:
Fluentbit + Kafka + Logstash + Elasticsearch
Fluentbit + Kafka + Vector + Elasticsearch
Once we integrate Kafka into the logging system, the data flow diagram looks as follows:
Both solutions share similarities but differ in the component used to consume Kafka data. To ensure compatibility with Insight's data analysis, the format of the data consumed from Kafka and written into Elasticsearch should be consistent with the data directly written by Fluentbit to Elasticsearch.
Let's first see how Fluentbit writes logs to Kafka:
Once the Kafka cluster is ready, we need to modify the content of the insight-system namespace's ConfigMap . We will add three Kafka outputs and comment out the original three Elasticsearch outputs:
Assuming the Kafka Brokers address is: insight-kafka.insight-system.svc.cluster.local:9092
Next, let's discuss the subtle differences in consuming Kafka data and writing it to Elasticsearch. As mentioned at the beginning of this article, we will explore Logstash and Vector as two ways to consume Kafka data.
"},{"location":"en/admin/insight/best-practice/insight-kafka.html#consuming-kafka-and-writing-to-elasticsearch","title":"Consuming Kafka and Writing to Elasticsearch","text":"
Assuming the Elasticsearch address is: https://mcamel-common-es-cluster-es-http.mcamel-system:9200
"},{"location":"en/admin/insight/best-practice/insight-kafka.html#using-logstash-for-consumption","title":"Using Logstash for Consumption","text":"
If you are familiar with the Logstash technology stack, you can continue using this approach.
When deploying Logstash via Helm, you can add the following pipeline in the logstashPipeline section:
"},{"location":"en/admin/insight/best-practice/insight-kafka.html#checking-if-its-working-properly","title":"Checking if it's Working Properly","text":"
You can verify if the configuration is successful by checking if there are new data in the Insight log query interface or observing an increase in the number of indices in Elasticsearch.
Understand and meet the DeepFlow runtime permissions and kernel requirements
Storage volume is ready
"},{"location":"en/admin/insight/best-practice/integration_deepflow.html#install-deepflow-and-configure-insight","title":"Install DeepFlow and Configure Insight","text":"
Installing DeepFlow components requires two charts:
deepflow: includes components such as deepflow-app, deepflow-server, deepflow-clickhouse, and deepflow-agent. Generally, deepflow is deployed in the global service cluster, so it also installs deepflow-agent together.
deepflow-agent: only includes the deepflow-agent component, used to collect eBPF data and send it to deepflow-server.
DeepFlow needs to be installed in the global service cluster.
Go to the kpanda-global-cluster cluster and click Helm Apps -> Helm Charts in the left navigation bar, select community as the repository, and search for deepflow in the search box:
Click the deepflow card to enter the details page:
Click Install to enter the installation page:
Most of the values have default values. Clickhouse and Mysql require applying storage volumes, and their default sizes are 10Gi. You can search for relevant configurations and modify them using the persistence keyword.
After configuring, click OK to start the installation.
DeepFlow Agent is installed in the sub-cluster using the deepflow-agent chart. It is used to collect eBPF observability data from the sub-cluster and report it to the global service cluster. Similar to installing deepflow, go to Helm Apps -> Helm Charts, select community as the repository, and search for deepflow-agent in the search box. Follow the process to enter the installation page.
Parameter Explanation:
DeployComponent : deployment mode, default is daemonset.
timezone : timezone, default is Asia/Shanghai.
DeepflowServerNodeIPS : addresses of the nodes where deepflow server is installed.
deepflowK8sClusterID : cluster UUID.
agentGroupID : agent group ID.
controllerPort : data reporting port of deepflow server, can be left blank, default is 30035.
clusterNAME : cluster name.
After configuring, click OK to complete the installation.
After correctly installing DeepFlow, click Network Observability to enter the DeepFlow Grafana UI. It contains a large number of dashboards for viewing and helping analyze issues. Click DeepFlow Templates to browse all available dashboards:
"},{"location":"en/admin/insight/best-practice/sw-to-otel.html","title":"Simplifying Trace Data Integration with OpenTelemetry and SkyWalking","text":"
This article explains how to seamlessly integrate trace data from SkyWalking into the Insight platform, using OpenTelemetry. With zero code modification required, you can transform your existing SkyWalking trace data and leverage Insight's capabilities.
"},{"location":"en/admin/insight/best-practice/sw-to-otel.html#understanding-the-code","title":"Understanding the Code","text":"
To ensure compatibility with different distributed tracing implementations, OpenTelemetry provides a way to incorporate components that standardize data processing and output to various backends. While Jaeger and Zipkin are already available, we have contributed the SkyWalkingReceiver to the OpenTelemetry community. This receiver has been refined and is now suitable for use in production environments without any modifications to your application's code.
Although SkyWalking and OpenTelemetry share similarities, such as using Trace to define a trace and Span to mark the smallest granularity, there are differences in certain details and implementations:
SkyWalking OpenTelemetry Data Structure Span -> Segment -> Trace Span -> Trace Attribute Information Tags Attributes Application Time Logs Events Reference Relationship References Links
Now, let's discuss the steps involved in converting SkyWalking Trace to OpenTelemetry Trace. The main tasks include:
Constructing OpenTelemetry's TraceId and SpanId
Constructing OpenTelemetry's ParentSpanId
Retaining SkyWalking's original TraceId, SegmentId, and SpanId in OpenTelemetry Spans
First, let's look at how to construct the TraceId and SpanId for OpenTelemetry. Both SkyWalking and OpenTelemetry use TraceId to connect distributed service calls and use SpanId to mark each Span, but there are significant differences in the implementation specifications:
Info
View GitHub for code implementation\uff1a
Skywalking Receiver
PR: Create skywalking component folder/structure
PR: add Skywalking tracing receiver impl
Specifically, the possible formats for SkyWalking TraceId and SegmentId are as follows:
In the OpenTelemetry protocol, a Span is unique across all Traces, while in SkyWalking, a Span is only unique within each Segment. This means that to uniquely identify a Span in SkyWalking, it is necessary to combine the SegmentId and SpanId, and convert it to the SpanId in OpenTelemetry.
Info
View GitHub for code implementation\uff1a
Skywalking Receiver
PR: Fix skywalking traceid and spanid convertion
Next, let's see how to construct the ParentSpanId for OpenTelemetry. Within a Segment, the ParentSpanId field in SkyWalking can be directly used to construct the ParentSpanId field in OpenTelemetry. However, when a Trace spans multiple Segments, SkyWalking uses the association information represented by ParentTraceSegmentId and ParentSpanId in the Reference. In this case, the ParentSpanId in OpenTelemetry needs to be constructed using the information in the Reference.
Code implementation can be found on GitHub: Skywalking Receiver
Finally, let's see how to preserve the original TraceId, SegmentId, and SpanId from SkyWalking in the OpenTelemetry Span. We carry these original information to associate the OpenTelemetry TraceId and SpanId displayed in the distributed tracing backend with the SkyWalking TraceId, SegmentId, and SpanId in the application logs. We choose to carry the original TraceId, SegmentId, and ParentSegmentId from SkyWalking to the OpenTelemetry Attributes.
Info
View GitHub for code implementation\uff1a
Skywalking Receiver
Add extra link attributes from skywalking ref
After this series of conversions, we have fully transformed the SkyWalking Segment Object into an OpenTelemetry Trace, as shown in the following diagram:
"},{"location":"en/admin/insight/best-practice/sw-to-otel.html#deploying-the-demo","title":"Deploying the Demo","text":"
To demonstrate the complete process of collecting and displaying SkyWalking tracing data using OpenTelemetry, we will use a demo application.
First, deploy the OpenTelemetry Agent and enable the following configuration to ensure compatibility with the SkyWalking protocol:
# otel-agent config\nreceivers:\n skywalking:\n protocols:\n grpc:\n endpoint: 0.0.0.0:11800 # Receive trace data reported by the SkyWalking Agent\n http: \n endpoint: 0.0.0.0:12800 # Receive trace data reported from the front-end / nginx or other HTTP protocols\nservice: \n pipelines: \n traces: \n receivers: [skywalking]\n\n# otel-agent service yaml\nspec:\n ports: \n - name: sw-http\n port: 12800 \n protocol: TCP \n targetPort: 12800 \n - name: sw-grpc \n port: 11800 \n protocol: TCP \n targetPort: 11800\n
Next, modify the connection of your business application from the SkyWalking OAP Service (e.g., oap:11800) to the OpenTelemetry Agent Service (e.g., otel-agent:11800). This will allow you to start receiving trace data from the SkyWalking probe using OpenTelemetry.
To demonstrate the entire process, we will use the SkyWalking-showcase Demo. This demo utilizes the SkyWalking Agent for tracing, and after being processed by OpenTelemetry, the final results are presented using Jaeger:
From the architecture diagram of the SkyWalking Showcase, we can observe that the data remains intact even after standardization by OpenTelemetry. In this trace, the request starts from app/homepage, then two requests /rcmd and /songs/top are initiated simultaneously within the app, distributed to the recommendation and songs services, and finally reach the database for querying, completing the entire request chain.
Additionally, you can view the original SkyWalking Id information on the Jaeger page, which facilitates correlation with application logs:
By following these steps, you can seamlessly integrate SkyWalking trace data into OpenTelemetry and leverage the capabilities of the Insight platform.
"},{"location":"en/admin/insight/best-practice/tail-based-sampling.html","title":"About Trace Sampling and Configuration","text":"
Using distributed tracing, you can observe how requests flow through various systems in a distributed system. Undeniably, it is very useful for understanding service connections, diagnosing latency issues, and providing many other benefits.
However, if most of your requests are successful and there are no unacceptable delays or errors, do you really need all this data? Therefore, you only need to achieve the right insights through appropriate data sampling rather than a large amount or complete data.
The idea behind sampling is to control the traces sent to the observability collector, thereby reducing collection costs. Different organizations have different reasons for sampling, including why they want to sample and what types of data they wish to sample. Therefore, we need to customize the sampling strategy:
Cost Management: If a large amount of telemetry data needs to be stored, it incurs higher computational and storage costs.
Focus on Interesting Traces: Different organizations prioritize different data types.
Filter Out Noise: For example, you may want to filter out health checks.
It is important to use consistent terminology when discussing sampling. A Trace or Span is considered sampled or unsampled:
Sampled: A Trace or Span that is processed and stored. It is chosen by the sampler to represent the overall data, so it is considered sampled.
Unsampled: A Trace or Span that is not processed or stored. Because it was not selected by the sampler, it is considered unsampled.
"},{"location":"en/admin/insight/best-practice/tail-based-sampling.html#what-are-the-sampling-options","title":"What Are the Sampling Options?","text":""},{"location":"en/admin/insight/best-practice/tail-based-sampling.html#head-sampling","title":"Head Sampling","text":"
Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.
For example, the most common form of head sampling is Consistent Probability Sampling. This is also be referred to as Deterministic Sampling. In this case, a sampling decision is made based on the trace ID and the desired percentage of traces to sample. This ensures that whole traces are sampled - no missing spans - at a consistent rate, such as 5% of all traces.
The upsides to head sampling are: - Easy to understand - Easy to configure - Efficient - Can be done at any point in the trace collection pipeline
The primary downside to head sampling is that it is not possible to make a sampling decision based on data in the entire trace. This means that while head sampling is effective as a blunt instrument, but it is completely insufficient for sampling strategies that must consider information from the entire system. For example, you cannot ensure that all traces with an error within them are sampled with head sampling alone. For this situation and many others, you need tail sampling.
Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn\u2019t an option with Head Sampling.
Some examples of how to use tail sampling include:
Always sampling traces that contain an error
Sampling traces based on overall latency
Sampling traces based on the presence or value of specific attributes on one or more spans in a trace; for example, sampling more traces originating from a newly deployed service
Applying different sampling rates to traces based on certain criteria, such as when traces only come from low-volume services versus traces with high-volume services
As you can see, tail sampling allows for a much higher degree of sophistication in how you sample data. For larger systems that must sample telemetry, it is almost always necessary to use Tail Sampling to balance data volume with the usefulness of that data.
There are three primary downsides to tail sampling today:
Tail sampling can be difficult to implement. Depending on the kind of sampling techniques available to you, it is not always a \u201cset and forget\u201d kind of thing. As your systems change, so too will your sampling strategies. For a large and sophisticated distributed system, rules that implement sampling strategies can also be large and sophisticated.
Tail sampling can be difficult to operate. The component(s) that implement tail sampling must be stateful systems that can accept and store a large amount of data. Depending on traffic patterns, this can require dozens or even hundreds of compute nodes that all utilize resources differently. Furthermore, a tail sampler might need to \u201cfall back\u201d to less computationally intensive sampling techniques if it is unable to keep up with the volume of data it is receiving. Because of these factors, it is critical to monitor tail-sampling components to ensure that they have the resources they need to make the correct sampling decisions.
Tail samplers often end up as vendor-specific technology today. If you\u2019re using a paid vendor for Observability, the most effective tail sampling options available to you might be limited to what the vendor offers.
Finally, for some systems, tail sampling might be used in conjunction with Head Sampling. For example, a set of services that produce an extremely high volume of trace data might first use head sampling to sample only a small percentage of traces, and then later in the telemetry pipeline use tail sampling to make more sophisticated sampling decisions before exporting to a backend. This is often done in the interest of protecting the telemetry pipeline from being overloaded.
Insight currently recommends using tail sampling and prioritizes support for tail sampling.
The tail sampling processor samples traces based on a defined set of strategies. However, all spans of a trace must be received by the same collector instance to make effective sampling decisions.
Therefore, adjustments need to be made to the Global OpenTelemetry Collector architecture of Insight to implement the tail sampling strategy.
"},{"location":"en/admin/insight/best-practice/tail-based-sampling.html#specific-changes-to-insight","title":"Specific Changes to Insight","text":"
Introduce an Opentelemetry Collector Gateway component with load balancing capabilities in front of the insight-opentelemetry-collector in the Global cluster, allowing the same group of Traces to be routed to the same Opentelemetry Collector instance based on the TraceID.
Deploy an OTEL COL Gateway component with load balancing capabilities.
If you are using Insight V0.25.x, you can quickly enable this by using the Helm Upgrade parameter --set opentelemetry-collector-gateway.enabled=true, thereby skipping the deployment process described below.
Refer to the following YAML to deploy the component.
Tail sampling rules need to be added to the existing insight-otel-collector-config configmap configuration group.
Add the following content in the processor section, and adjust the specific rules as needed, refer to the OTel official example.
........\ntail_sampling:\n decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed\n num_traces: 1500000 # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;\n # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped\n expected_new_traces_per_sec: 10\n policies: # Reporting policies\n [\n {\n name: latency-policy,\n type: latency, # Report traces that exceed 500ms\n latency: {threshold_ms: 500}\n },\n {\n name: status_code-policy,\n type: status_code, # Report traces with ERROR status code\n status_code: {status_codes: [ ERROR ]}\n }\n ]\n......\ntail_sampling: # Composite sampling\n decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed\n num_traces: 1500000 # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;\n # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped\n expected_new_traces_per_sec: 10\n policies: [\n {\n name: debug-worker-cluster-sample-policy,\n type: and,\n and:\n {\n and_sub_policy:\n [\n {\n name: service-name-policy,\n type: string_attribute,\n string_attribute:\n { key: k8s.cluster.id, values: [xxxxxxx] },\n },\n {\n name: trace-status-policy,\n type: status_code,\n status_code: { status_codes: [ERROR] },\n },\n {\n name: probabilistic-policy,\n type: probabilistic,\n probabilistic: { sampling_percentage: 1 },\n }\n ]\n }\n }\n ]\n
Activate the processor in the otel col pipeline within the insight-otel-collector-config configmap:
"},{"location":"en/admin/insight/collection-manag/agent-status.html","title":"insight-agent Component Status Explanation","text":"
In AI platform, Insight acts as a multi-cluster observability product. To achieve unified data collection across multiple clusters, users need to install the Helm App insight-agent (installed by default in the insight-system namespace). Refer to How to Install insight-agent .
In the \"Observability\" -> \"Collection Management\" section, you can view the installation status of insight-agent in each cluster.
Not Installed : insight-agent is not installed in the insight-system namespace of the cluster.
Running : insight-agent is successfully installed in the cluster, and all deployed components are running.
Error : If insight-agent is in this state, it indicates that the helm deployment failed or there are components deployed that are not in a running state.
You can troubleshoot using the following steps:
Run the following command. If the status is deployed , proceed to the next step. If it is failed , it is recommended to uninstall and reinstall it from Container Management -> Helm Apps as it may affect application upgrades:
helm list -n insight-system\n
Run the following command or check the status of the deployed components in Insight -> Data Collection . If there are Pods not in the Running state, restart the containers in an abnormal state.
The resource consumption of the Prometheus metric collection component in insight-agent is directly proportional to the number of Pods running in the cluster. Please adjust the resources for Prometheus according to the cluster size. Refer to Prometheus Resource Planning.
The storage capacity of the vmstorage metric storage component in the global service cluster is directly proportional to the total number of Pods in the clusters.
Please contact the platform administrator to adjust the disk capacity of vmstorage based on the cluster size. Refer to vmstorage Disk Capacity Planning.
Adjust vmstorage disk based on multi-cluster scale. Refer to vmstorge Disk Expansion.
Data Collection is mainly to centrally manage and display the entrance of the cluster installation collection plug-in insight-agent , which helps users quickly view the health status of the cluster collection plug-in, and provides a quick entry to configure collection rules.
The specific operation steps are as follows:
Click in the upper left corner and select Insight -> Data Collection .
You can view the status of all cluster collection plug-ins.
When the cluster is connected to insight-agent and is running, click a cluster name to enter the details\u3002
In the Service Monitor tab, click the shortcut link to jump to Container Management -> CRD to add service discovery rules.
Prometheus primarily uses the Pull approach to retrieve monitoring metrics from target services' exposed endpoints. Therefore, it requires configuring proper scraping jobs to request monitoring data and write it into the storage provided by Prometheus. Currently, Prometheus offers several configurations for these jobs:
Native Job Configuration: This provides native Prometheus job configuration for scraping.
Pod Monitor: In the Kubernetes ecosystem, it allows scraping of monitoring data from Pods using Prometheus Operator.
Service Monitor: In the Kubernetes ecosystem, it allows scraping monitoring data from Endpoints of Services using Prometheus Operator.
# Name of the scraping job, also adds a label (job=job_name) to the scraped metrics\njob_name: <job_name>\n\n# Time interval between scrapes\n[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]\n\n# Timeout for scrape requests\n[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]\n\n# URI path for the scrape request\n[ metrics_path: <path> | default = /metrics ]\n\n# Handling of label conflicts between scraped labels and labels added by the backend Prometheus.\n# true: Retains the scraped labels and ignores conflicting labels from the backend Prometheus.\n# false: Adds an \"exported_<original-label>\" prefix to the scraped labels and includes the additional labels added by the backend Prometheus.\n[ honor_labels: <boolean> | default = false ]\n\n# Whether to use the timestamp generated by the target being scraped.\n# true: Uses the timestamp from the target if available.\n# false: Ignores the timestamp from the target.\n[ honor_timestamps: <boolean> | default = true ]\n\n# Protocol for the scrape request: http or https\n[ scheme: <scheme> | default = http ]\n\n# URL parameters for the scrape request\nparams:\n [ <string>: [<string>, ...] ]\n\n# Set the value of the `Authorization` header in the scrape request through basic authentication. password/password_file are mutually exclusive, with password_file taking precedence.\nbasic_auth:\n [ username: <string> ]\n [ password: <secret> ]\n [ password_file: <string> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token: <secret> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token_file: <filename> ]\n\n# Whether the scrape connection should use a TLS secure channel, configure the proper TLS parameters\ntls_config:\n [ <tls_config> ]\n\n# Use a proxy service to scrape the metrics from the target, specify the address of the proxy service.\n[ proxy_url: <string> ]\n\n# Specify the targets using static configuration, see explanation below.\nstatic_configs:\n [ - <static_config> ... ]\n\n# CVM service discovery configuration, see explanation below.\ncvm_sd_configs:\n [ - <cvm_sd_config> ... ]\n\n# After scraping the data, rewrite the labels of the proper target using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nrelabel_configs:\n [ - <relabel_config> ... ]\n\n# Before writing the scraped data, rewrite the values of the labels using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nmetric_relabel_configs:\n [ - <relabel_config> ... ]\n\n# Limit the number of data points per scrape, 0: no limit, default is 0\n[ sample_limit: <int> | default = 0 ]\n\n# Limit the number of targets per scrape, 0: no limit, default is 0\n[ target_limit: <int> | default = 0 ]\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is PodMonitor\nkind: PodMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be <namespace>/<name>\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight\nspec:\n # Specify the label of the proper Pod, pod monitor will use this value as the job label value.\n # If viewing the Pod YAML, use the values in pod.metadata.labels.\n # If viewing Deployment/Daemonset/Statefulset, use spec.template.metadata.labels.\n [ jobLabel: string ]\n # Adds the proper Pod's Labels to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n podMetricsEndpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: PodMonitor\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n podMetricsEndpoints:\n - interval: 30s\n port: metric-port # Specify the Port Name proper to Prometheus Exporter in the pod YAML\n path: /metrics # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n relabelings:\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: instance\n replacement: \"crs-xxxxxx\" # Adjust to the proper Redis instance ID\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: ip\n replacement: \"1.x.x.x\" # Adjust to the proper Redis instance IP\n namespaceSelector: # Select the namespaces where the monitored Pods are located\n matchNames:\n - redis-test\n selector: # Specify the Label values of the Pods to be monitored in order to locate the target pods\n matchLabels:\n k8s-app: redis-exporter\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is ServiceMonitor\nkind: ServiceMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be the name of the Service.\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n # Specify the label(metadata/labels) of the proper Pod, service monitor will use this value as the job label value.\n [ jobLabel: string ]\n # Adds the Labels of the proper service to the Target's Labels\n [ targetLabels: []string ]\n # Adds the Labels of the proper Pod to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n endpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: ServiceMonitor\nmetadata:\n name: go-demo # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n endpoints:\n - interval: 30s\n # Specify the Port Name proper to Prometheus Exporter in the service YAML\n port: 8080-8080-tcp\n # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n path: /metrics\n relabelings:\n # ** There must be a label named 'application', assuming there is a label named 'app' in k8s,\n # we replace it with 'application' using the relabel 'replace' action\n - action: replace\n sourceLabels: [__meta_kubernetes_pod_label_app]\n targetLabel: application\n # Select the namespace where the monitored service is located\n namespaceSelector:\n matchNames:\n - golang-demo\n # Specify the Label values of the service to be monitored in order to locate the target service\n selector:\n matchLabels:\n app: golang-app-demo\n
The explanation for the proper configmaps is as follows:
# The name of the proper port. Please note that it's not the actual port number.\n# Default: 80. Possible values are as follows:\n# ServiceMonitor: corresponds to Service>spec/ports/name;\n# PodMonitor: explained as follows:\n# If viewing the Pod YAML, take the value from pod.spec.containers.ports.name.\n# If viewing Deployment/DaemonSet/StatefulSet, take the value from spec.template.spec.containers.ports.name.\n[ port: string | default = 80]\n# The URI path for the scrape request.\n[ path: string | default = /metrics ]\n# The protocol for the scrape: http or https.\n[ scheme: string | default = http]\n# URL parameters for the scrape request.\n[ params: map[string][]string]\n# The interval between scrape requests.\n[ interval: string | default = 30s ]\n# The timeout for the scrape request.\n[ scrapeTimeout: string | default = 30s]\n# Whether the scrape connection should be made over a secure TLS channel, and the TLS configuration.\n[ tlsConfig: TLSConfig ]\n# Read the bearer token value from the specified file and include it in the headers of the scrape request.\n[ bearerTokenFile: string ]\n# Read the bearer token from the specified K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ bearerTokenSecret: string ]\n# Handling conflicts when scraped labels conflict with labels added by the backend Prometheus.\n# true: Keep the scraped labels and ignore the conflicting labels from the backend Prometheus.\n# false: For conflicting labels, prefix the scraped label with 'exported_<original-label>' and add the labels added by the backend Prometheus.\n[ honorLabels: bool | default = false ]\n# Whether to use the timestamp generated on the target during the scrape.\n# true: Use the timestamp on the target if available.\n# false: Ignore the timestamp on the target.\n[ honorTimestamps: bool | default = true ]\n# Basic authentication credentials. Fill in the values of username/password from the proper K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ basicAuth: BasicAuth ]\n# Scrape the metrics from the target through a proxy server. Specify the address of the proxy server.\n[ proxyUrl: string ]\n# After scraping the data, rewrite the values of the labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nrelabelings:\n[ - <relabel_config> ...]\n# Before writing the scraped data, rewrite the values of the proper labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nmetricRelabelings:\n[ - <relabel_config> ...]\n
The explanation for the proper configmaps is as follows:
# Specifies which labels to take from the original labels for relabeling. The values taken are concatenated using the separator defined in the configuration.\n# For PodMonitor/ServiceMonitor, the proper configmap is sourceLabels.\n[ source_labels: '[' <labelname> [, ...] ']' ]\n# Defines the character used to concatenate the values of the labels to be relabeled. Default is ';'.\n[ separator: <string> | default = ; ]\n\n# When the action is replace/hashmod, target_label is used to specify the proper label name.\n# For PodMonitor/ServiceMonitor, the proper configmap is targetLabel.\n[ target_label: <labelname> ]\n\n# Regular expression used to match the values of the source labels.\n[ regex: <regex> | default = (.*) ]\n\n# Used when action is hashmod, it takes the modulus value based on the MD5 hash of the source label's value.\n[ modulus: <int> ]\n\n# Used when action is replace, it defines the expression to replace when the regex matches. It can use regular expression replacement with regex.\n[ replacement: <string> | default = $1 ]\n\n# Actions performed based on the matched values of regex. The available actions are as follows, with replace being the default:\n# replace: If the regex matches, replace the proper value with the value defined in replacement. Set the value using target_label and add the proper label.\n# keep: If the regex doesn't match, discard the value.\n# drop: If the regex matches, discard the value.\n# hashmod: Take the modulus of the MD5 hash of the source label's value based on the value specified in modulus.\n# Add a new label with a label name specified by target_label.\n# labelmap: If the regex matches, replace the proper label name with the value specified in replacement.\n# labeldrop: If the regex matches, delete the proper label.\n# labelkeep: If the regex doesn't match, delete the proper label.\n[ action: <relabel_action> | default = replace ]\n
Insight uses the Blackbox Exporter provided by Prometheus as a blackbox monitoring solution, allowing detection of target instances via HTTP, HTTPS, DNS, ICMP, TCP, and gRPC. It can be used in the following scenarios:
HTTP/HTTPS: URL/API availability monitoring
ICMP: Host availability monitoring
TCP: Port availability monitoring
DNS: Domain name resolution
In this page, we will explain how to configure custom probers in an existing Blackbox ConfigMap.
ICMP prober is not enabled by default in Insight because it requires higher permissions. Therfore We will use the HTTP prober as an example to demonstrate how to modify the ConfigMap to achieve custom HTTP probing.
module:\n ICMP: # Example of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: ip4\nicmp_example: # Example 2 of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: \"ip4\"\n source_ip_address: \"127.0.0.1\"\n
Since ICMP requires higher permissions, we also need to elevate the pod permissions. Otherwise, an operation not permitted error will occur. There are two ways to elevate permissions:
Directly edit the BlackBox Exporter deployment file to enable it
The following YAML file contains various probers such as HTTP, TCP, SMTP, ICMP, and DNS. You can modify the configuration file of insight-agent-prometheus-blackbox-exporter according to your needs.
Click to view the complete YAML file
kind: ConfigMap\napiVersion: v1\nmetadata:\n name: insight-agent-prometheus-blackbox-exporter\n namespace: insight-system\n labels:\n app.kubernetes.io/instance: insight-agent\n app.kubernetes.io/managed-by: Helm\n app.kubernetes.io/name: prometheus-blackbox-exporter\n app.kubernetes.io/version: v0.24.0\n helm.sh/chart: prometheus-blackbox-exporter-8.8.0\n annotations:\n meta.helm.sh/release-name: insight-agent\n meta.helm.sh/release-namespace: insight-system\ndata:\n blackbox.yaml: |\n modules:\n HTTP_GET:\n prober: http\n timeout: 5s\n http:\n method: GET\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"]\n follow_redirects: true\n preferred_ip_protocol: \"ip4\"\n HTTP_POST:\n prober: http\n timeout: 5s\n http:\n method: POST\n body_size_limit: 1MB\n TCP:\n prober: tcp\n timeout: 5s\n # Not enabled by default:\n # ICMP:\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: ip4\n SSH:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^SSH-2.0-\"\n POP3S:\n prober: tcp\n tcp:\n query_response:\n - expect: \"^+OK\"\n tls: true\n tls_config:\n insecure_skip_verify: false\n http_2xx_example: # http prober example\n prober: http\n timeout: 5s # probe timeout\n http:\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"] # Version in the response, usually default\n valid_status_codes: [] # Defaults to 2xx # Valid range of response codes, probe successful if within this range\n method: GET # request method\n headers: # request headers\n Host: vhost.example.com\n Accept-Language: en-US\n Origin: example.com\n no_follow_redirects: false # allow redirects\n fail_if_ssl: false \n fail_if_not_ssl: false\n fail_if_body_matches_regexp:\n - \"Could not connect to database\"\n fail_if_body_not_matches_regexp:\n - \"Download the latest version here\"\n fail_if_header_matches: # Verifies that no cookies are set\n - header: Set-Cookie\n allow_missing: true\n regexp: '.*'\n fail_if_header_not_matches:\n - header: Access-Control-Allow-Origin\n regexp: '(\\*|example\\.com)'\n tls_config: # tls configuration for https requests\n insecure_skip_verify: false\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\" # Preferred IP protocol version\n ip_protocol_fallback: false # no fallback to \"ip6\" \n http_post_2xx: # http prober example with body\n prober: http\n timeout: 5s\n http:\n method: POST # probe request method\n headers:\n Content-Type: application/json\n body: '{\"username\":\"admin\",\"password\":\"123456\"}' # body carried during probe\n http_basic_auth_example: # prober example with username and password\n prober: http\n timeout: 5s\n http:\n method: POST\n headers:\n Host: \"login.example.com\"\n basic_auth: # username and password to be added during probe\n username: \"username\"\n password: \"mysecret\"\n http_custom_ca_example:\n prober: http\n http:\n method: GET\n tls_config: # root certificate used during probe\n ca_file: \"/certs/my_cert.crt\"\n http_gzip:\n prober: http\n http:\n method: GET\n compression: gzip # compression method used during probe\n http_gzip_with_accept_encoding:\n prober: http\n http:\n method: GET\n compression: gzip\n headers:\n Accept-Encoding: gzip\n tls_connect: # TCP prober example\n prober: tcp\n timeout: 5s\n tcp:\n tls: true # use TLS\n tcp_connect_example:\n prober: tcp\n timeout: 5s\n imap_starttls: # IMAP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"OK.*STARTTLS\"\n - send: \". STARTTLS\"\n - expect: \"OK\"\n - starttls: true\n - send: \". capability\"\n - expect: \"CAPABILITY IMAP4rev1\"\n smtp_starttls: # SMTP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^220 ([^ ]+) ESMTP (.+)$\"\n - send: \"EHLO prober\\r\"\n - expect: \"^250-STARTTLS\"\n - send: \"STARTTLS\\r\"\n - expect: \"^220\"\n - starttls: true\n - send: \"EHLO prober\\r\"\n - expect: \"^250-AUTH\"\n - send: \"QUIT\\r\"\n irc_banner_example:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - send: \"NICK prober\"\n - send: \"USER prober prober prober :prober\"\n - expect: \"PING :([^ ]+)\"\n send: \"PONG ${1}\"\n - expect: \"^:[^ ]+ 001\"\n # icmp_example: # ICMP prober configuration example\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: \"ip4\"\n # source_ip_address: \"127.0.0.1\"\n dns_udp_example: # DNS query example using UDP\n prober: dns\n timeout: 5s\n dns:\n query_name: \"www.prometheus.io\" # domain name to resolve\n query_type: \"A\" # type proper to this domain\n valid_rcodes:\n - NOERROR\n validate_answer_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n fail_if_all_match_regexp:\n - \".*127.0.0.1\"\n fail_if_not_matches_regexp:\n - \"www.prometheus.io.\\t300\\tIN\\tA\\t127.0.0.1\"\n fail_if_none_matches_regexp:\n - \"127.0.0.1\"\n validate_authority_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n validate_additional_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n dns_soa:\n prober: dns\n dns:\n query_name: \"prometheus.io\"\n query_type: \"SOA\"\n dns_tcp_example: # DNS query example using TCP\n prober: dns\n dns:\n transport_protocol: \"tcp\" # defaults to \"udp\"\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\"\n query_name: \"www.prometheus.io\"\n
"},{"location":"en/admin/insight/collection-manag/service-monitor.html","title":"Configure service discovery rules","text":"
Observable Insight supports the way of creating CRD ServiceMonitor through container management to meet your collection requirements for custom service discovery. Users can use ServiceMonitor to define the scope of the Namespace discovered by the Pod and select the monitored Service through matchLabel .
This is the service endpoint, which represents the address where Prometheus collects Metrics. endpoints is an array, and multiple endpoints can be created at the same time. Each endpoint contains three fields, and the meaning of each field is as follows:
interval : Specifies the collection cycle of Prometheus for the current endpoint . The unit is seconds, set to 15s in this example.
path : Specifies the collection path of Prometheus. In this example, it is specified as /actuator/prometheus .
port : Specifies the port through which the collected data needs to pass. The set port is the name set by the port of the Service being collected.
This is the scope of the Service that needs to be discovered. namespaceSelector contains two mutually exclusive fields, and the meaning of the fields is as follows:
any : Only one value true , when this field is set, it will listen to changes of all Services that meet the Selector filtering conditions.
matchNames : An array value that specifies the scope of namespace to be monitored. For example, if you only want to monitor the Services in two namespaces, default and insight-system, the matchNames are set as follows:
Test Case Test Method OCP 4.10 (K8s 1.23.0) Remarks Collect and query web application metrics Manual \u2705 - Add custom metric collection Manual \u2705 - Query real-time metrics Manual \u2705 - Instantaneous index query Manual \u2705 - Instantaneous metric API field verification Manual \u2705 - Query metrics over a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Batch query cluster CPU, memory usage, total cluster CPU, cluster memory usage, total number of cluster nodes Manual \u2705 - Batch query node CPU, memory usage, total node CPU, node memory usage Manual \u2705 - Batch query cluster metrics within a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Query Pod log Manual \u2705 - Query SVC log Manual \u2705 - Query statefulset logs Manual \u2705 - Query Deployment Logs Manual \u2705 - Query NPD log Manual \u2705 - Log Filtering Manual \u2705 - Log fuzzy query - workloadSearch Manual \u2705 - Log fuzzy query - podSearch Manual \u2705 - Log fuzzy query - containerSearch Manual \u2705 - Log Accurate Query - cluster Manual \u2705 - Log Accurate Query - namespace Manual \u2705 - Log query API field verification Manual \u2705 - Alert Rule - CRUD operations Manual \u2705 - Alert Template - CRUD operations Manual \u2705 - Notification Method - CRUD operations Manual \u2705 - Link Query Manual \u2705 - Topology Query Manual \u2705 -
The table above represents the Openshift 4.x cluster compatibility test. It includes various test cases, their proper test method (manual), and the test results for the OCP version 4.10 (with Kubernetes version 1.23.0).
Please note that this is not an exhaustive list, and additional test scenarios may exist.
Test Scenario Test Method Rancher rke2c1 (K8s 1.24.11) Notes Collect and query web application metrics Manual \u2705 - Add custom metric collection Manual \u2705 - Query real-time metrics Manual \u2705 - Instantaneous index query Manual \u2705 - Instantaneous metric API field verification Manual \u2705 - Query metrics over a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Batch query cluster CPU, memory usage, total cluster CPU, cluster memory usage, total number of cluster nodes Manual \u2705 - Batch query node CPU, memory usage, total node CPU, node memory usage Manual \u2705 - Batch query cluster metrics within a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Query Pod log Manual \u2705 - Query SVC log Manual \u2705 - Query statefulset logs Manual \u2705 - Query Deployment Logs Manual \u2705 - Query NPD log Manual \u2705 - Log Filtering Manual \u2705 - Log fuzzy query - workloadSearch Manual \u2705 - Log fuzzy query - podSearch Manual \u2705 - Log fuzzy query - containerSearch Manual \u2705 - Log Accurate Query - cluster Manual \u2705 - Log Accurate Query - namespace Manual \u2705 - Log query API field verification Manual \u2705 - Alert Rule - CRUD operations Manual \u2705 - Alert Template - CRUD operations Manual \u2705 - Notification Method - CRUD operations Manual \u2705 - Link Query Manual \u2705 - Topology Query Manual \u2705 -"},{"location":"en/admin/insight/dashboard/dashboard.html","title":"Dashboard","text":"
Grafana is a cross-platform open source visual analysis tool. Insight uses open source Grafana to provide monitoring services, and supports viewing resource consumption from multiple dimensions such as clusters, nodes, and namespaces.
For more information on open source Grafana, see Grafana Official Documentation.
In the Insight / Overview dashboard, you can view the resource usage of multiple clusters and analyze resource usage, network, storage, and more based on dimensions such as namespaces and Pods.
Click the dropdown menu in the upper-left corner of the dashboard to switch between clusters.
Click the lower-right corner of the dashboard to switch the time range for queries.
Insight provides several recommended dashboards that allow monitoring from different dimensions such as nodes, namespaces, and workloads. Switch between dashboards by clicking the insight-system / Insight / Overview section.
Note
For accessing Grafana UI, refer to Access Native Grafana.
For importing custom dashboards, refer to Importing Custom Dashboards.
By using Grafana CRD, you can incorporate the management and deployment of dashboards into the lifecycle management of Kubernetes. This enables version control, automated deployment, and cluster-level management of dashboards. This page describes how to import custom dashboards using CRD and the UI interface.
Log in to the AI platform platform and go to Container Management . Select the kpanda-global-cluster from the cluster list.
Choose Custom Resources from the left navigation bar. Look for the grafanadashboards.integreatly.org file in the list and click it to view the details.
Click YAML Create and use the following template. Replace the dashboard JSON in the Json field.
namespace : Specify the target namespace.
name : Provide a name for the dashboard.
label : Mandatory. Set the label as operator.insight.io/managed-by: insight .
Insight only collects data from clusters that have insight-agent installed and running in a normal state. The overview provides an overview of resources across multiple clusters:
Alert Statistics: Provides statistics on active alerts across all clusters.
Resource Consumption: Displays the resource usage trends for the top 5 clusters and nodes in the past hour, based on CPU usage, memory usage, and disk usage.
By default, the sorting is based on CPU usage. You can switch the metric to sort clusters and nodes.
Resource Trends: Shows the trends in the number of nodes over the past 15 days and the running trend of pods in the last hour.
Service Requests Ranking: Displays the top 5 services with the highest request latency and error rates, along with their respective clusters and namespaces in the multi-cluster environment.
By default, Insight collects node logs, container logs, and Kubernetes audit logs. In the log query page, you can search for standard output (stdout) logs within the permissions of your login account. This includes node logs, product logs, and Kubernetes audit logs. You can quickly find the desired logs among a large volume of logs. Additionally, you can use the source information and contextual raw data of the logs to assist in troubleshooting and issue resolution.
In the left navigation bar, select Data Query -> Log Query .
After selecting the query criteria, click Search , and the log records in the form of graphs will be displayed. The most recent logs are displayed on top.
In the Filter panel, switch Type and select Node to check the logs of all nodes in the cluster.
In the Filter panel, switch Type and select Event to view the logs generated by all Kubernetes events in the cluster.
Lucene Syntax Explanation:
Use logical operators (AND, OR, NOT, \"\") to query multiple keywords. For example: keyword1 AND (keyword2 OR keyword3) NOT keyword4.
Use a tilde (~) for fuzzy queries. You can optionally specify a parameter after the \"~\" to control the similarity of the fuzzy query. If not specified, it defaults to 0.5. For example: error~.
Use wildcards (*, ?) as single-character placeholders to match any character.
Use square brackets [ ] or curly braces { } for range queries. Square brackets [ ] represent a closed interval and include the boundary values. Curly braces { } represent an open interval and exclude the boundary values. Range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. For example timestamp:[2022-01-01 TO 2022-01-31].
For more information, please refer to the Lucene Syntax Explanation.
Clicking on the button next to a log will slide out a panel on the right side where you can view the default 100 lines of context for that log. You can switch the Display Rows option to view more contextual content.
Metric query supports querying the index data of each container resource, and you can view the trend changes of the monitoring index. At the same time, advanced query supports native PromQL statements for Metric query.
In the left navigation bar, click Data Query -> Metrics .
After selecting query conditions such as cluster, type, node, and metric name, click Search , and the proper metric chart and data details will be displayed on the right side of the screen.
Tip
Support custom time range. You can manually click the Refresh icon or select a default time interval to refresh.
Modify the Kibana Service to be exposed as a NodePort for access:
kubectl patch svc -n mcamel-system mcamel-common-es-cluster-masters-kb-http -p '{\"spec\":{\"type\":\"NodePort\"}}'\n\n# After modification, check the NodePort. For example, if the port is 30128, the access URL will be https://{NodeIP in the cluster}:30128\n[root@insight-master1 ~]# kubectl get svc -n mcamel-system | grep mcamel-common-es-cluster-masters-kb-http\nmcamel-common-es-cluster-masters-kb-http NodePort 10.233.51.174 <none> 5601:30128/TCP 108m\n
Retrieve the ElasticSearch Secret to log in to Kibana (username is elastic):
Go to Kibana -> Stack Management -> Index Management and enable the Include hidden indices option to see all indexes. Based on the index sequence numbers, keep the indexes with larger numbers and delete the ones with smaller numbers.
"},{"location":"en/admin/insight/faq/traceclockskew.html","title":"Clock offset in trace data","text":"
In a distributed system, due to Clock Skew (clock skew adjustment) influence, Time drift exists between different hosts. Generally speaking, the system time of different hosts at the same time has a slight deviation.
The traces system is a typical distributed system, and it is also affected by this phenomenon in terms of time data collection. For example, in a link, the start time of the server-side span is earlier than that of the client-side span. This phenomenon does not exist logically, but due to the influence of clock skew, there is a deviation in the system time between the hosts at the moment when the trace data is collected in each service, which eventually leads to the phenomenon shown in the following figure:
The phenomenon in the above figure cannot be eliminated theoretically. However, this phenomenon is rare, and even if it occurs, it will not affect the calling relationship between services.
Currently Insight uses Jaeger UI to display trace data, and the UI will remind when encountering such a link:
Currently Jaeger's community is trying to optimize this problem through the UI level.
Through cluster monitoring, you can view the basic information of the cluster, the resource consumption and the trend of resource consumption over a period of time.
Select Infrastructure > Clusters from the left navigation bar. On this page, you can view the following information:
Resource Overview: Provides statistics on the number of normal/all nodes and workloads across multiple clusters.
Fault: Displays the number of alerts generated in the current cluster.
Resource Consumption: Shows the actual usage and total capacity of CPU, memory, and disk for the selected cluster.
Metric Explanations: Describes the trends in CPU, memory, disk I/O, and network bandwidth.
Click Resource Level Monitor, you can view more metrics of the current cluster.
"},{"location":"en/admin/insight/infra/cluster.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The ratio of the actual CPU usage of all pod resources in the cluster to the total CPU capacity of all nodes. CPU Allocation The ratio of the sum of CPU requests of all pods in the cluster to the total CPU capacity of all nodes. Memory Usage The ratio of the actual memory usage of all pod resources in the cluster to the total memory capacity of all nodes. Memory Allocation The ratio of the sum of memory requests of all pods in the cluster to the total memory capacity of all nodes."},{"location":"en/admin/insight/infra/container.html","title":"Container Insight","text":"
Container insight is the process of monitoring workloads in cluster management. In the list, you can view basic information and status of workloads. On the Workloads details page, you can see the number of active alerts and the trend of resource consumption such as CPU and memory.
Follow these steps to view service monitoring metrics:
Go to the Insight product module.
Select Infrastructure > Workloads from the left navigation bar.
Switch between tabs at the top to view data for different types of workloads.
Click the target workload name to view the details.
Faults: Displays the total number of active alerts for the workload.
Resource Consumption: Shows the CPU, memory, and network usage of the workload.
Monitoring Metrics: Provides the trends of CPU, Memory, Network, and disk usage for the workload over the past hour.
Switch to the Pods tab to view the status of various pods for the workload, including their nodes, restart counts, and other information.
Switch to the JVM monitor tab to view the JVM metrics for each pods
Note
The JVM monitoring feature only supports the Java language.
To enable the JVM monitoring feature, refer to Getting Started with Monitoring Java Applications.
"},{"location":"en/admin/insight/infra/container.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for all pods under the workload. CPU Requests The sum of CPU requests for all pods under the workload. CPU Limits The sum of CPU limits for all pods under the workload. Memory Usage The sum of memory usage for all pods under the workload. Memory Requests The sum of memory requests for all pods under the workload. Memory Limits The sum of memory limits for all pods under the workload. Disk Read/Write Rate The total number of continuous disk reads and writes per second within the specified time range, representing a performance measure of the number of read and write operations per second on the disk. Network Send/Receive Rate The incoming and outgoing rates of network traffic, aggregated by workload, within the specified time range."},{"location":"en/admin/insight/infra/event.html","title":"Event Query","text":"
AI platform Insight supports event querying by cluster and namespace.
"},{"location":"en/admin/insight/infra/event.html#event-status-distribution","title":"Event Status Distribution","text":"
By default, the events that occurred within the last 12 hours are displayed. You can select a different time range in the upper right corner to view longer or shorter periods. You can also customize the sampling interval from 1 minute to 5 hours.
The event status distribution chart provides a visual representation of the intensity and dispersion of events. This helps in evaluating and preparing for subsequent cluster operations and maintenance tasks. If events are densely concentrated during specific time periods, you may need to allocate more resources or take proper measures to ensure cluster stability and high availability. On the other hand, if events are dispersed, you can effectively schedule other maintenance tasks such as system optimization, upgrades, or handling other tasks during this period.
By considering the event status distribution chart and the selected time range, you can better plan and manage your cluster operations and maintenance work, ensuring system stability and reliability.
"},{"location":"en/admin/insight/infra/event.html#event-count-and-statistics","title":"Event Count and Statistics","text":"
Through important event statistics, you can easily understand the number of image pull failures, health check failures, Pod execution failures, Pod scheduling failures, container OOM (Out-of-Memory) occurrences, volume mounting failures, and the total count of all events. These events are typically categorized as \"Warning\" and \"Normal\".
Select Infrastructure -> Namespaces from the left navigation bar. On this page, you can view the following information:
Switch Namespace: Switch between clusters or namespaces at the top.
Resource Overview: Provides statistics on the number of normal and total workloads within the selected namespace.
Incidents: Displays the number of alerts generated within the selected namespace.
Events: Shows the number of Warning level events within the selected namespace in the past 24 hours.
Resource Consumption: Provides the sum of CPU and memory usage for Pods within the selected namespace, along with the CPU and memory quota information.
"},{"location":"en/admin/insight/infra/namespace.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for Pods within the selected namespace. Memory Usage The sum of memory usage for Pods within the selected namespace. Pod CPU Usage The CPU usage for each Pod within the selected namespace. Pod Memory Usage The memory usage for each Pod within the selected namespace."},{"location":"en/admin/insight/infra/node.html","title":"Node Monitoring","text":"
Through node monitoring, you can get an overview of the current health status of the nodes in the selected cluster and the number of abnormal pod; on the current node details page, you can view the number of alerts and the trend of resource consumption such as CPU, memory, and disk.
Probe refers to the use of black-box monitoring to regularly test the connectivity of targets through HTTP, TCP, and other methods, enabling quick detection of ongoing faults.
Insight uses the Prometheus Blackbox Exporter tool to probe the network using protocols such as HTTP, HTTPS, DNS, TCP, and ICMP, and returns the probe results to understand the network status.
Select Infrastructure -> Probes in the left navigation bar.
Click the cluster or namespace dropdown in the table to switch between clusters and namespaces.
The list displays the name, probe method, probe target, connectivity status, and creation time of the probes by default.
The connectivity status can be:
Normal: The probe successfully connects to the target, and the target returns the expected response.
Abnormal: The probe fails to connect to the target, or the target does not return the expected response.
Pending: The probe is attempting to connect to the target.
Supports fuzzy search of probe names.
"},{"location":"en/admin/insight/infra/probe.html#create-a-probe","title":"Create a Probe","text":"
Click Create Probe .
Fill in the basic information and click Next .
Name: The name can only contain lowercase letters, numbers, and hyphens (-), and must start and end with a lowercase letter or number, with a maximum length of 63 characters.
Cluster: Select the cluster for the probe task.
Namespace: The namespace where the probe task is located.
Configure the probe parameters.
Blackbox Instance: Select the blackbox instance responsible for the probe.
Probe Method:
HTTP: Sends HTTP or HTTPS requests to the target URL to check its connectivity and response time. This can be used to monitor the availability and performance of websites or web applications.
TCP: Establishes a TCP connection to the target host and port to check its connectivity and response time. This can be used to monitor TCP-based services such as web servers and database servers.
Other: Supports custom probe methods by configuring ConfigMap. For more information, refer to: Custom Probe Methods
Probe Target: The target address of the probe, supports domain names or IP addresses.
Labels: Custom labels that will be automatically added to Prometheus' labels.
Probe Interval: The interval between probes.
Probe Timeout: The maximum waiting time when probing the target.
After configuring, click OK to complete the creation.
Warning
After the probe task is created, it takes about 3 minutes to synchronize the configuration. During this period, no probes will be performed, and probe results cannot be viewed.
Click \u2507 in the operations column and click View Monitoring Dashboard .
Metric Name Description Current Status Response Represents the response status code of the HTTP probe request. Ping Status Indicates whether the probe request was successful. 1 indicates a successful probe request, and 0 indicates a failed probe request. IP Protocol Indicates the IP protocol version used in the probe request. SSL Expiry Represents the earliest expiration time of the SSL/TLS certificate. DNS Response (Latency) Represents the duration of the entire probe process in seconds. HTTP Duration Represents the duration of the entire process from sending the request to receiving the complete response."},{"location":"en/admin/insight/infra/probe.html#edit-a-probe","title":"Edit a Probe","text":"
Click \u2507 in the operations column and click Edit .
"},{"location":"en/admin/insight/infra/probe.html#delete-a-probe","title":"Delete a Probe","text":"
Click \u2507 in the operations column and click Delete .
AI platform platform enables the management and creation of multicloud and multiple clusters. Building upon this capability, Insight serves as a unified observability solution for multiple clusters. It collects observability data from multiple clusters by deploying the insight-agent plugin and allows querying of metrics, logs, and trace data through the AI platform Insight.
insight-agent is a tool that facilitates the collection of observability data from multiple clusters. Once installed, it automatically collects metrics, logs, and trace data without any modifications.
Clusters created through Container Management come pre-installed with insight-agent. Hence, this guide specifically provides instructions on enabling observability for integrated clusters.
Install insight-agent online
As a unified observability platform for multiple clusters, Insight's resource consumption of certain components is closely related to the data of cluster creation and the number of integrated clusters. When installing insight-agent, it is necessary to adjust the resources of the proper components based on the cluster size.
Adjust the CPU and memory resources of the Prometheus collection component in insight-agent according to the size of the cluster created or integrated. Please refer to Prometheus resource planning.
As the metric data from multiple clusters is stored centrally, AI platform platform administrators need to adjust the disk space of vmstorage based on the cluster size. Please refer to vmstorage disk capacity planning.
For instructions on adjusting the disk space of vmstorage, please refer to Expanding vmstorage disk.
Since AI platform supports the management of multicloud and multiple clusters, insight-agent has undergone partial verification. However, there are known conflicts with monitoring components when installing insight-agent in Suanova 4.0 clusters and Openshift 4.x clusters. If you encounter similar issues, please refer to the following documents:
Install insight-agent in Suanova 4.0.x
Install insight-agent in Openshift 4.x
Currently, the insight-agent collection component has undergone functional testing for popular versions of Kubernetes. Please refer to:
Kubernetes cluster compatibility testing
Openshift 4.x cluster compatibility testing
Rancher cluster compatibility testing
"},{"location":"en/admin/insight/quickstart/install/big-log-and-trace.html","title":"Enable Big Log and Big Trace Modes","text":"
The Insight Module supports switching log to Big Log mode and trace to Big Trace mode, in order to enhance data writing capabilities in large-scale environments. This page introduces following methods for enabling these modes:
Enable or upgrade to Big Log and Big Trace modes through the installer (controlled by the same parameter value in manifest.yaml)
Manually enable Big Log and Big Trace modes through Helm commands
This mode is referred to as the Kafka mode, and the data flow diagram is shown below:
"},{"location":"en/admin/insight/quickstart/install/big-log-and-trace.html#enabling-via-installer","title":"Enabling via Installer","text":"
When deploying/upgrading AI platform using the installer, the manifest.yaml file includes the infrastructures.kafka field. To enable observable Big Log and Big Trace modes, Kafka must be activated:
When using a manifest.yaml that enables kafka during installation, Kafka middleware will be installed by default, and Big Log and Big Trace modes will be enabled automatically. The installation command is:
The upgrade also involves modifying the kafka field. However, note that since the old environment was installed with kafka: false, Kafka is not present in the environment. Therefore, you need to specify the upgrade for middleware to install Kafka middleware simultaneously. The upgrade command is:
In the Container Management module, find the cluster, select Helm Apps from the left navigation bar, and find and update the insight-agent.
In Trace Settings, select kafka for output and fill in the correct brokers address.
Note that after the upgrade is complete, you need to manually restart the insight-agent-opentelemetry-collector and insight-opentelemetry-collector components.
When deploying Insight to a Kubernetes environment, proper resource management and optimization are crucial. Insight includes several core components such as Prometheus, OpenTelemetry, FluentBit, Vector, and Elasticsearch. These components, during their operation, may negatively impact the performance of other pods within the cluster due to resource consumption issues. To effectively manage resources and optimize cluster operations, node affinity becomes an important option.
This page is about how to add taints and node affinity to ensure that each component runs on the appropriate nodes, avoiding resource competition or contention, thereby guranttee the stability and efficiency of the entire Kubernetes cluster.
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#configure-dedicated-nodes-for-insight-using-taints","title":"Configure dedicated nodes for Insight using taints","text":"
Since the Insight Agent includes DaemonSet components, the configuration method described in this section is to have all components except the Insight DaemonSet run on dedicated nodes.
This is achieved by adding taints to the dedicated nodes and using tolerations to match them. More details can be found in the Kubernetes official documentation.
You can refer to the following commands to add and remove taints on nodes:
There are two ways to schedule Insight components to dedicated nodes:
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#1-add-tolerations-for-each-component","title":"1. Add tolerations for each component","text":"
Configure the tolerations for the insight-server and insight-agent Charts respectively:
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#2-configure-at-the-namespace-level","title":"2. Configure at the namespace level","text":"
Allow pods in the insight-system namespace to tolerate the node.daocloud.io=insight-only taint.
Adjust the apiserver configuration file /etc/kubernetes/manifests/kube-apiserver.yaml to include PodTolerationRestriction,PodNodeSelector. See the following picture:
Add an annotation to the insight-system namespace:
Restart the components under the insight-system namespace to allow normal scheduling of pods under the insight-system.
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#use-node-labels-and-node-affinity-to-manage-component-scheduling","title":"Use node labels and node affinity to manage component scheduling","text":"
Info
Node affinity is conceptually similar to nodeSelector, allowing you to constrain which nodes a pod can be scheduled on based on labels on the nodes. There are two types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution: The scheduler will only schedule the pod if the rules are met. This feature is similar to nodeSelector but has more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler will try to find nodes that meet the rules. If no matching nodes are found, the scheduler will still schedule the Pod.
For more details, please refer to the Kubernetes official documentation.
To meet different user needs for scheduling Insight components, Insight provides fine-grained labels for different components' scheduling policies. Below is a description of the labels and their associated components:
Label Key Label Value Description node.daocloud.io/insight-any Any value, recommended to use true Represents that all Insight components prefer nodes with this label node.daocloud.io/insight-prometheus Any value, recommended to use true Specifically for Prometheus components node.daocloud.io/insight-vmstorage Any value, recommended to use true Specifically for VictoriaMetrics vmstorage components node.daocloud.io/insight-vector Any value, recommended to use true Specifically for Vector components node.daocloud.io/insight-otel-col Any value, recommended to use true Specifically for OpenTelemetry components
You can refer to the following commands to add and remove labels on nodes:
# Add label to node8, prioritizing scheduling insight-prometheus to node8 \nkubectl label nodes node8 node.daocloud.io/insight-prometheus=true\n\n# Remove the node.daocloud.io/insight-prometheus label from node8\nkubectl label nodes node8 node.daocloud.io/insight-prometheus-\n
Below is the default affinity preference for the insight-prometheus component during deployment:
Prioritize scheduling insight-prometheus to nodes with the node.daocloud.io/insight-prometheus label
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html","title":"Get Data Storage Address of Global Service Cluster","text":"
Insight is a product for unified observation of multiple clusters. To achieve unified storage and querying of observation data from multiple clusters, sub-clusters need to report the collected observation data to the global service cluster for unified storage. This document provides the required address of the storage component when installing the collection component insight-agent.
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#install-insight-agent-in-global-service-cluster","title":"Install insight-agent in Global Service Cluster","text":"
If installing insight-agent in the global service cluster, it is recommended to access the cluster via domain name:
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#install-insight-agent-in-other-clusters","title":"Install insight-agent in Other Clusters","text":""},{"location":"en/admin/insight/quickstart/install/gethosturl.html#get-address-via-interface-provided-by-insight-server","title":"Get Address via Interface Provided by Insight Server","text":"
The management cluster uses the default LoadBalancer mode for exposure.
Log in to the console of the global service cluster and run the following command:
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
Note
Please replace the ${INSIGHT_SERVER_IP} parameter in the command.
global.exporters.logging.host is the log service address, no need to set the proper service port, the default value will be used.
global.exporters.metric.host is the metrics service address.
global.exporters.trace.host is the trace service address.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
Management cluster disables LoadBalancer
When calling the interface, you need to additionally pass an externally accessible node IP from the cluster, which will be used to construct the complete access address of the proper service.
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
global.exporters.logging.host is the log service address.
global.exporters.logging.port is the NodePort exposed by the log service.
global.exporters.metric.host is the metrics service address.
global.exporters.metric.port is the NodePort exposed by the metrics service.
global.exporters.trace.host is the trace service address.
global.exporters.trace.port is the NodePort exposed by the trace service.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
global.exporters.auditLog.port is the NodePort exposed by the audit log service.
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#connect-via-loadbalancer","title":"Connect via LoadBalancer","text":"
If LoadBalancer is enabled in the cluster and a VIP is set for Insight, you can manually execute the following command to obtain the address information for vminsert and opentelemetry-collector:
$ kubectl get service -n insight-system | grep lb\nlb-insight-opentelemetry-collector LoadBalancer 10.233.23.12 <pending> 4317:31286/TCP,8006:31351/TCP 24d\nlb-vminsert-insight-victoria-metrics-k8s-stack LoadBalancer 10.233.63.67 <pending> 8480:31629/TCP 24d\n
lb-vminsert-insight-victoria-metrics-k8s-stack is the address for the metrics service.
lb-insight-opentelemetry-collector is the address for the tracing service.
Execute the following command to obtain the address information for elasticsearch:
$ kubectl get service -n mcamel-system | grep es\nmcamel-common-es-cluster-masters-es-http NodePort 10.233.16.120 <none> 9200:30465/TCP 47d\n
mcamel-common-es-cluster-masters-es-http is the address for the logging service.
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#connect-via-nodeport","title":"Connect via NodePort","text":"
The LoadBalancer feature is disabled in the global service cluster.
In this case, the LoadBalancer resources mentioned above will not be created by default. The relevant service names are:
insight-agent is a plugin for collecting insight data, supporting unified observation of metrics, links, and log data. This article describes how to install insight-agent in an online environment for the accessed cluster.
Enter Container Management from the left navigation bar, and enter Clusters . Find the cluster where you want to install insight-agent.
Choose Install now to jump, or click the cluster and click Helm Apps -> Helm Templates in the left navigation bar, search for insight-agent in the search box, and click it for details.
Select the appropriate version and click Install .
Fill in the name, select the namespace and version, and fill in the addresses of logging, metric, audit, and trace reporting data in the yaml file. The system has filled in the address of the component for data reporting by default, please check it before clicking OK to install.
If you need to modify the data reporting address, please refer to Get Data Reporting Address.
The system will automatically return to Helm Apps . When the application status changes from Unknown to Deployed , it means that insight-agent is installed successfully.
Note
Click \u2507 on the far right, and you can perform more operations such as Update , View YAML and Delete in the pop-up menu.
For a practical installation demo, watch Video demo of installing insight-agent
This page lists some issues related to the installation and uninstallation of Insight Agent and their workarounds.
"},{"location":"en/admin/insight/quickstart/install/knownissues.html#uninstallation-failure-of-insight-agent","title":"Uninstallation Failure of Insight Agent","text":"
When you run the following command to uninstall Insight Agent,
helm uninstall insight agent\n
The tls secret used by otel-operator is failed to uninstall.
Due to the logic of \"reusing tls secret\" in the following code of otel-operator, it checks whether MutationConfiguration exists and reuses the CA cert bound in MutationConfiguration. However, since helm uninstall has uninstalled MutationConfiguration, it results in a null value.
Therefore, please manually delete the proper secret using one of the following methods:
Delete via command line: Log in to the console of the target cluster and run the following command:
Delete via UI: Log in to AI platform container management, select the target cluster, select Secret from the left menu, input insight-agent-opentelemetry-operator-controller-manager-service-cert, then select Delete.
"},{"location":"en/admin/insight/quickstart/install/knownissues.html#insight-agent_1","title":"Insight Agent","text":""},{"location":"en/admin/insight/quickstart/install/knownissues.html#log-collection-endpoint-not-updated-when-upgrading-insight-agent","title":"Log Collection Endpoint Not Updated When Upgrading Insight Agent","text":"
When updating the log configuration of the insight-agent from Elasticsearch to Kafka or from Kafka to Elasticsearch, the changes do not take effect and the agent continues to use the previous configuration.
Solution :
Manually restart Fluent Bit in the cluster.
"},{"location":"en/admin/insight/quickstart/install/knownissues.html#podmonitor-collects-multiple-sets-of-jvm-metrics","title":"PodMonitor Collects Multiple Sets of JVM Metrics","text":"
In this version, there is a defect in PodMonitor/insight-kubernetes-pod: it will incorrectly create Jobs to collect metrics for all containers in Pods that are marked with insight.opentelemetry.io/metric-scrape=true, instead of only the containers proper to insight.opentelemetry.io/metric-port.
After PodMonitor is declared, PrometheusOperator will pre-configure some service discovery configurations. Considering the compatibility of CRDs, it is abandoned to configure the collection tasks through annotations.
Use the additional scrape config mechanism provided by Prometheus to configure the service discovery rules in a secret and introduce them into Prometheus.
Therefore:
Delete the current PodMonitor for insight-kubernetes-pod
Use a new rule
In the new rule, action: keepequal is used to compare the consistency between source_labels and target_label to determine whether to create collection tasks for the ports of a container. Note that this feature is only available in Prometheus v2.41.0 (2022-12-20) and higher.
This page provides some considerations for upgrading insight-server and insight-agent.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v028x-or-lower-to-v029x","title":"Upgrade from v0.28.x (or lower) to v0.29.x","text":"
Due to the upgrade of the Opentelemetry community operator chart version in v0.29.0, the supported values for featureGates in the values file have changed. Therefore, before upgrading, you need to set the value of featureGates to empty, as follows:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v026x-or-lower-to-v027x-or-higher","title":"Upgrade from v0.26.x (or lower) to v0.27.x or higher","text":"
In v0.27.x, the switch for the vector component has been separated. If the existing environment has vector enabled, you need to specify --set vector.enabled=true when upgrading the insight-server.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-020x","title":"Upgrade from v0.19.x (or lower) to 0.20.x","text":"
Before upgrading Insight , you need to manually delete the jaeger-collector and jaeger-query deployments by running the following command:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
In v0.18.x, there have been updates to the Jaeger-related deployment files, so you need to manually run the following commands before upgrading insight-server:
There have been changes to metric names in v0.18.x, so after upgrading insight-server, insight-agent should also be upgraded.
In addition, the parameters for enabling the tracing module and adjusting the ElasticSearch connection have been modified. Refer to the following parameters:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v015x-or-lower-to-v016x","title":"Upgrade from v0.15.x (or lower) to v0.16.x","text":"
In v0.16.x, a new feature parameter disableRouteContinueEnforce in the vmalertmanagers CRD is used. Therefore, you need to manually run the following command before upgrading insight-server:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v023x-or-lower-to-v024x","title":"Upgrade from v0.23.x (or lower) to v0.24.x","text":"
In v0.24.x, CRDs have been added to the OTEL operator chart. However, helm upgrade does not update CRDs, so you need to manually run the following command:
If you are performing an offline installation, you can find the above CRD yaml file after extracting the insight-agent offline package. After extracting the insight-agent Chart, manually run the following command:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-v020x","title":"Upgrade from v0.19.x (or lower) to v0.20.x","text":"
In v0.20.x, Kafka log export configuration has been added, and there have been some adjustments to the log export configuration. Before upgrading insight-agent , please note the parameter changes. The previous logging configuration has been moved to the logging.elasticsearch configuration:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x_1","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
Due to the updated deployment files for Jaeger In v0.18.x, it is important to note the changes in parameters before upgrading the insight-agent.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v016x-or-lower-to-v017x","title":"Upgrade from v0.16.x (or lower) to v0.17.x","text":"
In v0.17.x, the kube-prometheus-stack chart version was upgraded from 41.9.1 to 45.28.1, and there were also some field upgrades in the CRD used, such as the attachMetadata field of servicemonitor. Therefore, the following command needs to be rund before upgrading the insight-agent:
If you are performing an offline installation, you can find the yaml for the above CRD in insight-agent/dependency-crds after extracting the insight-agent offline package.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v011x-or-earlier-to-v012x","title":"Upgrade from v0.11.x (or earlier) to v0.12.x","text":"
v0.12.x upgrades kube-prometheus-stack chart from 39.6.0 to 41.9.1, including prometheus-operator to v0.60.1, prometheus-node-exporter chart to v4.3.0. Prometheus-node-exporter uses Kubernetes recommended label after upgrading, so you need to delete node-exporter daemonset. prometheus-operator has updated the CRD, so you need to run the following command before upgrading the insight-agent:
Please ensure that the insight-agent is ready. If not, please refer to Install insight-agent for data collection and make sure the following three items are ready:
Enable trace functionality for insight-agent
Check if the address and port for trace data are correctly filled
Ensure that the Pods proper to deployment/insight-agent-opentelemetry-operator and deployment/insight-agent-opentelemetry-collector are ready
"},{"location":"en/admin/insight/quickstart/otel/operator.html#works-with-the-service-mesh-product-mspider","title":"Works with the Service Mesh Product (Mspider)","text":"
If you enable the tracing capability of the Mspider(Service Mesh), you need to add an additional environment variable injection configuration:
"},{"location":"en/admin/insight/quickstart/otel/operator.html#the-operation-steps-are-as-follows","title":"The operation steps are as follows","text":"
Log in to AI platform, then enter Container Management and select the target cluster.
Click CRDs in the left navigation bar, find instrumentations.opentelemetry.io, and enter the details page.
Select the insight-system namespace, then edit insight-opentelemetry-autoinstrumentation, and add the following content under spec:env::
"},{"location":"en/admin/insight/quickstart/otel/operator.html#add-annotations-to-automatically-access-traces","title":"Add annotations to automatically access traces","text":"
After the above is ready, you can access traces for the application through annotations (Annotation). Otel currently supports accessing traces through annotations. Depending on the service language, different pod annotations need to be added. Each service can add one of two types of annotations:
Only inject environment variable annotations
There is only one such annotation, which is used to add otel-related environment variables, such as link reporting address, cluster id where the container is located, and namespace (this annotation is very useful when the application does not support automatic probe language)
The value is divided into two parts by /, the first value (insight-system) is the namespace of the CR installed in the previous step, and the second value (insight-opentelemetry-autoinstrumentation) is the name of the CR.
Automatic probe injection and environment variable injection annotations
There are currently 4 such annotations, proper to 4 different programming languages: java, nodejs, python, dotnet. After using it, automatic probes and otel default environment variables will be injected into the first container under spec.pod:
Since Go's automatic detection requires the setting of OTEL_GO_AUTO_TARGET_EXE, you must provide a valid executable path through annotations or Instrumentation resources. Failure to set this value will result in the termination of Go's automatic detection injection, leading to a failure in the connection trace.
The OpenTelemetry Operator automatically adds some OTel-related environment variables when injecting probes and also supports overriding these variables. The priority order for overriding these environment variables is as follows:
original container env vars -> language specific env vars -> common env vars -> instrument spec configs' vars\n
However, it is important to avoid manually overriding OTEL_RESOURCE_ATTRIBUTES_NODE_NAME . This variable serves as an identifier within the operator to determine if a pod has already been injected with a probe. Manually adding this variable may prevent the probe from being injected successfully.
How to query the connected services, refer to Trace Query.
"},{"location":"en/admin/insight/quickstart/otel/otel.html","title":"Use OTel to provide the application observability","text":"
Enhancement is the process of enabling application code to generate telemetry data. i.e. something that helps you monitor or measure the performance and status of your application.
OpenTelemetry is a leading open source project providing instrumentation libraries for major programming languages \u200b\u200band popular frameworks. It is a project under the Cloud Native Computing Foundation and is supported by the vast resources of the community. It provides a standardized data format for collected data without the need to integrate specific vendors.
Insight supports OpenTelemetry for application instrumentation to enhance your applications.
This guide introduces the basic concepts of telemetry enhancement using OpenTelemetry. OpenTelemetry also has an ecosystem of libraries, plugins, integrations, and other useful tools to extend it. You can find these resources at the OTel Registry.
You can use any open standard library for telemetry enhancement and use Insight as an observability backend to ingest, analyze, and visualize data.
To enhance your code, you can use the enhanced operations provided by OpenTelemetry for specific languages:
Insight currently provides an easy way to enhance .Net NodeJS, Java, Python and Golang applications with OpenTelemetry. Please follow the guidelines below.
Best practices for integrate trace: Application Non-Intrusive Enhancement via Operator
Manual instrumentation with Go language as an example: Enhance Go application with OpenTelemetry SDK
Using ebpf to implement non-intrusive auto-instrumetation in Go language (experimental feature)
"},{"location":"en/admin/insight/quickstart/otel/send_tracing_to_insight.html","title":"Sending Trace Data to Insight","text":"
This document describes how customers can send trace data to Insight on their own. It mainly includes the following two scenarios:
Customer apps report traces to Insight through OTEL Agent/SDK
Forwarding traces to Insight through Opentelemetry Collector (OTEL COL)
In each cluster where Insight Agent is installed, there is an insight-agent-otel-col component that is used to receive trace data from that cluster. Therefore, this component serves as the entry point for user access and needs to obtain its address first. You can get the address of the Opentelemetry Collector in the cluster through the AI platform interface, such as insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 :
In addition, there are some slight differences for different reporting methods:
"},{"location":"en/admin/insight/quickstart/otel/send_tracing_to_insight.html#customer-apps-report-traces-to-insight-through-otel-agentsdk","title":"Customer apps report traces to Insight through OTEL Agent/SDK","text":"
To successfully report trace data to Insight and display it properly, it is recommended to provide the required metadata (Resource Attributes) for OTLP through the following environment variables. There are two ways to achieve this:
Manually add them to the deployment YAML file, for example:
"},{"location":"en/admin/insight/quickstart/otel/send_tracing_to_insight.html#forwarding-traces-to-insight-through-opentelemetry-collector","title":"Forwarding traces to Insight through Opentelemetry Collector","text":"
After ensuring that the application has added the metadata mentioned above, you only need to add an OTLP Exporter in your customer's Opentelemetry Collector to forward the trace data to Insight Agent Opentelemetry Collector. Below is an example Opentelemetry Collector configuration file:
Enhancing Applications Non-intrusively with the Operator
Achieving Observability with OTel
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html","title":"Enhance Go applications with OTel SDK","text":"
This page contains instructions on how to set up OpenTelemetry enhancements in a Go application.
OpenTelemetry, also known simply as OTel, is an open-source observability framework that helps generate and collect telemetry data: traces, metrics, and logs in Go apps.
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#enhance-go-apps-with-the-opentelemetry-sdk","title":"Enhance Go apps with the OpenTelemetry SDK","text":""},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#install-related-dependencies","title":"Install related dependencies","text":"
Dependencies related to the OpenTelemetry exporter and SDK must be installed first. If you are using another request router, please refer to request routing. After switching/going into the application source folder run the following command:
go get go.opentelemetry.io/otel@v1.8.0 \\\n go.opentelemetry.io/otel/trace@v1.8.0 \\\n go.opentelemetry.io/otel/sdk@v1.8.0 \\\n go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin@v0.33.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace@v1.7.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.4.1\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#create-an-initialization-feature-using-the-opentelemetry-sdk","title":"Create an initialization feature using the OpenTelemetry SDK","text":"
In order for an application to be able to send data, a feature is required to initialize OpenTelemetry. Add the following code snippet to the main.go file:
import (\n \"context\"\n \"os\"\n \"time\"\n\n \"go.opentelemetry.io/otel\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc\"\n \"go.opentelemetry.io/otel/propagation\"\n \"go.opentelemetry.io/otel/sdk/resource\"\n sdktrace \"go.opentelemetry.io/otel/sdk/trace\"\n semconv \"go.opentelemetry.io/otel/semconv/v1.7.0\"\n \"go.uber.org/zap\"\n \"google.golang.org/grpc\"\n)\n\nvar tracerExp *otlptrace.Exporter\n\nfunc retryInitTracer() func() {\n var shutdown func()\n go func() {\n for {\n // otel will reconnected and re-send spans when otel col recover. so, we don't need to re-init tracer exporter.\n if tracerExp == nil {\n shutdown = initTracer()\n } else {\n break\n }\n time.Sleep(time.Minute * 5)\n }\n }()\n return shutdown\n}\n\nfunc initTracer() func() {\n // temporarily set timeout to 10s\n ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n defer cancel()\n\n serviceName, ok := os.LookupEnv(\"OTEL_SERVICE_NAME\")\n if !ok {\n serviceName = \"server_name\"\n os.Setenv(\"OTEL_SERVICE_NAME\", serviceName)\n }\n otelAgentAddr, ok := os.LookupEnv(\"OTEL_EXPORTER_OTLP_ENDPOINT\")\n if !ok {\n otelAgentAddr = \"http://localhost:4317\"\n os.Setenv(\"OTEL_EXPORTER_OTLP_ENDPOINT\", otelAgentAddr)\n }\n zap.S().Infof(\"OTLP Trace connect to: %s with service name: %s\", otelAgentAddr, serviceName)\n\n traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithDialOption(grpc.WithBlock()))\n if err != nil {\n handleErr(err, \"OTLP Trace gRPC Creation\")\n return nil\n }\n\n tracerProvider := sdktrace.NewTracerProvider(\n sdktrace.WithBatcher(traceExporter),\n sdktrace.WithSampler(sdktrace.AlwaysSample()),\n sdktrace.WithResource(resource.NewWithAttributes(semconv.SchemaURL)))\n\n otel.SetTracerProvider(tracerProvider)\n otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))\n\n tracerExp = traceExporter\n return func() {\n // Shutdown will flush any remaining spans and shut down the exporter.\n handleErr(tracerProvider.Shutdown(ctx), \"failed to shutdown TracerProvider\")\n }\n}\n\nfunc handleErr(err error, message string) {\n if err != nil {\n zap.S().Errorf(\"%s: %v\", message, err)\n }\n}\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#initialize-tracker-in-maingo","title":"Initialize tracker in main.go","text":"
Modify the main feature to initialize the tracker in main.go. Also when your service shuts down, you should call TracerProvider.Shutdown() to ensure all spans are exported. The service makes the call as a deferred feature in the main function:
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#add-opentelemetry-gin-middleware-to-the-application","title":"Add OpenTelemetry Gin middleware to the application","text":"
Configure Gin to use the middleware by adding the following line to main.go :
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#run-the-application","title":"Run the application","text":"
Local debugging and running
Note: This step is only used for local development and debugging. In the production environment, the Operator will automatically complete the injection of the following environment variables.
The above steps have completed the work of initializing the SDK. Now if you need to develop and debug locally, you need to obtain the address of insight-agent-opentelemerty-collector in the insight-system namespace in advance, assuming: insight-agent-opentelemetry-collector .insight-system.svc.cluster.local:4317 .
Therefore, you can add the following environment variables when you start the application locally:
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
Running in a production environment
Please refer to the introduction of Only injecting environment variable annotations in Achieving non-intrusive enhancement of applications through Operators to add annotations to deployment yaml:
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
The OpenTelemetry community has also developed middleware for database access libraries, such as Gorm:
import (\n \"github.com/uptrace/opentelemetry-go-extra/otelgorm\"\n \"gorm.io/driver/sqlite\"\n \"gorm.io/gorm\"\n)\n\ndb, err := gorm.Open(sqlite.Open(\"file::memory:?cache=shared\"), &gorm.Config{})\nif err != nil {\n panic(err)\n}\n\notelPlugin := otelgorm.NewPlugin(otelgorm.WithDBName(\"mydb\"), # Missing this can lead to incomplete display of database related topology\n otelgorm.WithAttributes(semconv.ServerAddress(\"memory\"))) # Missing this can lead to incomplete display of database related topology\nif err := db.Use(otelPlugin); err != nil {\n panic(err)\n}\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#add-custom-properties-and-custom-events-to-span","title":"Add custom properties and custom events to span","text":"
It is also possible to set a custom attribute or tag as a span. To add custom properties and events, follow these steps:
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#import-tracking-and-property-libraries","title":"Import Tracking and Property Libraries","text":"
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#get-the-current-span-from-the-context","title":"Get the current Span from the context","text":"
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#set-properties-in-the-current-span","title":"Set properties in the current Span","text":"
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#add-an-event-to-the-current-span","title":"Add an Event to the current Span","text":"
Adding span events is done using AddEvent on the span object.
span.AddEvent(msg)\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#log-errors-and-exceptions","title":"Log errors and exceptions","text":"
import \"go.opentelemetry.io/otel/codes\"\n\n// Get the current span\nspan := trace.SpanFromContext(ctx)\n\n// RecordError will automatically convert an error into a span even\nspan.RecordError(err)\n\n// Flag this span as an error\nspan.SetStatus(codes.Error, \"internal error\")\n
Navigate to your application\u2019s source folder and run the following command:
go get go.opentelemetry.io/otel \\\n go.opentelemetry.io/otel/attribute \\\n go.opentelemetry.io/otel/exporters/prometheus \\\n go.opentelemetry.io/otel/metric/global \\\n go.opentelemetry.io/otel/metric/instrument \\\n go.opentelemetry.io/otel/sdk/metric\n
"},{"location":"en/admin/insight/quickstart/otel/golang/meter.html#create-an-initialization-function-using-otel-sdk","title":"Create an Initialization Function Using OTel SDK","text":"
For Java applications, you can directly expose JVM-related metrics by using the OpenTelemetry agent with the following environment variable:
OTEL_METRICS_EXPORTER=prometheus\n
You can then check your metrics at http://localhost:8888/metrics.
Next, combine it with a Prometheus ServiceMonitor to complete the metrics integration. If you want to expose custom metrics, please refer to opentelemetry-java-docs/prometheus.
The process is mainly divided into two steps:
Create a meter provider and specify Prometheus as the exporter.
/*\n * Copyright The OpenTelemetry Authors\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the Prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
Create a custom meter and start the HTTP server.
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n * Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n * these to a Prometheus instance via a HttpServer exporter.\n *\n * <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n * The Gauge callback gets executed every collection interval.\n */\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // It is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
After running the Java application, you can check if your metrics are working correctly by visiting http://localhost:8888/metrics.
For accessing and monitoring Java application links, please refer to the document Implementing Non-Intrusive Enhancements for Applications via Operator, which explains how to automatically integrate links through annotations.
Monitoring the JVM of Java applications: How Java applications that have already exposed JVM metrics and those that have not yet exposed JVM metrics can connect with observability Insight.
If your Java application has not yet started exposing JVM metrics, you can refer to the following documents:
Exposing JVM Monitoring Metrics Using JMX Exporter
Exposing JVM Monitoring Metrics Using OpenTelemetry Java Agent
If your Java application has already exposed JVM metrics, you can refer to the following document:
Connecting Existing JVM Metrics of Java Applications to Observability
Writing TraceId and SpanId into Java Application Logs to correlate link data with log data.
"},{"location":"en/admin/insight/quickstart/otel/java/mdc.html","title":"Writing TraceId and SpanId into Java Application Logs","text":"
This article explains how to automatically write TraceId and SpanId into Java application logs using OpenTelemetry. By including TraceId and SpanId in your logs, you can correlate distributed tracing data with log data, enabling more efficient fault diagnosis and performance analysis.
Spring Boot projects come with a built-in logging framework and use Logback as the default logging implementation. If your Java project is a Spring Boot project, you can write TraceId into logs with minimal configuration.
Set logging.pattern.level in application.properties, adding %mdc{trace_id} and %mdc{span_id} to the logs.
Modify the log4j2.xml configuration, adding %X{trace_id} and %X{span_id} in the pattern to automatically write TraceId and SpanId into the logs:
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html","title":"Exposing JVM Monitoring Metrics Using JMX Exporter","text":"
JMX Exporter provides two usage methods:
Standalone Process: Specify parameters when starting the JVM to expose a JMX RMI interface. The JMX Exporter calls RMI to obtain the JVM runtime state data, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
In-Process (JVM process): Specify parameters when starting the JVM to run the JMX Exporter jar file as a javaagent. This method reads the JVM runtime state data in-process, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
Note
The official recommendation is not to use the first method due to its complex configuration and the requirement for a separate process, which introduces additional monitoring challenges. Therefore, this article focuses on the second method, detailing how to use JMX Exporter to expose JVM monitoring metrics in a Kubernetes environment.
In this method, you need to specify the JMX Exporter jar file and configuration file when starting the JVM. Since the jar file is a binary file that is not ideal for mounting via a configmap, and the configuration file typically does not require modifications, it is recommended to package both the JMX Exporter jar file and the configuration file directly into the business container image.
For the second method, you can choose to include the JMX Exporter jar file in the application image or mount it during deployment. Below are explanations for both approaches:
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-1-building-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Building JMX Exporter JAR File into the Business Image","text":"
The content of prometheus-jmx-config.yaml is as follows:
The format for the startup parameter is: -javaagent:=:
Here, port 8088 is used to expose JVM monitoring metrics; you may change it if it conflicts with the Java application.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-2-mounting-via-init-container","title":"Method 2: Mounting via Init Container","text":"
First, we need to create a Docker image for the JMX Exporter. The following Dockerfile is for reference:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file into the image\nCOPY prometheus-jmx-config.yaml ./\n# Download the jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image using the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment YAML:
With the above modifications, the example application my-demo-app now has the capability to expose JVM metrics. After running the service, you can access the Prometheus formatted metrics at http://localhost:8088.
Next, you can refer to Connecting Existing JVM Metrics of Java Applications to Observability.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.html","title":"Integrating Existing JVM Metrics of Java Applications with Observability","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), you will need to ensure that the monitoring data is collected. You can achieve this by adding annotations (Kubernetes Annotations) to your workload to allow Insight to scrape the existing JVM metrics:
annotations: \n insight.opentelemetry.io/metric-scrape: \"true\" # Whether to scrape\n insight.opentelemetry.io/metric-path: \"/\" # Path to scrape metrics\n insight.opentelemetry.io/metric-port: \"9464\" # Port to scrape metrics\n
For example, to add annotations to the my-deployment-app:
In the above example, Insight will scrape the Prometheus metrics exposed through Spring Boot Actuator via http://<service-ip>:8080/actuator/prometheus.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html","title":"Exposing JVM Metrics Using OpenTelemetry Java Agent","text":"
Starting from OpenTelemetry Agent v1.20.0 and later, the OpenTelemetry Agent has introduced the JMX Metric Insight module. If your application is already integrated with the OpenTelemetry Agent for tracing, you no longer need to introduce another agent to expose JMX metrics for your application. The OpenTelemetry Agent collects and exposes metrics by detecting the locally available MBeans in the application.
The OpenTelemetry Agent also provides built-in monitoring examples for common Java servers or frameworks. Please refer to the Predefined Metrics.
When using the OpenTelemetry Java Agent, you also need to consider how to mount the JAR into the container. In addition to the methods for mounting the JAR file as described with the JMX Exporter, you can leverage the capabilities provided by the OpenTelemetry Operator to automatically enable JVM metrics exposure for your application.
If your application is already integrated with the OpenTelemetry Agent for tracing, you do not need to introduce another agent to expose JMX metrics. The OpenTelemetry Agent can now locally collect and expose metrics interfaces by detecting the locally available MBeans in the application.
However, as of the current version, you still need to manually add the appropriate annotations to your application for the JVM data to be collected by Insight. For specific annotation content, please refer to Integrating Existing JVM Metrics of Java Applications with Observability.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html#exposing-metrics-for-java-middleware","title":"Exposing Metrics for Java Middleware","text":"
The OpenTelemetry Agent also includes built-in examples for monitoring middleware. Please refer to the Predefined Metrics.
By default, no specific types are designated; you need to specify them using the -Dotel.jmx.target.system JVM options, for example, -Dotel.jmx.target.system=jetty,kafka-broker.
Although the OpenShift system comes with a monitoring system, we will still install Insight Agent because of some rules in the data collection agreement.
Among them, in addition to the basic installation configuration, the following parameters need to be added during helm install:
## Parameters related to fluentbit;\n--set fluent-bit.ocp.enabled=true \\\n--set fluent-bit.serviceAccount.create=false \\\n--set fluent-bit.securityContext.runAsUser=0 \\\n--set fluent-bit.securityContext.seLinuxOptions.type=spc_t \\\n--set fluent-bit.securityContext.readOnlyRootFilesystem=false \\\n--set fluent-bit.securityContext.allowPrivilegeEscalation=false \\\n\n## Enable Prometheus(CR) for OpenShift4.x\n--set compatibility.openshift.prometheus.enabled=true \\\n\n## Close the Prometheus instance of the higher version\n--set kube-prometheus-stack.prometheus.enabled=false \\\n--set kube-prometheus-stack.kubeApiServer.enabled=false \\\n--set kube-prometheus-stack.kubelet.enabled=false \\\n--set kube-prometheus-stack.kubeControllerManager.enabled=false \\\n--set kube-prometheus-stack.coreDns.enabled=false \\\n--set kube-prometheus-stack.kubeDns.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeScheduler.enabled=false \\\n--set kube-prometheus-stack.kubeStateMetrics.enabled=false \\\n--set kube-prometheus-stack.nodeExporter.enabled=false \\\n\n## Limit the namespace processed by PrometheusOperator to avoid competition with OpenShift's own PrometheusOperator\n--set kube-prometheus-stack.prometheusOperator.kubeletService.namespace=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.prometheusInstanceNamespaces=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[0]=\"openshift-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[1]=\"openshift-user-workload-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[2]=\"openshift-customer-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[3]=\"openshift-route-monitor-operator\" \\\n
"},{"location":"en/admin/insight/quickstart/other/install-agent-on-ocp.html#write-system-monitoring-data-into-prometheus-through-openshifts-own-mechanism","title":"Write system monitoring data into Prometheus through OpenShift's own mechanism","text":"
"},{"location":"en/admin/insight/quickstart/res-plan/modify-vms-disk.html","title":"vmstorage Disk Expansion","text":"
This article describes the method for expanding the vmstorage disk. Please refer to the vmstorage disk capacity planning for the specifications of the vmstorage disk.
Log in to the AI platform platform as a global service cluster administrator. Click Container Management -> Clusters and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Container Storage -> PVCs and find the PVC bound to the vmstorage.
Click a vmstorage PVC to enter the details of the volume claim for vmstorage and confirm the StorageClass that the PVC is bound to.
Select the left navigation menu Container Storage -> Storage Class and find local-path . Click the \u2507 on the right side of the target and select Edit in the popup menu.
Enable Scale Up and click OK .
"},{"location":"en/admin/insight/quickstart/res-plan/modify-vms-disk.html#modify-the-disk-capacity-of-vmstorage","title":"Modify the disk capacity of vmstorage","text":"
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu CRDs and find the custom resource for vmcluster .
Click the custom resource for vmcluster to enter the details page, switch to the insight-system namespace, and select Edit YAML from the right menu of insight-victoria-metrics-k8s-stack .
Modify according to the legend and click OK .
Select the left navigation menu Container Storage -> PVCs again and find the volume claim bound to vmstorage. Confirm that the modification has taken effect. In the details page of a PVC, click the associated storage source (PV).
Open the volume details page and click the Update button in the upper right corner.
After modifying the Capacity , click OK and wait for a moment until the expansion is successful.
"},{"location":"en/admin/insight/quickstart/res-plan/modify-vms-disk.html#clone-the-storage-volume","title":"Clone the storage volume","text":"
If the storage volume expansion fails, you can refer to the following method to clone the storage volume.
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Workloads -> StatefulSets and find the statefulset for vmstorage . Click the \u2507 on the right side of the target and select Status -> Stop -> OK in the popup menu.
After logging into the master node of the kpanda-global-cluster cluster in the command line, run the following command to copy the vm-data directory in the vmstorage container to store the metric information locally:
Log in to the AI platform platform and go to the details of the kpanda-global-cluster cluster. Select the left navigation menu Container Storage -> PVs , click Clone in the upper right corner, and modify the capacity of the volume.
Delete the previous data volume of vmstorage.
Wait for a moment until the volume claim is bound to the cloned data volume, then run the following command to import the exported data from step 3 into the proper container, and then start the previously paused vmstorage .
In the actual use of Prometheus, affected by the number of cluster containers and the opening of Istio, the CPU, memory and other resource usage of Prometheus will exceed the set resources.
In order to ensure the normal operation of Prometheus in clusters of different sizes, it is necessary to adjust the resources of Prometheus according to the actual size of the cluster.
In the case that the mesh is not enabled, the test statistics show that the relationship between the system Job index and pods is Series count = 800 * pod count
When the service mesh is enabled, the magnitude of the Istio-related metrics generated by the pod after the feature is enabled is Series count = 768 * pod count
"},{"location":"en/admin/insight/quickstart/res-plan/prometheus-res.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"
The following resource planning is recommended by Prometheus when the service mesh is not enabled :
Pod count in the table refers to the pod count that is basically running stably in the cluster. If a large number of pods are restarted, the index will increase sharply in a short period of time. At this time, resources need to be adjusted accordingly.
Prometheus stores two hours of data by default in memory, and when the Remote Write function is enabled in the cluster, a certain amount of memory will be occupied, and resources surge ratio is recommended to be set to 2.
The data in the table are recommended values, applicable to general situations. If the environment has precise resource requirements, it is recommended to check the resource usage of the proper Prometheus after the cluster has been running for a period of time for precise configuration.
"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html","title":"vmstorage disk capacity planning","text":"
vmstorage is responsible for storing multicluster metrics for observability. In order to ensure the stability of vmstorage, it is necessary to adjust the disk capacity of vmstorage according to the number of clusters and the size of the cluster. For more information, please refer to vmstorage retention period and disk space.
After 14 days of disk observation of vmstorage of clusters of different sizes, We found that the disk usage of vmstorage was positively correlated with the amount of metrics it stored and the disk usage of individual data points.
The amount of metrics stored instantaneously increase(vm_rows{ type != \"indexdb\"}[30s]) to obtain the increased amount of metrics within 30s
Disk usage of a single data point: sum(vm_data_size_bytes{type!=\"indexdb\"}) / sum(vm_rows{type != \"indexdb\"})
Disk usage = Instantaneous metrics x 2 x disk usage for a single data point x 60 x 24 x storage time (days)
Parameter Description:
The unit of disk usage is Byte .
Storage duration (days) x 60 x 24 converts time (days) into minutes to calculate disk usage.
The default collection time of Prometheus in Insight Agent is 30s, so twice the amount of metrics will be generated within 1 minute.
The default storage duration in vmstorage is 1 month, please refer to Modify System Configuration to modify the configuration.
Warning
This formula is a general solution, and it is recommended to reserve redundant disk capacity on the calculation result to ensure the normal operation of vmstorage.
The data in the table is calculated based on the default storage time of one month (30 days), and the disk usage of a single data point (datapoint) is calculated as 0.9. In a multicluster scenario, the number of Pods represents the sum of the number of Pods in the multicluster.
"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 8W 6 GiB 200 16W 12 GiB 300 24w 18 GiB 400 32w 24 GiB 500 40w 30 GiB 800 64w 48 GiB 1000 80W 60 GiB 2000 160w 120 GiB 3000 240w 180 GiB"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-enabled","title":"When the service mesh is enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 15W 12 GiB 200 31w 24 GiB 300 46w 36 GiB 400 62w 48 GiB 500 78w 60 GiB 800 125w 94 GiB 1000 156w 120 GiB 2000 312w 235 GiB 3000 468w 350 GiB"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html#example","title":"Example","text":"
There are two clusters in the AI platform platform, of which 500 Pods are running in the global management cluster (service mesh is turned on), and 1000 Pods are running in the worker cluster (service mesh is not turned on), and the expected metrics are stored for 30 days.
The number of metrics in the global management cluster is 800x500 + 768x500 = 784000
Worker cluster metrics are 800x1000 = 800000
Then the current vmstorage disk usage should be set to (784000+80000)x2x0.9x60x24x31 =124384896000 byte = 116 GiB
Note
For the relationship between the number of metrics and the number of Pods in the cluster, please refer to Prometheus Resource Planning.
"},{"location":"en/admin/insight/reference/alertnotification.html","title":"Alert Notification Process Description","text":"
When configuring an alert policy in Insight, you have the ability to set different notification sending intervals for alerts triggered at different levels within the same policy. However, due to the presence of two parameters, group_interval and repeat_interval , in the native Alertmanager configuration, the actual intervals for sending alert notifications may deviate.
group_wait : Specifies the waiting time before sending alert notifications. When Alertmanager receives a group of alerts, if no further alerts are received within the duration specified by group_wait , Alertmanager waits for a certain amount of time to collect additional alerts with the same labels and content. It then includes all qualifying alerts in the same notification.
group_interval : Determines the waiting time before merging a group of alerts into a single notification. If no more alerts from the same group are received during this period, Alertmanager sends a notification containing all received alerts.
repeat_interval : Sets the interval for resending alert notifications. After Alertmanager sends an alert notification to a receiver, if it continues to receive alerts with the same labels and content within the duration specified by repeat_interval , Alertmanager will resend the alert notification.
When the group_wait , group_interval , and repeat_interval parameters are set simultaneously, Alertmanager handles alert notifications under the same group as follows:
When Alertmanager receives qualifying alerts, it waits for at least the duration specified in the group_wait parameter to collect additional alerts with the same labels and content. It includes all qualifying alerts in the same notification.
If no further alerts are received during the group_wait duration, Alertmanager sends all received alerts to the receiver after that time. If additional qualifying alerts arrive during this period, Alertmanager continues to wait until all alerts are collected or a timeout occurs.
If more alerts with the same labels and content are received within the group_interval parameter, these new alerts are merged into the previous notification and sent together. If there are still unsent alerts after the group_interval duration, Alertmanager starts a new timing cycle and waits for more alerts until the group_interval duration is reached again or new alerts are received.
If Alertmanager keeps receiving alerts with the same labels and content within the duration specified by repeat_interval , it will resend the previously sent alert notifications. When resending alert notifications, Alertmanager does not wait for group_wait or group_interval , but sends notifications repeatedly according to the time interval specified by repeat_interval .
If there are still unsent alerts after the repeat_interval duration, Alertmanager starts a new timing cycle and continues to wait for new alerts with the same labels and content. This process continues until there are no new alerts or Alertmanager is stopped.
When Alertmanager receives an alert, it waits for at least 30 seconds to collect additional alerts with the same labels and content, and includes them in the same notification.
If more alerts with the same labels and content are received within 5 minutes, these new alerts are merged into the previous notification and sent together. If there are still unsent alerts after 15 minutes, Alertmanager starts a new timing cycle and waits for more alerts until 5 minutes have passed or new alerts are received.
If Alertmanager continues to receive alerts with the same labels and content within 1 hour, it will resend the previously sent alert notifications.
"},{"location":"en/admin/insight/reference/lucene.html","title":"Lucene Syntax Usage","text":""},{"location":"en/admin/insight/reference/lucene.html#introduction-to-lucene","title":"Introduction to Lucene","text":"
Lucene is a subproject of Apache Software Foundation's Jakarta project and is an open-source full-text search engine toolkit. The purpose of Lucene is to provide software developers with a simple and easy-to-use toolkit for implementing full-text search functionality in their target systems.
Lucene's syntax allows you to construct search queries in a flexible way to meet different search requirements. Here is a detailed explanation of Lucene's syntax:
To perform searches with multiple keywords using Lucene syntax, you can use Boolean logical operators to combine multiple keywords. Lucene supports the following operators:
AND operator
Use AND or && to represent the logical AND relationship.
Example: term1 AND term2 or term1 && term2
OR operator
Use OR or || to represent the logical OR relationship.
Example: term1 OR term2 or term1 || term2
NOT operator
Use NOT or - to represent the logical NOT relationship.
Example: term1 NOT term2 or term1 -term2
Quotes
You can enclose a phrase in quotes for exact matching.
In Lucene, fuzzy queries can be performed using the tilde ( ~ ) operator for approximate matching. You can specify an edit distance to limit the degree of similarity in the matches.
term~\n
In the above example, term is the keyword to perform a fuzzy match on.
Please note the following:
After the tilde ( ~ ), you can optionally specify a parameter to control the similarity of the fuzzy query.
The parameter value ranges from 0 to 2, where 0 represents an exact match, 1 allows for one edit operation (such as adding, deleting, or replacing characters) to match, and 2 allows for two edit operations to match.
If no parameter value is specified, the default similarity threshold used is 0.5.
Fuzzy queries will return documents that are similar to the given keyword but may incur some performance overhead, especially for larger indexes.
In the above example, te?t represents a word that starts with \"te\", followed by any single character, and ends with \"t\". This query can match words like \"test\", \"text\", and \"tent\".
It is important to note that the question mark ( ? ) represents only a single character. If you want to match multiple characters or varying lengths of characters, you can use the asterisk ( * ) for multi-character wildcard matching. Additionally, the question mark will not match an empty string.
To summarize, in Lucene syntax, the question mark ( ? ) is used as a single-character wildcard to match any single character. By using the question mark in your search keywords, you can perform more flexible and specific pattern matching.
Lucene syntax supports range queries, where you can use square brackets [ ] or curly braces { } to represent a range. Here are examples of range queries:
Inclusive boundary range query:
Square brackets [ ] indicate a closed interval that includes the boundary values.
Example: field:[value1 TO value2] represents the range of values for field , including both value1 and value2 .
Exclusive boundary range query:
Curly braces { } indicate an open interval that excludes the boundary values.
Example: field:{value1 TO value2} represents the range of values for field between value1 and value2 , excluding both.
Omitted boundary range query:
You can omit one or both boundary values to specify an infinite range.
Example: field:[value TO ] represents the range of values for field from value to positive infinity, and field:[ TO value] represents the range of values for field from negative infinity to value .
Note
Please note that range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. Also, ensure that you correctly specify the boundary values as the actual value type of the field in your query. If you want to perform a range query across the entire index without specifying a specific field, you can use the wildcard query * instead of a field name.
This will retrieve data where the timestamp field falls within the range from January 1, 2022, to January 31, 2022.
Not specify a field
*:[value1 TO value2]\n
This will search the entire index for documents with values ranging from value1 to value2 .
"},{"location":"en/admin/insight/reference/lucene.html#insight-common-keywords","title":"Insight Common Keywords","text":""},{"location":"en/admin/insight/reference/lucene.html#container-logs","title":"Container Logs","text":"
The toClusterName function retrieves the \"cluster name\" based on the \"cluster unique identifier (ID)\". If there is no proper cluster found, it will directly return the passed-in cluster's unique identifier.
The toClusterId function retrieves the \"cluster unique identifier (ID)\" based on the \"cluster name\". If there is no proper cluster found, it will directly return the passed-in cluster name.
Because Insight combines messages generated by the same rule at the same time when sending alert messages, email subjects are different from the four templates above and only use the content of commonLabels in the alert message to render the template. The default template is as follows:
Other fields that can be used as email subjects are as follows:
{{ .status }} Triggering status of the alert message\n{{ .alertgroup }} Name of the policy to which the alert belongs\n{{ .alertname }} Name of the rule to which the alert belongs\n{{ .severity }} Severity level of the alert\n{{ .target_type }} Type of resource for which the alert is raised\n{{ .target }} Resource object for which the alert is raised\n{{ .Custom label key for other rules }}\n
"},{"location":"en/admin/insight/reference/tailing-sidecar.html","title":"Collecting Container Logs through Sidecar","text":"
Tailing Sidecar is a Kubernetes cluster-level logging proxy that acts as a streaming sidecar container. It allows automatic collection and summarization of log files within containers, even when the container cannot write to standard output or standard error streams.
Insight supports log collection through the Sidecar mode, which involves running a Sidecar container alongside each Pod to output log data to the standard output stream. This enables FluentBit to collect container logs effectively.
The Insight Agent comes with the tailing-sidecar operator installed by default. To enable file log collection within a container, you can add annotations to the Pod, which will automatically inject the Tailing Sidecar container. The injected Sidecar container reads the files in the business container and outputs them to the standard output stream.
Here are the specific steps to follow:
Modify the YAML file of the Pod and add the following parameters in the annotation field:
sidecar-name-0 : Name for the Tailing Sidecar container (optional; a container name will be created automatically if not specified, starting with the prefix \"tailing-sidecar\").
volume-name-0 : Name of the storage volume.
path-to-tail-0 : File path to tail.
Note
Each Pod can run multiple sidecar containers, separated by ; . This allows different sidecar containers to collect multiple files and store them in various volumes.
Restart the Pod. Once the Pod's status changes to Running , you can use the Log Query interface to search for logs within the container of the Pod.
The metrics in this article are organized based on the community's kube-prometheus framework. Currently, it covers metrics from multiple levels, including Cluster, Node, Namespace, and Workload. This article lists some commonly used metrics, their descriptions, and units for easy reference.
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#cluster","title":"Cluster","text":"Metric Name Description Unit cluster_cpu_utilization Cluster CPU Utilization cluster_cpu_total Total CPU in Cluster Core cluster_cpu_usage CPU Used in Cluster Core cluster_cpu_requests_commitment CPU Allocation Rate in Cluster cluster_memory_utilization Cluster Memory Utilization cluster_memory_usage Memory Usage in Cluster Byte cluster_memory_available Available Memory in Cluster Byte cluster_memory_requests_commitment Memory Allocation Rate in Cluster cluster_memory_total Total Memory in Cluster Byte cluster_net_utilization Network Data Transfer Rate in Cluster Byte/s cluster_net_bytes_transmitted Network Data Transmitted in Cluster (Upstream) Byte/s cluster_net_bytes_received Network Data Received in Cluster (Downstream) Byte/s cluster_disk_read_iops Disk Read IOPS in Cluster times/s cluster_disk_write_iops Disk Write IOPS in Cluster times/s cluster_disk_read_throughput Disk Read Throughput in Cluster Byte/s cluster_disk_write_throughput Disk Write Throughput in Cluster Byte/s cluster_disk_size_capacity Total Disk Capacity in Cluster Byte cluster_disk_size_available Available Disk Size in Cluster Byte cluster_disk_size_usage Disk Usage in Cluster Byte cluster_disk_size_utilization Disk Utilization in Cluster cluster_node_total Total Nodes in Cluster units cluster_node_online Online Nodes in Cluster units cluster_node_offline_count Count of Offline Nodes in Cluster units cluster_pod_count Total Pods in Cluster units cluster_pod_running_count Count of Running Pods in Cluster units cluster_pod_abnormal_count Count of Abnormal Pods in Cluster units cluster_deployment_count Total Deployments in Cluster units cluster_deployment_normal_count Count of Normal Deployments in Cluster units cluster_deployment_abnormal_count Count of Abnormal Deployments in Cluster units cluster_statefulset_count Count of StatefulSets in Cluster units cluster_statefulset_normal_count Count of Normal StatefulSets in Cluster units cluster_statefulset_abnormal_count Count of Abnormal StatefulSets in Cluster units cluster_daemonset_count Count of DaemonSets in Cluster units cluster_daemonset_normal_count Count of Normal DaemonSets in Cluster units cluster_daemonset_abnormal_count Count of Abnormal DaemonSets in Cluster units cluster_job_count Total Jobs in Cluster units cluster_job_normal_count Count of Normal Jobs in Cluster units cluster_job_abnormal_count Count of Abnormal Jobs in Cluster units
Tip
Utilization is generally a number in the range (0,1] (e.g., 0.21, not 21%)
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#node","title":"Node","text":"Metric Name Description Unit node_cpu_utilization Node CPU Utilization node_cpu_total Total CPU in Node Core node_cpu_usage CPU Usage in Node Core node_cpu_requests_commitment CPU Allocation Rate in Node node_memory_utilization Node Memory Utilization node_memory_usage Memory Usage in Node Byte node_memory_requests_commitment Memory Allocation Rate in Node node_memory_available Available Memory in Node Byte node_memory_total Total Memory in Node Byte node_net_utilization Network Data Transfer Rate in Node Byte/s node_net_bytes_transmitted Network Data Transmitted in Node (Upstream) Byte/s node_net_bytes_received Network Data Received in Node (Downstream) Byte/s node_disk_read_iops Disk Read IOPS in Node times/s node_disk_write_iops Disk Write IOPS in Node times/s node_disk_read_throughput Disk Read Throughput in Node Byte/s node_disk_write_throughput Disk Write Throughput in Node Byte/s node_disk_size_capacity Total Disk Capacity in Node Byte node_disk_size_available Available Disk Size in Node Byte node_disk_size_usage Disk Usage in Node Byte node_disk_size_utilization Disk Utilization in Node"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#workload","title":"Workload","text":"
The currently supported workload types include: Deployment, StatefulSet, DaemonSet, Job, and CronJob.
Metric Name Description Unit workload_cpu_usage Workload CPU Usage Core workload_cpu_limits Workload CPU Limit Core workload_cpu_requests Workload CPU Requests Core workload_cpu_utilization Workload CPU Utilization workload_memory_usage Workload Memory Usage Byte workload_memory_limits Workload Memory Limit Byte workload_memory_requests Workload Memory Requests Byte workload_memory_utilization Workload Memory Utilization workload_memory_usage_cached Workload Memory Usage (including cache) Byte workload_net_bytes_transmitted Workload Network Data Transmitted Rate Byte/s workload_net_bytes_received Workload Network Data Received Rate Byte/s workload_disk_read_throughput Workload Disk Read Throughput Byte/s workload_disk_write_throughput Workload Disk Write Throughput Byte/s
Total workload is calculated here.
Metrics can be obtained using workload_cpu_usage{workload_type=\"deployment\", workload=\"prometheus\"}.
Calculation rule for workload_pod_utilization: workload_pod_usage / workload_pod_request.
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#pod","title":"Pod","text":"Metric Name Description Unit pod_cpu_usage Pod CPU Usage Core pod_cpu_limits Pod CPU Limit Core pod_cpu_requests Pod CPU Requests Core pod_cpu_utilization Pod CPU Utilization pod_memory_usage Pod Memory Usage Byte pod_memory_limits Pod Memory Limit Byte pod_memory_requests Pod Memory Requests Byte pod_memory_utilization Pod Memory Utilization pod_memory_usage_cached Pod Memory Usage (including cache) Byte pod_net_bytes_transmitted Pod Network Data Transmitted Rate Byte/s pod_net_bytes_received Pod Network Data Received Rate Byte/s pod_disk_read_throughput Pod Disk Read Throughput Byte/s pod_disk_write_throughput Pod Disk Write Throughput Byte/s
You can obtain the CPU usage of all Pods belonging to the Deployment named prometheus by using pod_cpu_usage{workload_type=\"deployment\", workload=\"prometheus\"}.
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#span-metrics","title":"Span Metrics","text":"Metric Name Description Unit calls_total Total Service Requests duration_milliseconds_bucket Service Latency Histogram duration_milliseconds_sum Total Service Latency ms duration_milliseconds_count Number of Latency Records otelcol_processor_groupbytrace_spans_released Number of Collected Spans otelcol_processor_groupbytrace_traces_released Number of Collected Traces traces_service_graph_request_total Total Service Requests (Topology Feature) traces_service_graph_request_server_seconds_sum Total Latency (Topology Feature) ms traces_service_graph_request_server_seconds_bucket Service Latency Histogram (Topology Feature) traces_service_graph_request_server_seconds_count Total Service Requests (Topology Feature)"},{"location":"en/admin/insight/system-config/modify-config.html","title":"Modify system configuration","text":"
Observability will persist the data of metrics, logs, and traces by default. Users can modify the system configuration according to This page.
"},{"location":"en/admin/insight/system-config/modify-config.html#how-to-modify-the-metric-data-retention-period","title":"How to modify the metric data retention period","text":"
Refer to the following steps to modify the metric data retention period.
After saving the modification, the pod of the component responsible for storing the metrics will automatically restart, just wait for a while.
"},{"location":"en/admin/insight/system-config/modify-config.html#how-to-modify-the-log-data-storage-duration","title":"How to modify the log data storage duration","text":"
Refer to the following steps to modify the log data retention period:
"},{"location":"en/admin/insight/system-config/modify-config.html#method-1-modify-the-json-file","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . Change http://localhost:9200 to the address of elastic .
After modification, run the above command. It will print out the content as shown below, then the modification is successful.
{\n\"acknowledged\": true\n}\n
"},{"location":"en/admin/insight/system-config/modify-config.html#method-2-modify-from-the-ui","title":"Method 2: Modify from the UI","text":"
Log in kibana , select Stack Management in the left navigation bar.
Select the left navigation Index Lifecycle Polices , and find the index insight-es-k8s-logs-policy , click to enter the details.
Expand the Hot phase configuration panel, modify the Maximum age parameter, and set the retention period. The default storage period is 7d .
After modification, click Save policy at the bottom of the page to complete the modification.
"},{"location":"en/admin/insight/system-config/modify-config.html#how-to-modify-the-trace-data-storage-duration","title":"How to modify the trace data storage duration","text":"
Refer to the following steps to modify the trace data retention period:
"},{"location":"en/admin/insight/system-config/modify-config.html#method-1-modify-the-json-file_1","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . At the same time, modify http://localhost:9200 to the access address of elastic .
On the system component page, you can quickly view the running status of the system components in Insight. When a system component fails, some features in Insight will be unavailable.
Go to Insight product module,
In the left navigation bar, select System Management -> System Components .
"},{"location":"en/admin/insight/system-config/system-component.html#component-description","title":"Component description","text":"Module Component Name Description Metrics vminsert-insight-victoria-metrics-k8s-stack Responsible for writing the metric data collected by Prometheus in each cluster to the storage component. If this component is abnormal, the metric data of the worker cluster cannot be written. Metrics vmalert-insight-victoria-metrics-k8s-stack Responsible for taking effect of the recording and alert rules configured in the VM Rule, and sending the triggered alert rules to alertmanager. Metrics vmalertmanager-insight-victoria-metrics-k8s-stack is responsible for sending messages when alerts are triggered. If this component is abnormal, the alert information cannot be sent. Metrics vmselect-insight-victoria-metrics-k8s-stack Responsible for querying metrics data. If this component is abnormal, the metric cannot be queried. Metrics vmstorage-insight-victoria-metrics-k8s-stack Responsible for storing multicluster metrics data. Dashboard grafana-deployment Provide monitoring panel capability. The exception of this component will make it impossible to view the built-in dashboard. Link insight-jaeger-collector Responsible for receiving trace data in opentelemetry-collector and storing it. Link insight-jaeger-query Responsible for querying the trace data collected in each cluster. Link insight-opentelemetry-collector Responsible for receiving trace data forwarded by each sub-cluster Log elasticsearch Responsible for storing the log data of each cluster."},{"location":"en/admin/insight/system-config/system-config.html","title":"System Settings","text":"
System Settings displays the default storage time of metrics, logs, traces and the default Apdex threshold.
Click the right navigation bar and select System Settings .
Currently only supports modifying the storage duration of historical alerts, click Edit to enter the target duration.
When the storage duration is set to \"0\", the historical alerts will not be cleared.
Note
To modify other settings, please click to view How to modify the system settings?
In Insight , a service refers to a group of workloads that provide the same behavior for incoming requests. Service insight helps observe the performance and status of applications during the operation process by using the OpenTelemetry SDK.
For how to use OpenTelemetry, please refer to: Using OTel to give your application insight.
Service: A service represents a group of workloads that provide the same behavior for incoming requests. You can define the service name when using the OpenTelemetry SDK or use the name defined in Istio.
Operation: An operation refers to a specific request or action handled by a service. Each span has an operation name.
Outbound Traffic: Outbound traffic refers to all the traffic generated by the current service when making requests.
Inbound Traffic: Inbound traffic refers to all the traffic initiated by the upstream service targeting the current service.
The Services List page displays key metrics such as throughput rate, error rate, and request latency for all services that have been instrumented with distributed tracing. You can filter services based on clusters or namespaces and sort the list by throughput rate, error rate, or request latency. By default, the data displayed in the list is for the last hour, but you can customize the time range.
Follow these steps to view service insight metrics:
Go to the Insight product module.
Select Trace Tracking -> Services from the left navigation bar.
Attention
If the namespace of a service in the list is unknown , it means that the service has not been properly instrumented. We recommend reconfiguring the instrumentation.
If multiple services have the same name and none of them have the correct Namespace environment variable configured, the metrics displayed in the list and service details page will be aggregated for all those services.
Click a service name (taking insight-system as an example) to view the detailed metrics and operation metrics for that service.
In the Service Topology section, you can view the service topology one layer above or below the current service. When you hover over a node, you can see its information.
In the Traffic Metrics section, you can view the monitoring metrics for all requests to the service within the past hour (including inbound and outbound traffic).
You can use the time selector in the upper right corner to quickly select a time range or specify a custom time range.
Sorting is available for throughput, error rate, and request latency in the operation metrics.
Clicking on the icon next to an individual operation will take you to the Traces page to quickly search for related traces.
"},{"location":"en/admin/insight/trace/service.html#service-metric-explanations","title":"Service Metric Explanations","text":"Metric Description Throughput Rate The number of requests processed within a unit of time. Error Rate The ratio of erroneous requests to the total number of requests within the specified time range. P50 Request Latency The response time within which 50% of requests complete. P95 Request Latency The response time within which 95% of requests complete. P99 Request Latency The response time within which 99% of requests complete."},{"location":"en/admin/insight/trace/topology.html","title":"Service Map","text":"
Service map is a visual representation of the connections, communication, and dependencies between services. It provides insights into the service-to-service interactions, allowing you to view the calls and performance of services within a specified time range. The connections between nodes in the topology map represent the existence of service-to-service calls during the queried time period.
Select Tracing -> Service Map from the left navigation bar.
In the Service Map, you can perform the following actions:
Click a node to slide out the details of the service on the right side. Here, you can view metrics such as request latency, throughput, and error rate for the service. Clicking on the service name takes you to the service details page.
Hover over the connections to view the traffic metrics between the two services.
Click Display Settings , you can configure the display elements in the service map.
In the Service Map, there can be nodes that are not part of the cluster. These external nodes can be categorized into three types:
Database
Message Queue
Virtual Node
If a service makes a request to a Database or Message Queue, these two types of nodes will be displayed by default in the topology map. However, Virtual Nodes represent nodes outside the cluster or services not integrated into the trace, and they will not be displayed by default in the map.
When a service makes a request to MySQL, PostgreSQL, or Oracle Database, the detailed database type can be seen in the map.
TraceID: Used to identify a complete request call trace.
Operation: Describes the specific operation or event represented by a Span.
Entry Span: The entry Span represents the first request of the entire call.
Latency: The duration from receiving the request to completing the response for the entire call trace.
Span: The number of Spans included in the entire trace.
Start Time: The time when the current trace starts.
Tag: A collection of key-value pairs that constitute Span tags. Tags are used to annotate and supplement Spans, and each Span can have multiple key-value tag pairs.
Click the icon on the right side of the trace data to search for associated logs.
By default, it queries the log data within the duration of the trace and one minute after its completion.
The queried logs include those with the trace's TraceID in their log text and container logs related to the trace invocation process.
Click View More to jump to the Associated Log page with conditions.
By default, all logs are searched, but you can filter by the TraceID or the relevant container logs from the trace call process using the dropdown.
Note
Since trace may span across clusters or namespaces, if the user does not have sufficient permissions, they will be unable to query the associated logs for that trace.
"},{"location":"en/admin/k8s/add-node.html#steps-to-add-nodes","title":"Steps to Add Nodes","text":"
Log in to the AI platform as an administrator.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
On the cluster overview page, click Node Management, and then click the Add Node button on the right side.
Follow the wizard, fill in the required parameters, and then click OK.
Basic InformationParameter Configuration
Click OK in the popup window.
Return to the node list. The status of the newly added node will be Pending. After a few minutes, if the status changes to Running, it indicates that the node has been successfully added.
Tip
For nodes that have just been successfully added, it may take an additional 2-3 minutes for the GPU to be recognized.
"},{"location":"en/admin/k8s/create-k8s.html","title":"Creating a Kubernetes Cluster on the Cloud","text":"
Deploying a Kubernetes cluster is aimed at supporting efficient AI computing resource scheduling and management, achieving elastic scalability, providing high availability, and optimizing the model training and inference processes.
After configuring the node information, click Start Check.
Each node can run a default of 110 Pods (container groups). If the node configuration is higher, it can be adjusted to 200 or 300 Pods.
Wait for the cluster creation to complete.
In the cluster list, find the newly created cluster, click the cluster name, navigate to Helm Apps -> Helm Charts, and search for metax-gpu-extensions in the search box, then click the card.
Click the Install button on the right to begin installing the GPU plugin.
The cost of GPU resources is relatively high. If you temporarily do not need a GPU, you can remove the worker nodes with GPUs. The following steps are also applicable for removing regular worker nodes.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
On the cluster overview page, click Nodes, find the node you want to remove, click the \u2507 on the right side of the list, and select Remove Node from the pop-up menu.
In the pop-up window, enter the node name, and after confirming it is correct, click Delete.
You will automatically return to the node list, and the status will be Removing. After a few minutes, refresh the page; if the node is no longer there, it indicates that the node has been successfully removed.
After removing the node from the UI list and shutting it down, log in to the host of the removed node via SSH and execute the shutdown command.
Tip
After removing the node from the UI and shutting it down, the data on the node is not immediately deleted; the node's data will be retained for a period of time.
"},{"location":"en/admin/kpanda/backup/index.html","title":"Backup and Restore","text":"
Backup and restore are essential aspects of system management. In practice, it is important to first back up the data of the system at a specific point in time and securely store the backup. In case of incidents such as data corruption, loss, or accidental deletion, the system can be quickly restored based on the previous backup data, reducing downtime and minimizing losses.
In real production environments, services may be deployed across different clouds, regions, or availability zones. If one infrastructure faces a failure, organizations need to quickly restore applications in other available environments. In such cases, cross-cloud or cross-cluster backup and restore become crucial.
Large-scale systems often involve multiple roles and users with complex permission management systems. With many operators involved, accidents caused by human error can lead to system failures. In such scenarios, the ability to roll back the system quickly using previously backed-up data is necessary. Relying solely on manual troubleshooting, fault repair, and system recovery can be time-consuming, resulting in prolonged system unavailability and increased losses for organizations.
Additionally, factors like network attacks, natural disasters, and equipment malfunctions can also cause data accidents.
Therefore, backup and restore are vital as the last line of defense for maintaining system stability and ensuring data security.
Backups are typically classified into three types: full backups, incremental backups, and differential backups. Currently, AI platform supports full backups and incremental backups.
The backup and restore provided by AI platform can be divided into two categories: Application Backup and ETCD Backup. It supports both manual backups and scheduled automatic backups using CronJobs.
Application Backup
Application backup refers to backing up data of a specific workload in the cluster and then restoring that data either within the same cluster or in another cluster. It supports backing up all resources under a namespace or filtering resources by specific labels.
Application backup also supports cross-cluster backup of stateful applications. For detailed steps, refer to the Backup and Restore MySQL Applications and Data Across Clusters guide.
etcd Backup
etcd is the data storage component of Kubernetes. Kubernetes stores its own component's data and application data in etcd. Therefore, backing up etcd is equivalent to backing up the entire cluster's data, allowing quick restoration of the cluster to a previous state in case of failures.
It's worth noting that currently, restoring etcd backup data is only supported within the same cluster (the original cluster). To learn more about related best practices, refer to the ETCD Backup and Restore guide.
This article explains how to backup applications in AI platform. The demo application used in this tutorial is called dao-2048 , which is a deployment.
Before backing up a deployment, the following prerequisites must be met:
Integrate a Kubernetes cluster or create a Kubernetes cluster in the Container Management module, and be able to access the UI interface of the cluster.
Create a Namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
Install the velero component, and ensure the velero component is running properly.
Create a deployment (the workload in this tutorial is named dao-2048 ), and label the deployment with app: dao-2048 .
Follow the steps below to backup the deployment dao-2048 .
Enter the Container Management module, click Backup Recovery -> Application Backup on the left navigation bar, and enter the Application Backup list page.
On the Application Backup list page, select the cluster where the velero and dao-2048 applications have been installed. Click Backup Plan in the upper right corner to create a new backup cluster.
Refer to the instructions below to fill in the backup configuration.
Name: The name of the new backup plan.
Source Cluster: The cluster where the application backup plan is to be executed.
Object Storage Location: The access path of the object storage configured when installing velero on the source cluster.
Namespace: The namespaces that need to be backed up, multiple selections are supported.
Advanced Configuration: Back up specific resources in the namespace based on resource labels, such as an application, or do not back up specific resources in the namespace based on resource labels during backup.
Refer to the instructions below to set the backup execution frequency, and then click Next .
Backup Frequency: Set the time period for task execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Retention Time (days): Set the storage time of backup resources, the default is 30 days, and will be deleted after expiration.
Backup Data Volume (PV): Whether to back up the data in the data volume (PV), support direct copy and use CSI snapshot.
Direct Replication: directly copy the data in the data volume (PV) for backup;
Use CSI snapshots: Use CSI snapshots to back up data volumes (PVs). Requires a CSI snapshot type available for backup in the cluster.
Click OK , the page will automatically return to the application backup plan list, find the newly created dao-2048 backup plan, and perform the Immediate Execution operation.
At this point, the Last Execution State of the cluster will change to in progress . After the backup is complete, you can click the name of the backup plan to view the details of the backup plan.
etcd backup is based on cluster data as the core backup. In cases such as hardware device damage, development and test configuration errors, etc., the backup cluster data can be restored through etcd backup.
This section will introduce how to realize the etcd backup for clusters. Also see etcd Backup and Restore Best Practices.
Enter Container Management -> Backup Recovery -> etcd Backup page, you can see all the current backup policies. Click Create Backup Policy on the right.
Fill in the Basic Information. Then, click Next to automatically verify the connectivity of etcd. If the verification passes, proceed to the next step.
First select the backup cluster and log in to the terminal
Enter etcd, and the format is https://${NodeIP}:${Port}.
In a standard Kubernetes cluster, the default port for etcd is 2379.
In a Suanova 4.0 cluster, the default port for etcd is 12379.
In a public cloud managed cluster, you need to contact the relevant developers to obtain the etcd port number. This is because the control plane components of public cloud clusters are maintained and managed by the cloud service provider. Users cannot directly access or view these components, nor can they obtain control plane port information through regular commands (such as kubectl).
Ways to obtain port number
Find the etcd Pod in the kube-system namespace
kubectl get po -n kube-system | grep etcd\n
Get the port number from the listen-client-urls of the etcd Pod
kubectl get po -n kube-system ${etcd_pod_name} -oyaml | grep listen-client-urls # (1)!\n
Replace etcd_pod_name with the actual Pod name
The expected output is as follows, where the number after the node IP is the port number:
Fill in the CA certificate, you can use the following command to view the certificate content. Then, copy and paste it to the proper location:
Standard Kubernetes ClusterSuanova 4.0 Cluster
cat /etc/kubernetes/ssl/etcd/ca.crt\n
cat /etc/daocloud/dce/certs/ca.crt\n
Fill in the Cert certificate, you can use the following command to view the content of the certificate. Then, copy and paste it to the proper location:
Click How to get below the input box to see how to obtain the proper information on the UI page.
Refer to the following information to fill in the Backup Policy.
Backup Method: Choose either manual backup or scheduled backup
Manual Backup: Immediately perform a full backup of etcd data based on the backup configuration.
Scheduled Backup: Periodically perform full backups of etcd data according to the set backup frequency.
Backup Chain Length: the maximum number of backup data to retain. The default is 30.
Backup Frequency: it can be per hour, per day, per week or per month, and can also be customized.
Refer to the following information to fill in the Storage Path.
Storage Provider: Default is S3 storage
Object Storage Access Address: The access address of MinIO
Bucket: Create a Bucket in MinIO and fill in the Bucket name
Username: The login username for MinIO
Password: The login password for MinIO
After clicking OK , the page will automatically redirect to the backup policy list, where you can view all the currently created ones.
Click the \u2507 action button on the right side of the policy to view logs, view YAML, update the policy, stop the policy, or execute the policy immediately.
When the backup method is manual, you can click Execute Now to perform the backup.
When the backup method is scheduled, the backup will be performed according to the configured time.
Click Logs to view the log content. By default, 100 lines are displayed. If you want to see more log information or download the logs, you can follow the prompts above the logs to go to the observability module.
Go to Container Management -> Backup Recovery -> etcd Backup, and click the Recovery Point tab.
After selecting the target cluster, you can view all the backup information under that cluster.
Each time a backup is executed, a proper recovery point is generated, which can be used to quickly restore the application from a successful recovery point.
"},{"location":"en/admin/kpanda/backup/install-velero.html","title":"Install the Velero Plugin","text":"
velero is an open source tool for backing up and restoring Kubernetes cluster resources. It can back up resources in a Kubernetes cluster to cloud storage services, local storage, or other locations, and restore those resources to the same or a different cluster when needed.
This section introduces how to deploy the Velero plugin in AI platform using the Helm Apps.
Please perform the following steps to install the velero plugin for your cluster.
On the cluster list page, find the target cluster that needs to install the velero plugin, click the name of the cluster, click Helm Apps -> Helm chart in the left navigation bar, and enter velero in the search bar to search .
Read the introduction of the velero plugin, select the version and click the Install button. This page will take 5.2.0 version as an example to install, and it is recommended that you install 5.2.0 and later versions.
Configure basic info .
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, it must be velero namespace.
Version: The version of the plugin, here we take 5.2.0 version as an example.
Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Deletion Failed: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Detailed Logs: Turn on the verbose output of the installation process log.
Note
After enabling Ready Wait and/or Failed Delete , it takes a long time for the app to be marked as Running .
Configure Velero chart Parameter Settings according to the following instructions
S3 Credentials: Configure the authentication information of object storage (minio).
Use secret: Keep the default configuration true.
Secret name: Keep the default configuration velero-s3-credential.
SecretContents.aws_access_key_id = : Configure the username for accessing object storage, replace with the actual parameter.
SecretContents.aws_secret_access_key = : Configure the password for accessing object storage, replace with the actual parameter.
Use existing secret parameter example is as follows:
BackupStorageLocation: The location where Velero backs up data.
S3 bucket: The name of the storage bucket used to save backup data (must be a real storage bucket that already exists in minio).
Is default BackupStorage: Keep the default configuration true.
S3 access mode: The access mode of Velero to data, which can be selected
ReadWrite: Allow Velero to read and write backup data;
ReadOnly: Allow Velero to read backup data, but cannot modify backup data;
WriteOnly: Only allow Velero to write backup data, and cannot read backup data.
S3 Configs: Detailed configuration of S3 storage (minio).
S3 region: The geographical region of cloud storage. The default is to use the us-east-1 parameter, which is provided by the system administrator.
S3 force path style: Keep the default configuration true.
S3 server URL: The console access address of object storage (minio). Minio generally provides two services, UI access and console access. Please use the console access address here.
Click the OK button to complete the installation of the Velero plugin. The system will automatically jump to the Helm Apps list page. After waiting for a few minutes, refresh the page, and you can see the application just installed.
"},{"location":"en/admin/kpanda/best-practice/add-master-node.html","title":"Scaling Controller Nodes in a Worker Cluster","text":"
This article provides a step-by-step guide on how to manually scale the control nodes in a worker cluster to achieve high availability for self-built clusters.
Note
It is recommended to enable high availability mode when creating the worker cluster in the interface. Manually scaling the control nodes of the worker cluster involves certain operational risks, so please proceed with caution.
A worker cluster has been created using the AI platform platform. You can refer to the documentation on Creating a Worker Cluster.
The managed cluster associated with the worker cluster exists in the current platform and is running normally.
Note
Managed cluster refers to the cluster specified during the creation of the worker cluster, which provides capabilities such as Kubernetes version upgrades, node scaling, uninstallation, and operation records for the current cluster.
"},{"location":"en/admin/kpanda/best-practice/add-master-node.html#modify-the-host-manifest","title":"Modify the Host manifest","text":"
Log in to the container management platform and go to the overview page of the cluster where you want to scale the control nodes. In the Basic Information section, locate the Managed Cluster of the current cluster and click its name to enter the overview page.
In the overview page of the managed cluster, click Console to open the cloud terminal console. Run the following command to find the host manifest of the worker cluster that needs to be scaled.
kubectl get cm -n kubean-system ${ClusterName}-hosts-conf -oyaml\n
${ClusterName} is the name of the worker cluster to be scaled.
Modify the host manifest file based on the example below and add information for the controller nodes.
all.hosts.node1: Existing master node in the original cluster
all.hosts.node2, all.hosts.node3: Control nodes to be added during cluster scaling
all.children.kube_control_plane.hosts: Control plane group in the cluster
all.children.kube_node.hosts: Worker node group in the cluster
all.children.etcd.hosts: ETCD node group in the cluster
"},{"location":"en/admin/kpanda/best-practice/add-master-node.html#add-expansion-task-scale-master-node-opsyaml-using-the-clusteroperationyml-template","title":"Add Expansion Task \"scale-master-node-ops.yaml\" using the ClusterOperation.yml Template","text":"
Use the following ClusterOperation.yml template to add a cluster control node expansion task called \"scale-master-node-ops.yaml\".
ClusterOperation.yml
apiVersion: kubean.io/v1alpha1\nkind: ClusterOperation\nmetadata:\n name: cluster1-online-install-ops\nspec:\n cluster: ${cluster-name} # Specify cluster name\n image: ghcr.m.daocloud.io/kubean-io/spray-job:v0.18.0 # Specify the image for the kubean job\n actionType: playbook\n action: cluster.yml\n extraArgs: --limit=etcd,kube_control_plane -e ignore_assert_errors=yes\n preHook:\n - actionType: playbook\n action: ping.yml\n - actionType: playbook\n action: disable-firewalld.yml\n - actionType: playbook\n action: enable-repo.yml # In an offline environment, you need to add this yaml and\n # set the correct repo-list (for installing operating system packages).\n # The following parameter values are for reference only.\n extraArgs: |\n -e \"{repo_list: ['http://172.30.41.0:9000/kubean/centos/\\$releasever/os/\\$basearch','http://172.30.41.0:9000/kubean/centos-iso/\\$releasever/os/\\$basearch']}\"\n postHook:\n - actionType: playbook\n action: upgrade-cluster.yml\n extraArgs: --limit=etcd,kube_control_plane -e ignore_assert_errors=yes\n - actionType: playbook\n action: kubeconfig.yml\n - actionType: playbook\n action: cluster-info.yml\n
Note
spec.image: The image address should be consistent with the image within the job that was previously deployed
spec.action: set to cluster.yml, if adding Master (etcd) nodes exceeds (including) three at once, additional parameter -e etcd_retries=10 should be added to cluster.yaml to increase etcd node join retry times
spec.extraArgs: set to --limit=etcd,kube_control_plane -e ignore_assert_errors=yes
If it is an offline environment, spec.preHook needs to add enable-repo.yml, and the extraArgs parameter should fill in the correct repo_list for the relevant OS
spec.postHook.action: should include upgrade-cluster.yml, where extraArgs is set to --limit=etcd,kube_control_plane -e ignore_assert_errors=yes
Create and deploy scale-master-node-ops.yaml based on the above configuration.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html","title":"Scaling the Worker Nodes of the Global Service Cluster","text":"
This page introduces how to manually scale the worker nodes of the global service cluster in offline mode. By default, it is not recommended to scale the global service cluster after deploying AI platform. Please ensure proper resource planning before deploying AI platform.
Note
The controller node of the global service cluster do not support scaling.
The AI platform deployment has been completed through bootstrap node, and the kind cluster on the bootstrap node is running normally.
You must log in with a user account that has admin privileges on the platform.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#get-kubeconfig-for-the-kind-cluster-on-the-bootstrap-node","title":"Get kubeconfig for the kind cluster on the bootstrap node","text":"
Run the following command to log in to the bootstrap node:
ssh root@bootstrap-node-ip-address\n
On the bootstrap node, run the following command to get the CONTAINER ID of the kind cluster:
[root@localhost ~]# podman ps\n\n# Expected output:\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n220d662b1b6a docker.m.daocloud.io/kindest/node:v1.26.2 2 weeks ago Up 2 weeks 0.0.0.0:443->30443/tcp, 0.0.0.0:8081->30081/tcp, 0.0.0.0:9000-9001->32000-32001/tcp, 0.0.0.0:36674->6443/tcp my-cluster-installer-control-plane\n
Run the following command to enter a container in the kind cluster:
podman exec -it {CONTAINER ID} bash\n
Replace {CONTAINER ID} with your actual container ID.
Inside the container of the kind cluster, run the following command to get the kubeconfig information for the kind cluster:
kubectl config view --minify --flatten --raw\n
After the console output, copy the kubeconfig information of the kind cluster for the next step.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#create-clusterkubeanio-resources-in-the-kind-cluster-on-the-bootstrap-node","title":"Create cluster.kubean.io resources in the kind cluster on the bootstrap node","text":"
Use the command podman exec -it {CONTAINER ID} bash to enter the kind cluster container.
Inside the kind cluster container, run the following command to get the kind cluster name:
kubectl get clusters\n
Copy and run the following command within the kind cluster to create the cluster.kubean.io resource:
The default cluster name for spec.hostsConfRef.name, spec.kubeconfRef.name, and spec.varsConfRef.name is my-cluster. Please replace it with the kind cluster name obtained in the previous step.
Run the following command in the kind cluster to verify if the cluster.kubean.io resource is created successfully:
kubectl get clusters\n
Expected output is:
NAME AGE\nkpanda-global-cluster 3s\nmy-cluster 16d\n
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#update-the-containerd-configuration-in-the-kind-cluster-on-the-bootstrap-node","title":"Update the containerd configuration in the kind cluster on the bootstrap node","text":"
Run the following command to log in to one of the controller nodes of the global service cluster:
On the global service cluster controller node, run the following command to copy the containerd configuration file config.toml from the controller node to the bootstrap node:
On the bootstrap node, select the insecure registry section from the containerd configuration file config.toml that was copied from the controller node, and add it to the config.toml in the kind cluster.
An example of the insecure registry section is as follows:
Since the config.toml file in the kind cluster cannot be modified directly, you can first copy the file out to modify it and then copy it back to the kind cluster. The steps are as follows:
Run the following command on the bootstrap node to copy the file out:
{CONTAINER ID} should be replaced with your actual container ID.
Run the following command within the kind cluster to restart the containerd service:
systemctl restart containerd\n
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#integrate-a-kind-cluster-into-the-ai-platform-cluster-list","title":"Integrate a Kind cluster into the AI platform cluster list","text":"
Log in to AI platform, navigate to Container Management, and on the right side of the cluster list, click the Integrate Cluster button.
In the integration configuration section, fill in and edit the kubeconfig of the Kind cluster.
Skip TLS verification; this line needs to be added manually.
Replace it with the IP of the Kind node, and change port 6443 to the port mapped to the node (you can run the command podman ps|grep 6443 to check the mapped port).
Click the OK to complete the integration of the Kind cluster.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#add-labels-to-the-global-service-cluster","title":"Add Labels to the Global Service Cluster","text":"
Log in to AI platform, navigate to Container Management, find the kapnda-global-cluster , and in the right-side, find the Basic Configuration menu options.
In the Basic Configuration page, add the label kpanda.io/managed-by=my-cluster for the global service cluster:
Note
The value in the label kpanda.io/managed-by=my-cluster corresponds to the name of the cluster specified during the integration process, which defaults to my-cluster. Please adjust this according to your actual situation.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#add-nodes-to-the-global-service-cluster","title":"Add nodes to the global service cluster","text":"
Go to the node list page of the global service cluster, find the Integrate Node button on the right side of the node list, and click to enter the node configuration page.
After filling in the IP and authentication information of the node to be integrated, click Start Check . Once the node check is completed, click Next .
Add the following custom parameters in the Custom Parameters section:
Click the OK button and wait for the node to be added.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html","title":"Cross-Cluster Backup and Recovery of MySQL Application and Data","text":"
This demonstration will show how to use the application backup feature in AI platform to perform cross-cluster backup migration for a stateful application.
Note
The current operator should have admin privileges on the AI platform platform.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#prepare-the-demonstration-environment","title":"Prepare the Demonstration Environment","text":""},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#prepare-two-clusters","title":"Prepare Two Clusters","text":"
main-cluster will be the source cluster for backup data, and recovery-cluster will be the target cluster for data recovery.
Cluster IP Nodes main-cluster 10.6.175.100 1 node recovery-cluster 10.6.175.110 1 node"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#set-up-minio-configuration","title":"Set Up MinIO Configuration","text":"MinIO Server Address Bucket Username Password http://10.7.209.110:9000 mysql-demo root dangerous"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#deploy-nfs-storage-service-in-both-clusters","title":"Deploy NFS Storage Service in Both Clusters","text":"
Note
NFS storage service needs to be deployed on all nodes in both the source and target clusters.
Install the dependencies required for NFS on all nodes in both clusters.
Prepare NFS storage service for the MySQL application.
Log in to any control node of both main-cluster and recovery-cluster . Use the command vi nfs.yaml to create a file named nfs.yaml on the node, and copy the following YAML content into the nfs.yaml file.
Run the nfs.yaml file on the control nodes of both clusters.
kubectl apply -f nfs.yaml\n
Check the status of the NFS Pod and wait for its status to become running (approximately 2 minutes).
kubectl get pod -n nfs-system -owide\n
Expected output
[root@g-master1 ~]# kubectl get pod -owide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nnfs-provisioner-7dfb9bcc45-74ws2 1/1 Running 0 4m45s 10.6.175.100 g-master1 <none> <none>\n
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#deploy-mysql-application","title":"Deploy MySQL Application","text":"
Prepare a PVC (Persistent Volume Claim) based on NFS storage for the MySQL application to store its data.
Use the command vi pvc.yaml to create a file named pvc.yaml on the node, and copy the following YAML content into the pvc.yaml file.
Run kubectl get pod | grep mysql to view the status of the MySQL Pod and wait for its status to become running (approximately 2 minutes).
Expected output
[root@g-master1 ~]# kubectl get pod |grep mysql\nmysql-deploy-5d6f94cb5c-gkrks 1/1 Running 0 2m53s\n
Note
If the MySQL Pod remains in a non-running state for a long time, it is usually because NFS dependencies are not installed on all nodes in the cluster.
Run kubectl describe pod ${mysql pod name} to view detailed information about the Pod.
If there is an error message like MountVolume.SetUp failed for volume \"pvc-4ad70cc6-df37-4253-b0c9-8cb86518ccf8\" : mount failed: exit status 32 , please delete the previous resources by executing kubectl delete -f nfs.yaml/pvc.yaml/mysql.yaml and start from deploying the NFS service again.
Write data to the MySQL application.
To verify the success of the data migration later, you can use a script to write test data to the MySQL application.
Use the command vi insert.sh to create a script named insert.sh on the node, and copy the following content into the script.
insert.sh
#!/bin/bash\n\nfunction rand(){\n min=$1\n max=$(($2-$min+1))\n num=$(date +%s%N)\n echo $(($num%$max+$min))\n}\n\nfunction insert(){\n user=$(date +%s%N | md5sum | cut -c 1-9)\n age=$(rand 1 100)\n\n sql=\"INSERT INTO test.users(user_name, age)VALUES('${user}', ${age});\"\n echo -e ${sql}\n\n kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"${sql}\"\n\n}\n\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE DATABASE IF NOT EXISTS test;\"\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE TABLE IF NOT EXISTS test.users(user_name VARCHAR(10) NOT NULL,age INT UNSIGNED)ENGINE=InnoDB DEFAULT CHARSET=utf8;\"\n\nwhile true;do\n insert\n sleep 1\ndone\n
mysql: [Warning] Using a password on the command line interface can be insecure.\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('dc09195ba', 10);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('80ab6aa28', 70);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('f488e3d46', 23);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('e6098695c', 93);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('eda563e7d', 63);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('a4d1b8d68', 17);\nmysql: [Warning] Using a password on the command line interface can be insecure.\n
Press Ctrl + C on the keyboard simultaneously to pause the script execution.
Go to the MySQL Pod and check the data written in MySQL.
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
Expected output
[root@g-master1 ~]# kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\nmysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#install-velero-plugin-on-both-clusters","title":"Install Velero Plugin on Both Clusters","text":"
Note
The velero plugin needs to be installed on both the source and target clusters.
Refer to the Install Velero Plugin documentation and the MinIO configuration below to install the velero plugin on the main-cluster and recovery-cluster .
MinIO Server Address Bucket Username Password http://10.7.209.110:9000 mysql-demo root dangerous
Note
When installing the plugin, replace S3url with the MinIO server address prepared for this demonstration, and replace the bucket with an existing bucket in MinIO.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#backup-mysql-application-and-data","title":"Backup MySQL Application and Data","text":"
Add a unique label, backup=mysql , to the MySQL application and PVC data. This will facilitate resource selection during backup.
kubectl label deploy mysql-deploy backup=mysql # Add label to mysql-deploy\nkubectl label pod mysql-deploy-5d6f94cb5c-gkrks backup=mysql # Add label to mysql pod\nkubectl label pvc mydata backup=mysql # Add label to mysql pvc\n
Refer to the steps described in Application Backup and the parameters below to create an application backup.
After creating the backup plan, the page will automatically return to the backup plan list. Find the newly created backup plan backup-mysq and click the more options button __ ...__ in the plan. Select \"Run Now\" to execute the newly created backup plan.
Wait for the backup plan execution to complete before proceeding with the next steps.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#cross-cluster-recovery-of-mysql-application-and-data","title":"Cross-Cluster Recovery of MySQL Application and Data","text":"
Log in to the AI platform platform and select Container Management -> Backup & Restore -> Application Backup from the left navigation menu.
Select Recovery in the left-side toolbar, then click Restore Backup on the right side.
Fill in the parameters based on the following instructions:
Name: restore-mysql (can be customized)
Backup Source Cluster: main-cluster
Backup Plan: backup-mysql
Backup Point: default
Recovery Target Cluster: recovery-cluster
Refresh the backup plan list and wait for the backup plan execution to complete.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#check-if-the-data-is-restored-successfully","title":"Check if the data is restored successfully","text":"
Log in to the control plane of recovery-cluster , check if mysql-deploy is successfully backed up in the current cluster.
kubectl get pod\n
Expected output\u5982\u4e0b\uff1a
NAME READY STATUS RESTARTS AGE\nmysql-deploy-5798f5d4b8-62k6c 1/1 Running 0 24h\n
Check if the data in MySQL datasheet is restored or not.
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
Expected output is as follows\uff1a
[root@g-master1 ~]# kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\nmysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
Success
As you can see, the data in the Pod is consistent with the data inside the Pods in the main-cluster . This indicates that the MySQL application and its data from the main-cluster have been successfully recovered to the recovery-cluster cluster.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html","title":"Create a RedHat 9.2 Worker Cluster on a CentOS Management Platform","text":"
This article explains how to create a RedHat 9.2 worker cluster on an existing CentOS management platform.
Note
This article only applies to the offline mode, using the AI platform platform to create a worker cluster. The architecture of the management platform and the cluster to be created are both AMD. When creating a cluster, heterogeneous deployment (mixing AMD and ARM) is not supported. After the cluster is created, you can use the method of connecting heterogeneous nodes to achieve mixed deployment and management of the cluster.
A AI platform full-mode has been deployed, and the spark node is still alive. For deployment, see the document Offline Install AI platform Enterprise.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#download-and-import-redhat-offline-packages","title":"Download and Import RedHat Offline Packages","text":"
Make sure you are logged into the spark node! And the clusterConfig.yaml file used when deploying AI platform is still available.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#download-the-relevant-redhat-offline-packages","title":"Download the Relevant RedHat Offline Packages","text":"
Download the required RedHat OS package and ISO offline packages:
Resource Name Description Download Link os-pkgs-redhat9-v0.9.3.tar.gz RedHat9.2 OS-package package Download ISO Offline Package ISO package import script Go to RedHat Official Download Site import-iso ISO import script Download"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#import-the-os-package-to-the-minio-of-the-spark-node","title":"Import the OS Package to the MinIO of the Spark Node","text":"
Extract the RedHat OS package
Execute the following command to extract the downloaded OS package. Here we download the RedHat OS package.
tar -xvf os-pkgs-redhat9-v0.9.3.tar.gz\n
The contents of the extracted OS package are as follows:
os-pkgs\n \u251c\u2500\u2500 import_ospkgs.sh # This script is used to import OS packages into the MinIO file service\n \u251c\u2500\u2500 os-pkgs-amd64.tar.gz # OS packages for the amd64 architecture\n \u251c\u2500\u2500 os-pkgs-arm64.tar.gz # OS packages for the arm64 architecture\n \u2514\u2500\u2500 os-pkgs.sha256sum.txt # sha256sum verification file of the OS packages\n
Import the OS Package to the MinIO of the Spark Node
Execute the following command to import the OS packages to the MinIO file service:
The above command is only applicable to the MinIO service built into the spark node. If an external MinIO is used, replace http://127.0.0.1:9000 with the access address of the external MinIO. \"rootuser\" and \"rootpass123\" are the default account and password of the MinIO service built into the spark node. \"os-pkgs-redhat9-v0.9.3.tar.gz\" is the name of the downloaded OS package offline package.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#import-the-iso-offline-package-to-the-minio-of-the-spark-node","title":"Import the ISO Offline Package to the MinIO of the Spark Node","text":"
Execute the following command to import the ISO package to the MinIO file service:
The above command is only applicable to the MinIO service built into the spark node. If an external MinIO is used, replace http://127.0.0.1:9000 with the access address of the external MinIO. \"rootuser\" and \"rootpass123\" are the default account and password of the MinIO service built into the spark node. \"rhel-9.2-x86_64-dvd.iso\" is the name of the downloaded ISO offline package.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#create-the-cluster-in-the-ui","title":"Create the Cluster in the UI","text":"
Refer to the document Creating a Worker Cluster to create a RedHat 9.2 cluster.
"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html","title":"Create an Ubuntu Worker Cluster on CentOS","text":"
This page explains how to create an Ubuntu worker cluster on an existing CentOS.
Note
This page is specifically for the offline mode, using the AI platform platform to create a worker cluster, where both the CentOS platform and the worker cluster to be created are based on AMD architecture. Heterogeneous (mixed AMD and ARM) deployments are not supported during cluster creation; however, after the cluster is created, you can manage a mixed deployment by adding heterogeneous nodes.
A fully deployed AI platform system, with the bootstrap node still active. For deployment reference, see the documentation Offline Install AI platform Enterprise.
"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#download-and-import-ubuntu-offline-packages","title":"Download and Import Ubuntu Offline Packages","text":"
Please ensure you are logged into the bootstrap node! Also, make sure that the clusterConfig.yaml file used during the AI platform deployment is still available.
Download the required Ubuntu OS packages and ISO offline packages:
Resource Name Description Download Link os-pkgs-ubuntu2204-v0.18.2.tar.gz Ubuntu 20.04 OS package https://github.com/kubean-io/kubean/releases/download/v0.18.2/os-pkgs-ubuntu2204-v0.18.2.tar.gz ISO Offline Package ISO Package http://mirrors.melbourne.co.uk/ubuntu-releases/"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#import-os-and-iso-packages-into-minio-on-the-bootstrap-node","title":"Import OS and ISO Packages into MinIO on the Bootstrap Node","text":"
Refer to the documentation Importing Offline Resources to import offline resources into MinIO on the bootstrap node.
"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#create-cluster-on-ui","title":"Create Cluster on UI","text":"
Refer to the documentation Creating a Worker Cluster to create the Ubuntu cluster.
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html","title":"etcd Backup and Restore","text":"
Using the ETCD backup feature to create a backup policy, you can back up the etcd data of a specified cluster to S3 storage on a scheduled basis. This page focuses on how to restore the data that has been backed up to the current cluster.
Note
AI platform ETCD backup restores are limited to backups and restores for the same cluster (with no change in the number of nodes and IP addresses). For example, after the etcd data of Cluster A is backed up, the backup data can only be restored to Cluster A, not to Cluster B.
The feature is recommended app backup and restore for cross-cluster backups and restores.
First, create a backup policy to back up the current status. It is recommended to refer to the ETCD backup.
The following is a specific case to illustrate the whole process of backup and restore.
Begin with basic information about the target cluster and S3 storage for the restore. Here, MinIo is used as S3 storage, and the whole cluster has 3 control planes (3 etcd copies).
IP Host Role Remarks 10.6.212.10 host01 k8s-master01 k8s node 1 10.6.212.11 host02 k8s-master02 k8s node 2 10.6.212.12 host03 k8s-master03 k8s node 3 10.6.212.13 host04 minio minio service"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#prerequisites","title":"Prerequisites","text":""},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#install-the-etcdbrctl-tool","title":"Install the etcdbrctl tool","text":"
To implement ETCD data backup and restore, you need to install the etcdbrctl open source tool on any of the above k8s nodes. This tool does not have binary files for the time being and needs to be compiled by itself. Refer to the compilation mode.
After installation, use the following command to check whether the tool is available:
etcdbrctl -v\n
The expected output is as follows:
INFO[0000] etcd-backup-restore Version: v0.23.0-dev\nINFO[0000] Git SHA: b980beec\nINFO[0000] Go Version: go1.19.3\nINFO[0000] Go OS/Arch: linux/amd64\n
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#check-the-backup-data","title":"Check the backup data","text":"
You need to check the following before restoring:
Have you successfully backed up your data in AI platform
Check if backup data exists in S3 storage
Note
The backup of AI platform is a full data backup, and the full data of the last backup will be restored when restoring.
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#shut-down-the-cluster","title":"Shut down the cluster","text":"
Before backing up, the cluster must be shut down. The default clusters etcd and kube-apiserver are started as static pods. To close the cluster here means to move the static Pod manifest file out of the /etc/kubernetes/manifest directory, and the cluster will remove Pods to close the service.
First, delete the previous backup data. Removing the data does not delete the existing etcd data, but refers to modifying the name of the etcd data directory. Wait for the backup to be successfully restored before deleting this directory. The purpose of this is to also try to restore the current cluster if the etcd backup restore fails. This step needs to be performed for each node.
rm -rf /var/lib/etcd_bak\n
The service then needs to be shut down kube-apiserver to ensure that there are no new changes to the etcd data. This step needs to be performed for each node.
The expected output is as follows, indicating that all etcd nodes have been destroyed:
{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:50.817+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.31:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-1:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:55.818+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-2:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.32:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-2:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:52:00.820+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.33:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-3:2379 (context deadline exceeded)\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#restore-the-backup","title":"Restore the backup","text":"
You only need to restore the data of one node, and the etcd data of other nodes will be automatically synchronized.
Set environment variables
Before restoring the data using etcdbrctl, run the following command to set the authentication information of the connection S3 as an environment variable:
--data-dir: etcd data directory. This directory must be consistent with the etcd data directory so that etcd can load data normally.
--store-container: The location of S3 storage, the bucket in MinIO, must correspond to the bucket of data backup.
--initial-cluster: etcd is configured initially. The name of the etcd cluster must be the same as the original one.
--initial-advertise-peer-urls: etcd member inter-cluster access address. Must be consistent with etcd configuration.
The expected output is as follows:
INFO[0000] Finding latest set of snapshot to recover from...\nINFO[0000] Restoring from base snapshot: Full-00000000-00111147-1679991074 actor=restorer\nINFO[0001] successfully fetched data of base snapshot in 1.241380207 seconds actor=restorer\n{\"level\":\"info\",\"ts\":1680011221.2511616,\"caller\":\"mvcc/kvstore.go:380\",\"msg\":\"restored last compact revision\",\"meta-bucket-name\":\"meta\",\"meta-bucket-name-key\":\"finishedCompactRev\",\"restored-compact-revision\":110327}\n{\"level\":\"info\",\"ts\":1680011221.3045986,\"caller\":\"membership/cluster.go:392\",\"msg\":\"added member\",\"cluster-id\":\"66638454b9dd7b8a\",\"local-member-id\":\"0\",\"added-peer-id\":\"123c2503a378fc46\",\"added-peer-peer-urls\":[\"https://10.6.212.10:2380\"]}\nINFO[0001] Starting embedded etcd server... actor=restorer\n\n....\n\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:565\",\"msg\":\"stopped serving peer traffic\",\"address\":\"127.0.0.1:37161\"}\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:367\",\"msg\":\"closed etcd server\",\"name\":\"default\",\"data-dir\":\"/var/lib/etcd\",\"advertise-peer-urls\":[\"http://localhost:0\"],\"advertise-client-urls\":[\"http://localhost:0\"]}\nINFO[0003] Successfully restored the etcd data directory.\n
!!! note \u201cYou can check the YAML file of etcd for comparison to avoid configuration errors\u201d
Then wait for the etcd service to finish starting, and check the status of etcd. The default directory of etcd-related certificates is: /etc/kubernetes/ssl . If the cluster certificate is stored in another location, specify the proper path.
Check the etcd cluster list:
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\" \n
The expected output is as follows:
+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| 123c2503a378fc46 | started | controller-node-1 | https://10.6.212.10:2380 | https://10.6.212.10:2379 | false |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n
To view the status of controller-node-1:
etcdctl endpoint status --endpoints=controller-node-1:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
The expected output is as follows:
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| controller-node-1:2379 | 123c2503a378fc46 | 3.5.6 | 15 MB | true | false | 3 | 1200 | 1199 | |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
Restore other node data
The above steps have restored the data of node 01. If you want to restore the data of other nodes, you only need to start the Pod of etcd and let etcd complete the data synchronization by itself.
Data synchronization between etcd member clusters takes some time. You can check the etcd cluster status to ensure that all etcd clusters are normal:
Check whether the etcd cluster status is normal:
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
The expected output is as follows:
+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| 6ea47110c5a87c03 | started | controller-node-1 | https://10.5.14.31:2380 | https://10.5.14.31:2379 | false |\n| e222e199f1e318c4 | started | controller-node-2 | https://10.5.14.32:2380 | https://10.5.14.32:2379 | false |\n| f64eeda321aabe2d | started | controller-node-3 | https://10.5.14.33:2380 | https://10.5.14.33:2379 | false |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n
Check whether the three member nodes are normal:
etcdctl endpoint status --endpoints=controller-node-1:2379,controller-node-2:2379,controller-node-3:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
After kubelet starts kube-apiserver, check whether the restored k8s data is normal:
kubectl get nodes\n
The expected output is as follows:
NAME STATUS ROLES AGE VERSION\ncontroller-node-1 Ready <none> 3h30m v1.25.4\ncontroller-node-3 Ready control-plane 3h29m v1.25.4\ncontroller-node-3 Ready control-plane 3h28m v1.25.4\n
"},{"location":"en/admin/kpanda/best-practice/hardening-cluster.html","title":"How to Harden a Self-built Work Cluster","text":"
In AI platform, when using the CIS Benchmark (CIS) scan on a work cluster created using the user interface, some scan items did not pass the scan. This article provides hardening instructions based on different versions of CIS Benchmark.
"},{"location":"en/admin/kpanda/best-practice/hardening-cluster.html#hardening-configuration-to-pass-cis-scan","title":"Hardening Configuration to Pass CIS Scan","text":"
To address these security scan issues, kubespray has added default values in v2.22 to solve some of the problems. For more details, refer to the kubespray hardening documentation.
Add parameters by modifying the kubean var-config configuration file:
In AI platform, there is also a feature to configure advanced parameters through the user interface. Add custom parameters in the last step of cluster creation:
After setting the custom parameters, the following parameters are added to the var-config configmap in kubean:
Perform a scan after installing the cluster:
After the scan, all scan items passed the scan (WARN and INFO are counted as PASS). Note that this document only applies to CIS Benchmark 1.27, as CIS Benchmark is continuously updated.
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html","title":"Deploy and Upgrade Compatible Versions of Kubean in Offline Scenarios","text":"
In order to meet the customer's demand for building Kubernetes (K8s) clusters with lower versions, Kubean provides the capability to be compatible with lower versions and create K8s clusters with those versions.
Currently, the supported versions for self-built worker clusters range from 1.26.0-v1.28. Refer to the AI platform Cluster Version Support System for more information.
This article will demonstrate how to deploy a K8s cluster with a lower version.
Prepare a management cluster where kubean resides, and the current environment has deployed the podman, skopeo, and minio client commands. If not supported, you can install the dependent components through the script, Installing Prerequisite Dependencies.
Go to kubean to view the released artifacts, and choose the specific artifact version based on the actual situation. The currently supported artifact versions and their proper cluster version ranges are as follows:
Artifact Version Cluster Range AI platform Support release-2.21 v1.23.0 ~ v1.25.6 Supported since installer v0.14.0 release-2.22 v1.24.0 ~ v1.26.9 Supported since installer v0.15.0 release-2.23 v1.25.0 ~ v1.27.7 Expected to support from installer v0.16.0
This article demonstrates the offline deployment of a K8s cluster with version 1.23.0 and the offline upgrade of a K8s cluster from version 1.23.0 to 1.24.0, so we choose the artifact release-2.21.
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#prepare-the-relevant-artifacts-for-the-lower-version-of-kubespray-release","title":"Prepare the Relevant Artifacts for the Lower Version of Kubespray Release","text":"
Import the spray-job image into the registry of the offline environment.
# Assuming the registry address in the bootstrap cluster is 172.30.41.200\nREGISTRY_ADDR=\"172.30.41.200\"\n\n# The image spray-job can use the accelerator address here, and the image address is determined based on the selected artifact version\nSPRAY_IMG_ADDR=\"ghcr.m.daocloud.io/kubean-io/spray-job:2.21-d6f688f\"\n\n# skopeo parameters\nSKOPEO_PARAMS=\" --insecure-policy -a --dest-tls-verify=false --retry-times=3 \"\n\n# Online environment: Export the spray-job image of version release-2.21 and transfer it to the offline environment\nskopeo copy docker://${SPRAY_IMG_ADDR} docker-archive:spray-job-2.21.tar\n\n# Offline environment: Import the spray-job image of version release-2.21 into the bootstrap registry\nskopeo copy ${SKOPEO_PARAMS} docker-archive:spray-job-2.21.tar docker://${REGISTRY_ADDR}/${SPRAY_IMG_ADDR}\n
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#create-offline-resources-for-the-earlier-versions-of-k8s","title":"Create Offline Resources for the Earlier Versions of K8s","text":"
Prepare the manifest.yml file.
cat > \"manifest.yml\" <<EOF\nimage_arch:\n - \"amd64\" ## \"arm64\"\nkube_version: ## Fill in the cluster version according to the actual scenario\n - \"v1.23.0\"\n - \"v1.24.0\"\nEOF\n
Create the offline incremental package.
# Create the data directory\nmkdir data\n# Create the offline package\nAIRGAP_IMG_ADDR=\"ghcr.m.daocloud.io/kubean-io/airgap-patch:2.21-d6f688f\" # (1)\npodman run --rm -v $(pwd)/manifest.yml:/manifest.yml -v $(pwd)/data:/data -e ZONE=CN -e MODE=FULL ${AIRGAP_IMG_ADDR}\n
The image spray-job can use the accelerator address here, and the image address is determined based on the selected artifact version
Import the offline images and binary packages for the proper K8s version.
# Import the binaries from the data directory to the minio in the bootstrap node\ncd ./data/amd64/files/\nMINIO_ADDR=\"http://172.30.41.200:9000\" # Replace IP with the actual repository url\nMINIO_USER=rootuser MINIO_PASS=rootpass123 ./import_files.sh ${MINIO_ADDR}\n\n# Import the images from the data directory to the image repository in the bootstrap node\ncd ./data/amd64/images/\nREGISTRY_ADDR=\"172.30.41.200\" ./import_images.sh # Replace IP with the actual repository url\n
Deploy the manifest and localartifactset.cr.yaml custom resources to the management cluster where kubean resides or the Global cluster. In this example, we use the Global cluster.
# Deploy the localArtifactSet resources in the data directory\ncd ./data\nkubectl apply -f data/localartifactset.cr.yaml\n\n# Download the manifest resources proper to release-2.21\nwget https://raw.githubusercontent.com/kubean-io/kubean-manifest/main/manifests/manifest-2.21-d6f688f.yml\n\n# Deploy the manifest resources proper to release-2.21\nkubectl apply -f manifest-2.21-d6f688f.yml\n
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#deployment-and-upgrade-legacy-k8s-cluster","title":"Deployment and Upgrade Legacy K8s Cluster","text":""},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#deploy","title":"Deploy","text":"
Go to Container Management and click the Create Cluster button on the Clusters page.
Choose the manifest and localartifactset.cr.yaml custom resources deployed cluster as the Managed parameter. In this example, we use the Global cluster.
Refer to Creating a Cluster for the remaining parameters.
Select the newly created cluster and go to the details page.
Click Cluster Operations in the left navigation bar, then click Cluster Upgrade on the top right of the page.
Select the available cluster for upgrade.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html","title":"How to Add Heterogeneous Nodes to a Worker Cluster","text":"
This page explains how to add ARM architecture nodes with Kylin v10 sp2 operating system to an AMD architecture worker cluster with CentOS 7.9 operating system.
Note
This page is only applicable to adding heterogeneous nodes to a worker cluster created using the AI platform platform in offline mode, excluding connected clusters.
A AI platform Full Mode deployment has been successfully completed, and the bootstrap node is still alive. Refer to the documentation Offline Installation of AI platform Enterprise for the deployment process.
A worker cluster with AMD architecture and CentOS 7.9 operating system has been created through the AI platform platform. Refer to the documentation Creating a Worker Cluster for the creation process.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/best-practice/multi-arch.html#download-and-import-offline-packages","title":"Download and Import Offline Packages","text":"
Take ARM architecture and Kylin v10 sp2 operating system as examples.
Make sure you are logged into the bootstrap node! Also, make sure the clusterConfig.yaml file used during the AI platform deployment is available.
The latest version can be downloaded from the Download Center.
CPU Architecture Version Download Link AMD64 v0.18.0 https://qiniu-download-public.daocloud.io/DaoCloud_Enterprise/dce5/offline-v0.18.0-amd64.tar ARM64 v0.18.0 https://qiniu-download-public.daocloud.io/DaoCloud_Enterprise/dce5/offline-v0.18.0-arm64.tar
After downloading, extract the offline package:
tar -xvf offline-v0.18.0-arm64.tar\n
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#iso-offline-package-kylin-v10-sp2","title":"ISO Offline Package (Kylin v10 sp2)","text":"CPU Architecture Operating System Version Download Link ARM64 Kylin Linux Advanced Server release V10 (Sword) SP2 https://www.kylinos.cn/support/trial.html
Note
Kylin operating system requires personal information to be provided for downloading and usage. Select V10 (Sword) SP2 when downloading.
The Kubean project provides osPackage offline packages for different operating systems. Visit https://github.com/kubean-io/kubean/releases to view the available packages.
Operating System Version Download Link Kylin Linux Advanced Server release V10 (Sword) SP2 https://github.com/kubean-io/kubean/releases/download/v0.16.3/os-pkgs-kylinv10-v0.16.3.tar.gz
Note
Check the specific version of the osPackage offline package in the offline/sample/clusterConfig.yaml file of the offline image package.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#import-offline-packages-to-the-bootstrap-node","title":"Import Offline Packages to the Bootstrap Node","text":"
To add information about the newly added worker nodes according to the above comments:
kubectl edit cm ${cluster-name}-hosts-conf -n kubean-system\n
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#add-expansion-tasks-through-clusteroperationyml","title":"Add Expansion Tasks through ClusterOperation.yml","text":"
Ensure the spec.image image address matches the image used in the previous deployment job.
Set spec.action to scale.yml .
Set spec.extraArgs to --limit=g-worker .
Fill in the correct repo_list parameter for the relevant OS in spec.preHook 's enable-repo.yml script.
To create and deploy join-node-ops.yaml according to the above configuration:
vi join-node-ops.yaml\nkubectl apply -f join-node-ops.yaml -n kubean-system\n
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#check-the-status-of-the-task-execution","title":"Check the status of the task execution","text":"
kubectl -n kubean-system get pod | grep add-worker-node\n
To check the progress of the scaling task, you can view the logs of the proper pod.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#verify-in-ui","title":"Verify in UI","text":"
Go to Container Management -> Clusters -> Nodes .
Click the newly added node to view details.
"},{"location":"en/admin/kpanda/best-practice/replace-first-master-node.html","title":"Replace the first master node of the worker cluster","text":"
This page will take a highly available three-master-node worker cluster as an example. When the first master node of the worker cluster fails or malfunctions, how to replace or reintroduce the first master node.
This page features a highly available cluster with three master nodes.
node1 (172.30.41.161)
node2 (172.30.41.162)
node3 (172.30.41.163)
Assuming node1 is down, the following steps will explain how to reintroduce the recovered node1 back into the worker cluster.
Before performing the replacement operation, first obtain basic information about the cluster resources, which will be used when modifying related configurations.
Note
The following commands to obtain cluster resource information are executed in the management cluster.
Get the cluster name
Run the following command to find the clusters.kubean.io resource proper to the cluster:
# For example, if the resource name of clusters.kubean.io is cluster-mini-1\n# Get the name of the cluster\nCLUSTER_NAME=$(kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.metadata.name}{'\\n'}\")\n
Get the host list configmap of the cluster
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.hostsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-hosts-conf\",\"namespace\":\"kubean-system\"}\n
Get the configuration parameters configmap of the cluster
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.varsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-vars-conf\",\"namespace\":\"kubean-system\"}\n
Reset the node1 node to restore it to the state before installing the cluster (or use a new node), maintaining the network connectivity of the node1 node.
Adjust the order of the node1 node in the kube_control_plane, kube_node, and etcd sections in the host list (node1/node2/node3 -> node2/node3/node1):
Manually modify the cluster configuration, edit and update cluster-info
# Edit cluster-info\nkubectl -n kube-public edit cm cluster-info\n\n# 1. If the ca.crt certificate is updated, the content of the certificate-authority-data field needs to be updated\n# View the base64 encoding of the ca certificate:\ncat /etc/kubernetes/ssl/ca.crt | base64 | tr -d '\\n'\n\n# 2. Change the IP address in the server field to the new first master IP, this document will use the IP address of node2, 172.30.41.162\n
Manually modify the cluster configuration, edit and update kubeadm-config
# Edit kubeadm-config\nkubectl -n kube-system edit cm kubeadm-config\n\n# Change controlPlaneEndpoint to the new first master IP,\n# this document will use the IP address of node2, 172.30.41.162\n
Scale up the master node and update the cluster
Note
Use --limit to limit the update operation to only affect the etcd and kube_control_plane node groups.
If it is an offline environment, spec.preHook needs to add enable-repo.yml, and the extraArgs parameter should fill in the correct repo_list for the related OS.
cat << EOF | kubectl apply -f -\n---\napiVersion: kubean.io/v1alpha1\nkind: ClusterOperation\nmetadata:\n name: cluster-mini-1-update-cluster-ops\nspec:\n cluster: ${CLUSTER_NAME}\n image: ${SPRAY_IMG_ADDR}:${SPRAY_RLS_2_22_TAG}\n actionType: playbook\n action: cluster.yml\n extraArgs: --limit=etcd,kube_control_plane -e kube_version=${KUBE_VERSION}\n preHook:\n - actionType: playbook\n action: enable-repo.yml # This yaml needs to be added in an offline environment,\n # and set the correct repo-list (install operating system packages),\n # the following parameter values are for reference only\n extraArgs: |\n -e \"{repo_list: ['http://172.30.41.0:9000/kubean/centos/\\$releasever/os/\\$basearch','http://172.30.41.0:9000/kubean/centos-iso/\\$releasever/os/\\$basearch']}\"\n postHook:\n - actionType: playbook\n action: cluster-info.yml\nEOF\n
Now, you completed the replacement of the first Master node.
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html","title":"Offline Deployment/Upgrade Guide for Worker Clusters","text":"
Note
This document is specifically designed for deploying or upgrading the Kubernetes version of worker clusters created on the AI platform platform in offline mode. It does not cover the deployment or upgrade of other Kubernetes components.
This guide is applicable to the following offline scenarios:
You can follow the operational guidelines to deploy the recommended Kubernetes version in a non-GUI environment created by the AI platform platform.
You can upgrade the Kubernetes version of worker clusters created using the AI platform platform by generating incremental offline packages.
The overall approach is as follows:
Build the offline package on an integrated node.
Import the offline package to the bootstrap node.
Update the Kubernetes version manifest for the global service cluster.
Use the AI platform UI to create or upgrade the Kubernetes version of the worker cluster.
Note
For a list of currently supported offline Kubernetes versions, refer to the list of Kubernetes versions supported by Kubean.
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html#building-the-offline-package-on-an-integrated-node","title":"Building the Offline Package on an Integrated Node","text":"
Since the offline environment cannot connect to the internet, you need to prepare an integrated node in advance to build the incremental offline package and start Docker or Podman services on this node. Refer to How to Install Docker?
Check the status of the Docker service on the integrated node.
Create a file named manifest.yaml in the /root directory of the integrated node with the following command:
vi manifest.yaml\n
The content of manifest.yaml should be as follows:
manifest.yaml
image_arch:\n- \"amd64\"\nkube_version: # Specify the version of the cluster to be upgraded\n- \"v1.28.0\"\n
image_arch specifies the CPU architecture type, with options for amd64 and arm64.
kube_version indicates the version of the Kubernetes offline package to be built. You can refer to the supported offline Kubernetes versions mentioned earlier.
Create a folder named /data in the /root directory to store the incremental offline package.
mkdir data\n
Run the following command to generate the offline package using the kubean airgap-patch image. Make sure the tag of the airgap-patch image matches the Kubean version, and that the Kubean version covers the Kubernetes version you wish to upgrade.
# Assuming the Kubean version is v0.13.9\ndocker run --rm -v $(pwd)/manifest.yaml:/manifest.yaml -v $(pwd)/data:/data ghcr.m.daocloud.io/kubean-io/airgap-patch:v0.13.9\n
After the Docker service completes running, check the files in the /data folder. The folder structure should look like this:
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html#importing-the-offline-package-to-the-bootstrap-node","title":"Importing the Offline Package to the Bootstrap Node","text":"
Copy the /data files from the integrated node to the /root directory of the bootstrap node. On the integrated node , run the following command:
scp -r data root@x.x.x.x:/root\n
Replace x.x.x.x with the IP address of the bootstrap node.
On the bootstrap node, copy the image files in the /data folder to the built-in Docker registry of the bootstrap node. After logging into the bootstrap node, run the following commands:
Navigate to the directory where the image files are located.
cd data/amd64/images\n
Run the import_images.sh script to import the images into the built-in Docker Registry of the bootstrap node.
REGISTRY_ADDR=\"127.0.0.1\" ./import_images.sh\n
Note
The above command is only applicable to the built-in Docker Registry of the bootstrap node. If you are using an external registry, use the following command:
The above command is only applicable to the built-in Minio service of the bootstrap node. If you are using an external Minio, replace http://127.0.0.1:9000 with the access address of the external Minio. \"rootuser\" and \"rootpass123\" are the default account and password for the built-in Minio service of the bootstrap node.
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html#updating-the-kubernetes-version-manifest-for-the-global-service-cluster","title":"Updating the Kubernetes Version Manifest for the Global Service Cluster","text":"
Run the following command on the bootstrap node to deploy the localartifactset resource to the global service cluster:
Log into the AI platform UI management interface to continue with the following actions:
Refer to the Creating Cluster Documentation to create a worker cluster, where you can select the incremental version of Kubernetes.
Refer to the Upgrading Cluster Documentation to upgrade your self-built worker cluster.
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html","title":"Creating a Cluster on Non-Supported Operating Systems","text":"
This document outlines how to create a worker cluster on an unsupported OS in offline mode. For the range of OS supported by AI platform, refer to AI platform Supported Operating Systems.
The main process for creating a worker cluster on an unsupported OS in offline mode is illustrated in the diagram below:
Next, we will use the openAnolis operating system as an example to demonstrate how to create a cluster on a non-mainstream operating system.
AI platform Full Mode has been deployed following the documentation: Offline Installation of AI platform Enterprise.
At least one node with the same architecture and version that can connect to the internet.
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#online-node-building-an-offline-package","title":"Online Node - Building an Offline Package","text":"
Find an online environment with the same architecture and OS as the nodes in the target cluster. In this example, we will use AnolisOS 8.8 GA. Run the following command to generate an offline os-pkgs package:
# Download relevant scripts and build os packages package\n$ curl -Lo ./pkgs.yml https://raw.githubusercontent.com/kubean-io/kubean/main/build/os-packages/others/pkgs.yml\n$ curl -Lo ./other_os_pkgs.sh https://raw.githubusercontent.com/kubean-io/kubean/main/build/os-packages/others/other_os_pkgs.sh && chmod +x other_os_pkgs.sh\n$ ./other_os_pkgs.sh build # Build the offline package\n
After executing the above command, you should have a compressed package named os-pkgs-anolis-8.8.tar.gz in the current directory. The file structure in the current directory should look like this:
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#offline-node-installing-the-offline-package","title":"Offline Node - Installing the Offline Package","text":"
Copy the three files generated on the online node ( other_os_pkgs.sh , pkgs.yml , and os-pkgs-anolis-8.8.tar.gz ) to all nodes in the target cluster in the offline environment.
Login to any one of the nodes in the offline environment that is part of the target cluster, and run the following command to install the os-pkg package on the node:
# Configure environment variables\n$ export PKGS_YML_PATH=/root/workspace/os-pkgs/pkgs.yml # Path to the pkgs.yml file on the current offline node\n$ export PKGS_TAR_PATH=/root/workspace/os-pkgs/os-pkgs-anolis-8.8.tar.gz # Path to the os-pkgs-anolis-8.8.tar.gz file on the current offline node\n$ export SSH_USER=root # Username for the current offline node\n$ export SSH_PASS=dangerous # Password for the current offline node\n$ export HOST_IPS='172.30.41.168' # IP address of the current offline node\n$ ./other_os_pkgs.sh install # Install the offline package\n
After executing the above command, wait for the interface to prompt: All packages for node (X.X.X.X) have been installed , which indicates that the installation is complete.
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#go-to-the-user-interface-to-create-cluster","title":"Go to the User Interface to Create Cluster","text":"
Refer to the documentation on Creating a Worker Cluster to create an openAnolis cluster.
"},{"location":"en/admin/kpanda/clusterops/cluster-oversold.html","title":"Dynamic Resource Overprovision in the Cluster","text":"
Currently, many businesses experience peaks and valleys in demand. To ensure service performance and stability, resources are typically allocated based on peak demand when deploying services. However, peak periods may be very short, resulting in resource waste during off-peak times. Cluster resource overprovision utilizes these allocated but unused resources (i.e., the difference between allocation and usage) to enhance cluster resource utilization and reduce waste.
This article mainly introduces how to use the cluster dynamic resource overprovision feature.
The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and access to the cluster's UI interface is available.
A namespace has been created, and the user has been granted Cluster Admin permissions. For details, refer to Cluster Authorization.
Once the cluster dynamic resource overprovision ratio is set, it will take effect while workloads are running. The following example uses nginx to validate the use of resource overprovision capabilities.
Create a workload (nginx) and set the proper resource limits. For the creation process, refer to Creating Stateless Workloads (Deployment).
Check whether the ratio of the Pod's resource requests to limits meets the overprovision ratio.
Cluster settings are used to customize advanced feature settings for your cluster, including whether to enable GPU, helm repo refresh cycle, Helm operation record retention, etc.
Enable GPU: GPUs and proper driver plug-ins need to be installed on the cluster in advance.
Click the name of the target cluster, and click Operations and Maintenance -> Cluster Settings -> Addons in the left navigation bar.
Helm operation basic image, registry refresh cycle, number of operation records retained, whether to enable cluster deletion protection (the cluster cannot be uninstalled directly after enabling)
On this page, you can view the recent cluster operation records and Helm operation records, as well as the YAML files and logs of each operation, and you can also delete a certain record.
Set the number of reserved entries for Helm operations:
By default, the system keeps the last 100 Helm operation records. If you keep too many entries, it may cause data redundancy, and if you keep too few entries, you may lose the key operation records you need. A reasonable reserved quantity needs to be set according to the actual situation. Specific steps are as follows:
Click the name of the target cluster, and click Recent Operations -> Helm Operations -> Set Number of Retained Items in the left navigation bar.
Set how many Helm operation records need to be kept, and click OK .
Clusters integrated or created using the AI platform Container Management platform can be accessed not only through the UI interface but also in two other ways for access control:
Access online via CloudShell
Access via kubectl after downloading the cluster certificate
Note
When accessing the cluster, the user should have Cluster Admin permission or higher.
"},{"location":"en/admin/kpanda/clusters/access-cluster.html#access-via-cloudshell","title":"Access via CloudShell","text":"
Enter Clusters page, select the cluster you want to access via CloudShell, click the ... icon on the right, and then click Console from the dropdown list.
Run kubectl get node command in the Console to verify the connectivity between CloudShell and the cluster. If the console returns node information of the cluster, you can access and manage the cluster through CloudShell.
"},{"location":"en/admin/kpanda/clusters/access-cluster.html#access-via-kubectl","title":"Access via kubectl","text":"
If you want to access and manage remote clusters from a local node, make sure you have met these prerequisites:
Your local node and the cloud cluster are in a connected network.
The cluster certificate has been downloaded to the local node.
The kubectl tool has been installed on the local node. For detailed installation guides, see Installing tools.
If everything is in place, follow these steps to access a cloud cluster from your local environment.
Enter Clusters page, find your target cluster, click ... on the right, and select Download kubeconfig in the drop-down list.
Set the Kubeconfig period and click Download .
Open the downloaded certificate and copy its content to the config file of the local node.
By default, the kubectl tool will look for a file named config in the $HOME/.kube directory on the local node. This file stores access credentials of clusters. Kubectl can access the cluster with that configuration file.
Run the following command on the local node to verify its connectivity with the cluster:
kubectl get pod -n default\n
An expected output is as follows:
NAME READY STATUS RESTARTS AGE\ndao-2048-2048-58c7f7fc5-mq7h4 1/1 Running 0 30h\n
Now you can access and manage the cluster locally with kubectl.
Suanova AI platform categorizes clusters based on different functionalities to help users better manage IT infrastructure.
"},{"location":"en/admin/kpanda/clusters/cluster-role.html#global-service-cluster","title":"Global Service Cluster","text":"
This cluster is used to run AI platform components such as Container Management, Global Management, Insight. It generally does not carry business workloads.
This cluster is used to manage worker clusters and generally does not carry business workloads.
Classic Mode deploys the global service cluster and management cluster in different clusters, suitable for multi-data center, multi-architecture enterprise scenarios.
Simple Mode deploys the management cluster and global service cluster in the same cluster.
This is a cluster created using Container Management and is mainly used to carry business workloads. This cluster is managed by the management cluster.
Supported Features Description K8s Version Supports K8s 1.22 and above Operating System RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;Ubuntu 18.04 x86, Ubuntu 20.04 x86;CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD Full Lifecycle Management Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Calico, Cillium, Multus, and other CNIs Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies"},{"location":"en/admin/kpanda/clusters/cluster-role.html#integrated-cluster","title":"Integrated Cluster","text":"
This cluster is used to integrate existing standard K8s clusters, including but not limited to self-built clusters in local data centers, clusters provided by public cloud vendors, clusters provided by private cloud vendors, edge clusters, Xinchuang clusters, heterogeneous clusters, and different Suanova clusters. It is mainly used to carry business workloads.
Supported Features Description K8s Version 1.18+ Supported Vendors VMware Tanzu, Amazon EKS, Redhat Openshift, SUSE Rancher, Alibaba ACK, Huawei CCE, Tencent TKE, Standard K8s Cluster, Suanova Full Lifecycle Management Not Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Depends on the network mode of the integrated cluster's kernel Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies
Note
A cluster can have multiple cluster roles. For example, a cluster can be both a global service cluster and a management cluster or a worker cluster.
"},{"location":"en/admin/kpanda/clusters/cluster-scheduler-plugin.html","title":"Deploy Second Scheduler scheduler-plugins in a Cluster","text":"
This page describes how to deploy a second scheduler-plugins in a cluster.
"},{"location":"en/admin/kpanda/clusters/cluster-scheduler-plugin.html#why-do-we-need-scheduler-plugins","title":"Why do we need scheduler-plugins?","text":"
The cluster created through the platform will install the native K8s scheduler-plugin, but the native scheduler-plugin has many limitations:
The native scheduler-plugin cannot meet scheduling requirements, so you can use either CoScheduling, CapacityScheduling or other types of scheduler-plugins.
In special scenarios, a new scheduler-plugin is needed to complete scheduling tasks without affecting the process of the native scheduler-plugin.
Distinguish scheduler-plugins with different functionalities and achieve different scheduling scenarios by switching scheduler-plugin names.
This page takes the scenario of using the vgpu scheduler-plugin while combining the coscheduling plugin capability of scheduler-plugins as an example to introduce how to install and use scheduler-plugins.
kubean is a new feature introduced in v0.13.0, please ensure that your version is v0.13.0 or higher.
The installation version of scheduler-plugins is v0.27.8, please ensure that the cluster version is compatible with it. Refer to the document Compatibility Matrix.
scheduler_plugins_enabled Set to true to enable the scheduler-plugins capability.
You can enable or disable certain plugins by setting the scheduler_plugins_enabled_plugins or scheduler_plugins_disabled_plugins options. See K8s Official Plugin Names for reference.
If you need to set parameters for custom plugins, please configure scheduler_plugins_plugin_config, for example: set the permitWaitingTimeoutSeconds parameter for coscheduling. See K8s Official Plugin Configuration for reference.
After successful cluster creation, the system will automatically install the scheduler-plugins and controller component loads. You can check the workload status in the proper cluster's deployment.
Here is an example of how to use scheduler-plugins by demonstrating a scenario where the vgpu scheduler is used in combination with the coscheduling plugin capability of scheduler-plugins.
Install vgpu in the Helm Charts and set the values.yaml parameters.
schedulerName: scheduler-plugins-scheduler: This is the scheduler name for scheduler-plugins installed by kubean, and currently cannot be modified.
scheduler.kubeScheduler.enabled: false: Do not install kube-scheduler and use vgpu-scheduler as a separate extender.
Extend vgpu-scheduler on scheduler-plugins.
[root@master01 charts]# kubectl get cm -n scheduler-plugins scheduler-config -ojsonpath=\"{.data.scheduler-config\\.yaml}\"\n
After installing vgpu-scheduler, the system will automatically create a service (svc), and the urlPrefix specifies the URL of the svc.
Note
The svc refers to the pod service load. You can use the following command in the namespace where the nvidia-vgpu plugin is installed to get the external access information for port 443.
kubectl get svc -n ${namespace}\n
The urlPrefix format is https://${ip address}:${port}
Restart the scheduler pod of scheduler-plugins to load the new configuration file.
Note
When creating a vgpu application, you do not need to specify the name of a scheduler-plugin. The vgpu-scheduler webhook will automatically change the scheduler's name to \"scheduler-plugins-scheduler\" without manual specification.
AI platform Container Management module can manage two types of clusters: integrated clusters and created clusters.
Integrated clusters: clusters created in other platforms and now integrated into AI platform.
Created clusters: clusters created in AI platform.
For more information about cluster types, see Cluster Role.
We designed several status for these two clusters.
"},{"location":"en/admin/kpanda/clusters/cluster-status.html#integrated-clusters","title":"Integrated Clusters","text":"Status Description Integrating The cluster is being integrated into AI platform. Removing The cluster is being removed from AI platform. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status."},{"location":"en/admin/kpanda/clusters/cluster-status.html#created-clusters","title":"Created Clusters","text":"Status Description Creating The cluster is being created. Updating The Kubernetes version of the cluster is being operating. Deleting The cluster is being deleted. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status. Failed The cluster creation is failed. You should check the logs for detailed reasons."},{"location":"en/admin/kpanda/clusters/cluster-version.html","title":"Supported Kubernetes Versions","text":"
In AI platform, the integrated clusters and created clusters have different version support mechanisms.
This page focuses on the version support mechanism for created clusters.
The Kubernetes community supports three version ranges: 1.26, 1.27, and 1.28. When a new version is released by the community, the supported version range is incremented. For example, if the latest version released by the community is 1.27, the supported version range by the community will be 1.27, 1.28, and 1.29.
To ensure the security and stability of the clusters, when creating clusters in AI platform, the supported version range will always be one version lower than the community's version.
For instance, if the Kubernetes community supports v1.25, v1.26, and v1.27, then the version range for creating worker clusters in AI platform will be v1.24, v1.25, and v1.26. Additionally, a stable version, such as 1.24.7, will be recommended to users.
Furthermore, the version range for creating worker clusters in AI platform will remain highly synchronized with the community. When the community version increases incrementally, the version range for creating worker clusters in AI platform will also increase by one version.
"},{"location":"en/admin/kpanda/clusters/cluster-version.html#supported-kubernetes-versions_1","title":"Supported Kubernetes Versions","text":"Kubernetes Community Versions Created Worker Cluster Versions Recommended Versions for Created Worker Cluster AI platform Installer Release Date
In AI platform Container Management, clusters can have four roles: global service cluster, management cluster, worker cluster, and integrated cluster. An integrated cluster can only be integrated from third-party vendors (see Integrate Cluster).
This page explains how to create a Worker Cluster. By default, when creating a new Worker Cluster, the operating system type and CPU architecture of the worker nodes should be consistent with the Global Service Cluster. If you want to create a cluster with a different operating system or architecture than the Global Management Cluster, refer to Creating an Ubuntu Worker Cluster on a CentOS Management Platform for instructions.
It is recommended to use the supported operating systems in AI platform to create the cluster. If your local nodes are not within the supported range, you can refer to Creating a Cluster on Non-Mainstream Operating Systems for instructions.
Certain prerequisites must be met before creating a cluster:
Prepare enough nodes to be joined into the cluster.
It is recommended to use Kubernetes version 1.25.7. For the specific version range, refer to the AI platform Cluster Version Support System. Currently, the supported version range for created worker clusters is v1.26.0-v1.28. If you need to create a cluster with a lower version, refer to the Supporterd Cluster Versions.
The target host must allow IPv4 forwarding. If using IPv6 in Pods and Services, the target server needs to allow IPv6 forwarding.
AI platform does not provide firewall management. You need to pre-define the firewall rules of the target host by yourself. To avoid errors during cluster creation, it is recommended to disable the firewall of the target host.
Enter the Container Management module, click Create Cluster on the upper right corner of the Clusters page.
Fill in the basic information by referring to the following instructions.
Cluster Name: only contain lowercase letters, numbers, and hyphens (\"-\"). Must start and end with a lowercase letter or number and totally up to 63 characters.
Managed By: Choose a cluster to manage this new cluster through its lifecycle, such as creating, upgrading, node scaling, deleting the new cluster, etc.
Runtime: Select the runtime environment of the cluster. Currently support containerd and docker (see How to Choose Container Runtime).
Kubernetes Version: Allow span of three major versions, such as from 1.23-1.25, subject to the versions supported by the management cluster.
Fill in the node configuration information and click Node Check .
High Availability: When enabled, at least 3 controller nodes are required. When disabled, only 1 controller node is needed.
It is recommended to use High Availability mode in production environments.
Credential Type: Choose whether to access nodes using username/password or public/private keys.
If using public/private key authentication, SSH keys for the nodes need to be configured in advance. Refer to Using SSH Key Authentication for Nodes.
Same Password: When enabled, all nodes in the cluster will have the same access password. Enter the unified password for accessing all nodes in the field below. If disabled, you can set separate usernames and passwords for each node.
Node Information: Set note names and IPs.
NTP Time Synchronization: When enabled, time will be automatically synchronized across all nodes. Provide the NTP server address.
If node check is passed, click Next . If the check failed, update Node Information and check again.
Fill in the network configuration and click Next .
CNI: Provide network services for Pods in the cluster. CNI cannot be changed after the cluster is created. Supports cilium and calico. Set none means not installing CNI when creating the cluster. You may install a CNI later.
For CNI configuration details, see Cilium Installation Parameters or Calico Installation Parameters.
Container IP Range: Set an IP range for allocating IPs for containers in the cluster. IP range determines the max number of containers allowed in the cluster. Cannot be modified after creation.
Service IP Range: Set an IP range for allocating IPs for container Services in the cluster. This range determines the max number of container Services that can be created in the cluster. Cannot be modified after creation.
Fill in the plug-in configuration and click Next .
Fill in advanced settings and click OK .
kubelet_max_pods : Set the maximum number of Pods per node. The default is 110.
hostname_override : Reset the hostname (not recommended).
kubernetes_audit : Kubernetes audit log, enabled by default.
auto_renew_certificate : Automatically renew the certificate of the control plane on the first Monday of each month, enabled by default.
disable_firewalld&ufw : Disable the firewall to prevent the node from being inaccessible during installation.
Insecure_registries : Set the address of you private container registry. If you use a private container registry, fill in its address can bypass certificate authentication of the container engine and obtain the image.
yum_repos : Fill in the Yum source registry address.
Success
After correctly filling in the above information, the page will prompt that the cluster is being created.
Creating a cluster takes a long time, so you need to wait patiently. You can click the Back to Clusters button to let it running backend.
To view the current status, click Real-time Log .
Note
hen the cluster is in an unknown state, it means that the current cluster has been disconnected.
The data displayed by the system is the cached data before the disconnection, which does not represent real data.
Any operations performed in the disconnected state will not take effect. Please check the cluster network connectivity or Host Status.
Clusters created in AI platform Container Management can be either deleted or removed. Clusters integrated into AI platform can only be removed.
Info
If you want to delete an integrated cluster, you should delete it in the platform where it is created.
In AI platform, the difference between Delete and Remove is:
Delete will destroy the cluster and reset the data of all nodes under the cluster. All data will be totally cleared and lost. Making a backup before deleting a cluster is a recommended best practice. You can no longer use that cluster anymore.
Remove just removes the cluster from AI platform. It will not destroy the cluster and no data will be lost. You can still use the cluster in other platforms or re-integrate it into AI platform later if needed.
Note
You should have Admin or Kpanda Owner permissions to perform delete or remove operations.
Before deleting a cluster, you should turn off Cluster Deletion Protection in Cluster Settings -> Advanced Settings , otherwise the Delete Cluster option will not be displayed.
The global service cluster cannot be deleted or removed.
Enter the Container Management module, find your target cluster, click __ ...__ on the right, and select Delete cluster / Remove in the drop-down list.
Enter the cluster name to confirm and click Delete .
You will be auto directed to cluster lists. The status of this cluster will changed to Deleting . It may take a while to delete/remove a cluster.
With the features of integrating clusters, AI platform allows you to manage on-premise and cloud clusters of various providers in a unified manner. This is quite important in avoiding the risk of being locked in by a certain providers, helping enterprises safely migrate their business to the cloud.
In AI platform Container Management module, you can integrate a cluster of the following providers: standard Kubernetes clusters, Redhat Openshift, SUSE Rancher, VMware Tanzu, Amazon EKS, Aliyun ACK, Huawei CCE, Tencent TKE, etc.
Enter Container Management module, and click Integrate Cluster in the upper right corner.
Fill in the basic information by referring to the following instructions.
Cluster Name: It should be unique and cannot be changed after the integration. Maximum 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number.
Cluster Alias: Enter any characters, no more than 60 characters.
Release Distribution: the cluster provider, support mainstream vendors listed at the beginning.
Fill in the KubeConfig of the target cluster and click Verify Config . The cluster can be successfully connected only after the verification is passed.
Click How do I get the KubeConfig? to see the specific steps for getting this file.
Confirm that all parameters are filled in correctly and click OK in the lower right corner of the page.
Note
The status of the newly integrated cluster is Integrating , which will become Running after the integration succeeds.
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html","title":"Integrate the Rancher Cluster","text":"
This page explains how to integrate a Rancher cluster.
Prepare a Rancher cluster with administrator privileges and ensure network connectivity between the container management cluster and the target cluster.
Be equipped with permissions not lower than kpanda owner.
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#steps","title":"Steps","text":""},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#step-1-create-a-serviceaccount-user-with-administrator-privileges-in-the-rancher-cluster","title":"Step 1: Create a ServiceAccount user with administrator privileges in the Rancher cluster","text":"
Log in to the Rancher cluster with a role that has administrator privileges, and create a file named sa.yaml using the terminal.
vi sa.yaml\n
Press the i key to enter insert mode, then copy and paste the following content:
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#step-2-update-kubeconfig-with-the-rancher-rke-sa-authentication-on-your-local-machine","title":"Step 2: Update kubeconfig with the rancher-rke SA authentication on your local machine","text":"
Perform the following steps on any local node where kubelet is installed:
{cluster-name} : the name of your Rancher cluster.
{APIServer} : the access address of the cluster, usually refering to the IP address of the control node + port \"6443\", such as https://10.X.X.X:6443 .
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#step-3-connect-the-cluster-in-the-suanova-interface","title":"Step 3: Connect the cluster in the Suanova Interface","text":"
Using the kubeconfig file fetched earlier, refer to the Integrate Cluster documentation to integrate the Rancher cluster to the global cluster.
"},{"location":"en/admin/kpanda/clusters/runtime.html","title":"How to choose the container runtime","text":"
The container runtime is an important component in kubernetes to manage the life cycle of containers and container images. Kubernetes made containerd the default container runtime in version 1.19, and removed support for the Dockershim component in version 1.24.
Therefore, compared to the Docker runtime, we recommend you to use the lightweight containerd as your container runtime, because this has become the current mainstream runtime choice.
In addition, some operating system distribution vendors are not friendly enough for Docker runtime compatibility. The runtime support of different operating systems is as follows:
The Kubernetes Community packages a small version every quarter, and the maintenance cycle of each version is only about 9 months. Some major bugs or security holes will not be updated after the version stops maintenance. Manually upgrading cluster operations is cumbersome and places a huge workload on administrators.
In Suanova, you can upgrade the Kubernetes cluster with one click through the web UI interface.
Danger
After the version is upgraded, it will not be possible to roll back to the previous version, please proceed with caution.
Note
Kubernetes versions are denoted as x.y.z , where x is the major version, y is the minor version, and z is the patch version.
Cluster upgrades across minor versions are not allowed, e.g. a direct upgrade from 1.23 to 1.25 is not possible.
**Access clusters do not support version upgrades. If there is no \"cluster upgrade\" in the left navigation bar, please check whether the cluster is an access cluster. **
The global service cluster can only be upgraded through the terminal.
When upgrading a worker cluster, the Management Cluster of the worker cluster should have been connected to the container management module and be running normally.
Click the name of the target cluster in the cluster list.
Then click Cluster Operation and Maintenance -> Cluster Upgrade in the left navigation bar, and click Version Upgrade in the upper right corner of the page.
Select the version that can be upgraded, and enter the cluster name to confirm.
After clicking OK , you can see the upgrade progress of the cluster.
The cluster upgrade is expected to take 30 minutes. You can click the Real-time Log button to view the detailed log of the cluster upgrade.
ConfigMaps store non-confidential data in the form of key-value pairs to achieve the effect of mutual decoupling of configuration data and application code. ConfigMaps can be used as environment variables for containers, command-line parameters, or configuration files in storage volumes.
Note
The data saved in ConfigMaps cannot exceed 1 MiB. If you need to store larger volumes of data, it is recommended to mount a storage volume or use an independent database or file service.
ConfigMaps do not provide confidentiality or encryption. If you want to store encrypted data, it is recommended to use secret, or other third-party tools to ensure the privacy of data.
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> ConfigMap , and click the YAML Create button in the upper right corner.
Fill in or paste the configuration file prepared in advance, and then click OK in the lower right corner of the pop-up box.
!!! note
- Click __Import__ to import an existing file locally to quickly create ConfigMaps.\n - After filling in the data, click __Download__ to save the configuration file locally.\n
After the creation is complete, click More on the right side of the ConfigMap to edit YAML, update, export, delete and other operations.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
Secrets can be used in some cases:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
As the identity authentication credential for the container registry when the kubelet pulls the container image.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster
Created a namespace, user, and authorized the user as NS Editor. For details, refer to Namespace Authorization.
"},{"location":"en/admin/kpanda/configmaps-secrets/create-secret.html#create-secret-with-wizard","title":"Create secret with wizard","text":"
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> Secret , and click the Create Secret button in the upper right corner.
Fill in the configuration information on the Create Secret page, and click OK .
Note when filling in the configuration:
The name of the key must be unique within the same namespace
Key type:
Default (Opaque): Kubernetes default key type, which supports arbitrary data defined by users.
TLS (kubernetes.io/tls): credentials for TLS client or server data access.
Container registry information (kubernetes.io/dockerconfigjson): Credentials for Container registry access.
username and password (kubernetes.io/basic-auth): Credentials for basic authentication.
Custom: the type customized by the user according to business needs.
Key data: the data stored in the key, the parameters that need to be filled in are different for different data
When the key type is default (Opaque)/custom: multiple key-value pairs can be filled in.
When the key type is TLS (kubernetes.io/tls): you need to fill in the certificate certificate and private key data. Certificates are self-signed or CA-signed credentials used for authentication. A certificate request is a request for a signature and needs to be signed with a private key.
When the key type is container registry information (kubernetes.io/dockerconfigjson): you need to fill in the account and password of the private container registry.
When the key type is username and password (kubernetes.io/basic-auth): Username and password need to be specified.
ConfigMap (ConfigMap) is an API object of Kubernetes, which is used to save non-confidential data into key-value pairs, and can store configurations that other objects need to use. When used, the container can use it as an environment variable, a command-line argument, or a configuration file in a storage volume. By using ConfigMaps, configuration data and application code can be separated, providing a more flexible way to modify application configuration.
Note
ConfigMaps do not provide confidentiality or encryption. If the data to be stored is confidential, please use secret, or use other third-party tools to ensure the privacy of the data instead of ConfigMaps. In addition, when using ConfigMaps in containers, the container and ConfigMaps must be in the same cluster namespace.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use ConfigMaps in Pods. There are many use cases, mainly including:
Use ConfigMaps to set the environment variables of the container
Use ConfigMaps to set the command line parameters of the container
Use ConfigMaps as container data volumes
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#set-the-environment-variables-of-the-container","title":"Set the environment variables of the container","text":"
You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line.
Note
The ConfigMap import is to use the ConfigMap as the value of the environment variable; the ConfigMap key value import is to use a certain parameter in the ConfigMap as the value of the environment variable.
When creating a workload through an image, you can set environment variables for the container by selecting Import ConfigMaps or Import ConfigMap Key Values on the Environment Variables interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Environment Variables configuration, and click the Add Environment Variable button.
Select ConfigMap Import or ConfigMap Key Value Import in the environment variable type.
When the environment variable type is selected as ConfigMap import , enter variable name , prefix name, ConfigMap name in sequence.
When the environment variable type is selected as ConfigMap key-value import , enter variable name , ConfigMap name, and Secret name in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#command-line-operation","title":"Command line operation","text":"
You can set ConfigMaps as environment variables when creating a workload, using the valueFrom parameter to refer to the Key/Value in the ConfigMap.
Use valueFrom to specify the value of the env reference ConfigMap
Referenced configuration file name
Referenced ConfigMap key
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#set-the-command-line-parameters-of-the-container","title":"Set the command line parameters of the container","text":"
You can use ConfigMaps to set the command or parameter value in the container, and use the environment variable substitution syntax $(VAR_NAME) to do so. As follows.
When creating a workload through an image, you can use the ConfigMap as the data volume of the container by selecting the storage type as \"ConfigMap\" on the \"Data Storage\" interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Data Storage configuration, and click __Add in the __ Node Path Mapping __ list __ button.
Select ConfigMap in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#command-line-operation_1","title":"Command line operation","text":"
To use a ConfigMap in a Pod's storage volume.
Here is an example Pod that mounts a ConfigMap as a volume:
If there are multiple containers in a Pod, each container needs its own volumeMounts block, but you only need to set one spec.volumes block per ConfigMap.
Note
When a ConfigMap is used as a data volume mounted on a container, the ConfigMap can only be read as a read-only file.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use keys in Pods in a variety of use cases, mainly including:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
Used as the identity authentication credential for the container registry when the kubelet pulls the container image.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#use-the-key-to-set-the-environment-variable-of-the-container","title":"Use the key to set the environment variable of the container","text":"
You can use the key as the environment variable of the container through the GUI or the terminal command line.
Note
Key import is to use the key as the value of an environment variable; key key value import is to use a parameter in the key as the value of an environment variable.
When creating a workload from an image, you can set environment variables for the container by selecting Key Import or Key Key Value Import on the Environment Variables interface.
Go to the Image Creation Workload page.
Select the Environment Variables configuration in Container Configuration , and click the Add Environment Variable button.
Select Key Import or Key Key Value Import in the environment variable type.
When the environment variable type is selected as Key Import , enter Variable Name , Prefix , and Secret in sequence.
When the environment variable type is selected as key key value import , enter variable name , Secret , Secret name in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#command-line-operation","title":"Command line operation","text":"
As shown in the example below, you can set the secret as an environment variable when creating the workload, using the valueFrom parameter to refer to the Key/Value in the Secret.
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"username\"
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"password\"
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#use-the-key-as-the-pods-data-volume","title":"Use the key as the pod's data volume","text":""},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#graphical-interface-operation_1","title":"Graphical interface operation","text":"
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the Container Configuration , select the Data Storage configuration, and click the Add button in the Node Path Mapping list.
Select Secret in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#command-line-operation_1","title":"Command line operation","text":"
The following is an example of a Pod that mounts a Secret named mysecret via a data volume:
Default setting, means \"mysecret\" must already exist
If the Pod contains multiple containers, each container needs its own volumeMounts block, but only one .spec.volumes setting is required for each Secret.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#used-as-the-identity-authentication-credential-for-the-container-registry-when-the-kubelet-pulls-the-container-image","title":"Used as the identity authentication credential for the container registry when the kubelet pulls the container image","text":"
You can use the key as the identity authentication credential for the Container registry through the GUI or the terminal command line.
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the second step of Container Configuration , select the Basic Information configuration, and click the Select Image button.
Select the name of the private container registry in the drop-down list of `container registry' in the pop-up box. Please see Create Secret for details on private image secret creation.
Enter the image name in the private registry, click OK to complete the image selection.
Note
When creating a key, you need to ensure that you enter the correct container registry address, username, password, and select the correct mirror name, otherwise you will not be able to obtain the mirror image in the container registry.
In Kubernetes, all objects are abstracted as resources, such as Pod, Deployment, Service, Volume, etc. are the default resources provided by Kubernetes. This provides important support for our daily operation and maintenance and management work, but in some special cases, the existing preset resources cannot meet the needs of the business. Therefore, we hope to expand the capabilities of the Kubernetes API, and CustomResourceDefinition (CRD) was born based on this requirement.
The container management module supports interface-based management of custom resources, and its main features are as follows:
Obtain the list and detailed information of custom resources under the cluster
Create custom resources based on YAML
Create a custom resource example CR (Custom Resource) based on YAML
"},{"location":"en/admin/kpanda/custom-resources/create.html#create-a-custom-resource-example-via-yaml","title":"Create a custom resource example via YAML","text":"
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Custom Resource , and click the YAML Create button in the upper right corner.
Click the custom resource named crontabs.stable.example.com , enter the details, and click the YAML Create button in the upper right corner.
On the Create with YAML page, fill in the YAML statement and click OK .
Return to the details page of crontabs.stable.example.com , and you can view the custom resource named my-new-cron-object just created.
"},{"location":"en/admin/kpanda/gpu/index.html","title":"Overview of GPU Management","text":"
This article introduces the capability of Suanova container management platform in unified operations and management of heterogeneous resources, with a focus on GPUs.
With the rapid development of emerging technologies such as AI applications, large-scale models, artificial intelligence, and autonomous driving, enterprises are facing an increasing demand for compute-intensive tasks and data processing. Traditional compute architectures represented by CPUs can no longer meet the growing computational requirements of enterprises. At this point, heterogeneous computing represented by GPUs has been widely applied due to its unique advantages in processing large-scale data, performing complex calculations, and real-time graphics rendering.
Meanwhile, due to the lack of experience and professional solutions in scheduling and managing heterogeneous resources, the utilization efficiency of GPU devices is extremely low, resulting in high AI production costs for enterprises. The challenge of reducing costs, increasing efficiency, and improving the utilization of GPUs and other heterogeneous resources has become a pressing issue for many enterprises.
"},{"location":"en/admin/kpanda/gpu/index.html#introduction-to-gpu-capabilities","title":"Introduction to GPU Capabilities","text":"
The Suanova container management platform supports unified scheduling and operations management of GPUs, NPUs, and other heterogeneous resources, fully unleashing the computational power of GPU resources, and accelerating the development of enterprise AI and other emerging applications. The GPU management capabilities of Suanova are as follows:
Support for unified management of heterogeneous computing resources from domestic and foreign manufacturers such as NVIDIA, Huawei Ascend, and Iluvatar.
Support for multi-card heterogeneous scheduling within the same cluster, with automatic recognition of GPUs in the cluster.
Support for native management solutions for NVIDIA GPUs, vGPUs, and MIG, with cloud native capabilities.
Support for partitioning a single physical card for use by different tenants, and allocate GPU resources to tenants and containers based on computing power and memory quotas.
Support for multi-dimensional GPU resource monitoring at the cluster, node, and application levels, assisting operators in managing GPU resources.
Compatibility with various training frameworks such as TensorFlow and PyTorch.
"},{"location":"en/admin/kpanda/gpu/index.html#introduction-to-gpu-operator","title":"Introduction to GPU Operator","text":"
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU card into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
Architecture diagram of NVIDIA GPU Operator:
"},{"location":"en/admin/kpanda/gpu/FAQ.html","title":"GPU FAQs","text":""},{"location":"en/admin/kpanda/gpu/FAQ.html#gpu-processes-are-not-visible-while-running-nvidia-smi-inside-a-pod","title":"GPU processes are not visible while running nvidia-smi inside a pod","text":"
Q: When running the nvidia-smi command inside a GPU-utilizing pod, no GPU process information is visible in the full-card mode and vGPU mode.
A: Due to PID namespace isolation, GPU processes are not visible inside the Pod. To view GPU processes, you can use one of the following methods:
Configure the workload using the GPU with hostPID: true to enable viewing PIDs on the host.
Run the nvidia-smi command in the driver pod of the gpu-operator to view processes.
Run the chroot /run/nvidia/driver nvidia-smi command on the host to view processes.
"},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html","title":"How to Use Iluvatar GPU in Applications","text":"
This section describes how to use Iluvatar virtual GPU on AI platform.
Deployed AI platform container management platform and it is running smoothly.
The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and the UI interface of the cluster can be accessed.
The Iluvatar GPU driver has been installed on the current cluster. Refer to the Iluvatar official documentation for driver installation instructions, or contact the Suanova ecosystem team for enterprise-level support at peg-pem@daocloud.io.
The GPUs in the current cluster have not undergone any virtualization operations and not been occupied by other applications.
"},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html#configuration-via-user-interface","title":"Configuration via User Interface","text":"
Check if the GPU card in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type as Iluvatar .
Deploy a workload. Click Clusters -> Workloads and deploy a workload using the image. After selecting the type as (Iluvatar) , configure the GPU resources used by the application:
Physical Card Count (iluvatar.ai/vcuda-core): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
Memory Usage (iluvatar.ai/vcuda-memory): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.
If there are any issues with the configuration values, scheduling failures or resource allocation failures may occur.
"},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html#configuration-via-yaml","title":"Configuration via YAML","text":"
To request GPU resources for a workload, add the iluvatar.ai/vcuda-core: 1 and iluvatar.ai/vcuda-memory: 200 to the requests and limits. These parameters configure the application to use the physical card resources.
"},{"location":"en/admin/kpanda/gpu/dynamic-regulation.html","title":"GPU Scheduling Configuration (Binpack and Spread)","text":"
This page introduces how to reduce GPU resource fragmentation and prevent single points of failure through Binpack and Spread when using NVIDIA vGPU, achieving advanced scheduling for vGPU. The AI platform platform provides Binpack and Spread scheduling policies across two dimensions: clusters and workloads, meeting different usage requirements in various scenarios.
Binpack: Prioritizes using the same GPU on a node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different GPUs on nodes, suitable for high availability scenarios to avoid single card failures.
Scheduling policy based on node dimension
Binpack: Multiple Pods prioritize using the same node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different nodes, suitable for high availability scenarios to avoid single node failures.
"},{"location":"en/admin/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-cluster-level","title":"Use Binpack and Spread at Cluster-Level","text":"
Note
By default, workloads will follow the cluster-level Binpack and Spread. If a workload sets its own Binpack and Spread scheduling policies that differ from the cluster, the workload will prioritize its own scheduling policy.
On the Clusters page, select the cluster for which you want to adjust the Binpack and Spread scheduling policies. Click the \u2507 icon on the right and select GPU Scheduling Configuration from the dropdown list.
Adjust the GPU scheduling configuration according to your business scenario, and click OK to save.
"},{"location":"en/admin/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-workload-level","title":"Use Binpack and Spread at Workload-Level","text":"
Note
When the Binpack and Spread scheduling policies at the workload level conflict with the cluster-level configuration, the workload-level configuration takes precedence.
Follow the steps below to create a deployment using an image and configure Binpack and Spread scheduling policies within the workload.
Click Clusters in the left navigation bar, then click the name of the target cluster to enter the Cluster Details page.
On the Cluster Details page, click Workloads -> Deployments in the left navigation bar, then click the Create by Image button in the upper right corner of the page.
Sequentially fill in the Basic Information, Container Settings, and in the Container Configuration section, enable GPU configuration, selecting the GPU type as NVIDIA vGPU. Click Advanced Settings, enable the Binpack / Spread scheduling policy, and adjust the GPU scheduling configuration according to the business scenario. After configuration, click Next to proceed to Service Settings and Advanced Settings. Finally, click OK at the bottom right of the page to complete the creation.
"},{"location":"en/admin/kpanda/gpu/gpu_matrix.html","title":"GPU Support Matrix","text":"
This page explains the matrix of supported GPUs and operating systems for AI platform.
"},{"location":"en/admin/kpanda/gpu/gpu_matrix.html#nvidia-gpu","title":"NVIDIA GPU","text":"GPU Manufacturer and Type Supported GPU Models Compatible Operating System (Online) Recommended Kernel Recommended Operating System and Kernel Installation Documentation NVIDIA GPU (Full Card/vGPU)
NVIDIA Fermi (2.1) Architecture:
NVIDIA GeForce 400 Series
NVIDIA Quadro 4000 Series
NVIDIA Tesla 20 Series
NVIDIA Ampere Architecture Series (A100; A800; H100)
CentOS 7
Kernel 3.10.0-123 ~ 3.10.0-1160
Kernel Reference Document
Recommended Operating System with Proper Kernel Version
This document mainly introduces the configuration of GPU scheduling, which can implement advanced scheduling policies. Currently, the primary implementation is the vgpu scheduling policy.
vGPU provides two policies for resource usage: binpack and spread. These correspond to node-level and GPU-level dimensions, respectively. The use case is whether you want to distribute workloads more sparsely across different nodes and GPUs or concentrate them on the same node and GPU, thereby making resource utilization more efficient and reducing resource fragmentation.
You can modify the scheduling policy in your cluster by following these steps:
Go to the cluster management list in the container management interface.
Click the settings button ... next to the cluster.
Click GPU Scheduling Configuration.
Toggle the scheduling policy between node-level and GPU-level. By default, the node-level policy is binpack, and the GPU-level policy is spread.
The above steps modify the cluster-level scheduling policy. Users can also specify their own scheduling policy at the workload level to change the scheduling results. Below is an example of modifying the scheduling policy at the workload level:
In this example, both the node- and GPU-level scheduling policies are set to binpack. This ensures that the workload is scheduled to maximize resource utilization and reduce fragmentation.
Follow these steps to manage GPU quotas in AI platform:
Go to Namespaces and click Quota Management to configure the GPU resources that can be used by a specific namespace.
The currently supported card types for quota management in a namespace are: NVIDIA vGPU, NVIDIA MIG, Iluvatar, and Ascend.
NVIDIA vGPU Quota Management: Configure the specific quota that can be used. This will create a ResourcesQuota CR.
- Physical Card Count (nvidia.com/vgpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine.\n- GPU Core Count (nvidia.com/gpucores): Indicates the GPU compute power occupied by each card. The value ranges from 0 to 100. If configured as 0, it is considered not to enforce isolation. If configured as 100, it is considered to exclusively occupy the entire card.\n- GPU Memory Usage (nvidia.com/gpumem): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.\n
"},{"location":"en/admin/kpanda/gpu/ascend/ascend_driver_install.html","title":"Installation of Ascend NPU Components","text":"
This chapter provides installation guidance for Ascend NPU drivers, Device Plugin, NPU-Exporter, and other components.
Before using NPU resources, you need to complete the firmware installation, NPU driver installation, Docker Runtime installation, user creation, log directory creation, and NPU Device Plugin installation. Refer to the following steps for details.
Confirm that the kernel version is within the range proper to the \"binary installation\" method, and then you can directly install the NPU driver firmware.
For firmware and driver downloads, refer to: Firmware Download Link
For firmware installation, refer to: Install NPU Driver Firmware
If the driver is not installed, refer to the official Ascend documentation for installation. For example, for Ascend910, refer to: 910 Driver Installation Document.
Run the command npu-smi info, and if the NPU information is returned normally, it indicates that the NPU driver and firmware are ready.
Create the parent directory for component logs and the log directories for each component on the proper node, and set the appropriate owner and permissions for the directories. Execute the following command to create the parent directory for component logs.
Please create the proper log directory for each required component. In this example, only the Device Plugin component is needed. For other component requirements, refer to the official documentation
Refer to the following commands to create labels on the proper nodes:
# Create this label on computing nodes where the driver is installed\nkubectl label node {nodename} huawei.com.ascend/Driver=installed\nkubectl label node {nodename} node-role.kubernetes.io/worker=worker\nkubectl label node {nodename} workerselector=dls-worker-node\nkubectl label node {nodename} host-arch=huawei-arm // or host-arch=huawei-x86, select according to the actual situation\nkubectl label node {nodename} accelerator=huawei-Ascend910 // select according to the actual situation\n# Create this label on control nodes\nkubectl label node {nodename} masterselector=dls-master-node\n
"},{"location":"en/admin/kpanda/gpu/ascend/ascend_driver_install.html#install-device-plugin-and-npuexporter","title":"Install Device Plugin and NpuExporter","text":"
Functional module path: Container Management -> Cluster, click the name of the target cluster, then click Helm Apps -> Helm Charts from the left navigation bar, and search for ascend-mindxdl.
DevicePlugin: Provides a general device plugin mechanism and standard device API interface for Kubernetes to use devices. It is recommended to use the default image and version.
NpuExporter: Based on the Prometheus/Telegraf ecosystem, this component provides interfaces to help users monitor the Ascend series AI processors and container-level allocation status. It is recommended to use the default image and version.
ServiceMonitor: Disabled by default. If enabled, you can view NPU-related monitoring in the observability module. To enable, ensure that the insight-agent is installed and running, otherwise, the ascend-mindxdl installation will fail.
isVirtualMachine: Disabled by default. If the NPU node is a virtual machine scenario, enable the isVirtualMachine parameter.
After a successful installation, two components will appear under the proper namespace, as shown below:
At the same time, the proper NPU information will also appear on the node information:
Once everything is ready, you can select the proper NPU device when creating a workload through the page, as shown below:
Note
For detailed information of how to use, refer to Using Ascend (Ascend) NPU.
Ascend virtualization is divided into dynamic virtualization and static virtualization. This document describes how to enable and use Ascend static virtualization capabilities.
To enable virtualization capabilities, you need to manually modify the startup parameters of the ascend-device-plugin-daemonset component. Refer to the following command:
After splitting the instance, manually restart the device-plugin pod, then use the kubectl describe command to check the resources of the registered node:
kubectl describe node {{nodename}}\n
"},{"location":"en/admin/kpanda/gpu/ascend/vnpu.html#how-to-use-the-device","title":"How to Use the Device","text":"
When creating an application, specify the resource key as shown in the following YAML:
"},{"location":"en/admin/kpanda/gpu/metax/usemetax.html","title":"MetaX GPU Component Installation and Usage","text":"
This chapter provides installation guidance for MetaX's gpu-extensions, gpu-operator, and other components, as well as usage methods for both the full GPU card and vGPU modes.
The required tar package has been downloaded and installed from the MetaX Software Center. This article uses metax-gpu-k8s-package.0.7.10.tar.gz as an example.
Metax provides two helm-chart packages: metax-extensions and gpu-operator. Depending on the usage scenario, different components can be selected for installation.
Metax-extensions: Includes two components, gpu-device and gpu-label. When using the Metax-extensions solution, the user's application container image needs to be built based on the MXMACA\u00ae base image. Moreover, Metax-extensions is only suitable for scenarios using the full GPU card.
gpu-operator: Includes components such as gpu-device, gpu-label, driver-manager, container-runtime, and operator-controller. When using the gpu-operator solution, users can choose to create application container images that do not include the MXMACA\u00ae SDK. The gpu-operator is suitable for both full GPU card and vGPU scenarios.
If no content is displayed, it indicates that the software package has not been installed. If content is displayed, it indicates that the software package has been installed.
When using metax-operator, it is not recommended to pre-install the MXMACA kernel driver on worker nodes; if it has already been installed, there is no need to uninstall it.
The images for the components metax-operator, gpu-label, gpu-device, and container-runtime must have the amd64 suffix.
The image for the metax-maca component is not included in the metax-k8s-images.0.7.13.run package and needs to be separately downloaded, such as maca-mxc500-2.23.0.23-ubuntu20.04-x86_64.tar.xz. After loading it, the image for the metax-maca component needs to be modified again.
The image for the metax-driver component needs to be downloaded from https://pub-docstore.metax-tech.com:7001 as the k8s-driver-image.2.23.0.25.run file, and then execute the command k8s-driver-image.2.23.0.25.run push {registry}/metax to push the image to the image repository. After pushing, modify the image address for the metax-driver component.
The SuanFeng AI computing platform's container management platform has been deployed and is running normally.
The container management module has either integrated with a Kubernetes cluster or created a Kubernetes cluster, and is able to access the cluster's UI interface.
The current cluster has installed the Cambricon firmware, drivers, and DevicePlugin components. For installation details, please refer to the official documentation:
Driver Firmware Installation
DevicePlugin Installation
When installing DevicePlugin, please disable the --enable-device-type parameter; otherwise, the SuanFeng AI computing platform will not be able to correctly recognize the Cambricon GPU.
"},{"location":"en/admin/kpanda/gpu/mlu/use-mlu.html#introduction-to-cambricon-gpu-modes","title":"Introduction to Cambricon GPU Modes","text":"
Cambricon GPUs have the following modes:
Full Card Mode: Register the Cambricon GPU as a whole card for use in the cluster.
Share Mode: Allows one Cambricon GPU to be shared among multiple Pods, with the number of shareable containers set by the virtualization-num parameter.
Dynamic SMLU Mode: Further refines resource allocation, allowing control over the size of memory and computing power allocated to containers.
MIM Mode: Allows the Cambricon GPU to be divided into multiple GPUs of fixed specifications for use.
"},{"location":"en/admin/kpanda/gpu/mlu/use-mlu.html#using-cambricon-in-suanfeng-ai-computing-platform","title":"Using Cambricon in SuanFeng AI Computing Platform","text":"
Here, we take the Dynamic SMLU mode as an example:
After correctly installing the DevicePlugin and other components, click the proper Cluster -> Cluster Maintenance -> Cluster Settings -> Addon Plugins to check whether the proper GPU type has been automatically enabled and detected.
Click the node management page to check if the nodes have correctly recognized the proper GPU type.
Deploy workloads. Click the proper Cluster -> Workloads, and deploy workloads using images. After selecting the type (MLU VGPU), you need to configure the GPU resources used by the App:
GPU Computing Power (cambricon.com/mlu.smlu.vcore): Indicates the percentage of cores the current Pod needs to use.
GPU Memory (cambricon.com/mlu.smlu.vmemory): Indicates the size of memory the current Pod needs to use, in MB.
apiVersion: v1 \nkind: Pod \nmetadata: \n name: pod1 \nspec: \n restartPolicy: OnFailure \n containers: \n - image: ubuntu:16.04 \n name: pod1-ctr \n command: [\"sleep\"] \n args: [\"100000\"] \n resources: \n limits: \n cambricon.com/mlu: \"1\" # use this when device type is not enabled, else delete this line. \n #cambricon.com/mlu: \"1\" #uncomment to use when device type is enabled \n #cambricon.com/mlu.share: \"1\" #uncomment to use device with env-share mode \n #cambricon.com/mlu.mim-2m.8gb: \"1\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vcore: \"100\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vmemory: \"1024\" #uncomment to use device with mim mode\n
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
vGPU is a virtualization technology that allows one physical GPU to be partitioned into multiple virtual GPUs, with each virtual GPU assigned to different virtual machines or users. vGPU enables multiple users to share the same physical GPU and independently use GPU resources in their respective virtual environments. Each virtual GPU can access a certain amount of compute power and memory capacity. vGPU is suitable for virtualized environments and cloud computing scenarios, providing higher resource utilization and flexibility.
MIG is a feature introduced by the NVIDIA Ampere architecture that allows one physical GPU to be divided into multiple physical GPU instances, each of which can be independently allocated to different users or workloads. Each MIG instance has its own compute resources, memory, and PCIe bandwidth, just like an independent virtual GPU. MIG provides finer-grained GPU resource allocation and management and allows dynamic adjustment of the number and size of instances based on demand. MIG is suitable for multi-tenant environments, containerized applications, batch jobs, and other scenarios.
Whether using vGPU in a virtualized environment or MIG on a physical GPU, NVIDIA provides users with more choices and optimized ways to utilize GPU resources. The Suanova container management platform fully supports the above NVIDIA capabilities. Users can easily access the full computational power of NVIDIA GPUs through simple UI operations, thereby improving resource utilization and reducing costs.
Single Mode: The node only exposes a single type of MIG device on all its GPUs. All GPUs on the node must:
Be of the same model (e.g., A100-SXM-40GB), with matching MIG profiles only for GPUs of the same model.
Have MIG configuration enabled, which requires a machine reboot to take effect.
Create identical GI and CI for exposing \"identical\" MIG devices across all products.
Mixed Mode: The node exposes mixed MIG device types on all its GPUs. Requesting a specific MIG device type requires the number of compute slices and total memory provided by the device type.
All GPUs on the node must: Be in the same product line (e.g., A100-SXM-40GB).
Each GPU can enable or disable MIG individually and freely configure any available mixture of MIG device types.
The k8s-device-plugin running on the node will:
Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type.
Expose individual MIG devices using resource types that follow the pattern nvidia.com/mig-<slice_count>g.<memory_size>gb .
For detailed instructions on enabling these configurations, refer to Offline Installation of GPU Operator.
"},{"location":"en/admin/kpanda/gpu/nvidia/index.html#how-to-use","title":"How to Use","text":"
You can refer to the following links to quickly start using Suanova's management capabilities for NVIDIA GPUs.
Using Full NVIDIA GPU
Using NVIDIA vGPU
Using NVIDIA MIG
"},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html","title":"Using the Whole NVIDIA GPU Card for an Application","text":"
This section describes how to allocate the entire NVIDIA GPU card to a single application on the AI platform platform.
AI platform container management platform has been deployed and is running properly.
The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions.
The GPU card in the current cluster has not undergone any virtualization operations or been occupied by other applications.
"},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-the-user-interface","title":"Configuring via the User Interface","text":"
Check if the cluster has detected the GPUs. Click Clusters -> Cluster Settings -> Addon Plugins to see if it has automatically enabled and detected the proper GPU types. Currently, the cluster will automatically enable GPU and set the GPU Type as Nvidia GPU .
Deploy a workload. Click Clusters -> Workloads , and deploy the workload using the image method. After selecting the type ( Nvidia GPU ), configure the number of physical cards used by the application:
Physical Card Count (nvidia.com/gpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
If the above value is configured incorrectly, scheduling failures and resource allocation issues may occur.
"},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-yaml","title":"Configuring via YAML","text":"
To request GPU resources for a workload, add the nvidia.com/gpu: 1 parameter to the resource request and limit configuration in the YAML file. This parameter configures the number of physical cards used by the application.
AI platform comes with pre-installed driver images for the following three operating systems: Ubuntu 22.04, Ubuntu 20.04, and CentOS 7.9. The driver version is 535.104.12. Additionally, it includes the required Toolkit images for each operating system, so users no longer need to manually provide offline toolkit images.
This page demonstrates using AMD architecture with CentOS 7.9 (3.10.0-1160). If you need to deploy on Red Hat 8.4, refer to Uploading Red Hat gpu-operator Offline Image to the Bootstrap Node Repository and Building Offline Yum Source for Red Hat 8.4.
The kernel version of the cluster nodes where the gpu-operator is to be deployed must be completely consistent. The distribution and GPU card model of the nodes must fall within the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.
systemOS : Select the operating system for the host. The current options are Ubuntu 22.04, Ubuntu 20.04, Centos 7.9, and other. Please choose the correct operating system.
Namespace : Select the namespace for installing the plugin
Version: The version of the plugin. Here, we use version v23.9.0+2 as an example.
Failure Deletion: If the installation fails, it will delete the already installed associated resources. When enabled, Ready Wait will also be enabled by default.
Ready Wait: When enabled, the application will be marked as successfully installed only when all associated resources are in a ready state.
Detailed Logs: When enabled, detailed logs of the installation process will be recorded.
Driver.enable : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the gpu-operator, please disable this.
Driver.repository : Repository where the GPU driver image is located, default is nvidia's nvcr.io repository.
Driver.usePrecompiled : Enable the precompiled mode to install the driver.
Driver.version : Version of the GPU driver image, use default parameters for offline deployment. Configuration is only required for online installation. Different versions of the Driver image exist for different types of operating systems. For more details, refer to Nvidia GPU Driver Versions. Examples of Driver Version for different operating systems are as follows:
Note
When using the built-in operating system version, there is no need to modify the image version. For other operating system versions, please refer to Uploading Images to the Bootstrap Node Repository. note that there is no need to include the operating system name such as Ubuntu, CentOS, or Red Hat in the version number. If the official image contains an operating system suffix, please manually remove it.
For Red Hat systems, for example, 525.105.17
For Ubuntu systems, for example, 535-5.15.0-1043-nvidia
For CentOS systems, for example, 525.147.05
Driver.RepoConfig.ConfigMapName : Used to record the name of the offline yum repository configuration file for the gpu-operator. When using the pre-packaged offline bundle, refer to the following documents for different types of operating systems.
For detailed configuration methods, refer to Enabling MIG Functionality.
MigManager.Config.name : The name of the MIG split configuration file, used to define the MIG (GI, CI) split policy. The default is default-mig-parted-config . For custom parameters, refer to Enabling MIG Functionality.
After completing the configuration and creation of the above parameters:
If using full-card mode , GPU resources can be used when creating applications.
If using vGPU mode , after completing the above configuration and creation, proceed to vGPU Addon Installation.
If using MIG mode and you need to use a specific split specification for individual GPU nodes, otherwise, split according to the default value in MigManager.Config.
After spliting, applications can use MIG GPU resources.
"},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html","title":"Uploading Red Hat GPU Operator Offline Image to Bootstrap Repository","text":"
This guide explains how to upload an offline image to the bootstrap repository using the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image for Red Hat 8.4 as an example.
The bootstrap node and its components are running properly.
Prepare a node that has internet access and can access the bootstrap node. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
"},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html#step-1-obtain-the-offline-image-on-an-internet-connected-node","title":"Step 1: Obtain the Offline Image on an Internet-Connected Node","text":"
Perform the following steps on the internet-connected node:
Pull the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image:
Once the image is pulled, save it as a compressed archive named nvidia-driver.tar :
docker save nvcr.io/nvidia/driver:525.105.17-rhel8.4 > nvidia-driver.tar\n
Copy the compressed image archive nvidia-driver.tar to the bootstrap node:
scp nvidia-driver.tar user@ip:/root\n
For example:
scp nvidia-driver.tar root@10.6.175.10:/root\n
"},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html#step-2-push-the-image-to-the-bootstrap-repository","title":"Step 2: Push the Image to the Bootstrap Repository","text":"
Perform the following steps on the bootstrap node:
Log in to the bootstrap node and import the compressed image archive nvidia-driver.tar :
docker load -i nvidia-driver.tar\n
View the imported image:
docker images -a | grep nvidia\n
Expected output:
nvcr.io/nvidia/driver e3ed7dee73e9 1 days ago 1.02GB\n
Retag the image to correspond to the target repository in the remote Registry repository:
docker tag <image-name> <registry-url>/<repository-name>:<tag>\n
Replace with the name of the Nvidia image from the previous step, with the address of the Registry service on the bootstrap node, with the name of the repository you want to push the image to, and with the desired tag for the image.
For example:
docker tag nvcr.io/nvidia/driver 10.6.10.5/nvcr.io/nvidia/driver:525.105.17-rhel8.4\n
Check the GPU Driver image version applicable to your kernel, at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. Use the kernel to query the image version and save the image using ctr export.
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
Import the image into the cluster's container registry
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
"},{"location":"en/admin/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.html#install-the-driver","title":"Install the Driver","text":"
Install the gpu-operator addon and set driver.usePrecompiled=true
Set driver.version=535, note that it should be 535, not 535.104.12
The AI platform comes with a pre-installed GPU Operator offline package for CentOS 7.9 with kernel version 3.10.0-1160. or other OS types or kernel versions, users need to manually build an offline yum source.
This guide explains how to build an offline yum source for CentOS 7.9 with a specific kernel version and use it when installing the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the v0.12.0 or later version of the addon offline package on the platform.
Prepare a file server that is accessible from the cluster network, such as Nginx or MinIO.
Prepare a node that has internet access, can access the cluster where the GPU Operator will be deployed, and can access the file server. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
This guide uses CentOS 7.9 with kernel version 3.10.0-1160.95.1.el7.x86_64 as an example to explain how to upgrade the pre-installed GPU Operator offline package's yum source.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#check-os-and-kernel-versions-of-cluster-nodes","title":"Check OS and Kernel Versions of Cluster Nodes","text":"
Run the following commands on both the control node of the Global cluster and the node where GPU Operator will be deployed. If the OS and kernel versions of the two nodes are consistent, there is no need to build a yum source. You can directly refer to the Offline Installation of GPU Operator document for installation. If the OS or kernel versions of the two nodes are not consistent, please proceed to the next step.
Run the following command to view the distribution name and version of the node where GPU Operator will be deployed in the cluster.
cat /etc/redhat-release\n
Expected output:
CentOS Linux release 7.9 (Core)\n
The output shows the current node's OS version as CentOS 7.9.
Run the following command to view the kernel version of the node where GPU Operator will be deployed in the cluster.
uname -a\n
Expected output:
Linux localhost.localdomain 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux\n
The output shows the current node's kernel version as 3.10.0-1160.el7.x86_64.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-the-offline-yum-source","title":"Create the Offline Yum Source","text":"
Perform the following steps on a node that has internet access and can access the file server:
Create a script file named yum.sh by running the following command:
vi yum.sh\n
Then press the i key to enter insert mode and enter the following content:
Press the Esc key to exit insert mode, then enter :wq to save and exit.
Run the yum.sh file:
bash -x yum.sh TARGET_KERNEL_VERSION\n
The TARGET_KERNEL_VERSION parameter is used to specify the kernel version of the cluster nodes.
Note: You don't need to include the distribution identifier (e.g., __ .el7.x86_64__ ). For example:
bash -x yum.sh 3.10.0-1160.95.1\n
Now you have generated an offline yum source, centos-base , for the kernel version 3.10.0-1160.95.1.el7.x86_64 .
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#upload-the-offline-yum-source-to-the-file-server","title":"Upload the Offline Yum Source to the File Server","text":"
Perform the following steps on a node that has internet access and can access the file server. This step is used to upload the generated yum source from the previous step to a file server that can be accessed by the cluster where the GPU Operator will be deployed. The file server can be Nginx, MinIO, or any other file server that supports the HTTP protocol.
In this example, we will use the built-in MinIO as the file server. The MinIO details are as follows:
Run the following command in the current directory of the node to establish a connection between the node's local mc command-line tool and the MinIO server:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should resemble the following:
Added __minio__ successfully.\n
mc is the command-line tool provided by MinIO for interacting with the MinIO server. For more details, refer to the MinIO Client documentation.
In the current directory of the node, create a bucket named centos-base :
mc mb -p minio/centos-base\n
The expected output should resemble the following:
Bucket created successfully __minio/centos-base__ .\n
Set the access policy of the bucket centos-base to allow public download. This will enable access during the installation of the GPU Operator:
mc anonymous set download minio/centos-base\n
The expected output should resemble the following:
Access permission for __minio/centos-base__ is set to __download__ \n
In the current directory of the node, copy the generated centos-base offline yum source to the minio/centos-base bucket on the MinIO server:
mc cp centos-base minio/centos-base --recursive\n
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-a-configmap-to-store-the-yum-source-info-in-the-cluster","title":"Create a ConfigMap to Store the Yum Source Info in the Cluster","text":"
Perform the following steps on the control node of the cluster where the GPU Operator will be deployed.
Run the following command to create a file named CentOS-Base.repo that specifies the configmap for the yum source storage:
# The file name must be CentOS-Base.repo, otherwise it cannot be recognized during the installation of the GPU Operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The expected output should resemble the following:
configmap/local-repo-config created\n
The local-repo-config configmap will be used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can customize the configuration file name.
View the content of the local-repo-config configmap:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output should resemble the following:
apiVersion: v1\ndata:\nCentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base# The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\ncreationTimestamp: \"2023-10-18T01:59:02Z\"\nname: local-repo-config\nnamespace: gpu-operator\nresourceVersion: \"59445080\"\nuid: c5f0ebab-046f-442c-b932-f9003e014387\n
You have successfully created an offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it during the offline installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html","title":"Building Red Hat 8.4 Offline Yum Source","text":"
The AI platform comes with pre-installed CentOS v7.9 and GPU Operator offline packages with kernel v3.10.0-1160. For other OS types or nodes with different kernels, users need to manually build the offline yum source.
This guide explains how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also demonstrates how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the addon offline package v0.12.0 or higher on the platform.
The OS of the cluster nodes where the GPU Operator will be deployed must be Red Hat v8.4, and the kernel version must be identical.
Prepare a file server that can communicate with the cluster network where the GPU Operator will be deployed, such as Nginx or MinIO.
Prepare a node that can access the internet, the cluster where the GPU Operator will be deployed, and the file server. Ensure that Docker is already installed on this node.
The nodes in the Global cluster must be Red Hat 8.4 4.18.0-305.el8.x86_64.
This guide uses a node with Red Hat 8.4 4.18.0-305.el8.x86_64 as an example to demonstrate how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also explains how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-1-download-the-yum-source-from-the-bootstrap-node","title":"Step 1: Download the Yum Source from the Bootstrap Node","text":"
Perform the following steps on the master node of the Global cluster.
Use SSH or any other method to access any node in the Global cluster and run the following command:
cat /etc/yum.repos.d/extension.repo # View the contents of extension.repo.\n
The expected output should resemble the following:
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-2-download-the-elfutils-libelf-devel-0187-4el8x86_64rpm-package","title":"Step 2: Download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm Package","text":"
Perform the following steps on a node with internet access. Before proceeding, ensure that there is network connectivity between the node with internet access and the master node of the Global cluster.
Run the following command on the node with internet access to download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm package:
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-3-generate-the-local-yum-repository","title":"Step 3: Generate the Local Yum Repository","text":"
Perform the following steps on the master node of the Global cluster mentioned in Step 1.
Enter the yum repository directories:
cd ~/redhat-base-repo/extension-1/Packages\ncd ~/redhat-base-repo/extension-2/Packages\n
Generate the repository index for the directories:
createrepo_c ./\n
You have now generated the offline yum source named redhat-base-repo for kernel version 4.18.0-305.el8.x86_64 .
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-4-upload-the-local-yum-repository-to-the-file-server","title":"Step 4: Upload the Local Yum Repository to the File Server","text":"
In this example, we will use Minio, which is built-in as the file server in the bootstrap node. However, you can choose any file server that suits your needs. Here are the details for Minio:
Access URL: http://10.5.14.200:9000 (usually the {bootstrap-node-IP} + {port-9000})
Login username: rootuser
Login password: rootpass123
On the current node, establish a connection between the local mc command-line tool and the Minio server by running the following command:
mc config host add minio <file_server_access_url> <username> <password>\n
For example:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should be similar to:
Added __minio__ successfully.\n
The mc command-line tool is provided by the Minio file server as a client command-line tool. For more details, refer to the MinIO Client documentation.
Create a bucket named redhat-base in the current location:
mc mb -p minio/redhat-base\n
The expected output should be similar to:
Bucket created successfully __minio/redhat-base__ .\n
Set the access policy of the redhat-base bucket to allow public downloads so that it can be accessed during the installation of the GPU Operator:
mc anonymous set download minio/redhat-base\n
The expected output should be similar to:
Access permission for __minio/redhat-base__ is set to __download__ \n
Copy the offline yum repository files ( redhat-base-repo ) from the current location to the Minio server's minio/redhat-base bucket:
mc cp redhat-base-repo minio/redhat-base --recursive\n
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-5-create-a-configmap-to-store-yum-repository-information-in-the-cluster","title":"Step 5: Create a ConfigMap to Store Yum Repository Information in the Cluster","text":"
Perform the following steps on the control node of the cluster where you will deploy the GPU Operator.
Run the following command to create a file named redhat.repo , which specifies the configuration information for the yum repository storage:
# The file name must be redhat.repo, otherwise it won't be recognized when installing gpu-operator\ncat > redhat.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created redhat.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The local-repo-config configuration file is used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can choose a different name for the configuration file.
View the contents of the local-repo-config configuration file:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
You have successfully created the offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it by specifying the RepoConfig.ConfigMapName parameter during the offline installation of the GPU Operator.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"Build an Offline Yum Repository for Red Hat 7.9","text":""},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#introduction","title":"Introduction","text":"
AI platform comes with a pre-installed CentOS 7.9 with GPU Operator offline package for kernel 3.10.0-1160. You need to manually build an offline yum repository for other OS types or nodes with different kernels.
This page explains how to build an offline yum repository for Red Hat 7.9 based on any node in the Global cluster, and how to use the RepoConfig.ConfigMapName parameter when installing the GPU Operator.
The cluster nodes where the GPU Operator is to be deployed must be Red Hat 7.9 with the exact same kernel version.
Prepare a file server that can be connected to the cluster network where the GPU Operator is to be deployed, such as nginx or minio.
Prepare a node that can access the internet, the cluster where the GPU Operator is to be deployed, and the file server. Docker installation must be completed on this node.
The nodes in the global service cluster must be Red Hat 7.9.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#steps","title":"Steps","text":""},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#1-build-offline-yum-repo-for-relevant-kernel","title":"1. Build Offline Yum Repo for Relevant Kernel","text":"
Download rhel7.9 ISO
Download the rhel7.9 ospackage that corresponds to your Kubean version.
Find the version number of Kubean in the Container Management section of the Global cluster under Helm Apps.
Download the rhel7.9 ospackage for that version from the Kubean repository.
Import offline resources using the installer.
Refer to the Import Offline Resources document.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-download-offline-driver-image-for-red-hat-79-os","title":"2. Download Offline Driver Image for Red Hat 7.9 OS","text":"
Click here to view the download url.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-upload-red-hat-gpu-operator-offline-image-to-boostrap-node-repository","title":"3. Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository","text":"
Refer to Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository.
Note
This reference is based on rhel8.4, so make sure to modify it for rhel7.9.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#4-create-configmaps-in-the-cluster-to-save-yum-repository-information","title":"4. Create ConfigMaps in the Cluster to Save Yum Repository Information","text":"
Run the following command on the control node of the cluster where the GPU Operator is to be deployed.
Run the following command to create a file named CentOS-Base.repo to specify the configuration information where the yum repository is stored.
# The file name must be CentOS-Base.repo, otherwise it will not be recognized when installing gpu-operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a profile named local-repo-config in the gpu-operator namespace:
The local-repo-config profile is used to provide the value of the RepoConfig.ConfigMapName parameter when installing gpu-operator, and the profile name can be customized by the user.
View the contents of the local-repo-config profile:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output is as follows:
local-repo-config.yaml
apiVersion: v1\ndata:\n CentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl\n = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname\n = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\n creationTimestamp: \"2023-10-18T01:59:02Z\"\n name: local-repo-config\n namespace: gpu-operator\n resourceVersion: \"59445080\"\n uid: c5f0ebab-046f-442c-b932-f9003e014387\n
At this point, you have successfully created the offline yum repository profile for the cluster where the GPU Operator is to be deployed. The RepoConfig.ConfigMapName parameter was used during the Offline Installation of GPU Operator.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html","title":"Overview of NVIDIA Multi-Instance GPU (MIG)","text":""},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html#mig-scenarios","title":"MIG Scenarios","text":"
Multi-Tenant Cloud Environments:
MIG allows cloud service providers to partition a physical GPU into multiple independent GPU instances, which can be allocated to different tenants. This enables resource isolation and independence, meeting the GPU computing needs of multiple tenants.
Containerized Applications:
MIG enables finer-grained GPU resource management in containerized environments. By partitioning a physical GPU into multiple MIG instances, each container can be assigned with dedicated GPU compute resources, providing better performance isolation and resource utilization.
Batch Processing Jobs:
For batch processing jobs requiring large-scale parallel computing, MIG provides higher computational performance and larger memory capacity. Each MIG instance can utilize a portion of the physical GPU's compute resources, accelerating the processing of large-scale computational tasks.
AI/Machine Learning Training:
MIG offers increased compute power and memory capacity for training large-scale deep learning models. By partitioning the physical GPU into multiple MIG instances, each instance can independently carry out model training, improving training efficiency and throughput.
In general, NVIDIA MIG is suitable for scenarios that require finer-grained allocation and management of GPU resources. It enables resource isolation, improved performance utilization, and meets the GPU computing needs of multiple users or applications.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html#overview-of-mig","title":"Overview of MIG","text":"
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU card to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU card into seven GPU instances that can be used by multiple users.
SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices.
GPU Memory Slice : The smallest portion of GPU memory, including the proper memory controller and cache. A GPU memory slice is approximately one-eighth of the total GPU memory resources in terms of capacity and bandwidth.
GPU SM Slice : The smallest computational unit of SMs on a GPU. When configuring in MIG mode, the GPU SM slice is approximately one-seventh of the total available SMs in the GPU.
GPU Slice : The GPU slice represents the smallest portion of the GPU, consisting of a single GPU memory slice and a single GPU SM slice combined together.
GPU Instance (GI): A GPU instance is the combination of a GPU slice and GPU engines (DMA, NVDEC, etc.). Anything within a GPU instance always shares all GPU memory slices and other GPU engines, but its SM slice can be further subdivided into Compute Instances (CIs). A GPU instance provides memory QoS. Each GPU slice contains dedicated GPU memory resources, limiting available capacity and bandwidth while providing memory QoS. Each GPU memory slice gets one-eighth of the total GPU memory resources, and each GPU SM slice gets one-seventh of the total SM count.
Compute Instance (CI): A Compute Instance represents the smallest computational unit within a GPU instance. It consists of a subset of SMs, along with dedicated register files, shared memory, and other resources. Each CI has its own CUDA context and can run independent CUDA kernels. The number of CIs in a GPU instance depends on the number of available SMs and the configuration chosen during MIG setup.
Instance Slice : An Instance Slice represents a single CI within a GPU instance. It is the combination of a subset of SMs and a portion of the GPU memory slice. Each Instance Slice provides isolation and resource allocation for individual applications or users running on the GPU instance.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html#key-benefits-of-mig","title":"Key Benefits of MIG","text":"
Resource Sharing: MIG allows a single physical GPU to be divided into multiple GPU instances, providing efficient sharing of GPU resources among different users or applications. This maximizes GPU utilization and enables improved performance isolation.
Fine-Grained Resource Allocation: With MIG, GPU resources can be allocated at a finer granularity, allowing for more precise partitioning and allocation of compute power and memory capacity.
Improved Performance Isolation: Each MIG instance operates independently with its dedicated resources, ensuring predictable throughput and latency for individual users or applications. This improves performance isolation and prevents interference between different workloads running on the same GPU.
Enhanced Security and Fault Isolation: MIG provides better security and fault isolation by ensuring that each user or application has its dedicated GPU resources. This prevents unauthorized access to data and mitigates the impact of faults or errors in one instance on others.
Increased Scalability: MIG enables the simultaneous usage of GPU resources by multiple users or applications, increasing system scalability and accommodating the needs of various workloads.
Efficient Containerization: By using MIG in containerized environments, GPU resources can be effectively allocated to different containers, improving performance isolation and resource utilization.
Overall, MIG offers significant advantages in terms of resource sharing, fine-grained allocation, performance isolation, security, scalability, and containerization, making it a valuable feature for various GPU computing scenarios.
Check the system requirements for the GPU driver installation on the target node: GPU Support Matrix
Ensure that the cluster nodes have GPUs of the proper models (NVIDIA H100, A100, and A30 Tensor Core GPUs). For more information, see the GPU Support Matrix.
All GPUs on the nodes must belong to the same product line (e.g., A100-SXM-40GB).
When installing the Operator, you need to set the MigManager Config parameter accordingly. The default setting is default-mig-parted-config. You can also customize the sharding policy configuration file:
After successfully installing the GPU operator, the node is in full card mode by default. There will be an indicator on the node management page, as shown below:
Click the \u2507 at the right side of the node list, select a GPU mode to switch, and then choose the proper MIG mode and sharding policy. Here, we take MIXED mode as an example:
There are two configurations here:
MIG Policy: Mixed and Single.
Sharding Policy: The policy here needs to match the key in the default-mig-parted-config (or user-defined sharding policy) configuration file.
After clicking OK button, wait for about a minute and refresh the page. The MIG mode will be switched to:
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_command.html","title":"MIG Related Commands","text":"
GI Related Commands:
Subcommand Description nvidia-smi mig -lgi View the list of created GI instances nvidia-smi mig -dgi -gi Delete a specific GI instance nvidia-smi mig -lgip View the profile of GI nvidia-smi mig -cgi Create a GI using the specified profile ID
CI Related Commands:
Subcommand Description nvidia-smi mig -lcip { -gi {gi Instance ID}} View the profile of CI, specifying -gi will show the CIs that can be created for a particular GI instance nvidia-smi mig -lci View the list of created CI instances nvidia-smi mig -cci {profile id} -gi {gi instance id} Create a CI instance with the specified GI nvidia-smi mig -dci -ci Delete a specific CI instance
GI+CI Related Commands:
Subcommand Description nvidia-smi mig -i 0 -cgi {gi profile id} -C {ci profile id} Create a GI + CI instance directly"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_usage.html","title":"Using MIG GPU Resources","text":"
This section explains how applications can use MIG GPU resources.
AI platform container management platform is deployed and running successfully.
The container management module is integrated with a Kubernetes cluster or a Kubernetes cluster is created, and the UI interface of the cluster can be accessed.
NVIDIA DevicePlugin and MIG capabilities are enabled. Refer to Offline installation of GPU Operator for details.
The nodes in the cluster have GPUs of the proper models.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-gpu-through-the-ui","title":"Using MIG GPU through the UI","text":"
Confirm if the cluster has recognized the GPU card type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
When deploying an application using an image, you can select and use NVIDIA MIG resources.
Example of MIG Single Mode (used in the same way as a full GPU card):
Note
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU card (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
MIG Mixed Mode
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-through-yaml-configuration","title":"Using MIG through YAML Configuration","text":"
Expose MIG device through nvidia.com/mig-g.gb resource type
After entering the container, you can check if only one MIG device is being used:
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/hami.html","title":"Build a vGPU Memory Oversubscription Image","text":"
The vGPU memory oversubscription feature in the Hami Project no longer exists. To use this feature, you need to rebuild with the libvgpu.so file that supports memory oversubscription.
Dockerfile
FROM docker.m.daocloud.io/projecthami/hami:v2.3.11\nCOPY libvgpu.so /k8s-vgpu/lib/nvidia/\n
To virtualize a single NVIDIA GPU into multiple virtual GPUs and allocate them to different virtual machines or users, you can use NVIDIA's vGPU capability. This section explains how to install the vGPU plugin in the AI platform platform, which is a prerequisite for using NVIDIA vGPU capability.
During the installation of vGPU, several basic modification parameters are provided. If you need to modify advanced parameters, click the YAML column to make changes:
deviceMemoryScaling : NVIDIA device memory scaling factor, the input value must be an integer, with a default value of 1. It can be greater than 1 (enabling virtual memory, experimental feature). For an NVIDIA GPU with a memory size of M, if we configure the devicePlugin.deviceMemoryScaling parameter as S, in a Kubernetes cluster where we have deployed our device plugin, the vGPUs assigned from this GPU will have a total memory of S * M .
deviceSplitCount : An integer type, with a default value of 10. Number of GPU splits, each GPU cannot be assigned more tasks than its configuration count. If configured as N, each GPU can have up to N tasks simultaneously.
Resources : Represents the resource usage of the vgpu-device-plugin and vgpu-schedule pods.
After a successful installation, you will see two types of pods in the specified namespace, indicating that the NVIDIA vGPU plugin has been successfully installed:
After a successful installation, you can deploy applications using vGPU resources.
Note
NVIDIA vGPU Addon does not support upgrading directly from the older v2.0.0 to the latest v2.0.0+1; To upgrade, please uninstall the older version and then reinstall the latest version.
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html","title":"Using NVIDIA vGPU in Applications","text":"
This section explains how to use the vGPU capability in the AI platform platform.
The nodes in the cluster have GPUs of the proper models.
vGPU Addon has been successfully installed. Refer to Installing GPU Addon for details.
GPU Operator is installed, and the Nvidia.DevicePlugin capability is disabled. Refer to Offline Installation of GPU Operator for details.
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-the-ui","title":"Using vGPU through the UI","text":"
Confirm if the cluster has detected GPUs. Click the Clusters -> Cluster Settings -> Addon Plugins and check if the GPU plugin has been automatically enabled and the proper GPU type has been detected. Currently, the cluster will automatically enable the GPU addon and set the GPU Type as Nvidia vGPU .
Deploy a workload by clicking Clusters -> Workloads . When deploying a workload using an image, select the type Nvidia vGPU , and you will be prompted with the following parameters:
Number of Physical Cards (nvidia.com/vgpu) : Indicates how many physical cards need to be mounted by the current pod. The input value must be an integer and less than or equal to the number of cards on the host machine.
GPU Cores (nvidia.com/gpucores): Indicates the GPU cores utilized by each card, with a value range from 0 to 100. Setting it to 0 means no enforced isolation, while setting it to 100 means exclusive use of the entire card.
GPU Memory (nvidia.com/gpumem): Indicates the GPU memory occupied by each card, with a value in MB. The minimum value is 1, and the maximum value is the total memory of the card.
If there are issues with the configuration values above, it may result in scheduling failure or inability to allocate resources.
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-yaml-configuration","title":"Using vGPU through YAML Configuration","text":"
Refer to the following workload configuration and add the parameter nvidia.com/vgpu: '1' in the resource requests and limits section to configure the number of physical cards used by the application.
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU card.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"Using Volcano's Gang Scheduler","text":"
The Gang scheduling policy is one of the core scheduling algorithms of the volcano-scheduler. It satisfies the \"All or nothing\" scheduling requirement during the scheduling process, preventing arbitrary scheduling of Pods that could waste cluster resources. The specific algorithm observes whether the number of scheduled Pods under a Job meets the minimum running quantity. When the Job's minimum running quantity is satisfied, scheduling actions are performed for all Pods under the Job; otherwise, no actions are taken.
The Gang scheduling algorithm, based on the concept of a Pod group, is particularly suitable for scenarios that require multi-process collaboration. AI scenarios often involve complex workflows, such as Data Ingestion, Data Analysis, Data Splitting, Training, Serving, and Logging, which require a group of containers to work together. This makes the Gang scheduling policy based on pods very appropriate.
In multi-threaded parallel computing communication scenarios under the MPI computation framework, Gang scheduling is also very suitable because it requires master and slave processes to work together. High relevance among containers in a pod may lead to resource contention, and overall scheduling allocation can effectively resolve deadlocks.
In scenarios with insufficient cluster resources, the Gang scheduling policy significantly improves the utilization of cluster resources. For example, if the cluster can currently accommodate only 2 Pods, but the minimum number of Pods required for scheduling is 3, then all Pods of this Job will remain pending until the cluster can accommodate 3 Pods, at which point the Pods will be scheduled. This effectively prevents the partial scheduling of Pods, which would not meet the requirements and would occupy resources, making other Jobs unable to run.
The Gang Scheduler is the core scheduling plugin of Volcano, and it is enabled by default upon installing Volcano. When creating a workload, you only need to specify the scheduler name as Volcano.
Volcano schedules based on PodGroups. When creating a workload, there is no need to manually create PodGroup resources; Volcano will automatically create them based on the workload information. Below is an example of a PodGroup:
Represents the minimum number of Pods or jobs that need to run under this PodGroup. If the cluster resources do not meet the requirements to run the number of jobs specified by miniMember, the scheduler will not schedule any jobs within this PodGroup.
Represents the minimum resources required to run this PodGroup. If the allocatable resources of the cluster do not meet the minResources, the scheduler will not schedule any jobs within this PodGroup.
Represents the priority of this PodGroup, used by the scheduler to sort all PodGroups within the queue during scheduling. system-node-critical and system-cluster-critical are two reserved values indicating the highest priority. If not specifically designated, the default priority or zero priority is used.
Represents the queue to which this PodGroup belongs. The queue must be pre-created and in the open state.
In a multi-threaded parallel computing communication scenario under the MPI computation framework, we need to ensure that all Pods can be successfully scheduled to ensure the job is completed correctly. Setting minAvailable to 4 means that 1 mpimaster and 3 mpiworkers are required to run.
apiVersion: scheduling.volcano.sh/v1beta1\nkind: PodGroup\nmetadata:\n annotations:\n creationTimestamp: \"2024-05-28T09:18:50Z\"\n generation: 5\n labels:\n volcano.sh/job-type: MPI\n name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2\n namespace: default\n ownerReferences:\n - apiVersion: batch.volcano.sh/v1alpha1\n blockOwnerDeletion: true\n controller: true\n kind: Job\n name: lm-mpi-job\n uid: 9c571015-37c7-4a1a-9604-eaa2248613f2\n resourceVersion: \"25173454\"\n uid: 7b04632e-7cff-4884-8e9a-035b7649d33b\nspec:\n minMember: 4\n minResources:\n count/pods: \"4\"\n cpu: 3500m\n limits.cpu: 3500m\n pods: \"4\"\n requests.cpu: 3500m\n minTaskMember:\n mpimaster: 1\n mpiworker: 3\n queue: default\nstatus:\n conditions:\n - lastTransitionTime: \"2024-05-28T09:19:01Z\"\n message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded,\n 3 Releasing, 4 minAvailable'\n reason: NotEnoughResources\n status: \"True\"\n transitionID: f875efa5-0358-4363-9300-06cebc0e7466\n type: Unschedulable\n - lastTransitionTime: \"2024-05-28T09:18:53Z\"\n reason: tasks in gang are ready to be scheduled\n status: \"True\"\n transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab\n type: Scheduled\n phase: Pending\n succeeded: 1\n
From the PodGroup, it can be seen that it is associated with the workload through ownerReferences and sets the minimum number of running Pods to 4.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html","title":"Use Volcano for AI Compute","text":""},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html#usage-scenarios","title":"Usage Scenarios","text":"
Kubernetes has become the de facto standard for orchestrating and managing cloud-native applications, and an increasing number of applications are choosing to migrate to K8s. The fields of artificial intelligence and machine learning inherently involve a large number of compute-intensive tasks, and developers are very willing to build AI platforms based on Kubernetes to fully leverage its resource management, application orchestration, and operations monitoring capabilities. However, the default Kubernetes scheduler was initially designed primarily for long-running services and has many shortcomings in batch and elastic scheduling for AI and big data tasks. For example, resource contention issues:
Take TensorFlow job scenarios as an example. TensorFlow jobs include two different roles, PS and Worker, and the Pods for these two roles need to work together to complete the entire job. If only one type of role Pod is running, the entire job cannot be executed properly. The default scheduler schedules Pods one by one and is unaware of the PS and Worker roles in a Kubeflow TFJob. In a high-load cluster (insufficient resources), multiple jobs may each be allocated some resources to run a portion of their Pods, but the jobs cannot complete successfully, leading to resource waste. For instance, if a cluster has 4 GPUs and both TFJob1 and TFJob2 each have 4 Workers, TFJob1 and TFJob2 might each be allocated 2 GPUs. However, both TFJob1 and TFJob2 require 4 GPUs to run. This mutual waiting for resource release creates a deadlock situation, resulting in GPU resource waste.
Volcano is the first Kubernetes-based container batch computing platform under CNCF, focusing on high-performance computing scenarios. It fills in the missing functionalities of Kubernetes in fields such as machine learning, big data, and scientific computing, providing essential support for these high-performance workloads. Additionally, Volcano seamlessly integrates with mainstream computing frameworks like Spark, TensorFlow, and PyTorch, and supports hybrid scheduling of heterogeneous devices, including CPUs and GPUs, effectively resolving the deadlock issues mentioned above.
The following sections will introduce how to install and use Volcano.
Find Volcano in Cluster Details -> Helm Apps -> Helm Charts and install it.
Check and confirm whether Volcano is installed successfully, that is, whether the components volcano-admission, volcano-controllers, and volcano-scheduler are running properly.
Typically, Volcano is used in conjunction with the AI Lab to achieve an effective closed-loop process for the development and training of datasets, Notebooks, and task training.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html#volcano-use-cases","title":"Volcano Use Cases","text":"
Volcano is a standalone scheduler. To enable the Volcano scheduler when creating workloads, simply specify the scheduler's name (schedulerName: volcano).
The volcanoJob resource is an extension of the Job in Volcano, breaking the Job down into smaller working units called tasks, which can interact with each other.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html#parallel-computing-with-mpi","title":"Parallel Computing with MPI","text":"
In multi-threaded parallel computing communication scenarios under the MPI computing framework, we need to ensure that all Pods are successfully scheduled to guarantee the task's proper completion. Setting minAvailable to 4 indicates that 1 mpimaster and 3 mpiworkers are required to run. By simply setting the schedulerName field value to \"volcano,\" you can enable the Volcano scheduler.
Helm is a package management tool for Kubernetes, which makes it easy for users to quickly discover, share and use applications built with Kubernetes. The Container Management Module provides hundreds of Helm charts, covering storage, network, monitoring, database and other main cases. With these templates, you can quickly deploy and easily manage Helm apps through the UI interface. In addition, it supports adding more personalized templates through Add Helm repository to meet various needs.
Key Concepts:
There are a few key concepts to understand when using Helm:
Chart: A Helm installation package, which contains the images, dependencies, and resource definitions required to run an application, and may also contain service definitions in the Kubernetes cluster, similar to the formula in Homebrew, dpkg in APT, or rpm files in Yum. Charts are called Helm Charts in AI platform.
Release: A Chart instance running on the Kubernetes cluster. A Chart can be installed multiple times in the same cluster, and each installation will create a new Release. Release is called Helm Apps in AI platform.
Repository: A repository for publishing and storing Charts. Repository is called Helm Repositories in AI platform.
For more details, refer to Helm official website.
Related operations:
Manage Helm apps, including installing, updating, uninstalling Helm apps, viewing Helm operation records, etc.
Manage Helm repository, including installing, updating, deleting Helm repository, etc.
"},{"location":"en/admin/kpanda/helm/Import-addon.html","title":"Import Custom Helm Apps into Built-in Addons","text":"
This article explains how to import Helm appss into the system's built-in addons in both offline and online environments.
charts-syncer is available and running. If not, you can click here to download.
The Helm Chart has been adapted for charts-syncer. This means adding a .relok8s-images.yaml file to the Helm Chart. This file should include all the images used in the Chart, including any images that are not directly used in the Chart but are used similar to images used in an Operator.
Note
Refer to image-hints-file for instructions on how to write a Chart. It is required to separate the registry and repository of the image because the registry/repository needs to be replaced or modified when loading the image.
The installer's fire cluster has charts-syncer installed. If you are importing a custom Helm apps into the installer's fire cluster, you can skip the download and proceed to the adaptation. If charts-syncer binary is not installed, you can download it immediately.
Go to Container Management -> Helm Apps -> Helm Repositories , search for the addon, and obtain the built-in repository address and username/password (the default username/password for the system's built-in repository is rootuser/rootpass123).
Sync the Helm Chart to the built-in repository addon of the container management system
Write the following configuration file, modify it according to your specific configuration, and save it as sync-dao-2048.yaml .
source: # helm charts source information\n repo:\n kind: HARBOR # It can also be any other supported Helm Chart repository type, such as CHARTMUSEUM\n url: https://release-ci.daocloud.io/chartrepo/community # Change to the chart repo URL\n #auth: # username/password, if no password is set, leave it blank\n #username: \"admin\"\n #password: \"Harbor12345\"\ncharts: # charts to sync\n - name: dao-2048 # helm charts information, if not specified, sync all charts in the source helm repo\n versions:\n - 1.4.1\ntarget: # helm charts target information\n containerRegistry: 10.5.14.40 # image repository URL\n repo:\n kind: CHARTMUSEUM # It can also be any other supported Helm Chart repository type, such as HARBOR\n url: http://10.5.14.40:8081 # Change to the correct chart repo URL, you can verify the address by using helm repo add $HELM-REPO\n auth: # username/password, if no password is set, leave it blank\n username: \"rootuser\"\n password: \"rootpass123\"\n containers:\n # kind: HARBOR # If the image repository is HARBOR and you want charts-syncer to automatically create an image repository, fill in this field\n # auth: # username/password, if no password is set, leave it blank\n # username: \"admin\"\n # password: \"Harbor12345\"\n\n# leverage .relok8s-images.yaml file inside the Charts to move the container images too\nrelocateContainerImages: true\n
Run the charts-syncer command to sync the Chart and its included images
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
Once the previous step is completed, go to Container Management -> Helm Apps -> Helm Repositories , find the proper addon, click Sync Repository in the action column, and you will see the uploaded Helm apps in the Helm template.
You can then proceed with normal installation, upgrade, and uninstallation.
The Helm Repo address for the online environment is release.daocloud.io . If the user does not have permission to add Helm Repo, they will not be able to import custom Helm appss into the system's built-in addons. You can add your own Helm repository and then integrate your Helm repository into the platform using the same steps as syncing Helm Chart in the offline environment.
The container management module supports interface-based management of Helm, including creating Helm instances using Helm charts, customizing Helm instance arguments, and managing the full lifecycle of Helm instances.
This section will take cert-manager as an example to introduce how to create and manage Helm apps through the container management interface.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster.
Created a namespace, user, and granted NS Admin or higher permissions to the user. For details, refer to Namespace Authorization.
"},{"location":"en/admin/kpanda/helm/helm-app.html#install-the-helm-app","title":"Install the Helm app","text":"
Follow the steps below to install the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps -> Helm Chart to enter the Helm chart page.
On the Helm chart page, select the Helm repository named addon , and all the Helm chart templates under the addon repository will be displayed on the interface. Click the Chart named cert-manager .
On the installation page, you can see the relevant detailed information of the Chart, select the version to be installed in the upper right corner of the interface, and click the Install button. Here select v1.9.1 version for installation.
Configure Name , Namespace and Version Information . You can also customize arguments by modifying YAML in the argument Configuration area below. Click OK .
The system will automatically return to the list of Helm apps, and the status of the newly created Helm app is Installing , and the status will change to Running after a period of time.
"},{"location":"en/admin/kpanda/helm/helm-app.html#update-the-helm-app","title":"Update the Helm app","text":"
After we have completed the installation of a Helm app through the interface, we can perform an update operation on the Helm app. Note: Update operations using the UI are only supported for Helm apps installed via the UI.
Follow the steps below to update the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app that needs to be updated, click the __ ...__ operation button on the right side of the list, and select the Update operation in the drop-down selection.
After clicking the Update button, the system will jump to the update interface, where you can update the Helm app as needed. Here we take updating the http port of the dao-2048 application as an example.
After modifying the proper arguments. You can click the Change button under the argument configuration to compare the files before and after the modification. After confirming that there is no error, click the OK button at the bottom to complete the update of the Helm app.
The system will automatically return to the Helm app list, and a pop-up window in the upper right corner will prompt update successful .
Every installation, update, and deletion of Helm apps has detailed operation records and logs for viewing.
In the left navigation bar, click Cluster Operations -> Recent Operations , and then select the Helm Operations tab at the top of the page. Each record corresponds to an install/update/delete operation.
To view the detailed log of each operation: Click \u2507 on the right side of the list, and select Log from the pop-up menu.
At this point, the detailed operation log will be displayed in the form of console at the bottom of the page.
"},{"location":"en/admin/kpanda/helm/helm-app.html#delete-the-helm-app","title":"Delete the Helm app","text":"
Follow the steps below to delete the Helm app.
Find the cluster where the Helm app to be deleted resides, click the cluster name, and enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app you want to delete, click the __ ...__ operation button on the right side of the list, and select Delete from the drop-down selection.
Enter the name of the Helm app in the pop-up window to confirm, and then click the Delete button.
The Helm repository is a repository for storing and publishing Charts. The Helm App module supports HTTP(s) protocol to access Chart packages in the repository. By default, the system has 4 built-in helm repos as shown in the table below to meet common needs in the production process of enterprises.
Repository Description Example partner Various high-quality features provided by ecological partners Chart tidb system Chart that must be relied upon by system core functional components and some advanced features. For example, insight-agent must be installed to obtain cluster monitoring information Insight addon Common Chart in business cases cert-manager community The most popular open source components in the Kubernetes community Chart Istio
In addition to the above preset repositories, you can also add third-party Helm repositories yourself. This page will introduce how to add and update third-party Helm repositories.
The following takes the public container repository of Kubevela as an example to introduce and manage the helm repo.
Find the cluster that needs to be imported into the third-party helm repo, click the cluster name, and enter cluster details.
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo page.
Click the Create Repository button on the helm repo page to enter the Create repository page, and configure relevant arguments according to the table below.
Repository Name: Set the repository name. It can be up to 63 characters long and may only include lowercase letters, numbers, and separators -. It must start and end with a lowercase letter or number, for example, kubevela.
Repository URL: The HTTP(S) address pointing to the target Helm repository. For example, https://charts.kubevela.net/core.
Skip TLS Verification: If the added Helm repository uses an HTTPS address and requires skipping TLS verification, you can check this option. The default is unchecked.
Authentication Method: The method used for identity verification after connecting to the repository URL. For public repositories, you can select None. For private repositories, you need to enter a username/password for identity verification.
Labels: Add labels to this Helm repository. For example, key: repo4; value: Kubevela.
Annotations: Add annotations to this Helm repository. For example, key: repo4; value: Kubevela.
Description: Add a description for this Helm repository. For example: This is a Kubevela public Helm repository.
Click OK to complete the creation of the Helm repository. The page will automatically jump to the list of Helm repositories.
"},{"location":"en/admin/kpanda/helm/helm-repo.html#update-the-helm-repository","title":"Update the Helm repository","text":"
When the address information of the helm repo changes, the address, authentication method, label, annotation, and description information of the helm repo can be updated.
Find the cluster where the repository to be updated is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Update in the pop-up menu.
Update on the Update Helm Repository page, and click OK when finished.
Return to the helm repo list, and the screen prompts that the update is successful.
"},{"location":"en/admin/kpanda/helm/helm-repo.html#delete-the-helm-repository","title":"Delete the Helm repository","text":"
In addition to importing and updating repositorys, you can also delete unnecessary repositories, including system preset repositories and third-party repositories.
Find the cluster where the repository to be deleted is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Delete in the pop-up menu.
Enter the repository name to confirm, and click Delete .
Return to the list of Helm repositories, and the screen prompts that the deletion is successful.
"},{"location":"en/admin/kpanda/helm/multi-archi-helm.html","title":"Import and Upgrade Multi-Arch Helm Apps","text":"
In a multi-arch cluster, it is common to use Helm charts that support multiple architectures to address deployment issues caused by architectural differences. This guide will explain how to integrate single-arch Helm apps into multi-arch deployments and how to integrate multi-arch Helm apps.
The offline package is quite large and requires sufficient space for decompression and loading of images. Otherwise, it may interrupt the process with a \"no space left\" error.
"},{"location":"en/admin/kpanda/helm/multi-archi-helm.html#retry-after-failure","title":"Retry after Failure","text":"
If the multi-arch fusion step fails, you need to clean up the residue before retrying:
If the offline package for fusion contains registry spaces that are inconsistent with the imported offline package, an error may occur during the fusion process due to the non-existence of the registry spaces:
Solution: Simply create the registry space before the fusion. For example, in the above error, creating the registry space \"localhost\" in advance can prevent the error.
When upgrading to a version lower than 0.12.0 of the addon, the charts-syncer in the target offline package does not check the existence of the image before pushing, so it will recombine the multi-arch into a single architecture during the upgrade process. For example, if the addon is implemented as a multi-arch in v0.10, upgrading to v0.11 will overwrite the multi-arch addon with a single architecture. However, upgrading to v0.12.0 or above can still maintain the multi-arch.
This article explains how to upload Helm charts. See the steps below.
Add a Helm repository, refer to Adding a Third-Party Helm Repository for the procedure.
Upload the Helm Chart to the Helm repository.
Upload with ClientUpload with Web Page
Note
This method is suitable for Harbor, ChartMuseum, JFrog type repositories.
Log in to a node that can access the Helm repository, upload the Helm binary to the node, and install the cm-push plugin (VPN is needed and Git should be installed in advance).
Refer to the plugin installation process.
Push the Helm Chart to the Helm repository by executing the following command:
charts-dir: The directory of the Helm Chart, or the packaged Chart (i.e., .tgz file).
HELM_REPO_URL: The URL of the Helm repository.
username/password: The username and password for the Helm repository with push permissions.
If you want to access via HTTPS and skip the certificate verification, you can add the argument --insecure.
Note
This method is only applicable to Harbor repositories.
Log into the Harbor repository, ensuring the logged-in user has permissions to push;
Go to the relevant project, select the Helm Charts tab, click the Upload button on the page to upload the Helm Chart.
Sync Remote Repository Data
Manual SyncAuto Sync
By default, the cluster does not enable Helm Repository Auto-Refresh, so you need to perform a manual sync operation. The general steps are:
Go to Helm Apps -> Helm Repositories, click the \u2507 button on the right side of the repository list, and select Sync Repository to complete the repository data synchronization.
If you need to enable the Helm repository auto-sync feature, you can go to Cluster Maintenance -> Cluster Settings -> Advanced Settings and turn on the Helm repository auto-refresh switch.
Cluster inspection allows administrators to regularly or ad-hoc check the overall health of the cluster, giving them proactive control over ensuring cluster security. With a well-planned inspection schedule, this proactive cluster check allows administrators to monitor the cluster status at any time and address potential issues in advance. It eliminates the previous dilemma of passive troubleshooting during failures, enabling proactive monitoring and prevention.
The cluster inspection feature provided by AI platform's container management module supports custom inspection items at the cluster, node, and pod levels. After the inspection is completed, it automatically generates visual inspection reports.
Cluster Level: Checks the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes, such as the status of kube-apiserver and etcd .
Node Level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle counts, PID status, and network status.
pod Level: Checks the CPU and memory usage, running status of pods, and the status of PV (Persistent Volume) and PVC (PersistentVolumeClaim).
For information on security inspections or executing security-related inspections, refer to the supported security scan types in AI platform.
AI platform Container Management module provides cluster inspection functionality, which supports inspection at the cluster, node, and pod levels.
Cluster level: Check the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes such as kube-apiserver and etcd .
Node level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle count, PID status, and network status.
Pod level: Check the CPU and memory usage, running status, PV and PVC status of Pods.
Here's how to create an inspection configuration.
Click Cluster Inspection in the left navigation bar.
On the right side of the page, click Inspection Configuration .
Fill in the inspection configuration based on the following instructions, then click OK at the bottom of the page.
Cluster: Select the clusters that you want to inspect from the dropdown list. If you select multiple clusters, multiple inspection configurations will be automatically generated (only the inspected clusters are inconsistent, all other configurations are identical).
Scheduled Inspection: When enabled, it allows for regular automatic execution of cluster inspections based on a pre-set inspection frequency.
Inspection Frequency: Set the interval for automatic inspections, e.g., every Tuesday at 10 AM. It supports custom CronExpressios, refer to Cron Schedule Syntax for more information.
Number of Inspection Records to Retain: Specifies the maximum number of inspection records to be retained, including all inspection records for each cluster.
Parameter Configuration: The parameter configuration is divided into three parts: cluster level, node level, and pod level. You can enable or disable specific inspection items based on your requirements.
After creating the inspection configuration, it will be automatically displayed in the inspection configuration list. Click the more options button on the right of the configuration to immediately perform an inspection, modify the inspection configuration or delete the inspection configuration and reports.
Click Inspection to perform an inspection once based on the configuration.
Click Inspection Configuration to modify the inspection configuration.
Click Delete to delete the inspection configuration and reports.
Note
After creating the inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time.
If Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
After creating an inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time. If the Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
This page explains how to manually perform a cluster inspection.
When performing an inspection, you can choose to inspect multiple clusters in batches or perform a separate inspection for a specific cluster.
Batch InspectionIndividual Inspection
Click Cluster Inspection in the top-level navigation bar of the Container Management module, then click Inspection on the right side of the page.
Select the clusters you want to inspect, then click OK at the bottom of the page.
If you choose to inspect multiple clusters at the same time, the system will perform inspections based on different inspection configurations for each cluster.
If no inspection configuration is set for a cluster, the system will use the default configuration.
Go to the Cluster Inspection page.
Click the more options button ( \u2507 ) on the right of the proper inspection configuration, then select Inspection from the popup menu.
Go to the Cluster Inspection page and click the name of the target inspection cluster.
Click the name of the inspection record you want to view.
Each inspection execution generates an inspection record.
When the number of inspection records exceeds the maximum retention specified in the inspection configuration, the earliest record will be deleted starting from the execution time.
View the detailed information of the inspection, which may include an overview of cluster resources and the running status of system components.
You can download the inspection report or delete the inspection report from the top right corner of the page.
Namespaces are an abstraction used in Kubernetes for resource isolation. A cluster can contain multiple namespaces with different names, and the resources in each namespace are isolated from each other. For a detailed introduction to namespaces, refer to Namespaces.
This page will introduce the related operations of the namespace.
"},{"location":"en/admin/kpanda/namespaces/createns.html#create-a-namespace","title":"Create a namespace","text":"
Supports easy creation of namespaces through forms, and quick creation of namespaces by writing or importing YAML files.
Note
Before creating a namespace, you need to Integrate a Kubernetes cluster or Create a Kubernetes cluster in the container management module.
The default namespace default is usually automatically generated after cluster initialization. But for production clusters, for ease of management, it is recommended to create other namespaces instead of using the default namespace directly.
"},{"location":"en/admin/kpanda/namespaces/createns.html#create-with-form","title":"Create with form","text":"
On the cluster list page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the Create button on the right side of the page.
Fill in the name of the namespace, configure the workspace and labels (optional), and then click OK.
Info
After binding a namespace to a workspace, the resources of that namespace will be shared with the bound workspace. For a detailed explanation of workspaces, refer to Workspaces and Hierarchies.
After the namespace is created, you can still bind/unbind the workspace.
Click OK to complete the creation of the namespace. On the right side of the namespace list, click \u2507 to select update, bind/unbind workspace, quota management, delete, and more from the pop-up menu.
"},{"location":"en/admin/kpanda/namespaces/createns.html#create-from-yaml","title":"Create from YAML","text":"
On the Clusters page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the YAML Create button on the right side of the page.
Enter or paste the prepared YAML content, or directly import an existing YAML file locally.
After entering the YAML content, click Download to save the YAML file locally.
Finally, click OK in the lower right corner of the pop-up box.
Namespace exclusive nodes in a Kubernetes cluster allow a specific namespace to have exclusive access to one or more node's CPU, memory, and other resources through taints and tolerations. Once exclusive nodes are configured for a specific namespace, applications and services from other namespaces cannot run on the exclusive nodes. Using exclusive nodes allows important applications to have exclusive access to some computing resources, achieving physical isolation from other applications.
Note
Applications and services running on a node before it is set to be an exclusive node will not be affected and will continue to run normally on that node. Only when these Pods are deleted or rebuilt will they be scheduled to other non-exclusive nodes.
Check whether the kube-apiserver of the current cluster has enabled the PodNodeSelector and PodTolerationRestriction admission controllers.
The use of namespace exclusive nodes requires users to enable the PodNodeSelector and PodTolerationRestriction admission controllers on the kube-apiserver. For more information about admission controllers, refer to Kubernetes Admission Controllers Reference.
You can go to any Master node in the current cluster to check whether these two features are enabled in the kube-apiserver.yaml file, or you can execute the following command on the Master node for a quick check:
[root@g-master1 ~]# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep enable-admission-plugins\n\n# The expected output is as follows:\n- --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction\n
"},{"location":"en/admin/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-global-cluster","title":"Enable Namespace Exclusive Nodes on Global Cluster","text":"
Since the Global cluster runs platform basic components such as kpanda, ghippo, and insight, enabling namespace exclusive nodes on Global may cause system components to not be scheduled to the exclusive nodes when they restart, affecting the overall high availability of the system. Therefore, we generally do not recommend users to enable the namespace exclusive node feature on the Global cluster.
If you do need to enable namespace exclusive nodes on the Global cluster, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the Global cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to configure system component tolerations.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Add toleration annotations to the namespace where the platform components are located
After enabling the admission controllers, you need to add toleration annotations to the namespace where the platform components are located to ensure the high availability of the platform components.
The system component namespaces for AI platform are as follows:
Check whether there are the above namespaces in the current cluster, execute the following command, and add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' for each namespace.
Please make sure to replace <namespace-name> with the name of the platform namespace you want to add the annotation to.
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
"},{"location":"en/admin/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-non-global-clusters","title":"Enable Namespace Exclusive Nodes on Non-Global Clusters","text":"
To enable namespace exclusive nodes on non-Global clusters, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the current cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to using the interface to set exclusive nodes for the namespace.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
Add toleration annotations to the namespace where the components that need high availability are located (optional)
Execute the following command to add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' to the namespace where the components that need high availability are located.
Pod security policies in a Kubernetes cluster allow you to control the behavior of Pods in various aspects of security by configuring different levels and modes for specific namespaces. Only Pods that meet certain conditions will be accepted by the system. It sets three levels and three modes, allowing users to choose the most suitable scheme to set restriction policies according to their needs.
Note
Only one security policy can be configured for one security mode. Please be careful when configuring the enforce security mode for a namespace, as violations will prevent Pods from being created.
This section will introduce how to configure Pod security policies for namespaces through the container management interface.
The container management module has integrated a Kubernetes cluster or created a Kubernetes cluster. The cluster version needs to be v1.22 or above, and you should be able to access the cluster's UI interface.
A namespace has been created, a user has been created, and the user has been granted NS Admin or higher permissions. For details, refer to Namespace Authorization.
"},{"location":"en/admin/kpanda/namespaces/podsecurity.html#configure-pod-security-policies-for-namespace","title":"Configure Pod Security Policies for Namespace","text":"
Select the namespace for which you want to configure Pod security policies and go to the details page. Click Configure Policy on the Pod Security Policy page to go to the configuration page.
Click Add Policy on the configuration page, and a policy will appear, including security level and security mode. The following is a detailed introduction to the security level and security policy.
Security Level Description Privileged An unrestricted policy that provides the maximum possible range of permissions. This policy allows known privilege elevations. Baseline The least restrictive policy that prohibits known privilege elevations. Allows the use of default (minimum specified) Pod configurations. Restricted A highly restrictive policy that follows current best practices for protecting Pods. Security Mode Description Audit Violations of the specified policy will add new audit events in the audit log, and the Pod can be created. Warn Violations of the specified policy will return user-visible warning information, and the Pod can be created. Enforce Violations of the specified policy will prevent the Pod from being created.
Different security levels correspond to different check items. If you don't know how to configure your namespace, you can Policy ConfigMap Explanation at the top right corner of the page to view detailed information.
Click Confirm. If the creation is successful, the security policy you configured will appear on the page.
Click \u2507 to edit or delete the security policy you configured.
"},{"location":"en/admin/kpanda/network/create-ingress.html","title":"Create an Ingress","text":"
In a Kubernetes cluster, Ingress exposes services from outside the cluster to inside the cluster HTTP and HTTPS ingress. Traffic ingress is controlled by rules defined on the Ingress resource. Here's an example of a simple Ingress that sends all traffic to the same Service:
Ingress is an API object that manages external access to services in the cluster, and the typical access method is HTTP. Ingress can provide load balancing, SSL termination, and name-based virtual hosting.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
Completed Create Ingress Instance, Deploy Application Workload, and have created the proper Service
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Ingress to enter the service list, and click the Create Ingress button in the upper right corner.
Note
It is also possible to Create from YAML .
Open Create Ingress page to configure. There are two protocol types to choose from, refer to the following two parameter tables for configuration.
"},{"location":"en/admin/kpanda/network/create-ingress.html#create-http-protocol-ingress","title":"Create HTTP protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTP protocol. HTTP Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Scope of use of Ingress Platform-level load balancer : In the same cluster, share the same Ingress instance, where all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace, or belongs to a certain workspace, and the set workspace includes the current namespace, and all Pods can receive it Requests distributed by this load balancer. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> Ngnix Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Path Rewriting [Type] Optional [Meaning] rewrite-target , in some cases, the URL exposed by the backend service is different from the path specified in the Ingress rule. If no URL rewriting configuration is performed, There will be an error when accessing. close Redirect [Type] Optional[Meaning] permanent-redirect , permanent redirection, after entering the rewriting path, the access path will be redirected to the set address. close Traffic Distribution [Type] Optional[Meaning] After enabled and set, traffic distribution will be performed according to the set conditions. Based on weight : After setting the weight, add the following Annotation to the created Ingress: nginx.ingress.kubernetes.io/canary-weight: \"10\" Based on Cookie : set After the cookie rules, the traffic will be distributed according to the set cookie conditions Based on Header : After setting the header rules, the traffic will be distributed according to the set header conditions Close Labels [Type] Optional [Meaning] Add a label for the ingress - Annotations [Type] Optional [Meaning] Add annotation for ingress -"},{"location":"en/admin/kpanda/network/create-ingress.html#create-https-protocol-ingress","title":"Create HTTPS protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTPS protocol. HTTPS Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io Secret [Type] Required [Meaning] Https TLS certificate, Create Secret. Forwarding policy [Type] Optional[Meaning] Specify the access policy of Ingress. Path: Specifies the URL path for service access, the default is the root path/directoryTarget service: Service name for ingressTarget service port: Port exposed by the service LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Platform-level load balancer : In the same cluster, the same Ingress instance is shared, and all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace or to a certain workspace. This workspace contains the current namespace, and all Pods can receive the workload from this Balanced distribution of requests. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> None Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Labels [Type] Optional [Meaning] Add a label for the ingress Annotations [Type] Optional[Meaning] Add annotation for ingress"},{"location":"en/admin/kpanda/network/create-ingress.html#create-ingress-successfully","title":"Create ingress successfully","text":"
After configuring all the parameters, click the OK button to return to the ingress list automatically. On the right side of the list, click \u2507 to modify or delete the selected ingress.
"},{"location":"en/admin/kpanda/network/create-services.html","title":"Create a Service","text":"
In a Kubernetes cluster, each Pod has an internal independent IP address, but Pods in the workload may be created and deleted at any time, and directly using the Pod IP address cannot provide external services.
This requires creating a service through which you get a fixed IP address, decoupling the front-end and back-end of the workload, and allowing external users to access the service. At the same time, the service also provides the Load Balancer feature, enabling users to access workloads from the public network.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Service to enter the service list, and click the Create Service button in the upper right corner.
!!! tip
It is also possible to create a service via __YAML__ .\n
Open the Create Service page, select an access type, and refer to the following three parameter tables for configuration.
Click Intra-Cluster Access (ClusterIP) , which refers to exposing services through the internal IP of the cluster. The services selected for this option can only be accessed within the cluster. This is the default service type. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select intra-cluster access (ClusterIP). ClusterIP Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. app:job01 Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. Container port (targetport): The container port that the workload actually monitors, used to expose services to the cluster. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same Pod Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time is 30 seconds by default 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/admin/kpanda/network/create-services.html#create-nodeport-service","title":"Create NodePort service","text":"
Click NodePort , which means exposing the service via IP and static port ( NodePort ) on each node. The NodePort service is routed to the automatically created ClusterIP service. You can access a NodePort service from outside the cluster by requesting : . Refer to the configuration parameters in the table below. Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. ***Container port (targetport)*: The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same PodAfter enabled, .spec.sessionAffinity of Service is ClientIP , refer to for details : Session Affinity for Service Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time, the default timeout is 30 seconds.spec.sessionAffinityConfig.clientIP.timeoutSeconds is set to 30 by default seconds 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/admin/kpanda/network/create-services.html#create-loadbalancer-service","title":"Create LoadBalancer service","text":"
Click Load Balancer , which refers to using the cloud provider's load balancer to expose services to the outside. External load balancers can route traffic to automatically created NodePort services and ClusterIP services. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default External Traffic Policy [Type] Required[Meaning] Set external traffic policy. Cluster: Traffic can be forwarded to Pods on all nodes in the cluster. Local: Traffic is only sent to Pods on this node. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Tag selector [Type] Required [Meaning] Add tag, Service Select the Pod according to the label, fill it out and click \"Add\". You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Load balancing type [Type] Required [Meaning] The type of load balancing used, currently supports MetalLB and others. MetalLB IP Pool [Type] Required[Meaning] When the selected load balancing type is MetalLB, LoadBalancer Service will allocate IP addresses from this pool by default, and declare all IP addresses in this pool through APR, For details, refer to: Install MetalLB Load balancing address [Type] Required[Meaning] 1. If you are using a public cloud CloudProvider, fill in the load balancing address provided by the cloud provider here;2. If the above load balancing type is selected as MetalLB, the IP will be obtained from the above IP pool by default, if not filled, it will be obtained automatically. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. Container port (targetport): The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/admin/kpanda/network/create-services.html#complete-service-creation","title":"Complete service creation","text":"
After configuring all parameters, click the OK button to return to the service list automatically. On the right side of the list, click \u2507 to modify or delete the selected service.
Network policies in Kubernetes allow you to control network traffic at the IP address or port level (OSI layer 3 or layer 4). The container management module currently supports creating network policies based on Pods or namespaces, using label selectors to specify which traffic can enter or leave Pods with specific labels.
For more details on network policies, refer to the official Kubernetes documentation on Network Policies.
Currently, there are two methods available for creating network policies: YAML and form-based creation. Each method has its advantages and disadvantages, catering to different user needs.
YAML creation requires fewer steps and is more efficient, but it has a higher learning curve as it requires familiarity with configuring network policy YAML files.
Form-based creation is more intuitive and straightforward. Users can simply fill in the proper values based on the prompts. However, this method involves more steps.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create with YAML in the left navigation bar.
In the pop-up dialog, enter or paste the pre-prepared YAML file, then click OK at the bottom of the dialog.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create Policy in the left navigation bar.
Fill in the basic information.
The name and namespace cannot be changed after creation.
Fill in the policy configuration.
The policy configuration includes ingress and egress policies. To establish a successful connection from a source Pod to a target Pod, both the egress policy of the source Pod and the ingress policy of the target Pod need to allow the connection. If either side does not allow the connection, the connection will fail.
Ingress Policy: Click \u2795 to begin configuring the policy. Multiple policies can be configured. The effects of multiple network policies are cumulative. Only when all network policies are satisfied simultaneously can a connection be successfully established.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies . Click the name of the network policy.
View the basic configuration, associated instances, ingress policies, and egress policies of the policy.
Info
Under the \"Associated Instances\" tab, you can view instance monitoring, logs, container lists, YAML files, events, and more.
There are two ways to update network policies. You can either update them through the form or by using a YAML file.
On the network policy list page, find the policy you want to update, and choose Update in the action column on the right to update it via the form. Choose Edit YAML to update it using a YAML file.
Click the name of the network policy, then choose Update in the top right corner of the policy details page to update it via the form. Choose Edit YAML to update it using a YAML file.
There are two ways to delete network policies. You can delete network policies either through the form or by using a YAML file.
On the network policy list page, find the policy you want to delete, and choose Delete in the action column on the right to delete it via the form. Choose Edit YAML to delete it using a YAML file.
Click the name of the network policy, then choose Delete in the top right corner of the policy details page to delete it via the form. Choose Edit YAML to delete it using a YAML file.
As the number of business applications continues to grow, the resources of the cluster become increasingly tight. At this point, you can expand the cluster nodes based on kubean. After the expansion, applications can run on the newly added nodes, alleviating resource pressure.
Only clusters created through the container management module support node autoscaling. Clusters accessed from the outside do not support this operation. This article mainly introduces the expansion of worker nodes in the same architecture work cluster. If you need to add control nodes or heterogeneous work nodes to the cluster, refer to: Expanding the control node of the work cluster, Adding heterogeneous nodes to the work cluster, Expanding the worker node of the global service cluster.
On the Clusters page, click the name of the target cluster.
If the Cluster Type contains the label Integrated Cluster, it means that the cluster does not support node autoscaling.
Click Nodes in the left navigation bar, and then click Integrate Node in the upper right corner of the page.
Enter the host name and node IP and click OK.
Click \u2795 Add Worker Node to continue accessing more nodes.
Note
Accessing the node takes about 20 minutes, please be patient.
When the peak business period is over, in order to save resource costs, you can reduce the size of the cluster and unload redundant nodes, that is, node scaling. After a node is uninstalled, applications cannot continue to run on the node.
The current operating user has the Cluster Admin role authorization.
Only through the container management module created cluster can node autoscaling be supported, and the cluster accessed from the outside does not support this operation.
Before uninstalling a node, you need to pause scheduling the node, and expel the applications on the node to other nodes.
Eviction method: log in to the controller node, and use the kubectl drain command to evict all Pods on the node. The safe eviction method allows the containers in the pod to terminate gracefully.
When cluster nodes scales down, they can only be uninstalled one by one, not in batches.
If you need to uninstall cluster controller nodes, you need to ensure that the final number of controller nodes is an odd number.
The first controller node cannot be offline when the cluster node scales down. If it is necessary to perform this operation, please contact the after-sales engineer.
On the Clusters page, click the name of the target cluster.
If the Cluster Type has the tag Integrate Cluster , it means that the cluster does not support node autoscaling.
Click Nodes on the left navigation bar, find the node to be uninstalled, click \u2507 and select Remove .
Enter the node name, and click Delete to confirm.
"},{"location":"en/admin/kpanda/nodes/labels-annotations.html","title":"Labels and Annotations","text":"
Labels are identifying key-value pairs added to Kubernetes objects such as Pods, nodes, and clusters, which can be combined with label selectors to find and filter Kubernetes objects that meet certain conditions. Each key must be unique for a given object.
Annotations, like tags, are key/value pairs, but they do not have identification or filtering features. Annotations can be used to add arbitrary metadata to nodes. Annotation keys usually use the format prefix(optional)/name(required) , for example nfd.node.kubernetes.io/extended-resources . If the prefix is \u200b\u200bomitted, it means that the annotation key is private to the user.
For more information about labels and annotations, refer to the official Kubernetes documentation labels and selectors Or Annotations.
The steps to add/delete tags and annotations are as follows:
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click Edit Labels or Edit Annotations .
Click \u2795 Add to add tags or annotations, click X to delete tags or annotations, and finally click OK .
"},{"location":"en/admin/kpanda/nodes/node-authentication.html","title":"Node Authentication","text":""},{"location":"en/admin/kpanda/nodes/node-authentication.html#authenticate-nodes-using-ssh-keys","title":"Authenticate Nodes Using SSH Keys","text":"
If you choose to authenticate the nodes of the cluster-to-be-created using SSH keys, you need to configure the public and private keys according to the following instructions.
Run the following command on any node within the management cluster of the cluster-to-be-created to generate the public and private keys.
cd /root/.ssh\nssh-keygen -t rsa\n
Run the ls command to check if the keys have been successfully created in the management cluster. The correct output should be as follows:
ls\nid_rsa id_rsa.pub known_hosts\n
The file named id_rsa is the private key, and the file named id_rsa.pub is the public key.
Run the following command to load the public key file id_rsa.pub onto all the nodes of the cluster-to-be-created.
Replace the user account and node IP in the above command with the username and IP of the nodes in the cluster-to-be-created. The same operation needs to be performed on every node in the cluster-to-be-created.
Run the following command to view the private key file id_rsa created in step 1.
Copy the content of the private key and paste it into the interface's key input field.
"},{"location":"en/admin/kpanda/nodes/node-check.html","title":"Create a cluster node availability check","text":"
When creating a cluster or adding nodes to an existing cluster, refer to the table below to check the node configuration to avoid cluster creation or expansion failure due to wrong node configuration.
Check Item Description OS Refer to Supported Architectures and Operating Systems SELinux Off Firewall Off Architecture Consistency Consistent CPU architecture between nodes (such as ARM or x86) Host Time All hosts are out of sync within 10 seconds. Network Connectivity The node and its SSH port can be accessed normally by the platform. CPU Available CPU resources are greater than 4 Cores Memory Available memory resources are greater than 8 GB"},{"location":"en/admin/kpanda/nodes/node-check.html#supported-architectures-and-operating-systems","title":"Supported architectures and operating systems","text":"Architecture Operating System Remarks ARM Kylin Linux Advanced Server release V10 (Sword) SP2 Recommended ARM UOS Linux ARM openEuler x86 CentOS 7.x Recommended x86 Redhat 7.x Recommended x86 Redhat 8.x Recommended x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 Haiguang x86 UOS Linux x86 openEuler"},{"location":"en/admin/kpanda/nodes/node-details.html","title":"Node Details","text":"
After accessing or creating a cluster, you can view the information of each node in the cluster, including node status, labels, resource usage, Pod, monitoring information, etc.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar to view the node status, role, label, CPU/memory usage, IP address, and creation time.
Click the node name to enter the node details page to view more information, including overview information, pod information, label annotation information, event list, status, etc.
In addition, you can also view the node's YAML file, monitoring information, labels and annotations, etc.
Supports suspending or resuming scheduling of nodes. Pausing scheduling means stopping the scheduling of Pods to the node. Resuming scheduling means that Pods can be scheduled to that node.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click the Cordon button to suspend scheduling the node.
Click the \u2507 operation icon on the right side of the node, and click the Uncordon button to resume scheduling the node.
The node scheduling status may be delayed due to network conditions. Click the refresh icon on the right side of the search box to refresh the node scheduling status.
Taint can make a node exclude a certain type of Pod and prevent Pod from being scheduled on the node. One or more taints can be applied to each node, and Pods that cannot tolerate these taints will not be scheduled on that node.
Find the target cluster on the Clusters page, and click the cluster name to enter the Cluster page.
In the left navigation bar, click Nodes , find the node that needs to modify the taint, click the \u2507 operation icon on the right and click the Edit Taints button.
Enter the key value information of the taint in the pop-up box, select the taint effect, and click OK .
Click \u2795 Add to add multiple taints to the node, and click X on the right side of the taint effect to delete the taint.
Currently supports three taint effects:
NoExecute: This affects pods that are already running on the node as follows:
Pods that do not tolerate the taint are evicted immediately
Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever
Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.
NoSchedule: No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.
PreferNoSchedule: This is a \"preference\" or \"soft\" version of NoSchedule. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed, so this taint is not recommended to use in a production environment.
For more details about taints, refer to the Kubernetes documentation Taints and Tolerance.
The current cluster is connected to the container management and the Global cluster has installed the kolm component (search for helm templates for kolm).
The current cluster has the olm component installed with a version of 0.2.4 or higher (search for helm templates for olm).
Go to Container Management -> Select the current cluster -> Helm Apps -> View the olm component -> Plugin Settings , and find the images needed for the opm, minio, minio bundle, and minio operator in the subsequent steps.
Using the screenshot as an example, the four image addresses are as follows:\n\n# opm image\n10.5.14.200/quay.m.daocloud.io/operator-framework/opm:v1.29.0\n\n# minio image\n10.5.14.200/quay.m.daocloud.io/minio/minio:RELEASE.2023-03-24T21-41-23Z\n\n# minio bundle image\n10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3\n\n# minio operator image\n10.5.14.200/quay.m.daocloud.io/minio/operator:v5.0.3\n
Run the opm command to get the operators included in the offline bundle image.
Replace all image addresses in the minio-operator/manifests/minio-operator.clusterserviceversion.yaml file with the image addresses from the offline container registry.
Before replacement:
After replacement:
Generate a Dockerfile for building the bundle image.
# Set the new catalog image \nexport OFFLINE_CATALOG_IMG=10.5.14.200/release.daocloud.io/operator-framework/system-operator-index:v0.1.0-offline\n\n$ docker build . -f index.Dockerfile -t ${OFFLINE_CATALOG_IMG} \n\n$ docker push ${OFFLINE_CATALOG_IMG}\n
Go to Container Management and update the built-in catsrc image for the Helm App olm (enter the catalog image specified in the construction of the catalog image, ${catalog-image} ).
After the update is successful, the minio-operator component will appear in the Operator Hub.
"},{"location":"en/admin/kpanda/permissions/cluster-ns-auth.html","title":"Cluster and Namespace Authorization","text":"
Container management implements authorization based on global authority management and global user/group management. If you need to grant users the highest authority for container management (can create, manage, and delete all clusters), refer to What are Access Control.
After the user logs in to the platform, click Privilege Management under Container Management on the left menu bar, which is located on the Cluster Permissions tab by default.
Click the Add Authorization button.
On the Add Cluster Permission page, select the target cluster, the user/group to be authorized, and click OK .
Currently, the only cluster role supported is Cluster Admin . For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permissions to add multiple times.
Return to the cluster permission management page, and a message appears on the screen: Cluster permission added successfully .
After the user logs in to the platform, click Permissions under Container Management on the left menu bar, and click the Namespace Permissions tab.
Click the Add Authorization button. On the Add Namespace Permission page, select the target cluster, target namespace, and user/group to be authorized, and click OK .
The currently supported namespace roles are NS Admin, NS Editor, and NS Viewer. For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permission to add multiple times. Click OK to complete the permission authorization.
Return to the namespace permission management page, and a message appears on the screen: Cluster permission added successfully .
Tip
If you need to delete or edit permissions later, you can click \u2507 on the right side of the list and select Edit or Delete .
"},{"location":"en/admin/kpanda/permissions/custom-kpanda-role.html","title":"Adding RBAC Rules to System Roles","text":"
In the past, the RBAC rules for those system roles in container management were pre-defined and could not be modified by users. To support more flexible permission settings and to meet the customized needs for system roles, now you can modify RBAC rules for system roles such as cluster admin, ns admin, ns editor, ns viewer.
The following example demonstrates how to add a new ns-view rule, granting the authority to delete workload deployments. Similar operations can be performed for other rules.
Before adding RBAC rules to system roles, the following prerequisites must be met:
Container management v0.27.0 and above.
Integrated Kubernetes cluster or created Kubernetes cluster, and able to access the cluster's UI interface.
Completed creation of a namespace and user account, and the granting of NS Viewer. For details, refer to namespace authorization.
Note
RBAC rules only need to be added in the Global Cluster, and the Kpanda controller will synchronize those added rules to all integrated subclusters. Synchronization may take some time to complete.
RBAC rules can only be added in the Global Cluster. RBAC rules added in subclusters will be overridden by the system role permissions of the Global Cluster.
Only ClusterRoles with fixed Label are supported for adding rules. Replacing or deleting rules is not supported, nor is adding rules by using role. The correspondence between built-in roles and ClusterRole Label created by users is as follows.
Create a deployment by a user with admin or cluster admin permissions.
Grant a user the ns-viewer role to provide them with the ns-view permission.
Switch the login user to ns-viewer, open the console to get the token for the ns-viewer user, and use curl to request and delete the nginx deployment mentioned above. However, a prompt appears as below, indicating the user doesn't have permission to delete it.
[root@master-01 ~]# curl -k -X DELETE 'https://${URL}/apis/kpanda.io/v1alpha1/clusters/cluster-member/namespaces/default/deployments/nginx' -H 'authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJOU044MG9BclBRMzUwZ2VVU2ZyNy1xMEREVWY4MmEtZmJqR05uRE1sd1lFIn0.eyJleHAiOjE3MTU3NjY1NzksImlhdCI6MTcxNTY4MDE3OSwiYXV0aF90aW1lIjoxNzE1NjgwMTc3LCJqdGkiOiIxZjI3MzJlNC1jYjFhLTQ4OTktYjBiZC1iN2IxZWY1MzAxNDEiLCJpc3MiOiJodHRwczovLzEwLjYuMjAxLjIwMTozMDE0Ny9hdXRoL3JlYWxtcy9naGlwcG8iLCJhdWQiOiJfX2ludGVybmFsLWdoaXBwbyIsInN1YiI6ImMxZmMxM2ViLTAwZGUtNDFiYS05ZTllLWE5OGU2OGM0MmVmMCIsInR5cCI6IklEIiwiYXpwIjoiX19pbnRlcm5hbC1naGlwcG8iLCJzZXNzaW9uX3N0YXRlIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiYXRfaGFzaCI6IlJhTHoyQjlKQ2FNc1RrbGVMR3V6blEiLCJhY3IiOiIwIiwic2lkIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiZW1haWxfdmVyaWZpZWQiOmZhbHNlLCJncm91cHMiOltdLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJucy12aWV3ZXIiLCJsb2NhbGUiOiIifQ.As2ipMjfvzvgONAGlc9RnqOd3zMwAj82VXlcqcR74ZK9tAq3Q4ruQ1a6WuIfqiq8Kq4F77ljwwzYUuunfBli2zhU2II8zyxVhLoCEBu4pBVBd_oJyUycXuNa6HfQGnl36E1M7-_QG8b-_T51wFxxVb5b7SEDE1AvIf54NAlAr-rhDmGRdOK1c9CohQcS00ab52MD3IPiFFZ8_Iljnii-RpXKZoTjdcULJVn_uZNk_SzSUK-7MVWmPBK15m6sNktOMSf0pCObKWRqHd15JSe-2aA2PKBo1jBH3tHbOgZyMPdsLI0QdmEnKB5FiiOeMpwn_oHnT6IjT-BZlB18VkW8rA'\n{\"code\":7,\"message\":\"[RBAC] delete resources(deployments: nginx) is forbidden for user(ns-viewer) in cluster(cluster-member)\",\"details\":[]}[root@master-01 ~]#\n[root@master-01 ~]#\n
Create a ClusterRole on the global cluster, as shown in the yaml below.
This field value can be arbitrarily specified, as long as it is not duplicated and complies with the Kubernetes resource naming conventions.
When adding rules to different roles, make sure to apply different labels.
Wait for the kpanda controller to add a rule of user creation to the built-in role: ns-viewer, then you can check if the rules added in the previous step are present for ns-viewer.
When using curl again to request the deletion of the aforementioned nginx deployment, this time the deletion was successful. This means that ns-viewer has successfully added the rule to delete deployments.
Container management permissions are based on a multi-dimensional permission management system created by global permission management and Kubernetes RBAC permission management. It supports cluster-level and namespace-level permission control, helping users to conveniently and flexibly set different operation permissions for IAM users and groups (collections of users) under a tenant.
Cluster permissions are authorized based on Kubernetes RBAC's ClusterRoleBinding, allowing users/groups to have cluster-related permissions. The current default cluster role is Cluster Admin (does not have the permission to create or delete clusters).
Namespace permissions are authorized based on Kubernetes RBAC capabilities, allowing different users/groups to have different operation permissions on resources under a namespace (including Kubernetes API permissions). For details, refer to: Kubernetes RBAC. Currently, the default roles for container management are: NS Admin, NS Editor, NS Viewer.
What is the relationship between global permissions and container management permissions?
Answer: Global permissions only authorize coarse-grained permissions, which can manage the creation, editing, and deletion of all clusters; while for fine-grained permissions, such as the management permissions of a single cluster, the management, editing, and deletion permissions of a single namespace, they need to be implemented based on Kubernetes RBAC container management permissions. Generally, users only need to be authorized in container management.
Currently, only four default roles are supported. Can the RoleBinding and ClusterRoleBinding (Kubernetes fine-grained RBAC) for custom roles also take effect?
Answer: Currently, custom permissions cannot be managed through the graphical interface, but the permission rules created using kubectl can still take effect.
Suanova AI platform supports elastic scaling of Pod resources based on metrics (Horizontal Pod Autoscaling, HPA). Users can dynamically adjust the number of copies of Pod resources by setting CPU utilization, memory usage, and custom metrics. For example, after setting an auto scaling policy based on the CPU utilization metric for the workload, when the CPU utilization of the Pod exceeds/belows the metric threshold you set, the workload controller will automatically increase/decrease the number of Pod replicas.
This page describes how to configure auto scaling based on built-in metrics and custom metrics for workloads.
Note
HPA is only applicable to Deployment and StatefulSet, and only one HPA can be created per workload.
If you create an HPA policy based on CPU utilization, you must set the configuration limit (Limit) for the workload in advance, otherwise the CPU utilization cannot be calculated.
If built-in metrics and multiple custom metrics are used at the same time, HPA will calculate the number of scaling copies required based on multiple metrics, and take the larger value (but not exceed the maximum number of copies configured when setting the HPA policy) for elastic scaling .
Refer to the following steps to configure the built-in index auto scaling policy for the workload.
Click Clusters on the left navigation bar to enter the cluster list page. Click a cluster name to enter the Cluster Details page.
On the cluster details page, click Workload in the left navigation bar to enter the workload list, and then click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster.
After confirming that the cluster has installed the metrics-server plug-in, and the plug-in is running normally, you can click the New Scaling button.
Create custom metric auto scaling policy parameters.
Policy name: Enter the name of the auto scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as hpa- my-dep.
Namespace: The namespace where the payload resides.
Workload: The workload object that performs auto scaling.
Target CPU Utilization: The CPU usage of the Pod under the workload resource. The calculation method is: the request (request) value of all Pod resources/workloads under the workload. When the actual CPU usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Target Memory Usage: The memory usage of the Pod under the workload resource. When the actual memory usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Replica range: the elastic scaling range of the number of Pod replicas. The default interval is 1 - 10.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to edit, delete, and view related events.
The container Vertical Pod Autoscaler (VPA) calculates the most suitable CPU and memory request values \u200b\u200bfor the Pod by monitoring the Pod's resource application and usage over a period of time. Using VPA can allocate resources to each Pod in the cluster more reasonably, improve the overall resource utilization of the cluster, and avoid waste of cluster resources.
AI platform supports VPA through containers. Based on this feature, the Pod request value can be dynamically adjusted according to the usage of container resources. AI platform supports manual and automatic modification of resource request values, and you can configure them according to actual needs.
This page describes how to configure VPA for deployment.
Warning
Using VPA to modify a Pod resource request will trigger a Pod restart. Due to the limitations of Kubernetes itself, Pods may be scheduled to other nodes after restarting.
Refer to the following steps to configure the built-in index auto scaling policy for the deployment.
Find the current cluster in Clusters , and click the name of the target cluster.
Click Deployments in the left navigation bar, find the deployment that needs to create a VPA, and click the name of the deployment.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster, and confirm that the relevant plug-ins have been installed and are running normally.
Click the Create Autoscaler button and configure the VPA vertical scaling policy parameters.
Policy name: Enter the name of the vertical scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as vpa- my-dep.
Scaling mode: Run the method of modifying the CPU and memory request values. Currently, vertical scaling supports manual and automatic scaling modes.
Manual scaling: After the vertical scaling policy calculates the recommended resource configuration value, the user needs to manually modify the resource quota of the application.
Auto-scaling: The vertical scaling policy automatically calculates and modifies the resource quota of the application.
Target container: Select the container to be scaled vertically.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to perform edit and delete operations.
"},{"location":"en/admin/kpanda/scale/custom-hpa.html","title":"Creating HPA Based on Custom Metrics","text":"
When the built-in CPU and memory metrics in the system do not meet your business needs, you can add custom metrics by configuring ServiceMonitoring and achieve auto-scaling based on these custom metrics. This article will introduce how to configure auto-scaling for workloads based on custom metrics.
Note
HPA is only applicable to Deployment and StatefulSet, and each workload can only create one HPA.
If both built-in metrics and multiple custom metrics are used, HPA will calculate the required number of scaled replicas based on multiple metrics respectively, and take the larger value (but not exceeding the maximum number of replicas configured when setting the HPA policy) for scaling.
Refer to the following steps to configure the auto-scaling policy based on metrics for workloads.
Click Clusters in the left navigation bar to enter the clusters page. Click a cluster name to enter the Cluster Overview page.
On the Cluster Details page, click Workloads in the left navigation bar to enter the workload list, and click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the current autoscaling configuration of the cluster.
Confirm that the cluster has installed metrics-server, Insight, and Prometheus-adapter plugins, and that the plugins are running normally, then click the Create AutoScaler button.
Note
If the related plugins are not installed or the plugins are in an abnormal state, you will not be able to see the entry for creating custom metrics auto-scaling on the page.
Policy Name: Enter the name of the auto-scaling policy. Note that the name can be up to 63 characters long, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with a lowercase letter or number, e.g., hpa-my-dep.
Namespace: The namespace where the workload is located.
Workload: The workload object that performs auto-scaling.
Resource Type: The type of custom metric being monitored, including Pod and Service types.
Metric: The name of the custom metric created using ServiceMonitoring or the name of the system-built custom metric.
Data Type: The method used to calculate the metric value, including target value and target average value. When the resource type is Pod, only the target average value can be used.
This case takes a Golang business program as an example. The example program exposes the httpserver_requests_total metric and records HTTP requests. This metric can be used to calculate the QPS value of the business program.
"},{"location":"en/admin/kpanda/scale/custom-hpa.html#deploy-business-program","title":"Deploy Business Program","text":"
"},{"location":"en/admin/kpanda/scale/custom-hpa.html#prometheus-collects-business-monitoring","title":"Prometheus Collects Business Monitoring","text":"
If the insight-agent is installed, Prometheus can be configured by creating a ServiceMonitor CRD object.
Operation steps: In Cluster Details -> Custom Resources, search for \u201cservicemonitors.monitoring.coreos.com\", click the name to enter the details. Create the following example CRD in the httpserver namespace via YAML:
If Prometheus is installed via insight, the serviceMonitor must be labeled with operator.insight.io/managed-by: insight. If installed by other means, this label is not required.
"},{"location":"en/admin/kpanda/scale/custom-hpa.html#configure-metric-rules-in-prometheus-adapter","title":"Configure Metric Rules in Prometheus-adapter","text":"
steps: In Clusters -> Helm Apps, search for \u201cprometheus-adapter\",enter the update page through the action bar, and configure custom metrics in YAML as follows:
Follow the above steps to find the application httpserver in the Deployment and create auto-scaling via custom metrics.
"},{"location":"en/admin/kpanda/scale/hpa-cronhpa-compatibility-rules.html","title":"Compatibility Rules for HPA and CronHPA","text":"
HPA stands for HorizontalPodAutoscaler, which refers to horizontal pod auto-scaling.
CronHPA stands for Cron HorizontalPodAutoscaler, which refers to scheduled horizontal pod auto-scaling.
"},{"location":"en/admin/kpanda/scale/hpa-cronhpa-compatibility-rules.html#conflict-between-cronhpa-and-hpa","title":"Conflict Between CronHPA and HPA","text":"
Scheduled scaling with CronHPA triggers horizontal pod scaling at specified times. To prevent sudden traffic surges, you may have configured HPA to ensure the normal operation of your application. If both HPA and CronHPA are detected simultaneously, conflicts arise because CronHPA and HPA operate independently without awareness of each other. Consequently, the actions performed last will override those executed first.
By comparing the definition templates of CronHPA and HPA, the following points can be observed:
Both CronHPA and HPA use the scaleTargetRef field to identify the scaling target.
CronHPA schedules the number of replicas to scale based on crontab rules in jobs.
HPA determines scaling based on resource utilization.
Note
If both CronHPA and HPA are set, there will be scenarios where CronHPA and HPA simultaneously operate on a single scaleTargetRef.
"},{"location":"en/admin/kpanda/scale/hpa-cronhpa-compatibility-rules.html#compatibility-solution-for-cronhpa-and-hpa","title":"Compatibility Solution for CronHPA and HPA","text":"
As noted above, the fundamental reason that simultaneous use of CronHPA and HPA results in the later action overriding the earlier one is that the two controllers cannot sense each other. Therefore, the conflict can be resolved by enabling CronHPA to be aware of HPA's current state.
The system will treat HPA as the scaling object for CronHPA, thus achieving scheduled scaling for the Deployment object defined by the HPA.
HPA's definition configures the Deployment in the scaleTargetRef field, and then the Deployment uses its definition to locate the ReplicaSet, which ultimately adjusts the actual number of replicas.
In AI platform, the scaleTargetRef in CronHPA is set to the HPA object, and it uses the HPA object to find the actual scaleTargetRef, allowing CronHPA to be aware of HPA's current state.
CronHPA senses HPA by adjusting HPA. CronHPA determines whether scaling is needed and modifies the HPA upper limit by comparing the target number of replicas with the current number of replicas, choosing the larger value. Similarly, CronHPA determines whether to modify the HPA lower limit by comparing the target number of replicas from CronHPA with the configuration in HPA, choosing the smaller value.
The container copy timing horizontal autoscaling policy (CronHPA) can provide stable computing resource guarantee for periodic high-concurrency applications, and kubernetes-cronhpa-controller is a key component to implement CronHPA.
This section describes how to install the kubernetes-cronhpa-controller plugin.
Note
In order to use CornHPA, not only the kubernetes-cronhpa-controller plugin needs to be installed, but also install the metrics-server plugin.
Refer to the following steps to install the kubernetes-cronhpa-controller plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of CronHPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.3.0 or later.
Refer to the following instructions to configure the parameters.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.3.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling ready wait and/or failed deletion , it takes a long time for the application to be marked as \"running\".
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the kubernetes-cronhpa-controller plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now it's time to start creating CronHPA policies.
metrics-server is the built-in resource usage metrics collection component of Kubernetes. You can automatically scale Pod copies horizontally for workload resources by configuring HPA policies.
This section describes how to install metrics-server .
Please perform the following steps to install the metrics-server plugin for the cluster.
On the Auto Scaling page under workload details, click the Install button to enter the metrics-server plug-in installation interface.
Read the introduction of the metrics-server plugin, select the version and click the Install button. This page will use the 3.8.2 version as an example to install, and it is recommended that you install 3.8.2 and later versions.
Configure basic parameters on the installation configuration interface.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, here we take default as an example.
Version: The version of the plugin, here we take 3.8.2 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Failed to delete: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Verbose log: Turn on the verbose output of the installation process log.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the app to be marked as Running .
Advanced parameter configuration
If the cluster network cannot access the k8s.gcr.io repository, please try to modify the repositort parameter to repository: k8s.m.daocloud.io/metrics-server/metrics-server .
An SSL certificate is also required to install the metrics-server plugin. To bypass certificate verification, you need to add - --kubelet-insecure-tls parameter at defaultArgs: .
Click to view and use the YAML parameters to replace the default YAML
Click the OK button to complete the installation of the metrics-server plug-in, and then the system will automatically jump to the Helm Apps list page. After a few minutes, refresh the page and you will see the newly installed Applications.
Note
When deleting the metrics-server plugin, the plugin can only be completely deleted on the Helm Apps list page. If you only delete metrics-server on the workload page, this only deletes the workload copy of the application, the application itself is still not deleted, and an error will be prompted when you reinstall the plugin later.
The Vertical Pod Autoscaler, VPA, can make the resource allocation of the cluster more reasonable and avoid the waste of cluster resources. vpa is the key component to realize the vertical autoscaling of the container.
This section describes how to install the vpa plugin.
In order to use VPA policies, not only the __vpa__ plugin needs to be installed, but also [install the __metrics-server__ plugin](install-metrics-server.md).\n
Refer to the following steps to install the vpa plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of VPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.5.0 or later.
Review the configuration parameters described below.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.5.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the application to be marked as running .
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the vpa plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now you can start Create VPA policy.
Log in to the cluster, click the sidebar Helm Apps \u2192 Helm Charts , enter knative in the search box at the top right, and then press the enter key to search.
Click the knative-operator to enter the installation configuration interface. You can view the available versions and the Parameters optional items of Helm values on this interface.
After clicking the install button, you will enter the installation configuration interface.
Enter the name, installation tenant, and it is recommended to check Wait and Detailed Logs .
In the settings below, you can tick Serving and enter the installation tenant of the Knative Serving component, which will deploy the Knative Serving component after installation. This component is managed by the Knative Operator.
Knative provides a higher level of abstraction, simplifying and speeding up the process of building, deploying, and managing applications on Kubernetes. It allows developers to focus more on implementing business logic, while leaving most of the infrastructure and operations work to Knative, significantly improving productivity.
Component Features Activator Queues requests (if a Knative Service has scaled to zero). Calls the autoscaler to bring back services that have scaled down to zero and forward queued requests. The Activator can also act as a request buffer, handling bursts of traffic. Autoscaler Responsible for scaling Knative services based on configuration, metrics, and incoming requests. Controller Manages the state of Knative CRs. It monitors multiple objects, manages the lifecycle of dependent resources, and updates resource status. Queue-Proxy Sidecar container injected into each Knative Service. Responsible for collecting traffic data and reporting it to the Autoscaler, which then initiates scaling requests based on this data and preset rules. Webhooks Knative Serving has several Webhooks responsible for validating and mutating Knative resources."},{"location":"en/admin/kpanda/scale/knative/knative.html#ingress-traffic-entry-solutions","title":"Ingress Traffic Entry Solutions","text":"Solution Use Case Istio If Istio is already in use, it can be chosen as the traffic entry solution. Contour If Contour has been enabled in the cluster, it can be chosen as the traffic entry solution. Kourier If neither of the above two Ingress components are present, Knative's Envoy-based Kourier Ingress can be used as the traffic entry solution."},{"location":"en/admin/kpanda/scale/knative/knative.html#autoscaler-solutions-comparison","title":"Autoscaler Solutions Comparison","text":"Autoscaler Type Core Part of Knative Serving Default Enabled Scale to Zero Support CPU-based Autoscaling Support Knative Pod Autoscaler (KPA) Yes Yes Yes No Horizontal Pod Autoscaler (HPA) No Needs to be enabled after installing Knative Serving No Yes"},{"location":"en/admin/kpanda/scale/knative/knative.html#crd","title":"CRD","text":"Resource Type API Name Description Services service.serving.knative.dev Automatically manages the entire lifecycle of Workloads, controls the creation of other objects, ensures applications have Routes, Configurations, and new revisions with each update. Routes route.serving.knative.dev Maps network endpoints to one or more revision versions, supports traffic distribution and version routing. Configurations configuration.serving.knative.dev Maintains the desired state of deployments, provides separation between code and configuration, follows the Twelve-Factor App methodology, modifying configurations creates new revisions. Revisions revision.serving.knative.dev Snapshot of the workload at each modification time point, immutable object, automatically scales based on traffic."},{"location":"en/admin/kpanda/scale/knative/playground.html","title":"Knative Practices","text":"
In this section, we will delve into learning Knative through several practical exercises.
case1 When there is low traffic or no traffic, traffic will be routed to the activator.
case2 When there is high traffic, traffic will be routed directly to the Pod only if it exceeds the target-burst-capacity.
Configured as 0, expansion from 0 is the only scenario.
Configured as -1, the activator will always be present in the request path.
Configured as >0, the number of additional concurrent requests that the system can handle before triggering scaling.
case3 When the traffic decreases again, traffic will be routed back to the activator if the traffic is lower than current_demand + target-burst-capacity > (pods * concurrency-target).
The total number of pending requests + the number of requests that can exceed the target concurrency > the target concurrency per Pod * number of Pods.
"},{"location":"en/admin/kpanda/scale/knative/playground.html#case-2-based-on-concurrent-elastic-scaling","title":"case 2 - Based on Concurrent Elastic Scaling","text":"
We first apply the following YAML definition under the cluster.
"},{"location":"en/admin/kpanda/scale/knative/playground.html#case-3-based-on-concurrent-elastic-scaling-scale-out-in-advance-to-reach-a-specific-ratio","title":"case 3 - Based on concurrent elastic scaling, scale out in advance to reach a specific ratio.","text":"
We can easily achieve this, for example, by limiting the concurrency to 10 per container. This can be implemented through autoscaling.knative.dev/target-utilization-percentage: 70, starting to scale out the Pods when 70% is reached.
"},{"location":"en/admin/kpanda/security/index.html","title":"Types of Security Scans","text":"
AI platform Container Management provides three types of security scans:
Compliance Scan: Conducts security scans on cluster nodes based on CIS Benchmark.
Authorization Scan: Checks for security and compliance issues in the Kubernetes cluster, records and verifies authorized access, object changes, events, and other activities related to the Kubernetes API.
Vulnerability Scan: Scans the Kubernetes cluster for potential vulnerabilities and risks, such as unauthorized access, sensitive information leakage, weak authentication, container escape, etc.
The object of compliance scanning is the cluster node. The scan result lists the scan items and results and provides repair suggestions for any failed scan items. For specific security rules used during scanning, refer to the CIS Kubernetes Benchmark.
The focus of the scan varies when checking different types of nodes.
Scan the control plane node (Controller)
Focus on the security of system components such as API Server , controller-manager , scheduler , kubelet , etc.
Check the security configuration of the Etcd database.
Verify whether the cluster's authentication mechanism, authorization policy, and network security configuration meet security standards.
Scan worker nodes
Check if the configuration of container runtimes such as kubelet and Docker meets security standards.
Verify whether the container image has been trusted and verified.
Check if the network security configuration of the node meets security standards.
Tip
To use compliance scanning, you need to create a scan configuration first, and then create a scan policy based on that configuration. After executing the scan policy, you can view the scan report.
Authorization scanning focuses on security vulnerabilities caused by authorization issues. Authorization scans can help users identify security threats in Kubernetes clusters, identify which resources need further review and protection measures. By performing these checks, users can gain a clearer and more comprehensive understanding of their Kubernetes environment and ensure that the cluster environment meets Kubernetes' best practices and security standards.
Specifically, authorization scanning supports the following operations:
Scans the health status of all nodes in the cluster.
Scans the running state of components in the cluster, such as kube-apiserver , kube-controller-manager , kube-scheduler , etc.
API security: whether unsafe API versions are enabled, whether appropriate RBAC roles and permission restrictions are set, etc.
Container security: whether insecure images are used, whether privileged mode is enabled, whether appropriate security context is set, etc.
Network security: whether appropriate network policy is enabled to restrict traffic, whether TLS encryption is used, etc.
Storage security: whether appropriate encryption and access controls are enabled.
Application security: whether necessary security measures are in place, such as password management, cross-site scripting attack defense, etc.
Provides warnings and suggestions: Security best practices that cluster administrators should perform, such as regularly rotating certificates, using strong passwords, restricting network access, etc.
Tip
To use authorization scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Security Scanning.
Vulnerability scanning focuses on scanning potential malicious attacks and security vulnerabilities, such as remote code execution, SQL injection, XSS attacks, and some attacks specific to Kubernetes. The final scan report lists the security vulnerabilities in the cluster and provides repair suggestions.
Tip
To use vulnerability scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Vulnerability Scan.
To use the Permission Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/admin/kpanda/security/audit.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Permission Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
To use the Vulnerability Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/admin/kpanda/security/hunter.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Vulnerability Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
The first step in using CIS Scanning is to create a scan configuration. Based on the scan configuration, you can then create scan policies, execute scan policies, and finally view scan results.
"},{"location":"en/admin/kpanda/security/cis/config.html#create-a-scan-configuration","title":"Create a Scan Configuration","text":"
The steps for creating a scan configuration are as follows:
Click Security Management in the left navigation bar of the homepage of the container management module.
By default, enter the Compliance Scanning page, click the Scan Configuration tab, and then click Create Scan Configuration in the upper-right corner.
Fill in the configuration name, select the configuration template, and optionally check the scan items, then click OK .
Scan Template: Currently, two templates are provided. The kubeadm template is suitable for general Kubernetes clusters. The daocloud template ignores scan items that are not applicable to AI platform based on the kubeadm template and the platform design of AI platform.
Under the scan configuration tab, clicking the name of a scan configuration displays the type of the configuration, the number of scan items, the creation time, the configuration template, and the specific scan items enabled for the configuration.
After a scan configuration has been successfully created, it can be updated or deleted according to your needs.
Under the scan configuration tab, click the \u2507 action button to the right of a configuration:
Select Edit to update the configuration. You can update the description, template, and scan items. The configuration name cannot be changed.
Select Delete to delete the configuration.
"},{"location":"en/admin/kpanda/security/cis/policy.html","title":"Scan Policy","text":""},{"location":"en/admin/kpanda/security/cis/policy.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
After creating a scan configuration, you can create a scan policy based on the configuration.
Under the Security Management -> Compliance Scanning page, click the Scan Policy tab on the right to create a scan policy.
Fill in the configuration according to the following instructions and click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Configuration: Select a pre-created scan configuration. The scan configuration determines which specific scan items need to be performed.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
After executing a scan policy, a scan report will be generated automatically. You can view the scan report online or download it to your local computer.
Download and View
Under the Security Management -> Compliance Scanning page, click the Scan Report tab, then click the \u2507 action button to the right of a report and select Download .
View Online
Clicking the name of a report allows you to view its content online, which includes:
The target cluster scanned.
The scan policy and scan configuration used.
The start time of the scan.
The total number of scan items, the number passed, and the number failed.
For failed scan items, repair suggestions are provided.
For passed scan items, more secure operational suggestions are provided.
A data volume (PersistentVolume, PV) is a piece of storage in the cluster, which can be prepared in advance by the administrator, or dynamically prepared using a storage class (Storage Class). PV is a cluster resource, but it has an independent life cycle and will not be deleted when the Pod process ends. Mounting PVs to workloads can achieve data persistence for workloads. The PV holds the data directory that can be accessed by the containers in the Pod.
"},{"location":"en/admin/kpanda/storage/pv.html#create-data-volume","title":"Create data volume","text":"
Currently, there are two ways to create data volumes: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create Data Volume (PV) in the left navigation bar.
Fill in the basic information.
The data volume name, data volume type, mount path, volume mode, and node affinity cannot be changed after creation.
Data volume type: For a detailed introduction to volume types, refer to the official Kubernetes document Volumes.
Local: The local storage of the Node node is packaged into a PVC interface, and the container directly uses the PVC without paying attention to the underlying storage type. Local volumes do not support dynamic configuration of data volumes, but support configuration of node affinity, which can limit which nodes can access the data volume.
HostPath: Use files or directories on the file system of Node nodes as data volumes, and do not support Pod scheduling based on node affinity.
Mount path: mount the data volume to a specific directory in the container.
access mode:
ReadWriteOnce: The data volume can be mounted by a node in read-write mode.
ReadWriteMany: The data volume can be mounted by multiple nodes in read-write mode.
ReadOnlyMany: The data volume can be mounted read-only by multiple nodes.
ReadWriteOncePod: The data volume can be mounted read-write by a single Pod.
Recycling policy:
Retain: The PV is not deleted, but its status is only changed to released , which needs to be manually recycled by the user. For how to manually reclaim, refer to Persistent Volume.
Recycle: keep the PV but empty its data, perform a basic wipe ( rm -rf /thevolume/* ).
Delete: When deleting a PV and its data.
Volume mode:
File system: The data volume will be mounted to a certain directory by the Pod. If the data volume is stored from a device and the device is currently empty, a file system is created on the device before the volume is mounted for the first time.
Block: Use the data volume as a raw block device. This type of volume is given to the Pod as a block device without any file system on it, allowing the Pod to access the data volume faster.
Node affinity:
"},{"location":"en/admin/kpanda/storage/pv.html#view-data-volume","title":"View data volume","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) in the left navigation bar.
On this page, you can view all data volumes in the current cluster, as well as information such as the status, capacity, and namespace of each data volume.
Supports sequential or reverse sorting according to the name, status, namespace, and creation time of data volumes.
Click the name of a data volume to view the basic configuration, StorageClass information, labels, comments, etc. of the data volume.
"},{"location":"en/admin/kpanda/storage/pv.html#clone-data-volume","title":"Clone data volume","text":"
By cloning a data volume, a new data volume can be recreated based on the configuration of the cloned data volume.
Enter the clone page
On the data volume list page, find the data volume to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/admin/kpanda/storage/pv.html#update-data-volume","title":"Update data volume","text":"
There are two ways to update data volumes. Support for updating data volumes via forms or YAML files.
Note
Only updating the alias, capacity, access mode, reclamation policy, label, and comment of the data volume is supported.
On the data volume list page, find the data volume that needs to be updated, select Update under the operation bar on the right to update through the form, select Edit YAML to update through YAML.
Click the name of the data volume to enter the details page of the data volume, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/admin/kpanda/storage/pv.html#delete-data-volume","title":"Delete data volume","text":"
On the data volume list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Delete .
A persistent volume claim (PersistentVolumeClaim, PVC) expresses a user's request for storage. PVC consumes PV resources and claims a data volume with a specific size and specific access mode. For example, the PV volume is required to be mounted in ReadWriteOnce, ReadOnlyMany or ReadWriteMany modes.
"},{"location":"en/admin/kpanda/storage/pvc.html#create-data-volume-statement","title":"Create data volume statement","text":"
Currently, there are two ways to create data volume declarations: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume declaration.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create Data Volume Declaration (PVC) in the left navigation bar.
Fill in the basic information.
The name, namespace, creation method, data volume, capacity, and access mode of the data volume declaration cannot be changed after creation.
Creation method: dynamically create a new data volume claim in an existing StorageClass or data volume, or create a new data volume claim based on a snapshot of a data volume claim.
The declared capacity of the data volume cannot be modified when the snapshot is created, and can be modified after the creation is complete.
After selecting the creation method, select the desired StorageClass/data volume/snapshot from the drop-down list.
access mode:
ReadWriteOnce, the data volume declaration can be mounted by a node in read-write mode.
ReadWriteMany, the data volume declaration can be mounted by multiple nodes in read-write mode.
ReadOnlyMany, the data volume declaration can be mounted read-only by multiple nodes.
ReadWriteOncePod, the data volume declaration can be mounted by a single Pod in read-write mode.
"},{"location":"en/admin/kpanda/storage/pvc.html#view-data-volume-statement","title":"View data volume statement","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) in the left navigation bar.
On this page, you can view all data volume declarations in the current cluster, as well as information such as the status, capacity, and namespace of each data volume declaration.
Supports sorting in sequential or reverse order according to the declared name, status, namespace, and creation time of the data volume.
Click the name of the data volume declaration to view the basic configuration, StorageClass information, labels, comments and other information of the data volume declaration.
"},{"location":"en/admin/kpanda/storage/pvc.html#expansion-data-volume-statement","title":"Expansion data volume statement","text":"
In the left navigation bar, click Container Storage -> Data Volume Declaration (PVC) , and find the data volume declaration whose capacity you want to adjust.
Click the name of the data volume declaration, and then click the operation button in the upper right corner of the page and select Expansion .
Enter the target capacity and click OK .
"},{"location":"en/admin/kpanda/storage/pvc.html#clone-data-volume-statement","title":"Clone data volume statement","text":"
By cloning a data volume claim, a new data volume claim can be recreated based on the configuration of the cloned data volume claim.
Enter the clone page
On the data volume declaration list page, find the data volume declaration that needs to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume declaration, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/admin/kpanda/storage/pvc.html#update-data-volume-statement","title":"Update data volume statement","text":"
There are two ways to update data volume claims. Support for updating data volume claims via form or YAML file.
Note
Only aliases, labels, and annotations for data volume claims are updated.
On the data volume list page, find the data volume declaration that needs to be updated, select Update in the operation bar on the right to update it through the form, and select Edit YAML to update it through YAML.
Click the name of the data volume declaration, enter the details page of the data volume declaration, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/admin/kpanda/storage/pvc.html#delete-data-volume-statement","title":"Delete data volume statement","text":"
On the data volume declaration list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume statement, click the operation button in the upper right corner of the details page and select Delete .
If there is no optional StorageClass or data volume in the list, you can Create a StorageClass or Create a data volume.
If there is no optional snapshot in the list, you can enter the details page of the data volume declaration and create a snapshot in the upper right corner.
If the StorageClass (SC) used by the data volume declaration is not enabled for snapshots, snapshots cannot be made, and the page will not display the \"Make Snapshot\" option.
If the StorageClass (SC) used by the data volume declaration does not have the capacity expansion feature enabled, the data volume does not support capacity expansion, and the page will not display the capacity expansion option.
A StorageClass refers to a large storage resource pool composed of many physical disks. This platform supports the creation of block StorageClass, local StorageClass, and custom StorageClass after accessing various storage vendors, and then dynamically configures data volumes for workloads.
Currently, it supports creating StorageClass through YAML and forms. These two methods have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the StorageClass.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create StorageClass (SC) in the left navigation bar.
Fill in the basic information and click OK at the bottom.
CUSTOM STORAGE SYSTEM
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
CSI storage driver: A standard Kubernetes-based container storage interface plug-in, which must comply with the format specified by the storage manufacturer, such as rancher.io/local-path .
For how to fill in the CSI drivers provided by different vendors, refer to the official Kubernetes document Storage Class.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
HwameiStor storage system
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
Storage system: HwameiStor storage system.
Storage type: support LVM, raw disk type
LVM type : HwameiStor recommended usage method, which can use highly available data volumes, and the proper CSI storage driver is lvm.hwameistor.io .
Raw disk data volume : suitable for high availability cases, without high availability capability, the proper CSI driver is hdd.hwameistor.io .
High Availability Mode: Before using the high availability capability, please make sure DRBD component has been installed. After the high availability mode is turned on, the number of data volume copies can be set to 1 and 2. Convert data volume copy from 1 to 1 if needed.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
On the StorageClass list page, find the StorageClass that needs to be updated, and select Edit under the operation bar on the right to update the StorageClass.
Info
Select View YAML to view the YAML file of the StorageClass, but editing is not supported.
This page introduces how to create a CronJob through images and YAML files.
CronJobs are suitable for performing periodic operations, such as backup and report generation. These jobs can be configured to repeat periodically (for example: daily/weekly/monthly), and the time interval at which the job starts to run can be defined.
Before creating a CronJob, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-cronjob.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a CronJob using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, CronJob Settings, Advanced Configuration, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the CronJobs list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the CronJob.
On the Create CronJobs page, enter the information according to the table below, and click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created CronJob in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the configuration with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull the image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local mirror will be pulled, and only when the mirror does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass configuration to the Pod, etc. For details, refer to Container environment variable configuration.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Concurrency Policy: Whether to allow multiple Job jobs to run in parallel.
Allow : A new CronJob can be created before the previous job is completed, and multiple jobs can be parallelized. Too many jobs may occupy cluster resources.
Forbid : Before the previous job is completed, a new job cannot be created. If the execution time of the new job is up and the previous job has not been completed, CronJob will ignore the execution of the new job.
Replace : If the execution time of the new job is up, but the previous job has not been completed, the new job will replace the previous job.
The above rules only apply to multiple jobs created by the same CronJob. Multiple jobs created by multiple CronJobs are always allowed to run concurrently.
Policy Settings: Set the time period for job execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Job Records: Set how many records of successful or failed jobs to keep. 0 means do not keep.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set. The default is 360 s.
Retries: the number of times the job can be retried, the default value is 6.
Restart Policy: Set whether to restart the Pod when the job fails.
The advanced configuration of CronJobs mainly involves labels and annotations.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/admin/kpanda/workloads/create-cronjob.html#create-from-yaml","title":"Create from YAML","text":"
In addition to mirroring, you can also create timed jobs more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page introduces how to create a daemonSet through image and YAML files.
DaemonSet is connected to taint through node affinity feature ensures that a replica of a Pod is running on all or some of the nodes. For nodes that newly joined the cluster, DaemonSet automatically deploys the proper Pod on the new node and tracks the running status of the Pod. When a node is removed, the DaemonSet deletes all Pods it created.
Common cases for daemons include:
Run cluster daemons on each node.
Run a log collection daemon on each node.
Run a monitoring daemon on each node.
For simplicity, a DaemonSet can be started on each node for each type of daemon. For finer and more advanced daemon management, you can also deploy multiple DaemonSets for the same daemon. Each DaemonSet has different flags and has different memory, CPU requirements for different hardware types.
Before creating a DaemonSet, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-daemonset.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a daemon using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> DaemonSets in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of DaemonSets . Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the DaemonSet.
On the Create DaemonSets page, after entering the information according to the table below, click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created DaemonSet in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyScheduling PoliciesLabels and Annotations
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set, at this time the domain name of the containerThe parsing file will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable Pods: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/admin/kpanda/workloads/create-daemonset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create daemons more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workload -> Daemons in the left navigation bar, and then click the YAML Create button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a daemon
This page describes how to create deployments through images and YAML files.
Deployment is a common resource in Kubernetes, mainly Pod and ReplicaSet provide declarative updates, support elastic scaling, rolling upgrades, and version rollbacks features. Declare the desired Pod state in the Deployment, and the Deployment Controller will modify the current state through the ReplicaSet to make it reach the pre-declared desired state. Deployment is stateless and does not support data persistence. It is suitable for deploying stateless applications that do not need to save data and can be restarted and rolled back at any time.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback.
Before using image to create deployments, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-deployment.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a deployment by image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Setting, Service Setting, Advanced Setting in turn, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of Deployments . Click \u2507 on the right side of the list to perform operations such as update, delete, elastic scaling, restart, and version rollback on the load. If the workload status is abnormal, please check the specific abnormal information, refer to Workload Status.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, it is essential to correctly fill in the container name and image parameters; otherwise, you will not be able to proceed to the next step. After filling in the configuration according to the following requirements, click OK.
Container Type: The default is Work Container. For information on init containers, see the [K8s Official Documentation] (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/).
Container Name: No more than 63 characters, supporting lowercase letters, numbers, and separators (\"-\"). It must start and end with a lowercase letter or number, for example, nginx-01.
Image:
Image: Select an appropriate image from the list. When entering the image name, the default is to pull the image from the official DockerHub.
Image Version: Select an appropriate version from the dropdown list.
Image Pull Policy: By checking Always pull the image, the image will be pulled from the repository each time the workload restarts/upgrades. If unchecked, it will only pull the local image, and will pull from the repository only if the image does not exist locally. For more details, refer to Image Pull Policy.
Registry Secret: Optional. If the target repository requires a Secret to access, you need to create secret first.
Privileged Container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and has all the privileges of running processes on the host.
CPU/Memory Request: The request value (the minimum resource needed) and the limit value (the maximum resource allowed) for CPU/memory resources. Configure resources for the container as needed to avoid resource waste and system failures caused by container resource overages. Default values are shown in the figure.
GPU Configuration: Configure GPU usage for the container, supporting only positive integers. The GPU quota setting supports configuring the container to exclusively use an entire GPU card or part of a vGPU. For example, for a GPU card with 8 cores, entering the number 8 means the container exclusively uses the entire card, and entering the number 1 means configuring 1 core of the vGPU for the container.
Before setting the GPU, the administrator needs to pre-install the GPU card and driver plugin on the cluster node and enable the GPU feature in the Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Setting.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Setting.
Configure container parameters within the Pod, add environment variables or pass setting to the Pod, etc. For details, refer to Container environment variable setting.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Setting.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: Network Settings, Upgrade Policy, Scheduling Policies, Labels and Annotations. You can click the tabs below to view the setting requirements of each part.
Network SettingsUpgrade PolicyScheduling PoliciesLabels and Annotations
For container NIC setting, refer to Workload Usage IP Pool
DNS setting
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related setting options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make container use kubelet's -The domain name resolution file pointed to by the -resolv-conf parameter. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the setting of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Setting options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node Affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload Anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/admin/kpanda/workloads/create-deployment.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create deployments more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a deployment
This page introduces how to create a job through image and YAML file.
Job is suitable for performing one-time jobs. A Job creates one or more Pods, and the Job keeps retrying to run Pods until a certain number of Pods are successfully terminated. A Job ends when the specified number of Pods are successfully terminated. When a Job is deleted, all Pods created by the Job will be cleared. When a Job is paused, all active Pods in the Job are deleted until the Job is resumed. For more information about jobs, refer to Job.
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-job.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a job using an image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings and Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the job list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the job.
On the Create Jobs page, enter the basic information according to the table below, and click Next .
Payload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created job in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Number of Instances: Enter the number of Pod instances for the workload. By default, 1 Pod instance is created.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the setting requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle settings.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check settings.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage settings.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes job settings, labels and annotations.
Job SettingsLabels and Annotations
Parallel Pods: the maximum number of Pods that can be created at the same time during job execution, and the parallel number should not be greater than the total number of Pods. Default is 1.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set.
Restart Policy: Whether to restart the Pod when the setting fails.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/admin/kpanda/workloads/create-job.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, creation jobs can also be created more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page describes how to create a StatefulSet through image and YAML files.
StatefulSet is a common resource in Kubernetes, and Deployment, mainly used to manage the deployment and scaling of Pod collections. The main difference between the two is that Deployment is stateless and does not save data, while StatefulSet is stateful and is mainly used to manage stateful applications. In addition, Pods in a StatefulSet have a persistent ID, which makes it easy to identify the proper Pod when matching storage volumes.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management.
Before using image to create StatefulSets, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-statefulset.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a statefulSet using image.
Click Clusters on the left navigation bar, then click the name of the target cluster to enter Cluster Details.
Click Workloads -> StatefulSets in the left navigation bar, and then click the Create by Image button in the upper right corner.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the list of StatefulSets , and wait for the status of the workload to become running . If the workload status is abnormal, refer to Workload Status for specific exception information.
Click \u2507 on the right side of the New Workload column to perform operations such as update, delete, elastic scaling, restart, and version rollback on the workload.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
Used to judge the health status of containers and applications. Helps improve app usability. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyContainer Management PoliciesScheduling PoliciesLabels and Annotations
For container NIC settings, refer to Workload Usage IP Pool
DNS settings
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the application docking host.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Kubernetes v1.7 and later versions can set Pod management policies through .spec.podManagementPolicy , which supports the following two methods:
OrderedReady : The default Pod management policy, which means that Pods are deployed in order. Only after the deployment of the previous Pod is successfully completed, the statefulset will start to deploy the next Pod. Pods are deleted in reverse order, with the last created being deleted first.
Parallel : Create or delete containers in parallel, just like Pods of the Deployment type. The StatefulSet controller starts or terminates all containers in parallel. There is no need to wait for a Pod to enter the Running and ready state or to stop completely before starting or terminating other Pods. This option only affects the behavior of scaling operations, not the order of updates.
Tolerance time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/admin/kpanda/workloads/create-statefulset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create statefulsets more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> StatefulSets in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a statefulSet
An environment variable refers to a variable set in the container running environment, which is used to add environment flags to Pods or transfer configurations, etc. It supports configuring environment variables for Pods in the form of key-value pairs.
Suanova container management adds a graphical interface to configure environment variables for Pods on the basis of native Kubernetes, and supports the following configuration methods:
Key-value pair (Key/Value Pair): Use a custom key-value pair as the environment variable of the container
Resource reference (Resource): Use the fields defined by Container as the value of environment variables, such as the memory limit of the container, the number of copies, etc.
Variable/Variable Reference (Pod Field): Use the Pod field as the value of an environment variable, such as the name of the Pod
ConfigMap key value import (ConfigMap key): Import the value of a key in the ConfigMap as the value of an environment variable
Key key value import (Secret Key): use the data from the Secret to define the value of the environment variable
Key Import (Secret): Import all key values \u200b\u200bin Secret as environment variables
ConfigMap import (ConfigMap): import all key values \u200b\u200bin the ConfigMap as environment variables
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html","title":"Container health check","text":"
Container health check checks the health status of containers according to user requirements. After configuration, if the application in the container is abnormal, the container will automatically restart and recover. Kubernetes provides Liveness checks, Readiness checks, and Startup checks.
LivenessProbe can detect application deadlock (the application is running, but cannot continue to run the following steps). Restarting containers in this state can help improve the availability of applications, even if there are bugs in them.
ReadinessProbe can detect when a container is ready to accept request traffic. A Pod can only be considered ready when all containers in a Pod are ready. One use of this signal is to control which Pod is used as the backend of the Service. If the Pod is not ready, it will be removed from the Service's load balancer.
Startup check (StartupProbe) can know when the application container is started. After configuration, it can control the container to check the viability and readiness after it starts successfully, so as to ensure that these liveness and readiness probes will not affect the start of the application. Startup detection can be used to perform liveness checks on slow-starting containers, preventing them from being killed before they start running.
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#liveness-and-readiness-checks","title":"Liveness and readiness checks","text":"
The configuration of LivenessProbe is similar to that of ReadinessProbe, the only difference is to use readinessProbe field instead of livenessProbe field.
HTTP GET parameter description:
Parameter Description Path (Path) The requested path for access. Such as: /healthz path in the example Port (Port) Service listening port. Such as: port 8080 in the example protocol access protocol, Http or Https Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. SuccessThreshold (successThreshold) The minimum number of consecutive successes that are considered successful after a probe fails. The default value is 1, and the minimum value is 1. This value must be 1 for liveness and startup probes. Maximum number of failures (failureThreshold) The number of retries when the probe fails. Giving up in case of a liveness probe means restarting the container. Pods that are abandoned due to readiness probes are marked as not ready. The default value is 3. The minimum value is 1."},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#check-with-http-get-request","title":"Check with HTTP GET request","text":"
YAML example:
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-http\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/liveness # Container image\n args:\n - /server # Arguments to pass to the container\n livenessProbe:\n httpGet:\n path: /healthz # Access request path\n port: 8080 # Service listening port\n httpHeaders:\n - name: Custom-Header # Custom header name\n value: Awesome # Custom header value\n initialDelaySeconds: 3 # Wait 3 seconds before the first probe\n periodSeconds: 3 # Perform liveness detection every 3 seconds\n
According to the set rules, Kubelet sends an HTTP GET request to the service running in the container (the service is listening on port 8080) to perform the detection. The kubelet considers the container alive if the handler under the /healthz path on the server returns a success code. If the handler returns a failure code, the kubelet kills the container and restarts it. Any return code greater than or equal to 200 and less than 400 indicates success, and any other return code indicates failure. The /healthz handler returns a 200 status code for the first 10 seconds of the container's lifetime. The handler then returns a status code of 500.
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#use-tcp-port-check","title":"Use TCP port check","text":"
TCP port parameter description:
Parameter Description Port (Port) Service listening port. Such as: port 8080 in the example Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second.
For a container that provides TCP communication services, based on this configuration, the cluster establishes a TCP connection to the container according to the set rules. If the connection is successful, it proves that the detection is successful, otherwise the detection fails. If you choose the TCP port detection method, you must specify the port that the container listens to.
This example uses both readiness and liveness probes. The kubelet sends the first readiness probe 5 seconds after the container is started. Attempt to connect to port 8080 of the goproxy container. If the probe is successful, the Pod will be marked as ready and the kubelet will continue to run the check every 10 seconds.
In addition to the readiness probe, this configuration includes a liveness probe. The kubelet will perform the first liveness probe 15 seconds after the container is started. The readiness probe will attempt to connect to the goproxy container on port 8080. If the liveness probe fails, the container will be restarted.
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-exec\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/busybox # Container image\n args:\n - /bin/sh # Command to run\n - -c # Pass the following string as a command\n - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600 # Command to execute\n livenessProbe:\n exec:\n command:\n - cat # Command to check liveness\n - /tmp/healthy # File to check\n initialDelaySeconds: 5 # Wait 5 seconds before the first probe\n periodSeconds: 5 # Perform liveness detection every 5 seconds\n
The periodSeconds field specifies that the kubelet performs a liveness probe every 5 seconds, and the initialDelaySeconds field specifies that the kubelet waits for 5 seconds before performing the first probe. According to the set rules, the cluster periodically executes the command cat /tmp/healthy in the container through the kubelet to detect. If the command executes successfully and the return value is 0, the kubelet considers the container to be healthy and alive. If this command returns a non-zero value, the kubelet will kill the container and restart it.
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#protect-slow-starting-containers-with-pre-start-checks","title":"Protect slow-starting containers with pre-start checks","text":"
Some applications require a long initialization time at startup. You need to use the same command to set startup detection. For HTTP or TCP detection, you can set the failureThreshold * periodSeconds parameter to a long enough time to cope with the long startup time scene.
With the above settings, the application will have up to 5 minutes (30 * 10 = 300s) to complete the startup process. Once the startup detection is successful, the survival detection task will take over the detection of the container and respond quickly to the container deadlock. If the start probe has been unsuccessful, the container is killed after 300 seconds and further disposition is performed according to the restartPolicy .
"},{"location":"en/admin/kpanda/workloads/pod-config/job-parameters.html","title":"Description of job parameters","text":"
According to the settings of .spec.completions and .spec.Parallelism , jobs (Job) can be divided into the following types:
Job Type Description Non-parallel Job Creates a Pod until its Job completes successfully Parallel Jobs with deterministic completion counts A Job is considered complete when the number of successful Pods reaches .spec.completions Parallel Job Creates one or more Pods until one finishes successfully
Parameter Description
RestartPolicy Creates a Pod until it terminates successfully .spec.completions Indicates the number of Pods that need to run successfully when the Job ends, the default is 1 .spec.parallelism Indicates the number of Pods running in parallel, the default is 1 spec.backoffLimit Indicates the maximum number of retries for a failed Pod, beyond which no more retries will continue. .spec.activeDeadlineSeconds Indicates the Pod running time. Once this time is reached, the Job, that is, all its Pods, will stop. And activeDeadlineSeconds has a higher priority than backoffLimit, that is, the job that reaches activeDeadlineSeconds will ignore the setting of backoffLimit.
The following is an example Job configuration, saved in myjob.yaml, which calculates \u03c0 to 2000 digits and prints the output.
apiVersion: batch/v1\nkind: Job #The type of the current resource\nmetadata:\n name: myjob\nspec:\n completions: 50 # Job needs to run 50 Pods at the end, in this example it prints \u03c0 50 times\n parallelism: 5 # 5 Pods in parallel\n backoffLimit: 5 # retry up to 5 times\n template:\n spec:\n containers:\n - name: pi\n image: perl\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\n restartPolicy: Never #restart policy\n
Related commands
kubectl apply -f myjob.yaml # Start job\nkubectl get job # View this job\nkubectl logs myjob-1122dswzs View Job Pod logs\n
"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html","title":"Configure the container lifecycle","text":"
Pods follow a predefined lifecycle, starting in the Pending phase and entering the Running state if at least one container in the Pod starts normally. If any container in the Pod ends in a failed state, the state becomes Failed . The following phase field values \u200b\u200bindicate which phase of the lifecycle a Pod is in.
Value Description Pending The Pod has been accepted by the system, but one or more containers have not yet been created or run. This phase includes waiting for the pod to be scheduled and downloading the image over the network. Running (Running) The Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or in the process of starting or restarting. Succeeded (Success) All containers in the Pod were successfully terminated and will not be restarted. Failed All containers in the Pod have terminated, and at least one container terminated due to failure. That is, the container exited with a non-zero status or was terminated by the system. Unknown (Unknown) The status of the Pod cannot be obtained for some reason, usually due to a communication failure with the host where the Pod resides.
When creating a workload in Suanova container management, images are usually used to specify the running environment in the container. By default, when building an image, the Entrypoint and CMD fields can be used to define the commands and parameters to be executed when the container is running. If you need to change the commands and parameters of the container image before starting, after starting, and before stopping, you can override the default commands and parameters in the image by setting the lifecycle event commands and parameters of the container.
Configure the startup command, post-start command, and pre-stop command of the container according to business needs.
Parameter Description Example value Start command Type: Optional Meaning: The container will be started according to the start command. Command after startup Type: optionalMeaning: command after container startup Command before stopping Type: Optional Meaning: The command executed by the container after receiving the stop command. Ensure that the services running in the instance can be drained in advance when the instance is upgraded or deleted. -"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html#start-command","title":"start command","text":"
Configure the startup command according to the table below.
Parameter Description Example value Run command Type: RequiredMeaning: Enter an executable command, and separate multiple commands with spaces. If the command itself has spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html#post-start-commands","title":"Post-start commands","text":"
Suanova provides two processing types, command line script and HTTP request, to configure post-start commands. You can choose the configuration method that suits you according to the table below.
Command line script configuration
Parameter Description Example value Run Command Type: Optional Meaning: Enter an executable command, and separate multiple commands with spaces. If the command itself contains spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html#stop-pre-command","title":"stop pre-command","text":"
Suanova provides two processing types, command line script and HTTP request, to configure the pre-stop command. You can choose the configuration method that suits you according to the table below.
HTTP request configuration
Parameter Description Example value URL Path Type: Optional Meaning: Requested URL path. Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Port Type: RequiredMeaning: Requested port. port=8080 Node Address Type: Optional Meaning: The requested IP address, the default is the node IP where the container is located. -"},{"location":"en/admin/kpanda/workloads/pod-config/scheduling-policy.html","title":"Scheduling Policy","text":"
In a Kubernetes cluster, like many other Kubernetes objects, nodes have labels. You can manually add labels. Kubernetes also adds some standard labels to all nodes in the cluster. See Common Labels, Annotations, and Taints for common node labels. By adding labels to nodes, you can have pods scheduled on specific nodes or groups of nodes. You can use this feature to ensure that specific Pods can only run on nodes with certain isolation, security or governance properties.
nodeSelector is the simplest recommended form of a node selection constraint. You can add a nodeSelector field to the Pod's spec to set the node label. Kubernetes will only schedule pods on nodes with each label specified. nodeSelector provides one of the easiest ways to constrain Pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define. Some benefits of using affinity and anti-affinity are:
Affinity and anti-affinity languages are more expressive. nodeSelector can only select nodes that have all the specified labels. Affinity, anti-affinity give you greater control over selection logic.
You can mark a rule as \"soft demand\" or \"preference\", so that the scheduler will still schedule the Pod if no matching node can be found.
You can use the labels of other Pods running on the node (or in other topological domains) to enforce scheduling constraints, instead of only using the labels of the node itself. This capability allows you to define rules which allow Pods to be placed together.
You can choose which node the Pod will deploy to by setting affinity and anti-affinity.
When the node where the workload instance is located is unavailable, the period for the system to reschedule the instance to other available nodes. The default is 300 seconds.
Node affinity is conceptually similar to nodeSelector , which allows you to constrain which nodes Pods can be scheduled on based on the labels on the nodes. There are two types of node affinity:
Must be satisfied: ( requiredDuringSchedulingIgnoredDuringExecution ) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: ( preferredDuringSchedulingIgnoredDuringExecution ) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
It can only be added in the \"as far as possible\" policy, which can be understood as the priority of scheduling, and those with the highest weight will be scheduled first. The value range is 1 to 100.
Similar to node affinity, there are two types of workload affinity:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The affinity of the workload is mainly used to determine which Pods of the workload can be deployed in the same topology domain. For example, services that communicate with each other can be deployed in the same topology domain (such as the same availability zone) by applying affinity scheduling to reduce the network delay between them.
Similar to node affinity, there are two types of anti-affinity for workloads:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The anti-affinity of the workload is mainly used to determine which Pods of the workload cannot be deployed in the same topology domain. For example, the same Pod of a load is distributed to different topological domains (such as different hosts) to improve the stability of the workload itself.
A workload is an application running on Kubernetes, and in Kubernetes, whether your application is composed of a single same component or composed of many different components, you can use a set of Pods to run it. Kubernetes provides five built-in workload resources to manage pods:
Deployment
StatefulSet
Daemonset
Job
CronJob
You can also expand workload resources by setting Custom Resource CRD. In the fifth-generation container management, it supports full lifecycle management of workloads such as creation, update, capacity expansion, monitoring, logging, deletion, and version management.
Pod is the smallest computing unit created and managed in Kubernetes, that is, a collection of containers. These containers share storage, networking, and management policies that control how the containers run. Pods are typically not created directly by users, but through workload resources. Pods follow a predefined lifecycle, starting at Pending phase, if at least one of the primary containers starts normally, it enters Running , and then enters the Succeeded or Failed stage depending on whether any container in the Pod ends in a failed status.
The fifth-generation container management module designs a built-in workload life cycle status set based on factors such as Pod status and number of replicas, so that users can more realistically perceive the running status of workloads. Because different workload types (such as Deployment and Jobs) have inconsistent management mechanisms for Pods, different workloads will have different lifecycle status during operation, as shown in the following table.
"},{"location":"en/admin/kpanda/workloads/pod-config/workload-status.html#deployment-statefulset-damemonset-status","title":"Deployment, StatefulSet, DamemonSet Status","text":"Status Description Waiting 1. A workload is in this status while its creation is in progress. 2. After an upgrade or rollback action is triggered, the workload is in this status. 3. Trigger operations such as pausing/scaling, and the workload is in this status. Running This status occurs when all instances under the workload are running and the number of replicas matches the user-defined number. Deleting When a delete operation is performed, the payload is in this status until the delete is complete. Exception Unable to get the status of the workload for some reason. This usually occurs because communication with the pod's host has failed. Not Ready When the container is in an abnormal, pending status, this status is displayed when the workload cannot be started due to an unknown error"},{"location":"en/admin/kpanda/workloads/pod-config/workload-status.html#job-status","title":"Job Status","text":"Status Description Waiting The workload is in this status while Job creation is in progress. Executing The Job is in progress and the workload is in this status. Execution Complete The Job execution is complete and the workload is in this status. Deleting A delete operation is triggered and the workload is in this status. Exception Pod status could not be obtained for some reason. This usually occurs because communication with the pod's host has failed."},{"location":"en/admin/kpanda/workloads/pod-config/workload-status.html#cronjob-status","title":"CronJob status","text":"Status Description Waiting The CronJob is in this status when it is being created. Started After the CronJob is successfully created, the CronJob is in this status when it is running normally or when the paused task is started. Stopped The CronJob is in this status when the stop task operation is performed. Deleting The deletion operation is triggered, and the CronJob is in this status.
When the workload is in an abnormal or unready status, you can move the mouse over the status value of the load, and the system will display more detailed error information through a prompt box. You can also view the log or events to obtain related running information of the workload.
"},{"location":"en/admin/register/bindws.html#steps-to-follow","title":"Steps to Follow","text":"
Log in to the AI platform as an administrator.
Navigate to Global Management -> Workspace and Folder, and click Create Workspace.
Enter the workspace name, select a folder, and click OK to create a workspace.
Bind resources to the workspace.
On this interface, you can click Create Namespace to create a namespace.
Add authorization: Assign the user to the workspace.
The user logs in to the AI platform to check if they have permissions for the workspace and namespace. The administrator can perform more actions through the \u2507 on the right side.
Next step: Allocate Resources for the Workspace
"},{"location":"en/admin/register/wsres.html","title":"Allocate Resources to the Workspace","text":"
After binding a user to a workspace, it is necessary to allocate appropriate resources to the workspace.
Navigate to Global Management -> Workspace and Folder, find the workspace to which you want to add resources, and click Add Shared Resources.
Select the cluster, set the appropriate resource quota, and then click OK
Return to the shared resources page. Resources have been successfully allocated to the workspace, and the administrator can modify them at any time using the \u2507 on the right side.
AI platform provides a fully automated security implementation for containers, Pods, images, runtimes, and microservices. The following table lists some of the security features that have been implemented or are in the process of being implemented.
Security Features Specific Items Description Image security Trusted image Distribution Key pairs and signature information are required to achieve secure transport of images. It's allowed to select a key for mirror signing during mirror transmission. Runtime security Event correlation analysis Support correlation and risk analysis of security events detected at runtime to enhance attack traceability. Support converge alerts, reduce invalid alerts, and improve event response efficiency. - Container decoy repository The container decoy repository is equipped with common decoys including but not limited to: unauthorized access vulnerabilities, code execution vulnerabilities, local file reading vulnerabilities, remote command execution RCE vulnerabilities, and other container decoys. - Container decoy deployment Support custom decoy containers, including service names, service locations, etc. - Container decoy alerting Support alerting on suspicious behavior in container decoys. - Offset detection While scanning the image, learn all the binary information in the image and form a \"whitelist\" to allow only the binaries in the \"whitelist\" to run after the container is online, which ensures that the container can not run unauthorized (such as illegal download) executable files. Micro-isolation Intelligent recommendation of isolation policies Support for recording historical access traffic to resources, and intelligent policy recommendation based on historical access traffic when configuring isolation policies for resources. - Tenant isolation Support isolation control of tenants in Kubernetes clusters, with the ability to set different network security groups for different tenants, and supports tenant-level security policies to achieve inter-tenant network access and isolation. Microservices security Service and API security scanning Supports automatic, manual and periodic scanning of services and APIs within a cluster. Support all traditional web scanning items including XSS vulnerabilities, SQL injection, command/code injection, directory enumeration, path traversal, XML entity injection, poc, file upload, weak password, jsonp, ssrf, arbitrary jump, CRLF injection and other risks. For vulnerabilities found in the container environment, support vulnerability type display, url display, parameter display, danger level display, test method display, etc."},{"location":"en/admin/security/falco-exporter.html","title":"What is Falco-exporter","text":"
Falco-exporter is a Prometheus Metrics exporter for Falco output events.
Falco-exporter is deployed as a DaemonSet on a Kubernetes cluster. If Prometheus is installed and running in the cluster, metrics provided by Falco-exporter will be automatically discovered.
This section describes how to install Falco-exporter.
Note
Before installing and using Falco-exporter, you need to install and run Falco with gRPC output enabled (enabled by via Unix sockets by default). For more information on enabling gRPC output in Falco Helm Chart, see Enabling gRPC.
Please confirm that your cluster has successfully connected to the Container Management platform, and then perform the following steps to install Falco-exporter.
Click Container Management->Clusters in the left navigation bar, then find the cluster name where you want to install Falco-exporter.
In the left navigation bar, select Helm Releases -> Helm Charts, and then find and click falco-exporter.
Select the version you want to install in Version and click Install.
On the installation screen, fill in the required installation parameters.
Fill in application name, namespace, version, etc.
Fill in the following parameters:
Falco Prometheus Exporter -> Image Settings -> Registry: set the repository address of the falco-exporter image, which is already filled with the available online repositories by default. If it is a private environment, you can change it to a private repository address.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Repository: set the falco-exporter image name.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Install ServiceMonitor: install Prometheus Operator service monitor. It is enabled by default.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Scrape Interval: user-defined interval; if not specified, the Prometheus default interval is used.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Scrape Timeout: user-defined scrape timeout; if not specified, the Prometheus default scrape timeout is used.
In the screen as above, fill in the following parameters:
Falco Prometheus Exporter -> Prometheus prometheusRules -> Install prometheusRules: create PrometheusRules to alert on priority events. It is enabled by default.
Falco Prometheus Exporter -> Prometheus prometheusRules -> Alerts settings: set whether alerts are enabled for different levels of log events, the interval between alerts, and the threshold for alerts.
Click the OK button at the bottom right corner to complete the installation.
Please confirm that your cluster has successfully connected to the Container Management platform, and then perform the following steps to install Falco.
Click Container Management->Clusters in the left navigation bar, then find the cluster name where you want to install Falco.
In the left navigation bar, select Helm Releases -> Helm Charts, and then find and click Falco.
Select the version you want to install in Version, and click Install.
On the installation page, fill in the required installation parameters.
Fill in the application name, namespace, version, etc.
Fill in the following parameters:
Falco -> Image Settings -> Registry: set the repository address of the Falco image, which is already filled with the available online repositories by default. If it is a private environment, you can change it to a private repository address.
Falco -> Image Settings -> Repository: set the Falco image name.
Falco -> Falco Driver -> Image Settings -> Registry: set the repository address of the Falco Driver image, which is already filled with available online repositories by default. If it is a private environment, you can change it to a private repository address.
Falco -> Falco Driver -> Image Settings -> Repository: set the Falco Driver image name.
Falco -> Falco Driver -> Image Settings -> Driver Kind: set the Driver Kind, providing the following two options.
ebpf: use ebpf to detect events, which requires the Linux kernel to support ebpf and enable CONFIG_BPF_JIT and sysctl net.core.bpf_jit_enable=1.
module: use kernel module detection with limited OS version support. Refer to module support system version.
Falco -> Falco Driver -> Image Settings -> Log Level: the minimum log level to be included in the log.
Click the OK button in the bottom right corner to complete the installation.
"},{"location":"en/admin/security/falco.html","title":"What is Falco","text":"
Falco is a cloudnative runtime security tool designed to detect anomalous activity in applications, and can be used to monitor the runtime security of Kubernetes applications and internal components. With only a set of rules, Falco can continuously monitor and watch for anomalous activity in containers, applications, hosts, and networks.
"},{"location":"en/admin/security/falco.html#what-does-falco-detect","title":"What does Falco detect?","text":"
Falco can detect and alert on any behavior involving Linux system calls. Falco alerts can be triggered using specific system calls, parameters, and properties of the calling process. For example, Falco can easily detect events including but not limited to the following:
A shell is running inside a container or pod in Kubernetes.
A container is running in privileged mode or mounting a sensitive path, such as /proc, from the host.
A server process is spawning a child process of an unexpected type.
A sensitive file, such as /etc/shadow, is being read unexpectedly.
A non-device file is being written to /dev.
A standard system binary, such as ls, is making an outbound network connection.
A privileged pod is started in a Kubernetes cluster.
For more information on the default rules that come with Falco, see the Rules documentation.
"},{"location":"en/admin/security/falco.html#what-are-falco-rules","title":"What are Falco rules?","text":"
Falco rules define the behavior and events that Falco should monitor. Rules can be written in the Falco rules file or in a generic configuration file. For more information on writing, managing and deploying rules, see Falco Rules.
"},{"location":"en/admin/security/falco.html#what-are-falco-alerts","title":"What are Falco Alerts?","text":"
Alerts are configurable downstream operations that can be as simple as logging or as complex as STDOUT passing a gRPC call to a client. For more information on configuring, understanding, and developing alerts, see Falco Alerts. Falco can send alerts t:
Standard output
A file
A system log
A spawned program
An HTTP[s] endpoint
A client via the gRPC API
"},{"location":"en/admin/security/falco.html#what-are-the-components-of-falco","title":"What are the components of Falco?","text":"
Falco consists of the following main components:
Userspace program: a CLI tool that can be used to interact with Falco. The userspace program handles signals, parses messages from a Falco driver, and sends alerts.
Configuration: define how Falco is run, what rules to assert, and how to perform alerts. For more information, see Configuration.
Driver: a software that adheres to the Falco driver specification and sends a stream of system call information. You cannot run Falco without installing a driver. Currently, Falco supports the following drivers:
Kernel module built on libscap and libsinsp C++ libraries (default)
BPF probe built from the same modules
Userspace instrumentation
For more information, see Falco drivers.
Plugins: allow users to extend the functionality of falco libraries/falco executable by adding new event sources and new fields that can extract information from events. For more information, see Plugins.
Notebook typically refers to Jupyter Notebook or similar interactive computing environments. This is a very popular tool widely used in fields such as data science, machine learning, and deep learning. This page explains how to use Notebook on the Canfeng AI platform.
Enter a name, select a cluster, namespace, choose the newly created queue, and click One-click Initialization.
Select Notebook type, configure memory and CPU, enable GPU, create and configure PVC:
Enable SSH external access:
You will be automatically redirected to the Notebook instance list; click the instance name.
Enter the Notebook instance details page and click the Open button in the upper right corner.
You will enter the Notebook development environment, where a persistent volume is mounted in the /home/jovyan directory. You can clone code using git and upload data after connecting via SSH, etc.
"},{"location":"en/admin/share/notebook.html#accessing-notebook-instances-via-ssh","title":"Accessing Notebook Instances via SSH","text":"
Generate an SSH key pair on your own computer.
Open the command line on your computer, for example, open git bash on Windows, and enter ssh-keygen.exe -t rsa, then press enter until completion.
Use commands like cat ~/.ssh/id_rsa.pub to view and copy the public key.
Log in to the AI platform as a user, click Personal Center in the upper right corner -> SSH Public Key -> Import SSH Public Key.
Go to the details page of the Notebook instance and copy the SSH link.
Use SSH to access the Notebook instance from the client.
Administrator has assigned a workspace to the user
Resource quotas have been set for the workspace
A cluster has been created
"},{"location":"en/admin/share/workload.html#steps-to-create-ai-workloads","title":"Steps to Create AI Workloads","text":"
Log in to the AI platform as a User.
Navigate to Container Management, select a namespace, then click Workloads -> Deployments, and then click the Create from Image button on the right.
After configuring the parameters, click OK.
Basic InformationContainer ConfigurationOthers
Select your own namespace.
Set the image, configure resources such as CPU, memory, and GPU, and set the startup command.
Service configuration and advanced settings can use default configurations.
Automatically return to the stateless workload list and click the workload name.
Enter the details page to view the GPU quota.
You can also enter the console and run the mx-smi command to check the GPU resources.
Next step: Using Notebook
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html","title":"Import a Linux Virtual Machine with Ubuntu from an External Platform","text":"
This page provides a detailed introduction on how to import Linux virtual machines from the external platform VMware into the virtual machines of AI platform through the command line.
Info
The external virtual platform in this document is VMware vSphere Client, abbreviated as vSphere. Technically, it relies on kubevirt cdi for implementation. Before proceeding, the virtual machine imported on vSphere needs to be shut down. Take a virtual machine of the Ubuntu operating system as an example.
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html#fetch-basic-information-of-vsphere-virtual-machine","title":"Fetch Basic Information of vSphere Virtual Machine","text":"
vSphere URL: Fetch information on the URL of the target platform
vSphere SSL Certificate Thumbprint: Need to be fetched using openssl
Can't use SSL_get_servername\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=20:unable to get local issuer certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=21:unable to verify the first certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify return:1\nDONE\nsha1 Fingerprint=C3:9D:D7:55:6A:43:11:2B:DE:BA:27:EA:3B:C2:13:AF:E4:12:62:4D # Value needed\n
vSphere Account: Fetch account information for vSphere, and pay attention to permissions
vSphere Password: Fetch password information for vSphere
UUID of the virtual machine to be imported: Need to be fetched on the web page of vSphere
Access the Vsphere page, go to the details page of the virtual machine to be imported, click Edit Settings , open the browser's developer console at this point, click Network -> Headers , find the URL as shown in the image below.
Click Response , locate vmConfigContext -> config , and finally find the target value uuid .
Path of the vmdk file of the virtual machine to be imported
Different information needs to be configured based on the chosen network mode. If a fixed IP address is required, you should select the Bridge network mode.
Create a Multus CR of the ovs type. Refer to Creating a Multus CR.
Create subnets and IP pools. Refer to Creating Subnets and IP Pools.
apiVersion: v1\nkind: Secret\nmetadata:\n name: vsphere # Can be changed\n labels:\n app: containerized-data-importer # Do not change\ntype: Opaque\ndata:\n accessKeyId: \"username-base64\"\n secretKey: \"password-base64\"\n
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html#write-a-kubevirt-vm-yaml-to-create-vm","title":"Write a KubeVirt VM YAML to create VM","text":"
Tip
If a fixed IP address is required, the YAML configuration differs slightly from the one used for the default network. These differences have been highlighted.
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html#access-vnc-to-verify-successful-operation","title":"Access VNC to verify successful operation","text":"
Modify the network configuration of the virtual machine
Check the current network
When the actual import is completed, the configuration shown in the image below has been completed. However, it should be noted that the enp1s0 interface does not contain the inet field, so it cannot connect to the external network.
Configure netplan
In the configuration shown in the image above, change the objects in ethernets to enp1s0 and obtain an IP address using DHCP.
Apply the netplan configuration to the system network configuration
sudo netplan apply\n
Perform a ping test on the external network
Access the virtual machine on the node via SSH.
"},{"location":"en/admin/virtnest/best-practice/import-windows.html","title":"Import a Windows Virtual Machine from the External Platform","text":"
This page provides a detailed introduction on how to import virtual machines from an external platform -- VMware, into the virtual machines of AI platform using the command line.
Info
The external virtual platform on this page is VMware vSphere Client, abbreviated as vSphere. Technically, it relies on kubevirt cdi for implementation. Before proceeding, the virtual machine imported on vSphere needs to be shut down. Take a virtual machine of the Windows operating system as an example.
Before importing, refer to the Network Configuration to prepare the environment.
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#fetch-information-of-the-windows-virtual-machine","title":"Fetch Information of the Windows Virtual Machine","text":"
Similar to importing a virtual machine with a Linux operating system, refer to Importing a Linux Virtual Machine with Ubuntu from an External Platform to get the following information:
vSphere account and password
vSphere virtual machine information
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#check-the-boot-type-of-windows","title":"Check the Boot Type of Windows","text":"
When importing a virtual machine from an external platform into the AI platform virtualization platform, you need to configure it according to the boot type (BIOS or UEFI) to ensure it can boot and run correctly.
You can check whether Windows uses BIOS or UEFI through \"System Summary.\" If it uses UEFI, you need to add the relevant information in the YAML file.
Prepare the window.yaml file and pay attention to the following configmaps:
PVC booting Virtio drivers
Disk bus type, set to SATA or Virtio depending on the boot type
UEFI configuration (if UEFI is used)
Click to view the window.yaml example window.yaml
apiVersion: kubevirt.io/v1\nkind: VirtualMachine\nmetadata:\n labels:\n virtnest.io/os-family: windows\n virtnest.io/os-version: \"server2019\"\n name: export-window-21\n namespace: default\nspec:\n dataVolumeTemplates:\n - metadata:\n name: export-window-21-rootdisk\n spec:\n pvc:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 22Gi\n storageClassName: local-path\n source:\n vddk:\n backingFile: \"[A05-09-ShangPu-Local-DataStore] virtnest-export-window/virtnest-export-window.vmdk\"\n url: \"https://10.64.56.21\"\n uuid: \"421d40f2-21a2-cfeb-d5c9-e7f8abfc2faa\"\n thumbprint: \"D7:C4:22:E3:6F:69:DA:72:50:81:12:FA:42:18:3F:29:5C:7F:41:CA\"\n secretRef: \"vsphere21\"\n initImageURL: \"release.daocloud.io/virtnest/vddk:v8\"\n - metadata:\n name: export-window-21-datadisk\n spec:\n pvc:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 1Gi\n storageClassName: local-path\n source:\n vddk:\n backingFile: \"[A05-09-ShangPu-Local-DataStore] virtnest-export-window/virtnest-export-window_1.vmdk\"\n url: \"https://10.64.56.21\"\n uuid: \"421d40f2-21a2-cfeb-d5c9-e7f8abfc2faa\"\n thumbprint: \"D7:C4:22:E3:6F:69:DA:72:50:81:12:FA:42:18:3F:29:5C:7F:41:CA\"\n secretRef: \"vsphere21\"\n initImageURL: \"release.daocloud.io/virtnest/vddk:v8\"\n # <1>. PVC for booting Virtio drivers\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n - metadata:\n name: virtio-disk\n spec:\n pvc:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 10Mi\n storageClassName: local-path\n source:\n blank: {}\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n running: true\n template:\n metadata:\n annotations:\n ipam.spidernet.io/ippools: '[{\"cleangateway\":false,\"ipv4\":[\"test86\"]}]'\n spec:\n dnsConfig:\n nameservers:\n - 223.5.5.5\n domain:\n cpu:\n cores: 2\n memory:\n guest: 4Gi\n devices:\n disks:\n - bootOrder: 1\n disk:\n bus: sata # <2> Disk bus type, set to SATA or Virtio depending on the boot type\n name: rootdisk\n - bootOrder: 2\n disk:\n bus: sata # <2> Disk bus type, set to SATA or Virtio depending on the boot type\n name: datadisk\n # <1>. disk for booting Virtio drivers\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n - bootOrder: 3\n disk:\n bus: virtio\n name: virtdisk\n - bootOrder: 4\n cdrom:\n bus: sata\n name: virtiocontainerdisk\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n interfaces:\n - bridge: {}\n name: ovs-bridge0\n # <3> In the above section \"Check the Boot Type of Windows\"\n # If using UEFI, add the following information\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n features:\n smm:\n enabled: true\n firmware:\n bootloader:\n efi:\n secureBoot: false\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n machine:\n type: q35\n resources:\n requests:\n memory: 4Gi\n networks:\n - multus:\n default: true\n networkName: kube-system/test1\n name: ovs-bridge0\n volumes:\n - dataVolume:\n name: export-window-21-rootdisk\n name: rootdisk\n - dataVolume:\n name: export-window-21-datadisk\n name: datadisk \n # <1> Volumes for booting Virtio drivers\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n - dataVolume:\n name: virtio-disk\n name: virtdisk\n - containerDisk:\n image: release-ci.daocloud.io/virtnest/kubevirt/virtio-win:v4.12.12-5\n name: virtiocontainerdisk\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#install-virtio-drivers-via-vnc","title":"Install VirtIO Drivers via VNC","text":"
Access and connect to the virtual machine via VNC.
Download and install the appropriate VirtIO drivers based on the Windows version.
Enable Remote Desktop to facilitate future connections via RDP.
After installation, update the YAML file and reboot the virtual machine.
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#update-yaml-after-reboot","title":"Update YAML After Reboot","text":"Click to view the modified `window.yaml` example window.yaml
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#access-and-verify-via-rdp","title":"Access and Verify via RDP","text":"
Use an RDP client to connect to the virtual machine. Log in with the default account admin and password dangerous!123.
Verify network access and data disk data
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#differences-between-importing-linux-and-windows-virtual-machines","title":"Differences Between Importing Linux and Windows Virtual Machines","text":"
Windows may require UEFI configuration.
Windows typically requires the installation of VirtIO drivers.
Windows multi-disk imports usually do not require re-mounting of disks.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html","title":"Create a Windows Virtual Machine","text":"
This document will explain how to create a Windows virtual machine via the command line.
Before creating a Windows virtual machine, it is recommended to first refer to installing dependencies and prerequisites for the virtual machine module to ensure that your environment is ready.
During the creation process, it is recommended to refer to the official documentation: Installing Windows documentation, Installing Windows related drivers.
It is recommended to access the Windows virtual machine using the VNC method.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#import-an-iso-image","title":"Import an ISO Image","text":"
Creating a Windows virtual machine requires importing an ISO image primarily to install the Windows operating system. Unlike Linux operating systems, the Windows installation process usually involves booting from an installation disc or ISO image file. Therefore, when creating a Windows virtual machine, it is necessary to first import the installation ISO image of the Windows operating system so that the virtual machine can be installed properly.
Here are two methods for importing ISO images:
(Recommended) Creating a Docker image. It is recommended to refer to building images.
(Not recommended) Using virtctl to import the image into a Persistent Volume Claim (PVC).
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#create-a-windows-virtual-machine-using-yaml","title":"Create a Windows Virtual Machine Using YAML","text":"
Creating a Windows virtual machine using YAML is more flexible and easier to write and maintain. Below are three reference YAML examples:
(Recommended) Using Virtio drivers + Docker image:
If you need to use storage capabilities - mount disks, please install viostor drivers.
If you need to use network capabilities, please install NetKVM drivers.
(Not recommended) In a scenario where Virtio drivers are not used, importing the image into a Persistent Volume Claim (PVC) using the virtctl tool. The virtual machine may use other types of drivers or default drivers to operate disk and network devices.
For Windows virtual machines, remote desktop control access is often required. It is recommended to use Microsoft Remote Desktop to control your virtual machine.
Note
Your Windows version must support remote desktop control to use Microsoft Remote Desktop.
You need to disable the Windows firewall.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#add-data-disks","title":"Add Data Disks","text":"
Adding a data disk to a Windows virtual machine follows the same process as adding one to a Linux virtual machine. You can refer to the provided YAML example for guidance.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#snapshots-cloning-live-migration","title":"Snapshots, Cloning, Live Migration","text":"
These capabilities are consistent with Linux virtual machines and can be configured using the same methods.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#access-your-windows-virtual-machine","title":"Access Your Windows Virtual Machine","text":"
After successful creation, access the virtual machine list page to confirm that the virtual machine is running properly.
Click the console access (VNC) to access it successfully.
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html","title":"Configure GPU Passthrough for Virtual Machines","text":"
This page will explain the prerequisites for configuring GPU when creating a virtual machine.
The key to configuring GPU for virtual machines is to configure the GPU Operator to deploy different software components on the worker nodes, depending on the GPU workload configuration. Here are three example nodes:
The controller-node-1 node is configured to run containers.
The work-node-1 node is configured to run virtual machines with GPU passthrough.
The work-node-2 node is configured to run virtual machines with virtual vGPU.
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#assumptions-limitations-and-dependencies","title":"Assumptions, Limitations, and Dependencies","text":"
The worker nodes can run GPU-accelerated containers, virtual machines with GPU passthrough, or virtual machines with vGPU. However, a combination of any of these is not supported.
The cluster administrator or developer needs to have prior knowledge of the cluster and correctly label the nodes to indicate the type of GPU workload they will run.
The worker node that runs a GPU-accelerated virtual machine with GPU passthrough or vGPU is assumed to be a bare metal machine. If the worker node is a virtual machine, the GPU passthrough feature needs to be enabled on the virtual machine platform. Please consult your virtual machine platform provider for guidance.
Nvidia MIG is not supported for vGPU.
The GPU Operator does not automatically install GPU drivers in the virtual machine.
To enable GPU passthrough, the cluster nodes need to have IOMMU enabled. Refer to How to Enable IOMMU. If your cluster is running on a virtual machine, please consult your virtual machine platform provider.
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#label-the-cluster-nodes","title":"Label the Cluster Nodes","text":"
Go to Container Management, select your worker cluster, click Node Management, and then click Modify Labels in the action bar to add labels to the nodes. Each node can only have one label.
You can assign the following values to the labels: container, vm-passthrough, and vm-vgpu.
Go to Container Management, select your worker cluster, click Helm Apps -> Helm Chart , and choose and install gpu-operator. Modify the relevant fields in the yaml.
gpu-operator.sandboxWorkloads.enabled=true\ngpu-operator.vfioManager.enabled=true\ngpu-operator.sandboxDevicePlugin.enabled=true\ngpu-operator.sandboxDevicePlugin.version=v1.2.4 // version should be >= v1.2.4\ngpu-operator.toolkit.version=v1.14.3-ubuntu20.04\n
Wait for the installation to succeed, as shown in the following image:
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#install-virtnest-agent-and-configure-cr","title":"Install virtnest-agent and Configure CR","text":"
Install virtnest-agent, refer to Install virtnest-agent.
Add vGPU and GPU passthrough to the Virtnest Kubevirt CR. The following example shows the relevant yaml after adding vGPU and GPU passthrough:
GPU information registered by GPU Operator on the node
Example of obtaining vGPU information (only applicable to vGPU): View the node information on the node marked as nvidia.com/gpu.workload.config=vm-gpu, such as work-node-2, in the Capacity section, nvidia.com/GRID_P4-1Q: 8 indicates the available vGPU:
In this case, the mdevNameSelector should be \"GRID P4-1Q\" and the resourceName should be \"GRID_P4-1Q\".
Get GPU passthrough information: On the node marked as nvidia.com/gpu.workload.config=vm-passthrough (work-node-1 in this example), view the node information. In the Capacity section, nvidia.com/GP104GL_TESLA_P4: 2 represents the available vGPU:
The resourceName should be \"GRID_P4-1Q\". How to obtain the pciVendorSelector? Use SSH to log in to the target node work-node-1 and use the lspci -nnk -d 10de: command to obtain the Nvidia GPU PCI information, as shown below: The red box indicates the pciVendorSelector information.
Edit the kubevirt CR note: If there are multiple GPUs of the same model, you only need to write one in the CR, there is no need to list every GPU.
GPU passthrough; in the example above, there are two GPUs for TEESLA P4, so only one needs to be registered here
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#create-vm-using-yaml-and-enable-gpu-acceleration","title":"Create VM Using YAML and Enable GPU Acceleration","text":"
The only difference from a regular virtual machine is adding GPU-related information in the devices section.
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html","title":"Configure GPU (vGPU) for Virtual Machines","text":"
This page will explain the prerequisites for configuring GPU when creating a virtual machine.
The key to configuring GPU for virtual machines is to configure the GPU Operator to deploy different software components on the worker nodes, depending on the GPU workload configuration. Here are three example nodes:
The controller-node-1 node is configured to run containers.
The work-node-1 node is configured to run virtual machines with GPU passthrough.
The work-node-2 node is configured to run virtual machines with virtual vGPU.
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html#assumptions-limitations-and-dependencies","title":"Assumptions, Limitations, and Dependencies","text":"
The worker nodes can run GPU-accelerated containers, virtual machines with GPU passthrough, or virtual machines with vGPU. However, a combination of any of these is not supported.
The worker nodes can run GPU-accelerated containers, virtual machines with GPU passthrough, or virtual machines with vGPU individually, but not in any combination.
The cluster administrator or developer needs to have prior knowledge of the cluster and correctly label the nodes to indicate the type of GPU workload they will run.
The worker node that runs a GPU-accelerated virtual machine with GPU passthrough or vGPU is assumed to be a bare metal machine. If the worker node is a virtual machine, the GPU passthrough feature needs to be enabled on the virtual machine platform. Please consult your virtual machine platform provider for guidance.
Nvidia MIG is not supported for vGPU.
The GPU Operator does not automatically install GPU drivers in the virtual machine.
To enable GPU passthrough, the cluster nodes need to have IOMMU enabled. Please refer to How to Enable IOMMU. If your cluster is running on a virtual machine, please consult your virtual machine platform provider.
Note: This step is only required when using NVIDIA vGPU. If you plan to use GPU passthrough only, skip this section.
Follow these steps to build the vGPU Manager image and push it to the container registry:
Download the vGPU software from the NVIDIA Licensing Portal.
Log in to the NVIDIA Licensing Portal and go to the Software Downloads page.
The NVIDIA vGPU software is located in the Driver downloads tab on the Software Downloads page.
Select VGPU + Linux in the filter criteria and click Download to get the Linux KVM package. Unzip the downloaded file (NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run).
Open a terminal and clone the container-images/driver repository.
git clone https://gitlab.com/nvidia/container-images/driver cd driver\n
Switch to the vgpu-manager directory proper to your operating system.
cd vgpu-manager/<your-os>\n
Copy the .run file extracted in step 1 to the current directory.
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html#label-the-cluster-nodes","title":"Label the Cluster Nodes","text":"
Go to Container Management, select your worker cluster, click Node Management, and then click Modify Labels in the action bar to add labels to the nodes. Each node can only have one label.
You can assign the following values to the labels: container, vm-passthrough, and vm-vgpu.
Go to Container Management, select your worker cluster, click Helm Apps -> Helm Chart, and choose and install gpu-operator. Modify the relevant fields in the yaml.
GPU information registered by the GPU Operator on the node
Example of obtaining vGPU information (only applicable to vGPU): View the node information on the node marked as nvidia.com/gpu.workload.config=vm-gpu, such as work-node-2, in the Capacity section, nvidia.com/GRID_P4-1Q: 8 indicates the available vGPU:
In this case, the mdevNameSelector should be \"GRID P4-1Q\" and the resourceName should be \"GRID_P4-1Q\".
Get GPU passthrough information: On the node marked as nvidia.com/gpu.workload.config=vm-passthrough (work-node-1 in this example), view the node information. In the Capacity section, nvidia.com/GP104GL_TESLA_P4: 2 represents the available vGPU:
The resourceName should be \"GRID_P4-1Q\". How to obtain the pciVendorSelector? SSH into the target node work-node-1 and use the lspci -nnk -d 10de: command to obtain the Nvidia GPU PCI information, as shown below: The red box indicates the pciVendorSelector information.
Edit the kubevirt CR note: If there are multiple GPUs of the same model, you only need to write one in the CR, there is no need to list every GPU.
GPU passthrough; in the example above, there are two GPUs for TEESLA P4, so only one needs to be registered here
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html#create-vm-using-yaml-and-enable-gpu-acceleration","title":"Create VM Using YAML and Enable GPU Acceleration","text":"
The only difference from a regular virtual machine is adding the gpu-related information in the devices section.
If you want to experience the latest development version of virtnest, then please add the following repository address (the development version of virtnest is extremely unstable).
"},{"location":"en/admin/virtnest/install/index.html#choose-a-version-that-you-want-to-install","title":"Choose a Version that You Want to Install","text":"
It is recommended to install the latest version.
[root@master ~]# helm search repo virtnest-release/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest-release/virtnest 0.6.0 v0.6.0 A Helm chart for virtnest\n
"},{"location":"en/admin/virtnest/install/index.html#create-a-namespace","title":"Create a Namespace","text":"
"},{"location":"en/admin/virtnest/install/index.html#upgrade","title":"Upgrade","text":""},{"location":"en/admin/virtnest/install/index.html#update-the-virtnest-helm-repository","title":"Update the virtnest Helm Repository","text":"
helm repo update virtnest-release\n
"},{"location":"en/admin/virtnest/install/index.html#back-up-the-set-parameters","title":"Back up the --set Parameters","text":"
Before upgrading the virtnest version, we recommend executing the following command to backup the --set parameters of the previous version
helm get values virtnest -n virtnest-system -o yaml > bak.yaml\n
"},{"location":"en/admin/virtnest/install/install-dependency.html","title":"Dependencies and Prerequisites","text":"
This page explains the dependencies and prerequisites for installing virtual machine.
Info
The term virtnest mentioned in the commands or scripts below is the internal development codename for the Global Management module.
"},{"location":"en/admin/virtnest/install/install-dependency.html#prerequisites","title":"Prerequisites","text":""},{"location":"en/admin/virtnest/install/install-dependency.html#kernel-version-being-above-v411","title":"Kernel version being above v4.11","text":"
The kernel version of all nodes in the target cluster needs to be higher than v4.11. For detail information, see kubevirt issue. Run the following command to see the version:
"},{"location":"en/admin/virtnest/install/install-dependency.html#cpu-supporting-x86-64-v2-instruction-set-or-higher","title":"CPU supporting x86-64-v2 instruction set or higher","text":"
You can use the following script to check if the current node's CPU is usable:
Note
If you encounter a message like the one shown below, you can safely ignore it as it does not impact the final result.
\u793a\u4f8b
$ sh detect-cpu.sh\ndetect-cpu.sh: line 3: fpu: command not found\n
"},{"location":"en/admin/virtnest/install/install-dependency.html#all-nodes-having-hardware-virtualization-nested-virtualization-enabled","title":"All Nodes having hardware virtualization (nested virtualization) enabled","text":"
Run the following command to check if it has been achieved:
virt-host-validate qemu\n
# Successful case\nQEMU: Checking for hardware virtualization : PASS\nQEMU: Checking if device /dev/kvm exists : PASS\nQEMU: Checking if device /dev/kvm is accessible : PASS\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for device assignment IOMMU support : PASS\nQEMU: Checking if IOMMU is enabled by kernel : PASS\nQEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure Guest support)\n\n# Failure case\nQEMU: Checking for hardware virtualization : FAIL (Only emulated CPUs are available, performance will be significantly limited)\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpu' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller mount-point : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller mount-point : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller mount-point : PASS\nWARN (Unknown if this platform has IOMMU support)\n
Methods vary from platforms, and this page takes vsphere as an example. See vmware website.
"},{"location":"en/admin/virtnest/install/install-dependency.html#if-using-docker-engine-as-the-container-runtime","title":"If using Docker Engine as the container runtime","text":"
If Docker Engine is used as the container runtime, it must be higher than v20.10.10.
"},{"location":"en/admin/virtnest/install/install-dependency.html#enabling-iommu-is-recommended","title":"Enabling IOMMU is recommended","text":"
To prepare for future functions, it is recommended to enable IOMMU.
"},{"location":"en/admin/virtnest/install/offline-install.html","title":"Offline Upgrade of the Virtual Machine Module","text":"
This page explains how to install or upgrade the Virtual Machine module after downloading it from the Download Center.
Info
The term \"virtnest\" appearing in the following commands or scripts is the internal development code name for the Virtual Machine module.
"},{"location":"en/admin/virtnest/install/offline-install.html#load-images-from-the-installation-package","title":"Load Images from the Installation Package","text":"
You can load the images using one of the following two methods. When there is an container registry available in your environment, it is recommended to choose the chart-syncer method for synchronizing the images to the container registry, as it is more efficient and convenient.
"},{"location":"en/admin/virtnest/install/offline-install.html#synchronize-images-to-the-container-registry-using-chart-syncer","title":"Synchronize Images to the container registry using chart-syncer","text":"
Create load-image.yaml file.
Note
All parameters in this YAML file are mandatory. You need a private container registry and modify the relevant configurations.
Chart Repo InstalledChart Repo Not Installed
If the chart repo is already installed in your environment, chart-syncer also supports exporting the chart as a tgz file.
The relative path to run the charts-syncer command, not the relative path between this YAML file and the offline package.
Change to your container registry URL.
Change to your container registry.
It can also be any other supported Helm Chart repository type.
Change to the chart repo URL.
Your container registry username.
Your container registry password.
Your container registry username.
Your container registry password.
If the chart repo is not installed in your environment, chart-syncer also supports exporting the chart as a tgz file and storing it in the specified path.
The relative path to run the charts-syncer command, not the relative path between this YAML file and the offline package.
Change to your container registry URL.
Change to your container registry.
Local path of the chart.
Your container registry username.
Your container registry password.
Run the command to synchronize the images.
charts-syncer sync --config load-image.yaml\n
"},{"location":"en/admin/virtnest/install/offline-install.html#load-images-directly-using-docker-or-containerd","title":"Load Images Directly using Docker or containerd","text":"
Unpack and load the image files.
Unpack the tar archive.
tar xvf virtnest.bundle.tar\n
After successful extraction, you will have three files:
hints.yaml
images.tar
original-chart
Load the images from the local file to Docker or containerd.
Dockercontainerd
docker load -i images.tar\n
ctr -n k8s.io image import images.tar\n
Note
Perform the Docker or containerd image loading operation on each node. After loading is complete, tag the images to match the Registry and Repository used during installation.
If the helm version is too low, it may fail. If it fails, try executing helm update repo.
Choose the version of the Virtual Machine you want to install (it is recommended to install the latest version).
helm search repo virtnest/virtnest --versions\n
[root@master ~]# helm search repo virtnest/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest/virtnest 0.2.0 v0.2.0 A Helm chart for virtnest\n...\n
Back up the --set parameters.
Before upgrading the Virtual Machine version, it is recommended to run the following command to backup the --set parameters of the previous version.
helm get values virtnest -n virtnest-system -o yaml > bak.yaml\n
To utilize the Virtual Machine (VM), the virtnest-agent component needs to be installed in the cluster using Helm.
Click Container Management in the left navigation menu, then click Virtual Machines . If the virtnest-agent component is not installed, you will not be able to use the VM. The interface will display a reminder for you to install within the required cluster.
Select the desired cluster, click Helm Apps in the left navigation menu, then click Helm Charts to view the template list.
Search for the virtnest-agent component, and click to the see details. Select the appropriate version and click Install button to install.
On the installation page, fill in the required information, and click OK to finish the installation.
Go back to the Virtual Machines in the navigation menu. If the installation is successful, you will see the VM list, and you can now use the VM.
This article will explain how to create a virtual machine using two methods: image and YAML file.
Virtual machine, based on KubeVirt, manages virtual machines as cloud native applications, seamlessly integrating with containers. This allows users to easily deploy virtual machine applications and enjoy a smooth experience similar to containerized applications.
Before creating a virtual machine, make sure you meet the following prerequisites:
Expose hardware-assisted virtualization to the user operating system.
Install virtnest-agent on the specified cluster; the operating system kernel version must be 3.15 or higher.
Create a namespace and user.
Prepare the image in advance. The platform comes with three built-in images (as shown below). If you need to create your own image, refer to creating from an image with KubeVirt.
When configuring the network, if you choose to use the Passt network mode, you need to upgrade to Version 0.4.0 or higher.
Follow the steps below to create a virtual machine using an image.
Click Container Management on the left navigation bar, then click Virtual Machines to enter the VM page.
On the virtual machine list page, click Create VMs and select Create with Image.
Fill the basic information, image settings, storage and network, login settings, and click OK at the bottom right corner to complete the creation.
The system will automatically return to the virtual machine list. By clicking the \u2507 button on the right side of the list, you can perform operations such as power on/off, restart, clone, update, create snapshots, console access (VNC), and delete virtual machines. Cloning and snapshot capabilities depend on the selected StorageClass.
In the Create VMs page, enter the information according to the table below and click Next.
Name: Up to 63 characters, can only contain lowercase letters, numbers, and hyphens ( - ), and must start and end with a lowercase letter or number. The name must be unique within the namespace, and cannot be changed once the virtual machine is created.
Alias: Allows any characters, up to 60 characters.
Cluster: Select the cluster to deploy the newly created virtual machine.
Namespace: Select the namespace to deploy the newly created virtual machine. If the desired namespace is not found, you can create a new namespace according to the prompts on the page.
Label/Annotation: Select the desired labels/annotations to add to the virtual machine.
Fill in the image-related information according to the table below, then click Next.
Image Source: Supports three types of sources.
Registry: Images stored in the container registry. You can select images from the registry as needed.
HTTP: Images stored in a file server using the HTTP protocol, supporting both HTTPS:// and HTTP:// prefixes.
Object Storage (S3): Virtual machine images obtained through the object storage protocol (S3). For non-authenticated object storage files, please use the HTTP source.
The following are the built-in images provided by the platform, including the operating system, version, and the image URL. Custom virtual machine images are also supported.
Operating System Version Image Address CentOS CentOS 7.9 release-ci.daocloud.io/virtnest/system-images/centos-7.9-x86_64:v1 Ubuntu Ubuntu 22.04 release-ci.daocloud.io/virtnest/system-images/ubuntu-22.04-x86_64:v1 Debian Debian 12 release-ci.daocloud.io/virtnest/system-images/debian-12-x86_64:v1
Image Secret: Only supports the default (Opaque) type of key, for specific operations you can refer to Create Secret.
The built-in image storage in the bootstrap cluster, and the container registry of the bootstrap cluster is not encrypted, so when selecting the built-in image, there is no need to select a secret.
Note
The hot-plug configuration for CPU and memory requires virtnest v0.10.0 or higher, and virtnest-agent v0.7.0 or higher.
Resource Config: For CPU, it is recommended to use whole numbers. If a decimal is entered, it will be rounded up. The hot-plug configuration for CPU and memory is supported.
GPU Configuration: Enabling GPU functionality requires meeting certain prerequisites. For details, refer to Configuring GPU for Virtual Machines (Nvidia). Virtual machines support two types of Nvidia GPUs: Nvidia-GPU and Nvidia-vGPU. After selecting the desired type, you will need to choose the proper GPU model and the number of cards.
"},{"location":"en/admin/virtnest/quickstart/index.html#storage-and-network","title":"Storage and Network","text":"
Storage:
Storage is closely related to the function of the virtual machine. Mainly by using Kubernetes' persistent volumes and storage classes, it provides flexible and scalable virtual machine storage capabilities. For example, the virtual machine image is stored in the PVC, and it supports cloning, snapshotting, etc. with other data.
System Disk: The system automatically creates a VirtIO type rootfs system disk for storing the operating system and data.
Data Disk: The data disk is a storage device in the virtual machine used to store user data, application data, or other non-operating system related files. Compared with the system disk, the data disk is optional and can be dynamically added or removed as needed. The capacity of the data disk can also be flexibly configured according to demand.
Block storage is used by default. If you need to use the clone and snapshot functions, make sure that your storage pool has created the proper VolumeSnapshotClass, which you can refer to the following example. If you need to use the live migration function, make sure your storage supports and selects the ReadWriteMany access mode.
In most cases, the storage will not automatically create such a VolumeSnapshotClass during the installation process, so you need to manually create a VolumeSnapshotClass. The following is an example of HwameiStor creating a VolumeSnapshotClass:
Run the following command to check if the VolumeSnapshotClass was created successfully.
kubectl get VolumeSnapshotClass\n
View the created Snapshotclass and confirm that the provisioner property is consistent with the Driver property in the storage pool.
Network:
Network setting can be combined as needed according to the table information.
Network Mode CNI Install Spiderpool Network Cards Fixed IP Live Migration Masquerade (NAT) Calico \u274c Single \u274c \u2705 Cilium \u274c Single \u274c \u2705 Flannel \u274c Single \u274c \u2705 Bridge OVS \u2705 Multiple \u2705 \u2705
Network modes are divided into Masquerade (NAT) and Bridge, the latter mode need to be installed after the spiderpool component can be used.
The network mode of Masquerade (NAT) is selected by default, using the default network card eth0.
If the spiderpool component is installed in the cluster, you can choose the Bridge mode, and the Bridge mode supports multiple NICs.
Add Network Card
Passthrough / Bridge mode supports manual addition of network cards. Click Add NIC to configure the network card IP pool. Choose the Multus CR that matches the network mode, if not, you need to create it yourself.
If you turn on the Use Default IP Pool switch, use the default IP pool in the multus CR setting. If the switch is off, manually select the IP pool.
Accessing virtual machines through the terminal provides more flexibility and lightweight access. However, it does not directly display the graphical interface, has limited interactivity, and does not support multiple concurrent terminal sessions.
Click Container Management in the left navigation bar, then click Virtual Machines to access the list page. Click the \u2507 button on the right side of the list to access the virtual machine via the terminal.
Accessing virtual machines through VNC allows you to access and control the full graphical interface of the remote computer. It provides a more interactive experience and allows intuitive operation of the remote device. However, it may have some performance impact, and it does not support multiple concurrent terminal sessions.
Choose VNC for Windows systems.
Click Container Management in the left navigation bar, then click Virtual Machines to access the list page. Click the \u2507 button on the right side of the list to access the virtual machine via Console Access (VNC).
After successfully creating a virtual machine, you can enter the VM Detail page to view Basic Information, Settings, GPU Settings, Overview, Storage, Network, Snapshot and Event List.
Click Container Management in the left navigation bar, then click Clusters to enter the page of the cluster where the virtual machine is located. Click the VM Name to view the virtual machine details.
Operating System: The operating system installed on the virtual machine to execute programs.
Image Address: A link to a virtual hard disk file or operating system installation media, which is used to load and install the operating system in the virtual machine software.
Network Mode: The network mode configured for the virtual machine, including Bridge or Masquerade(NAT).
CPU & Memory: The resources allocated to the virtual machine.
GPU Settings includes: GPU Type, GPU Model and GPU Counts
"},{"location":"en/admin/virtnest/quickstart/detail.html#other-information","title":"Other Information","text":"OverviewStorageNetworkSnapshotsEvent List
It allows you to view its insight content. Please note that if insight-agent is not installed, overview information cannot be obtained.
It displays the storage used by the virtual machine, including information about the system disk and data disk.
It displays the network settings of the virtual machine, including Multus CR, NIC Name, IP Address and so on.
If you have created snapshots, this part will display relative information. Restoring the virtual machine from snapshots is supported.
The event list includes various state changes, operation records, and system messages during the lifecycle of the virtual machine.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html","title":"Accessing Virtual Machine via NodePort","text":"
This page explains how to access a virtual machine using NodePort.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html#limitations-of-existing-access-methods","title":"Limitations of Existing Access Methods","text":"
Virtual machines support access via VNC or console, but both methods have a limitation: they do not allow multiple terminals to be simultaneously online.
Using a NodePort-formatted Service can help solve this problem.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html#create-a-service","title":"Create a Service","text":"
Using the Container Management Page
Select the cluster page where the target virtual machine is located and create a Service.
Select the access type as NodePort.
Choose the namespace (the namespace where the virtual machine resides).
Fill in the label selector as vm.kubevirt.io/name: your-vm-name.
Port Configuration: Choose TCP for the protocol, provide a custom port name, and set the service port and container port to 22.
After successful creation, you can access the virtual machine by using ssh username@nodeip -p port.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html#create-the-service-via-kubectl","title":"Create the Service via kubectl","text":"
On this page, Alias , Label and Annotation can be updated, while other information cannot. After completing the updates, click Next to proceed to the Image Settings page.
On this page, parameters such as Image Address, Operating System, and Version cannot be changed once selected. Users are allowed to update the GPU Quota, including enabling or disabling GPU support, selecting the GPU type, specifying the required model, and configuring the number of GPU. A restart is required for taking effect. After completing the updates, click Next to proceed to the Storage and Network page.
"},{"location":"en/admin/virtnest/quickstart/update.html#storage-and-network","title":"Storage and Network","text":"
On the Storage and Network page, the StorageClass and PVC Mode for the System Disk cannot be changed once selected. You can increase Disk Capacity, but reducing it is not supported. And you can freely add or remove Data Disk. Network updates are not supported. After completing the updates, click Next to proceed to the Login Settings page.
Note
It is recommended to restart the virtual machine after modifying storage capacity or adding data disks to ensure the configuration takes effect.
On the Login Settings page, Username, Password, and SSH cannot be changed once set. After confirming your login information is correct, click OK to complete the update process.
In addition to updating the virtual machine via forms, you can also quickly update it using a YAML file.
Go to the virtual machine list page and click the Edit YAML button.
"},{"location":"en/admin/virtnest/template/index.html","title":"Create Virtual Machines via Templates","text":"
This guide explains how to create virtual machines using templates.
With internal templates and custom templates, users can easily create new virtual machines. Additionally, we provide the ability to convert existing virtual machines into templates, allowing users to manage and utilize resources more flexibly.
"},{"location":"en/admin/virtnest/template/index.html#create-with-template","title":"Create with Template","text":"
Follow these steps to create a virtual machine using a template.
Click Container Management in the left navigation menu, then click Virtual Machines to access the Virtual Machine Management page. On the virtual machine list page, click Create Virtual Machine and select Create with Template .
On the template creation page, fill in the required information, including Basic Information, Template Config, Storage and Network, and Login Settings. Then, click OK in the bottom-right corner to complete the creation.
The system will automatically return to the virtual machine list. By clicking \u2507 on the right side of the list, you can perform operations such as power off/restart, clone, update, create snapshot, convert to template, console access (VNC), and delete. The ability to clone and create snapshots depends on the selected storage pool.
On the Create VMs page, enter the information according to the table below and click Next .
Name: Can contain up to 63 characters and can only include lowercase letters, numbers, and hyphens ( - ). The name must start and end with a lowercase letter or number. Names must be unique within the same namespace, and the name cannot be changed after the virtual machine is created.
Alias: Can include any characters, up to 60 characters in length.
Cluster: Select the cluster where the new virtual machine will be deployed.
Namespace: Select the namespace where the new virtual machine will be deployed. If the desired namespace is not found, you can follow the instructions on the page to create a new namespace.
The template list will appear, and you can choose either an internal template or a custom template based on your needs.
Select an Internal Template: AI platform Virtual Machine provides several standard templates that cannot be edited or deleted. When selecting an internal template, the image source, operating system, image address, and other information will be based on the template and cannot be modified. GPU quota will also be based on the template but can be modified.
Select a Custom Template: These templates are created from virtual machine configurations and can be edited or deleted. When using a custom template, you can modify the image source and other information based on your specific requirements.
"},{"location":"en/admin/virtnest/template/index.html#storage-and-network","title":"Storage and Network","text":"
Storage: By default, the system creates a rootfs system disk of VirtIO type for storing the operating system and data. Block storage is used by default. If you need to use clone and snapshot functionality, make sure your storage pool supports the VolumeSnapshots feature and create it in the storage pool (SC). Please note that the storage pool (SC) has additional prerequisites that need to be met.
Prerequisites:
KubeVirt utilizes the VolumeSnapshot feature of the Kubernetes CSI driver to capture the persistent state of virtual machines. Therefore, you need to ensure that your virtual machine uses a StorageClass that supports VolumeSnapshots and is configured with the correct VolumeSnapshotClass.
Check the created SnapshotClass and confirm that the provisioner property matches the Driver property in the storage pool.
Supports adding one system disk and multiple data disks.
Network: If no configuration is made, the system will create a VirtIO type network by default.
This guide explains the usage of internal VM templates and custom VM templates.
Using both internal and custom templates, users can easily create new VMs. Additionally, we provide the ability to convert existing VMs into VM templates, allowing users to manage and utilize resources more flexibly.
Click Container Management in the left navigation menu, then click VM Template to access the VM Template page. If the template is converted from a virtual machine configured with a GPU, the template will also include GPU information and will be displayed in the template list.
Click the \u2507 on the right side of a template in the list. For internal templates, you can create VM and view YAML. For custom templates, you can create VM, edit YAML and delete template.
Custom templates are created from VM configurations. The following steps explain how to convert a VM configuration into a template.
Click Container Management in the left navigation menu, then click Virtual Machines to access the list page. Click the \u2507 on the right side of a VM in the list to convert the configuration into a template. Only running or stopped VMs can be converted.
Provide a name for the new template. A notification will indicate that the original VM will be preserved and remain available. After a successful conversion, a new entry will be added to the template list.
After successfully creating a template, you can click the template name to view the details of the VM, including Basic Information, GPU Settings, Storage, Network, and more. If you need to quickly deploy a new VM based on that template, simply click the Create VM button in the upper right corner of the page for easy operation.
"},{"location":"en/admin/virtnest/vm/auto-migrate.html","title":"Automatic VM Drifting","text":"
This article will explain how to seamlessly migrate running virtual machines to other nodes when a node in the cluster becomes inaccessible due to power outages or network failures, ensuring business continuity and data security.
Compared to automatic drifting, live migration requires you to manually initiate the migration process through the interface, rather than having the system automatically trigger it.
Check the status of the virtual machine launcher pod:
kubectl get pod\n
Check if the launcher pod is in a Terminating state.
Force delete the launcher pod:
If the launcher pod is in a Terminating state, you can force delete it with the following command:
kubectl delete <launcher pod> --force\n
Replace <launcher pod> with the name of your launcher pod.
Wait for recreation and check the status:
After deletion, the system will automatically recreate the launcher pod. Wait for its status to become running, then refresh the virtual machine list to see if the VM has successfully migrated to the new node.
If using rook-ceph as storage, it needs to be configured in ReadWriteOnce mode:
After force deleting the pod, you need to wait approximately six minutes for the launcher pod to start, or you can immediately start the pod using the following commands:
kubectl get pv | grep <vm name>\nkubectl get VolumeAttachment | grep <pv name>\n
Replace <vm name> and <pv name> with your virtual machine name and persistent volume name.
Then delete the proper VolumeAttachment with the following command:
kubectl delete VolumeAttachment <vm>\n
Replace <vm> with your virtual machine name.
"},{"location":"en/admin/virtnest/vm/clone.html","title":"Cloning a Cloud Host","text":"
This article will introduce how to clone a new cloud host.
Users can clone a new cloud host, which will have the same operating system and system configuration as the original cloud host. This enables quick deployment and scaling, allowing for the rapid creation of new cloud hosts with similar configurations without the need to install from scratch.
Before using the cloning feature, the following prerequisites must be met (which are the same as those for the snapshot feature):
Only cloud hosts that are not in an error state can use the cloning feature.
Install Snapshot CRDs, Snapshot Controller, and CSI Driver. For specific installation steps, refer to CSI Snapshotter.
Wait for the snapshot-controller component to be ready. This component will monitor events related to VolumeSnapshot and VolumeSnapshotContent and trigger related operations.
Wait for the CSI Driver to be ready, ensuring that the csi-snapshotter sidecar is running in the CSI Driver. The csi-snapshotter sidecar will monitor events related to VolumeSnapshotContent and trigger related operations.
If the storage is Rook-Ceph, refer to ceph-csi-snapshot
If the storage is HwameiStor, refer to huameistor-snapshot
"},{"location":"en/admin/virtnest/vm/clone.html#cloning-a-cloud-host_1","title":"Cloning a Cloud Host","text":"
Click__Container Management__ in the left navigation bar, then click__Cloud Hosts__ to enter the list page. Clickthe \u2507 on the right side of the list to perform snapshot operations on cloud hosts that are not in an error state.
A popup will appear, requiring you to fill in the name and description for the new cloud host being cloned. The cloning operation may take some time, depending on the size of the cloud host and storage performance.
After a successful clone, you can view the new cloud host in the cloud host list. The newly created cloud host will be in a powered-off state and will need to be manually powered on if required.
It is recommended to take a snapshot of the original cloud host before cloning. If you encounter issues during the cloning process, please check whether the prerequisites are met and try to execute the cloning operation again.
When creating a virtual machine using Object Storage (S3) as the image source, sometimes you need to fill in a secret to get through S3's verification. The following will introduce how to create a secret that meets the requirements of the virtual machine.
Click Container Management in the left navigation bar, then click Clusters , enter the details of the cluster where the virtual machine is located, click ConfigMaps & Secrets , select the Secrets , and click Create Secret .
Enter the creation page, fill in the secret name, select the namespace that is the same as the virtual machine, and note that you need to select the default type Opaque . The secret data needs to follow the following principles.
accessKeyId: Data represented in Base64 encoding
secretKey: Data represented in Base64 encoding
After successful creation, you can use the required secret when creating a virtual machine.
"},{"location":"en/admin/virtnest/vm/cross-cluster-migrate.html","title":"Migrate VM across Clusters","text":"
This feature currently does not have a UI, so you can follow the steps in the documentation.
A VM needs to be migrated to another cluster when the original cluster experiences a failure or performance degradation that makes the VM inaccessible.
A VM needs to be migrated to another cluster when perform planned maintenance or upgrades on the cluster.
A VM needs to be migrated to another cluster to match more appropriate resource configurations when the performance requirements of specific applications change and resource allocation needs to be adjusted.
Before performing migration of a VM across cluster, the following prerequisites must be met:
Cluster network connectivity: Ensure that the network between the original cluster and the target migration cluster is accessible.
Same storage type: The target migration cluster must support the same storage type as the original cluster. For example, if the exporting cluster uses rook-ceph-block type StorageClass, the importing cluster must also support this type.
Enable VMExport Feature Gate in KubeVirt of the original cluster.
"},{"location":"en/admin/virtnest/vm/cross-cluster-migrate.html#configure-ingress-for-the-original-cluster","title":"Configure Ingress for the Original Cluster","text":"
Using Nginx Ingress as an example, configure Ingress to point to the virt-exportproxy Service:
If cold migration is performed while the VM is powered off :
apiVersion: v1\nkind: Secret\nmetadata:\n name: example-token # Export Token used by the VM\n namespace: default # Namespace where the VM resides\nstringData:\n token: 1234567890ab # Export the used Token (Modifiable)\n\n---\napiVersion: export.kubevirt.io/v1alpha1\nkind: VirtualMachineExport\nmetadata:\n name: example-export # Export name (Modifiable)\n namespace: default # Namespace where the VM resides\nspec:\n tokenSecretRef: example-token # Must match the name of the token created above\n source:\n apiGroup: \"kubevirt.io\"\n kind: VirtualMachine\n name: testvm # VM name\n
If hot migration is performed using a VM snapshot while the VM is powered on :
apiVersion: v1\nkind: Secret\nmetadata:\n name: example-token # Export Token used by VM\n namespace: default # Namespace where the VM resides\nstringData:\n token: 1234567890ab # Export the used Token (Modifiable)\n\n---\napiVersion: export.kubevirt.io/v1alpha1\nkind: VirtualMachineExport\nmetadata:\n name: export-snapshot # Export name (Modifiable)\n namespace: default # Namespace where the VM resides\nspec:\n tokenSecretRef: export-token # Must match the name of the token created above\n source:\n apiGroup: \"snapshot.kubevirt.io\"\n kind: VirtualMachineSnapshot\n name: export-snap-202407191524 # Name of the proper VM snapshot\n
Check if the VirtualMachineExport is ready:
# Replace example-export with the name of the created VirtualMachineExport\nkubectl get VirtualMachineExport example-export -n default\n\nNAME SOURCEKIND SOURCENAME PHASE\nexample-export VirtualMachine testvm Ready\n
Once the VirtualMachineExport is ready, export the VM YAML.
If virtctl is installed, you can use the following command to export the VM YAML:
# Replace example-export with the name of the created VirtualMachineExport\n# Specify the namespace with -n\nvirtctl vmexport download example-export --manifest --include-secret --output=manifest.yaml\n
If virtctl is not installed, you can use the following commands to export the VM YAML:
# Replace example-export with the name and namespace of the created VirtualMachineExport\nmanifesturl=$(kubectl get VirtualMachineExport example-export -n default -o=jsonpath='{.status.links.internal.manifests[0].url}')\nsecreturl=$(kubectl get VirtualMachineExport example-export -n default -o=jsonpath='{.status.links.internal.manifests[1].url}')\n# Replace with the secret name and namespace\ntoken=$(kubectl get secret example-token -n default -o=jsonpath='{.data.token}' | base64 -d)\n\ncurl -H \"Accept: application/yaml\" -H \"x-kubevirt-export-token: $token\" --insecure $secreturl > manifest.yaml\ncurl -H \"Accept: application/yaml\" -H \"x-kubevirt-export-token: $token\" --insecure $manifesturl >> manifest.yaml\n
Import VM.
Copy the exported manifest.yaml to the target migration cluster and run the following command.(If the namespace does not exist, it need to be created in advance) :
kubectl apply -f manifest.yaml\n
After successfully creating a VM, you need to restart it. Once the VM is running successfully, the original VM need to be deleted in the original cluster (Do not delete the original VM if it has not started successfully).
When configuring the liveness and readiness probes for a cloud host, the process is similar to that of Kubernetes configuration. This article will introduce how to configure health check parameters for a cloud host using YAML.
However, it is important to note that the configuration must be done when the cloud host has been successfully created and is in a powered-off state.
The configuration of userData may vary depending on the operating system (such as Ubuntu/Debian or CentOS). The main differences are:
Package manager:
Ubuntu/Debian uses apt-get as the package manager. CentOS uses yum as the package manager.
SSH service restart command:
Ubuntu/Debian uses systemctl restart ssh.service. CentOS uses systemctl restart sshd.service (note that for CentOS 7 and earlier versions, it uses service sshd restart).
Installed packages:
Ubuntu/Debian installs ncat. CentOS installs nmap-ncat (because ncat may not be available in the default repository for CentOS).
This article will explain how to migrate a virtual machine from one node to another.
When a node needs maintenance or upgrades, users can seamlessly migrate running virtual machines to other nodes while ensuring business continuity and data security.
Click Container Management on the left navigation bar, then click Virtual Machines to enter the list page. Click \u2507 on the right side of the list to migrate running virtual machines. Currently, the virtual machine is on the node controller-node-1 .
A pop-up box will appear, indicating that during live migration, the running virtual machine instances will be migrated to another node, but the target node cannot be predetermined. Please ensure that other nodes have sufficient resources.
After a successful migration, you can view the node information in the virtual machine list. At this time, the node has been migrated to controller-node-2 .
"},{"location":"en/admin/virtnest/vm/migratiom.html","title":"Cold Migration within the Cluster","text":"
This article will introduce how to move a cloud host from one node to another within the same cluster while it is powered off.
The main feature of cold migration is that the cloud host will be offline during the migration process, which may impact business continuity. Therefore, careful planning of the migration time window is necessary, taking into account business needs and system availability. Typically, cold migration is suitable for scenarios where downtime requirements are not very strict.
Click__Container Management__ in the left navigation bar, then click__Cloud Hosts__ to enter the list page. Clickthe \u2507 on the right side of the list to initiate the migration action for the cloud host that is in a powered-off state. The current node of the cloud host cannot be viewed while it is powered off, so prior planning or checking while powered on is required.
Note
If you have used local-path in the storage pool of the original node, there may be issues during cross-node migration. Please choose carefully.
After clicking migrate, a prompt will appear allowing you to choose to migrate to a specific node or randomly. If you need to change the storage pool, ensure that there is an available storage pool in the target node. Also, ensure that the target node has sufficient resources. The migration process may take a significant amount of time, so please be patient.
The migration will take some time, so please be patient. After it is successful, you need to restart the cloud host to check if the migration was successful. This example has already powered on the cloud host to check the migration effect.
The virtual machine's monitoring is based on the Grafana Dashboard open-sourced by Kubevirt, which generates monitoring dashboards for each virtual machine.
Monitoring information of the virtual machine can provide better insights into the resource consumption of the virtual machine, such as CPU, memory, storage, and network resource usage. These information can help optimize and plan resources, improving overall resource utilization efficiency.
Navigate to the VM Detail page and click Overview to view the monitoring content of the virtual machine. Please note that without the insight-agent component installed, monitoring information cannot be obtained. Below are the detailed information:
Total CPU, CPU Usage, Memory Total, Memory Usage.
CPU Utilisation: the percentage of CPU resources currently used by the virtual machine;
Memory Utilisation: the percentage of memory resources currently used by the virtual machine out of the total available memory.
Network Traffic by Virtual Machines: the amount of network data sent and received by the virtual machine during a specific time period;
Network Packet Loss Rate: the proportion of lost data packets during data transmission out of the total sent data packets.
Network Packet Error Rate: the rate of errors that occur during network transmission;
Storage Traffic: the speed and capacity at which the virtual machine system reads and writes to the disk within a certain time period.
Storage IOPS: the number of input/output operations the virtual machine system performs in one second.
Storage Delay: the time delay experienced by the virtual machine system when performing disk read and write operations.
This article introduces how to create snapshots for VMs on a schedule.
You can create scheduled snapshots for VMs, providing continuous protection for data and ensuring effective data recovery in case of data loss, corruption, or deletion.
In the left navigation bar, click Container Management -> Clusters to select the proper cluster where the target VM is located. After entering the cluster, click Workloads -> CronJobs, and choose Create from YAML to create a scheduled task. Refer to the following YAML example to create snapshots for the specified VM on a schedule.
Click to view the YAML example for creating a scheduled task
apiVersion: batch/v1\nkind: CronJob\nmetadata:\n name: xxxxx-xxxxx-cronjob # Scheduled task name (Customizable)\n namespace: virtnest-system # Do not modify the namespace\nspec:\n schedule: \"5 * * * *\" # Modify the scheduled task execution interval as needed\n concurrencyPolicy: Allow\n suspend: false\n successfulJobsHistoryLimit: 10\n failedJobsHistoryLimit: 3\n startingDeadlineSeconds: 60\n jobTemplate:\n spec:\n template:\n metadata:\n labels:\n virtnest.io/vm: xxxx # Modify to the name of the VM that needs to be snapshotted\n virtnest.io/namespace: xxxx # Modify to the namespace where the VM is located\n spec:\n serviceAccountName: kubevirt-operator\n containers:\n - name: snapshot-job\n image: release.daocloud.io/virtnest/tools:v0.1.5 # For offline environments, modify the registry address to the proper registry address of the cluster\n imagePullPolicy: IfNotPresent\n env:\n - name: NS\n valueFrom:\n fieldRef:\n fieldPath: metadata.labels['virtnest.io/namespace']\n - name: VM\n valueFrom:\n fieldRef:\n fieldPath: metadata.labels['virtnest.io/vm']\n command:\n - /bin/sh\n - -c\n - |\n export SUFFIX=$(date +\"%Y%m%d-%H%M%S\")\n cat <<EOF | kubectl apply -f -\n apiVersion: snapshot.kubevirt.io/v1alpha1\n kind: VirtualMachineSnapshot\n metadata:\n name: $(VM)-snapshot-$SUFFIX\n namespace: $(NS)\n spec:\n source:\n apiGroup: kubevirt.io\n kind: VirtualMachine\n name: $(VM)\n EOF\n restartPolicy: OnFailure\n
After creating the scheduled task and running it successfully, you can click Virtual Machines in the list page to select the target VM. After entering the details, you can view the snapshot list.
This guide explains how to create snapshots for virtual machines and restore them.
You can create snapshots for virtual machines to save the current state of the virtual machine. A snapshot can be restored multiple times, and each time the virtual machine will be reverted to the state when the snapshot was created. Snapshots are commonly used for backup, recovery and rollback.
Before using the snapshots, the following prerequisites need to be met:
Only virtual machines in a non-error state can use the snapshot function.
Install Snapshot CRDs, Snapshot Controller, and CSI Driver. For detailed installation steps, refer to CSI Snapshotter.
Wait for the snapshot-controller component to be ready. This component monitors events related to VolumeSnapshot and VolumeSnapshotContent and triggers specific actions.
Wait for the CSI Driver to be ready. Ensure that the csi-snapshotter sidecar is running within the CSI Driver. The csi-snapshotter sidecar monitors events related to VolumeSnapshotContent and triggers specific actions.
If the storage is rook-ceph, refer to ceph-csi-snapshot.
If the storage is HwameiStor, refer to huameistor-snapshot.
"},{"location":"en/admin/virtnest/vm/snapshot.html#create-a-snapshot","title":"Create a Snapshot","text":"
Click Container Management in the left navigation menu, then click Virtual Machines to access the list page. Click the \u2507 on the right side of the list for a virtual machine to perform snapshot operations (only available for non-error state virtual machines).
A dialog box will pop up, prompting you to input a name and description for the snapshot. Please note that the creation process may take a few minutes, during which you won't be able to perform any operations on the virtual machine.
After successfully creating the snapshot, you can view its details within the virtual machine's information section. Here, you have the option to edit the description, recover from the snapshot, delete it, among other operations.
"},{"location":"en/admin/virtnest/vm/snapshot.html#restore-from-a-snapshot","title":"Restore from a Snapshot","text":"
Click Restore from Snapshot and provide a name for the virtual machine recovery record. The recovery operation may take some time to complete, depending on the size of the snapshot and other factors. After a successful recovery, the virtual machine will be restored to the state when the snapshot was created.
After some time, you can scroll down to the snapshot information to view all the recovery records for the current snapshot. It also provides a way to locate the position of the recovery.
This article will introduce how to configure network information when creating virtual machines.
In virtual machines, network management is a crucial part that allows us to manage and configure network connections for virtual machines in a Kubernetes environment. It can be configured according to different needs and scenarios, achieving a more flexible and diverse network architecture.
Single NIC Scenario: For simple applications that only require basic network connectivity or when there are resource constraints, using a single NIC can save network resources and prevent waste of resources.
Multiple NIC Scenario: When security isolation between different network environments needs to be achieved, multiple NICs can be used to divide different network areas. It also allows for control and management of traffic.
When selecting the Bridge network mode, some information needs to be configured in advance:
Install and run Open vSwitch on the host nodes. See Ovs-cni Quick Start.
Configure Open vSwitch bridge on the host nodes. See vswitch for instructions.
Install Spiderpool. See installing spiderpool for instructions. By default, Spiderpool will install both Multus CNI and Ovs CNI.
Create a Multus CR of type ovs. You can create a custom Multus CR or use YAML for creation
Create a subnet and IP pool. See creating subnets and IP pools .
Network configuration can be combined according to the table information.
Network Mode CNI Spiderpool Installed NIC Mode Fixed IP Live Migration Masquerade (NAT) Calico \u274c Single NIC \u274c \u2705 Cilium \u274c Single NIC \u274c \u2705 Flannel \u274c Single NIC \u274c \u2705 Bridge OVS \u2705 Multiple NIC \u2705 \u2705
Network Mode: There are two modes - Masquerade (NAT) and Bridge. Bridge mode requires the installation of the spiderpool component.
The default selection is Masquerade (NAT) network mode using the eth0 default NIC.
If the cluster has the spiderpool component installed, then Bridge mode can be selected. The Bridge mode supports multiple NICs.
Ensure all prerequisites are met before selecting the Bridge mode.
Adding NICs
Bridge modes support manually adding NICs. Click Add NIC to configure the NIC IP pool. Choose a Multus CR that matches the network mode, if not available, it needs to be created manually.
If the Use Default IP Pool switch is turned on, it will use the default IP pool in the multus CR configuration. If turned off, manually select the IP pool.
"},{"location":"en/admin/virtnest/vm/vm-network.html#network-configuration","title":"Network Configuration","text":""},{"location":"en/admin/virtnest/vm/vm-sc.html","title":"Storage for Virtual Machine","text":"
This article will introduce how to configure storage when creating a virtual machine.
Storage and virtual machine functionality are closely related, mainly providing flexible and scalable virtual machine storage capabilities through the use of Kubernetes persistent volumes and storage classes. For example, virtual machine image storage in PVC supports cloning, snapshotting, and other operations with other data.
"},{"location":"en/admin/virtnest/vm/vm-sc.html#deploying-different-storage","title":"Deploying Different Storage","text":"
Before using virtual machine storage functionality, different storage needs to be deployed according to requirements:
Refer to Deploying hwameistor, or install hwameistor-operator in the Helm template of the container management module.
Refer to Deploying rook-ceph
Deploy localpath, use the command kubectl apply -f to create the following YAML:
System Disk: By default, a VirtIO type rootfs system disk is created for the system to store the operating system and data.
Data Disk: The data disk is a storage device in the virtual machine used to store user data, application data, or other files unrelated to the operating system. Compared to the system disk, the data disk is optional and can be dynamically added or removed as needed. The capacity of the data disk can also be flexibly configured according to requirements.
Block storage is used by default. If you need to use cloning and snapshot functions, make sure that your storage pool has created the proper VolumeSnapshotClass, as shown in the example below. If you need to use real-time migration, make sure that your storage supports and has selected the ReadWriteMany access mode.
In most cases, such VolumeSnapshotClass is not automatically created during the installation process, so you need to manually create VolumeSnapshotClass. Here is an example of creating a VolumeSnapshotClass in HwameiStor:
This document will explain how to build the required virtual machine images.
A virtual machine image is essentially a replica file, which is a disk partition with an installed operating system. Common image file formats include raw, qcow2, vmdk, etc.
"},{"location":"en/admin/virtnest/vm-image/index.html#build-an-image","title":"Build an Image","text":"
Below are some detailed steps for building virtual machine images:
Download System Images
Before building virtual machine images, you need to download the required system images. We recommend using images in qcow2, raw, or vmdk formats. You can visit the following links to get CentOS and Fedora images:
CentOS Cloud Images: Obtain CentOS images from the official CentOS project or other sources. Make sure to choose a version compatible with your virtualization platform.
Fedora Cloud Images: Get images from the official Fedora project. Choose the appropriate version based on your requirements.
Build a Docker Image and Push it to a Containe Registry
In this step, we will use Docker to build an image and push it to a container registry for easy deployment and usage when needed.
Create a Dockerfile
FROM scratch\nADD --chown=107:107 CentOS-7-x86_64-GenericCloud.qcow2 /disk/\n
The Dockerfile above adds a file named CentOS-7-x86_64-GenericCloud.qcow2 to the image being built from a scratch base image and places it in the /disk/ directory within the image. This operation includes the file in the image, allowing it to provide a CentOS 7 x86_64 operating system environment when used to create a virtual machine.
The above command builds an image named release-ci.daocloud.io/ghippo/kubevirt-demo/centos7:v1 using the instructions in the Dockerfile. You can modify the image name according to your project requirements.
Push the Image to the Container Registry
Use the following command to push the built image to the release-ci.daocloud.io container registry. You can modify the repository name and address as needed.
These are the detailed steps and instructions for building virtual machine images. By following these steps, you will be able to successfully build and push images for virtual machines to meet your usage needs.
Naming convention: composed of lowercase \"spec.name\" and \"-custom\"
If used for modifying the category
This field must be true
Define the Chinese and English names of the category
The higher the number, the higher its position in the sorting order
After writing the YAML file, you can see the newly added or modified navigation bar categories by executing the following command and refreshing the page:
kubectl apply -f xxx.yaml\n
"},{"location":"en/admin/ghippo/best-practice/navigator.html#navigation-bar-menus","title":"Navigation Bar Menus","text":"
To add or reorder navigation bar menus, you can achieve it by adding a navigator YAML.
Note
If you need to edit an existing navigation bar menu (not a custom menu added by the user), you need to set the \"gproduct\" field of the new custom menu the same as the \"gproduct\" field of the menu to be overridden. The new navigation bar menu will overwrite the parts with the same \"name\" in the \"menus\" section, and perform an addition operation for the parts with different \"name\".
Naming convention: composed of lowercase \"spec.gproduct\" and \"-custom\"
Define the Chinese and English names of the menu
Either \"category\" or \"parentGProduct\" can be used to distinguish between first-level and second-level menus, and it should match the \"spec.name\" field of NavigatorCategory to complete the matching
Second-level menus
The lower the number, the higher its position in the sorting order
Define the identifier of the menu, used for linkage with the parentGProduct field to establish the parent-child relationship.
Set whether the menu is visible, default is true
This field must be true
The higher the number, the higher its position in the sorting order
Naming convention: composed of lowercase \"spec.gproduct\" and \"-custom\"
Define the Chinese and English names of the menu
Either \"category\" or \"parentGProduct\" can be used to distinguish between first-level and second-level menus. If this field is added, it will ignore the \"menus\" field and insert this menu as a second-level menu under the first-level menu with the \"gproduct\" of \"ghippo\"
Define the identifier of the menu, used for linkage with the parentGProduct field to establish the parent-child relationship.
Set whether the menu is visible, default is true
This field must be true
The higher the number, the higher its position in the sorting order
Insight is a multicluster observation product in AI platform. In order to realize the unified collection of multicluster observation data, users need to install the Helm App insight-agent (Installed in insight-system namespace by default). See How to install insight-agent .
In Insight -> Data Collection section, you can view the status of insight-agent installed in each cluster.
not installed : insight-agent is not installed under the insight-system namespace in this cluster
Running : insight-agent is successfully installed in the cluster, and all deployed components are running
Exception : If insight-agent is in this state, it means that the helm deployment failed or the deployed components are not running
Can be checked by:
Run the following command, if the status is deployed , go to the next step. If it is failed , since it will affect the upgrade of the application, it is recommended to reinstall after uninstalling Container Management -> Helm Apps :
helm list -n insight-system\n
run the following command or check the status of the components deployed in the cluster in Insight -> Data Collection . If there is a pod that is not in the Running state, please restart the abnormal pod.
The resource consumption of the metric collection component Prometheus in insight-agent is directly proportional to the number of pods running in the cluster. Adjust Prometheus resources according to the cluster size, please refer to Prometheus Resource Planning.
Since the storage capacity of the metric storage component vmstorage in the global service cluster is directly proportional to the sum of the number of pods in each cluster.
Please contact the platform administrator to adjust the disk capacity of vmstorage according to the cluster size, see vmstorage disk capacity planning.
Adjust vmstorage disk according to multicluster size, see vmstorge disk expansion.
"},{"location":"en/admin/insight/quickstart/jvm-monitor/jmx-exporter.html","title":"Use JMX Exporter to expose JVM monitoring metrics","text":"
JMX-Exporter provides two usages:
Start a standalone process. Specify parameters when the JVM starts, expose the RMI interface of JMX, JMX Exporter calls RMI to obtain the JVM runtime status data, Convert to Prometheus metrics format, and expose ports for Prometheus to collect.
Start the JVM in-process. Specify parameters when the JVM starts, and run the jar package of JMX-Exporter in the form of javaagent. Read the JVM runtime status data in the process, convert it into Prometheus metrics format, and expose the port for Prometheus to collect.
Note
Officials do not recommend the first method. On the one hand, the configuration is complicated, and on the other hand, it requires a separate process, and the monitoring of this process itself has become a new problem. So This page focuses on the second usage and how to use JMX Exporter to expose JVM monitoring metrics in the Kubernetes environment.
The second usage is used here, and the JMX Exporter jar package file and configuration file need to be specified when starting the JVM. The jar package is a binary file, so it is not easy to mount it through configmap. We hardly need to modify the configuration file. So the suggestion is to directly package the jar package and configuration file of JMX Exporter into the business container image.
Among them, in the second way, we can choose to put the jar file of JMX Exporter in the business application mirror, You can also choose to mount it during deployment. Here is an introduction to the two methods:
"},{"location":"en/admin/insight/quickstart/jvm-monitor/jmx-exporter.html#method-1-build-the-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Build the JMX Exporter JAR file into the business image","text":"
The content of prometheus-jmx-config.yaml is as follows:
For more configmaps, please refer to the bottom introduction or Prometheus official documentation.
Then prepare the jar package file, you can find the latest jar package download address on the Github page of jmx_exporter and refer to the following Dockerfile:
Port 8088 is used here to expose the monitoring metrics of the JVM. If it conflicts with Java applications, you can change it yourself
"},{"location":"en/admin/insight/quickstart/jvm-monitor/jmx-exporter.html#method-2-mount-via-init-container-container","title":"Method 2: mount via init container container","text":"
We need to make the JMX exporter into a Docker image first, the following Dockerfile is for reference only:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file to the mirror\nCOPY prometheus-jmx-config.yaml ./\n# Download jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image according to the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment Yaml:
After the above modification, the sample application my-demo-app has the ability to expose JVM metrics. After running the service, we can access the prometheus format metrics exposed by the service through http://lcoalhost:8088.
Then, you can refer to Java Application Docking Observability with JVM Metrics.
This document mainly describes how to monitor the JVM of the customer's Java application. It describes how Java applications that have exposed JVM metrics, and those that have not, interface with Insight.
If your Java application does not start exposing JVM metrics, you can refer to the following documents:
Expose JVM monitoring metrics with JMX Exporter
Expose JVM monitoring metrics using OpenTelemetry Java Agent
If your Java application has exposed JVM metrics, you can refer to the following documents:
Java application docking observability with existing JVM metrics
"},{"location":"en/admin/insight/quickstart/jvm-monitor/legacy-jvm.html","title":"Java Application with JVM Metrics to Dock Insight","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), We need to allow monitoring data to be collected. You can let Insight collect existing JVM metrics by adding Kubernetes Annotations to the workload:
annatation:\n insight.opentelemetry.io/metric-scrape: \"true\" # whether to collect\n insight.opentelemetry.io/metric-path: \"/\" # path to collect metrics\n insight.opentelemetry.io/metric-port: \"9464\" # port for collecting metrics\n
YAML Example to add annotations for my-deployment-app workload\uff1a
In the above example\uff0cInsight will use :8080//actuator/prometheus to get Prometheus metrics exposed through Spring Boot Actuator .
"},{"location":"en/admin/insight/quickstart/jvm-monitor/otel-java-agent.html","title":"Use OpenTelemetry Java Agent to expose JVM monitoring metrics","text":"
In Opentelemetry Agent v1.20.0 and above, Opentelemetry Agent has added the JMX Metric Insight module. If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents for our application Expose JMX metrics. The Opentelemetry Agent also collects and exposes metrics by instrumenting the metrics exposed by MBeans locally available in the application.
Opentelemetry Agent also has some built-in monitoring samples for common Java Servers or frameworks, please refer to predefined metrics.
Using the OpenTelemetry Java Agent also needs to consider how to mount the JAR into the container. In addition to referring to the JMX Exporter above to mount the JAR file, we can also use the Operator capabilities provided by OpenTelemetry to automatically enable JVM metric exposure for our applications. :
If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents to expose JMX metrics for our application. The Opentelemetry Agent can now natively collect and expose metrics interfaces by instrumenting metrics exposed by MBeans available locally in the application.
However, for current version, you still need to manually add the proper annotations to workload before the JVM data will be collected by Insight.
"},{"location":"en/admin/insight/quickstart/jvm-monitor/otel-java-agent.html#expose-metrics-for-java-middleware","title":"Expose metrics for Java middleware","text":"
Opentelemetry Agent also has some built-in middleware monitoring samples, please refer to Predefined Metrics.
By default, no type is specified, and it needs to be specified through -Dotel.jmx.target.system JVM Options, such as -Dotel.jmx.target.system=jetty,kafka-broker .
Gaining JMX Metric Insights with the OpenTelemetry Java Agent
Otel jmx metrics
"},{"location":"en/admin/insight/quickstart/otel/golang-ebpf.html","title":"Enhance Go apps with OTel auto-instrumentation","text":"
If you don't want to manually change the application code, you can try This page's eBPF-based automatic enhancement method. This feature is currently in the review stage of donating to the OpenTelemetry community, and does not support Operator injection through annotations (it will be supported in the future), so you need to manually change the Deployment YAML or use a patch.
Install under the Insight-system namespace, skip this step if it has already been installed.
Note: This CR currently only supports the injection of environment variables (including service name and trace address) required to connect to Insight, and will support the injection of Golang probes in the future.
"},{"location":"en/admin/insight/quickstart/otel/golang-ebpf.html#change-the-application-deployment-file","title":"Change the application deployment file","text":"
Add environment variable annotations
There is only one such annotation, which is used to add OpenTelemetry-related environment variables, such as link reporting address, cluster id where the container is located, and namespace:
The value is divided into two parts by / , the first value insight-system is the namespace of the CR installed in the second step, and the second value insight-opentelemetry-autoinstrumentation is the name of the CR.
Getting Started with Go OpenTelemetry Automatic Instrumentation
Donating ebpf based instrumentation
"},{"location":"en/admin/insight/quickstart/other/install-agentindce.html","title":"Install insight-agent in Suanova 4.0","text":"
In AI platform, previous Suanova 4.0 can be accessed as a subcluster. This guide provides potential issues and solutions when installing insight-agent in a Suanova 4.0 cluster.
Since most Suanova 4.0 clusters have installed dx-insight as the monitoring system, installing insight-agent at this time will conflict with the existing prometheus operator in the cluster, making it impossible to install smoothly.
Enable the parameters of the prometheus operator, retain the prometheus operator in dx-insight, and make it compatible with the prometheus operator in insight-agent in 5.0.
Enable the --deny-namespaces parameter in the two prometheus operators respectively.
Run the following command (the following command is for reference only, the actual command needs to replace the prometheus operator name and namespace in the command).
As shown in the figure above, the dx-insight component is deployed under the dx-insight tenant, and the insight-agent is deployed under the insight-system tenant. Add --deny-namespaces=insight-system in the prometheus operator in dx-insight, Add --deny-namespaces=dx-insight in the prometheus operator in insight-agent.
Just add deny namespace, both prometheus operators can continue to scan other namespaces, and the related collection resources under kube-system or customer business namespaces are not affected.
Please pay attention to the problem of node exporter port conflict.
The open-source node-exporter turns on hostnetwork by default and the default port is 9100. If the monitoring system of the cluster has installed node-exporter , then installing insight-agent at this time will cause node-exporter port conflict and it cannot run normally.
Note
Insight's node exporter will enable some features to collect special indicators, so it is recommended to install.
Currently, it does not support modifying the port in the installation command. After helm install insight-agent , you need to manually modify the related ports of the insight node-exporter daemonset and svc.
The docker storage directory of Suanova 4.0 is /var/lib/containers , which is different from the path in the configuration of insigh-agent, so the logs are not collected.
"},{"location":"en/admin/insight/trace/topology-helper.html","title":"Service Topology Element Explanations","text":"
The service topology provided by Observability allows you to quickly identify the request relationships between services and determine the health status of services based on different colors. The health status is determined based on the request latency and error rate of the service's overall traffic. This article explains the elements in the service topology.
"},{"location":"en/admin/insight/trace/topology-helper.html#node-status-explanation","title":"Node Status Explanation","text":"
The node health status is determined based on the error rate and request latency of the service's overall traffic, following these rules:
Color Status Rules Gray Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/admin/insight/trace/topology-helper.html#connection-status-explanation","title":"Connection Status Explanation","text":"Color Status Rules Green Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html","title":"GPU Metrics","text":"
This page lists some commonly used GPU metrics.
"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html#cluster-level","title":"Cluster Level","text":"Metric Name Description Number of GPUs Total number of GPUs in the cluster Average GPU Utilization Average compute utilization of all GPUs in the cluster Average GPU Memory Utilization Average memory utilization of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Temperature Temperature of all GPUs in the cluster GPU Utilization Details 24-hour usage details of all GPUs in the cluster (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of all GPUs in the cluster (includes min, max, avg, current) GPU Memory Bandwidth Utilization For example, an Nvidia V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the utilization is 50%"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html#node-level","title":"Node Level","text":"Metric Name Description GPU Mode Usage mode of GPUs on the node, including full-card mode, MIG mode, vGPU mode Number of Physical GPUs Total number of physical GPUs on the node Number of Virtual GPUs Number of vGPU devices created on the node Number of MIG Instances Number of MIG instances created on the node GPU Memory Allocation Rate Memory allocation rate of all GPUs on the node Average GPU Utilization Average compute utilization of all GPUs on the node Average GPU Memory Utilization Average memory utilization of all GPUs on the node GPU Driver Version Driver version information of GPUs on the node GPU Utilization Details 24-hour usage details of each GPU on the node (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of each GPU on the node (includes min, max, avg, current)"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html#pod-level","title":"Pod Level","text":"Category Metric Name Description Application Overview GPU - Compute & Memory Pod GPU Utilization Compute utilization of the GPUs used by the current Pod Pod GPU Memory Utilization Memory utilization of the GPUs used by the current Pod Pod GPU Memory Usage Memory usage of the GPUs used by the current Pod Memory Allocation Memory allocation of the GPUs used by the current Pod Pod GPU Memory Copy Ratio Memory copy ratio of the GPUs used by the current Pod GPU - Engine Overview GPU Graphics Engine Activity Percentage Percentage of time the Graphics or Compute engine is active during a monitoring cycle GPU Memory Bandwidth Utilization Memory bandwidth utilization (Memory BW Utilization) indicates the fraction of cycles during which data is sent to or received from the device memory. This value represents the average over the interval, not an instantaneous value. A higher value indicates higher utilization of device memory.A value of 1 (100%) indicates that a DRAM instruction is executed every cycle during the interval (in practice, a peak of about 0.8 (80%) is the maximum achievable).A value of 0.2 (20%) indicates that 20% of the cycles during the interval are spent reading from or writing to device memory. Tensor Core Utilization Percentage of time the Tensor Core pipeline is active during a monitoring cycle FP16 Engine Utilization Percentage of time the FP16 pipeline is active during a monitoring cycle FP32 Engine Utilization Percentage of time the FP32 pipeline is active during a monitoring cycle FP64 Engine Utilization Percentage of time the FP64 pipeline is active during a monitoring cycle GPU Decode Utilization Decode engine utilization of the GPU GPU Encode Utilization Encode engine utilization of the GPU GPU - Temperature & Power GPU Temperature Temperature of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Total Power Consumption Total power consumption of the GPUs GPU - Clock GPU Memory Clock Memory clock frequency GPU Application SM Clock Application SM clock frequency GPU Application Memory Clock Application memory clock frequency GPU Video Engine Clock Video engine clock frequency GPU Throttle Reasons Reasons for GPU throttling GPU - Other Details PCIe Transfer Rate Data transfer rate of the GPU through the PCIe bus PCIe Receive Rate Data receive rate of the GPU through the PCIe bus"},{"location":"en/admin/kpanda/gpu/ascend/Ascend_usage.html","title":"Use Ascend NPU","text":"
This section explains how to use Ascend NPU on the AI platform platform.
This document uses the AscentCL Image Classification Application example from the Ascend sample library.
Download the Ascend repository
Run the following command to download the Ascend demo repository, and remember the storage location of the code for subsequent use.
git clone https://gitee.com/ascend/samples.git\n
Prepare the base image
This example uses the Ascent-pytorch base image, which can be obtained from the Ascend Container Registry.
Prepare the YAML file
ascend-demo.yaml
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: resnetinfer1-1-1usoc\nspec:\n template:\n spec:\n containers:\n - image: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:23.0.RC2-ubuntu18.04 # Inference image name\n imagePullPolicy: IfNotPresent\n name: resnet50infer\n securityContext:\n runAsUser: 0\n command:\n - \"/bin/bash\"\n - \"-c\"\n - |\n source /usr/local/Ascend/ascend-toolkit/set_env.sh &&\n TEMP_DIR=/root/samples_copy_$(date '+%Y%m%d_%H%M%S_%N') &&\n cp -r /root/samples \"$TEMP_DIR\" &&\n cd \"$TEMP_DIR\"/inference/modelInference/sampleResnetQuickStart/python/model &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/003_Atc_Models/resnet50/resnet50.onnx &&\n atc --model=resnet50.onnx --framework=5 --output=resnet50 --input_shape=\"actual_input_1:1,3,224,224\" --soc_version=Ascend910 &&\n cd ../data &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/models/aclsample/dog1_1024_683.jpg &&\n cd ../scripts &&\n bash sample_run.sh\n resources:\n requests:\n huawei.com/Ascend910: 1 # Number of the Ascend 910 Processors\n limits:\n huawei.com/Ascend910: 1 # The value should be the same as that of requests\n volumeMounts:\n - name: hiai-driver\n mountPath: /usr/local/Ascend/driver\n readOnly: true\n - name: slog\n mountPath: /var/log/npu/conf/slog/slog.conf\n - name: localtime # The container time must be the same as the host time\n mountPath: /etc/localtime\n - name: dmp\n mountPath: /var/dmp_daemon\n - name: slogd\n mountPath: /var/slogd\n - name: hbasic\n mountPath: /etc/hdcBasic.cfg\n - name: sys-version\n mountPath: /etc/sys_version.conf\n - name: aicpu\n mountPath: /usr/lib64/aicpu_kernels\n - name: tfso\n mountPath: /usr/lib64/libtensorflow.so\n - name: sample-path\n mountPath: /root/samples\n volumes:\n - name: hiai-driver\n hostPath:\n path: /usr/local/Ascend/driver\n - name: slog\n hostPath:\n path: /var/log/npu/conf/slog/slog.conf\n - name: localtime\n hostPath:\n path: /etc/localtime\n - name: dmp\n hostPath:\n path: /var/dmp_daemon\n - name: slogd\n hostPath:\n path: /var/slogd\n - name: hbasic\n hostPath:\n path: /etc/hdcBasic.cfg\n - name: sys-version\n hostPath:\n path: /etc/sys_version.conf\n - name: aicpu\n hostPath:\n path: /usr/lib64/aicpu_kernels\n - name: tfso\n hostPath:\n path: /usr/lib64/libtensorflow.so\n - name: sample-path\n hostPath:\n path: /root/samples\n restartPolicy: OnFailure\n
Some fields in the above YAML need to be modified according to the actual situation:
atc ... --soc_version=Ascend910 uses Ascend910, adjust this field depending on your actual situation. You can use the npu-smi info command to check the GPU model and add the Ascend prefix.
samples-path should be adjusted according to the actual situation.
resources should be adjusted according to the actual situation.
Deploy a Job and check its results
Use the following command to create a Job:
kubectl apply -f ascend-demo.yaml\n
Check the Pod running status:
After the Pod runs successfully, check the log results. The key prompt information on the screen is shown in the figure below. The Label indicates the category identifier, Conf indicates the maximum confidence of the classification, and Class indicates the belonging category. These values may vary depending on the version and environment, so please refer to the actual situation:
Confirm whether the cluster has detected the GPU card. Click Clusters -> Cluster Settings -> Addon Plugins , and check whether the proper GPU type is automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type to Ascend .
Deploy the workload. Click Clusters -> Workloads , deploy the workload through an image, select the type (Ascend), and then configure the number of physical cards used by the application:
Number of Physical Cards (huawei.com/Ascend910) : This indicates how many physical cards the current Pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host.
If there is an issue with the above configuration, it will result in scheduling failure and resource allocation issues.
This article will introduce the prerequisites for configuring GPU when creating a virtual machine.
The key point of configuring GPU for virtual machines is to configure the GPU Operator so that different software components can be deployed on working nodes depending on the GPU workloads configured on these nodes. Taking the following three nodes as examples:
The controller-node-1 node is configured to run containers.
The work-node-1 node is configured to run virtual machines with direct GPUs.
The work-node-2 node is configured to run virtual machines with virtual vGPUs.
"},{"location":"en/admin/virtnest/vm/vm-gpu.html#assumptions-limitations-and-dependencies","title":"Assumptions, Limitations, and Dependencies","text":"
Working nodes can run GPU-accelerated containers, virtual machines with direct GPUs, or virtual machines with vGPUs, but not a combination of any.
Working nodes can run GPU-accelerated containers, virtual machines with direct GPUs, or virtual machines with vGPUs separately, without supporting any combination forms.
Cluster administrators or developers need to understand the cluster situation in advance and correctly label the nodes to indicate the type of GPU workload they will run.
The working nodes running virtual machines with direct GPUs or vGPUs are assumed to be bare metal. If the working nodes are virtual machines, the GPU direct pass-through feature needs to be enabled on the virtual machine platform. Please consult the virtual machine platform provider.
Nvidia MIG vGPU is not supported.
The GPU Operator will not automatically install GPU drivers in virtual machines.
To enable the GPU direct pass-through feature, the cluster nodes need to enable IOMMU. Please refer to How to Enable IOMMU. If your cluster is running on a virtual machine, consult your virtual machine platform provider.
Note: Building a vGPU Manager image is only required when using NVIDIA vGPUs. If you plan to use only GPU direct pass-through, skip this section.
The following are the steps to build the vGPU Manager image and push it to the container registry:
Download the vGPU software from the NVIDIA Licensing Portal.
Log in to the NVIDIA Licensing Portal and go to the Software Downloads page.
The NVIDIA vGPU software is located in the Driver downloads tab on the Software Downloads page.
Select VGPU + Linux in the filter criteria and click Download to get the software package for Linux KVM. Unzip the downloaded file (NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run).
Clone the container-images/driver repository in the terminal
git clone https://gitlab.com/nvidia/container-images/driver cd driver\n
Switch to the vgpu-manager directory for your operating system
cd vgpu-manager/<your-os>\n
Copy the .run file extracted in step 1 to the current directory
Go to Container Management , select your worker cluster and click Nodes. On the right of the list, click \u2507 and select Edit Labels to add labels to the nodes. Each node can only have one label.
You can assign the following values to the labels: container, vm-passthrough, and vm-vgpu.
Go to Container Management , select your worker cluster, click Helm Apps -> Helm Charts , choose and install gpu-operator. You need to modify some fields in the yaml.
Fill in the container registry address refered in the step \"Build vGPU Manager Image\".
Fill in the VERSION refered in the step \"Build vGPU Manager Image\".
Wait for the installation to be successful, as shown in the image below:
"},{"location":"en/admin/virtnest/vm/vm-gpu.html#install-virtnest-agent-and-configure-cr","title":"Install virtnest-agent and Configure CR","text":"
Install virtnest-agent, refer to Install virtnest-agent.
Add vGPU and GPU direct pass-through to the Virtnest Kubevirt CR. The following example shows the key yaml after adding vGPU and GPU direct pass-through:
spec:\n configuration:\n developerConfiguration:\n featureGates:\n - GPU\n - DisableMDEVConfiguration\n # Fill in the information below\n permittedHostDevices:\n mediatedDevices: # vGPU\n - mdevNameSelector: GRID P4-1Q\n resourceName: nvidia.com /GRID_P4-1Q\n pciHostDevices: # GPU direct pass-through\n - externalResourceProvider: true\n pciVendorSelector: 10DE:1BB3\n resourceName: nvidia.com /GP104GL_TESLA_P4\n
In the kubevirt CR yaml, permittedHostDevices is used to import VM devices, and vGPU should be added in mediatedDevices with the following structure:
mediatedDevices: \n- mdevNameSelector: GRID P4-1Q # Device Name\n resourceName: nvidia.com/GRID_P4-1Q # vGPU information registered by GPU Operator to the node\n
GPU direct pass-through should be added in pciHostDevices under permittedHostDevices with the following structure:
pciHostDevices: \n- externalResourceProvider: true # Do not change by default\n pciVendorSelector: 10DE:1BB3 # Vendor id of the current pci device\n resourceName: nvidia.com/GP104GL_TESLA_P4 # GPU information registered by GPU Operator to the node\n
Example of obtaining vGPU information (only applicable to vGPU): View node information on a node marked as nvidia.com/gpu.workload.config=vm-vgpu (e.g., work-node-2), and the nvidia.com/GRID_P4-1Q: 8 in Capacity indicates available vGPUs:
So the mdevNameSelector should be \"GRID P4-1Q\" and the resourceName should be \"GRID_P4-1Q\".
Obtain GPU direct pass-through information: On a node marked as nvidia.com/gpu.workload.config=vm-passthrough (e.g., work-node-1), view the node information, and nvidia.com/GP104GL_TESLA_P4: 2 in Capacity indicates available vGPUs:
So the resourceName should be \"GRID_P4-1Q\". How to obtain the pciVendorSelector? SSH into the target node work-node-1 and use the command \"lspci -nnk -d 10de:\" to get the Nvidia GPU PCI information, as shown in the image above.
Editing kubevirt CR note: If there are multiple GPUs of the same model, only one needs to be written in the CR, listing each GPU is not necessary.
# kubectl -n virtnest-system edit kubevirt kubevirt\nspec:\n configuration:\n developerConfiguration:\n featureGates:\n - GPU\n - DisableMDEVConfiguration\n # Fill in the information below\n permittedHostDevices:\n mediatedDevices: # vGPU\n - mdevNameSelector: GRID P4-1Q\n resourceName: nvidia.com/GRID_P4-1Q\n pciHostDevices: # GPU direct pass-through, in the above example, TEESLA P4 has two GPUs, only register one here\n - externalResourceProvider: true\n pciVendorSelector: 10DE:1BB3\n resourceName: nvidia.com/GP104GL_TESLA_P4 \n
"},{"location":"en/admin/virtnest/vm/vm-gpu.html#create-vm-yaml-and-use-gpu-acceleration","title":"Create VM YAML and Use GPU Acceleration","text":"
The only difference from a regular virtual machine is adding GPU-related information in the devices section.
"},{"location":"en/end-user/index.html","title":"Suanova AI Platform - End User","text":"
This is the user documentation for the Suanova AI Platform aimed at end users.
User Registration
User registration is the first step to using the AI platform.
User Registration
Cloud Host
A cloud host is a virtual machine deployed in the cloud.
Create Cloud Host
Use Cloud Host
Container Management
Container management is the core module of the AI computing center.
K8s Clusters on Cloud
Node Management
Workloads
Helm Apps and Templates
AI Lab
Manage datasets and run AI training and inference jobs.
Create AI Workloads
Use Notebook
Create Training Jobs
Create Inference Services
Insight
Monitor the status of clusters, nodes, and workloads through dashboards.
Monitor Clusters/Nodes
Metrics
Logs
Tracing
Personal Center
Set password, keys, and language in the personal center.
Security Settings
Access Keys
Language Settings
"},{"location":"en/end-user/baize/dataset/create-use-delete.html","title":"Create, Use and Delete Datasets","text":"
AI Lab provides comprehensive dataset management functions needed for model development, training, and inference processes. Currently, it supports unified access to various data sources.
With simple configurations, you can connect data sources to AI Lab, achieving unified data management, preloading, dataset management, and other functionalities.
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#create-a-dataset","title":"Create a Dataset","text":"
In the left navigation bar, click Data Management -> Dataset List, and then click the Create button on the right.
Select the worker cluster and namespace to which the dataset belongs, then click Next.
Configure the data source type for the target data, then click OK.
Currently supported data sources include:
GIT: Supports repositories such as GitHub, GitLab, and Gitee
Upon successful creation, the dataset will be returned to the dataset list. You can perform more actions by clicking \u2507 on the right.
Info
The system will automatically perform a one-time data preloading after the dataset is successfully created; the dataset cannot be used until the preloading is complete.
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-a-dataset","title":"Use a Dataset","text":"
Once the dataset is successfully created, it can be used in tasks such as model training and inference.
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-in-notebook","title":"Use in Notebook","text":"
In creating a Notebook, you can directly use the dataset; the usage is as follows:
Use the dataset as training data mount
Use the dataset as code mount
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-in-training-obs","title":"Use in Training obs","text":"
Use the dataset to specify job output
Use the dataset to specify job input
Use the dataset to specify TensorBoard output
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-in-inference-services","title":"Use in Inference Services","text":"
Use the dataset to mount a model
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#delete-a-dataset","title":"Delete a Dataset","text":"
If you find a dataset to be redundant, expired, or no longer needed, you can delete it from the dataset list.
Click the \u2507 on the right side of the dataset list, then choose Delete from the dropdown menu.
In the pop-up window, confirm the dataset you want to delete, enter the dataset name, and then click Delete.
A confirmation message will appear indicating successful deletion, and the dataset will disappear from the list.
Caution
Once a dataset is deleted, it cannot be recovered, so please proceed with caution.
Traditionally, Python environment dependencies are built into an image, which includes the Python version and dependency packages. This approach has high maintenance costs and is inconvenient to update, often requiring a complete rebuild of the image.
In AI Lab, users can manage pure environment dependencies through the Environment Management module, decoupling this part from the image. The advantages include:
One environment can be used in multiple places, such as in Notebooks, distributed training tasks, and even inference services.
Updating dependency packages is more convenient; you only need to update the environment dependencies without rebuilding the image.
The main components of the environment management are:
Cluster : Select the cluster to operate on.
Namespace : Select the namespace to limit the scope of operations.
Environment List : Displays all environments and their statuses under the current cluster and namespace.
"},{"location":"en/end-user/baize/dataset/environments.html#explanation-of-environment-list-fields","title":"Explanation of Environment List Fields","text":"
Name : The name of the environment.
Status : The current status of the environment (normal or failed). New environments undergo a warming-up process, after which they can be used in other tasks.
Creation Time : The time the environment was created.
"},{"location":"en/end-user/baize/dataset/environments.html#creat-new-environment","title":"Creat New Environment","text":"
On the Environment Management interface, click the Create button at the top right to enter the environment creation process.
Fill in the following basic information:
Name : Enter the environment name, with a length of 2-63 characters, starting and ending with lowercase letters or numbers.
Deployment Location:
Cluster : Select the cluster to deploy, such as gpu-cluster.
Namespace : Select the namespace, such as default.
Remarks (optional): Enter remarks.
Labels (optional): Add labels to the environment.
Annotations (optional): Add annotations to the environment. After completing the information, click Next to proceed to environment configuration.
Python Version : Select the required Python version, such as 3.12.3.
Package Manager : Choose the package management tool, either PIP or CONDA.
Environment Data :
If PIP is selected: Enter the dependency package list in requirements.txt format in the editor below.
If CONDA is selected: Enter the dependency package list in environment.yaml format in the editor below.
Other Options (optional):
Additional pip Index URLs : Configure additional pip index URLs; suitable for internal enterprise private repositories or PIP acceleration sites.
GPU Configuration : Enable or disable GPU configuration; some GPU-related dependency packages need GPU resources configured during preloading.
Associated Storage : Select the associated storage configuration; environment dependency packages will be stored in the associated storage. Note: Storage must support ReadWriteMany.
After configuration, click the Create button, and the system will automatically create and configure the new Python environment.
Verify that the Python version and package manager configuration are correct.
Ensure the selected cluster and namespace are available.
If dependency preloading fails:
Check if the requirements.txt or environment.yaml file format is correct.
Verify that the dependency package names and versions are correct. If other issues arise, contact the platform administrator or refer to the platform help documentation for more support.
These are the basic steps and considerations for managing Python dependencies in AI Lab.
With the rapid iteration of AI Lab, we have now supported various model inference services. Here, you can see information about the supported models.
AI Lab v0.3.0 launched model inference services, facilitating users to directly use the inference services of AI Lab without worrying about model deployment and maintenance for traditional deep learning models.
AI Lab v0.6.0 supports the complete version of vLLM inference capabilities, supporting many large language models such as LLama, Qwen, ChatGLM, and more.
Note
The support for inference capabilities is related to the version of AI Lab.
You can use GPU types that have been verified by AI platform in AI Lab. For more details, refer to the GPU Support Matrix.
Through the Triton Inference Server, traditional deep learning models can be well supported. Currently, AI Lab supports mainstream inference backend services:
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
With vLLM, we can quickly use large language models. Here, you can see the list of models we support, which generally aligns with the vLLM Support Models.
HuggingFace Models: We support most of HuggingFace's models. You can see more models at the HuggingFace Model Hub.
The vLLM Supported Models list includes supported large language models and vision-language models.
Models fine-tuned using the vLLM support framework.
"},{"location":"en/end-user/baize/inference/models.html#new-features-of-vllm","title":"New Features of vLLM","text":"
Currently, AI Lab also supports some new features when using vLLM as an inference tool:
Enable Lora Adapter to optimize model inference services during inference.
Provide a compatible OpenAPI interface with OpenAI, making it easy for users to switch to local inference services at a low cost and quickly transition.
"},{"location":"en/end-user/baize/inference/triton-inference.html","title":"Create Inference Service Using Triton Framework","text":"
The AI Lab currently offers Triton and vLLM as inference frameworks. Users can quickly start a high-performance inference service with simple configurations.
Danger
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
"},{"location":"en/end-user/baize/inference/triton-inference.html#introduction-to-triton","title":"Introduction to Triton","text":"
Triton is an open-source inference server developed by NVIDIA, designed to simplify the deployment and inference of machine learning models. It supports a variety of deep learning frameworks, including TensorFlow and PyTorch, enabling users to easily manage and deploy different types of models.
Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. The following example illustrates the PyTorch model for mnist handwritten digit recognition.
Note
The model to be inferred must adhere to the following directory structure within the dataset:
Currently, form-based creation is supported, allowing you to create services with field prompts in the interface.
"},{"location":"en/end-user/baize/inference/triton-inference.html#configure-model-path","title":"Configure Model Path","text":"
The model path model-repo/mnist-cnn/1/model.pt must be consistent with the directory structure of the dataset.
"},{"location":"en/end-user/baize/inference/triton-inference.html#model-configuration","title":"Model Configuration","text":""},{"location":"en/end-user/baize/inference/triton-inference.html#configure-input-and-output-parameters","title":"Configure Input and Output Parameters","text":"
Note
The first dimension of the input and output parameters defaults to batchsize, setting it to -1 allows for the automatic calculation of the batchsize based on the input inference data. The remaining dimensions and data type must match the model's input.
Send HTTP POST Request: Use tools like curl or HTTP client libraries (e.g., Python's requests library) to send POST requests to the Triton Server.
Set HTTP Headers: Configuration generated automatically based on user settings, include metadata about the model inputs and outputs in the HTTP headers.
Construct Request Body: The request body usually contains the input data for inference and model-specific metadata.
<ip> is the host address where the Triton Inference Server is running.
<port> is the port where the Triton Inference Server is running.
<inference-name> is the name of the inference service that has been created.
\"name\" must match the name of the input parameter in the model configuration.
\"shape\" must match the dims of the input parameter in the model configuration.
\"datatype\" must match the Data Type of the input parameter in the model configuration.
\"data\" should be replaced with the actual inference data.
Please note that the above example code needs to be adjusted according to your specific model and environment. The format and content of the input data must also comply with the model's requirements.
"},{"location":"en/end-user/baize/inference/vllm-inference.html","title":"Create Inference Service Using vLLM Framework","text":"
AI Lab supports using vLLM as an inference service, offering all the capabilities of vLLM while fully adapting to the OpenAI interface definition.
"},{"location":"en/end-user/baize/inference/vllm-inference.html#introduction-to-vllm","title":"Introduction to vLLM","text":"
vLLM is a fast and easy-to-use library for inference and services. It aims to significantly improve the throughput and memory efficiency of language model services in real-time scenarios. vLLM boasts several features in terms of speed and flexibility:
Continuous batching of incoming requests.
Efficiently manages attention keys and values memory using PagedAttention.
Seamless integration with popular HuggingFace models.
Select the vLLM inference framework. In the model module selection, choose the pre-created model dataset hdd-models and fill in the path information where the model is located within the dataset.
This guide uses the ChatGLM3 model for creating the inference service.
Configure the resources for the inference service and adjust the parameters for running the inference service.
Parameter Name Description GPU Resources Configure GPU resources for inference based on the model scale and cluster resources. Allow Remote Code Controls whether vLLM trusts and executes code from remote sources. LoRA LoRA is a parameter-efficient fine-tuning technique for deep learning models. It reduces the number of parameters and computational complexity by decomposing the original model parameter matrix into low-rank matrices. 1. --lora-modules: Specifies specific modules or layers for low-rank approximation. 2. max_loras_rank: Specifies the maximum rank for each adapter layer in the LoRA model. For simpler tasks, a smaller rank value can be chosen, while more complex tasks may require a larger rank value to ensure model performance. 3. max_loras: Indicates the maximum number of LoRA layers that can be included in the model, customized based on model size and inference complexity. 4. max_cpu_loras: Specifies the maximum number of LoRA layers that can be handled in a CPU environment. Associated Environment Selects predefined environment dependencies required for inference.
Info
For models that support LoRA parameters, refer to vLLM Supported Models.
In the Advanced Configuration , support is provided for automated affinity scheduling based on GPU resources and other node configurations. Users can also customize scheduling policies.
Once the inference service is created, click the name of the inference service to enter the details and view the API call methods. Verify the execution results using Curl, Python, and Node.js.
Copy the curl command from the details and execute it in the terminal to send a model inference request. The expected output should be:
Job management refers to the functionality of creating and managing job lifecycles through job scheduling and control components.
AI platform Smart Computing Capability adopts Kubernetes' Job mechanism to schedule various AI inference and training jobs.
Click Job Center -> Jobs in the left navigation bar to enter the job list. Click the Create button on the right.
The system will pre-fill basic configuration data, including the cluster, namespace, type, queue, and priority. Adjust these parameters and click Next.
Configure the URL, runtime parameters, and associated datasets, then click Next.
Optionally add labels, annotations, runtime env variables, and other job parameters. Select a scheduling policy and click Confirm.
After the job is successfully created, it will have several running statuses:
Pytorch is an open-source deep learning framework that provides a flexible environment for training and deployment. A Pytorch job is a job that uses the Pytorch framework.
In the AI Lab platform, we provide support and adaptation for Pytorch jobs. Through a graphical interface, you can quickly create Pytorch jobs and perform model training.
Here we use the baize-notebook base image and the associated environment as the basic runtime environment for the job.
To learn how to create an environment, refer to Environments.
"},{"location":"en/end-user/baize/jobs/pytorch.html#create-jobs","title":"Create Jobs","text":""},{"location":"en/end-user/baize/jobs/pytorch.html#pytorch-single-jobs","title":"Pytorch Single Jobs","text":"
Log in to the AI Lab platform, click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Pytorch Single and click Next .
Fill in the job name and description, then click OK .
Once the job is successfully submitted, we can enter the job details to see the resource usage. From the upper right corner, go to Workload Details to view the log output during the training process.
import os\nimport torch\nimport torch.distributed as dist\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\nclass SimpleModel(nn.Module):\n def __init__(self):\n super(SimpleModel, self).__init__()\n self.fc = nn.Linear(10, 1)\n\n def forward(self, x):\n return self.fc(x)\n\ndef train():\n # Print environment information\n print(f'PyTorch version: {torch.__version__}')\n print(f'CUDA available: {torch.cuda.is_available()}')\n if torch.cuda.is_available():\n print(f'CUDA version: {torch.version.cuda}')\n print(f'CUDA device count: {torch.cuda.device_count()}')\n\n rank = int(os.environ.get('RANK', '0'))\n world_size = int(os.environ.get('WORLD_SIZE', '1'))\n\n print(f'Rank: {rank}, World Size: {world_size}')\n\n # Initialize distributed environment\n try:\n if world_size > 1:\n dist.init_process_group('nccl')\n print('Distributed process group initialized successfully')\n else:\n print('Running in non-distributed mode')\n except Exception as e:\n print(f'Error initializing process group: {e}')\n return\n\n # Set device\n try:\n if torch.cuda.is_available():\n device = torch.device(f'cuda:{rank % torch.cuda.device_count()}')\n print(f'Using CUDA device: {device}')\n else:\n device = torch.device('cpu')\n print('CUDA not available, using CPU')\n except Exception as e:\n print(f'Error setting device: {e}')\n device = torch.device('cpu')\n print('Falling back to CPU')\n\n try:\n model = SimpleModel().to(device)\n print('Model moved to device successfully')\n except Exception as e:\n print(f'Error moving model to device: {e}')\n return\n\n try:\n if world_size > 1:\n ddp_model = DDP(model, device_ids=[rank % torch.cuda.device_count()] if torch.cuda.is_available() else None)\n print('DDP model created successfully')\n else:\n ddp_model = model\n print('Using non-distributed model')\n except Exception as e:\n print(f'Error creating DDP model: {e}')\n return\n\n loss_fn = nn.MSELoss()\n optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)\n\n # Generate some random data\n try:\n data = torch.randn(100, 10, device=device)\n labels = torch.randn(100, 1, device=device)\n print('Data generated and moved to device successfully')\n except Exception as e:\n print(f'Error generating or moving data to device: {e}')\n return\n\n for epoch in range(10):\n try:\n ddp_model.train()\n outputs = ddp_model(data)\n loss = loss_fn(outputs, labels)\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n if rank == 0:\n print(f'Epoch {epoch}, Loss: {loss.item():.4f}')\n except Exception as e:\n print(f'Error during training epoch {epoch}: {e}')\n break\n\n if world_size > 1:\n dist.destroy_process_group()\n\nif __name__ == '__main__':\n train()\n
"},{"location":"en/end-user/baize/jobs/pytorch.html#number-of-job-replicas","title":"Number of Job Replicas","text":"
Note that Pytorch Distributed training jobs will create a group of Master and Worker training Pods, where the Master is responsible for coordinating the training job, and the Worker is responsible for the actual training work.
Note
In this demonstration: Master replica count is 1, Worker replica count is 2; Therefore, we need to set the replica count to 3 in the Job Configuration , which is the sum of Master and Worker replica counts. Pytorch will automatically tune the roles of Master and Worker.
AI Lab provides important visualization analysis tools provided for the model development process, used to display the training process and results of machine learning models. This document will introduce the basic concepts of Job Analysis (Tensorboard), its usage in the AI Lab system, and how to configure the log content of datasets.
Note
Tensorboard is a visualization tool provided by TensorFlow, used to display the training process and results of machine learning models. It can help developers more intuitively understand the training dynamics of their models, analyze model performance, debug issues, and more.
The role and advantages of Tensorboard in the model development process:
Visualize Training Process : Display metrics such as training and validation loss, and accuracy through charts, helping developers intuitively observe the training effects of the model.
Debug and Optimize Models : By viewing the weights and gradient distributions of different layers, help developers discover and fix issues in the model.
Compare Different Experiments : Simultaneously display the results of multiple experiments, making it convenient for developers to compare the effects of different models and hyperparameter configurations.
Track Training Data : Record the datasets and parameters used during training to ensure the reproducibility of experiments.
"},{"location":"en/end-user/baize/jobs/tensorboard.html#how-to-create-tensorboard","title":"How to Create Tensorboard","text":"
In the AI Lab system, we provide a convenient way to create and manage Tensorboard. Here are the specific steps:
"},{"location":"en/end-user/baize/jobs/tensorboard.html#enable-tensorboard-when-creating-a-notebook","title":"Enable Tensorboard When Creating a Notebook","text":"
Create a Notebook : Create a new Notebook on the AI Lab platform.
Enable Tensorboard : On the Notebook creation page, enable the Tensorboard option and specify the dataset and log path.
"},{"location":"en/end-user/baize/jobs/tensorboard.html#enable-tensorboard-after-creating-and-completing-a-distributed-job","title":"Enable Tensorboard After Creating and Completing a Distributed Job","text":"
Create a Distributed Job : Create a new distributed training job on the AI Lab platform.
Configure Tensorboard : On the job configuration page, enable the Tensorboard option and specify the dataset and log path.
View Tensorboard After Job Completion : After the job is completed, you can view the Tensorboard link on the job details page. Click the link to see the visualized results of the training process.
"},{"location":"en/end-user/baize/jobs/tensorboard.html#directly-reference-tensorboard-in-a-notebook","title":"Directly Reference Tensorboard in a Notebook","text":"
In a Notebook, you can directly start Tensorboard through code. Here is a sample code snippet:
"},{"location":"en/end-user/baize/jobs/tensorboard.html#how-to-configure-dataset-log-content","title":"How to Configure Dataset Log Content","text":"
When using Tensorboard, you can record and configure different datasets and log content. Here are some common configuration methods:
"},{"location":"en/end-user/baize/jobs/tensorboard.html#configure-training-and-validation-dataset-logs","title":"Configure Training and Validation Dataset Logs","text":"
While training the model, you can use TensorFlow's tf.summary API to record logs for the training and validation datasets. Here is a sample code snippet:
# Import necessary libraries\nimport tensorflow as tf\n\n# Create log directories\ntrain_log_dir = 'logs/gradient_tape/train'\nval_log_dir = 'logs/gradient_tape/val'\ntrain_summary_writer = tf.summary.create_file_writer(train_log_dir)\nval_summary_writer = tf.summary.create_file_writer(val_log_dir)\n\n# Train model and record logs\nfor epoch in range(EPOCHS):\n for (x_train, y_train) in train_dataset:\n # Training step\n train_step(x_train, y_train)\n with train_summary_writer.as_default():\n tf.summary.scalar('loss', train_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)\n\n for (x_val, y_val) in val_dataset:\n # Validation step\n val_step(x_val, y_val)\n with val_summary_writer.as_default():\n tf.summary.scalar('loss', val_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', val_accuracy.result(), step=epoch)\n
In addition to logs for training and validation datasets, you can also record other custom log content such as learning rate and gradient distribution. Here is a sample code snippet:
In AI Lab, Tensorboards created through various methods are uniformly displayed on the job analysis page, making it convenient for users to view and manage.
Users can view information such as the link, status, and creation time of Tensorboard on the job analysis page and directly access the visualized results of Tensorboard through the link.
Tensorflow, along with Pytorch, is a highly active open-source deep learning framework that provides a flexible environment for training and deployment.
AI Lab provides support and adaptation for the Tensorflow framework. You can quickly create Tensorflow jobs and conduct model training through graphical operations.
Here, we use the baize-notebook base image and the associated environment as the basic runtime environment for jobs.
For information on how to create an environment, refer to Environment List.
"},{"location":"en/end-user/baize/jobs/tensorflow.html#creating-a-job","title":"Creating a Job","text":""},{"location":"en/end-user/baize/jobs/tensorflow.html#example-tfjob-single","title":"Example TFJob Single","text":"
Log in to the AI Lab platform and click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Tensorflow Single and click Next .
Fill in the job name and description, then click OK .
"},{"location":"en/end-user/baize/jobs/tensorflow.html#pre-warming-the-code-repository","title":"Pre-warming the Code Repository","text":"
Use AI Lab -> Dataset List to create a dataset and pull the code from a remote GitHub repository into the dataset. This way, when creating a job, you can directly select the dataset and mount the code into the job.
Command parameters: Use python /code/tensorflow/tf-single.py
\"\"\"\n pip install tensorflow numpy\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\n\n# Create some random data\nx = np.random.rand(100, 1)\ny = 2 * x + 1 + np.random.rand(100, 1) * 0.1\n\n# Create a simple model\nmodel = tf.keras.Sequential([\n tf.keras.layers.Dense(1, input_shape=(1,))\n])\n\n# Compile the model\nmodel.compile(optimizer='adam', loss='mse')\n\n# Train the model, setting epochs to 10\nhistory = model.fit(x, y, epochs=10, verbose=1)\n\n# Print the final loss\nprint('Final loss: {' + str(history.history['loss'][-1]) +'}')\n\n# Use the model to make predictions\ntest_x = np.array([[0.5]])\nprediction = model.predict(test_x)\nprint(f'Prediction for x=0.5: {prediction[0][0]}')\n
After the job is successfully submitted, you can enter the job details to see the resource usage. From the upper right corner, navigate to Workload Details to view log outputs during the training process.
Once a job is created, it will be displayed in the job list.
In the job list, click the \u2507 on the right side of a job and select Job Workload Details .
A pop-up window will appear asking you to choose which Pod to view. Click Enter .
You will be redirected to the container management interface, where you can view the container\u2019s working status, labels and annotations, and any events that have occurred.
You can also view detailed logs of the current Pod for the recent period. By default, 100 lines of logs are displayed. To view more detailed logs or to download logs, click the blue Insight text at the top.
Additionally, you can use the ... in the upper right corner to view the current Pod's YAML, and to upload or download files. Below is an example of a Pod's YAML.
The access key can be used to access the openAPI and continuous delivery. Users can obtain the key and access the API by referring to the following steps in the personal center.
Log in to AI platform, find Personal Center in the drop-down menu in the upper right corner, and you can manage the access key of the account on the Access Keys page.
Info
Access key is displayed only once. If you forget your access key, you will need to create a new key.
"},{"location":"en/end-user/ghippo/personal-center/accesstoken.html#use-the-key-to-access-api","title":"Use the key to access API","text":"
When accessing AI platform openAPI, add the header Authorization:Bearer ${token} to the request to identify the visitor, where ${token} is the key obtained in the previous step. For the specific API, see OpenAPI Documentation.
Request Example
curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k\n
This section explains how to set the interface language. Currently supports Chinese, English two languages.
Language setting is the portal for the platform to provide multilingual services. The platform is displayed in Chinese by default. Users can switch the platform language by selecting English or automatically detecting the browser language preference according to their needs. Each user's multilingual service is independent of each other, and switching will not affect other users.
The platform provides three ways to switch languages: Chinese, English-English, and automatically detect your browser language preference.
The operation steps are as follows.
Log in to the AI platform with your username/password. Click Global Management at the bottom of the left navigation bar.
Click the username in the upper right corner and select Personal Center .
Function description: It is used to fill in the email address and modify the login password.
Email: After the administrator configures the email server address, the user can click the Forget Password button on the login page to fill in the email address there to retrieve the password.
Password: The password used to log in to the platform, it is recommended to change the password regularly.
The specific operation steps are as follows:
Click the username in the upper right corner and select Personal Center .
Click the Security Settings tab. Fill in your email address or change the login password.
"},{"location":"en/end-user/ghippo/personal-center/ssh-key.html","title":"Configuring SSH Public Key","text":"
This article explains how to configure SSH public key.
Before generating a new SSH key, please check if you need to use an existing SSH key stored in the root directory of the local user. For Linux and Mac, use the following command to view existing public keys. Windows users can use the following command in WSL (requires Windows 10 or above) or Git Bash to view the generated public keys.
ED25519 Algorithm:
cat ~/.ssh/id_ed25519.pub\n
RSA Algorithm:
cat ~/.ssh/id_rsa.pub\n
If a long string starting with ssh-ed25519 or ssh-rsa is returned, it means that a local public key already exists. You can skip Step 2 Generate SSH Key and proceed directly to Step 3.
If Step 1 does not return the specified content string, it means that there is no available SSH key locally and a new SSH key needs to be generated. Please follow these steps:
Access the terminal (Windows users please use WSL or Git Bash), and run ssh-keygen -t.
Enter the key algorithm type and an optional comment.
The comment will appear in the .pub file and can generally use the email address as the comment content.
To generate a key pair based on the ED25519 algorithm, use the following command:
ssh-keygen -t ed25519 -C \"<comment>\"\n
To generate a key pair based on the RSA algorithm, use the following command:
ssh-keygen -t rsa -C \"<comment>\"\n
Press Enter to choose the SSH key generation path.
Taking the ED25519 algorithm as an example, the default path is as follows:
Generating public/private ed25519 key pair.\nEnter file in which to save the key (/home/user/.ssh/id_ed25519):\n
The default key generation path is /home/user/.ssh/id_ed25519, and the proper public key is /home/user/.ssh/id_ed25519.pub.
Set a passphrase for the key.
Enter passphrase (empty for no passphrase):\nEnter same passphrase again:\n
The passphrase is empty by default, and you can choose to use a passphrase to protect the private key file. If you do not want to enter a passphrase every time you access the repository using the SSH protocol, you can enter an empty passphrase when creating the key.
Press Enter to complete the key pair creation.
"},{"location":"en/end-user/ghippo/personal-center/ssh-key.html#step-3-copy-the-public-key","title":"Step 3. Copy the Public Key","text":"
In addition to manually copying the generated public key information printed on the command line, you can use the following commands to copy the public key to the clipboard, depending on the operating system.
Windows (in WSL or Git Bash):
cat ~/.ssh/id_ed25519.pub | clip\n
Mac:
tr -d '\\n'< ~/.ssh/id_ed25519.pub | pbcopy\n
GNU/Linux (requires xclip):
xclip -sel clip < ~/.ssh/id_ed25519.pub\n
"},{"location":"en/end-user/ghippo/personal-center/ssh-key.html#step-4-set-the-public-key-on-ai-platform-platform","title":"Step 4. Set the Public Key on AI platform Platform","text":"
Log in to the AI platform UI page and select Profile -> SSH Public Key in the upper right corner of the page.
Add the generated SSH public key information.
SSH public key content.
Public key title: Supports customizing the public key name for management differentiation.
Expiration: Set the expiration period for the public key. After it expires, the public key will be automatically invalidated and cannot be used. If not set, it will be permanently valid.
"},{"location":"en/end-user/ghippo/workspace/folder-permission.html","title":"Description of folder permissions","text":"
Folders have permission mapping capabilities, which can map the permissions of users/groups in this folder to subfolders, workspaces and resources under it.
If the user/group is Folder Admin role in this folder, it is still Folder Admin role when mapped to a subfolder, and Workspace Admin is mapped to the workspace under it; If a Namespace is bound in Workspace and Folder -> Resource Group , the user/group is also a Namespace Admin after mapping.
Note
The permission mapping capability of folders will not be applied to shared resources, because sharing is to share the use permissions of the cluster to multiple workspaces, rather than assigning management permissions to workspaces, so permission inheritance and role mapping will not be implemented.
Folders have hierarchical capabilities, so when folders are mapped to departments/suppliers/projects in the enterprise,
If a user/group has administrative authority (Admin) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have administrative authority;
If a user/group has access rights (Editor) in the first-level department, the second-, third-, and fourth-level departments or projects under it also have access rights;
If a user/group has read-only permission (Viewer) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have read-only permission.
Objects Actions Folder Admin Folder Editor Folder Viewer on the folder itself view \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 To Subfolder Create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 workspace under it create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 Workspace under it - Resource Group View \u2713 \u2713 \u2713 resource binding \u2713 \u2717 \u2717 unbind \u2713 \u2717 \u2717 Workspaces under it - Shared Resources View \u2713 \u2713 \u2713 New share \u2713 \u2717 \u2717 Unshare \u2713 \u2717 \u2717 Resource Quota \u2713 \u2717 \u2717"},{"location":"en/end-user/ghippo/workspace/folders.html","title":"Create/Delete Folders","text":"
Folders have the capability to map permissions, allowing users/user groups to have their permissions in the folder mapped to its sub-folders, workspaces, and resources.
Follow the steps below to create a folder:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Folder button in the top right corner.
Fill in the folder name, parent folder, and other information, then click OK to complete creating the folder.
Tip
After successful creation, the folder name will be displayed in the left tree structure, represented by different icons for workspaces and folders.
Note
To edit or delete a specific folder, select it and Click \u2507 on the right side.
If there are resources bound to the resource group or shared resources within the folder, the folder cannot be deleted. All resources need to be unbound before deleting.
If there are registry resources accessed by the microservice engine module within the folder, the folder cannot be deleted. All access to the registry needs to be removed before deleting the folder.
Shared resources do not necessarily mean that the shared users can use the shared resources without any restrictions. Admin, Kpanda Owner, and Workspace Admin can limit the maximum usage quota of a user through the Resource Quota feature in shared resources. If no restrictions are set, it means the usage is unlimited.
CPU Request (Core)
CPU Limit (Core)
Memory Request (MB)
Memory Limit (MB)
Total Storage Request (GB)
Persistent Volume Claims (PVC)
GPU Type, Spec, Quantity (including but not limited to Nvidia, Ascend, ILLUVATAR, and other GPUs)
A resource (cluster) can be shared among multiple workspaces, and a workspace can use resources from multiple shared clusters simultaneously.
"},{"location":"en/end-user/ghippo/workspace/quota.html#resource-groups-and-shared-resources","title":"Resource Groups and Shared Resources","text":"
Cluster resources in both shared resources and resource groups are derived from Container Management. However, different effects will occur when binding a cluster to a workspace or sharing it with a workspace.
Binding Resources
Users/User groups in the workspace will have full management and usage permissions for the cluster. Workspace Admin will be mapped as Cluster Admin. Workspace Admin can access the Container Management module to manage the cluster.
Note
As of now, there are no Cluster Editor and Cluster Viewer roles in the Container Management module. Therefore, Workspace Editor and Workspace Viewer cannot be mapped.
Adding Shared Resources
Users/User groups in the workspace will have usage permissions for the cluster resources.
Unlike resource groups, when sharing a cluster with a workspace, the roles of the users in the workspace will not be mapped to the resources. Therefore, Workspace Admin will not be mapped as Cluster Admin.
This section demonstrates three scenarios related to resource quotas.
Select workspace ws01 and the shared cluster in Workbench, and create a namespace ns01 .
If no resource quotas are set in the shared cluster, there is no need to set resource quotas when creating the namespace.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the CPU request for the namespace must be less than or equal to 100 cores (CPU Request \u2264 100 core) for successful creation.
"},{"location":"en/end-user/ghippo/workspace/quota.html#bind-namespace-to-workspace","title":"Bind Namespace to Workspace","text":"
Prerequisite: Workspace ws01 has added a shared cluster, and the operator has the Workspace Admin + Kpanda Owner or Admin role.
The two methods of binding have the same effect.
Bind the created namespace ns01 to ws01 in Container Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
Bind the namespace ns01 to ws01 in Global Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
"},{"location":"en/end-user/ghippo/workspace/quota.html#unbind-namespace-from-workspace","title":"Unbind Namespace from Workspace","text":"
The two methods of unbinding have the same effect.
Unbind the namespace ns01 from workspace ws01 in Container Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
Unbind the namespace ns01 from workspace ws01 in Global Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html","title":"Differences between Resource Groups and Shared Resources","text":"
Both resource groups and shared resources support cluster binding, but they have significant differences in usage.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#differences-in-usage-scenarios","title":"Differences in Usage Scenarios","text":"
Cluster Binding for Resource Groups: Resource groups are usually used for batch authorization. After binding a resource group to a cluster, the workspace administrator will be mapped as a cluster administrator and able to manage and use cluster resources.
Cluster Binding for Shared Resources: Shared resources are usually used for resource quotas. A typical scenario is that the platform administrator assigns a cluster to a first-level supplier, who then assigns the cluster to a second-level supplier and sets resource quotas for the second-level supplier.
Note: In this scenario, the platform administrator needs to impose resource restrictions on secondary suppliers. Currently, it is not supported to limit the cluster quota of secondary suppliers by the primary supplier.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#differences-in-cluster-quota-usage","title":"Differences in Cluster Quota Usage","text":"
Cluster Binding for Resource Groups: The workspace administrator is mapped as the administrator of the cluster and is equivalent to being granted the Cluster Admin role in Container Management-Permission Management. They can have unrestricted access to cluster resources, manage important content such as management nodes, and cannot be subject to resource quotas.
Cluster Binding for Shared Resources: The workspace administrator can only use the quota in the cluster to create namespaces in the Workbench and does not have cluster management permissions. If the workspace is restricted by a quota, the workspace administrator can only create and use namespaces within the quota range.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#differences-in-resource-types","title":"Differences in Resource Types","text":"
Resource Groups: Can bind to clusters, cluster-namespaces, multiclouds, multicloud namespaces, meshs, and mesh-namespaces.
Shared Resources: Can only bind to clusters.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#similarities-between-resource-groups-and-shared-resources","title":"Similarities between Resource Groups and Shared Resources","text":"
After binding to a cluster, both resource groups and shared resources can go to the Workbench to create namespaces, which will be automatically bound to the workspace.
A workspace is a resource category that represents a hierarchical relationship of resources. A workspace can contain resources such as clusters, namespaces, and registries. Typically, each workspace corresponds to a project and different resources can be allocated, and different users and user groups can be assigned to each workspace.
Follow the steps below to create a workspace:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Workspace button in the top right corner.
Fill in the workspace name, folder assignment, and other information, then click OK to complete creating the workspace.
Tip
After successful creation, the workspace name will be displayed in the left tree structure, represented by different icons for folders and workspaces.
Note
To edit or delete a specific workspace or folder, select it and click ... on the right side.
If resource groups and shared resources have resources under the workspace, the workspace cannot be deleted. All resources need to be unbound before deletion of the workspace.
If Microservices Engine has Integrated Registry under the workspace, the workspace cannot be deleted. Integrated Registry needs to be removed before deletion of the workspace.
If Container Registry has Registry Space or Integrated Registry under the workspace, the workspace cannot be deleted. Registry Space needs to be removed, and Integrated Registry needs to be deleted before deletion of the workspace.
"},{"location":"en/end-user/ghippo/workspace/ws-folder.html","title":"Workspace and Folder","text":"
Workspace and Folder is a feature that provides resource isolation and grouping, addressing issues related to unified authorization, resource grouping, and resource quotas.
Workspace and Folder involves two concepts: workspaces and folders.
Workspaces allow the management of resources through Authorization , Resource Group , and Shared Resource , enabling users (and user groups) to share resources within the workspace.
Resources
Resources are at the lowest level of the hierarchy in the resource management module. They include clusters, namespaces, pipelines, gateways, and more. All these resources can only have workspaces as their parent level. Workspaces act as containers for grouping resources.
Workspace
A workspace usually refers to a project or environment, and the resources in each workspace are logically isolated from those in other workspaces. You can grant users (groups of users) different access rights to the same set of resources through authorization in the workspace.
Workspaces are at the first level, counting from the bottom of the hierarchy, and contain resources. All resources except shared resources have one and only one parent. All workspaces also have one and only one parent folder.
Resources are grouped by workspace, and there are two grouping modes in workspace, namely Resource Group and Shared Resource .
Resource group
A resource can only be added to one resource group, and resource groups correspond to workspaces one by one. After a resource is added to a resource group, Workspace Admin will obtain the management authority of the resource, which is equivalent to the owner of the resource.
Share resource
For shared resources, multiple workspaces can share one or more resources. Resource owners can choose to share their own resources with the workspace. Generally, when sharing, the resource owner will limit the amount of resources that can be used by the shared workspace. After resources are shared, Workspace Admin only has resource usage rights under the resource limit, and cannot manage resources or adjust the amount of resources that can be used by the workspace.
At the same time, shared resources also have certain requirements for the resources themselves. Only Cluster (cluster) resources can be shared. Cluster Admin can share Cluster resources to different workspaces, and limit the use of workspaces on this Cluster.
Workspace Admin can create multiple Namespaces within the resource quota, but the sum of the resource quotas of the Namespaces cannot exceed the resource quota of the Cluster in the workspace. For Kubernetes resources, the only resource type that can be shared currently is Cluster.
Folders can be used to build enterprise business hierarchy relationships.
Folders are a further grouping mechanism based on workspaces and have a hierarchical structure. A folder can contain workspaces, other folders, or a combination of both, forming a tree-like organizational relationship.
Folders allow you to map your business hierarchy and group workspaces by department. Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces.
A folder has one and only one parent folder, and the root folder is the highest level of the hierarchy. The root folder has no parent, and folders and workspaces are attached to the root folder.
In addition, users (groups) in folders can inherit permissions from their parents through a hierarchical structure. The permissions of the user in the hierarchical structure come from the combination of the permissions of the current level and the permissions inherited from its parents. The permissions are additive and there is no mutual exclusion.
"},{"location":"en/end-user/ghippo/workspace/ws-permission.html","title":"Description of workspace permissions","text":"
The workspace has permission mapping and resource isolation capabilities, and can map the permissions of users/groups in the workspace to the resources under it. If the user/group has the Workspace Admin role in the workspace and the resource Namespace is bound to the workspace-resource group, the user/group will become Namespace Admin after mapping.
Note
The permission mapping capability of the workspace will not be applied to shared resources, because sharing is to share the cluster usage permissions to multiple workspaces, rather than assigning management permissions to the workspaces, so permission inheritance and role mapping will not be implemented.
Resource isolation is achieved by binding resources to different workspaces. Therefore, resources can be flexibly allocated to each workspace (tenant) with the help of permission mapping, resource isolation, and resource sharing capabilities.
Generally applicable to the following two use cases:
Cluster one-to-one
Ordinary Cluster Department/Tenant (Workspace) Purpose Cluster 01 A Administration and Usage Cluster 02 B Administration and Usage
Cluster one-to-many
Cluster Department/Tenant (Workspace) Resource Quota Cluster 01 A 100 core CPU B 50-core CPU
Authorized users can go to modules such as workbench, microservice engine, middleware, multicloud orchestration, and service mesh to use resources in the workspace. For the operation scope of the roles of Workspace Admin, Workspace Editor, and Workspace Viewer in each module, please refer to the permission description:
If a user John (\"John\" represents any user who is required to bind resources) has the Workspace Admin role assigned or has been granted proper permissions through a custom role, which includes the Workspace's \"Resource Binding\" Permissions, and wants to bind a specific cluster or namespace to the workspace.
To bind cluster/namespace resources to a workspace, not only the workspace's \"Resource Binding\" permissions are required, but also the permissions of Cluster Admin.
"},{"location":"en/end-user/ghippo/workspace/wsbind-permission.html#granting-authorization-to-john","title":"Granting Authorization to John","text":"
Using the Platform Admin Role, grant John the role of Workspace Admin on the Workspace -> Authorization page.
Then, on the Container Management -> Permissions page, authorize John as a Cluster Admin by Add Permission.
"},{"location":"en/end-user/ghippo/workspace/wsbind-permission.html#binding-to-workspace","title":"Binding to Workspace","text":"
Using John's account to log in to AI platform, on the Container Management -> Clusters page, John can bind the specified cluster to his own workspace by using the Bind Workspace button.
Note
John can only bind clusters or namespaces to a specific workspace in the Container Management module, and cannot perform this operation in the Global Management module.
To bind a namespace to a workspace, you must have at least Workspace Admin and Cluster Admin permissions.
"},{"location":"en/end-user/host/createhost.html","title":"Creating and Starting a Cloud Host","text":"
Once the user completes registration and is assigned a workspace, namespace, and resources, they can create and start a cloud host.
"},{"location":"en/end-user/host/usehost.html#steps-to-operate","title":"Steps to Operate","text":"
Log into the AI platform as an administrator.
Navigate to Container Management -> Container Network -> Services, click the service name to enter the service details page, and click Update in the upper right corner.
Change the port range to 30900-30999, ensuring there are no conflicts.
Log into the AI platform as an end user, navigate to the proper service, and check the access ports.
Use an SSH client to log into the cloud host from the external network.
At this point, you can perform various operations on the cloud host.
The Alert Center is an important feature provided by AI platform that allows users to easily view all active and historical alerts by cluster and namespace through a graphical interface, and search alerts based on severity level (critical, warning, info).
All alerts are triggered based on the threshold conditions set in the preset alert rules. In AI platform, some global alert policies are built-in, but users can also create or delete alert policies at any time, and set thresholds for the following metrics:
CPU usage
Memory usage
Disk usage
Disk reads per second
Disk writes per second
Cluster disk read throughput
Cluster disk write throughput
Network send rate
Network receive rate
Users can also add labels and annotations to alert rules. Alert rules can be classified as active or expired, and certain rules can be enabled/disabled to achieve silent alerts.
When the threshold condition is met, users can configure how they want to be notified, including email, DingTalk, WeCom, webhook, and SMS notifications. All notification message templates can be customized and all messages are sent at specified intervals.
In addition, the Alert Center also supports sending alert messages to designated users through short message services provided by Alibaba Cloud, Tencent Cloud, and more platforms that will be added soon, enabling multiple ways of alert notification.
AI platform Alert Center is a powerful alert management platform that helps users quickly detect and resolve problems in the cluster, improve business stability and availability, and facilitate cluster inspection and troubleshooting.
In addition to the built-in alert policies, AI platform allows users to create custom alert policies. Each alert policy is a collection of alert rules that can be set for clusters, nodes, and workloads. When an alert object reaches the threshold set by any of the rules in the policy, an alert is automatically triggered and a notification is sent.
Taking the built-in alerts as an example, click the first alert policy alertmanager.rules .
You can see that some alert rules have been set under it. You can add more rules under this policy, or edit or delete them at any time. You can also view the historical and active alerts related to this alert policy and edit the notification configuration.
Select Alert Center -> Alert Policies , and click the Create Alert Policy button.
Fill in the basic information, select one or more clusters, nodes, or workloads as the alert objects, and click Next .
The list must have at least one rule. If the list is empty, please Add Rule .
Create an alert rule in the pop-up window, fill in the parameters, and click OK .
Template rules: Pre-defined basic metrics that can monitor CPU, memory, disk, and network.
PromQL rules: Input a PromQL expression, please query Prometheus expressions.
Duration: After the alert is triggered and the duration reaches the set value, the alert policy will become a triggered state.
Alert level: Including emergency, warning, and information levels.
Advanced settings: Custom tags and annotations.
After clicking Next , configure notifications.
After the configuration is complete, click the OK button to return to the Alert Policy list.
Tip
The newly created alert policy is in the Not Triggered state. Once the threshold conditions and duration specified in the rules are met, it will change to the Triggered state.
After filling in the basic information, click Add Rule and select Log Rule as the rule type.
Creating log rules is supported only when the resource object is selected as a node or workload.
Field Explanation:
Filter Condition : Field used to query log content, supports four filtering conditions: AND, OR, regular expression matching, and fuzzy matching.
Condition : Based on the filter condition, enter keywords or matching conditions.
Time Range : Time range for log queries.
Threshold Condition : Enter the alert threshold value in the input box. When the set threshold is reached, an alert will be triggered. Supported comparison operators are: >, \u2265, =, \u2264, <.
Alert Level : Select the alert level to indicate the severity of the alert.
After filling in the basic information, click Add Rule and select Event Rule as the rule type.
Creating event rules is supported only when the resource object is selected as a workload.
Field Explanation:
Event Rule : Only supports selecting the workload as the resource object.
Event Reason : Different event reasons for different types of workloads, where the event reasons are combined with \"AND\" relationship.
Time Range : Detect data generated within this time range. If the threshold condition is reached, an alert event will be triggered.
Threshold Condition : When the generated events reach the set threshold, an alert event will be triggered.
Trend Chart : By default, it queries the trend of event changes within the last 10 minutes. The value at each point represents the total number of occurrences within a certain period of time (time range) from the current time point to a previous time.
Click \u2507 at the right side of the list, then choose Delete from the pop-up menu to delete an alert policy. By clicking on the policy name, you can enter the policy details where you can add, edit, or delete the alert rules under it.
Warning
Deleted alert strategies will be permanently removed, so please proceed with caution.
The Alert template allows platform administrators to create Alert templates and rules, and business units can directly use Alert templates to create Alert policies. This feature can reduce the management of Alert rules by business personnel and allow for modification of Alert thresholds based on actual environment conditions.
In the navigation bar, select Alert -> Alert Policy, and click Alert Template at the top.
Click Create Alert Template, and set the name, description, and other information for the Alert template.
Parameter Description Template Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Resource Type Used to specify the matching type of the Alert template. Alert Rule Supports pre-defined multiple Alert rules, including template rules and PromQL rules.
Click OK to complete the creation and return to the Alert template list. Click the template name to view the template details.
Alert Inhibition is mainly a mechanism for temporarily hiding or reducing the priority of alerts that do not need immediate attention. The purpose of this feature is to reduce unnecessary alert information that may disturb operations personnel, allowing them to focus on more critical issues.
Alert inhibition recognizes and ignores certain alerts by defining a set of rules to deal with specific conditions. There are mainly the following conditions:
Parent-child inhibition: when a parent alert (for example, a crash on a node) is triggered, all child alerts aroused by it (for example, a crash on a container running on that node) are inhibited.
Similar alert inhibition: When alerts have the same characteristics (for example, the same problem on the same instance), multiple alerts are inhibited.
In the left navigation bar, select Alert -> Noise Reduction, and click Inhibition at the top.
Click Create Inhibition, and set the name and rules for the inhibition.
Note
The problem of avoiding multiple similar or related alerts that may be triggered by the same issue is achieved by defining a set of rules to identify and ignore certain alerts through Rule Details and Alert Details.
Parameter Description Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Cluster The cluster where the inhibition rule applies. Namespace The namespace where the inhibition rule applies. Source Alert Matching alerts by label conditions. It compares alerts that meet all label conditions with those that meet inhibition conditions, and alerts that do not meet inhibition conditions will be sent to the user as usual. Value range explanation: - Alert Level: The level of metric or event alerts, can be set as: Critical, Major, Minor. - Resource Type: The resource type specific for the alert object, can be set as: Cluster, Node, StatefulSet, Deployment, DaemonSet, Pod. - Labels: Alert identification attributes, consisting of label name and label value, supports user-defined values. Inhibition Specifies the matching conditions for the target alert (the alert to be inhibited). Alerts that meet all the conditions will no longer be sent to the user. Equal Specifies the list of labels to compare to determine if the source alert and target alert match. Inhibition is triggered only when the values of the labels specified in equal are exactly the same in the source and target alerts. The equal field is optional. If the equal field is omitted, all labels are used for matching.
Click OK to complete the creation and return to Inhibition list. Click the inhibition rule name to view the rule details.
After entering Insight , click Alert Center -> Notification Settings in the left navigation bar. By default, the email notification object is selected. Click Add email group and add one or more email addresses.
Multiple email addresses can be added.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the email group.
In the left navigation bar, click Alert Center -> Notification Settings -> WeCom . Click Add Group Robot and add one or more group robots.
For the URL of the WeCom group robot, please refer to the official document of WeCom: How to use group robots.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> DingTalk . Click Add Group Robot and add one or more group robots.
For the URL of the DingTalk group robot, please refer to the official document of DingTalk: Custom Robot Access.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> Lark . Click Add Group Bot and add one or more group bots.
Note
When signature verification is required in Lark's group bot, you need to fill in the specific signature key when enabling notifications. Refer to Customizing Bot User Guide.
After configuration, you will be automatically redirected to the list page. Click \u2507 on the right side of the list and select Send Test Message . You can edit or delete group bots.
In the left navigation bar, click Alert Center -> Notification Settings -> Webhook . Click New Webhook and add one or more Webhooks.
For the Webhook URL and more configuration methods, please refer to the webhook document.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the Webhook.
In the left navigation bar, click Alert Center -> Notification Settings -> SMS . Click Add SMS Group and add one or more SMS groups.
Enter the name, the object receiving the message, phone number, and notification server in the pop-up window.
The notification server needs to be created in advance under Notification Settings -> Notification Server . Currently, two cloud servers, Alibaba Cloud and Tencent Cloud, are supported. Please refer to your own cloud server information for the specific configuration parameters.
After the SMS group is successfully added, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the SMS group.
The message template feature supports customizing the content of message templates and can notify specified objects in the form of email, WeCom, DingTalk, Webhook, and SMS.
"},{"location":"en/end-user/insight/alert-center/msg-template.html#creating-a-message-template","title":"Creating a Message Template","text":"
In the left navigation bar, select Alert -> Message Template .
Insight comes with two default built-in templates in both Chinese and English for user convenience.
Fill in the template content.
Info
Observability comes with predefined message templates. If you need to define the content of the templates, refer to Configure Notification Templates.
Click the name of a message template to view the details of the message template in the right slider.
Parameters Variable Description ruleName {{ .Labels.alertname }} The name of the rule that triggered the alert groupName {{ .Labels.alertgroup }} The name of the alert policy to which the alert rule belongs severity {{ .Labels.severity }} The level of the alert that was triggered cluster {{ .Labels.cluster }} The cluster where the resource that triggered the alert is located namespace {{ .Labels.namespace }} The namespace where the resource that triggered the alert is located node {{ .Labels.node }} The node where the resource that triggered the alert is located targetType {{ .Labels.target_type }} The resource type of the alert target target {{ .Labels.target }} The name of the object that triggered the alert value {{ .Annotations.value }} The metric value at the time the alert notification was triggered startsAt {{ .StartsAt }} The time when the alert started to occur endsAt {{ .EndsAt }} The time when the alert ended description {{ .Annotations.description }} A detailed description of the alert labels {{ for .labels }} {{ end }} All labels of the alert use the for function to iterate through the labels list to get all label contents."},{"location":"en/end-user/insight/alert-center/msg-template.html#editing-or-deleting-a-message-template","title":"Editing or Deleting a Message Template","text":"
Click \u2507 on the right side of the list and select Edit or Delete from the pop-up menu to modify or delete the message template.
Warning
Once a template is deleted, it cannot be recovered, so please use caution when deleting templates.
Alert silence is a feature that allows alerts meeting certain criteria to be temporarily disabled from sending notifications within a specific time range. This feature helps operations personnel avoid receiving too many noisy alerts during certain operations or events, while also allowing for more precise handling of real issues that need to be addressed.
On the Alert Silence page, you can see two tabs: Active Rule and Expired Rule. The former presents the rules currently in effect, while the latter presents those that were defined in the past but have now expired (or have been deleted by the user).
"},{"location":"en/end-user/insight/alert-center/silent.html#creating-a-silent-rule","title":"Creating a Silent Rule","text":"
In the left navigation bar, select Alert -> Noice Reduction -> Alert Silence , and click the Create Silence Rule button.
Fill in the parameters for the silent rule, such as cluster, namespace, tags, and time, to define the scope and effective time of the rule, and then click OK .
Return to the rule list, and on the right side of the list, click \u2507 to edit or delete a silent rule.
Through the Alert Silence feature, you can flexibly control which alerts should be ignored and when they should be effective, thereby improving operational efficiency and reducing the possibility of false alerts.
Insight supports SMS notifications and currently sends alert messages using integrated Alibaba Cloud and Tencent Cloud SMS services. This article explains how to configure the SMS notification server in Insight. The variables supported in the SMS signature are the default variables in the message template. As the number of SMS characters is limited, it is recommended to choose more explicit variables.
For information on how to configure SMS recipients, refer to the document: Configure SMS Notification Group.
Go to Alert Center -> Notification Settings -> Notification Server .
Click Add Notification Server .
Configure Alibaba Cloud server.
To apply for Alibaba Cloud SMS service, refer to Alibaba Cloud SMS Service.
Field descriptions:
AccessKey ID : Parameter used by Alibaba Cloud to identify the user.
AccessKey Secret : Key used by Alibaba Cloud to authenticate the user. AccessKey Secret must be kept confidential.
SMS Signature : The SMS service supports creating signatures that meet the requirements according to user needs. When sending SMS, the SMS platform will add the approved SMS signature to the SMS content before sending it to the SMS recipient.
Template CODE : The SMS template is the specific content of the SMS to be sent.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content.
Please refer to Alibaba Cloud Variable Specification.
Note
Example: The template content defined in Alibaba Cloud is: ${severity}: ${alertname} triggered at ${startat}. Refer to the configuration in the parameter template.
Configure Tencent Cloud server.
To apply for Tencent Cloud SMS service, please refer to Tencent Cloud SMS.
Field descriptions:
Secret ID : Parameter used by Tencent Cloud to identify the API caller.
SecretKey : Parameter used by Tencent Cloud to authenticate the API caller.
SMS Template ID : The SMS template ID automatically generated by Tencent Cloud system.
Signature Content : The SMS signature content, which is the full name or abbreviation of the actual website name defined in the Tencent Cloud SMS signature.
SdkAppId : SMS SdkAppId, the actual SdkAppId generated after adding the application in the Tencent Cloud SMS console.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content. Please refer to: Tencent Cloud Variable Specification.
Note
Example: The template content defined in Tencent Cloud is: {1}: {2} triggered at {3}. Refer to the configuration in the parameter template.
"},{"location":"en/end-user/insight/collection-manag/agent-status.html","title":"insight-agent Component Status Explanation","text":"
In AI platform, Insight acts as a multi-cluster observability product. To achieve unified data collection across multiple clusters, users need to install the Helm App insight-agent (installed by default in the insight-system namespace). Refer to How to Install insight-agent .
In the \"Observability\" -> \"Collection Management\" section, you can view the installation status of insight-agent in each cluster.
Not Installed : insight-agent is not installed in the insight-system namespace of the cluster.
Running : insight-agent is successfully installed in the cluster, and all deployed components are running.
Error : If insight-agent is in this state, it indicates that the helm deployment failed or there are components deployed that are not in a running state.
You can troubleshoot using the following steps:
Run the following command. If the status is deployed , proceed to the next step. If it is failed , it is recommended to uninstall and reinstall it from Container Management -> Helm Apps as it may affect application upgrades:
helm list -n insight-system\n
Run the following command or check the status of the deployed components in Insight -> Data Collection . If there are Pods not in the Running state, restart the containers in an abnormal state.
The resource consumption of the Prometheus metric collection component in insight-agent is directly proportional to the number of Pods running in the cluster. Please adjust the resources for Prometheus according to the cluster size. Refer to Prometheus Resource Planning.
The storage capacity of the vmstorage metric storage component in the global service cluster is directly proportional to the total number of Pods in the clusters.
Please contact the platform administrator to adjust the disk capacity of vmstorage based on the cluster size. Refer to vmstorage Disk Capacity Planning.
Adjust vmstorage disk based on multi-cluster scale. Refer to vmstorge Disk Expansion.
Data Collection is mainly to centrally manage and display the entrance of the cluster installation collection plug-in insight-agent , which helps users quickly view the health status of the cluster collection plug-in, and provides a quick entry to configure collection rules.
The specific operation steps are as follows:
Click in the upper left corner and select Insight -> Data Collection .
You can view the status of all cluster collection plug-ins.
When the cluster is connected to insight-agent and is running, click a cluster name to enter the details\u3002
In the Service Monitor tab, click the shortcut link to jump to Container Management -> CRD to add service discovery rules.
Prometheus primarily uses the Pull approach to retrieve monitoring metrics from target services' exposed endpoints. Therefore, it requires configuring proper scraping jobs to request monitoring data and write it into the storage provided by Prometheus. Currently, Prometheus offers several configurations for these jobs:
Native Job Configuration: This provides native Prometheus job configuration for scraping.
Pod Monitor: In the Kubernetes ecosystem, it allows scraping of monitoring data from Pods using Prometheus Operator.
Service Monitor: In the Kubernetes ecosystem, it allows scraping monitoring data from Endpoints of Services using Prometheus Operator.
# Name of the scraping job, also adds a label (job=job_name) to the scraped metrics\njob_name: <job_name>\n\n# Time interval between scrapes\n[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]\n\n# Timeout for scrape requests\n[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]\n\n# URI path for the scrape request\n[ metrics_path: <path> | default = /metrics ]\n\n# Handling of label conflicts between scraped labels and labels added by the backend Prometheus.\n# true: Retains the scraped labels and ignores conflicting labels from the backend Prometheus.\n# false: Adds an \"exported_<original-label>\" prefix to the scraped labels and includes the additional labels added by the backend Prometheus.\n[ honor_labels: <boolean> | default = false ]\n\n# Whether to use the timestamp generated by the target being scraped.\n# true: Uses the timestamp from the target if available.\n# false: Ignores the timestamp from the target.\n[ honor_timestamps: <boolean> | default = true ]\n\n# Protocol for the scrape request: http or https\n[ scheme: <scheme> | default = http ]\n\n# URL parameters for the scrape request\nparams:\n [ <string>: [<string>, ...] ]\n\n# Set the value of the `Authorization` header in the scrape request through basic authentication. password/password_file are mutually exclusive, with password_file taking precedence.\nbasic_auth:\n [ username: <string> ]\n [ password: <secret> ]\n [ password_file: <string> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token: <secret> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token_file: <filename> ]\n\n# Whether the scrape connection should use a TLS secure channel, configure the proper TLS parameters\ntls_config:\n [ <tls_config> ]\n\n# Use a proxy service to scrape the metrics from the target, specify the address of the proxy service.\n[ proxy_url: <string> ]\n\n# Specify the targets using static configuration, see explanation below.\nstatic_configs:\n [ - <static_config> ... ]\n\n# CVM service discovery configuration, see explanation below.\ncvm_sd_configs:\n [ - <cvm_sd_config> ... ]\n\n# After scraping the data, rewrite the labels of the proper target using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nrelabel_configs:\n [ - <relabel_config> ... ]\n\n# Before writing the scraped data, rewrite the values of the labels using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nmetric_relabel_configs:\n [ - <relabel_config> ... ]\n\n# Limit the number of data points per scrape, 0: no limit, default is 0\n[ sample_limit: <int> | default = 0 ]\n\n# Limit the number of targets per scrape, 0: no limit, default is 0\n[ target_limit: <int> | default = 0 ]\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is PodMonitor\nkind: PodMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be <namespace>/<name>\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight\nspec:\n # Specify the label of the proper Pod, pod monitor will use this value as the job label value.\n # If viewing the Pod YAML, use the values in pod.metadata.labels.\n # If viewing Deployment/Daemonset/Statefulset, use spec.template.metadata.labels.\n [ jobLabel: string ]\n # Adds the proper Pod's Labels to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n podMetricsEndpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: PodMonitor\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n podMetricsEndpoints:\n - interval: 30s\n port: metric-port # Specify the Port Name proper to Prometheus Exporter in the pod YAML\n path: /metrics # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n relabelings:\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: instance\n replacement: \"crs-xxxxxx\" # Adjust to the proper Redis instance ID\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: ip\n replacement: \"1.x.x.x\" # Adjust to the proper Redis instance IP\n namespaceSelector: # Select the namespaces where the monitored Pods are located\n matchNames:\n - redis-test\n selector: # Specify the Label values of the Pods to be monitored in order to locate the target pods\n matchLabels:\n k8s-app: redis-exporter\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is ServiceMonitor\nkind: ServiceMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be the name of the Service.\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n # Specify the label(metadata/labels) of the proper Pod, service monitor will use this value as the job label value.\n [ jobLabel: string ]\n # Adds the Labels of the proper service to the Target's Labels\n [ targetLabels: []string ]\n # Adds the Labels of the proper Pod to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n endpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: ServiceMonitor\nmetadata:\n name: go-demo # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n endpoints:\n - interval: 30s\n # Specify the Port Name proper to Prometheus Exporter in the service YAML\n port: 8080-8080-tcp\n # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n path: /metrics\n relabelings:\n # ** There must be a label named 'application', assuming there is a label named 'app' in k8s,\n # we replace it with 'application' using the relabel 'replace' action\n - action: replace\n sourceLabels: [__meta_kubernetes_pod_label_app]\n targetLabel: application\n # Select the namespace where the monitored service is located\n namespaceSelector:\n matchNames:\n - golang-demo\n # Specify the Label values of the service to be monitored in order to locate the target service\n selector:\n matchLabels:\n app: golang-app-demo\n
The explanation for the proper configmaps is as follows:
# The name of the proper port. Please note that it's not the actual port number.\n# Default: 80. Possible values are as follows:\n# ServiceMonitor: corresponds to Service>spec/ports/name;\n# PodMonitor: explained as follows:\n# If viewing the Pod YAML, take the value from pod.spec.containers.ports.name.\n# If viewing Deployment/DaemonSet/StatefulSet, take the value from spec.template.spec.containers.ports.name.\n[ port: string | default = 80]\n# The URI path for the scrape request.\n[ path: string | default = /metrics ]\n# The protocol for the scrape: http or https.\n[ scheme: string | default = http]\n# URL parameters for the scrape request.\n[ params: map[string][]string]\n# The interval between scrape requests.\n[ interval: string | default = 30s ]\n# The timeout for the scrape request.\n[ scrapeTimeout: string | default = 30s]\n# Whether the scrape connection should be made over a secure TLS channel, and the TLS configuration.\n[ tlsConfig: TLSConfig ]\n# Read the bearer token value from the specified file and include it in the headers of the scrape request.\n[ bearerTokenFile: string ]\n# Read the bearer token from the specified K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ bearerTokenSecret: string ]\n# Handling conflicts when scraped labels conflict with labels added by the backend Prometheus.\n# true: Keep the scraped labels and ignore the conflicting labels from the backend Prometheus.\n# false: For conflicting labels, prefix the scraped label with 'exported_<original-label>' and add the labels added by the backend Prometheus.\n[ honorLabels: bool | default = false ]\n# Whether to use the timestamp generated on the target during the scrape.\n# true: Use the timestamp on the target if available.\n# false: Ignore the timestamp on the target.\n[ honorTimestamps: bool | default = true ]\n# Basic authentication credentials. Fill in the values of username/password from the proper K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ basicAuth: BasicAuth ]\n# Scrape the metrics from the target through a proxy server. Specify the address of the proxy server.\n[ proxyUrl: string ]\n# After scraping the data, rewrite the values of the labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nrelabelings:\n[ - <relabel_config> ...]\n# Before writing the scraped data, rewrite the values of the proper labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nmetricRelabelings:\n[ - <relabel_config> ...]\n
The explanation for the proper configmaps is as follows:
# Specifies which labels to take from the original labels for relabeling. The values taken are concatenated using the separator defined in the configuration.\n# For PodMonitor/ServiceMonitor, the proper configmap is sourceLabels.\n[ source_labels: '[' <labelname> [, ...] ']' ]\n# Defines the character used to concatenate the values of the labels to be relabeled. Default is ';'.\n[ separator: <string> | default = ; ]\n\n# When the action is replace/hashmod, target_label is used to specify the proper label name.\n# For PodMonitor/ServiceMonitor, the proper configmap is targetLabel.\n[ target_label: <labelname> ]\n\n# Regular expression used to match the values of the source labels.\n[ regex: <regex> | default = (.*) ]\n\n# Used when action is hashmod, it takes the modulus value based on the MD5 hash of the source label's value.\n[ modulus: <int> ]\n\n# Used when action is replace, it defines the expression to replace when the regex matches. It can use regular expression replacement with regex.\n[ replacement: <string> | default = $1 ]\n\n# Actions performed based on the matched values of regex. The available actions are as follows, with replace being the default:\n# replace: If the regex matches, replace the proper value with the value defined in replacement. Set the value using target_label and add the proper label.\n# keep: If the regex doesn't match, discard the value.\n# drop: If the regex matches, discard the value.\n# hashmod: Take the modulus of the MD5 hash of the source label's value based on the value specified in modulus.\n# Add a new label with a label name specified by target_label.\n# labelmap: If the regex matches, replace the proper label name with the value specified in replacement.\n# labeldrop: If the regex matches, delete the proper label.\n# labelkeep: If the regex doesn't match, delete the proper label.\n[ action: <relabel_action> | default = replace ]\n
Insight uses the Blackbox Exporter provided by Prometheus as a blackbox monitoring solution, allowing detection of target instances via HTTP, HTTPS, DNS, ICMP, TCP, and gRPC. It can be used in the following scenarios:
HTTP/HTTPS: URL/API availability monitoring
ICMP: Host availability monitoring
TCP: Port availability monitoring
DNS: Domain name resolution
In this page, we will explain how to configure custom probers in an existing Blackbox ConfigMap.
ICMP prober is not enabled by default in Insight because it requires higher permissions. Therfore We will use the HTTP prober as an example to demonstrate how to modify the ConfigMap to achieve custom HTTP probing.
module:\n ICMP: # Example of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: ip4\nicmp_example: # Example 2 of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: \"ip4\"\n source_ip_address: \"127.0.0.1\"\n
Since ICMP requires higher permissions, we also need to elevate the pod permissions. Otherwise, an operation not permitted error will occur. There are two ways to elevate permissions:
Directly edit the BlackBox Exporter deployment file to enable it
The following YAML file contains various probers such as HTTP, TCP, SMTP, ICMP, and DNS. You can modify the configuration file of insight-agent-prometheus-blackbox-exporter according to your needs.
Click to view the complete YAML file
kind: ConfigMap\napiVersion: v1\nmetadata:\n name: insight-agent-prometheus-blackbox-exporter\n namespace: insight-system\n labels:\n app.kubernetes.io/instance: insight-agent\n app.kubernetes.io/managed-by: Helm\n app.kubernetes.io/name: prometheus-blackbox-exporter\n app.kubernetes.io/version: v0.24.0\n helm.sh/chart: prometheus-blackbox-exporter-8.8.0\n annotations:\n meta.helm.sh/release-name: insight-agent\n meta.helm.sh/release-namespace: insight-system\ndata:\n blackbox.yaml: |\n modules:\n HTTP_GET:\n prober: http\n timeout: 5s\n http:\n method: GET\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"]\n follow_redirects: true\n preferred_ip_protocol: \"ip4\"\n HTTP_POST:\n prober: http\n timeout: 5s\n http:\n method: POST\n body_size_limit: 1MB\n TCP:\n prober: tcp\n timeout: 5s\n # Not enabled by default:\n # ICMP:\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: ip4\n SSH:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^SSH-2.0-\"\n POP3S:\n prober: tcp\n tcp:\n query_response:\n - expect: \"^+OK\"\n tls: true\n tls_config:\n insecure_skip_verify: false\n http_2xx_example: # http prober example\n prober: http\n timeout: 5s # probe timeout\n http:\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"] # Version in the response, usually default\n valid_status_codes: [] # Defaults to 2xx # Valid range of response codes, probe successful if within this range\n method: GET # request method\n headers: # request headers\n Host: vhost.example.com\n Accept-Language: en-US\n Origin: example.com\n no_follow_redirects: false # allow redirects\n fail_if_ssl: false \n fail_if_not_ssl: false\n fail_if_body_matches_regexp:\n - \"Could not connect to database\"\n fail_if_body_not_matches_regexp:\n - \"Download the latest version here\"\n fail_if_header_matches: # Verifies that no cookies are set\n - header: Set-Cookie\n allow_missing: true\n regexp: '.*'\n fail_if_header_not_matches:\n - header: Access-Control-Allow-Origin\n regexp: '(\\*|example\\.com)'\n tls_config: # tls configuration for https requests\n insecure_skip_verify: false\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\" # Preferred IP protocol version\n ip_protocol_fallback: false # no fallback to \"ip6\" \n http_post_2xx: # http prober example with body\n prober: http\n timeout: 5s\n http:\n method: POST # probe request method\n headers:\n Content-Type: application/json\n body: '{\"username\":\"admin\",\"password\":\"123456\"}' # body carried during probe\n http_basic_auth_example: # prober example with username and password\n prober: http\n timeout: 5s\n http:\n method: POST\n headers:\n Host: \"login.example.com\"\n basic_auth: # username and password to be added during probe\n username: \"username\"\n password: \"mysecret\"\n http_custom_ca_example:\n prober: http\n http:\n method: GET\n tls_config: # root certificate used during probe\n ca_file: \"/certs/my_cert.crt\"\n http_gzip:\n prober: http\n http:\n method: GET\n compression: gzip # compression method used during probe\n http_gzip_with_accept_encoding:\n prober: http\n http:\n method: GET\n compression: gzip\n headers:\n Accept-Encoding: gzip\n tls_connect: # TCP prober example\n prober: tcp\n timeout: 5s\n tcp:\n tls: true # use TLS\n tcp_connect_example:\n prober: tcp\n timeout: 5s\n imap_starttls: # IMAP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"OK.*STARTTLS\"\n - send: \". STARTTLS\"\n - expect: \"OK\"\n - starttls: true\n - send: \". capability\"\n - expect: \"CAPABILITY IMAP4rev1\"\n smtp_starttls: # SMTP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^220 ([^ ]+) ESMTP (.+)$\"\n - send: \"EHLO prober\\r\"\n - expect: \"^250-STARTTLS\"\n - send: \"STARTTLS\\r\"\n - expect: \"^220\"\n - starttls: true\n - send: \"EHLO prober\\r\"\n - expect: \"^250-AUTH\"\n - send: \"QUIT\\r\"\n irc_banner_example:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - send: \"NICK prober\"\n - send: \"USER prober prober prober :prober\"\n - expect: \"PING :([^ ]+)\"\n send: \"PONG ${1}\"\n - expect: \"^:[^ ]+ 001\"\n # icmp_example: # ICMP prober configuration example\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: \"ip4\"\n # source_ip_address: \"127.0.0.1\"\n dns_udp_example: # DNS query example using UDP\n prober: dns\n timeout: 5s\n dns:\n query_name: \"www.prometheus.io\" # domain name to resolve\n query_type: \"A\" # type proper to this domain\n valid_rcodes:\n - NOERROR\n validate_answer_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n fail_if_all_match_regexp:\n - \".*127.0.0.1\"\n fail_if_not_matches_regexp:\n - \"www.prometheus.io.\\t300\\tIN\\tA\\t127.0.0.1\"\n fail_if_none_matches_regexp:\n - \"127.0.0.1\"\n validate_authority_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n validate_additional_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n dns_soa:\n prober: dns\n dns:\n query_name: \"prometheus.io\"\n query_type: \"SOA\"\n dns_tcp_example: # DNS query example using TCP\n prober: dns\n dns:\n transport_protocol: \"tcp\" # defaults to \"udp\"\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\"\n query_name: \"www.prometheus.io\"\n
"},{"location":"en/end-user/insight/collection-manag/service-monitor.html","title":"Configure service discovery rules","text":"
Observable Insight supports the way of creating CRD ServiceMonitor through container management to meet your collection requirements for custom service discovery. Users can use ServiceMonitor to define the scope of the Namespace discovered by the Pod and select the monitored Service through matchLabel .
This is the service endpoint, which represents the address where Prometheus collects Metrics. endpoints is an array, and multiple endpoints can be created at the same time. Each endpoint contains three fields, and the meaning of each field is as follows:
interval : Specifies the collection cycle of Prometheus for the current endpoint . The unit is seconds, set to 15s in this example.
path : Specifies the collection path of Prometheus. In this example, it is specified as /actuator/prometheus .
port : Specifies the port through which the collected data needs to pass. The set port is the name set by the port of the Service being collected.
This is the scope of the Service that needs to be discovered. namespaceSelector contains two mutually exclusive fields, and the meaning of the fields is as follows:
any : Only one value true , when this field is set, it will listen to changes of all Services that meet the Selector filtering conditions.
matchNames : An array value that specifies the scope of namespace to be monitored. For example, if you only want to monitor the Services in two namespaces, default and insight-system, the matchNames are set as follows:
Grafana is a cross-platform open source visual analysis tool. Insight uses open source Grafana to provide monitoring services, and supports viewing resource consumption from multiple dimensions such as clusters, nodes, and namespaces.
For more information on open source Grafana, see Grafana Official Documentation.
In the Insight / Overview dashboard, you can view the resource usage of multiple clusters and analyze resource usage, network, storage, and more based on dimensions such as namespaces and Pods.
Click the dropdown menu in the upper-left corner of the dashboard to switch between clusters.
Click the lower-right corner of the dashboard to switch the time range for queries.
Insight provides several recommended dashboards that allow monitoring from different dimensions such as nodes, namespaces, and workloads. Switch between dashboards by clicking the insight-system / Insight / Overview section.
Note
For accessing Grafana UI, refer to Access Native Grafana.
For importing custom dashboards, refer to Importing Custom Dashboards.
By using Grafana CRD, you can incorporate the management and deployment of dashboards into the lifecycle management of Kubernetes. This enables version control, automated deployment, and cluster-level management of dashboards. This page describes how to import custom dashboards using CRD and the UI interface.
Log in to the AI platform platform and go to Container Management . Select the kpanda-global-cluster from the cluster list.
Choose Custom Resources from the left navigation bar. Look for the grafanadashboards.integreatly.org file in the list and click it to view the details.
Click YAML Create and use the following template. Replace the dashboard JSON in the Json field.
namespace : Specify the target namespace.
name : Provide a name for the dashboard.
label : Mandatory. Set the label as operator.insight.io/managed-by: insight .
Insight only collects data from clusters that have insight-agent installed and running in a normal state. The overview provides an overview of resources across multiple clusters:
Alert Statistics: Provides statistics on active alerts across all clusters.
Resource Consumption: Displays the resource usage trends for the top 5 clusters and nodes in the past hour, based on CPU usage, memory usage, and disk usage.
By default, the sorting is based on CPU usage. You can switch the metric to sort clusters and nodes.
Resource Trends: Shows the trends in the number of nodes over the past 15 days and the running trend of pods in the last hour.
Service Requests Ranking: Displays the top 5 services with the highest request latency and error rates, along with their respective clusters and namespaces in the multi-cluster environment.
By default, Insight collects node logs, container logs, and Kubernetes audit logs. In the log query page, you can search for standard output (stdout) logs within the permissions of your login account. This includes node logs, product logs, and Kubernetes audit logs. You can quickly find the desired logs among a large volume of logs. Additionally, you can use the source information and contextual raw data of the logs to assist in troubleshooting and issue resolution.
In the left navigation bar, select Data Query -> Log Query .
After selecting the query criteria, click Search , and the log records in the form of graphs will be displayed. The most recent logs are displayed on top.
In the Filter panel, switch Type and select Node to check the logs of all nodes in the cluster.
In the Filter panel, switch Type and select Event to view the logs generated by all Kubernetes events in the cluster.
Lucene Syntax Explanation:
Use logical operators (AND, OR, NOT, \"\") to query multiple keywords. For example: keyword1 AND (keyword2 OR keyword3) NOT keyword4.
Use a tilde (~) for fuzzy queries. You can optionally specify a parameter after the \"~\" to control the similarity of the fuzzy query. If not specified, it defaults to 0.5. For example: error~.
Use wildcards (*, ?) as single-character placeholders to match any character.
Use square brackets [ ] or curly braces { } for range queries. Square brackets [ ] represent a closed interval and include the boundary values. Curly braces { } represent an open interval and exclude the boundary values. Range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. For example timestamp:[2022-01-01 TO 2022-01-31].
Clicking on the button next to a log will slide out a panel on the right side where you can view the default 100 lines of context for that log. You can switch the Display Rows option to view more contextual content.
Metric query supports querying the index data of each container resource, and you can view the trend changes of the monitoring index. At the same time, advanced query supports native PromQL statements for Metric query.
In the left navigation bar, click Data Query -> metric Query .
After selecting query conditions such as cluster, type, node, and metric name, click Search , and the proper metric chart and data details will be displayed on the right side of the screen.
Tip
Support custom time range. You can manually click the Refresh icon or select a default time interval to refresh.
Through cluster monitoring, you can view the basic information of the cluster, the resource consumption and the trend of resource consumption over a period of time.
Select Infrastructure > Clusters from the left navigation bar. On this page, you can view the following information:
Resource Overview: Provides statistics on the number of normal/all nodes and workloads across multiple clusters.
Fault: Displays the number of alerts generated in the current cluster.
Resource Consumption: Shows the actual usage and total capacity of CPU, memory, and disk for the selected cluster.
Metric Explanations: Describes the trends in CPU, memory, disk I/O, and network bandwidth.
Click Resource Level Monitor, you can view more metrics of the current cluster.
"},{"location":"en/end-user/insight/infra/cluster.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The ratio of the actual CPU usage of all pod resources in the cluster to the total CPU capacity of all nodes. CPU Allocation The ratio of the sum of CPU requests of all pods in the cluster to the total CPU capacity of all nodes. Memory Usage The ratio of the actual memory usage of all pod resources in the cluster to the total memory capacity of all nodes. Memory Allocation The ratio of the sum of memory requests of all pods in the cluster to the total memory capacity of all nodes."},{"location":"en/end-user/insight/infra/container.html","title":"Container Insight","text":"
Container insight is the process of monitoring workloads in cluster management. In the list, you can view basic information and status of workloads. On the Workloads details page, you can see the number of active alerts and the trend of resource consumption such as CPU and memory.
Follow these steps to view service monitoring metrics:
Go to the Insight product module.
Select Infrastructure > Workloads from the left navigation bar.
Switch between tabs at the top to view data for different types of workloads.
Click the target workload name to view the details.
Faults: Displays the total number of active alerts for the workload.
Resource Consumption: Shows the CPU, memory, and network usage of the workload.
Monitoring Metrics: Provides the trends of CPU, Memory, Network, and disk usage for the workload over the past hour.
Switch to the Pods tab to view the status of various pods for the workload, including their nodes, restart counts, and other information.
Switch to the JVM monitor tab to view the JVM metrics for each pods
Note
The JVM monitoring feature only supports the Java language.
To enable the JVM monitoring feature, refer to Getting Started with Monitoring Java Applications.
"},{"location":"en/end-user/insight/infra/container.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for all pods under the workload. CPU Requests The sum of CPU requests for all pods under the workload. CPU Limits The sum of CPU limits for all pods under the workload. Memory Usage The sum of memory usage for all pods under the workload. Memory Requests The sum of memory requests for all pods under the workload. Memory Limits The sum of memory limits for all pods under the workload. Disk Read/Write Rate The total number of continuous disk reads and writes per second within the specified time range, representing a performance measure of the number of read and write operations per second on the disk. Network Send/Receive Rate The incoming and outgoing rates of network traffic, aggregated by workload, within the specified time range."},{"location":"en/end-user/insight/infra/event.html","title":"Event Query","text":"
AI platform Insight supports event querying by cluster and namespace.
"},{"location":"en/end-user/insight/infra/event.html#event-status-distribution","title":"Event Status Distribution","text":"
By default, the events that occurred within the last 12 hours are displayed. You can select a different time range in the upper right corner to view longer or shorter periods. You can also customize the sampling interval from 1 minute to 5 hours.
The event status distribution chart provides a visual representation of the intensity and dispersion of events. This helps in evaluating and preparing for subsequent cluster operations and maintenance tasks. If events are densely concentrated during specific time periods, you may need to allocate more resources or take proper measures to ensure cluster stability and high availability. On the other hand, if events are dispersed, you can effectively schedule other maintenance tasks such as system optimization, upgrades, or handling other tasks during this period.
By considering the event status distribution chart and the selected time range, you can better plan and manage your cluster operations and maintenance work, ensuring system stability and reliability.
"},{"location":"en/end-user/insight/infra/event.html#event-count-and-statistics","title":"Event Count and Statistics","text":"
Through important event statistics, you can easily understand the number of image pull failures, health check failures, Pod execution failures, Pod scheduling failures, container OOM (Out-of-Memory) occurrences, volume mounting failures, and the total count of all events. These events are typically categorized as \"Warning\" and \"Normal\".
Select Infrastructure -> Namespaces from the left navigation bar. On this page, you can view the following information:
Switch Namespace: Switch between clusters or namespaces at the top.
Resource Overview: Provides statistics on the number of normal and total workloads within the selected namespace.
Incidents: Displays the number of alerts generated within the selected namespace.
Events: Shows the number of Warning level events within the selected namespace in the past 24 hours.
Resource Consumption: Provides the sum of CPU and memory usage for Pods within the selected namespace, along with the CPU and memory quota information.
"},{"location":"en/end-user/insight/infra/namespace.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for Pods within the selected namespace. Memory Usage The sum of memory usage for Pods within the selected namespace. Pod CPU Usage The CPU usage for each Pod within the selected namespace. Pod Memory Usage The memory usage for each Pod within the selected namespace."},{"location":"en/end-user/insight/infra/node.html","title":"Node Monitoring","text":"
Through node monitoring, you can get an overview of the current health status of the nodes in the selected cluster and the number of abnormal pod; on the current node details page, you can view the number of alerts and the trend of resource consumption such as CPU, memory, and disk.
Probe refers to the use of black-box monitoring to regularly test the connectivity of targets through HTTP, TCP, and other methods, enabling quick detection of ongoing faults.
Insight uses the Prometheus Blackbox Exporter tool to probe the network using protocols such as HTTP, HTTPS, DNS, TCP, and ICMP, and returns the probe results to understand the network status.
Select Infrastructure -> Probes in the left navigation bar.
Click the cluster or namespace dropdown in the table to switch between clusters and namespaces.
The list displays the name, probe method, probe target, connectivity status, and creation time of the probes by default.
The connectivity status can be:
Normal: The probe successfully connects to the target, and the target returns the expected response.
Abnormal: The probe fails to connect to the target, or the target does not return the expected response.
Pending: The probe is attempting to connect to the target.
Supports fuzzy search of probe names.
"},{"location":"en/end-user/insight/infra/probe.html#create-a-probe","title":"Create a Probe","text":"
Click Create Probe .
Fill in the basic information and click Next .
Name: The name can only contain lowercase letters, numbers, and hyphens (-), and must start and end with a lowercase letter or number, with a maximum length of 63 characters.
Cluster: Select the cluster for the probe task.
Namespace: The namespace where the probe task is located.
Configure the probe parameters.
Blackbox Instance: Select the blackbox instance responsible for the probe.
Probe Method:
HTTP: Sends HTTP or HTTPS requests to the target URL to check its connectivity and response time. This can be used to monitor the availability and performance of websites or web applications.
TCP: Establishes a TCP connection to the target host and port to check its connectivity and response time. This can be used to monitor TCP-based services such as web servers and database servers.
Other: Supports custom probe methods by configuring ConfigMap. For more information, refer to: Custom Probe Methods
Probe Target: The target address of the probe, supports domain names or IP addresses.
Labels: Custom labels that will be automatically added to Prometheus' labels.
Probe Interval: The interval between probes.
Probe Timeout: The maximum waiting time when probing the target.
After configuring, click OK to complete the creation.
Warning
After the probe task is created, it takes about 3 minutes to synchronize the configuration. During this period, no probes will be performed, and probe results cannot be viewed.
Click \u2507 in the operations column and click View Monitoring Dashboard .
Metric Name Description Current Status Response Represents the response status code of the HTTP probe request. Ping Status Indicates whether the probe request was successful. 1 indicates a successful probe request, and 0 indicates a failed probe request. IP Protocol Indicates the IP protocol version used in the probe request. SSL Expiry Represents the earliest expiration time of the SSL/TLS certificate. DNS Response (Latency) Represents the duration of the entire probe process in seconds. HTTP Duration Represents the duration of the entire process from sending the request to receiving the complete response."},{"location":"en/end-user/insight/infra/probe.html#edit-a-probe","title":"Edit a Probe","text":"
Click \u2507 in the operations column and click Edit .
"},{"location":"en/end-user/insight/infra/probe.html#delete-a-probe","title":"Delete a Probe","text":"
Click \u2507 in the operations column and click Delete .
Insight is a multicluster observation product in AI platform. In order to realize the unified collection of multicluster observation data, users need to install the Helm App insight-agent (Installed in insight-system namespace by default). See How to install insight-agent .
In Insight -> Data Collection section, you can view the status of insight-agent installed in each cluster.
not installed : insight-agent is not installed under the insight-system namespace in this cluster
Running : insight-agent is successfully installed in the cluster, and all deployed components are running
Exception : If insight-agent is in this state, it means that the helm deployment failed or the deployed components are not running
Can be checked by:
Run the following command, if the status is deployed , go to the next step. If it is failed , since it will affect the upgrade of the application, it is recommended to reinstall after uninstalling Container Management -> Helm Apps :
helm list -n insight-system\n
run the following command or check the status of the components deployed in the cluster in Insight -> Data Collection . If there is a pod that is not in the Running state, please restart the abnormal pod.
The resource consumption of the metric collection component Prometheus in insight-agent is directly proportional to the number of pods running in the cluster. Adjust Prometheus resources according to the cluster size, please refer to Prometheus Resource Planning.
Since the storage capacity of the metric storage component vmstorage in the global service cluster is directly proportional to the sum of the number of pods in each cluster.
Please contact the platform administrator to adjust the disk capacity of vmstorage according to the cluster size, see vmstorage disk capacity planning.
Adjust vmstorage disk according to multicluster size, see vmstorge disk expansion.
AI platform platform enables the management and creation of multicloud and multiple clusters. Building upon this capability, Insight serves as a unified observability solution for multiple clusters. It collects observability data from multiple clusters by deploying the insight-agent plugin and allows querying of metrics, logs, and trace data through the AI platform Insight.
insight-agent is a tool that facilitates the collection of observability data from multiple clusters. Once installed, it automatically collects metrics, logs, and trace data without any modifications.
Clusters created through Container Management come pre-installed with insight-agent. Hence, this guide specifically provides instructions on enabling observability for integrated clusters.
Install insight-agent online
As a unified observability platform for multiple clusters, Insight's resource consumption of certain components is closely related to the data of cluster creation and the number of integrated clusters. When installing insight-agent, it is necessary to adjust the resources of the proper components based on the cluster size.
Adjust the CPU and memory resources of the Prometheus collection component in insight-agent according to the size of the cluster created or integrated. Please refer to Prometheus resource planning.
As the metric data from multiple clusters is stored centrally, AI platform platform administrators need to adjust the disk space of vmstorage based on the cluster size. Please refer to vmstorage disk capacity planning.
For instructions on adjusting the disk space of vmstorage, please refer to Expanding vmstorage disk.
Since AI platform supports the management of multicloud and multiple clusters, insight-agent has undergone partial verification. However, there are known conflicts with monitoring components when installing insight-agent in Suanova 4.0 clusters and Openshift 4.x clusters. If you encounter similar issues, please refer to the following documents:
Install insight-agent in Openshift 4.x
Currently, the insight-agent collection component has undergone functional testing for popular versions of Kubernetes. Please refer to:
Kubernetes cluster compatibility testing
Openshift 4.x cluster compatibility testing
Rancher cluster compatibility testing
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html","title":"Enable Big Log and Big Trace Modes","text":"
The Insight Module supports switching log to Big Log mode and trace to Big Trace mode, in order to enhance data writing capabilities in large-scale environments. This page introduces following methods for enabling these modes:
Enable or upgrade to Big Log and Big Trace modes through the installer (controlled by the same parameter value in manifest.yaml)
Manually enable Big Log and Big Trace modes through Helm commands
This mode is referred to as the Kafka mode, and the data flow diagram is shown below:
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html#enabling-via-installer","title":"Enabling via Installer","text":"
When deploying/upgrading AI platform using the installer, the manifest.yaml file includes the infrastructures.kafka field. To enable observable Big Log and Big Trace modes, Kafka must be activated:
When using a manifest.yaml that enables kafka during installation, Kafka middleware will be installed by default, and Big Log and Big Trace modes will be enabled automatically. The installation command is:
The upgrade also involves modifying the kafka field. However, note that since the old environment was installed with kafka: false, Kafka is not present in the environment. Therefore, you need to specify the upgrade for middleware to install Kafka middleware simultaneously. The upgrade command is:
After the upgrade is complete, you need to manually restart the following components:
insight-agent-fluent-bit
insight-agent-opentelemetry-collector
insight-opentelemetry-collector
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html#enabling-via-helm-commands","title":"Enabling via Helm Commands","text":"
Prerequisites: Ensure that there is a usable Kafka and that the address is accessible.
Use the following commands to retrieve the values of the old versions of Insight and insight-agent (it's recommended to back them up):
helm get values insight -n insight-system -o yaml > insight.yaml\nhelm get values insight-agent -n insight-system -o yaml > insight-agent.yaml\n
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html#enabling-big-log","title":"Enabling Big Log","text":"
There are several ways to enable or upgrade to Big Log mode:
Use --set in the helm upgrade commandModify YAML and run helm upgradeUpgrade via Container Management UI
First, run the following Insight upgrade command, ensuring the Kafka brokers address is correct:
In the Container Management module, find the cluster, select Helm Apps from the left navigation bar, and find and update the insight-agent.
In Trace Settings, select kafka for output and fill in the correct brokers address.
Note that after the upgrade is complete, you need to manually restart the insight-agent-opentelemetry-collector and insight-opentelemetry-collector components.
When deploying Insight to a Kubernetes environment, proper resource management and optimization are crucial. Insight includes several core components such as Prometheus, OpenTelemetry, FluentBit, Vector, and Elasticsearch. These components, during their operation, may negatively impact the performance of other pods within the cluster due to resource consumption issues. To effectively manage resources and optimize cluster operations, node affinity becomes an important option.
This page is about how to add taints and node affinity to ensure that each component runs on the appropriate nodes, avoiding resource competition or contention, thereby guranttee the stability and efficiency of the entire Kubernetes cluster.
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#configure-dedicated-nodes-for-insight-using-taints","title":"Configure dedicated nodes for Insight using taints","text":"
Since the Insight Agent includes DaemonSet components, the configuration method described in this section is to have all components except the Insight DaemonSet run on dedicated nodes.
This is achieved by adding taints to the dedicated nodes and using tolerations to match them. More details can be found in the Kubernetes official documentation.
You can refer to the following commands to add and remove taints on nodes:
There are two ways to schedule Insight components to dedicated nodes:
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#1-add-tolerations-for-each-component","title":"1. Add tolerations for each component","text":"
Configure the tolerations for the insight-server and insight-agent Charts respectively:
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#2-configure-at-the-namespace-level","title":"2. Configure at the namespace level","text":"
Allow pods in the insight-system namespace to tolerate the node.daocloud.io=insight-only taint.
Adjust the apiserver configuration file /etc/kubernetes/manifests/kube-apiserver.yaml to include PodTolerationRestriction,PodNodeSelector. See the following picture:
Add an annotation to the insight-system namespace:
Restart the components under the insight-system namespace to allow normal scheduling of pods under the insight-system.
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#use-node-labels-and-node-affinity-to-manage-component-scheduling","title":"Use node labels and node affinity to manage component scheduling","text":"
Info
Node affinity is conceptually similar to nodeSelector, allowing you to constrain which nodes a pod can be scheduled on based on labels on the nodes. There are two types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution: The scheduler will only schedule the pod if the rules are met. This feature is similar to nodeSelector but has more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler will try to find nodes that meet the rules. If no matching nodes are found, the scheduler will still schedule the Pod.
For more details, please refer to the Kubernetes official documentation.
To meet different user needs for scheduling Insight components, Insight provides fine-grained labels for different components' scheduling policies. Below is a description of the labels and their associated components:
Label Key Label Value Description node.daocloud.io/insight-any Any value, recommended to use true Represents that all Insight components prefer nodes with this label node.daocloud.io/insight-prometheus Any value, recommended to use true Specifically for Prometheus components node.daocloud.io/insight-vmstorage Any value, recommended to use true Specifically for VictoriaMetrics vmstorage components node.daocloud.io/insight-vector Any value, recommended to use true Specifically for Vector components node.daocloud.io/insight-otel-col Any value, recommended to use true Specifically for OpenTelemetry components
You can refer to the following commands to add and remove labels on nodes:
# Add label to node8, prioritizing scheduling insight-prometheus to node8 \nkubectl label nodes node8 node.daocloud.io/insight-prometheus=true\n\n# Remove the node.daocloud.io/insight-prometheus label from node8\nkubectl label nodes node8 node.daocloud.io/insight-prometheus-\n
Below is the default affinity preference for the insight-prometheus component during deployment:
Prioritize scheduling insight-prometheus to nodes with the node.daocloud.io/insight-prometheus label
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html","title":"Get Data Storage Address of Global Service Cluster","text":"
Insight is a product for unified observation of multiple clusters. To achieve unified storage and querying of observation data from multiple clusters, sub-clusters need to report the collected observation data to the global service cluster for unified storage. This document provides the required address of the storage component when installing the collection component insight-agent.
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#install-insight-agent-in-global-service-cluster","title":"Install insight-agent in Global Service Cluster","text":"
If installing insight-agent in the global service cluster, it is recommended to access the cluster via domain name:
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#install-insight-agent-in-other-clusters","title":"Install insight-agent in Other Clusters","text":""},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#get-address-via-interface-provided-by-insight-server","title":"Get Address via Interface Provided by Insight Server","text":"
The management cluster uses the default LoadBalancer mode for exposure.
Log in to the console of the global service cluster and run the following command:
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
Note
Please replace the ${INSIGHT_SERVER_IP} parameter in the command.
global.exporters.logging.host is the log service address, no need to set the proper service port, the default value will be used.
global.exporters.metric.host is the metrics service address.
global.exporters.trace.host is the trace service address.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
Management cluster disables LoadBalancer
When calling the interface, you need to additionally pass an externally accessible node IP from the cluster, which will be used to construct the complete access address of the proper service.
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
global.exporters.logging.host is the log service address.
global.exporters.logging.port is the NodePort exposed by the log service.
global.exporters.metric.host is the metrics service address.
global.exporters.metric.port is the NodePort exposed by the metrics service.
global.exporters.trace.host is the trace service address.
global.exporters.trace.port is the NodePort exposed by the trace service.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
global.exporters.auditLog.port is the NodePort exposed by the audit log service.
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#connect-via-loadbalancer","title":"Connect via LoadBalancer","text":"
If LoadBalancer is enabled in the cluster and a VIP is set for Insight, you can manually execute the following command to obtain the address information for vminsert and opentelemetry-collector:
$ kubectl get service -n insight-system | grep lb\nlb-insight-opentelemetry-collector LoadBalancer 10.233.23.12 <pending> 4317:31286/TCP,8006:31351/TCP 24d\nlb-vminsert-insight-victoria-metrics-k8s-stack LoadBalancer 10.233.63.67 <pending> 8480:31629/TCP 24d\n
lb-vminsert-insight-victoria-metrics-k8s-stack is the address for the metrics service.
lb-insight-opentelemetry-collector is the address for the tracing service.
Execute the following command to obtain the address information for elasticsearch:
$ kubectl get service -n mcamel-system | grep es\nmcamel-common-es-cluster-masters-es-http NodePort 10.233.16.120 <none> 9200:30465/TCP 47d\n
mcamel-common-es-cluster-masters-es-http is the address for the logging service.
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#connect-via-nodeport","title":"Connect via NodePort","text":"
The LoadBalancer feature is disabled in the global service cluster.
In this case, the LoadBalancer resources mentioned above will not be created by default. The relevant service names are:
insight-agent is a plugin for collecting insight data, supporting unified observation of metrics, links, and log data. This article describes how to install insight-agent in an online environment for the accessed cluster.
Enter Container Management from the left navigation bar, and enter Clusters . Find the cluster where you want to install insight-agent.
Choose Install now to jump, or click the cluster and click Helm Apps -> Helm Templates in the left navigation bar, search for insight-agent in the search box, and click it for details.
Select the appropriate version and click Install .
Fill in the name, select the namespace and version, and fill in the addresses of logging, metric, audit, and trace reporting data in the yaml file. The system has filled in the address of the component for data reporting by default, please check it before clicking OK to install.
If you need to modify the data reporting address, please refer to Get Data Reporting Address.
The system will automatically return to Helm Apps . When the application status changes from Unknown to Deployed , it means that insight-agent is installed successfully.
Note
Click \u2507 on the far right, and you can perform more operations such as Update , View YAML and Delete in the pop-up menu.
For a practical installation demo, watch Video demo of installing insight-agent
This page lists some issues related to the installation and uninstallation of Insight Agent and their workarounds.
"},{"location":"en/end-user/insight/quickstart/install/knownissues.html#uninstallation-failure-of-insight-agent","title":"Uninstallation Failure of Insight Agent","text":"
When you run the following command to uninstall Insight Agent,
helm uninstall insight agent\n
The tls secret used by otel-operator is failed to uninstall.
Due to the logic of \"reusing tls secret\" in the following code of otel-operator, it checks whether MutationConfiguration exists and reuses the CA cert bound in MutationConfiguration. However, since helm uninstall has uninstalled MutationConfiguration, it results in a null value.
Therefore, please manually delete the proper secret using one of the following methods:
Delete via command line: Log in to the console of the target cluster and run the following command:
Delete via UI: Log in to AI platform container management, select the target cluster, select Secret from the left menu, input insight-agent-opentelemetry-operator-controller-manager-service-cert, then select Delete.
"},{"location":"en/end-user/insight/quickstart/install/knownissues.html#insight-agent_1","title":"Insight Agent","text":""},{"location":"en/end-user/insight/quickstart/install/knownissues.html#log-collection-endpoint-not-updated-when-upgrading-insight-agent","title":"Log Collection Endpoint Not Updated When Upgrading Insight Agent","text":"
When updating the log configuration of the insight-agent from Elasticsearch to Kafka or from Kafka to Elasticsearch, the changes do not take effect and the agent continues to use the previous configuration.
Solution :
Manually restart Fluent Bit in the cluster.
"},{"location":"en/end-user/insight/quickstart/install/knownissues.html#podmonitor-collects-multiple-sets-of-jvm-metrics","title":"PodMonitor Collects Multiple Sets of JVM Metrics","text":"
In this version, there is a defect in PodMonitor/insight-kubernetes-pod: it will incorrectly create Jobs to collect metrics for all containers in Pods that are marked with insight.opentelemetry.io/metric-scrape=true, instead of only the containers proper to insight.opentelemetry.io/metric-port.
After PodMonitor is declared, PrometheusOperator will pre-configure some service discovery configurations. Considering the compatibility of CRDs, it is abandoned to configure the collection tasks through annotations.
Use the additional scrape config mechanism provided by Prometheus to configure the service discovery rules in a secret and introduce them into Prometheus.
Therefore:
Delete the current PodMonitor for insight-kubernetes-pod
Use a new rule
In the new rule, action: keepequal is used to compare the consistency between source_labels and target_label to determine whether to create collection tasks for the ports of a container. Note that this feature is only available in Prometheus v2.41.0 (2022-12-20) and higher.
This page provides some considerations for upgrading insight-server and insight-agent.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v028x-or-lower-to-v029x","title":"Upgrade from v0.28.x (or lower) to v0.29.x","text":"
Due to the upgrade of the Opentelemetry community operator chart version in v0.29.0, the supported values for featureGates in the values file have changed. Therefore, before upgrading, you need to set the value of featureGates to empty, as follows:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v026x-or-lower-to-v027x-or-higher","title":"Upgrade from v0.26.x (or lower) to v0.27.x or higher","text":"
In v0.27.x, the switch for the vector component has been separated. If the existing environment has vector enabled, you need to specify --set vector.enabled=true when upgrading the insight-server.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-020x","title":"Upgrade from v0.19.x (or lower) to 0.20.x","text":"
Before upgrading Insight , you need to manually delete the jaeger-collector and jaeger-query deployments by running the following command:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
In v0.18.x, there have been updates to the Jaeger-related deployment files, so you need to manually run the following commands before upgrading insight-server:
There have been changes to metric names in v0.18.x, so after upgrading insight-server, insight-agent should also be upgraded.
In addition, the parameters for enabling the tracing module and adjusting the ElasticSearch connection have been modified. Refer to the following parameters:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v015x-or-lower-to-v016x","title":"Upgrade from v0.15.x (or lower) to v0.16.x","text":"
In v0.16.x, a new feature parameter disableRouteContinueEnforce in the vmalertmanagers CRD is used. Therefore, you need to manually run the following command before upgrading insight-server:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v023x-or-lower-to-v024x","title":"Upgrade from v0.23.x (or lower) to v0.24.x","text":"
In v0.24.x, CRDs have been added to the OTEL operator chart. However, helm upgrade does not update CRDs, so you need to manually run the following command:
If you are performing an offline installation, you can find the above CRD yaml file after extracting the insight-agent offline package. After extracting the insight-agent Chart, manually run the following command:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-v020x","title":"Upgrade from v0.19.x (or lower) to v0.20.x","text":"
In v0.20.x, Kafka log export configuration has been added, and there have been some adjustments to the log export configuration. Before upgrading insight-agent , please note the parameter changes. The previous logging configuration has been moved to the logging.elasticsearch configuration:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x_1","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
Due to the updated deployment files for Jaeger In v0.18.x, it is important to note the changes in parameters before upgrading the insight-agent.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v016x-or-lower-to-v017x","title":"Upgrade from v0.16.x (or lower) to v0.17.x","text":"
In v0.17.x, the kube-prometheus-stack chart version was upgraded from 41.9.1 to 45.28.1, and there were also some field upgrades in the CRD used, such as the attachMetadata field of servicemonitor. Therefore, the following command needs to be rund before upgrading the insight-agent:
If you are performing an offline installation, you can find the yaml for the above CRD in insight-agent/dependency-crds after extracting the insight-agent offline package.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v011x-or-earlier-to-v012x","title":"Upgrade from v0.11.x (or earlier) to v0.12.x","text":"
v0.12.x upgrades kube-prometheus-stack chart from 39.6.0 to 41.9.1, including prometheus-operator to v0.60.1, prometheus-node-exporter chart to v4.3.0. Prometheus-node-exporter uses Kubernetes recommended label after upgrading, so you need to delete node-exporter daemonset. prometheus-operator has updated the CRD, so you need to run the following command before upgrading the insight-agent:
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/jmx-exporter.html","title":"Use JMX Exporter to expose JVM monitoring metrics","text":"
JMX-Exporter provides two usages:
Start a standalone process. Specify parameters when the JVM starts, expose the RMI interface of JMX, JMX Exporter calls RMI to obtain the JVM runtime status data, Convert to Prometheus metrics format, and expose ports for Prometheus to collect.
Start the JVM in-process. Specify parameters when the JVM starts, and run the jar package of JMX-Exporter in the form of javaagent. Read the JVM runtime status data in the process, convert it into Prometheus metrics format, and expose the port for Prometheus to collect.
Note
Officials do not recommend the first method. On the one hand, the configuration is complicated, and on the other hand, it requires a separate process, and the monitoring of this process itself has become a new problem. So This page focuses on the second usage and how to use JMX Exporter to expose JVM monitoring metrics in the Kubernetes environment.
The second usage is used here, and the JMX Exporter jar package file and configuration file need to be specified when starting the JVM. The jar package is a binary file, so it is not easy to mount it through configmap. We hardly need to modify the configuration file. So the suggestion is to directly package the jar package and configuration file of JMX Exporter into the business container image.
Among them, in the second way, we can choose to put the jar file of JMX Exporter in the business application mirror, You can also choose to mount it during deployment. Here is an introduction to the two methods:
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/jmx-exporter.html#method-1-build-the-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Build the JMX Exporter JAR file into the business image","text":"
The content of prometheus-jmx-config.yaml is as follows:
For more configmaps, please refer to the bottom introduction or Prometheus official documentation.
Then prepare the jar package file, you can find the latest jar package download address on the Github page of jmx_exporter and refer to the following Dockerfile:
Port 8088 is used here to expose the monitoring metrics of the JVM. If it conflicts with Java applications, you can change it yourself
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/jmx-exporter.html#method-2-mount-via-init-container-container","title":"Method 2: mount via init container container","text":"
We need to make the JMX exporter into a Docker image first, the following Dockerfile is for reference only:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file to the mirror\nCOPY prometheus-jmx-config.yaml ./\n# Download jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image according to the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment Yaml:
After the above modification, the sample application my-demo-app has the ability to expose JVM metrics. After running the service, we can access the prometheus format metrics exposed by the service through http://lcoalhost:8088.
Then, you can refer to Java Application Docking Observability with JVM Metrics.
This document mainly describes how to monitor the JVM of the customer's Java application. It describes how Java applications that have exposed JVM metrics, and those that have not, interface with Insight.
If your Java application does not start exposing JVM metrics, you can refer to the following documents:
Expose JVM monitoring metrics with JMX Exporter
Expose JVM monitoring metrics using OpenTelemetry Java Agent
If your Java application has exposed JVM metrics, you can refer to the following documents:
Java application docking observability with existing JVM metrics
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/legacy-jvm.html","title":"Java Application with JVM Metrics to Dock Insight","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), We need to allow monitoring data to be collected. You can let Insight collect existing JVM metrics by adding Kubernetes Annotations to the workload:
annatation:\n insight.opentelemetry.io/metric-scrape: \"true\" # whether to collect\n insight.opentelemetry.io/metric-path: \"/\" # path to collect metrics\n insight.opentelemetry.io/metric-port: \"9464\" # port for collecting metrics\n
YAML Example to add annotations for my-deployment-app workload\uff1a
In the above example\uff0cInsight will use :8080//actuator/prometheus to get Prometheus metrics exposed through Spring Boot Actuator .
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/otel-java-agent.html","title":"Use OpenTelemetry Java Agent to expose JVM monitoring metrics","text":"
In Opentelemetry Agent v1.20.0 and above, Opentelemetry Agent has added the JMX Metric Insight module. If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents for our application Expose JMX metrics. The Opentelemetry Agent also collects and exposes metrics by instrumenting the metrics exposed by MBeans locally available in the application.
Opentelemetry Agent also has some built-in monitoring samples for common Java Servers or frameworks, please refer to predefined metrics.
Using the OpenTelemetry Java Agent also needs to consider how to mount the JAR into the container. In addition to referring to the JMX Exporter above to mount the JAR file, we can also use the Operator capabilities provided by OpenTelemetry to automatically enable JVM metric exposure for our applications. :
If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents to expose JMX metrics for our application. The Opentelemetry Agent can now natively collect and expose metrics interfaces by instrumenting metrics exposed by MBeans available locally in the application.
However, for current version, you still need to manually add the proper annotations to workload before the JVM data will be collected by Insight.
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/otel-java-agent.html#expose-metrics-for-java-middleware","title":"Expose metrics for Java middleware","text":"
Opentelemetry Agent also has some built-in middleware monitoring samples, please refer to Predefined Metrics.
By default, no type is specified, and it needs to be specified through -Dotel.jmx.target.system JVM Options, such as -Dotel.jmx.target.system=jetty,kafka-broker .
Gaining JMX Metric Insights with the OpenTelemetry Java Agent
Otel jmx metrics
"},{"location":"en/end-user/insight/quickstart/otel/golang-ebpf.html","title":"Enhance Go apps with OTel auto-instrumentation","text":"
If you don't want to manually change the application code, you can try This page's eBPF-based automatic enhancement method. This feature is currently in the review stage of donating to the OpenTelemetry community, and does not support Operator injection through annotations (it will be supported in the future), so you need to manually change the Deployment YAML or use a patch.
Install under the Insight-system namespace, skip this step if it has already been installed.
Note: This CR currently only supports the injection of environment variables (including service name and trace address) required to connect to Insight, and will support the injection of Golang probes in the future.
"},{"location":"en/end-user/insight/quickstart/otel/golang-ebpf.html#change-the-application-deployment-file","title":"Change the application deployment file","text":"
Add environment variable annotations
There is only one such annotation, which is used to add OpenTelemetry-related environment variables, such as link reporting address, cluster id where the container is located, and namespace:
The value is divided into two parts by / , the first value insight-system is the namespace of the CR installed in the second step, and the second value insight-opentelemetry-autoinstrumentation is the name of the CR.
Please ensure that the insight-agent is ready. If not, please refer to Install insight-agent for data collection and make sure the following three items are ready:
Enable trace functionality for insight-agent
Check if the address and port for trace data are correctly filled
Ensure that the Pods proper to deployment/insight-agent-opentelemetry-operator and deployment/insight-agent-opentelemetry-collector are ready
"},{"location":"en/end-user/insight/quickstart/otel/operator.html#works-with-the-service-mesh-product-mspider","title":"Works with the Service Mesh Product (Mspider)","text":"
If you enable the tracing capability of the Mspider(Service Mesh), you need to add an additional environment variable injection configuration:
"},{"location":"en/end-user/insight/quickstart/otel/operator.html#the-operation-steps-are-as-follows","title":"The operation steps are as follows","text":"
Log in to AI platform, then enter Container Management and select the target cluster.
Click CRDs in the left navigation bar, find instrumentations.opentelemetry.io, and enter the details page.
Select the insight-system namespace, then edit insight-opentelemetry-autoinstrumentation, and add the following content under spec:env::
"},{"location":"en/end-user/insight/quickstart/otel/operator.html#add-annotations-to-automatically-access-traces","title":"Add annotations to automatically access traces","text":"
After the above is ready, you can access traces for the application through annotations (Annotation). Otel currently supports accessing traces through annotations. Depending on the service language, different pod annotations need to be added. Each service can add one of two types of annotations:
Only inject environment variable annotations
There is only one such annotation, which is used to add otel-related environment variables, such as link reporting address, cluster id where the container is located, and namespace (this annotation is very useful when the application does not support automatic probe language)
The value is divided into two parts by /, the first value (insight-system) is the namespace of the CR installed in the previous step, and the second value (insight-opentelemetry-autoinstrumentation) is the name of the CR.
Automatic probe injection and environment variable injection annotations
There are currently 4 such annotations, proper to 4 different programming languages: java, nodejs, python, dotnet. After using it, automatic probes and otel default environment variables will be injected into the first container under spec.pod:
Since Go's automatic detection requires the setting of OTEL_GO_AUTO_TARGET_EXE, you must provide a valid executable path through annotations or Instrumentation resources. Failure to set this value will result in the termination of Go's automatic detection injection, leading to a failure in the connection trace.
The OpenTelemetry Operator automatically adds some OTel-related environment variables when injecting probes and also supports overriding these variables. The priority order for overriding these environment variables is as follows:
original container env vars -> language specific env vars -> common env vars -> instrument spec configs' vars\n
However, it is important to avoid manually overriding OTEL_RESOURCE_ATTRIBUTES_NODE_NAME . This variable serves as an identifier within the operator to determine if a pod has already been injected with a probe. Manually adding this variable may prevent the probe from being injected successfully.
How to query the connected services, refer to Trace Query.
"},{"location":"en/end-user/insight/quickstart/otel/otel.html","title":"Use OTel to provide the application observability","text":"
Enhancement is the process of enabling application code to generate telemetry data. i.e. something that helps you monitor or measure the performance and status of your application.
OpenTelemetry is a leading open source project providing instrumentation libraries for major programming languages \u200b\u200band popular frameworks. It is a project under the Cloud Native Computing Foundation and is supported by the vast resources of the community. It provides a standardized data format for collected data without the need to integrate specific vendors.
Insight supports OpenTelemetry for application instrumentation to enhance your applications.
This guide introduces the basic concepts of telemetry enhancement using OpenTelemetry. OpenTelemetry also has an ecosystem of libraries, plugins, integrations, and other useful tools to extend it. You can find these resources at the OTel Registry.
You can use any open standard library for telemetry enhancement and use Insight as an observability backend to ingest, analyze, and visualize data.
To enhance your code, you can use the enhanced operations provided by OpenTelemetry for specific languages:
Insight currently provides an easy way to enhance .Net NodeJS, Java, Python and Golang applications with OpenTelemetry. Please follow the guidelines below.
Best practices for integrate trace: Application Non-Intrusive Enhancement via Operator
Manual instrumentation with Go language as an example: Enhance Go application with OpenTelemetry SDK
Using ebpf to implement non-intrusive auto-instrumetation in Go language (experimental feature)
"},{"location":"en/end-user/insight/quickstart/otel/send_tracing_to_insight.html","title":"Sending Trace Data to Insight","text":"
This document describes how customers can send trace data to Insight on their own. It mainly includes the following two scenarios:
Customer apps report traces to Insight through OTEL Agent/SDK
Forwarding traces to Insight through Opentelemetry Collector (OTEL COL)
In each cluster where Insight Agent is installed, there is an insight-agent-otel-col component that is used to receive trace data from that cluster. Therefore, this component serves as the entry point for user access and needs to obtain its address first. You can get the address of the Opentelemetry Collector in the cluster through the AI platform interface, such as insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 :
In addition, there are some slight differences for different reporting methods:
"},{"location":"en/end-user/insight/quickstart/otel/send_tracing_to_insight.html#customer-apps-report-traces-to-insight-through-otel-agentsdk","title":"Customer apps report traces to Insight through OTEL Agent/SDK","text":"
To successfully report trace data to Insight and display it properly, it is recommended to provide the required metadata (Resource Attributes) for OTLP through the following environment variables. There are two ways to achieve this:
Manually add them to the deployment YAML file, for example:
"},{"location":"en/end-user/insight/quickstart/otel/send_tracing_to_insight.html#forwarding-traces-to-insight-through-opentelemetry-collector","title":"Forwarding traces to Insight through Opentelemetry Collector","text":"
After ensuring that the application has added the metadata mentioned above, you only need to add an OTLP Exporter in your customer's Opentelemetry Collector to forward the trace data to Insight Agent Opentelemetry Collector. Below is an example Opentelemetry Collector configuration file:
Enhancing Applications Non-intrusively with the Operator
Achieving Observability with OTel
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html","title":"Enhance Go applications with OTel SDK","text":"
This page contains instructions on how to set up OpenTelemetry enhancements in a Go application.
OpenTelemetry, also known simply as OTel, is an open-source observability framework that helps generate and collect telemetry data: traces, metrics, and logs in Go apps.
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#enhance-go-apps-with-the-opentelemetry-sdk","title":"Enhance Go apps with the OpenTelemetry SDK","text":""},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#install-related-dependencies","title":"Install related dependencies","text":"
Dependencies related to the OpenTelemetry exporter and SDK must be installed first. If you are using another request router, please refer to request routing. After switching/going into the application source folder run the following command:
go get go.opentelemetry.io/otel@v1.8.0 \\\n go.opentelemetry.io/otel/trace@v1.8.0 \\\n go.opentelemetry.io/otel/sdk@v1.8.0 \\\n go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin@v0.33.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace@v1.7.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.4.1\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#create-an-initialization-feature-using-the-opentelemetry-sdk","title":"Create an initialization feature using the OpenTelemetry SDK","text":"
In order for an application to be able to send data, a feature is required to initialize OpenTelemetry. Add the following code snippet to the main.go file:
import (\n \"context\"\n \"os\"\n \"time\"\n\n \"go.opentelemetry.io/otel\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc\"\n \"go.opentelemetry.io/otel/propagation\"\n \"go.opentelemetry.io/otel/sdk/resource\"\n sdktrace \"go.opentelemetry.io/otel/sdk/trace\"\n semconv \"go.opentelemetry.io/otel/semconv/v1.7.0\"\n \"go.uber.org/zap\"\n \"google.golang.org/grpc\"\n)\n\nvar tracerExp *otlptrace.Exporter\n\nfunc retryInitTracer() func() {\n var shutdown func()\n go func() {\n for {\n // otel will reconnected and re-send spans when otel col recover. so, we don't need to re-init tracer exporter.\n if tracerExp == nil {\n shutdown = initTracer()\n } else {\n break\n }\n time.Sleep(time.Minute * 5)\n }\n }()\n return shutdown\n}\n\nfunc initTracer() func() {\n // temporarily set timeout to 10s\n ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n defer cancel()\n\n serviceName, ok := os.LookupEnv(\"OTEL_SERVICE_NAME\")\n if !ok {\n serviceName = \"server_name\"\n os.Setenv(\"OTEL_SERVICE_NAME\", serviceName)\n }\n otelAgentAddr, ok := os.LookupEnv(\"OTEL_EXPORTER_OTLP_ENDPOINT\")\n if !ok {\n otelAgentAddr = \"http://localhost:4317\"\n os.Setenv(\"OTEL_EXPORTER_OTLP_ENDPOINT\", otelAgentAddr)\n }\n zap.S().Infof(\"OTLP Trace connect to: %s with service name: %s\", otelAgentAddr, serviceName)\n\n traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithDialOption(grpc.WithBlock()))\n if err != nil {\n handleErr(err, \"OTLP Trace gRPC Creation\")\n return nil\n }\n\n tracerProvider := sdktrace.NewTracerProvider(\n sdktrace.WithBatcher(traceExporter),\n sdktrace.WithSampler(sdktrace.AlwaysSample()),\n sdktrace.WithResource(resource.NewWithAttributes(semconv.SchemaURL)))\n\n otel.SetTracerProvider(tracerProvider)\n otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))\n\n tracerExp = traceExporter\n return func() {\n // Shutdown will flush any remaining spans and shut down the exporter.\n handleErr(tracerProvider.Shutdown(ctx), \"failed to shutdown TracerProvider\")\n }\n}\n\nfunc handleErr(err error, message string) {\n if err != nil {\n zap.S().Errorf(\"%s: %v\", message, err)\n }\n}\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#initialize-tracker-in-maingo","title":"Initialize tracker in main.go","text":"
Modify the main feature to initialize the tracker in main.go. Also when your service shuts down, you should call TracerProvider.Shutdown() to ensure all spans are exported. The service makes the call as a deferred feature in the main function:
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#add-opentelemetry-gin-middleware-to-the-application","title":"Add OpenTelemetry Gin middleware to the application","text":"
Configure Gin to use the middleware by adding the following line to main.go :
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#run-the-application","title":"Run the application","text":"
Local debugging and running
Note: This step is only used for local development and debugging. In the production environment, the Operator will automatically complete the injection of the following environment variables.
The above steps have completed the work of initializing the SDK. Now if you need to develop and debug locally, you need to obtain the address of insight-agent-opentelemerty-collector in the insight-system namespace in advance, assuming: insight-agent-opentelemetry-collector .insight-system.svc.cluster.local:4317 .
Therefore, you can add the following environment variables when you start the application locally:
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
Running in a production environment
Please refer to the introduction of Only injecting environment variable annotations in Achieving non-intrusive enhancement of applications through Operators to add annotations to deployment yaml:
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
The OpenTelemetry community has also developed middleware for database access libraries, such as Gorm:
import (\n \"github.com/uptrace/opentelemetry-go-extra/otelgorm\"\n \"gorm.io/driver/sqlite\"\n \"gorm.io/gorm\"\n)\n\ndb, err := gorm.Open(sqlite.Open(\"file::memory:?cache=shared\"), &gorm.Config{})\nif err != nil {\n panic(err)\n}\n\notelPlugin := otelgorm.NewPlugin(otelgorm.WithDBName(\"mydb\"), # Missing this can lead to incomplete display of database related topology\n otelgorm.WithAttributes(semconv.ServerAddress(\"memory\"))) # Missing this can lead to incomplete display of database related topology\nif err := db.Use(otelPlugin); err != nil {\n panic(err)\n}\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#add-custom-properties-and-custom-events-to-span","title":"Add custom properties and custom events to span","text":"
It is also possible to set a custom attribute or tag as a span. To add custom properties and events, follow these steps:
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#import-tracking-and-property-libraries","title":"Import Tracking and Property Libraries","text":"
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#get-the-current-span-from-the-context","title":"Get the current Span from the context","text":"
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#set-properties-in-the-current-span","title":"Set properties in the current Span","text":"
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#add-an-event-to-the-current-span","title":"Add an Event to the current Span","text":"
Adding span events is done using AddEvent on the span object.
span.AddEvent(msg)\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#log-errors-and-exceptions","title":"Log errors and exceptions","text":"
import \"go.opentelemetry.io/otel/codes\"\n\n// Get the current span\nspan := trace.SpanFromContext(ctx)\n\n// RecordError will automatically convert an error into a span even\nspan.RecordError(err)\n\n// Flag this span as an error\nspan.SetStatus(codes.Error, \"internal error\")\n
Navigate to your application\u2019s source folder and run the following command:
go get go.opentelemetry.io/otel \\\n go.opentelemetry.io/otel/attribute \\\n go.opentelemetry.io/otel/exporters/prometheus \\\n go.opentelemetry.io/otel/metric/global \\\n go.opentelemetry.io/otel/metric/instrument \\\n go.opentelemetry.io/otel/sdk/metric\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/meter.html#create-an-initialization-function-using-otel-sdk","title":"Create an Initialization Function Using OTel SDK","text":"
For Java applications, you can directly expose JVM-related metrics by using the OpenTelemetry agent with the following environment variable:
OTEL_METRICS_EXPORTER=prometheus\n
You can then check your metrics at http://localhost:8888/metrics.
Next, combine it with a Prometheus ServiceMonitor to complete the metrics integration. If you want to expose custom metrics, please refer to opentelemetry-java-docs/prometheus.
The process is mainly divided into two steps:
Create a meter provider and specify Prometheus as the exporter.
/*\n * Copyright The OpenTelemetry Authors\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the Prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
Create a custom meter and start the HTTP server.
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n * Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n * these to a Prometheus instance via a HttpServer exporter.\n *\n * <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n * The Gauge callback gets executed every collection interval.\n */\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // It is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
After running the Java application, you can check if your metrics are working correctly by visiting http://localhost:8888/metrics.
For accessing and monitoring Java application links, please refer to the document Implementing Non-Intrusive Enhancements for Applications via Operator, which explains how to automatically integrate links through annotations.
Monitoring the JVM of Java applications: How Java applications that have already exposed JVM metrics and those that have not yet exposed JVM metrics can connect with observability Insight.
If your Java application has not yet started exposing JVM metrics, you can refer to the following documents:
Exposing JVM Monitoring Metrics Using JMX Exporter
Exposing JVM Monitoring Metrics Using OpenTelemetry Java Agent
If your Java application has already exposed JVM metrics, you can refer to the following document:
Connecting Existing JVM Metrics of Java Applications to Observability
Writing TraceId and SpanId into Java Application Logs to correlate link data with log data.
"},{"location":"en/end-user/insight/quickstart/otel/java/mdc.html","title":"Writing TraceId and SpanId into Java Application Logs","text":"
This article explains how to automatically write TraceId and SpanId into Java application logs using OpenTelemetry. By including TraceId and SpanId in your logs, you can correlate distributed tracing data with log data, enabling more efficient fault diagnosis and performance analysis.
Spring Boot projects come with a built-in logging framework and use Logback as the default logging implementation. If your Java project is a Spring Boot project, you can write TraceId into logs with minimal configuration.
Set logging.pattern.level in application.properties, adding %mdc{trace_id} and %mdc{span_id} to the logs.
Modify the log4j2.xml configuration, adding %X{trace_id} and %X{span_id} in the pattern to automatically write TraceId and SpanId into the logs:
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html","title":"Exposing JVM Monitoring Metrics Using JMX Exporter","text":"
JMX Exporter provides two usage methods:
Standalone Process: Specify parameters when starting the JVM to expose a JMX RMI interface. The JMX Exporter calls RMI to obtain the JVM runtime state data, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
In-Process (JVM process): Specify parameters when starting the JVM to run the JMX Exporter jar file as a javaagent. This method reads the JVM runtime state data in-process, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
Note
The official recommendation is not to use the first method due to its complex configuration and the requirement for a separate process, which introduces additional monitoring challenges. Therefore, this article focuses on the second method, detailing how to use JMX Exporter to expose JVM monitoring metrics in a Kubernetes environment.
In this method, you need to specify the JMX Exporter jar file and configuration file when starting the JVM. Since the jar file is a binary file that is not ideal for mounting via a configmap, and the configuration file typically does not require modifications, it is recommended to package both the JMX Exporter jar file and the configuration file directly into the business container image.
For the second method, you can choose to include the JMX Exporter jar file in the application image or mount it during deployment. Below are explanations for both approaches:
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-1-building-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Building JMX Exporter JAR File into the Business Image","text":"
The content of prometheus-jmx-config.yaml is as follows:
The format for the startup parameter is: -javaagent:=:
Here, port 8088 is used to expose JVM monitoring metrics; you may change it if it conflicts with the Java application.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-2-mounting-via-init-container","title":"Method 2: Mounting via Init Container","text":"
First, we need to create a Docker image for the JMX Exporter. The following Dockerfile is for reference:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file into the image\nCOPY prometheus-jmx-config.yaml ./\n# Download the jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image using the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment YAML:
With the above modifications, the example application my-demo-app now has the capability to expose JVM metrics. After running the service, you can access the Prometheus formatted metrics at http://localhost:8088.
Next, you can refer to Connecting Existing JVM Metrics of Java Applications to Observability.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.html","title":"Integrating Existing JVM Metrics of Java Applications with Observability","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), you will need to ensure that the monitoring data is collected. You can achieve this by adding annotations (Kubernetes Annotations) to your workload to allow Insight to scrape the existing JVM metrics:
annotations: \n insight.opentelemetry.io/metric-scrape: \"true\" # Whether to scrape\n insight.opentelemetry.io/metric-path: \"/\" # Path to scrape metrics\n insight.opentelemetry.io/metric-port: \"9464\" # Port to scrape metrics\n
For example, to add annotations to the my-deployment-app:
In the above example, Insight will scrape the Prometheus metrics exposed through Spring Boot Actuator via http://<service-ip>:8080/actuator/prometheus.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html","title":"Exposing JVM Metrics Using OpenTelemetry Java Agent","text":"
Starting from OpenTelemetry Agent v1.20.0 and later, the OpenTelemetry Agent has introduced the JMX Metric Insight module. If your application is already integrated with the OpenTelemetry Agent for tracing, you no longer need to introduce another agent to expose JMX metrics for your application. The OpenTelemetry Agent collects and exposes metrics by detecting the locally available MBeans in the application.
The OpenTelemetry Agent also provides built-in monitoring examples for common Java servers or frameworks. Please refer to the Predefined Metrics.
When using the OpenTelemetry Java Agent, you also need to consider how to mount the JAR into the container. In addition to the methods for mounting the JAR file as described with the JMX Exporter, you can leverage the capabilities provided by the OpenTelemetry Operator to automatically enable JVM metrics exposure for your application.
If your application is already integrated with the OpenTelemetry Agent for tracing, you do not need to introduce another agent to expose JMX metrics. The OpenTelemetry Agent can now locally collect and expose metrics interfaces by detecting the locally available MBeans in the application.
However, as of the current version, you still need to manually add the appropriate annotations to your application for the JVM data to be collected by Insight. For specific annotation content, please refer to Integrating Existing JVM Metrics of Java Applications with Observability.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html#exposing-metrics-for-java-middleware","title":"Exposing Metrics for Java Middleware","text":"
The OpenTelemetry Agent also includes built-in examples for monitoring middleware. Please refer to the Predefined Metrics.
By default, no specific types are designated; you need to specify them using the -Dotel.jmx.target.system JVM options, for example, -Dotel.jmx.target.system=jetty,kafka-broker.
Although the OpenShift system comes with a monitoring system, we will still install Insight Agent because of some rules in the data collection agreement.
Among them, in addition to the basic installation configuration, the following parameters need to be added during helm install:
## Parameters related to fluentbit;\n--set fluent-bit.ocp.enabled=true \\\n--set fluent-bit.serviceAccount.create=false \\\n--set fluent-bit.securityContext.runAsUser=0 \\\n--set fluent-bit.securityContext.seLinuxOptions.type=spc_t \\\n--set fluent-bit.securityContext.readOnlyRootFilesystem=false \\\n--set fluent-bit.securityContext.allowPrivilegeEscalation=false \\\n\n## Enable Prometheus(CR) for OpenShift4.x\n--set compatibility.openshift.prometheus.enabled=true \\\n\n## Close the Prometheus instance of the higher version\n--set kube-prometheus-stack.prometheus.enabled=false \\\n--set kube-prometheus-stack.kubeApiServer.enabled=false \\\n--set kube-prometheus-stack.kubelet.enabled=false \\\n--set kube-prometheus-stack.kubeControllerManager.enabled=false \\\n--set kube-prometheus-stack.coreDns.enabled=false \\\n--set kube-prometheus-stack.kubeDns.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeScheduler.enabled=false \\\n--set kube-prometheus-stack.kubeStateMetrics.enabled=false \\\n--set kube-prometheus-stack.nodeExporter.enabled=false \\\n\n## Limit the namespace processed by PrometheusOperator to avoid competition with OpenShift's own PrometheusOperator\n--set kube-prometheus-stack.prometheusOperator.kubeletService.namespace=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.prometheusInstanceNamespaces=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[0]=\"openshift-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[1]=\"openshift-user-workload-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[2]=\"openshift-customer-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[3]=\"openshift-route-monitor-operator\" \\\n
"},{"location":"en/end-user/insight/quickstart/other/install-agent-on-ocp.html#write-system-monitoring-data-into-prometheus-through-openshifts-own-mechanism","title":"Write system monitoring data into Prometheus through OpenShift's own mechanism","text":"
"},{"location":"en/end-user/insight/quickstart/other/install-agentindce.html","title":"Install insight-agent in Suanova 4.0","text":"
In AI platform, previous Suanova 4.0 can be accessed as a subcluster. This guide provides potential issues and solutions when installing insight-agent in a Suanova 4.0 cluster.
Since most Suanova 4.0 clusters have installed dx-insight as the monitoring system, installing insight-agent at this time will conflict with the existing prometheus operator in the cluster, making it impossible to install smoothly.
Enable the parameters of the prometheus operator, retain the prometheus operator in dx-insight, and make it compatible with the prometheus operator in insight-agent in 5.0.
Enable the --deny-namespaces parameter in the two prometheus operators respectively.
Run the following command (the following command is for reference only, the actual command needs to replace the prometheus operator name and namespace in the command).
As shown in the figure above, the dx-insight component is deployed under the dx-insight tenant, and the insight-agent is deployed under the insight-system tenant. Add --deny-namespaces=insight-system in the prometheus operator in dx-insight, Add --deny-namespaces=dx-insight in the prometheus operator in insight-agent.
Just add deny namespace, both prometheus operators can continue to scan other namespaces, and the related collection resources under kube-system or customer business namespaces are not affected.
Please pay attention to the problem of node exporter port conflict.
The open-source node-exporter turns on hostnetwork by default and the default port is 9100. If the monitoring system of the cluster has installed node-exporter , then installing insight-agent at this time will cause node-exporter port conflict and it cannot run normally.
Note
Insight's node exporter will enable some features to collect special indicators, so it is recommended to install.
Currently, it does not support modifying the port in the installation command. After helm install insight-agent , you need to manually modify the related ports of the insight node-exporter daemonset and svc.
The docker storage directory of Suanova 4.0 is /var/lib/containers , which is different from the path in the configuration of insigh-agent, so the logs are not collected.
"},{"location":"en/end-user/insight/quickstart/res-plan/modify-vms-disk.html","title":"vmstorage Disk Expansion","text":"
This article describes the method for expanding the vmstorage disk. Please refer to the vmstorage disk capacity planning for the specifications of the vmstorage disk.
Log in to the AI platform platform as a global service cluster administrator. Click Container Management -> Clusters and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Container Storage -> PVCs and find the PVC bound to the vmstorage.
Click a vmstorage PVC to enter the details of the volume claim for vmstorage and confirm the StorageClass that the PVC is bound to.
Select the left navigation menu Container Storage -> Storage Class and find local-path . Click the \u2507 on the right side of the target and select Edit in the popup menu.
Enable Scale Up and click OK .
"},{"location":"en/end-user/insight/quickstart/res-plan/modify-vms-disk.html#modify-the-disk-capacity-of-vmstorage","title":"Modify the disk capacity of vmstorage","text":"
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu CRDs and find the custom resource for vmcluster .
Click the custom resource for vmcluster to enter the details page, switch to the insight-system namespace, and select Edit YAML from the right menu of insight-victoria-metrics-k8s-stack .
Modify according to the legend and click OK .
Select the left navigation menu Container Storage -> PVCs again and find the volume claim bound to vmstorage. Confirm that the modification has taken effect. In the details page of a PVC, click the associated storage source (PV).
Open the volume details page and click the Update button in the upper right corner.
After modifying the Capacity , click OK and wait for a moment until the expansion is successful.
"},{"location":"en/end-user/insight/quickstart/res-plan/modify-vms-disk.html#clone-the-storage-volume","title":"Clone the storage volume","text":"
If the storage volume expansion fails, you can refer to the following method to clone the storage volume.
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Workloads -> StatefulSets and find the statefulset for vmstorage . Click the \u2507 on the right side of the target and select Status -> Stop -> OK in the popup menu.
After logging into the master node of the kpanda-global-cluster cluster in the command line, run the following command to copy the vm-data directory in the vmstorage container to store the metric information locally:
Log in to the AI platform platform and go to the details of the kpanda-global-cluster cluster. Select the left navigation menu Container Storage -> PVs , click Clone in the upper right corner, and modify the capacity of the volume.
Delete the previous data volume of vmstorage.
Wait for a moment until the volume claim is bound to the cloned data volume, then run the following command to import the exported data from step 3 into the proper container, and then start the previously paused vmstorage .
In the actual use of Prometheus, affected by the number of cluster containers and the opening of Istio, the CPU, memory and other resource usage of Prometheus will exceed the set resources.
In order to ensure the normal operation of Prometheus in clusters of different sizes, it is necessary to adjust the resources of Prometheus according to the actual size of the cluster.
In the case that the mesh is not enabled, the test statistics show that the relationship between the system Job index and pods is Series count = 800 * pod count
When the service mesh is enabled, the magnitude of the Istio-related metrics generated by the pod after the feature is enabled is Series count = 768 * pod count
"},{"location":"en/end-user/insight/quickstart/res-plan/prometheus-res.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"
The following resource planning is recommended by Prometheus when the service mesh is not enabled :
Pod count in the table refers to the pod count that is basically running stably in the cluster. If a large number of pods are restarted, the index will increase sharply in a short period of time. At this time, resources need to be adjusted accordingly.
Prometheus stores two hours of data by default in memory, and when the Remote Write function is enabled in the cluster, a certain amount of memory will be occupied, and resources surge ratio is recommended to be set to 2.
The data in the table are recommended values, applicable to general situations. If the environment has precise resource requirements, it is recommended to check the resource usage of the proper Prometheus after the cluster has been running for a period of time for precise configuration.
"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html","title":"vmstorage disk capacity planning","text":"
vmstorage is responsible for storing multicluster metrics for observability. In order to ensure the stability of vmstorage, it is necessary to adjust the disk capacity of vmstorage according to the number of clusters and the size of the cluster. For more information, please refer to vmstorage retention period and disk space.
After 14 days of disk observation of vmstorage of clusters of different sizes, We found that the disk usage of vmstorage was positively correlated with the amount of metrics it stored and the disk usage of individual data points.
The amount of metrics stored instantaneously increase(vm_rows{ type != \"indexdb\"}[30s]) to obtain the increased amount of metrics within 30s
Disk usage of a single data point: sum(vm_data_size_bytes{type!=\"indexdb\"}) / sum(vm_rows{type != \"indexdb\"})
Disk usage = Instantaneous metrics x 2 x disk usage for a single data point x 60 x 24 x storage time (days)
Parameter Description:
The unit of disk usage is Byte .
Storage duration (days) x 60 x 24 converts time (days) into minutes to calculate disk usage.
The default collection time of Prometheus in Insight Agent is 30s, so twice the amount of metrics will be generated within 1 minute.
The default storage duration in vmstorage is 1 month, please refer to Modify System Configuration to modify the configuration.
Warning
This formula is a general solution, and it is recommended to reserve redundant disk capacity on the calculation result to ensure the normal operation of vmstorage.
The data in the table is calculated based on the default storage time of one month (30 days), and the disk usage of a single data point (datapoint) is calculated as 0.9. In a multicluster scenario, the number of Pods represents the sum of the number of Pods in the multicluster.
"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 8W 6 GiB 200 16W 12 GiB 300 24w 18 GiB 400 32w 24 GiB 500 40w 30 GiB 800 64w 48 GiB 1000 80W 60 GiB 2000 160w 120 GiB 3000 240w 180 GiB"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-enabled","title":"When the service mesh is enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 15W 12 GiB 200 31w 24 GiB 300 46w 36 GiB 400 62w 48 GiB 500 78w 60 GiB 800 125w 94 GiB 1000 156w 120 GiB 2000 312w 235 GiB 3000 468w 350 GiB"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html#example","title":"Example","text":"
There are two clusters in the AI platform platform, of which 500 Pods are running in the global management cluster (service mesh is turned on), and 1000 Pods are running in the worker cluster (service mesh is not turned on), and the expected metrics are stored for 30 days.
The number of metrics in the global management cluster is 800x500 + 768x500 = 784000
Worker cluster metrics are 800x1000 = 800000
Then the current vmstorage disk usage should be set to (784000+80000)x2x0.9x60x24x31 =124384896000 byte = 116 GiB
Note
For the relationship between the number of metrics and the number of Pods in the cluster, please refer to Prometheus Resource Planning.
"},{"location":"en/end-user/insight/system-config/modify-config.html","title":"Modify system configuration","text":"
Observability will persist the data of metrics, logs, and traces by default. Users can modify the system configuration according to This page.
"},{"location":"en/end-user/insight/system-config/modify-config.html#how-to-modify-the-metric-data-retention-period","title":"How to modify the metric data retention period","text":"
Refer to the following steps to modify the metric data retention period.
After saving the modification, the pod of the component responsible for storing the metrics will automatically restart, just wait for a while.
"},{"location":"en/end-user/insight/system-config/modify-config.html#how-to-modify-the-log-data-storage-duration","title":"How to modify the log data storage duration","text":"
Refer to the following steps to modify the log data retention period:
"},{"location":"en/end-user/insight/system-config/modify-config.html#method-1-modify-the-json-file","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . Change http://localhost:9200 to the address of elastic .
After modification, run the above command. It will print out the content as shown below, then the modification is successful.
{\n\"acknowledged\": true\n}\n
"},{"location":"en/end-user/insight/system-config/modify-config.html#method-2-modify-from-the-ui","title":"Method 2: Modify from the UI","text":"
Log in kibana , select Stack Management in the left navigation bar.
Select the left navigation Index Lifecycle Polices , and find the index insight-es-k8s-logs-policy , click to enter the details.
Expand the Hot phase configuration panel, modify the Maximum age parameter, and set the retention period. The default storage period is 7d .
After modification, click Save policy at the bottom of the page to complete the modification.
"},{"location":"en/end-user/insight/system-config/modify-config.html#how-to-modify-the-trace-data-storage-duration","title":"How to modify the trace data storage duration","text":"
Refer to the following steps to modify the trace data retention period:
"},{"location":"en/end-user/insight/system-config/modify-config.html#method-1-modify-the-json-file_1","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . At the same time, modify http://localhost:9200 to the access address of elastic .
On the system component page, you can quickly view the running status of the system components in Insight. When a system component fails, some features in Insight will be unavailable.
Go to Insight product module,
In the left navigation bar, select System Management -> System Components .
"},{"location":"en/end-user/insight/system-config/system-component.html#component-description","title":"Component description","text":"Module Component Name Description Metrics vminsert-insight-victoria-metrics-k8s-stack Responsible for writing the metric data collected by Prometheus in each cluster to the storage component. If this component is abnormal, the metric data of the worker cluster cannot be written. Metrics vmalert-insight-victoria-metrics-k8s-stack Responsible for taking effect of the recording and alert rules configured in the VM Rule, and sending the triggered alert rules to alertmanager. Metrics vmalertmanager-insight-victoria-metrics-k8s-stack is responsible for sending messages when alerts are triggered. If this component is abnormal, the alert information cannot be sent. Metrics vmselect-insight-victoria-metrics-k8s-stack Responsible for querying metrics data. If this component is abnormal, the metric cannot be queried. Metrics vmstorage-insight-victoria-metrics-k8s-stack Responsible for storing multicluster metrics data. Dashboard grafana-deployment Provide monitoring panel capability. The exception of this component will make it impossible to view the built-in dashboard. Link insight-jaeger-collector Responsible for receiving trace data in opentelemetry-collector and storing it. Link insight-jaeger-query Responsible for querying the trace data collected in each cluster. Link insight-opentelemetry-collector Responsible for receiving trace data forwarded by each sub-cluster Log elasticsearch Responsible for storing the log data of each cluster."},{"location":"en/end-user/insight/system-config/system-config.html","title":"System Configuration","text":"
System Configuration displays the default storage time of metrics, logs, traces and the default Apdex threshold.
Click the right navigation bar and select System Configuration .
Currently only supports modifying the storage duration of historical alerts, click Edit to enter the target duration.
When the storage duration is set to \"0\", the historical alerts will not be cleared.
Note
To modify other configurations, please click to view How to modify the system configuration?
In Insight , a service refers to a group of workloads that provide the same behavior for incoming requests. Service insight helps observe the performance and status of applications during the operation process by using the OpenTelemetry SDK.
For how to use OpenTelemetry, please refer to: Using OTel to give your application insight.
Service: A service represents a group of workloads that provide the same behavior for incoming requests. You can define the service name when using the OpenTelemetry SDK or use the name defined in Istio.
Operation: An operation refers to a specific request or action handled by a service. Each span has an operation name.
Outbound Traffic: Outbound traffic refers to all the traffic generated by the current service when making requests.
Inbound Traffic: Inbound traffic refers to all the traffic initiated by the upstream service targeting the current service.
The Services List page displays key metrics such as throughput rate, error rate, and request latency for all services that have been instrumented with distributed tracing. You can filter services based on clusters or namespaces and sort the list by throughput rate, error rate, or request latency. By default, the data displayed in the list is for the last hour, but you can customize the time range.
Follow these steps to view service insight metrics:
Go to the Insight product module.
Select Trace Tracking -> Services from the left navigation bar.
Attention
If the namespace of a service in the list is unknown , it means that the service has not been properly instrumented. We recommend reconfiguring the instrumentation.
If multiple services have the same name and none of them have the correct Namespace environment variable configured, the metrics displayed in the list and service details page will be aggregated for all those services.
Click a service name (taking insight-system as an example) to view the detailed metrics and operation metrics for that service.
In the Service Topology section, you can view the service topology one layer above or below the current service. When you hover over a node, you can see its information.
In the Traffic Metrics section, you can view the monitoring metrics for all requests to the service within the past hour (including inbound and outbound traffic).
You can use the time selector in the upper right corner to quickly select a time range or specify a custom time range.
Sorting is available for throughput, error rate, and request latency in the operation metrics.
Clicking on the icon next to an individual operation will take you to the Traces page to quickly search for related traces.
"},{"location":"en/end-user/insight/trace/service.html#service-metric-explanations","title":"Service Metric Explanations","text":"Metric Description Throughput Rate The number of requests processed within a unit of time. Error Rate The ratio of erroneous requests to the total number of requests within the specified time range. P50 Request Latency The response time within which 50% of requests complete. P95 Request Latency The response time within which 95% of requests complete. P99 Request Latency The response time within which 99% of requests complete."},{"location":"en/end-user/insight/trace/topology-helper.html","title":"Service Topology Element Explanations","text":"
The service topology provided by Observability allows you to quickly identify the request relationships between services and determine the health status of services based on different colors. The health status is determined based on the request latency and error rate of the service's overall traffic. This article explains the elements in the service topology.
"},{"location":"en/end-user/insight/trace/topology-helper.html#node-status-explanation","title":"Node Status Explanation","text":"
The node health status is determined based on the error rate and request latency of the service's overall traffic, following these rules:
Color Status Rules Gray Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/end-user/insight/trace/topology-helper.html#connection-status-explanation","title":"Connection Status Explanation","text":"Color Status Rules Green Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/end-user/insight/trace/topology.html","title":"Service Map","text":"
Service map is a visual representation of the connections, communication, and dependencies between services. It provides insights into the service-to-service interactions, allowing you to view the calls and performance of services within a specified time range. The connections between nodes in the topology map represent the existence of service-to-service calls during the queried time period.
Select Tracing -> Service Map from the left navigation bar.
In the Service Map, you can perform the following actions:
Click a node to slide out the details of the service on the right side. Here, you can view metrics such as request latency, throughput, and error rate for the service. Clicking on the service name takes you to the service details page.
Hover over the connections to view the traffic metrics between the two services.
Click Display Settings , you can configure the display elements in the service map.
In the Service Map, there can be nodes that are not part of the cluster. These external nodes can be categorized into three types:
Database
Message Queue
Virtual Node
If a service makes a request to a Database or Message Queue, these two types of nodes will be displayed by default in the topology map. However, Virtual Nodes represent nodes outside the cluster or services not integrated into the trace, and they will not be displayed by default in the map.
When a service makes a request to MySQL, PostgreSQL, or Oracle Database, the detailed database type can be seen in the map.
TraceID: Used to identify a complete request call trace.
Operation: Describes the specific operation or event represented by a Span.
Entry Span: The entry Span represents the first request of the entire call.
Latency: The duration from receiving the request to completing the response for the entire call trace.
Span: The number of Spans included in the entire trace.
Start Time: The time when the current trace starts.
Tag: A collection of key-value pairs that constitute Span tags. Tags are used to annotate and supplement Spans, and each Span can have multiple key-value tag pairs.
Click the icon on the right side of the trace data to search for associated logs.
By default, it queries the log data within the duration of the trace and one minute after its completion.
The queried logs include those with the trace's TraceID in their log text and container logs related to the trace invocation process.
Click View More to jump to the Associated Log page with conditions.
By default, all logs are searched, but you can filter by the TraceID or the relevant container logs from the trace call process using the dropdown.
Note
Since trace may span across clusters or namespaces, if the user does not have sufficient permissions, they will be unable to query the associated logs for that trace.
"},{"location":"en/end-user/k8s/add-node.html#steps-to-add-nodes","title":"Steps to Add Nodes","text":"
Log into the AI platform as an administrator.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
On the cluster overview page, click Nodes, then click the Add Node button on the right.
Follow the wizard to fill in the parameters and click OK.
In the pop-up window, click OK.
Return to the node list; the status of the newly added node will be Connecting. After a few minutes, when the status changes to Running, it indicates that the connection was successful.
Tip
For newly connected nodes, it may take an additional 2-3 minutes to recognize the GPU.
"},{"location":"en/end-user/k8s/create-k8s.html","title":"Creating a Kubernetes Cluster in the Cloud","text":"
Deploying a Kubernetes cluster is aimed at supporting efficient AI computing resource scheduling and management, achieving elastic scaling, providing high availability, and optimizing the model training and inference processes.
Two segments of IP addresses are allocated (Pod CIDR 18 bits, SVC CIDR 18 bits, must not conflict with existing networks)
"},{"location":"en/end-user/k8s/create-k8s.html#steps-to-create","title":"Steps to Create","text":"
Log into the AI platform as an administrator.
Create and launch 3 cloud hosts without GPU to serve as Master nodes for the cluster.
Configure resources: 16 CPU cores, 32 GB RAM, 200 GB system disk (ReadWriteOnce)
Select Bridge network mode
Set the root password or add an SSH public key for SSH connection
Record the IPs of the 3 hosts
Navigate to Container Management -> Clusters, and click the Create Cluster button on the right.
Follow the wizard to configure various parameters of the cluster.
Wait for the cluster creation to complete.
In the cluster list, find the newly created cluster, click the cluster name, navigate to Helm Apps -> Helm Charts, and search for metax-gpu-extensions in the search box, then click the card.
Click the Install button on the right to start installing the GPU plugin.
Automatically return to the Helm App list and wait for the status of metax-gpu-extensions to change to Deployed.
At this point, the cluster has been successfully created. You can check the nodes included in the cluster. You can now create AI workloads and use GPUs.
The cost of GPU resources is relatively high. If GPUs are not needed temporarily, you can remove the worker nodes with GPUs. The following steps also apply to removing regular worker nodes.
"},{"location":"en/end-user/k8s/remove-node.html#steps-to-remove","title":"Steps to Remove","text":"
Log into the AI platform as an administrator.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
Enter the cluster overview page, click Nodes, find the node to be removed, click the \u2507 on the right side of the list, and select Remove Node from the pop-up menu.
In the pop-up window, enter the node name, and after confirming it is correct, click Delete.
You will automatically return to the node list, where the status will be Removing. After a few minutes, refresh the page, and if the node is no longer present, it indicates that the node has been successfully removed.
After removing the node from the UI list, SSH into the removed node's host and execute the shutdown command.
Tip
After removing the node in the UI and shutting it down, the data on the node is not immediately deleted; the node's data will be retained for a period of time.
"},{"location":"en/end-user/kpanda/backup/index.html","title":"Backup and Restore","text":"
Backup and restore are essential aspects of system management. In practice, it is important to first back up the data of the system at a specific point in time and securely store the backup. In case of incidents such as data corruption, loss, or accidental deletion, the system can be quickly restored based on the previous backup data, reducing downtime and minimizing losses.
In real production environments, services may be deployed across different clouds, regions, or availability zones. If one infrastructure faces a failure, organizations need to quickly restore applications in other available environments. In such cases, cross-cloud or cross-cluster backup and restore become crucial.
Large-scale systems often involve multiple roles and users with complex permission management systems. With many operators involved, accidents caused by human error can lead to system failures. In such scenarios, the ability to roll back the system quickly using previously backed-up data is necessary. Relying solely on manual troubleshooting, fault repair, and system recovery can be time-consuming, resulting in prolonged system unavailability and increased losses for organizations.
Additionally, factors like network attacks, natural disasters, and equipment malfunctions can also cause data accidents.
Therefore, backup and restore are vital as the last line of defense for maintaining system stability and ensuring data security.
Backups are typically classified into three types: full backups, incremental backups, and differential backups. Currently, AI platform supports full backups and incremental backups.
The backup and restore provided by AI platform can be divided into two categories: Application Backup and ETCD Backup. It supports both manual backups and scheduled automatic backups using CronJobs.
Application Backup
Application backup refers to backing up data of a specific workload in the cluster and then restoring that data either within the same cluster or in another cluster. It supports backing up all resources under a namespace or filtering resources by specific labels.
Application backup also supports cross-cluster backup of stateful applications. For detailed steps, refer to the Backup and Restore MySQL Applications and Data Across Clusters guide.
etcd Backup
etcd is the data storage component of Kubernetes. Kubernetes stores its own component's data and application data in etcd. Therefore, backing up etcd is equivalent to backing up the entire cluster's data, allowing quick restoration of the cluster to a previous state in case of failures.
It's worth noting that currently, restoring etcd backup data is only supported within the same cluster (the original cluster). To learn more about related best practices, refer to the ETCD Backup and Restore guide.
This article explains how to backup applications in AI platform. The demo application used in this tutorial is called dao-2048 , which is a deployment.
Before backing up a deployment, the following prerequisites must be met:
Integrate a Kubernetes cluster or create a Kubernetes cluster in the Container Management module, and be able to access the UI interface of the cluster.
Create a Namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
Install the velero component, and ensure the velero component is running properly.
Create a deployment (the workload in this tutorial is named dao-2048 ), and label the deployment with app: dao-2048 .
Follow the steps below to backup the deployment dao-2048 .
Enter the Container Management module, click Backup Recovery -> Application Backup on the left navigation bar, and enter the Application Backup list page.
On the Application Backup list page, select the cluster where the velero and dao-2048 applications have been installed. Click Backup Plan in the upper right corner to create a new backup cluster.
Refer to the instructions below to fill in the backup configuration.
Name: The name of the new backup plan.
Source Cluster: The cluster where the application backup plan is to be executed.
Object Storage Location: The access path of the object storage configured when installing velero on the source cluster.
Namespace: The namespaces that need to be backed up, multiple selections are supported.
Advanced Configuration: Back up specific resources in the namespace based on resource labels, such as an application, or do not back up specific resources in the namespace based on resource labels during backup.
Refer to the instructions below to set the backup execution frequency, and then click Next .
Backup Frequency: Set the time period for task execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Retention Time (days): Set the storage time of backup resources, the default is 30 days, and will be deleted after expiration.
Backup Data Volume (PV): Whether to back up the data in the data volume (PV), support direct copy and use CSI snapshot.
Direct Replication: directly copy the data in the data volume (PV) for backup;
Use CSI snapshots: Use CSI snapshots to back up data volumes (PVs). Requires a CSI snapshot type available for backup in the cluster.
Click OK , the page will automatically return to the application backup plan list, find the newly created dao-2048 backup plan, and perform the Immediate Execution operation.
At this point, the Last Execution State of the cluster will change to in progress . After the backup is complete, you can click the name of the backup plan to view the details of the backup plan.
etcd backup is based on cluster data as the core backup. In cases such as hardware device damage, development and test configuration errors, etc., the backup cluster data can be restored through etcd backup.
This section will introduce how to realize the etcd backup for clusters. Also see etcd Backup and Restore Best Practices.
Enter Container Management -> Backup Recovery -> etcd Backup page, you can see all the current backup policies. Click Create Backup Policy on the right.
Fill in the Basic Information. Then, click Next to automatically verify the connectivity of etcd. If the verification passes, proceed to the next step.
First select the backup cluster and log in to the terminal
Enter etcd, and the format is https://${NodeIP}:${Port}.
In a standard Kubernetes cluster, the default port for etcd is 2379.
In a Suanova 4.0 cluster, the default port for etcd is 12379.
In a public cloud managed cluster, you need to contact the relevant developers to obtain the etcd port number. This is because the control plane components of public cloud clusters are maintained and managed by the cloud service provider. Users cannot directly access or view these components, nor can they obtain control plane port information through regular commands (such as kubectl).
Ways to obtain port number
Find the etcd Pod in the kube-system namespace
kubectl get po -n kube-system | grep etcd\n
Get the port number from the listen-client-urls of the etcd Pod
kubectl get po -n kube-system ${etcd_pod_name} -oyaml | grep listen-client-urls # (1)!\n
Replace etcd_pod_name with the actual Pod name
The expected output is as follows, where the number after the node IP is the port number:
Fill in the CA certificate, you can use the following command to view the certificate content. Then, copy and paste it to the proper location:
Standard Kubernetes ClusterSuanova 4.0 Cluster
cat /etc/kubernetes/ssl/etcd/ca.crt\n
cat /etc/daocloud/dce/certs/ca.crt\n
Fill in the Cert certificate, you can use the following command to view the content of the certificate. Then, copy and paste it to the proper location:
Click How to get below the input box to see how to obtain the proper information on the UI page.
Refer to the following information to fill in the Backup Policy.
Backup Method: Choose either manual backup or scheduled backup
Manual Backup: Immediately perform a full backup of etcd data based on the backup configuration.
Scheduled Backup: Periodically perform full backups of etcd data according to the set backup frequency.
Backup Chain Length: the maximum number of backup data to retain. The default is 30.
Backup Frequency: it can be per hour, per day, per week or per month, and can also be customized.
Refer to the following information to fill in the Storage Path.
Storage Provider: Default is S3 storage
Object Storage Access Address: The access address of MinIO
Bucket: Create a Bucket in MinIO and fill in the Bucket name
Username: The login username for MinIO
Password: The login password for MinIO
After clicking OK , the page will automatically redirect to the backup policy list, where you can view all the currently created ones.
Click the \u2507 action button on the right side of the policy to view logs, view YAML, update the policy, stop the policy, or execute the policy immediately.
When the backup method is manual, you can click Execute Now to perform the backup.
When the backup method is scheduled, the backup will be performed according to the configured time.
Click Logs to view the log content. By default, 100 lines are displayed. If you want to see more log information or download the logs, you can follow the prompts above the logs to go to the observability module.
Go to Container Management -> Backup Recovery -> etcd Backup, and click the Recovery Point tab.
After selecting the target cluster, you can view all the backup information under that cluster.
Each time a backup is executed, a proper recovery point is generated, which can be used to quickly restore the application from a successful recovery point.
"},{"location":"en/end-user/kpanda/backup/install-velero.html","title":"Install the Velero Plugin","text":"
velero is an open source tool for backing up and restoring Kubernetes cluster resources. It can back up resources in a Kubernetes cluster to cloud storage services, local storage, or other locations, and restore those resources to the same or a different cluster when needed.
This section introduces how to deploy the Velero plugin in AI platform using the Helm Apps.
Please perform the following steps to install the velero plugin for your cluster.
On the cluster list page, find the target cluster that needs to install the velero plugin, click the name of the cluster, click Helm Apps -> Helm chart in the left navigation bar, and enter velero in the search bar to search .
Read the introduction of the velero plugin, select the version and click the Install button. This page will take 5.2.0 version as an example to install, and it is recommended that you install 5.2.0 and later versions.
Configure basic info .
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, it must be velero namespace.
Version: The version of the plugin, here we take 5.2.0 version as an example.
Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Deletion Failed: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Detailed Logs: Turn on the verbose output of the installation process log.
!!! note
After enabling __Ready Wait__ and/or __Failed Delete__ , it takes a long time for the app to be marked as __Running__ .\n
Configure Velero chart Parameter Settings according to the following instructions
S3 Credentials: Configure the authentication information of object storage (minio).
Use secret: Keep the default configuration true.
Secret name: Keep the default configuration velero-s3-credential.
SecretContents.aws_access_key_id = : Configure the username for accessing object storage, replace with the actual parameter.
SecretContents.aws_secret_access_key = : Configure the password for accessing object storage, replace with the actual parameter.
Use existing secret parameter example is as follows:
BackupStorageLocation: The location where Velero backs up data.
S3 bucket: The name of the storage bucket used to save backup data (must be a real storage bucket that already exists in minio).
Is default BackupStorage: Keep the default configuration true.
S3 access mode: The access mode of Velero to data, which can be selected
ReadWrite: Allow Velero to read and write backup data;
ReadOnly: Allow Velero to read backup data, but cannot modify backup data;
WriteOnly: Only allow Velero to write backup data, and cannot read backup data.
S3 Configs: Detailed configuration of S3 storage (minio).
S3 region: The geographical region of cloud storage. The default is to use the us-east-1 parameter, which is provided by the system administrator.
S3 force path style: Keep the default configuration true.
S3 server URL: The console access address of object storage (minio). Minio generally provides two services, UI access and console access. Please use the console access address here.
Click the OK button to complete the installation of the Velero plugin. The system will automatically jump to the Helm Apps list page. After waiting for a few minutes, refresh the page, and you can see the application just installed.
Cluster settings are used to customize advanced feature settings for your cluster, including whether to enable GPU, helm repo refresh cycle, Helm operation record retention, etc.
Enable GPU: GPUs and proper driver plug-ins need to be installed on the cluster in advance.
Click the name of the target cluster, and click Operations and Maintenance -> Cluster Settings -> Addons in the left navigation bar.
Helm operation basic image, registry refresh cycle, number of operation records retained, whether to enable cluster deletion protection (the cluster cannot be uninstalled directly after enabling)
On this page, you can view the recent cluster operation records and Helm operation records, as well as the YAML files and logs of each operation, and you can also delete a certain record.
Set the number of reserved entries for Helm operations:
By default, the system keeps the last 100 Helm operation records. If you keep too many entries, it may cause data redundancy, and if you keep too few entries, you may lose the key operation records you need. A reasonable reserved quantity needs to be set according to the actual situation. Specific steps are as follows:
Click the name of the target cluster, and click Recent Operations -> Helm Operations -> Set Number of Retained Items in the left navigation bar.
Set how many Helm operation records need to be kept, and click OK .
Clusters integrated or created using the AI platform Container Management platform can be accessed not only through the UI interface but also in two other ways for access control:
Access online via CloudShell
Access via kubectl after downloading the cluster certificate
Note
When accessing the cluster, the user should have Cluster Admin permission or higher.
"},{"location":"en/end-user/kpanda/clusters/access-cluster.html#access-via-cloudshell","title":"Access via CloudShell","text":"
Enter Clusters page, select the cluster you want to access via CloudShell, click the ... icon on the right, and then click Console from the dropdown list.
Run kubectl get node command in the Console to verify the connectivity between CloudShell and the cluster. If the console returns node information of the cluster, you can access and manage the cluster through CloudShell.
"},{"location":"en/end-user/kpanda/clusters/access-cluster.html#access-via-kubectl","title":"Access via kubectl","text":"
If you want to access and manage remote clusters from a local node, make sure you have met these prerequisites:
Your local node and the cloud cluster are in a connected network.
The cluster certificate has been downloaded to the local node.
The kubectl tool has been installed on the local node. For detailed installation guides, see Installing tools.
If everything is in place, follow these steps to access a cloud cluster from your local environment.
Enter Clusters page, find your target cluster, click ... on the right, and select Download kubeconfig in the drop-down list.
Set the Kubeconfig period and click Download .
Open the downloaded certificate and copy its content to the config file of the local node.
By default, the kubectl tool will look for a file named config in the $HOME/.kube directory on the local node. This file stores access credentials of clusters. Kubectl can access the cluster with that configuration file.
Run the following command on the local node to verify its connectivity with the cluster:
kubectl get pod -n default\n
An expected output is as follows:
NAME READY STATUS RESTARTS AGE\ndao-2048-2048-58c7f7fc5-mq7h4 1/1 Running 0 30h\n
Now you can access and manage the cluster locally with kubectl.
This is a cluster created using Container Management and is mainly used to carry business workloads. This cluster is managed by the management cluster.
Supported Features Description K8s Version Supports K8s 1.22 and above Operating System RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;Ubuntu 18.04 x86, Ubuntu 20.04 x86;CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD Full Lifecycle Management Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Calico, Cillium, Multus, and other CNIs Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies"},{"location":"en/end-user/kpanda/clusters/cluster-role.html#integrated-cluster","title":"Integrated Cluster","text":"
This cluster is used to integrate existing standard K8s clusters, including but not limited to self-built clusters in local data centers, clusters provided by public cloud vendors, clusters provided by private cloud vendors, edge clusters, Xinchuang clusters, heterogeneous clusters, and different Suanova clusters. It is mainly used to carry business workloads.
Supported Features Description K8s Version 1.18+ Supported Vendors VMware Tanzu, Amazon EKS, Redhat Openshift, SUSE Rancher, Alibaba ACK, Huawei CCE, Tencent TKE, Standard K8s Cluster, Suanova Full Lifecycle Management Not Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Depends on the network mode of the integrated cluster's kernel Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies
Note
A cluster can have multiple cluster roles. For example, a cluster can be both a global service cluster and a management cluster or a worker cluster.
"},{"location":"en/end-user/kpanda/clusters/cluster-scheduler-plugin.html","title":"Deploy Second Scheduler scheduler-plugins in a Cluster","text":"
This page describes how to deploy a second scheduler-plugins in a cluster.
"},{"location":"en/end-user/kpanda/clusters/cluster-scheduler-plugin.html#why-do-we-need-scheduler-plugins","title":"Why do we need scheduler-plugins?","text":"
The cluster created through the platform will install the native K8s scheduler-plugin, but the native scheduler-plugin has many limitations:
The native scheduler-plugin cannot meet scheduling requirements, so you can use either CoScheduling, CapacityScheduling or other types of scheduler-plugins.
In special scenarios, a new scheduler-plugin is needed to complete scheduling tasks without affecting the process of the native scheduler-plugin.
Distinguish scheduler-plugins with different functionalities and achieve different scheduling scenarios by switching scheduler-plugin names.
This page takes the scenario of using the vgpu scheduler-plugin while combining the coscheduling plugin capability of scheduler-plugins as an example to introduce how to install and use scheduler-plugins.
kubean is a new feature introduced in v0.13.0, please ensure that your version is v0.13.0 or higher.
The installation version of scheduler-plugins is v0.27.8, please ensure that the cluster version is compatible with it. Refer to the document Compatibility Matrix.
scheduler_plugins_enabled Set to true to enable the scheduler-plugins capability.
You can enable or disable certain plugins by setting the scheduler_plugins_enabled_plugins or scheduler_plugins_disabled_plugins options. See K8s Official Plugin Names for reference.
If you need to set parameters for custom plugins, please configure scheduler_plugins_plugin_config, for example: set the permitWaitingTimeoutSeconds parameter for coscheduling. See K8s Official Plugin Configuration for reference.
After successful cluster creation, the system will automatically install the scheduler-plugins and controller component loads. You can check the workload status in the proper cluster's deployment.
Here is an example of how to use scheduler-plugins by demonstrating a scenario where the vgpu scheduler is used in combination with the coscheduling plugin capability of scheduler-plugins.
Install vgpu in the Helm Charts and set the values.yaml parameters.
schedulerName: scheduler-plugins-scheduler: This is the scheduler name for scheduler-plugins installed by kubean, and currently cannot be modified.
scheduler.kubeScheduler.enabled: false: Do not install kube-scheduler and use vgpu-scheduler as a separate extender.
Extend vgpu-scheduler on scheduler-plugins.
[root@master01 charts]# kubectl get cm -n scheduler-plugins scheduler-config -ojsonpath=\"{.data.scheduler-config\\.yaml}\"\n
After installing vgpu-scheduler, the system will automatically create a service (svc), and the urlPrefix specifies the URL of the svc.
Note
The svc refers to the pod service load. You can use the following command in the namespace where the nvidia-vgpu plugin is installed to get the external access information for port 443.
kubectl get svc -n ${namespace}\n
The urlPrefix format is https://${ip address}:${port}
Restart the scheduler pod of scheduler-plugins to load the new configuration file.
Note
When creating a vgpu application, you do not need to specify the name of a scheduler-plugin. The vgpu-scheduler webhook will automatically change the scheduler's name to \"scheduler-plugins-scheduler\" without manual specification.
AI platform Container Management module can manage two types of clusters: integrated clusters and created clusters.
Integrated clusters: clusters created in other platforms and now integrated into AI platform.
Created clusters: clusters created in AI platform.
For more information about cluster types, see Cluster Role.
We designed several status for these two clusters.
"},{"location":"en/end-user/kpanda/clusters/cluster-status.html#integrated-clusters","title":"Integrated Clusters","text":"Status Description Integrating The cluster is being integrated into AI platform. Removing The cluster is being removed from AI platform. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status."},{"location":"en/end-user/kpanda/clusters/cluster-status.html#created-clusters","title":"Created Clusters","text":"Status Description Creating The cluster is being created. Updating The Kubernetes version of the cluster is being operating. Deleting The cluster is being deleted. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status. Failed The cluster creation is failed. You should check the logs for detailed reasons."},{"location":"en/end-user/kpanda/clusters/cluster-version.html","title":"Supported Kubernetes Versions","text":"
In AI platform, the integrated clusters and created clusters have different version support mechanisms.
This page focuses on the version support mechanism for created clusters.
The Kubernetes community supports three version ranges: 1.26, 1.27, and 1.28. When a new version is released by the community, the supported version range is incremented. For example, if the latest version released by the community is 1.27, the supported version range by the community will be 1.27, 1.28, and 1.29.
To ensure the security and stability of the clusters, when creating clusters in AI platform, the supported version range will always be one version lower than the community's version.
For instance, if the Kubernetes community supports v1.25, v1.26, and v1.27, then the version range for creating worker clusters in AI platform will be v1.24, v1.25, and v1.26. Additionally, a stable version, such as 1.24.7, will be recommended to users.
Furthermore, the version range for creating worker clusters in AI platform will remain highly synchronized with the community. When the community version increases incrementally, the version range for creating worker clusters in AI platform will also increase by one version.
"},{"location":"en/end-user/kpanda/clusters/cluster-version.html#supported-kubernetes-versions_1","title":"Supported Kubernetes Versions","text":"Kubernetes Community Versions Created Worker Cluster Versions Recommended Versions for Created Worker Cluster AI platform Installer Release Date
In AI platform Container Management, clusters can have four roles: global service cluster, management cluster, worker cluster, and integrated cluster. An integrated cluster can only be integrated from third-party vendors (see Integrate Cluster).
This page explains how to create a Worker Cluster. By default, when creating a new Worker Cluster, the operating system type and CPU architecture of the worker nodes should be consistent with the Global Service Cluster. If you want to create a cluster with a different operating system or architecture than the Global Management Cluster, refer to Creating an Ubuntu Worker Cluster on a CentOS Management Platform for instructions.
It is recommended to use the supported operating systems in AI platform to create the cluster. If your local nodes are not within the supported range, you can refer to Creating a Cluster on Non-Mainstream Operating Systems for instructions.
Certain prerequisites must be met before creating a cluster:
Prepare enough nodes to be joined into the cluster.
It is recommended to use Kubernetes version 1.25.7. For the specific version range, refer to the AI platform Cluster Version Support System. Currently, the supported version range for created worker clusters is v1.26.0-v1.28. If you need to create a cluster with a lower version, refer to the Supporterd Cluster Versions.
The target host must allow IPv4 forwarding. If using IPv6 in Pods and Services, the target server needs to allow IPv6 forwarding.
AI platform does not provide firewall management. You need to pre-define the firewall rules of the target host by yourself. To avoid errors during cluster creation, it is recommended to disable the firewall of the target host.
Enter the Container Management module, click Create Cluster on the upper right corner of the Clusters page.
Fill in the basic information by referring to the following instructions.
Cluster Name: only contain lowercase letters, numbers, and hyphens (\"-\"). Must start and end with a lowercase letter or number and totally up to 63 characters.
Managed By: Choose a cluster to manage this new cluster through its lifecycle, such as creating, upgrading, node scaling, deleting the new cluster, etc.
Runtime: Select the runtime environment of the cluster. Currently support containerd and docker (see How to Choose Container Runtime).
Kubernetes Version: Allow span of three major versions, such as from 1.23-1.25, subject to the versions supported by the management cluster.
Fill in the node configuration information and click Node Check .
High Availability: When enabled, at least 3 controller nodes are required. When disabled, only 1 controller node is needed.
It is recommended to use High Availability mode in production environments.
Credential Type: Choose whether to access nodes using username/password or public/private keys.
If using public/private key authentication, SSH keys for the nodes need to be configured in advance. Refer to Using SSH Key Authentication for Nodes.
Same Password: When enabled, all nodes in the cluster will have the same access password. Enter the unified password for accessing all nodes in the field below. If disabled, you can set separate usernames and passwords for each node.
Node Information: Set note names and IPs.
NTP Time Synchronization: When enabled, time will be automatically synchronized across all nodes. Provide the NTP server address.
If node check is passed, click Next . If the check failed, update Node Information and check again.
Fill in the network configuration and click Next .
CNI: Provide network services for Pods in the cluster. CNI cannot be changed after the cluster is created. Supports cilium and calico. Set none means not installing CNI when creating the cluster. You may install a CNI later.
For CNI configuration details, see Cilium Installation Parameters or Calico Installation Parameters.
Container IP Range: Set an IP range for allocating IPs for containers in the cluster. IP range determines the max number of containers allowed in the cluster. Cannot be modified after creation.
Service IP Range: Set an IP range for allocating IPs for container Services in the cluster. This range determines the max number of container Services that can be created in the cluster. Cannot be modified after creation.
Fill in the plug-in configuration and click Next .
Fill in advanced settings and click OK .
kubelet_max_pods : Set the maximum number of Pods per node. The default is 110.
hostname_override : Reset the hostname (not recommended).
kubernetes_audit : Kubernetes audit log, enabled by default.
auto_renew_certificate : Automatically renew the certificate of the control plane on the first Monday of each month, enabled by default.
disable_firewalld&ufw : Disable the firewall to prevent the node from being inaccessible during installation.
Insecure_registries : Set the address of you private container registry. If you use a private container registry, fill in its address can bypass certificate authentication of the container engine and obtain the image.
yum_repos : Fill in the Yum source registry address.
Success
After correctly filling in the above information, the page will prompt that the cluster is being created.
Creating a cluster takes a long time, so you need to wait patiently. You can click the Back to Clusters button to let it running backend.
To view the current status, click Real-time Log .
Note
hen the cluster is in an unknown state, it means that the current cluster has been disconnected.
The data displayed by the system is the cached data before the disconnection, which does not represent real data.
Any operations performed in the disconnected state will not take effect. Please check the cluster network connectivity or Host Status.
Clusters created in AI platform Container Management can be either deleted or removed. Clusters integrated into AI platform can only be removed.
Info
If you want to delete an integrated cluster, you should delete it in the platform where it is created.
In AI platform, the difference between Delete and Remove is:
Delete will destroy the cluster and reset the data of all nodes under the cluster. All data will be totally cleared and lost. Making a backup before deleting a cluster is a recommended best practice. You can no longer use that cluster anymore.
Remove just removes the cluster from AI platform. It will not destroy the cluster and no data will be lost. You can still use the cluster in other platforms or re-integrate it into AI platform later if needed.
Note
You should have Admin or Kpanda Owner permissions to perform delete or remove operations.
Before deleting a cluster, you should turn off Cluster Deletion Protection in Cluster Settings -> Advanced Settings , otherwise the Delete Cluster option will not be displayed.
The global service cluster cannot be deleted or removed.
Enter the Container Management module, find your target cluster, click __ ...__ on the right, and select Delete cluster / Remove in the drop-down list.
Enter the cluster name to confirm and click Delete .
You will be auto directed to cluster lists. The status of this cluster will changed to Deleting . It may take a while to delete/remove a cluster.
With the features of integrating clusters, AI platform allows you to manage on-premise and cloud clusters of various providers in a unified manner. This is quite important in avoiding the risk of being locked in by a certain providers, helping enterprises safely migrate their business to the cloud.
In AI platform Container Management module, you can integrate a cluster of the following providers: standard Kubernetes clusters, Redhat Openshift, SUSE Rancher, VMware Tanzu, Amazon EKS, Aliyun ACK, Huawei CCE, Tencent TKE, etc.
Enter Container Management module, and click Integrate Cluster in the upper right corner.
Fill in the basic information by referring to the following instructions.
Cluster Name: It should be unique and cannot be changed after the integration. Maximum 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number.
Cluster Alias: Enter any characters, no more than 60 characters.
Release Distribution: the cluster provider, support mainstream vendors listed at the beginning.
Fill in the KubeConfig of the target cluster and click Verify Config . The cluster can be successfully connected only after the verification is passed.
Click How do I get the KubeConfig? to see the specific steps for getting this file.
Confirm that all parameters are filled in correctly and click OK in the lower right corner of the page.
Note
The status of the newly integrated cluster is Integrating , which will become Running after the integration succeeds.
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html","title":"Integrate the Rancher Cluster","text":"
This page explains how to integrate a Rancher cluster.
Prepare a Rancher cluster with administrator privileges and ensure network connectivity between the container management cluster and the target cluster.
Be equipped with permissions not lower than kpanda owner.
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#steps","title":"Steps","text":""},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#step-1-create-a-serviceaccount-user-with-administrator-privileges-in-the-rancher-cluster","title":"Step 1: Create a ServiceAccount user with administrator privileges in the Rancher cluster","text":"
Log in to the Rancher cluster with a role that has administrator privileges, and create a file named sa.yaml using the terminal.
vi sa.yaml\n
Press the i key to enter insert mode, then copy and paste the following content:
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#step-2-update-kubeconfig-with-the-rancher-rke-sa-authentication-on-your-local-machine","title":"Step 2: Update kubeconfig with the rancher-rke SA authentication on your local machine","text":"
Perform the following steps on any local node where kubelet is installed:
{cluster-name} : the name of your Rancher cluster.
{APIServer} : the access address of the cluster, usually refering to the IP address of the control node + port \"6443\", such as https://10.X.X.X:6443 .
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#step-3-connect-the-cluster-in-the-suanova-interface","title":"Step 3: Connect the cluster in the Suanova Interface","text":"
Using the kubeconfig file fetched earlier, refer to the Integrate Cluster documentation to integrate the Rancher cluster to the global cluster.
"},{"location":"en/end-user/kpanda/clusters/runtime.html","title":"How to choose the container runtime","text":"
The container runtime is an important component in kubernetes to manage the life cycle of containers and container images. Kubernetes made containerd the default container runtime in version 1.19, and removed support for the Dockershim component in version 1.24.
Therefore, compared to the Docker runtime, we recommend you to use the lightweight containerd as your container runtime, because this has become the current mainstream runtime choice.
In addition, some operating system distribution vendors are not friendly enough for Docker runtime compatibility. The runtime support of different operating systems is as follows:
The Kubernetes Community packages a small version every quarter, and the maintenance cycle of each version is only about 9 months. Some major bugs or security holes will not be updated after the version stops maintenance. Manually upgrading cluster operations is cumbersome and places a huge workload on administrators.
In Suanova, you can upgrade the Kubernetes cluster with one click through the web UI interface.
Danger
After the version is upgraded, it will not be possible to roll back to the previous version, please proceed with caution.
Note
Kubernetes versions are denoted as x.y.z , where x is the major version, y is the minor version, and z is the patch version.
Cluster upgrades across minor versions are not allowed, e.g. a direct upgrade from 1.23 to 1.25 is not possible.
**Access clusters do not support version upgrades. If there is no \"cluster upgrade\" in the left navigation bar, please check whether the cluster is an access cluster. **
The global service cluster can only be upgraded through the terminal.
When upgrading a worker cluster, the Management Cluster of the worker cluster should have been connected to the container management module and be running normally.
Click the name of the target cluster in the cluster list.
Then click Cluster Operation and Maintenance -> Cluster Upgrade in the left navigation bar, and click Version Upgrade in the upper right corner of the page.
Select the version that can be upgraded, and enter the cluster name to confirm.
After clicking OK , you can see the upgrade progress of the cluster.
The cluster upgrade is expected to take 30 minutes. You can click the Real-time Log button to view the detailed log of the cluster upgrade.
ConfigMaps store non-confidential data in the form of key-value pairs to achieve the effect of mutual decoupling of configuration data and application code. ConfigMaps can be used as environment variables for containers, command-line parameters, or configuration files in storage volumes.
Note
The data saved in ConfigMaps cannot exceed 1 MiB. If you need to store larger volumes of data, it is recommended to mount a storage volume or use an independent database or file service.
ConfigMaps do not provide confidentiality or encryption. If you want to store encrypted data, it is recommended to use secret, or other third-party tools to ensure the privacy of data.
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> ConfigMap , and click the YAML Create button in the upper right corner.
Fill in or paste the configuration file prepared in advance, and then click OK in the lower right corner of the pop-up box.
!!! note
- Click __Import__ to import an existing file locally to quickly create ConfigMaps.\n - After filling in the data, click __Download__ to save the configuration file locally.\n
After the creation is complete, click More on the right side of the ConfigMap to edit YAML, update, export, delete and other operations.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
Secrets can be used in some cases:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
As the identity authentication credential for the container registry when the kubelet pulls the container image.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster
Created a namespace, user, and authorized the user as NS Editor. For details, refer to Namespace Authorization.
"},{"location":"en/end-user/kpanda/configmaps-secrets/create-secret.html#create-secret-with-wizard","title":"Create secret with wizard","text":"
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> Secret , and click the Create Secret button in the upper right corner.
Fill in the configuration information on the Create Secret page, and click OK .
Note when filling in the configuration:
The name of the key must be unique within the same namespace
Key type:
Default (Opaque): Kubernetes default key type, which supports arbitrary data defined by users.
TLS (kubernetes.io/tls): credentials for TLS client or server data access.
Container registry information (kubernetes.io/dockerconfigjson): Credentials for Container registry access.
username and password (kubernetes.io/basic-auth): Credentials for basic authentication.
Custom: the type customized by the user according to business needs.
Key data: the data stored in the key, the parameters that need to be filled in are different for different data
When the key type is default (Opaque)/custom: multiple key-value pairs can be filled in.
When the key type is TLS (kubernetes.io/tls): you need to fill in the certificate certificate and private key data. Certificates are self-signed or CA-signed credentials used for authentication. A certificate request is a request for a signature and needs to be signed with a private key.
When the key type is container registry information (kubernetes.io/dockerconfigjson): you need to fill in the account and password of the private container registry.
When the key type is username and password (kubernetes.io/basic-auth): Username and password need to be specified.
ConfigMap (ConfigMap) is an API object of Kubernetes, which is used to save non-confidential data into key-value pairs, and can store configurations that other objects need to use. When used, the container can use it as an environment variable, a command-line argument, or a configuration file in a storage volume. By using ConfigMaps, configuration data and application code can be separated, providing a more flexible way to modify application configuration.
Note
ConfigMaps do not provide confidentiality or encryption. If the data to be stored is confidential, please use secret, or use other third-party tools to ensure the privacy of the data instead of ConfigMaps. In addition, when using ConfigMaps in containers, the container and ConfigMaps must be in the same cluster namespace.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use ConfigMaps in Pods. There are many use cases, mainly including:
Use ConfigMaps to set the environment variables of the container
Use ConfigMaps to set the command line parameters of the container
Use ConfigMaps as container data volumes
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#set-the-environment-variables-of-the-container","title":"Set the environment variables of the container","text":"
You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line.
Note
The ConfigMap import is to use the ConfigMap as the value of the environment variable; the ConfigMap key value import is to use a certain parameter in the ConfigMap as the value of the environment variable.
When creating a workload through an image, you can set environment variables for the container by selecting Import ConfigMaps or Import ConfigMap Key Values on the Environment Variables interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Environment Variables configuration, and click the Add Environment Variable button.
Select ConfigMap Import or ConfigMap Key Value Import in the environment variable type.
When the environment variable type is selected as ConfigMap import , enter variable name , prefix name, ConfigMap name in sequence.
When the environment variable type is selected as ConfigMap key-value import , enter variable name , ConfigMap name, and Secret name in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#command-line-operation","title":"Command line operation","text":"
You can set ConfigMaps as environment variables when creating a workload, using the valueFrom parameter to refer to the Key/Value in the ConfigMap.
Use valueFrom to specify the value of the env reference ConfigMap
Referenced configuration file name
Referenced ConfigMap key
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#set-the-command-line-parameters-of-the-container","title":"Set the command line parameters of the container","text":"
You can use ConfigMaps to set the command or parameter value in the container, and use the environment variable substitution syntax $(VAR_NAME) to do so. As follows.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#used-as-container-data-volume","title":"Used as container data volume","text":"
You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line.
When creating a workload through an image, you can use the ConfigMap as the data volume of the container by selecting the storage type as \"ConfigMap\" on the \"Data Storage\" interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Data Storage configuration, and click __Add in the __ Node Path Mapping __ list __ button.
Select ConfigMap in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#command-line-operation_1","title":"Command line operation","text":"
To use a ConfigMap in a Pod's storage volume.
Here is an example Pod that mounts a ConfigMap as a volume:
If there are multiple containers in a Pod, each container needs its own volumeMounts block, but you only need to set one spec.volumes block per ConfigMap.
Note
When a ConfigMap is used as a data volume mounted on a container, the ConfigMap can only be read as a read-only file.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use keys in Pods in a variety of use cases, mainly including:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
Used as the identity authentication credential for the container registry when the kubelet pulls the container image.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#use-the-key-to-set-the-environment-variable-of-the-container","title":"Use the key to set the environment variable of the container","text":"
You can use the key as the environment variable of the container through the GUI or the terminal command line.
Note
Key import is to use the key as the value of an environment variable; key key value import is to use a parameter in the key as the value of an environment variable.
When creating a workload from an image, you can set environment variables for the container by selecting Key Import or Key Key Value Import on the Environment Variables interface.
Go to the Image Creation Workload page.
Select the Environment Variables configuration in Container Configuration , and click the Add Environment Variable button.
Select Key Import or Key Key Value Import in the environment variable type.
When the environment variable type is selected as Key Import , enter Variable Name , Prefix , and Secret in sequence.
When the environment variable type is selected as key key value import , enter variable name , Secret , Secret name in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#command-line-operation","title":"Command line operation","text":"
As shown in the example below, you can set the secret as an environment variable when creating the workload, using the valueFrom parameter to refer to the Key/Value in the Secret.
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"username\"
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"password\"
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#use-the-key-as-the-pods-data-volume","title":"Use the key as the pod's data volume","text":""},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#graphical-interface-operation_1","title":"Graphical interface operation","text":"
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the Container Configuration , select the Data Storage configuration, and click the Add button in the Node Path Mapping list.
Select Secret in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#command-line-operation_1","title":"Command line operation","text":"
The following is an example of a Pod that mounts a Secret named mysecret via a data volume:
Default setting, means \"mysecret\" must already exist
If the Pod contains multiple containers, each container needs its own volumeMounts block, but only one .spec.volumes setting is required for each Secret.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#used-as-the-identity-authentication-credential-for-the-container-registry-when-the-kubelet-pulls-the-container-image","title":"Used as the identity authentication credential for the container registry when the kubelet pulls the container image","text":"
You can use the key as the identity authentication credential for the Container registry through the GUI or the terminal command line.
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the second step of Container Configuration , select the Basic Information configuration, and click the Select Image button.
Select the name of the private container registry in the drop-down list of `container registry' in the pop-up box. Please see Create Secret for details on private image secret creation.
Enter the image name in the private registry, click OK to complete the image selection.
Note
When creating a key, you need to ensure that you enter the correct container registry address, username, password, and select the correct mirror name, otherwise you will not be able to obtain the mirror image in the container registry.
In Kubernetes, all objects are abstracted as resources, such as Pod, Deployment, Service, Volume, etc. are the default resources provided by Kubernetes. This provides important support for our daily operation and maintenance and management work, but in some special cases, the existing preset resources cannot meet the needs of the business. Therefore, we hope to expand the capabilities of the Kubernetes API, and CustomResourceDefinition (CRD) was born based on this requirement.
The container management module supports interface-based management of custom resources, and its main features are as follows:
Obtain the list and detailed information of custom resources under the cluster
Create custom resources based on YAML
Create a custom resource example CR (Custom Resource) based on YAML
"},{"location":"en/end-user/kpanda/custom-resources/create.html#create-a-custom-resource-example-via-yaml","title":"Create a custom resource example via YAML","text":"
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Custom Resource , and click the YAML Create button in the upper right corner.
Click the custom resource named crontabs.stable.example.com , enter the details, and click the YAML Create button in the upper right corner.
On the Create with YAML page, fill in the YAML statement and click OK .
Return to the details page of crontabs.stable.example.com , and you can view the custom resource named my-new-cron-object just created.
"},{"location":"en/end-user/kpanda/gpu/index.html","title":"Overview of GPU Management","text":"
This article introduces the capability of Suanova container management platform in unified operations and management of heterogeneous resources, with a focus on GPUs.
With the rapid development of emerging technologies such as AI applications, large-scale models, artificial intelligence, and autonomous driving, enterprises are facing an increasing demand for compute-intensive tasks and data processing. Traditional compute architectures represented by CPUs can no longer meet the growing computational requirements of enterprises. At this point, heterogeneous computing represented by GPUs has been widely applied due to its unique advantages in processing large-scale data, performing complex calculations, and real-time graphics rendering.
Meanwhile, due to the lack of experience and professional solutions in scheduling and managing heterogeneous resources, the utilization efficiency of GPU devices is extremely low, resulting in high AI production costs for enterprises. The challenge of reducing costs, increasing efficiency, and improving the utilization of GPUs and other heterogeneous resources has become a pressing issue for many enterprises.
"},{"location":"en/end-user/kpanda/gpu/index.html#introduction-to-gpu-capabilities","title":"Introduction to GPU Capabilities","text":"
The Suanova container management platform supports unified scheduling and operations management of GPUs, NPUs, and other heterogeneous resources, fully unleashing the computational power of GPU resources, and accelerating the development of enterprise AI and other emerging applications. The GPU management capabilities of Suanova are as follows:
Support for unified management of heterogeneous computing resources from domestic and foreign manufacturers such as NVIDIA, Huawei Ascend, and Iluvatar.
Support for multi-card heterogeneous scheduling within the same cluster, with automatic recognition of GPUs in the cluster.
Support for native management solutions for NVIDIA GPUs, vGPUs, and MIG, with cloud native capabilities.
Support for partitioning a single physical card for use by different tenants, and allocate GPU resources to tenants and containers based on computing power and memory quotas.
Support for multi-dimensional GPU resource monitoring at the cluster, node, and application levels, assisting operators in managing GPU resources.
Compatibility with various training frameworks such as TensorFlow and PyTorch.
"},{"location":"en/end-user/kpanda/gpu/index.html#introduction-to-gpu-operator","title":"Introduction to GPU Operator","text":"
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU card into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
Architecture diagram of NVIDIA GPU Operator:
"},{"location":"en/end-user/kpanda/gpu/FAQ.html","title":"GPU FAQs","text":""},{"location":"en/end-user/kpanda/gpu/FAQ.html#gpu-processes-are-not-visible-while-running-nvidia-smi-inside-a-pod","title":"GPU processes are not visible while running nvidia-smi inside a pod","text":"
Q: When running the nvidia-smi command inside a GPU-utilizing pod, no GPU process information is visible in the full-card mode and vGPU mode.
A: Due to PID namespace isolation, GPU processes are not visible inside the Pod. To view GPU processes, you can use one of the following methods:
Configure the workload using the GPU with hostPID: true to enable viewing PIDs on the host.
Run the nvidia-smi command in the driver pod of the gpu-operator to view processes.
Run the chroot /run/nvidia/driver nvidia-smi command on the host to view processes.
"},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html","title":"How to Use Iluvatar GPU in Applications","text":"
This section describes how to use Iluvatar virtual GPU on AI platform.
Deployed AI platform container management platform and it is running smoothly.
The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and the UI interface of the cluster can be accessed.
The Iluvatar GPU driver has been installed on the current cluster. Refer to the Iluvatar official documentation for driver installation instructions, or contact the Suanova ecosystem team for enterprise-level support at peg-pem@daocloud.io.
The GPUs in the current cluster have not undergone any virtualization operations and not been occupied by other applications.
"},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html#configuration-via-user-interface","title":"Configuration via User Interface","text":"
Check if the GPU card in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type as Iluvatar .
Deploy a workload. Click Clusters -> Workloads and deploy a workload using the image. After selecting the type as (Iluvatar) , configure the GPU resources used by the application:
Physical Card Count (iluvatar.ai/vcuda-core): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
Memory Usage (iluvatar.ai/vcuda-memory): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.
If there are any issues with the configuration values, scheduling failures or resource allocation failures may occur.
"},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html#configuration-via-yaml","title":"Configuration via YAML","text":"
To request GPU resources for a workload, add the iluvatar.ai/vcuda-core: 1 and iluvatar.ai/vcuda-memory: 200 to the requests and limits. These parameters configure the application to use the physical card resources.
"},{"location":"en/end-user/kpanda/gpu/dynamic-regulation.html","title":"GPU Scheduling Configuration (Binpack and Spread)","text":"
This page introduces how to reduce GPU resource fragmentation and prevent single points of failure through Binpack and Spread when using NVIDIA vGPU, achieving advanced scheduling for vGPU. The AI platform platform provides Binpack and Spread scheduling policies across two dimensions: clusters and workloads, meeting different usage requirements in various scenarios.
Binpack: Prioritizes using the same GPU on a node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different GPUs on nodes, suitable for high availability scenarios to avoid single card failures.
Scheduling policy based on node dimension
Binpack: Multiple Pods prioritize using the same node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different nodes, suitable for high availability scenarios to avoid single node failures.
"},{"location":"en/end-user/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-cluster-level","title":"Use Binpack and Spread at Cluster-Level","text":"
Note
By default, workloads will follow the cluster-level Binpack and Spread. If a workload sets its own Binpack and Spread scheduling policies that differ from the cluster, the workload will prioritize its own scheduling policy.
On the Clusters page, select the cluster for which you want to adjust the Binpack and Spread scheduling policies. Click the \u2507 icon on the right and select GPU Scheduling Configuration from the dropdown list.
Adjust the GPU scheduling configuration according to your business scenario, and click OK to save.
"},{"location":"en/end-user/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-workload-level","title":"Use Binpack and Spread at Workload-Level","text":"
Note
When the Binpack and Spread scheduling policies at the workload level conflict with the cluster-level configuration, the workload-level configuration takes precedence.
Follow the steps below to create a deployment using an image and configure Binpack and Spread scheduling policies within the workload.
Click Clusters in the left navigation bar, then click the name of the target cluster to enter the Cluster Details page.
On the Cluster Details page, click Workloads -> Deployments in the left navigation bar, then click the Create by Image button in the upper right corner of the page.
Sequentially fill in the Basic Information, Container Settings, and in the Container Configuration section, enable GPU configuration, selecting the GPU type as NVIDIA vGPU. Click Advanced Settings, enable the Binpack / Spread scheduling policy, and adjust the GPU scheduling configuration according to the business scenario. After configuration, click Next to proceed to Service Settings and Advanced Settings. Finally, click OK at the bottom right of the page to complete the creation.
"},{"location":"en/end-user/kpanda/gpu/gpu-metrics.html#cluster-level","title":"Cluster Level","text":"Metric Name Description Number of GPUs Total number of GPUs in the cluster Average GPU Utilization Average compute utilization of all GPUs in the cluster Average GPU Memory Utilization Average memory utilization of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Temperature Temperature of all GPUs in the cluster GPU Utilization Details 24-hour usage details of all GPUs in the cluster (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of all GPUs in the cluster (includes min, max, avg, current) GPU Memory Bandwidth Utilization For example, an Nvidia V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the utilization is 50%"},{"location":"en/end-user/kpanda/gpu/gpu-metrics.html#node-level","title":"Node Level","text":"Metric Name Description GPU Mode Usage mode of GPUs on the node, including full-card mode, MIG mode, vGPU mode Number of Physical GPUs Total number of physical GPUs on the node Number of Virtual GPUs Number of vGPU devices created on the node Number of MIG Instances Number of MIG instances created on the node GPU Memory Allocation Rate Memory allocation rate of all GPUs on the node Average GPU Utilization Average compute utilization of all GPUs on the node Average GPU Memory Utilization Average memory utilization of all GPUs on the node GPU Driver Version Driver version information of GPUs on the node GPU Utilization Details 24-hour usage details of each GPU on the node (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of each GPU on the node (includes min, max, avg, current)"},{"location":"en/end-user/kpanda/gpu/gpu-metrics.html#pod-level","title":"Pod Level","text":"Category Metric Name Description Application Overview GPU - Compute & Memory Pod GPU Utilization Compute utilization of the GPUs used by the current Pod Pod GPU Memory Utilization Memory utilization of the GPUs used by the current Pod Pod GPU Memory Usage Memory usage of the GPUs used by the current Pod Memory Allocation Memory allocation of the GPUs used by the current Pod Pod GPU Memory Copy Ratio Memory copy ratio of the GPUs used by the current Pod GPU - Engine Overview GPU Graphics Engine Activity Percentage Percentage of time the Graphics or Compute engine is active during a monitoring cycle GPU Memory Bandwidth Utilization Memory bandwidth utilization (Memory BW Utilization) indicates the fraction of cycles during which data is sent to or received from the device memory. This value represents the average over the interval, not an instantaneous value. A higher value indicates higher utilization of device memory.A value of 1 (100%) indicates that a DRAM instruction is executed every cycle during the interval (in practice, a peak of about 0.8 (80%) is the maximum achievable).A value of 0.2 (20%) indicates that 20% of the cycles during the interval are spent reading from or writing to device memory. Tensor Core Utilization Percentage of time the Tensor Core pipeline is active during a monitoring cycle FP16 Engine Utilization Percentage of time the FP16 pipeline is active during a monitoring cycle FP32 Engine Utilization Percentage of time the FP32 pipeline is active during a monitoring cycle FP64 Engine Utilization Percentage of time the FP64 pipeline is active during a monitoring cycle GPU Decode Utilization Decode engine utilization of the GPU GPU Encode Utilization Encode engine utilization of the GPU GPU - Temperature & Power GPU Temperature Temperature of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Total Power Consumption Total power consumption of the GPUs GPU - Clock GPU Memory Clock Memory clock frequency GPU Application SM Clock Application SM clock frequency GPU Application Memory Clock Application memory clock frequency GPU Video Engine Clock Video engine clock frequency GPU Throttle Reasons Reasons for GPU throttling GPU - Other Details PCIe Transfer Rate Data transfer rate of the GPU through the PCIe bus PCIe Receive Rate Data receive rate of the GPU through the PCIe bus"},{"location":"en/end-user/kpanda/gpu/gpu_matrix.html","title":"GPU Support Matrix","text":"
This page explains the matrix of supported GPUs and operating systems for AI platform.
"},{"location":"en/end-user/kpanda/gpu/gpu_matrix.html#nvidia-gpu","title":"NVIDIA GPU","text":"GPU Manufacturer and Type Supported GPU Models Compatible Operating System (Online) Recommended Kernel Recommended Operating System and Kernel Installation Documentation NVIDIA GPU (Full Card/vGPU)
NVIDIA Fermi (2.1) Architecture:
NVIDIA GeForce 400 Series
NVIDIA Quadro 4000 Series
NVIDIA Tesla 20 Series
NVIDIA Ampere Architecture Series (A100; A800; H100)
CentOS 7
Kernel 3.10.0-123 ~ 3.10.0-1160
Kernel Reference Document
Recommended Operating System with Proper Kernel Version
This document mainly introduces the configuration of GPU scheduling, which can implement advanced scheduling policies. Currently, the primary implementation is the vgpu scheduling policy.
vGPU provides two policies for resource usage: binpack and spread. These correspond to node-level and GPU-level dimensions, respectively. The use case is whether you want to distribute workloads more sparsely across different nodes and GPUs or concentrate them on the same node and GPU, thereby making resource utilization more efficient and reducing resource fragmentation.
You can modify the scheduling policy in your cluster by following these steps:
Go to the cluster management list in the container management interface.
Click the settings button ... next to the cluster.
Click GPU Scheduling Configuration.
Toggle the scheduling policy between node-level and GPU-level. By default, the node-level policy is binpack, and the GPU-level policy is spread.
The above steps modify the cluster-level scheduling policy. Users can also specify their own scheduling policy at the workload level to change the scheduling results. Below is an example of modifying the scheduling policy at the workload level:
In this example, both the node- and GPU-level scheduling policies are set to binpack. This ensures that the workload is scheduled to maximize resource utilization and reduce fragmentation.
Follow these steps to manage GPU quotas in AI platform:
Go to Namespaces and click Quota Management to configure the GPU resources that can be used by a specific namespace.
The currently supported card types for quota management in a namespace are: NVIDIA vGPU, NVIDIA MIG, Iluvatar, and Ascend.
NVIDIA vGPU Quota Management: Configure the specific quota that can be used. This will create a ResourcesQuota CR.
- Physical Card Count (nvidia.com/vgpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine.\n- GPU Core Count (nvidia.com/gpucores): Indicates the GPU compute power occupied by each card. The value ranges from 0 to 100. If configured as 0, it is considered not to enforce isolation. If configured as 100, it is considered to exclusively occupy the entire card.\n- GPU Memory Usage (nvidia.com/gpumem): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.\n
This document uses the AscentCL Image Classification Application example from the Ascend sample library.
Download the Ascend repository
Run the following command to download the Ascend demo repository, and remember the storage location of the code for subsequent use.
git clone https://gitee.com/ascend/samples.git\n
Prepare the base image
This example uses the Ascent-pytorch base image, which can be obtained from the Ascend Container Registry.
Prepare the YAML file
ascend-demo.yaml
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: resnetinfer1-1-1usoc\nspec:\n template:\n spec:\n containers:\n - image: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:23.0.RC2-ubuntu18.04 # Inference image name\n imagePullPolicy: IfNotPresent\n name: resnet50infer\n securityContext:\n runAsUser: 0\n command:\n - \"/bin/bash\"\n - \"-c\"\n - |\n source /usr/local/Ascend/ascend-toolkit/set_env.sh &&\n TEMP_DIR=/root/samples_copy_$(date '+%Y%m%d_%H%M%S_%N') &&\n cp -r /root/samples \"$TEMP_DIR\" &&\n cd \"$TEMP_DIR\"/inference/modelInference/sampleResnetQuickStart/python/model &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/003_Atc_Models/resnet50/resnet50.onnx &&\n atc --model=resnet50.onnx --framework=5 --output=resnet50 --input_shape=\"actual_input_1:1,3,224,224\" --soc_version=Ascend910 &&\n cd ../data &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/models/aclsample/dog1_1024_683.jpg &&\n cd ../scripts &&\n bash sample_run.sh\n resources:\n requests:\n huawei.com/Ascend910: 1 # Number of the Ascend 910 Processors\n limits:\n huawei.com/Ascend910: 1 # The value should be the same as that of requests\n volumeMounts:\n - name: hiai-driver\n mountPath: /usr/local/Ascend/driver\n readOnly: true\n - name: slog\n mountPath: /var/log/npu/conf/slog/slog.conf\n - name: localtime # The container time must be the same as the host time\n mountPath: /etc/localtime\n - name: dmp\n mountPath: /var/dmp_daemon\n - name: slogd\n mountPath: /var/slogd\n - name: hbasic\n mountPath: /etc/hdcBasic.cfg\n - name: sys-version\n mountPath: /etc/sys_version.conf\n - name: aicpu\n mountPath: /usr/lib64/aicpu_kernels\n - name: tfso\n mountPath: /usr/lib64/libtensorflow.so\n - name: sample-path\n mountPath: /root/samples\n volumes:\n - name: hiai-driver\n hostPath:\n path: /usr/local/Ascend/driver\n - name: slog\n hostPath:\n path: /var/log/npu/conf/slog/slog.conf\n - name: localtime\n hostPath:\n path: /etc/localtime\n - name: dmp\n hostPath:\n path: /var/dmp_daemon\n - name: slogd\n hostPath:\n path: /var/slogd\n - name: hbasic\n hostPath:\n path: /etc/hdcBasic.cfg\n - name: sys-version\n hostPath:\n path: /etc/sys_version.conf\n - name: aicpu\n hostPath:\n path: /usr/lib64/aicpu_kernels\n - name: tfso\n hostPath:\n path: /usr/lib64/libtensorflow.so\n - name: sample-path\n hostPath:\n path: /root/samples\n restartPolicy: OnFailure\n
Some fields in the above YAML need to be modified according to the actual situation:
atc ... --soc_version=Ascend910 uses Ascend910, adjust this field depending on your actual situation. You can use the npu-smi info command to check the GPU model and add the Ascend prefix.
samples-path should be adjusted according to the actual situation.
resources should be adjusted according to the actual situation.
Deploy a Job and check its results
Use the following command to create a Job:
kubectl apply -f ascend-demo.yaml\n
Check the Pod running status:
After the Pod runs successfully, check the log results. The key prompt information on the screen is shown in the figure below. The Label indicates the category identifier, Conf indicates the maximum confidence of the classification, and Class indicates the belonging category. These values may vary depending on the version and environment, so please refer to the actual situation:
Confirm whether the cluster has detected the GPU card. Click Clusters -> Cluster Settings -> Addon Plugins , and check whether the proper GPU type is automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type to Ascend .
Deploy the workload. Click Clusters -> Workloads , deploy the workload through an image, select the type (Ascend), and then configure the number of physical cards used by the application:
Number of Physical Cards (huawei.com/Ascend910) : This indicates how many physical cards the current Pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host.
If there is an issue with the above configuration, it will result in scheduling failure and resource allocation issues.
"},{"location":"en/end-user/kpanda/gpu/ascend/ascend_driver_install.html","title":"Installation of Ascend NPU Components","text":"
This chapter provides installation guidance for Ascend NPU drivers, Device Plugin, NPU-Exporter, and other components.
Before using NPU resources, you need to complete the firmware installation, NPU driver installation, Docker Runtime installation, user creation, log directory creation, and NPU Device Plugin installation. Refer to the following steps for details.
Confirm that the kernel version is within the range proper to the \"binary installation\" method, and then you can directly install the NPU driver firmware.
For firmware and driver downloads, refer to: Firmware Download Link
For firmware installation, refer to: Install NPU Driver Firmware
If the driver is not installed, refer to the official Ascend documentation for installation. For example, for Ascend910, refer to: 910 Driver Installation Document.
Run the command npu-smi info, and if the NPU information is returned normally, it indicates that the NPU driver and firmware are ready.
Create the parent directory for component logs and the log directories for each component on the proper node, and set the appropriate owner and permissions for the directories. Execute the following command to create the parent directory for component logs.
Please create the proper log directory for each required component. In this example, only the Device Plugin component is needed. For other component requirements, refer to the official documentation
Refer to the following commands to create labels on the proper nodes:
# Create this label on computing nodes where the driver is installed\nkubectl label node {nodename} huawei.com.ascend/Driver=installed\nkubectl label node {nodename} node-role.kubernetes.io/worker=worker\nkubectl label node {nodename} workerselector=dls-worker-node\nkubectl label node {nodename} host-arch=huawei-arm // or host-arch=huawei-x86, select according to the actual situation\nkubectl label node {nodename} accelerator=huawei-Ascend910 // select according to the actual situation\n# Create this label on control nodes\nkubectl label node {nodename} masterselector=dls-master-node\n
"},{"location":"en/end-user/kpanda/gpu/ascend/ascend_driver_install.html#install-device-plugin-and-npuexporter","title":"Install Device Plugin and NpuExporter","text":"
Functional module path: Container Management -> Cluster, click the name of the target cluster, then click Helm Apps -> Helm Charts from the left navigation bar, and search for ascend-mindxdl.
DevicePlugin: Provides a general device plugin mechanism and standard device API interface for Kubernetes to use devices. It is recommended to use the default image and version.
NpuExporter: Based on the Prometheus/Telegraf ecosystem, this component provides interfaces to help users monitor the Ascend series AI processors and container-level allocation status. It is recommended to use the default image and version.
ServiceMonitor: Disabled by default. If enabled, you can view NPU-related monitoring in the observability module. To enable, ensure that the insight-agent is installed and running, otherwise, the ascend-mindxdl installation will fail.
isVirtualMachine: Disabled by default. If the NPU node is a virtual machine scenario, enable the isVirtualMachine parameter.
After a successful installation, two components will appear under the proper namespace, as shown below:
At the same time, the proper NPU information will also appear on the node information:
Once everything is ready, you can select the proper NPU device when creating a workload through the page, as shown below:
Note
For detailed information of how to use, refer to Using Ascend (Ascend) NPU.
Ascend virtualization is divided into dynamic virtualization and static virtualization. This document describes how to enable and use Ascend static virtualization capabilities.
To enable virtualization capabilities, you need to manually modify the startup parameters of the ascend-device-plugin-daemonset component. Refer to the following command:
After splitting the instance, manually restart the device-plugin pod, then use the kubectl describe command to check the resources of the registered node:
kubectl describe node {{nodename}}\n
"},{"location":"en/end-user/kpanda/gpu/ascend/vnpu.html#how-to-use-the-device","title":"How to Use the Device","text":"
When creating an application, specify the resource key as shown in the following YAML:
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
vGPU is a virtualization technology that allows one physical GPU to be partitioned into multiple virtual GPUs, with each virtual GPU assigned to different virtual machines or users. vGPU enables multiple users to share the same physical GPU and independently use GPU resources in their respective virtual environments. Each virtual GPU can access a certain amount of compute power and memory capacity. vGPU is suitable for virtualized environments and cloud computing scenarios, providing higher resource utilization and flexibility.
MIG is a feature introduced by the NVIDIA Ampere architecture that allows one physical GPU to be divided into multiple physical GPU instances, each of which can be independently allocated to different users or workloads. Each MIG instance has its own compute resources, memory, and PCIe bandwidth, just like an independent virtual GPU. MIG provides finer-grained GPU resource allocation and management and allows dynamic adjustment of the number and size of instances based on demand. MIG is suitable for multi-tenant environments, containerized applications, batch jobs, and other scenarios.
Whether using vGPU in a virtualized environment or MIG on a physical GPU, NVIDIA provides users with more choices and optimized ways to utilize GPU resources. The Suanova container management platform fully supports the above NVIDIA capabilities. Users can easily access the full computational power of NVIDIA GPUs through simple UI operations, thereby improving resource utilization and reducing costs.
Single Mode: The node only exposes a single type of MIG device on all its GPUs. All GPUs on the node must:
Be of the same model (e.g., A100-SXM-40GB), with matching MIG profiles only for GPUs of the same model.
Have MIG configuration enabled, which requires a machine reboot to take effect.
Create identical GI and CI for exposing \"identical\" MIG devices across all products.
Mixed Mode: The node exposes mixed MIG device types on all its GPUs. Requesting a specific MIG device type requires the number of compute slices and total memory provided by the device type.
All GPUs on the node must: Be in the same product line (e.g., A100-SXM-40GB).
Each GPU can enable or disable MIG individually and freely configure any available mixture of MIG device types.
The k8s-device-plugin running on the node will:
Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type.
Expose individual MIG devices using resource types that follow the pattern nvidia.com/mig-<slice_count>g.<memory_size>gb .
For detailed instructions on enabling these configurations, refer to Offline Installation of GPU Operator.
"},{"location":"en/end-user/kpanda/gpu/nvidia/index.html#how-to-use","title":"How to Use","text":"
You can refer to the following links to quickly start using Suanova's management capabilities for NVIDIA GPUs.
Using Full NVIDIA GPU
Using NVIDIA vGPU
Using NVIDIA MIG
"},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html","title":"Using the Whole NVIDIA GPU Card for an Application","text":"
This section describes how to allocate the entire NVIDIA GPU card to a single application on the AI platform platform.
AI platform container management platform has been deployed and is running properly.
The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions.
The GPU card in the current cluster has not undergone any virtualization operations or been occupied by other applications.
"},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-the-user-interface","title":"Configuring via the User Interface","text":"
Check if the cluster has detected the GPUs. Click Clusters -> Cluster Settings -> Addon Plugins to see if it has automatically enabled and detected the proper GPU types. Currently, the cluster will automatically enable GPU and set the GPU Type as Nvidia GPU .
Deploy a workload. Click Clusters -> Workloads , and deploy the workload using the image method. After selecting the type ( Nvidia GPU ), configure the number of physical cards used by the application:
Physical Card Count (nvidia.com/gpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
If the above value is configured incorrectly, scheduling failures and resource allocation issues may occur.
"},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-yaml","title":"Configuring via YAML","text":"
To request GPU resources for a workload, add the nvidia.com/gpu: 1 parameter to the resource request and limit configuration in the YAML file. This parameter configures the number of physical cards used by the application.
AI platform comes with pre-installed driver images for the following three operating systems: Ubuntu 22.04, Ubuntu 20.04, and CentOS 7.9. The driver version is 535.104.12. Additionally, it includes the required Toolkit images for each operating system, so users no longer need to manually provide offline toolkit images.
This page demonstrates using AMD architecture with CentOS 7.9 (3.10.0-1160). If you need to deploy on Red Hat 8.4, refer to Uploading Red Hat gpu-operator Offline Image to the Bootstrap Node Repository and Building Offline Yum Source for Red Hat 8.4.
The kernel version of the cluster nodes where the gpu-operator is to be deployed must be completely consistent. The distribution and GPU card model of the nodes must fall within the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.
systemOS : Select the operating system for the host. The current options are Ubuntu 22.04, Ubuntu 20.04, Centos 7.9, and other. Please choose the correct operating system.
Namespace : Select the namespace for installing the plugin
Version: The version of the plugin. Here, we use version v23.9.0+2 as an example.
Failure Deletion: If the installation fails, it will delete the already installed associated resources. When enabled, Ready Wait will also be enabled by default.
Ready Wait: When enabled, the application will be marked as successfully installed only when all associated resources are in a ready state.
Detailed Logs: When enabled, detailed logs of the installation process will be recorded.
Driver.enable : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the gpu-operator, please disable this.
Driver.repository : Repository where the GPU driver image is located, default is nvidia's nvcr.io repository.
Driver.usePrecompiled : Enable the precompiled mode to install the driver.
Driver.version : Version of the GPU driver image, use default parameters for offline deployment. Configuration is only required for online installation. Different versions of the Driver image exist for different types of operating systems. For more details, refer to Nvidia GPU Driver Versions. Examples of Driver Version for different operating systems are as follows:
Note
When using the built-in operating system version, there is no need to modify the image version. For other operating system versions, please refer to Uploading Images to the Bootstrap Node Repository. note that there is no need to include the operating system name such as Ubuntu, CentOS, or Red Hat in the version number. If the official image contains an operating system suffix, please manually remove it.
For Red Hat systems, for example, 525.105.17
For Ubuntu systems, for example, 535-5.15.0-1043-nvidia
For CentOS systems, for example, 525.147.05
Driver.RepoConfig.ConfigMapName : Used to record the name of the offline yum repository configuration file for the gpu-operator. When using the pre-packaged offline bundle, refer to the following documents for different types of operating systems.
For detailed configuration methods, refer to Enabling MIG Functionality.
MigManager.Config.name : The name of the MIG split configuration file, used to define the MIG (GI, CI) split policy. The default is default-mig-parted-config . For custom parameters, refer to Enabling MIG Functionality.
After completing the configuration and creation of the above parameters:
If using full-card mode , GPU resources can be used when creating applications.
If using vGPU mode , after completing the above configuration and creation, proceed to vGPU Addon Installation.
If using MIG mode and you need to use a specific split specification for individual GPU nodes, otherwise, split according to the default value in MigManager.Config.
After spliting, applications can use MIG GPU resources.
"},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html","title":"Uploading Red Hat GPU Operator Offline Image to Bootstrap Repository","text":"
This guide explains how to upload an offline image to the bootstrap repository using the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image for Red Hat 8.4 as an example.
The bootstrap node and its components are running properly.
Prepare a node that has internet access and can access the bootstrap node. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
"},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html#step-1-obtain-the-offline-image-on-an-internet-connected-node","title":"Step 1: Obtain the Offline Image on an Internet-Connected Node","text":"
Perform the following steps on the internet-connected node:
Pull the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image:
Once the image is pulled, save it as a compressed archive named nvidia-driver.tar :
docker save nvcr.io/nvidia/driver:525.105.17-rhel8.4 > nvidia-driver.tar\n
Copy the compressed image archive nvidia-driver.tar to the bootstrap node:
scp nvidia-driver.tar user@ip:/root\n
For example:
scp nvidia-driver.tar root@10.6.175.10:/root\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html#step-2-push-the-image-to-the-bootstrap-repository","title":"Step 2: Push the Image to the Bootstrap Repository","text":"
Perform the following steps on the bootstrap node:
Log in to the bootstrap node and import the compressed image archive nvidia-driver.tar :
docker load -i nvidia-driver.tar\n
View the imported image:
docker images -a | grep nvidia\n
Expected output:
nvcr.io/nvidia/driver e3ed7dee73e9 1 days ago 1.02GB\n
Retag the image to correspond to the target repository in the remote Registry repository:
docker tag <image-name> <registry-url>/<repository-name>:<tag>\n
Replace with the name of the Nvidia image from the previous step, with the address of the Registry service on the bootstrap node, with the name of the repository you want to push the image to, and with the desired tag for the image.
For example:
docker tag nvcr.io/nvidia/driver 10.6.10.5/nvcr.io/nvidia/driver:525.105.17-rhel8.4\n
Check the GPU Driver image version applicable to your kernel, at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. Use the kernel to query the image version and save the image using ctr export.
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
Import the image into the cluster's container registry
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.html#install-the-driver","title":"Install the Driver","text":"
Install the gpu-operator addon and set driver.usePrecompiled=true
Set driver.version=535, note that it should be 535, not 535.104.12
The AI platform comes with a pre-installed GPU Operator offline package for CentOS 7.9 with kernel version 3.10.0-1160. or other OS types or kernel versions, users need to manually build an offline yum source.
This guide explains how to build an offline yum source for CentOS 7.9 with a specific kernel version and use it when installing the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the v0.12.0 or later version of the addon offline package on the platform.
Prepare a file server that is accessible from the cluster network, such as Nginx or MinIO.
Prepare a node that has internet access, can access the cluster where the GPU Operator will be deployed, and can access the file server. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
This guide uses CentOS 7.9 with kernel version 3.10.0-1160.95.1.el7.x86_64 as an example to explain how to upgrade the pre-installed GPU Operator offline package's yum source.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#check-os-and-kernel-versions-of-cluster-nodes","title":"Check OS and Kernel Versions of Cluster Nodes","text":"
Run the following commands on both the control node of the Global cluster and the node where GPU Operator will be deployed. If the OS and kernel versions of the two nodes are consistent, there is no need to build a yum source. You can directly refer to the Offline Installation of GPU Operator document for installation. If the OS or kernel versions of the two nodes are not consistent, please proceed to the next step.
Run the following command to view the distribution name and version of the node where GPU Operator will be deployed in the cluster.
cat /etc/redhat-release\n
Expected output:
CentOS Linux release 7.9 (Core)\n
The output shows the current node's OS version as CentOS 7.9.
Run the following command to view the kernel version of the node where GPU Operator will be deployed in the cluster.
uname -a\n
Expected output:
Linux localhost.localdomain 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux\n
The output shows the current node's kernel version as 3.10.0-1160.el7.x86_64.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-the-offline-yum-source","title":"Create the Offline Yum Source","text":"
Perform the following steps on a node that has internet access and can access the file server:
Create a script file named yum.sh by running the following command:
vi yum.sh\n
Then press the i key to enter insert mode and enter the following content:
Press the Esc key to exit insert mode, then enter :wq to save and exit.
Run the yum.sh file:
bash -x yum.sh TARGET_KERNEL_VERSION\n
The TARGET_KERNEL_VERSION parameter is used to specify the kernel version of the cluster nodes.
Note: You don't need to include the distribution identifier (e.g., __ .el7.x86_64__ ). For example:
bash -x yum.sh 3.10.0-1160.95.1\n
Now you have generated an offline yum source, centos-base , for the kernel version 3.10.0-1160.95.1.el7.x86_64 .
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#upload-the-offline-yum-source-to-the-file-server","title":"Upload the Offline Yum Source to the File Server","text":"
Perform the following steps on a node that has internet access and can access the file server. This step is used to upload the generated yum source from the previous step to a file server that can be accessed by the cluster where the GPU Operator will be deployed. The file server can be Nginx, MinIO, or any other file server that supports the HTTP protocol.
In this example, we will use the built-in MinIO as the file server. The MinIO details are as follows:
Run the following command in the current directory of the node to establish a connection between the node's local mc command-line tool and the MinIO server:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should resemble the following:
Added __minio__ successfully.\n
mc is the command-line tool provided by MinIO for interacting with the MinIO server. For more details, refer to the MinIO Client documentation.
In the current directory of the node, create a bucket named centos-base :
mc mb -p minio/centos-base\n
The expected output should resemble the following:
Bucket created successfully __minio/centos-base__ .\n
Set the access policy of the bucket centos-base to allow public download. This will enable access during the installation of the GPU Operator:
mc anonymous set download minio/centos-base\n
The expected output should resemble the following:
Access permission for __minio/centos-base__ is set to __download__ \n
In the current directory of the node, copy the generated centos-base offline yum source to the minio/centos-base bucket on the MinIO server:
mc cp centos-base minio/centos-base --recursive\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-a-configmap-to-store-the-yum-source-info-in-the-cluster","title":"Create a ConfigMap to Store the Yum Source Info in the Cluster","text":"
Perform the following steps on the control node of the cluster where the GPU Operator will be deployed.
Run the following command to create a file named CentOS-Base.repo that specifies the configmap for the yum source storage:
# The file name must be CentOS-Base.repo, otherwise it cannot be recognized during the installation of the GPU Operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The expected output should resemble the following:
configmap/local-repo-config created\n
The local-repo-config configmap will be used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can customize the configuration file name.
View the content of the local-repo-config configmap:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output should resemble the following:
apiVersion: v1\ndata:\nCentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base# The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\ncreationTimestamp: \"2023-10-18T01:59:02Z\"\nname: local-repo-config\nnamespace: gpu-operator\nresourceVersion: \"59445080\"\nuid: c5f0ebab-046f-442c-b932-f9003e014387\n
You have successfully created an offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it during the offline installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html","title":"Building Red Hat 8.4 Offline Yum Source","text":"
The AI platform comes with pre-installed CentOS v7.9 and GPU Operator offline packages with kernel v3.10.0-1160. For other OS types or nodes with different kernels, users need to manually build the offline yum source.
This guide explains how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also demonstrates how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the addon offline package v0.12.0 or higher on the platform.
The OS of the cluster nodes where the GPU Operator will be deployed must be Red Hat v8.4, and the kernel version must be identical.
Prepare a file server that can communicate with the cluster network where the GPU Operator will be deployed, such as Nginx or MinIO.
Prepare a node that can access the internet, the cluster where the GPU Operator will be deployed, and the file server. Ensure that Docker is already installed on this node.
The nodes in the Global cluster must be Red Hat 8.4 4.18.0-305.el8.x86_64.
This guide uses a node with Red Hat 8.4 4.18.0-305.el8.x86_64 as an example to demonstrate how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also explains how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-1-download-the-yum-source-from-the-bootstrap-node","title":"Step 1: Download the Yum Source from the Bootstrap Node","text":"
Perform the following steps on the master node of the Global cluster.
Use SSH or any other method to access any node in the Global cluster and run the following command:
cat /etc/yum.repos.d/extension.repo # View the contents of extension.repo.\n
The expected output should resemble the following:
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-2-download-the-elfutils-libelf-devel-0187-4el8x86_64rpm-package","title":"Step 2: Download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm Package","text":"
Perform the following steps on a node with internet access. Before proceeding, ensure that there is network connectivity between the node with internet access and the master node of the Global cluster.
Run the following command on the node with internet access to download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm package:
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-3-generate-the-local-yum-repository","title":"Step 3: Generate the Local Yum Repository","text":"
Perform the following steps on the master node of the Global cluster mentioned in Step 1.
Enter the yum repository directories:
cd ~/redhat-base-repo/extension-1/Packages\ncd ~/redhat-base-repo/extension-2/Packages\n
Generate the repository index for the directories:
createrepo_c ./\n
You have now generated the offline yum source named redhat-base-repo for kernel version 4.18.0-305.el8.x86_64 .
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-4-upload-the-local-yum-repository-to-the-file-server","title":"Step 4: Upload the Local Yum Repository to the File Server","text":"
In this example, we will use Minio, which is built-in as the file server in the bootstrap node. However, you can choose any file server that suits your needs. Here are the details for Minio:
Access URL: http://10.5.14.200:9000 (usually the {bootstrap-node-IP} + {port-9000})
Login username: rootuser
Login password: rootpass123
On the current node, establish a connection between the local mc command-line tool and the Minio server by running the following command:
mc config host add minio <file_server_access_url> <username> <password>\n
For example:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should be similar to:
Added __minio__ successfully.\n
The mc command-line tool is provided by the Minio file server as a client command-line tool. For more details, refer to the MinIO Client documentation.
Create a bucket named redhat-base in the current location:
mc mb -p minio/redhat-base\n
The expected output should be similar to:
Bucket created successfully __minio/redhat-base__ .\n
Set the access policy of the redhat-base bucket to allow public downloads so that it can be accessed during the installation of the GPU Operator:
mc anonymous set download minio/redhat-base\n
The expected output should be similar to:
Access permission for __minio/redhat-base__ is set to __download__ \n
Copy the offline yum repository files ( redhat-base-repo ) from the current location to the Minio server's minio/redhat-base bucket:
mc cp redhat-base-repo minio/redhat-base --recursive\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-5-create-a-configmap-to-store-yum-repository-information-in-the-cluster","title":"Step 5: Create a ConfigMap to Store Yum Repository Information in the Cluster","text":"
Perform the following steps on the control node of the cluster where you will deploy the GPU Operator.
Run the following command to create a file named redhat.repo , which specifies the configuration information for the yum repository storage:
# The file name must be redhat.repo, otherwise it won't be recognized when installing gpu-operator\ncat > redhat.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created redhat.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The local-repo-config configuration file is used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can choose a different name for the configuration file.
View the contents of the local-repo-config configuration file:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
You have successfully created the offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it by specifying the RepoConfig.ConfigMapName parameter during the offline installation of the GPU Operator.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"Build an Offline Yum Repository for Red Hat 7.9","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#introduction","title":"Introduction","text":"
AI platform comes with a pre-installed CentOS 7.9 with GPU Operator offline package for kernel 3.10.0-1160. You need to manually build an offline yum repository for other OS types or nodes with different kernels.
This page explains how to build an offline yum repository for Red Hat 7.9 based on any node in the Global cluster, and how to use the RepoConfig.ConfigMapName parameter when installing the GPU Operator.
The cluster nodes where the GPU Operator is to be deployed must be Red Hat 7.9 with the exact same kernel version.
Prepare a file server that can be connected to the cluster network where the GPU Operator is to be deployed, such as nginx or minio.
Prepare a node that can access the internet, the cluster where the GPU Operator is to be deployed, and the file server. Docker installation must be completed on this node.
The nodes in the global service cluster must be Red Hat 7.9.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#steps","title":"Steps","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#1-build-offline-yum-repo-for-relevant-kernel","title":"1. Build Offline Yum Repo for Relevant Kernel","text":"
Download rhel7.9 ISO
Download the rhel7.9 ospackage that corresponds to your Kubean version.
Find the version number of Kubean in the Container Management section of the Global cluster under Helm Apps.
Download the rhel7.9 ospackage for that version from the Kubean repository.
Import offline resources using the installer.
Refer to the Import Offline Resources document.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-download-offline-driver-image-for-red-hat-79-os","title":"2. Download Offline Driver Image for Red Hat 7.9 OS","text":"
Click here to view the download url.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-upload-red-hat-gpu-operator-offline-image-to-boostrap-node-repository","title":"3. Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository","text":"
Refer to Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository.
Note
This reference is based on rhel8.4, so make sure to modify it for rhel7.9.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#4-create-configmaps-in-the-cluster-to-save-yum-repository-information","title":"4. Create ConfigMaps in the Cluster to Save Yum Repository Information","text":"
Run the following command on the control node of the cluster where the GPU Operator is to be deployed.
Run the following command to create a file named CentOS-Base.repo to specify the configuration information where the yum repository is stored.
# The file name must be CentOS-Base.repo, otherwise it will not be recognized when installing gpu-operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a profile named local-repo-config in the gpu-operator namespace:
The local-repo-config profile is used to provide the value of the RepoConfig.ConfigMapName parameter when installing gpu-operator, and the profile name can be customized by the user.
View the contents of the local-repo-config profile:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output is as follows:
local-repo-config.yaml
apiVersion: v1\ndata:\n CentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl\n = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname\n = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\n creationTimestamp: \"2023-10-18T01:59:02Z\"\n name: local-repo-config\n namespace: gpu-operator\n resourceVersion: \"59445080\"\n uid: c5f0ebab-046f-442c-b932-f9003e014387\n
At this point, you have successfully created the offline yum repository profile for the cluster where the GPU Operator is to be deployed. The RepoConfig.ConfigMapName parameter was used during the Offline Installation of GPU Operator.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html","title":"Overview of NVIDIA Multi-Instance GPU (MIG)","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html#mig-scenarios","title":"MIG Scenarios","text":"
Multi-Tenant Cloud Environments:
MIG allows cloud service providers to partition a physical GPU into multiple independent GPU instances, which can be allocated to different tenants. This enables resource isolation and independence, meeting the GPU computing needs of multiple tenants.
Containerized Applications:
MIG enables finer-grained GPU resource management in containerized environments. By partitioning a physical GPU into multiple MIG instances, each container can be assigned with dedicated GPU compute resources, providing better performance isolation and resource utilization.
Batch Processing Jobs:
For batch processing jobs requiring large-scale parallel computing, MIG provides higher computational performance and larger memory capacity. Each MIG instance can utilize a portion of the physical GPU's compute resources, accelerating the processing of large-scale computational tasks.
AI/Machine Learning Training:
MIG offers increased compute power and memory capacity for training large-scale deep learning models. By partitioning the physical GPU into multiple MIG instances, each instance can independently carry out model training, improving training efficiency and throughput.
In general, NVIDIA MIG is suitable for scenarios that require finer-grained allocation and management of GPU resources. It enables resource isolation, improved performance utilization, and meets the GPU computing needs of multiple users or applications.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html#overview-of-mig","title":"Overview of MIG","text":"
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU card to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU card into seven GPU instances that can be used by multiple users.
SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices.
GPU Memory Slice : The smallest portion of GPU memory, including the proper memory controller and cache. A GPU memory slice is approximately one-eighth of the total GPU memory resources in terms of capacity and bandwidth.
GPU SM Slice : The smallest computational unit of SMs on a GPU. When configuring in MIG mode, the GPU SM slice is approximately one-seventh of the total available SMs in the GPU.
GPU Slice : The GPU slice represents the smallest portion of the GPU, consisting of a single GPU memory slice and a single GPU SM slice combined together.
GPU Instance (GI): A GPU instance is the combination of a GPU slice and GPU engines (DMA, NVDEC, etc.). Anything within a GPU instance always shares all GPU memory slices and other GPU engines, but its SM slice can be further subdivided into Compute Instances (CIs). A GPU instance provides memory QoS. Each GPU slice contains dedicated GPU memory resources, limiting available capacity and bandwidth while providing memory QoS. Each GPU memory slice gets one-eighth of the total GPU memory resources, and each GPU SM slice gets one-seventh of the total SM count.
Compute Instance (CI): A Compute Instance represents the smallest computational unit within a GPU instance. It consists of a subset of SMs, along with dedicated register files, shared memory, and other resources. Each CI has its own CUDA context and can run independent CUDA kernels. The number of CIs in a GPU instance depends on the number of available SMs and the configuration chosen during MIG setup.
Instance Slice : An Instance Slice represents a single CI within a GPU instance. It is the combination of a subset of SMs and a portion of the GPU memory slice. Each Instance Slice provides isolation and resource allocation for individual applications or users running on the GPU instance.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html#key-benefits-of-mig","title":"Key Benefits of MIG","text":"
Resource Sharing: MIG allows a single physical GPU to be divided into multiple GPU instances, providing efficient sharing of GPU resources among different users or applications. This maximizes GPU utilization and enables improved performance isolation.
Fine-Grained Resource Allocation: With MIG, GPU resources can be allocated at a finer granularity, allowing for more precise partitioning and allocation of compute power and memory capacity.
Improved Performance Isolation: Each MIG instance operates independently with its dedicated resources, ensuring predictable throughput and latency for individual users or applications. This improves performance isolation and prevents interference between different workloads running on the same GPU.
Enhanced Security and Fault Isolation: MIG provides better security and fault isolation by ensuring that each user or application has its dedicated GPU resources. This prevents unauthorized access to data and mitigates the impact of faults or errors in one instance on others.
Increased Scalability: MIG enables the simultaneous usage of GPU resources by multiple users or applications, increasing system scalability and accommodating the needs of various workloads.
Efficient Containerization: By using MIG in containerized environments, GPU resources can be effectively allocated to different containers, improving performance isolation and resource utilization.
Overall, MIG offers significant advantages in terms of resource sharing, fine-grained allocation, performance isolation, security, scalability, and containerization, making it a valuable feature for various GPU computing scenarios.
Check the system requirements for the GPU driver installation on the target node: GPU Support Matrix
Ensure that the cluster nodes have GPUs of the proper models (NVIDIA H100, A100, and A30 Tensor Core GPUs). For more information, see the GPU Support Matrix.
All GPUs on the nodes must belong to the same product line (e.g., A100-SXM-40GB).
When installing the Operator, you need to set the MigManager Config parameter accordingly. The default setting is default-mig-parted-config. You can also customize the sharding policy configuration file:
After successfully installing the GPU operator, the node is in full card mode by default. There will be an indicator on the node management page, as shown below:
Click the \u2507 at the right side of the node list, select a GPU mode to switch, and then choose the proper MIG mode and sharding policy. Here, we take MIXED mode as an example:
There are two configurations here:
MIG Policy: Mixed and Single.
Sharding Policy: The policy here needs to match the key in the default-mig-parted-config (or user-defined sharding policy) configuration file.
After clicking OK button, wait for about a minute and refresh the page. The MIG mode will be switched to:
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_command.html","title":"MIG Related Commands","text":"
GI Related Commands:
Subcommand Description nvidia-smi mig -lgi View the list of created GI instances nvidia-smi mig -dgi -gi Delete a specific GI instance nvidia-smi mig -lgip View the profile of GI nvidia-smi mig -cgi Create a GI using the specified profile ID
CI Related Commands:
Subcommand Description nvidia-smi mig -lcip { -gi {gi Instance ID}} View the profile of CI, specifying -gi will show the CIs that can be created for a particular GI instance nvidia-smi mig -lci View the list of created CI instances nvidia-smi mig -cci {profile id} -gi {gi instance id} Create a CI instance with the specified GI nvidia-smi mig -dci -ci Delete a specific CI instance
GI+CI Related Commands:
Subcommand Description nvidia-smi mig -i 0 -cgi {gi profile id} -C {ci profile id} Create a GI + CI instance directly"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_usage.html","title":"Using MIG GPU Resources","text":"
This section explains how applications can use MIG GPU resources.
AI platform container management platform is deployed and running successfully.
The container management module is integrated with a Kubernetes cluster or a Kubernetes cluster is created, and the UI interface of the cluster can be accessed.
NVIDIA DevicePlugin and MIG capabilities are enabled. Refer to Offline installation of GPU Operator for details.
The nodes in the cluster have GPUs of the proper models.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-gpu-through-the-ui","title":"Using MIG GPU through the UI","text":"
Confirm if the cluster has recognized the GPU card type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
When deploying an application using an image, you can select and use NVIDIA MIG resources.
Example of MIG Single Mode (used in the same way as a full GPU card):
Note
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU card (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
MIG Mixed Mode
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-through-yaml-configuration","title":"Using MIG through YAML Configuration","text":"
Expose MIG device through nvidia.com/mig-g.gb resource type
After entering the container, you can check if only one MIG device is being used:
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/hami.html","title":"Build a vGPU Memory Oversubscription Image","text":"
The vGPU memory oversubscription feature in the Hami Project no longer exists. To use this feature, you need to rebuild with the libvgpu.so file that supports memory oversubscription.
Dockerfile
FROM docker.m.daocloud.io/projecthami/hami:v2.3.11\nCOPY libvgpu.so /k8s-vgpu/lib/nvidia/\n
To virtualize a single NVIDIA GPU into multiple virtual GPUs and allocate them to different virtual machines or users, you can use NVIDIA's vGPU capability. This section explains how to install the vGPU plugin in the AI platform platform, which is a prerequisite for using NVIDIA vGPU capability.
During the installation of vGPU, several basic modification parameters are provided. If you need to modify advanced parameters, click the YAML column to make changes:
deviceMemoryScaling : NVIDIA device memory scaling factor, the input value must be an integer, with a default value of 1. It can be greater than 1 (enabling virtual memory, experimental feature). For an NVIDIA GPU with a memory size of M, if we configure the devicePlugin.deviceMemoryScaling parameter as S, in a Kubernetes cluster where we have deployed our device plugin, the vGPUs assigned from this GPU will have a total memory of S * M .
deviceSplitCount : An integer type, with a default value of 10. Number of GPU splits, each GPU cannot be assigned more tasks than its configuration count. If configured as N, each GPU can have up to N tasks simultaneously.
Resources : Represents the resource usage of the vgpu-device-plugin and vgpu-schedule pods.
After a successful installation, you will see two types of pods in the specified namespace, indicating that the NVIDIA vGPU plugin has been successfully installed:
After a successful installation, you can deploy applications using vGPU resources.
Note
NVIDIA vGPU Addon does not support upgrading directly from the older v2.0.0 to the latest v2.0.0+1; To upgrade, please uninstall the older version and then reinstall the latest version.
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html","title":"Using NVIDIA vGPU in Applications","text":"
This section explains how to use the vGPU capability in the AI platform platform.
The nodes in the cluster have GPUs of the proper models.
vGPU Addon has been successfully installed. Refer to Installing GPU Addon for details.
GPU Operator is installed, and the Nvidia.DevicePlugin capability is disabled. Refer to Offline Installation of GPU Operator for details.
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-the-ui","title":"Using vGPU through the UI","text":"
Confirm if the cluster has detected GPUs. Click the Clusters -> Cluster Settings -> Addon Plugins and check if the GPU plugin has been automatically enabled and the proper GPU type has been detected. Currently, the cluster will automatically enable the GPU addon and set the GPU Type as Nvidia vGPU .
Deploy a workload by clicking Clusters -> Workloads . When deploying a workload using an image, select the type Nvidia vGPU , and you will be prompted with the following parameters:
Number of Physical Cards (nvidia.com/vgpu) : Indicates how many physical cards need to be mounted by the current pod. The input value must be an integer and less than or equal to the number of cards on the host machine.
GPU Cores (nvidia.com/gpucores): Indicates the GPU cores utilized by each card, with a value range from 0 to 100. Setting it to 0 means no enforced isolation, while setting it to 100 means exclusive use of the entire card.
GPU Memory (nvidia.com/gpumem): Indicates the GPU memory occupied by each card, with a value in MB. The minimum value is 1, and the maximum value is the total memory of the card.
If there are issues with the configuration values above, it may result in scheduling failure or inability to allocate resources.
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-yaml-configuration","title":"Using vGPU through YAML Configuration","text":"
Refer to the following workload configuration and add the parameter nvidia.com/vgpu: '1' in the resource requests and limits section to configure the number of physical cards used by the application.
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU card.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"Using Volcano's Gang Scheduler","text":"
The Gang scheduling policy is one of the core scheduling algorithms of the volcano-scheduler. It satisfies the \"All or nothing\" scheduling requirement during the scheduling process, preventing arbitrary scheduling of Pods that could waste cluster resources. The specific algorithm observes whether the number of scheduled Pods under a Job meets the minimum running quantity. When the Job's minimum running quantity is satisfied, scheduling actions are performed for all Pods under the Job; otherwise, no actions are taken.
The Gang scheduling algorithm, based on the concept of a Pod group, is particularly suitable for scenarios that require multi-process collaboration. AI scenarios often involve complex workflows, such as Data Ingestion, Data Analysis, Data Splitting, Training, Serving, and Logging, which require a group of containers to work together. This makes the Gang scheduling policy based on pods very appropriate.
In multi-threaded parallel computing communication scenarios under the MPI computation framework, Gang scheduling is also very suitable because it requires master and slave processes to work together. High relevance among containers in a pod may lead to resource contention, and overall scheduling allocation can effectively resolve deadlocks.
In scenarios with insufficient cluster resources, the Gang scheduling policy significantly improves the utilization of cluster resources. For example, if the cluster can currently accommodate only 2 Pods, but the minimum number of Pods required for scheduling is 3, then all Pods of this Job will remain pending until the cluster can accommodate 3 Pods, at which point the Pods will be scheduled. This effectively prevents the partial scheduling of Pods, which would not meet the requirements and would occupy resources, making other Jobs unable to run.
The Gang Scheduler is the core scheduling plugin of Volcano, and it is enabled by default upon installing Volcano. When creating a workload, you only need to specify the scheduler name as Volcano.
Volcano schedules based on PodGroups. When creating a workload, there is no need to manually create PodGroup resources; Volcano will automatically create them based on the workload information. Below is an example of a PodGroup:
Represents the minimum number of Pods or jobs that need to run under this PodGroup. If the cluster resources do not meet the requirements to run the number of jobs specified by miniMember, the scheduler will not schedule any jobs within this PodGroup.
Represents the minimum resources required to run this PodGroup. If the allocatable resources of the cluster do not meet the minResources, the scheduler will not schedule any jobs within this PodGroup.
Represents the priority of this PodGroup, used by the scheduler to sort all PodGroups within the queue during scheduling. system-node-critical and system-cluster-critical are two reserved values indicating the highest priority. If not specifically designated, the default priority or zero priority is used.
Represents the queue to which this PodGroup belongs. The queue must be pre-created and in the open state.
In a multi-threaded parallel computing communication scenario under the MPI computation framework, we need to ensure that all Pods can be successfully scheduled to ensure the job is completed correctly. Setting minAvailable to 4 means that 1 mpimaster and 3 mpiworkers are required to run.
apiVersion: scheduling.volcano.sh/v1beta1\nkind: PodGroup\nmetadata:\n annotations:\n creationTimestamp: \"2024-05-28T09:18:50Z\"\n generation: 5\n labels:\n volcano.sh/job-type: MPI\n name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2\n namespace: default\n ownerReferences:\n - apiVersion: batch.volcano.sh/v1alpha1\n blockOwnerDeletion: true\n controller: true\n kind: Job\n name: lm-mpi-job\n uid: 9c571015-37c7-4a1a-9604-eaa2248613f2\n resourceVersion: \"25173454\"\n uid: 7b04632e-7cff-4884-8e9a-035b7649d33b\nspec:\n minMember: 4\n minResources:\n count/pods: \"4\"\n cpu: 3500m\n limits.cpu: 3500m\n pods: \"4\"\n requests.cpu: 3500m\n minTaskMember:\n mpimaster: 1\n mpiworker: 3\n queue: default\nstatus:\n conditions:\n - lastTransitionTime: \"2024-05-28T09:19:01Z\"\n message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded,\n 3 Releasing, 4 minAvailable'\n reason: NotEnoughResources\n status: \"True\"\n transitionID: f875efa5-0358-4363-9300-06cebc0e7466\n type: Unschedulable\n - lastTransitionTime: \"2024-05-28T09:18:53Z\"\n reason: tasks in gang are ready to be scheduled\n status: \"True\"\n transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab\n type: Scheduled\n phase: Pending\n succeeded: 1\n
From the PodGroup, it can be seen that it is associated with the workload through ownerReferences and sets the minimum number of running Pods to 4.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html","title":"Use Volcano for AI Compute","text":""},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html#usage-scenarios","title":"Usage Scenarios","text":"
Kubernetes has become the de facto standard for orchestrating and managing cloud-native applications, and an increasing number of applications are choosing to migrate to K8s. The fields of artificial intelligence and machine learning inherently involve a large number of compute-intensive tasks, and developers are very willing to build AI platforms based on Kubernetes to fully leverage its resource management, application orchestration, and operations monitoring capabilities. However, the default Kubernetes scheduler was initially designed primarily for long-running services and has many shortcomings in batch and elastic scheduling for AI and big data tasks. For example, resource contention issues:
Take TensorFlow job scenarios as an example. TensorFlow jobs include two different roles, PS and Worker, and the Pods for these two roles need to work together to complete the entire job. If only one type of role Pod is running, the entire job cannot be executed properly. The default scheduler schedules Pods one by one and is unaware of the PS and Worker roles in a Kubeflow TFJob. In a high-load cluster (insufficient resources), multiple jobs may each be allocated some resources to run a portion of their Pods, but the jobs cannot complete successfully, leading to resource waste. For instance, if a cluster has 4 GPUs and both TFJob1 and TFJob2 each have 4 Workers, TFJob1 and TFJob2 might each be allocated 2 GPUs. However, both TFJob1 and TFJob2 require 4 GPUs to run. This mutual waiting for resource release creates a deadlock situation, resulting in GPU resource waste.
Volcano is the first Kubernetes-based container batch computing platform under CNCF, focusing on high-performance computing scenarios. It fills in the missing functionalities of Kubernetes in fields such as machine learning, big data, and scientific computing, providing essential support for these high-performance workloads. Additionally, Volcano seamlessly integrates with mainstream computing frameworks like Spark, TensorFlow, and PyTorch, and supports hybrid scheduling of heterogeneous devices, including CPUs and GPUs, effectively resolving the deadlock issues mentioned above.
The following sections will introduce how to install and use Volcano.
Find Volcano in Cluster Details -> Helm Apps -> Helm Charts and install it.
Check and confirm whether Volcano is installed successfully, that is, whether the components volcano-admission, volcano-controllers, and volcano-scheduler are running properly.
Typically, Volcano is used in conjunction with the AI Lab to achieve an effective closed-loop process for the development and training of datasets, Notebooks, and task training.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html#volcano-use-cases","title":"Volcano Use Cases","text":"
Volcano is a standalone scheduler. To enable the Volcano scheduler when creating workloads, simply specify the scheduler's name (schedulerName: volcano).
The volcanoJob resource is an extension of the Job in Volcano, breaking the Job down into smaller working units called tasks, which can interact with each other.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html#parallel-computing-with-mpi","title":"Parallel Computing with MPI","text":"
In multi-threaded parallel computing communication scenarios under the MPI computing framework, we need to ensure that all Pods are successfully scheduled to guarantee the task's proper completion. Setting minAvailable to 4 indicates that 1 mpimaster and 3 mpiworkers are required to run. By simply setting the schedulerName field value to \"volcano,\" you can enable the Volcano scheduler.
Helm is a package management tool for Kubernetes, which makes it easy for users to quickly discover, share and use applications built with Kubernetes. Container Management provides hundreds of Helm charts, covering storage, network, monitoring, database and other main cases. With these templates, you can quickly deploy and easily manage Helm apps through the UI interface. In addition, it supports adding more personalized templates through Add Helm repository to meet various needs.
Key Concepts:
There are a few key concepts to understand when using Helm:
Chart: A Helm installation package, which contains the images, dependencies, and resource definitions required to run an application, and may also contain service definitions in the Kubernetes cluster, similar to the formula in Homebrew, dpkg in APT, or rpm files in Yum. Charts are called Helm Charts in AI platform.
Release: A Chart instance running on the Kubernetes cluster. A Chart can be installed multiple times in the same cluster, and each installation will create a new Release. Release is called Helm Apps in AI platform.
Repository: A repository for publishing and storing Charts. Repository is called Helm Repositories in AI platform.
For more details, refer to Helm official website.
Related operations:
Manage Helm apps, including installing, updating, uninstalling Helm apps, viewing Helm operation records, etc.
Manage Helm repository, including installing, updating, deleting Helm repository, etc.
"},{"location":"en/end-user/kpanda/helm/Import-addon.html","title":"Import Custom Helm Apps into Built-in Addons","text":"
This article explains how to import Helm appss into the system's built-in addons in both offline and online environments.
charts-syncer is available and running. If not, you can click here to download.
The Helm Chart has been adapted for charts-syncer. This means adding a .relok8s-images.yaml file to the Helm Chart. This file should include all the images used in the Chart, including any images that are not directly used in the Chart but are used similar to images used in an Operator.
Note
Refer to image-hints-file for instructions on how to write a Chart. It is required to separate the registry and repository of the image because the registry/repository needs to be replaced or modified when loading the image.
The installer's fire cluster has charts-syncer installed. If you are importing a custom Helm apps into the installer's fire cluster, you can skip the download and proceed to the adaptation. If charts-syncer binary is not installed, you can download it immediately.
Go to Container Management -> Helm Apps -> Helm Repositories , search for the addon, and obtain the built-in repository address and username/password (the default username/password for the system's built-in repository is rootuser/rootpass123).
Sync the Helm Chart to the built-in repository addon of the container management system
Write the following configuration file, modify it according to your specific configuration, and save it as sync-dao-2048.yaml .
source: # helm charts source information\n repo:\n kind: HARBOR # It can also be any other supported Helm Chart repository type, such as CHARTMUSEUM\n url: https://release-ci.daocloud.io/chartrepo/community # Change to the chart repo URL\n #auth: # username/password, if no password is set, leave it blank\n #username: \"admin\"\n #password: \"Harbor12345\"\ncharts: # charts to sync\n - name: dao-2048 # helm charts information, if not specified, sync all charts in the source helm repo\n versions:\n - 1.4.1\ntarget: # helm charts target information\n containerRegistry: 10.5.14.40 # image repository URL\n repo:\n kind: CHARTMUSEUM # It can also be any other supported Helm Chart repository type, such as HARBOR\n url: http://10.5.14.40:8081 # Change to the correct chart repo URL, you can verify the address by using helm repo add $HELM-REPO\n auth: # username/password, if no password is set, leave it blank\n username: \"rootuser\"\n password: \"rootpass123\"\n containers:\n # kind: HARBOR # If the image repository is HARBOR and you want charts-syncer to automatically create an image repository, fill in this field\n # auth: # username/password, if no password is set, leave it blank\n # username: \"admin\"\n # password: \"Harbor12345\"\n\n# leverage .relok8s-images.yaml file inside the Charts to move the container images too\nrelocateContainerImages: true\n
Run the charts-syncer command to sync the Chart and its included images
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
Once the previous step is completed, go to Container Management -> Helm Apps -> Helm Repositories , find the proper addon, click Sync Repository in the action column, and you will see the uploaded Helm apps in the Helm template.
You can then proceed with normal installation, upgrade, and uninstallation.
The Helm Repo address for the online environment is release.daocloud.io . If the user does not have permission to add Helm Repo, they will not be able to import custom Helm appss into the system's built-in addons. You can add your own Helm repository and then integrate your Helm repository into the platform using the same steps as syncing Helm Chart in the offline environment.
The container management module supports interface-based management of Helm, including creating Helm instances using Helm charts, customizing Helm instance arguments, and managing the full lifecycle of Helm instances.
This section will take cert-manager as an example to introduce how to create and manage Helm apps through the container management interface.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster.
Created a namespace, user, and granted NS Admin or higher permissions to the user. For details, refer to Namespace Authorization.
"},{"location":"en/end-user/kpanda/helm/helm-app.html#install-the-helm-app","title":"Install the Helm app","text":"
Follow the steps below to install the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps -> Helm Chart to enter the Helm chart page.
On the Helm chart page, select the Helm repository named addon , and all the Helm chart templates under the addon repository will be displayed on the interface. Click the Chart named cert-manager .
On the installation page, you can see the relevant detailed information of the Chart, select the version to be installed in the upper right corner of the interface, and click the Install button. Here select v1.9.1 version for installation.
Configure Name , Namespace and Version Information . You can also customize arguments by modifying YAML in the argument Configuration area below. Click OK .
The system will automatically return to the list of Helm apps, and the status of the newly created Helm app is Installing , and the status will change to Running after a period of time.
"},{"location":"en/end-user/kpanda/helm/helm-app.html#update-the-helm-app","title":"Update the Helm app","text":"
After we have completed the installation of a Helm app through the interface, we can perform an update operation on the Helm app. Note: Update operations using the UI are only supported for Helm apps installed via the UI.
Follow the steps below to update the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app that needs to be updated, click the __ ...__ operation button on the right side of the list, and select the Update operation in the drop-down selection.
After clicking the Update button, the system will jump to the update interface, where you can update the Helm app as needed. Here we take updating the http port of the dao-2048 application as an example.
After modifying the proper arguments. You can click the Change button under the argument configuration to compare the files before and after the modification. After confirming that there is no error, click the OK button at the bottom to complete the update of the Helm app.
The system will automatically return to the Helm app list, and a pop-up window in the upper right corner will prompt update successful .
Every installation, update, and deletion of Helm apps has detailed operation records and logs for viewing.
In the left navigation bar, click Cluster Operations -> Recent Operations , and then select the Helm Operations tab at the top of the page. Each record corresponds to an install/update/delete operation.
To view the detailed log of each operation: Click \u2507 on the right side of the list, and select Log from the pop-up menu.
At this point, the detailed operation log will be displayed in the form of console at the bottom of the page.
"},{"location":"en/end-user/kpanda/helm/helm-app.html#delete-the-helm-app","title":"Delete the Helm app","text":"
Follow the steps below to delete the Helm app.
Find the cluster where the Helm app to be deleted resides, click the cluster name, and enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app you want to delete, click the __ ...__ operation button on the right side of the list, and select Delete from the drop-down selection.
Enter the name of the Helm app in the pop-up window to confirm, and then click the Delete button.
The Helm repository is a repository for storing and publishing Charts. The Helm App module supports HTTP(s) protocol to access Chart packages in the repository. By default, the system has 4 built-in helm repos as shown in the table below to meet common needs in the production process of enterprises.
Repository Description Example partner Various high-quality features provided by ecological partners Chart tidb system Chart that must be relied upon by system core functional components and some advanced features. For example, insight-agent must be installed to obtain cluster monitoring information Insight addon Common Chart in business cases cert-manager community The most popular open source components in the Kubernetes community Chart Istio
In addition to the above preset repositories, you can also add third-party Helm repositories yourself. This page will introduce how to add and update third-party Helm repositories.
The following takes the public container repository of Kubevela as an example to introduce and manage the helm repo.
Find the cluster that needs to be imported into the third-party helm repo, click the cluster name, and enter cluster details.
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo page.
Click the Create Repository button on the helm repo page to enter the Create repository page, and configure relevant arguments according to the table below.
Repository Name: Set the repository name. It can be up to 63 characters long and may only include lowercase letters, numbers, and separators -. It must start and end with a lowercase letter or number, for example, kubevela.
Repository URL: The HTTP(S) address pointing to the target Helm repository. For example, https://charts.kubevela.net/core.
Skip TLS Verification: If the added Helm repository uses an HTTPS address and requires skipping TLS verification, you can check this option. The default is unchecked.
Authentication Method: The method used for identity verification after connecting to the repository URL. For public repositories, you can select None. For private repositories, you need to enter a username/password for identity verification.
Labels: Add labels to this Helm repository. For example, key: repo4; value: Kubevela.
Annotations: Add annotations to this Helm repository. For example, key: repo4; value: Kubevela.
Description: Add a description for this Helm repository. For example: This is a Kubevela public Helm repository.
Click OK to complete the creation of the Helm repository. The page will automatically jump to the list of Helm repositories.
"},{"location":"en/end-user/kpanda/helm/helm-repo.html#update-the-helm-repository","title":"Update the Helm repository","text":"
When the address information of the helm repo changes, the address, authentication method, label, annotation, and description information of the helm repo can be updated.
Find the cluster where the repository to be updated is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Update in the pop-up menu.
Update on the Update Helm Repository page, and click OK when finished.
Return to the helm repo list, and the screen prompts that the update is successful.
"},{"location":"en/end-user/kpanda/helm/helm-repo.html#delete-the-helm-repository","title":"Delete the Helm repository","text":"
In addition to importing and updating repositorys, you can also delete unnecessary repositories, including system preset repositories and third-party repositories.
Find the cluster where the repository to be deleted is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Delete in the pop-up menu.
Enter the repository name to confirm, and click Delete .
Return to the list of Helm repositories, and the screen prompts that the deletion is successful.
"},{"location":"en/end-user/kpanda/helm/multi-archi-helm.html","title":"Import and Upgrade Multi-Arch Helm Apps","text":"
In a multi-arch cluster, it is common to use Helm charts that support multiple architectures to address deployment issues caused by architectural differences. This guide will explain how to integrate single-arch Helm apps into multi-arch deployments and how to integrate multi-arch Helm apps.
The offline package is quite large and requires sufficient space for decompression and loading of images. Otherwise, it may interrupt the process with a \"no space left\" error.
"},{"location":"en/end-user/kpanda/helm/multi-archi-helm.html#retry-after-failure","title":"Retry after Failure","text":"
If the multi-arch fusion step fails, you need to clean up the residue before retrying:
If the offline package for fusion contains registry spaces that are inconsistent with the imported offline package, an error may occur during the fusion process due to the non-existence of the registry spaces:
Solution: Simply create the registry space before the fusion. For example, in the above error, creating the registry space \"localhost\" in advance can prevent the error.
When upgrading to a version lower than 0.12.0 of the addon, the charts-syncer in the target offline package does not check the existence of the image before pushing, so it will recombine the multi-arch into a single architecture during the upgrade process. For example, if the addon is implemented as a multi-arch in v0.10, upgrading to v0.11 will overwrite the multi-arch addon with a single architecture. However, upgrading to v0.12.0 or above can still maintain the multi-arch.
This article explains how to upload Helm charts. See the steps below.
Add a Helm repository, refer to Adding a Third-Party Helm Repository for the procedure.
Upload the Helm Chart to the Helm repository.
Upload with ClientUpload with Web Page
Note
This method is suitable for Harbor, ChartMuseum, JFrog type repositories.
Log in to a node that can access the Helm repository, upload the Helm binary to the node, and install the cm-push plugin (VPN is needed and Git should be installed in advance).
Refer to the plugin installation process.
Push the Helm Chart to the Helm repository by executing the following command:
charts-dir: The directory of the Helm Chart, or the packaged Chart (i.e., .tgz file).
HELM_REPO_URL: The URL of the Helm repository.
username/password: The username and password for the Helm repository with push permissions.
If you want to access via HTTPS and skip the certificate verification, you can add the argument --insecure.
Note
This method is only applicable to Harbor repositories.
Log into the Harbor repository, ensuring the logged-in user has permissions to push;
Go to the relevant project, select the Helm Charts tab, click the Upload button on the page to upload the Helm Chart.
Sync Remote Repository Data
Manual SyncAuto Sync
By default, the cluster does not enable Helm Repository Auto-Refresh, so you need to perform a manual sync operation. The general steps are:
Go to Helm Apps -> Helm Repositories, click the \u2507 button on the right side of the repository list, and select Sync Repository to complete the repository data synchronization.
If you need to enable the Helm repository auto-sync feature, you can go to Cluster Maintenance -> Cluster Settings -> Advanced Settings and turn on the Helm repository auto-refresh switch.
Cluster inspection allows administrators to regularly or ad-hoc check the overall health of the cluster, giving them proactive control over ensuring cluster security. With a well-planned inspection schedule, this proactive cluster check allows administrators to monitor the cluster status at any time and address potential issues in advance. It eliminates the previous dilemma of passive troubleshooting during failures, enabling proactive monitoring and prevention.
The cluster inspection feature provided by AI platform's container management module supports custom inspection items at the cluster, node, and pod levels. After the inspection is completed, it automatically generates visual inspection reports.
Cluster Level: Checks the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes, such as the status of kube-apiserver and etcd .
Node Level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle counts, PID status, and network status.
pod Level: Checks the CPU and memory usage, running status of pods, and the status of PV (Persistent Volume) and PVC (PersistentVolumeClaim).
For information on security inspections or executing security-related inspections, refer to the supported security scan types in AI platform.
AI platform Container Management module provides cluster inspection functionality, which supports inspection at the cluster, node, and pod levels.
Cluster level: Check the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes such as kube-apiserver and etcd .
Node level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle count, PID status, and network status.
Pod level: Check the CPU and memory usage, running status, PV and PVC status of Pods.
Here's how to create an inspection configuration.
Click Cluster Inspection in the left navigation bar.
On the right side of the page, click Inspection Configuration .
Fill in the inspection configuration based on the following instructions, then click OK at the bottom of the page.
Cluster: Select the clusters that you want to inspect from the dropdown list. If you select multiple clusters, multiple inspection configurations will be automatically generated (only the inspected clusters are inconsistent, all other configurations are identical).
Scheduled Inspection: When enabled, it allows for regular automatic execution of cluster inspections based on a pre-set inspection frequency.
Inspection Frequency: Set the interval for automatic inspections, e.g., every Tuesday at 10 AM. It supports custom CronExpressios, refer to Cron Schedule Syntax for more information.
Number of Inspection Records to Retain: Specifies the maximum number of inspection records to be retained, including all inspection records for each cluster.
Parameter Configuration: The parameter configuration is divided into three parts: cluster level, node level, and pod level. You can enable or disable specific inspection items based on your requirements.
After creating the inspection configuration, it will be automatically displayed in the inspection configuration list. Click the more options button on the right of the configuration to immediately perform an inspection, modify the inspection configuration or delete the inspection configuration and reports.
Click Inspection to perform an inspection once based on the configuration.
Click Inspection Configuration to modify the inspection configuration.
Click Delete to delete the inspection configuration and reports.
Note
After creating the inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time.
If Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
After creating an inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time. If the Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
This page explains how to manually perform a cluster inspection.
When performing an inspection, you can choose to inspect multiple clusters in batches or perform a separate inspection for a specific cluster.
Batch InspectionIndividual Inspection
Click Cluster Inspection in the top-level navigation bar of the Container Management module, then click Inspection on the right side of the page.
Select the clusters you want to inspect, then click OK at the bottom of the page.
If you choose to inspect multiple clusters at the same time, the system will perform inspections based on different inspection configurations for each cluster.
If no inspection configuration is set for a cluster, the system will use the default configuration.
Go to the Cluster Inspection page.
Click the more options button ( \u2507 ) on the right of the proper inspection configuration, then select Inspection from the popup menu.
Go to the Cluster Inspection page and click the name of the target inspection cluster.
Click the name of the inspection record you want to view.
Each inspection execution generates an inspection record.
When the number of inspection records exceeds the maximum retention specified in the inspection configuration, the earliest record will be deleted starting from the execution time.
View the detailed information of the inspection, which may include an overview of cluster resources and the running status of system components.
You can download the inspection report or delete the inspection report from the top right corner of the page.
Namespaces are an abstraction used in Kubernetes for resource isolation. A cluster can contain multiple namespaces with different names, and the resources in each namespace are isolated from each other. For a detailed introduction to namespaces, refer to Namespaces.
This page will introduce the related operations of the namespace.
"},{"location":"en/end-user/kpanda/namespaces/createns.html#create-a-namespace","title":"Create a namespace","text":"
Supports easy creation of namespaces through forms, and quick creation of namespaces by writing or importing YAML files.
Note
Before creating a namespace, you need to Integrate a Kubernetes cluster or Create a Kubernetes cluster in the container management module.
The default namespace default is usually automatically generated after cluster initialization. But for production clusters, for ease of management, it is recommended to create other namespaces instead of using the default namespace directly.
"},{"location":"en/end-user/kpanda/namespaces/createns.html#create-with-form","title":"Create with form","text":"
On the cluster list page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the Create button on the right side of the page.
Fill in the name of the namespace, configure the workspace and labels (optional), and then click OK.
Info
After binding a namespace to a workspace, the resources of that namespace will be shared with the bound workspace. For a detailed explanation of workspaces, refer to Workspaces and Hierarchies.
After the namespace is created, you can still bind/unbind the workspace.
Click OK to complete the creation of the namespace. On the right side of the namespace list, click \u2507 to select update, bind/unbind workspace, quota management, delete, and more from the pop-up menu.
"},{"location":"en/end-user/kpanda/namespaces/createns.html#create-from-yaml","title":"Create from YAML","text":"
On the Clusters page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the YAML Create button on the right side of the page.
Enter or paste the prepared YAML content, or directly import an existing YAML file locally.
After entering the YAML content, click Download to save the YAML file locally.
Finally, click OK in the lower right corner of the pop-up box.
Namespace exclusive nodes in a Kubernetes cluster allow a specific namespace to have exclusive access to one or more node's CPU, memory, and other resources through taints and tolerations. Once exclusive nodes are configured for a specific namespace, applications and services from other namespaces cannot run on the exclusive nodes. Using exclusive nodes allows important applications to have exclusive access to some computing resources, achieving physical isolation from other applications.
Note
Applications and services running on a node before it is set to be an exclusive node will not be affected and will continue to run normally on that node. Only when these Pods are deleted or rebuilt will they be scheduled to other non-exclusive nodes.
Check whether the kube-apiserver of the current cluster has enabled the PodNodeSelector and PodTolerationRestriction admission controllers.
The use of namespace exclusive nodes requires users to enable the PodNodeSelector and PodTolerationRestriction admission controllers on the kube-apiserver. For more information about admission controllers, refer to Kubernetes Admission Controllers Reference.
You can go to any Master node in the current cluster to check whether these two features are enabled in the kube-apiserver.yaml file, or you can execute the following command on the Master node for a quick check:
[root@g-master1 ~]# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep enable-admission-plugins\n\n# The expected output is as follows:\n- --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction\n
"},{"location":"en/end-user/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-global-cluster","title":"Enable Namespace Exclusive Nodes on Global Cluster","text":"
Since the Global cluster runs platform basic components such as kpanda, ghippo, and insight, enabling namespace exclusive nodes on Global may cause system components to not be scheduled to the exclusive nodes when they restart, affecting the overall high availability of the system. Therefore, we generally do not recommend users to enable the namespace exclusive node feature on the Global cluster.
If you do need to enable namespace exclusive nodes on the Global cluster, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the Global cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to configure system component tolerations.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction # List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Add toleration annotations to the namespace where the platform components are located
After enabling the admission controllers, you need to add toleration annotations to the namespace where the platform components are located to ensure the high availability of the platform components.
The system component namespaces for AI platform are as follows:
Check whether there are the above namespaces in the current cluster, execute the following command, and add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' for each namespace.
Please make sure to replace <namespace-name> with the name of the platform namespace you want to add the annotation to.
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
"},{"location":"en/end-user/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-non-global-clusters","title":"Enable Namespace Exclusive Nodes on Non-Global Clusters","text":"
To enable namespace exclusive nodes on non-Global clusters, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the current cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to using the interface to set exclusive nodes for the namespace.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
Add toleration annotations to the namespace where the components that need high availability are located (optional)
Execute the following command to add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' to the namespace where the components that need high availability are located.
Pod security policies in a Kubernetes cluster allow you to control the behavior of Pods in various aspects of security by configuring different levels and modes for specific namespaces. Only Pods that meet certain conditions will be accepted by the system. It sets three levels and three modes, allowing users to choose the most suitable scheme to set restriction policies according to their needs.
Note
Only one security policy can be configured for one security mode. Please be careful when configuring the enforce security mode for a namespace, as violations will prevent Pods from being created.
This section will introduce how to configure Pod security policies for namespaces through the container management interface.
The container management module has integrated a Kubernetes cluster or created a Kubernetes cluster. The cluster version needs to be v1.22 or above, and you should be able to access the cluster's UI interface.
A namespace has been created, a user has been created, and the user has been granted NS Admin or higher permissions. For details, refer to Namespace Authorization.
"},{"location":"en/end-user/kpanda/namespaces/podsecurity.html#configure-pod-security-policies-for-namespace","title":"Configure Pod Security Policies for Namespace","text":"
Select the namespace for which you want to configure Pod security policies and go to the details page. Click Configure Policy on the Pod Security Policy page to go to the configuration page.
Click Add Policy on the configuration page, and a policy will appear, including security level and security mode. The following is a detailed introduction to the security level and security policy.
Security Level Description Privileged An unrestricted policy that provides the maximum possible range of permissions. This policy allows known privilege elevations. Baseline The least restrictive policy that prohibits known privilege elevations. Allows the use of default (minimum specified) Pod configurations. Restricted A highly restrictive policy that follows current best practices for protecting Pods. Security Mode Description Audit Violations of the specified policy will add new audit events in the audit log, and the Pod can be created. Warn Violations of the specified policy will return user-visible warning information, and the Pod can be created. Enforce Violations of the specified policy will prevent the Pod from being created.
Different security levels correspond to different check items. If you don't know how to configure your namespace, you can Policy ConfigMap Explanation at the top right corner of the page to view detailed information.
Click Confirm. If the creation is successful, the security policy you configured will appear on the page.
Click \u2507 to edit or delete the security policy you configured.
"},{"location":"en/end-user/kpanda/network/create-ingress.html","title":"Create an Ingress","text":"
In a Kubernetes cluster, Ingress exposes services from outside the cluster to inside the cluster HTTP and HTTPS ingress. Traffic ingress is controlled by rules defined on the Ingress resource. Here's an example of a simple Ingress that sends all traffic to the same Service:
Ingress is an API object that manages external access to services in the cluster, and the typical access method is HTTP. Ingress can provide load balancing, SSL termination, and name-based virtual hosting.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
Completed Create Ingress Instance, Deploy Application Workload, and have created the proper Service
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Ingress to enter the service list, and click the Create Ingress button in the upper right corner.
Note
It is also possible to Create from YAML .
Open Create Ingress page to configure. There are two protocol types to choose from, refer to the following two parameter tables for configuration.
"},{"location":"en/end-user/kpanda/network/create-ingress.html#create-http-protocol-ingress","title":"Create HTTP protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTP protocol. HTTP Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Scope of use of Ingress Platform-level load balancer : In the same cluster, share the same Ingress instance, where all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace, or belongs to a certain workspace, and the set workspace includes the current namespace, and all Pods can receive it Requests distributed by this load balancer. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> Ngnix Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Path Rewriting [Type] Optional [Meaning] rewrite-target , in some cases, the URL exposed by the backend service is different from the path specified in the Ingress rule. If no URL rewriting configuration is performed, There will be an error when accessing. close Redirect [Type] Optional[Meaning] permanent-redirect , permanent redirection, after entering the rewriting path, the access path will be redirected to the set address. close Traffic Distribution [Type] Optional[Meaning] After enabled and set, traffic distribution will be performed according to the set conditions. Based on weight : After setting the weight, add the following Annotation to the created Ingress: nginx.ingress.kubernetes.io/canary-weight: \"10\" Based on Cookie : set After the cookie rules, the traffic will be distributed according to the set cookie conditions Based on Header : After setting the header rules, the traffic will be distributed according to the set header conditions Close Labels [Type] Optional [Meaning] Add a label for the ingress - Annotations [Type] Optional [Meaning] Add annotation for ingress -"},{"location":"en/end-user/kpanda/network/create-ingress.html#create-https-protocol-ingress","title":"Create HTTPS protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTPS protocol. HTTPS Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io Secret [Type] Required [Meaning] Https TLS certificate, Create Secret. Forwarding policy [Type] Optional[Meaning] Specify the access policy of Ingress. Path: Specifies the URL path for service access, the default is the root path/directoryTarget service: Service name for ingressTarget service port: Port exposed by the service LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Platform-level load balancer : In the same cluster, the same Ingress instance is shared, and all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace or to a certain workspace. This workspace contains the current namespace, and all Pods can receive the workload from this Balanced distribution of requests. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> None Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Labels [Type] Optional [Meaning] Add a label for the ingress Annotations [Type] Optional[Meaning] Add annotation for ingress"},{"location":"en/end-user/kpanda/network/create-ingress.html#create-ingress-successfully","title":"Create ingress successfully","text":"
After configuring all the parameters, click the OK button to return to the ingress list automatically. On the right side of the list, click \u2507 to modify or delete the selected ingress.
"},{"location":"en/end-user/kpanda/network/create-services.html","title":"Create a Service","text":"
In a Kubernetes cluster, each Pod has an internal independent IP address, but Pods in the workload may be created and deleted at any time, and directly using the Pod IP address cannot provide external services.
This requires creating a service through which you get a fixed IP address, decoupling the front-end and back-end of the workload, and allowing external users to access the service. At the same time, the service also provides the Load Balancer feature, enabling users to access workloads from the public network.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Service to enter the service list, and click the Create Service button in the upper right corner.
!!! tip
It is also possible to create a service via __YAML__ .\n
Open the Create Service page, select an access type, and refer to the following three parameter tables for configuration.
Click Intra-Cluster Access (ClusterIP) , which refers to exposing services through the internal IP of the cluster. The services selected for this option can only be accessed within the cluster. This is the default service type. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select intra-cluster access (ClusterIP). ClusterIP Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. app:job01 Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. Container port (targetport): The container port that the workload actually monitors, used to expose services to the cluster. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same Pod Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time is 30 seconds by default 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/end-user/kpanda/network/create-services.html#create-nodeport-service","title":"Create NodePort service","text":"
Click NodePort , which means exposing the service via IP and static port ( NodePort ) on each node. The NodePort service is routed to the automatically created ClusterIP service. You can access a NodePort service from outside the cluster by requesting : . Refer to the configuration parameters in the table below. Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. ***Container port (targetport)*: The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same PodAfter enabled, .spec.sessionAffinity of Service is ClientIP , refer to for details : Session Affinity for Service Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time, the default timeout is 30 seconds.spec.sessionAffinityConfig.clientIP.timeoutSeconds is set to 30 by default seconds 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/end-user/kpanda/network/create-services.html#create-loadbalancer-service","title":"Create LoadBalancer service","text":"
Click Load Balancer , which refers to using the cloud provider's load balancer to expose services to the outside. External load balancers can route traffic to automatically created NodePort services and ClusterIP services. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default External Traffic Policy [Type] Required[Meaning] Set external traffic policy. Cluster: Traffic can be forwarded to Pods on all nodes in the cluster. Local: Traffic is only sent to Pods on this node. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Tag selector [Type] Required [Meaning] Add tag, Service Select the Pod according to the label, fill it out and click \"Add\". You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Load balancing type [Type] Required [Meaning] The type of load balancing used, currently supports MetalLB and others. MetalLB IP Pool [Type] Required[Meaning] When the selected load balancing type is MetalLB, LoadBalancer Service will allocate IP addresses from this pool by default, and declare all IP addresses in this pool through APR, For details, refer to: Install MetalLB Load balancing address [Type] Required[Meaning] 1. If you are using a public cloud CloudProvider, fill in the load balancing address provided by the cloud provider here;2. If the above load balancing type is selected as MetalLB, the IP will be obtained from the above IP pool by default, if not filled, it will be obtained automatically. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. Container port (targetport): The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/end-user/kpanda/network/create-services.html#complete-service-creation","title":"Complete service creation","text":"
After configuring all parameters, click the OK button to return to the service list automatically. On the right side of the list, click \u2507 to modify or delete the selected service.
Network policies in Kubernetes allow you to control network traffic at the IP address or port level (OSI layer 3 or layer 4). The container management module currently supports creating network policies based on Pods or namespaces, using label selectors to specify which traffic can enter or leave Pods with specific labels.
For more details on network policies, refer to the official Kubernetes documentation on Network Policies.
Currently, there are two methods available for creating network policies: YAML and form-based creation. Each method has its advantages and disadvantages, catering to different user needs.
YAML creation requires fewer steps and is more efficient, but it has a higher learning curve as it requires familiarity with configuring network policy YAML files.
Form-based creation is more intuitive and straightforward. Users can simply fill in the proper values based on the prompts. However, this method involves more steps.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create with YAML in the left navigation bar.
In the pop-up dialog, enter or paste the pre-prepared YAML file, then click OK at the bottom of the dialog.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create Policy in the left navigation bar.
Fill in the basic information.
The name and namespace cannot be changed after creation.
Fill in the policy configuration.
The policy configuration includes ingress and egress policies. To establish a successful connection from a source Pod to a target Pod, both the egress policy of the source Pod and the ingress policy of the target Pod need to allow the connection. If either side does not allow the connection, the connection will fail.
Ingress Policy: Click \u2795 to begin configuring the policy. Multiple policies can be configured. The effects of multiple network policies are cumulative. Only when all network policies are satisfied simultaneously can a connection be successfully established.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies . Click the name of the network policy.
View the basic configuration, associated instances, ingress policies, and egress policies of the policy.
Info
Under the \"Associated Instances\" tab, you can view instance monitoring, logs, container lists, YAML files, events, and more.
There are two ways to update network policies. You can either update them through the form or by using a YAML file.
On the network policy list page, find the policy you want to update, and choose Update in the action column on the right to update it via the form. Choose Edit YAML to update it using a YAML file.
Click the name of the network policy, then choose Update in the top right corner of the policy details page to update it via the form. Choose Edit YAML to update it using a YAML file.
There are two ways to delete network policies. You can delete network policies either through the form or by using a YAML file.
On the network policy list page, find the policy you want to delete, and choose Delete in the action column on the right to delete it via the form. Choose Edit YAML to delete it using a YAML file.
Click the name of the network policy, then choose Delete in the top right corner of the policy details page to delete it via the form. Choose Edit YAML to delete it using a YAML file.
As the number of business applications continues to grow, the resources of the cluster become increasingly tight. At this point, you can expand the cluster nodes based on kubean. After the expansion, applications can run on the newly added nodes, alleviating resource pressure.
Only clusters created through the container management module support node autoscaling. Clusters accessed from the outside do not support this operation. This article mainly introduces the expansion of worker nodes in the same architecture work cluster. If you need to add control nodes or heterogeneous work nodes to the cluster, refer to: Expanding the control node of the work cluster, Adding heterogeneous nodes to the work cluster, Expanding the worker node of the global service cluster.
On the Clusters page, click the name of the target cluster.
If the Cluster Type contains the label Integrated Cluster, it means that the cluster does not support node autoscaling.
Click Nodes in the left navigation bar, and then click Integrate Node in the upper right corner of the page.
Enter the host name and node IP and click OK.
Click \u2795 Add Worker Node to continue accessing more nodes.
Note
Accessing the node takes about 20 minutes, please be patient.
When the peak business period is over, in order to save resource costs, you can reduce the size of the cluster and unload redundant nodes, that is, node scaling. After a node is uninstalled, applications cannot continue to run on the node.
The current operating user has the Cluster Admin role authorization.
Only through the container management module created cluster can node autoscaling be supported, and the cluster accessed from the outside does not support this operation.
Before uninstalling a node, you need to pause scheduling the node, and expel the applications on the node to other nodes.
Eviction method: log in to the controller node, and use the kubectl drain command to evict all Pods on the node. The safe eviction method allows the containers in the pod to terminate gracefully.
When cluster nodes scales down, they can only be uninstalled one by one, not in batches.
If you need to uninstall cluster controller nodes, you need to ensure that the final number of controller nodes is an odd number.
The first controller node cannot be offline when the cluster node scales down. If it is necessary to perform this operation, please contact the after-sales engineer.
On the Clusters page, click the name of the target cluster.
If the Cluster Type has the tag Integrate Cluster , it means that the cluster does not support node autoscaling.
Click Nodes on the left navigation bar, find the node to be uninstalled, click \u2507 and select Remove .
Enter the node name, and click Delete to confirm.
"},{"location":"en/end-user/kpanda/nodes/labels-annotations.html","title":"Labels and Annotations","text":"
Labels are identifying key-value pairs added to Kubernetes objects such as Pods, nodes, and clusters, which can be combined with label selectors to find and filter Kubernetes objects that meet certain conditions. Each key must be unique for a given object.
Annotations, like tags, are key/value pairs, but they do not have identification or filtering features. Annotations can be used to add arbitrary metadata to nodes. Annotation keys usually use the format prefix(optional)/name(required) , for example nfd.node.kubernetes.io/extended-resources . If the prefix is \u200b\u200bomitted, it means that the annotation key is private to the user.
For more information about labels and annotations, refer to the official Kubernetes documentation labels and selectors Or Annotations.
The steps to add/delete tags and annotations are as follows:
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click Edit Labels or Edit Annotations .
Click \u2795 Add to add tags or annotations, click X to delete tags or annotations, and finally click OK .
"},{"location":"en/end-user/kpanda/nodes/node-authentication.html","title":"Node Authentication","text":""},{"location":"en/end-user/kpanda/nodes/node-authentication.html#authenticate-nodes-using-ssh-keys","title":"Authenticate Nodes Using SSH Keys","text":"
If you choose to authenticate the nodes of the cluster-to-be-created using SSH keys, you need to configure the public and private keys according to the following instructions.
Run the following command on any node within the management cluster of the cluster-to-be-created to generate the public and private keys.
cd /root/.ssh\nssh-keygen -t rsa\n
Run the ls command to check if the keys have been successfully created in the management cluster. The correct output should be as follows:
ls\nid_rsa id_rsa.pub known_hosts\n
The file named id_rsa is the private key, and the file named id_rsa.pub is the public key.
Run the following command to load the public key file id_rsa.pub onto all the nodes of the cluster-to-be-created.
Replace the user account and node IP in the above command with the username and IP of the nodes in the cluster-to-be-created. The same operation needs to be performed on every node in the cluster-to-be-created.
Run the following command to view the private key file id_rsa created in step 1.
Copy the content of the private key and paste it into the interface's key input field.
"},{"location":"en/end-user/kpanda/nodes/node-check.html","title":"Create a cluster node availability check","text":"
When creating a cluster or adding nodes to an existing cluster, refer to the table below to check the node configuration to avoid cluster creation or expansion failure due to wrong node configuration.
Check Item Description OS Refer to Supported Architectures and Operating Systems SELinux Off Firewall Off Architecture Consistency Consistent CPU architecture between nodes (such as ARM or x86) Host Time All hosts are out of sync within 10 seconds. Network Connectivity The node and its SSH port can be accessed normally by the platform. CPU Available CPU resources are greater than 4 Cores Memory Available memory resources are greater than 8 GB"},{"location":"en/end-user/kpanda/nodes/node-check.html#supported-architectures-and-operating-systems","title":"Supported architectures and operating systems","text":"Architecture Operating System Remarks ARM Kylin Linux Advanced Server release V10 (Sword) SP2 Recommended ARM UOS Linux ARM openEuler x86 CentOS 7.x Recommended x86 Redhat 7.x Recommended x86 Redhat 8.x Recommended x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 Haiguang x86 UOS Linux x86 openEuler"},{"location":"en/end-user/kpanda/nodes/node-details.html","title":"Node Details","text":"
After accessing or creating a cluster, you can view the information of each node in the cluster, including node status, labels, resource usage, Pod, monitoring information, etc.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar to view the node status, role, label, CPU/memory usage, IP address, and creation time.
Click the node name to enter the node details page to view more information, including overview information, pod information, label annotation information, event list, status, etc.
In addition, you can also view the node's YAML file, monitoring information, labels and annotations, etc.
Supports suspending or resuming scheduling of nodes. Pausing scheduling means stopping the scheduling of Pods to the node. Resuming scheduling means that Pods can be scheduled to that node.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click the Cordon button to suspend scheduling the node.
Click the \u2507 operation icon on the right side of the node, and click the Uncordon button to resume scheduling the node.
The node scheduling status may be delayed due to network conditions. Click the refresh icon on the right side of the search box to refresh the node scheduling status.
Taint can make a node exclude a certain type of Pod and prevent Pod from being scheduled on the node. One or more taints can be applied to each node, and Pods that cannot tolerate these taints will not be scheduled on that node.
Find the target cluster on the Clusters page, and click the cluster name to enter the Cluster page.
In the left navigation bar, click Nodes , find the node that needs to modify the taint, click the \u2507 operation icon on the right and click the Edit Taints button.
Enter the key value information of the taint in the pop-up box, select the taint effect, and click OK .
Click \u2795 Add to add multiple taints to the node, and click X on the right side of the taint effect to delete the taint.
Currently supports three taint effects:
NoExecute: This affects pods that are already running on the node as follows:
Pods that do not tolerate the taint are evicted immediately
Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever
Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.
NoSchedule: No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.
PreferNoSchedule: This is a \"preference\" or \"soft\" version of NoSchedule. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed, so this taint is not recommended to use in a production environment.
For more details about taints, refer to the Kubernetes documentation Taints and Tolerance.
The current cluster is connected to the container management and the Global cluster has installed the kolm component (search for helm templates for kolm).
The current cluster has the olm component installed with a version of 0.2.4 or higher (search for helm templates for olm).
Go to Container Management -> Select the current cluster -> Helm Apps -> View the olm component -> Plugin Settings , and find the images needed for the opm, minio, minio bundle, and minio operator in the subsequent steps.
Using the screenshot as an example, the four image addresses are as follows:\n\n# opm image\n10.5.14.200/quay.m.daocloud.io/operator-framework/opm:v1.29.0\n\n# minio image\n10.5.14.200/quay.m.daocloud.io/minio/minio:RELEASE.2023-03-24T21-41-23Z\n\n# minio bundle image\n10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3\n\n# minio operator image\n10.5.14.200/quay.m.daocloud.io/minio/operator:v5.0.3\n
Run the opm command to get the operators included in the offline bundle image.
Replace all image addresses in the minio-operator/manifests/minio-operator.clusterserviceversion.yaml file with the image addresses from the offline container registry.
Before replacement:
After replacement:
Generate a Dockerfile for building the bundle image.
# Set the new catalog image \nexport OFFLINE_CATALOG_IMG=10.5.14.200/release.daocloud.io/operator-framework/system-operator-index:v0.1.0-offline\n\n$ docker build . -f index.Dockerfile -t ${OFFLINE_CATALOG_IMG} \n\n$ docker push ${OFFLINE_CATALOG_IMG}\n
Go to Container Management and update the built-in catsrc image for the Helm App olm (enter the catalog image specified in the construction of the catalog image, ${catalog-image} ).
After the update is successful, the minio-operator component will appear in the Operator Hub.
"},{"location":"en/end-user/kpanda/permissions/cluster-ns-auth.html","title":"Cluster and Namespace Authorization","text":"
Container management implements authorization based on global authority management and global user/group management. If you need to grant users the highest authority for container management (can create, manage, and delete all clusters), refer to What are Access Control.
After the user logs in to the platform, click Privilege Management under Container Management on the left menu bar, which is located on the Cluster Permissions tab by default.
Click the Add Authorization button.
On the Add Cluster Permission page, select the target cluster, the user/group to be authorized, and click OK .
Currently, the only cluster role supported is Cluster Admin . For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permissions to add multiple times.
Return to the cluster permission management page, and a message appears on the screen: Cluster permission added successfully .
After the user logs in to the platform, click Permissions under Container Management on the left menu bar, and click the Namespace Permissions tab.
Click the Add Authorization button. On the Add Namespace Permission page, select the target cluster, target namespace, and user/group to be authorized, and click OK .
The currently supported namespace roles are NS Admin, NS Editor, and NS Viewer. For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permission to add multiple times. Click OK to complete the permission authorization.
Return to the namespace permission management page, and a message appears on the screen: Cluster permission added successfully .
Tip
If you need to delete or edit permissions later, you can click \u2507 on the right side of the list and select Edit or Delete .
"},{"location":"en/end-user/kpanda/permissions/custom-kpanda-role.html","title":"Adding RBAC Rules to System Roles","text":"
In the past, the RBAC rules for those system roles in container management were pre-defined and could not be modified by users. To support more flexible permission settings and to meet the customized needs for system roles, now you can modify RBAC rules for system roles such as cluster admin, ns admin, ns editor, ns viewer.
The following example demonstrates how to add a new ns-view rule, granting the authority to delete workload deployments. Similar operations can be performed for other rules.
Before adding RBAC rules to system roles, the following prerequisites must be met:
Container management v0.27.0 and above.
Integrated Kubernetes cluster or created Kubernetes cluster, and able to access the cluster's UI interface.
Completed creation of a namespace and user account, and the granting of NS Viewer. For details, refer to namespace authorization.
Note
RBAC rules only need to be added in the Global Cluster, and the Kpanda controller will synchronize those added rules to all integrated subclusters. Synchronization may take some time to complete.
RBAC rules can only be added in the Global Cluster. RBAC rules added in subclusters will be overridden by the system role permissions of the Global Cluster.
Only ClusterRoles with fixed Label are supported for adding rules. Replacing or deleting rules is not supported, nor is adding rules by using role. The correspondence between built-in roles and ClusterRole Label created by users is as follows.
Create a deployment by a user with admin or cluster admin permissions.
Grant a user the ns-viewer role to provide them with the ns-view permission.
Switch the login user to ns-viewer, open the console to get the token for the ns-viewer user, and use curl to request and delete the nginx deployment mentioned above. However, a prompt appears as below, indicating the user doesn't have permission to delete it.
[root@master-01 ~]# curl -k -X DELETE 'https://${URL}/apis/kpanda.io/v1alpha1/clusters/cluster-member/namespaces/default/deployments/nginx' -H 'authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJOU044MG9BclBRMzUwZ2VVU2ZyNy1xMEREVWY4MmEtZmJqR05uRE1sd1lFIn0.eyJleHAiOjE3MTU3NjY1NzksImlhdCI6MTcxNTY4MDE3OSwiYXV0aF90aW1lIjoxNzE1NjgwMTc3LCJqdGkiOiIxZjI3MzJlNC1jYjFhLTQ4OTktYjBiZC1iN2IxZWY1MzAxNDEiLCJpc3MiOiJodHRwczovLzEwLjYuMjAxLjIwMTozMDE0Ny9hdXRoL3JlYWxtcy9naGlwcG8iLCJhdWQiOiJfX2ludGVybmFsLWdoaXBwbyIsInN1YiI6ImMxZmMxM2ViLTAwZGUtNDFiYS05ZTllLWE5OGU2OGM0MmVmMCIsInR5cCI6IklEIiwiYXpwIjoiX19pbnRlcm5hbC1naGlwcG8iLCJzZXNzaW9uX3N0YXRlIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiYXRfaGFzaCI6IlJhTHoyQjlKQ2FNc1RrbGVMR3V6blEiLCJhY3IiOiIwIiwic2lkIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiZW1haWxfdmVyaWZpZWQiOmZhbHNlLCJncm91cHMiOltdLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJucy12aWV3ZXIiLCJsb2NhbGUiOiIifQ.As2ipMjfvzvgONAGlc9RnqOd3zMwAj82VXlcqcR74ZK9tAq3Q4ruQ1a6WuIfqiq8Kq4F77ljwwzYUuunfBli2zhU2II8zyxVhLoCEBu4pBVBd_oJyUycXuNa6HfQGnl36E1M7-_QG8b-_T51wFxxVb5b7SEDE1AvIf54NAlAr-rhDmGRdOK1c9CohQcS00ab52MD3IPiFFZ8_Iljnii-RpXKZoTjdcULJVn_uZNk_SzSUK-7MVWmPBK15m6sNktOMSf0pCObKWRqHd15JSe-2aA2PKBo1jBH3tHbOgZyMPdsLI0QdmEnKB5FiiOeMpwn_oHnT6IjT-BZlB18VkW8rA'\n{\"code\":7,\"message\":\"[RBAC] delete resources(deployments: nginx) is forbidden for user(ns-viewer) in cluster(cluster-member)\",\"details\":[]}[root@master-01 ~]#\n[root@master-01 ~]#\n
Create a ClusterRole on the global cluster, as shown in the yaml below.
This field value can be arbitrarily specified, as long as it is not duplicated and complies with the Kubernetes resource naming conventions.
When adding rules to different roles, make sure to apply different labels.
Wait for the kpanda controller to add a rule of user creation to the built-in role: ns-viewer, then you can check if the rules added in the previous step are present for ns-viewer.
When using curl again to request the deletion of the aforementioned nginx deployment, this time the deletion was successful. This means that ns-viewer has successfully added the rule to delete deployments.
Container management permissions are based on a multi-dimensional permission management system created by global permission management and Kubernetes RBAC permission management. It supports cluster-level and namespace-level permission control, helping users to conveniently and flexibly set different operation permissions for IAM users and user groups (collections of users) under a tenant.
Cluster permissions are authorized based on Kubernetes RBAC's ClusterRoleBinding, allowing users/user groups to have cluster-related permissions. The current default cluster role is Cluster Admin (does not have the permission to create or delete clusters).
Namespace permissions are authorized based on Kubernetes RBAC capabilities, allowing different users/user groups to have different operation permissions on resources under a namespace (including Kubernetes API permissions). For details, refer to: Kubernetes RBAC. Currently, the default roles for container management are: NS Admin, NS Editor, NS Viewer.
What is the relationship between global permissions and container management permissions?
Answer: Global permissions only authorize coarse-grained permissions, which can manage the creation, editing, and deletion of all clusters; while for fine-grained permissions, such as the management permissions of a single cluster, the management, editing, and deletion permissions of a single namespace, they need to be implemented based on Kubernetes RBAC container management permissions. Generally, users only need to be authorized in container management.
Currently, only four default roles are supported. Can the RoleBinding and ClusterRoleBinding (Kubernetes fine-grained RBAC) for custom roles also take effect?
Answer: Currently, custom permissions cannot be managed through the graphical interface, but the permission rules created using kubectl can still take effect.
Suanova AI platform supports elastic scaling of Pod resources based on metrics (Horizontal Pod Autoscaling, HPA). Users can dynamically adjust the number of copies of Pod resources by setting CPU utilization, memory usage, and custom metrics. For example, after setting an auto scaling policy based on the CPU utilization metric for the workload, when the CPU utilization of the Pod exceeds/belows the metric threshold you set, the workload controller will automatically increase/decrease the number of Pod replicas.
This page describes how to configure auto scaling based on built-in metrics and custom metrics for workloads.
Note
HPA is only applicable to Deployment and StatefulSet, and only one HPA can be created per workload.
If you create an HPA policy based on CPU utilization, you must set the configuration limit (Limit) for the workload in advance, otherwise the CPU utilization cannot be calculated.
If built-in metrics and multiple custom metrics are used at the same time, HPA will calculate the number of scaling copies required based on multiple metrics, and take the larger value (but not exceed the maximum number of copies configured when setting the HPA policy) for elastic scaling .
Refer to the following steps to configure the built-in index auto scaling policy for the workload.
Click Clusters on the left navigation bar to enter the cluster list page. Click a cluster name to enter the Cluster Details page.
On the cluster details page, click Workload in the left navigation bar to enter the workload list, and then click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster.
After confirming that the cluster has installed the metrics-server plug-in, and the plug-in is running normally, you can click the New Scaling button.
Create custom metric auto scaling policy parameters.
Policy name: Enter the name of the auto scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as hpa- my-dep.
Namespace: The namespace where the payload resides.
Workload: The workload object that performs auto scaling.
Target CPU Utilization: The CPU usage of the Pod under the workload resource. The calculation method is: the request (request) value of all Pod resources/workloads under the workload. When the actual CPU usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Target Memory Usage: The memory usage of the Pod under the workload resource. When the actual memory usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Replica range: the elastic scaling range of the number of Pod replicas. The default interval is 1 - 10.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to edit, delete, and view related events.
The container Vertical Pod Autoscaler (VPA) calculates the most suitable CPU and memory request values \u200b\u200bfor the Pod by monitoring the Pod's resource application and usage over a period of time. Using VPA can allocate resources to each Pod in the cluster more reasonably, improve the overall resource utilization of the cluster, and avoid waste of cluster resources.
AI platform supports VPA through containers. Based on this feature, the Pod request value can be dynamically adjusted according to the usage of container resources. AI platform supports manual and automatic modification of resource request values, and you can configure them according to actual needs.
This page describes how to configure VPA for deployment.
Warning
Using VPA to modify a Pod resource request will trigger a Pod restart. Due to the limitations of Kubernetes itself, Pods may be scheduled to other nodes after restarting.
Refer to the following steps to configure the built-in index auto scaling policy for the deployment.
Find the current cluster in Clusters , and click the name of the target cluster.
Click Deployments in the left navigation bar, find the deployment that needs to create a VPA, and click the name of the deployment.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster, and confirm that the relevant plug-ins have been installed and are running normally.
Click the Create Autoscaler button and configure the VPA vertical scaling policy parameters.
Policy name: Enter the name of the vertical scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as vpa- my-dep.
Scaling mode: Run the method of modifying the CPU and memory request values. Currently, vertical scaling supports manual and automatic scaling modes.
Manual scaling: After the vertical scaling policy calculates the recommended resource configuration value, the user needs to manually modify the resource quota of the application.
Auto-scaling: The vertical scaling policy automatically calculates and modifies the resource quota of the application.
Target container: Select the container to be scaled vertically.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to perform edit and delete operations.
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html","title":"Creating HPA Based on Custom Metrics","text":"
When the built-in CPU and memory metrics in the system do not meet your business needs, you can add custom metrics by configuring ServiceMonitoring and achieve auto-scaling based on these custom metrics. This article will introduce how to configure auto-scaling for workloads based on custom metrics.
Note
HPA is only applicable to Deployment and StatefulSet, and each workload can only create one HPA.
If both built-in metrics and multiple custom metrics are used, HPA will calculate the required number of scaled replicas based on multiple metrics respectively, and take the larger value (but not exceeding the maximum number of replicas configured when setting the HPA policy) for scaling.
Refer to the following steps to configure the auto-scaling policy based on metrics for workloads.
Click Clusters in the left navigation bar to enter the clusters page. Click a cluster name to enter the Cluster Overview page.
On the Cluster Details page, click Workloads in the left navigation bar to enter the workload list, and click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the current autoscaling configuration of the cluster.
Confirm that the cluster has installed metrics-server, Insight, and Prometheus-adapter plugins, and that the plugins are running normally, then click the Create AutoScaler button.
Note
If the related plugins are not installed or the plugins are in an abnormal state, you will not be able to see the entry for creating custom metrics auto-scaling on the page.
Policy Name: Enter the name of the auto-scaling policy. Note that the name can be up to 63 characters long, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with a lowercase letter or number, e.g., hpa-my-dep.
Namespace: The namespace where the workload is located.
Workload: The workload object that performs auto-scaling.
Resource Type: The type of custom metric being monitored, including Pod and Service types.
Metric: The name of the custom metric created using ServiceMonitoring or the name of the system-built custom metric.
Data Type: The method used to calculate the metric value, including target value and target average value. When the resource type is Pod, only the target average value can be used.
This case takes a Golang business program as an example. The example program exposes the httpserver_requests_total metric and records HTTP requests. This metric can be used to calculate the QPS value of the business program.
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html#deploy-business-program","title":"Deploy Business Program","text":"
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html#prometheus-collects-business-monitoring","title":"Prometheus Collects Business Monitoring","text":"
If the insight-agent is installed, Prometheus can be configured by creating a ServiceMonitor CRD object.
Operation steps: In Cluster Details -> Custom Resources, search for \u201cservicemonitors.monitoring.coreos.com\", click the name to enter the details. Create the following example CRD in the httpserver namespace via YAML:
If Prometheus is installed via insight, the serviceMonitor must be labeled with operator.insight.io/managed-by: insight. If installed by other means, this label is not required.
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html#configure-metric-rules-in-prometheus-adapter","title":"Configure Metric Rules in Prometheus-adapter","text":"
steps: In Clusters -> Helm Apps, search for \u201cprometheus-adapter\",enter the update page through the action bar, and configure custom metrics in YAML as follows:
Follow the above steps to find the application httpserver in the Deployment and create auto-scaling via custom metrics.
"},{"location":"en/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.html","title":"Compatibility Rules for HPA and CronHPA","text":"
HPA stands for HorizontalPodAutoscaler, which refers to horizontal pod auto-scaling.
CronHPA stands for Cron HorizontalPodAutoscaler, which refers to scheduled horizontal pod auto-scaling.
"},{"location":"en/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.html#conflict-between-cronhpa-and-hpa","title":"Conflict Between CronHPA and HPA","text":"
Scheduled scaling with CronHPA triggers horizontal pod scaling at specified times. To prevent sudden traffic surges, you may have configured HPA to ensure the normal operation of your application. If both HPA and CronHPA are detected simultaneously, conflicts arise because CronHPA and HPA operate independently without awareness of each other. Consequently, the actions performed last will override those executed first.
By comparing the definition templates of CronHPA and HPA, the following points can be observed:
Both CronHPA and HPA use the scaleTargetRef field to identify the scaling target.
CronHPA schedules the number of replicas to scale based on crontab rules in jobs.
HPA determines scaling based on resource utilization.
Note
If both CronHPA and HPA are set, there will be scenarios where CronHPA and HPA simultaneously operate on a single scaleTargetRef.
"},{"location":"en/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.html#compatibility-solution-for-cronhpa-and-hpa","title":"Compatibility Solution for CronHPA and HPA","text":"
As noted above, the fundamental reason that simultaneous use of CronHPA and HPA results in the later action overriding the earlier one is that the two controllers cannot sense each other. Therefore, the conflict can be resolved by enabling CronHPA to be aware of HPA's current state.
The system will treat HPA as the scaling object for CronHPA, thus achieving scheduled scaling for the Deployment object defined by the HPA.
HPA's definition configures the Deployment in the scaleTargetRef field, and then the Deployment uses its definition to locate the ReplicaSet, which ultimately adjusts the actual number of replicas.
In AI platform, the scaleTargetRef in CronHPA is set to the HPA object, and it uses the HPA object to find the actual scaleTargetRef, allowing CronHPA to be aware of HPA's current state.
CronHPA senses HPA by adjusting HPA. CronHPA determines whether scaling is needed and modifies the HPA upper limit by comparing the target number of replicas with the current number of replicas, choosing the larger value. Similarly, CronHPA determines whether to modify the HPA lower limit by comparing the target number of replicas from CronHPA with the configuration in HPA, choosing the smaller value.
The container copy timing horizontal autoscaling policy (CronHPA) can provide stable computing resource guarantee for periodic high-concurrency applications, and kubernetes-cronhpa-controller is a key component to implement CronHPA.
This section describes how to install the kubernetes-cronhpa-controller plugin.
Note
In order to use CornHPA, not only the kubernetes-cronhpa-controller plugin needs to be installed, but also install the metrics-server plugin.
Refer to the following steps to install the kubernetes-cronhpa-controller plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of CronHPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.3.0 or later.
Refer to the following instructions to configure the parameters.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.3.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling ready wait and/or failed deletion , it takes a long time for the application to be marked as \"running\".
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the kubernetes-cronhpa-controller plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now it's time to start creating CronHPA policies.
metrics-server is the built-in resource usage metrics collection component of Kubernetes. You can automatically scale Pod copies horizontally for workload resources by configuring HPA policies.
This section describes how to install metrics-server .
Please perform the following steps to install the metrics-server plugin for the cluster.
On the Auto Scaling page under workload details, click the Install button to enter the metrics-server plug-in installation interface.
Read the introduction of the metrics-server plugin, select the version and click the Install button. This page will use the 3.8.2 version as an example to install, and it is recommended that you install 3.8.2 and later versions.
Configure basic parameters on the installation configuration interface.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, here we take default as an example.
Version: The version of the plugin, here we take 3.8.2 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Failed to delete: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Verbose log: Turn on the verbose output of the installation process log.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the app to be marked as Running .
Advanced parameter configuration
If the cluster network cannot access the k8s.gcr.io repository, please try to modify the repositort parameter to repository: k8s.m.daocloud.io/metrics-server/metrics-server .
An SSL certificate is also required to install the metrics-server plugin. To bypass certificate verification, you need to add - --kubelet-insecure-tls parameter at defaultArgs: .
Click to view and use the YAML parameters to replace the default YAML
Click the OK button to complete the installation of the metrics-server plug-in, and then the system will automatically jump to the Helm Apps list page. After a few minutes, refresh the page and you will see the newly installed Applications.
Note
When deleting the metrics-server plugin, the plugin can only be completely deleted on the Helm Apps list page. If you only delete metrics-server on the workload page, this only deletes the workload copy of the application, the application itself is still not deleted, and an error will be prompted when you reinstall the plugin later.
The Vertical Pod Autoscaler, VPA, can make the resource allocation of the cluster more reasonable and avoid the waste of cluster resources. vpa is the key component to realize the vertical autoscaling of the container.
This section describes how to install the vpa plugin.
In order to use VPA policies, not only the __vpa__ plugin needs to be installed, but also [install the __metrics-server__ plugin](install-metrics-server.md).\n
Refer to the following steps to install the vpa plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of VPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.5.0 or later.
Review the configuration parameters described below.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.5.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the application to be marked as running .
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the vpa plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now you can start Create VPA policy.
Log in to the cluster, click the sidebar Helm Apps \u2192 Helm Charts , enter knative in the search box at the top right, and then press the enter key to search.
Click the knative-operator to enter the installation configuration interface. You can view the available versions and the Parameters optional items of Helm values on this interface.
After clicking the install button, you will enter the installation configuration interface.
Enter the name, installation tenant, and it is recommended to check Wait and Detailed Logs .
In the settings below, you can tick Serving and enter the installation tenant of the Knative Serving component, which will deploy the Knative Serving component after installation. This component is managed by the Knative Operator.
Knative provides a higher level of abstraction, simplifying and speeding up the process of building, deploying, and managing applications on Kubernetes. It allows developers to focus more on implementing business logic, while leaving most of the infrastructure and operations work to Knative, significantly improving productivity.
Component Features Activator Queues requests (if a Knative Service has scaled to zero). Calls the autoscaler to bring back services that have scaled down to zero and forward queued requests. The Activator can also act as a request buffer, handling bursts of traffic. Autoscaler Responsible for scaling Knative services based on configuration, metrics, and incoming requests. Controller Manages the state of Knative CRs. It monitors multiple objects, manages the lifecycle of dependent resources, and updates resource status. Queue-Proxy Sidecar container injected into each Knative Service. Responsible for collecting traffic data and reporting it to the Autoscaler, which then initiates scaling requests based on this data and preset rules. Webhooks Knative Serving has several Webhooks responsible for validating and mutating Knative resources."},{"location":"en/end-user/kpanda/scale/knative/knative.html#ingress-traffic-entry-solutions","title":"Ingress Traffic Entry Solutions","text":"Solution Use Case Istio If Istio is already in use, it can be chosen as the traffic entry solution. Contour If Contour has been enabled in the cluster, it can be chosen as the traffic entry solution. Kourier If neither of the above two Ingress components are present, Knative's Envoy-based Kourier Ingress can be used as the traffic entry solution."},{"location":"en/end-user/kpanda/scale/knative/knative.html#autoscaler-solutions-comparison","title":"Autoscaler Solutions Comparison","text":"Autoscaler Type Core Part of Knative Serving Default Enabled Scale to Zero Support CPU-based Autoscaling Support Knative Pod Autoscaler (KPA) Yes Yes Yes No Horizontal Pod Autoscaler (HPA) No Needs to be enabled after installing Knative Serving No Yes"},{"location":"en/end-user/kpanda/scale/knative/knative.html#crd","title":"CRD","text":"Resource Type API Name Description Services service.serving.knative.dev Automatically manages the entire lifecycle of Workloads, controls the creation of other objects, ensures applications have Routes, Configurations, and new revisions with each update. Routes route.serving.knative.dev Maps network endpoints to one or more revision versions, supports traffic distribution and version routing. Configurations configuration.serving.knative.dev Maintains the desired state of deployments, provides separation between code and configuration, follows the Twelve-Factor App methodology, modifying configurations creates new revisions. Revisions revision.serving.knative.dev Snapshot of the workload at each modification time point, immutable object, automatically scales based on traffic."},{"location":"en/end-user/kpanda/scale/knative/playground.html","title":"Knative Practices","text":"
In this section, we will delve into learning Knative through several practical exercises.
case1 When there is low traffic or no traffic, traffic will be routed to the activator.
case2 When there is high traffic, traffic will be routed directly to the Pod only if it exceeds the target-burst-capacity.
Configured as 0, expansion from 0 is the only scenario.
Configured as -1, the activator will always be present in the request path.
Configured as >0, the number of additional concurrent requests that the system can handle before triggering scaling.
case3 When the traffic decreases again, traffic will be routed back to the activator if the traffic is lower than current_demand + target-burst-capacity > (pods * concurrency-target).
The total number of pending requests + the number of requests that can exceed the target concurrency > the target concurrency per Pod * number of Pods.
"},{"location":"en/end-user/kpanda/scale/knative/playground.html#case-2-based-on-concurrent-elastic-scaling","title":"case 2 - Based on Concurrent Elastic Scaling","text":"
We first apply the following YAML definition under the cluster.
"},{"location":"en/end-user/kpanda/scale/knative/playground.html#case-3-based-on-concurrent-elastic-scaling-scale-out-in-advance-to-reach-a-specific-ratio","title":"case 3 - Based on concurrent elastic scaling, scale out in advance to reach a specific ratio.","text":"
We can easily achieve this, for example, by limiting the concurrency to 10 per container. This can be implemented through autoscaling.knative.dev/target-utilization-percentage: 70, starting to scale out the Pods when 70% is reached.
"},{"location":"en/end-user/kpanda/security/index.html","title":"Types of Security Scans","text":"
AI platform Container Management provides three types of security scans:
Compliance Scan: Conducts security scans on cluster nodes based on CIS Benchmark.
Authorization Scan: Checks for security and compliance issues in the Kubernetes cluster, records and verifies authorized access, object changes, events, and other activities related to the Kubernetes API.
Vulnerability Scan: Scans the Kubernetes cluster for potential vulnerabilities and risks, such as unauthorized access, sensitive information leakage, weak authentication, container escape, etc.
The object of compliance scanning is the cluster node. The scan result lists the scan items and results and provides repair suggestions for any failed scan items. For specific security rules used during scanning, refer to the CIS Kubernetes Benchmark.
The focus of the scan varies when checking different types of nodes.
Scan the control plane node (Controller)
Focus on the security of system components such as API Server , controller-manager , scheduler , kubelet , etc.
Check the security configuration of the Etcd database.
Verify whether the cluster's authentication mechanism, authorization policy, and network security configuration meet security standards.
Scan worker nodes
Check if the configuration of container runtimes such as kubelet and Docker meets security standards.
Verify whether the container image has been trusted and verified.
Check if the network security configuration of the node meets security standards.
Tip
To use compliance scanning, you need to create a scan configuration first, and then create a scan policy based on that configuration. After executing the scan policy, you can view the scan report.
Authorization scanning focuses on security vulnerabilities caused by authorization issues. Authorization scans can help users identify security threats in Kubernetes clusters, identify which resources need further review and protection measures. By performing these checks, users can gain a clearer and more comprehensive understanding of their Kubernetes environment and ensure that the cluster environment meets Kubernetes' best practices and security standards.
Specifically, authorization scanning supports the following operations:
Scans the health status of all nodes in the cluster.
Scans the running state of components in the cluster, such as kube-apiserver , kube-controller-manager , kube-scheduler , etc.
API security: whether unsafe API versions are enabled, whether appropriate RBAC roles and permission restrictions are set, etc.
Container security: whether insecure images are used, whether privileged mode is enabled, whether appropriate security context is set, etc.
Network security: whether appropriate network policy is enabled to restrict traffic, whether TLS encryption is used, etc.
Storage security: whether appropriate encryption and access controls are enabled.
Application security: whether necessary security measures are in place, such as password management, cross-site scripting attack defense, etc.
Provides warnings and suggestions: Security best practices that cluster administrators should perform, such as regularly rotating certificates, using strong passwords, restricting network access, etc.
Tip
To use authorization scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Security Scanning.
Vulnerability scanning focuses on scanning potential malicious attacks and security vulnerabilities, such as remote code execution, SQL injection, XSS attacks, and some attacks specific to Kubernetes. The final scan report lists the security vulnerabilities in the cluster and provides repair suggestions.
Tip
To use vulnerability scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Vulnerability Scan.
To use the Permission Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/end-user/kpanda/security/audit.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Permission Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
To use the Vulnerability Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/end-user/kpanda/security/hunter.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Vulnerability Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
The first step in using CIS Scanning is to create a scan configuration. Based on the scan configuration, you can then create scan policies, execute scan policies, and finally view scan results.
"},{"location":"en/end-user/kpanda/security/cis/config.html#create-a-scan-configuration","title":"Create a Scan Configuration","text":"
The steps for creating a scan configuration are as follows:
Click Security Management in the left navigation bar of the homepage of the container management module.
By default, enter the Compliance Scanning page, click the Scan Configuration tab, and then click Create Scan Configuration in the upper-right corner.
Fill in the configuration name, select the configuration template, and optionally check the scan items, then click OK .
Scan Template: Currently, two templates are provided. The kubeadm template is suitable for general Kubernetes clusters. The daocloud template ignores scan items that are not applicable to AI platform based on the kubeadm template and the platform design of AI platform.
Under the scan configuration tab, clicking the name of a scan configuration displays the type of the configuration, the number of scan items, the creation time, the configuration template, and the specific scan items enabled for the configuration.
After a scan configuration has been successfully created, it can be updated or deleted according to your needs.
Under the scan configuration tab, click the \u2507 action button to the right of a configuration:
Select Edit to update the configuration. You can update the description, template, and scan items. The configuration name cannot be changed.
Select Delete to delete the configuration.
"},{"location":"en/end-user/kpanda/security/cis/policy.html","title":"Scan Policy","text":""},{"location":"en/end-user/kpanda/security/cis/policy.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
After creating a scan configuration, you can create a scan policy based on the configuration.
Under the Security Management -> Compliance Scanning page, click the Scan Policy tab on the right to create a scan policy.
Fill in the configuration according to the following instructions and click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Configuration: Select a pre-created scan configuration. The scan configuration determines which specific scan items need to be performed.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
After executing a scan policy, a scan report will be generated automatically. You can view the scan report online or download it to your local computer.
Download and View
Under the Security Management -> Compliance Scanning page, click the Scan Report tab, then click the \u2507 action button to the right of a report and select Download .
View Online
Clicking the name of a report allows you to view its content online, which includes:
The target cluster scanned.
The scan policy and scan configuration used.
The start time of the scan.
The total number of scan items, the number passed, and the number failed.
For failed scan items, repair suggestions are provided.
For passed scan items, more secure operational suggestions are provided.
A data volume (PersistentVolume, PV) is a piece of storage in the cluster, which can be prepared in advance by the administrator, or dynamically prepared using a storage class (Storage Class). PV is a cluster resource, but it has an independent life cycle and will not be deleted when the Pod process ends. Mounting PVs to workloads can achieve data persistence for workloads. The PV holds the data directory that can be accessed by the containers in the Pod.
"},{"location":"en/end-user/kpanda/storage/pv.html#create-data-volume","title":"Create data volume","text":"
Currently, there are two ways to create data volumes: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create Data Volume (PV) in the left navigation bar.
Fill in the basic information.
The data volume name, data volume type, mount path, volume mode, and node affinity cannot be changed after creation.
Data volume type: For a detailed introduction to volume types, refer to the official Kubernetes document Volumes.
Local: The local storage of the Node node is packaged into a PVC interface, and the container directly uses the PVC without paying attention to the underlying storage type. Local volumes do not support dynamic configuration of data volumes, but support configuration of node affinity, which can limit which nodes can access the data volume.
HostPath: Use files or directories on the file system of Node nodes as data volumes, and do not support Pod scheduling based on node affinity.
Mount path: mount the data volume to a specific directory in the container.
access mode:
ReadWriteOnce: The data volume can be mounted by a node in read-write mode.
ReadWriteMany: The data volume can be mounted by multiple nodes in read-write mode.
ReadOnlyMany: The data volume can be mounted read-only by multiple nodes.
ReadWriteOncePod: The data volume can be mounted read-write by a single Pod.
Recycling policy:
Retain: The PV is not deleted, but its status is only changed to released , which needs to be manually recycled by the user. For how to manually reclaim, refer to Persistent Volume.
Recycle: keep the PV but empty its data, perform a basic wipe ( rm -rf /thevolume/* ).
Delete: When deleting a PV and its data.
Volume mode:
File system: The data volume will be mounted to a certain directory by the Pod. If the data volume is stored from a device and the device is currently empty, a file system is created on the device before the volume is mounted for the first time.
Block: Use the data volume as a raw block device. This type of volume is given to the Pod as a block device without any file system on it, allowing the Pod to access the data volume faster.
Node affinity:
"},{"location":"en/end-user/kpanda/storage/pv.html#view-data-volume","title":"View data volume","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) in the left navigation bar.
On this page, you can view all data volumes in the current cluster, as well as information such as the status, capacity, and namespace of each data volume.
Supports sequential or reverse sorting according to the name, status, namespace, and creation time of data volumes.
Click the name of a data volume to view the basic configuration, StorageClass information, labels, comments, etc. of the data volume.
"},{"location":"en/end-user/kpanda/storage/pv.html#clone-data-volume","title":"Clone data volume","text":"
By cloning a data volume, a new data volume can be recreated based on the configuration of the cloned data volume.
Enter the clone page
On the data volume list page, find the data volume to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/end-user/kpanda/storage/pv.html#update-data-volume","title":"Update data volume","text":"
There are two ways to update data volumes. Support for updating data volumes via forms or YAML files.
Note
Only updating the alias, capacity, access mode, reclamation policy, label, and comment of the data volume is supported.
On the data volume list page, find the data volume that needs to be updated, select Update under the operation bar on the right to update through the form, select Edit YAML to update through YAML.
Click the name of the data volume to enter the details page of the data volume, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/end-user/kpanda/storage/pv.html#delete-data-volume","title":"Delete data volume","text":"
On the data volume list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Delete .
A persistent volume claim (PersistentVolumeClaim, PVC) expresses a user's request for storage. PVC consumes PV resources and claims a data volume with a specific size and specific access mode. For example, the PV volume is required to be mounted in ReadWriteOnce, ReadOnlyMany or ReadWriteMany modes.
"},{"location":"en/end-user/kpanda/storage/pvc.html#create-data-volume-statement","title":"Create data volume statement","text":"
Currently, there are two ways to create data volume declarations: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume declaration.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create Data Volume Declaration (PVC) in the left navigation bar.
Fill in the basic information.
The name, namespace, creation method, data volume, capacity, and access mode of the data volume declaration cannot be changed after creation.
Creation method: dynamically create a new data volume claim in an existing StorageClass or data volume, or create a new data volume claim based on a snapshot of a data volume claim.
The declared capacity of the data volume cannot be modified when the snapshot is created, and can be modified after the creation is complete.
After selecting the creation method, select the desired StorageClass/data volume/snapshot from the drop-down list.
access mode:
ReadWriteOnce, the data volume declaration can be mounted by a node in read-write mode.
ReadWriteMany, the data volume declaration can be mounted by multiple nodes in read-write mode.
ReadOnlyMany, the data volume declaration can be mounted read-only by multiple nodes.
ReadWriteOncePod, the data volume declaration can be mounted by a single Pod in read-write mode.
"},{"location":"en/end-user/kpanda/storage/pvc.html#view-data-volume-statement","title":"View data volume statement","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) in the left navigation bar.
On this page, you can view all data volume declarations in the current cluster, as well as information such as the status, capacity, and namespace of each data volume declaration.
Supports sorting in sequential or reverse order according to the declared name, status, namespace, and creation time of the data volume.
Click the name of the data volume declaration to view the basic configuration, StorageClass information, labels, comments and other information of the data volume declaration.
"},{"location":"en/end-user/kpanda/storage/pvc.html#expansion-data-volume-statement","title":"Expansion data volume statement","text":"
In the left navigation bar, click Container Storage -> Data Volume Declaration (PVC) , and find the data volume declaration whose capacity you want to adjust.
Click the name of the data volume declaration, and then click the operation button in the upper right corner of the page and select Expansion .
Enter the target capacity and click OK .
"},{"location":"en/end-user/kpanda/storage/pvc.html#clone-data-volume-statement","title":"Clone data volume statement","text":"
By cloning a data volume claim, a new data volume claim can be recreated based on the configuration of the cloned data volume claim.
Enter the clone page
On the data volume declaration list page, find the data volume declaration that needs to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume declaration, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/end-user/kpanda/storage/pvc.html#update-data-volume-statement","title":"Update data volume statement","text":"
There are two ways to update data volume claims. Support for updating data volume claims via form or YAML file.
Note
Only aliases, labels, and annotations for data volume claims are updated.
On the data volume list page, find the data volume declaration that needs to be updated, select Update in the operation bar on the right to update it through the form, and select Edit YAML to update it through YAML.
Click the name of the data volume declaration, enter the details page of the data volume declaration, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/end-user/kpanda/storage/pvc.html#delete-data-volume-statement","title":"Delete data volume statement","text":"
On the data volume declaration list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume statement, click the operation button in the upper right corner of the details page and select Delete .
If there is no optional StorageClass or data volume in the list, you can Create a StorageClass or Create a data volume.
If there is no optional snapshot in the list, you can enter the details page of the data volume declaration and create a snapshot in the upper right corner.
If the StorageClass (SC) used by the data volume declaration is not enabled for snapshots, snapshots cannot be made, and the page will not display the \"Make Snapshot\" option.
If the StorageClass (SC) used by the data volume declaration does not have the capacity expansion feature enabled, the data volume does not support capacity expansion, and the page will not display the capacity expansion option.
A StorageClass refers to a large storage resource pool composed of many physical disks. This platform supports the creation of block StorageClass, local StorageClass, and custom StorageClass after accessing various storage vendors, and then dynamically configures data volumes for workloads.
Currently, it supports creating StorageClass through YAML and forms. These two methods have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the StorageClass.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create StorageClass (SC) in the left navigation bar.
Fill in the basic information and click OK at the bottom.
CUSTOM STORAGE SYSTEM
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
CSI storage driver: A standard Kubernetes-based container storage interface plug-in, which must comply with the format specified by the storage manufacturer, such as rancher.io/local-path .
For how to fill in the CSI drivers provided by different vendors, refer to the official Kubernetes document Storage Class.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
HwameiStor storage system
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
Storage system: HwameiStor storage system.
Storage type: support LVM, raw disk type
LVM type : HwameiStor recommended usage method, which can use highly available data volumes, and the proper CSI storage driver is lvm.hwameistor.io .
Raw disk data volume : suitable for high availability cases, without high availability capability, the proper CSI driver is hdd.hwameistor.io .
High Availability Mode: Before using the high availability capability, please make sure DRBD component has been installed. After the high availability mode is turned on, the number of data volume copies can be set to 1 and 2. Convert data volume copy from 1 to 1 if needed.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
On the StorageClass list page, find the StorageClass that needs to be updated, and select Edit under the operation bar on the right to update the StorageClass.
Info
Select View YAML to view the YAML file of the StorageClass, but editing is not supported.
This page introduces how to create a CronJob through images and YAML files.
CronJobs are suitable for performing periodic operations, such as backup and report generation. These jobs can be configured to repeat periodically (for example: daily/weekly/monthly), and the time interval at which the job starts to run can be defined.
Before creating a CronJob, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-cronjob.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a CronJob using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, CronJob Settings, Advanced Configuration, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the CronJobs list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the CronJob.
On the Create CronJobs page, enter the information according to the table below, and click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created CronJob in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the configuration with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull the image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local mirror will be pulled, and only when the mirror does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass configuration to the Pod, etc. For details, refer to Container environment variable configuration.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Concurrency Policy: Whether to allow multiple Job jobs to run in parallel.
Allow : A new CronJob can be created before the previous job is completed, and multiple jobs can be parallelized. Too many jobs may occupy cluster resources.
Forbid : Before the previous job is completed, a new job cannot be created. If the execution time of the new job is up and the previous job has not been completed, CronJob will ignore the execution of the new job.
Replace : If the execution time of the new job is up, but the previous job has not been completed, the new job will replace the previous job.
The above rules only apply to multiple jobs created by the same CronJob. Multiple jobs created by multiple CronJobs are always allowed to run concurrently.
Policy Settings: Set the time period for job execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Job Records: Set how many records of successful or failed jobs to keep. 0 means do not keep.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set. The default is 360 s.
Retries: the number of times the job can be retried, the default value is 6.
Restart Policy: Set whether to restart the Pod when the job fails.
The advanced configuration of CronJobs mainly involves labels and annotations.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/end-user/kpanda/workloads/create-cronjob.html#create-from-yaml","title":"Create from YAML","text":"
In addition to mirroring, you can also create timed jobs more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page introduces how to create a daemonSet through image and YAML files.
DaemonSet is connected to taint through node affinity feature ensures that a replica of a Pod is running on all or some of the nodes. For nodes that newly joined the cluster, DaemonSet automatically deploys the proper Pod on the new node and tracks the running status of the Pod. When a node is removed, the DaemonSet deletes all Pods it created.
Common cases for daemons include:
Run cluster daemons on each node.
Run a log collection daemon on each node.
Run a monitoring daemon on each node.
For simplicity, a DaemonSet can be started on each node for each type of daemon. For finer and more advanced daemon management, you can also deploy multiple DaemonSets for the same daemon. Each DaemonSet has different flags and has different memory, CPU requirements for different hardware types.
Before creating a DaemonSet, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-daemonset.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a daemon using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> DaemonSets in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of DaemonSets . Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the DaemonSet.
On the Create DaemonSets page, after entering the information according to the table below, click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created DaemonSet in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyScheduling PoliciesLabels and Annotations
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set, at this time the domain name of the containerThe parsing file will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable Pods: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/end-user/kpanda/workloads/create-daemonset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create daemons more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workload -> Daemons in the left navigation bar, and then click the YAML Create button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a daemon
This page describes how to create deployments through images and YAML files.
Deployment is a common resource in Kubernetes, mainly Pod and ReplicaSet provide declarative updates, support elastic scaling, rolling upgrades, and version rollbacks features. Declare the desired Pod state in the Deployment, and the Deployment Controller will modify the current state through the ReplicaSet to make it reach the pre-declared desired state. Deployment is stateless and does not support data persistence. It is suitable for deploying stateless applications that do not need to save data and can be restarted and rolled back at any time.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback.
Before using image to create deployments, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-deployment.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a deployment by image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Setting, Service Setting, Advanced Setting in turn, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of Deployments . Click \u2507 on the right side of the list to perform operations such as update, delete, elastic scaling, restart, and version rollback on the load. If the workload status is abnormal, please check the specific abnormal information, refer to Workload Status.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, it is essential to correctly fill in the container name and image parameters; otherwise, you will not be able to proceed to the next step. After filling in the configuration according to the following requirements, click OK.
Container Type: The default is Work Container. For information on init containers, see the [K8s Official Documentation] (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/).
Container Name: No more than 63 characters, supporting lowercase letters, numbers, and separators (\"-\"). It must start and end with a lowercase letter or number, for example, nginx-01.
Image:
Image: Select an appropriate image from the list. When entering the image name, the default is to pull the image from the official DockerHub.
Image Version: Select an appropriate version from the dropdown list.
Image Pull Policy: By checking Always pull the image, the image will be pulled from the repository each time the workload restarts/upgrades. If unchecked, it will only pull the local image, and will pull from the repository only if the image does not exist locally. For more details, refer to Image Pull Policy.
Registry Secret: Optional. If the target repository requires a Secret to access, you need to create secret first.
Privileged Container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and has all the privileges of running processes on the host.
CPU/Memory Request: The request value (the minimum resource needed) and the limit value (the maximum resource allowed) for CPU/memory resources. Configure resources for the container as needed to avoid resource waste and system failures caused by container resource overages. Default values are shown in the figure.
GPU Configuration: Configure GPU usage for the container, supporting only positive integers. The GPU quota setting supports configuring the container to exclusively use an entire GPU card or part of a vGPU. For example, for a GPU card with 8 cores, entering the number 8 means the container exclusively uses the entire card, and entering the number 1 means configuring 1 core of the vGPU for the container.
Before setting the GPU, the administrator needs to pre-install the GPU card and driver plugin on the cluster node and enable the GPU feature in the Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Setting.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Setting.
Configure container parameters within the Pod, add environment variables or pass setting to the Pod, etc. For details, refer to Container environment variable setting.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Setting.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: Network Settings, Upgrade Policy, Scheduling Policies, Labels and Annotations. You can click the tabs below to view the setting requirements of each part.
Network SettingsUpgrade PolicyScheduling PoliciesLabels and Annotations
For container NIC setting, refer to Workload Usage IP Pool
DNS setting
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related setting options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make container use kubelet's -The domain name resolution file pointed to by the -resolv-conf parameter. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the setting of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Setting options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node Affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload Anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/end-user/kpanda/workloads/create-deployment.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create deployments more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a deployment
This page introduces how to create a job through image and YAML file.
Job is suitable for performing one-time jobs. A Job creates one or more Pods, and the Job keeps retrying to run Pods until a certain number of Pods are successfully terminated. A Job ends when the specified number of Pods are successfully terminated. When a Job is deleted, all Pods created by the Job will be cleared. When a Job is paused, all active Pods in the Job are deleted until the Job is resumed. For more information about jobs, refer to Job.
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-job.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a job using an image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings and Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the job list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the job.
On the Create Jobs page, enter the basic information according to the table below, and click Next .
Payload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created job in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Number of Instances: Enter the number of Pod instances for the workload. By default, 1 Pod instance is created.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the setting requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle settings.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check settings.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage settings.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes job settings, labels and annotations.
Job SettingsLabels and Annotations
Parallel Pods: the maximum number of Pods that can be created at the same time during job execution, and the parallel number should not be greater than the total number of Pods. Default is 1.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set.
Restart Policy: Whether to restart the Pod when the setting fails.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/end-user/kpanda/workloads/create-job.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, creation jobs can also be created more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page describes how to create a StatefulSet through image and YAML files.
StatefulSet is a common resource in Kubernetes, and Deployment, mainly used to manage the deployment and scaling of Pod collections. The main difference between the two is that Deployment is stateless and does not save data, while StatefulSet is stateful and is mainly used to manage stateful applications. In addition, Pods in a StatefulSet have a persistent ID, which makes it easy to identify the proper Pod when matching storage volumes.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management.
Before using image to create StatefulSets, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-statefulset.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a statefulSet using image.
Click Clusters on the left navigation bar, then click the name of the target cluster to enter Cluster Details.
Click Workloads -> StatefulSets in the left navigation bar, and then click the Create by Image button in the upper right corner.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the list of StatefulSets , and wait for the status of the workload to become running . If the workload status is abnormal, refer to Workload Status for specific exception information.
Click \u2507 on the right side of the New Workload column to perform operations such as update, delete, elastic scaling, restart, and version rollback on the workload.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
Used to judge the health status of containers and applications. Helps improve app usability. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyContainer Management PoliciesScheduling PoliciesLabels and Annotations
For container NIC settings, refer to Workload Usage IP Pool
DNS settings
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the application docking host.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Kubernetes v1.7 and later versions can set Pod management policies through .spec.podManagementPolicy , which supports the following two methods:
OrderedReady : The default Pod management policy, which means that Pods are deployed in order. Only after the deployment of the previous Pod is successfully completed, the statefulset will start to deploy the next Pod. Pods are deleted in reverse order, with the last created being deleted first.
Parallel : Create or delete containers in parallel, just like Pods of the Deployment type. The StatefulSet controller starts or terminates all containers in parallel. There is no need to wait for a Pod to enter the Running and ready state or to stop completely before starting or terminating other Pods. This option only affects the behavior of scaling operations, not the order of updates.
Tolerance time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/end-user/kpanda/workloads/create-statefulset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create statefulsets more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> StatefulSets in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a statefulSet
An environment variable refers to a variable set in the container running environment, which is used to add environment flags to Pods or transfer configurations, etc. It supports configuring environment variables for Pods in the form of key-value pairs.
Suanova container management adds a graphical interface to configure environment variables for Pods on the basis of native Kubernetes, and supports the following configuration methods:
Key-value pair (Key/Value Pair): Use a custom key-value pair as the environment variable of the container
Resource reference (Resource): Use the fields defined by Container as the value of environment variables, such as the memory limit of the container, the number of copies, etc.
Variable/Variable Reference (Pod Field): Use the Pod field as the value of an environment variable, such as the name of the Pod
ConfigMap key value import (ConfigMap key): Import the value of a key in the ConfigMap as the value of an environment variable
Key key value import (Secret Key): use the data from the Secret to define the value of the environment variable
Key Import (Secret): Import all key values \u200b\u200bin Secret as environment variables
ConfigMap import (ConfigMap): import all key values \u200b\u200bin the ConfigMap as environment variables
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html","title":"Container health check","text":"
Container health check checks the health status of containers according to user requirements. After configuration, if the application in the container is abnormal, the container will automatically restart and recover. Kubernetes provides Liveness checks, Readiness checks, and Startup checks.
LivenessProbe can detect application deadlock (the application is running, but cannot continue to run the following steps). Restarting containers in this state can help improve the availability of applications, even if there are bugs in them.
ReadinessProbe can detect when a container is ready to accept request traffic. A Pod can only be considered ready when all containers in a Pod are ready. One use of this signal is to control which Pod is used as the backend of the Service. If the Pod is not ready, it will be removed from the Service's load balancer.
Startup check (StartupProbe) can know when the application container is started. After configuration, it can control the container to check the viability and readiness after it starts successfully, so as to ensure that these liveness and readiness probes will not affect the start of the application. Startup detection can be used to perform liveness checks on slow-starting containers, preventing them from being killed before they start running.
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#liveness-and-readiness-checks","title":"Liveness and readiness checks","text":"
The configuration of LivenessProbe is similar to that of ReadinessProbe, the only difference is to use readinessProbe field instead of livenessProbe field.
HTTP GET parameter description:
Parameter Description Path (Path) The requested path for access. Such as: /healthz path in the example Port (Port) Service listening port. Such as: port 8080 in the example protocol access protocol, Http or Https Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. SuccessThreshold (successThreshold) The minimum number of consecutive successes that are considered successful after a probe fails. The default value is 1, and the minimum value is 1. This value must be 1 for liveness and startup probes. Maximum number of failures (failureThreshold) The number of retries when the probe fails. Giving up in case of a liveness probe means restarting the container. Pods that are abandoned due to readiness probes are marked as not ready. The default value is 3. The minimum value is 1."},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#check-with-http-get-request","title":"Check with HTTP GET request","text":"
YAML example:
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-http\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/liveness # Container image\n args:\n - /server # Arguments to pass to the container\n livenessProbe:\n httpGet:\n path: /healthz # Access request path\n port: 8080 # Service listening port\n httpHeaders:\n - name: Custom-Header # Custom header name\n value: Awesome # Custom header value\n initialDelaySeconds: 3 # Wait 3 seconds before the first probe\n periodSeconds: 3 # Perform liveness detection every 3 seconds\n
According to the set rules, Kubelet sends an HTTP GET request to the service running in the container (the service is listening on port 8080) to perform the detection. The kubelet considers the container alive if the handler under the /healthz path on the server returns a success code. If the handler returns a failure code, the kubelet kills the container and restarts it. Any return code greater than or equal to 200 and less than 400 indicates success, and any other return code indicates failure. The /healthz handler returns a 200 status code for the first 10 seconds of the container's lifetime. The handler then returns a status code of 500.
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#use-tcp-port-check","title":"Use TCP port check","text":"
TCP port parameter description:
Parameter Description Port (Port) Service listening port. Such as: port 8080 in the example Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second.
For a container that provides TCP communication services, based on this configuration, the cluster establishes a TCP connection to the container according to the set rules. If the connection is successful, it proves that the detection is successful, otherwise the detection fails. If you choose the TCP port detection method, you must specify the port that the container listens to.
This example uses both readiness and liveness probes. The kubelet sends the first readiness probe 5 seconds after the container is started. Attempt to connect to port 8080 of the goproxy container. If the probe is successful, the Pod will be marked as ready and the kubelet will continue to run the check every 10 seconds.
In addition to the readiness probe, this configuration includes a liveness probe. The kubelet will perform the first liveness probe 15 seconds after the container is started. The readiness probe will attempt to connect to the goproxy container on port 8080. If the liveness probe fails, the container will be restarted.
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-exec\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/busybox # Container image\n args:\n - /bin/sh # Command to run\n - -c # Pass the following string as a command\n - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600 # Command to execute\n livenessProbe:\n exec:\n command:\n - cat # Command to check liveness\n - /tmp/healthy # File to check\n initialDelaySeconds: 5 # Wait 5 seconds before the first probe\n periodSeconds: 5 # Perform liveness detection every 5 seconds\n
The periodSeconds field specifies that the kubelet performs a liveness probe every 5 seconds, and the initialDelaySeconds field specifies that the kubelet waits for 5 seconds before performing the first probe. According to the set rules, the cluster periodically executes the command cat /tmp/healthy in the container through the kubelet to detect. If the command executes successfully and the return value is 0, the kubelet considers the container to be healthy and alive. If this command returns a non-zero value, the kubelet will kill the container and restart it.
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#protect-slow-starting-containers-with-pre-start-checks","title":"Protect slow-starting containers with pre-start checks","text":"
Some applications require a long initialization time at startup. You need to use the same command to set startup detection. For HTTP or TCP detection, you can set the failureThreshold * periodSeconds parameter to a long enough time to cope with the long startup time scene.
With the above settings, the application will have up to 5 minutes (30 * 10 = 300s) to complete the startup process. Once the startup detection is successful, the survival detection task will take over the detection of the container and respond quickly to the container deadlock. If the start probe has been unsuccessful, the container is killed after 300 seconds and further disposition is performed according to the restartPolicy .
"},{"location":"en/end-user/kpanda/workloads/pod-config/job-parameters.html","title":"Description of job parameters","text":"
According to the settings of .spec.completions and .spec.Parallelism , jobs (Job) can be divided into the following types:
Job Type Description Non-parallel Job Creates a Pod until its Job completes successfully Parallel Jobs with deterministic completion counts A Job is considered complete when the number of successful Pods reaches .spec.completions Parallel Job Creates one or more Pods until one finishes successfully
Parameter Description
RestartPolicy Creates a Pod until it terminates successfully .spec.completions Indicates the number of Pods that need to run successfully when the Job ends, the default is 1 .spec.parallelism Indicates the number of Pods running in parallel, the default is 1 spec.backoffLimit Indicates the maximum number of retries for a failed Pod, beyond which no more retries will continue. .spec.activeDeadlineSeconds Indicates the Pod running time. Once this time is reached, the Job, that is, all its Pods, will stop. And activeDeadlineSeconds has a higher priority than backoffLimit, that is, the job that reaches activeDeadlineSeconds will ignore the setting of backoffLimit.
The following is an example Job configuration, saved in myjob.yaml, which calculates \u03c0 to 2000 digits and prints the output.
apiVersion: batch/v1\nkind: Job #The type of the current resource\nmetadata:\n name: myjob\nspec:\n completions: 50 # Job needs to run 50 Pods at the end, in this example it prints \u03c0 50 times\n parallelism: 5 # 5 Pods in parallel\n backoffLimit: 5 # retry up to 5 times\n template:\n spec:\n containers:\n - name: pi\n image: perl\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\n restartPolicy: Never #restart policy\n
Related commands
kubectl apply -f myjob.yaml # Start job\nkubectl get job # View this job\nkubectl logs myjob-1122dswzs View Job Pod logs\n
"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html","title":"Configure the container lifecycle","text":"
Pods follow a predefined lifecycle, starting in the Pending phase and entering the Running state if at least one container in the Pod starts normally. If any container in the Pod ends in a failed state, the state becomes Failed . The following phase field values \u200b\u200bindicate which phase of the lifecycle a Pod is in.
Value Description Pending The Pod has been accepted by the system, but one or more containers have not yet been created or run. This phase includes waiting for the pod to be scheduled and downloading the image over the network. Running (Running) The Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or in the process of starting or restarting. Succeeded (Success) All containers in the Pod were successfully terminated and will not be restarted. Failed All containers in the Pod have terminated, and at least one container terminated due to failure. That is, the container exited with a non-zero status or was terminated by the system. Unknown (Unknown) The status of the Pod cannot be obtained for some reason, usually due to a communication failure with the host where the Pod resides.
When creating a workload in Suanova container management, images are usually used to specify the running environment in the container. By default, when building an image, the Entrypoint and CMD fields can be used to define the commands and parameters to be executed when the container is running. If you need to change the commands and parameters of the container image before starting, after starting, and before stopping, you can override the default commands and parameters in the image by setting the lifecycle event commands and parameters of the container.
Configure the startup command, post-start command, and pre-stop command of the container according to business needs.
Parameter Description Example value Start command Type: Optional Meaning: The container will be started according to the start command. Command after startup Type: optionalMeaning: command after container startup Command before stopping Type: Optional Meaning: The command executed by the container after receiving the stop command. Ensure that the services running in the instance can be drained in advance when the instance is upgraded or deleted. -"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html#start-command","title":"start command","text":"
Configure the startup command according to the table below.
Parameter Description Example value Run command Type: RequiredMeaning: Enter an executable command, and separate multiple commands with spaces. If the command itself has spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html#post-start-commands","title":"Post-start commands","text":"
Suanova provides two processing types, command line script and HTTP request, to configure post-start commands. You can choose the configuration method that suits you according to the table below.
Command line script configuration
Parameter Description Example value Run Command Type: Optional Meaning: Enter an executable command, and separate multiple commands with spaces. If the command itself contains spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html#stop-pre-command","title":"stop pre-command","text":"
Suanova provides two processing types, command line script and HTTP request, to configure the pre-stop command. You can choose the configuration method that suits you according to the table below.
HTTP request configuration
Parameter Description Example value URL Path Type: Optional Meaning: Requested URL path. Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Port Type: RequiredMeaning: Requested port. port=8080 Node Address Type: Optional Meaning: The requested IP address, the default is the node IP where the container is located. -"},{"location":"en/end-user/kpanda/workloads/pod-config/scheduling-policy.html","title":"Scheduling Policy","text":"
In a Kubernetes cluster, like many other Kubernetes objects, nodes have labels. You can manually add labels. Kubernetes also adds some standard labels to all nodes in the cluster. See Common Labels, Annotations, and Taints for common node labels. By adding labels to nodes, you can have pods scheduled on specific nodes or groups of nodes. You can use this feature to ensure that specific Pods can only run on nodes with certain isolation, security or governance properties.
nodeSelector is the simplest recommended form of a node selection constraint. You can add a nodeSelector field to the Pod's spec to set the node label. Kubernetes will only schedule pods on nodes with each label specified. nodeSelector provides one of the easiest ways to constrain Pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define. Some benefits of using affinity and anti-affinity are:
Affinity and anti-affinity languages are more expressive. nodeSelector can only select nodes that have all the specified labels. Affinity, anti-affinity give you greater control over selection logic.
You can mark a rule as \"soft demand\" or \"preference\", so that the scheduler will still schedule the Pod if no matching node can be found.
You can use the labels of other Pods running on the node (or in other topological domains) to enforce scheduling constraints, instead of only using the labels of the node itself. This capability allows you to define rules which allow Pods to be placed together.
You can choose which node the Pod will deploy to by setting affinity and anti-affinity.
When the node where the workload instance is located is unavailable, the period for the system to reschedule the instance to other available nodes. The default is 300 seconds.
Node affinity is conceptually similar to nodeSelector , which allows you to constrain which nodes Pods can be scheduled on based on the labels on the nodes. There are two types of node affinity:
Must be satisfied: ( requiredDuringSchedulingIgnoredDuringExecution ) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: ( preferredDuringSchedulingIgnoredDuringExecution ) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
It can only be added in the \"as far as possible\" policy, which can be understood as the priority of scheduling, and those with the highest weight will be scheduled first. The value range is 1 to 100.
Similar to node affinity, there are two types of workload affinity:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The affinity of the workload is mainly used to determine which Pods of the workload can be deployed in the same topology domain. For example, services that communicate with each other can be deployed in the same topology domain (such as the same availability zone) by applying affinity scheduling to reduce the network delay between them.
Similar to node affinity, there are two types of anti-affinity for workloads:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The anti-affinity of the workload is mainly used to determine which Pods of the workload cannot be deployed in the same topology domain. For example, the same Pod of a load is distributed to different topological domains (such as different hosts) to improve the stability of the workload itself.
A workload is an application running on Kubernetes, and in Kubernetes, whether your application is composed of a single same component or composed of many different components, you can use a set of Pods to run it. Kubernetes provides five built-in workload resources to manage pods:
Deployment
StatefulSet
Daemonset
Job
CronJob
You can also expand workload resources by setting Custom Resource CRD. In the fifth-generation container management, it supports full lifecycle management of workloads such as creation, update, capacity expansion, monitoring, logging, deletion, and version management.
Pod is the smallest computing unit created and managed in Kubernetes, that is, a collection of containers. These containers share storage, networking, and management policies that control how the containers run. Pods are typically not created directly by users, but through workload resources. Pods follow a predefined lifecycle, starting at Pending phase, if at least one of the primary containers starts normally, it enters Running , and then enters the Succeeded or Failed stage depending on whether any container in the Pod ends in a failed status.
The fifth-generation container management module designs a built-in workload life cycle status set based on factors such as Pod status and number of replicas, so that users can more realistically perceive the running status of workloads. Because different workload types (such as Deployment and Jobs) have inconsistent management mechanisms for Pods, different workloads will have different lifecycle status during operation, as shown in the following table.
"},{"location":"en/end-user/kpanda/workloads/pod-config/workload-status.html#deployment-statefulset-damemonset-status","title":"Deployment, StatefulSet, DamemonSet Status","text":"Status Description Waiting 1. A workload is in this status while its creation is in progress. 2. After an upgrade or rollback action is triggered, the workload is in this status. 3. Trigger operations such as pausing/scaling, and the workload is in this status. Running This status occurs when all instances under the workload are running and the number of replicas matches the user-defined number. Deleting When a delete operation is performed, the payload is in this status until the delete is complete. Exception Unable to get the status of the workload for some reason. This usually occurs because communication with the pod's host has failed. Not Ready When the container is in an abnormal, pending status, this status is displayed when the workload cannot be started due to an unknown error"},{"location":"en/end-user/kpanda/workloads/pod-config/workload-status.html#job-status","title":"Job Status","text":"Status Description Waiting The workload is in this status while Job creation is in progress. Executing The Job is in progress and the workload is in this status. Execution Complete The Job execution is complete and the workload is in this status. Deleting A delete operation is triggered and the workload is in this status. Exception Pod status could not be obtained for some reason. This usually occurs because communication with the pod's host has failed."},{"location":"en/end-user/kpanda/workloads/pod-config/workload-status.html#cronjob-status","title":"CronJob status","text":"Status Description Waiting The CronJob is in this status when it is being created. Started After the CronJob is successfully created, the CronJob is in this status when it is running normally or when the paused task is started. Stopped The CronJob is in this status when the stop task operation is performed. Deleting The deletion operation is triggered, and the CronJob is in this status.
When the workload is in an abnormal or unready status, you can move the mouse over the status value of the load, and the system will display more detailed error information through a prompt box. You can also view the log or events to obtain related running information of the workload.
Notebook usually refers to Jupyter Notebook or similar interactive computing environments. It is a very popular tool widely used in fields such as data science, machine learning, and deep learning. This page explains how to use Notebook in the AI platform.
Enter a name, select the cluster, namespace, choose the queue just created, and click One-Click Initialization.
Select the Notebook type, configure memory, CPU, enable GPU, create and configure PVC:
Enable SSH external network access:
You will be automatically redirected to the Notebook instance list, click the instance name.
Enter the Notebook instance detail page and click the Open button in the upper right corner.
You have entered the Notebook development environment, where a persistent volume is mounted in the /home/jovyan directory. You can clone code through git, upload data after connecting via SSH, etc.
"},{"location":"en/end-user/share/notebook.html#accessing-notebook-instances-via-ssh","title":"Accessing Notebook Instances via SSH","text":"
Generate an SSH key pair on your own computer.
Open the command line on your computer, for example, open git bash on Windows, enter ssh-keygen.exe -t rsa, and press enter through the prompts.
Use commands like cat ~/.ssh/id_rsa.pub to view and copy the public key.
Log into the AI platform as a user, click Personal Center -> SSH Public Key -> Import SSH Public Key in the upper right corner.
Enter the detail page of the Notebook instance and copy the SSH link.
Use SSH to access the Notebook instance from the client.
Next step: Create Training Job
"},{"location":"en/end-user/share/workload.html","title":"Creating AI Workloads Using GPU Resources","text":"
After the administrator allocates resource quotas for the workspace, users can create AI workloads to utilize GPU computing resources.
Access Keys can be used to access the OpenAPI and for continuous publishing. You can follow the steps below to obtain their keys and access the API in their personal center.
Log in to the AI platform, find Personal Center in the dropdown menu at the top right corner, and manage your account's access keys on the Access Keys page.
Info
Access key information is displayed only once. If you forget the access key information, you will need to create a new access key.
"},{"location":"en/openapi/index.html#using-the-key-to-access-the-api","title":"Using the Key to Access the API","text":"
When accessing the AI platform's OpenAPI, include the request header Authorization:Bearer ${token} in the request to identify the visitor's identity, where ${token} is the key obtained in the previous step.
Request Example
curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k\n
"},{"location":"en/openapi/baize/index.html","title":"AI Lab OpenAPI Docs","text":""},{"location":"en/openapi/ghippo/index.html","title":"Global Management OpenAPI Docs","text":""},{"location":"en/openapi/insight/index.html","title":"Insight OpenAPI Docs","text":""},{"location":"en/openapi/kpanda/index.html","title":"Container Management OpenAPI Docs","text":""},{"location":"en/openapi/virtnest/index.html","title":"Cloud Host OpenAPI Docs","text":""}]}
\ No newline at end of file
+{"config":{"lang":["en","zh"],"separator":"[\\s\\u200b\\u3000\\-\u3001\u3002\uff0c\uff0e\uff1f\uff01\uff1b]+","pipeline":["stemmer"]},"docs":[{"location":"index.html","title":"\u8c50\u6536\u4e8c\u865f\u6a94\u6848\u7ad9","text":"
\u9019\u662f\u8c50\u6536\u4e8c\u865f AI \u7b97\u529b\u4e2d\u5fc3\u7684\u6a94\u6848\u7ad9\u3002
\u7d42\u7aef\u7528\u6236\u624b\u518a\uff1a\u5728\u5bb9\u5668\u5316\u74b0\u5883\u4e2d\uff0c\u4f7f\u7528\u96f2\u4e3b\u6a5f\uff0c\u958b\u767c AI \u7b97\u6cd5\uff0c\u69cb\u5efa\u8a13\u7df4\u548c\u63a8\u7406\u4efb\u52d9
5.0 AI Lab \u63d0\u4f9b\u4e86\u4efb\u52a1\u8c03\u5ea6\u5668\uff0c\u53ef\u4ee5\u5e2e\u52a9\u60a8\u66f4\u597d\u5730\u7ba1\u7406\u4efb\u52a1\uff0c\u9664\u4e86\u63d0\u4f9b\u57fa\u7840\u7684\u8c03\u5ea6\u5668\u4e4b\u5916\uff0c\u76ee\u524d\u4e5f\u652f\u6301\u7528\u6237\u81ea\u5b9a\u4e49\u8c03\u5ea6\u5668\u3002
\u5728 Kubernetes \u4e2d\uff0c\u4efb\u52a1\u8c03\u5ea6\u5668\u8d1f\u8d23\u51b3\u5b9a\u5c06 Pod \u5206\u914d\u5230\u54ea\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002\u5b83\u8003\u8651\u591a\u79cd\u56e0\u7d20\uff0c\u5982\u8d44\u6e90\u9700\u6c42\u3001\u786c\u4ef6/\u8f6f\u4ef6\u7ea6\u675f\u3001\u4eb2\u548c\u6027/\u53cd\u4eb2\u548c\u6027\u89c4\u5219\u3001\u6570\u636e\u5c40\u90e8\u6027\u7b49\u3002
\u9ed8\u8ba4\u8c03\u5ea6\u5668\u662f Kubernetes \u96c6\u7fa4\u4e2d\u7684\u4e00\u4e2a\u6838\u5fc3\u7ec4\u4ef6\uff0c\u8d1f\u8d23\u51b3\u5b9a\u5c06 Pod \u5206\u914d\u5230\u54ea\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002\u8ba9\u6211\u4eec\u6df1\u5165\u4e86\u89e3\u5b83\u7684\u5de5\u4f5c\u539f\u7406\u3001\u7279\u6027\u548c\u914d\u7f6e\u65b9\u6cd5\u3002
\u8c03\u5ea6\u5668\u4f1a\u904d\u5386\u6240\u6709\u8282\u70b9\uff0c\u6392\u9664\u4e0d\u6ee1\u8db3 Pod \u8981\u6c42\u7684\u8282\u70b9\uff0c\u8003\u8651\u7684\u56e0\u7d20\u5305\u62ec\uff1a
\u4ee5\u4e0a\uff0c\u5c31\u662f\u6211\u4eec\u5728 AI Lab \u4e2d\uff0c\u4e3a\u4efb\u52a1\u589e\u52a0\u8c03\u5ea6\u5668\u9009\u9879\u7684\u914d\u7f6e\u4f7f\u7528\u8bf4\u660e\u3002
CPU \u81f3\u5c11 8 \u6838\uff0c\u63a8\u8350 16 \u6838
\u5185\u5b58 64GB\uff0c\u63a8\u8350 128GB
Info
\u5728\u5f00\u59cb\u4f53\u9a8c\u4e4b\u524d\uff0c\u8bf7\u68c0\u67e5 AI \u7b97\u529b\u5e73\u53f0\u4ee5\u53ca AI Lab \u90e8\u7f72\u6b63\u786e\uff0cGPU \u961f\u5217\u8d44\u6e90\u521d\u59cb\u5316\u6210\u529f\uff0c\u4e14\u7b97\u529b\u8d44\u6e90\u5145\u8db3\u3002
AI Lab \u4f1a\u5728\u540e\u53f0\u8fdb\u884c\u5168\u81ea\u52a8\u6570\u636e\u9884\u70ed\uff0c\u4ee5\u4fbf\u540e\u7eed\u7684\u4efb\u52a1\u80fd\u591f\u5feb\u901f\u8bbf\u95ee\u6570\u636e\u3002
AI Lab \u63d0\u4f9b\u4e86\u73af\u5883\u7ba1\u7406\u7684\u80fd\u529b\uff0c\u5c06 Python \u73af\u5883\u4f9d\u8d56\u5305\u7ba1\u7406\u548c\u5f00\u53d1\u5de5\u5177\u3001\u4efb\u52a1\u955c\u50cf\u7b49\u8fdb\u884c\u89e3\u8026\uff0c\u89e3\u51b3\u4e86\u4f9d\u8d56\u7ba1\u7406\u6df7\u4e71\uff0c\u73af\u5883\u4e0d\u4e00\u81f4\u7b49\u95ee\u9898\u3002
\u8fd9\u91cc\u4f7f\u7528 AI Lab \u63d0\u4f9b\u7684\u73af\u5883\u7ba1\u7406\u529f\u80fd\uff0c\u521b\u5efa ChatGLM3 \u5fae\u8c03\u6240\u9700\u7684\u73af\u5883\uff0c\u4ee5\u5907\u540e\u7eed\u4f7f\u7528\u3002
AI Lab \u63d0\u4f9b\u4e86 Notebook \u4f5c\u4e3a IDE \u7684\u529f\u80fd\uff0c\u53ef\u4ee5\u8ba9\u7528\u6237\u5728\u6d4f\u89c8\u5668\u4e2d\u76f4\u63a5\u7f16\u5199\u4ee3\u7801\uff0c\u8fd0\u884c\u4ee3\u7801\uff0c\u67e5\u770b\u4ee3\u7801\u8fd0\u884c\u7ed3\u679c\uff0c\u975e\u5e38\u9002\u5408\u4e8e\u6570\u636e\u5206\u6790\u3001\u673a\u5668\u5b66\u4e60\u3001\u6df1\u5ea6\u5b66\u4e60\u7b49\u9886\u57df\u7684\u5f00\u53d1\u3002
\u60a8\u53ef\u4ee5\u4f7f\u7528 AI Lab \u63d0\u4f9b\u7684 JupyterLab Notebook \u6765\u8fdb\u884c ChatGLM3 \u7684\u5fae\u8c03\u4efb\u52a1\u3002
\u672c\u6587\u4ee5 ChatGLM3 \u4e3a\u4f8b\uff0c\u5e26\u60a8\u5feb\u901f\u4e86\u89e3\u548c\u4e0a\u624b AI Lab \u7684\u6a21\u578b\u5fae\u8c03\uff0c\u4f7f\u7528 LoRA \u5fae\u8c03\u4e86 ChatGLM3 \u6a21\u578b\u3002
AI Lab \u63d0\u4f9b\u4e86\u975e\u5e38\u4e30\u5bcc\u7684\u529f\u80fd\uff0c\u53ef\u4ee5\u5e2e\u52a9\u6a21\u578b\u5f00\u53d1\u8005\u5feb\u901f\u8fdb\u884c\u6a21\u578b\u5f00\u53d1\u3001\u5fae\u8c03\u3001\u63a8\u7406\u7b49\u4efb\u52a1\uff0c\u540c\u65f6\u4e5f\u63d0\u4f9b\u4e86\u4e30\u5bcc\u7684 OpenAPI \u63a5\u53e3\uff0c\u53ef\u4ee5\u65b9\u4fbf\u5730\u4e0e\u7b2c\u4e09\u65b9\u5e94\u7528\u751f\u6001\u8fdb\u884c\u7ed3\u5408\u3002
Label Studio \u662f\u4e00\u4e2a\u5f00\u6e90\u7684\u6570\u636e\u6807\u6ce8\u5de5\u5177\uff0c\u7528\u4e8e\u5404\u79cd\u673a\u5668\u5b66\u4e60\u548c\u4eba\u5de5\u667a\u80fd\u4efb\u52a1\u3002 \u4ee5\u4e0b\u662f Label Studio \u7684\u7b80\u8981\u4ecb\u7ecd\uff1a
Label Studio \u901a\u8fc7\u5176\u7075\u6d3b\u6027\u548c\u529f\u80fd\u4e30\u5bcc\u6027\uff0c\u4e3a\u6570\u636e\u79d1\u5b66\u5bb6\u548c\u673a\u5668\u5b66\u4e60\u5de5\u7a0b\u5e08\u63d0\u4f9b\u4e86\u5f3a\u5927\u7684\u6570\u636e\u6807\u6ce8\u89e3\u51b3\u65b9\u6848\u3002
"},{"location":"admin/baize/best-practice/label-studio.html#ai","title":"\u90e8\u7f72\u5230 AI \u7b97\u529b\u5e73\u53f0","text":"
\u8981\u60f3\u5728 AI Lab \u4e2d\u4f7f\u7528 Label Studio\uff0c\u9700\u5c06\u5176\u90e8\u7f72\u5230\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\uff0c \u4f60\u53ef\u4ee5\u901a\u8fc7 Helm \u7684\u65b9\u5f0f\u5feb\u901f\u90e8\u7f72\u3002
Note
\u66f4\u591a\u90e8\u7f72\u8be6\u60c5\uff0c\u8bf7\u53c2\u9605 Deploy Label Studio on Kubernetes\u3002
\u5982\u679c\u8981\u6dfb\u52a0 Label Studio \u5230\u5bfc\u822a\u680f\uff0c\u53ef\u4ee5\u53c2\u8003\u5168\u5c40\u7ba1\u7406 OEM IN \u7684\u65b9\u5f0f\u3002 \u4ee5\u4e0b\u6848\u4f8b\u662f\u589e\u52a0\u5230 AI Lab \u4e8c\u7ea7\u5bfc\u822a\u7684\u6dfb\u52a0\u65b9\u5f0f\u3002
\u4ee5\u4e0a\uff0c\u5c31\u662f\u5982\u4f55\u6dfb\u52a0 Label Studio \u5e76\u5c06\u5176\u4f5c\u4e3a AI Lab \u7684\u6807\u6ce8\u7ec4\u4ef6\uff0c\u901a\u8fc7\u5c06\u6807\u6ce8\u540e\u7684\u6570\u636e\u6dfb\u52a0\u5230 AI Lab \u7684\u6570\u636e\u96c6\u4e2d\uff0c \u8054\u52a8\u7b97\u6cd5\u5f00\u53d1\uff0c\u5b8c\u5584\u7b97\u6cd5\u5f00\u53d1\u6d41\u7a0b\uff0c\u540e\u7eed\u5982\u4f55\u4f7f\u7528\u8bf7\u5173\u6ce8\u5176\u4ed6\u6587\u6863\u53c2\u8003\u3002
\u5f00\u53d1\u63a7\u5236\u53f0\u662f\u5f00\u53d1\u8005\u65e5\u5e38\u6267\u884c AI \u63a8\u7406\u3001\u5927\u6a21\u578b\u8bad\u7ec3\u7b49\u4efb\u52a1\u7684\u63a7\u5236\u53f0\u3002
\u672c\u6587\u63d0\u4f9b\u4e86\u7b80\u5355\u7684\u64cd\u4f5c\u624b\u518c\u4ee5\u4fbf\u7528\u6237\u4f7f\u7528 AI Lab \u8fdb\u884c\u6570\u636e\u96c6\u3001Notebook\u3001\u4efb\u52a1\u8bad\u7ec3\u7684\u6574\u4e2a\u5f00\u53d1\u3001\u8bad\u7ec3\u6d41\u7a0b\u3002
\u7b49\u5f85\u73af\u5883\u9884\u70ed\u6210\u529f\u540e\uff0c\u53ea\u9700\u8981\u5c06\u6b64\u73af\u5883\u6302\u8f7d\u5230 Notebook\u3001\u8bad\u7ec3\u4efb\u52a1\u4e2d\uff0c\u4f7f\u7528 AI Lab \u63d0\u4f9b\u7684\u57fa\u7840\u955c\u50cf\u5c31\u53ef\u4ee5
AI Lab \u63d0\u4f9b\u6a21\u578b\u5f00\u53d1\u3001\u8bad\u7ec3\u4ee5\u53ca\u63a8\u7406\u8fc7\u7a0b\u6240\u6709\u9700\u8981\u7684\u6570\u636e\u96c6\u7ba1\u7406\u529f\u80fd\u3002\u76ee\u524d\u652f\u6301\u5c06\u591a\u79cd\u6570\u636e\u6e90\u7edf\u4e00\u63a5\u5165\u80fd\u529b\u3002
\u901a\u8fc7\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5c06\u6570\u636e\u6e90\u63a5\u5165\u5230 AI Lab \u4e2d\uff0c\u5b9e\u73b0\u6570\u636e\u7684\u7edf\u4e00\u7eb3\u7ba1\u3001\u9884\u70ed\u3001\u6570\u636e\u96c6\u7ba1\u7406\u7b49\u529f\u80fd\u3002
\u672c\u6587\u8bf4\u660e\u5982\u4f55\u5728 AI Lab \u4e2d\u7ba1\u7406\u4f60\u7684\u73af\u5883\u4f9d\u8d56\u5e93\uff0c\u4ee5\u4e0b\u662f\u5177\u4f53\u64cd\u4f5c\u6b65\u9aa4\u548c\u6ce8\u610f\u4e8b\u9879\u3002
\u968f\u7740 AI Lab \u7684\u5feb\u901f\u8fed\u4ee3\uff0c\u6211\u4eec\u5df2\u7ecf\u652f\u6301\u4e86\u591a\u79cd\u6a21\u578b\u7684\u63a8\u7406\u670d\u52a1\uff0c\u60a8\u53ef\u4ee5\u5728\u8fd9\u91cc\u770b\u5230\u6240\u652f\u6301\u7684\u6a21\u578b\u4fe1\u606f\u3002
AI Lab v0.3.0 \u4e0a\u7ebf\u4e86\u6a21\u578b\u63a8\u7406\u670d\u52a1\uff0c\u9488\u5bf9\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u65b9\u4fbf\u7528\u6237\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528AI Lab \u7684\u63a8\u7406\u670d\u52a1\uff0c\u65e0\u9700\u5173\u5fc3\u6a21\u578b\u7684\u90e8\u7f72\u548c\u7ef4\u62a4\u3002
AI Lab v0.6.0 \u652f\u6301\u4e86\u5b8c\u6574\u7248\u672c\u7684 vLLM \u63a8\u7406\u80fd\u529b\uff0c\u652f\u6301\u8bf8\u591a\u5927\u8bed\u8a00\u6a21\u578b\uff0c\u5982 LLama\u3001Qwen\u3001ChatGLM \u7b49\u3002
\u60a8\u53ef\u4ee5\u5728 AI Lab \u4e2d\u4f7f\u7528\u7ecf\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9a8c\u8bc1\u8fc7\u7684 GPU \u7c7b\u578b\uff1b \u66f4\u591a\u7ec6\u8282\u53c2\u9605 GPU \u652f\u6301\u77e9\u9635\u3002
\u901a\u8fc7 Triton Inference Server \u53ef\u4ee5\u5f88\u597d\u7684\u652f\u6301\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u6211\u4eec\u76ee\u524d\u652f\u6301\u4e3b\u6d41\u7684\u63a8\u7406\u540e\u7aef\u670d\u52a1\uff1a
AI Lab \u76ee\u524d\u63d0\u4f9b\u4ee5 Triton\u3001vLLM \u4f5c\u4e3a\u63a8\u7406\u6846\u67b6\uff0c\u7528\u6237\u53ea\u9700\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5feb\u901f\u542f\u52a8\u4e00\u4e2a\u9ad8\u6027\u80fd\u7684\u63a8\u7406\u670d\u52a1\u3002
\u652f\u6301 API key \u7684\u8bf7\u6c42\u65b9\u5f0f\u8ba4\u8bc1\uff0c\u7528\u6237\u53ef\u4ee5\u81ea\u5b9a\u4e49\u589e\u52a0\u8ba4\u8bc1\u53c2\u6570\u3002
\u63a8\u7406\u670d\u52a1\u521b\u5efa\u5b8c\u6210\u4e4b\u540e\uff0c\u70b9\u51fb\u63a8\u7406\u670d\u52a1\u540d\u79f0\u8fdb\u5165\u8be6\u60c5\uff0c\u67e5\u770b API \u8c03\u7528\u65b9\u6cd5\u3002\u901a\u8fc7\u4f7f\u7528 Curl\u3001Python\u3001Nodejs \u7b49\u65b9\u5f0f\u9a8c\u8bc1\u6267\u884c\u7ed3\u679c\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
\u5728 AI Lab \u6a21\u5757\u4e2d\uff0c\u63d0\u4f9b\u4e86\u6a21\u578b\u5f00\u53d1\u8fc7\u7a0b\u91cd\u8981\u7684\u53ef\u89c6\u5316\u5206\u6790\u5de5\u5177\uff0c\u7528\u4e8e\u5c55\u793a\u673a\u5668\u5b66\u4e60\u6a21\u578b\u7684\u8bad\u7ec3\u8fc7\u7a0b\u548c\u7ed3\u679c\u3002 \u672c\u6587\u5c06\u4ecb\u7ecd \u4efb\u52a1\u5206\u6790\uff08Tensorboard\uff09\u7684\u57fa\u672c\u6982\u5ff5\u3001\u5728 AI Lab \u7cfb\u7edf\u4e2d\u7684\u4f7f\u7528\u65b9\u6cd5\uff0c\u4ee5\u53ca\u5982\u4f55\u914d\u7f6e\u6570\u636e\u96c6\u7684\u65e5\u5fd7\u5185\u5bb9\u3002
\u5728 AI Lab \u7cfb\u7edf\u4e2d\uff0c\u6211\u4eec\u63d0\u4f9b\u4e86\u4fbf\u6377\u7684\u65b9\u5f0f\u6765\u521b\u5efa\u548c\u7ba1\u7406 Tensorboard\u3002\u4ee5\u4e0b\u662f\u5177\u4f53\u6b65\u9aa4\uff1a
\u521b\u5efa\u5206\u5e03\u5f0f\u4efb\u52a1\uff1a\u5728 AI Lab \u5e73\u53f0\u4e0a\u521b\u5efa\u4e00\u4e2a\u65b0\u7684\u5206\u5e03\u5f0f\u8bad\u7ec3\u4efb\u52a1\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
jovyan@19d0197587cc:/$ baizectl\nAI platform management tool\n\nUsage:\n baizectl [command]\n\nAvailable Commands:\n completion Generate the autocompletion script for the specified shell\n data Management datasets\n help Help about any command\n job Manage jobs\n login Login to the platform\n version Show cli version\n\nFlags:\n --cluster string Cluster name to operate\n -h, --help help for baizectl\n --mode string Connection mode: auto, api, notebook (default \"auto\")\n -n, --namespace string Namespace to use for the operation. If not set, the default Namespace will be used.\n -s, --server string \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 access base url\n --skip-tls-verify Skip TLS certificate verification\n --token string \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 access token\n -w, --workspace int32 Workspace ID to use for the operation\n\nUse \"baizectl [command] --help\" for more information about a command.\n
jovyan@19d0197587cc:/$ baizectl job\nManage jobs\n\nUsage:\n baizectl job [command]\n\nAvailable Commands:\n delete Delete a job\n logs Show logs of a job\n ls List jobs\n restart restart a job\n submit Submit a job\n\nFlags:\n -h, --help help for job\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl job [command] --help\" for more information about a command.\n
(base) jovyan@den-0:~$ baizectl job submit --help\nSubmit a job\n\nUsage:\n baizectl job submit [flags] -- command ...\n\nAliases:\n submit, create\n\nExamples:\n# Submit a job to run the command \"torchrun python train.py\"\nbaizectl job submit -- torchrun python train.py\n# Submit a job with 2 workers(each pod use 4 gpus) to run the command \"torchrun python train.py\" and use the image \"pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime\"\nbaizectl job submit --image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime --workers 2 --resources nvidia.com/gpu=4 -- torchrun python train.py\n# Submit a tensorflow job to run the command \"python train.py\"\nbaizectl job submit --tensorflow -- python train.py\n\n\nFlags:\n --annotations stringArray The annotations of the job, the format is key=value\n --auto-load-env It only takes effect when executed in Notebook, the environment variables of the current environment will be automatically read and set to the environment variables of the Job, the specific environment variables to be read can be specified using the BAIZE_MAPPING_ENVS environment variable, the default is PATH,CONDA_*,*PYTHON*,NCCL_*, if set to false, the environment variables of the current environment will not be read. (default true)\n --commands stringArray The default command of the job\n -d, --datasets stringArray The dataset bind to the job, the format is datasetName:mountPath, e.g. mnist:/data/mnist\n -e, --envs stringArray The environment variables of the job, the format is key=value\n -x, --from-notebook string Define whether to read the configuration of the current Notebook and directly create tasks, including images, resources, Dataset, etc.\n auto: Automatically determine the mode according to the current environment. If the current environment is a Notebook, it will be set to notebook mode.\n false: Do not read the configuration of the current Notebook.\n true: Read the configuration of the current Notebook. (default \"auto\")\n -h, --help help for submit\n --image string The image of the job, it must be specified if fromNotebook is false.\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --labels stringArray The labels of the job, the format is key=value\n --max-retries int32 number of retries before marking this job failed\n --max-run-duration int Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it\n --name string The name of the job, if empty, the name will be generated automatically.\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --priority string The priority of the job, current support baize-medium-priority, baize-low-priority, baize-high-priority\n --pvcs stringArray The pvcs bind to the job, the format is pvcName:mountPath, e.g. mnist:/data/mnist\n --pytorch Pytorch Job, has higher priority than --job-type\n --queue string The queue to used\n --requests-resources stringArray Similar to resources, but sets the resources of requests\n --resources stringArray The resources of the job, it is a string in the format of cpu=1,memory=1Gi,nvidia.com/gpu=1, it will be set to the limits and requests of the container.\n --restart-policy string The job restart policy (default \"on-failure\")\n --runtime-envs baizectl data ls --runtime-env The runtime environment to use for the job, you can use baizectl data ls --runtime-env to get the runtime environment\n --shm-size int32 The shared memory size of the job, default is 0, which means no shared memory, if set to more than 0, the job will use the shared memory, the unit is MiB\n --tensorboard-log-dir string The tensorboard log directory, if set, the job will automatically start tensorboard, else not. The format is /path/to/log, you can use relative path in notebook.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --workers int The workers of the job, default is 1, which means single worker, if set to more than 1, the job will be distributed. (default 1)\n --working-dir string The working directory of job container, if in notebook mode, the default is the directory of the current file\n
(base) jovyan@den-0:~$ baizectl job logs --help\nShow logs of a job\n\nUsage:\n baizectl job logs <job-name> [pod-name] [flags]\n\nAliases:\n logs, log\n\nFlags:\n -f, --follow Specify if the logs should be streamed.\n -h, --help help for logs\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --pytorch Pytorch Job, has higher priority than --job-type\n --tail int Lines of recent log file to display.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --timestamps Show timestamps\n
(base) jovyan@den-0:~$ baizectl job log -t TENSORFLOW tf-sample-job-v2-202406161632-evgrbrhn -f\n2024-06-16 08:33:06.083766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n2024-06-16 08:33:06.086189: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2024-06-16 08:33:07.223046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\nModel: \"sequential\"\n_________________________________________________________________\n Layer (type) Output Shape Param # \n=================================================================\n Conv1 (Conv2D) (None, 13, 13, 8) 80 \n\n flatten (Flatten) (None, 1352) 0 \n\n Softmax (Dense) (None, 10) 13530 \n\n=================================================================\nTotal params: 13610 (53.16 KB)\nTrainable params: 13610 (53.16 KB)\nNon-trainable params: 0 (0.00 Byte)\n...\n
(base) jovyan@den-0:~$ baizectl data \nManagement datasets\n\nUsage:\n baizectl data [flags]\n baizectl data [command]\n\nAliases:\n data, dataset, datasets, envs, runtime-envs\n\nAvailable Commands:\n ls List datasets\n\nFlags:\n -h, --help help for data\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl data [command] --help\" for more information about a command.\n
baizectl data \u652f\u6301\u901a\u8fc7 ls \u547d\u4ee4\u67e5\u770b\u6570\u636e\u96c6\u5217\u8868\uff0c\u9ed8\u8ba4\u663e\u793a table \u683c\u5f0f\uff0c\u7528\u6237\u53ef\u4ee5\u901a\u8fc7 -o \u53c2\u6570\u6307\u5b9a\u8f93\u51fa\u683c\u5f0f\u3002
(base) jovyan@den-0:~$ baizectl data ls\n NAME TYPE URI PHASE \n fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY \n sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY \n training-output PVC pvc://training-output READY \n
jovyan@19d0197587cc:/$ baizess\nsource switch tool\n\nUsage:\n baizess [command] [package-manager]\n\nAvailable Commands:\n set Switch the source of specified package manager to current fastest source\n reset Reset the source of specified package manager to default source\n\nAvailable Package-managers:\n apt (require root privilege)\n conda\n pip\n
Notebook \u63d0\u4f9b\u4e86\u4e00\u4e2a\u5728\u7ebf\u7684 Web \u4ea4\u4e92\u5f0f\u7f16\u7a0b\u73af\u5883\uff0c\u65b9\u4fbf\u5f00\u53d1\u8005\u5feb\u901f\u8fdb\u884c\u6570\u636e\u79d1\u5b66\u548c\u673a\u5668\u5b66\u4e60\u5b9e\u9a8c\u3002
\u5f53\u7cfb\u7edf\u63d0\u793a\u60a8\u201cEnter a file in which to save the key\u201d\uff0c\u60a8\u53ef\u4ee5\u76f4\u63a5\u6572\u51fb Enter \u952e\u4f7f\u7528\u9ed8\u8ba4\u8def\u5f84\uff0c\u6216\u8005\u6307\u5b9a\u4e00\u4e2a\u65b0\u7684\u8def\u5f84\u3002
\u767b\u5f55\u5230\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0, \u7136\u540e\u53f3\u4e0a\u89d2\u5e10\u53f7\u70b9\u5f00\uff0c\u9009\u62e9\u4e2a\u4eba\u4e2d\u5fc3
\u70b9\u51fb Next \uff0cPyCharm \u5c06\u5c1d\u8bd5\u8fde\u63a5\u5230\u8fdc\u7a0b\u670d\u52a1\u5668\u3002\u5982\u679c\u8fde\u63a5\u6210\u529f\uff0c\u60a8\u5c06\u88ab\u8981\u6c42\u8f93\u5165\u5bc6\u7801\u6216\u9009\u62e9\u79c1\u94a5\u6587\u4ef6\u3002
\u8fd0\u7ef4\u7ba1\u7406\u662f IT \u8fd0\u7ef4\u4eba\u5458\u65e5\u5e38\u7ba1\u7406 IT \u8d44\u6e90\uff0c\u5904\u7406\u5de5\u4f5c\u7684\u7a7a\u95f4\u3002
\u672c\u6587\u5c06\u6301\u7eed\u7edf\u8ba1\u548c\u68b3\u7406 AI Lab \u4f7f\u7528\u8fc7\u7a0b\u53ef\u80fd\u56e0\u73af\u5883\u6216\u64cd\u4f5c\u4e0d\u89c4\u8303\u5f15\u8d77\u7684\u62a5\u9519\uff0c\u4ee5\u53ca\u5728\u4f7f\u7528\u8fc7\u7a0b\u4e2d\u9047\u5230\u67d0\u4e9b\u62a5\u9519\u7684\u95ee\u9898\u5206\u6790\u3001\u89e3\u51b3\u65b9\u6848\u3002
Warning
\u672c\u6587\u6863\u4ec5\u9002\u7528\u4e8e AI \u7b97\u529b\u4e2d\u5fc3\u7248\u672c\uff0c\u82e5\u9047\u5230 AI Lab \u7684\u4f7f\u7528\u95ee\u9898\uff0c\u8bf7\u4f18\u5148\u67e5\u770b\u6b64\u6392\u969c\u624b\u518c\u3002
AI Lab \u5728 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u6a21\u5757\u540d\u79f0 baize\uff0c\u63d0\u4f9b\u4e86\u4e00\u7ad9\u5f0f\u7684\u6a21\u578b\u8bad\u7ec3\u3001\u63a8\u7406\u3001\u6a21\u578b\u7ba1\u7406\u7b49\u529f\u80fd\u3002
\u5728 AI Lab \u5f00\u53d1\u63a7\u5236\u53f0\u3001\u8fd0\u7ef4\u63a7\u5236\u53f0\uff0c\u529f\u80fd\u6a21\u5757\u7684\u96c6\u7fa4\u641c\u7d22\u6761\u4ef6\u7684\u4e0b\u62c9\u5217\u8868\u627e\u4e0d\u5230\u60f3\u8981\u7684\u96c6\u7fa4\u3002
\u5728 AI Lab \u4e2d\uff0c\u96c6\u7fa4\u4e0b\u62c9\u5217\u8868\u5982\u679c\u7f3a\u5c11\u4e86\u60f3\u8981\u7684\u96c6\u7fa4\uff0c\u53ef\u80fd\u662f\u7531\u4e8e\u4ee5\u4e0b\u539f\u56e0\u5bfc\u81f4\u7684\uff1a
baize-agent \u672a\u5b89\u88c5\u6216\u5b89\u88c5\u4e0d\u6210\u529f\uff0c\u5bfc\u81f4 AI Lab \u65e0\u6cd5\u83b7\u53d6\u96c6\u7fa4\u4fe1\u606f
\u5b89\u88c5 baize-agent \u672a\u914d\u7f6e\u96c6\u7fa4\u540d\u79f0\uff0c\u5bfc\u81f4 AI Lab \u65e0\u6cd5\u83b7\u53d6\u96c6\u7fa4\u4fe1\u606f
AI Lab \u6709\u4e00\u4e9b\u57fa\u7840\u7ec4\u4ef6\u9700\u8981\u5728\u6bcf\u4e2a\u5de5\u4f5c\u96c6\u7fa4\u5185\u8fdb\u884c\u5b89\u88c5\uff0c\u5982\u679c\u5de5\u4f5c\u96c6\u7fa4\u5185\u672a\u5b89\u88c5 baize-agent \u65f6\uff0c\u53ef\u4ee5\u5728\u754c\u9762\u4e0a\u9009\u62e9\u5b89\u88c5\uff0c\u53ef\u80fd\u4f1a\u5bfc\u81f4\u4e00\u4e9b\u975e\u9884\u671f\u7684\u62a5\u9519\u7b49\u95ee\u9898\u3002
\u5982\u679c\u96c6\u7fa4\u5185\u53ef\u89c2\u6d4b\u7ec4\u4ef6\u5f02\u5e38\uff0c\u53ef\u80fd\u4f1a\u5bfc\u81f4 AI Lab \u65e0\u6cd5\u83b7\u53d6\u96c6\u7fa4\u4fe1\u606f\uff0c\u8bf7\u68c0\u67e5\u5e73\u53f0\u7684\u53ef\u89c2\u6d4b\u670d\u52a1\u662f\u5426\u6b63\u5e38\u8fd0\u884c\u53ca\u914d\u7f6e\u3002
\u5728 AI Lab \u4e2d\uff0c\u5982\u679c\u521b\u5efa\u670d\u52a1\u65f6\uff0c\u53d1\u73b0\u6307\u5b9a\u7684\u547d\u540d\u7a7a\u95f4\u4e0d\u5b58\u5728 LocalQueue\uff0c\u5219\u4f1a\u63d0\u793a\u9700\u8981\u521d\u59cb\u5316\u961f\u5217\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86\u9884\u7f6e\u7684\u7cfb\u7edf\u89d2\u8272\uff0c\u5e2e\u52a9\u7528\u6237\u7b80\u5316\u89d2\u8272\u6743\u9650\u7684\u4f7f\u7528\u6b65\u9aa4\u3002
Note
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86\u4e09\u79cd\u7c7b\u578b\u7684\u7cfb\u7edf\u89d2\u8272\uff0c\u5206\u522b\u4e3a\u5e73\u53f0\u89d2\u8272\u3001\u5de5\u4f5c\u7a7a\u95f4\u89d2\u8272\u548c\u6587\u4ef6\u5939\u89d2\u8272\u3002
IAM\uff08Identity and Access Management\uff0c\u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236\uff09\u662f\u5168\u5c40\u7ba1\u7406\u7684\u4e00\u4e2a\u91cd\u8981\u6a21\u5757\uff0c\u60a8\u53ef\u4ee5\u901a\u8fc7\u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236\u6a21\u5757\u521b\u5efa\u3001\u7ba1\u7406\u548c\u9500\u6bc1\u7528\u6237\uff08\u7528\u6237\u7ec4\uff09\uff0c\u5e76\u4f7f\u7528\u7cfb\u7edf\u89d2\u8272\u548c\u81ea\u5b9a\u4e49\u89d2\u8272\u63a7\u5236\u5176\u4ed6\u7528\u6237\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u6743\u9650\u3002
\u5f53\u60a8\u5e0c\u671b\u672c\u4f01\u4e1a\u5458\u5de5\u53ef\u4ee5\u4f7f\u7528\u4f01\u4e1a\u5185\u90e8\u7684\u8ba4\u8bc1\u7cfb\u7edf\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\uff0c\u800c\u4e0d\u9700\u8981\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u521b\u5efa\u5bf9\u5e94\u7684\u7528\u6237\uff0c\u60a8\u53ef\u4ee5\u4f7f\u7528\u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c\u5efa\u7acb\u60a8\u6240\u5728\u4f01\u4e1a\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u4fe1\u4efb\u5173\u7cfb\uff0c\u901a\u8fc7\u8054\u5408\u8ba4\u8bc1\u4f7f\u5458\u5de5\u4f7f\u7528\u4f01\u4e1a\u5df2\u6709\u8d26\u53f7\u76f4\u63a5\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\uff0c\u5b9e\u73b0\u5355\u70b9\u767b\u5f55\u3002
\u5168\u5c40\u7ba1\u7406\u652f\u6301\u57fa\u4e8e LDAP \u548c OIDC \u534f\u8bae\u7684\u5355\u70b9\u767b\u5f55\uff0c\u5982\u679c\u60a8\u7684\u4f01\u4e1a\u6216\u7ec4\u7ec7\u5df2\u6709\u81ea\u5df1\u7684\u8d26\u53f7\u4f53\u7cfb\uff0c\u540c\u65f6\u5e0c\u671b\u7ba1\u7406\u7ec4\u7ec7\u5185\u7684\u6210\u5458\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\uff0c\u60a8\u53ef\u4ee5\u4f7f\u7528\u5168\u5c40\u7ba1\u7406\u63d0\u4f9b\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c\u800c\u4e0d\u5fc5\u5728\u60a8\u7684\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u6bcf\u4e00\u4f4d\u7ec4\u7ec7\u6210\u5458\u521b\u5efa\u7528\u6237\u540d/\u5bc6\u7801\u3002\u60a8\u53ef\u4ee5\u5411\u8fd9\u4e9b\u5916\u90e8\u7528\u6237\u8eab\u4efd\u6388\u4e88\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\u7684\u6743\u9650\u3002
\u8d1f\u8d23\u6536\u96c6\u548c\u5b58\u50a8\u7528\u6237\u8eab\u4efd\u4fe1\u606f\u3001\u7528\u6237\u540d\u3001\u5bc6\u7801\u7b49\uff0c\u5728\u7528\u6237\u767b\u5f55\u65f6\u8d1f\u8d23\u8ba4\u8bc1\u7528\u6237\u7684\u670d\u52a1\u3002\u5728\u4f01\u4e1a\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u7684\u8fc7\u7a0b\u4e2d\uff0c\u8eab\u4efd\u63d0\u4f9b\u5546\u6307\u4f01\u4e1a\u81ea\u8eab\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u3002
\u670d\u52a1\u63d0\u4f9b\u5546\u901a\u8fc7\u4e0e\u8eab\u4efd\u63d0\u4f9b\u5546 IdP \u5efa\u7acb\u4fe1\u4efb\u5173\u7cfb\uff0c\u4f7f\u7528 IDP \u63d0\u4f9b\u7684\u7528\u6237\u4fe1\u606f\uff0c\u4e3a\u7528\u6237\u63d0\u4f9b\u5177\u4f53\u7684\u670d\u52a1\u3002\u5728\u4f01\u4e1a\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u7684\u8fc7\u7a0b\u4e2d\uff0c\u670d\u52a1\u63d0\u4f9b\u5546\u6307 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u3002
\u7ba1\u7406\u5458\u65e0\u9700\u91cd\u65b0\u521b\u5efa\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7528\u6237
\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u524d\uff0c\u7ba1\u7406\u5458\u9700\u8981\u5728\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u548c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u5206\u522b\u4e3a\u7528\u6237\u521b\u5efa\u8d26\u53f7\uff1b\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u540e\uff0c\u4f01\u4e1a\u7ba1\u7406\u5458\u53ea\u9700\u8981\u5728\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u4e2d\u4e3a\u7528\u6237\u521b\u5efa\u8d26\u53f7\uff0c\u7528\u6237\u5373\u53ef\u540c\u65f6\u8bbf\u95ee\u4e24\u4e2a\u7cfb\u7edf\uff0c\u964d\u4f4e\u4e86\u4eba\u5458\u7ba1\u7406\u6210\u672c\u3002
\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u524d\uff0c\u7528\u6237\u8bbf\u95ee\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u548c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9700\u8981\u4f7f\u7528\u4e24\u4e2a\u7cfb\u7edf\u7684\u8d26\u53f7\u767b\u5f55\uff1b\u4f7f\u7528\u8eab\u4efd\u63d0\u4f9b\u5546\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u540e\uff0c\u7528\u6237\u5728\u672c\u4f01\u4e1a\u7ba1\u7406\u7cfb\u7edf\u4e2d\u767b\u5f55\u5373\u53ef\u8bbf\u95ee\u4e24\u4e2a\u7cfb\u7edf\u3002
LDAP \u82f1\u6587\u5168\u79f0\u4e3a Lightweight Directory Access Protocol\uff0c\u5373\u8f7b\u578b\u76ee\u5f55\u8bbf\u95ee\u534f\u8bae\uff0c\u8fd9\u662f\u4e00\u4e2a\u5f00\u653e\u7684\u3001\u4e2d\u7acb\u7684\u5de5\u4e1a\u6807\u51c6\u5e94\u7528\u534f\u8bae\uff0c \u901a\u8fc7 IP \u534f\u8bae\u63d0\u4f9b\u8bbf\u95ee\u63a7\u5236\u548c\u7ef4\u62a4\u5206\u5e03\u5f0f\u4fe1\u606f\u7684\u76ee\u5f55\u4fe1\u606f\u3002
\u5982\u679c\u60a8\u7684\u4f01\u4e1a\u6216\u7ec4\u7ec7\u5df2\u6709\u81ea\u5df1\u7684\u8d26\u53f7\u4f53\u7cfb\uff0c\u540c\u65f6\u60a8\u7684\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u652f\u6301 LDAP \u534f\u8bae\uff0c\u5c31\u53ef\u4ee5\u4f7f\u7528\u5168\u5c40\u7ba1\u7406\u63d0\u4f9b\u7684\u57fa\u4e8e LDAP \u534f\u8bae\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c\u800c\u4e0d\u5fc5\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u6bcf\u4e00\u4f4d\u7ec4\u7ec7\u6210\u5458\u521b\u5efa\u7528\u6237\u540d/\u5bc6\u7801\u3002 \u60a8\u53ef\u4ee5\u5411\u8fd9\u4e9b\u5916\u90e8\u7528\u6237\u8eab\u4efd\u6388\u4e88\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\u7684\u6743\u9650\u3002
\u5f53\u60a8\u901a\u8fc7 LDAP \u534f\u8bae\u5c06\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5efa\u7acb\u4fe1\u4efb\u5173\u7cfb\u540e\uff0c\u53ef\u901a\u8fc7\u624b\u52a8\u540c\u6b65\u6216\u81ea\u52a8\u540c\u6b65\u7684\u65b9\u5f0f\uff0c\u5c06\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u4e2d\u7684\u7528\u6237\u6216\u7528\u6237\u7ec4\u4e00\u6b21\u6027\u540c\u6b65\u81f3 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u3002
\u540c\u6b65\u540e\u7ba1\u7406\u5458\u53ef\u5bf9\u7528\u6237\u7ec4/\u7528\u6237\u7ec4\u8fdb\u884c\u6279\u91cf\u6388\u6743\uff0c\u540c\u65f6\u7528\u6237\u53ef\u901a\u8fc7\u5728\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u4e2d\u7684\u8d26\u53f7/\u5bc6\u7801\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\u3002
\u5982\u679c\u60a8\u7684\u4f01\u4e1a\u6216\u7ec4\u7ec7\u4e2d\u7684\u6210\u5458\u5747\u7ba1\u7406\u5728\u4f01\u4e1a\u5fae\u4fe1\u4e2d\uff0c\u60a8\u53ef\u4ee5\u4f7f\u7528\u5168\u5c40\u7ba1\u7406\u63d0\u4f9b\u7684\u57fa\u4e8e OAuth 2.0 \u534f\u8bae\u7684\u8eab\u4efd\u63d0\u4f9b\u5546\u529f\u80fd\uff0c \u800c\u4e0d\u5fc5\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u6bcf\u4e00\u4f4d\u7ec4\u7ec7\u6210\u5458\u521b\u5efa\u7528\u6237\u540d/\u5bc6\u7801\u3002 \u60a8\u53ef\u4ee5\u5411\u8fd9\u4e9b\u5916\u90e8\u7528\u6237\u8eab\u4efd\u6388\u4e88\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8d44\u6e90\u7684\u6743\u9650\u3002
\u5b57\u6bb5 \u63cf\u8ff0 \u4f01\u4e1a ID \u4f01\u4e1a\u5fae\u4fe1\u7684 ID Agent ID \u81ea\u5efa\u5e94\u7528\u7684 ID ClientSecret \u81ea\u5efa\u5e94\u7528\u7684 Secret
\u5b57\u6bb5 \u63cf\u8ff0 \u63d0\u4f9b\u5546\u540d\u79f0 \u663e\u793a\u5728\u767b\u5f55\u9875\u4e0a\uff0c\u662f\u8eab\u4efd\u63d0\u4f9b\u5546\u7684\u5165\u53e3 \u8ba4\u8bc1\u65b9\u5f0f \u5ba2\u6237\u7aef\u8eab\u4efd\u9a8c\u8bc1\u65b9\u6cd5\u3002\u5982\u679c JWT \u4f7f\u7528\u79c1\u94a5\u7b7e\u540d\uff0c\u8bf7\u4e0b\u62c9\u9009\u62e9 JWT signed with private key \u3002\u5177\u4f53\u53c2\u9605 Client Authentication\u3002 \u5ba2\u6237\u7aef ID \u5ba2\u6237\u7aef\u7684 ID \u5ba2\u6237\u7aef\u5bc6\u94a5 \u5ba2\u6237\u7aef\u5bc6\u7801 \u5ba2\u6237\u7aef URL \u53ef\u901a\u8fc7\u8eab\u4efd\u63d0\u4f9b\u5546 well-known \u63a5\u53e3\u4e00\u952e\u83b7\u53d6\u767b\u5f55 URL\u3001Token URL\u3001\u7528\u6237\u4fe1\u606f URL \u548c\u767b\u51fa URL \u81ea\u52a8\u5173\u8054 \u5f00\u542f\u540e\u5f53\u8eab\u4efd\u63d0\u4f9b\u5546\u7528\u6237\u540d/\u90ae\u7bb1\u4e0e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7528\u6237\u540d/\u90ae\u7bb1\u91cd\u590d\u65f6\u5c06\u81ea\u52a8\u4f7f\u4e8c\u8005\u5173\u8054
Note
\u5f53\u7528\u6237\u901a\u8fc7\u4f01\u4e1a\u7528\u6237\u7ba1\u7406\u7cfb\u7edf\u5b8c\u6210\u7b2c\u4e00\u6b21\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u540e\uff0c\u7528\u6237\u4fe1\u606f\u624d\u4f1a\u88ab\u540c\u6b65\u81f3\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684 \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u7528\u6237\u5217\u8868 \u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b58\u5728\u4e09\u79cd\u89d2\u8272\u8303\u56f4\uff0c\u80fd\u591f\u7075\u6d3b\u3001\u6709\u6548\u5730\u89e3\u51b3\u60a8\u5728\u6743\u9650\u4e0a\u7684\u4f7f\u7528\u95ee\u9898\uff1a
\u5e73\u53f0\u89d2\u8272\u662f\u7c97\u7c92\u5ea6\u6743\u9650\uff0c\u5bf9\u5e73\u53f0\u4e0a\u6240\u6709\u76f8\u5173\u8d44\u6e90\u5177\u6709\u76f8\u5e94\u6743\u9650\u3002\u901a\u8fc7\u5e73\u53f0\u89d2\u8272\u53ef\u4ee5\u8d4b\u4e88\u7528\u6237\u5bf9\u6240\u6709\u96c6\u7fa4\u3001\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u7b49\u7684\u589e\u5220\u6539\u67e5\u6743\u9650\uff0c \u800c\u4e0d\u80fd\u5177\u4f53\u5230\u67d0\u4e00\u4e2a\u96c6\u7fa4\u6216\u67d0\u4e00\u4e2a\u5de5\u4f5c\u7a7a\u95f4\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86 5 \u4e2a\u9884\u7f6e\u7684\u3001\u7528\u6237\u53ef\u76f4\u63a5\u4f7f\u7528\u7684\u5e73\u53f0\u89d2\u8272\uff1a
Admin
Kpanda Owner
Workspace and Folder Owner
IAM Owner
Audit Owner
\u540c\u65f6\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fd8\u652f\u6301\u7528\u6237\u521b\u5efa\u81ea\u5b9a\u4e49\u5e73\u53f0\u89d2\u8272\uff0c\u53ef\u6839\u636e\u9700\u8981\u81ea\u5b9a\u4e49\u89d2\u8272\u5185\u5bb9\u3002 \u5982\u521b\u5efa\u4e00\u4e2a\u5e73\u53f0\u89d2\u8272\uff0c\u5305\u542b\u5e94\u7528\u5de5\u4f5c\u53f0\u7684\u6240\u6709\u529f\u80fd\u6743\u9650\uff0c\u7531\u4e8e\u5e94\u7528\u5de5\u4f5c\u53f0\u4f9d\u8d56\u4e8e\u5de5\u4f5c\u7a7a\u95f4\uff0c \u56e0\u6b64\u5e73\u53f0\u4f1a\u5e2e\u52a9\u7528\u6237\u9ed8\u8ba4\u52fe\u9009\u5de5\u4f5c\u7a7a\u95f4\u7684\u67e5\u770b\u6743\u9650\uff0c\u8bf7\u4e0d\u8981\u624b\u52a8\u53d6\u6d88\u52fe\u9009\u3002 \u82e5\u7528\u6237 A \u88ab\u6388\u4e88\u8be5 Workbench\uff08\u5e94\u7528\u5de5\u4f5c\u53f0\uff09\u89d2\u8272\uff0c\u5c06\u81ea\u52a8\u62e5\u6709\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u4e0b\u7684\u5e94\u7528\u5de5\u4f5c\u53f0\u76f8\u5173\u529f\u80fd\u7684\u589e\u5220\u6539\u67e5\u7b49\u6743\u9650\u3002
\u6587\u4ef6\u5939\u89d2\u8272\u7684\u6743\u9650\u7c92\u5ea6\u4ecb\u4e8e\u5e73\u53f0\u89d2\u8272\u4e0e\u5de5\u4f5c\u7a7a\u95f4\u89d2\u8272\u4e4b\u95f4\uff0c\u901a\u8fc7\u6587\u4ef6\u5939\u89d2\u8272\u53ef\u4ee5\u8d4b\u4e88\u7528\u6237\u67d0\u4e2a\u6587\u4ef6\u5939\u53ca\u5176\u5b50\u6587\u4ef6\u5939\u548c\u8be5\u6587\u4ef6\u5939\u4e0b\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u7684\u7ba1\u7406\u6743\u9650\u3001\u67e5\u770b\u6743\u9650\u7b49\uff0c \u5e38\u9002\u7528\u4e8e\u4f01\u4e1a\u4e2d\u7684\u90e8\u95e8\u573a\u666f\u3002\u6bd4\u5982\u7528\u6237 B \u662f\u4e00\u7ea7\u90e8\u95e8\u7684 Leader\uff0c\u901a\u5e38\u7528\u6237 B \u80fd\u591f\u7ba1\u7406\u8be5\u4e00\u7ea7\u90e8\u95e8\u3001\u5176\u4e0b\u7684\u6240\u6709\u4e8c\u7ea7\u90e8\u95e8\u548c\u90e8\u95e8\u4e2d\u7684\u9879\u76ee\u7b49\uff0c \u5728\u6b64\u573a\u666f\u4e2d\u7ed9\u7528\u6237 B \u6388\u4e88\u4e00\u7ea7\u6587\u4ef6\u5939\u7684\u7ba1\u7406\u5458\u6743\u9650\uff0c\u7528\u6237 B \u4e5f\u5c06\u62e5\u6709\u5176\u4e0b\u7684\u4e8c\u7ea7\u6587\u4ef6\u5939\u548c\u5de5\u4f5c\u7a7a\u95f4\u7684\u76f8\u5e94\u6743\u9650\u3002 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u4e86 3 \u4e2a\u9884\u7f6e\u7684\u3001\u7528\u6237\u53ef\u76f4\u63a5\u4f7f\u7528\u6587\u4ef6\u5939\u89d2\u8272\uff1a
Folder Admin
Folder Editor
Folder Viewer
\u540c\u65f6\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u8fd8\u652f\u6301\u7528\u6237\u521b\u5efa\u81ea\u5b9a\u4e49\u6587\u4ef6\u5939\u89d2\u8272\uff0c\u53ef\u6839\u636e\u9700\u8981\u81ea\u5b9a\u4e49\u89d2\u8272\u5185\u5bb9\u3002 \u5982\u521b\u5efa\u4e00\u4e2a\u6587\u4ef6\u5939\u89d2\u8272\uff0c\u5305\u542b\u5e94\u7528\u5de5\u4f5c\u53f0\u7684\u6240\u6709\u529f\u80fd\u6743\u9650\u3002\u82e5\u7528\u6237 A \u5728\u6587\u4ef6\u5939 01 \u4e2d\u88ab\u6388\u4e88\u8be5\u89d2\u8272\uff0c \u5c06\u62e5\u6709\u8be5\u6587\u4ef6\u5939\u4e0b\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u4e2d\u5e94\u7528\u5de5\u4f5c\u53f0\u76f8\u5173\u529f\u80fd\u7684\u589e\u5220\u6539\u67e5\u6743\u9650\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5728\u63a5\u5165\u5ba2\u6237\u7684\u7cfb\u7edf\u540e\uff0c\u53ef\u4ee5\u521b\u5efa Webhook\uff0c\u5728\u7528\u6237\u521b\u5efa/\u66f4\u65b0/\u5220\u9664/\u767b\u5f55/\u767b\u51fa\u4e4b\u65f6\u53d1\u9001\u6d88\u606f\u901a\u77e5\u3002
\u6e90\u5e94\u7528\u7a0b\u5e8f\uff08\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\uff09\u6267\u884c\u67d0\u4e2a\u7279\u5b9a\u64cd\u4f5c\u6216\u4e8b\u4ef6\u3002
Method\uff1a\u89c6\u60c5\u51b5\u9009\u62e9\u9002\u7528\u7684\u65b9\u6cd5\uff0c\u4f8b\u5982\u4f01\u4e1a\u5fae\u4fe1\u63a8\u8350\u4f7f\u7528 POST \u65b9\u6cd5
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9884\u5148\u5b9a\u4e49\u4e86\u4e00\u4e9b\u53d8\u91cf\uff0c\u60a8\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u60c5\u51b5\u5728\u6d88\u606f\u4f53\u4e2d\u4f7f\u7528\u8fd9\u4e9b\u53d8\u91cf\u3002
"},{"location":"admin/ghippo/audit/open-audit.html#ai","title":"\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b89\u88c5\u5b8c\u6210\u65f6\u72b6\u6001","text":"
\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u5728\u7cfb\u7edf\u548c\u7f51\u7edc\u7ba1\u7406\u4e2d\u626e\u6f14\u7740\u5173\u952e\u89d2\u8272\uff0c\u5b83\u6709\u52a9\u4e8e\u8ffd\u8e2a\u6d3b\u52a8\u3001\u7ef4\u62a4\u5b89\u5168\u3001\u89e3\u51b3\u95ee\u9898\u5e76\u786e\u4fdd\u7cfb\u7edf\u5408\u89c4\u6027\u3002 \u4f46\u662f\u83b7\u53d6\u6e90 IP \u4f1a\u5e26\u6765\u4e00\u5b9a\u7684\u6027\u80fd\u635f\u8017\uff0c\u6240\u4ee5\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u5ba1\u8ba1\u65e5\u5fd7\u5e76\u4e0d\u603b\u662f\u5f00\u542f\u7684\uff0c \u5728\u4e0d\u540c\u7684\u5b89\u88c5\u6a21\u5f0f\u4e0b\uff0c\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u7684\u9ed8\u8ba4\u5f00\u542f\u60c5\u51b5\u4e0d\u540c\uff0c\u5e76\u4e14\u5f00\u542f\u7684\u65b9\u5f0f\u4e0d\u540c\u3002 \u4e0b\u9762\u4f1a\u6839\u636e\u5b89\u88c5\u6a21\u5f0f\u5206\u522b\u4ecb\u7ecd\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u7684\u9ed8\u8ba4\u5f00\u542f\u60c5\u51b5\u4ee5\u53ca\u5982\u4f55\u5f00\u542f\u3002
\u8be5\u6a21\u5f0f\u5b89\u88c5\u4e0b\uff0c\u5ba1\u8ba1\u65e5\u5fd7\u6e90 IP \u9ed8\u8ba4\u662f\u5173\u95ed\u7684\uff0c\u5f00\u542f\u6b65\u9aa4\u5982\u4e0b\uff1a
\u666e\u901a\u7528\u6237\u662f\u6307\u80fd\u591f\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5927\u90e8\u5206\u4ea7\u54c1\u6a21\u5757\u53ca\u529f\u80fd\uff08\u7ba1\u7406\u529f\u80fd\u9664\u5916\uff09\uff0c\u5bf9\u6743\u9650\u8303\u56f4\u5185\u7684\u8d44\u6e90\u6709\u4e00\u5b9a\u7684\u64cd\u4f5c\u6743\u9650\uff0c\u80fd\u591f\u72ec\u7acb\u4f7f\u7528\u8d44\u6e90\u90e8\u7f72\u5e94\u7528\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u4e3a\u6b64\u5f15\u5165\u4e86\u5de5\u4f5c\u7a7a\u95f4\u7684\u6982\u5ff5\u3002\u5de5\u4f5c\u7a7a\u95f4\u901a\u8fc7\u5171\u4eab\u8d44\u6e90\u53ef\u4ee5\u63d0\u4f9b\u66f4\u9ad8\u7ef4\u5ea6\u7684\u8d44\u6e90\u9650\u989d\u80fd\u529b\uff0c\u5b9e\u73b0\u5de5\u4f5c\u7a7a\u95f4\uff08\u79df\u6237\uff09\u5728\u8d44\u6e90\u9650\u989d\u4e0b\u81ea\u52a9\u5f0f\u521b\u5efa Kubernetes \u547d\u540d\u7a7a\u95f4\u7684\u80fd\u529b\u3002
\u9996\u5148\u8981\u6309\u7167\u73b0\u6709\u7684\u4f01\u4e1a\u5c42\u7ea7\u7ed3\u6784\uff0c\u6784\u5efa\u4e0e\u4f01\u4e1a\u76f8\u540c\u7684\u6587\u4ef6\u5939\u5c42\u7ea7\u3002 AI \u7b97\u529b\u4e2d\u5fc3\u652f\u6301 5 \u7ea7\u6587\u4ef6\u5939\uff0c\u53ef\u4ee5\u6839\u636e\u4f01\u4e1a\u5b9e\u9645\u60c5\u51b5\u81ea\u7531\u7ec4\u5408\uff0c\u5c06\u6587\u4ef6\u5939\u548c\u5de5\u4f5c\u7a7a\u95f4\u6620\u5c04\u4e3a\u4f01\u4e1a\u4e2d\u7684\u90e8\u95e8\u3001\u9879\u76ee\u3001\u4f9b\u5e94\u5546\u7b49\u5b9e\u4f53\u3002
\u4f34\u968f\u4e1a\u52a1\u7684\u6301\u7eed\u6269\u5f20\uff0c\u516c\u53f8\u89c4\u6a21\u4e0d\u65ad\u58ee\u5927\uff0c\u5b50\u516c\u53f8\u3001\u5206\u516c\u53f8\u7eb7\u7eb7\u8bbe\u7acb\uff0c\u6709\u7684\u5b50\u516c\u53f8\u8fd8\u8fdb\u4e00\u6b65\u8bbe\u7acb\u5b59\u516c\u53f8\uff0c \u539f\u5148\u7684\u5927\u90e8\u95e8\u4e5f\u9010\u6e10\u7ec6\u5206\u6210\u591a\u4e2a\u5c0f\u90e8\u95e8\uff0c\u4ece\u800c\u4f7f\u5f97\u7ec4\u7ec7\u7ed3\u6784\u7684\u5c42\u7ea7\u65e5\u76ca\u589e\u591a\u3002\u8fd9\u79cd\u7ec4\u7ec7\u7ed3\u6784\u7684\u53d8\u5316\uff0c\u4e5f\u5bf9 IT \u6cbb\u7406\u67b6\u6784\u4ea7\u751f\u4e86\u5f71\u54cd\u3002
\u7cfb\u7edf\u6d88\u606f\u7528\u4e8e\u901a\u77e5\u6240\u6709\u7528\u6237\uff0c\u7c7b\u4f3c\u4e8e\u7cfb\u7edf\u516c\u544a\uff0c\u4f1a\u5728\u7279\u5b9a\u65f6\u95f4\u663e\u793a\u5728 AI \u7b97\u529b\u4e2d\u5fc3UI \u7684\u9876\u90e8\u680f\u3002
\u6700\u4f73\u5b9e\u8df5\uff1a\u8fd0\u7ef4\u90e8\u95e8\u624b\u4e2d\u6709\u4e00\u4e2a\u9ad8\u53ef\u7528\u96c6\u7fa4 01\uff0c\u60f3\u8981\u5206\u914d\u7ed9\u90e8\u95e8 A\uff08\u5de5\u4f5c\u7a7a\u95f4 A\uff09\u548c\u90e8\u95e8 B\uff08\u5de5\u4f5c\u7a7a\u95f4 B\uff09\u4f7f\u7528\uff0c\u5176\u4e2d\u90e8\u95e8 A \u5206\u914d CPU 50 \u6838\uff0c\u90e8\u95e8 B \u5206\u914d CPU 100 \u6838\u3002 \u90a3\u4e48\u53ef\u4ee5\u501f\u7528\u5171\u4eab\u8d44\u6e90\u7684\u6982\u5ff5\uff0c\u5c06\u96c6\u7fa4 01 \u5206\u522b\u5171\u4eab\u7ed9\u90e8\u95e8 A \u548c\u90e8\u95e8 B\uff0c\u5e76\u9650\u5236\u90e8\u95e8 A \u7684 CPU \u4f7f\u7528\u989d\u5ea6\u4e3a 50\uff0c\u90e8\u95e8 B \u7684 CPU \u4f7f\u7528\u989d\u5ea6\u4e3a 100\u3002 \u90a3\u4e48\u90e8\u95e8 A \u7684\u7ba1\u7406\u5458\uff08\u5de5\u4f5c\u7a7a\u95f4 A Admin\uff09\u80fd\u591f\u5728\u5e94\u7528\u5de5\u4f5c\u53f0\u521b\u5efa\u5e76\u4f7f\u7528\u547d\u540d\u7a7a\u95f4\uff0c\u5176\u4e2d\u547d\u540d\u7a7a\u95f4\u989d\u5ea6\u603b\u548c\u4e0d\u80fd\u8d85\u8fc7 50 \u6838\uff0c\u90e8\u95e8 B \u7684\u7ba1\u7406\u5458\uff08\u5de5\u4f5c\u7a7a\u95f4 B Admin\uff09\u80fd\u591f\u5728\u5e94\u7528\u5de5\u4f5c\u53f0\u521b\u5efa\u5e76\u4f7f\u7528\u547d\u540d\u7a7a\u95f4\uff0c\u5176\u4e2d\u547d\u540d\u7a7a\u95f4\u989d\u5ea6\u603b\u548c\u4e0d\u80fd\u8d85\u8fc7 100 \u6838\u3002 \u90e8\u95e8 A \u7684\u7ba1\u7406\u5458\u548c\u90e8\u95e8 B \u7ba1\u7406\u5458\u521b\u5efa\u7684\u547d\u540d\u7a7a\u95f4\u4f1a\u88ab\u81ea\u52a8\u7ed1\u5b9a\u5728\u8be5\u90e8\u95e8\uff0c\u90e8\u95e8\u4e2d\u7684\u5176\u4ed6\u6210\u5458\u5c06\u5bf9\u5e94\u7684\u62e5\u6709\u547d\u540d\u7a7a\u95f4\u7684 Namesapce Admin\u3001Namesapce Edit\u3001Namesapce View \u89d2\u8272\uff08\u8fd9\u91cc\u90e8\u95e8\u6307\u7684\u662f\u5de5\u4f5c\u7a7a\u95f4\uff0c\u5de5\u4f5c\u7a7a\u95f4\u8fd8\u53ef\u4ee5\u6620\u5c04\u4e3a\u7ec4\u7ec7\u3001\u4f9b\u5e94\u5546\u7b49\u5176\u4ed6\u6982\u5ff5\uff09\u3002\u6574\u4e2a\u8fc7\u7a0b\u5982\u4e0b\u8868\uff1a
\u90e8\u95e8 \u89d2\u8272 \u5171\u4eab\u96c6\u7fa4 Cluster \u8d44\u6e90\u914d\u989d \u90e8\u95e8\u7ba1\u7406\u5458 A Workspace Admin \u96c6\u7fa4 01 CPU 50 \u6838 \u90e8\u95e8\u7ba1\u7406\u5458 B Workspace Admin \u96c6\u7fa4 01 CPU 100 \u6838
"},{"location":"admin/ghippo/best-practice/ws-best-practice.html#ai","title":"\u5de5\u4f5c\u7a7a\u95f4\u5bf9 AI \u7b97\u529b\u4e2d\u5fc3\u5404\u6a21\u5757\u7684\u4f5c\u7528","text":"
GProduct \u662f AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u9664\u5168\u5c40\u7ba1\u7406\u5916\u7684\u6240\u6709\u5176\u4ed6\u6a21\u5757\u7684\u7edf\u79f0\uff0c\u8fd9\u4e9b\u6a21\u5757\u9700\u8981\u4e0e\u5168\u5c40\u7ba1\u7406\u5bf9\u63a5\u540e\u624d\u80fd\u52a0\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u3002
"},{"location":"admin/ghippo/best-practice/oem/custom-idp.html","title":"\u5b9a\u5236 AI \u7b97\u529b\u4e2d\u5fc3\u5bf9\u63a5\u5916\u90e8\u8eab\u4efd\u63d0\u4f9b\u5546 (IdP)","text":"
\u8eab\u4efd\u63d0\u4f9b\u5546\uff08IdP, Identity Provider\uff09\uff1a\u5f53 AI \u7b97\u529b\u4e2d\u5fc3\u9700\u8981\u4f7f\u7528\u5ba2\u6237\u7cfb\u7edf\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c \u4f7f\u7528\u5ba2\u6237\u7cfb\u7edf\u767b\u5f55\u754c\u9762\u6765\u8fdb\u884c\u767b\u5f55\u8ba4\u8bc1\u65f6\uff0c\u8be5\u5ba2\u6237\u7cfb\u7edf\u88ab\u79f0\u4e3a AI \u7b97\u529b\u4e2d\u5fc3\u7684\u8eab\u4efd\u63d0\u4f9b\u5546
cd quarkus\nmvn -f ../pom.xml clean install -DskipTestsuite -DskipExamples -DskipTests\n
"},{"location":"admin/ghippo/best-practice/oem/keycloak-idp.html#ide","title":"\u4ece IDE \u8fd0\u884c","text":""},{"location":"admin/ghippo/best-practice/oem/keycloak-idp.html#service","title":"\u6dfb\u52a0 service \u4ee3\u7801","text":""},{"location":"admin/ghippo/best-practice/oem/keycloak-idp.html#keycloak","title":"\u5982\u679c\u53ef\u4ece keycloak \u7ee7\u627f\u90e8\u5206\u529f\u80fd","text":"
"},{"location":"admin/ghippo/best-practice/oem/oem-in.html","title":"\u5982\u4f55\u5c06\u5ba2\u6237\u7cfb\u7edf\u96c6\u6210\u5230 AI \u7b97\u529b\u4e2d\u5fc3\uff08OEM IN\uff09","text":"
OEM IN \u662f\u6307\u5408\u4f5c\u4f19\u4f34\u7684\u5e73\u53f0\u4f5c\u4e3a\u5b50\u6a21\u5757\u5d4c\u5165 AI \u7b97\u529b\u4e2d\u5fc3\uff0c\u51fa\u73b0\u5728 AI \u7b97\u529b\u4e2d\u5fc3\u4e00\u7ea7\u5bfc\u822a\u680f\u3002 \u7528\u6237\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3\u8fdb\u884c\u767b\u5f55\u548c\u7edf\u4e00\u7ba1\u7406\u3002\u5b9e\u73b0 OEM IN \u5171\u5206\u4e3a 5 \u6b65\uff0c\u5206\u522b\u662f\uff1a
\u4ee5\u4e0b\u4f7f\u7528\u5f00\u6e90\u8f6f\u4ef6 Label Studio \u6765\u505a\u5d4c\u5957\u6f14\u793a\u3002\u5b9e\u9645\u573a\u666f\u9700\u8981\u81ea\u5df1\u89e3\u51b3\u5ba2\u6237\u7cfb\u7edf\u7684\u95ee\u9898\uff1a
\u4f8b\u5982\u5ba2\u6237\u7cfb\u7edf\u9700\u8981\u81ea\u5df1\u6dfb\u52a0\u4e00\u4e2a Subpath\uff0c\u7528\u4e8e\u533a\u5206\u54ea\u4e9b\u662f AI \u7b97\u529b\u4e2d\u5fc3\u7684\u670d\u52a1\uff0c\u54ea\u4e9b\u662f\u5ba2\u6237\u7cfb\u7edf\u7684\u670d\u52a1\u3002
\u5c06\u5ba2\u6237\u7cfb\u7edf\u4e0e AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u901a\u8fc7 OIDC/OAUTH \u7b49\u534f\u8bae\u5bf9\u63a5\uff0c\u4f7f\u7528\u6237\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u540e\u8fdb\u5165\u5ba2\u6237\u7cfb\u7edf\u65f6\u65e0\u9700\u518d\u6b21\u767b\u5f55\u3002
Note
\u8fd9\u91cc\u4f7f\u7528\u4e24\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u76f8\u4e92\u5bf9\u63a5\u6765\u8fdb\u884c\u6f14\u793a\u3002\u6db5\u76d6\u5c06 AI \u7b97\u529b\u4e2d\u5fc3 \u4f5c\u4e3a\u7528\u6237\u6e90\u767b\u5f55\u5ba2\u6237\u5e73\u53f0\uff0c\u548c\u5c06\u5ba2\u6237\u5e73\u53f0\u4f5c\u4e3a\u7528\u6237\u6e90\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3 \u5e73\u53f0\u4e24\u79cd\u573a\u666f\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u767b\u5f55\u5ba2\u6237\u5e73\u53f0\uff1a \u9996\u5148\u5c06\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u5b9e\u73b0\u5bf9\u63a5\u540e\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u7684\u7528\u6237\u53ef\u4ee5\u901a\u8fc7 OIDC \u76f4\u63a5\u767b\u5f55\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3\uff0c \u800c\u65e0\u9700\u5728\u7b2c\u4e8c\u5957\u4e2d\u518d\u6b21\u521b\u5efa\u7528\u6237\u3002\u5728\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u901a\u8fc7 \u5168\u5c40\u7ba1\u7406 -> \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u63a5\u5165\u7ba1\u7406 \u521b\u5efa SSO \u63a5\u5165\u3002
\u5ba2\u6237\u5e73\u53f0\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\uff1a \u5c06\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3 \u4e2d\u751f\u6210\u7684\u5ba2\u6237\u7aef ID\u3001\u5ba2\u6237\u7aef\u5bc6\u94a5\u3001\u5355\u70b9\u767b\u5f55 URL \u7b49\u586b\u5199\u5230\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3 \u5168\u5c40\u7ba1\u7406 -> \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u8eab\u4efd\u63d0\u4f9b\u5546 -> OIDC \u4e2d\uff0c\u5b8c\u6210\u7528\u6237\u5bf9\u63a5\u3002 \u5bf9\u63a5\u540e\uff0c\u7b2c\u4e00\u5957 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u7684\u7528\u6237\u53ef\u4ee5\u901a\u8fc7 OIDC \u76f4\u63a5\u767b\u5f55\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3\uff0c\u800c\u65e0\u9700\u5728\u7b2c\u4e8c\u5957\u4e2d\u518d\u6b21\u521b\u5efa\u7528\u6237\u3002
\u5bf9\u63a5\u5b8c\u6210\u540e\uff0c\u7b2c\u4e8c\u5957 AI \u7b97\u529b\u4e2d\u5fc3 \u767b\u5f55\u9875\u9762\u5c06\u51fa\u73b0 OIDC \u9009\u9879\uff0c\u9996\u6b21\u767b\u5f55\u65f6\u9009\u62e9\u901a\u8fc7 OIDC \u767b\u5f55\uff08\u81ea\u5b9a\u4e49\u540d\u79f0\uff0c\u8fd9\u91cc\u662f\u540d\u79f0\u662f loginname\uff09\uff0c \u540e\u7eed\u5c06\u76f4\u63a5\u8fdb\u5165\u65e0\u9700\u518d\u6b21\u9009\u62e9\u3002
Note
\u4f7f\u7528\u4e24\u5957 AI \u7b97\u529b\u4e2d\u5fc3,\u8868\u660e\u5ba2\u6237\u53ea\u8981\u652f\u6301 OIDC \u534f\u8bae\uff0c\u65e0\u8bba\u662f AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u8fd8\u662f\u201c\u5ba2\u6237\u5e73\u53f0\u201d\u4f5c\u4e3a\u7528\u6237\u6e90\uff0c\u4e24\u79cd\u573a\u666f\u90fd\u652f\u6301\u3002
\u53c2\u8003\u6587\u6863\u4e0b\u65b9\u7684 tar \u5305\u6765\u5b9e\u73b0\u4e00\u4e2a\u7a7a\u58f3\u7684\u524d\u7aef\u5b50\u5e94\u7528\uff0c\u628a\u5ba2\u6237\u7cfb\u7edf\u4ee5 iframe \u7684\u5f62\u5f0f\u653e\u8fdb\u8be5\u7a7a\u58f3\u5e94\u7528\u91cc\u3002
\u5bf9\u63a5\u5b8c\u6210\u540e\uff0c\u5c06\u5728 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u4e00\u7ea7\u5bfc\u822a\u680f\u51fa\u73b0 \u5ba2\u6237\u7cfb\u7edf \uff0c\u70b9\u51fb\u53ef\u8fdb\u5165\u5ba2\u6237\u7cfb\u7edf\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u652f\u6301\u901a\u8fc7\u5199 CSS \u7684\u65b9\u5f0f\u6765\u5b9e\u73b0\u5916\u89c2\u5b9a\u5236\u3002\u5b9e\u9645\u5e94\u7528\u4e2d\u5ba2\u6237\u7cfb\u7edf\u5982\u4f55\u5b9e\u73b0\u5916\u89c2\u5b9a\u5236\u9700\u8981\u6839\u636e\u5b9e\u9645\u60c5\u51b5\u5904\u7406\u3002
"},{"location":"admin/ghippo/best-practice/oem/oem-in.html#anyproduct-ai","title":"AnyProduct \u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u5176\u4ed6\u80fd\u529b(\u53ef\u9009)","text":"
\u64cd\u4f5c\u65b9\u6cd5\u4e3a\u8c03\u7528 AI \u7b97\u529b\u4e2d\u5fc3OpenAPI\u3002
OEM OUT \u662f\u6307\u5c06 AI \u7b97\u529b\u4e2d\u5fc3\u4f5c\u4e3a\u5b50\u6a21\u5757\u63a5\u5165\u5176\u4ed6\u4ea7\u54c1\uff0c\u51fa\u73b0\u5728\u5176\u4ed6\u4ea7\u54c1\u7684\u83dc\u5355\u4e2d\u3002 \u7528\u6237\u767b\u5f55\u5176\u4ed6\u4ea7\u54c1\u540e\u53ef\u76f4\u63a5\u8df3\u8f6c\u81f3 AI \u7b97\u529b\u4e2d\u5fc3\u65e0\u9700\u4e8c\u6b21\u767b\u5f55\u3002\u5b9e\u73b0 OEM OUT \u5171\u5206\u4e3a 5 \u6b65\uff0c\u5206\u522b\u662f\uff1a
\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\uff08\u5047\u8bbe\u90e8\u7f72\u5b8c\u7684\u8bbf\u95ee\u5730\u5740\u4e3a https://10.6.8.2:30343/\uff09
\u5ba2\u6237\u7cfb\u7edf\u548c AI \u7b97\u529b\u4e2d\u5fc3\u524d\u53ef\u4ee5\u653e\u4e00\u4e2a nginx \u53cd\u4ee3\u6765\u5b9e\u73b0\u540c\u57df\u8bbf\u95ee\uff0c / \u8def\u7531\u5230\u5ba2\u6237\u7cfb\u7edf\uff0c /dce5 (subpath) \u8def\u7531\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7cfb\u7edf\uff0c vi /etc/nginx/conf.d/default.conf \u793a\u4f8b\u5982\u4e0b\uff1a
\u5c06\u5ba2\u6237\u7cfb\u7edf\u4e0e AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u901a\u8fc7 OIDC/OAUTH \u7b49\u534f\u8bae\u5bf9\u63a5\uff0c\u4f7f\u7528\u6237\u767b\u5f55\u5ba2\u6237\u7cfb\u7edf\u540e\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u65e0\u9700\u518d\u6b21\u767b\u5f55\u3002 \u5728\u62ff\u5230\u5ba2\u6237\u7cfb\u7edf\u7684 OIDC \u4fe1\u606f\u540e\u586b\u5165 \u5168\u5c40\u7ba1\u7406 -> \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 -> \u8eab\u4efd\u63d0\u4f9b\u5546 \u4e2d\u3002
\u5bf9\u63a5\u5b8c\u6210\u540e\uff0cAI \u7b97\u529b\u4e2d\u5fc3\u767b\u5f55\u9875\u9762\u5c06\u51fa\u73b0 OIDC\uff08\u81ea\u5b9a\u4e49\uff09\u9009\u9879\uff0c\u9996\u6b21\u4ece\u5ba2\u6237\u7cfb\u7edf\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u9009\u62e9\u901a\u8fc7 OIDC \u767b\u5f55\uff0c \u540e\u7eed\u5c06\u76f4\u63a5\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u65e0\u9700\u518d\u6b21\u9009\u62e9\u3002
\u5bf9\u63a5\u5bfc\u822a\u680f\u662f\u6307 AI \u7b97\u529b\u4e2d\u5fc3\u51fa\u73b0\u5728\u5ba2\u6237\u7cfb\u7edf\u7684\u83dc\u5355\u4e2d\uff0c\u7528\u6237\u70b9\u51fb\u76f8\u5e94\u7684\u83dc\u5355\u540d\u79f0\u80fd\u591f\u76f4\u63a5\u8fdb\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u3002 \u56e0\u6b64\u5bf9\u63a5\u5bfc\u822a\u680f\u4f9d\u8d56\u4e8e\u5ba2\u6237\u7cfb\u7edf\uff0c\u4e0d\u540c\u5e73\u53f0\u9700\u8981\u6309\u7167\u5177\u4f53\u60c5\u51b5\u8fdb\u884c\u5904\u7406\u3002
\u56fd\u5bc6\u7f51\u5173\u90e8\u7f72\u6210\u529f\u4e4b\u540e\uff0c\u81ea\u5b9a\u4e49 AI \u7b97\u529b\u4e2d\u5fc3\u53cd\u5411\u4ee3\u7406\u670d\u52a1\u5668\u5730\u5740\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u5728 \u7528\u6237\u4e0e\u8bbf\u95ee\u63a7\u5236 \u4e2d\u901a\u8fc7\u7ba1\u7406\u5458\u521b\u5efa\u65b0\u7528\u6237\u7684\u65b9\u5f0f\u4e3a\u7528\u6237\u5206\u914d\u4e00\u4e2a\u9644\u6709\u4e00\u5b9a\u6743\u9650\u7684\u8d26\u53f7\u3002\u8be5\u7528\u6237\u4ea7\u751f\u7684\u6240\u6709\u884c\u4e3a\u90fd\u5c06\u5173\u8054\u5230\u81ea\u5df1\u7684\u5e10\u53f7\u3002
"},{"location":"admin/ghippo/install/reverse-proxy.html","title":"\u81ea\u5b9a\u4e49 AI \u7b97\u529b\u4e2d\u5fc3\u53cd\u5411\u4ee3\u7406\u670d\u52a1\u5668\u5730\u5740","text":"
\u8bbf\u95ee\u5bc6\u94a5\uff08Access Key\uff09\u53ef\u7528\u4e8e\u8bbf\u95ee\u5f00\u653e API \u548c\u6301\u7eed\u53d1\u5e03\uff0c\u7528\u6237\u53ef\u5728\u4e2a\u4eba\u4e2d\u5fc3\u53c2\u7167\u4ee5\u4e0b\u6b65\u9aa4\u83b7\u53d6\u5bc6\u94a5\u5e76\u8bbf\u95ee API\u3002
\u4f7f\u7528\u60a8\u7684\u7528\u6237\u540d/\u5bc6\u7801\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\u3002\u70b9\u51fb\u5de6\u4fa7\u5bfc\u822a\u680f\u5e95\u90e8\u7684 \u5168\u5c40\u7ba1\u7406 \u3002
"},{"location":"admin/ghippo/personal-center/ssh-key.html#4-ai","title":"\u6b65\u9aa4 4\uff1a\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0a\u8bbe\u7f6e\u516c\u94a5","text":"
\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0UI \u9875\u9762\uff0c\u5728\u9875\u9762\u53f3\u4e0a\u89d2\u9009\u62e9 \u4e2a\u4eba\u4e2d\u5fc3 -> SSH \u516c\u94a5 \u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u53ef\u901a\u8fc7 \u5916\u89c2\u5b9a\u5236 \u66f4\u6362\u767b\u5f55\u754c\u9762\u3001\u9876\u90e8\u5bfc\u822a\u680f\u4ee5\u53ca\u5e95\u90e8\u7248\u6743\u548c\u5907\u6848\u4fe1\u606f\uff0c\u5e2e\u52a9\u7528\u6237\u66f4\u597d\u5730\u8fa8\u8bc6\u4ea7\u54c1\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4f1a\u5728\u7528\u6237\u5fd8\u8bb0\u5bc6\u7801\u65f6\uff0c\u5411\u7528\u6237\u53d1\u9001\u7535\u5b50\u90ae\u4ef6\u4ee5\u9a8c\u8bc1\u7535\u5b50\u90ae\u4ef6\u5730\u5740\uff0c\u786e\u4fdd\u7528\u6237\u662f\u672c\u4eba\u64cd\u4f5c\u3002 \u8981\u4f7f\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u80fd\u591f\u53d1\u9001\u7535\u5b50\u90ae\u4ef6\uff0c\u9700\u8981\u5148\u63d0\u4f9b\u60a8\u7684\u90ae\u4ef6\u670d\u52a1\u5668\u5730\u5740\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5728\u56fe\u5f62\u754c\u9762\u4e0a\u63d0\u4f9b\u4e86\u57fa\u4e8e\u5bc6\u7801\u548c\u8bbf\u95ee\u63a7\u5236\u7684\u5b89\u5168\u7b56\u7565\u3002
\u4f1a\u8bdd\u8d85\u65f6\u7b56\u7565\uff1a\u7528\u6237\u5728 x \u5c0f\u65f6\u5185\u6ca1\u6709\u64cd\u4f5c\uff0c\u9000\u51fa\u5f53\u524d\u8d26\u53f7\u3002
\u652f\u6301\u81ea\u5b9a\u4e49\u8bbe\u7f6e CPU \u3001\u5185\u5b58\u3001\u5b58\u50a8\u4ee5\u53ca GPU \u7684\u8ba1\u8d39\u5355\u4f4d\uff0c\u4ee5\u53ca\u8d27\u5e01\u5355\u4f4d\u3002
\u652f\u6301\u5c55\u793a CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c
\u96c6\u7fa4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u96c6\u7fa4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u8be5\u6bb5\u65f6\u95f4\u5185\u96c6\u7fa4\u4e0b\u7684\u8282\u70b9\u6570\u91cf\uff0c \u53ef\u901a\u8fc7\u70b9\u51fb\u8282\u70b9\u6570\u91cf\u5feb\u6377\u8fdb\u5165\u8282\u70b9\u62a5\u8868\uff0c\u5e76\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u5185\u8be5\u96c6\u7fa4\u4e0b\u7684\u8282\u70b9\u4f7f\u7528\u60c5\u51b5\u3002
\u8282\u70b9\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u8282\u70b9\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u8282\u70b9\u7684 IP\u3001\u7c7b\u578b\u548c\u6240\u5c5e\u96c6\u7fa4\u3002
\u5bb9\u5668\u7ec4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u5bb9\u5668\u7ec4\u7684\u6240\u5c5e\u547d\u540d\u7a7a\u95f4\u3001\u6240\u5c5e\u96c6\u7fa4\u548c\u6240\u5c5e\u5de5\u4f5c\u7a7a\u95f4\u3002
\u5de5\u4f5c\u7a7a\u95f4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u5de5\u4f5c\u7a7a\u95f4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u547d\u540d\u7a7a\u95f4\u6570\u91cf\u548c\u5bb9\u5668\u7ec4\u6570\u91cf\uff0c \u53ef\u901a\u8fc7\u70b9\u51fb\u547d\u540d\u7a7a\u95f4\u6570\u91cf\u5feb\u6377\u8fdb\u5165\u547d\u540d\u7a7a\u95f4\u62a5\u8868\uff0c\u5e76\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u5185\u8be5\u5de5\u4f5c\u7a7a\u95f4\u4e0b\u547d\u540d\u7a7a\u95f4\u7684\u4f7f\u7528\u60c5\u51b5\uff1b\u540c\u6837\u7684\u65b9\u5f0f\u53ef\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u4e0b\u8be5\u5de5\u4f5c\u7a7a\u95f4\u4e0b\u7684\u5bb9\u5668\u7ec4\u7684\u4f7f\u7528\u60c5\u51b5\u3002
\u547d\u540d\u7a7a\u95f4\u62a5\u8868\uff1a\u5c55\u793a\u67d0\u6bb5\u65f6\u95f4\u5185\u6240\u6709\u547d\u540d\u7a7a\u95f4\u7684 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u3001\u5b58\u50a8\u4f7f\u7528\u7387\u548c GPU \u663e\u5b58\u4f7f\u7528\u7387\u7684\u6700\u5927\u3001\u6700\u5c0f\u548c\u5e73\u5747\u503c\uff0c\u4ee5\u53ca\u5bb9\u5668\u7ec4\u6570\u91cf\u3001\u6240\u5c5e\u96c6\u7fa4\u3001\u6240\u5c5e\u5de5\u4f5c\u7a7a\u95f4\uff0c \u53ef\u901a\u8fc7\u70b9\u51fb\u5bb9\u5668\u7ec4\u6570\u91cf\u5feb\u6377\u8fdb\u5165\u5bb9\u5668\u7ec4\u62a5\u8868\uff0c\u5e76\u67e5\u770b\u8be5\u6bb5\u65f6\u95f4\u5185\u8be5\u547d\u540d\u7a7a\u95f4\u4e0b\u7684\u5bb9\u5668\u7ec4\u7684\u4f7f\u7528\u60c5\u51b5\u3002
\u51fa\u73b0\u8fd9\u4e2a\u95ee\u9898\u539f\u56e0\u4e3a\uff1aghippo-keycloak \u8fde\u63a5\u7684 Mysql \u6570\u636e\u5e93\u51fa\u73b0\u6545\u969c, \u5bfc\u81f4 OIDC Public keys \u88ab\u91cd\u7f6e
\u5220\u9664 keycloak database \u5e76\u521b\u5efa\uff0c\u63d0\u793a CREATE DATABASE IF NOT EXISTS keycloak CHARACTER SET utf8
\u91cd\u542f Keycloak Pod \u89e3\u51b3\u95ee\u9898
"},{"location":"admin/ghippo/troubleshooting/ghippo03.html#cpu-does-not-support-86-64-v2","title":"CPU does not support \u00d786-64-v2","text":""},{"location":"admin/ghippo/troubleshooting/ghippo03.html#_5","title":"\u6545\u969c\u8868\u73b0","text":"
keycloak \u65e0\u6cd5\u6b63\u5e38\u542f\u52a8\uff0ckeycloak pod \u8fd0\u884c\u72b6\u6001\u4e3a CrashLoopBackOff \u5e76\u4e14 keycloak \u7684 log \u51fa\u73b0\u5982\u4e0b\u56fe\u6240\u793a\u7684\u4fe1\u606f
\u8fd0\u884c\u4e0b\u9762\u7684\u68c0\u67e5\u811a\u672c\uff0c\u67e5\u8be2\u5f53\u524d\u8282\u70b9 cpu \u7684 x86-64\u67b6\u6784\u7684\u7279\u5f81\u7ea7\u522b
\u6267\u884c\u4e0b\u9762\u547d\u4ee4\u67e5\u770b\u5f53\u524d cpu \u7684\u7279\u6027\uff0c\u5982\u679c\u8f93\u51fa\u4e2d\u5305\u542b sse4_2\uff0c\u5219\u8868\u793a\u4f60\u7684\u5904\u7406\u5668\u652f\u6301SSE 4.2\u3002
\u9700\u8981\u5347\u7ea7\u4f60\u7684\u4e91\u4e3b\u673a\u6216\u7269\u7406\u673a CPU \u4ee5\u652f\u6301 x86-64-v2 \u53ca\u4ee5\u4e0a\uff0c\u786e\u4fddx86 CPU \u6307\u4ee4\u96c6\u652f\u6301 sse4.2\uff0c\u5982\u4f55\u5347\u7ea7\u9700\u8981\u4f60\u54a8\u8be2\u4e91\u4e3b\u673a\u5e73\u53f0\u63d0\u4f9b\u5546\u6216\u7740\u7269\u7406\u673a\u63d0\u4f9b\u5546\u3002
\u4ee5\u7ec8\u7aef\u7528\u6237\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\uff0c\u5bfc\u822a\u5230\u5bf9\u5e94\u7684\u670d\u52a1\uff0c\u67e5\u770b\u8bbf\u95ee\u7aef\u53e3\u3002
\u544a\u8b66\u4e2d\u5fc3\u662f AI \u7b97\u529b\u5e73\u53f0 \u63d0\u4f9b\u7684\u4e00\u4e2a\u91cd\u8981\u529f\u80fd\uff0c\u5b83\u8ba9\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u56fe\u5f62\u754c\u9762\u65b9\u4fbf\u5730\u6309\u7167\u96c6\u7fa4\u548c\u547d\u540d\u7a7a\u95f4\u67e5\u770b\u6240\u6709\u6d3b\u52a8\u548c\u5386\u53f2\u544a\u8b66\uff0c \u5e76\u6839\u636e\u544a\u8b66\u7ea7\u522b\uff08\u7d27\u6025\u3001\u8b66\u544a\u3001\u63d0\u793a\uff09\u6765\u641c\u7d22\u544a\u8b66\u3002
\u6240\u6709\u544a\u8b66\u90fd\u662f\u57fa\u4e8e\u9884\u8bbe\u7684\u544a\u8b66\u89c4\u5219\u8bbe\u5b9a\u7684\u9608\u503c\u6761\u4ef6\u89e6\u53d1\u7684\u3002\u5728 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u5185\u7f6e\u4e86\u4e00\u4e9b\u5168\u5c40\u544a\u8b66\u7b56\u7565\uff0c\u540c\u65f6\u60a8\u4e5f\u53ef\u4ee5\u968f\u65f6\u521b\u5efa\u3001\u5220\u9664\u544a\u8b66\u7b56\u7565\uff0c\u5bf9\u4ee5\u4e0b\u6307\u6807\u8fdb\u884c\u8bbe\u7f6e\uff1a
AI \u7b97\u529b\u5e73\u53f0 \u544a\u8b66\u4e2d\u5fc3\u662f\u4e00\u4e2a\u529f\u80fd\u5f3a\u5927\u7684\u544a\u8b66\u7ba1\u7406\u5e73\u53f0\uff0c\u53ef\u5e2e\u52a9\u7528\u6237\u53ca\u65f6\u53d1\u73b0\u548c\u89e3\u51b3\u96c6\u7fa4\u4e2d\u51fa\u73b0\u7684\u95ee\u9898\uff0c \u63d0\u9ad8\u4e1a\u52a1\u7a33\u5b9a\u6027\u548c\u53ef\u7528\u6027\uff0c\u4fbf\u4e8e\u96c6\u7fa4\u5de1\u68c0\u548c\u6545\u969c\u6392\u67e5\u3002
\u5207\u6362\u5230 insight-system -> Fluent Bit \u4eea\u8868\u76d8\u3002
Fluent Bit \u4eea\u8868\u76d8\u4e0a\u65b9\u6709\u51e0\u4e2a\u9009\u9879\u6846\uff0c\u53ef\u4ee5\u9009\u62e9\u65e5\u5fd7\u91c7\u96c6\u63d2\u4ef6\u3001\u65e5\u5fd7\u8fc7\u6ee4\u63d2\u4ef6\u3001\u65e5\u5fd7\u8f93\u51fa\u63d2\u4ef6\u53ca\u6240\u5728\u96c6\u7fa4\u540d\u3002
\u672c\u6587\u5c06\u4ee5 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u4e3e\u4f8b\uff0c\u8bb2\u89e3\u5982\u4f55\u901a\u8fc7 Insight \u53d1\u73b0 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\u5f02\u5e38\u7684\u7ec4\u4ef6\u5e76\u5206\u6790\u51fa\u7ec4\u4ef6\u5f02\u5e38\u7684\u6839\u56e0\u3002
\u7ed3\u5408 Grafana\uff08\u5f53\u524d\u955c\u50cf\u7248\u672c 9.3.14\uff09\u7684\u5b98\u65b9\u6587\u6863\u3002\u6839\u636e\u5982\u4e0b\u6b65\u9aa4\u914d\u7f6e\u4f7f\u7528\u5916\u90e8\u7684\u6570\u636e\u5e93\uff0c\u793a\u4f8b\u4ee5 MySQL \u4e3a\u4f8b\uff1a
AI \u7b97\u529b\u4e2d\u5fc3 Insight \u76ee\u524d\u63a8\u8350\u4f7f\u7528\u5c3e\u90e8\u91c7\u6837\u5e76\u4f18\u5148\u652f\u6301\u5c3e\u90e8\u91c7\u6837\u3002
Pod Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Pod \u4e0a\u5bf9\u5e94\u7684\u76d1\u63a7\u6570\u636e\u3002
Service Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Service \u5bf9\u5e94 Endpoints \u4e0a\u7684\u76d1\u63a7\u6570\u636e\u3002
\u7531\u4e8e ICMP \u9700\u8981\u66f4\u9ad8\u6743\u9650\uff0c\u56e0\u6b64\uff0c\u6211\u4eec\u8fd8\u9700\u8981\u63d0\u5347 Pod \u6743\u9650\uff0c\u5426\u5219\u4f1a\u51fa\u73b0 operation not permitted \u7684\u9519\u8bef\u3002\u6709\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\u63d0\u5347\u6743\u9650\uff1a
port \uff1a\u6307\u5b9a\u91c7\u96c6\u6570\u636e\u9700\u8981\u901a\u8fc7\u7684\u7aef\u53e3\uff0c\u8bbe\u7f6e\u7684\u7aef\u53e3\u4e3a\u91c7\u96c6\u7684 Service \u7aef\u53e3\u6240\u8bbe\u7f6e\u7684 name \u3002
\u8fd9\u662f\u9700\u8981\u53d1\u73b0\u7684 Service \u7684\u8303\u56f4\u3002 namespaceSelector \u5305\u542b\u4e24\u4e2a\u4e92\u65a5\u5b57\u6bb5\uff0c\u5b57\u6bb5\u7684\u542b\u4e49\u5982\u4e0b\uff1a
any \uff1a\u6709\u4e14\u4ec5\u6709\u4e00\u4e2a\u503c true \uff0c\u5f53\u8be5\u5b57\u6bb5\u88ab\u8bbe\u7f6e\u65f6\uff0c\u5c06\u76d1\u542c\u6240\u6709\u7b26\u5408 Selector \u8fc7\u6ee4\u6761\u4ef6\u7684 Service \u7684\u53d8\u52a8\u3002
\u8d44\u6e90\u6d88\u8017\uff1a\u53ef\u6309 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u548c\u78c1\u76d8\u4f7f\u7528\u7387\u5206\u522b\u67e5\u770b\u8fd1\u4e00\u5c0f\u65f6 TOP5 \u96c6\u7fa4\u3001\u8282\u70b9\u7684\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\u3002
\u9ed8\u8ba4\u6309\u7167\u6839\u636e CPU \u4f7f\u7528\u7387\u6392\u5e8f\u3002\u60a8\u53ef\u5207\u6362\u6307\u6807\u5207\u6362\u96c6\u7fa4\u3001\u8282\u70b9\u7684\u6392\u5e8f\u65b9\u5f0f\u3002
\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\uff1a\u53ef\u67e5\u770b\u8fd1 15 \u5929\u7684\u8282\u70b9\u4e2a\u6570\u8d8b\u52bf\u4ee5\u53ca\u4e00\u5c0f\u65f6 Pod \u7684\u8fd0\u884c\u8d8b\u52bf\u3002
\u4f7f\u7528 \u903b\u8f91\u64cd\u4f5c\u7b26\uff08AND\u3001OR\u3001NOT\u3001\"\" \uff09\u7b26\u67e5\u8be2\u591a\u4e2a\u5173\u952e\u5b57\uff0c\u4f8b\u5982\uff1akeyword1 AND (keyword2 OR keyword3) NOT keyword4\u3002
kubectl get secrets -n mcamel-system mcamel-common-es-cluster-masters-es-elastic-user -o jsonpath=\"{.data.elastic}\" |base64 -d\n
\u8fdb\u5165 Kibana -> Stack Management -> Index Management \uff0c\u6253\u5f00 Include hidden indices \u9009\u9879\uff0c\u5373\u53ef\u89c1\u6240\u6709\u7684 index\u3002 \u6839\u636e index \u7684\u5e8f\u53f7\u5927\u5c0f\uff0c\u4fdd\u7559\u5e8f\u53f7\u5927\u7684 index\uff0c\u5220\u9664\u5e8f\u53f7\u5c0f\u7684 index\u3002
\u91cd\u542f Pod\uff0c\u7b49\u5f85 Pod \u6062\u590d\u8fd0\u884c\u72b6\u6001\u4e4b\u540e\uff0cFluenbit \u5c06\u4e0d\u518d\u91c7\u96c6\u8fd9\u4e2a Pod \u5185\u7684\u5bb9\u5668\u7684\u65e5\u5fd7\u3002
"},{"location":"admin/insight/infra/cluster.html#_4","title":"\u53c2\u8003\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645 CPU \u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 CPU \u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684 CPU \u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684\u5185\u5b58\u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002"},{"location":"admin/insight/infra/container.html","title":"\u5bb9\u5668\u76d1\u63a7","text":"
"},{"location":"admin/insight/infra/container.html#_4","title":"\u6307\u6807\u53c2\u8003\u8bf4\u660e","text":"\u6307\u6807\u540d\u79f0 \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c\u3002 CPU \u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u8bf7\u6c42\u91cf\u4e4b\u548c\u3002 CPU \u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u9650\u5236\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u9650\u5236\u91cf\u4e4b\u548c\u3002 \u78c1\u76d8\u8bfb\u5199\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\u78c1\u76d8\u6bcf\u79d2\u8fde\u7eed\u8bfb\u53d6\u548c\u5199\u5165\u7684\u603b\u548c\uff0c\u8868\u793a\u78c1\u76d8\u6bcf\u79d2\u8bfb\u53d6\u548c\u5199\u5165\u64cd\u4f5c\u6570\u7684\u6027\u80fd\u5ea6\u91cf\u3002 \u7f51\u7edc\u53d1\u9001\u63a5\u6536\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\uff0c\u6309\u5de5\u4f5c\u8d1f\u8f7d\u7edf\u8ba1\u7684\u7f51\u7edc\u6d41\u91cf\u7684\u6d41\u5165\u3001\u6d41\u51fa\u901f\u7387\u3002"},{"location":"admin/insight/infra/event.html","title":"\u4e8b\u4ef6\u67e5\u8be2","text":"
AI \u7b97\u529b\u5e73\u53f0 Insight \u652f\u6301\u6309\u96c6\u7fa4\u3001\u547d\u540d\u7a7a\u95f4\u67e5\u8be2\u4e8b\u4ef6\uff0c\u5e76\u63d0\u4f9b\u4e86\u4e8b\u4ef6\u72b6\u6001\u5206\u5e03\u56fe\uff0c\u5bf9\u91cd\u8981\u4e8b\u4ef6\u8fdb\u884c\u7edf\u8ba1\u3002
\u901a\u8fc7\u91cd\u8981\u4e8b\u4ef6\u7edf\u8ba1\uff0c\u60a8\u53ef\u4ee5\u65b9\u4fbf\u5730\u4e86\u89e3\u955c\u50cf\u62c9\u53d6\u5931\u8d25\u6b21\u6570\u3001\u5065\u5eb7\u68c0\u67e5\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u8fd0\u884c\u5931\u8d25\u6b21\u6570\u3001 Pod \u8c03\u5ea6\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668 OOM \u5185\u5b58\u8017\u5c3d\u6b21\u6570\u3001\u5b58\u50a8\u5377\u6302\u8f7d\u5931\u8d25\u6b21\u6570\u4ee5\u53ca\u6240\u6709\u4e8b\u4ef6\u7684\u603b\u6570\u3002\u8fd9\u4e9b\u4e8b\u4ef6\u901a\u5e38\u5206\u4e3a\u300cWarning\u300d\u548c\u300cNormal\u300d\u4e24\u7c7b\u3002
"},{"location":"admin/insight/infra/namespace.html#_4","title":"\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c \u5185\u5b58\u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c \u5bb9\u5668\u7ec4 CPU \u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf \u5bb9\u5668\u7ec4\u5185\u5b58\u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf"},{"location":"admin/insight/infra/node.html","title":"\u8282\u70b9\u76d1\u63a7","text":"
\u6307\u6807\u540d\u79f0 \u63cf\u8ff0 Current Status Response \u8868\u793a HTTP \u63a2\u6d4b\u8bf7\u6c42\u7684\u54cd\u5e94\u72b6\u6001\u7801\u3002 Ping Status \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u662f\u5426\u6210\u529f\u30021 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u6210\u529f\uff0c0 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u5931\u8d25\u3002 IP Protocol \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u4f7f\u7528\u7684 IP \u534f\u8bae\u7248\u672c\u3002 SSL Expiry \u8868\u793a SSL/TLS \u8bc1\u4e66\u7684\u6700\u65e9\u5230\u671f\u65f6\u95f4\u3002 DNS Response (Latency) \u8868\u793a\u6574\u4e2a\u63a2\u6d4b\u8fc7\u7a0b\u7684\u6301\u7eed\u65f6\u95f4\uff0c\u5355\u4f4d\u662f\u79d2\u3002 HTTP Duration \u8868\u793a\u4ece\u53d1\u9001\u8bf7\u6c42\u5230\u63a5\u6536\u5230\u5b8c\u6574\u54cd\u5e94\u7684\u6574\u4e2a\u8fc7\u7a0b\u7684\u65f6\u95f4\u3002"},{"location":"admin/insight/infra/probe.html#_7","title":"\u5220\u9664\u62e8\u6d4b\u4efb\u52a1","text":"
AI \u7b97\u529b\u4e2d\u5fc3 \u5e73\u53f0\u5b9e\u73b0\u4e86\u5bf9\u591a\u4e91\u591a\u96c6\u7fa4\u7684\u7eb3\u7ba1\uff0c\u5e76\u652f\u6301\u521b\u5efa\u96c6\u7fa4\u3002\u5728\u6b64\u57fa\u7840\u4e0a\uff0c\u53ef\u89c2\u6d4b\u6027 Insight \u4f5c\u4e3a\u591a\u96c6\u7fa4\u7edf\u4e00\u89c2\u6d4b\u65b9\u6848\uff0c\u901a\u8fc7\u90e8\u7f72 insight-agent \u63d2\u4ef6\u5b9e\u73b0\u5bf9\u591a\u96c6\u7fa4\u89c2\u6d4b\u6570\u636e\u7684\u91c7\u96c6\uff0c\u5e76\u652f\u6301\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3 \u53ef\u89c2\u6d4b\u6027\u4ea7\u54c1\u5b9e\u73b0\u5bf9\u6307\u6807\u3001\u65e5\u5fd7\u3001\u94fe\u8def\u6570\u636e\u7684\u67e5\u8be2\u3002
"},{"location":"admin/insight/quickstart/install/gethosturl.html#insight-agent_1","title":"\u5728\u5176\u4ed6\u96c6\u7fa4\u5b89\u88c5 insight-agent","text":""},{"location":"admin/insight/quickstart/install/gethosturl.html#insight-server","title":"\u901a\u8fc7 Insight Server \u63d0\u4f9b\u7684\u63a5\u53e3\u83b7\u53d6\u5730\u5740","text":"
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u8c03\u7528\u63a5\u53e3\u65f6\u9700\u8981\u989d\u5916\u4f20\u9012\u96c6\u7fa4\u4e2d\u4efb\u610f\u5916\u90e8\u53ef\u8bbf\u95ee\u7684\u8282\u70b9 IP\uff0c\u4f1a\u4f7f\u7528\u8be5 IP \u62fc\u63a5\u51fa\u5bf9\u5e94\u670d\u52a1\u7684\u5b8c\u6574\u8bbf\u95ee\u5730\u5740\u3002
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u4ee5\u4e0a\u5c31\u7eea\u4e4b\u540e\uff0c\u60a8\u5c31\u53ef\u4ee5\u901a\u8fc7\u6ce8\u89e3\uff08Annotation\uff09\u65b9\u5f0f\u4e3a\u5e94\u7528\u7a0b\u5e8f\u63a5\u5165\u94fe\u8def\u8ffd\u8e2a\u4e86\uff0cOTel \u76ee\u524d\u652f\u6301\u901a\u8fc7\u6ce8\u89e3\u7684\u65b9\u5f0f\u63a5\u5165\u94fe\u8def\u3002 \u6839\u636e\u670d\u52a1\u8bed\u8a00\uff0c\u9700\u8981\u6dfb\u52a0\u4e0a\u4e0d\u540c\u7684 pod annotations\u3002\u6bcf\u4e2a\u670d\u52a1\u53ef\u6dfb\u52a0\u4e24\u7c7b\u6ce8\u89e3\u4e4b\u4e00\uff1a
\u7531\u4e8e Go \u81ea\u52a8\u68c0\u6d4b\u9700\u8981\u8bbe\u7f6e OTEL_GO_AUTO_TARGET_EXE\uff0c \u56e0\u6b64\u60a8\u5fc5\u987b\u901a\u8fc7\u6ce8\u89e3\u6216 Instrumentation \u8d44\u6e90\u63d0\u4f9b\u6709\u6548\u7684\u53ef\u6267\u884c\u8def\u5f84\u3002\u672a\u8bbe\u7f6e\u6b64\u503c\u4f1a\u5bfc\u81f4 Go \u81ea\u52a8\u68c0\u6d4b\u6ce8\u5165\u4e2d\u6b62\uff0c\u4ece\u800c\u5bfc\u81f4\u63a5\u5165\u94fe\u8def\u5931\u8d25\u3002
Go \u81ea\u52a8\u68c0\u6d4b\u4e5f\u9700\u8981\u63d0\u5347\u6743\u9650\u3002\u4ee5\u4e0b\u6743\u9650\u662f\u81ea\u52a8\u8bbe\u7f6e\u7684\u5e76\u4e14\u662f\u5fc5\u9700\u7684\u3002
\u4ee5 Go \u8bed\u8a00\u4e3a\u4f8b\u7684\u624b\u52a8\u57cb\u70b9\u63a5\u5165\uff1a\u4f7f\u7528 OpenTelemetry SDK \u589e\u5f3a Go \u5e94\u7528\u7a0b\u5e8f
\u5229\u7528 ebpf \u5b9e\u73b0 Go \u8bed\u8a00\u65e0\u4fb5\u5165\u63a2\u9488\uff08\u5b9e\u9a8c\u6027\u529f\u80fd\uff09
OpenTelemetry \u4e5f\u7b80\u79f0\u4e3a OTel\uff0c\u662f\u4e00\u4e2a\u5f00\u6e90\u7684\u53ef\u89c2\u6d4b\u6027\u6846\u67b6\uff0c\u53ef\u4ee5\u5e2e\u52a9\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u751f\u6210\u548c\u6536\u96c6\u9065\u6d4b\u6570\u636e\uff1a\u94fe\u8def\u3001\u6307\u6807\u548c\u65e5\u5fd7\u3002
\u672c\u6587\u4e3b\u8981\u8bb2\u89e3\u5982\u4f55\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u901a\u8fc7 OpenTelemetry Go SDK \u589e\u5f3a\u5e76\u63a5\u5165\u94fe\u8def\u76d1\u63a7\u3002
"},{"location":"admin/insight/quickstart/otel/golang/golang.html#otel-sdk-go_1","title":"\u4f7f\u7528 OTel SDK \u589e\u5f3a Go \u5e94\u7528","text":""},{"location":"admin/insight/quickstart/otel/golang/golang.html#_1","title":"\u5b89\u88c5\u76f8\u5173\u4f9d\u8d56","text":"
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
\u521b\u5efa meter provider\uff0c\u5e76\u6307\u5b9a prometheus \u4f5c\u4e3a exporter\u3002
/*\n* Copyright The OpenTelemetry Authors\n* SPDX-License-Identifier: Apache-2.0\n*/\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
\u81ea\u5b9a\u4e49 meter \u5e76\u5f00\u542f http server
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n* Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n* these to a Prometheus instance via a HttpServer exporter.\n*\n* <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n* The Gauge callback gets executed every collection interval.\n*/\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // it is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
\u8fd9\u91cc\u4f7f\u7528\u7b2c\u4e8c\u79cd\u7528\u6cd5\uff0c\u542f\u52a8 JVM \u65f6\u9700\u8981\u6307\u5b9a JMX Exporter \u7684 jar \u5305\u6587\u4ef6\u548c\u914d\u7f6e\u6587\u4ef6\u3002 jar \u5305\u662f\u4e8c\u8fdb\u5236\u6587\u4ef6\uff0c\u4e0d\u597d\u901a\u8fc7 configmap \u6302\u8f7d\uff0c\u914d\u7f6e\u6587\u4ef6\u6211\u4eec\u51e0\u4e4e\u4e0d\u9700\u8981\u4fee\u6539\uff0c \u6240\u4ee5\u5efa\u8bae\u662f\u76f4\u63a5\u5c06 JMX Exporter \u7684 jar \u5305\u548c\u914d\u7f6e\u6587\u4ef6\u90fd\u6253\u5305\u5230\u4e1a\u52a1\u5bb9\u5668\u955c\u50cf\u4e2d\u3002
\u5176\u4e2d\uff0c\u7b2c\u4e8c\u79cd\u65b9\u5f0f\u6211\u4eec\u53ef\u4ee5\u9009\u62e9\u5c06 JMX Exporter \u7684 jar \u6587\u4ef6\u653e\u5728\u4e1a\u52a1\u5e94\u7528\u955c\u50cf\u4e2d\uff0c \u4e5f\u53ef\u4ee5\u9009\u62e9\u5728\u90e8\u7f72\u7684\u65f6\u5019\u6302\u8f7d\u8fdb\u53bb\u3002\u8fd9\u91cc\u5206\u522b\u5bf9\u4e24\u79cd\u65b9\u5f0f\u505a\u4e00\u4e2a\u4ecb\u7ecd\uff1a
"},{"location":"admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#jmx-exporter-jar","title":"\u65b9\u5f0f\u4e00\uff1a\u5c06 JMX Exporter JAR \u6587\u4ef6\u6784\u5efa\u81f3\u4e1a\u52a1\u955c\u50cf\u4e2d","text":"
\u7136\u540e\u51c6\u5907 jar \u5305\u6587\u4ef6\uff0c\u53ef\u4ee5\u5728 jmx_exporter \u7684 Github \u9875\u9762\u627e\u5230\u6700\u65b0\u7684 jar \u5305\u4e0b\u8f7d\u5730\u5740\u5e76\u53c2\u8003\u5982\u4e0b Dockerfile:
\u5728\u672a\u5f00\u542f\u7f51\u683c\u60c5\u51b5\u4e0b\uff0c\u6d4b\u8bd5\u60c5\u51b5\u7edf\u8ba1\u51fa\u7cfb\u7edf Job \u6307\u6807\u91cf\u4e0e Pod \u7684\u5173\u7cfb\u4e3a Series \u6570\u91cf = 800 * Pod \u6570\u91cf
\u5728\u5f00\u542f\u670d\u52a1\u7f51\u683c\u65f6\uff0c\u5f00\u542f\u529f\u80fd\u540e Pod \u4ea7\u751f\u7684 Istio \u76f8\u5173\u6307\u6807\u6570\u91cf\u7ea7\u4e3a Series \u6570\u91cf = 768 * Pod \u6570\u91cf
\u8868\u683c\u4e2d\u7684 Pod \u6570\u91cf \u6307\u96c6\u7fa4\u4e2d\u57fa\u672c\u7a33\u5b9a\u8fd0\u884c\u7684 Pod \u6570\u91cf\uff0c\u5982\u51fa\u73b0\u5927\u91cf\u7684 Pod \u91cd\u542f\uff0c\u5219\u4f1a\u9020\u6210\u77ed\u65f6\u95f4\u5185\u6307\u6807\u91cf\u7684\u9661\u589e\uff0c\u6b64\u65f6\u8d44\u6e90\u9700\u8981\u8fdb\u884c\u76f8\u5e94\u4e0a\u8c03\u3002
\u78c1\u76d8\u7528\u91cf = \u77ac\u65f6\u6307\u6807\u91cf x 2 x \u5355\u4e2a\u6570\u636e\u70b9\u7684\u5360\u7528\u78c1\u76d8 x 60 x 24 x \u5b58\u50a8\u65f6\u95f4 (\u5929)
\u5b58\u50a8\u65f6\u957f(\u5929) x 60 x 24 \u5c06\u65f6\u95f4(\u5929)\u6362\u7b97\u6210\u5206\u949f\u4ee5\u4fbf\u8ba1\u7b97\u78c1\u76d8\u7528\u91cf\u3002
\u5728\u4e0b\u8ff0\u793a\u4f8b\u4e2d\uff0cAlertmanager \u5c06\u6240\u6709 CPU \u4f7f\u7528\u7387\u9ad8\u4e8e\u9608\u503c\u7684\u544a\u8b66\u5206\u914d\u5230\u4e00\u4e2a\u540d\u4e3a\u201ccritical_alerts\u201d\u7684\u7b56\u7565\u4e2d\u3002
\u4f8b\u5982\uff1a field:[value TO ] \u00a0\u8868\u793a\u00a0 field \u00a0\u7684\u53d6\u503c\u8303\u56f4\u4ece\u00a0 value \u00a0\u5230\u6b63\u65e0\u7a77\uff0c field:[ TO value] \u00a0\u8868\u793a\u00a0 field \u00a0\u7684\u53d6\u503c\u8303\u56f4\u4ece\u8d1f\u65e0\u7a77\u5230\u00a0 value \u3002
\u6bcf\u4e2a Pod \u53ef\u8fd0\u884c\u591a\u4e2a Sidecar \u5bb9\u5668\uff0c\u53ef\u4ee5\u901a\u8fc7 ; \u9694\u79bb\uff0c\u5b9e\u73b0\u4e0d\u540c Sidecar \u5bb9\u5668\u91c7\u96c6\u591a\u4e2a\u6587\u4ef6\u5230\u591a\u4e2a\u5b58\u50a8\u5377\u3002
\u91cd\u542f Pod\uff0c\u5f85 Pod \u72b6\u6001\u53d8\u6210 \u8fd0\u884c\u4e2d \u540e\uff0c\u5219\u53ef\u901a\u8fc7 \u65e5\u5fd7\u67e5\u8be2 \u754c\u9762\uff0c\u67e5\u627e\u8be5 Pod \u7684\u5bb9\u5668\u5185\u65e5\u5fd7\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 insight-es-k8s-logs-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 jaeger-ilm-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u90e8\u7f72 Kubernetes \u96c6\u7fa4\u662f\u4e3a\u4e86\u652f\u6301\u9ad8\u6548\u7684 AI \u7b97\u529b\u8c03\u5ea6\u548c\u7ba1\u7406\uff0c\u5b9e\u73b0\u5f39\u6027\u4f38\u7f29\uff0c\u63d0\u4f9b\u9ad8\u53ef\u7528\u6027\uff0c\u4ece\u800c\u4f18\u5316\u6a21\u578b\u8bad\u7ec3\u548c\u63a8\u7406\u8fc7\u7a0b\u3002
\u5230\u6b64\u96c6\u7fa4\u521b\u5efa\u6210\u529f\uff0c\u53ef\u4ee5\u53bb\u67e5\u770b\u96c6\u7fa4\u6240\u5305\u542b\u7684\u8282\u70b9\u3002\u4f60\u53ef\u4ee5\u53bb\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u5e76\u4f7f\u7528 GPU \u4e86\u3002
\u4e0b\u4e00\u6b65\uff1a\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d
\u5907\u4efd\u901a\u5e38\u5206\u4e3a\u5168\u91cf\u5907\u4efd\u3001\u589e\u91cf\u5907\u4efd\u3001\u5dee\u5f02\u5907\u4efd\u4e09\u79cd\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u76ee\u524d\u652f\u6301\u5168\u91cf\u5907\u4efd\u548c\u589e\u91cf\u5907\u4efd\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u7684\u5907\u4efd\u6062\u590d\u53ef\u4ee5\u5206\u4e3a \u5e94\u7528\u5907\u4efd \u548c ETCD \u5907\u4efd \u4e24\u79cd\uff0c\u652f\u6301\u624b\u52a8\u5907\u4efd\uff0c\u6216\u57fa\u4e8e CronJob \u5b9a\u65f6\u81ea\u52a8\u5907\u4efd\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u5e94\u7528\u505a\u5907\u4efd\uff0c\u672c\u6559\u7a0b\u4e2d\u4f7f\u7528\u7684\u6f14\u793a\u5e94\u7528\u540d\u4e3a dao-2048 \uff0c\u5c5e\u4e8e\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u3002
CA \u8bc1\u4e66\uff1a\u53ef\u901a\u8fc7\u5982\u4e0b\u547d\u4ee4\u67e5\u770b\u8bc1\u4e66\uff0c\u7136\u540e\u5c06\u8bc1\u4e66\u5185\u5bb9\u590d\u5236\u7c98\u8d34\u5230\u5bf9\u5e94\u4f4d\u7f6e\uff1a
S3 region \uff1a\u4e91\u5b58\u50a8\u7684\u5730\u7406\u533a\u57df\u3002\u9ed8\u8ba4\u4f7f\u7528 us-east-1 \u53c2\u6570\uff0c\u7531\u7cfb\u7edf\u7ba1\u7406\u5458\u63d0\u4f9b
S3 force path style \uff1a\u4fdd\u6301\u9ed8\u8ba4\u914d\u7f6e true
S3 server URL \uff1a\u5bf9\u8c61\u5b58\u50a8\uff08minio\uff09\u7684\u63a7\u5236\u53f0\u8bbf\u95ee\u5730\u5740\uff0cminio \u4e00\u822c\u63d0\u4f9b\u4e86 UI \u8bbf\u95ee\u548c\u63a7\u5236\u53f0\u8bbf\u95ee\u4e24\u4e2a\u670d\u52a1\uff0c\u6b64\u5904\u8bf7\u4f7f\u7528\u63a7\u5236\u53f0\u8bbf\u95ee\u7684\u5730\u5740
\u5df2\u7ecf\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u597d\u4e00\u4e2a\u5de5\u4f5c\u96c6\u7fa4\uff0c\u53ef\u53c2\u8003\u6587\u6863\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u3002
\u672c\u6587\u5c06\u4ecb\u7ecd\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u5982\u4f55\u624b\u52a8\u4e3a\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u5de5\u4f5c\u8282\u70b9\u8fdb\u884c\u6269\u5bb9\u3002 \u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u4e0d\u5efa\u8bae\u5728\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u540e\u5bf9\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u8fdb\u884c\u6269\u5bb9\uff0c\u8bf7\u5728\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u524d\u505a\u597d\u8d44\u6e90\u89c4\u5212\u3002
\u5df2\u7ecf\u901a\u8fc7\u706b\u79cd\u8282\u70b9\u5b8c\u6210 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u7684\u90e8\u7f72\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u4e0a\u7684 kind \u96c6\u7fa4\u8fd0\u884c\u6b63\u5e38\u3002
[root@localhost ~]# podman ps\n\n# \u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n220d662b1b6a docker.m.daocloud.io/kindest/node:v1.26.2 2 weeks ago Up 2 weeks 0.0.0.0:443->30443/tcp, 0.0.0.0:8081->30081/tcp, 0.0.0.0:9000-9001->32000-32001/tcp, 0.0.0.0:36674->6443/tcp my-cluster-installer-control-plane\n
"},{"location":"admin/kpanda/best-practice/add-worker-node-on-global.html#kind-ai","title":"\u5c06 kind \u96c6\u7fa4\u63a5\u5165 AI \u7b97\u529b\u4e2d\u5fc3\u96c6\u7fa4\u5217\u8868","text":"
\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\uff0c\u8fdb\u5165\u5bb9\u5668\u7ba1\u7406\uff0c\u5728\u96c6\u7fa4\u5217\u8868\u9875\u53f3\u4fa7\u70b9\u51fb \u63a5\u5165\u96c6\u7fa4 \u6309\u94ae\uff0c\u8fdb\u5165\u63a5\u5165\u96c6\u7fa4\u9875\u9762\u3002
\u672c\u6b21\u6f14\u793a\u5c06\u57fa\u4e8e AI \u7b97\u529b\u4e2d\u5fc3\u7684\u5e94\u7528\u5907\u4efd\u529f\u80fd\uff0c\u5b9e\u73b0\u4e00\u4e2a\u6709\u72b6\u6001\u5e94\u7528\u7684\u8de8\u96c6\u7fa4\u5907\u4efd\u8fc1\u79fb\u3002
Note
\u5f53\u524d\u64cd\u4f5c\u8005\u5e94\u5177\u6709 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u7ba1\u7406\u5458\u7684\u6743\u9650\u3002
\u67e5\u770b NFS Pod \u72b6\u6001\uff0c\u7b49\u5f85\u5176\u72b6\u6001\u53d8\u4e3a running \uff08\u5927\u7ea6\u9700\u8981 2 \u5206\u949f\uff09\u3002
kubectl get pod -n nfs-system -owide\n
\u9884\u671f\u8f93\u51fa\u4e3a\uff1a
[root@g-master1 ~]# kubectl get pod -owide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nnfs-provisioner-7dfb9bcc45-74ws2 1/1 Running 0 4m45s 10.6.175.100 g-master1 <none> <none>\n
"},{"location":"admin/kpanda/best-practice/backup-mysql-on-nfs.html#mysql_1","title":"\u90e8\u7f72 MySQL \u5e94\u7528","text":"
\u4e3a MySQL \u5e94\u7528\u51c6\u5907\u57fa\u4e8e NFS \u5b58\u50a8\u7684 PVC\uff0c\u7528\u6765\u5b58\u50a8 MySQL \u670d\u52a1\u5185\u7684\u6570\u636e\u3002
\u4f7f\u7528 vi pvc.yaml \u547d\u4ee4\u5728\u8282\u70b9\u4e0a\u521b\u5efa\u540d\u4e3a pvc.yaml \u7684\u6587\u4ef6\uff0c\u5c06\u4e0b\u9762\u7684 YAML \u5185\u5bb9\u590d\u5236\u5230 pvc.yaml \u6587\u4ef6\u5185\u3002
\u6267\u884c kubectl get pod | grep mysql \u67e5\u770b MySQL Pod \u72b6\u6001\uff0c\u7b49\u5f85\u5176\u72b6\u6001\u53d8\u4e3a running \uff08\u5927\u7ea6\u9700\u8981 2 \u5206\u949f\uff09\u3002
\u9884\u671f\u8f93\u51fa\u4e3a\uff1a
[root@g-master1 ~]# kubectl get pod |grep mysql\nmysql-deploy-5d6f94cb5c-gkrks 1/1 Running 0 2m53s\n
Note
\u5982\u679c MySQL Pod \u72b6\u6001\u957f\u671f\u5904\u4e8e\u975e running \u72b6\u6001\uff0c\u901a\u5e38\u662f\u56e0\u4e3a\u6ca1\u6709\u5728\u96c6\u7fa4\u7684\u6240\u6709\u8282\u70b9\u4e0a\u5b89\u88c5 NFS \u4f9d\u8d56\u3002
\u6267\u884c kubectl describe pod ${mysql pod \u540d\u79f0} \u67e5\u770b Pod \u7684\u8be6\u7ec6\u4fe1\u606f\u3002
\u5982\u679c\u62a5\u9519\u4e2d\u6709 MountVolume.SetUp failed for volume \"pvc-4ad70cc6-df37-4253-b0c9-8cb86518ccf8\" : mount failed: exit status 32 \u4e4b\u7c7b\u7684\u4fe1\u606f\uff0c\u8bf7\u5206\u522b\u6267\u884c kubectl delete -f nfs.yaml/pvc.yaml/mysql.yaml \u5220\u9664\u4e4b\u524d\u7684\u8d44\u6e90\u540e\uff0c\u91cd\u65b0\u4ece\u90e8\u7f72 NFS \u670d\u52a1\u5f00\u59cb\u3002
\u5411 MySQL \u5e94\u7528\u5199\u5165\u6570\u636e\u3002
\u4e3a\u4e86\u4fbf\u4e8e\u540e\u671f\u9a8c\u8bc1\u8fc1\u79fb\u6570\u636e\u662f\u5426\u6210\u529f\uff0c\u53ef\u4ee5\u4f7f\u7528\u811a\u672c\u5411 MySQL \u5e94\u7528\u4e2d\u5199\u5165\u6d4b\u8bd5\u6570\u636e\u3002
\u4f7f\u7528 vi insert.sh \u547d\u4ee4\u5728\u8282\u70b9\u4e0a\u521b\u5efa\u540d\u4e3a insert.sh \u7684\u811a\u672c\uff0c\u5c06\u4e0b\u9762\u7684 YAML \u5185\u5bb9\u590d\u5236\u5230\u8be5\u811a\u672c\u3002
insert.sh
#!/bin/bash\n\nfunction rand(){\n min=$1\n max=$(($2-$min+1))\n num=$(date +%s%N)\n echo $(($num%$max+$min))\n}\n\nfunction insert(){\n user=$(date +%s%N | md5sum | cut -c 1-9)\n age=$(rand 1 100)\n\n sql=\"INSERT INTO test.users(user_name, age)VALUES('${user}', ${age});\"\n echo -e ${sql}\n\n kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"${sql}\"\n\n}\n\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE DATABASE IF NOT EXISTS test;\"\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE TABLE IF NOT EXISTS test.users(user_name VARCHAR(10) NOT NULL,age INT UNSIGNED)ENGINE=InnoDB DEFAULT CHARSET=utf8;\"\n\nwhile true;do\n insert\n sleep 1\ndone\n
mysql: [Warning] Using a password on the command line interface can be insecure.\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('dc09195ba', 10);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('80ab6aa28', 70);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('f488e3d46', 23);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('e6098695c', 93);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('eda563e7d', 63);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('a4d1b8d68', 17);\nmysql: [Warning] Using a password on the command line interface can be insecure.\n
\u5728\u952e\u76d8\u4e0a\u540c\u65f6\u6309\u4e0b control \u548c c \u6682\u505c\u811a\u672c\u7684\u6267\u884c\u3002
\u524d\u5f80 MySQL Pod \u67e5\u770b MySQL \u4e2d\u5199\u5165\u7684\u6570\u636e\u3002
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
\u9884\u671f\u8f93\u51fa\u4e3a\uff1a
mysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
mysql> set global read_only=1; #1\u662f\u53ea\u8bfb\uff0c0\u662f\u8bfb\u5199\nmysql> show global variables like \"%read_only%\"; #\u67e5\u8be2\u72b6\u6001\n
\u4e3a MySQL \u5e94\u7528\u53ca PVC \u6570\u636e\u6dfb\u52a0\u72ec\u6709\u7684\u6807\u7b7e\uff1a backup=mysql \uff0c\u4fbf\u4e8e\u5907\u4efd\u65f6\u9009\u62e9\u8d44\u6e90\u3002
kubectl label deploy mysql-deploy backup=mysql # \u4e3a __mysql-deploy__ \u8d1f\u8f7d\u6dfb\u52a0\u6807\u7b7e\nkubectl label pod mysql-deploy-5d6f94cb5c-gkrks backup=mysql # \u4e3a mysql pod \u6dfb\u52a0\u6807\u7b7e\nkubectl label pvc mydata backup=mysql # \u4e3a mysql \u7684 pvc \u6dfb\u52a0\u6807\u7b7e\n
"},{"location":"admin/kpanda/best-practice/backup-mysql-on-nfs.html#mysql_3","title":"\u8de8\u96c6\u7fa4\u6062\u590d MySQL \u5e94\u7528\u53ca\u6570\u636e","text":"
\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\uff0c\u5728\u5de6\u4fa7\u5bfc\u822a\u9009\u62e9 \u5bb9\u5668\u7ba1\u7406 -> \u5907\u4efd\u6062\u590d -> \u5e94\u7528\u5907\u4efd \u3002
NAME READY STATUS RESTARTS AGE\nmysql-deploy-5798f5d4b8-62k6c 1/1 Running 0 24h\n
\u68c0\u67e5 MySQL \u6570\u636e\u8868\u4e2d\u7684\u6570\u636e\u662f\u5426\u6062\u590d\u6210\u529f\u3002
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
mysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
Success
\u53ef\u4ee5\u770b\u5230\uff0cPod \u4e2d\u7684\u6570\u636e\u548c main-cluster \u96c6\u7fa4\u4e2d Pod \u91cc\u9762\u7684\u6570\u636e\u4e00\u81f4\u3002\u8fd9\u8bf4\u660e\u5df2\u7ecf\u6210\u529f\u5730\u5c06 main-cluster \u4e2d\u7684 MySQL \u5e94\u7528\u53ca\u5176\u6570\u636e\u8de8\u96c6\u7fa4\u6062\u590d\u5230\u4e86 recovery-cluster \u96c6\u7fa4\u3002
\u672c\u6587\u4ec5\u9488\u5bf9\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u7ba1\u7406\u5e73\u53f0\u548c\u5f85\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u67b6\u6784\u5747\u4e3a AMD\u3002 \u521b\u5efa\u96c6\u7fa4\u65f6\u4e0d\u652f\u6301\u5f02\u6784\uff08AMD \u548c ARM \u6df7\u5408\uff09\u90e8\u7f72\uff0c\u60a8\u53ef\u4ee5\u5728\u96c6\u7fa4\u521b\u5efa\u5b8c\u6210\u540e\uff0c\u901a\u8fc7\u63a5\u5165\u5f02\u6784\u8282\u70b9\u7684\u65b9\u5f0f\u8fdb\u884c\u96c6\u7fa4\u6df7\u5408\u90e8\u7f72\u7ba1\u7406\u3002
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u8fd8\u5b58\u6d3b\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u8bf7\u786e\u4fdd\u5df2\u7ecf\u767b\u5f55\u5230\u706b\u79cd\u8282\u70b9\uff01\u5e76\u4e14\u4e4b\u524d\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u4f7f\u7528\u7684 clusterConfig.yaml \u6587\u4ef6\u8fd8\u5728\u3002
\u4e0b\u8f7d\u6240\u9700\u7684 RedHat OS package \u5305\u548c ISO \u79bb\u7ebf\u5305\uff1a
\u8d44\u6e90\u540d \u8bf4\u660e \u4e0b\u8f7d\u5730\u5740 os-pkgs-redhat9-v0.9.3.tar.gz RedHat9.2 OS-package \u5305 https://github.com/kubean-io/kubean/releases/download/v0.9.3/os-pkgs-redhat9-v0.9.3.tar.gz ISO \u79bb\u7ebf\u5305 ISO \u5305\u5bfc\u5165\u706b\u79cd\u8282\u70b9\u811a\u672c \u524d\u5f80 RedHat \u5b98\u65b9\u5730\u5740\u767b\u5f55\u4e0b\u8f7d import-iso ISO \u5bfc\u5165\u706b\u79cd\u8282\u70b9\u811a\u672c https://github.com/kubean-io/kubean/releases/download/v0.9.3/import_iso.sh"},{"location":"admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#os-pckage-minio","title":"\u5bfc\u5165 os pckage \u79bb\u7ebf\u5305\u81f3\u706b\u79cd\u8282\u70b9\u7684 minio","text":"
\u89e3\u538b RedHat os pckage \u79bb\u7ebf\u5305
\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u89e3\u538b\u4e0b\u8f7d\u7684 os pckage \u79bb\u7ebf\u5305\u3002\u6b64\u5904\u6211\u4eec\u4e0b\u8f7d\u7684 RedHat os pckage \u79bb\u7ebf\u5305\u3002
tar -xvf os-pkgs-redhat9-v0.9.3.tar.gz \n
os package \u89e3\u538b\u540e\u7684\u6587\u4ef6\u5185\u5bb9\u5982\u4e0b\uff1a
os-pkgs\n \u251c\u2500\u2500 import_ospkgs.sh # \u8be5\u811a\u672c\u7528\u4e8e\u5bfc\u5165 os packages \u5230 MinIO \u6587\u4ef6\u670d\u52a1\n \u251c\u2500\u2500 os-pkgs-amd64.tar.gz # amd64 \u67b6\u6784\u7684 os packages \u5305\n \u251c\u2500\u2500 os-pkgs-arm64.tar.gz # arm64 \u67b6\u6784\u7684 os packages \u5305\n \u2514\u2500\u2500 os-pkgs.sha256sum.txt # os packages \u5305\u7684 sha256sum \u6548\u9a8c\u6587\u4ef6\n
\u5bfc\u5165 OS Package \u81f3\u706b\u79cd\u8282\u70b9\u7684 MinIO
\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u5c06 os packages \u5305\u5230 MinIO \u6587\u4ef6\u670d\u52a1\u4e2d\uff1a
"},{"location":"admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#iso-minio","title":"\u5bfc\u5165 ISO \u79bb\u7ebf\u5305\u81f3\u706b\u79cd\u8282\u70b9\u7684 MinIO","text":"
\u6267\u884c\u5982\u4e0b\u547d\u4ee4, \u5c06 ISO \u5305\u5230 MinIO \u6587\u4ef6\u670d\u52a1\u4e2d:
\u672c\u6587\u4ec5\u9488\u5bf9\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u7ba1\u7406\u5e73\u53f0\u548c\u5f85\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u67b6\u6784\u5747\u4e3a AMD\u3002 \u521b\u5efa\u96c6\u7fa4\u65f6\u4e0d\u652f\u6301\u5f02\u6784\uff08AMD \u548c ARM \u6df7\u5408\uff09\u90e8\u7f72\uff0c\u60a8\u53ef\u4ee5\u5728\u96c6\u7fa4\u521b\u5efa\u5b8c\u6210\u540e\uff0c\u901a\u8fc7\u63a5\u5165\u5f02\u6784\u8282\u70b9\u7684\u65b9\u5f0f\u8fdb\u884c\u96c6\u7fa4\u6df7\u5408\u90e8\u7f72\u7ba1\u7406\u3002
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u8fd8\u5b58\u6d3b\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u8bf7\u786e\u4fdd\u5df2\u7ecf\u767b\u5f55\u5230\u706b\u79cd\u8282\u70b9\uff01\u5e76\u4e14\u4e4b\u524d\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u4f7f\u7528\u7684 clusterConfig.yaml \u6587\u4ef6\u8fd8\u5728\u3002
\u4e0b\u8f7d\u6240\u9700\u7684 Ubuntu OS package \u5305\u548c ISO \u79bb\u7ebf\u5305\uff1a
\u8d44\u6e90\u540d \u8bf4\u660e \u4e0b\u8f7d\u5730\u5740 os-pkgs-ubuntu2204-v0.18.2.tar.gz Ubuntu1804 OS-package \u5305 https://github.com/kubean-io/kubean/releases/download/v0.18.2/os-pkgs-ubuntu2204-v0.18.2.tar.gz ISO \u79bb\u7ebf\u5305 ISO \u5305 http://mirrors.melbourne.co.uk/ubuntu-releases/"},{"location":"admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#os-package-iso-minio","title":"\u5bfc\u5165 OS Package \u548c ISO \u79bb\u7ebf\u5305\u81f3\u706b\u79cd\u8282\u70b9\u7684 MinIO","text":"
AI \u7b97\u529b\u4e2d\u5fc3ETCD \u5907\u4efd\u8fd8\u539f\u4ec5\u9650\u4e8e\u9488\u5bf9\u540c\u4e00\u96c6\u7fa4\uff08\u8282\u70b9\u6570\u548c IP \u5730\u5740\u6ca1\u6709\u53d8\u5316\uff09\u8fdb\u884c\u5907\u4efd\u4e0e\u8fd8\u539f\u3002 \u4f8b\u5982\uff0c\u5907\u4efd\u4e86 A \u96c6\u7fa4 \u7684 etcd \u6570\u636e\u540e\uff0c\u53ea\u80fd\u5c06\u5907\u4efd\u6570\u636e\u8fd8\u539f\u5230 A \u96c6\u7fa4\u4e2d\uff0c\u4e0d\u80fd\u8fd8\u539f\u5230 B \u96c6\u7fa4\u3002
AI \u7b97\u529b\u4e2d\u5fc3\u7684\u5907\u4efd\u662f\u5168\u91cf\u6570\u636e\u5907\u4efd\uff0c\u8fd8\u539f\u65f6\u5c06\u8fd8\u539f\u6700\u540e\u4e00\u6b21\u5907\u4efd\u7684\u5168\u91cf\u6570\u636e\u3002
{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:50.817+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.31:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-1:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:55.818+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-2:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.32:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-2:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:52:00.820+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.33:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-3:2379 (context deadline exceeded)\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
--initial-advertise-peer-urls\uff1aetcd member \u96c6\u7fa4\u4e4b\u95f4\u8bbf\u95ee\u5730\u5740\u3002\u5fc5\u987b\u8ddf etcd \u7684\u914d\u7f6e\u4fdd\u6301\u4e00\u81f4\u3002
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
INFO[0000] Finding latest set of snapshot to recover from...\nINFO[0000] Restoring from base snapshot: Full-00000000-00111147-1679991074 actor=restorer\nINFO[0001] successfully fetched data of base snapshot in 1.241380207 seconds actor=restorer\n{\"level\":\"info\",\"ts\":1680011221.2511616,\"caller\":\"mvcc/kvstore.go:380\",\"msg\":\"restored last compact revision\",\"meta-bucket-name\":\"meta\",\"meta-bucket-name-key\":\"finishedCompactRev\",\"restored-compact-revision\":110327}\n{\"level\":\"info\",\"ts\":1680011221.3045986,\"caller\":\"membership/cluster.go:392\",\"msg\":\"added member\",\"cluster-id\":\"66638454b9dd7b8a\",\"local-member-id\":\"0\",\"added-peer-id\":\"123c2503a378fc46\",\"added-peer-peer-urls\":[\"https://10.6.212.10:2380\"]}\nINFO[0001] Starting embedded etcd server... actor=restorer\n....\n\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:565\",\"msg\":\"stopped serving peer traffic\",\"address\":\"127.0.0.1:37161\"}\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:367\",\"msg\":\"closed etcd server\",\"name\":\"default\",\"data-dir\":\"/var/lib/etcd\",\"advertise-peer-urls\":[\"http://localhost:0\"],\"advertise-client-urls\":[\"http://localhost:0\"]}\nINFO[0003] Successfully restored the etcd data directory.\n
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\" \n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| 123c2503a378fc46 | started | controller-node-1 | https://10.6.212.10:2380 | https://10.6.212.10:2379 | false |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n
\u67e5\u770b controller-node-1 \u72b6\u6001:
etcdctl endpoint status --endpoints=controller-node-1:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| controller-node-1:2379 | 123c2503a378fc46 | 3.5.6 | 15 MB | true | false | 3 | 1200 | 1199 | |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
\u6062\u590d\u5176\u4ed6\u8282\u70b9\u6570\u636e
\u4e0a\u8ff0\u6b65\u9aa4\u5df2\u7ecf\u8fd8\u539f\u4e86\u8282\u70b9 01 \u7684\u6570\u636e\uff0c\u82e5\u60f3\u8981\u8fd8\u539f\u5176\u4ed6\u8282\u70b9\u6570\u636e\uff0c\u53ea\u9700\u8981\u5c06 etcd \u7684 Pod \u542f\u52a8\u8d77\u6765\uff0c\u8ba9 etcd \u81ea\u5df1\u5b8c\u6210\u6570\u636e\u540c\u6b65\u3002
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| 6ea47110c5a87c03 | started | controller-node-1 | https://10.5.14.31:2380 | https://10.5.14.31:2379 | false |\n| e222e199f1e318c4 | started | controller-node-2 | https://10.5.14.32:2380 | https://10.5.14.32:2379 | false |\n| f64eeda321aabe2d | started | controller-node-3 | https://10.5.14.33:2380 | https://10.5.14.33:2379 | false |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n
\u68c0\u67e5 3 \u4e2a member \u8282\u70b9\u662f\u5426\u6b63\u5e38:
etcdctl endpoint status --endpoints=controller-node-1:2379,controller-node-2:2379,controller-node-3:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
\u5728 AI \u7b97\u529b\u4e2d\u5fc3\u4e2d\uff0c\u4e5f\u63d0\u4f9b\u4e86\u901a\u8fc7 UI \u6765\u914d\u7f6e\u9ad8\u7ea7\u53c2\u6570\u7684\u529f\u80fd\uff0c\u5728\u521b\u5efa\u96c6\u7fa4\u6700\u540e\u4e00\u6b65\u6dfb\u52a0\u81ea\u5b9a\u4e49\u53c2\u6570\uff1a
\u534f\u8bae \u7aef\u53e3 \u6e90 \u76ee\u7684 \u63cf\u8ff0 TCP 2379-2380 Servers Servers \u9002\u7528\u4e8e HA\u4e0e\u5d4c\u5165\u5f0fetcd TCP 6443 Agents Servers K3s supervisor \u548c Kubernetes API Server UDP 8472 All nodes All nodes \u4ec5\u9002\u7528\u4e8eFlannel VXLAN TCP 10250 All nodes All nodes Kubelet metrics UDP 51820 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5e26\u6709 IPv4\u7684Flannel Wireguard UDP 51821 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5e26\u6709 IPv6\u7684Flannel Wireguard TCP 5001 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5d4c\u5165\u5206\u5e03\u5f0f\u6ce8\u518c\u8868\uff08Spegel\uff09 TCP 6443 All nodes All nodes \u4ec5\u9002\u7528\u4e8e\u5d4c\u5165\u5206\u5e03\u5f0f\u6ce8\u518c\u8868\uff08Spegel\uff09
$ export K3S_VERSION=v1.30.3+k3s1\n$ bash k3slcm\n* Copying ./v1.30.3/k3s-airgap-images-amd64.tar.zst to 172.30.41.5\n* Copying ./v1.30.3/k3s to 172.30.41.5\n* Copying ./v1.30.3/k3s-install.sh to 172.30.41.5\n* Copying ./v1.30.3/k3s-airgap-images-amd64.tar.zst to 172.30.41.6\n* Copying ./v1.30.3/k3s to 172.30.41.6\n* Copying ./v1.30.3/k3s-install.sh to 172.30.41.6\n* Copying ./v1.30.3/k3s-airgap-images-amd64.tar.zst to 172.30.41.7\n* Copying ./v1.30.3/k3s to 172.30.41.7\n* Copying ./v1.30.3/k3s-install.sh to 172.30.41.7\n* Installing on first server node [172.30.41.5]\n[INFO] Skipping k3s download and verify\n[INFO] Skipping installation of SELinux RPM\n[INFO] Skipping /usr/local/bin/kubectl symlink to k3s, already exists\n[INFO] Skipping /usr/local/bin/crictl symlink to k3s, already exists\n[INFO] Skipping /usr/local/bin/ctr symlink to k3s, already exists\n[INFO] Creating killall script /usr/local/bin/k3s-killall.sh\n[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh\n[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env\n[INFO] systemd: Creating service file /etc/systemd/system/k3s.service\n[INFO] systemd: Enabling k3s unit\nCreated symlink /etc/systemd/system/multi-user.target.wants/k3s.service \u2192 /etc/systemd/system/k3s.service.\n[INFO] No change detected so skipping service start\n* Installing on other server node [172.30.41.6]\n......\n
\u76ee\u524d\u652f\u6301\u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4\u5728 v1.26-v1.28\uff0c\u53ef\u4ee5\u53c2\u9605 AI \u7b97\u529b\u4e2d\u5fc3\u96c6\u7fa4\u7248\u672c\u652f\u6301\u4f53\u7cfb\u3002
$ cat /etc/fstab\n\n# /etc/fstab\n# Created by anaconda on Thu Mar 19 11:32:59 2020\n#\n# Accessible filesystems, by reference, are maintained under '/dev/disk'\n# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info\n#\n/dev/mapper/centos-root / xfs defaults 0 0\nUUID=3ed01f0e-67a1-4083-943a-343b7fed1708 /boot xfs defaults 0 0\n/dev/mapper/centos-swap swap swap defaults 0 0\n
\u672c\u6587\u4ec5\u9488\u5bf9\u79bb\u7ebf\u6a21\u5f0f\u4e0b\uff0c\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u6240\u521b\u5efa\u7684\u5de5\u4f5c\u96c6\u7fa4\u8fdb\u884c\u5f02\u6784\u8282\u70b9\u7684\u6dfb\u52a0\uff0c\u4e0d\u5305\u62ec\u63a5\u5165\u7684\u96c6\u7fa4\u3002
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u5e76\u4e14\u706b\u79cd\u8282\u70b9\u8fd8\u5b58\u6d3b\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u5df2\u7ecf\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u521b\u5efa\u597d\u4e00\u4e2a AMD \u67b6\u6784\uff0c\u64cd\u4f5c\u7cfb\u7edf\u4e3a CentOS 7.9 \u7684\u5de5\u4f5c\u96c6\u7fa4\uff0c\u521b\u5efa\u53c2\u8003\u6587\u6863\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4
\u4ee5 ARM \u67b6\u6784\u3001\u64cd\u4f5c\u7cfb\u7edf Kylin v10 sp2 \u4e3a\u4f8b\u3002
\u8bf7\u786e\u4fdd\u5df2\u7ecf\u767b\u5f55\u5230\u706b\u79cd\u8282\u70b9\uff01\u5e76\u4e14\u4e4b\u524d\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u65f6\u4f7f\u7528\u7684 clusterConfig.yaml \u6587\u4ef6\u8fd8\u5728\u3002
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.hostsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-hosts-conf\",\"namespace\":\"kubean-system\"}\n
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.varsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-vars-conf\",\"namespace\":\"kubean-system\"}\n
\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u4ee5\u4e0b\u64cd\u4f5c\u6307\u5357\uff0c\u90e8\u7f72 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u6240\u521b\u5efa\u7684\u975e\u754c\u9762\u4e2d\u63a8\u8350\u7684 Kubernetes \u7248\u672c\u3002
\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u5236\u4f5c\u589e\u91cf\u79bb\u7ebf\u5305\u7684\u65b9\u5f0f\u5bf9\u4f7f\u7528 AI \u7b97\u529b\u4e2d\u5fc3\u5e73\u53f0\u6240\u521b\u5efa\u7684\u5de5\u4f5c\u96c6\u7fa4\u7684 kubernetes \u7684\u7248\u672c\u8fdb\u884c\u5347\u7ea7\u3002
\u767b\u5f55 AI \u7b97\u529b\u4e2d\u5fc3\u7684 UI \u7ba1\u7406\u754c\u9762\uff0c\u60a8\u53ef\u4ee5\u7ee7\u7eed\u6267\u884c\u4ee5\u4e0b\u64cd\u4f5c\uff1a
\u672c\u6587\u4ecb\u7ecd\u79bb\u7ebf\u6a21\u5f0f\u4e0b\u5982\u4f55\u5728 \u672a\u58f0\u660e\u652f\u6301\u7684 OS \u4e0a\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u3002AI \u7b97\u529b\u4e2d\u5fc3\u58f0\u660e\u652f\u6301\u7684 OS \u8303\u56f4\u8bf7\u53c2\u8003 AI \u7b97\u529b\u4e2d\u5fc3\u652f\u6301\u7684\u64cd\u4f5c\u7cfb\u7edf
\u79bb\u7ebf\u6a21\u5f0f\u4e0b\u5728\u672a\u58f0\u660e\u652f\u6301\u7684 OS \u4e0a\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u4e3b\u8981\u7684\u6d41\u7a0b\u5982\u4e0b\u56fe\uff1a
\u5df2\u7ecf\u90e8\u7f72\u597d\u4e00\u4e2a AI \u7b97\u529b\u4e2d\u5fc3\u5168\u6a21\u5f0f\uff0c\u90e8\u7f72\u53c2\u8003\u6587\u6863\u79bb\u7ebf\u5b89\u88c5 AI \u7b97\u529b\u4e2d\u5fc3\u5546\u4e1a\u7248
\u627e\u5230\u4e00\u4e2a\u548c\u5f85\u5efa\u96c6\u7fa4\u8282\u70b9\u67b6\u6784\u548c OS \u5747\u4e00\u81f4\u7684\u5728\u7ebf\u73af\u5883\uff0c\u672c\u6587\u4ee5 AnolisOS 8.8 GA \u4e3a\u4f8b\u3002\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u751f\u6210\u79bb\u7ebf os-pkgs \u5305\u3002
\u6267\u884c\u5b8c\u6210\u4e0a\u8ff0\u547d\u4ee4\u540e\uff0c\u7b49\u5f85\u754c\u9762\u63d0\u793a\uff1a All packages for node (X.X.X.X) have been installed \u5373\u8868\u793a\u5b89\u88c5\u5b8c\u6210\u3002
LSE/LSR Pod \u7684 Request \u548c Limit \u5fc5\u987b\u76f8\u7b49\uff0cCPU \u503c\u5fc5\u987b\u662f 1000 \u7684\u6574\u6570\u500d\u3002
LSE Pod \u5206\u914d\u7684 CPU \u662f\u5b8c\u5168\u72ec\u5360\u7684\uff0c\u4e0d\u5f97\u5171\u4eab\u3002\u5982\u679c\u8282\u70b9\u662f\u8d85\u7ebf\u7a0b\u67b6\u6784\uff0c\u53ea\u4fdd\u8bc1\u903b\u8f91\u6838\u5fc3\u7ef4\u5ea6\u662f\u9694\u79bb\u7684\uff0c\u4f46\u662f\u53ef\u4ee5\u901a\u8fc7 CPUBindPolicyFullPCPUs \u7b56\u7565\u83b7\u5f97\u66f4\u597d\u7684\u9694\u79bb\u3002
LSR Pod \u5206\u914d\u7684 CPU \u53ea\u80fd\u4e0e BE Pod \u5171\u4eab\u3002
LS Pod \u7ed1\u5b9a\u4e86\u4e0e LSE/LSR Pod \u72ec\u5360\u4e4b\u5916\u7684\u5171\u4eab CPU \u6c60\u3002
BE Pod \u7ed1\u5b9a\u4f7f\u7528\u8282\u70b9\u4e2d\u9664 LSE Pod \u72ec\u5360\u4e4b\u5916\u7684\u6240\u6709 CPU \u3002
\u4ee5\u4e0b\u793a\u4f8b\u4e2d\u521b\u5efa4\u4e2a\u526f\u672c\u6570\u4e3a1\u7684 deployment, \u8bbe\u7f6e QoS \u7c7b\u522b\u4e3a LSE, LSR, LS, BE, \u5f85 pod \u521b\u5efa\u5b8c\u6210\u540e\uff0c\u89c2\u5bdf\u5404 pod \u7684 CPU \u5206\u914d\u60c5\u51b5\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-lse-56c9cd77f5-cdqbd\nCPU set for Pod nginx-lse-56c9cd77f5-cdqbd (Burstable QoS): 0-1\n
QoS \u7c7b\u578b\u4e3a LSR \u7684 Pod, \u7ed1\u5b9a CPU 2-3 \u6838\uff0c\u53ef\u4e0e BE \u7c7b\u578b\u7684 Pod \u5171\u4eab\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-lsr-c7fdb97d8-b58h8\nCPU set for Pod nginx-lsr-c7fdb97d8-b58h8 (Burstable QoS): 2-3\n
QoS \u7c7b\u578b\u4e3a LS \u7684 Pod, \u4f7f\u7528 CPU 4-15 \u6838\uff0c\u7ed1\u5b9a\u4e86\u4e0e LSE/LSR Pod \u72ec\u5360\u4e4b\u5916\u7684\u5171\u4eab CPU \u6c60\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-ls-54746c8cf8-rh4b7\nCPU set for Pod nginx-ls-54746c8cf8-rh4b7 (Burstable QoS): 4-15\n
QoS \u7c7b\u578b\u4e3a BE \u7684 pod, \u53ef\u4f7f\u7528 LSE Pod \u72ec\u5360\u4e4b\u5916\u7684 CPU\u3002
[root@controller-node-1 ~]# ./get_cpuset.sh nginx-be-577c946b89-js2qn\nCPU set for Pod nginx-be-577c946b89-js2qn (BestEffort QoS): 2,4-12\n
\u67e5\u770b\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u8d44\u6e90\u7533\u8bf7\u503c\u4e0e\u9650\u5236\u503c\u7684\u6bd4\u503c\u662f\u5426\u7b26\u5408\u8d85\u552e\u6bd4
\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u63a5\u5165\u6216\u521b\u5efa\u7684\u96c6\u7fa4\uff0c\u4e0d\u4ec5\u53ef\u4ee5\u901a\u8fc7 UI \u754c\u9762\u76f4\u63a5\u8bbf\u95ee\uff0c\u4e5f\u53ef\u4ee5\u901a\u8fc7\u5176\u4ed6\u4e24\u79cd\u65b9\u5f0f\u8fdb\u884c\u8bbf\u95ee\u63a7\u5236\uff1a
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u57fa\u4e8e\u96c6\u7fa4\u7684\u4e0d\u540c\u529f\u80fd\u5b9a\u4f4d\u5bf9\u96c6\u7fa4\u8fdb\u884c\u4e86\u89d2\u8272\u5206\u7c7b\uff0c\u5e2e\u52a9\u7528\u6237\u66f4\u597d\u5730\u7ba1\u7406 IT \u57fa\u7840\u8bbe\u65bd\u3002
\u6b64\u96c6\u7fa4\u7528\u4e8e\u8fd0\u884c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7ec4\u4ef6\uff0c\u4f8b\u5982\u5bb9\u5668\u7ba1\u7406\u3001\u5168\u5c40\u7ba1\u7406\u3001\u53ef\u89c2\u6d4b\u6027\u3001\u955c\u50cf\u4ed3\u5e93\u7b49\u3002 \u4e00\u822c\u4e0d\u627f\u8f7d\u4e1a\u52a1\u8d1f\u8f7d\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u63a5\u5165\u578b\u96c6\u7fa4\u548c\u81ea\u5efa\u96c6\u7fa4\u91c7\u53d6\u4e0d\u540c\u7684\u7248\u672c\u652f\u6301\u673a\u5236\u3002
\u4f8b\u5982\uff0c\u793e\u533a\u652f\u6301\u7684\u7248\u672c\u8303\u56f4\u662f 1.25\u30011.26\u30011.27\uff0c\u5219\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u662f 1.24\u30011.25\u30011.26\uff0c\u5e76\u4e14\u4f1a\u4e3a\u7528\u6237\u63a8\u8350\u4e00\u4e2a\u7a33\u5b9a\u7684\u7248\u672c\uff0c\u5982 1.24.7\u3002
\u9664\u6b64\u4e4b\u5916\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e0e\u793e\u533a\u4fdd\u6301\u9ad8\u5ea6\u540c\u6b65\uff0c\u5f53\u793e\u533a\u7248\u672c\u8fdb\u884c\u9012\u589e\u540e\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e5f\u4f1a\u540c\u6b65\u9012\u589e\u4e00\u4e2a\u7248\u672c\u3002
"},{"location":"admin/kpanda/clusters/cluster-version.html#kubernetes","title":"Kubernetes \u7248\u672c\u652f\u6301\u8303\u56f4","text":"Kubernetes \u793e\u533a\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u63a8\u8350\u7248\u672c \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b89\u88c5\u5668 \u53d1\u5e03\u65f6\u95f4
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4e2d\uff0c\u96c6\u7fa4\u89d2\u8272\u5206\u56db\u7c7b\uff1a\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u3001\u7ba1\u7406\u96c6\u7fa4\u3001\u5de5\u4f5c\u96c6\u7fa4\u3001\u63a5\u5165\u96c6\u7fa4\u3002 \u5176\u4e2d\uff0c\u63a5\u5165\u96c6\u7fa4\u53ea\u80fd\u4ece\u7b2c\u4e09\u65b9\u5382\u5546\u63a5\u5165\uff0c\u53c2\u89c1\u63a5\u5165\u96c6\u7fa4\u3002
\u672c\u9875\u4ecb\u7ecd\u5982\u4f55\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u65b0\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u5de5\u4f5c\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u9700\u8981\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4fdd\u6301\u4e00\u81f4\u3002 \u5982\u9700\u4f7f\u7528\u533a\u522b\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4 OS \u6216\u67b6\u6784\u7684\u8282\u70b9\u521b\u5efa\u96c6\u7fa4\uff0c\u53c2\u9605\u5728 centos \u7ba1\u7406\u5e73\u53f0\u4e0a\u521b\u5efa ubuntu \u5de5\u4f5c\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u63a8\u8350\u4f7f\u7528 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u64cd\u4f5c\u7cfb\u7edf\u6765\u521b\u5efa\u96c6\u7fa4\u3002 \u5982\u60a8\u672c\u5730\u8282\u70b9\u4e0d\u5728\u4e0a\u8ff0\u652f\u6301\u8303\u56f4\uff0c\u53ef\u53c2\u8003\u5728\u975e\u4e3b\u6d41\u64cd\u4f5c\u7cfb\u7edf\u4e0a\u521b\u5efa\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u6839\u636e\u4e1a\u52a1\u9700\u6c42\u51c6\u5907\u4e00\u5b9a\u6570\u91cf\u7684\u8282\u70b9\uff0c\u4e14\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u4e00\u81f4\u3002
\u63a8\u8350 Kubernetes \u7248\u672c 1.29.5\uff0c\u5177\u4f53\u7248\u672c\u8303\u56f4\uff0c\u53c2\u9605 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u96c6\u7fa4\u7248\u672c\u652f\u6301\u4f53\u7cfb\uff0c \u76ee\u524d\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4\u5728 v1.28.0-v1.30.2\u3002\u5982\u9700\u521b\u5efa\u4f4e\u7248\u672c\u7684\u96c6\u7fa4\uff0c\u8bf7\u53c2\u8003\u96c6\u7fa4\u7248\u672c\u652f\u6301\u8303\u56f4\u3001\u90e8\u7f72\u4e0e\u5347\u7ea7 Kubean \u5411\u4e0b\u517c\u5bb9\u7248\u672c\u3002
\u76ee\u6807\u4e3b\u673a\u9700\u8981\u5141\u8bb8 IPv4 \u8f6c\u53d1\u3002\u5982\u679c Pod \u548c Service \u4f7f\u7528\u7684\u662f IPv6\uff0c\u5219\u76ee\u6807\u670d\u52a1\u5668\u9700\u8981\u5141\u8bb8 IPv6 \u8f6c\u53d1\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u6682\u4e0d\u63d0\u4f9b\u5bf9\u9632\u706b\u5899\u7684\u7ba1\u7406\u529f\u80fd\uff0c\u60a8\u9700\u8981\u9884\u5148\u81ea\u884c\u5b9a\u4e49\u76ee\u6807\u4e3b\u673a\u9632\u706b\u5899\u89c4\u5219\u3002\u4e3a\u4e86\u907f\u514d\u521b\u5efa\u96c6\u7fa4\u7684\u8fc7\u7a0b\u4e2d\u51fa\u73b0\u95ee\u9898\uff0c\u5efa\u8bae\u7981\u7528\u76ee\u6807\u4e3b\u673a\u7684\u9632\u706b\u5899\u3002
\u670d\u52a1\u7f51\u6bb5\uff1a\u540c\u4e00\u96c6\u7fa4\u4e0b\u5bb9\u5668\u4e92\u76f8\u8bbf\u95ee\u65f6\u4f7f\u7528\u7684 Service \u8d44\u6e90\u7684\u7f51\u6bb5\uff0c\u51b3\u5b9a Service \u8d44\u6e90\u7684\u4e0a\u9650\u3002\u521b\u5efa\u540e\u4e0d\u53ef\u4fee\u6539\u3002
\u5982\u679c\u60f3\u5f7b\u5e95\u5220\u9664\u4e00\u4e2a\u63a5\u5165\u7684\u96c6\u7fa4\uff0c\u9700\u8981\u524d\u5f80\u521b\u5efa\u8be5\u96c6\u7fa4\u7684\u539f\u59cb\u5e73\u53f0\u64cd\u4f5c\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0d\u652f\u6301\u5220\u9664\u63a5\u5165\u7684\u96c6\u7fa4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c \u5378\u8f7d\u96c6\u7fa4 \u548c \u89e3\u9664\u63a5\u5165 \u7684\u533a\u522b\u5728\u4e8e\uff1a
"},{"location":"admin/kpanda/clusters/integrate-rancher-cluster.html#ai","title":"\u6b65\u9aa4\u4e09\uff1a\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u754c\u9762\u63a5\u5165\u96c6\u7fa4","text":"
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED\nadmin.conf Dec 14, 2024 07:26 UTC 204d no \napiserver Dec 14, 2024 07:26 UTC 204d ca no \napiserver-etcd-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \napiserver-kubelet-client Dec 14, 2024 07:26 UTC 204d ca no \ncontroller-manager.conf Dec 14, 2024 07:26 UTC 204d no \netcd-healthcheck-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-peer Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-server Dec 14, 2024 07:26 UTC 204d etcd-ca no \nfront-proxy-client Dec 14, 2024 07:26 UTC 204d front-proxy-ca no \nscheduler.conf Dec 14, 2024 07:26 UTC 204d no \n\nCERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED\nca Dec 12, 2033 07:26 UTC 9y no \netcd-ca Dec 12, 2033 07:26 UTC 9y no \nfront-proxy-ca Dec 12, 2033 07:26 UTC 9y no \n
\u9759\u6001 Pod \u662f\u88ab\u672c\u5730 kubelet \u800c\u4e0d\u662f API \u670d\u52a1\u5668\u7ba1\u7406\uff0c\u6240\u4ee5 kubectl \u4e0d\u80fd\u7528\u6765\u5220\u9664\u6216\u91cd\u542f\u4ed6\u4eec\u3002
\u5982\u679c Pod \u4e0d\u5728\u6e05\u5355\u76ee\u5f55\u91cc\uff0ckubelet \u5c06\u4f1a\u7ec8\u6b62\u5b83\u3002 \u5728\u53e6\u4e00\u4e2a fileCheckFrequency \u5468\u671f\u4e4b\u540e\u4f60\u53ef\u4ee5\u5c06\u6587\u4ef6\u79fb\u56de\u53bb\uff0ckubelet \u53ef\u4ee5\u5b8c\u6210 Pod \u7684\u91cd\u5efa\uff0c\u800c\u7ec4\u4ef6\u7684\u8bc1\u4e66\u66f4\u65b0\u64cd\u4f5c\u4e5f\u5f97\u4ee5\u5b8c\u6210\u3002
Kubernetes \u7248\u672c\u4ee5 x.y.z \u8868\u793a\uff0c\u5176\u4e2d x \u662f\u4e3b\u8981\u7248\u672c\uff0c y \u662f\u6b21\u8981\u7248\u672c\uff0c z \u662f\u8865\u4e01\u7248\u672c\u3002
\u5f53\u5bc6\u94a5\u7c7b\u578b\u4e3a TLS (kubernetes.io/tls)\uff1a\u9700\u8981\u586b\u5165\u8bc1\u4e66\u51ed\u8bc1\u548c\u79c1\u94a5\u6570\u636e\u3002\u8bc1\u4e66\u662f\u81ea\u7b7e\u540d\u6216 CA \u7b7e\u540d\u8fc7\u7684\u51ed\u636e\uff0c\u7528\u6765\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u3002\u8bc1\u4e66\u8bf7\u6c42\u662f\u5bf9\u7b7e\u540d\u7684\u8bf7\u6c42\uff0c\u9700\u8981\u4f7f\u7528\u79c1\u94a5\u8fdb\u884c\u7b7e\u540d\u3002
"},{"location":"admin/kpanda/configmaps-secrets/use-secret.html#pod","title":"\u4f7f\u7528\u5bc6\u94a5\u4f5c\u4e3a Pod \u7684\u6570\u636e\u5377","text":""},{"location":"admin/kpanda/configmaps-secrets/use-secret.html#_6","title":"\u56fe\u5f62\u754c\u9762\u64cd\u4f5c","text":"
\u672c\u6587\u4ecb\u7ecd \u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u5bf9 GPU\u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8d44\u6e90\u7edf\u4e00\u8fd0\u7ef4\u7ba1\u7406\u80fd\u529b\u3002
\u968f\u7740 AI \u5e94\u7528\u3001\u5927\u6a21\u578b\u3001\u4eba\u5de5\u667a\u80fd\u3001\u81ea\u52a8\u9a7e\u9a76\u7b49\u65b0\u5174\u6280\u672f\u7684\u5feb\u901f\u53d1\u5c55\uff0c\u4f01\u4e1a\u9762\u4e34\u7740\u8d8a\u6765\u8d8a\u591a\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4efb\u52a1\u548c\u6570\u636e\u5904\u7406\u9700\u6c42\u3002 \u4ee5 CPU \u4e3a\u4ee3\u8868\u7684\u4f20\u7edf\u8ba1\u7b97\u67b6\u6784\u5df2\u65e0\u6cd5\u6ee1\u8db3\u4f01\u4e1a\u65e5\u76ca\u589e\u957f\u7684\u8ba1\u7b97\u9700\u6c42\u3002\u6b64\u65f6\uff0c\u4ee5 GPU \u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8ba1\u7b97\u56e0\u5728\u5904\u7406\u5927\u89c4\u6a21\u6570\u636e\u3001\u8fdb\u884c\u590d\u6742\u8ba1\u7b97\u548c\u5b9e\u65f6\u56fe\u5f62\u6e32\u67d3\u65b9\u9762\u5177\u6709\u72ec\u7279\u7684\u4f18\u52bf\u88ab\u5e7f\u6cdb\u5e94\u7528\u3002
\u4e0e\u6b64\u540c\u65f6\uff0c\u7531\u4e8e\u7f3a\u4e4f\u5f02\u6784\u8d44\u6e90\u8c03\u5ea6\u7ba1\u7406\u7b49\u65b9\u9762\u7684\u7ecf\u9a8c\u548c\u4e13\u4e1a\u7684\u89e3\u51b3\u65b9\u6848\uff0c\u5bfc\u81f4\u4e86 GPU \u8bbe\u5907\u7684\u8d44\u6e90\u5229\u7528\u7387\u6781\u4f4e\uff0c\u7ed9\u4f01\u4e1a\u5e26\u6765\u4e86\u9ad8\u6602\u7684 AI \u751f\u4ea7\u6210\u672c\u3002 \u5982\u4f55\u964d\u672c\u589e\u6548\uff0c\u63d0\u9ad8 GPU \u7b49\u5f02\u6784\u8d44\u6e90\u7684\u5229\u7528\u6548\u7387\uff0c\u6210\u4e3a\u4e86\u5f53\u524d\u4f17\u591a\u4f01\u4e1a\u4e9f\u9700\u8de8\u8d8a\u7684\u4e00\u9053\u96be\u9898\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u652f\u6301\u5bf9 GPU\u3001NPU \u7b49\u5f02\u6784\u8d44\u6e90\u8fdb\u884c\u7edf\u4e00\u8c03\u5ea6\u548c\u8fd0\u7ef4\u7ba1\u7406\uff0c\u5145\u5206\u91ca\u653e GPU \u8d44\u6e90\u7b97\u529b\uff0c\u52a0\u901f\u4f01\u4e1a AI \u7b49\u65b0\u5174\u5e94\u7528\u53d1\u5c55\u3002GPU \u7ba1\u7406\u80fd\u529b\u5982\u4e0b\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08iluvatar.ai/vcuda-core\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u8c03\u6574\u540e\u67e5\u770b Pod \u4e2d\u7684\u8d44\u6e90 GPU \u5206\u914d\u8d44\u6e90\uff1a
\u901a\u8fc7\u4e0a\u8ff0\u6b65\u9aa4\uff0c\u60a8\u53ef\u4ee5\u5728\u4e0d\u91cd\u542f vGPU Pod \u7684\u60c5\u51b5\u4e0b\u52a8\u6001\u5730\u8c03\u6574\u5176\u7b97\u529b\u548c\u663e\u5b58\u8d44\u6e90\uff0c\u4ece\u800c\u66f4\u7075\u6d3b\u5730\u6ee1\u8db3\u4e1a\u52a1\u9700\u6c42\u5e76\u4f18\u5316\u8d44\u6e90\u5229\u7528\u3002
\u672c\u9875\u8bf4\u660e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684 GPU \u53ca\u64cd\u4f5c\u7cfb\u7edf\u6240\u5bf9\u5e94\u7684\u77e9\u9635\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u8282\u70b9\u7684\u4e0d\u540c GPU \u5361\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u5361\u6545\u969c\u3002
Binpack\uff1a \u591a\u4e2a Pod \u4f1a\u4f18\u5148\u9009\u62e9\u540c\u4e00\u4e2a\u8282\u70b9\uff0c\u9002\u7528\u4e8e\u63d0\u9ad8 GPU \u5229\u7528\u7387\uff0c\u51cf\u5c11\u8d44\u6e90\u788e\u7247\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u4e0d\u540c\u8282\u70b9\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u8282\u70b9\u6545\u969c\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d POD \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u5e76\u4e14\u8981 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08huawei.com/Ascend910\uff09 \uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14**\u5c0f\u4e8e\u7b49\u4e8e**\u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
GPU \u7b97\u529b\uff08cambricon.com/mlu.smlu.vcore\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u4f7f\u7528\u6838\u5fc3\u7684\u767e\u5206\u6bd4\u6570\u91cf\u3002
apiVersion: v1 \nkind: Pod \nmetadata: \n name: pod1 \nspec: \n restartPolicy: OnFailure \n containers: \n - image: ubuntu:16.04 \n name: pod1-ctr \n command: [\"sleep\"] \n args: [\"100000\"] \n resources: \n limits: \n cambricon.com/mlu: \"1\" # use this when device type is not enabled, else delete this line. \n #cambricon.com/mlu: \"1\" #uncomment to use when device type is enabled \n #cambricon.com/mlu.share: \"1\" #uncomment to use device with env-share mode \n #cambricon.com/mlu.mim-2m.8gb: \"1\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vcore: \"100\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vmemory: \"1024\" #uncomment to use device with mim mode\n
\u4e3a\u5728\u6240\u6709\u4ea7\u54c1\u4e2d\u516c\u5f00\u201c\u5b8c\u5168\u76f8\u540c\u201d\u7684 MIG \u8bbe\u5907\u7c7b\u578b\uff0c\u521b\u5efa\u76f8\u540c\u7684GI \u548c CI
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u672c\u6587\u4f7f\u7528 AMD \u67b6\u6784\u7684 CentOS 7.9\uff083.10.0-1160\uff09\u8fdb\u884c\u6f14\u793a\u3002\u5982\u9700\u4f7f\u7528 Red Hat 8.4 \u90e8\u7f72\uff0c \u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u548c\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u3002
\u4f7f\u7528\u5185\u7f6e\u7684\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u65e0\u9700\u4fee\u6539\u955c\u50cf\u7248\u672c\uff0c\u5176\u4ed6\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u955c\u50cf\u3002 \u6ce8\u610f\u7248\u672c\u53f7\u540e\u65e0\u9700\u586b\u5199 Ubuntu\u3001CentOS\u3001Red Hat \u7b49\u64cd\u4f5c\u7cfb\u7edf\u540d\u79f0\uff0c\u82e5\u5b98\u65b9\u955c\u50cf\u542b\u6709\u64cd\u4f5c\u7cfb\u7edf\u540e\u7f00\uff0c\u8bf7\u624b\u52a8\u79fb\u9664\u3002
"},{"location":"admin/kpanda/gpu/nvidia/push_image_to_repo.html","title":"\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u672c\u6587\u4ee5 Red Hat 8.4 \u7684 nvcr.io/nvidia/driver:525.105.17-rhel8.4 \u79bb\u7ebf\u9a71\u52a8\u955c\u50cf\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u79bb\u7ebf\u955c\u50cf\u3002
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
\u5f53\u5de5\u4f5c\u8282\u70b9\u7684\u5185\u6838\u7248\u672c\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u5185\u6838\u7248\u672c\u6216 OS \u7c7b\u578b\u4e0d\u4e00\u81f4\u65f6\uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
"},{"location":"admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#os","title":"\u68c0\u67e5\u96c6\u7fa4\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c","text":"
\u5206\u522b\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u548c\u5f85\u90e8\u7f72 GPU Operator \u7684\u8282\u70b9\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c\u4e00\u81f4\u5219\u65e0\u9700\u6784\u5efa yum \u6e90\uff0c \u53ef\u53c2\u8003\u79bb\u7ebf\u5b89\u88c5 GPU Operator \u6587\u6863\u76f4\u63a5\u5b89\u88c5\uff1b\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u6216\u5185\u6838\u7248\u672c\u4e0d\u4e00\u81f4\uff0c\u8bf7\u6267\u884c\u4e0b\u4e00\u6b65\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 8.4\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4\u3002
\u672c\u6587\u4ee5 Red Hat 8.4 4.18.0-305.el8.x86_64 \u8282\u70b9\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u5305\uff0c \u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\uff0c\u901a\u8fc7 RepoConfig.ConfigMapName \u53c2\u6570\u6765\u4f7f\u7528\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio \u6587\u4ef6\u670d\u52a1\u5668\u8bbf\u95ee\u5730\u5740 \u7528\u6237\u540d \u5bc6\u7801\n
\u4f8b\u5982\uff1a
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
"},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90","text":""},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#_1","title":"\u4f7f\u7528\u573a\u666f\u4ecb\u7ecd","text":"
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9884\u7f6e\u4e86 CentOS 7.9\uff0c\u5185\u6838\u4e3a 3.10.0-1160 \u7684 GPU Operator \u79bb\u7ebf\u5305\u3002\u5176\u5b83 OS \u7c7b\u578b\u7684\u8282\u70b9\u6216\u5185\u6838\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90\u5305\uff0c\u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\u4f7f\u7528 RepoConfig.ConfigMapName \u53c2\u6570\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 7.9\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4
"},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-red-hat-79-os","title":"2. \u4e0b\u8f7d Red Hat 7.9 OS \u7684\u79bb\u7ebf\u9a71\u52a8\u955c\u50cf","text":"
"},{"location":"admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-red-hat-gpu-opreator","title":"3. \u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u3002
SM \uff1a\u6d41\u5f0f\u591a\u5904\u7406\u5668\uff08Streaming Multiprocessor\uff09\uff0cGPU \u7684\u6838\u5fc3\u8ba1\u7b97\u5355\u5143\uff0c\u8d1f\u8d23\u6267\u884c\u56fe\u5f62\u6e32\u67d3\u548c\u901a\u7528\u8ba1\u7b97\u4efb\u52a1\u3002 \u6bcf\u4e2a SM \u5305\u542b\u4e00\u7ec4 CUDA \u6838\u5fc3\uff0c\u4ee5\u53ca\u5171\u4eab\u5185\u5b58\u3001\u5bc4\u5b58\u5668\u6587\u4ef6\u548c\u5176\u4ed6\u8d44\u6e90\uff0c\u53ef\u4ee5\u540c\u65f6\u6267\u884c\u591a\u4e2a\u7ebf\u7a0b\u3002 \u6bcf\u4e2a MIG \u5b9e\u4f8b\u90fd\u62e5\u6709\u4e00\u5b9a\u6570\u91cf\u7684 SM \u548c\u5176\u4ed6\u76f8\u5173\u8d44\u6e90\uff0c\u4ee5\u53ca\u88ab\u5212\u5206\u51fa\u6765\u7684\u663e\u5b58\u3002
GPU SM Slice \uff1aGPU SM \u5207\u7247\u662f GPU \u4e0a SM \u7684\u6700\u5c0f\u8ba1\u7b97\u5355\u4f4d\u3002\u5728 MIG \u6a21\u5f0f\u4e0b\u914d\u7f6e\u65f6\uff0c GPU SM \u5207\u7247\u5927\u7ea6\u662f GPU \u4e2d\u53ef\u7528 SMS \u603b\u6570\u7684\u4e03\u5206\u4e4b\u4e00\u3002
Compute Instance \uff1aGPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b \uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c\u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002
GPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247(GI)\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b\uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c \u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002\u4f7f\u7528\u4e0a\u9762\u7684\u76f8\u540c 4g.20gb \u793a\u4f8b\uff0c\u53ef\u4ee5\u521b\u5efa\u4e00\u4e2a CI \u4ee5\u4ec5\u4f7f\u7528\u7b2c\u4e00\u4e2a\u8ba1\u7b97\u5207\u7247\u7684 1c.4g.20gb \u8ba1\u7b97\u914d\u7f6e\uff0c\u5982\u4e0b\u56fe\u84dd\u8272\u90e8\u5206\u6240\u793a\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
NUMA \u8282\u70b9\u662f Non-Uniform Memory Access\uff08\u975e\u7edf\u4e00\u5185\u5b58\u8bbf\u95ee\uff09\u67b6\u6784\u4e2d\u7684\u4e00\u4e2a\u57fa\u672c\u7ec4\u6210\u5355\u5143\uff0c\u4e00\u4e2a Node \u8282\u70b9\u662f\u591a\u4e2a NUMA \u8282\u70b9\u7684\u96c6\u5408\uff0c \u5728\u591a\u4e2a NUMA \u8282\u70b9\u4e4b\u95f4\u8fdb\u884c\u5185\u5b58\u8bbf\u95ee\u65f6\u4f1a\u4ea7\u751f\u5ef6\u8fdf\uff0c\u5f00\u53d1\u8005\u53ef\u4ee5\u901a\u8fc7\u4f18\u5316\u4efb\u52a1\u8c03\u5ea6\u548c\u5185\u5b58\u5206\u914d\u7b56\u7565\uff0c\u6765\u63d0\u9ad8\u5185\u5b58\u8bbf\u95ee\u6548\u7387\u548c\u6574\u4f53\u6027\u80fd\u3002
Numa \u4eb2\u548c\u6027\u8c03\u5ea6\u7684\u5e38\u89c1\u573a\u666f\u662f\u90a3\u4e9b\u5bf9 CPU \u53c2\u6570\u654f\u611f/\u8c03\u5ea6\u5ef6\u8fdf\u654f\u611f\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4f5c\u4e1a\u3002\u5982\u79d1\u5b66\u8ba1\u7b97\u3001\u89c6\u9891\u89e3\u7801\u3001\u52a8\u6f2b\u52a8\u753b\u6e32\u67d3\u3001\u5927\u6570\u636e\u79bb\u7ebf\u5904\u7406\u7b49\u5177\u4f53\u573a\u666f\u3002
Pod \u8c03\u5ea6\u65f6\u53ef\u4ee5\u91c7\u7528\u7684 NUMA \u653e\u7f6e\u7b56\u7565\uff0c\u5177\u4f53\u7b56\u7565\u5bf9\u5e94\u7684\u8c03\u5ea6\u884c\u4e3a\u8bf7\u53c2\u89c1 Pod \u8c03\u5ea6\u884c\u4e3a\u8bf4\u660e\u3002
single-numa-node\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a single-numa-node \u7684\u8282\u70b9\u6c60\u4e2d\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
restricted\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a restricted \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c\u7684 NUMA \u96c6\u5408\u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
best-effort\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a best-effort \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14\u5c3d\u91cf\u5c06 CPU \u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u6ca1\u6709\u8282\u70b9\u6ee1\u8db3\u8fd9\u4e00\u6761\u4ef6\uff0c\u5219\u9009\u62e9\u6700\u4f18\u8282\u70b9\u8fdb\u884c\u653e\u7f6e\u3002
\u5f53Pod\u8bbe\u7f6e\u4e86\u62d3\u6251\u7b56\u7565\u65f6\uff0cVolcano \u4f1a\u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\u9884\u6d4b\u5339\u914d\u7684\u8282\u70b9\u5217\u8868\u3002 \u8c03\u5ea6\u8fc7\u7a0b\u5982\u4e0b\uff1a
\u6839\u636e Pod \u8bbe\u7f6e\u7684 Volcano \u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u5177\u6709\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u3002
\u5728\u8bbe\u7f6e\u4e86\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u4e2d\uff0c\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u8be5\u7b56\u7565\u8981\u6c42\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6\u3002
Pod \u53ef\u914d\u7f6e\u7684\u62d3\u6251\u7b56\u7565 1. \u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u53ef\u8c03\u5ea6\u7684\u8282\u70b9 2. \u8fdb\u4e00\u6b65\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u7b56\u7565\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6 none \u9488\u5bf9\u914d\u7f6e\u4e86\u4ee5\u4e0b\u51e0\u79cd\u62d3\u6251\u7b56\u7565\u7684\u8282\u70b9\uff0c\u8c03\u5ea6\u65f6\u5747\u65e0\u7b5b\u9009\u884c\u4e3a\u3002none\uff1a\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 - best-effort \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201cbest-effort\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u5c3d\u53ef\u80fd\u6ee1\u8db3\u7b56\u7565\u8981\u6c42\u8fdb\u884c\u8c03\u5ea6\uff1a\u4f18\u5148\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\uff0c\u5982\u679c\u5355 NUMA \u8282\u70b9\u65e0\u6cd5\u6ee1\u8db3 CPU \u7533\u8bf7\u503c\uff0c\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 restricted \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201crestricted\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u4e25\u683c\u9650\u5236\u7684\u8c03\u5ea6\u7b56\u7565\uff1a\u5355 NUMA \u8282\u70b9\u7684CPU\u5bb9\u91cf\u4e0a\u9650\u5927\u4e8e\u7b49\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002\u6b64\u65f6\u5982\u679c\u5355 NUMA \u8282\u70b9\u5269\u4f59\u7684 CPU \u53ef\u4f7f\u7528\u91cf\u4e0d\u8db3\uff0c\u5219 Pod \u65e0\u6cd5\u8c03\u5ea6\u3002\u5355 NUMA \u8282\u70b9\u7684 CPU \u5bb9\u91cf\u4e0a\u9650\u5c0f\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u53ef\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 single-numa-node \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201csingle-numa-node\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 \u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002"},{"location":"admin/kpanda/gpu/volcano/numa.html#numa_1","title":"\u914d\u7f6e NUMA \u4eb2\u548c\u8c03\u5ea6\u7b56\u7565","text":"
\u5047\u8bbe NUMA \u8282\u70b9\u60c5\u51b5\u5982\u4e0b\uff1a
\u5de5\u4f5c\u8282\u70b9 \u8282\u70b9\u7b56\u7565\u62d3\u6251\u7ba1\u7406\u5668\u7b56\u7565 NUMA \u8282\u70b9 0 \u4e0a\u7684\u53ef\u5206\u914d CPU NUMA \u8282\u70b9 1 \u4e0a\u7684\u53ef\u5206\u914d CPU node-1 single-numa-node 16U 16U node-2 best-effort 16U 16U node-3 best-effort 20U 20U
\u793a\u4f8b\u4e00\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a 2U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201csingle-numa-node\u201d\uff0c\u56e0\u6b64\u4f1a\u88ab\u8c03\u5ea6\u5230\u76f8\u540c\u7b56\u7565\u7684 node-1\u3002
\u793a\u4f8b\u4e8c\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a20U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201cbest-effort\u201d\uff0c\u5b83\u5c06\u88ab\u8c03\u5ea6\u5230 node-3\uff0c \u56e0\u4e3a node-3 \u53ef\u4ee5\u5728\u5355\u4e2a NUMA \u8282\u70b9\u4e0a\u5206\u914d Pod \u7684 CPU \u8bf7\u6c42\uff0c\u800c node-2 \u9700\u8981\u5728\u4e24\u4e2a NUMA \u8282\u70b9\u4e0a\u6267\u884c\u6b64\u64cd\u4f5c\u3002
"},{"location":"admin/kpanda/gpu/volcano/numa.html#cpu","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5","text":"
\u60a8\u53ef\u4ee5\u901a\u8fc7 lscpu \u547d\u4ee4\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5\uff1a
"},{"location":"admin/kpanda/gpu/volcano/numa.html#cpu_1","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d","text":"
\u7136\u540e\u67e5\u770b NUMA \u8282\u70b9\u4f7f\u7528\u60c5\u51b5\uff1a
# \u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d\ncat /var/lib/kubelet/cpu_manager_state\n{\"policyName\":\"static\",\"defaultCpuSet\":\"0,10-15,25-31\",\"entries\":{\"777870b5-c64f-42f5-9296-688b9dc212ba\":{\"container-1\":\"16-24\"},\"fb15e10a-b6a5-4aaa-8fcd-76c1aa64e6fd\":{\"container-1\":\"1-9\"}},\"checksum\":318470969}\n
\u4ee5\u4e0a\u793a\u4f8b\u4e2d\u8868\u793a\uff0c\u8282\u70b9\u4e0a\u8fd0\u884c\u4e86\u4e24\u4e2a\u5bb9\u5668\uff0c\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node0 \u76841-9 \u6838\uff0c\u53e6\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node1 \u7684 16-24 \u6838\u3002
"},{"location":"admin/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"\u4f7f\u7528 Volcano \u7684 Gang Scheduler","text":"
Gang \u8c03\u5ea6\u7b56\u7565\u662f volcano-scheduler \u7684\u6838\u5fc3\u8c03\u5ea6\u7b97\u6cd5\u4e4b\u4e00\uff0c\u5b83\u6ee1\u8db3\u4e86\u8c03\u5ea6\u8fc7\u7a0b\u4e2d\u7684 \u201cAll or nothing\u201d \u7684\u8c03\u5ea6\u9700\u6c42\uff0c \u907f\u514d Pod \u7684\u4efb\u610f\u8c03\u5ea6\u5bfc\u81f4\u96c6\u7fa4\u8d44\u6e90\u7684\u6d6a\u8d39\u3002\u5177\u4f53\u7b97\u6cd5\u662f\uff0c\u89c2\u5bdf Job \u4e0b\u7684 Pod \u5df2\u8c03\u5ea6\u6570\u91cf\u662f\u5426\u6ee1\u8db3\u4e86\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\uff0c \u5f53 Job \u7684\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\u5f97\u5230\u6ee1\u8db3\u65f6\uff0c\u4e3a Job \u4e0b\u7684\u6240\u6709 Pod \u6267\u884c\u8c03\u5ea6\u52a8\u4f5c\uff0c\u5426\u5219\uff0c\u4e0d\u6267\u884c\u3002
\u57fa\u4e8e\u5bb9\u5668\u7ec4\u6982\u5ff5\u7684 Gang \u8c03\u5ea6\u7b97\u6cd5\u5341\u5206\u9002\u5408\u9700\u8981\u591a\u8fdb\u7a0b\u534f\u4f5c\u7684\u573a\u666f\u3002AI \u573a\u666f\u5f80\u5f80\u5305\u542b\u590d\u6742\u7684\u6d41\u7a0b\uff0c Data Ingestion\u3001Data Analysts\u3001Data Splitting\u3001Trainer\u3001Serving\u3001Logging \u7b49\uff0c \u9700\u8981\u4e00\u7ec4\u5bb9\u5668\u8fdb\u884c\u534f\u540c\u5de5\u4f5c\uff0c\u5c31\u5f88\u9002\u5408\u57fa\u4e8e\u5bb9\u5668\u7ec4\u7684 Gang \u8c03\u5ea6\u7b56\u7565\u3002 MPI \u8ba1\u7b97\u6846\u67b6\u4e0b\u7684\u591a\u7ebf\u7a0b\u5e76\u884c\u8ba1\u7b97\u901a\u4fe1\u573a\u666f\uff0c\u7531\u4e8e\u9700\u8981\u4e3b\u4ece\u8fdb\u7a0b\u534f\u540c\u5de5\u4f5c\uff0c\u4e5f\u975e\u5e38\u9002\u5408\u4f7f\u7528 Gang \u8c03\u5ea6\u7b56\u7565\u3002 \u5bb9\u5668\u7ec4\u4e0b\u7684\u5bb9\u5668\u9ad8\u5ea6\u76f8\u5173\u4e5f\u53ef\u80fd\u5b58\u5728\u8d44\u6e90\u4e89\u62a2\uff0c\u6574\u4f53\u8c03\u5ea6\u5206\u914d\uff0c\u80fd\u591f\u6709\u6548\u89e3\u51b3\u6b7b\u9501\u3002
Binpack \u5728\u5bf9\u4e00\u4e2a\u8282\u70b9\u6253\u5206\u65f6\uff0c\u4f1a\u6839\u636e Binpack \u63d2\u4ef6\u81ea\u8eab\u6743\u91cd\u548c\u5404\u8d44\u6e90\u8bbe\u7f6e\u7684\u6743\u91cd\u503c\u7efc\u5408\u6253\u5206\u3002 \u9996\u5148\uff0c\u5bf9 Pod \u8bf7\u6c42\u8d44\u6e90\u4e2d\u7684\u6bcf\u7c7b\u8d44\u6e90\u4f9d\u6b21\u6253\u5206\uff0c\u4ee5 CPU \u4e3a\u4f8b\uff0cCPU \u8d44\u6e90\u5728\u5f85\u8c03\u5ea6\u8282\u70b9\u7684\u5f97\u5206\u4fe1\u606f\u5982\u4e0b\uff1a
CPU.weight * (request + used) / allocatable\n
\u5373 CPU \u6743\u91cd\u503c\u8d8a\u9ad8\uff0c\u5f97\u5206\u8d8a\u9ad8\uff0c\u8282\u70b9\u8d44\u6e90\u4f7f\u7528\u91cf\u8d8a\u6ee1\uff0c\u5f97\u5206\u8d8a\u9ad8\u3002Memory\u3001GPU \u7b49\u8d44\u6e90\u539f\u7406\u7c7b\u4f3c\u3002\u5176\u4e2d\uff1a
CPU.weight \u4e3a\u7528\u6237\u8bbe\u7f6e\u7684 CPU \u6743\u91cd
request \u4e3a\u5f53\u524d Pod \u8bf7\u6c42\u7684 CPU \u8d44\u6e90\u91cf
used \u4e3a\u5f53\u524d\u8282\u70b9\u5df2\u7ecf\u5206\u914d\u4f7f\u7528\u7684 CPU \u91cf
allocatable \u4e3a\u5f53\u524d\u8282\u70b9 CPU \u53ef\u7528\u603b\u91cf
\u4f18\u5148\u7ea7\u7684\u51b3\u5b9a\u57fa\u4e8e\u914d\u7f6e\u7684 PriorityClass \u4e2d\u7684 Value \u503c\uff0c\u503c\u8d8a\u5927\u4f18\u5148\u7ea7\u8d8a\u9ad8\u3002\u9ed8\u8ba4\u5df2\u542f\u7528\uff0c\u65e0\u9700\u4fee\u6539\u3002\u53ef\u901a\u8fc7\u4ee5\u4e0b\u547d\u4ee4\u786e\u8ba4\u6216\u4fee\u6539\u3002
\u901a\u8fc7 kubectl get pod \u67e5\u770b Pod \u8fd0\u884c\u4fe1\u606f\uff0c\u96c6\u7fa4\u8d44\u6e90\u4e0d\u8db3\uff0cPod \u5904\u4e8e Pending \u72b6\u6001\uff1a
\u6b64\u5916\uff0cVolcano \u4e0e Spark\u3001TensorFlow\u3001PyTorch \u7b49\u4e3b\u6d41\u8ba1\u7b97\u6846\u67b6\u65e0\u7f1d\u5bf9\u63a5\uff0c\u5e76\u652f\u6301 CPU \u548c GPU \u7b49\u5f02\u6784\u8bbe\u5907\u7684\u6df7\u5408\u8c03\u5ea6\uff0c\u4e3a AI \u8ba1\u7b97\u4efb\u52a1\u63d0\u4f9b\u4e86\u5168\u9762\u7684\u4f18\u5316\u652f\u6301\u3002
\u63a5\u4e0b\u6765\uff0c\u6211\u4eec\u5c06\u4ecb\u7ecd\u5982\u4f55\u5b89\u88c5\u548c\u4f7f\u7528 Volcano\uff0c\u4ee5\u4fbf\u60a8\u80fd\u591f\u5145\u5206\u5229\u7528\u5176\u8c03\u5ea6\u7b56\u7565\u4f18\u52bf\uff0c\u4f18\u5316 AI \u8ba1\u7b97\u4efb\u52a1\u3002
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u7684\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u3001\u8282\u70b9\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u4e09\u4e2a\u7ef4\u5ea6\u8fdb\u884c\u81ea\u5b9a\u4e49\u5de1\u68c0\u9879\uff0c\u5de1\u68c0\u7ed3\u675f\u540e\u4f1a\u81ea\u52a8\u751f\u6210\u53ef\u89c6\u5316\u7684\u5de1\u68c0\u62a5\u544a\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5982\u9700\u4e86\u89e3\u6216\u6267\u884c\u5b89\u5168\u65b9\u9762\u7684\u5de1\u68c0\uff0c\u53ef\u53c2\u8003\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u5b89\u5168\u626b\u63cf\u7c7b\u578b\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u7ef4\u5ea6\u3001\u8282\u70b9\u7ef4\u5ea6\u3001\u5bb9\u5668\u7ec4\u7ef4\u5ea6\u8fdb\u884c\u5de1\u68c0\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5728\u8282\u70b9\u88ab\u8bbe\u7f6e\u4e3a\u72ec\u4eab\u8282\u70b9\u524d\u5df2\u7ecf\u8fd0\u884c\u5728\u6b64\u8282\u70b9\u4e0a\u7684\u5e94\u7528\u548c\u670d\u52a1\u5c06\u4e0d\u4f1a\u53d7\u5f71\u54cd\uff0c\u4f9d\u7136\u4f1a\u6b63\u5e38\u8fd0\u884c\u5728\u8be5\u8282\u70b9\u4e0a\uff0c\u4ec5\u5f53\u8fd9\u4e9b Pod \u88ab\u5220\u9664\u6216\u91cd\u5efa\u65f6\uff0c\u624d\u4f1a\u8c03\u5ea6\u5230\u5176\u5b83\u975e\u72ec\u4eab\u8282\u70b9\u4e0a\u3002
\u7531\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u8fd0\u884c\u7740 kpanda\u3001ghippo\u3001insight \u7b49\u5e73\u53f0\u57fa\u7840\u7ec4\u4ef6\uff0c\u5728 Global \u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u5c06\u53ef\u80fd\u5bfc\u81f4\u5f53\u7cfb\u7edf\u7ec4\u4ef6\u91cd\u542f\u540e\uff0c\u7cfb\u7edf\u7ec4\u4ef6\u65e0\u6cd5\u8c03\u5ea6\u5230\u88ab\u72ec\u4eab\u7684\u8282\u70b9\u4e0a\uff0c\u5f71\u54cd\u7cfb\u7edf\u7684\u6574\u4f53\u9ad8\u53ef\u7528\u80fd\u529b\u3002\u56e0\u6b64\uff0c\u901a\u5e38\u60c5\u51b5\u4e0b\uff0c\u6211\u4eec\u4e0d\u63a8\u8350\u7528\u6237\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u7279\u6027\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u5bb9\u5668\u7ec4\u5b89\u5168\u7b56\u7565\u6307\u5728 kubernetes \u96c6\u7fa4\u4e2d\uff0c\u901a\u8fc7\u4e3a\u6307\u5b9a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e\u4e0d\u540c\u7684\u7b49\u7ea7\u548c\u6a21\u5f0f\uff0c\u5b9e\u73b0\u5728\u5b89\u5168\u7684\u5404\u4e2a\u65b9\u9762\u63a7\u5236 Pod \u7684\u884c\u4e3a\uff0c\u53ea\u6709\u6ee1\u8db3\u4e00\u5b9a\u7684\u6761\u4ef6\u7684 Pod \u624d\u4f1a\u88ab\u7cfb\u7edf\u63a5\u53d7\u3002\u5b83\u8bbe\u7f6e\u4e09\u4e2a\u7b49\u7ea7\u548c\u4e09\u79cd\u6a21\u5f0f\uff0c\u7528\u6237\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u7684\u9700\u6c42\u9009\u62e9\u66f4\u52a0\u5408\u9002\u7684\u65b9\u6848\u6765\u8bbe\u7f6e\u9650\u5236\u7b56\u7565\u3002
Note
\u4e00\u6761\u5b89\u5168\u6a21\u5f0f\u4ec5\u80fd\u914d\u7f6e\u4e00\u6761\u5b89\u5168\u7b56\u7565\u3002\u540c\u65f6\u8bf7\u8c28\u614e\u4e3a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e enforce \u7684\u5b89\u5168\u6a21\u5f0f\uff0c\u8fdd\u53cd\u540e\u5c06\u4f1a\u5bfc\u81f4 Pod \u65e0\u6cd5\u521b\u5efa\u3002
\u5df2\u7ecf\u5b8c\u6210 Ingress \u5b9e\u4f8b\u7684\u521b\u5efa\uff0c\u5df2\u90e8\u7f72\u5e94\u7528\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5e76\u4e14\u5df2\u521b\u5efa\u5bf9\u5e94 Service
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u6bcf\u4e2a Pod \u90fd\u6709\u4e00\u4e2a\u5185\u90e8\u72ec\u7acb\u7684 IP \u5730\u5740\uff0c\u4f46\u662f\u5de5\u4f5c\u8d1f\u8f7d\u4e2d\u7684 Pod \u53ef\u80fd\u4f1a\u88ab\u968f\u65f6\u521b\u5efa\u548c\u5220\u9664\uff0c\u76f4\u63a5\u4f7f\u7528 Pod IP \u5730\u5740\u5e76\u4e0d\u80fd\u5bf9\u5916\u63d0\u4f9b\u670d\u52a1\u3002
\u8fd9\u5c31\u9700\u8981\u521b\u5efa\u670d\u52a1\uff0c\u901a\u8fc7\u670d\u52a1\u60a8\u4f1a\u83b7\u5f97\u4e00\u4e2a\u56fa\u5b9a\u7684 IP \u5730\u5740\uff0c\u4ece\u800c\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u524d\u7aef\u548c\u540e\u7aef\u7684\u89e3\u8026\uff0c\u8ba9\u5916\u90e8\u7528\u6237\u80fd\u591f\u8bbf\u95ee\u670d\u52a1\u3002\u540c\u65f6\uff0c\u670d\u52a1\u8fd8\u63d0\u4f9b\u4e86\u8d1f\u8f7d\u5747\u8861\uff08LoadBalancer\uff09\u529f\u80fd\uff0c\u4f7f\u7528\u6237\u80fd\u4ece\u516c\u7f51\u8bbf\u95ee\u5230\u5de5\u4f5c\u8d1f\u8f7d\u3002
\u70b9\u9009 \u96c6\u7fa4\u5185\u8bbf\u95ee\uff08ClusterIP\uff09 \uff0c\u8fd9\u662f\u6307\u901a\u8fc7\u96c6\u7fa4\u7684\u5185\u90e8 IP \u66b4\u9732\u670d\u52a1\uff0c\u9009\u62e9\u6b64\u9879\u7684\u670d\u52a1\u53ea\u80fd\u5728\u96c6\u7fa4\u5185\u90e8\u8bbf\u95ee\u3002\u8fd9\u662f\u9ed8\u8ba4\u7684\u670d\u52a1\u7c7b\u578b\u3002\u53c2\u8003\u4e0b\u8868\u914d\u7f6e\u53c2\u6570\u3002
\u7b56\u7565\u914d\u7f6e\u5206\u4e3a\u5165\u6d41\u91cf\u7b56\u7565\u548c\u51fa\u6d41\u91cf\u7b56\u7565\u3002\u5982\u679c\u6e90 Pod \u60f3\u8981\u6210\u529f\u8fde\u63a5\u5230\u76ee\u6807 Pod\uff0c\u6e90 Pod \u7684\u51fa\u6d41\u91cf\u7b56\u7565\u548c\u76ee\u6807 Pod \u7684\u5165\u6d41\u91cf\u7b56\u7565\u90fd\u9700\u8981\u5141\u8bb8\u8fde\u63a5\u3002\u5982\u679c\u4efb\u4f55\u4e00\u65b9\u4e0d\u5141\u8bb8\u8fde\u63a5\uff0c\u90fd\u4f1a\u5bfc\u81f4\u8fde\u63a5\u5931\u8d25\u3002
\u6267\u884c ls \u547d\u4ee4\u67e5\u770b\u7ba1\u7406\u96c6\u7fa4\u4e0a\u7684\u5bc6\u94a5\u662f\u5426\u521b\u5efa\u6210\u529f\uff0c\u6b63\u786e\u53cd\u9988\u5982\u4e0b\uff1a
\u68c0\u67e5\u9879 \u63cf\u8ff0 \u64cd\u4f5c\u7cfb\u7edf \u53c2\u8003\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf SELinux \u5173\u95ed \u9632\u706b\u5899 \u5173\u95ed \u67b6\u6784\u4e00\u81f4\u6027 \u8282\u70b9\u95f4 CPU \u67b6\u6784\u4e00\u81f4\uff08\u5982\u5747\u4e3a ARM \u6216 x86\uff09 \u4e3b\u673a\u65f6\u95f4 \u6240\u6709\u4e3b\u673a\u95f4\u540c\u6b65\u8bef\u5dee\u5c0f\u4e8e 10 \u79d2\u3002 \u7f51\u7edc\u8054\u901a\u6027 \u8282\u70b9\u53ca\u5176 SSH \u7aef\u53e3\u80fd\u591f\u6b63\u5e38\u88ab\u5e73\u53f0\u8bbf\u95ee\u3002 CPU \u53ef\u7528 CPU \u8d44\u6e90\u5927\u4e8e 4 Core \u5185\u5b58 \u53ef\u7528\u5185\u5b58\u8d44\u6e90\u5927\u4e8e 8 GB"},{"location":"admin/kpanda/nodes/node-check.html#_2","title":"\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf","text":"\u67b6\u6784 \u64cd\u4f5c\u7cfb\u7edf \u5907\u6ce8 ARM Kylin Linux Advanced Server release V10 (Sword) SP2 \u63a8\u8350 ARM UOS Linux ARM openEuler x86 CentOS 7.x \u63a8\u8350 x86 Redhat 7.x \u63a8\u8350 x86 Redhat 8.x \u63a8\u8350 x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 \u6d77\u5149 x86 UOS Linux x86 openEuler"},{"location":"admin/kpanda/nodes/node-details.html","title":"\u8282\u70b9\u8be6\u60c5","text":"
\u652f\u6301\u5c06\u8282\u70b9\u6682\u505c\u8c03\u5ea6\u6216\u6062\u590d\u8c03\u5ea6\u3002\u6682\u505c\u8c03\u5ea6\u6307\uff0c\u505c\u6b62\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002\u6062\u590d\u8c03\u5ea6\u6307\uff0c\u53ef\u4ee5\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
\u6c61\u70b9 (Taint) \u80fd\u591f\u4f7f\u8282\u70b9\u6392\u65a5\u67d0\u4e00\u7c7b Pod\uff0c\u907f\u514d Pod \u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002 \u6bcf\u4e2a\u8282\u70b9\u4e0a\u53ef\u4ee5\u5e94\u7528\u4e00\u4e2a\u6216\u591a\u4e2a\u6c61\u70b9\uff0c\u4e0d\u80fd\u5bb9\u5fcd\u8fd9\u4e9b\u6c61\u70b9\u7684 Pod \u5219\u4e0d\u4f1a\u88ab\u8c03\u5ea6\u8be5\u8282\u70b9\u4e0a\u3002
\u4e3a\u8282\u70b9\u6dfb\u52a0\u6c61\u70b9\u4e4b\u540e\uff0c\u53ea\u6709\u80fd\u5bb9\u5fcd\u8be5\u6c61\u70b9\u7684 Pod \u624d\u80fd\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
NoSchedule\uff1a\u65b0\u7684 Pod \u4e0d\u4f1a\u88ab\u8c03\u5ea6\u5230\u5e26\u6709\u6b64\u6c61\u70b9\u7684\u8282\u70b9\u4e0a\uff0c\u9664\u975e\u65b0\u7684 Pod \u5177\u6709\u76f8\u5339\u914d\u7684\u5bb9\u5fcd\u5ea6\u3002\u5f53\u524d\u6b63\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u4e0d\u4f1a \u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u4e0d\u80fd\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f1a\u9a6c\u4e0a\u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f46\u662f\u5728\u5bb9\u5fcd\u5ea6\u5b9a\u4e49\u4e2d\u6ca1\u6709\u6307\u5b9a tolerationSeconds\uff0c\u5219 Pod \u8fd8\u4f1a\u4e00\u76f4\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\u800c\u4e14\u6307\u5b9a\u4e86 tolerationSeconds\uff0c\u5219 Pod \u8fd8\u80fd\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u7ee7\u7eed\u8fd0\u884c\u6307\u5b9a\u7684\u65f6\u957f\u3002\u8fd9\u6bb5\u65f6\u95f4\u8fc7\u53bb\u540e\uff0c\u518d\u4ece\u8282\u70b9\u4e0a\u9a71\u9664\u8fd9\u4e9b Pod\u3002
PreferNoSchedule\uff1a\u8fd9\u662f\u201c\u8f6f\u6027\u201d\u7684 NoSchedule\u3002\u63a7\u5236\u5e73\u9762\u5c06**\u5c1d\u8bd5**\u907f\u514d\u5c06\u4e0d\u5bb9\u5fcd\u6b64\u6c61\u70b9\u7684 Pod \u8c03\u5ea6\u5230\u8282\u70b9\u4e0a\uff0c\u4f46\u4e0d\u80fd\u4fdd\u8bc1\u5b8c\u5168\u907f\u514d\u3002\u6240\u4ee5\u8981\u5c3d\u91cf\u907f\u514d\u4f7f\u7528\u6b64\u6c61\u70b9\u3002
\u5f53\u524d\u96c6\u7fa4\u5df2\u63a5\u5165\u5bb9\u5668\u7ba1\u7406\u4e14\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u5df2\u7ecf\u5b89\u88c5 kolm \u7ec4\u4ef6\uff08helm \u6a21\u677f\u641c\u7d22 kolm\uff09
\u53ea\u9700\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0cKpanda \u63a7\u5236\u5668\u4f1a\u628a Global Cluster \u589e\u52a0\u7684\u6743\u9650\u70b9\u540c\u6b65\u5230\u6240\u6709\u63a5\u5165\u5b50\u96c6\u7fa4\u4e2d\uff0c\u540c\u6b65\u9700\u4e00\u6bb5\u65f6\u95f4\u624d\u80fd\u5b8c\u6210
\u53ea\u80fd\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0c\u5728\u5b50\u96c6\u7fa4\u65b0\u589e\u7684\u6743\u9650\u70b9\u4f1a\u88ab Global Cluster \u5185\u7f6e\u89d2\u8272\u6743\u9650\u70b9\u8986\u76d6
\u53ea\u652f\u6301\u4f7f\u7528\u56fa\u5b9a Label \u7684 ClusterRole \u8ffd\u52a0\u6743\u9650\uff0c\u4e0d\u652f\u6301\u66ff\u6362\u6216\u8005\u5220\u9664\u6743\u9650\uff0c\u4e5f\u4e0d\u80fd\u4f7f\u7528 role \u8ffd\u52a0\u6743\u9650\uff0c\u5185\u7f6e\u89d2\u8272\u8ddf\u7528\u6237\u521b\u5efa\u7684 ClusterRole Label \u5bf9\u5e94\u5173\u7cfb\u5982\u4e0b
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301 Pod \u8d44\u6e90\u57fa\u4e8e\u6307\u6807\u8fdb\u884c\u5f39\u6027\u4f38\u7f29\uff08Horizontal Pod Autoscaling, HPA\uff09\u3002 \u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e CPU \u5229\u7528\u7387\u3001\u5185\u5b58\u7528\u91cf\u53ca\u81ea\u5b9a\u4e49\u6307\u6807\u6307\u6807\u6765\u52a8\u6001\u8c03\u6574 Pod \u8d44\u6e90\u7684\u526f\u672c\u6570\u91cf\u3002 \u4f8b\u5982\uff0c\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u57fa\u4e8e CPU \u5229\u7528\u7387\u6307\u6807\u5f39\u6027\u4f38\u7f29\u7b56\u7565\u540e\uff0c\u5f53 Pod \u7684 CPU \u5229\u7528\u7387\u8d85\u8fc7/\u4f4e\u4e8e\u60a8\u8bbe\u7f6e\u7684\u6307\u6807\u9600\u503c\uff0c\u5de5\u4f5c\u8d1f\u8f7d\u63a7\u5236\u5668\u5c06\u4f1a\u81ea\u52a8\u589e\u52a0/\u8f83\u5c11 Pod \u526f\u672c\u6570\u3002
\u5982\u679c\u57fa\u4e8e CPU \u5229\u7528\u7387\u521b\u5efa HPA \u7b56\u7565\uff0c\u5fc5\u987b\u9884\u5148\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u914d\u7f6e\u9650\u5236\uff08Limit\uff09\uff0c\u5426\u5219\u65e0\u6cd5\u8ba1\u7b97 CPU \u5229\u7528\u7387\u3002
\u7cfb\u7edf\u5185\u7f6e\u4e86 CPU \u548c\u5185\u5b58\u4e24\u79cd\u5f39\u6027\u4f38\u7f29\u6307\u6807\u4ee5\u6ee1\u8db3\u7528\u6237\u7684\u57fa\u7840\u4e1a\u52a1\u4f7f\u7528\u573a\u666f\u3002
\u76ee\u6807 CPU \u5229\u7528\u7387\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b Pod \u7684 CPU \u4f7f\u7528\u7387\u3002\u8ba1\u7b97\u65b9\u5f0f\u4e3a\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u7684 Pod \u8d44\u6e90 / \u5de5\u4f5c\u8d1f\u8f7d\u7684\u8bf7\u6c42\uff08request\uff09\u503c\u3002\u5f53\u5b9e\u9645 CPU \u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u76ee\u6807\u5185\u5b58\u7528\u91cf\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b\u7684 Pod \u7684\u5185\u5b58\u7528\u91cf\u3002\u5f53\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\u901a\u8fc7\u76d1\u63a7 Pod \u5728\u4e00\u6bb5\u65f6\u95f4\u5185\u7684\u8d44\u6e90\u7533\u8bf7\u548c\u7528\u91cf\uff0c \u8ba1\u7b97\u51fa\u5bf9\u8be5 Pod \u800c\u8a00\u6700\u9002\u5408\u7684 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u3002\u4f7f\u7528 VPA \u53ef\u4ee5\u66f4\u52a0\u5408\u7406\u5730\u4e3a\u96c6\u7fa4\u4e0b\u6bcf\u4e2a Pod \u5206\u914d\u8d44\u6e90\uff0c\u63d0\u9ad8\u96c6\u7fa4\u7684\u6574\u4f53\u8d44\u6e90\u5229\u7528\u7387\uff0c\u907f\u514d\u96c6\u7fa4\u8d44\u6e90\u6d6a\u8d39\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\uff0c\u57fa\u4e8e\u6b64\u529f\u80fd\u53ef\u4ee5\u6839\u636e\u5bb9\u5668\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\u52a8\u6001\u8c03\u6574 Pod \u8bf7\u6c42\u503c\u3002 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u65b9\u5f0f\u6765\u4fee\u6539\u8d44\u6e90\u8bf7\u6c42\u503c\uff0c\u60a8\u53ef\u4ee5\u6839\u636e\u5b9e\u9645\u9700\u8981\u8fdb\u884c\u914d\u7f6e\u3002
\u672c\u6587\u5c06\u4ecb\u7ecd\u5982\u4f55\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u914d\u7f6e Pod \u5782\u76f4\u4f38\u7f29\u3002
Warning
\u4f7f\u7528 VPA \u4fee\u6539 Pod \u8d44\u6e90\u8bf7\u6c42\u4f1a\u89e6\u53d1 Pod \u91cd\u542f\u3002\u7531\u4e8e Kubernetes \u672c\u8eab\u7684\u9650\u5236\uff0c Pod \u91cd\u542f\u540e\u53ef\u80fd\u4f1a\u88ab\u8c03\u5ea6\u5230\u5176\u5b83\u8282\u70b9\u4e0a\u3002
\u4f38\u7f29\u6a21\u5f0f\uff1a\u6267\u884c\u4fee\u6539 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u7684\u65b9\u5f0f\uff0c\u76ee\u524d\u5782\u76f4\u4f38\u7f29\u652f\u6301\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u4f38\u7f29\u6a21\u5f0f\u3002
\u8d44\u6e90\u7c7b\u578b\uff1a\u8fdb\u884c\u76d1\u63a7\u7684\u81ea\u5b9a\u4e49\u6307\u6807\u7c7b\u578b\uff0c\u5305\u542b Pod \u548c Service \u4e24\u79cd\u7c7b\u578b\u3002
\u6570\u636e\u7c7b\u578b\uff1a\u7528\u4e8e\u8ba1\u7b97\u6307\u6807\u503c\u7684\u65b9\u6cd5\uff0c\u5305\u542b\u76ee\u6807\u503c\u548c\u76ee\u6807\u5e73\u5747\u503c\u4e24\u79cd\u7c7b\u578b\uff0c\u5f53\u8d44\u6e90\u7c7b\u578b\u4e3a Pod \u65f6\uff0c\u53ea\u652f\u6301\u4f7f\u7528\u76ee\u6807\u5e73\u5747\u503c\u3002
\u5f85\u5904\u7406\u7684\u8bf7\u6c42\u603b\u6570 + \u80fd\u63a5\u53d7\u7684\u8d85\u8fc7\u76ee\u6807\u5e76\u53d1\u6570\u7684\u8bf7\u6c42\u6570\u91cf > \u6bcf\u4e2a Pod \u7684\u76ee\u6807\u5e76\u53d1\u6570 * Pod \u6570\u91cf
\u6267\u884c\u4e0b\u9762\u547d\u4ee4\u6d4b\u8bd5\uff0c\u5e76\u53ef\u4ee5\u901a\u8fc7 kubectl get pods -A -w \u6765\u89c2\u5bdf\u6269\u5bb9\u7684 Pod\u3002
API \u5b89\u5168\uff1a\u542f\u7528\u4e86\u4e0d\u5b89\u5168\u7684 API \u7248\u672c\uff0c\u662f\u5426\u8bbe\u7f6e\u4e86\u9002\u5f53\u7684 RBAC \u89d2\u8272\u548c\u6743\u9650\u9650\u5236\u7b49
\u6570\u636e\u5377\uff08PersistentVolume\uff0cPV\uff09\u662f\u96c6\u7fa4\u4e2d\u7684\u4e00\u5757\u5b58\u50a8\uff0c\u53ef\u7531\u7ba1\u7406\u5458\u4e8b\u5148\u5236\u5907\uff0c\u6216\u4f7f\u7528\u5b58\u50a8\u7c7b\uff08Storage Class\uff09\u6765\u52a8\u6001\u5236\u5907\u3002PV \u662f\u96c6\u7fa4\u8d44\u6e90\uff0c\u4f46\u62e5\u6709\u72ec\u7acb\u7684\u751f\u547d\u5468\u671f\uff0c\u4e0d\u4f1a\u968f\u7740 Pod \u8fdb\u7a0b\u7ed3\u675f\u800c\u88ab\u5220\u9664\u3002\u5c06 PV \u6302\u8f7d\u5230\u5de5\u4f5c\u8d1f\u8f7d\u53ef\u4ee5\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u7684\u6570\u636e\u6301\u4e45\u5316\u3002PV \u4e2d\u4fdd\u5b58\u4e86\u53ef\u88ab Pod \u4e2d\u5bb9\u5668\u8bbf\u95ee\u7684\u6570\u636e\u76ee\u5f55\u3002
HostPath\uff1a\u4f7f\u7528 Node \u8282\u70b9\u7684\u6587\u4ef6\u7cfb\u7edf\u4e0a\u7684\u6587\u4ef6\u6216\u76ee\u5f55\u4f5c\u4e3a\u6570\u636e\u5377\uff0c\u4e0d\u652f\u6301\u57fa\u4e8e\u8282\u70b9\u4eb2\u548c\u6027\u7684 Pod \u8c03\u5ea6\u3002
ReadWriteOncePod\uff1a\u6570\u636e\u5377\u53ef\u4ee5\u88ab\u5355\u4e2a Pod \u4ee5\u8bfb\u5199\u65b9\u5f0f\u6302\u8f7d\u3002
\u56de\u6536\u7b56\u7565\uff1a
Retain\uff1a\u4e0d\u5220\u9664 PV\uff0c\u4ec5\u5c06\u5176\u72b6\u6001\u53d8\u4e3a released \uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u56de\u6536\u3002\u6709\u5173\u5982\u4f55\u624b\u52a8\u56de\u6536\uff0c\u53ef\u53c2\u8003\u6301\u4e45\u5377\u3002
\u6587\u4ef6\u7cfb\u7edf\uff1a\u6570\u636e\u5377\u5c06\u88ab Pod \u6302\u8f7d\u5230\u67d0\u4e2a\u76ee\u5f55\u3002\u5982\u679c\u6570\u636e\u5377\u7684\u5b58\u50a8\u6765\u81ea\u67d0\u5757\u8bbe\u5907\u800c\u8be5\u8bbe\u5907\u76ee\u524d\u4e3a\u7a7a\uff0c\u7b2c\u4e00\u6b21\u6302\u8f7d\u5377\u4e4b\u524d\u4f1a\u5728\u8bbe\u5907\u4e0a\u521b\u5efa\u6587\u4ef6\u7cfb\u7edf\u3002
\u5757\uff1a\u5c06\u6570\u636e\u5377\u4f5c\u4e3a\u539f\u59cb\u5757\u8bbe\u5907\u6765\u4f7f\u7528\u3002\u8fd9\u7c7b\u5377\u4ee5\u5757\u8bbe\u5907\u7684\u65b9\u5f0f\u4ea4\u7ed9 Pod \u4f7f\u7528\uff0c\u5176\u4e0a\u6ca1\u6709\u4efb\u4f55\u6587\u4ef6\u7cfb\u7edf\uff0c\u53ef\u4ee5\u8ba9 Pod \u66f4\u5feb\u5730\u8bbf\u95ee\u6570\u636e\u5377\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u652f\u6301\u5c06\u4e00\u4e2a\u5b58\u50a8\u6c60\u5171\u4eab\u7ed9\u591a\u4e2a\u547d\u540d\u7a7a\u95f4\u4f7f\u7528\uff0c\u4ee5\u4fbf\u63d0\u9ad8\u8d44\u6e90\u5229\u7528\u6548\u7387\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u5c31\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002\u9ed8\u8ba4\u503c\u4e3a 360 s\u3002
\u5b88\u62a4\u8fdb\u7a0b\uff08DaemonSet\uff09\u901a\u8fc7\u8282\u70b9\u4eb2\u548c\u6027\u4e0e\u6c61\u70b9\u529f\u80fd\u786e\u4fdd\u5728\u5168\u90e8\u6216\u90e8\u5206\u8282\u70b9\u4e0a\u8fd0\u884c\u4e00\u4e2a Pod \u7684\u526f\u672c\u3002\u5bf9\u4e8e\u65b0\u52a0\u5165\u96c6\u7fa4\u7684\u8282\u70b9\uff0cDaemonSet \u81ea\u52a8\u5728\u65b0\u8282\u70b9\u4e0a\u90e8\u7f72\u76f8\u5e94\u7684 Pod\uff0c\u5e76\u8ddf\u8e2a Pod \u7684\u8fd0\u884c\u72b6\u6001\u3002\u5f53\u8282\u70b9\u88ab\u79fb\u9664\u65f6\uff0cDaemonSet \u5219\u5220\u9664\u5176\u521b\u5efa\u7684\u6240\u6709 Pod\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u65e0\u6548 Pod \u6570\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u6d6a\u6d8c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u4e3b\u8981\u4e3a Pod \u548c ReplicaSet \u63d0\u4f9b\u58f0\u660e\u5f0f\u66f4\u65b0\uff0c\u652f\u6301\u5f39\u6027\u4f38\u7f29\u3001\u6eda\u52a8\u5347\u7ea7\u3001\u7248\u672c\u56de\u9000\u7b49\u529f\u80fd\u3002\u5728 Deployment \u4e2d\u58f0\u660e\u671f\u671b\u7684 Pod \u72b6\u6001\uff0cDeployment Controller \u4f1a\u901a\u8fc7 ReplicaSet \u4fee\u6539\u5f53\u524d\u72b6\u6001\uff0c\u4f7f\u5176\u8fbe\u5230\u9884\u5148\u58f0\u660e\u7684\u671f\u671b\u72b6\u6001\u3002Deployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u652f\u6301\u6570\u636e\u6301\u4e45\u5316\uff0c\u9002\u7528\u4e8e\u90e8\u7f72\u65e0\u72b6\u6001\u7684\u3001\u4e0d\u9700\u8981\u4fdd\u5b58\u6570\u636e\u3001\u968f\u65f6\u53ef\u4ee5\u91cd\u542f\u56de\u6eda\u7684\u5e94\u7528\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u65e0\u72b6\u6001\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u4e0d\u53ef\u7528\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u5cf0\u503c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5b9e\u4f8b\u6570\uff1a\u8f93\u5165\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u5b9e\u4f8b\u6570\u91cf\u3002\u9ed8\u8ba4\u521b\u5efa 1 \u4e2a Pod \u5b9e\u4f8b\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e76\u884c\u6570\uff1a\u4efb\u52a1\u6267\u884c\u8fc7\u7a0b\u4e2d\u5141\u8bb8\u540c\u65f6\u521b\u5efa\u7684\u6700\u5927 Pod \u6570\uff0c\u5e76\u884c\u6570\u5e94\u4e0d\u5927\u4e8e Pod \u603b\u6570\u3002\u9ed8\u8ba4\u4e3a 1\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002
\u6709\u72b6\u6001\u8d1f\u8f7d\uff08StatefulSet\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u548c\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u7c7b\u4f3c\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406 Pod \u96c6\u5408\u7684\u90e8\u7f72\u548c\u4f38\u7f29\u3002\u4e8c\u8005\u7684\u4e3b\u8981\u533a\u522b\u5728\u4e8e\uff0cDeployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u4fdd\u5b58\u6570\u636e\uff0c\u800c StatefulSet \u662f\u6709\u72b6\u6001\u7684\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406\u6709\u72b6\u6001\u5e94\u7528\u3002\u6b64\u5916\uff0cStatefulSet \u4e2d\u7684 Pod \u5177\u6709\u6c38\u4e45\u4e0d\u53d8\u7684 ID\uff0c\u4fbf\u4e8e\u5728\u5339\u914d\u5b58\u50a8\u5377\u65f6\u8bc6\u522b\u5bf9\u5e94\u7684 Pod\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u6709\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
Kubernetes v1.7 \u53ca\u5176\u4e4b\u540e\u7684\u7248\u672c\u53ef\u4ee5\u901a\u8fc7 .spec.podManagementPolicy \u8bbe\u7f6e Pod \u7684\u7ba1\u7406\u7b56\u7565\uff0c\u652f\u6301\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\uff1a
\u6309\u5e8f\u7b56\u7565\uff08OrderedReady\uff09 \uff1a\u9ed8\u8ba4\u7684 Pod \u7ba1\u7406\u7b56\u7565\uff0c\u8868\u793a\u6309\u987a\u5e8f\u90e8\u7f72 Pod\uff0c\u53ea\u6709\u524d\u4e00\u4e2a Pod \u90e8\u7f72 \u6210\u529f\u5b8c\u6210\u540e\uff0c\u6709\u72b6\u6001\u8d1f\u8f7d\u624d\u4f1a\u5f00\u59cb\u90e8\u7f72\u4e0b\u4e00\u4e2a Pod\u3002\u5220\u9664 Pod \u65f6\u5219\u91c7\u7528\u9006\u5e8f\uff0c\u6700\u540e\u521b\u5efa\u7684\u6700\u5148\u88ab\u5220\u9664\u3002
\u5e76\u884c\u7b56\u7565\uff08Parallel\uff09 \uff1a\u5e76\u884c\u521b\u5efa\u6216\u5220\u9664\u5bb9\u5668\uff0c\u548c Deployment \u7c7b\u578b\u7684 Pod \u4e00\u6837\u3002StatefulSet \u63a7\u5236\u5668\u5e76\u884c\u5730\u542f\u52a8\u6216\u7ec8\u6b62\u6240\u6709\u7684\u5bb9\u5668\u3002\u542f\u52a8\u6216\u8005\u7ec8\u6b62\u5176\u4ed6 Pod \u524d\uff0c\u65e0\u9700\u7b49\u5f85 Pod \u8fdb\u5165 Running \u548c ready \u6216\u8005\u5b8c\u5168\u505c\u6b62\u72b6\u6001\u3002 \u8fd9\u4e2a\u9009\u9879\u53ea\u4f1a\u5f71\u54cd\u6269\u7f29\u64cd\u4f5c\u7684\u884c\u4e3a\uff0c\u4e0d\u5f71\u54cd\u66f4\u65b0\u65f6\u7684\u987a\u5e8f\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u73af\u5883\u53d8\u91cf\u662f\u6307\u5bb9\u5668\u8fd0\u884c\u73af\u5883\u4e2d\u8bbe\u5b9a\u7684\u4e00\u4e2a\u53d8\u91cf\uff0c\u7528\u4e8e\u7ed9 Pod \u6dfb\u52a0\u73af\u5883\u6807\u5fd7\u6216\u4f20\u9012\u914d\u7f6e\u7b49\uff0c\u652f\u6301\u901a\u8fc7\u952e\u503c\u5bf9\u7684\u5f62\u5f0f\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5728\u539f\u751f Kubernetes \u7684\u57fa\u7840\u4e0a\u589e\u52a0\u4e86\u56fe\u5f62\u5316\u754c\u9762\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\uff0c\u652f\u6301\u4ee5\u4e0b\u51e0\u79cd\u914d\u7f6e\u65b9\u5f0f\uff1a
\u53d8\u91cf/\u53d8\u91cf\u5f15\u7528\uff08Pod Field\uff09\uff1a\u5c06 Pod \u5b57\u6bb5\u4f5c\u4e3a\u73af\u5883\u53d8\u91cf\u7684\u503c\uff0c\u4f8b\u5982 Pod \u7684\u540d\u79f0
\u5c31\u7eea\u68c0\u67e5\uff08ReadinessProbe\uff09 \u53ef\u63a2\u77e5\u5bb9\u5668\u4f55\u65f6\u51c6\u5907\u597d\u63a5\u53d7\u8bf7\u6c42\u6d41\u91cf\uff0c\u5f53\u4e00\u4e2a Pod \u5185\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5c31\u7eea\u65f6\uff0c\u624d\u80fd\u8ba4\u4e3a\u8be5 Pod \u5c31\u7eea\u3002 \u8fd9\u79cd\u4fe1\u53f7\u7684\u4e00\u4e2a\u7528\u9014\u5c31\u662f\u63a7\u5236\u54ea\u4e2a Pod \u4f5c\u4e3a Service \u7684\u540e\u7aef\u3002 \u82e5 Pod \u5c1a\u672a\u5c31\u7eea\uff0c\u4f1a\u88ab\u4ece Service \u7684\u8d1f\u8f7d\u5747\u8861\u5668\u4e2d\u5254\u9664\u3002
Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c Pod \u5185\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \u72b6\u6001\u3002\u5982\u679c Pod \u4e2d\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\uff0c\u5219\u72b6\u6001\u53d8\u4e3a Failed \u3002\u4ee5\u4e0b phase \u5b57\u6bb5\u503c\u8868\u660e\u4e86\u4e00\u4e2a Pod \u5904\u4e8e\u751f\u547d\u5468\u671f\u7684\u54ea\u4e2a\u9636\u6bb5\u3002
\u503c \u63cf\u8ff0 Pending \uff08\u60ac\u51b3\uff09 Pod \u5df2\u88ab\u7cfb\u7edf\u63a5\u53d7\uff0c\u4f46\u6709\u4e00\u4e2a\u6216\u8005\u591a\u4e2a\u5bb9\u5668\u5c1a\u672a\u521b\u5efa\u4ea6\u672a\u8fd0\u884c\u3002\u8fd9\u4e2a\u9636\u6bb5\u5305\u62ec\u7b49\u5f85 Pod \u88ab\u8c03\u5ea6\u7684\u65f6\u95f4\u548c\u901a\u8fc7\u7f51\u7edc\u4e0b\u8f7d\u955c\u50cf\u7684\u65f6\u95f4\u3002 Running \uff08\u8fd0\u884c\u4e2d\uff09 Pod \u5df2\u7ecf\u7ed1\u5b9a\u5230\u4e86\u67d0\u4e2a\u8282\u70b9\uff0cPod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u88ab\u521b\u5efa\u3002\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u4ecd\u5728\u8fd0\u884c\uff0c\u6216\u8005\u6b63\u5904\u4e8e\u542f\u52a8\u6216\u91cd\u542f\u72b6\u6001\u3002 Succeeded \uff08\u6210\u529f\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u6210\u529f\u7ec8\u6b62\uff0c\u5e76\u4e14\u4e0d\u4f1a\u518d\u91cd\u542f\u3002 Failed \uff08\u5931\u8d25\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u7ec8\u6b62\uff0c\u5e76\u4e14\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u662f\u56e0\u4e3a\u5931\u8d25\u800c\u7ec8\u6b62\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u5bb9\u5668\u4ee5\u975e 0 \u72b6\u6001\u9000\u51fa\u6216\u8005\u88ab\u7cfb\u7edf\u7ec8\u6b62\u3002 Unknown \uff08\u672a\u77e5\uff09 \u56e0\u4e3a\u67d0\u4e9b\u539f\u56e0\u65e0\u6cd5\u53d6\u5f97 Pod \u7684\u72b6\u6001\uff0c\u8fd9\u79cd\u60c5\u51b5\u901a\u5e38\u662f\u56e0\u4e3a\u4e0e Pod \u6240\u5728\u4e3b\u673a\u901a\u4fe1\u5931\u8d25\u6240\u81f4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u4e2d\u521b\u5efa\u4e00\u4e2a\u5de5\u4f5c\u8d1f\u8f7d\u65f6\uff0c\u901a\u5e38\u4f7f\u7528\u955c\u50cf\u6765\u6307\u5b9a\u5bb9\u5668\u4e2d\u7684\u8fd0\u884c\u73af\u5883\u3002\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u5728\u6784\u5efa\u955c\u50cf\u65f6\uff0c\u53ef\u4ee5\u901a\u8fc7 Entrypoint \u548c CMD \u4e24\u4e2a\u5b57\u6bb5\u6765\u5b9a\u4e49\u5bb9\u5668\u8fd0\u884c\u65f6\u6267\u884c\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002\u5982\u679c\u9700\u8981\u66f4\u6539\u5bb9\u5668\u955c\u50cf\u542f\u52a8\u524d\u3001\u542f\u52a8\u540e\u3001\u505c\u6b62\u524d\u7684\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u5bb9\u5668\u7684\u751f\u547d\u5468\u671f\u4e8b\u4ef6\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u6765\u8986\u76d6\u955c\u50cf\u4e2d\u9ed8\u8ba4\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u542f\u52a8\u540e\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u505c\u6b62\u524d\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u8282\u70b9\u4e5f\u6709\u6807\u7b7e\u3002\u60a8\u53ef\u4ee5\u624b\u52a8\u6dfb\u52a0\u6807\u7b7e\u3002 Kubernetes \u4e5f\u4f1a\u4e3a\u96c6\u7fa4\u4e2d\u6240\u6709\u8282\u70b9\u6dfb\u52a0\u4e00\u4e9b\u6807\u51c6\u7684\u6807\u7b7e\u3002\u53c2\u89c1\u5e38\u7528\u7684\u6807\u7b7e\u3001\u6ce8\u89e3\u548c\u6c61\u70b9\u4ee5\u4e86\u89e3\u5e38\u89c1\u7684\u8282\u70b9\u6807\u7b7e\u3002\u901a\u8fc7\u4e3a\u8282\u70b9\u6dfb\u52a0\u6807\u7b7e\uff0c\u60a8\u53ef\u4ee5\u8ba9 Pod \u8c03\u5ea6\u5230\u7279\u5b9a\u8282\u70b9\u6216\u8282\u70b9\u7ec4\u4e0a\u3002\u60a8\u53ef\u4ee5\u4f7f\u7528\u8fd9\u4e2a\u529f\u80fd\u6765\u786e\u4fdd\u7279\u5b9a\u7684 Pod \u53ea\u80fd\u8fd0\u884c\u5728\u5177\u6709\u4e00\u5b9a\u9694\u79bb\u6027\uff0c\u5b89\u5168\u6027\u6216\u76d1\u7ba1\u5c5e\u6027\u7684\u8282\u70b9\u4e0a\u3002
nodeSelector \u662f\u8282\u70b9\u9009\u62e9\u7ea6\u675f\u7684\u6700\u7b80\u5355\u63a8\u8350\u5f62\u5f0f\u3002\u60a8\u53ef\u4ee5\u5c06 nodeSelector \u5b57\u6bb5\u6dfb\u52a0\u5230 Pod \u7684\u89c4\u7ea6\u4e2d\u8bbe\u7f6e\u60a8\u5e0c\u671b\u76ee\u6807\u8282\u70b9\u6240\u5177\u6709\u7684\u8282\u70b9\u6807\u7b7e\u3002Kubernetes \u53ea\u4f1a\u5c06 Pod \u8c03\u5ea6\u5230\u62e5\u6709\u6307\u5b9a\u6bcf\u4e2a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002 nodeSelector \u63d0\u4f9b\u4e86\u4e00\u79cd\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u6765\u5c06 Pod \u7ea6\u675f\u5230\u5177\u6709\u7279\u5b9a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002\u4eb2\u548c\u6027\u548c\u53cd\u4eb2\u548c\u6027\u6269\u5c55\u4e86\u60a8\u53ef\u4ee5\u5b9a\u4e49\u7684\u7ea6\u675f\u7c7b\u578b\u3002\u4f7f\u7528\u4eb2\u548c\u6027\u4e0e\u53cd\u4eb2\u548c\u6027\u7684\u4e00\u4e9b\u597d\u5904\u6709\uff1a
\u60a8\u53ef\u4ee5\u6807\u660e\u67d0\u89c4\u5219\u662f\u201c\u8f6f\u9700\u6c42\u201d\u6216\u8005\u201c\u504f\u597d\u201d\uff0c\u8fd9\u6837\u8c03\u5ea6\u5668\u5728\u65e0\u6cd5\u627e\u5230\u5339\u914d\u8282\u70b9\u65f6\uff0c\u4f1a\u5ffd\u7565\u4eb2\u548c\u6027/\u53cd\u4eb2\u548c\u6027\u89c4\u5219\uff0c\u786e\u4fdd Pod \u8c03\u5ea6\u6210\u529f\u3002
\u60a8\u53ef\u4ee5\u4f7f\u7528\u8282\u70b9\u4e0a\uff08\u6216\u5176\u4ed6\u62d3\u6251\u57df\u4e2d\uff09\u8fd0\u884c\u7684\u5176\u4ed6 Pod \u7684\u6807\u7b7e\u6765\u5b9e\u65bd\u8c03\u5ea6\u7ea6\u675f\uff0c\u800c\u4e0d\u662f\u53ea\u80fd\u4f7f\u7528\u8282\u70b9\u672c\u8eab\u7684\u6807\u7b7e\u3002\u8fd9\u4e2a\u80fd\u529b\u8ba9\u60a8\u80fd\u591f\u5b9a\u4e49\u89c4\u5219\u5141\u8bb8\u54ea\u4e9b Pod \u53ef\u4ee5\u88ab\u653e\u7f6e\u5728\u4e00\u8d77\u3002
\u60a8\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u4eb2\u548c\uff08affinity\uff09\u4e0e\u53cd\u4eb2\u548c\uff08anti-affinity\uff09\u6765\u9009\u62e9 Pod \u8981\u90e8\u7f72\u7684\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u53ef\u4ee5\u548c\u54ea\u4e9b Pod\u90e8 \u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5bf9\u4e8e\u76f8\u4e92\u901a\u4fe1\u7684\u670d\u52a1\uff0c\u53ef\u901a\u8fc7\u5e94\u7528\u4eb2\u548c\u6027\u8c03\u5ea6\uff0c\u5c06\u5176\u90e8\u7f72\u5230\u540c\u4e00\u62d3\u6251\u57df\uff08\u5982\u540c\u4e00\u53ef\u7528\u533a\uff09\u4e2d\uff0c\u51cf\u5c11\u5b83\u4eec\u4e4b\u95f4\u7684\u7f51\u7edc\u5ef6\u8fdf\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u53cd\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u4e0d\u53ef\u4ee5\u548c\u54ea\u4e9b Pod \u90e8\u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5c06\u4e00\u4e2a\u8d1f\u8f7d\u7684\u76f8\u540c Pod \u5206\u6563\u90e8\u7f72\u5230\u4e0d\u540c\u7684\u62d3\u6251\u57df\uff08\u4f8b\u5982\u4e0d\u540c\u4e3b\u673a\uff09\u4e2d\uff0c\u63d0\u9ad8\u8d1f\u8f7d\u672c\u8eab\u7684\u7a33\u5b9a\u6027\u3002
Pod \u662f Kuberneters \u4e2d\u521b\u5efa\u548c\u7ba1\u7406\u7684\u3001\u6700\u5c0f\u7684\u8ba1\u7b97\u5355\u5143\uff0c\u5373\u4e00\u7ec4\u5bb9\u5668\u7684\u96c6\u5408\u3002\u8fd9\u4e9b\u5bb9\u5668\u5171\u4eab\u5b58\u50a8\u3001\u7f51\u7edc\u4ee5\u53ca\u7ba1\u7406\u63a7\u5236\u5bb9\u5668\u8fd0\u884c\u65b9\u5f0f\u7684\u7b56\u7565\u3002 Pod \u901a\u5e38\u4e0d\u7531\u7528\u6237\u76f4\u63a5\u521b\u5efa\uff0c\u800c\u662f\u901a\u8fc7\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u6765\u521b\u5efa\u3002 Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c\u81f3\u5c11\u5176\u4e2d\u6709\u4e00\u4e2a\u4e3b\u8981\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \uff0c\u4e4b\u540e\u53d6\u51b3\u4e8e Pod \u4e2d\u662f\u5426\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\u800c\u8fdb\u5165 Succeeded \u6216\u8005 Failed \u9636\u6bb5\u3002
\u7b2c\u4e94\u4ee3\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4f9d\u636e Pod \u7684\u72b6\u6001\u3001\u526f\u672c\u6570\u7b49\u56e0\u7d20\uff0c\u8bbe\u8ba1\u4e86\u4e00\u79cd\u5185\u7f6e\u7684\u5de5\u4f5c\u8d1f\u8f7d\u751f\u547d\u5468\u671f\u7684\u72b6\u6001\u96c6\uff0c\u4ee5\u8ba9\u7528\u6237\u80fd\u591f\u66f4\u52a0\u771f\u5b9e\u7684\u611f\u77e5\u5de5\u4f5c\u8d1f\u8f7d\u8fd0\u884c\u60c5\u51b5\u3002 \u7531\u4e8e\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u7c7b\u578b\uff08\u6bd4\u5982\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u548c\u4efb\u52a1\uff09\u5bf9 Pod \u7684\u7ba1\u7406\u673a\u5236\u4e0d\u4e00\u81f4\uff0c\u56e0\u6b64\uff0c\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u5728\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u4f1a\u5448\u73b0\u4e0d\u540c\u7684\u751f\u547d\u5468\u671f\u72b6\u6001\uff0c\u5177\u4f53\u5982\u4e0b\u8868\uff1a
\u606d\u559c\uff0c\u60a8\u6210\u529f\u8fdb\u5165\u4e86 AI \u7b97\u529b\u5e73\u53f0\uff0c\u73b0\u5728\u53ef\u4ee5\u5f00\u59cb\u60a8\u7684 AI \u4e4b\u65c5\u4e86\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9488\u5bf9\u5bb9\u5668\u3001Pod\u3001\u955c\u50cf\u3001\u8fd0\u884c\u65f6\u3001\u5fae\u670d\u52a1\u63d0\u4f9b\u4e86\u5168\u9762\u81ea\u52a8\u5316\u7684\u5b89\u5168\u5b9e\u73b0\u3002 \u4e0b\u8868\u5217\u51fa\u4e86\u4e00\u4e9b\u5df2\u5b9e\u73b0\u6216\u6b63\u5728\u5b9e\u73b0\u4e2d\u7684\u5b89\u5168\u7279\u6027\u3002
\u4ee5 \u7528\u6237\u8eab\u4efd \u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\uff0c\u67e5\u770b\u5176\u662f\u5426\u88ab\u5206\u914d\u4e86 test-ns-1 \u547d\u540d\u7a7a\u95f4\u3002
\u4e0b\u4e00\u6b65\uff1a\u521b\u5efa AI \u8d1f\u8f7d\u4f7f\u7528 GPU \u8d44\u6e90
"},{"location":"admin/share/workload.html","title":"\u521b\u5efa AI \u8d1f\u8f7d\u4f7f\u7528 GPU \u8d44\u6e90","text":"
\u7ba1\u7406\u5458\u4e3a\u5de5\u4f5c\u7a7a\u95f4\u5206\u914d\u8d44\u6e90\u914d\u989d\u540e\uff0c\u7528\u6237\u5c31\u53ef\u4ee5\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u6765\u4f7f\u7528 GPU \u7b97\u529b\u8d44\u6e90\u3002
"},{"location":"admin/virtnest/best-practice/import-ubuntu.html","title":"\u5982\u4f55\u4ece VMWare \u5bfc\u5165\u4f20\u7edf Linux \u4e91\u4e3b\u673a\u5230\u4e91\u539f\u751f\u4e91\u4e3b\u673a\u5e73\u53f0","text":"
\u672c\u6587\u5c06\u8be6\u7ec6\u4ecb\u7ecd\u5982\u4f55\u901a\u8fc7\u547d\u4ee4\u884c\u5c06\u5916\u90e8\u5e73\u53f0 VMware \u4e0a\u7684 Linux \u4e91\u4e3b\u673a\u5bfc\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u4e91\u4e3b\u673a\u4e2d\u3002
Can't use SSL_get_servername\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=20:unable to get local issuer certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=21:unable to verify the first certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify return:1\nDONE\nsha1 Fingerprint=C3:9D:D7:55:6A:43:11:2B:DE:BA:27:EA:3B:C2:13:AF:E4:12:62:4D # \u6240\u9700\u503c\n
\u9700\u8981\u6839\u636e\u7f51\u7edc\u6a21\u5f0f\u7684\u4e0d\u540c\u914d\u7f6e\u4e0d\u540c\u7684\u4fe1\u606f\uff0c\u82e5\u6709\u56fa\u5b9a IP \u7684\u9700\u6c42\uff0c\u9700\u8981\u9009\u62e9 Bridge \u7f51\u7edc\u6a21\u5f0f
"},{"location":"admin/virtnest/best-practice/import-windows.html","title":"\u5982\u4f55\u4ece VMWare \u5bfc\u5165\u4f20\u7edf Windows \u4e91\u4e3b\u673a\u5230\u4e91\u539f\u751f\u4e91\u4e3b\u673a\u5e73\u53f0","text":"
\u672c\u6587\u5c06\u8be6\u7ec6\u4ecb\u7ecd\u5982\u4f55\u901a\u8fc7\u547d\u4ee4\u884c\u5c06\u5916\u90e8\u5e73\u53f0 VMware \u4e0a\u7684\u4e91\u4e3b\u673a\u5bfc\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u4e91\u4e3b\u673a\u4e2d\u3002
"},{"location":"admin/virtnest/best-practice/import-windows.html#windows_1","title":"\u68c0\u67e5 Windows \u7684\u5f15\u5bfc\u7c7b\u578b","text":"
\u5c06\u5916\u90e8\u5e73\u53f0\u7684\u4e91\u4e3b\u673a\u5bfc\u5165\u5230 AI \u7b97\u529b\u4e2d\u5fc3\u7684\u865a\u62df\u5316\u5e73\u53f0\u4e2d\u65f6\uff0c\u9700\u8981\u6839\u636e\u4e91\u4e3b\u673a\u7684\u542f\u52a8\u7c7b\u578b\uff08BIOS \u6216 UEFI\uff09\u8fdb\u884c\u76f8\u5e94\u7684\u914d\u7f6e\uff0c\u4ee5\u786e\u4fdd\u4e91\u4e3b\u673a\u80fd\u591f\u6b63\u786e\u542f\u52a8\u548c\u8fd0\u884c\u3002
"},{"location":"admin/virtnest/best-practice/import-windows.html#linux-windows","title":"\u5bf9\u6bd4\u5bfc\u5165 Linux \u548c Windows \u4e91\u4e3b\u673a\u7684\u5dee\u5f02","text":"
Windows \u53ef\u80fd\u9700\u8981 UEFI \u914d\u7f6e\u3002
Windows \u901a\u5e38\u9700\u8981\u5b89\u88c5 VirtIO \u9a71\u52a8\u3002
Windows \u591a\u78c1\u76d8\u5bfc\u5165\u901a\u5e38\u4e0d\u9700\u8981\u91cd\u65b0\u6302\u8f7d\u78c1\u76d8\u3002
"},{"location":"admin/virtnest/best-practice/vm-windows.html","title":"\u521b\u5efa Windows \u4e91\u4e3b\u673a","text":"
\u672c\u6587\u5c06\u4ecb\u7ecd\u5982\u4f55\u901a\u8fc7\u547d\u4ee4\u884c\u521b\u5efa Windows \u4e91\u4e3b\u673a\u3002
\u521b\u5efa Windows \u4e91\u4e3b\u673a\u4e4b\u524d\uff0c\u9700\u8981\u5148\u53c2\u8003\u5b89\u88c5\u4e91\u4e3b\u673a\u6a21\u5757\u7684\u4f9d\u8d56\u548c\u524d\u63d0\u786e\u5b9a\u60a8\u7684\u73af\u5883\u5df2\u7ecf\u51c6\u5907\u5c31\u7eea\u3002
\u521b\u5efa\u8fc7\u7a0b\u5efa\u8bae\u53c2\u8003\u5b98\u65b9\u6587\u6863\uff1a\u5b89\u88c5 windows \u7684\u6587\u6863\u3001 \u5b89\u88c5 Windows \u76f8\u5173\u9a71\u52a8\u7a0b\u5e8f\u3002
Windows \u4e91\u4e3b\u673a\u5efa\u8bae\u4f7f\u7528 VNC \u7684\u8bbf\u95ee\u65b9\u5f0f\u3002
"},{"location":"admin/virtnest/best-practice/vm-windows.html#iso","title":"\u5bfc\u5165 ISO \u955c\u50cf","text":"
\u200b\u521b\u5efa Windows \u4e91\u4e3b\u673a\u9700\u8981\u5bfc\u5165 ISO \u955c\u50cf\u7684\u4e3b\u8981\u539f\u56e0\u662f\u4e3a\u4e86\u5b89\u88c5 Windows \u64cd\u4f5c\u7cfb\u7edf\u3002 \u4e0e Linux \u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\uff0cWindows \u64cd\u4f5c\u7cfb\u7edf\u5b89\u88c5\u8fc7\u7a0b\u901a\u5e38\u9700\u8981\u4ece\u5b89\u88c5\u5149\u76d8\u6216 ISO \u955c\u50cf\u6587\u4ef6\u4e2d\u5f15\u5bfc\u3002 \u56e0\u6b64\uff0c\u5728\u521b\u5efa Windows \u4e91\u4e3b\u673a\u65f6\uff0c\u9700\u8981\u5148\u5bfc\u5165 Windows \u64cd\u4f5c\u7cfb\u7edf\u7684\u5b89\u88c5 ISO \u955c\u50cf\u6587\u4ef6\uff0c\u4ee5\u4fbf\u4e91\u4e3b\u673a\u80fd\u591f\u6b63\u5e38\u5b89\u88c5\u3002
\u4ee5\u4e0b\u4ecb\u7ecd\u4e24\u4e2a\u5bfc\u5165 ISO \u955c\u50cf\u7684\u529e\u6cd5\uff1a
Windows \u7248\u672c\u7684\u4e91\u4e3b\u673a\u5927\u591a\u6570\u60c5\u51b5\u662f\u9700\u8981\u8fdc\u7a0b\u684c\u9762\u63a7\u5236\u8bbf\u95ee\u7684\uff0c\u5efa\u8bae\u4f7f\u7528 Microsoft \u8fdc\u7a0b\u684c\u9762\u63a7\u5236\u60a8\u7684\u4e91\u4e3b\u673a\u3002
Note
\u4f60\u7684 Windows \u7248\u672c\u9700\u652f\u6301\u8fdc\u7a0b\u684c\u9762\u63a7\u5236\uff0c\u624d\u80fd\u4f7f\u7528 Microsoft \u8fdc\u7a0b\u684c\u9762\u3002
\u9700\u8981\u5173\u95ed Windows \u7684\u9632\u706b\u5899\u3002
Windows \u4e91\u4e3b\u673a\u6dfb\u52a0\u6570\u636e\u76d8\u7684\u65b9\u5f0f\u548c Linux \u4e91\u4e3b\u673a\u4e00\u81f4\u3002\u4f60\u53ef\u4ee5\u53c2\u8003\u4e0b\u9762\u7684 YAML \u793a\u4f8b\uff1a
\u8fd9\u4e9b\u80fd\u529b\u548c Linux \u4e91\u4e3b\u673a\u4e00\u81f4\uff0c\u53ef\u76f4\u63a5\u53c2\u8003\u914d\u7f6e Linux \u4e91\u4e3b\u673a\u7684\u65b9\u5f0f\u3002
"},{"location":"admin/virtnest/best-practice/vm-windows.html#windows_1","title":"\u8bbf\u95ee Windows \u4e91\u4e3b\u673a","text":"
[root@master ~]# helm search repo virtnest-release/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest-release/virtnest 0.6.0 v0.6.0 A Helm chart for virtnest\n
# \u6210\u529f\u7684\u60c5\u51b5\nQEMU: Checking for hardware virtualization : PASS\nQEMU: Checking if device /dev/kvm exists : PASS\nQEMU: Checking if device /dev/kvm is accessible : PASS\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for device assignment IOMMU support : PASS\nQEMU: Checking if IOMMU is enabled by kernel : PASS\nQEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure Guest support)\n\n# \u5931\u8d25\u7684\u60c5\u51b5\nQEMU: Checking for hardware virtualization : FAIL (Only emulated CPUs are available, performance will be significantly limited)\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpu' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller mount-point : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller mount-point : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller mount-point : PASS\nWARN (Unknown if this platform has IOMMU support)\n
[root@master ~]# helm search repo virtnest/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest/virtnest 0.2.0 v0.2.0 A Helm chart for virtnest\n...\n
Passt\uff08\u76f4\u901a\uff09/Bridge\uff08\u6865\u63a5\uff09\u6a21\u5f0f\u4e0b\u652f\u6301\u624b\u52a8\u6dfb\u52a0\u7f51\u5361\u3002\u70b9\u51fb \u6dfb\u52a0\u7f51\u5361 \uff0c\u8fdb\u884c\u7f51\u5361 IP \u6c60\u7684\u914d\u7f6e\u3002\u9009\u62e9\u548c\u7f51\u7edc\u6a21\u5f0f\u5339\u914d\u7684 Multus CR\uff0c\u82e5\u6ca1\u6709\u5219\u9700\u8981\u81ea\u884c\u521b\u5efa\u3002
\u82e5\u6253\u5f00 \u4f7f\u7528\u9ed8\u8ba4 IP \u6c60 \u5f00\u5173\uff0c\u5219\u4f7f\u7528 multus CR \u914d\u7f6e\u4e2d\u7684\u9ed8\u8ba4 IP \u6c60\u3002\u82e5\u5173\u95ed\u5f00\u5173\uff0c\u5219\u624b\u52a8\u9009\u62e9 IP \u6c60\u3002
IP \uff1a\u4e91\u4e3b\u673a\u7684 IP \u5730\u5740\u3002\u5bf9\u4e8e\u6dfb\u52a0\u591a\u5f20\u7f51\u5361\u7684\u4e91\u4e3b\u673a\uff0c\u4f1a\u4e3a\u5176\u5206\u914d\u591a\u4e2a IP \u5730\u5740\u3002
\u4e91\u4e3b\u673a\u672a\u8fdb\u884c\u78c1\u76d8\u843d\u76d8\u64cd\u4f5c\uff0c\u6216\u4f7f\u7528 Rook-Ceph\u3001HwameiStor HA \u6a21\u5f0f\u4f5c\u4e3a\u5b58\u50a8\u7cfb\u7edf
\u68c0\u67e5\u4e91\u4e3b\u673a launcher pod \u72b6\u6001\uff1a
kubectl get pod\n
\u67e5\u770b launcher pod \u662f\u5426\u5904\u4e8e Terminating \u72b6\u6001\u3002
\u5f3a\u5236\u5220\u9664 launcher pod\uff1a
\u5982\u679c launcher pod \u72b6\u6001\u4e3a Terminating\uff0c\u53ef\u4ee5\u6267\u884c\u4ee5\u4e0b\u547d\u4ee4\u8fdb\u884c\u5f3a\u5236\u5220\u9664\uff1a
kubectl delete <launcher pod> --force\n
\u66ff\u6362 <launcher pod> \u4e3a\u4f60\u7684 launcher pod \u540d\u79f0\u3002
\u5f3a\u5236\u5220\u9664 pod \u540e\uff0c\u9700\u8981\u7b49\u5f85\u5927\u7ea6\u516d\u5206\u949f\u4ee5\u8ba9 launcher pod \u542f\u52a8\uff0c\u6216\u8005\u53ef\u4ee5\u901a\u8fc7\u4ee5\u4e0b\u547d\u4ee4\u7acb\u5373\u542f\u52a8 pod\uff1a
kubectl get pv | grep <vm name>\nkubectl get VolumeAttachment | grep <pv name>\n
\u6fc0\u6d3b VMExport Feature Gate\uff0c\u5728\u539f\u6709\u96c6\u7fa4\u5185\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c \u53ef\u53c2\u8003How to activate a feature gate
Bridge\uff08\u6865\u63a5\uff09\u6a21\u5f0f\u4e0b\u652f\u6301\u624b\u52a8\u6dfb\u52a0\u7f51\u5361\u3002\u70b9\u51fb \u6dfb\u52a0\u7f51\u5361 \uff0c\u8fdb\u884c\u7f51\u5361 IP \u6c60\u7684\u914d\u7f6e\u3002\u9009\u62e9\u548c\u7f51\u7edc\u6a21\u5f0f\u5339\u914d\u7684 Multus CR\uff0c\u82e5\u6ca1\u6709\u5219\u9700\u8981\u81ea\u884c\u521b\u5efa\u3002
\u82e5\u6253\u5f00 \u4f7f\u7528\u9ed8\u8ba4 IP \u6c60 \u5f00\u5173\uff0c\u5219\u4f7f\u7528 multus CR \u914d\u7f6e\u4e2d\u7684\u9ed8\u8ba4 IP \u6c60\u3002\u82e5\u5173\u95ed\u5f00\u5173\uff0c\u5219\u624b\u52a8\u9009\u62e9 IP \u6c60\u3002
AI Lab \u63d0\u4f9b\u6a21\u578b\u5f00\u53d1\u3001\u8bad\u7ec3\u4ee5\u53ca\u63a8\u7406\u8fc7\u7a0b\u6240\u6709\u9700\u8981\u7684\u6570\u636e\u96c6\u7ba1\u7406\u529f\u80fd\u3002\u76ee\u524d\u652f\u6301\u5c06\u591a\u79cd\u6570\u636e\u6e90\u7edf\u4e00\u63a5\u5165\u80fd\u529b\u3002
\u901a\u8fc7\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5c06\u6570\u636e\u6e90\u63a5\u5165\u5230 AI Lab \u4e2d\uff0c\u5b9e\u73b0\u6570\u636e\u7684\u7edf\u4e00\u7eb3\u7ba1\u3001\u9884\u70ed\u3001\u6570\u636e\u96c6\u7ba1\u7406\u7b49\u529f\u80fd\u3002
\u672c\u6587\u8bf4\u660e\u5982\u4f55\u5728 AI Lab \u4e2d\u7ba1\u7406\u4f60\u7684\u73af\u5883\u4f9d\u8d56\u5e93\uff0c\u4ee5\u4e0b\u662f\u5177\u4f53\u64cd\u4f5c\u6b65\u9aa4\u548c\u6ce8\u610f\u4e8b\u9879\u3002
\u968f\u7740 AI Lab \u7684\u5feb\u901f\u8fed\u4ee3\uff0c\u6211\u4eec\u5df2\u7ecf\u652f\u6301\u4e86\u591a\u79cd\u6a21\u578b\u7684\u63a8\u7406\u670d\u52a1\uff0c\u60a8\u53ef\u4ee5\u5728\u8fd9\u91cc\u770b\u5230\u6240\u652f\u6301\u7684\u6a21\u578b\u4fe1\u606f\u3002
AI Lab v0.3.0 \u4e0a\u7ebf\u4e86\u6a21\u578b\u63a8\u7406\u670d\u52a1\uff0c\u9488\u5bf9\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u65b9\u4fbf\u7528\u6237\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528AI Lab \u7684\u63a8\u7406\u670d\u52a1\uff0c\u65e0\u9700\u5173\u5fc3\u6a21\u578b\u7684\u90e8\u7f72\u548c\u7ef4\u62a4\u3002
AI Lab v0.6.0 \u652f\u6301\u4e86\u5b8c\u6574\u7248\u672c\u7684 vLLM \u63a8\u7406\u80fd\u529b\uff0c\u652f\u6301\u8bf8\u591a\u5927\u8bed\u8a00\u6a21\u578b\uff0c\u5982 LLama\u3001Qwen\u3001ChatGLM \u7b49\u3002
\u60a8\u53ef\u4ee5\u5728 AI Lab \u4e2d\u4f7f\u7528\u7ecf\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9a8c\u8bc1\u8fc7\u7684 GPU \u7c7b\u578b\uff1b \u66f4\u591a\u7ec6\u8282\u53c2\u9605 GPU \u652f\u6301\u77e9\u9635\u3002
\u901a\u8fc7 Triton Inference Server \u53ef\u4ee5\u5f88\u597d\u7684\u652f\u6301\u4f20\u7edf\u7684\u6df1\u5ea6\u5b66\u4e60\u6a21\u578b\uff0c\u6211\u4eec\u76ee\u524d\u652f\u6301\u4e3b\u6d41\u7684\u63a8\u7406\u540e\u7aef\u670d\u52a1\uff1a
AI Lab \u76ee\u524d\u63d0\u4f9b\u4ee5 Triton\u3001vLLM \u4f5c\u4e3a\u63a8\u7406\u6846\u67b6\uff0c\u7528\u6237\u53ea\u9700\u7b80\u5355\u914d\u7f6e\u5373\u53ef\u5feb\u901f\u542f\u52a8\u4e00\u4e2a\u9ad8\u6027\u80fd\u7684\u63a8\u7406\u670d\u52a1\u3002
\u652f\u6301 API key \u7684\u8bf7\u6c42\u65b9\u5f0f\u8ba4\u8bc1\uff0c\u7528\u6237\u53ef\u4ee5\u81ea\u5b9a\u4e49\u589e\u52a0\u8ba4\u8bc1\u53c2\u6570\u3002
\u63a8\u7406\u670d\u52a1\u521b\u5efa\u5b8c\u6210\u4e4b\u540e\uff0c\u70b9\u51fb\u63a8\u7406\u670d\u52a1\u540d\u79f0\u8fdb\u5165\u8be6\u60c5\uff0c\u67e5\u770b API \u8c03\u7528\u65b9\u6cd5\u3002\u901a\u8fc7\u4f7f\u7528 Curl\u3001Python\u3001Nodejs \u7b49\u65b9\u5f0f\u9a8c\u8bc1\u6267\u884c\u7ed3\u679c\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
\u5728 AI Lab \u6a21\u5757\u4e2d\uff0c\u63d0\u4f9b\u4e86\u6a21\u578b\u5f00\u53d1\u8fc7\u7a0b\u91cd\u8981\u7684\u53ef\u89c6\u5316\u5206\u6790\u5de5\u5177\uff0c\u7528\u4e8e\u5c55\u793a\u673a\u5668\u5b66\u4e60\u6a21\u578b\u7684\u8bad\u7ec3\u8fc7\u7a0b\u548c\u7ed3\u679c\u3002 \u672c\u6587\u5c06\u4ecb\u7ecd \u4efb\u52a1\u5206\u6790\uff08Tensorboard\uff09\u7684\u57fa\u672c\u6982\u5ff5\u3001\u5728 AI Lab \u7cfb\u7edf\u4e2d\u7684\u4f7f\u7528\u65b9\u6cd5\uff0c\u4ee5\u53ca\u5982\u4f55\u914d\u7f6e\u6570\u636e\u96c6\u7684\u65e5\u5fd7\u5185\u5bb9\u3002
\u5728 AI Lab \u7cfb\u7edf\u4e2d\uff0c\u6211\u4eec\u63d0\u4f9b\u4e86\u4fbf\u6377\u7684\u65b9\u5f0f\u6765\u521b\u5efa\u548c\u7ba1\u7406 Tensorboard\u3002\u4ee5\u4e0b\u662f\u5177\u4f53\u6b65\u9aa4\uff1a
\u521b\u5efa\u5206\u5e03\u5f0f\u4efb\u52a1\uff1a\u5728 AI Lab \u5e73\u53f0\u4e0a\u521b\u5efa\u4e00\u4e2a\u65b0\u7684\u5206\u5e03\u5f0f\u8bad\u7ec3\u4efb\u52a1\u3002
\u540c\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u8fdb\u5165\u4efb\u52a1\u8be6\u60c5\uff0c\u67e5\u770b\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\uff0c\u4ee5\u53ca\u6bcf\u4e2a Pod \u7684\u65e5\u5fd7\u8f93\u51fa\u3002
\u8bbf\u95ee\u5bc6\u94a5\uff08Access Key\uff09\u53ef\u7528\u4e8e\u8bbf\u95ee\u5f00\u653e API \u548c\u6301\u7eed\u53d1\u5e03\uff0c\u7528\u6237\u53ef\u5728\u4e2a\u4eba\u4e2d\u5fc3\u53c2\u7167\u4ee5\u4e0b\u6b65\u9aa4\u83b7\u53d6\u5bc6\u94a5\u5e76\u8bbf\u95ee API\u3002
\u4f7f\u7528\u60a8\u7684\u7528\u6237\u540d/\u5bc6\u7801\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\u3002\u70b9\u51fb\u5de6\u4fa7\u5bfc\u822a\u680f\u5e95\u90e8\u7684 \u5168\u5c40\u7ba1\u7406 \u3002
"},{"location":"end-user/ghippo/personal-center/ssh-key.html#4-ai","title":"\u6b65\u9aa4 4\uff1a\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0a\u8bbe\u7f6e\u516c\u94a5","text":"
\u767b\u5f55\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0UI \u9875\u9762\uff0c\u5728\u9875\u9762\u53f3\u4e0a\u89d2\u9009\u62e9 \u4e2a\u4eba\u4e2d\u5fc3 -> SSH \u516c\u94a5 \u3002
\u4ee5\u7ec8\u7aef\u7528\u6237\u767b\u5f55 AI \u7b97\u529b\u5e73\u53f0\uff0c\u5bfc\u822a\u5230\u5bf9\u5e94\u7684\u670d\u52a1\uff0c\u67e5\u770b\u8bbf\u95ee\u7aef\u53e3\u3002
\u544a\u8b66\u4e2d\u5fc3\u662f AI \u7b97\u529b\u5e73\u53f0 \u63d0\u4f9b\u7684\u4e00\u4e2a\u91cd\u8981\u529f\u80fd\uff0c\u5b83\u8ba9\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u56fe\u5f62\u754c\u9762\u65b9\u4fbf\u5730\u6309\u7167\u96c6\u7fa4\u548c\u547d\u540d\u7a7a\u95f4\u67e5\u770b\u6240\u6709\u6d3b\u52a8\u548c\u5386\u53f2\u544a\u8b66\uff0c \u5e76\u6839\u636e\u544a\u8b66\u7ea7\u522b\uff08\u7d27\u6025\u3001\u8b66\u544a\u3001\u63d0\u793a\uff09\u6765\u641c\u7d22\u544a\u8b66\u3002
\u6240\u6709\u544a\u8b66\u90fd\u662f\u57fa\u4e8e\u9884\u8bbe\u7684\u544a\u8b66\u89c4\u5219\u8bbe\u5b9a\u7684\u9608\u503c\u6761\u4ef6\u89e6\u53d1\u7684\u3002\u5728 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u5185\u7f6e\u4e86\u4e00\u4e9b\u5168\u5c40\u544a\u8b66\u7b56\u7565\uff0c\u540c\u65f6\u60a8\u4e5f\u53ef\u4ee5\u968f\u65f6\u521b\u5efa\u3001\u5220\u9664\u544a\u8b66\u7b56\u7565\uff0c\u5bf9\u4ee5\u4e0b\u6307\u6807\u8fdb\u884c\u8bbe\u7f6e\uff1a
AI \u7b97\u529b\u5e73\u53f0 \u544a\u8b66\u4e2d\u5fc3\u662f\u4e00\u4e2a\u529f\u80fd\u5f3a\u5927\u7684\u544a\u8b66\u7ba1\u7406\u5e73\u53f0\uff0c\u53ef\u5e2e\u52a9\u7528\u6237\u53ca\u65f6\u53d1\u73b0\u548c\u89e3\u51b3\u96c6\u7fa4\u4e2d\u51fa\u73b0\u7684\u95ee\u9898\uff0c \u63d0\u9ad8\u4e1a\u52a1\u7a33\u5b9a\u6027\u548c\u53ef\u7528\u6027\uff0c\u4fbf\u4e8e\u96c6\u7fa4\u5de1\u68c0\u548c\u6545\u969c\u6392\u67e5\u3002
Pod Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Pod \u4e0a\u5bf9\u5e94\u7684\u76d1\u63a7\u6570\u636e\u3002
Service Monitor\uff1a\u5728 K8S \u751f\u6001\u4e0b\uff0c\u57fa\u4e8e Prometheus Operator \u6765\u6293\u53d6 Service \u5bf9\u5e94 Endpoints \u4e0a\u7684\u76d1\u63a7\u6570\u636e\u3002
\u7531\u4e8e ICMP \u9700\u8981\u66f4\u9ad8\u6743\u9650\uff0c\u56e0\u6b64\uff0c\u6211\u4eec\u8fd8\u9700\u8981\u63d0\u5347 Pod \u6743\u9650\uff0c\u5426\u5219\u4f1a\u51fa\u73b0 operation not permitted \u7684\u9519\u8bef\u3002\u6709\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\u63d0\u5347\u6743\u9650\uff1a
port \uff1a\u6307\u5b9a\u91c7\u96c6\u6570\u636e\u9700\u8981\u901a\u8fc7\u7684\u7aef\u53e3\uff0c\u8bbe\u7f6e\u7684\u7aef\u53e3\u4e3a\u91c7\u96c6\u7684 Service \u7aef\u53e3\u6240\u8bbe\u7f6e\u7684 name \u3002
\u8fd9\u662f\u9700\u8981\u53d1\u73b0\u7684 Service \u7684\u8303\u56f4\u3002 namespaceSelector \u5305\u542b\u4e24\u4e2a\u4e92\u65a5\u5b57\u6bb5\uff0c\u5b57\u6bb5\u7684\u542b\u4e49\u5982\u4e0b\uff1a
any \uff1a\u6709\u4e14\u4ec5\u6709\u4e00\u4e2a\u503c true \uff0c\u5f53\u8be5\u5b57\u6bb5\u88ab\u8bbe\u7f6e\u65f6\uff0c\u5c06\u76d1\u542c\u6240\u6709\u7b26\u5408 Selector \u8fc7\u6ee4\u6761\u4ef6\u7684 Service \u7684\u53d8\u52a8\u3002
\u8d44\u6e90\u6d88\u8017\uff1a\u53ef\u6309 CPU \u4f7f\u7528\u7387\u3001\u5185\u5b58\u4f7f\u7528\u7387\u548c\u78c1\u76d8\u4f7f\u7528\u7387\u5206\u522b\u67e5\u770b\u8fd1\u4e00\u5c0f\u65f6 TOP5 \u96c6\u7fa4\u3001\u8282\u70b9\u7684\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\u3002
\u9ed8\u8ba4\u6309\u7167\u6839\u636e CPU \u4f7f\u7528\u7387\u6392\u5e8f\u3002\u60a8\u53ef\u5207\u6362\u6307\u6807\u5207\u6362\u96c6\u7fa4\u3001\u8282\u70b9\u7684\u6392\u5e8f\u65b9\u5f0f\u3002
\u8d44\u6e90\u53d8\u5316\u8d8b\u52bf\uff1a\u53ef\u67e5\u770b\u8fd1 15 \u5929\u7684\u8282\u70b9\u4e2a\u6570\u8d8b\u52bf\u4ee5\u53ca\u4e00\u5c0f\u65f6 Pod \u7684\u8fd0\u884c\u8d8b\u52bf\u3002
\u4f7f\u7528 \u903b\u8f91\u64cd\u4f5c\u7b26\uff08AND\u3001OR\u3001NOT\u3001\"\" \uff09\u7b26\u67e5\u8be2\u591a\u4e2a\u5173\u952e\u5b57\uff0c\u4f8b\u5982\uff1akeyword1 AND (keyword2 OR keyword3) NOT keyword4\u3002
"},{"location":"end-user/insight/infra/cluster.html#_4","title":"\u53c2\u8003\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645 CPU \u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 CPU \u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684 CPU \u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684 CPU \u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u4f7f\u7528\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u8d44\u6e90\u7684\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002 \u5185\u5b58\u5206\u914d\u7387 \u8be5\u6307\u6807\u662f\u6307\u96c6\u7fa4\u4e2d\u6240\u6709 Pod \u7684\u5185\u5b58\u8bf7\u6c42\u91cf\u7684\u603b\u548c\u4e0e\u6240\u6709\u8282\u70b9\u7684\u5185\u5b58\u603b\u91cf\u7684\u6bd4\u7387\u3002"},{"location":"end-user/insight/infra/container.html","title":"\u5bb9\u5668\u76d1\u63a7","text":"
"},{"location":"end-user/insight/infra/container.html#_4","title":"\u6307\u6807\u53c2\u8003\u8bf4\u660e","text":"\u6307\u6807\u540d\u79f0 \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c\u3002 CPU \u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u8bf7\u6c42\u91cf\u4e4b\u548c\u3002 CPU \u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684 CPU \u9650\u5236\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u4f7f\u7528\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u8bf7\u6c42\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c\u3002 \u5185\u5b58\u9650\u5236\u91cf \u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u9650\u5236\u91cf\u4e4b\u548c\u3002 \u78c1\u76d8\u8bfb\u5199\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\u78c1\u76d8\u6bcf\u79d2\u8fde\u7eed\u8bfb\u53d6\u548c\u5199\u5165\u7684\u603b\u548c\uff0c\u8868\u793a\u78c1\u76d8\u6bcf\u79d2\u8bfb\u53d6\u548c\u5199\u5165\u64cd\u4f5c\u6570\u7684\u6027\u80fd\u5ea6\u91cf\u3002 \u7f51\u7edc\u53d1\u9001\u63a5\u6536\u901f\u7387 \u6307\u5b9a\u65f6\u95f4\u8303\u56f4\u5185\uff0c\u6309\u5de5\u4f5c\u8d1f\u8f7d\u7edf\u8ba1\u7684\u7f51\u7edc\u6d41\u91cf\u7684\u6d41\u5165\u3001\u6d41\u51fa\u901f\u7387\u3002"},{"location":"end-user/insight/infra/event.html","title":"\u4e8b\u4ef6\u67e5\u8be2","text":"
AI \u7b97\u529b\u5e73\u53f0 Insight \u652f\u6301\u6309\u96c6\u7fa4\u3001\u547d\u540d\u7a7a\u95f4\u67e5\u8be2\u4e8b\u4ef6\uff0c\u5e76\u63d0\u4f9b\u4e86\u4e8b\u4ef6\u72b6\u6001\u5206\u5e03\u56fe\uff0c\u5bf9\u91cd\u8981\u4e8b\u4ef6\u8fdb\u884c\u7edf\u8ba1\u3002
\u901a\u8fc7\u91cd\u8981\u4e8b\u4ef6\u7edf\u8ba1\uff0c\u60a8\u53ef\u4ee5\u65b9\u4fbf\u5730\u4e86\u89e3\u955c\u50cf\u62c9\u53d6\u5931\u8d25\u6b21\u6570\u3001\u5065\u5eb7\u68c0\u67e5\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u8fd0\u884c\u5931\u8d25\u6b21\u6570\u3001 Pod \u8c03\u5ea6\u5931\u8d25\u6b21\u6570\u3001\u5bb9\u5668 OOM \u5185\u5b58\u8017\u5c3d\u6b21\u6570\u3001\u5b58\u50a8\u5377\u6302\u8f7d\u5931\u8d25\u6b21\u6570\u4ee5\u53ca\u6240\u6709\u4e8b\u4ef6\u7684\u603b\u6570\u3002\u8fd9\u4e9b\u4e8b\u4ef6\u901a\u5e38\u5206\u4e3a\u300cWarning\u300d\u548c\u300cNormal\u300d\u4e24\u7c7b\u3002
"},{"location":"end-user/insight/infra/namespace.html#_4","title":"\u6307\u6807\u8bf4\u660e","text":"\u6307\u6807\u540d \u8bf4\u660e CPU \u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf\u4e4b\u548c \u5185\u5b58\u4f7f\u7528\u91cf \u6240\u9009\u547d\u540d\u7a7a\u95f4\u4e2d\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf\u4e4b\u548c \u5bb9\u5668\u7ec4 CPU \u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684 CPU \u4f7f\u7528\u91cf \u5bb9\u5668\u7ec4\u5185\u5b58\u4f7f\u7528\u91cf \u547d\u540d\u7a7a\u95f4\u4e2d\u5404\u5bb9\u5668\u7ec4\u7684\u5185\u5b58\u4f7f\u7528\u91cf"},{"location":"end-user/insight/infra/node.html","title":"\u8282\u70b9\u76d1\u63a7","text":"
\u6307\u6807\u540d\u79f0 \u63cf\u8ff0 Current Status Response \u8868\u793a HTTP \u63a2\u6d4b\u8bf7\u6c42\u7684\u54cd\u5e94\u72b6\u6001\u7801\u3002 Ping Status \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u662f\u5426\u6210\u529f\u30021 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u6210\u529f\uff0c0 \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u5931\u8d25\u3002 IP Protocol \u8868\u793a\u63a2\u6d4b\u8bf7\u6c42\u4f7f\u7528\u7684 IP \u534f\u8bae\u7248\u672c\u3002 SSL Expiry \u8868\u793a SSL/TLS \u8bc1\u4e66\u7684\u6700\u65e9\u5230\u671f\u65f6\u95f4\u3002 DNS Response (Latency) \u8868\u793a\u6574\u4e2a\u63a2\u6d4b\u8fc7\u7a0b\u7684\u6301\u7eed\u65f6\u95f4\uff0c\u5355\u4f4d\u662f\u79d2\u3002 HTTP Duration \u8868\u793a\u4ece\u53d1\u9001\u8bf7\u6c42\u5230\u63a5\u6536\u5230\u5b8c\u6574\u54cd\u5e94\u7684\u6574\u4e2a\u8fc7\u7a0b\u7684\u65f6\u95f4\u3002"},{"location":"end-user/insight/infra/probe.html#_7","title":"\u5220\u9664\u62e8\u6d4b\u4efb\u52a1","text":"
AI \u7b97\u529b\u4e2d\u5fc3 \u5e73\u53f0\u5b9e\u73b0\u4e86\u5bf9\u591a\u4e91\u591a\u96c6\u7fa4\u7684\u7eb3\u7ba1\uff0c\u5e76\u652f\u6301\u521b\u5efa\u96c6\u7fa4\u3002\u5728\u6b64\u57fa\u7840\u4e0a\uff0c\u53ef\u89c2\u6d4b\u6027 Insight \u4f5c\u4e3a\u591a\u96c6\u7fa4\u7edf\u4e00\u89c2\u6d4b\u65b9\u6848\uff0c\u901a\u8fc7\u90e8\u7f72 insight-agent \u63d2\u4ef6\u5b9e\u73b0\u5bf9\u591a\u96c6\u7fa4\u89c2\u6d4b\u6570\u636e\u7684\u91c7\u96c6\uff0c\u5e76\u652f\u6301\u901a\u8fc7 AI \u7b97\u529b\u4e2d\u5fc3 \u53ef\u89c2\u6d4b\u6027\u4ea7\u54c1\u5b9e\u73b0\u5bf9\u6307\u6807\u3001\u65e5\u5fd7\u3001\u94fe\u8def\u6570\u636e\u7684\u67e5\u8be2\u3002
"},{"location":"end-user/insight/quickstart/install/gethosturl.html#insight-agent_1","title":"\u5728\u5176\u4ed6\u96c6\u7fa4\u5b89\u88c5 insight-agent","text":""},{"location":"end-user/insight/quickstart/install/gethosturl.html#insight-server","title":"\u901a\u8fc7 Insight Server \u63d0\u4f9b\u7684\u63a5\u53e3\u83b7\u53d6\u5730\u5740","text":"
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u8c03\u7528\u63a5\u53e3\u65f6\u9700\u8981\u989d\u5916\u4f20\u9012\u96c6\u7fa4\u4e2d\u4efb\u610f\u5916\u90e8\u53ef\u8bbf\u95ee\u7684\u8282\u70b9 IP\uff0c\u4f1a\u4f7f\u7528\u8be5 IP \u62fc\u63a5\u51fa\u5bf9\u5e94\u670d\u52a1\u7684\u5b8c\u6574\u8bbf\u95ee\u5730\u5740\u3002
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
\u4ee5\u4e0a\u5c31\u7eea\u4e4b\u540e\uff0c\u60a8\u5c31\u53ef\u4ee5\u901a\u8fc7\u6ce8\u89e3\uff08Annotation\uff09\u65b9\u5f0f\u4e3a\u5e94\u7528\u7a0b\u5e8f\u63a5\u5165\u94fe\u8def\u8ffd\u8e2a\u4e86\uff0cOTel \u76ee\u524d\u652f\u6301\u901a\u8fc7\u6ce8\u89e3\u7684\u65b9\u5f0f\u63a5\u5165\u94fe\u8def\u3002 \u6839\u636e\u670d\u52a1\u8bed\u8a00\uff0c\u9700\u8981\u6dfb\u52a0\u4e0a\u4e0d\u540c\u7684 pod annotations\u3002\u6bcf\u4e2a\u670d\u52a1\u53ef\u6dfb\u52a0\u4e24\u7c7b\u6ce8\u89e3\u4e4b\u4e00\uff1a
\u7531\u4e8e Go \u81ea\u52a8\u68c0\u6d4b\u9700\u8981\u8bbe\u7f6e OTEL_GO_AUTO_TARGET_EXE\uff0c \u56e0\u6b64\u60a8\u5fc5\u987b\u901a\u8fc7\u6ce8\u89e3\u6216 Instrumentation \u8d44\u6e90\u63d0\u4f9b\u6709\u6548\u7684\u53ef\u6267\u884c\u8def\u5f84\u3002\u672a\u8bbe\u7f6e\u6b64\u503c\u4f1a\u5bfc\u81f4 Go \u81ea\u52a8\u68c0\u6d4b\u6ce8\u5165\u4e2d\u6b62\uff0c\u4ece\u800c\u5bfc\u81f4\u63a5\u5165\u94fe\u8def\u5931\u8d25\u3002
Go \u81ea\u52a8\u68c0\u6d4b\u4e5f\u9700\u8981\u63d0\u5347\u6743\u9650\u3002\u4ee5\u4e0b\u6743\u9650\u662f\u81ea\u52a8\u8bbe\u7f6e\u7684\u5e76\u4e14\u662f\u5fc5\u9700\u7684\u3002
\u4ee5 Go \u8bed\u8a00\u4e3a\u4f8b\u7684\u624b\u52a8\u57cb\u70b9\u63a5\u5165\uff1a\u4f7f\u7528 OpenTelemetry SDK \u589e\u5f3a Go \u5e94\u7528\u7a0b\u5e8f
\u5229\u7528 ebpf \u5b9e\u73b0 Go \u8bed\u8a00\u65e0\u4fb5\u5165\u63a2\u9488\uff08\u5b9e\u9a8c\u6027\u529f\u80fd\uff09
OpenTelemetry \u4e5f\u7b80\u79f0\u4e3a OTel\uff0c\u662f\u4e00\u4e2a\u5f00\u6e90\u7684\u53ef\u89c2\u6d4b\u6027\u6846\u67b6\uff0c\u53ef\u4ee5\u5e2e\u52a9\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u751f\u6210\u548c\u6536\u96c6\u9065\u6d4b\u6570\u636e\uff1a\u94fe\u8def\u3001\u6307\u6807\u548c\u65e5\u5fd7\u3002
\u672c\u6587\u4e3b\u8981\u8bb2\u89e3\u5982\u4f55\u5728 Go \u5e94\u7528\u7a0b\u5e8f\u4e2d\u901a\u8fc7 OpenTelemetry Go SDK \u589e\u5f3a\u5e76\u63a5\u5165\u94fe\u8def\u76d1\u63a7\u3002
"},{"location":"end-user/insight/quickstart/otel/golang/golang.html#otel-sdk-go_1","title":"\u4f7f\u7528 OTel SDK \u589e\u5f3a Go \u5e94\u7528","text":""},{"location":"end-user/insight/quickstart/otel/golang/golang.html#_1","title":"\u5b89\u88c5\u76f8\u5173\u4f9d\u8d56","text":"
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
\u521b\u5efa meter provider\uff0c\u5e76\u6307\u5b9a prometheus \u4f5c\u4e3a exporter\u3002
/*\n* Copyright The OpenTelemetry Authors\n* SPDX-License-Identifier: Apache-2.0\n*/\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
\u81ea\u5b9a\u4e49 meter \u5e76\u5f00\u542f http server
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n* Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n* these to a Prometheus instance via a HttpServer exporter.\n*\n* <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n* The Gauge callback gets executed every collection interval.\n*/\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // it is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
\u8fd9\u91cc\u4f7f\u7528\u7b2c\u4e8c\u79cd\u7528\u6cd5\uff0c\u542f\u52a8 JVM \u65f6\u9700\u8981\u6307\u5b9a JMX Exporter \u7684 jar \u5305\u6587\u4ef6\u548c\u914d\u7f6e\u6587\u4ef6\u3002 jar \u5305\u662f\u4e8c\u8fdb\u5236\u6587\u4ef6\uff0c\u4e0d\u597d\u901a\u8fc7 configmap \u6302\u8f7d\uff0c\u914d\u7f6e\u6587\u4ef6\u6211\u4eec\u51e0\u4e4e\u4e0d\u9700\u8981\u4fee\u6539\uff0c \u6240\u4ee5\u5efa\u8bae\u662f\u76f4\u63a5\u5c06 JMX Exporter \u7684 jar \u5305\u548c\u914d\u7f6e\u6587\u4ef6\u90fd\u6253\u5305\u5230\u4e1a\u52a1\u5bb9\u5668\u955c\u50cf\u4e2d\u3002
\u5176\u4e2d\uff0c\u7b2c\u4e8c\u79cd\u65b9\u5f0f\u6211\u4eec\u53ef\u4ee5\u9009\u62e9\u5c06 JMX Exporter \u7684 jar \u6587\u4ef6\u653e\u5728\u4e1a\u52a1\u5e94\u7528\u955c\u50cf\u4e2d\uff0c \u4e5f\u53ef\u4ee5\u9009\u62e9\u5728\u90e8\u7f72\u7684\u65f6\u5019\u6302\u8f7d\u8fdb\u53bb\u3002\u8fd9\u91cc\u5206\u522b\u5bf9\u4e24\u79cd\u65b9\u5f0f\u505a\u4e00\u4e2a\u4ecb\u7ecd\uff1a
"},{"location":"end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#jmx-exporter-jar","title":"\u65b9\u5f0f\u4e00\uff1a\u5c06 JMX Exporter JAR \u6587\u4ef6\u6784\u5efa\u81f3\u4e1a\u52a1\u955c\u50cf\u4e2d","text":"
\u7136\u540e\u51c6\u5907 jar \u5305\u6587\u4ef6\uff0c\u53ef\u4ee5\u5728 jmx_exporter \u7684 Github \u9875\u9762\u627e\u5230\u6700\u65b0\u7684 jar \u5305\u4e0b\u8f7d\u5730\u5740\u5e76\u53c2\u8003\u5982\u4e0b Dockerfile:
\u5728\u672a\u5f00\u542f\u7f51\u683c\u60c5\u51b5\u4e0b\uff0c\u6d4b\u8bd5\u60c5\u51b5\u7edf\u8ba1\u51fa\u7cfb\u7edf Job \u6307\u6807\u91cf\u4e0e Pod \u7684\u5173\u7cfb\u4e3a Series \u6570\u91cf = 800 * Pod \u6570\u91cf
\u5728\u5f00\u542f\u670d\u52a1\u7f51\u683c\u65f6\uff0c\u5f00\u542f\u529f\u80fd\u540e Pod \u4ea7\u751f\u7684 Istio \u76f8\u5173\u6307\u6807\u6570\u91cf\u7ea7\u4e3a Series \u6570\u91cf = 768 * Pod \u6570\u91cf
\u8868\u683c\u4e2d\u7684 Pod \u6570\u91cf \u6307\u96c6\u7fa4\u4e2d\u57fa\u672c\u7a33\u5b9a\u8fd0\u884c\u7684 Pod \u6570\u91cf\uff0c\u5982\u51fa\u73b0\u5927\u91cf\u7684 Pod \u91cd\u542f\uff0c\u5219\u4f1a\u9020\u6210\u77ed\u65f6\u95f4\u5185\u6307\u6807\u91cf\u7684\u9661\u589e\uff0c\u6b64\u65f6\u8d44\u6e90\u9700\u8981\u8fdb\u884c\u76f8\u5e94\u4e0a\u8c03\u3002
\u78c1\u76d8\u7528\u91cf = \u77ac\u65f6\u6307\u6807\u91cf x 2 x \u5355\u4e2a\u6570\u636e\u70b9\u7684\u5360\u7528\u78c1\u76d8 x 60 x 24 x \u5b58\u50a8\u65f6\u95f4 (\u5929)
\u5b58\u50a8\u65f6\u957f(\u5929) x 60 x 24 \u5c06\u65f6\u95f4(\u5929)\u6362\u7b97\u6210\u5206\u949f\u4ee5\u4fbf\u8ba1\u7b97\u78c1\u76d8\u7528\u91cf\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 insight-es-k8s-logs-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u9009\u62e9\u5de6\u4fa7\u5bfc\u822a Index Lifecycle Polices \uff0c\u5e76\u627e\u5230\u7d22\u5f15 jaeger-ilm-policy \uff0c\u70b9\u51fb\u8fdb\u5165\u8be6\u60c5\u3002
\u5c55\u5f00 Hot phase \u914d\u7f6e\u9762\u677f\uff0c\u4fee\u6539 Maximum age \u53c2\u6570\uff0c\u5e76\u8bbe\u7f6e\u4fdd\u7559\u671f\u9650\uff0c\u9ed8\u8ba4\u5b58\u50a8\u65f6\u957f\u4e3a 7d \u3002
\u4fee\u6539\u5b8c\u540e\uff0c\u70b9\u51fb\u9875\u9762\u5e95\u90e8\u7684 Save policy \u5373\u4fee\u6539\u6210\u529f\u3002
\u90e8\u7f72 Kubernetes \u96c6\u7fa4\u662f\u4e3a\u4e86\u652f\u6301\u9ad8\u6548\u7684 AI \u7b97\u529b\u8c03\u5ea6\u548c\u7ba1\u7406\uff0c\u5b9e\u73b0\u5f39\u6027\u4f38\u7f29\uff0c\u63d0\u4f9b\u9ad8\u53ef\u7528\u6027\uff0c\u4ece\u800c\u4f18\u5316\u6a21\u578b\u8bad\u7ec3\u548c\u63a8\u7406\u8fc7\u7a0b\u3002
\u5230\u6b64\u96c6\u7fa4\u521b\u5efa\u6210\u529f\uff0c\u53ef\u4ee5\u53bb\u67e5\u770b\u96c6\u7fa4\u6240\u5305\u542b\u7684\u8282\u70b9\u3002\u4f60\u53ef\u4ee5\u53bb\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u5e76\u4f7f\u7528 GPU \u4e86\u3002
\u4e0b\u4e00\u6b65\uff1a\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d
\u5907\u4efd\u901a\u5e38\u5206\u4e3a\u5168\u91cf\u5907\u4efd\u3001\u589e\u91cf\u5907\u4efd\u3001\u5dee\u5f02\u5907\u4efd\u4e09\u79cd\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u76ee\u524d\u652f\u6301\u5168\u91cf\u5907\u4efd\u548c\u589e\u91cf\u5907\u4efd\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u7684\u5907\u4efd\u6062\u590d\u53ef\u4ee5\u5206\u4e3a \u5e94\u7528\u5907\u4efd \u548c ETCD \u5907\u4efd \u4e24\u79cd\uff0c\u652f\u6301\u624b\u52a8\u5907\u4efd\uff0c\u6216\u57fa\u4e8e CronJob \u5b9a\u65f6\u81ea\u52a8\u5907\u4efd\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4e3a\u5e94\u7528\u505a\u5907\u4efd\uff0c\u672c\u6559\u7a0b\u4e2d\u4f7f\u7528\u7684\u6f14\u793a\u5e94\u7528\u540d\u4e3a dao-2048 \uff0c\u5c5e\u4e8e\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u3002
CA \u8bc1\u4e66\uff1a\u53ef\u901a\u8fc7\u5982\u4e0b\u547d\u4ee4\u67e5\u770b\u8bc1\u4e66\uff0c\u7136\u540e\u5c06\u8bc1\u4e66\u5185\u5bb9\u590d\u5236\u7c98\u8d34\u5230\u5bf9\u5e94\u4f4d\u7f6e\uff1a
S3 region \uff1a\u4e91\u5b58\u50a8\u7684\u5730\u7406\u533a\u57df\u3002\u9ed8\u8ba4\u4f7f\u7528 us-east-1 \u53c2\u6570\uff0c\u7531\u7cfb\u7edf\u7ba1\u7406\u5458\u63d0\u4f9b
S3 force path style \uff1a\u4fdd\u6301\u9ed8\u8ba4\u914d\u7f6e true
S3 server URL \uff1a\u5bf9\u8c61\u5b58\u50a8\uff08minio\uff09\u7684\u63a7\u5236\u53f0\u8bbf\u95ee\u5730\u5740\uff0cminio \u4e00\u822c\u63d0\u4f9b\u4e86 UI \u8bbf\u95ee\u548c\u63a7\u5236\u53f0\u8bbf\u95ee\u4e24\u4e2a\u670d\u52a1\uff0c\u6b64\u5904\u8bf7\u4f7f\u7528\u63a7\u5236\u53f0\u8bbf\u95ee\u7684\u5730\u5740
\u67e5\u770b\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u8d44\u6e90\u7533\u8bf7\u503c\u4e0e\u9650\u5236\u503c\u7684\u6bd4\u503c\u662f\u5426\u7b26\u5408\u8d85\u552e\u6bd4
\u4f7f\u7528\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u63a5\u5165\u6216\u521b\u5efa\u7684\u96c6\u7fa4\uff0c\u4e0d\u4ec5\u53ef\u4ee5\u901a\u8fc7 UI \u754c\u9762\u76f4\u63a5\u8bbf\u95ee\uff0c\u4e5f\u53ef\u4ee5\u901a\u8fc7\u5176\u4ed6\u4e24\u79cd\u65b9\u5f0f\u8fdb\u884c\u8bbf\u95ee\u63a7\u5236\uff1a
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u57fa\u4e8e\u96c6\u7fa4\u7684\u4e0d\u540c\u529f\u80fd\u5b9a\u4f4d\u5bf9\u96c6\u7fa4\u8fdb\u884c\u4e86\u89d2\u8272\u5206\u7c7b\uff0c\u5e2e\u52a9\u7528\u6237\u66f4\u597d\u5730\u7ba1\u7406 IT \u57fa\u7840\u8bbe\u65bd\u3002
\u6b64\u96c6\u7fa4\u7528\u4e8e\u8fd0\u884c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7ec4\u4ef6\uff0c\u4f8b\u5982\u5bb9\u5668\u7ba1\u7406\u3001\u5168\u5c40\u7ba1\u7406\u3001\u53ef\u89c2\u6d4b\u6027\u3001\u955c\u50cf\u4ed3\u5e93\u7b49\u3002 \u4e00\u822c\u4e0d\u627f\u8f7d\u4e1a\u52a1\u8d1f\u8f7d\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c\u63a5\u5165\u578b\u96c6\u7fa4\u548c\u81ea\u5efa\u96c6\u7fa4\u91c7\u53d6\u4e0d\u540c\u7684\u7248\u672c\u652f\u6301\u673a\u5236\u3002
\u4f8b\u5982\uff0c\u793e\u533a\u652f\u6301\u7684\u7248\u672c\u8303\u56f4\u662f 1.25\u30011.26\u30011.27\uff0c\u5219\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u662f 1.24\u30011.25\u30011.26\uff0c\u5e76\u4e14\u4f1a\u4e3a\u7528\u6237\u63a8\u8350\u4e00\u4e2a\u7a33\u5b9a\u7684\u7248\u672c\uff0c\u5982 1.24.7\u3002
\u9664\u6b64\u4e4b\u5916\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e0e\u793e\u533a\u4fdd\u6301\u9ad8\u5ea6\u540c\u6b65\uff0c\u5f53\u793e\u533a\u7248\u672c\u8fdb\u884c\u9012\u589e\u540e\uff0c\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\u4f7f\u7528\u754c\u9762\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u7248\u672c\u8303\u56f4\u4e5f\u4f1a\u540c\u6b65\u9012\u589e\u4e00\u4e2a\u7248\u672c\u3002
"},{"location":"end-user/kpanda/clusters/cluster-version.html#kubernetes","title":"Kubernetes \u7248\u672c\u652f\u6301\u8303\u56f4","text":"Kubernetes \u793e\u533a\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4 \u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u63a8\u8350\u7248\u672c \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5b89\u88c5\u5668 \u53d1\u5e03\u65f6\u95f4
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4e2d\uff0c\u96c6\u7fa4\u89d2\u8272\u5206\u56db\u7c7b\uff1a\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u3001\u7ba1\u7406\u96c6\u7fa4\u3001\u5de5\u4f5c\u96c6\u7fa4\u3001\u63a5\u5165\u96c6\u7fa4\u3002 \u5176\u4e2d\uff0c\u63a5\u5165\u96c6\u7fa4\u53ea\u80fd\u4ece\u7b2c\u4e09\u65b9\u5382\u5546\u63a5\u5165\uff0c\u53c2\u89c1\u63a5\u5165\u96c6\u7fa4\u3002
\u672c\u9875\u4ecb\u7ecd\u5982\u4f55\u521b\u5efa\u5de5\u4f5c\u96c6\u7fa4\uff0c\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u65b0\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7684\u5de5\u4f5c\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u9700\u8981\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4fdd\u6301\u4e00\u81f4\u3002 \u5982\u9700\u4f7f\u7528\u533a\u522b\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4 OS \u6216\u67b6\u6784\u7684\u8282\u70b9\u521b\u5efa\u96c6\u7fa4\uff0c\u53c2\u9605\u5728 centos \u7ba1\u7406\u5e73\u53f0\u4e0a\u521b\u5efa ubuntu \u5de5\u4f5c\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u63a8\u8350\u4f7f\u7528 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u64cd\u4f5c\u7cfb\u7edf\u6765\u521b\u5efa\u96c6\u7fa4\u3002 \u5982\u60a8\u672c\u5730\u8282\u70b9\u4e0d\u5728\u4e0a\u8ff0\u652f\u6301\u8303\u56f4\uff0c\u53ef\u53c2\u8003\u5728\u975e\u4e3b\u6d41\u64cd\u4f5c\u7cfb\u7edf\u4e0a\u521b\u5efa\u96c6\u7fa4\u8fdb\u884c\u521b\u5efa\u3002
\u6839\u636e\u4e1a\u52a1\u9700\u6c42\u51c6\u5907\u4e00\u5b9a\u6570\u91cf\u7684\u8282\u70b9\uff0c\u4e14\u8282\u70b9 OS \u7c7b\u578b\u548c CPU \u67b6\u6784\u4e00\u81f4\u3002
\u63a8\u8350 Kubernetes \u7248\u672c 1.29.5\uff0c\u5177\u4f53\u7248\u672c\u8303\u56f4\uff0c\u53c2\u9605 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u96c6\u7fa4\u7248\u672c\u652f\u6301\u4f53\u7cfb\uff0c \u76ee\u524d\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u81ea\u5efa\u5de5\u4f5c\u96c6\u7fa4\u7248\u672c\u8303\u56f4\u5728 v1.28.0-v1.30.2\u3002\u5982\u9700\u521b\u5efa\u4f4e\u7248\u672c\u7684\u96c6\u7fa4\uff0c\u8bf7\u53c2\u8003\u96c6\u7fa4\u7248\u672c\u652f\u6301\u8303\u56f4\u3001\u90e8\u7f72\u4e0e\u5347\u7ea7 Kubean \u5411\u4e0b\u517c\u5bb9\u7248\u672c\u3002
\u76ee\u6807\u4e3b\u673a\u9700\u8981\u5141\u8bb8 IPv4 \u8f6c\u53d1\u3002\u5982\u679c Pod \u548c Service \u4f7f\u7528\u7684\u662f IPv6\uff0c\u5219\u76ee\u6807\u670d\u52a1\u5668\u9700\u8981\u5141\u8bb8 IPv6 \u8f6c\u53d1\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u6682\u4e0d\u63d0\u4f9b\u5bf9\u9632\u706b\u5899\u7684\u7ba1\u7406\u529f\u80fd\uff0c\u60a8\u9700\u8981\u9884\u5148\u81ea\u884c\u5b9a\u4e49\u76ee\u6807\u4e3b\u673a\u9632\u706b\u5899\u89c4\u5219\u3002\u4e3a\u4e86\u907f\u514d\u521b\u5efa\u96c6\u7fa4\u7684\u8fc7\u7a0b\u4e2d\u51fa\u73b0\u95ee\u9898\uff0c\u5efa\u8bae\u7981\u7528\u76ee\u6807\u4e3b\u673a\u7684\u9632\u706b\u5899\u3002
\u670d\u52a1\u7f51\u6bb5\uff1a\u540c\u4e00\u96c6\u7fa4\u4e0b\u5bb9\u5668\u4e92\u76f8\u8bbf\u95ee\u65f6\u4f7f\u7528\u7684 Service \u8d44\u6e90\u7684\u7f51\u6bb5\uff0c\u51b3\u5b9a Service \u8d44\u6e90\u7684\u4e0a\u9650\u3002\u521b\u5efa\u540e\u4e0d\u53ef\u4fee\u6539\u3002
\u5982\u679c\u60f3\u5f7b\u5e95\u5220\u9664\u4e00\u4e2a\u63a5\u5165\u7684\u96c6\u7fa4\uff0c\u9700\u8981\u524d\u5f80\u521b\u5efa\u8be5\u96c6\u7fa4\u7684\u539f\u59cb\u5e73\u53f0\u64cd\u4f5c\u3002\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e0d\u652f\u6301\u5220\u9664\u63a5\u5165\u7684\u96c6\u7fa4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u4e2d\uff0c \u5378\u8f7d\u96c6\u7fa4 \u548c \u89e3\u9664\u63a5\u5165 \u7684\u533a\u522b\u5728\u4e8e\uff1a
"},{"location":"end-user/kpanda/clusters/integrate-rancher-cluster.html#ai","title":"\u6b65\u9aa4\u4e09\uff1a\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u754c\u9762\u63a5\u5165\u96c6\u7fa4","text":"
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED\nadmin.conf Dec 14, 2024 07:26 UTC 204d no \napiserver Dec 14, 2024 07:26 UTC 204d ca no \napiserver-etcd-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \napiserver-kubelet-client Dec 14, 2024 07:26 UTC 204d ca no \ncontroller-manager.conf Dec 14, 2024 07:26 UTC 204d no \netcd-healthcheck-client Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-peer Dec 14, 2024 07:26 UTC 204d etcd-ca no \netcd-server Dec 14, 2024 07:26 UTC 204d etcd-ca no \nfront-proxy-client Dec 14, 2024 07:26 UTC 204d front-proxy-ca no \nscheduler.conf Dec 14, 2024 07:26 UTC 204d no \n\nCERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED\nca Dec 12, 2033 07:26 UTC 9y no \netcd-ca Dec 12, 2033 07:26 UTC 9y no \nfront-proxy-ca Dec 12, 2033 07:26 UTC 9y no \n
\u9759\u6001 Pod \u662f\u88ab\u672c\u5730 kubelet \u800c\u4e0d\u662f API \u670d\u52a1\u5668\u7ba1\u7406\uff0c\u6240\u4ee5 kubectl \u4e0d\u80fd\u7528\u6765\u5220\u9664\u6216\u91cd\u542f\u4ed6\u4eec\u3002
\u5982\u679c Pod \u4e0d\u5728\u6e05\u5355\u76ee\u5f55\u91cc\uff0ckubelet \u5c06\u4f1a\u7ec8\u6b62\u5b83\u3002 \u5728\u53e6\u4e00\u4e2a fileCheckFrequency \u5468\u671f\u4e4b\u540e\u4f60\u53ef\u4ee5\u5c06\u6587\u4ef6\u79fb\u56de\u53bb\uff0ckubelet \u53ef\u4ee5\u5b8c\u6210 Pod \u7684\u91cd\u5efa\uff0c\u800c\u7ec4\u4ef6\u7684\u8bc1\u4e66\u66f4\u65b0\u64cd\u4f5c\u4e5f\u5f97\u4ee5\u5b8c\u6210\u3002
Kubernetes \u7248\u672c\u4ee5 x.y.z \u8868\u793a\uff0c\u5176\u4e2d x \u662f\u4e3b\u8981\u7248\u672c\uff0c y \u662f\u6b21\u8981\u7248\u672c\uff0c z \u662f\u8865\u4e01\u7248\u672c\u3002
\u5f53\u5bc6\u94a5\u7c7b\u578b\u4e3a TLS (kubernetes.io/tls)\uff1a\u9700\u8981\u586b\u5165\u8bc1\u4e66\u51ed\u8bc1\u548c\u79c1\u94a5\u6570\u636e\u3002\u8bc1\u4e66\u662f\u81ea\u7b7e\u540d\u6216 CA \u7b7e\u540d\u8fc7\u7684\u51ed\u636e\uff0c\u7528\u6765\u8fdb\u884c\u8eab\u4efd\u8ba4\u8bc1\u3002\u8bc1\u4e66\u8bf7\u6c42\u662f\u5bf9\u7b7e\u540d\u7684\u8bf7\u6c42\uff0c\u9700\u8981\u4f7f\u7528\u79c1\u94a5\u8fdb\u884c\u7b7e\u540d\u3002
"},{"location":"end-user/kpanda/configmaps-secrets/use-secret.html#pod","title":"\u4f7f\u7528\u5bc6\u94a5\u4f5c\u4e3a Pod \u7684\u6570\u636e\u5377","text":""},{"location":"end-user/kpanda/configmaps-secrets/use-secret.html#_6","title":"\u56fe\u5f62\u754c\u9762\u64cd\u4f5c","text":"
\u672c\u6587\u4ecb\u7ecd \u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u5bf9 GPU\u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8d44\u6e90\u7edf\u4e00\u8fd0\u7ef4\u7ba1\u7406\u80fd\u529b\u3002
\u968f\u7740 AI \u5e94\u7528\u3001\u5927\u6a21\u578b\u3001\u4eba\u5de5\u667a\u80fd\u3001\u81ea\u52a8\u9a7e\u9a76\u7b49\u65b0\u5174\u6280\u672f\u7684\u5feb\u901f\u53d1\u5c55\uff0c\u4f01\u4e1a\u9762\u4e34\u7740\u8d8a\u6765\u8d8a\u591a\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4efb\u52a1\u548c\u6570\u636e\u5904\u7406\u9700\u6c42\u3002 \u4ee5 CPU \u4e3a\u4ee3\u8868\u7684\u4f20\u7edf\u8ba1\u7b97\u67b6\u6784\u5df2\u65e0\u6cd5\u6ee1\u8db3\u4f01\u4e1a\u65e5\u76ca\u589e\u957f\u7684\u8ba1\u7b97\u9700\u6c42\u3002\u6b64\u65f6\uff0c\u4ee5 GPU \u4e3a\u4ee3\u8868\u7684\u5f02\u6784\u8ba1\u7b97\u56e0\u5728\u5904\u7406\u5927\u89c4\u6a21\u6570\u636e\u3001\u8fdb\u884c\u590d\u6742\u8ba1\u7b97\u548c\u5b9e\u65f6\u56fe\u5f62\u6e32\u67d3\u65b9\u9762\u5177\u6709\u72ec\u7279\u7684\u4f18\u52bf\u88ab\u5e7f\u6cdb\u5e94\u7528\u3002
\u4e0e\u6b64\u540c\u65f6\uff0c\u7531\u4e8e\u7f3a\u4e4f\u5f02\u6784\u8d44\u6e90\u8c03\u5ea6\u7ba1\u7406\u7b49\u65b9\u9762\u7684\u7ecf\u9a8c\u548c\u4e13\u4e1a\u7684\u89e3\u51b3\u65b9\u6848\uff0c\u5bfc\u81f4\u4e86 GPU \u8bbe\u5907\u7684\u8d44\u6e90\u5229\u7528\u7387\u6781\u4f4e\uff0c\u7ed9\u4f01\u4e1a\u5e26\u6765\u4e86\u9ad8\u6602\u7684 AI \u751f\u4ea7\u6210\u672c\u3002 \u5982\u4f55\u964d\u672c\u589e\u6548\uff0c\u63d0\u9ad8 GPU \u7b49\u5f02\u6784\u8d44\u6e90\u7684\u5229\u7528\u6548\u7387\uff0c\u6210\u4e3a\u4e86\u5f53\u524d\u4f17\u591a\u4f01\u4e1a\u4e9f\u9700\u8de8\u8d8a\u7684\u4e00\u9053\u96be\u9898\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\u652f\u6301\u5bf9 GPU\u3001NPU \u7b49\u5f02\u6784\u8d44\u6e90\u8fdb\u884c\u7edf\u4e00\u8c03\u5ea6\u548c\u8fd0\u7ef4\u7ba1\u7406\uff0c\u5145\u5206\u91ca\u653e GPU \u8d44\u6e90\u7b97\u529b\uff0c\u52a0\u901f\u4f01\u4e1a AI \u7b49\u65b0\u5174\u5e94\u7528\u53d1\u5c55\u3002GPU \u7ba1\u7406\u80fd\u529b\u5982\u4e0b\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08iluvatar.ai/vcuda-core\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u8c03\u6574\u540e\u67e5\u770b Pod \u4e2d\u7684\u8d44\u6e90 GPU \u5206\u914d\u8d44\u6e90\uff1a
\u901a\u8fc7\u4e0a\u8ff0\u6b65\u9aa4\uff0c\u60a8\u53ef\u4ee5\u5728\u4e0d\u91cd\u542f vGPU Pod \u7684\u60c5\u51b5\u4e0b\u52a8\u6001\u5730\u8c03\u6574\u5176\u7b97\u529b\u548c\u663e\u5b58\u8d44\u6e90\uff0c\u4ece\u800c\u66f4\u7075\u6d3b\u5730\u6ee1\u8db3\u4e1a\u52a1\u9700\u6c42\u5e76\u4f18\u5316\u8d44\u6e90\u5229\u7528\u3002
\u672c\u9875\u8bf4\u660e\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684 GPU \u53ca\u64cd\u4f5c\u7cfb\u7edf\u6240\u5bf9\u5e94\u7684\u77e9\u9635\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u8282\u70b9\u7684\u4e0d\u540c GPU \u5361\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u5361\u6545\u969c\u3002
Binpack\uff1a \u591a\u4e2a Pod \u4f1a\u4f18\u5148\u9009\u62e9\u540c\u4e00\u4e2a\u8282\u70b9\uff0c\u9002\u7528\u4e8e\u63d0\u9ad8 GPU \u5229\u7528\u7387\uff0c\u51cf\u5c11\u8d44\u6e90\u788e\u7247\u3002
Spread\uff1a\u591a\u4e2a Pod \u4f1a\u5206\u6563\u5728\u4e0d\u540c\u8282\u70b9\u4e0a\uff0c\u9002\u7528\u4e8e\u9ad8\u53ef\u7528\u573a\u666f\uff0c\u907f\u514d\u5355\u8282\u70b9\u6545\u969c\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d POD \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u5e76\u4e14\u8981 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08huawei.com/Ascend910\uff09 \uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14**\u5c0f\u4e8e\u7b49\u4e8e**\u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
GPU \u7b97\u529b\uff08cambricon.com/mlu.smlu.vcore\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u4f7f\u7528\u6838\u5fc3\u7684\u767e\u5206\u6bd4\u6570\u91cf\u3002
apiVersion: v1 \nkind: Pod \nmetadata: \n name: pod1 \nspec: \n restartPolicy: OnFailure \n containers: \n - image: ubuntu:16.04 \n name: pod1-ctr \n command: [\"sleep\"] \n args: [\"100000\"] \n resources: \n limits: \n cambricon.com/mlu: \"1\" # use this when device type is not enabled, else delete this line. \n #cambricon.com/mlu: \"1\" #uncomment to use when device type is enabled \n #cambricon.com/mlu.share: \"1\" #uncomment to use device with env-share mode \n #cambricon.com/mlu.mim-2m.8gb: \"1\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vcore: \"100\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vmemory: \"1024\" #uncomment to use device with mim mode\n
\u4e3a\u5728\u6240\u6709\u4ea7\u54c1\u4e2d\u516c\u5f00\u201c\u5b8c\u5168\u76f8\u540c\u201d\u7684 MIG \u8bbe\u5907\u7c7b\u578b\uff0c\u521b\u5efa\u76f8\u540c\u7684GI \u548c CI
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u672c\u6587\u4f7f\u7528 AMD \u67b6\u6784\u7684 CentOS 7.9\uff083.10.0-1160\uff09\u8fdb\u884c\u6f14\u793a\u3002\u5982\u9700\u4f7f\u7528 Red Hat 8.4 \u90e8\u7f72\uff0c \u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u548c\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u3002
\u4f7f\u7528\u5185\u7f6e\u7684\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u65e0\u9700\u4fee\u6539\u955c\u50cf\u7248\u672c\uff0c\u5176\u4ed6\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\u8bf7\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u955c\u50cf\u3002 \u6ce8\u610f\u7248\u672c\u53f7\u540e\u65e0\u9700\u586b\u5199 Ubuntu\u3001CentOS\u3001Red Hat \u7b49\u64cd\u4f5c\u7cfb\u7edf\u540d\u79f0\uff0c\u82e5\u5b98\u65b9\u955c\u50cf\u542b\u6709\u64cd\u4f5c\u7cfb\u7edf\u540e\u7f00\uff0c\u8bf7\u624b\u52a8\u79fb\u9664\u3002
"},{"location":"end-user/kpanda/gpu/nvidia/push_image_to_repo.html","title":"\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u672c\u6587\u4ee5 Red Hat 8.4 \u7684 nvcr.io/nvidia/driver:525.105.17-rhel8.4 \u79bb\u7ebf\u9a71\u52a8\u955c\u50cf\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20\u79bb\u7ebf\u955c\u50cf\u3002
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {\u706b\u79cdregistry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
\u5f53\u5de5\u4f5c\u8282\u70b9\u7684\u5185\u6838\u7248\u672c\u4e0e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u5185\u6838\u7248\u672c\u6216 OS \u7c7b\u578b\u4e0d\u4e00\u81f4\u65f6\uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
"},{"location":"end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#os","title":"\u68c0\u67e5\u96c6\u7fa4\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c","text":"
\u5206\u522b\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u7684\u63a7\u5236\u8282\u70b9\u548c\u5f85\u90e8\u7f72 GPU Operator \u7684\u8282\u70b9\u6267\u884c\u5982\u4e0b\u547d\u4ee4\uff0c\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u548c\u5185\u6838\u7248\u672c\u4e00\u81f4\u5219\u65e0\u9700\u6784\u5efa yum \u6e90\uff0c \u53ef\u53c2\u8003\u79bb\u7ebf\u5b89\u88c5 GPU Operator \u6587\u6863\u76f4\u63a5\u5b89\u88c5\uff1b\u82e5\u4e24\u4e2a\u8282\u70b9\u7684 OS \u6216\u5185\u6838\u7248\u672c\u4e0d\u4e00\u81f4\uff0c\u8bf7\u6267\u884c\u4e0b\u4e00\u6b65\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 8.4\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4\u3002
\u672c\u6587\u4ee5 Red Hat 8.4 4.18.0-305.el8.x86_64 \u8282\u70b9\u4e3a\u4f8b\uff0c\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 8.4 \u79bb\u7ebf yum \u6e90\u5305\uff0c \u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\uff0c\u901a\u8fc7 RepoConfig.ConfigMapName \u53c2\u6570\u6765\u4f7f\u7528\u3002
\u5728\u8282\u70b9\u5f53\u524d\u8def\u5f84\u4e0b\uff0c\u6267\u884c\u5982\u4e0b\u547d\u4ee4\u5c06\u8282\u70b9\u672c\u5730 mc \u547d\u4ee4\u884c\u5de5\u5177\u548c minio \u670d\u52a1\u5668\u5efa\u7acb\u94fe\u63a5\u3002
mc config host add minio \u6587\u4ef6\u670d\u52a1\u5668\u8bbf\u95ee\u5730\u5740 \u7528\u6237\u540d \u5bc6\u7801\n
\u4f8b\u5982\uff1a
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
\u9884\u671f\u8f93\u51fa\u5982\u4e0b\uff1a
Added `minio` successfully.\n
mc \u547d\u4ee4\u884c\u5de5\u5177\u662f Minio \u6587\u4ef6\u670d\u52a1\u5668\u63d0\u4f9b\u7684\u5ba2\u6237\u7aef\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u8be6\u60c5\u8bf7\u53c2\u8003\uff1a MinIO Client\u3002
"},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90","text":""},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#_1","title":"\u4f7f\u7528\u573a\u666f\u4ecb\u7ecd","text":"
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u9884\u7f6e\u4e86 CentOS 7.9\uff0c\u5185\u6838\u4e3a 3.10.0-1160 \u7684 GPU Operator \u79bb\u7ebf\u5305\u3002\u5176\u5b83 OS \u7c7b\u578b\u7684\u8282\u70b9\u6216\u5185\u6838\u9700\u8981\u7528\u6237\u624b\u52a8\u6784\u5efa\u79bb\u7ebf yum \u6e90\u3002
\u672c\u6587\u4ecb\u7ecd\u5982\u4f55\u57fa\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4efb\u610f\u8282\u70b9\u6784\u5efa Red Hat 7.9 \u79bb\u7ebf yum \u6e90\u5305\uff0c\u5e76\u5728\u5b89\u88c5 Gpu Operator \u65f6\u4f7f\u7528 RepoConfig.ConfigMapName \u53c2\u6570\u3002
\u5f85\u90e8\u7f72 GPU Operator \u7684\u96c6\u7fa4\u8282\u70b9 OS \u5fc5\u987b\u4e3a Red Hat 7.9\uff0c\u4e14\u5185\u6838\u7248\u672c\u5b8c\u5168\u4e00\u81f4
"},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-red-hat-79-os","title":"2. \u4e0b\u8f7d Red Hat 7.9 OS \u7684\u79bb\u7ebf\u9a71\u52a8\u955c\u50cf","text":"
"},{"location":"end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-red-hat-gpu-opreator","title":"3. \u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf","text":"
\u53c2\u8003\u5411\u706b\u79cd\u8282\u70b9\u4ed3\u5e93\u4e0a\u4f20 Red Hat GPU Opreator \u79bb\u7ebf\u955c\u50cf\u3002
SM \uff1a\u6d41\u5f0f\u591a\u5904\u7406\u5668\uff08Streaming Multiprocessor\uff09\uff0cGPU \u7684\u6838\u5fc3\u8ba1\u7b97\u5355\u5143\uff0c\u8d1f\u8d23\u6267\u884c\u56fe\u5f62\u6e32\u67d3\u548c\u901a\u7528\u8ba1\u7b97\u4efb\u52a1\u3002 \u6bcf\u4e2a SM \u5305\u542b\u4e00\u7ec4 CUDA \u6838\u5fc3\uff0c\u4ee5\u53ca\u5171\u4eab\u5185\u5b58\u3001\u5bc4\u5b58\u5668\u6587\u4ef6\u548c\u5176\u4ed6\u8d44\u6e90\uff0c\u53ef\u4ee5\u540c\u65f6\u6267\u884c\u591a\u4e2a\u7ebf\u7a0b\u3002 \u6bcf\u4e2a MIG \u5b9e\u4f8b\u90fd\u62e5\u6709\u4e00\u5b9a\u6570\u91cf\u7684 SM \u548c\u5176\u4ed6\u76f8\u5173\u8d44\u6e90\uff0c\u4ee5\u53ca\u88ab\u5212\u5206\u51fa\u6765\u7684\u663e\u5b58\u3002
GPU SM Slice \uff1aGPU SM \u5207\u7247\u662f GPU \u4e0a SM \u7684\u6700\u5c0f\u8ba1\u7b97\u5355\u4f4d\u3002\u5728 MIG \u6a21\u5f0f\u4e0b\u914d\u7f6e\u65f6\uff0c GPU SM \u5207\u7247\u5927\u7ea6\u662f GPU \u4e2d\u53ef\u7528 SMS \u603b\u6570\u7684\u4e03\u5206\u4e4b\u4e00\u3002
Compute Instance \uff1aGPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b \uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c\u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002
GPU \u5b9e\u4f8b\u7684\u8ba1\u7b97\u5207\u7247(GI)\u53ef\u4ee5\u8fdb\u4e00\u6b65\u7ec6\u5206\u4e3a\u591a\u4e2a\u8ba1\u7b97\u5b9e\u4f8b\uff08CI\uff09\uff0c\u5176\u4e2d CI \u5171\u4eab\u7236 GI \u7684\u5f15\u64ce\u548c\u5185\u5b58\uff0c \u4f46\u6bcf\u4e2a CI \u90fd\u6709\u4e13\u7528\u7684 SM \u8d44\u6e90\u3002\u4f7f\u7528\u4e0a\u9762\u7684\u76f8\u540c 4g.20gb \u793a\u4f8b\uff0c\u53ef\u4ee5\u521b\u5efa\u4e00\u4e2a CI \u4ee5\u4ec5\u4f7f\u7528\u7b2c\u4e00\u4e2a\u8ba1\u7b97\u5207\u7247\u7684 1c.4g.20gb \u8ba1\u7b97\u914d\u7f6e\uff0c\u5982\u4e0b\u56fe\u84dd\u8272\u90e8\u5206\u6240\u793a\uff1a
\u5df2\u7ecf\u90e8\u7f72 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0 \u5bb9\u5668\u7ba1\u7406\u5e73\u53f0\uff0c\u4e14\u5e73\u53f0\u8fd0\u884c\u6b63\u5e38\u3002
\u7269\u7406\u5361\u6570\u91cf\uff08nvidia.com/vgpu\uff09\uff1a\u8868\u793a\u5f53\u524d Pod \u9700\u8981\u6302\u8f7d\u51e0\u5f20\u7269\u7406\u5361\uff0c\u8f93\u5165\u503c\u5fc5\u987b\u4e3a\u6574\u6570\u4e14 \u5c0f\u4e8e\u7b49\u4e8e \u5bbf\u4e3b\u673a\u4e0a\u7684\u5361\u6570\u91cf\u3002
NUMA \u8282\u70b9\u662f Non-Uniform Memory Access\uff08\u975e\u7edf\u4e00\u5185\u5b58\u8bbf\u95ee\uff09\u67b6\u6784\u4e2d\u7684\u4e00\u4e2a\u57fa\u672c\u7ec4\u6210\u5355\u5143\uff0c\u4e00\u4e2a Node \u8282\u70b9\u662f\u591a\u4e2a NUMA \u8282\u70b9\u7684\u96c6\u5408\uff0c \u5728\u591a\u4e2a NUMA \u8282\u70b9\u4e4b\u95f4\u8fdb\u884c\u5185\u5b58\u8bbf\u95ee\u65f6\u4f1a\u4ea7\u751f\u5ef6\u8fdf\uff0c\u5f00\u53d1\u8005\u53ef\u4ee5\u901a\u8fc7\u4f18\u5316\u4efb\u52a1\u8c03\u5ea6\u548c\u5185\u5b58\u5206\u914d\u7b56\u7565\uff0c\u6765\u63d0\u9ad8\u5185\u5b58\u8bbf\u95ee\u6548\u7387\u548c\u6574\u4f53\u6027\u80fd\u3002
Numa \u4eb2\u548c\u6027\u8c03\u5ea6\u7684\u5e38\u89c1\u573a\u666f\u662f\u90a3\u4e9b\u5bf9 CPU \u53c2\u6570\u654f\u611f/\u8c03\u5ea6\u5ef6\u8fdf\u654f\u611f\u7684\u8ba1\u7b97\u5bc6\u96c6\u578b\u4f5c\u4e1a\u3002\u5982\u79d1\u5b66\u8ba1\u7b97\u3001\u89c6\u9891\u89e3\u7801\u3001\u52a8\u6f2b\u52a8\u753b\u6e32\u67d3\u3001\u5927\u6570\u636e\u79bb\u7ebf\u5904\u7406\u7b49\u5177\u4f53\u573a\u666f\u3002
Pod \u8c03\u5ea6\u65f6\u53ef\u4ee5\u91c7\u7528\u7684 NUMA \u653e\u7f6e\u7b56\u7565\uff0c\u5177\u4f53\u7b56\u7565\u5bf9\u5e94\u7684\u8c03\u5ea6\u884c\u4e3a\u8bf7\u53c2\u89c1 Pod \u8c03\u5ea6\u884c\u4e3a\u8bf4\u660e\u3002
single-numa-node\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a single-numa-node \u7684\u8282\u70b9\u6c60\u4e2d\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
restricted\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a restricted \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14 CPU \u9700\u8981\u653e\u7f6e\u5728\u76f8\u540c\u7684 NUMA \u96c6\u5408\u4e0b\uff0c\u5982\u679c\u8282\u70b9\u6c60\u4e2d\u6ca1\u6709\u6ee1\u8db3\u6761\u4ef6\u7684\u8282\u70b9\uff0cPod \u5c06\u65e0\u6cd5\u88ab\u8c03\u5ea6\u3002
best-effort\uff1aPod \u8c03\u5ea6\u65f6\u4f1a\u9009\u62e9\u62d3\u6251\u7ba1\u7406\u7b56\u7565\u5df2\u7ecf\u8bbe\u7f6e\u4e3a best-effort \u8282\u70b9\u6c60\u7684\u8282\u70b9\uff0c\u4e14\u5c3d\u91cf\u5c06 CPU \u653e\u7f6e\u5728\u76f8\u540c NUMA \u4e0b\uff0c\u5982\u679c\u6ca1\u6709\u8282\u70b9\u6ee1\u8db3\u8fd9\u4e00\u6761\u4ef6\uff0c\u5219\u9009\u62e9\u6700\u4f18\u8282\u70b9\u8fdb\u884c\u653e\u7f6e\u3002
\u5f53Pod\u8bbe\u7f6e\u4e86\u62d3\u6251\u7b56\u7565\u65f6\uff0cVolcano \u4f1a\u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\u9884\u6d4b\u5339\u914d\u7684\u8282\u70b9\u5217\u8868\u3002 \u8c03\u5ea6\u8fc7\u7a0b\u5982\u4e0b\uff1a
\u6839\u636e Pod \u8bbe\u7f6e\u7684 Volcano \u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u5177\u6709\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u3002
\u5728\u8bbe\u7f6e\u4e86\u76f8\u540c\u7b56\u7565\u7684\u8282\u70b9\u4e2d\uff0c\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u8be5\u7b56\u7565\u8981\u6c42\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6\u3002
Pod \u53ef\u914d\u7f6e\u7684\u62d3\u6251\u7b56\u7565 1. \u6839\u636e Pod \u8bbe\u7f6e\u7684\u62d3\u6251\u7b56\u7565\uff0c\u7b5b\u9009\u53ef\u8c03\u5ea6\u7684\u8282\u70b9 2. \u8fdb\u4e00\u6b65\u7b5b\u9009 CPU \u62d3\u6251\u6ee1\u8db3\u7b56\u7565\u7684\u8282\u70b9\u8fdb\u884c\u8c03\u5ea6 none \u9488\u5bf9\u914d\u7f6e\u4e86\u4ee5\u4e0b\u51e0\u79cd\u62d3\u6251\u7b56\u7565\u7684\u8282\u70b9\uff0c\u8c03\u5ea6\u65f6\u5747\u65e0\u7b5b\u9009\u884c\u4e3a\u3002none\uff1a\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 - best-effort \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201cbest-effort\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u5c3d\u53ef\u80fd\u6ee1\u8db3\u7b56\u7565\u8981\u6c42\u8fdb\u884c\u8c03\u5ea6\uff1a\u4f18\u5148\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\uff0c\u5982\u679c\u5355 NUMA \u8282\u70b9\u65e0\u6cd5\u6ee1\u8db3 CPU \u7533\u8bf7\u503c\uff0c\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 restricted \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201crestricted\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u4e0d\u53ef\u8c03\u5ea6 \u4e25\u683c\u9650\u5236\u7684\u8c03\u5ea6\u7b56\u7565\uff1a\u5355 NUMA \u8282\u70b9\u7684CPU\u5bb9\u91cf\u4e0a\u9650\u5927\u4e8e\u7b49\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002\u6b64\u65f6\u5982\u679c\u5355 NUMA \u8282\u70b9\u5269\u4f59\u7684 CPU \u53ef\u4f7f\u7528\u91cf\u4e0d\u8db3\uff0c\u5219 Pod \u65e0\u6cd5\u8c03\u5ea6\u3002\u5355 NUMA \u8282\u70b9\u7684 CPU \u5bb9\u91cf\u4e0a\u9650\u5c0f\u4e8e CPU \u7684\u7533\u8bf7\u503c\u65f6\uff0c\u53ef\u5141\u8bb8\u8c03\u5ea6\u81f3\u591a\u4e2a NUMA \u8282\u70b9\u3002 single-numa-node \u7b5b\u9009\u62d3\u6251\u7b56\u7565\u540c\u6837\u4e3a\u201csingle-numa-node\u201d\u7684\u8282\u70b9\uff1anone\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bbest-effort\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1brestricted\uff1a\u4e0d\u53ef\u8c03\u5ea6\uff1bsingle-numa-node\uff1a\u53ef\u8c03\u5ea6 \u4ec5\u5141\u8bb8\u8c03\u5ea6\u81f3\u5355 NUMA \u8282\u70b9\u3002"},{"location":"end-user/kpanda/gpu/volcano/numa.html#numa_1","title":"\u914d\u7f6e NUMA \u4eb2\u548c\u8c03\u5ea6\u7b56\u7565","text":"
\u5047\u8bbe NUMA \u8282\u70b9\u60c5\u51b5\u5982\u4e0b\uff1a
\u5de5\u4f5c\u8282\u70b9 \u8282\u70b9\u7b56\u7565\u62d3\u6251\u7ba1\u7406\u5668\u7b56\u7565 NUMA \u8282\u70b9 0 \u4e0a\u7684\u53ef\u5206\u914d CPU NUMA \u8282\u70b9 1 \u4e0a\u7684\u53ef\u5206\u914d CPU node-1 single-numa-node 16U 16U node-2 best-effort 16U 16U node-3 best-effort 20U 20U
\u793a\u4f8b\u4e00\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a 2U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201csingle-numa-node\u201d\uff0c\u56e0\u6b64\u4f1a\u88ab\u8c03\u5ea6\u5230\u76f8\u540c\u7b56\u7565\u7684 node-1\u3002
\u793a\u4f8b\u4e8c\u4e2d\uff0cPod \u7684 CPU \u7533\u8bf7\u503c\u4e3a20U\uff0c\u8bbe\u7f6e\u62d3\u6251\u7b56\u7565\u4e3a\u201cbest-effort\u201d\uff0c\u5b83\u5c06\u88ab\u8c03\u5ea6\u5230 node-3\uff0c \u56e0\u4e3a node-3 \u53ef\u4ee5\u5728\u5355\u4e2a NUMA \u8282\u70b9\u4e0a\u5206\u914d Pod \u7684 CPU \u8bf7\u6c42\uff0c\u800c node-2 \u9700\u8981\u5728\u4e24\u4e2a NUMA \u8282\u70b9\u4e0a\u6267\u884c\u6b64\u64cd\u4f5c\u3002
"},{"location":"end-user/kpanda/gpu/volcano/numa.html#cpu","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5","text":"
\u60a8\u53ef\u4ee5\u901a\u8fc7 lscpu \u547d\u4ee4\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u6982\u51b5\uff1a
"},{"location":"end-user/kpanda/gpu/volcano/numa.html#cpu_1","title":"\u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d","text":"
\u7136\u540e\u67e5\u770b NUMA \u8282\u70b9\u4f7f\u7528\u60c5\u51b5\uff1a
# \u67e5\u770b\u5f53\u524d\u8282\u70b9\u7684 CPU \u5206\u914d\ncat /var/lib/kubelet/cpu_manager_state\n{\"policyName\":\"static\",\"defaultCpuSet\":\"0,10-15,25-31\",\"entries\":{\"777870b5-c64f-42f5-9296-688b9dc212ba\":{\"container-1\":\"16-24\"},\"fb15e10a-b6a5-4aaa-8fcd-76c1aa64e6fd\":{\"container-1\":\"1-9\"}},\"checksum\":318470969}\n
\u4ee5\u4e0a\u793a\u4f8b\u4e2d\u8868\u793a\uff0c\u8282\u70b9\u4e0a\u8fd0\u884c\u4e86\u4e24\u4e2a\u5bb9\u5668\uff0c\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node0 \u76841-9 \u6838\uff0c\u53e6\u4e00\u4e2a\u5360\u7528\u4e86 NUMA node1 \u7684 16-24 \u6838\u3002
"},{"location":"end-user/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"\u4f7f\u7528 Volcano \u7684 Gang Scheduler","text":"
Gang \u8c03\u5ea6\u7b56\u7565\u662f volcano-scheduler \u7684\u6838\u5fc3\u8c03\u5ea6\u7b97\u6cd5\u4e4b\u4e00\uff0c\u5b83\u6ee1\u8db3\u4e86\u8c03\u5ea6\u8fc7\u7a0b\u4e2d\u7684 \u201cAll or nothing\u201d \u7684\u8c03\u5ea6\u9700\u6c42\uff0c \u907f\u514d Pod \u7684\u4efb\u610f\u8c03\u5ea6\u5bfc\u81f4\u96c6\u7fa4\u8d44\u6e90\u7684\u6d6a\u8d39\u3002\u5177\u4f53\u7b97\u6cd5\u662f\uff0c\u89c2\u5bdf Job \u4e0b\u7684 Pod \u5df2\u8c03\u5ea6\u6570\u91cf\u662f\u5426\u6ee1\u8db3\u4e86\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\uff0c \u5f53 Job \u7684\u6700\u5c0f\u8fd0\u884c\u6570\u91cf\u5f97\u5230\u6ee1\u8db3\u65f6\uff0c\u4e3a Job \u4e0b\u7684\u6240\u6709 Pod \u6267\u884c\u8c03\u5ea6\u52a8\u4f5c\uff0c\u5426\u5219\uff0c\u4e0d\u6267\u884c\u3002
\u57fa\u4e8e\u5bb9\u5668\u7ec4\u6982\u5ff5\u7684 Gang \u8c03\u5ea6\u7b97\u6cd5\u5341\u5206\u9002\u5408\u9700\u8981\u591a\u8fdb\u7a0b\u534f\u4f5c\u7684\u573a\u666f\u3002AI \u573a\u666f\u5f80\u5f80\u5305\u542b\u590d\u6742\u7684\u6d41\u7a0b\uff0c Data Ingestion\u3001Data Analysts\u3001Data Splitting\u3001Trainer\u3001Serving\u3001Logging \u7b49\uff0c \u9700\u8981\u4e00\u7ec4\u5bb9\u5668\u8fdb\u884c\u534f\u540c\u5de5\u4f5c\uff0c\u5c31\u5f88\u9002\u5408\u57fa\u4e8e\u5bb9\u5668\u7ec4\u7684 Gang \u8c03\u5ea6\u7b56\u7565\u3002 MPI \u8ba1\u7b97\u6846\u67b6\u4e0b\u7684\u591a\u7ebf\u7a0b\u5e76\u884c\u8ba1\u7b97\u901a\u4fe1\u573a\u666f\uff0c\u7531\u4e8e\u9700\u8981\u4e3b\u4ece\u8fdb\u7a0b\u534f\u540c\u5de5\u4f5c\uff0c\u4e5f\u975e\u5e38\u9002\u5408\u4f7f\u7528 Gang \u8c03\u5ea6\u7b56\u7565\u3002 \u5bb9\u5668\u7ec4\u4e0b\u7684\u5bb9\u5668\u9ad8\u5ea6\u76f8\u5173\u4e5f\u53ef\u80fd\u5b58\u5728\u8d44\u6e90\u4e89\u62a2\uff0c\u6574\u4f53\u8c03\u5ea6\u5206\u914d\uff0c\u80fd\u591f\u6709\u6548\u89e3\u51b3\u6b7b\u9501\u3002
Binpack \u5728\u5bf9\u4e00\u4e2a\u8282\u70b9\u6253\u5206\u65f6\uff0c\u4f1a\u6839\u636e Binpack \u63d2\u4ef6\u81ea\u8eab\u6743\u91cd\u548c\u5404\u8d44\u6e90\u8bbe\u7f6e\u7684\u6743\u91cd\u503c\u7efc\u5408\u6253\u5206\u3002 \u9996\u5148\uff0c\u5bf9 Pod \u8bf7\u6c42\u8d44\u6e90\u4e2d\u7684\u6bcf\u7c7b\u8d44\u6e90\u4f9d\u6b21\u6253\u5206\uff0c\u4ee5 CPU \u4e3a\u4f8b\uff0cCPU \u8d44\u6e90\u5728\u5f85\u8c03\u5ea6\u8282\u70b9\u7684\u5f97\u5206\u4fe1\u606f\u5982\u4e0b\uff1a
CPU.weight * (request + used) / allocatable\n
\u5373 CPU \u6743\u91cd\u503c\u8d8a\u9ad8\uff0c\u5f97\u5206\u8d8a\u9ad8\uff0c\u8282\u70b9\u8d44\u6e90\u4f7f\u7528\u91cf\u8d8a\u6ee1\uff0c\u5f97\u5206\u8d8a\u9ad8\u3002Memory\u3001GPU \u7b49\u8d44\u6e90\u539f\u7406\u7c7b\u4f3c\u3002\u5176\u4e2d\uff1a
CPU.weight \u4e3a\u7528\u6237\u8bbe\u7f6e\u7684 CPU \u6743\u91cd
request \u4e3a\u5f53\u524d Pod \u8bf7\u6c42\u7684 CPU \u8d44\u6e90\u91cf
used \u4e3a\u5f53\u524d\u8282\u70b9\u5df2\u7ecf\u5206\u914d\u4f7f\u7528\u7684 CPU \u91cf
allocatable \u4e3a\u5f53\u524d\u8282\u70b9 CPU \u53ef\u7528\u603b\u91cf
\u4f18\u5148\u7ea7\u7684\u51b3\u5b9a\u57fa\u4e8e\u914d\u7f6e\u7684 PriorityClass \u4e2d\u7684 Value \u503c\uff0c\u503c\u8d8a\u5927\u4f18\u5148\u7ea7\u8d8a\u9ad8\u3002\u9ed8\u8ba4\u5df2\u542f\u7528\uff0c\u65e0\u9700\u4fee\u6539\u3002\u53ef\u901a\u8fc7\u4ee5\u4e0b\u547d\u4ee4\u786e\u8ba4\u6216\u4fee\u6539\u3002
\u901a\u8fc7 kubectl get pod \u67e5\u770b Pod \u8fd0\u884c\u4fe1\u606f\uff0c\u96c6\u7fa4\u8d44\u6e90\u4e0d\u8db3\uff0cPod \u5904\u4e8e Pending \u72b6\u6001\uff1a
\u6b64\u5916\uff0cVolcano \u4e0e Spark\u3001TensorFlow\u3001PyTorch \u7b49\u4e3b\u6d41\u8ba1\u7b97\u6846\u67b6\u65e0\u7f1d\u5bf9\u63a5\uff0c\u5e76\u652f\u6301 CPU \u548c GPU \u7b49\u5f02\u6784\u8bbe\u5907\u7684\u6df7\u5408\u8c03\u5ea6\uff0c\u4e3a AI \u8ba1\u7b97\u4efb\u52a1\u63d0\u4f9b\u4e86\u5168\u9762\u7684\u4f18\u5316\u652f\u6301\u3002
\u63a5\u4e0b\u6765\uff0c\u6211\u4eec\u5c06\u4ecb\u7ecd\u5982\u4f55\u5b89\u88c5\u548c\u4f7f\u7528 Volcano\uff0c\u4ee5\u4fbf\u60a8\u80fd\u591f\u5145\u5206\u5229\u7528\u5176\u8c03\u5ea6\u7b56\u7565\u4f18\u52bf\uff0c\u4f18\u5316 AI \u8ba1\u7b97\u4efb\u52a1\u3002
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u7684\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u3001\u8282\u70b9\u3001\u5bb9\u5668\u7ec4\uff08Pod\uff09\u4e09\u4e2a\u7ef4\u5ea6\u8fdb\u884c\u81ea\u5b9a\u4e49\u5de1\u68c0\u9879\uff0c\u5de1\u68c0\u7ed3\u675f\u540e\u4f1a\u81ea\u52a8\u751f\u6210\u53ef\u89c6\u5316\u7684\u5de1\u68c0\u62a5\u544a\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5982\u9700\u4e86\u89e3\u6216\u6267\u884c\u5b89\u5168\u65b9\u9762\u7684\u5de1\u68c0\uff0c\u53ef\u53c2\u8003\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u7684\u5b89\u5168\u626b\u63cf\u7c7b\u578b\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u63d0\u4f9b\u96c6\u7fa4\u5de1\u68c0\u529f\u80fd\uff0c\u652f\u6301\u4ece\u96c6\u7fa4\u7ef4\u5ea6\u3001\u8282\u70b9\u7ef4\u5ea6\u3001\u5bb9\u5668\u7ec4\u7ef4\u5ea6\u8fdb\u884c\u5de1\u68c0\u3002
\u5bb9\u5668\u7ec4\u7ef4\u5ea6\uff1a\u68c0\u67e5 Pod \u7684 CPU \u548c\u5185\u5b58\u4f7f\u7528\u60c5\u51b5\u3001\u8fd0\u884c\u72b6\u6001\u3001PV \u548c PVC \u7684\u72b6\u6001\u7b49\u3002
\u5728\u8282\u70b9\u88ab\u8bbe\u7f6e\u4e3a\u72ec\u4eab\u8282\u70b9\u524d\u5df2\u7ecf\u8fd0\u884c\u5728\u6b64\u8282\u70b9\u4e0a\u7684\u5e94\u7528\u548c\u670d\u52a1\u5c06\u4e0d\u4f1a\u53d7\u5f71\u54cd\uff0c\u4f9d\u7136\u4f1a\u6b63\u5e38\u8fd0\u884c\u5728\u8be5\u8282\u70b9\u4e0a\uff0c\u4ec5\u5f53\u8fd9\u4e9b Pod \u88ab\u5220\u9664\u6216\u91cd\u5efa\u65f6\uff0c\u624d\u4f1a\u8c03\u5ea6\u5230\u5176\u5b83\u975e\u72ec\u4eab\u8282\u70b9\u4e0a\u3002
\u7531\u4e8e\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u8fd0\u884c\u7740 kpanda\u3001ghippo\u3001insight \u7b49\u5e73\u53f0\u57fa\u7840\u7ec4\u4ef6\uff0c\u5728 Global \u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u5c06\u53ef\u80fd\u5bfc\u81f4\u5f53\u7cfb\u7edf\u7ec4\u4ef6\u91cd\u542f\u540e\uff0c\u7cfb\u7edf\u7ec4\u4ef6\u65e0\u6cd5\u8c03\u5ea6\u5230\u88ab\u72ec\u4eab\u7684\u8282\u70b9\u4e0a\uff0c\u5f71\u54cd\u7cfb\u7edf\u7684\u6574\u4f53\u9ad8\u53ef\u7528\u80fd\u529b\u3002\u56e0\u6b64\uff0c\u901a\u5e38\u60c5\u51b5\u4e0b\uff0c\u6211\u4eec\u4e0d\u63a8\u8350\u7528\u6237\u5728\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u4e0a\u542f\u7528\u547d\u540d\u7a7a\u95f4\u72ec\u4eab\u8282\u70b9\u7279\u6027\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u53d6\u6d88\u72ec\u4eab\u4e4b\u540e\uff0c\u5176\u4ed6\u547d\u540d\u7a7a\u95f4\u4e0b\u7684 Pod \u4e5f\u53ef\u4ee5\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002
\u5bb9\u5668\u7ec4\u5b89\u5168\u7b56\u7565\u6307\u5728 kubernetes \u96c6\u7fa4\u4e2d\uff0c\u901a\u8fc7\u4e3a\u6307\u5b9a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e\u4e0d\u540c\u7684\u7b49\u7ea7\u548c\u6a21\u5f0f\uff0c\u5b9e\u73b0\u5728\u5b89\u5168\u7684\u5404\u4e2a\u65b9\u9762\u63a7\u5236 Pod \u7684\u884c\u4e3a\uff0c\u53ea\u6709\u6ee1\u8db3\u4e00\u5b9a\u7684\u6761\u4ef6\u7684 Pod \u624d\u4f1a\u88ab\u7cfb\u7edf\u63a5\u53d7\u3002\u5b83\u8bbe\u7f6e\u4e09\u4e2a\u7b49\u7ea7\u548c\u4e09\u79cd\u6a21\u5f0f\uff0c\u7528\u6237\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u7684\u9700\u6c42\u9009\u62e9\u66f4\u52a0\u5408\u9002\u7684\u65b9\u6848\u6765\u8bbe\u7f6e\u9650\u5236\u7b56\u7565\u3002
Note
\u4e00\u6761\u5b89\u5168\u6a21\u5f0f\u4ec5\u80fd\u914d\u7f6e\u4e00\u6761\u5b89\u5168\u7b56\u7565\u3002\u540c\u65f6\u8bf7\u8c28\u614e\u4e3a\u547d\u540d\u7a7a\u95f4\u914d\u7f6e enforce \u7684\u5b89\u5168\u6a21\u5f0f\uff0c\u8fdd\u53cd\u540e\u5c06\u4f1a\u5bfc\u81f4 Pod \u65e0\u6cd5\u521b\u5efa\u3002
\u5df2\u7ecf\u5b8c\u6210 Ingress \u5b9e\u4f8b\u7684\u521b\u5efa\uff0c\u5df2\u90e8\u7f72\u5e94\u7528\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5e76\u4e14\u5df2\u521b\u5efa\u5bf9\u5e94 Service
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u6bcf\u4e2a Pod \u90fd\u6709\u4e00\u4e2a\u5185\u90e8\u72ec\u7acb\u7684 IP \u5730\u5740\uff0c\u4f46\u662f\u5de5\u4f5c\u8d1f\u8f7d\u4e2d\u7684 Pod \u53ef\u80fd\u4f1a\u88ab\u968f\u65f6\u521b\u5efa\u548c\u5220\u9664\uff0c\u76f4\u63a5\u4f7f\u7528 Pod IP \u5730\u5740\u5e76\u4e0d\u80fd\u5bf9\u5916\u63d0\u4f9b\u670d\u52a1\u3002
\u8fd9\u5c31\u9700\u8981\u521b\u5efa\u670d\u52a1\uff0c\u901a\u8fc7\u670d\u52a1\u60a8\u4f1a\u83b7\u5f97\u4e00\u4e2a\u56fa\u5b9a\u7684 IP \u5730\u5740\uff0c\u4ece\u800c\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u524d\u7aef\u548c\u540e\u7aef\u7684\u89e3\u8026\uff0c\u8ba9\u5916\u90e8\u7528\u6237\u80fd\u591f\u8bbf\u95ee\u670d\u52a1\u3002\u540c\u65f6\uff0c\u670d\u52a1\u8fd8\u63d0\u4f9b\u4e86\u8d1f\u8f7d\u5747\u8861\uff08LoadBalancer\uff09\u529f\u80fd\uff0c\u4f7f\u7528\u6237\u80fd\u4ece\u516c\u7f51\u8bbf\u95ee\u5230\u5de5\u4f5c\u8d1f\u8f7d\u3002
\u70b9\u9009 \u96c6\u7fa4\u5185\u8bbf\u95ee\uff08ClusterIP\uff09 \uff0c\u8fd9\u662f\u6307\u901a\u8fc7\u96c6\u7fa4\u7684\u5185\u90e8 IP \u66b4\u9732\u670d\u52a1\uff0c\u9009\u62e9\u6b64\u9879\u7684\u670d\u52a1\u53ea\u80fd\u5728\u96c6\u7fa4\u5185\u90e8\u8bbf\u95ee\u3002\u8fd9\u662f\u9ed8\u8ba4\u7684\u670d\u52a1\u7c7b\u578b\u3002\u53c2\u8003\u4e0b\u8868\u914d\u7f6e\u53c2\u6570\u3002
\u7b56\u7565\u914d\u7f6e\u5206\u4e3a\u5165\u6d41\u91cf\u7b56\u7565\u548c\u51fa\u6d41\u91cf\u7b56\u7565\u3002\u5982\u679c\u6e90 Pod \u60f3\u8981\u6210\u529f\u8fde\u63a5\u5230\u76ee\u6807 Pod\uff0c\u6e90 Pod \u7684\u51fa\u6d41\u91cf\u7b56\u7565\u548c\u76ee\u6807 Pod \u7684\u5165\u6d41\u91cf\u7b56\u7565\u90fd\u9700\u8981\u5141\u8bb8\u8fde\u63a5\u3002\u5982\u679c\u4efb\u4f55\u4e00\u65b9\u4e0d\u5141\u8bb8\u8fde\u63a5\uff0c\u90fd\u4f1a\u5bfc\u81f4\u8fde\u63a5\u5931\u8d25\u3002
\u6267\u884c ls \u547d\u4ee4\u67e5\u770b\u7ba1\u7406\u96c6\u7fa4\u4e0a\u7684\u5bc6\u94a5\u662f\u5426\u521b\u5efa\u6210\u529f\uff0c\u6b63\u786e\u53cd\u9988\u5982\u4e0b\uff1a
\u68c0\u67e5\u9879 \u63cf\u8ff0 \u64cd\u4f5c\u7cfb\u7edf \u53c2\u8003\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf SELinux \u5173\u95ed \u9632\u706b\u5899 \u5173\u95ed \u67b6\u6784\u4e00\u81f4\u6027 \u8282\u70b9\u95f4 CPU \u67b6\u6784\u4e00\u81f4\uff08\u5982\u5747\u4e3a ARM \u6216 x86\uff09 \u4e3b\u673a\u65f6\u95f4 \u6240\u6709\u4e3b\u673a\u95f4\u540c\u6b65\u8bef\u5dee\u5c0f\u4e8e 10 \u79d2\u3002 \u7f51\u7edc\u8054\u901a\u6027 \u8282\u70b9\u53ca\u5176 SSH \u7aef\u53e3\u80fd\u591f\u6b63\u5e38\u88ab\u5e73\u53f0\u8bbf\u95ee\u3002 CPU \u53ef\u7528 CPU \u8d44\u6e90\u5927\u4e8e 4 Core \u5185\u5b58 \u53ef\u7528\u5185\u5b58\u8d44\u6e90\u5927\u4e8e 8 GB"},{"location":"end-user/kpanda/nodes/node-check.html#_2","title":"\u652f\u6301\u7684\u67b6\u6784\u53ca\u64cd\u4f5c\u7cfb\u7edf","text":"\u67b6\u6784 \u64cd\u4f5c\u7cfb\u7edf \u5907\u6ce8 ARM Kylin Linux Advanced Server release V10 (Sword) SP2 \u63a8\u8350 ARM UOS Linux ARM openEuler x86 CentOS 7.x \u63a8\u8350 x86 Redhat 7.x \u63a8\u8350 x86 Redhat 8.x \u63a8\u8350 x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 \u6d77\u5149 x86 UOS Linux x86 openEuler"},{"location":"end-user/kpanda/nodes/node-details.html","title":"\u8282\u70b9\u8be6\u60c5","text":"
\u652f\u6301\u5c06\u8282\u70b9\u6682\u505c\u8c03\u5ea6\u6216\u6062\u590d\u8c03\u5ea6\u3002\u6682\u505c\u8c03\u5ea6\u6307\uff0c\u505c\u6b62\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002\u6062\u590d\u8c03\u5ea6\u6307\uff0c\u53ef\u4ee5\u5c06 Pod \u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
\u6c61\u70b9 (Taint) \u80fd\u591f\u4f7f\u8282\u70b9\u6392\u65a5\u67d0\u4e00\u7c7b Pod\uff0c\u907f\u514d Pod \u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u4e0a\u3002 \u6bcf\u4e2a\u8282\u70b9\u4e0a\u53ef\u4ee5\u5e94\u7528\u4e00\u4e2a\u6216\u591a\u4e2a\u6c61\u70b9\uff0c\u4e0d\u80fd\u5bb9\u5fcd\u8fd9\u4e9b\u6c61\u70b9\u7684 Pod \u5219\u4e0d\u4f1a\u88ab\u8c03\u5ea6\u8be5\u8282\u70b9\u4e0a\u3002
\u4e3a\u8282\u70b9\u6dfb\u52a0\u6c61\u70b9\u4e4b\u540e\uff0c\u53ea\u6709\u80fd\u5bb9\u5fcd\u8be5\u6c61\u70b9\u7684 Pod \u624d\u80fd\u88ab\u8c03\u5ea6\u5230\u8be5\u8282\u70b9\u3002
NoSchedule\uff1a\u65b0\u7684 Pod \u4e0d\u4f1a\u88ab\u8c03\u5ea6\u5230\u5e26\u6709\u6b64\u6c61\u70b9\u7684\u8282\u70b9\u4e0a\uff0c\u9664\u975e\u65b0\u7684 Pod \u5177\u6709\u76f8\u5339\u914d\u7684\u5bb9\u5fcd\u5ea6\u3002\u5f53\u524d\u6b63\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u4e0d\u4f1a \u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u4e0d\u80fd\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f1a\u9a6c\u4e0a\u88ab\u9a71\u9010\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\uff0c\u4f46\u662f\u5728\u5bb9\u5fcd\u5ea6\u5b9a\u4e49\u4e2d\u6ca1\u6709\u6307\u5b9a tolerationSeconds\uff0c\u5219 Pod \u8fd8\u4f1a\u4e00\u76f4\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u8fd0\u884c\u3002
\u5982\u679c Pod \u80fd\u591f\u5bb9\u5fcd\u6b64\u6c61\u70b9\u800c\u4e14\u6307\u5b9a\u4e86 tolerationSeconds\uff0c\u5219 Pod \u8fd8\u80fd\u5728\u8fd9\u4e2a\u8282\u70b9\u4e0a\u7ee7\u7eed\u8fd0\u884c\u6307\u5b9a\u7684\u65f6\u957f\u3002\u8fd9\u6bb5\u65f6\u95f4\u8fc7\u53bb\u540e\uff0c\u518d\u4ece\u8282\u70b9\u4e0a\u9a71\u9664\u8fd9\u4e9b Pod\u3002
PreferNoSchedule\uff1a\u8fd9\u662f\u201c\u8f6f\u6027\u201d\u7684 NoSchedule\u3002\u63a7\u5236\u5e73\u9762\u5c06**\u5c1d\u8bd5**\u907f\u514d\u5c06\u4e0d\u5bb9\u5fcd\u6b64\u6c61\u70b9\u7684 Pod \u8c03\u5ea6\u5230\u8282\u70b9\u4e0a\uff0c\u4f46\u4e0d\u80fd\u4fdd\u8bc1\u5b8c\u5168\u907f\u514d\u3002\u6240\u4ee5\u8981\u5c3d\u91cf\u907f\u514d\u4f7f\u7528\u6b64\u6c61\u70b9\u3002
\u5f53\u524d\u96c6\u7fa4\u5df2\u63a5\u5165\u5bb9\u5668\u7ba1\u7406\u4e14\u5168\u5c40\u670d\u52a1\u96c6\u7fa4\u5df2\u7ecf\u5b89\u88c5 kolm \u7ec4\u4ef6\uff08helm \u6a21\u677f\u641c\u7d22 kolm\uff09
\u53ea\u9700\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0cKpanda \u63a7\u5236\u5668\u4f1a\u628a Global Cluster \u589e\u52a0\u7684\u6743\u9650\u70b9\u540c\u6b65\u5230\u6240\u6709\u63a5\u5165\u5b50\u96c6\u7fa4\u4e2d\uff0c\u540c\u6b65\u9700\u4e00\u6bb5\u65f6\u95f4\u624d\u80fd\u5b8c\u6210
\u53ea\u80fd\u5728 Global Cluster \u589e\u52a0\u6743\u9650\u70b9\uff0c\u5728\u5b50\u96c6\u7fa4\u65b0\u589e\u7684\u6743\u9650\u70b9\u4f1a\u88ab Global Cluster \u5185\u7f6e\u89d2\u8272\u6743\u9650\u70b9\u8986\u76d6
\u53ea\u652f\u6301\u4f7f\u7528\u56fa\u5b9a Label \u7684 ClusterRole \u8ffd\u52a0\u6743\u9650\uff0c\u4e0d\u652f\u6301\u66ff\u6362\u6216\u8005\u5220\u9664\u6743\u9650\uff0c\u4e5f\u4e0d\u80fd\u4f7f\u7528 role \u8ffd\u52a0\u6743\u9650\uff0c\u5185\u7f6e\u89d2\u8272\u8ddf\u7528\u6237\u521b\u5efa\u7684 ClusterRole Label \u5bf9\u5e94\u5173\u7cfb\u5982\u4e0b
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301 Pod \u8d44\u6e90\u57fa\u4e8e\u6307\u6807\u8fdb\u884c\u5f39\u6027\u4f38\u7f29\uff08Horizontal Pod Autoscaling, HPA\uff09\u3002 \u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e CPU \u5229\u7528\u7387\u3001\u5185\u5b58\u7528\u91cf\u53ca\u81ea\u5b9a\u4e49\u6307\u6807\u6307\u6807\u6765\u52a8\u6001\u8c03\u6574 Pod \u8d44\u6e90\u7684\u526f\u672c\u6570\u91cf\u3002 \u4f8b\u5982\uff0c\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u57fa\u4e8e CPU \u5229\u7528\u7387\u6307\u6807\u5f39\u6027\u4f38\u7f29\u7b56\u7565\u540e\uff0c\u5f53 Pod \u7684 CPU \u5229\u7528\u7387\u8d85\u8fc7/\u4f4e\u4e8e\u60a8\u8bbe\u7f6e\u7684\u6307\u6807\u9600\u503c\uff0c\u5de5\u4f5c\u8d1f\u8f7d\u63a7\u5236\u5668\u5c06\u4f1a\u81ea\u52a8\u589e\u52a0/\u8f83\u5c11 Pod \u526f\u672c\u6570\u3002
\u5982\u679c\u57fa\u4e8e CPU \u5229\u7528\u7387\u521b\u5efa HPA \u7b56\u7565\uff0c\u5fc5\u987b\u9884\u5148\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u8bbe\u7f6e\u914d\u7f6e\u9650\u5236\uff08Limit\uff09\uff0c\u5426\u5219\u65e0\u6cd5\u8ba1\u7b97 CPU \u5229\u7528\u7387\u3002
\u7cfb\u7edf\u5185\u7f6e\u4e86 CPU \u548c\u5185\u5b58\u4e24\u79cd\u5f39\u6027\u4f38\u7f29\u6307\u6807\u4ee5\u6ee1\u8db3\u7528\u6237\u7684\u57fa\u7840\u4e1a\u52a1\u4f7f\u7528\u573a\u666f\u3002
\u76ee\u6807 CPU \u5229\u7528\u7387\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b Pod \u7684 CPU \u4f7f\u7528\u7387\u3002\u8ba1\u7b97\u65b9\u5f0f\u4e3a\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u4e0b\u6240\u6709\u7684 Pod \u8d44\u6e90 / \u5de5\u4f5c\u8d1f\u8f7d\u7684\u8bf7\u6c42\uff08request\uff09\u503c\u3002\u5f53\u5b9e\u9645 CPU \u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u76ee\u6807\u5185\u5b58\u7528\u91cf\uff1a\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u4e0b\u7684 Pod \u7684\u5185\u5b58\u7528\u91cf\u3002\u5f53\u5b9e\u9645\u5185\u5b58\u7528\u91cf\u5927\u4e8e/\u5c0f\u4e8e\u76ee\u6807\u503c\u65f6\uff0c\u7cfb\u7edf\u81ea\u52a8\u51cf\u5c11/\u589e\u52a0 Pod \u526f\u672c\u6570\u91cf\u3002
\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\u901a\u8fc7\u76d1\u63a7 Pod \u5728\u4e00\u6bb5\u65f6\u95f4\u5185\u7684\u8d44\u6e90\u7533\u8bf7\u548c\u7528\u91cf\uff0c \u8ba1\u7b97\u51fa\u5bf9\u8be5 Pod \u800c\u8a00\u6700\u9002\u5408\u7684 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u3002\u4f7f\u7528 VPA \u53ef\u4ee5\u66f4\u52a0\u5408\u7406\u5730\u4e3a\u96c6\u7fa4\u4e0b\u6bcf\u4e2a Pod \u5206\u914d\u8d44\u6e90\uff0c\u63d0\u9ad8\u96c6\u7fa4\u7684\u6574\u4f53\u8d44\u6e90\u5229\u7528\u7387\uff0c\u907f\u514d\u96c6\u7fa4\u8d44\u6e90\u6d6a\u8d39\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u5bb9\u5668\u5782\u76f4\u6269\u7f29\u5bb9\u7b56\u7565\uff08Vertical Pod Autoscaler, VPA\uff09\uff0c\u57fa\u4e8e\u6b64\u529f\u80fd\u53ef\u4ee5\u6839\u636e\u5bb9\u5668\u8d44\u6e90\u7684\u4f7f\u7528\u60c5\u51b5\u52a8\u6001\u8c03\u6574 Pod \u8bf7\u6c42\u503c\u3002 \u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u652f\u6301\u901a\u8fc7\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u65b9\u5f0f\u6765\u4fee\u6539\u8d44\u6e90\u8bf7\u6c42\u503c\uff0c\u60a8\u53ef\u4ee5\u6839\u636e\u5b9e\u9645\u9700\u8981\u8fdb\u884c\u914d\u7f6e\u3002
\u672c\u6587\u5c06\u4ecb\u7ecd\u5982\u4f55\u4e3a\u5de5\u4f5c\u8d1f\u8f7d\u914d\u7f6e Pod \u5782\u76f4\u4f38\u7f29\u3002
Warning
\u4f7f\u7528 VPA \u4fee\u6539 Pod \u8d44\u6e90\u8bf7\u6c42\u4f1a\u89e6\u53d1 Pod \u91cd\u542f\u3002\u7531\u4e8e Kubernetes \u672c\u8eab\u7684\u9650\u5236\uff0c Pod \u91cd\u542f\u540e\u53ef\u80fd\u4f1a\u88ab\u8c03\u5ea6\u5230\u5176\u5b83\u8282\u70b9\u4e0a\u3002
\u4f38\u7f29\u6a21\u5f0f\uff1a\u6267\u884c\u4fee\u6539 CPU \u548c\u5185\u5b58\u8bf7\u6c42\u503c\u7684\u65b9\u5f0f\uff0c\u76ee\u524d\u5782\u76f4\u4f38\u7f29\u652f\u6301\u624b\u52a8\u548c\u81ea\u52a8\u4e24\u79cd\u4f38\u7f29\u6a21\u5f0f\u3002
\u8d44\u6e90\u7c7b\u578b\uff1a\u8fdb\u884c\u76d1\u63a7\u7684\u81ea\u5b9a\u4e49\u6307\u6807\u7c7b\u578b\uff0c\u5305\u542b Pod \u548c Service \u4e24\u79cd\u7c7b\u578b\u3002
\u6570\u636e\u7c7b\u578b\uff1a\u7528\u4e8e\u8ba1\u7b97\u6307\u6807\u503c\u7684\u65b9\u6cd5\uff0c\u5305\u542b\u76ee\u6807\u503c\u548c\u76ee\u6807\u5e73\u5747\u503c\u4e24\u79cd\u7c7b\u578b\uff0c\u5f53\u8d44\u6e90\u7c7b\u578b\u4e3a Pod \u65f6\uff0c\u53ea\u652f\u6301\u4f7f\u7528\u76ee\u6807\u5e73\u5747\u503c\u3002
\u5f85\u5904\u7406\u7684\u8bf7\u6c42\u603b\u6570 + \u80fd\u63a5\u53d7\u7684\u8d85\u8fc7\u76ee\u6807\u5e76\u53d1\u6570\u7684\u8bf7\u6c42\u6570\u91cf > \u6bcf\u4e2a Pod \u7684\u76ee\u6807\u5e76\u53d1\u6570 * Pod \u6570\u91cf
\u6267\u884c\u4e0b\u9762\u547d\u4ee4\u6d4b\u8bd5\uff0c\u5e76\u53ef\u4ee5\u901a\u8fc7 kubectl get pods -A -w \u6765\u89c2\u5bdf\u6269\u5bb9\u7684 Pod\u3002
API \u5b89\u5168\uff1a\u542f\u7528\u4e86\u4e0d\u5b89\u5168\u7684 API \u7248\u672c\uff0c\u662f\u5426\u8bbe\u7f6e\u4e86\u9002\u5f53\u7684 RBAC \u89d2\u8272\u548c\u6743\u9650\u9650\u5236\u7b49
\u6570\u636e\u5377\uff08PersistentVolume\uff0cPV\uff09\u662f\u96c6\u7fa4\u4e2d\u7684\u4e00\u5757\u5b58\u50a8\uff0c\u53ef\u7531\u7ba1\u7406\u5458\u4e8b\u5148\u5236\u5907\uff0c\u6216\u4f7f\u7528\u5b58\u50a8\u7c7b\uff08Storage Class\uff09\u6765\u52a8\u6001\u5236\u5907\u3002PV \u662f\u96c6\u7fa4\u8d44\u6e90\uff0c\u4f46\u62e5\u6709\u72ec\u7acb\u7684\u751f\u547d\u5468\u671f\uff0c\u4e0d\u4f1a\u968f\u7740 Pod \u8fdb\u7a0b\u7ed3\u675f\u800c\u88ab\u5220\u9664\u3002\u5c06 PV \u6302\u8f7d\u5230\u5de5\u4f5c\u8d1f\u8f7d\u53ef\u4ee5\u5b9e\u73b0\u5de5\u4f5c\u8d1f\u8f7d\u7684\u6570\u636e\u6301\u4e45\u5316\u3002PV \u4e2d\u4fdd\u5b58\u4e86\u53ef\u88ab Pod \u4e2d\u5bb9\u5668\u8bbf\u95ee\u7684\u6570\u636e\u76ee\u5f55\u3002
HostPath\uff1a\u4f7f\u7528 Node \u8282\u70b9\u7684\u6587\u4ef6\u7cfb\u7edf\u4e0a\u7684\u6587\u4ef6\u6216\u76ee\u5f55\u4f5c\u4e3a\u6570\u636e\u5377\uff0c\u4e0d\u652f\u6301\u57fa\u4e8e\u8282\u70b9\u4eb2\u548c\u6027\u7684 Pod \u8c03\u5ea6\u3002
ReadWriteOncePod\uff1a\u6570\u636e\u5377\u53ef\u4ee5\u88ab\u5355\u4e2a Pod \u4ee5\u8bfb\u5199\u65b9\u5f0f\u6302\u8f7d\u3002
\u56de\u6536\u7b56\u7565\uff1a
Retain\uff1a\u4e0d\u5220\u9664 PV\uff0c\u4ec5\u5c06\u5176\u72b6\u6001\u53d8\u4e3a released \uff0c\u9700\u8981\u7528\u6237\u624b\u52a8\u56de\u6536\u3002\u6709\u5173\u5982\u4f55\u624b\u52a8\u56de\u6536\uff0c\u53ef\u53c2\u8003\u6301\u4e45\u5377\u3002
\u6587\u4ef6\u7cfb\u7edf\uff1a\u6570\u636e\u5377\u5c06\u88ab Pod \u6302\u8f7d\u5230\u67d0\u4e2a\u76ee\u5f55\u3002\u5982\u679c\u6570\u636e\u5377\u7684\u5b58\u50a8\u6765\u81ea\u67d0\u5757\u8bbe\u5907\u800c\u8be5\u8bbe\u5907\u76ee\u524d\u4e3a\u7a7a\uff0c\u7b2c\u4e00\u6b21\u6302\u8f7d\u5377\u4e4b\u524d\u4f1a\u5728\u8bbe\u5907\u4e0a\u521b\u5efa\u6587\u4ef6\u7cfb\u7edf\u3002
\u5757\uff1a\u5c06\u6570\u636e\u5377\u4f5c\u4e3a\u539f\u59cb\u5757\u8bbe\u5907\u6765\u4f7f\u7528\u3002\u8fd9\u7c7b\u5377\u4ee5\u5757\u8bbe\u5907\u7684\u65b9\u5f0f\u4ea4\u7ed9 Pod \u4f7f\u7528\uff0c\u5176\u4e0a\u6ca1\u6709\u4efb\u4f55\u6587\u4ef6\u7cfb\u7edf\uff0c\u53ef\u4ee5\u8ba9 Pod \u66f4\u5feb\u5730\u8bbf\u95ee\u6570\u636e\u5377\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u652f\u6301\u5c06\u4e00\u4e2a\u5b58\u50a8\u6c60\u5171\u4eab\u7ed9\u591a\u4e2a\u547d\u540d\u7a7a\u95f4\u4f7f\u7528\uff0c\u4ee5\u4fbf\u63d0\u9ad8\u8d44\u6e90\u5229\u7528\u6548\u7387\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u5c31\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002\u9ed8\u8ba4\u503c\u4e3a 360 s\u3002
\u5b88\u62a4\u8fdb\u7a0b\uff08DaemonSet\uff09\u901a\u8fc7\u8282\u70b9\u4eb2\u548c\u6027\u4e0e\u6c61\u70b9\u529f\u80fd\u786e\u4fdd\u5728\u5168\u90e8\u6216\u90e8\u5206\u8282\u70b9\u4e0a\u8fd0\u884c\u4e00\u4e2a Pod \u7684\u526f\u672c\u3002\u5bf9\u4e8e\u65b0\u52a0\u5165\u96c6\u7fa4\u7684\u8282\u70b9\uff0cDaemonSet \u81ea\u52a8\u5728\u65b0\u8282\u70b9\u4e0a\u90e8\u7f72\u76f8\u5e94\u7684 Pod\uff0c\u5e76\u8ddf\u8e2a Pod \u7684\u8fd0\u884c\u72b6\u6001\u3002\u5f53\u8282\u70b9\u88ab\u79fb\u9664\u65f6\uff0cDaemonSet \u5219\u5220\u9664\u5176\u521b\u5efa\u7684\u6240\u6709 Pod\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u65e0\u6548 Pod \u6570\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u6d6a\u6d8c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u4e3b\u8981\u4e3a Pod \u548c ReplicaSet \u63d0\u4f9b\u58f0\u660e\u5f0f\u66f4\u65b0\uff0c\u652f\u6301\u5f39\u6027\u4f38\u7f29\u3001\u6eda\u52a8\u5347\u7ea7\u3001\u7248\u672c\u56de\u9000\u7b49\u529f\u80fd\u3002\u5728 Deployment \u4e2d\u58f0\u660e\u671f\u671b\u7684 Pod \u72b6\u6001\uff0cDeployment Controller \u4f1a\u901a\u8fc7 ReplicaSet \u4fee\u6539\u5f53\u524d\u72b6\u6001\uff0c\u4f7f\u5176\u8fbe\u5230\u9884\u5148\u58f0\u660e\u7684\u671f\u671b\u72b6\u6001\u3002Deployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u652f\u6301\u6570\u636e\u6301\u4e45\u5316\uff0c\u9002\u7528\u4e8e\u90e8\u7f72\u65e0\u72b6\u6001\u7684\u3001\u4e0d\u9700\u8981\u4fdd\u5b58\u6570\u636e\u3001\u968f\u65f6\u53ef\u4ee5\u91cd\u542f\u56de\u6eda\u7684\u5e94\u7528\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u65e0\u72b6\u6001\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
\u6700\u5927\u4e0d\u53ef\u7528\uff1a\u6307\u5b9a\u8d1f\u8f7d\u66f4\u65b0\u8fc7\u7a0b\u4e2d\u4e0d\u53ef\u7528 Pod \u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\uff0c\u9ed8\u8ba4 25%\u3002\u5982\u679c\u7b49\u4e8e\u5b9e\u4f8b\u6570\u6709\u670d\u52a1\u4e2d\u65ad\u7684\u98ce\u9669\u3002
\u6700\u5927\u5cf0\u503c\uff1a\u66f4\u65b0 Pod \u7684\u8fc7\u7a0b\u4e2d Pod \u603b\u6570\u8d85\u8fc7 Pod \u671f\u671b\u526f\u672c\u6570\u90e8\u5206\u7684\u6700\u5927\u503c\u6216\u6bd4\u7387\u3002\u9ed8\u8ba4 25%\u3002
Pod \u53ef\u7528\u6700\u77ed\u65f6\u95f4\uff1aPod \u5c31\u7eea\u7684\u6700\u77ed\u65f6\u95f4\uff0c\u53ea\u6709\u8d85\u51fa\u8fd9\u4e2a\u65f6\u95f4 Pod \u624d\u88ab\u8ba4\u4e3a\u53ef\u7528\uff0c\u9ed8\u8ba4 0 \u79d2\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5b9e\u4f8b\u6570\uff1a\u8f93\u5165\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u5b9e\u4f8b\u6570\u91cf\u3002\u9ed8\u8ba4\u521b\u5efa 1 \u4e2a Pod \u5b9e\u4f8b\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
\u5e76\u884c\u6570\uff1a\u4efb\u52a1\u6267\u884c\u8fc7\u7a0b\u4e2d\u5141\u8bb8\u540c\u65f6\u521b\u5efa\u7684\u6700\u5927 Pod \u6570\uff0c\u5e76\u884c\u6570\u5e94\u4e0d\u5927\u4e8e Pod \u603b\u6570\u3002\u9ed8\u8ba4\u4e3a 1\u3002
\u8d85\u65f6\u65f6\u95f4\uff1a\u8d85\u51fa\u8be5\u65f6\u95f4\u65f6\uff0c\u4efb\u52a1\u4f1a\u88ab\u6807\u8bc6\u4e3a\u6267\u884c\u5931\u8d25\uff0c\u4efb\u52a1\u4e0b\u7684\u6240\u6709 Pod \u90fd\u4f1a\u88ab\u5220\u9664\u3002\u4e3a\u7a7a\u65f6\u8868\u793a\u4e0d\u8bbe\u7f6e\u8d85\u65f6\u65f6\u95f4\u3002
\u6709\u72b6\u6001\u8d1f\u8f7d\uff08StatefulSet\uff09\u662f Kubernetes \u4e2d\u7684\u4e00\u79cd\u5e38\u89c1\u8d44\u6e90\uff0c\u548c\u65e0\u72b6\u6001\u8d1f\u8f7d\uff08Deployment\uff09\u7c7b\u4f3c\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406 Pod \u96c6\u5408\u7684\u90e8\u7f72\u548c\u4f38\u7f29\u3002\u4e8c\u8005\u7684\u4e3b\u8981\u533a\u522b\u5728\u4e8e\uff0cDeployment \u662f\u65e0\u72b6\u6001\u7684\uff0c\u4e0d\u4fdd\u5b58\u6570\u636e\uff0c\u800c StatefulSet \u662f\u6709\u72b6\u6001\u7684\uff0c\u4e3b\u8981\u7528\u4e8e\u7ba1\u7406\u6709\u72b6\u6001\u5e94\u7528\u3002\u6b64\u5916\uff0cStatefulSet \u4e2d\u7684 Pod \u5177\u6709\u6c38\u4e45\u4e0d\u53d8\u7684 ID\uff0c\u4fbf\u4e8e\u5728\u5339\u914d\u5b58\u50a8\u5377\u65f6\u8bc6\u522b\u5bf9\u5e94\u7684 Pod\u3002
\u901a\u8fc7\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u7684\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\uff0c\u53ef\u4ee5\u57fa\u4e8e\u76f8\u5e94\u7684\u89d2\u8272\u6743\u9650\u8f7b\u677e\u7ba1\u7406\u591a\u4e91\u591a\u96c6\u7fa4\u4e0a\u7684\u5de5\u4f5c\u8d1f\u8f7d\uff0c\u5305\u62ec\u5bf9\u6709\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u7684\u521b\u5efa\u3001\u66f4\u65b0\u3001\u5220\u9664\u3001\u5f39\u6027\u6269\u7f29\u3001\u91cd\u542f\u3001\u7248\u672c\u56de\u9000\u7b49\u5168\u751f\u547d\u5468\u671f\u7ba1\u7406\u3002
\u914d\u7f6e Pod \u5185\u7684\u5bb9\u5668\u53c2\u6570\uff0c\u4e3a Pod \u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u6216\u4f20\u9012\u914d\u7f6e\u7b49\u3002\u8be6\u60c5\u53ef\u53c2\u8003\u5bb9\u5668\u73af\u5883\u53d8\u91cf\u914d\u7f6e\u3002
DNS \u914d\u7f6e\uff1a\u5e94\u7528\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u4f1a\u51fa\u73b0\u5197\u4f59\u7684 DNS \u67e5\u8be2\u3002Kubernetes \u4e3a\u5e94\u7528\u63d0\u4f9b\u4e86\u4e0e DNS \u76f8\u5173\u7684\u914d\u7f6e\u9009\u9879\uff0c\u80fd\u591f\u5728\u67d0\u4e9b\u573a\u666f\u4e0b\u6709\u6548\u5730\u51cf\u5c11\u5197\u4f59\u7684 DNS \u67e5\u8be2\uff0c\u63d0\u5347\u4e1a\u52a1\u5e76\u53d1\u91cf\u3002
DNS \u7b56\u7565
Default\uff1a\u4f7f\u5bb9\u5668\u4f7f\u7528 kubelet \u7684 --resolv-conf \u53c2\u6570\u6307\u5411\u7684\u57df\u540d\u89e3\u6790\u6587\u4ef6\u3002\u8be5\u914d\u7f6e\u53ea\u80fd\u89e3\u6790\u6ce8\u518c\u5230\u4e92\u8054\u7f51\u4e0a\u7684\u5916\u90e8\u57df\u540d\uff0c\u65e0\u6cd5\u89e3\u6790\u96c6\u7fa4\u5185\u90e8\u57df\u540d\uff0c\u4e14\u4e0d\u5b58\u5728\u65e0\u6548\u7684 DNS \u67e5\u8be2\u3002
Kubernetes v1.7 \u53ca\u5176\u4e4b\u540e\u7684\u7248\u672c\u53ef\u4ee5\u901a\u8fc7 .spec.podManagementPolicy \u8bbe\u7f6e Pod \u7684\u7ba1\u7406\u7b56\u7565\uff0c\u652f\u6301\u4ee5\u4e0b\u4e24\u79cd\u65b9\u5f0f\uff1a
\u6309\u5e8f\u7b56\u7565\uff08OrderedReady\uff09 \uff1a\u9ed8\u8ba4\u7684 Pod \u7ba1\u7406\u7b56\u7565\uff0c\u8868\u793a\u6309\u987a\u5e8f\u90e8\u7f72 Pod\uff0c\u53ea\u6709\u524d\u4e00\u4e2a Pod \u90e8\u7f72 \u6210\u529f\u5b8c\u6210\u540e\uff0c\u6709\u72b6\u6001\u8d1f\u8f7d\u624d\u4f1a\u5f00\u59cb\u90e8\u7f72\u4e0b\u4e00\u4e2a Pod\u3002\u5220\u9664 Pod \u65f6\u5219\u91c7\u7528\u9006\u5e8f\uff0c\u6700\u540e\u521b\u5efa\u7684\u6700\u5148\u88ab\u5220\u9664\u3002
\u5e76\u884c\u7b56\u7565\uff08Parallel\uff09 \uff1a\u5e76\u884c\u521b\u5efa\u6216\u5220\u9664\u5bb9\u5668\uff0c\u548c Deployment \u7c7b\u578b\u7684 Pod \u4e00\u6837\u3002StatefulSet \u63a7\u5236\u5668\u5e76\u884c\u5730\u542f\u52a8\u6216\u7ec8\u6b62\u6240\u6709\u7684\u5bb9\u5668\u3002\u542f\u52a8\u6216\u8005\u7ec8\u6b62\u5176\u4ed6 Pod \u524d\uff0c\u65e0\u9700\u7b49\u5f85 Pod \u8fdb\u5165 Running \u548c ready \u6216\u8005\u5b8c\u5168\u505c\u6b62\u72b6\u6001\u3002 \u8fd9\u4e2a\u9009\u9879\u53ea\u4f1a\u5f71\u54cd\u6269\u7f29\u64cd\u4f5c\u7684\u884c\u4e3a\uff0c\u4e0d\u5f71\u54cd\u66f4\u65b0\u65f6\u7684\u987a\u5e8f\u3002
\u8282\u70b9\u4eb2\u548c\u6027\uff1a\u6839\u636e\u8282\u70b9\u4e0a\u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u4e0a\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u53cd\u4eb2\u548c\u6027\uff1a\u57fa\u4e8e\u5df2\u7ecf\u5728\u8282\u70b9\u4e0a\u8fd0\u884c\u7684 Pod \u7684\u6807\u7b7e\u6765\u7ea6\u675f Pod \u4e0d\u53ef\u4ee5\u8c03\u5ea6\u5230\u54ea\u4e9b\u8282\u70b9\u3002
\u73af\u5883\u53d8\u91cf\u662f\u6307\u5bb9\u5668\u8fd0\u884c\u73af\u5883\u4e2d\u8bbe\u5b9a\u7684\u4e00\u4e2a\u53d8\u91cf\uff0c\u7528\u4e8e\u7ed9 Pod \u6dfb\u52a0\u73af\u5883\u6807\u5fd7\u6216\u4f20\u9012\u914d\u7f6e\u7b49\uff0c\u652f\u6301\u901a\u8fc7\u952e\u503c\u5bf9\u7684\u5f62\u5f0f\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u5728\u539f\u751f Kubernetes \u7684\u57fa\u7840\u4e0a\u589e\u52a0\u4e86\u56fe\u5f62\u5316\u754c\u9762\u4e3a Pod \u914d\u7f6e\u73af\u5883\u53d8\u91cf\uff0c\u652f\u6301\u4ee5\u4e0b\u51e0\u79cd\u914d\u7f6e\u65b9\u5f0f\uff1a
\u53d8\u91cf/\u53d8\u91cf\u5f15\u7528\uff08Pod Field\uff09\uff1a\u5c06 Pod \u5b57\u6bb5\u4f5c\u4e3a\u73af\u5883\u53d8\u91cf\u7684\u503c\uff0c\u4f8b\u5982 Pod \u7684\u540d\u79f0
\u5c31\u7eea\u68c0\u67e5\uff08ReadinessProbe\uff09 \u53ef\u63a2\u77e5\u5bb9\u5668\u4f55\u65f6\u51c6\u5907\u597d\u63a5\u53d7\u8bf7\u6c42\u6d41\u91cf\uff0c\u5f53\u4e00\u4e2a Pod \u5185\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5c31\u7eea\u65f6\uff0c\u624d\u80fd\u8ba4\u4e3a\u8be5 Pod \u5c31\u7eea\u3002 \u8fd9\u79cd\u4fe1\u53f7\u7684\u4e00\u4e2a\u7528\u9014\u5c31\u662f\u63a7\u5236\u54ea\u4e2a Pod \u4f5c\u4e3a Service \u7684\u540e\u7aef\u3002 \u82e5 Pod \u5c1a\u672a\u5c31\u7eea\uff0c\u4f1a\u88ab\u4ece Service \u7684\u8d1f\u8f7d\u5747\u8861\u5668\u4e2d\u5254\u9664\u3002
Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c Pod \u5185\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \u72b6\u6001\u3002\u5982\u679c Pod \u4e2d\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\uff0c\u5219\u72b6\u6001\u53d8\u4e3a Failed \u3002\u4ee5\u4e0b phase \u5b57\u6bb5\u503c\u8868\u660e\u4e86\u4e00\u4e2a Pod \u5904\u4e8e\u751f\u547d\u5468\u671f\u7684\u54ea\u4e2a\u9636\u6bb5\u3002
\u503c \u63cf\u8ff0 Pending \uff08\u60ac\u51b3\uff09 Pod \u5df2\u88ab\u7cfb\u7edf\u63a5\u53d7\uff0c\u4f46\u6709\u4e00\u4e2a\u6216\u8005\u591a\u4e2a\u5bb9\u5668\u5c1a\u672a\u521b\u5efa\u4ea6\u672a\u8fd0\u884c\u3002\u8fd9\u4e2a\u9636\u6bb5\u5305\u62ec\u7b49\u5f85 Pod \u88ab\u8c03\u5ea6\u7684\u65f6\u95f4\u548c\u901a\u8fc7\u7f51\u7edc\u4e0b\u8f7d\u955c\u50cf\u7684\u65f6\u95f4\u3002 Running \uff08\u8fd0\u884c\u4e2d\uff09 Pod \u5df2\u7ecf\u7ed1\u5b9a\u5230\u4e86\u67d0\u4e2a\u8282\u70b9\uff0cPod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u88ab\u521b\u5efa\u3002\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u4ecd\u5728\u8fd0\u884c\uff0c\u6216\u8005\u6b63\u5904\u4e8e\u542f\u52a8\u6216\u91cd\u542f\u72b6\u6001\u3002 Succeeded \uff08\u6210\u529f\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u6210\u529f\u7ec8\u6b62\uff0c\u5e76\u4e14\u4e0d\u4f1a\u518d\u91cd\u542f\u3002 Failed \uff08\u5931\u8d25\uff09 Pod \u4e2d\u7684\u6240\u6709\u5bb9\u5668\u90fd\u5df2\u7ec8\u6b62\uff0c\u5e76\u4e14\u81f3\u5c11\u6709\u4e00\u4e2a\u5bb9\u5668\u662f\u56e0\u4e3a\u5931\u8d25\u800c\u7ec8\u6b62\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u5bb9\u5668\u4ee5\u975e 0 \u72b6\u6001\u9000\u51fa\u6216\u8005\u88ab\u7cfb\u7edf\u7ec8\u6b62\u3002 Unknown \uff08\u672a\u77e5\uff09 \u56e0\u4e3a\u67d0\u4e9b\u539f\u56e0\u65e0\u6cd5\u53d6\u5f97 Pod \u7684\u72b6\u6001\uff0c\u8fd9\u79cd\u60c5\u51b5\u901a\u5e38\u662f\u56e0\u4e3a\u4e0e Pod \u6240\u5728\u4e3b\u673a\u901a\u4fe1\u5931\u8d25\u6240\u81f4\u3002
\u5728\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u5bb9\u5668\u7ba1\u7406\u4e2d\u521b\u5efa\u4e00\u4e2a\u5de5\u4f5c\u8d1f\u8f7d\u65f6\uff0c\u901a\u5e38\u4f7f\u7528\u955c\u50cf\u6765\u6307\u5b9a\u5bb9\u5668\u4e2d\u7684\u8fd0\u884c\u73af\u5883\u3002\u9ed8\u8ba4\u60c5\u51b5\u4e0b\uff0c\u5728\u6784\u5efa\u955c\u50cf\u65f6\uff0c\u53ef\u4ee5\u901a\u8fc7 Entrypoint \u548c CMD \u4e24\u4e2a\u5b57\u6bb5\u6765\u5b9a\u4e49\u5bb9\u5668\u8fd0\u884c\u65f6\u6267\u884c\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002\u5982\u679c\u9700\u8981\u66f4\u6539\u5bb9\u5668\u955c\u50cf\u542f\u52a8\u524d\u3001\u542f\u52a8\u540e\u3001\u505c\u6b62\u524d\u7684\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u5bb9\u5668\u7684\u751f\u547d\u5468\u671f\u4e8b\u4ef6\u547d\u4ee4\u548c\u53c2\u6570\uff0c\u6765\u8986\u76d6\u955c\u50cf\u4e2d\u9ed8\u8ba4\u7684\u547d\u4ee4\u548c\u53c2\u6570\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u542f\u52a8\u540e\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u7b97\u4e30 AI \u7b97\u529b\u5e73\u53f0\u63d0\u4f9b\u547d\u4ee4\u884c\u811a\u672c\u548c HTTP \u8bf7\u6c42\u4e24\u79cd\u5904\u7406\u7c7b\u578b\u5bf9\u505c\u6b62\u524d\u547d\u4ee4\u8fdb\u884c\u914d\u7f6e\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u4e0b\u8868\u9009\u62e9\u9002\u5408\u60a8\u7684\u914d\u7f6e\u65b9\u5f0f\u3002
\u5728 Kubernetes \u96c6\u7fa4\u4e2d\uff0c\u8282\u70b9\u4e5f\u6709\u6807\u7b7e\u3002\u60a8\u53ef\u4ee5\u624b\u52a8\u6dfb\u52a0\u6807\u7b7e\u3002 Kubernetes \u4e5f\u4f1a\u4e3a\u96c6\u7fa4\u4e2d\u6240\u6709\u8282\u70b9\u6dfb\u52a0\u4e00\u4e9b\u6807\u51c6\u7684\u6807\u7b7e\u3002\u53c2\u89c1\u5e38\u7528\u7684\u6807\u7b7e\u3001\u6ce8\u89e3\u548c\u6c61\u70b9\u4ee5\u4e86\u89e3\u5e38\u89c1\u7684\u8282\u70b9\u6807\u7b7e\u3002\u901a\u8fc7\u4e3a\u8282\u70b9\u6dfb\u52a0\u6807\u7b7e\uff0c\u60a8\u53ef\u4ee5\u8ba9 Pod \u8c03\u5ea6\u5230\u7279\u5b9a\u8282\u70b9\u6216\u8282\u70b9\u7ec4\u4e0a\u3002\u60a8\u53ef\u4ee5\u4f7f\u7528\u8fd9\u4e2a\u529f\u80fd\u6765\u786e\u4fdd\u7279\u5b9a\u7684 Pod \u53ea\u80fd\u8fd0\u884c\u5728\u5177\u6709\u4e00\u5b9a\u9694\u79bb\u6027\uff0c\u5b89\u5168\u6027\u6216\u76d1\u7ba1\u5c5e\u6027\u7684\u8282\u70b9\u4e0a\u3002
nodeSelector \u662f\u8282\u70b9\u9009\u62e9\u7ea6\u675f\u7684\u6700\u7b80\u5355\u63a8\u8350\u5f62\u5f0f\u3002\u60a8\u53ef\u4ee5\u5c06 nodeSelector \u5b57\u6bb5\u6dfb\u52a0\u5230 Pod \u7684\u89c4\u7ea6\u4e2d\u8bbe\u7f6e\u60a8\u5e0c\u671b\u76ee\u6807\u8282\u70b9\u6240\u5177\u6709\u7684\u8282\u70b9\u6807\u7b7e\u3002Kubernetes \u53ea\u4f1a\u5c06 Pod \u8c03\u5ea6\u5230\u62e5\u6709\u6307\u5b9a\u6bcf\u4e2a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002 nodeSelector \u63d0\u4f9b\u4e86\u4e00\u79cd\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u6765\u5c06 Pod \u7ea6\u675f\u5230\u5177\u6709\u7279\u5b9a\u6807\u7b7e\u7684\u8282\u70b9\u4e0a\u3002\u4eb2\u548c\u6027\u548c\u53cd\u4eb2\u548c\u6027\u6269\u5c55\u4e86\u60a8\u53ef\u4ee5\u5b9a\u4e49\u7684\u7ea6\u675f\u7c7b\u578b\u3002\u4f7f\u7528\u4eb2\u548c\u6027\u4e0e\u53cd\u4eb2\u548c\u6027\u7684\u4e00\u4e9b\u597d\u5904\u6709\uff1a
\u60a8\u53ef\u4ee5\u6807\u660e\u67d0\u89c4\u5219\u662f\u201c\u8f6f\u9700\u6c42\u201d\u6216\u8005\u201c\u504f\u597d\u201d\uff0c\u8fd9\u6837\u8c03\u5ea6\u5668\u5728\u65e0\u6cd5\u627e\u5230\u5339\u914d\u8282\u70b9\u65f6\uff0c\u4f1a\u5ffd\u7565\u4eb2\u548c\u6027/\u53cd\u4eb2\u548c\u6027\u89c4\u5219\uff0c\u786e\u4fdd Pod \u8c03\u5ea6\u6210\u529f\u3002
\u60a8\u53ef\u4ee5\u4f7f\u7528\u8282\u70b9\u4e0a\uff08\u6216\u5176\u4ed6\u62d3\u6251\u57df\u4e2d\uff09\u8fd0\u884c\u7684\u5176\u4ed6 Pod \u7684\u6807\u7b7e\u6765\u5b9e\u65bd\u8c03\u5ea6\u7ea6\u675f\uff0c\u800c\u4e0d\u662f\u53ea\u80fd\u4f7f\u7528\u8282\u70b9\u672c\u8eab\u7684\u6807\u7b7e\u3002\u8fd9\u4e2a\u80fd\u529b\u8ba9\u60a8\u80fd\u591f\u5b9a\u4e49\u89c4\u5219\u5141\u8bb8\u54ea\u4e9b Pod \u53ef\u4ee5\u88ab\u653e\u7f6e\u5728\u4e00\u8d77\u3002
\u60a8\u53ef\u4ee5\u901a\u8fc7\u8bbe\u7f6e\u4eb2\u548c\uff08affinity\uff09\u4e0e\u53cd\u4eb2\u548c\uff08anti-affinity\uff09\u6765\u9009\u62e9 Pod \u8981\u90e8\u7f72\u7684\u8282\u70b9\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u53ef\u4ee5\u548c\u54ea\u4e9b Pod\u90e8 \u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5bf9\u4e8e\u76f8\u4e92\u901a\u4fe1\u7684\u670d\u52a1\uff0c\u53ef\u901a\u8fc7\u5e94\u7528\u4eb2\u548c\u6027\u8c03\u5ea6\uff0c\u5c06\u5176\u90e8\u7f72\u5230\u540c\u4e00\u62d3\u6251\u57df\uff08\u5982\u540c\u4e00\u53ef\u7528\u533a\uff09\u4e2d\uff0c\u51cf\u5c11\u5b83\u4eec\u4e4b\u95f4\u7684\u7f51\u7edc\u5ef6\u8fdf\u3002
\u5de5\u4f5c\u8d1f\u8f7d\u7684\u53cd\u4eb2\u548c\u6027\u4e3b\u8981\u7528\u6765\u51b3\u5b9a\u5de5\u4f5c\u8d1f\u8f7d\u7684 Pod \u4e0d\u53ef\u4ee5\u548c\u54ea\u4e9b Pod \u90e8\u7f72\u5728\u540c\u4e00\u62d3\u6251\u57df\u3002\u4f8b\u5982\uff0c\u5c06\u4e00\u4e2a\u8d1f\u8f7d\u7684\u76f8\u540c Pod \u5206\u6563\u90e8\u7f72\u5230\u4e0d\u540c\u7684\u62d3\u6251\u57df\uff08\u4f8b\u5982\u4e0d\u540c\u4e3b\u673a\uff09\u4e2d\uff0c\u63d0\u9ad8\u8d1f\u8f7d\u672c\u8eab\u7684\u7a33\u5b9a\u6027\u3002
Pod \u662f Kuberneters \u4e2d\u521b\u5efa\u548c\u7ba1\u7406\u7684\u3001\u6700\u5c0f\u7684\u8ba1\u7b97\u5355\u5143\uff0c\u5373\u4e00\u7ec4\u5bb9\u5668\u7684\u96c6\u5408\u3002\u8fd9\u4e9b\u5bb9\u5668\u5171\u4eab\u5b58\u50a8\u3001\u7f51\u7edc\u4ee5\u53ca\u7ba1\u7406\u63a7\u5236\u5bb9\u5668\u8fd0\u884c\u65b9\u5f0f\u7684\u7b56\u7565\u3002 Pod \u901a\u5e38\u4e0d\u7531\u7528\u6237\u76f4\u63a5\u521b\u5efa\uff0c\u800c\u662f\u901a\u8fc7\u5de5\u4f5c\u8d1f\u8f7d\u8d44\u6e90\u6765\u521b\u5efa\u3002 Pod \u9075\u5faa\u4e00\u4e2a\u9884\u5b9a\u4e49\u7684\u751f\u547d\u5468\u671f\uff0c\u8d77\u59cb\u4e8e Pending \u9636\u6bb5\uff0c\u5982\u679c\u81f3\u5c11\u5176\u4e2d\u6709\u4e00\u4e2a\u4e3b\u8981\u5bb9\u5668\u6b63\u5e38\u542f\u52a8\uff0c\u5219\u8fdb\u5165 Running \uff0c\u4e4b\u540e\u53d6\u51b3\u4e8e Pod \u4e2d\u662f\u5426\u6709\u5bb9\u5668\u4ee5\u5931\u8d25\u72b6\u6001\u7ed3\u675f\u800c\u8fdb\u5165 Succeeded \u6216\u8005 Failed \u9636\u6bb5\u3002
\u7b2c\u4e94\u4ee3\u5bb9\u5668\u7ba1\u7406\u6a21\u5757\u4f9d\u636e Pod \u7684\u72b6\u6001\u3001\u526f\u672c\u6570\u7b49\u56e0\u7d20\uff0c\u8bbe\u8ba1\u4e86\u4e00\u79cd\u5185\u7f6e\u7684\u5de5\u4f5c\u8d1f\u8f7d\u751f\u547d\u5468\u671f\u7684\u72b6\u6001\u96c6\uff0c\u4ee5\u8ba9\u7528\u6237\u80fd\u591f\u66f4\u52a0\u771f\u5b9e\u7684\u611f\u77e5\u5de5\u4f5c\u8d1f\u8f7d\u8fd0\u884c\u60c5\u51b5\u3002 \u7531\u4e8e\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u7c7b\u578b\uff08\u6bd4\u5982\u65e0\u72b6\u6001\u5de5\u4f5c\u8d1f\u8f7d\u548c\u4efb\u52a1\uff09\u5bf9 Pod \u7684\u7ba1\u7406\u673a\u5236\u4e0d\u4e00\u81f4\uff0c\u56e0\u6b64\uff0c\u4e0d\u540c\u7684\u5de5\u4f5c\u8d1f\u8f7d\u5728\u8fd0\u884c\u8fc7\u7a0b\u4e2d\u4f1a\u5448\u73b0\u4e0d\u540c\u7684\u751f\u547d\u5468\u671f\u72b6\u6001\uff0c\u5177\u4f53\u5982\u4e0b\u8868\uff1a
\u606d\u559c\uff0c\u60a8\u6210\u529f\u8fdb\u5165\u4e86 AI \u7b97\u529b\u5e73\u53f0\uff0c\u73b0\u5728\u53ef\u4ee5\u5f00\u59cb\u60a8\u7684 AI \u4e4b\u65c5\u4e86\u3002
"},{"location":"end-user/share/workload.html","title":"\u521b\u5efa AI \u8d1f\u8f7d\u4f7f\u7528 GPU \u8d44\u6e90","text":"
\u7ba1\u7406\u5458\u4e3a\u5de5\u4f5c\u7a7a\u95f4\u5206\u914d\u8d44\u6e90\u914d\u989d\u540e\uff0c\u7528\u6237\u5c31\u53ef\u4ee5\u521b\u5efa AI \u5de5\u4f5c\u8d1f\u8f7d\u6765\u4f7f\u7528 GPU \u7b97\u529b\u8d44\u6e90\u3002
AI Lab provides a job scheduler to help you better manage jobs. In addition to the basic scheduler, it also supports custom schedulers.
"},{"location":"en/admin/baize/best-practice/add-scheduler.html#introduction-to-job-scheduler","title":"Introduction to Job Scheduler","text":"
In Kubernetes, the job scheduler is responsible for deciding which node to assign a Pod to. It considers various factors such as resource requirements, hardware/software constraints, affinity/anti-affinity rules, and data locality.
The default scheduler is a core component in a Kubernetes cluster that decides which node a Pod should run on. Let's delve into its working principles, features, and configuration methods.
In addition to basic job scheduling capabilities, we also support the use of Scheduler Plugins: Kubernetes SIG Scheduling, which maintains a set of scheduler plugins including Coscheduling (Gang Scheduling) and other features.
To deploy a secondary scheduler plugin in a worker cluster, refer to Deploying Secondary Scheduler Plugin.
"},{"location":"en/admin/baize/best-practice/add-scheduler.html#enable-scheduler-plugins-in-ai-lab","title":"Enable Scheduler Plugins in AI Lab","text":"
Danger
Improper operations when adding scheduler plugins may affect the stability of the entire cluster. It is recommended to test in a test environment or contact our technical support team.
Note that if you wish to use more scheduler plugins in training jobs, you need to manually install them successfully in the worker cluster first. Then, when deploying the baize-agent in the cluster, add the proper scheduler plugin configuration.
Through the container management UI provided by Helm Apps , you can easily deploy scheduler plugins in the cluster.
Then, click Install in the top right corner. (If the baize-agent has already been deployed, you can update it in the Helm App list.) Add the scheduler.
Note the parameter hierarchy of the scheduler. After adding, click OK .
Note: Do not omit this configuration when updating the baize-agent in the future.
"},{"location":"en/admin/baize/best-practice/add-scheduler.html#specify-scheduler-when-creating-a-job","title":"Specify Scheduler When Creating a Job","text":"
Once you have successfully deployed the proper scheduler plugin in the cluster and correctly added the proper scheduler configuration in the baize-agent, you can specify the scheduler when creating a job.
If everything is set up correctly, you will see the scheduler plugin you deployed in the scheduler dropdown menu.
This concludes the instructions for configuring and using the scheduler options in AI Lab.
In the Notebook, multiple available base images are provided by default for developers to choose from. In most cases, this will meet the developers' needs.
DaoCloud provides a default Notebook image that contains all necessary development tools and resources.
baize/baize-notebook\n
This Notebook includes basic development tools. Taking baize-notebook:v0.5.0 (May 30, 2024) as an example, the relevant dependencies and versions are as follows:
Dependency Version Description Ubuntu 22.04.3 Default OS Python 3.11.6 Default Python version pip 23.3.1 conda(mamba) 23.3.1 jupyterlab 3.6.6 JupyterLab image, providing a complete Notebook experience codeserver v4.89.1 Mainstream Code development tool for a familiar experience *baizectl v0.5.0 DaoCloud built-in CLI task management tool *SSH - Supports local SSH direct access to the Notebook container *kubectl v1.27 Kubernetes CLI for managing container resources within Notebook
Note
With each version iteration, AI platform will proactively maintain and update.
However, sometimes users may need custom images. This page explains how to update images and add them to the Notebook creation interface for selection.
Building a new image requires using baize-notebook as the base image to ensure the Notebook runs properly.
When building a custom image, it is recommended to first understand the Dockerfile of the baize-notebook image to better understand how to build a custom image.
"},{"location":"en/admin/baize/best-practice/change-notebook-image.html#dockerfile-for-baize-notebook","title":"Dockerfile for baize-notebook","text":"
"},{"location":"en/admin/baize/best-practice/change-notebook-image.html#add-to-the-notebook-image-list-helm","title":"Add to the Notebook Image List (Helm)","text":"
Warning
Note that this must be done by the platform administrator. Be cautious with changes.
Currently, the image selector needs to be modified by updating the Helm parameters of baize. The specific steps are as follows:
In the Helm Apps list of the kpanda-global-cluster global management cluster, find baize, enter the update page, and modify the Notebook image in the YAML parameters:
Note the parameter modification path global.config.notebook_images:
...\nglobal:\n ...\n config:\n notebook_images:\n ...\n names: release.daocloud.io/baize/baize-notebook:v0.5.0\n # Add your image information here\n
After the update is completed and the Helm App restarts successfully, you can see the new image in the Notebook creation interface image selection.
"},{"location":"en/admin/baize/best-practice/checkpoint.html","title":"Checkpoint Mechanism and Usage","text":"
In practical deep learning scenarios, model training typically lasts for a period, which places higher demands on the stability and efficiency of distributed training tasks. Moreover, during actual training, unexpected interruptions can cause the loss of the model state, requiring the training process to start over. This not only wastes time and resources, which is particularly evident in LLM training, but also affects the training effectiveness of the model.
The ability to save the model state during training, so that it can be restored in case of an interruption, becomes crucial. Checkpointing is the mainstream solution to this problem. This article will introduce the basic concepts of the Checkpoint mechanism and its usage in PyTorch and TensorFlow.
"},{"location":"en/admin/baize/best-practice/checkpoint.html#what-is-a-checkpoint","title":"What is a Checkpoint?","text":"
A checkpoint is a mechanism for saving the state of a model during training. By periodically saving checkpoints, you can restore the model in the following situations:
Training interruption (e.g., system crash or manual interruption)
TensorFlow provides the tf.train.Checkpoint class to manage the saving and restoring of models and optimizers.
"},{"location":"en/admin/baize/best-practice/checkpoint.html#save-checkpoints-in-tensorflow","title":"Save Checkpoints in TensorFlow","text":"
Here is an example of saving a checkpoint in TensorFlow:
import tensorflow as tf\n\n# Assume you have a simple model\nmodel = tf.keras.Sequential([\n tf.keras.layers.Dense(2, input_shape=(10,))\n])\noptimizer = tf.keras.optimizers.Adam(learning_rate=0.001)\n\n# Define checkpoint\ncheckpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)\ncheckpoint_dir = './checkpoints'\ncheckpoint_prefix = f'{checkpoint_dir}/ckpt'\n\n# Train the model...\n# Save checkpoint\ncheckpoint.save(file_prefix=checkpoint_prefix)\n
Note
Users of AI Lab can directly mount high-performance storage as the checkpoint directory to improve the speed of saving and restoring checkpoints.
"},{"location":"en/admin/baize/best-practice/checkpoint.html#restore-checkpoints-in-tensorflow","title":"Restore Checkpoints in TensorFlow","text":"
Load the checkpoint and restore the model and optimizer state:
# Restore checkpoint\nlatest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\ncheckpoint.restore(latest_checkpoint)\n\n# Continue training or inference...\n
"},{"location":"en/admin/baize/best-practice/checkpoint.html#manage-checkpoints-in-distributed-training-with-tensorflow","title":"Manage Checkpoints in Distributed Training with TensorFlow","text":"
In distributed training, TensorFlow manages checkpoints primarily through the following methods:
Using tf.train.Checkpoint and tf.train.CheckpointManager
Regular Saving : Determine a suitable saving frequency based on training time and resource consumption, such as every epoch or every few training steps.
Save Multiple Checkpoints : Keep the latest few checkpoints to prevent issues like file corruption or inapplicability.
Record Metadata : Save additional information in the checkpoint, such as the epoch number and loss value, to better restore the training state.
Use Version Control : Save checkpoints for different experiments to facilitate comparison and reuse.
Validation and Testing : Use checkpoints for validation and testing at different training stages to ensure model performance and stability.
The checkpoint mechanism plays a crucial role in deep learning training. By effectively using the checkpoint features in PyTorch and TensorFlow, you can significantly improve the reliability and efficiency of training. The methods and best practices described in this article should help you better manage the training process of deep learning models.
"},{"location":"en/admin/baize/best-practice/deploy-nfs-in-worker.html","title":"Deploy NFS for Preloading Dataset","text":"
A Network File System (NFS) allows remote hosts to mount file systems over a network and interact with those file systems as though they are mounted locally. This enables system administrators to consolidate resources onto centralized servers on the network.
Dataset is a core feature provided by AI Lab. By abstracting the dependency on data throughout the entire lifecycle of MLOps into datasets, users can manage various types of data in datasets so that training tasks can directly use the data in the dataset.
When remote data is not within the worker cluster, datasets provide the capability to automatically preheat data, supporting data preloading from sources such as Git, S3, and HTTP to the local cluster.
A storage service supporting the ReadWriteMany mode is needed for preloading remote data for the dataset, and it is recommended to deploy NFS within the cluster.
This article mainly introduces how to quickly deploy an NFS service and add it as a StorageClass for the cluster.
Installing csi-driver-nfs requires the use of Helm, please ensure it is installed beforehand.
# Add Helm repository\nhelm repo add csi-driver-nfs https://mirror.ghproxy.com/https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts\nhelm repo update csi-driver-nfs\n\n# Deploy csi-driver-nfs\n# The parameters here mainly optimize the image address to accelerate downloads in China\nhelm upgrade --install csi-driver-nfs csi-driver-nfs/csi-driver-nfs \\\n --set image.nfs.repository=k8s.m.daocloud.io/sig-storage/nfsplugin \\\n --set image.csiProvisioner.repository=k8s.m.daocloud.io/sig-storage/csi-provisioner \\\n --set image.livenessProbe.repository=k8s.m.daocloud.io/sig-storage/livenessprobe \\\n --set image.nodeDriverRegistrar.repository=k8s.m.daocloud.io/sig-storage/csi-node-driver-registrar \\\n --namespace nfs \\\n --version v4.5.0\n
Warning
Not all images of csi-nfs-controller support helm parameters, so the image field of the deployment needs to be manually modified. Change image: registry.k8s.io to image: k8s.dockerproxy.com to accelerate downloads in China.
Create a dataset and set the dataset's associated storage class and preloading method to NFS to preheat remote data into the cluster.
After the dataset is successfully created, you can see that the dataset's status is preloading, and you can start using it after the preloading is completed.
Run the following command to install the NFS client:
sudo yum install nfs-utils\n
Check the NFS server configuration to ensure that the NFS server is running and configured correctly. You can try mounting manually to test:
sudo mkdir -p /mnt/test\nsudo mount -t nfs <nfs-server>:/nfsdata /mnt/test\n
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html","title":"Fine-tune the ChatGLM3 Model by Using AI Lab","text":"
This page uses the ChatGLM3 model as an example to demonstrate how to use LoRA (Low-Rank Adaptation) to fine-tune the ChatGLM3 model within the AI Lab environment. The demo program is from the ChatGLM3 official example.
GPU with at least 20GB memory, recommended RTX4090 or NVIDIA A/H series
At least 200GB of available disk space
At least 8-core CPU, recommended 16-core
64GB RAM, recommended 128GB
Info
Before starting, ensure AI platform and AI Lab are correctly installed, GPU queue resources are successfully initialized, and computing resources are sufficient.
Utilize the dataset management feature provided by AI Lab to quickly preheat and persist the data required for fine-tuning large models, reducing GPU resource occupation due to data preparation, and improving resource utilization efficiency.
Create the required data resources on the dataset list page. These resources include the ChatGLM3 code and data files, all of which can be managed uniformly through the dataset list.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#code-and-model-files","title":"Code and Model Files","text":"
ChatGLM3 is a dialogue pre-training model jointly released by zhipuai.cn and Tsinghua University KEG Lab.
First, pull the ChatGLM3 code repository and download the pre-training model for subsequent fine-tuning tasks.
AI Lab will automatically preheat the data in the background to ensure quick data access for subsequent tasks.
You also need to prepare an empty dataset to store the model files output after the fine-tuning task is completed. Here, create an empty dataset, using PVC as an example.
Warning
Ensure to use a storage type that supports ReadWriteMany to allow quick access to resources for subsequent tasks.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#set-up-environment","title":"Set up Environment","text":"
For model developers, preparing the Python environment dependencies required for model development is crucial. Traditionally, environment dependencies are either packaged directly into the development tool's image or installed in the local environment, which can lead to inconsistency in environment dependencies and difficulties in managing and updating dependencies.
AI Lab provides environment management capabilities, decoupling Python environment dependency package management from development tools and task images, solving dependency management chaos and environment inconsistency issues.
Here, use the environment management feature provided by AI Lab to create the environment required for ChatGLM3 fine-tuning for subsequent use.
Warning
The ChatGLM repository contains a requirements.txt file that includes the environment dependencies required for ChatGLM3 fine-tuning.
This fine-tuning does not use the deepspeed and mpi4py packages. It is recommended to comment them out in the requirements.txt file to avoid compilation failures.
In the environment management list, you can quickly create a Python environment and complete the environment creation through a simple form configuration; a Python 3.11.x environment is required here.
Since CUDA is required for this experiment, GPU resources need to be configured here to preheat the necessary resource dependencies.
Creating the environment involves downloading a series of Python dependencies, and download speeds may vary based on your location. Using a domestic mirror for acceleration can speed up the download.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#use-notebook-as-ide","title":"Use Notebook as IDE","text":"
AI Lab provides Notebook as an IDE feature, allowing users to write, run, and view code results directly in the browser. This is very suitable for development in data analysis, machine learning, and deep learning fields.
You can use the JupyterLab Notebook provided by AI Lab for the ChatGLM3 fine-tuning task.
In the Notebook list, you can create a Notebook according to the page operation guide. Note that you need to configure the proper Notebook resource parameters according to the resource requirements mentioned earlier to avoid resource issues affecting the fine-tuning process.
Note
When creating a Notebook, you can directly mount the preloaded model code dataset and environment, greatly saving data preparation time.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#mount-dataset-and-code","title":"Mount Dataset and Code","text":"
Note: The ChatGLM3 code files are mounted to the /home/jovyan/ChatGLM3 directory, and you also need to mount the AdvertiseGen dataset to the /home/jovyan/ChatGLM3/finetune_demo/data/AdvertiseGen directory to allow the fine-tuning task to access the data.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#mount-pvc-to-model-output-folder","title":"Mount PVC to Model Output Folder","text":"
The model output location used this time is the /home/jovyan/ChatGLM3/finetune_demo/output directory. You can mount the previously created PVC dataset to this directory, so the trained model can be saved to the dataset for subsequent inference tasks.
After creation, you can see the Notebook interface where you can write, run, and view code results directly in the Notebook.
Once in the Notebook, you can find the previously mounted dataset and code in the File Browser option in the Notebook sidebar. Locate the ChatGLM3 folder.
You will find the fine-tuning code for ChatGLM3 in the finetune_demo folder. Open the lora_finetune.ipynb file, which contains the fine-tuning code for ChatGLM3.
First, follow the instructions in the README.md file to understand the entire fine-tuning process. It is recommended to read it thoroughly to ensure that the basic environment dependencies and data preparation work are completed.
Open the terminal and use conda to switch to the preheated environment, ensuring consistency with the JupyterLab Kernel for subsequent code execution.
First, preprocess the AdvertiseGen dataset, standardizing the data to meet the Lora pre-training format requirements. Save the processed data to the AdvertiseGen_fix folder.
import json\nfrom typing import Union\nfrom pathlib import Path\n\ndef _resolve_path(path: Union[str, Path]) -> Path:\n return Path(path).expanduser().resolve()\n\ndef _mkdir(dir_name: Union[str, Path]):\n dir_name = _resolve_path(dir_name)\n if not dir_name.is_dir():\n dir_name.mkdir(parents=True, exist_ok=False)\n\ndef convert_adgen(data_dir: Union[str, Path], save_dir: Union[str, Path]):\n def _convert(in_file: Path, out_file: Path):\n _mkdir(out_file.parent)\n with open(in_file, encoding='utf-8') as fin:\n with open(out_file, 'wt', encoding='utf-8') as fout:\n for line in fin:\n dct = json.loads(line)\n sample = {'conversations': [{'role': 'user', 'content': dct['content']},\n {'role': 'assistant', 'content': dct['summary']}]}\n fout.write(json.dumps(sample, ensure_ascii=False) + '\\n')\n\n data_dir = _resolve_path(data_dir)\n save_dir = _resolve_path(save_dir)\n\n train_file = data_dir / 'train.json'\n if train_file is_file():\n out_file = save_dir / train_file.relative_to(data_dir)\n _convert(train_file, out_file)\n\n dev_file = data_dir / 'dev.json'\n if dev_file.is_file():\n out_file = save_dir / dev_file.relative_to(data_dir)\n _convert(dev_file, out_file)\n\nconvert_adgen('data/AdvertiseGen', 'data/AdvertiseGen_fix')\n
To save debugging time, you can reduce the number of entries in /home/jovyan/ChatGLM3/finetune_demo/data/AdvertiseGen_fix/dev.json to 50. The data is in JSON format, making it easy to process.
After preprocessing the data, you can proceed with the fine-tuning test. Configure the fine-tuning parameters in the /home/jovyan/ChatGLM3/finetune_demo/configs/lora.yaml file. Key parameters to focus on include:
Open a new terminal window and use the following command for local fine-tuning testing. Ensure that the parameter configurations and paths are correct:
finetune_hf.py is the fine-tuning script in the ChatGLM3 code
data/AdvertiseGen_fix is your preprocessed dataset
./chatglm3-6b is your pre-trained model path
configs/lora.yaml is the fine-tuning configuration file
During fine-tuning, you can use the nvidia-smi command to check GPU memory usage:
After fine-tuning is complete, an output directory will be generated in the finetune_demo directory, containing the fine-tuned model files. This way, the fine-tuned model files are saved to the previously created PVC dataset.
After completing the local fine-tuning test and ensuring that your code and data are correct, you can submit the fine-tuning task to the AI Lab for large-scale training and fine-tuning tasks.
Note
This is the recommended model development and fine-tuning process: first, conduct local fine-tuning tests to ensure that the code and data are correct.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#submit-fine-tuning-tasks-via-ui","title":"Submit Fine-tuning Tasks via UI","text":"
Use Pytorch to create a fine-tuning task. Select the resources of the cluster you need to use based on your actual situation. Ensure to meet the resource requirements mentioned earlier.
Image: You can directly use the model image provided by baizectl.
Startup command: Based on your experience using LoRA fine-tuning in the Notebook, the code files and data are in the /home/jovyan/ChatGLM3/finetune_demo directory, so you can directly use this path:
After successfully submitting the task, you can view the training progress of the task in real-time in the task list. You can see the task status, resource usage, logs, and other information.
View task logs
After the task is completed, you can view the fine-tuned model files in the data output dataset for subsequent inference tasks.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#submit-tasks-via-baizectl","title":"Submit Tasks via baizectl","text":"
AI Lab's Notebook supports using the baizectl command-line tool without authentication. If you prefer using CLI, you can directly use the baizectl command-line tool to submit tasks.
After completing the fine-tuning task, you can use the fine-tuned model for inference tasks. Here, you can use the inference service provided by AI Lab to create an inference service with the output model.
In the inference service list, you can create a new inference service. When selecting the model, choose the previously output dataset and configure the model path.
Regarding model resource requirements and GPU resource requirements for inference services, configure them based on the model size and inference concurrency. Refer to the resource configuration of the previous fine-tuning tasks.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#configure-model-runtime","title":"Configure Model Runtime","text":"
Configuring the model runtime is crucial. Currently, AI Lab supports vLLM as the model inference service runtime, which can be directly selected.
Tip
vLLM supports a wide range of large language models. Visit vLLM for more information. These models can be easily used within AI Lab.
After creation, you can see the created inference service in the inference service list. The model service list allows you to get the model's access address directly.
"},{"location":"en/admin/baize/best-practice/finetunel-llm.html#test-the-model-service","title":"Test the Model Service","text":"
Try using the curl command in the terminal to test the model service. Here, you can see the returned results, enabling you to use the model service for inference tasks.
This page used ChatGLM3 as an example to quickly introduce and get you started with the AI Lab for model fine-tuning, using LoRA to fine-tune the ChatGLM3 model.
AI Lab provides a wealth of features to help model developers quickly conduct model development, fine-tuning, and inference tasks. It also offers rich OpenAPI interfaces, facilitating integration with third-party application ecosystems.
Refer to the video tutorial: Data Labeling and Dataset Usage Instructions
Label Studio is an open-source data labeling tool used for various machine learning and artificial intelligence jobs. Here is a brief introduction to Label Studio:
Supports labeling of various data types including images, audio, video, and text
Can be used for jobs such as object detection, image classification, speech transcription, and named entity recognition
Provides a customizable labeling interface
Supports various labeling formats and export options
Label Studio offers a powerful data labeling solution for data scientists and machine learning engineers due to its flexibility and rich features.
"},{"location":"en/admin/baize/best-practice/label-studio.html#deploy-to-ai-platform","title":"Deploy to AI platform","text":"
To use Label Studio in AI Lab, it needs to be deployed to the Global Service Cluster. You can quickly deploy it using Helm.
Note
For more deployment details, refer to Deploy Label Studio on Kubernetes.
Enter the Global Service Cluster, find Helm Apps -> Helm Repositories from the left navigation bar, click the Create Repository button, and fill in the following parameters:
After successfully adding the repository, click the \u2507 on the right side of the list and select Sync Repository. Wait a moment to complete the synchronization. (This sync operation will also be used for future updates of Label Studio).
Then navigate to the Helm Charts page, search for label-studio, and click the card.
Choose the latest version and configure the installation parameters as shown below, naming it label-studio. It is recommended to create a new namespace. Switch the parameters to YAML and modify the configuration according to the instructions.
global:\n image:\n repository: heartexlabs/label-studio # Configure proxy address here if docker.io is inaccessible\n extraEnvironmentVars:\n LABEL_STUDIO_HOST: https://{Access_Address}/label-studio # Use the AI platform login address, refer to the current webpage URL\n LABEL_STUDIO_USERNAME: {User_Email} # Must be an email, replace with your own\n LABEL_STUDIO_PASSWORD: {User_Password}\napp:\n nginx:\n livenessProbe:\n path: /label-studio/nginx_health\n readinessProbe:\n path: /label-studio/version\n
At this point, the installation of Label Studio is complete.
Warning
By default, PostgreSQL will be installed as the data service middleware. If the image pull fails, it may be because docker.io is inaccessible. Ensure to switch to an available proxy.
If you have your own PostgreSQL data service middleware, you can use the following parameters:
global:\n image:\n repository: heartexlabs/label-studio # Configure proxy address here if docker.io is inaccessible\n extraEnvironmentVars:\n LABEL_STUDIO_HOST: https://{Access_Address}/label-studio # Use the AI platform login address, refer to the current webpage URL\n LABEL_STUDIO_USERNAME: {User_Email} # Must be an email, replace with your own\n LABEL_STUDIO_PASSWORD: {User_Password}\napp:\n nginx:\n livenessProbe:\n path: /label-studio/nginx_health\n readinessProbe:\n path: /label-studio/version\npostgresql:\n enabled: false # Disable the built-in PostgreSQL\nexternalPostgresql:\n host: \"postgres-postgresql\" # PostgreSQL address\n port: 5432\n username: \"label_studio\" # PostgreSQL username\n password: \"your_label_studio_password\" # PostgreSQL password\n database: \"label_studio\" # PostgreSQL database name\n
"},{"location":"en/admin/baize/best-practice/label-studio.html#add-gproduct-to-navigation-bar","title":"Add GProduct to Navigation Bar","text":"
To add Label Studio to the AI platform navigation bar, you can refer to the method in Global Management OEM IN. The following example shows how to add it to the secondary navigation of AI Lab.
The above describes how to add Label Studio and integrate it as an labeling component in AI Lab. By adding labels to the datasets in AI Lab, you can associate it with algorithm development and improve the algorithm development process. For further usage, refer to relevant documentation.
"},{"location":"en/admin/baize/best-practice/train-with-deepspeed.html","title":"Submit a DeepSpeed Training Task","text":"
According to the DeepSpeed official documentation, it's recommended to modifying your code to implement the training task.
Specifically, you can use deepspeed.init_distributed() instead of torch.distributed.init_process_group(...). Then run the command using torchrun to submit it as a PyTorch distributed task, which will allow you to run a DeepSpeed task.
Yes, you can use torchrun to run your DeepSpeed training script. torchrun is a utility provided by PyTorch for distributed training. You can combine torchrun with the DeepSpeed API to start your training task.
Below is an example of running a DeepSpeed training script using torchrun:
Write the training script:
train.py
import torch\nimport deepspeed\nfrom torch.utils.data import DataLoader\n\n# Load model and data\nmodel = YourModel()\ntrain_dataset = YourDataset()\ntrain_dataloader = DataLoader(train_dataset, batch_size=32)\n\n# Configure file path\ndeepspeed_config = \"deepspeed_config.json\"\n\n# Create DeepSpeed training engine\nmodel_engine, optimizer, _, _ = deepspeed.initialize(\n model=model,\n model_parameters=model.parameters(),\n config_params=deepspeed_config\n)\n\n# Training loop\nfor batch in train_dataloader:\n loss = model_engine(batch)\n model_engine.backward(loss)\n model_engine.step()\n
Run the training script using torchrun or baizectl:
torchrun train.py\n
In this way, you can combine PyTorch's distributed training capabilities with DeepSpeed's optimization technologies for more efficient training. You can use the baizectl command to submit a job in a notebook:
This document provides a simple guide for users to use the AI Lab platform for the entire development and training process of datasets, Notebooks, and job training.
Click Data Management -> Datasets in the navigation bar, then click Create. Create three datasets as follows:
Code: https://github.com/d-run/drun-samples
For faster access in China, use Gitee: https://gitee.com/samzong_lu/training-sample-code.git
For faster access in China, use Gitee: https://gitee.com/samzong_lu/fashion-mnist.git
Empty PVC: Create an empty PVC to output the trained model and logs after training.
Note
Currently, only StorageClass with ReadWriteMany mode is supported. Please use NFS or the recommended JuiceFS.
Prepare the development environment by clicking Notebooks in the navigation bar, then click Create. Associate the three datasets created in the previous step and fill in the mount paths as shown in the image below:
Wait for the Notebook to be created successfully, click the access link in the list to enter the Notebook. Execute the following command in the Notebook terminal to start the job training.
Click Job Center -> Jobs in the navigation bar, create a Tensorflow Single job. Refer to the image below for job configuration and enable the Job Analysis (Tensorboard) feature. Click Create and wait for the status to complete.
For large datasets or models, it is recommended to enable GPU configuration in the resource configuration step.
In the job created in the previous step, you can click the specific job analysis to view the job status and optimize the job training.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html","title":"Create, Use and Delete Datasets","text":"
AI Lab provides comprehensive dataset management functions needed for model development, training, and inference processes. Currently, it supports unified access to various data sources.
With simple configurations, you can connect data sources to AI Lab, achieving unified data management, preloading, dataset management, and other functionalities.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#create-a-dataset","title":"Create a Dataset","text":"
In the left navigation bar, click Data Management -> Dataset List, and then click the Create button on the right.
Select the worker cluster and namespace to which the dataset belongs, then click Next.
Configure the data source type for the target data, then click OK.
Currently supported data sources include:
GIT: Supports repositories such as GitHub, GitLab, and Gitee
Upon successful creation, the dataset will be returned to the dataset list. You can perform more actions by clicking \u2507 on the right.
Info
The system will automatically perform a one-time data preloading after the dataset is successfully created; the dataset cannot be used until the preloading is complete.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-a-dataset","title":"Use a Dataset","text":"
Once the dataset is successfully created, it can be used in tasks such as model training and inference.
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-in-notebook","title":"Use in Notebook","text":"
In creating a Notebook, you can directly use the dataset; the usage is as follows:
Use the dataset as training data mount
Use the dataset as code mount
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-in-training-obs","title":"Use in Training obs","text":"
Use the dataset to specify job output
Use the dataset to specify job input
Use the dataset to specify TensorBoard output
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#use-in-inference-services","title":"Use in Inference Services","text":"
Use the dataset to mount a model
"},{"location":"en/admin/baize/developer/dataset/create-use-delete.html#delete-a-dataset","title":"Delete a Dataset","text":"
If you find a dataset to be redundant, expired, or no longer needed, you can delete it from the dataset list.
Click the \u2507 on the right side of the dataset list, then choose Delete from the dropdown menu.
In the pop-up window, confirm the dataset you want to delete, enter the dataset name, and then click Delete.
A confirmation message will appear indicating successful deletion, and the dataset will disappear from the list.
Caution
Once a dataset is deleted, it cannot be recovered, so please proceed with caution.
Traditionally, Python environment dependencies are built into an image, which includes the Python version and dependency packages. This approach has high maintenance costs and is inconvenient to update, often requiring a complete rebuild of the image.
In AI Lab, users can manage pure environment dependencies through the Environment Management module, decoupling this part from the image. The advantages include:
One environment can be used in multiple places, such as in Notebooks, distributed training tasks, and even inference services.
Updating dependency packages is more convenient; you only need to update the environment dependencies without rebuilding the image.
The main components of the environment management are:
Cluster : Select the cluster to operate on.
Namespace : Select the namespace to limit the scope of operations.
Environment List : Displays all environments and their statuses under the current cluster and namespace.
"},{"location":"en/admin/baize/developer/dataset/environments.html#explanation-of-environment-list-fields","title":"Explanation of Environment List Fields","text":"
Name : The name of the environment.
Status : The current status of the environment (normal or failed). New environments undergo a warming-up process, after which they can be used in other tasks.
Creation Time : The time the environment was created.
"},{"location":"en/admin/baize/developer/dataset/environments.html#creat-new-environment","title":"Creat New Environment","text":"
On the Environment Management interface, click the Create button at the top right to enter the environment creation process.
Fill in the following basic information:
Name : Enter the environment name, with a length of 2-63 characters, starting and ending with lowercase letters or numbers.
Deployment Location:
Cluster : Select the cluster to deploy, such as gpu-cluster.
Namespace : Select the namespace, such as default.
Remarks (optional): Enter remarks.
Labels (optional): Add labels to the environment.
Annotations (optional): Add annotations to the environment. After completing the information, click Next to proceed to environment configuration.
Python Version : Select the required Python version, such as 3.12.3.
Package Manager : Choose the package management tool, either PIP or CONDA.
Environment Data :
If PIP is selected: Enter the dependency package list in requirements.txt format in the editor below.
If CONDA is selected: Enter the dependency package list in environment.yaml format in the editor below.
Other Options (optional):
Additional pip Index URLs : Configure additional pip index URLs; suitable for internal enterprise private repositories or PIP acceleration sites.
GPU Configuration : Enable or disable GPU configuration; some GPU-related dependency packages need GPU resources configured during preloading.
Associated Storage : Select the associated storage configuration; environment dependency packages will be stored in the associated storage. Note: Storage must support ReadWriteMany.
After configuration, click the Create button, and the system will automatically create and configure the new Python environment.
Verify that the Python version and package manager configuration are correct.
Ensure the selected cluster and namespace are available.
If dependency preloading fails:
Check if the requirements.txt or environment.yaml file format is correct.
Verify that the dependency package names and versions are correct. If other issues arise, contact the platform administrator or refer to the platform help documentation for more support.
These are the basic steps and considerations for managing Python dependencies in AI Lab.
With the rapid iteration of AI Lab, we have now supported various model inference services. Here, you can see information about the supported models.
AI Lab v0.3.0 launched model inference services, facilitating users to directly use the inference services of AI Lab without worrying about model deployment and maintenance for traditional deep learning models.
AI Lab v0.6.0 supports the complete version of vLLM inference capabilities, supporting many large language models such as LLama, Qwen, ChatGLM, and more.
Note
The support for inference capabilities is related to the version of AI Lab.
You can use GPU types that have been verified by AI platform in AI Lab. For more details, refer to the GPU Support Matrix.
Through the Triton Inference Server, traditional deep learning models can be well supported. Currently, AI Lab supports mainstream inference backend services:
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
With vLLM, we can quickly use large language models. Here, you can see the list of models we support, which generally aligns with the vLLM Support Models.
HuggingFace Models: We support most of HuggingFace's models. You can see more models at the HuggingFace Model Hub.
The vLLM Supported Models list includes supported large language models and vision-language models.
Models fine-tuned using the vLLM support framework.
"},{"location":"en/admin/baize/developer/inference/models.html#new-features-of-vllm","title":"New Features of vLLM","text":"
Currently, AI Lab also supports some new features when using vLLM as an inference tool:
Enable Lora Adapter to optimize model inference services during inference.
Provide a compatible OpenAPI interface with OpenAI, making it easy for users to switch to local inference services at a low cost and quickly transition.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html","title":"Create Inference Service Using Triton Framework","text":"
The AI Lab currently offers Triton and vLLM as inference frameworks. Users can quickly start a high-performance inference service with simple configurations.
Danger
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html#introduction-to-triton","title":"Introduction to Triton","text":"
Triton is an open-source inference server developed by NVIDIA, designed to simplify the deployment and inference of machine learning models. It supports a variety of deep learning frameworks, including TensorFlow and PyTorch, enabling users to easily manage and deploy different types of models.
Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. The following example illustrates the PyTorch model for mnist handwritten digit recognition.
Note
The model to be inferred must adhere to the following directory structure within the dataset:
Currently, form-based creation is supported, allowing you to create services with field prompts in the interface.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html#configure-model-path","title":"Configure Model Path","text":"
The model path model-repo/mnist-cnn/1/model.pt must be consistent with the directory structure of the dataset.
"},{"location":"en/admin/baize/developer/inference/triton-inference.html#model-configuration","title":"Model Configuration","text":""},{"location":"en/admin/baize/developer/inference/triton-inference.html#configure-input-and-output-parameters","title":"Configure Input and Output Parameters","text":"
Note
The first dimension of the input and output parameters defaults to batchsize, setting it to -1 allows for the automatic calculation of the batchsize based on the input inference data. The remaining dimensions and data type must match the model's input.
Send HTTP POST Request: Use tools like curl or HTTP client libraries (e.g., Python's requests library) to send POST requests to the Triton Server.
Set HTTP Headers: Configuration generated automatically based on user settings, include metadata about the model inputs and outputs in the HTTP headers.
Construct Request Body: The request body usually contains the input data for inference and model-specific metadata.
<ip> is the host address where the Triton Inference Server is running.
<port> is the port where the Triton Inference Server is running.
<inference-name> is the name of the inference service that has been created.
\"name\" must match the name of the input parameter in the model configuration.
\"shape\" must match the dims of the input parameter in the model configuration.
\"datatype\" must match the Data Type of the input parameter in the model configuration.
\"data\" should be replaced with the actual inference data.
Please note that the above example code needs to be adjusted according to your specific model and environment. The format and content of the input data must also comply with the model's requirements.
"},{"location":"en/admin/baize/developer/inference/vllm-inference.html","title":"Create Inference Service Using vLLM Framework","text":"
AI Lab supports using vLLM as an inference service, offering all the capabilities of vLLM while fully adapting to the OpenAI interface definition.
"},{"location":"en/admin/baize/developer/inference/vllm-inference.html#introduction-to-vllm","title":"Introduction to vLLM","text":"
vLLM is a fast and easy-to-use library for inference and services. It aims to significantly improve the throughput and memory efficiency of language model services in real-time scenarios. vLLM boasts several features in terms of speed and flexibility:
Continuous batching of incoming requests.
Efficiently manages attention keys and values memory using PagedAttention.
Seamless integration with popular HuggingFace models.
Select the vLLM inference framework. In the model module selection, choose the pre-created model dataset hdd-models and fill in the path information where the model is located within the dataset.
This guide uses the ChatGLM3 model for creating the inference service.
Configure the resources for the inference service and adjust the parameters for running the inference service.
Parameter Name Description GPU Resources Configure GPU resources for inference based on the model scale and cluster resources. Allow Remote Code Controls whether vLLM trusts and executes code from remote sources. LoRA LoRA is a parameter-efficient fine-tuning technique for deep learning models. It reduces the number of parameters and computational complexity by decomposing the original model parameter matrix into low-rank matrices. 1. --lora-modules: Specifies specific modules or layers for low-rank approximation. 2. max_loras_rank: Specifies the maximum rank for each adapter layer in the LoRA model. For simpler tasks, a smaller rank value can be chosen, while more complex tasks may require a larger rank value to ensure model performance. 3. max_loras: Indicates the maximum number of LoRA layers that can be included in the model, customized based on model size and inference complexity. 4. max_cpu_loras: Specifies the maximum number of LoRA layers that can be handled in a CPU environment. Associated Environment Selects predefined environment dependencies required for inference.
Info
For models that support LoRA parameters, refer to vLLM Supported Models.
In the Advanced Configuration , support is provided for automated affinity scheduling based on GPU resources and other node configurations. Users can also customize scheduling policies.
Once the inference service is created, click the name of the inference service to enter the details and view the API call methods. Verify the execution results using Curl, Python, and Node.js.
Copy the curl command from the details and execute it in the terminal to send a model inference request. The expected output should be:
Job management refers to the functionality of creating and managing job lifecycles through job scheduling and control components.
AI platform Smart Computing Capability adopts Kubernetes' Job mechanism to schedule various AI inference and training jobs.
Click Job Center -> Jobs in the left navigation bar to enter the job list. Click the Create button on the right.
The system will pre-fill basic configuration data, including the cluster, namespace, type, queue, and priority. Adjust these parameters and click Next.
Configure the URL, runtime parameters, and associated datasets, then click Next.
Optionally add labels, annotations, runtime env variables, and other job parameters. Select a scheduling policy and click Confirm.
After the job is successfully created, it will have several running statuses:
Pytorch is an open-source deep learning framework that provides a flexible environment for training and deployment. A Pytorch job is a job that uses the Pytorch framework.
In the AI Lab platform, we provide support and adaptation for Pytorch jobs. Through a graphical interface, you can quickly create Pytorch jobs and perform model training.
Here we use the baize-notebook base image and the associated environment as the basic runtime environment for the job.
To learn how to create an environment, refer to Environments.
"},{"location":"en/admin/baize/developer/jobs/pytorch.html#create-jobs","title":"Create Jobs","text":""},{"location":"en/admin/baize/developer/jobs/pytorch.html#pytorch-single-jobs","title":"Pytorch Single Jobs","text":"
Log in to the AI Lab platform, click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Pytorch Single and click Next .
Fill in the job name and description, then click OK .
Once the job is successfully submitted, we can enter the job details to see the resource usage. From the upper right corner, go to Workload Details to view the log output during the training process.
import os\nimport torch\nimport torch.distributed as dist\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\nclass SimpleModel(nn.Module):\n def __init__(self):\n super(SimpleModel, self).__init__()\n self.fc = nn.Linear(10, 1)\n\n def forward(self, x):\n return self.fc(x)\n\ndef train():\n # Print environment information\n print(f'PyTorch version: {torch.__version__}')\n print(f'CUDA available: {torch.cuda.is_available()}')\n if torch.cuda.is_available():\n print(f'CUDA version: {torch.version.cuda}')\n print(f'CUDA device count: {torch.cuda.device_count()}')\n\n rank = int(os.environ.get('RANK', '0'))\n world_size = int(os.environ.get('WORLD_SIZE', '1'))\n\n print(f'Rank: {rank}, World Size: {world_size}')\n\n # Initialize distributed environment\n try:\n if world_size > 1:\n dist.init_process_group('nccl')\n print('Distributed process group initialized successfully')\n else:\n print('Running in non-distributed mode')\n except Exception as e:\n print(f'Error initializing process group: {e}')\n return\n\n # Set device\n try:\n if torch.cuda.is_available():\n device = torch.device(f'cuda:{rank % torch.cuda.device_count()}')\n print(f'Using CUDA device: {device}')\n else:\n device = torch.device('cpu')\n print('CUDA not available, using CPU')\n except Exception as e:\n print(f'Error setting device: {e}')\n device = torch.device('cpu')\n print('Falling back to CPU')\n\n try:\n model = SimpleModel().to(device)\n print('Model moved to device successfully')\n except Exception as e:\n print(f'Error moving model to device: {e}')\n return\n\n try:\n if world_size > 1:\n ddp_model = DDP(model, device_ids=[rank % torch.cuda.device_count()] if torch.cuda.is_available() else None)\n print('DDP model created successfully')\n else:\n ddp_model = model\n print('Using non-distributed model')\n except Exception as e:\n print(f'Error creating DDP model: {e}')\n return\n\n loss_fn = nn.MSELoss()\n optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)\n\n # Generate some random data\n try:\n data = torch.randn(100, 10, device=device)\n labels = torch.randn(100, 1, device=device)\n print('Data generated and moved to device successfully')\n except Exception as e:\n print(f'Error generating or moving data to device: {e}')\n return\n\n for epoch in range(10):\n try:\n ddp_model.train()\n outputs = ddp_model(data)\n loss = loss_fn(outputs, labels)\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n if rank == 0:\n print(f'Epoch {epoch}, Loss: {loss.item():.4f}')\n except Exception as e:\n print(f'Error during training epoch {epoch}: {e}')\n break\n\n if world_size > 1:\n dist.destroy_process_group()\n\nif __name__ == '__main__':\n train()\n
"},{"location":"en/admin/baize/developer/jobs/pytorch.html#number-of-job-replicas","title":"Number of Job Replicas","text":"
Note that Pytorch Distributed training jobs will create a group of Master and Worker training Pods, where the Master is responsible for coordinating the training job, and the Worker is responsible for the actual training work.
Note
In this demonstration: Master replica count is 1, Worker replica count is 2; Therefore, we need to set the replica count to 3 in the Job Configuration , which is the sum of Master and Worker replica counts. Pytorch will automatically tune the roles of Master and Worker.
AI Lab provides important visualization analysis tools provided for the model development process, used to display the training process and results of machine learning models. This document will introduce the basic concepts of Job Analysis (Tensorboard), its usage in the AI Lab system, and how to configure the log content of datasets.
Note
Tensorboard is a visualization tool provided by TensorFlow, used to display the training process and results of machine learning models. It can help developers more intuitively understand the training dynamics of their models, analyze model performance, debug issues, and more.
The role and advantages of Tensorboard in the model development process:
Visualize Training Process : Display metrics such as training and validation loss, and accuracy through charts, helping developers intuitively observe the training effects of the model.
Debug and Optimize Models : By viewing the weights and gradient distributions of different layers, help developers discover and fix issues in the model.
Compare Different Experiments : Simultaneously display the results of multiple experiments, making it convenient for developers to compare the effects of different models and hyperparameter configurations.
Track Training Data : Record the datasets and parameters used during training to ensure the reproducibility of experiments.
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#how-to-create-tensorboard","title":"How to Create Tensorboard","text":"
In the AI Lab system, we provide a convenient way to create and manage Tensorboard. Here are the specific steps:
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#enable-tensorboard-when-creating-a-notebook","title":"Enable Tensorboard When Creating a Notebook","text":"
Create a Notebook : Create a new Notebook on the AI Lab platform.
Enable Tensorboard : On the Notebook creation page, enable the Tensorboard option and specify the dataset and log path.
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#enable-tensorboard-after-creating-and-completing-a-distributed-job","title":"Enable Tensorboard After Creating and Completing a Distributed Job","text":"
Create a Distributed Job : Create a new distributed training job on the AI Lab platform.
Configure Tensorboard : On the job configuration page, enable the Tensorboard option and specify the dataset and log path.
View Tensorboard After Job Completion : After the job is completed, you can view the Tensorboard link on the job details page. Click the link to see the visualized results of the training process.
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#directly-reference-tensorboard-in-a-notebook","title":"Directly Reference Tensorboard in a Notebook","text":"
In a Notebook, you can directly start Tensorboard through code. Here is a sample code snippet:
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#how-to-configure-dataset-log-content","title":"How to Configure Dataset Log Content","text":"
When using Tensorboard, you can record and configure different datasets and log content. Here are some common configuration methods:
"},{"location":"en/admin/baize/developer/jobs/tensorboard.html#configure-training-and-validation-dataset-logs","title":"Configure Training and Validation Dataset Logs","text":"
While training the model, you can use TensorFlow's tf.summary API to record logs for the training and validation datasets. Here is a sample code snippet:
# Import necessary libraries\nimport tensorflow as tf\n\n# Create log directories\ntrain_log_dir = 'logs/gradient_tape/train'\nval_log_dir = 'logs/gradient_tape/val'\ntrain_summary_writer = tf.summary.create_file_writer(train_log_dir)\nval_summary_writer = tf.summary.create_file_writer(val_log_dir)\n\n# Train model and record logs\nfor epoch in range(EPOCHS):\n for (x_train, y_train) in train_dataset:\n # Training step\n train_step(x_train, y_train)\n with train_summary_writer.as_default():\n tf.summary.scalar('loss', train_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)\n\n for (x_val, y_val) in val_dataset:\n # Validation step\n val_step(x_val, y_val)\n with val_summary_writer.as_default():\n tf.summary.scalar('loss', val_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', val_accuracy.result(), step=epoch)\n
In addition to logs for training and validation datasets, you can also record other custom log content such as learning rate and gradient distribution. Here is a sample code snippet:
In AI Lab, Tensorboards created through various methods are uniformly displayed on the job analysis page, making it convenient for users to view and manage.
Users can view information such as the link, status, and creation time of Tensorboard on the job analysis page and directly access the visualized results of Tensorboard through the link.
Tensorflow, along with Pytorch, is a highly active open-source deep learning framework that provides a flexible environment for training and deployment.
AI Lab provides support and adaptation for the Tensorflow framework. You can quickly create Tensorflow jobs and conduct model training through graphical operations.
Here, we use the baize-notebook base image and the associated environment as the basic runtime environment for jobs.
For information on how to create an environment, refer to Environment List.
"},{"location":"en/admin/baize/developer/jobs/tensorflow.html#creating-a-job","title":"Creating a Job","text":""},{"location":"en/admin/baize/developer/jobs/tensorflow.html#example-tfjob-single","title":"Example TFJob Single","text":"
Log in to the AI Lab platform and click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Tensorflow Single and click Next .
Fill in the job name and description, then click OK .
"},{"location":"en/admin/baize/developer/jobs/tensorflow.html#pre-warming-the-code-repository","title":"Pre-warming the Code Repository","text":"
Use AI Lab -> Dataset List to create a dataset and pull the code from a remote GitHub repository into the dataset. This way, when creating a job, you can directly select the dataset and mount the code into the job.
Command parameters: Use python /code/tensorflow/tf-single.py
\"\"\"\n pip install tensorflow numpy\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\n\n# Create some random data\nx = np.random.rand(100, 1)\ny = 2 * x + 1 + np.random.rand(100, 1) * 0.1\n\n# Create a simple model\nmodel = tf.keras.Sequential([\n tf.keras.layers.Dense(1, input_shape=(1,))\n])\n\n# Compile the model\nmodel.compile(optimizer='adam', loss='mse')\n\n# Train the model, setting epochs to 10\nhistory = model.fit(x, y, epochs=10, verbose=1)\n\n# Print the final loss\nprint('Final loss: {' + str(history.history['loss'][-1]) +'}')\n\n# Use the model to make predictions\ntest_x = np.array([[0.5]])\nprediction = model.predict(test_x)\nprint(f'Prediction for x=0.5: {prediction[0][0]}')\n
After the job is successfully submitted, you can enter the job details to see the resource usage. From the upper right corner, navigate to Workload Details to view log outputs during the training process.
Once a job is created, it will be displayed in the job list.
In the job list, click the \u2507 on the right side of a job and select Job Workload Details .
A pop-up window will appear asking you to choose which Pod to view. Click Enter .
You will be redirected to the container management interface, where you can view the container\u2019s working status, labels and annotations, and any events that have occurred.
You can also view detailed logs of the current Pod for the recent period. By default, 100 lines of logs are displayed. To view more detailed logs or to download logs, click the blue Insight text at the top.
Additionally, you can use the ... in the upper right corner to view the current Pod's YAML, and to upload or download files. Below is an example of a Pod's YAML.
baizectl is a command line tool specifically designed for model developers and data scientists within the AI Lab module. It provides a series of commands to help users manage distributed training jobs, check job statuses, manage datasets, and more. It also supports connecting to Kubernetes worker clusters and AI platform workspaces, aiding users in efficiently using and managing Kubernetes platform resources.
The basic format of the baizectl command is as follows:
jovyan@19d0197587cc:/$ baizectl\nAI platform management tool\n\nUsage:\n baizectl [command]\n\nAvailable Commands:\n completion Generate the autocompletion script for the specified shell\n data Management datasets\n help Help about any command\n job Manage jobs\n login Login to the platform\n version Show cli version\n\nFlags:\n --cluster string Cluster name to operate\n -h, --help help for baizectl\n --mode string Connection mode: auto, api, notebook (default \"auto\")\n -n, --namespace string Namespace to use for the operation. If not set, the default Namespace will be used.\n -s, --server string access base url\n --skip-tls-verify Skip TLS certificate verification\n --token string access token\n -w, --workspace int32 Workspace ID to use for the operation\n\nUse \"baizectl [command] --help\" for more information about a command.\n
The above provides basic information about baizectl. Users can view the help information using baizectl --help, or view the help information for specific commands using baizectl [command] --help.
The basic format of the baizectl command is as follows:
baizectl [command] [flags]\n
Here, [command] refers to the specific operation command, such as data and job, and [flags] are optional parameters used to specify detailed information about the operation.
baizectl provides a series of commands to manage distributed training jobs, including viewing job lists, submitting jobs, viewing logs, restarting jobs, deleting jobs, and more.
jovyan@19d0197587cc:/$ baizectl job\nManage jobs\n\nUsage:\n baizectl job [command]\n\nAvailable Commands:\n delete Delete a job\n logs Show logs of a job\n ls List jobs\n restart restart a job\n submit Submit a job\n\nFlags:\n -h, --help help for job\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl job [command] --help\" for more information about a command.\n
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#submit-training-jobs","title":"Submit Training Jobs","text":"
baizectl supports submitting a job using the submit command. You can view detailed information by using baizectl job submit --help.
(base) jovyan@den-0:~$ baizectl job submit --help\nSubmit a job\n\nUsage:\n baizectl job submit [flags] -- command ...\n\nAliases:\n submit, create\n\nExamples:\n# Submit a job to run the command \"torchrun python train.py\"\nbaizectl job submit -- torchrun python train.py\n# Submit a job with 2 workers(each pod use 4 gpus) to run the command \"torchrun python train.py\" and use the image \"pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime\"\nbaizectl job submit --image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime --workers 2 --resources nvidia.com/gpu=4 -- torchrun python train.py\n# Submit a tensorflow job to run the command \"python train.py\"\nbaizectl job submit --tensorflow -- python train.py\n\n\nFlags:\n --annotations stringArray The annotations of the job, the format is key=value\n --auto-load-env It only takes effect when executed in Notebook, the environment variables of the current environment will be automatically read and set to the environment variables of the Job, the specific environment variables to be read can be specified using the BAIZE_MAPPING_ENVS environment variable, the default is PATH,CONDA_*,*PYTHON*,NCCL_*, if set to false, the environment variables of the current environment will not be read. (default true)\n --commands stringArray The default command of the job\n -d, --datasets stringArray The dataset bind to the job, the format is datasetName:mountPath, e.g. mnist:/data/mnist\n -e, --envs stringArray The environment variables of the job, the format is key=value\n -x, --from-notebook string Define whether to read the configuration of the current Notebook and directly create tasks, including images, resources, and dataset.\n auto: Automatically determine the mode according to the current environment. If the current environment is a Notebook, it will be set to notebook mode.\n false: Do not read the configuration of the current Notebook.\n true: Read the configuration of the current Notebook. (default \"auto\")\n -h, --help help for submit\n --image string The image of the job, it must be specified if fromNotebook is false.\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --labels stringArray The labels of the job, the format is key=value\n --max-retries int32 number of retries before marking this job failed\n --max-run-duration int Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it\n --name string The name of the job, if empty, the name will be generated automatically.\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --priority string The priority of the job, current support baize-medium-priority, baize-low-priority, baize-high-priority\n --pvcs stringArray The pvcs bind to the job, the format is pvcName:mountPath, e.g. mnist:/data/mnist\n --pytorch Pytorch Job, has higher priority than --job-type\n --queue string The queue to used\n --requests-resources stringArray Similar to resources, but sets the resources of requests\n --resources stringArray The resources of the job, it is a string in the format of cpu=1,memory=1Gi,nvidia.com/gpu=1, it will be set to the limits and requests of the container.\n --restart-policy string The job restart policy (default \"on-failure\")\n --runtime-envs baizectl data ls --runtime-env The runtime environment to use for the job, you can use baizectl data ls --runtime-env to get the runtime environment\n --shm-size int32 The shared memory size of the job, default is 0, which means no shared memory, if set to more than 0, the job will use the shared memory, the unit is MiB\n --tensorboard-log-dir string The tensorboard log directory, if set, the job will automatically start tensorboard, else not. The format is /path/to/log, you can use relative path in notebook.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --workers int The workers of the job, default is 1, which means single worker, if set to more than 1, the job will be distributed. (default 1)\n --working-dir string The working directory of job container, if in notebook mode, the default is the directory of the current file\n
Note
Explanation of command parameters for submitting jobs:
--name: Job name. If empty, it will be auto-generated.
--resources: Job resources, formatted as cpu=1 memory=1Gi,nvidia.com/gpu=1.
--workers: Number of job worker nodes. The default is 1. When set to greater than 1, the job will run in a distributed manner.
--queue: Job queue. Queue resources need to be created in advance.
--working-dir: Working directory. In Notebook mode, the current file directory will be used by default.
--datasets: Dataset, formatted as datasetName:mountPath, for example mnist:/data/mnist.
--shm-size: Shared memory size. This can be enabled for distributed training jobs, indicating the use of shared memory, with units in MiB.
--labels: Job labels, formatted as key=value.
--max-retries: Maximum retry count. The number of times to retry the job upon failure. The job will restart upon failure. Default is unlimited.
--max-run-duration: Maximum run duration. The job will be terminated by the system if it exceeds the specified run time. Default is unlimited.
--restart-policy: Restart policy, supporting on-failure, never, always. The default is on-failure.
--from-notebook: Whether to read configurations from the Notebook. Supports auto, true, false, with the default being auto.
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#example-of-a-pytorch-single-node-job","title":"Example of a PyTorch Single-Node Job","text":"
Example of submitting a training job. Users can modify parameters based on their actual needs. Below is an example of creating a PyTorch job:
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#example-of-a-distributed-pytorch-job","title":"Example of a Distributed PyTorch Job","text":"
Example of submitting a training job. You can modify parameters based on their actual needs. Below is an example of creating a distributed PyTorch job:
baizectl job supports viewing the job list using the ls command. By default, it displays pytorch jobs, but users can specify the job type using the -t parameter.
(base) jovyan@den-0:~$ baizectl job ls # View pytorch jobs by default\n NAME TYPE PHASE DURATION COMMAND \n demong PYTORCH SUCCEEDED 1m2s sleep 60 \n demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200 \n(base) jovyan@den-0:~$ baizectl job ls demo-sleep # View a specific job\n NAME TYPE PHASE DURATION COMMAND \n demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200 \n(base) jovyan@den-0:~$ baizectl job ls -t TENSORFLOW # View tensorflow jobs\n NAME TYPE PHASE DURATION COMMAND \n demotfjob TENSORFLOW CREATED 0s sleep 1000 \n
The job list uses table as the default display format. If you want to view more information, you can use the json or yaml format, which can be specified using the -o parameter.
baizectl job supports viewing job logs using the logs command. You can view detailed information by using baizectl job logs --help.
(base) jovyan@den-0:~$ baizectl job logs --help\nShow logs of a job\n\nUsage:\n baizectl job logs <job-name> [pod-name] [flags]\n\nAliases:\n logs, log\n\nFlags:\n -f, --follow Specify if the logs should be streamed.\n -h, --help help for logs\n -t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default \"PYTORCH\")\n --paddle PaddlePaddle Job, has higher priority than --job-type\n --pytorch Pytorch Job, has higher priority than --job-type\n --tail int Lines of recent log file to display.\n --tensorflow Tensorflow Job, has higher priority than --job-type\n --timestamps Show timestamps\n
Note
The --follow parameter allows for real-time log viewing.
The --tail parameter specifies the number of log lines to view, with a default of 50 lines.
The --timestamps parameter displays timestamps.
Example of viewing job logs:
(base) jovyan@den-0:~$ baizectl job log -t TENSORFLOW tf-sample-job-v2-202406161632-evgrbrhn -f\n2024-06-16 08:33:06.083766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n2024-06-16 08:33:06.086189: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n2024-06-16 08:33:06.132903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\nTo enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n2024-06-16 08:33:07.223046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\nModel: \"sequential\"\n_________________________________________________________________\n Layer (type) Output Shape Param # \n=================================================================\n Conv1 (Conv2D) (None, 13, 13, 8) 80 \n\n flatten (Flatten) (None, 1352) 0 \n\n Softmax (Dense) (None, 10) 13530 \n\n=================================================================\nTotal params: 13610 (53.16 KB)\nTrainable params: 13610 (53.16 KB)\nNon-trainable params: 0 (0.00 Byte)\n...\n
baizectl supports managing datasets. Currently, it supports viewing the dataset list, making it convenient to quickly bind datasets during job training.
(base) jovyan@den-0:~$ baizectl data \nManagement datasets\n\nUsage:\n baizectl data [flags]\n baizectl data [command]\n\nAliases:\n data, dataset, datasets, envs, runtime-envs\n\nAvailable Commands:\n ls List datasets\n\nFlags:\n -h, --help help for data\n -o, --output string Output format. One of: table, json, yaml (default \"table\")\n --page int Page number (default 1)\n --page-size int Page size (default -1)\n --search string Search query\n --sort string Sort order\n --truncate int Truncate output to the given length, 0 means no truncation (default 50)\n\nUse \"baizectl data [command] --help\" for more information about a command.\n
baizectl data supports viewing the datasets using the ls command. By default, it displays in table format, but users can specify the output format using the -o parameter.
(base) jovyan@den-0:~$ baizectl data ls\n NAME TYPE URI PHASE \n fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY \n sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY \n training-output PVC pvc://training-output READY \n
When submitting a training job, you can specify the dataset using the -d or --datasets parameter, for example:
The environment runtime-env is a unique environment management capability of Suanova. By decoupling the dependencies required for model development, training tasks, and inference, it offers a more flexible way to manage dependencies without the need to repeatedly build complex Docker images. You simply need to select the appropriate environment.
Additionally, runtime-env supports hot updates and dynamic upgrades, allowing you to update environment dependencies without rebuilding the image.
baizectl data supports viewing the environment list using the runtime-env command. By default, it displays in table format, but users can specify the output format using the -o parameter.
(base) jovyan@den-0:~$ baizectl data ls --runtime-env \n NAME TYPE URI PHASE \n fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY \n sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY \n training-output PVC pvc://training-output READY \n tensorflow-sample CONDA conda://python?version=3.12.3 PROCESSING \n
When submitting a training job, you can specify the environment using the --runtime-env parameter:
baizectl supports more advanced usage, such as generating auto-completion scripts, using specific clusters and namespaces, and using specific workspaces.
The above command generates an auto-completion script for bash and saves it to the /etc/bash_completion.d/baizectl directory. You can load the auto-completion script by using source /etc/bash_completion.d/baizectl.
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#using-specific-clusters-and-namespaces","title":"Using Specific Clusters and Namespaces","text":"
baizectl job ls --cluster my-cluster --namespace my-namespace\n
This command will list all jobs in the my-namespace namespace within the my-cluster cluster.
"},{"location":"en/admin/baize/developer/notebooks/baizectl.html#using-specific-workspaces","title":"Using Specific Workspaces","text":"
Solution: Check if the --server parameter is set correctly and ensure that the network connection is stable. If the server uses a self-signed certificate, you can use --skip-tls-verify to skip TLS certificate verification.
Question: How can I resolve insufficient permissions issues?
Solution: Ensure that you are using the correct --token parameter to log in and check if the current user has the necessary permissions for the operation.
Question: Why can't I list the datasets?
Solution: Check if the namespace and workspace are set correctly and ensure that the current user has permission to access these resources.
With this guide, you can quickly get started with baizectl commands and efficiently manage AI platform resources in practical applications. If you have any questions or issues, it is recommended to use baizectl [command] --help to check more detailed information.
baizess is a built-in, out-of-the-box source switch tool within the Notebook of AI Lab module. It provides a streamlined command-line interface to facilitate the management of package sources for various programming environments. With baizess, users can easily switch sources for commonly used package managers, ensuring seamless access to the latest libraries and dependencies. This tool enhances the efficiency of developers and data scientists by simplifying the process of managing package sources.
The basic information of the baizess command is as follows:
jovyan@19d0197587cc:/$ baizess\nsource switch tool\n\nUsage:\n baizess [command] [package-manager]\n\nAvailable Commands:\n set Switch the source of specified package manager to current fastest source\n reset Reset the source of specified package manager to default source\n\nAvailable Package-managers:\n apt (require root privilege)\n conda\n pip\n
set\uff1aBackup the source, perform speed test, and switch the specified package manager's source to the fastest domestic source based on speed test result.
reset\uff1aReset the specified package manager to default source.
Notebook provides an online web interactive programming environment, making it convenient for developers to quickly conduct data science and machine learning experiments.
Upon entering the developer console, developers can create and manage Notebooks in different clusters and namespaces.
Click Notebooks in the left navigation bar to enter the Notebook list. Click the Create button on the right.
The system will pre-fill basic configuration data, including the cluster, namespace, queue, priority, resources, and job arguments. Adjust these arguments and click OK.
The newly created Notebook will initially be in the Pending state, and will change to Running after a moment, with the latest one appearing at the top of the list by default.
Click the \u2507 on the right side to perform more actions: update arguments, start/stop, clone Notebook, view workload details, and delete.
Note
If you choose pure CPU resources and find that all GPUs on the node are mounted, you can try adding the following container environment variable to resolve this issue:
If you find a Notebook to be redundant, expired, or no longer needed for any other reason, you can delete it from the Notebook list.
Click the \u2507 on the right side of the Notebook in the Notebook list, then choose Delete from the dropdown menu.
In the pop-up window, confirm the Notebook you want to delete, enter the Notebook name, and then click Delete.
A confirmation message will appear indicating successful deletion, and the Notebook will disappear from the list.
Caution
Once a Notebook is deleted, it cannot be recovered, so please proceed with caution.
"},{"location":"en/admin/baize/developer/notebooks/notebook-auto-close.html","title":"Automatic Shutdown of Idle Notebooks","text":"
To optimize resource usage, the smart computing system automatically shuts down idle notebooks after a period of inactivity. This helps free up resources when a notebook is not in use.
Advantages: This feature significantly reduces resource waste from long periods of inactivity, enhancing overall efficiency.
Disadvantages: Without proper backup strategies in place, this may lead to potential data loss.
Note
This feature is enabled by default at the cluster level, with a default timeout of 30 minutes.
Currently, configuration changes must be made manually, but more convenient options will be available in the future.
To modify the deployment parameters of baize-agent in your worker cluster, update the Helm App.
"},{"location":"en/admin/baize/developer/notebooks/notebook-auto-close.html#modify-on-ui","title":"Modify on UI","text":"
In the clusters page, locate your worker cluster, go to its details, select Helm Apps, and find baize-agent under the baize-system namespace, Click Update on the upper right corner.
"},{"location":"en/admin/baize/developer/notebooks/notebook-auto-close.html#modify-on-cli","title":"Modify on CLI","text":"
In the console, use the helm upgrade command to change the configuration:
# Set version number\nexport VERSION=0.8.0\n\n# Update Helm Chart \nhelm upgrade --install baize-agent baize/baize-agent \\\n --namespace baize-system \\\n --create-namespace \\\n --set global.imageRegistry=release.daocloud.io \\\n --set notebook-controller.culling_enabled=true \\ # Enable automatic shutdown (default: true)\n --set notebook-controller.cull_idle_time=120 \\ # Set idle timeout to 120 minutes (default: 30 minutes)\n --set notebook-controller.idleness_check_period=1 \\ # Set check interval to 1 minute (default: 1 minute)\n --version=$VERSION\n
Note
To prevent data loss after an automatic shutdown, upgrade to v0.8.0 or higher and enable the auto-save feature in your notebook configuration.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html","title":"Use Environments in Notebooks","text":"
Environment management is one of the key features of AI Lab. By associating an environment in a Notebook , you can quickly switch between different environments, making it easier for them to develop and debug.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#select-an-environment-when-creating-a-notebook","title":"Select an Environment When Creating a Notebook","text":"
When creating a Notebook, you can select one or more environments. If there isn\u2019t a suitable environment, you can create a new one in Environments .
For instructions on how to create an environment, refer to Environments.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#use-environments-in-notebooks_1","title":"Use Environments in Notebooks","text":"
Note
In the Notebook, both conda and mamba are provided as environment management tools. You can choose the appropriate tool based on their needs.
In AI Lab, you can use the conda environment management tool. You can view the list of current environments in the Notebook by using the command !conda env list.
This command lists all conda environments and adds an asterisk (*) before the currently activated environment.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#manage-kernel-environment-in-jupyterlab","title":"Manage Kernel Environment in JupyterLab","text":"
In JupyterLab, the environments associated with the Notebook are automatically bounded to the Kernel list, allowing you to quickly switch environments through the Kernel.
With this method, you can simultaneously write and debug algorithms in a single Notebook.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#switch-environments-in-a-terminal","title":"Switch Environments in a Terminal","text":"
The Notebook for AI Lab now also supports VSCode.
If you prefer managing and switching environments in the Terminal, you can follow these steps:
Upon first starting and using the Notebook, you need to execute conda init, and then run conda activate <env_name> to switch to the proper environment.
(base) jovyan@chuanjia-jupyter-0:~/yolov8$ conda init bash # Initialize bash environment, only needed for the first use\nno change /opt/conda/condabin/conda\n change /opt/conda/bin/conda\n change /opt/conda/bin/conda-env\n change /opt/conda/bin/activate\n change /opt/conda/bin/deactivate\n change /opt/conda/etc/profile.d/conda.sh\n change /opt/conda/etc/fish/conf.d/conda.fish\n change /opt/conda/shell/condabin/Conda.psm1\n change /opt/conda/shell/condabin/conda-hook.ps1\n change /opt/conda/lib/python3.11/site-packages/xontrib/conda.xsh\n change /opt/conda/etc/profile.d/conda.csh\n change /home/jovyan/.bashrc\n action taken.\nAdded mamba to /home/jovyan/.bashrc\n\n==> For changes to take effect, close and re-open your current shell. <==\n\n(base) jovyan@chuanjia-jupyter-0:~/yolov8$ source ~/.bashrc # Reload bash environment\n(base) jovyan@chuanjia-jupyter-0:~/yolov8$ conda activate python-3.10 # Switch to python-3.10 environment\n(python-3.10) jovyan@chuanjia-jupyter-0:~/yolov8$ conda env list\n\n mamba version : 1.5.1\n# conda environments:\n#\ndkj-python312-pure /opt/baize-runtime-env/dkj-python312-pure/conda/envs/dkj-python312-pure\npython-3.10 * /opt/baize-runtime-env/python-3.10/conda/envs/python-3.10 # Currently activated environment\ntorch-smaple /opt/baize-runtime-env/torch-smaple/conda/envs/torch-smaple\nbase /opt/conda\nbaize-base /opt/conda/envs/baize-base\n
If you prefer to use mamba, you will need to use mamba init and mamba activate <env_name>.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#view-packages-in-environment","title":"View Packages in Environment","text":"
One important feature of different environment management is the ability to use different packages by quickly switching environments within a Notebook.
You can use the command below to view all packages in the current environment using conda.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-envs.html#update-packages-in-environment","title":"Update Packages in Environment","text":"
Currently, you can update the packages in the environment through the Environment Management UI in AI Lab.
The AI Lab provided by Notebook supports local access via SSH;
With simple configuration, you can use SSH to access the Jupyter Notebook. Whether you are using Windows, Mac, or Linux operating systems, you can follow the steps below.
First, you need to generate an SSH public and private key pair on your computer. This key pair will be used for the authentication process to ensure secure access.
Mac/LinuxWindows
Open the terminal.
Enter the command:
ssh-keygen -t rsa -b 4096\n
When prompted with \u201cEnter a file in which to save the key,\u201d you can press Enter to use the default path or specify a new path.
Next, you will be prompted to enter a passphrase (optional), which adds an extra layer of security. If you choose to enter a passphrase, remember it as you will need it each time you use the key.
Install Git Bash (if you haven't already).
Open Git Bash.
Enter the command:
ssh-keygen -t rsa -b 4096\n
Follow the same steps as Mac/Linux.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#add-ssh-public-key-to-personal-center-optional","title":"Add SSH Public Key to Personal Center (Optional)","text":"
Open the generated public key file, usually located at ~/.ssh/id_rsa.pub (if you did not change the default path).
Copy the public key content.
Log in to the system's personal center.
Look for the SSH public key configuration area and paste the copied public key into the designated location.
Save the changes.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#enable-ssh-in-notebook","title":"Enable SSH in Notebook","text":"
Log in to the Jupyter Notebook web interface.
Find the Notebook for which you want to enable SSH.
In the Notebook's settings or details page, find the option Enable SSH and enable it.
Record or copy the displayed SSH access command. This command will be used in subsequent steps for SSH connection.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#ssh-in-different-environments","title":"SSH in Different Environments","text":""},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#example","title":"Example","text":"
Assume the SSH command you obtained is as follows:
ssh username@mockhost -p 2222\n
Replace username with your username, mockhost with the actual hostname, and 2222 with the actual port number.
If prompted to accept the host's identity, type yes.
"},{"location":"en/admin/baize/developer/notebooks/notebook-with-ssh.html#remote-development-with-ide","title":"Remote Development with IDE","text":"
In addition to using command line tools for SSH connection, you can also utilize modern IDEs such as Visual Studio Code (VSCode) and PyCharm's SSH remote connection feature to develop locally while utilizing remote server resources.
Using SSH in VSCodeUsing SSH in PyCharm
VSCode supports SSH remote connection through the Remote - SSH extension, allowing you to edit files on the remote server directly in the local VSCode environment and run commands.
Steps:
Ensure you have installed VSCode and the Remote - SSH extension.
Open VSCode and click the remote resource manager icon at the bottom of the left activity bar.
Select Remote-SSH: Connect to Host... and then click + Add New SSH Host...
Enter the SSH connection command, for example:
ssh username@mockhost -p 2222\n
Press Enter. Replace username, mockhost, and 2222 with your actual username, hostname, and port number.
Select a configuration file to save this SSH host, usually the default is fine.
After completing, your SSH host will be added to the SSH target list. Click your host to connect. If it's your first connection, you may be prompted to verify the host's fingerprint. After accepting, you will be asked to enter the passphrase (if the SSH key has a passphrase). Once connected successfully, you can edit remote files in VSCode and utilize remote resources just as if you were developing locally.
PyCharm Professional Edition supports connecting to remote servers via SSH and directly developing in the local PyCharm.
Steps:
Open PyCharm and open or create a project.
Select File -> Settings (on Mac, it's PyCharm -> Preferences).
In the settings window, navigate to Project: YourProjectName -> Python Interpreter.
Click the gear icon in the upper right corner and select Add...
In the pop-up window, select SSH Interpreter.
Enter the remote host information: hostname (mockhost), port number (2222), username (username). Replace these placeholders with your actual information.
Click Next. PyCharm will attempt to connect to the remote server. If the connection is successful, you will be asked to enter the passphrase or select the private key file.
Once configured, click Finish. Now, your PyCharm will use the Python interpreter on the remote server.
Within the same Workspace, any user can log in to a Notebook with SSH enabled using their own SSH credentials. This means that as long as users have configured their SSH public key in the personal center and the Notebook has enabled SSH, they can use SSH for a secure connection.
Note that permissions for different users may vary depending on the Workspace configuration. Ensure you understand and comply with your organization's security and access policies.
By following the above steps, you should be able to successfully configure and use SSH to access the Jupyter Notebook. If you encounter any issues, refer to the system help documentation or contact the system administrator.
"},{"location":"en/admin/baize/developer/notebooks/start-pause.html","title":"Start and Stop Notebook","text":"
After a Notebook is successfully created, it typically has several states:
Pending
Running
Stopped
If a Notebook is in the Stopped state, click the \u2507 on the right side in the list, then choose Start from the dropdown menu.
This Notebook will move into the running queue, and its status will change to Pending. If everything is normal, its status will change to Running after a moment.
If you have finished using the Notebook, you can choose Stop from the menu to change its status to Stopped.
GPU Allocated: Statistics on the GPU allocation status of all unfinished tasks in the current cluster, calculating the ratio between requested GPUs (Request) and total resources (Total).
GPU Utilization: Statistics on the actual resource utilization of all running tasks in the current cluster, calculating the ratio between the GPUs actually used (Usage) and the total resources (Total).
Automatically consolidate GPU resource information across the entire platform, providing detailed GPU device information display, and allowing you to view workload statistics and task execution information for various GPUs.
After entering Operator, click Resource Management -> GPU Management in the left navigation bar to view GPU and task information.
In the Operator mode, queues can be used to schedule and optimize batch job workloads, effectively managing multiple tasks running on a cluster and optimizing resource utilization through a queue system.
Click Queue Management in the left navigation bar, then click the Create button on the right.
The system will pre-fill basic setup data, including the cluster to deploy to, workspace, and queuing policy. Click OK after adjusting these parameters.
A confirmation message will appear upon creation, returning you to the queue management list. Click the \u2507 on the right side of the list to perform additional operations such as update or delete.
This document will continuously compile and organize errors that may arise from environmental issues or improper operations during the use of AI Lab, as well as analyze and provide solutions for certain errors encountered during use.
Warning
This documentation is only applicable to version AI platform. If you encounter issues with the use of AI Lab, please refer to this troubleshooting guide first.
In AI platform, the module name for AI Lab is baize, which offers one-stop solutions for model training, inference, model management, and more.
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html","title":"Cluster Not Found in Drop-Down List","text":""},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#symptom","title":"Symptom","text":"
In the AI Lab Developer and Operator UI, the desired cluster cannot be found in the drop-down list while you search for a cluster.
If the desired cluster is missing from the cluster drop-down list in AI Lab, it could be due to the following reasons:
The baize-agent is not installed or failed to install, causing AI Lab to be unable to retrieve cluster information.
The cluster name was not configured when installing baize-agent, causing AI Lab to be unable to retrieve cluster information.
Observable components within the worker cluster are abnormal, leading to the inability to collect metrics information from the cluster.
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#solution","title":"Solution","text":""},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#baize-agent-not-installed-or-failed-to-install","title":"baize-agent not installed or failed to install","text":"
AI Lab requires some basic components to be installed in each worker cluster. If the baize-agent is not installed in the worker cluster, you can choose to install it via UI, which might lead to some unexpected errors.
Therefore, to ensure a good user experience, the selectable cluster range only includes clusters where the baize-agent has been successfully installed.
If the issue is due to the baize-agent not being installed or installation failure, use the following steps:
Container Management -> Clusters -> Helm Apps -> Helm Charts , find baize-agent and install it.
Note
Quickly jump to this address: https://<host>/kpanda/clusters/<cluster_name>/helm/charts/addon/baize-agent. Note to replace <host> with the actual console address, and <cluster_name> with the actual cluster name.
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#cluster-name-not-configured-in-the-process-of-installing-baize-agent","title":"Cluster name not configured in the process of installing baize-agent","text":"
When installing baize-agent, ensure to configure the cluster name. This name will be used for Insight metrics collection and is empty by default, requiring manual configuration .
"},{"location":"en/admin/baize/troubleshoot/cluster-not-found.html#insight-components-in-the-worker-cluster-are-abnormal","title":"Insight components in the worker cluster are abnormal","text":"
If the Insight components in the cluster are abnormal, it might cause AI Lab to be unable to retrieve cluster information. Check if the platform's Insight services are running and configured correctly.
Check if the insight-server component is running properly in the Global Service Cluster.
Check if the insight-agent component is running properly in the worker cluster.
When creating a Notebook, training task, or inference service, if the queue is being used for the first time in that namespace, there will be a prompt to initialize the queue with one click. However, the initialization fails.
In the AI Lab environment, the queue management capability is provided by Kueue. Kueue provides two types of queue management resources:
ClusterQueue: A cluster-level queue mainly used to manage resource quotas within the queue, including CPU, memory, and GPU.
LocalQueue: A namespace-level queue that needs to point to a ClusterQueue for resource allocation within the queue.
In the AI Lab environment, if a service is created and the specified namespace does not have a LocalQueue, there will be a prompt to initialize the queue.
In rare cases, the LocalQueue initialization might fail due to special reasons.
"},{"location":"en/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.html","title":"Notebook Not Controlled by Queue Quota","text":"
In the AI Lab module, when users create a Notebook, they find that even if the selected queue lacks resources, the Notebook can still be created successfully.
The queue management capability in AI Lab is provided by Kueue, and the Notebook service is provided through JupyterHub. JupyterHub has high requirements for the Kubernetes version. For versions below v1.27, even if queue quotas are set in AI platform, and users select the quota when creating a Notebook, the Notebook will not actually be restricted by the queue quota.
Solution: Plan in advance. It is recommended to use Kubernetes version v1.27 or above in the production environment.
Reference: Jupyter Notebook Documentation
"},{"location":"en/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.html#issue-02-configuration-not-enabled","title":"Issue 02: Configuration Not Enabled","text":"
Analysis:
When the Kubernetes cluster version is greater than v1.27, the Notebook still cannot be restricted by the queue quota.
This is because Kueue needs to have support for enablePlainPod enabled to take effect for the Notebook service.
Solution: When deploying baize-agent in the worker cluster, enable Kueue support for enablePlainPod.
If you forget your password, you can reset it by following the instructions on this page.
"},{"location":"en/admin/ghippo/password.html#steps-to-reset-password","title":"Steps to Reset Password","text":"
When an administrator initially creates a user, it sets a username and password for him. After the user logs in, fill in the email address and change the password in Personal Center . If the user has not set an email address, he can only contact the administrator to reset the password.
If you forget your password, you can click Forgot your password? on the login interface.
Enter your login email and click Submit .
Find the password reset email in the mailbox, and click the link in your email. The link is effective for 5 minutes.
Install applications that support 2FA dynamic password generation (such as Google Authenticator) on mobile phone or other devices. Set up a dynamic password to activate your account, and click Submit .
Set a new password and click Submit . The requirements for setting a new password are consistent with the password rules when creating an account.
The password is successfully reset, and you enter the home page directly.
The flow of the password reset process is as follows.
graph TB\n\npass[Forgot password] --> usern[Enter username]\n--> button[Click button to send a mail] --> judge1[Check your username is correct or not]\n\n judge1 -.Correct.-> judge2[Check if you have bounded a mail]\n judge1 -.Wrong.-> tip1[Error of incorrect username]\n\n judge2 -.A mail has been bounded.-> send[Send a reset mail]\n judge2 -.No any mail bounded.-> tip2[No any mail bounded<br>Contact admin to reset password]\n\nsend --> click[Click the mail link] --> config[Config dynamic password] --> reset[Reset password]\n--> success[Successfully reset]\n\nclassDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;\nclassDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff;\nclassDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5;\n\nclass pass,usern,button,tip1,send,tip2,send,click,config,reset,success plain;\nclass judge1,judge2 k8s
AI platform supports the creation of three scopes of custom roles:
The permissions of Platform Role take effect on all relevant resources of the platform
The permissions of workspace role take effect on the resources under the workspace where the user is located
The permissions of folder role take effect on the folder where the user is located and the subfolders and workspace resources under it
"},{"location":"en/admin/ghippo/access-control/custom-role.html#create-a-platform-role","title":"Create a platform role","text":"
A platform role refers to a role that can manipulate features related to a certain module of AI platform (such as container management, microservice engine, Multicloud Management, service mesh, Container registry, Workbench, and global management).
From the left navigation bar, click Global Management -> Access Control -> Roles , and click Create Custom Role .
Enter the name and description, select Platform Role , check the role permissions and click OK .
Return to the role list, search for the custom role you just created, and click \u2507 on the right to perform operations such as copying, editing, and deleting.
After the platform role is successfully created, you can go to User/group to add users and groups for this role.
"},{"location":"en/admin/ghippo/access-control/custom-role.html#create-a-workspace-role","title":"Create a workspace role","text":"
A workspace role refers to a role that can manipulate features related to a module (such as container management, microservice engine, Multicloud Management, service mesh, container registry, Workbench, and global management) according to the workspace.
From the left navigation bar, click Global Management -> Access Control -> Roles , and click Create Custom Role .
Enter the name and description, select Workspace role , check the role permissions and click OK .
Return to the role list, search for the custom role you just created, and click \u2507 on the right to perform operations such as copying, editing, and deleting.
After the workspace role is successfully created, you can go to Workspace to authorize and set which workspaces this role can manage.
The folder role refers to the ability to manipulate the relevant features of a module of AI platform (such as container management, microservice engine, Multicloud Management, service mesh, container registry, Workbench and global management) according to folders and subfolders. Role.
From the left navigation bar, click Global Management -> Access Control -> Roles , and click Create Custom Role .
Enter the name and description, select Folder Role , check the role permissions and click OK .
Return to the role list, search for the custom role you just created, and click \u2507 on the right to perform operations such as copying, editing, and deleting.
After the folder role is successfully created, you can go to Folder to authorize and set which folders this role can manage.
When two or more platforms need to integrate or embed with each other, user system integration is usually required. During the process of user system integration, the Docking Portal mainly provides SSO (Single Sign-On) capability. If you want to integrate AI platform as a user source into a client platform, you can achieve it by docking a product through Docking Portal .
"},{"location":"en/admin/ghippo/access-control/docking.html#docking-a-product","title":"Docking a product","text":"
Prerequisite: Administrator privileges for the platform or IAM Owner privileges for access control.
Log in with an admin, navigate to Access Control , select Docking Portal , enter the Docking Portal list, and click Create SSO Profile in the upper right corner.
On the Create SSO Profile page, fill in the Client ID.
After successfully creating the SSO access, in the Docking Portal list, click the just created Client ID to enter the details, copy the Client ID, Secret Key, and Single Sign-On URL information, and fill them in the client platform to complete the user system integration.
AI platform provides predefined system roles to help users simplify the process of role permission usage.
Note
AI platform provides three types of system roles: platform role, workspace role, and folder role.
Platform role: has proper permissions for all related resources on the platform. Please go to user/group page for authorization.
Workspace role: has proper permissions for a specific workspace. Please go to the specific workspace page for authorization.
Folder role: has proper permissions for a specific folder, subfolder, and resources under its workspace. Please go to the specific folder page for authorization.
Five system roles are predefined in Access Control: Admin, IAM Owner, Audit Owner, Kpanda Owner, and Workspace and Folder Owner. These five roles are created by the system and cannot be modified by users. The proper permissions of each role are as follows:
Role Name Role Type Module Role Permissions Admin System role All Platform administrator, manages all platform resources, represents the highest authority of the platform. IAM Owner System role Access Control Administrator of Access Control, has all permissions under this service, such as managing users/groups and authorization. Audit Owner System role Audit Log Administrator of Audit Log, has all permissions under this service, such as setting audit log policies and exporting audit logs. Kpanda Owner System role Container Management Administrator of Container Management, has all permissions under this service, such as creating/accessing clusters, deploying applications, granting cluster/namespace-related permissions to users/groups. Workspace and Folder Owner System role Workspace and Folder Administrator of Workspace and Folder, has all permissions under this service, such as creating folders/workspaces, authorizing folder/workspace-related permissions to users/groups, using features such as Workbench and microservice engine under the workspace."},{"location":"en/admin/ghippo/access-control/global.html#workspace-roles","title":"Workspace Roles","text":"
Three system roles are predefined in Access Control: Workspace Admin, Workspace Editor, and Workspace Viewer. These three roles are created by the system and cannot be modified by users. The proper permissions of each role are as follows:
Role Name Role Type Module Role Permissions Workspace Admin System role Workspace Administrator of a workspace, with management permission of the workspace. Workspace Editor System role Workspace Editor of a workspace, with editing permission of the workspace. Workspace Viewer System role Workspace Viewer of a workspace, with readonly permission of the workspace."},{"location":"en/admin/ghippo/access-control/global.html#folder-roles","title":"Folder Roles","text":"
Three system roles are predefined in Access Control: Folder Admin, Folder Editor, and Folder Viewer. These three roles are created by the system and cannot be modified by users. The proper permissions of each role are as follows:
Role Name Role Type Module Role Permissions Folder Admin System role Workspace Administrator of a folder and its subfolders/workspaces, with management permission. Folder Editor System role Workspace Editor of a folder and its subfolders/workspaces, with editing permission. Folder Viewer System role Workspace Viewer of a folder and its subfolders/workspaces, with readonly permission."},{"location":"en/admin/ghippo/access-control/group.html","title":"Group","text":"
A group is a collection of users. By joining a group, a user can inherit the role permissions of the group. Authorize users in batches through groups to better manage users and their permissions.
Enters Access Control , selects Groups , enters the list of groups, and clicks Create a group on the upper right.
Fill in the group information on the Create group page.
Click OK , the group is created successfully, and you will return to the group list page. The first line in the list is the newly created group.
"},{"location":"en/admin/ghippo/access-control/group.html#add-permissions-to-a-group","title":"Add permissions to a group","text":"
Prerequisite: The group already exists.
Enters Access Control , selects Groups , enters the list of groups, and clicks \u2507 -> Add permissions .
On the Add permissions page, check the required role permissions (multiple choices are allowed).
Click OK to add permissions to the group. Automatically return to the group list, click a group to view the permissions granted to the group.
"},{"location":"en/admin/ghippo/access-control/group.html#add-users-to-a-group","title":"Add users to a group","text":"
Enters Access Control , selects Groups to display the group list, and on the right side of a group, click \u2507 -> Add Members .
On the Add Group Members page, click the user to be added (multiple choices are allowed). If there is no user available, click Create a new user , first go to create a user, and then return to this page and click the refresh icon to display the newly created user.
Click OK to finish adding users to the group.
Note
Users in the group will inherit the permissions of the group; users who join the group can be viewed in the group details.
Note: Deleting a group will not delete the users in the group, but the users in the group will no longer be able to inherit the permissions of the group
The administrator enters Access Control , selects group to enter the group list, and on the right side of a group, click \u2507 -> Delete .
Click Delete to delete the group.
Return to the group list, and the screen will prompt that the deletion is successful.
Note
Deleting a group will not delete the users in the group, but the users in the group will no longer be able to inherit the permissions from the group.
"},{"location":"en/admin/ghippo/access-control/iam.html","title":"What is IAM","text":"
IAM (Identity and Access Management) is an important module of global management. You can create, manage and destroy users (groups) through the access control module, and use system roles and custom roles to control other users Access to the AI platform.
Structures and roles within an enterprise can be complex, with the management of projects, work groups, and mandates constantly changing. Access control uses a clear and tidy page to open up the authorization relationship between users, groups, and roles, and realize the authorization of users (groups) with the shortest link.
Appropriate role
Access control pre-defines an administrator role for each sub-module, without user maintenance, you can directly authorize the predefined system roles of the platform to users to realize the modular management of the platform. For fine-grained permissions, please refer to Permission Management.
Enterprise-grade access control
When you want your company's employees to use the company's internal authentication system to log in to the AI platform without creating proper users on the AI platform, you can use the identity provider feature of access control to establish a trust relationship between your company and Suanova, Through joint authentication, employees can directly log in to the AI platform with the existing account of the enterprise, realizing single sign-on.
Global management supports single sign-on based on LDPA and OIDC protocols. If your enterprise or organization has its own account system and you want to manage members in the organization to use AI platform resources, you can use the identity provider feature provided by global management. Instead of having to create username/passwords for every organization member in your AI platform. You can grant permissions to use AI platform resources to these external user identities.
Responsible for collecting and storing user identity information, usernames, and passwords, and responsible for authenticating users when they log in. In the identity authentication process between an enterprise and AI platform, the identity provider refers to the identity provider of the enterprise itself.
Service Provider (SP)
The service provider establishes a trust relationship with the identity provider IdP, and uses the user information provided by the IDP to provide users with specific services. In the process of enterprise authentication with AI platform, the service provider refers to AI platform.
LDAP
LDAP refers to Lightweight Directory Access Protocol (Lightweight Directory Access Protocol), which is often used for single sign-on, that is, users can log in with one account password in multiple services. Global management supports LDAP for identity authentication, so the enterprise IdP that establishes identity authentication with AI platform through the LDAP protocol must support the LDAP protocol. For a detailed description of LDAP, please refer to: Welcome to LDAP.
OIDC
OIDC, short for OpenID Connect, is an identity authentication standard protocol based on the OAuth 2.0 protocol. Global management supports the OIDC protocol for identity authentication, so the enterprise IdP that establishes identity authentication with AI platform through the OIDC protocol must support the OIDC protocol. For a detailed description of OIDC, please refer to: Welcome to OpenID Connect.
OAuth 2.0
OAuth 2.0 is the abbreviation of Open Authorization 2.0. It is an open authorization protocol. The authorization framework supports third-party applications to obtain access permissions in their own name.
Administrators do not need to recreate AI platform users
Before using the identity provider for identity authentication, the administrator needs to create an account for the user in the enterprise management system and AI platform respectively; after using the identity provider for identity authentication, the enterprise administrator only needs to create an account for the user in the enterprise management system, Users can access both systems at the same time, reducing personnel management costs.
Users do not need to remember two sets of platform accounts
Before using the identity provider for identity authentication, users need to log in with the accounts of the two systems to access the enterprise management system and AI platform; after using the identity provider for identity authentication, users can log in to the enterprise management system to access the two systems.
The full name of LDAP is Lightweight Directory Access Protocol, which is an open and neutral industry-standard application protocol that provides access control and maintains directories for distributed information through the IP protocol.
If your enterprise or organization has its own account system, and your enterprise user management system supports the LDAP protocol, you can use the identity provider feature based on the LDAP protocol provided by the Global Management instead of creating usernames/passwords for each member in AI platform. You can grant permissions to use AI platform resources to these external user identities.
In Global Management, the operation steps are as follows:
Log in to AI platform as a user with admin role. Click Global Management -> Access Control in the lower left corner of the left navigation bar.
Click Identity Provider on the left nav bar, click Create an Identity Provider button.
In the LDAP tab, fill in the following fields and click Save to establish a trust relationship with the identity provider and a user mapping relationship.
Field Description Vendor Supports LDAP (Lightweight Directory Access Protocol) and AD (Active Directory) Identity Provider Name (UI display name) Used to distinguish different identity providers Connection URL The address and port number of the LDAP service, e.g., ldap://10.6.165.2:30061 Bind DN The DN of the LDAP administrator, which Keycloak will use to access the LDAP server Bind credentials The password of the LDAP administrator. This field can retrieve its value from a vault using the ${vault.ID} format. Users DN The full DN of the LDAP tree where your users are located. This DN is the parent of the LDAP users. For example, if the DN of a typical user is similar to \u201cuid='john',ou=users,dc=example,dc=com\u201d, it can be \u201cou=users,dc=example,dc=com\u201d. User Object Classes All values of the LDAP objectClass attribute for users in LDAP, separated by commas. For example: \u201cinetOrgPerson,organizationalPerson\u201d. New Keycloak users will be written to LDAP with all of these object classes, and existing LDAP user records will be found if they contain all of these object classes. Enable StartTLS Encrypts the connection between AI platform and LDAP when enabled Default Permission Users/groups have no permissions by default after synchronization Full name mapping proper First name and Last Name User Name Mapping The unique username for the user Mailbox Mapping User email
Advanced Config
Field Description Enable or not Enabled by default. When disabled, this LDAP configuration will not take effect. Periodic full sync Disabled by default. When enabled, a sync period can be configured, such as syncing once every hour. Edit mode Read-only mode will not modify the source data in LDAP. Write mode will sync data back to LDAP after user information is edited on the platform. Read timeout Adjusting this value can effectively avoid interface timeouts when the amount of LDAP data is large. User LDAP filter An additional LDAP filter used to filter the search for users. Leave it empty if no additional filter is needed. Ensure it starts with \u201c(\u201d and ends with \u201c)\u201d. Username LDAP attribute The name of the LDAP attribute maps to the Keycloak username. For many LDAP server vendors, it can be \u201cuid\u201d. For Active Directory, it can be \u201csAMAccountName\u201d or \u201ccn\u201d. This attribute should be filled in for all LDAP user records you want to import into Keycloak. RDN LDAP attribute The name of the LDAP attribute that serves as the RDN (top-level attribute) of the typical user DN. It is usually the same as the Username LDAP attribute, but this is not required. For example, for Active Directory, when the username attribute might be \u201csAMAccountName\u201d, \u201ccn\u201d is often used as the RDN attribute. UUID LDAP attribute The name of the LDAP attribute used as the unique object identifier (UUID) for objects in LDAP. For many LDAP server vendors, it is \u201centryUUID\u201d. However, some may differ. For example, for Active Directory, it should be \u201cobjectGUID\u201d. If your LDAP server does not support the UUID concept, you can use any other attribute that should be unique among LDAP users in the tree, such as \u201cuid\u201d or \u201centryDN\u201d.
On the Sync Groups tab, fill in the following fields to configure the mapping relationship of groups, and click Save again.
Field Description Example base DN location of the group in the LDAP tree ou=groups,dc=example,dc=org Usergroup Object Filter Object classes for usergroups, separated by commas if more classes are required. In a typical LDAP deployment, usually \"groupOfNames\", the system has been filled in automatically, if you need to change it, just edit it. * means all. * group name cn Unchangeable
Note
After you have established a trust relationship between the enterprise user management system and AI platform through the LDAP protocol, you can synchronize the users or groups in the enterprise user management system to AI platform at one time through auto/manual synchronization.
After synchronization, the administrator can authorize groups/groups in batches, and users can log in to AI platform through the account/password in the enterprise user management system.
See the LDAP Operations Demo Video for a hands-on tutorial.
If all members in your enterprise or organization are managed in WeCom, you can use the identity provider feature based on the OAuth 2.0 protocol provided by Global Management, without the need to create a username/password for each organization member in AI platform. You can grant these external user identities permission to use AI platform resources.
Log in to AI platform with a user who has the admin role. Click Global Management -> Access Control at the bottom of the left navigation bar.
Select Identity Providers on the left navigation bar, and click the OAuth 2.0 tab. Fill in the form fields and establish a trust relationship with WeCom, then click Save.
"},{"location":"en/admin/ghippo/access-control/oauth2.0.html#proper-fields-in-wecom","title":"proper fields in WeCom","text":"
Note
Before integration, you need to create a custom application in the WeCom management console. Refer to How to create a custom application link
Field Description Corp ID ID of WeCom Agent ID ID of the custom application ClientSecret Secret of the custom application
WeCom ID:
Agent ID and ClientSecret:
"},{"location":"en/admin/ghippo/access-control/oidc.html","title":"Create and Manage OIDC","text":"
OIDC (OpenID Connect) is an identity layer based on OAuth 2.0 and an identity authentication standard protocol based on the OAuth2 protocol.
If your enterprise or organization already has its own account system, and your enterprise user management system supports the OIDC protocol, you can use the OIDC protocol-based identity provider feature provided by the Global Management instead of creating usernames/passwords for each member in AI platform. You can grant permissions to use AI platform resources to these external user identities.
The specific operation steps are as follows.
Log in to AI platform as a user with admin role. Click Global Management -> Access Control at the bottom of the left navigation bar.
On the left nav bar select Identity Provider , click OIDC -> Create an Identity Provider
After completing the form fields and establishing a trust relationship with the identity provider, click Save .
Fields Descriptions Provider Name displayed on the login page and is the entry point for the identity provider Authentication Method Client authentication method. If the JWT is signed with a private key, select JWT signed with private key from the dropdown. For details, refer to Client Authentication. Client ID Client ID Client Secret Client Secret Client URL One-click access to login URL, Token URL, user information URL and logout URL through the identity provider's well-known interface Auto-associate After it is turned on, when the identity provider username/email is duplicated with the AI platform username/email, the two will be automatically associated
Note
After the user completes the first login to AI platform through the enterprise user management system, the user information will be synchronized to Access Control -> User List of AI platform.
Users who log in for the first time will not be given any default permissions and need to be authorized by an administrator (the administrator can be a platform administrator, submodule administrator or resource administrator).
For practical tutorials, please refer to OIDC Operation Video Tutorials, or refer to Azure OpenID Connect (OIDC) Access Process.
The interactive process of user authentication is as follows:
Use a browser to initiate a single sign-on request for AI platform.
According to the information carried in the login link, AI platform searches for the proper configuration information in Global Management -> Access Control -> Identity Provider , constructs an OIDC authorization Request, and sends it to the browser.
After the browser receives the request, it forwards the OIDC authorization Request to the enterprise IdP.
Enter the username and password on the login page of the enterprise IdP. The enterprise IdP verifies the provided identity information, constructs an ID token carrying user information, and sends an OIDC authorization response to the browser.
After the browser responds, it forwards the OIDC authorization Response to AI platform.
AI platform takes the ID Token from the OIDC Authorization Response, maps it to a specific user list according to the configured identity conversion rules, and issues the Token.
Complete single sign-on to access AI platform.
"},{"location":"en/admin/ghippo/access-control/role.html","title":"Role and Permission Management","text":"
A role corresponds to a set of permissions that determine the actions that can be performed on resources. Granting a user a role means granting all the permissions included in that role.
AI platform platform provides three levels of roles, which effectively solve your permission-related issues:
Platform roles are coarse-grained permissions that grant proper permissions to all relevant resources on the platform. By assigning platform roles, users can have permissions to create, delete, modify, and view all clusters and workspaces, but not specifically to a particular cluster or workspace. AI platform provides 5 pre-defined platform roles that users can directly use:
Admin
Kpanda Owner
Workspace and Folder Owner
IAM Owner
Audit Owner
Additionally, AI platform supports the creation of custom platform roles with customized content as needed. For example, creating a platform role that includes all functional permissions in the Workbench. Since the Workbench depends on workspaces, the platform will automatically select the \"view\" permission for workspaces by default. Please do not manually deselect it. If User A is granted this Workbench role, they will automatically have all functional permissions related to the Workbench in all workspaces.
"},{"location":"en/admin/ghippo/access-control/role.html#platform-role-authorization-methods","title":"Platform Role Authorization Methods","text":"
There are three ways to authorize platform roles:
In the Global Management -> Access Control -> Users section, find the user in the user list, click ... , select Authorization , and grant platform role permissions to the user.
In the Global Management -> Access Control -> Groups section, create a group in the group list, add the user to the group, and grant authorization to the group (the specific operation is: find the group in the group list, click ... , select Add Permissions , and grant platform roles to the group).
In the Global Management -> Access Control -> Roles section, find the proper platform role in the role list, click the role name to access details, click the Related Members button, select the user or group, and click OK .
Workspace roles are fine-grained roles that grant users management permissions, view permissions, or Workbench-related permissions for a specific workspace. Users with these roles can only manage the assigned workspace and cannot access other workspaces. AI platform provides 3 pre-defined workspace roles that users can directly use:
Workspace Admin
Workspace Editor
Workspace Viewer
Moreover, AI platform supports the creation of custom workspace roles with customized content as needed. For example, creating a workspace role that includes all functional permissions in the Workbench. Since the Workbench depends on workspaces, the platform will automatically select the \"view\" permission for workspaces by default. Please do not manually deselect it. If User A is granted this role in Workspace 01, they will have all functional permissions related to the Workbench in Workspace 01.
Note
Unlike platform roles, workspace roles need to be used within the workspace. Once authorized, users will only have the functional permissions of that role within the assigned workspace.
"},{"location":"en/admin/ghippo/access-control/role.html#workspace-role-authorization-methods","title":"Workspace Role Authorization Methods","text":"
In the Global Management -> Workspace and Folder list, find the workspace, click Authorization , and grant workspace role permissions to the user.
Folder roles have permissions granularity between platform roles and workspace roles. They grant users management permissions and view permissions for a specific folder and its sub-folders, as well as all workspaces within that folder. Folder roles are commonly used in departmental scenarios in enterprises. For example, User B is a leader of a first-level department and usually has management permissions over the first-level department, all second-level departments under it, and projects within those departments. In this scenario, User B is granted admin permissions for the first-level folder, which also grants proper permissions for the second-level folders and workspaces below them. AI platform provides 3 pre-defined folder roles that users can directly use:
Folder Admin
Folder Editor
Folder Viewer
Additionally, AI platform supports the creation of custom folder roles with customized content as needed. For example, creating a folder role that includes all functional permissions in the Workbench. If User A is granted this role in Folder 01, they will have all functional permissions related to the Workbench in all workspaces within Folder 01.
Note
The functionality of modules depends on workspaces, and folders provide further grouping mechanisms with permission inheritance capabilities. Therefore, folder permissions not only include the folder itself but also its sub-folders and workspaces.
"},{"location":"en/admin/ghippo/access-control/role.html#folder-role-authorization-methods","title":"Folder Role Authorization Methods","text":"
In the Global Management -> Workspace and Folder list, find the folder, click Authorization , and grant folder role permissions to the user.
A user refers to a user created by the platform administrator Admin or the access control administrator IAM Owner on the Global Management -> Access Control -> Users page, or a user connected through LDAP / OIDC . The username represents the account, and the user logs in to the Suanova Enterprise platform through the username and password.
Having a user account is a prerequisite for users to access the platform. The newly created user does not have any permissions by default. For example, you need to assign proper role permissions to users, such as granting administrator permissions to submodules in User List or User Details . The sub-module administrator has the highest authority of the sub-module, and can create, manage, and delete all resources of the module. If a user needs to be granted permission for a specific resource, such as the permission to use a certain resource, please see Resource Authorization Description.
This page introduces operations such as creating, authorizing, disabling, enabling, and deleting users.
Prerequisite: You have the platform administrator Admin permission or the access control administrator IAM Admin permission.
The administrator enters Access Control , selects Users , enters the user list, and clicks Create User on the upper right.
Fill in the username and login password on the Create User page. If you need to create multiple users at one time, you can click Create User to create in batches, and you can create up to 5 users at a time. Determine whether to set the user to reset the password when logging in for the first time according to your actual situation.
Click OK , the user is successfully created and returns to the user list page.
Note
The username and password set here will be used to log in to the platform.
"},{"location":"en/admin/ghippo/access-control/user.html#authorize-for-user","title":"Authorize for User","text":"
Prerequisite: The user already exists.
The administrator enters Access Control , selects Users , enters the user list, and clicks \u2507 -> Authorization .
On the Authorization page, check the required role permissions (multiple choices are allowed).
Click OK to complete the authorization for the user.
Note
In the user list, click a user to enter the user details page.
"},{"location":"en/admin/ghippo/access-control/user.html#add-user-to-group","title":"Add user to group","text":"
The administrator enters Access Control , selects Users , enters the user list, and clicks \u2507 -> Add to Group .
On the Add to Group page, check the groups to be joined (multiple choices are allowed). If there is no optional group, click Create a new group to create a group, and then return to this page and click the Refresh button to display the newly created group.
Click OK to add the user to the group.
Note
The user will inherit the permissions of the group, and you can view the groups that the user has joined in User Details .
Once a user is deactivated, that user will no longer be able to access the Platform. Unlike deleting a user, a disabled user can be enabled again as needed. It is recommended to disable the user before deleting it to ensure that no critical service is using the key created by the user.
The administrator enters Access Control , selects Users , enters the user list, and clicks a username to enter user details.
Click Edit on the upper right, turn off the status button, and make the button gray and inactive.
Premise: User mailboxes need to be set. There are two ways to set user mailboxes.
On the user details page, the administrator clicks Edit , enters the user's email address in the pop-up box, and clicks OK to complete the email setting.
Users can also enter the Personal Center and set the email address on the Security Settings page.
If the user forgets the password when logging in, please refer to Reset Password.
After deleting a user, the user will no longer be able to access platform resources in any way, please delete carefully. Before deleting a user, make sure your key programs no longer use keys created by that user. If you are unsure, it is recommended to disable the user before deleting. If you delete a user and then create a new user with the same name, the new user is considered a new, separate identity that does not inherit the deleted user's roles.
The administrator enters Access Control , selects Users , enters the user list, and clicks \u2507 -> Delete .
With AI platform integrated into the client's system, you can create Webhooks to send message notifications when users are created, updated, deleted, logged in, or logged out.
Webhook is a mechanism for implementing real-time event notifications. It allows an application to push data or events to another application without the need for polling or continuous querying. By configuring Webhooks, you can specify that the target application receives and processes notifications when a certain event occurs.
The working principle of Webhook is as follows:
The source application (AI platform) performs a specific operation or event.
The source application packages the relevant data and information into an HTTP request and sends it to the URL specified by the target application (e.g., enterprise WeChat group robot).
The target application receives the request and processes it based on the data and information provided.
By using Webhooks, you can achieve the following functionalities:
Real-time notification: Notify other applications in a timely manner when a specific event occurs.
Automation: The target application can automatically trigger predefined operations based on the received Webhook requests, eliminating the need for manual intervention.
Data synchronization: Use Webhooks to pass data from one application to another, enabling synchronized updates.
Common use cases include:
Version control systems (e.g., GitHub, GitLab): Automatically trigger build and deployment operations when code repositories change.
E-commerce platforms: Send update notifications to logistics systems when order statuses change.
Chatbot platforms: Push messages to target servers via Webhooks for processing when user messages are received.
Audit logs help you monitor and record the activities of each user, and provide features for collecting, storing and querying security-related records arranged in chronological order. With the audit log service, you can continuously monitor and retain user behaviors in the Global Management module, including but not limited to user creation, user login/logout, user authorization, and user operations related to Kubernetes.
The audit log feature has the following characteristics:
Out of the box: When installing and using the platform, the audit log feature will be enabled by default, automatically recording various user-related actions, such as creating users, authorization, and login/logout. By default, 365 days of user behavior can be viewed within the platform.
Security analysis: The audit log will record user operations in detail and provide an export function. Through these events, you can judge whether the account is at risk.
Real-time recording: Quickly collect operation events, and trace back in the audit log list after user operations, so that suspicious behavior can be found at any time.
Convenient and reliable: The audit log supports manual cleaning and automatic cleaning, and the cleaning policy can be configured according to your storage size.
On the Settings tab, you can clean up audit logs for user operations and system operations.
You can manually clean up the logs, but it is recommended to export and save them before cleaning. You can also set the maximum retention time for the logs to automatically clean them up.
Note
The audit logs related to Kubernetes in the auditing module are provided by the Insight module. To reduce the storage pressure of the audit logs, Global Management by default does not collect Kubernetes-related logs. If you need to record them, please refer to Enabling K8s Audit Logs. Once enabled, the cleanup function is consistent with the Global Management cleanup function, but they do not affect each other.
"},{"location":"en/admin/ghippo/audit/open-audit.html","title":"Enable/Disable collection of audit logs","text":"
Kubernetes Audit Logs: Kubernetes itself generates audit logs. When this feature is enabled, audit log files for Kubernetes will be created in the specified directory.
Collecting Kubernetes Audit Logs: The log files mentioned above are collected using the Insight Agent. The prerequisite for collecting Kubernetes audit logs are that the cluster has enabled Kubernetes audit logs, the export of audit logs has been allowed, and the collection of audit logs has been opened.
Run the following command to check if audit logs are generated under the /var/log/kubernetes/audit directory. If they exist, it means that Kubernetes audit logs are successfully enabled.
ls /var/log/kubernetes/audit\n
If they are not enabled, please refer to the documentation on enabling/disabling Kubernetes audit logs.
"},{"location":"en/admin/ghippo/audit/open-audit.html#enable-collection-of-kubernetes-audit-logs-process","title":"Enable Collection of Kubernetes Audit Logs Process","text":"
Modify the IP address in this command to the IP address of the Spark node.
Note
If using a self-built Harbor repository, please modify the chart repo URL in the first step to the insight-agent chart URL of the self-built repository.
Save the current Insight Agent helm values.
helm get values insight-agent -n insight-system -o yaml > insight-agent-values-bak.yaml\n
Get the current version number ${insight_version_code}.
insight_version_code=`helm list -n insight-system |grep insight-agent | awk {'print $10'}`\n
Restart all fluentBit pods under the insight-system namespace.
fluent_pod=`kubectl get pod -n insight-system | grep insight-agent-fluent-bit | awk {'print $1'} | xargs`\nkubectl delete pod ${fluent_pod} -n insight-system\n
"},{"location":"en/admin/ghippo/audit/open-audit.html#disable-collection-of-kubernetes-audit-logs","title":"Disable Collection of Kubernetes Audit Logs","text":"
The remaining steps are the same as enabling the collection of Kubernetes audit logs, with only a modification in the previous section's step 4: updating the helm value configuration.
"},{"location":"en/admin/ghippo/audit/open-audit.html#ai-community-online-installation-environment","title":"AI Community Online Installation Environment","text":"
Note
If installing AI Community in a Kind cluster, perform the following steps inside the Kind container.
Run the following command to check if audit logs are generated under the /var/log/kubernetes/audit directory. If they exist, it means that Kubernetes audit logs are successfully enabled.
ls /var/log/kubernetes/audit\n
If they are not enabled, please refer to the documentation on enabling/disabling Kubernetes audit logs.
"},{"location":"en/admin/ghippo/audit/open-audit.html#enable-collection-of-kubernetes-audit-logs-process_1","title":"Enable Collection of Kubernetes Audit Logs Process","text":"
Save the current values.
helm get values insight-agent -n insight-system -o yaml > insight-agent-values-bak.yaml\n
Get the current version number ${insight_version_code} and update the configuration.
insight_version_code=`helm list -n insight-system |grep insight-agent | awk {'print $10'}`\n
If the upgrade fails due to an unsupported version, check if the helm repo used in the command has that version. If not, retry after you updated the helm repo.
helm repo update insight-release\n
Restart all fluentBit pods under the insight-system namespace.
fluent_pod=`kubectl get pod -n insight-system | grep insight-agent-fluent-bit | awk {'print $1'} | xargs`\nkubectl delete pod ${fluent_pod} -n insight-system\n
"},{"location":"en/admin/ghippo/audit/open-audit.html#disable-collection-of-kubernetes-audit-logs_1","title":"Disable Collection of Kubernetes Audit Logs","text":"
The remaining steps are the same as enabling the collection of Kubernetes audit logs, with only a modification in the previous section's step 3: updating the helm value configuration.
Each worker cluster is independent and can be turned on as needed.
"},{"location":"en/admin/ghippo/audit/open-audit.html#steps-to-enable-audit-log-collection-when-creating-a-cluster","title":"Steps to Enable Audit Log Collection When Creating a Cluster","text":"
By default, the collection of K8s audit logs is turned off. If you need to enable it, you can follow these steps:
Set the switch to the enabled state to enable the collection of K8s audit logs.
When creating a worker cluster via AI platform, ensure that the K8s audit log option for the cluster is set to 'true' so that the created worker cluster will have audit logs enabled.
After the cluster creation is successful, the K8s audit logs for that worker cluster will be collected.
"},{"location":"en/admin/ghippo/audit/open-audit.html#steps-to-enabledisable-after-accessing-or-creating-the-cluster","title":"Steps to Enable/Disable After Accessing or Creating the Cluster","text":""},{"location":"en/admin/ghippo/audit/open-audit.html#confirm-enabling-k8s-audit-logs","title":"Confirm Enabling K8s Audit Logs","text":"
Run the following command to check if audit logs are generated under the /var/log/kubernetes/audit directory. If they exist, it means that K8s audit logs are successfully enabled.
ls /var/log/kubernetes/audit\n
If they are not enabled, please refer to the documentation on enabling/disabling K8s audit logs.
"},{"location":"en/admin/ghippo/audit/open-audit.html#enable-collection-of-k8s-audit-logs","title":"Enable Collection of K8s Audit Logs","text":"
The collection of K8s audit logs is disabled by default. To enable it, follow these steps:
Select the cluster that has been accessed and needs to enable the collection of K8s audit logs.
Go to the Helm App management page and update the insight-agent configuration (if insight-agent is not installed, you can install it).
Enable/Disable the collection of K8s audit logs switch.
After enabling/disabling the switch, the fluent-bit pod needs to be restarted for the changes to take effect.
By default, the Kubernetes cluster does not generate audit log information. Through the following configuration, you can enable the audit log feature of Kubernetes.
Note
In a public cloud environment, it may not be possible to control the output and output path of Kubernetes audit logs.
Prepare the Policy file for the audit log
Configure the API server, and enable audit logs
Reboot and verify
"},{"location":"en/admin/ghippo/audit/open-k8s-audit.html#prepare-audit-log-policy-file","title":"Prepare audit log Policy file","text":"Click to view Policy YAML for audit log policy.yaml
Put the above audit log file in /etc/kubernetes/audit-policy/ folder, and name it apiserver-audit-policy.yaml .
"},{"location":"en/admin/ghippo/audit/open-k8s-audit.html#configure-the-api-server","title":"Configure the API server","text":"
Open the configuration file kube-apiserver.yaml of the API server, usually in the /etc/kubernetes/manifests/ folder, and add the following configuration information:
Please back up kube-apiserver.yaml before this step. The backup file cannot be placed in the /etc/kubernetes/manifests/ , and it is recommended to put it in the /etc/kubernetes/tmp .
"},{"location":"en/admin/ghippo/audit/open-k8s-audit.html#test-and-verify","title":"Test and verify","text":"
After a while, the API server will automatically restart, and run the following command to check whether there is an audit log generated in the /var/log/kubernetes/audit directory. If so, it means that the K8s audit log is successfully enabled.
ls /var/log/kubernetes/audit\n
If you want to close it, just remove the relevant commands in spec.containers.command .
"},{"location":"en/admin/ghippo/audit/source-ip.html","title":"Get Source IP in Audit Logs","text":"
The source IP in audit logs plays a critical role in system and network management. It helps track activities, maintain security, resolve issues, and ensure system compliance. However, getting the source IP can result in some performance overhead, so that audit logs are not always enabled in AI platform. The default enablement of source IP in audit logs and the methods to enable it vary depending on the installation mode. The following sections will explain the default enablement and the steps to enable source IP in audit logs based on the installation mode.
Note
Enabling audit logs will modify the replica count of the istio-ingressgateway, resulting in a certain performance overhead. Enabling audit logs requires disabling LoadBalance of kube-proxy and Topology Aware Routing, which can have a certain impact on cluster performance. After enabling audit logs, it is essential to ensure that the istio-ingressgateway exists on the proper node to the access IP. If the istio-ingressgateway drifts due to node health issues or other issues, it needs to be manually rescheduled back to that node. Otherwise, it will affect the normal operation of AI platform.
"},{"location":"en/admin/ghippo/audit/source-ip.html#determine-the-installation-mode","title":"Determine the Installation Mode","text":"
kubectl get pod -n metallb-system\n
Run the above command in the cluster. If the result is as follows, it means that the cluster is not in the MetalLB installation mode:
No resources found in metallbs-system namespace.\n
In this mode, the source IP in audit logs is gotten by default after the installation. For more information, refer to MetalLB Source IP.
"},{"location":"en/admin/ghippo/audit/gproduct-audit/ghippo.html","title":"Audit Items of Global Management","text":"Events Resource Type Notes UpdateEmail-Account Account UpdatePassword-Account Account CreateAccessKeys-Account Account UpdateAccessKeys-Account Account DeleteAccessKeys-Account Account Create-User User Delete-User User Update-User User UpdateRoles-User User UpdatePassword-User User CreateAccessKeys-User User UpdateAccessKeys-User User DeleteAccessKeys-User User Create-Group Group Delete-Group Group Update-Group Group AddUserTo-Group Group RemoveUserFrom-Group Group UpdateRoles-Group Group UpdateRoles-User User Create-LADP LADP Update-LADP LADP Delete-LADP LADP Unable to audit through API server for OIDC Login-User User Logout-User User UpdatePassword-SecurityPolicy SecurityPolicy UpdateSessionTimeout-SecurityPolicy SecurityPolicy UpdateAccountLockout-SecurityPolicy SecurityPolicy UpdateLogout-SecurityPolicy SecurityPolicy MailServer-SecurityPolicy SecurityPolicy CustomAppearance-SecurityPolicy SecurityPolicy OfficialAuthz-SecurityPolicy SecurityPolicy Create-Workspace Workspace Delete-Workspace Workspace BindResourceTo-Workspace Workspace UnBindResource-Workspace Workspace BindShared-Workspace Workspace SetQuota-Workspace Workspace Authorize-Workspace Workspace DeAuthorize-Workspace Workspace UpdateDeAuthorize-Workspace Workspace Update-Workspace Workspace Create-Folder Folder Delete-Folder Folder UpdateAuthorize-Folder Folder Update-Folder Folder Authorize-Folder Folder DeAuthorize-Folder Folder AutoCleanup-Audit Audit ManualCleanup-Audit Audit Export-Audit Audit"},{"location":"en/admin/ghippo/audit/gproduct-audit/insight.html","title":"Insight Audit Items","text":"Events Resource Type Notes Create-ProbeJob ProbeJob Update-ProbeJob ProbeJob Delete-ProbeJob ProbeJob Create-AlertPolicy AlertPolicy Update-AlertPolicy AlertPolicy Delete-AlertPolicy AlertPolicy Import-AlertPolicy AlertPolicy Create-AlertRule AlertRule Update-AlertRule AlertRule Delete-AlertRule AlertRule Create-RuleTemplate RuleTemplate Update-RuleTemplate RuleTemplate Delete-RuleTemplate RuleTemplate Create-email email Update-email email Delete-Receiver Receiver Create-dingtalk dingtalk Update-dingtalk dingtalk Delete-Receiver Receiver Create-wecom wecom Update-wecom wecom Delete-Receiver Receiver Create-webhook webhook Update-webhook webhook Delete-Receiver Receiver Create-sms sms Update-sms sms Delete-Receiver Receiver Create-aliyun(tencent,custom) aliyun, tencent, custom Update-aliyun(tencent,custom) aliyun, tencent, custom Delete-SMSserver SMSserver Create-MessageTemplate MessageTemplate Update-MessageTemplate MessageTemplate Delete-MessageTemplate MessageTemplate Create-AlertSilence AlertSilence Update-AlertSilence AlertSilence Delete-AlertSilence AlertSilence Create-AlertInhibition AlertInhibition Update-AlertInhibition AlertInhibition Delete-AlertInhibition AlertInhibition Update-SystemSettings SystemSettings"},{"location":"en/admin/ghippo/audit/gproduct-audit/kpanda.html","title":"Audit Items of Container Management","text":"Events Resource Types Create-Cluster Cluster Delete-Cluster Cluster Integrate-Cluster Cluster Remove-Cluster Cluster Upgrade-Cluster Cluster Integrate-Node Node Remove-Node Node Update-NodeGPUMode NodeGPUMode Create-HelmRepo HelmRepo Create-HelmApp HelmApp Delete-HelmApp HelmApp Create-Deployment Deployment Delete-Deployment Deployment Create-DaemonSet DaemonSet Delete-DaemonSet DaemonSet Create-StatefulSet StatefulSet Delete-StatefulSet StatefulSet Create-Job Job Delete-Job Job Create-CronJob CronJob Delete-CronJob CronJob Delete-Pod Pod Create-Service Service Delete-Service Service Create-Ingress Ingress Delete-Ingress Ingress Create-StorageClass StorageClass Delete-StorageClass StorageClass Create-PersistentVolume PersistentVolume Delete-PersistentVolume PersistentVolume Create-PersistentVolumeClaim PersistentVolumeClaim Delete-PersistentVolumeClaim PersistentVolumeClaim Delete-ReplicaSet ReplicaSet BindResourceTo-Workspace Workspace UnBindResource-Workspace Workspace BindResourceTo-Workspace Workspace UnBindResource-Workspace Workspace Create-CloudShell CloudShell Delete-CloudShell CloudShell"},{"location":"en/admin/ghippo/audit/gproduct-audit/virtnest.html","title":"Audit Items of Virtual Machine","text":"Events Resource Type Notes Restart-VMs VM ConvertToTemplate-VMs VM Edit-VMs VM Update-VMs VM Restore-VMs VM Power on-VMs VM LiveMigrate-VMs VM Delete-VMs VM Delete-VM Template VM Template Create-VMs VM CreateSnapshot-VMs VM Power off-VMs VM Clone-VMs VM"},{"location":"en/admin/ghippo/best-practice/authz-plan.html","title":"Ordinary user authorization plan","text":"
Ordinary users refer to those who can use most product modules and features (except management features), have certain operation rights to resources within the scope of authority, and can independently use resources to deploy applications.
The authorization and resource planning process for such users is shown in the following figure.
graph TB\n\n start([Start]) --> user[1. Create User]\n user --> ns[2. Prepare Kubernetes Namespace]\n ns --> ws[3. Prepare Workspace]\n ws --> ws-to-ns[4. Bind a workspace to namespace]\n ws-to-ns --> authu[5. Authorize a user with Workspace Editor]\n authu --> complete([End])\n\nclick user \"https://docs.daocloud.io/en/ghippo/access-control/user/\"\nclick ns \"https://docs.daocloud.io/en/kpanda/namespaces/createns/\"\nclick ws \"https://docs.daocloud.io/en/ghippo/workspace/workspace/\"\nclick ws-to-ns \"https://docs.daocloud.io/en/ghippo/workspace/ws-to-ns-across-clus/\"\nclick authu \"https://docs.daocloud.io/en/ghippo/workspace/wspermission/\"\n\n classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;\n classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;\n classDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5;\n class user,ns,ws,ws-to-ns,authu cluster;\n class start,complete plain;
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html","title":"Assign a Cluster to Multiple Workspaces (Tenants)","text":"
Cluster resources are typically managed by operations personnel. When allocating resources, they need to create namespaces to isolate resources and set resource quotas. This method has a drawback: if the business volume of the enterprise is large, manually allocating resources requires a significant amount of work, and flexibly adjusting resource quotas can also be challenging.
To address this, the AI platform introduces the concept of workspaces. By sharing resources, workspaces can provide higher-dimensional resource quota capabilities, allowing workspaces (tenants) to self-create Kubernetes namespaces under resource quotas.
For example, if you want several departments to share different clusters:
Cluster01 (Normal) Cluster02 (High Availability) Department (Workspace) A 50 quota 10 quota Department (Workspace) B 100 quota 20 quota
You can follow the process below to share clusters with multiple departments/workspaces/tenants:
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html#prepare-a-workspace","title":"Prepare a Workspace","text":"
Workspaces are designed to meet multi-tenant usage scenarios, forming isolated resource environments based on clusters, cluster namespaces, meshes, mesh namespaces, multicloud, multicloud namespaces, and other resources. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers.
Log in to AI platform with a user having the admin/folder admin role and click Global Management at the bottom of the left navigation bar.
Click Workspaces and Folders in the left navigation bar, then click the Create Workspace button at the top right.
Fill in the workspace name, folder, and other information, then click OK to complete the creation of the workspace.
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html#prepare-a-cluster","title":"Prepare a Cluster","text":"
Workspaces are designed to meet multi-tenant usage scenarios, forming isolated resource environments based on clusters, cluster namespaces, meshes, mesh namespaces, multicloud, multicloud namespaces, and other resources. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers.
Follow these steps to prepare a cluster.
Click Container Management at the bottom of the left navigation bar, then select Clusters .
Click Create Cluster to create a cluster or click Integrate Cluster to integrate a cluster.
"},{"location":"en/admin/ghippo/best-practice/cluster-for-multiws.html#add-cluster-to-workspace","title":"Add Cluster to Workspace","text":"
Return to Global Management to add clusters to the workspace.
Click Global Management -> Workspaces and Folders -> Shared Resources, then click a workspace name and click the New Shared Resource button.
Select the cluster, fill in the resource quota, and click OK .
"},{"location":"en/admin/ghippo/best-practice/folder-practice.html","title":"Folder Best Practices","text":"
A folder represents an organizational unit (such as a department) and is a node in the resource hierarchy.
A folder can contain workspaces, subfolders, or a combination of both. It provides identity management, multi-level and permission mapping capabilities, and can map the role of a user/group in a folder to its subfolders, workspaces and resources. Therefore, with the help of folders, enterprise managers can centrally manage and control all resources.
Build corporate hierarchy
First of all, according to the existing enterprise hierarchy structure, build the same folder hierarchy as the enterprise. The AI platform supports 5-level folders, which can be freely combined according to the actual situation of the enterprise, and folders and workspaces are mapped to entities such as departments, projects, and suppliers in the enterprise.
Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces.
User identity management
Folder provides three roles: Folder Admin, Folder Editor, and Folder Viewer. View role permissions, you can grant different roles to users/groups in the same folder through Authorization.
Role and permission mapping
Enterprise Administrator: Grant the Folder Admin role on the root folder. He will have administrative authority over all departments, projects and their resources.
Department manager: grant separate management rights to each subfolder and workspace.
Project members: Grant management rights separately at the workspace and resource levels.
"},{"location":"en/admin/ghippo/best-practice/super-group.html","title":"Architecture Management of Large Enterprises","text":"
With the continuous scaling of business, the company's scale continues to grow, subsidiaries and branches are established one after another, and some subsidiaries even further establish subsidiaries. The original large departments are gradually subdivided into multiple smaller departments, leading to an increasing number of hierarchical levels in the organizational structure. This organizational structure change also affects the IT governance architecture.
The specific operational steps are as follows:
Enable Isolation Mode between Folder/WS
Please refer to Enable Isolation Mode between Folder/WS.
Plan Enterprise Architecture according to the Actual Situation
Under a multi-level organizational structure, it is recommended to use the second-level folder as an isolation unit to isolate users/user groups/resources between \"sub-companies\". After isolation, users/user groups/resources between \"sub-companies\" are not visible to each other.
Create Users/Integrate User Systems
The main platform administrator Admin can create users on the platform or integrate users through LDAP/OIDC/OAuth2.0 and other identity providers to AI platform.
Create Folder Roles
In the isolation mode of Folder/WS, the platform administrator Admin needs to first authorize users to invite them to various sub-companies, so that the \"sub-company administrators (Folder Admin)\" can manage these users, such as secondary authorization or editing permissions. It is recommended to simplify the management work of the platform administrator Admin by creating a role without actual permissions to assist the platform administrator Admin in inviting users to sub-companies through \"authorization\". The actual permissions of sub-company users are delegated to the sub-company administrators (Folder Admin) to manage independently. (The following demonstrates how to create a resource-bound role without actual permissions, i.e., minirole)
Note
Resource-bound permissions used alone do not take effect, hence meeting the requirement of inviting users to sub-companies through \"authorization\" and then managed by sub-company administrators Folder Admin.
Authorize Users
The platform administrator invites users to various sub-companies according to the actual situation and appoints sub-company administrators.
Authorize sub-company regular users as \"minirole\" (1), and authorize sub-company administrators as Folder Admin.
Refers to the role without actual permissions created in step 4
Sub-company Administrators Manage Users/User Groups Independently
Sub-company administrator Folder Admin can only see their own \"Sub-company 2\" after logging into the platform, and can adjust the architecture by creating folders, creating workspaces, and assigning other permissions to users in Sub-company 2 through adding authorization/edit permissions.
When adding authorization, sub-company administrator Folder Admin can only see users invited by the platform administrator through \"authorization\", and cannot see all users on the platform, thus achieving user isolation between Folder/WS, and the same applies to user groups (the platform administrator can see and authorize all users and user groups on the platform).
Note
The main difference between large enterprises and small/medium-sized enterprises lies in whether users/user groups in Folder and workspaces are visible to each other. In large enterprises, users/user groups between subsidiaries are not visible + permission isolation; in small/medium-sized enterprises, users between departments are visible to each other + permission isolation.
System messages are used to notify all users, similar to system announcements, and will be displayed at the top bar of the AI platform UI at specific times.
"},{"location":"en/admin/ghippo/best-practice/system-message.html#configure-system-messages","title":"Configure System Messages","text":"
You can create a system message by applying the YAML for the system message in the Cluster Roles. The display time of the message is determined by the time fields in the YAML. System messages will only be displayed within the time range configured by the start and end fields.
In the Clusters, click the name of the Global Service Cluster to enter the Gobal Service Cluster.
Select CRDs from the left navigation bar, search for ghippoconfig, and click the ghippoconfigs.ghippo.io that appears in the search results.
Click Create from YAML or modify an existing YAML.
A sample YAML is as follows:
apiVersion: ghippo.io/v1alpha1\nkind: GhippoConfig\nmetadata:\n name: system-message\nspec:\n message: \"this is a message\"\n start: 2024-01-02T15:04:05+08:00\n end: 2024-07-24T17:26:05+08:00\n
"},{"location":"en/admin/ghippo/best-practice/ws-best-practice.html","title":"Workspace Best Practices","text":"
A workspace is a resource grouping unit, and most resources can be bound to a certain workspace. The workspace can realize the binding relationship between users and roles through authorization and resource binding, and apply it to all resources in the workspace at one time.
Through the workspace, you can easily manage teams and resources, and solve cross-module and cross-cluster resource authorization issues.
A workspace consists of three features: authorization, resource groups, and shared resources. It mainly solves the problems of unified authorization of resources, resource grouping and resource quota.
Authorization: Grant users/groups different roles in the workspace, and apply the roles to the resources in the workspace.
Best practice: When ordinary users want to use Workbench, microservice engine, service mesh, and middleware module features, or need to have permission to use container management and some resources in the service mesh, the administrator needs to grant the workspace permissions (Workspace Admin, Workspace Edit, Workspace View). The administrator here can be the Admin role, the Workspace Admin role of the workspace, or the Folder Admin role above the workspace. See Relationship between Folder and Workspace.
Resource group: Resource group and shared resource are two resource management modes of the workspace.
Resource groups support four resource types: Cluster, Cluster-Namespace (cross-cluster), Mesh, and Mesh-Namespace. A resource can only be bound to one resource group. After a resource is bound to a resource group, the owner of the workspace will have all the management rights of the resource, which is equivalent to the owner of the resource, so it is not limited by the resource quota.
Best practice: The workspace can grant different role permissions to department members through the \"authorization\" function, and the workspace can apply the authorization relationship between people and roles to all resources in the workspace at one time. Therefore, the operation and maintenance personnel only need to bind resources to resource groups, and add different roles in the department to different resource groups to ensure that resource permissions are assigned correctly.
Department Role Cluster Cross-cluster Cluster-Namespace Mesh Mesh-Namespace Department Admin Workspace Admin \u2713 \u2713 \u2713 \u2713 Department Core Members Workspace Edit \u2713 \u2717 \u2713 \u2717 Other Members Workspace View \u2713 \u2717 \u2717 \u2717
Shared resources: The shared resource feature is mainly for cluster resources.
A cluster can be shared by multiple workspaces (referring to the shared resource feature in the workspace); a workspace can also use the resources of multiple clusters at the same time. However, resource sharing does not mean that the sharer (workspace) can use the shared resource (cluster) without restriction, so the resource quota that the sharer (workspace) can use is usually limited.
At the same time, unlike resource groups, workspace members are only users of shared resources and can use resources in the cluster under resource quotas. For example, go to Workbench to create a namespace, and deploy applications, but do not have the management authority of the cluster. After the restriction, the total resource quota of the namespace created/bound under this workspace cannot exceed the resources set by the cluster in this workspace.
Best practice: The operation and maintenance department has a high-availability cluster 01, and wants to allocate it to department A (workspace A) and department B (workspace B), where department A allocates 50 CPU cores, and department B allocates CPU 100 cores. Then you can borrow the concept of shared resources, share cluster 01 with department A and department B respectively, and limit the CPU usage quota of department A to 50, and the CPU usage quota of department B to 100. Then the administrator of department A (workspace A Admin) can create and use a namespace in Workbench, and the sum of the namespace quotas cannot exceed 50 cores, and the administrator of department B (workspace B Admin) can create a namespace in Workbench And use namespaces, where the sum of namespace credits cannot exceed 100 cores. The namespaces created by the administrators of department A and department B will be automatically bound to the department, and other members of the department will have the roles of Namesapce Admin, Namesapce Edit, and Namesapce View proper to the namespace (the department here refers to Workspace, workspace can also be mapped to other concepts such as organization, and supplier). The whole process is as follows:
Department Role Cluster Resource Quota Department Administrator A Workspace Admin CPU 50 cores CPU 50 cores Department Administrator B Workspace Admin CPU 100 cores CPU 100 cores Other Members of the Department Namesapce AdminNamesapce EditNamesapce View Assign as Needed Assign as Needed
"},{"location":"en/admin/ghippo/best-practice/ws-best-practice.html#the-effect-of-the-workspace-on-the-ai-platfrom","title":"The effect of the workspace on the AI platfrom","text":"
Module name: Container Management
Due to the particularity of functional modules, resources created in the container management module will not be automatically bound to a certain workspace.
If you need to perform unified authorization management on people and resources through workspaces, you can manually bind the required resources to a certain workspace, to apply the roles of users in this workspace to resources (resources here can be cross- clustered).
In addition, there is a slight difference between container management and service mesh in terms of resource binding entry. The workspace provides the binding entry of Cluster and Cluster-Namespace resources in container management, but has not opened the Mesh and Mesh-Namespace for service mesh. Bindings for Namespace resources.
For Mesh and Mesh-Namespace resources, you can manually bind them in the resource list of the service mesh.
"},{"location":"en/admin/ghippo/best-practice/ws-best-practice.html#use-cases-of-workspace","title":"Use Cases of Workspace","text":"
Mapping to concepts such as different departments, projects, and organizations. At the same time, the roles of Workspace Admin, Workspace Edit, and Workspace View in the workspace can be mapped to different roles in departments, projects, and organizations
Add resources for different purposes to different workspaces for separate management and use
Set up completely independent administrators for different workspaces to realize user and authority management within the scope of the workspace
Share resources to different workspaces, and limit the upper limit of resources that can be used by workspaces
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html","title":"Workspaces (tenants) bind namespaces across clusters","text":"
Namespaces from different clusters are bound under the workspace (tenant), which enables the workspace (tenant) to flexibly manage the Kubernetes Namespace under any cluster on the platform. At the same time, the platform provides permission mapping capabilities, which can map the user's permissions in the workspace to the bound namespace.
When one or more cross-cluster namespaces are bound under the workspace (tenant), the administrator does not need to authorize the members in the workspace again. The roles of members in the workspace will be automatically mapped according to the following mapping relationship to complete the authorization, avoiding repeated operations of multiple authorizations:
Workspace Admin corresponds to Namespace Admin
Workspace Editor corresponds to Namespace Editor
Workspace Viewer corresponds to Namespace Viewer
Here is an example:
User Workspace Role User A Workspace01 Workspace Admin
After binding a namespace to a workspace:
User Category Role User A Workspace01 Workspace Admin Namespace01 Namespace Admin"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#implementation-plan","title":"Implementation plan","text":"
Bind different namespaces from different clusters to the same workspace (tenant), and use the process for members under the workspace (tenant) as shown in the figure.
graph TB\n\npreparews[prepare workspace] --> preparens[prepare namespace]\n--> judge([whether the namespace is bound to another workspace])\njudge -.unbound.->nstows[bind namespace to workspace] -->wsperm[manage workspace access]\njudge -.bound.->createns[Create a new namespace]\n\nclassDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;\nclassDef k8s fill: #326ce5, stroke: #fff, stroke-width: 1px, color: #fff;\nclassDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5;\n\nclass preparews, preparens, createns, nstows, wsperm cluster;\nclass judge plain\n\nclick preparews \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_3\"\nclick prepares \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_4\"\nclick nstows \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_5\"\nclick wsperm \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_6\"\nclick creates \"https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_4\"
In order to meet the multi-tenant use cases, the workspace forms an isolated resource environment based on multiple resources such as clusters, cluster namespaces, meshs, mesh namespaces, multicloud, and multicloud namespaces. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers.
Log in to AI platform as a user with the admin/folder admin role, and click Global Management at the bottom of the left navigation bar.
Click Workspace and Folder in the left navigation bar, and click the Create Workspace button in the upper right corner.
After filling in the workspace name, folder and other information, click OK to complete the creation of the workspace.
Tip: If the created namespace already exists in the platform, click a workspace, and under the Resource Group tab, click Bind Resource to directly bind the namespace.
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#prepare-the-namespace","title":"Prepare the namespace","text":"
A namespace is a smaller unit of resource isolation that can be managed and used by members of a workspace after it is bound to a workspace.
Follow the steps below to prepare a namespace that is not yet bound to any workspace.
Click Container Management at the bottom of the left navigation bar.
Click the name of the target cluster to enter Cluster Details .
Click Namespace on the left navigation bar to enter the namespace management page, and click the Create button on the right side of the page.
Fill in the name of the namespace, configure the workspace and tags (optional settings), and click OK .
Info
Workspaces are primarily used to divide groups of resources and grant users (groups of users) different access rights to that resource. For a detailed description of the workspace, please refer to Workspace and Folder.
Click OK to complete the creation of the namespace. On the right side of the namespace list, click \u2507 , and you can select Bind Workspace from the pop-up menu.
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#bind-the-namespace-to-the-workspace","title":"Bind the namespace to the workspace","text":"
In addition to binding in the namespace list, you can also return to global management , follow the steps below to bind the workspace.
Click Global Management -> Workspace and Folder -> Resource Group , click a workspace name, and click the Bind Resource button.
Select the workspace to be bound (multiple choices are allowed), and click OK to complete the binding.
"},{"location":"en/admin/ghippo/best-practice/ws-to-ns.html#add-members-to-the-workspace-and-authorize","title":"Add members to the workspace and authorize","text":"
In Workspace and Folder -> Authorization , click the name of a workspace, and click the Add Authorization button.
After selecting the User/group and Role to be authorized, click OK to complete the authorization.
"},{"location":"en/admin/ghippo/best-practice/gproduct/intro.html","title":"How GProduct connects to global management","text":"
GProduct is the general term for all other modules in AI platform except the global management. These modules need to be connected with the global management before they can be added to AI platform.
"},{"location":"en/admin/ghippo/best-practice/gproduct/intro.html#what-to-be-docking","title":"What to be docking","text":"
Docking Navigation Bar
The entrances are unified on the left navigation bar.
Access Routing and AuthN
Unify the IP or domain name, and unify the routing entry through the globally managed Istio Gateway.
Unified login / unified AuthN authentication
The login page is unified using the global management (Keycloak) login page, and the API authn token verification uses Istio Gateway. After GProduct is connected to the global management, there is no need to pay attention to how to implement login and authentication.
Only support one of overview, workbench, container, microservice, data service, and management
The larger the number, the higher it is ranked
The configuration for the global management navigation bar category is stored in a ConfigMap and cannot be added through registration at present. Please contact the global management team to add it.
The kpanda front-end is integrated into the AI platform parent application Anakin as a micro-frontend.
AI platform frontend uses qiankun to connect the sub-applications UI. See getting started.
After registering the GProductNavigator CR, the proper registration information will be generated for the front-end parent application. For example, kpanda will generate the following registration information:
{\n \"id\": \"kpanda\",\n \"title\": \"\u5bb9\u5668\u7ba1\u7406\",\n \"url\": \"/kpanda\",\n \"uiAssetsUrl\": \"/ui/kpanda/\", // The trailing / is required\n \"needImportLicense\": false\n},\n
The proper relation between the above registration and the qiankun sub-application fields is:
container and loader are provided by the frontend parent application. The sub-application does not need to concern it. Props will provide a pinia store containing user basic information and sub-product registration information.
qiankun will use the following parameters on startup:
start({\n sandbox: {\n experimentalStyleIsolation: true,\n },\n // Remove the favicon in the sub-application to prevent it from overwriting the parent application's favicon in Firefox\n getTemplate: (template) => template.replaceAll(/<link\\s* rel=\"[\\w\\s]*icon[\\w\\s]*\"\\s*( href=\".*?\")?\\s*\\/?>/g, ''),\n});\n
Refer to Docking demo tar to GProduct provided by frontend team.
"},{"location":"en/admin/ghippo/best-practice/gproduct/route-auth.html","title":"Access routing and login authentication","text":"
Unified login and password verification after docking, the effect is as follows:
The API bear token verification of each GProduct module goes through the Istio Gateway.
Take kpanda as an example to register GProductProxy CR.
# GProductProxy CR example, including routing and login authentication\n\n# spec.proxies: The route written later cannot be a subset of the route written first, and vice versa\n# spec.proxies.match.uri.prefix: If it is a backend api, it is recommended to add \"/\" at the end of the prefix to indicate the end of this path (special requirements can not be added)\n# spec.proxies.match.uri: supports prefix and exact modes; Prefix and Exact can only choose 1 out of 2; Prefix has a higher priority than Exact\n\napiVersion: ghippo.io/v1alpha1\nkind: GProductProxy\nmetadata:\n name: kpanda # (1)\nspec:\n gproduct: kpanda # (2)\n proxies:\n - labels:\n kind: UIEntry\n match:\n uri:\n prefix: /kpanda # (3)\n rewrite:\n uri: /index.html\n destination:\n host: ghippo-anakin.ghippo-system.svc.cluster.local\n port: 80\n authnCheck: false # (4)\n - labels:\n kind: UIAssets\n match:\n uri:\n prefix: /ui/kpanda/ # (5)\n destination:\n host: kpanda-ui.kpanda-system.svc.cluster.local\n port: 80\n authnCheck: false\n - match:\n uri:\n prefix: /apis/kpanda.io/v1/a\n destination:\n host: kpanda-service.kpanda-system.svc.cluster.local\n port: 80\n authnCheck: false\n - match:\n uri:\n prefix: /apis/kpanda.io/v1 # (6)\n destination:\n host: kpanda-service.kpanda-system.svc.cluster.local\n port: 80\n authnCheck: true\n
Cluster-level CRDs
You need to specify the GProduct name in lowercase
Can also support exact
Whether istio-gateway is required to perform AuthN Token authentication for this routing API, false means to skip authentication
UIAssets recommends adding / at the end to indicate the end (otherwise there may be problems in the front end)
The route written later cannot be a subset of the route written earlier, and vice versa
"},{"location":"en/admin/ghippo/best-practice/menu/menu-display-or-hiding.html","title":"Display/Hide Navigation Bar Menu Based on Permissions","text":"
Under the current permission system, Global Management has the capability to regulate the visibility of navigation bar menus according to user permissions. However, due to the authorization information of Container Management not being synchronized with Global Management, Global Management cannot accurately determine whether to display the Container Management menu.
This document implements the following through configuration: By default, the menus for Container Management and Insight will not be displayed in areas where Global Management cannot make a judgment. A Whitelist authorization strategy is employed to effectively manage the visibility of these menus. (The permissions for clusters or namespaces authorized through the Container Management page cannot be perceived or judged by Global Management)
For example, if User A holds the Cluster Admin role for cluster A in Container Management, Global Management cannot determine whether to display the Container Management menu. After the configuration described in this document, User A will not see the Container Management menu by default. They will need to have explicit permission in Global Management to access the Container Management menu.
The feature to show/hide menus based on permissions must be enabled. The methods to enable this are as follows:
For new installation enviroments, add the --set global.navigatorVisibleDependency=true parameter when using helm install.
For existing environments, back up values using helm get values ghippo -n ghippo-system -o yaml, then modify bak.yaml and add global.navigatorVisibleDependency: true.
Then upgrade the Global Management using the following command:
"},{"location":"en/admin/ghippo/best-practice/menu/menu-display-or-hiding.html#configure-the-navigation-bar","title":"Configure the Navigation Bar","text":"
Apply the following YAML in kpanda-global-cluster:
"},{"location":"en/admin/ghippo/best-practice/menu/menu-display-or-hiding.html#achieve-the-above-effect-through-custom-roles","title":"Achieve the Above Effect Through Custom Roles","text":"
Note
Only the menus for the Container Management module need to be configured separately menu permissions. Other modules will automatically show/hide based on user permissions
Create a custom role that includes the permission to view the Container Management menu, and then grant this role to users who need access to the Container Management menu.
you can see the navigation bar menus for container management and observability. The result is as follows:
"},{"location":"en/admin/ghippo/best-practice/oem/custom-idp.html","title":"Customizing AI platform Integration with IdP","text":"
Identity Provider (IdP): In AI platform, when a client system needs to be used as the user source and user authentication is performed through the client system's login interface, the client system is referred to as the Identity Provider for AI platform.
If there is a high customization requirement for the Ghippo login IdP, such as supporting WeCom, WeChat, or other social organization login requirements, please refer to this document for implementation.
Upgrade Ghippo to v0.15.0 or above. You can also directly install and deploy Ghippo v0.15.0, but make sure to manually record the following information.
After a successful upgrade, an installation command should be manually run. The parameter values set in --set should be gotten from the above saved content, along with additional parameter values:
global.idpPlugin.enabled: Whether to enable the custom plugin, default is disabled.
global.idpPlugin.image.repository: The image address used by the initContainer to initialize the custom plugin.
global.idpPlugin.image.tag: The image tag used by the initContainer to initialize the custom plugin.
global.idpPlugin.path: The directory file of the custom plugin within the above image.
Known issue in keycloak >= v21, support for old version themes has been removed and may be fixed in v22. See Issue #15344.
This demo uses Keycloak v20.0.5.
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#source-based-development","title":"Source-based Development","text":""},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#configure-the-environment","title":"Configure the Environment","text":"
Refer to keycloak/building.md for environment configuration.
Run the following commands based on keycloak/README.md:
cd quarkus\nmvn -f ../pom.xml clean install -DskipTestsuite -DskipExamples -DskipTests\n
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#run-from-ide","title":"Run from IDE","text":""},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#add-service-code","title":"Add Service Code","text":""},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#if-inheriting-some-functionality-from-keycloak","title":"If inheriting some functionality from Keycloak","text":"
Add files under the directory services/src/main/java/org/keycloak/broker :
The file names should be xxxProvider.java and xxxProviderFactory.java .
xxxProviderFactory.java example:
Pay attention to the variable PROVIDER_ID = \"oauth\"; , as it will be used in the HTML definition later.
xxxProvider.java example:
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#if-unable-to-inherit-functionality-from-keycloak","title":"If unable to inherit functionality from Keycloak","text":"
Refer to the three files in the image below to write your own code:
Add xxxProviderFactory to resource service
Add xxxProviderFactory to services/src/main/resources/META-INF/services/org.keycloak.broker.provider.IdentityProviderFactory so that the newly added code can work:
Add HTML file
Copy the file themes/src/main/resources/theme/base/admin/resources/partials/realm-identity-provider-oidc.html and rename it as realm-identity-provider-oauth.html (remember the variable to pay attention to from earlier).
Place the copied file in themes/src/main/resources/theme/base/admin/resources/partials/realm-identity-provider-oauth.html .
All the necessary files have been added. Now you can start debugging the functionality.
"},{"location":"en/admin/ghippo/best-practice/oem/keycloak-idp.html#packaging-as-a-jar-plugin","title":"Packaging as a JAR Plugin","text":"
Create a new Java project and copy the above code into the project, as shown below:
Refer to pom.xml.
Run mvn clean package to package the code, resulting in the xxx-jar-with-dependencies.jar file.
Download Keycloak Release 20.0.5 zip package and extract it.
Copy the xxx-jar-with-dependencies.jar file to the keycloak-20.0.5/providers directory.
Run the following command to check if the functionality is working correctly:
bin/kc.sh start-dev\n
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html","title":"Integrating Customer Systems into AI platform (OEM IN)","text":"
OEM IN refers to the partner's platform being embedded as a submodule in AI platform, appearing in the primary navigation bar of AI platform. Users can log in and manage it uniformly through AI platform. The implementation of OEM IN is divided into 5 steps:
Unify Domain
Integrate User Systems
Integrate Navigation Bar
Customize Appearance
Integrate Permission System (Optional)
For specific operational demonstrations, refer to the OEM IN Best Practices Video Tutorial.
Note
The open source software Label Studio is used for nested demonstrations below. In actual scenarios, you need to solve the following issues in the customer system:
The customer system needs to add a Subpath to distinguish which services belong to AI platform and which belong to the customer system.
Adjust the operations on the customer system during the application according to the actual situation.
Plan the Subpath path of the customer system: http://10.6.202.177:30123/label-studio (It is recommended to use a recognizable name as the Subpath, which should not conflict with the HTTP router of the main AI platform). Ensure that users can access the customer system through http://10.6.202.177:30123/label-studio.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#unify-domain-name-and-port","title":"Unify Domain Name and Port","text":"
SSH into the AI platform server.
ssh root@10.6.202.177\n
Create the label-studio.yaml file using the vim command.
vim label-studio.yaml\n
label-studio.yaml
apiVersion: networking.istio.io/v1beta1\nkind: ServiceEntry\nmetadata:\n name: label-studio\n namespace: ghippo-system\nspec:\n exportTo:\n - \"*\"\n hosts:\n - label-studio.svc.external\n ports:\n # Add a virtual port\n - number: 80\n name: http\n protocol: HTTP\n location: MESH_EXTERNAL\n resolution: STATIC\n endpoints:\n # Change to the domain name (or IP) of the customer system\n - address: 10.6.202.177\n ports:\n # Change to the port number of the customer system\n http: 30123\n---\napiVersion: networking.istio.io/v1alpha3\nkind: VirtualService\nmetadata:\n # Change to the name of the customer system\n name: label-studio\n namespace: ghippo-system\nspec:\n exportTo:\n - \"*\"\n hosts:\n - \"*\"\n gateways:\n - ghippo-gateway\n http:\n - match:\n - uri:\n exact: /label-studio # Change to the routing address of the customer system in the Web UI entry\n - uri:\n prefix: /label-studio/ # Change to the routing address of the customer system in the Web UI entry\n route:\n - destination:\n # Change to the value of spec.hosts in the ServiceEntry above\n host: label-studio.svc.external\n port:\n # Change to the value of spec.ports in the ServiceEntry above\n number: 80\n---\napiVersion: security.istio.io/v1beta1\nkind: AuthorizationPolicy\nmetadata:\n # Change to the name of the customer system\n name: label-studio\n namespace: istio-system\nspec:\n action: ALLOW\n selector:\n matchLabels:\n app: istio-ingressgateway\n rules:\n - from:\n - source:\n requestPrincipals:\n - '*'\n - to:\n - operation:\n paths:\n - /label-studio # Change to the value of spec.http.match.uri.prefix in VirtualService\n - /label-studio/* # Change to the value of spec.http.match.uri.prefix in VirtualService (Note: add \"*\" at the end)\n
Apply the label-studio.yaml using the kubectl command:
kubectl apply -f\u00a0label-studio.yaml\n
Verify if the IP and port of the Label Studio UI are consistent:
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#integrate-user-systems","title":"Integrate User Systems","text":"
Integrate the customer system with the AI platform platform through protocols like OIDC/OAUTH, allowing users to enter the customer system without logging in again after logging into the AI platform platform.
In the scenario of two AI platform, you can create SSO access through Global Management -> Access Control -> Docking Portal.
After creating, fill in the details such as the Client ID, Client Secret, and Login URL in the customer system's Global Management -> Access Control -> Identity Provider -> OIDC, to complete user integration.
After integration, the customer system login page will display the OIDC (Custom) option. Select to log in via OIDC the first time entering the customer system from the AI platform platform, and subsequently, you will directly enter the customer system without selecting again.
Refer to the tar package at the bottom of the document to implement an empty frontend sub-application, and embed the customer system into this empty shell application in the form of an iframe.
Download the gproduct-demo-main.tar.gz file and change the value of the src attribute in App-iframe.vue under the src folder (the user entering the customer system):
The absolute address: src=\"https://10.6.202.177:30443/label-studio\" (AI platform address + Subpath)
The relative address, such as src=\"./external-anyproduct/insight\"
Delete the App.vue and main.ts files under the src folder, and rename:
Rename App-iframe.vue to App.vue
Rename main-iframe.ts to main.ts
Build the image following the steps in the readme (Note: before executing the last step, replace the image address in demo.yaml with the built image address)
After integration, the Customer System will appear in the primary navigation bar of AI platform, and clicking it will allow users to enter the customer system.
AI platform supports customizing the appearance by writing CSS. How the customer system implements appearance customization in actual applications needs to be handled according to the actual situation.
Log in to the customer system, and through Global Management -> Settings -> Appearance, you can customize platform background colors, logos, and names. For specific operations, please refer to Appearance Customization.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#integrate-permission-system-optional","title":"Integrate Permission System (Optional)","text":"
Method One:
Customized teams can implement a customized module that AI platform will notify each user login event to the customized module via Webhook, and the customized module can call the OpenAPI of AnyProduct and AI platform to synchronize the user's permission information.
Method Two:
Through Webhook, notify AnyProduct of each authorization change (if required, it can be implemented later).
"},{"location":"en/admin/ghippo/best-practice/oem/oem-in.html#use-other-capabilities-of-ai-platform-in-anyproduct-optional","title":"Use Other Capabilities of AI platform in AnyProduct (Optional)","text":"
Download the tar package for gProduct-demo-main integration
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html","title":"Integrate AI platform into Customer System (OEM OUT)","text":"
OEM OUT refers to integrating AI platform as a sub-module into other products, appearing in their menus. You can directly access AI platform without logging in again after logging into other products. The OEM OUT integration involves 5 steps:
Deploy AI platform (Assuming the access address after deployment is https://10.6.8.2:30343/).
To achieve cross-domain access between the customer system and AI platform, you can use an nginx reverse proxy. Use the following example configuration in vi /etc/nginx/conf.d/default.conf :
server {\n listen 80;\n server_name localhost;\n\n location /dce5/ {\n proxy_pass https://10.6.8.2:30343/;\n proxy_http_version 1.1;\n proxy_read_timeout 300s; # This line is required for using kpanda cloudtty, otherwise it can be removed\n proxy_send_timeout 300s; # This line is required for using kpanda cloudtty, otherwise it can be removed\n\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n\n proxy_set_header Upgrade $http_upgrade; # This line is required for using kpanda cloudtty, otherwise it can be removed\n proxy_set_header Connection $connection_upgrade; # This line is required for using kpanda cloudtty, otherwise it can be removed\n }\n\n location / {\n proxy_pass https://10.6.165.50:30443/; # Assuming this is the customer system address (e.g., Yiyun)\n proxy_http_version 1.1;\n\n proxy_set_header Host $host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n }\n}\n
Assuming the nginx entry address is 10.6.165.50, follow the Customize AI platform Reverse Proxy Server Address to set the AI_PROXY reverse proxy as http://10.6.165.50/dce5. Ensure that AI platform can be accessed via http://10.6.165.50/dce5. The customer system also needs to configure the reverse proxy based on its specific requirements.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html#user-system-integration","title":"User System Integration","text":"
Integrate the customer system with AI platform using protocols like OIDC/OAUTH, allowing users to access AI platform without logging in again after logging into the customer system. Fill in the OIDC information of the customer system in Global Management -> Access Control -> Identity Provider .
After integration, the AI platform login page will display the OIDC (custom) option. When accessing AI platform from the customer system for the first time, select OIDC login, and subsequent logins will directly enter AI platform without needing to choose again.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html#navigation-bar-integration","title":"Navigation Bar Integration","text":"
Navigation bar integration means adding AI platform to the menu of the customer system. You can directly access AI platform by clicking the proper menu item. The navigation bar integration depends on the customer system and needs to be handled based on specific circumstances.
Use Global Management -> Settings -> Appearance to customize the platform's background color, logo, and name. For detailed instructions, refer to Appearance Customization.
"},{"location":"en/admin/ghippo/best-practice/oem/oem-out.html#permission-system-integration-optional","title":"Permission System Integration (optional)","text":"
Permission system integration is complex. If you have such requirements, please contact the Global Management team.
Tengine: Tengine is a web server project initiated by taobao.com. Based on Nginx, it adds many advanced features and features for the needs of high-traffic websites.
Tongsuo: Formerly known as BabaSSL, Tongsuo is an open-source cryptographic library that offers a range of modern cryptographic algorithms and secure communication protocols. It is designed to support a variety of use cases, including storage, network security, key management, and privacy computing. By providing foundational cryptographic capabilities, Tongsuo ensures the privacy, integrity, and authenticity of data during transmission, storage, and usage. It also enhances security throughout the data lifecycle, offering robust privacy protection and security features.
You can refer to the Tongsuo official documentation to use OpenSSL to generate SM2 certificates, or visit Guomi SSL Laboratory to apply for SM2 certificates.
In the end, we will get the following files:
-rw-r--r-- 1 root root 749 Dec 8 02:59 sm2.*.enc.crt.pem\n-rw-r--r-- 1 root root 258 Dec 8 02:59 sm2.*.enc.key.pem\n-rw-r--r-- 1 root root 749 Dec 8 02:59 sm2.*.sig.crt.pem\n-rw-r--r-- 1 root root 258 Dec 8 02:59 sm2.*.sig.key.pem\n
-rw-r--r-- 1 root root 216 Dec 8 03:21 rsa.*.crt.pem\n-rw-r--r-- 1 root root 4096 Dec 8 02:59 rsa.*.key.pem\n
"},{"location":"en/admin/ghippo/install/gm-gateway.html#configure-sm2-and-rsa-tls-certificates-for-the-guomi-gateway","title":"Configure SM2 and RSA TLS Certificates for the Guomi Gateway","text":"
The Guomi gateway used in this article supports SM2 and RSA TLS certificates. The advantage of dual certificates is that when the browser does not support SM2 TLS certificates, it automatically switches to RSA TLS certificates.
For more detailed configurations, please refer to the Tongsuo official documentation.
We enter the Tengine container:
# Go to the nginx configuration file directory\ncd /usr/local/nginx/conf\n\n# Create the cert folder to store TLS certificates\nmkdir cert\n\n# Copy the SM2 and RSA TLS certificates to the `/usr/local/nginx/conf/cert` directory\ncp sm2.*.enc.crt.pem sm2.*.enc.key.pem sm2.*.sig.crt.pem sm2.*.sig.key.pem /usr/local/nginx/conf/cert\ncp rsa.*.crt.pem rsa.*.key.pem /usr/local/nginx/conf/cert\n\n# Edit the nginx.conf configuration\nvim nginx.conf\n...\nserver {\n listen 443 ssl;\n proxy_http_version 1.1;\n # Enable Guomi function to support SM2 TLS certificates\n enable_ntls on;\n\n # RSA certificate\n # If your browser does not support Guomi certificates, you can enable this option, and Tengine will automatically recognize the user's browser and use RSA certificates for fallback\n ssl_certificate /usr/local/nginx/conf/cert/rsa.*.crt.pem;\n ssl_certificate_key /usr/local/nginx/conf/cert/rsa.*.key.pem;\n\n # Configure two pairs of SM2 certificates for encryption and signature\n # SM2 signature certificate\n ssl_sign_certificate /usr/local/nginx/conf/cert/sm2.*.sig.crt.pem;\n ssl_sign_certificate_key /usr/local/nginx/conf/cert/sm2.*.sig.key.pem;\n # SM2 encryption certificate\n ssl_enc_certificate /usr/local/nginx/conf/cert/sm2.*.enc.crt.pem;\n ssl_enc_certificate_key /usr/local/nginx/conf/cert/sm2.*.enc.key.pem;\n ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;\n\n location / {\n proxy_set_header Host $http_host;\n proxy_set_header X-Real-IP $remote_addr;\n proxy_set_header REMOTE-HOST $remote_addr;\n proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n # You need to modify the address here to the address of the Istio ingress gateway\n # For example, proxy_pass https://istio-ingressgateway.istio-system.svc.cluster.local\n # Or proxy_pass https://demo-dev.daocloud.io\n proxy_pass https://istio-ingressgateway.istio-system.svc.cluster.local;\n }\n}\n
"},{"location":"en/admin/ghippo/install/gm-gateway.html#reload-the-configuration-of-the-guomi-gateway","title":"Reload the Configuration of the Guomi Gateway","text":"
You can deploy a web browser that supports Guomi certificates. For example, Samarium Browser, and then access the UI interface through Tengine to verify if the Guomi certificate is effective.
Before a user uses a new system, there is no data in the system, and the system cannot identify the new user. In order to identify the user identity and bind user data, the user needs an account that can uniquely identify the user identity.
AI platform assigns an account with certain permissions to the user through the way the administrator creates a new user in User and Access Control . All behaviors generated by this user will be associated with their own account.
The user logs in through the account/password, and the system verifies whether the identity is legal. If the verification is legal, the user logs in successfully.
Note
If the user does not perform any operation within 24 hours after logging in, the login status will be automatically logged out. If the logged-in user is always active, the logged-in state will persist.
The simple process of user login is shown in the figure below.
Set environment variables for easier use in the following steps.
# Your reverse proxy address, for example `export Suanova_PROXY=\"https://demo-alpha.daocloud.io\"` \nexport Suanova_PROXY=\"https://domain:port\"\n\n# Helm --set parameter backup file\nexport GHIPPO_VALUES_BAK=\"ghippo-values-bak.yaml\"\n\n# Get the current version of ghippo\nexport GHIPPO_HELM_VERSION=$(helm get notes ghippo -n ghippo-system | grep \"Chart Version\" | awk -F ': ' '{ print $2 }')\n
Backup the --set parameters.
helm get values ghippo -n ghippo-system -o yaml > ${GHIPPO_VALUES_BAK}\n
Set environment variables for easier use in the following steps.
# Your reverse proxy address, for example `export Suanova_PROXY=\"https://demo-alpha.daocloud.io\"` \nexport Suanova_PROXY=\"https://domain:port\"\n\n# Helm --set parameter backup file\nexport GHIPPO_VALUES_BAK=\"ghippo-values-bak.yaml\"\n\n# Get the current version of ghippo\nexport GHIPPO_HELM_VERSION=$(helm get notes ghippo -n ghippo-system | grep \"Chart Version\" | awk -F ': ' '{ print $2 }')\n
Backup the --set parameters.
helm get values ghippo -n ghippo-system -o yaml > ${GHIPPO_VALUES_BAK}\n
The access key can be used to access the openAPI and continuous delivery. Users can obtain the key and access the API by referring to the following steps in the personal center.
Log in to AI platform, find Personal Center in the drop-down menu in the upper right corner, and you can manage the access key of the account on the Access Keys page.
Info
Access key is displayed only once. If you forget your access key, you will need to create a new key.
"},{"location":"en/admin/ghippo/personal-center/accesstoken.html#use-the-key-to-access-api","title":"Use the key to access API","text":"
When accessing AI platform openAPI, add the header Authorization:Bearer ${token} to the request to identify the visitor, where ${token} is the key obtained in the previous step. Request Example
curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k\n
This section explains how to set the interface language. Currently supports Chinese, English two languages.
Language setting is the portal for the platform to provide multilingual services. The platform is displayed in Chinese by default. Users can switch the platform language by selecting English or automatically detecting the browser language preference according to their needs. Each user's multilingual service is independent of each other, and switching will not affect other users.
The platform provides three ways to switch languages: Chinese, English-English, and automatically detect your browser language preference.
The operation steps are as follows.
Log in to the AI platform with your username/password. Click Global Management at the bottom of the left navigation bar.
Click the username in the upper right corner and select Personal Center .
Function description: It is used to fill in the email address and modify the login password.
Email: After the administrator configures the email server address, the user can click the Forget Password button on the login page to fill in the email address there to retrieve the password.
Password: The password used to log in to the platform, it is recommended to change the password regularly.
The specific operation steps are as follows:
Click the username in the upper right corner and select Personal Center .
Click the Security Settings tab. Fill in your email address or change the login password.
"},{"location":"en/admin/ghippo/personal-center/ssh-key.html","title":"Configuring SSH Public Key","text":"
This article explains how to configure SSH public key.
Before generating a new SSH key, please check if you need to use an existing SSH key stored in the root directory of the local user. For Linux and Mac, use the following command to view existing public keys. Windows users can use the following command in WSL (requires Windows 10 or above) or Git Bash to view the generated public keys.
ED25519 Algorithm:
cat ~/.ssh/id_ed25519.pub\n
RSA Algorithm:
cat ~/.ssh/id_rsa.pub\n
If a long string starting with ssh-ed25519 or ssh-rsa is returned, it means that a local public key already exists. You can skip Step 2 Generate SSH Key and proceed directly to Step 3.
If Step 1 does not return the specified content string, it means that there is no available SSH key locally and a new SSH key needs to be generated. Please follow these steps:
Access the terminal (Windows users please use WSL or Git Bash), and run ssh-keygen -t.
Enter the key algorithm type and an optional comment.
The comment will appear in the .pub file and can generally use the email address as the comment content.
To generate a key pair based on the ED25519 algorithm, use the following command:
ssh-keygen -t ed25519 -C \"<comment>\"\n
To generate a key pair based on the RSA algorithm, use the following command:
ssh-keygen -t rsa -C \"<comment>\"\n
Press Enter to choose the SSH key generation path.
Taking the ED25519 algorithm as an example, the default path is as follows:
Generating public/private ed25519 key pair.\nEnter file in which to save the key (/home/user/.ssh/id_ed25519):\n
The default key generation path is /home/user/.ssh/id_ed25519, and the proper public key is /home/user/.ssh/id_ed25519.pub.
Set a passphrase for the key.
Enter passphrase (empty for no passphrase):\nEnter same passphrase again:\n
The passphrase is empty by default, and you can choose to use a passphrase to protect the private key file. If you do not want to enter a passphrase every time you access the repository using the SSH protocol, you can enter an empty passphrase when creating the key.
Press Enter to complete the key pair creation.
"},{"location":"en/admin/ghippo/personal-center/ssh-key.html#step-3-copy-the-public-key","title":"Step 3. Copy the Public Key","text":"
In addition to manually copying the generated public key information printed on the command line, you can use the following commands to copy the public key to the clipboard, depending on the operating system.
Windows (in WSL or Git Bash):
cat ~/.ssh/id_ed25519.pub | clip\n
Mac:
tr -d '\\n'< ~/.ssh/id_ed25519.pub | pbcopy\n
GNU/Linux (requires xclip):
xclip -sel clip < ~/.ssh/id_ed25519.pub\n
"},{"location":"en/admin/ghippo/personal-center/ssh-key.html#step-4-set-the-public-key-on-ai-platform-platform","title":"Step 4. Set the Public Key on AI platform Platform","text":"
Log in to the AI platform UI page and select Profile -> SSH Public Key in the upper right corner of the page.
Add the generated SSH public key information.
SSH public key content.
Public key title: Supports customizing the public key name for management differentiation.
Expiration: Set the expiration period for the public key. After it expires, the public key will be automatically invalidated and cannot be used. If not set, it will be permanently valid.
The About page primarily showcases the latest versions of each module, highlights the open source software used, and expresses gratitude to the technical team via an animated video.
Steps to view are as follows:
Log in to AI platform as a user with Admin role. Click Global Management at the bottom of the left navigation bar.
Click Settings , select About , and check the product version, open source software statement, and development teams.
In AI platform, you have the option to customize the appearance of the login page, top navigation bar, bottom copyright and ICP registration to enhance your product recognition.
"},{"location":"en/admin/ghippo/platform-setting/appearance.html#customizing-login-page-and-top-navigation-bar","title":"Customizing Login Page and Top Navigation Bar","text":"
To get started, log in to AI platform as a user with the admin role and navigate to Global Management -> Settings found at the bottom of the left navigation bar.
Select Appearance . On the Custom your login page tab, modify the icon and text of the login page as needed, then click Save .
Log out and refresh the login page to see the configured effect.
On the Advanced customization tab, you can modify login page, navigation bar, copyright, and ICP registration with css.
Note
If you wish to restore the default settings, simply click Revert . This action will discard all customized settings.
Advanced customization allows you to modify the color, font spacing, and font size of the entire container platform using CSS styles. Please note that familiarity with CSS syntax is required.
To reset any advanced customizations, delete the contents of the black input box or click the Revert button.
AI platform will send an e-mail to the user to verify the e-mail address if the user forgets the password to ensure that the user is acting in person. In order for AI platform to be able to send email, you need to provide your mail server address first.
The specific operation steps are as follows:
Log in to AI platform as a user with admin role. Click Global Management at the bottom of the left navigation bar.
Click Settings , select Mail Server Settings .
Complete the following fields to configure the mail server:
Field Description Example SMTP server address SMTP server address that can provide mail service smtp.163.com SMTP server port Port for sending mail 25 Username Name of the SMTP user test@163.com Password Password for the SMTP account 123456 Sender's email address Sender's email address test@163.com Use SSL secure connection SSL can be used to encrypt emails, thereby improving the security of information transmitted via emails, usually need to configure a certificate for the mail server Disable
After the configuration is complete, click Save , and click Test Mail Server .
A message indicating that the mail has been successfully sent appears in the upper right corner of the screen, indicating that the mail server has been successfully set up.
Q: What is the reason why the user still cannot retrieve the password after the mail server is set up?
Answer: The user may not have an email address or set a wrong email address; at this time, users with the admin role can find the user by username in Global Management -> Access Control , and set it as The user sets a new login password.
If the mail server is not connected, please check whether the mail server address, username and password are correct.
New passwords must differ from the most recent historical password.
Users are required to change their passwords upon expiration.
Passwords must not match the username.
Passwords cannot be the same as the user's email address.
Customizable password rules.
Customizable minimum password length.
"},{"location":"en/admin/ghippo/platform-setting/security.html#access-control-policy","title":"Access Control Policy","text":"
Session Timeout Policy: Users will be automatically logged out after a period of inactivity lasting x hours.
Account Lockout Policy: Accounts will be locked after multiple failed login attempts within a specified time frame.
Login/Logout Policy: Users will be logged out when closing the browser.
To configure the password and access control policies, navigate to global management, then click Settings -> Security Policy in the left navigation bar.
Operation Management provides a visual representation of the total usage and utilization rates of CPU, memory, storage and GPU across various dimensions such as cluster, node, namespace, pod, and workspace within a specified time range on the platform. It also automatically calculates platform consumption information based on usage, usage time, and unit price. By default, the module enables all report statistics, but platform administrators can manually enable or disable individual reports. After enabling or disabling, the platform will start or stop collecting report data within a maximum of 20 minutes. Previously collected data will still be displayed normally. Operation Management data can be retained on the platform for up to 365 days. Statistical data exceeding this retention period will be automatically deleted. You can also download reports in CSV or Excel format for further statistics and analysis.
Operation Management is available only for the Standard Edition and above. It is not supported in the Community Edition.
You need to install or upgrade the Operations Management module first, and then you can experience report management and billing metering.
Report Management provides data statistics for cluster, node, pods, workspace, and namespace across five dimensions: CPU Utilization, Memory Utilization, Storage Utilization, GPU Utilization, and GPU Memory Utilization. It also integrates with the audit and alert modules to support the statistical management of audit and alert data, supporting a total of seven types of reports.
Accounting & Billing provides billing statistics for clusters, nodes, pods, namespaces, and workspaces on the platform. It calculates the consumption for each resource during the statistical period based on the usage of CPU, memory, storage and GPU, as well as user-configured prices and currency units. Depending on the selected time span, such as monthly, quarterly, or annually, it can quickly calculate the actual consumption for that period.
Accounting and billing further process the usage data of resources based on reports. You can manually set the unit price and currency unit for CPU, memory, GPU and storage. After setting, the system will automatically calculate the expenses of clusters, nodes, pods, namespaces, and workspaces over a period. You can adjust the period freely and export billing reports in Excel or Csv format after filtering by week, month, quarter, or year.
"},{"location":"en/admin/ghippo/report-billing/billing.html#billing-rules-and-effective-time","title":"Billing Rules and Effective Time","text":"
Billing Rules: Default billing is based on the maximum value of request and usage.
Effective Time: Effective the next day, the fees incurred on that day are calculated based on the unit price and quantity obtained at midnight the next day.
Support customizing the billing unit for CPU, memory, storage and GPU, as well as the currency unit.
Support custom querying of billing data within a year, automatically calculating the billing situation for the selected time period.
Support exporting billing reports in CSV and Excel formats.
Support enabling/disabling individual billing reports. After enabling/disabling, the platform will start/stop collecting data within 20 minutes, and past collected data will still be displayed normally.
Support selective display of billing data for CPU, total memory, storage, GPU and total.
Cluster Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all clusters within a certain period, as well as the number of nodes in that cluster. By clicking the number of nodes, you can quickly enter the node billing report and view the billing situation of nodes in that cluster during that time period.
Node Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all nodes within a certain period, as well as the IP, type, and belonging cluster of nodes.
Pod Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all pods within a certain period, as well as the namespace, cluster, and workspace to which the pod belongs.
Workspace Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all workspaces within a certain period, as well as the number of namespaces and pods. By clicking the number of namespaces, you can quickly enter the namespace billing report and view the billing situation of namespaces in that workspace during that time period; the same method can be used to view the billing situation of pods in that workspace during that time period.
Namespace Billing Report: Displays the CPU billing, memory billing, storage billing, GPU billing and overall billing situation for all namespaces within a certain period, as well as the number of pods, the belonging cluster, and workspace. By clicking the number of pods, you can quickly enter the pod billing report and view the billing situation of pods in that namespace during that time period.
Report management visually displays statistical data across clusters, nodes, pods, workspaces, namespaces, audits, and alarms. This data provides a reliable foundation for platform billing and utilization optimization.
Supports custom queries for statistical data within a year
Allows exporting reports in CSV and Excel formats
Supports enabling/disabling individual reports; once toggled, the platform will start/stop data collection within 20 minutes, but previously collected data will still be displayed.
Displays maximum, minimum, and average values for CPU utilization, memory utilization, storage utilization, and GPU memory utilization
Cluster Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all clusters during a specific time period, as well as the number of nodes under the cluster. You can quickly access the node report by clicking on the node count and view the utilization of nodes under the cluster during that period.
Node Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all nodes during a specific time period, along with the node's IP, type, and affiliated cluster.
Pod Report: Shows the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all pods during a specific time period, as well as the pod's namespace, affiliated cluster, and workspace.
Workspace Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all workspaces during a specific time period, along with the number of namespaces and pods. You can quickly access the namespace report by clicking on the namespace count and view the utilization of namespaces under the workspace during that period; similarly, you can view the utilization of pods under the workspace.
Namespace Report: Displays the maximum, minimum, and average values of CPU utilization, memory utilization, storage utilization, and GPU memory utilization for all namespaces during a specific time period, as well as the number of pods, affiliated clusters, and workspaces. You can quickly access the pod report by clicking on the pod count and view the utilization of pods within the namespace during that period.
Audit Report: Divided into user actions and resource operations. The user action report mainly counts the number of operations by a single user during a period, including successful and failed attempts; The resource operation report mainly counts the number of operations on a type of resource by all users.
Alarm Report: Displays the number of alarms for all nodes during a specific period, including the occurrences of fatal, severe, and warning alarms.
Log in to AI platform as a user with the Admin role. Click Global Management -> Operations Management at the bottom of the left sidebar.
After entering Operations Management, switch between different menus to view reports on clusters, nodes, and pods.
"},{"location":"en/admin/ghippo/troubleshooting/ghippo01.html","title":"Unable to start istio-ingressgateway when restarting the cluster (virtual machine)?","text":"
The error message is as shown in the following image:
Possible cause: The jwtsUri address of the RequestAuthentication CR cannot be accessed, causing istiod to be unable to push the configuration to istio-ingressgateway (This bug can be avoided in Istio 1.15: https://github.com/istio/istio/pull/39341/).
Solution:
Backup the RequestAuthentication ghippo CR.
kubectl get RequestAuthentication ghippo -n istio-system -o yaml > ghippo-ra.yaml\n
Before applying the RequestAuthentication ghippo CR, make sure that ghippo-apiserver and ghippo-keycloak are started correctly.
"},{"location":"en/admin/ghippo/troubleshooting/ghippo02.html","title":"Login loop with error 401 or 403","text":"
This issue occurs when the MySQL database connected to ghippo-keycloak encounters a failure, causing the OIDC Public keys to be reset.
For Global Management version 0.11.1 and above, you can follow these steps to restore normal operation by updating the Global Management configuration file using helm .
# Update helm repository\nhelm repo update ghippo\n\n# Backup ghippo parameters\nhelm get values ghippo -n ghippo-system -o yaml > ghippo-values-bak.yaml\n\n# Get the current deployed ghippo version\nversion=$(helm get notes ghippo -n ghippo-system | grep \"Chart Version\" | awk -F ': ' '{ print $2 }')\n\n# Perform the update operation to make the configuration file take effect\nhelm upgrade ghippo ghippo/ghippo \\\n-n ghippo-system \\\n-f ./ghippo-values-bak.yaml \\\n--version ${version}\n
"},{"location":"en/admin/ghippo/troubleshooting/ghippo03.html","title":"Keycloak Unable to Start","text":""},{"location":"en/admin/ghippo/troubleshooting/ghippo03.html#common-issues","title":"Common Issues","text":""},{"location":"en/admin/ghippo/troubleshooting/ghippo03.html#symptoms","title":"Symptoms","text":"
MySQL is ready with no errors. After installing the global management, Keycloak fails to start (more than 10 times).
If the database is MySQL, check if the Keycloak database encoding is UTF8.
Check the network connection from Keycloak to the database, ensure the database resources are sufficient, including but not limited to resource limits, storage space, and physical machine resources.
Check if MySQL resource usage has reached the limit
Check if the number of tables in the MySQL database keycloak is 95. (The number of tables may vary across different versions of Keycloak, so you can compare it with the number of tables in the Keycloak database of the same version in development or testing environments). If the number is fewer, it indicates that there may be an issue with the database table initialization (The command to check the number of tables is: show tables;).
Delete and recreate the Keycloak database with the command CREATE DATABASE IF NOT EXISTS keycloak CHARACTER SET utf8
You need to upgrade your virtual machine or physical machine CPU to support x86-64-v2 and above, ensuring that the x86 CPU instruction set supports SSE4.2. For details on how to upgrade, you should consult your virtual machine platform provider or your physical machine provider.
For more information, see: https://github.com/keycloak/keycloak/issues/17290
"},{"location":"en/admin/ghippo/troubleshooting/ghippo04.html","title":"Failure to Upgrade Global Management Separately","text":"
If the upgrade fails and includes the following message, you can refer to the Offline Upgrade section to complete the installation of CRDs by following the steps for updating the ghippo crd.
ensure CRDs are installed first\n
"},{"location":"en/admin/ghippo/workspace/folder-permission.html","title":"Description of folder permissions","text":"
Folders have permission mapping capabilities, which can map the permissions of users/groups in this folder to subfolders, workspaces and resources under it.
If the user/group is Folder Admin role in this folder, it is still Folder Admin role when mapped to a subfolder, and Workspace Admin is mapped to the workspace under it; If a Namespace is bound in Workspace and Folder -> Resource Group , the user/group is also a Namespace Admin after mapping.
Note
The permission mapping capability of folders will not be applied to shared resources, because sharing is to share the use permissions of the cluster to multiple workspaces, rather than assigning management permissions to workspaces, so permission inheritance and role mapping will not be implemented.
Folders have hierarchical capabilities, so when folders are mapped to departments/suppliers/projects in the enterprise,
If a user/group has administrative authority (Admin) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have administrative authority;
If a user/group has access rights (Editor) in the first-level department, the second-, third-, and fourth-level departments or projects under it also have access rights;
If a user/group has read-only permission (Viewer) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have read-only permission.
Objects Actions Folder Admin Folder Editor Folder Viewer on the folder itself view \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 To Subfolder Create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 workspace under it create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 Workspace under it - Resource Group View \u2713 \u2713 \u2713 resource binding \u2713 \u2717 \u2717 unbind \u2713 \u2717 \u2717 Workspaces under it - Shared Resources View \u2713 \u2713 \u2713 New share \u2713 \u2717 \u2717 Unshare \u2713 \u2717 \u2717 Resource Quota \u2713 \u2717 \u2717"},{"location":"en/admin/ghippo/workspace/folders.html","title":"Create/Delete Folders","text":"
Folders have the capability to map permissions, allowing users/user groups to have their permissions in the folder mapped to its sub-folders, workspaces, and resources.
Follow the steps below to create a folder:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Folder button in the top right corner.
Fill in the folder name, parent folder, and other information, then click OK to complete creating the folder.
Tip
After successful creation, the folder name will be displayed in the left tree structure, represented by different icons for workspaces and folders.
Note
To edit or delete a specific folder, select it and Click \u2507 on the right side.
If there are resources bound to the resource group or shared resources within the folder, the folder cannot be deleted. All resources need to be unbound before deleting.
If there are registry resources accessed by the microservice engine module within the folder, the folder cannot be deleted. All access to the registry needs to be removed before deleting the folder.
Shared resources do not necessarily mean that the shared users can use the shared resources without any restrictions. Admin, Kpanda Owner, and Workspace Admin can limit the maximum usage quota of a user through the Resource Quota feature in shared resources. If no restrictions are set, it means the usage is unlimited.
CPU Request (Core)
CPU Limit (Core)
Memory Request (MB)
Memory Limit (MB)
Total Storage Request (GB)
Persistent Volume Claims (PVC)
GPU Type, Spec, Quantity (including but not limited to Nvidia, Ascend, ILLUVATAR, and other GPUs)
A resource (cluster) can be shared among multiple workspaces, and a workspace can use resources from multiple shared clusters simultaneously.
"},{"location":"en/admin/ghippo/workspace/quota.html#resource-groups-and-shared-resources","title":"Resource Groups and Shared Resources","text":"
Cluster resources in both shared resources and resource groups are derived from Container Management. However, different effects will occur when binding a cluster to a workspace or sharing it with a workspace.
Binding Resources
Users/User groups in the workspace will have full management and usage permissions for the cluster. Workspace Admin will be mapped as Cluster Admin. Workspace Admin can access the Container Management module to manage the cluster.
Note
As of now, there are no Cluster Editor and Cluster Viewer roles in the Container Management module. Therefore, Workspace Editor and Workspace Viewer cannot be mapped.
Adding Shared Resources
Users/User groups in the workspace will have usage permissions for the cluster resources.
Unlike resource groups, when sharing a cluster with a workspace, the roles of the users in the workspace will not be mapped to the resources. Therefore, Workspace Admin will not be mapped as Cluster Admin.
This section demonstrates three scenarios related to resource quotas.
Select workspace ws01 and the shared cluster in Workbench, and create a namespace ns01 .
If no resource quotas are set in the shared cluster, there is no need to set resource quotas when creating the namespace.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the CPU request for the namespace must be less than or equal to 100 cores (CPU Request \u2264 100 core) for successful creation.
"},{"location":"en/admin/ghippo/workspace/quota.html#bind-namespace-to-workspace","title":"Bind Namespace to Workspace","text":"
Prerequisite: Workspace ws01 has added a shared cluster, and the operator has the Workspace Admin + Kpanda Owner or Admin role.
The two methods of binding have the same effect.
Bind the created namespace ns01 to ws01 in Container Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
Bind the namespace ns01 to ws01 in Global Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
"},{"location":"en/admin/ghippo/workspace/quota.html#unbind-namespace-from-workspace","title":"Unbind Namespace from Workspace","text":"
The two methods of unbinding have the same effect.
Unbind the namespace ns01 from workspace ws01 in Container Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
Unbind the namespace ns01 from workspace ws01 in Global Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html","title":"Differences between Resource Groups and Shared Resources","text":"
Both resource groups and shared resources support cluster binding, but they have significant differences in usage.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#differences-in-usage-scenarios","title":"Differences in Usage Scenarios","text":"
Cluster Binding for Resource Groups: Resource groups are usually used for batch authorization. After binding a resource group to a cluster, the workspace administrator will be mapped as a cluster administrator and able to manage and use cluster resources.
Cluster Binding for Shared Resources: Shared resources are usually used for resource quotas. A typical scenario is that the platform administrator assigns a cluster to a first-level supplier, who then assigns the cluster to a second-level supplier and sets resource quotas for the second-level supplier.
Note: In this scenario, the platform administrator needs to impose resource restrictions on secondary suppliers. Currently, it is not supported to limit the cluster quota of secondary suppliers by the primary supplier.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#differences-in-cluster-quota-usage","title":"Differences in Cluster Quota Usage","text":"
Cluster Binding for Resource Groups: The workspace administrator is mapped as the administrator of the cluster and is equivalent to being granted the Cluster Admin role in Container Management-Permission Management. They can have unrestricted access to cluster resources, manage important content such as management nodes, and cannot be subject to resource quotas.
Cluster Binding for Shared Resources: The workspace administrator can only use the quota in the cluster to create namespaces in the Workbench and does not have cluster management permissions. If the workspace is restricted by a quota, the workspace administrator can only create and use namespaces within the quota range.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#differences-in-resource-types","title":"Differences in Resource Types","text":"
Resource Groups: Can bind to clusters, cluster-namespaces, multiclouds, multicloud namespaces, meshs, and mesh-namespaces.
Shared Resources: Can only bind to clusters.
"},{"location":"en/admin/ghippo/workspace/res-gp-and-shared-res.html#similarities-between-resource-groups-and-shared-resources","title":"Similarities between Resource Groups and Shared Resources","text":"
After binding to a cluster, both resource groups and shared resources can go to the Workbench to create namespaces, which will be automatically bound to the workspace.
A workspace is a resource category that represents a hierarchical relationship of resources. A workspace can contain resources such as clusters, namespaces, and registries. Typically, each workspace corresponds to a project and different resources can be allocated, and different users and user groups can be assigned to each workspace.
Follow the steps below to create a workspace:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Workspace button in the top right corner.
Fill in the workspace name, folder assignment, and other information, then click OK to complete creating the workspace.
Tip
After successful creation, the workspace name will be displayed in the left tree structure, represented by different icons for folders and workspaces.
Note
To edit or delete a specific workspace or folder, select it and click ... on the right side.
If resource groups and shared resources have resources under the workspace, the workspace cannot be deleted. All resources need to be unbound before deletion of the workspace.
If Microservices Engine has Integrated Registry under the workspace, the workspace cannot be deleted. Integrated Registry needs to be removed before deletion of the workspace.
If Container Registry has Registry Space or Integrated Registry under the workspace, the workspace cannot be deleted. Registry Space needs to be removed, and Integrated Registry needs to be deleted before deletion of the workspace.
"},{"location":"en/admin/ghippo/workspace/ws-folder.html","title":"Workspace and Folder","text":"
Workspace and Folder is a feature that provides resource isolation and grouping, addressing issues related to unified authorization, resource grouping, and resource quotas.
Workspace and Folder involves two concepts: workspaces and folders.
Workspaces allow the management of resources through Authorization , Resource Group , and Shared Resource , enabling users (and user groups) to share resources within the workspace.
Resources
Resources are at the lowest level of the hierarchy in the resource management module. They include clusters, namespaces, pipelines, gateways, and more. All these resources can only have workspaces as their parent level. Workspaces act as containers for grouping resources.
Workspace
A workspace usually refers to a project or environment, and the resources in each workspace are logically isolated from those in other workspaces. You can grant users (groups of users) different access rights to the same set of resources through authorization in the workspace.
Workspaces are at the first level, counting from the bottom of the hierarchy, and contain resources. All resources except shared resources have one and only one parent. All workspaces also have one and only one parent folder.
Resources are grouped by workspace, and there are two grouping modes in workspace, namely Resource Group and Shared Resource .
Resource group
A resource can only be added to one resource group, and resource groups correspond to workspaces one by one. After a resource is added to a resource group, Workspace Admin will obtain the management authority of the resource, which is equivalent to the owner of the resource.
Share resource
For shared resources, multiple workspaces can share one or more resources. Resource owners can choose to share their own resources with the workspace. Generally, when sharing, the resource owner will limit the amount of resources that can be used by the shared workspace. After resources are shared, Workspace Admin only has resource usage rights under the resource limit, and cannot manage resources or adjust the amount of resources that can be used by the workspace.
At the same time, shared resources also have certain requirements for the resources themselves. Only Cluster (cluster) resources can be shared. Cluster Admin can share Cluster resources to different workspaces, and limit the use of workspaces on this Cluster.
Workspace Admin can create multiple Namespaces within the resource quota, but the sum of the resource quotas of the Namespaces cannot exceed the resource quota of the Cluster in the workspace. For Kubernetes resources, the only resource type that can be shared currently is Cluster.
Folders can be used to build enterprise business hierarchy relationships.
Folders are a further grouping mechanism based on workspaces and have a hierarchical structure. A folder can contain workspaces, other folders, or a combination of both, forming a tree-like organizational relationship.
Folders allow you to map your business hierarchy and group workspaces by department. Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces.
A folder has one and only one parent folder, and the root folder is the highest level of the hierarchy. The root folder has no parent, and folders and workspaces are attached to the root folder.
In addition, users (groups) in folders can inherit permissions from their parents through a hierarchical structure. The permissions of the user in the hierarchical structure come from the combination of the permissions of the current level and the permissions inherited from its parents. The permissions are additive and there is no mutual exclusion.
"},{"location":"en/admin/ghippo/workspace/ws-permission.html","title":"Description of workspace permissions","text":"
The workspace has permission mapping and resource isolation capabilities, and can map the permissions of users/groups in the workspace to the resources under it. If the user/group has the Workspace Admin role in the workspace and the resource Namespace is bound to the workspace-resource group, the user/group will become Namespace Admin after mapping.
Note
The permission mapping capability of the workspace will not be applied to shared resources, because sharing is to share the cluster usage permissions to multiple workspaces, rather than assigning management permissions to the workspaces, so permission inheritance and role mapping will not be implemented.
Resource isolation is achieved by binding resources to different workspaces. Therefore, resources can be flexibly allocated to each workspace (tenant) with the help of permission mapping, resource isolation, and resource sharing capabilities.
Generally applicable to the following two use cases:
Cluster one-to-one
Ordinary Cluster Department/Tenant (Workspace) Purpose Cluster 01 A Administration and Usage Cluster 02 B Administration and Usage
Cluster one-to-many
Cluster Department/Tenant (Workspace) Resource Quota Cluster 01 A 100 core CPU B 50-core CPU
Authorized users can go to modules such as workbench, microservice engine, middleware, multicloud orchestration, and service mesh to use resources in the workspace. For the operation scope of the roles of Workspace Admin, Workspace Editor, and Workspace Viewer in each module, please refer to the permission description:
If a user John (\"John\" represents any user who is required to bind resources) has the Workspace Admin role assigned or has been granted proper permissions through a custom role, which includes the Workspace's \"Resource Binding\" Permissions, and wants to bind a specific cluster or namespace to the workspace.
To bind cluster/namespace resources to a workspace, not only the workspace's \"Resource Binding\" permissions are required, but also the permissions of Cluster Admin.
"},{"location":"en/admin/ghippo/workspace/wsbind-permission.html#granting-authorization-to-john","title":"Granting Authorization to John","text":"
Using the Platform Admin Role, grant John the role of Workspace Admin on the Workspace -> Authorization page.
Then, on the Container Management -> Permissions page, authorize John as a Cluster Admin by Add Permission.
"},{"location":"en/admin/ghippo/workspace/wsbind-permission.html#binding-to-workspace","title":"Binding to Workspace","text":"
Using John's account to log in to AI platform, on the Container Management -> Clusters page, John can bind the specified cluster to his own workspace by using the Bind Workspace button.
Note
John can only bind clusters or namespaces to a specific workspace in the Container Management module, and cannot perform this operation in the Global Management module.
To bind a namespace to a workspace, you must have at least Workspace Admin and Cluster Admin permissions.
"},{"location":"en/admin/host/createhost.html","title":"Create and Start a Cloud Host","text":"
After the user completes registration and is assigned a workspace, namespace, and resources, they can create and start a cloud host.
"},{"location":"en/admin/host/usehost.html#steps-to-follow","title":"Steps to Follow","text":"
Log in to the AI platform as an administrator.
Navigate to Container Management -> Container Network -> Services, click the service name to enter the service details page, and click Update at the top right corner.
Change the port range to 30900-30999, ensuring there are no conflicts.
Log in to the AI platform as an end user, navigate to the proper service, and check the access port.
Use an SSH client to log in to the cloud host from the external network.
At this point, you can perform various operations on the cloud host.
Next step: Cloud Resource Sharing: Quota Management
The Alert Center is an important feature provided by AI platform that allows users to easily view all active and historical alerts by cluster and namespace through a graphical interface, and search alerts based on severity level (critical, warning, info).
All alerts are triggered based on the threshold conditions set in the preset alert rules. In AI platform, some global alert policies are built-in, but users can also create or delete alert policies at any time, and set thresholds for the following metrics:
CPU usage
Memory usage
Disk usage
Disk reads per second
Disk writes per second
Cluster disk read throughput
Cluster disk write throughput
Network send rate
Network receive rate
Users can also add labels and annotations to alert rules. Alert rules can be classified as active or expired, and certain rules can be enabled/disabled to achieve silent alerts.
When the threshold condition is met, users can configure how they want to be notified, including email, DingTalk, WeCom, webhook, and SMS notifications. All notification message templates can be customized and all messages are sent at specified intervals.
In addition, the Alert Center also supports sending alert messages to designated users through short message services provided by Alibaba Cloud, Tencent Cloud, and more platforms that will be added soon, enabling multiple ways of alert notification.
AI platform Alert Center is a powerful alert management platform that helps users quickly detect and resolve problems in the cluster, improve business stability and availability, and facilitate cluster inspection and troubleshooting.
In addition to the built-in alert policies, AI platform allows users to create custom alert policies. Each alert policy is a collection of alert rules that can be set for clusters, nodes, and workloads. When an alert object reaches the threshold set by any of the rules in the policy, an alert is automatically triggered and a notification is sent.
Taking the built-in alerts as an example, click the first alert policy alertmanager.rules .
You can see that some alert rules have been set under it. You can add more rules under this policy, or edit or delete them at any time. You can also view the historical and active alerts related to this alert policy and edit the notification configuration.
Select Alert Center -> Alert Policies , and click the Create Alert Policy button.
Fill in the basic information, select one or more clusters, nodes, or workloads as the alert objects, and click Next .
The list must have at least one rule. If the list is empty, please Add Rule .
Create an alert rule in the pop-up window, fill in the parameters, and click OK .
Template rules: Pre-defined basic metrics that can monitor CPU, memory, disk, and network.
PromQL rules: Input a PromQL expression, please query Prometheus expressions.
Duration: After the alert is triggered and the duration reaches the set value, the alert policy will become a triggered state.
Alert level: Including emergency, warning, and information levels.
Advanced settings: Custom tags and annotations.
After clicking Next , configure notifications.
After the configuration is complete, click the OK button to return to the Alert Policy list.
Tip
The newly created alert policy is in the Not Triggered state. Once the threshold conditions and duration specified in the rules are met, it will change to the Triggered state.
After filling in the basic information, click Add Rule and select Log Rule as the rule type.
Creating log rules is supported only when the resource object is selected as a node or workload.
Field Explanation:
Filter Condition : Field used to query log content, supports four filtering conditions: AND, OR, regular expression matching, and fuzzy matching.
Condition : Based on the filter condition, enter keywords or matching conditions.
Time Range : Time range for log queries.
Threshold Condition : Enter the alert threshold value in the input box. When the set threshold is reached, an alert will be triggered. Supported comparison operators are: >, \u2265, =, \u2264, <.
Alert Level : Select the alert level to indicate the severity of the alert.
After filling in the basic information, click Add Rule and select Event Rule as the rule type.
Creating event rules is supported only when the resource object is selected as a workload.
Field Explanation:
Event Rule : Only supports selecting the workload as the resource object.
Event Reason : Different event reasons for different types of workloads, where the event reasons are combined with \"AND\" relationship.
Time Range : Detect data generated within this time range. If the threshold condition is reached, an alert event will be triggered.
Threshold Condition : When the generated events reach the set threshold, an alert event will be triggered.
Trend Chart : By default, it queries the trend of event changes within the last 10 minutes. The value at each point represents the total number of occurrences within a certain period of time (time range) from the current time point to a previous time.
Click \u2507 at the right side of the list, then choose Delete from the pop-up menu to delete an alert policy. By clicking on the policy name, you can enter the policy details where you can add, edit, or delete the alert rules under it.
Warning
Deleted alert strategies will be permanently removed, so please proceed with caution.
The Alert template allows platform administrators to create Alert templates and rules, and business units can directly use Alert templates to create Alert policies. This feature can reduce the management of Alert rules by business personnel and allow for modification of Alert thresholds based on actual environment conditions.
In the navigation bar, select Alert -> Alert Policy, and click Alert Template at the top.
Click Create Alert Template, and set the name, description, and other information for the Alert template.
Parameter Description Template Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Resource Type Used to specify the matching type of the Alert template. Alert Rule Supports pre-defined multiple Alert rules, including template rules and PromQL rules.
Click OK to complete the creation and return to the Alert template list. Click the template name to view the template details.
Alert Inhibition is mainly a mechanism for temporarily hiding or reducing the priority of alerts that do not need immediate attention. The purpose of this feature is to reduce unnecessary alert information that may disturb operations personnel, allowing them to focus on more critical issues.
Alert inhibition recognizes and ignores certain alerts by defining a set of rules to deal with specific conditions. There are mainly the following conditions:
Parent-child inhibition: when a parent alert (for example, a crash on a node) is triggered, all child alerts aroused by it (for example, a crash on a container running on that node) are inhibited.
Similar alert inhibition: When alerts have the same characteristics (for example, the same problem on the same instance), multiple alerts are inhibited.
In the left navigation bar, select Alert -> Noise Reduction, and click Inhibition at the top.
Click Create Inhibition, and set the name and rules for the inhibition.
Note
The problem of avoiding multiple similar or related alerts that may be triggered by the same issue is achieved by defining a set of rules to identify and ignore certain alerts through Rule Details and Alert Details.
Parameter Description Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Cluster The cluster where the inhibition rule applies. Namespace The namespace where the inhibition rule applies. Source Alert Matching alerts by label conditions. It compares alerts that meet all label conditions with those that meet inhibition conditions, and alerts that do not meet inhibition conditions will be sent to the user as usual. Value range explanation: - Alert Level: The level of metric or event alerts, can be set as: Critical, Major, Minor. - Resource Type: The resource type specific for the alert object, can be set as: Cluster, Node, StatefulSet, Deployment, DaemonSet, Pod. - Labels: Alert identification attributes, consisting of label name and label value, supports user-defined values. Inhibition Specifies the matching conditions for the target alert (the alert to be inhibited). Alerts that meet all the conditions will no longer be sent to the user. Equal Specifies the list of labels to compare to determine if the source alert and target alert match. Inhibition is triggered only when the values of the labels specified in equal are exactly the same in the source and target alerts. The equal field is optional. If the equal field is omitted, all labels are used for matching.
Click OK to complete the creation and return to Inhibition list. Click the inhibition rule name to view the rule details.
After entering Insight , click Alert Center -> Notification Settings in the left navigation bar. By default, the email notification object is selected. Click Add email group and add one or more email addresses.
Multiple email addresses can be added.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the email group.
In the left navigation bar, click Alert Center -> Notification Settings -> WeCom . Click Add Group Robot and add one or more group robots.
For the URL of the WeCom group robot, please refer to the official document of WeCom: How to use group robots.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> DingTalk . Click Add Group Robot and add one or more group robots.
For the URL of the DingTalk group robot, please refer to the official document of DingTalk: Custom Robot Access.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> Lark . Click Add Group Bot and add one or more group bots.
Note
When signature verification is required in Lark's group bot, you need to fill in the specific signature key when enabling notifications. Refer to Customizing Bot User Guide.
After configuration, you will be automatically redirected to the list page. Click \u2507 on the right side of the list and select Send Test Message . You can edit or delete group bots.
In the left navigation bar, click Alert Center -> Notification Settings -> Webhook . Click New Webhook and add one or more Webhooks.
For the Webhook URL and more configuration methods, please refer to the webhook document.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the Webhook.
In the left navigation bar, click Alert Center -> Notification Settings -> SMS . Click Add SMS Group and add one or more SMS groups.
Enter the name, the object receiving the message, phone number, and notification server in the pop-up window.
The notification server needs to be created in advance under Notification Settings -> Notification Server . Currently, two cloud servers, Alibaba Cloud and Tencent Cloud, are supported. Please refer to your own cloud server information for the specific configuration parameters.
After the SMS group is successfully added, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the SMS group.
The message template feature supports customizing the content of message templates and can notify specified objects in the form of email, WeCom, DingTalk, Webhook, and SMS.
"},{"location":"en/admin/insight/alert-center/msg-template.html#creating-a-message-template","title":"Creating a Message Template","text":"
In the left navigation bar, select Alert -> Message Template .
Insight comes with two default built-in templates in both Chinese and English for user convenience.
Fill in the template content.
Info
Observability comes with predefined message templates. If you need to define the content of the templates, refer to Configure Notification Templates.
Click the name of a message template to view the details of the message template in the right slider.
Parameters Variable Description ruleName {{ .Labels.alertname }} The name of the rule that triggered the alert groupName {{ .Labels.alertgroup }} The name of the alert policy to which the alert rule belongs severity {{ .Labels.severity }} The level of the alert that was triggered cluster {{ .Labels.cluster }} The cluster where the resource that triggered the alert is located namespace {{ .Labels.namespace }} The namespace where the resource that triggered the alert is located node {{ .Labels.node }} The node where the resource that triggered the alert is located targetType {{ .Labels.target_type }} The resource type of the alert target target {{ .Labels.target }} The name of the object that triggered the alert value {{ .Annotations.value }} The metric value at the time the alert notification was triggered startsAt {{ .StartsAt }} The time when the alert started to occur endsAt {{ .EndsAt }} The time when the alert ended description {{ .Annotations.description }} A detailed description of the alert labels {{ for .labels }} {{ end }} All labels of the alert use the for function to iterate through the labels list to get all label contents."},{"location":"en/admin/insight/alert-center/msg-template.html#editing-or-deleting-a-message-template","title":"Editing or Deleting a Message Template","text":"
Click \u2507 on the right side of the list and select Edit or Delete from the pop-up menu to modify or delete the message template.
Warning
Once a template is deleted, it cannot be recovered, so please use caution when deleting templates.
Alert silence is a feature that allows alerts meeting certain criteria to be temporarily disabled from sending notifications within a specific time range. This feature helps operations personnel avoid receiving too many noisy alerts during certain operations or events, while also allowing for more precise handling of real issues that need to be addressed.
On the Alert Silence page, you can see two tabs: Active Rule and Expired Rule. The former presents the rules currently in effect, while the latter presents those that were defined in the past but have now expired (or have been deleted by the user).
"},{"location":"en/admin/insight/alert-center/silent.html#creating-a-silent-rule","title":"Creating a Silent Rule","text":"
In the left navigation bar, select Alert -> Noice Reduction -> Alert Silence , and click the Create Silence Rule button.
Fill in the parameters for the silent rule, such as cluster, namespace, tags, and time, to define the scope and effective time of the rule, and then click OK .
Return to the rule list, and on the right side of the list, click \u2507 to edit or delete a silent rule.
Through the Alert Silence feature, you can flexibly control which alerts should be ignored and when they should be effective, thereby improving operational efficiency and reducing the possibility of false alerts.
Insight supports SMS notifications and currently sends alert messages using integrated Alibaba Cloud and Tencent Cloud SMS services. This article explains how to configure the SMS notification server in Insight. The variables supported in the SMS signature are the default variables in the message template. As the number of SMS characters is limited, it is recommended to choose more explicit variables.
For information on how to configure SMS recipients, refer to the document: Configure SMS Notification Group.
Go to Alert Center -> Notification Settings -> Notification Server .
Click Add Notification Server .
Configure Alibaba Cloud server.
To apply for Alibaba Cloud SMS service, please refer to Alibaba Cloud SMS Service.
Field descriptions:
AccessKey ID : Parameter used by Alibaba Cloud to identify the user.
AccessKey Secret : Key used by Alibaba Cloud to authenticate the user. AccessKey Secret must be kept confidential.
SMS Signature : The SMS service supports creating signatures that meet the requirements according to user needs. When sending SMS, the SMS platform will add the approved SMS signature to the SMS content before sending it to the SMS recipient.
Template CODE : The SMS template is the specific content of the SMS to be sent.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content.
Please refer to Alibaba Cloud Variable Specification.
Note
Example: The template content defined in Alibaba Cloud is: ${severity}: ${alertname} triggered at ${startat}. Refer to the configuration in the parameter template.
Configure Tencent Cloud server.
To apply for Tencent Cloud SMS service, please refer to Tencent Cloud SMS.
Field descriptions:
Secret ID : Parameter used by Tencent Cloud to identify the API caller.
SecretKey : Parameter used by Tencent Cloud to authenticate the API caller.
SMS Template ID : The SMS template ID automatically generated by Tencent Cloud system.
Signature Content : The SMS signature content, which is the full name or abbreviation of the actual website name defined in the Tencent Cloud SMS signature.
SdkAppId : SMS SdkAppId, the actual SdkAppId generated after adding the application in the Tencent Cloud SMS console.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content. Please refer to: Tencent Cloud Variable Specification.
Note
Example: The template content defined in Tencent Cloud is: {1}: {2} triggered at {3}. Refer to the configuration in the parameter template.
After installing the insight-agent in the cluster, Fluent Bit in insight-agent will collect logs in the cluster by default, including Kubernetes event logs, node logs, and container logs. Fluent Bit has already configured various log collection plugins, related filter plugins, and log output plugins. The working status of these plugins determines whether log collection is normal. Below is a dashboard for Fluent Bit that monitors the working conditions of each Fluent Bit in the cluster and the collection, processing, and export of plugin logs.
Use AI platform platform, enter Insight , and select the Dashboard in the left navigation bar.
Click the dashboard title Overview .
Switch to the insight-system -> Fluent Bit dashboard.
There are several check boxes above the Fluent Bit dashboard to select the input plugin, filter plugin, output plugin, and cluster in which it is located.
Filter Plugin Plugin Description Lua.audit_log.k8s Use lua to filter Kubernetes audit logs that meet certain conditions
Note
There are more filter plugins than Lua.audit_log.k8s, which only introduces filters that will discard logs.
Log Output Plugin
Output Plugin Plugin Description es.kube.kubeevent.syslog Write Kubernetes audit logs, event logs, and syslog logs to ElasticSearch cluster forward.audit_log Send Kubernetes audit logs and global management audit logs to Global Management es.skoala Write request logs and instance logs of microservice gateway to ElasticSearch cluster"},{"location":"en/admin/insight/best-practice/debug-trace.html","title":"Trace Collection Troubleshooting Guide","text":"
Before attempting to troubleshoot issues with trace data collection, you need to understand the transmission path of trace data. The following is a schematic diagram of the transmission of trace data:
As shown in the above figure, any transmission failure at any step will result in the inability to query trace data. If you find that there is no trace data after completing the application trace enhancement, please perform the following steps:
Use AI platform platform, enter Insight , and select the Dashboard in the left navigation bar.
Click the dashboard title Overview .
Switch to the insight-system -> insight tracing debug dashboard.
You can see that this dashboard is composed of three blocks, each responsible for monitoring the data transmission of different clusters and components. Check whether there are problems with trace data transmission through the generated time series chart.
Display the opentelemetry collector in different worker clusters receiving language probe/SDK trace data and sending aggregated trace data. You can select the cluster where it is located by the Cluster selection box in the upper left corner.
Note
Based on these four time series charts, you can determine whether the opentelemetry collector in this cluster is running normally.
global opentelemetry collector
Display the opentelemetry collector in the Global Service Cluster receiving trace data from the worker cluster's opentelemetry collector and sending aggregated trace data.
Note
The opentelemetry collector in the Global Management Cluster is also responsible for sending audit logs of all worker clusters' global management module and Kubernetes audit logs (not collected by default) to the audit server component of the global management module.
global jaeger collector
Display the jaeger collector in the Global Management Cluster receiving data from the otel collector in the Global Management Cluster and sending trace data to the ElasticSearch cluster.
"},{"location":"en/admin/insight/best-practice/find_root_cause.html","title":"Troubleshooting Service Issues with Insight","text":"
This article serves as a guide on using Insight to identify and analyze abnormal components in AI platform and determine the root causes of component exceptions.
Please note that this post assumes you have a basic understanding of Insight's product features or vision.
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#service-map-identifying-abnormalities-on-a-macro-level","title":"Service Map - Identifying Abnormalities on a Macro Level","text":"
In enterprise microservice architectures, managing a large number of services with complex interdependencies can be challenging. Insight offers service map monitoring, allowing users to gain a high-level overview of the running microservices in the system.
In the example below, you observe that the node insight-server is highlighted in red/yellow on the service map. By hovering over the node, you can see the error rate associated with it. To investigate further and understand why the error rate is not 0 , you can explore more detailed information:
Alternatively, clicking on the service name at the top will take you to the service's overview UI:
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#service-overview-delving-into-detailed-analysis","title":"Service Overview - Delving into Detailed Analysis","text":"
When it becomes necessary to analyze inbound and outbound traffic separately, you can use the filter in the upper right corner to refine the data. After applying the filter, you can observe that the service has multiple operations proper to a non-zero error rate. To investigate further, you can inspect the traces generated by these operations during a specific time period by clicking on \"View Traces\":
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#trace-details-identifying-and-eliminating-root-causes-of-errors","title":"Trace Details - Identifying and Eliminating Root Causes of Errors","text":"
In the trace list, you can easily identify traces marked as error (circled in red in the figure above) and examine their details by clicking on the proper trace. The following figure illustrates the trace details:
Within the trace diagram, you can quickly locate the last piece of data in an error state. Expanding the associated logs section reveals the cause of the request error:
Following the above analysis method, you can also identify traces related to other operation errors:
"},{"location":"en/admin/insight/best-practice/find_root_cause.html#lets-get-started-with-your-analysis","title":"Let's Get Started with Your Analysis!","text":""},{"location":"en/admin/insight/best-practice/insight-kafka.html","title":"Kafka + Elasticsearch Stream Architecture for Handling Large-Scale Logs","text":"
As businesses grow, the amount of log data generated by applications increases significantly. To ensure that systems can properly collect and analyze massive amounts of log data, it is common practice to introduce a streaming architecture using Kafka to handle asynchronous data collection. The collected log data flows through Kafka and is consumed by proper components, which then store the data into Elasticsearch for visualization and analysis using Insight.
This article will introduce two solutions:
Fluentbit + Kafka + Logstash + Elasticsearch
Fluentbit + Kafka + Vector + Elasticsearch
Once we integrate Kafka into the logging system, the data flow diagram looks as follows:
Both solutions share similarities but differ in the component used to consume Kafka data. To ensure compatibility with Insight's data analysis, the format of the data consumed from Kafka and written into Elasticsearch should be consistent with the data directly written by Fluentbit to Elasticsearch.
Let's first see how Fluentbit writes logs to Kafka:
Once the Kafka cluster is ready, we need to modify the content of the insight-system namespace's ConfigMap . We will add three Kafka outputs and comment out the original three Elasticsearch outputs:
Assuming the Kafka Brokers address is: insight-kafka.insight-system.svc.cluster.local:9092
Next, let's discuss the subtle differences in consuming Kafka data and writing it to Elasticsearch. As mentioned at the beginning of this article, we will explore Logstash and Vector as two ways to consume Kafka data.
"},{"location":"en/admin/insight/best-practice/insight-kafka.html#consuming-kafka-and-writing-to-elasticsearch","title":"Consuming Kafka and Writing to Elasticsearch","text":"
Assuming the Elasticsearch address is: https://mcamel-common-es-cluster-es-http.mcamel-system:9200
"},{"location":"en/admin/insight/best-practice/insight-kafka.html#using-logstash-for-consumption","title":"Using Logstash for Consumption","text":"
If you are familiar with the Logstash technology stack, you can continue using this approach.
When deploying Logstash via Helm, you can add the following pipeline in the logstashPipeline section:
"},{"location":"en/admin/insight/best-practice/insight-kafka.html#checking-if-its-working-properly","title":"Checking if it's Working Properly","text":"
You can verify if the configuration is successful by checking if there are new data in the Insight log query interface or observing an increase in the number of indices in Elasticsearch.
Understand and meet the DeepFlow runtime permissions and kernel requirements
Storage volume is ready
"},{"location":"en/admin/insight/best-practice/integration_deepflow.html#install-deepflow-and-configure-insight","title":"Install DeepFlow and Configure Insight","text":"
Installing DeepFlow components requires two charts:
deepflow: includes components such as deepflow-app, deepflow-server, deepflow-clickhouse, and deepflow-agent. Generally, deepflow is deployed in the global service cluster, so it also installs deepflow-agent together.
deepflow-agent: only includes the deepflow-agent component, used to collect eBPF data and send it to deepflow-server.
DeepFlow needs to be installed in the global service cluster.
Go to the kpanda-global-cluster cluster and click Helm Apps -> Helm Charts in the left navigation bar, select community as the repository, and search for deepflow in the search box:
Click the deepflow card to enter the details page:
Click Install to enter the installation page:
Most of the values have default values. Clickhouse and Mysql require applying storage volumes, and their default sizes are 10Gi. You can search for relevant configurations and modify them using the persistence keyword.
After configuring, click OK to start the installation.
DeepFlow Agent is installed in the sub-cluster using the deepflow-agent chart. It is used to collect eBPF observability data from the sub-cluster and report it to the global service cluster. Similar to installing deepflow, go to Helm Apps -> Helm Charts, select community as the repository, and search for deepflow-agent in the search box. Follow the process to enter the installation page.
Parameter Explanation:
DeployComponent : deployment mode, default is daemonset.
timezone : timezone, default is Asia/Shanghai.
DeepflowServerNodeIPS : addresses of the nodes where deepflow server is installed.
deepflowK8sClusterID : cluster UUID.
agentGroupID : agent group ID.
controllerPort : data reporting port of deepflow server, can be left blank, default is 30035.
clusterNAME : cluster name.
After configuring, click OK to complete the installation.
After correctly installing DeepFlow, click Network Observability to enter the DeepFlow Grafana UI. It contains a large number of dashboards for viewing and helping analyze issues. Click DeepFlow Templates to browse all available dashboards:
"},{"location":"en/admin/insight/best-practice/sw-to-otel.html","title":"Simplifying Trace Data Integration with OpenTelemetry and SkyWalking","text":"
This article explains how to seamlessly integrate trace data from SkyWalking into the Insight platform, using OpenTelemetry. With zero code modification required, you can transform your existing SkyWalking trace data and leverage Insight's capabilities.
"},{"location":"en/admin/insight/best-practice/sw-to-otel.html#understanding-the-code","title":"Understanding the Code","text":"
To ensure compatibility with different distributed tracing implementations, OpenTelemetry provides a way to incorporate components that standardize data processing and output to various backends. While Jaeger and Zipkin are already available, we have contributed the SkyWalkingReceiver to the OpenTelemetry community. This receiver has been refined and is now suitable for use in production environments without any modifications to your application's code.
Although SkyWalking and OpenTelemetry share similarities, such as using Trace to define a trace and Span to mark the smallest granularity, there are differences in certain details and implementations:
SkyWalking OpenTelemetry Data Structure Span -> Segment -> Trace Span -> Trace Attribute Information Tags Attributes Application Time Logs Events Reference Relationship References Links
Now, let's discuss the steps involved in converting SkyWalking Trace to OpenTelemetry Trace. The main tasks include:
Constructing OpenTelemetry's TraceId and SpanId
Constructing OpenTelemetry's ParentSpanId
Retaining SkyWalking's original TraceId, SegmentId, and SpanId in OpenTelemetry Spans
First, let's look at how to construct the TraceId and SpanId for OpenTelemetry. Both SkyWalking and OpenTelemetry use TraceId to connect distributed service calls and use SpanId to mark each Span, but there are significant differences in the implementation specifications:
Info
View GitHub for code implementation\uff1a
Skywalking Receiver
PR: Create skywalking component folder/structure
PR: add Skywalking tracing receiver impl
Specifically, the possible formats for SkyWalking TraceId and SegmentId are as follows:
In the OpenTelemetry protocol, a Span is unique across all Traces, while in SkyWalking, a Span is only unique within each Segment. This means that to uniquely identify a Span in SkyWalking, it is necessary to combine the SegmentId and SpanId, and convert it to the SpanId in OpenTelemetry.
Info
View GitHub for code implementation\uff1a
Skywalking Receiver
PR: Fix skywalking traceid and spanid convertion
Next, let's see how to construct the ParentSpanId for OpenTelemetry. Within a Segment, the ParentSpanId field in SkyWalking can be directly used to construct the ParentSpanId field in OpenTelemetry. However, when a Trace spans multiple Segments, SkyWalking uses the association information represented by ParentTraceSegmentId and ParentSpanId in the Reference. In this case, the ParentSpanId in OpenTelemetry needs to be constructed using the information in the Reference.
Code implementation can be found on GitHub: Skywalking Receiver
Finally, let's see how to preserve the original TraceId, SegmentId, and SpanId from SkyWalking in the OpenTelemetry Span. We carry these original information to associate the OpenTelemetry TraceId and SpanId displayed in the distributed tracing backend with the SkyWalking TraceId, SegmentId, and SpanId in the application logs. We choose to carry the original TraceId, SegmentId, and ParentSegmentId from SkyWalking to the OpenTelemetry Attributes.
Info
View GitHub for code implementation\uff1a
Skywalking Receiver
Add extra link attributes from skywalking ref
After this series of conversions, we have fully transformed the SkyWalking Segment Object into an OpenTelemetry Trace, as shown in the following diagram:
"},{"location":"en/admin/insight/best-practice/sw-to-otel.html#deploying-the-demo","title":"Deploying the Demo","text":"
To demonstrate the complete process of collecting and displaying SkyWalking tracing data using OpenTelemetry, we will use a demo application.
First, deploy the OpenTelemetry Agent and enable the following configuration to ensure compatibility with the SkyWalking protocol:
# otel-agent config\nreceivers:\n skywalking:\n protocols:\n grpc:\n endpoint: 0.0.0.0:11800 # Receive trace data reported by the SkyWalking Agent\n http: \n endpoint: 0.0.0.0:12800 # Receive trace data reported from the front-end / nginx or other HTTP protocols\nservice: \n pipelines: \n traces: \n receivers: [skywalking]\n\n# otel-agent service yaml\nspec:\n ports: \n - name: sw-http\n port: 12800 \n protocol: TCP \n targetPort: 12800 \n - name: sw-grpc \n port: 11800 \n protocol: TCP \n targetPort: 11800\n
Next, modify the connection of your business application from the SkyWalking OAP Service (e.g., oap:11800) to the OpenTelemetry Agent Service (e.g., otel-agent:11800). This will allow you to start receiving trace data from the SkyWalking probe using OpenTelemetry.
To demonstrate the entire process, we will use the SkyWalking-showcase Demo. This demo utilizes the SkyWalking Agent for tracing, and after being processed by OpenTelemetry, the final results are presented using Jaeger:
From the architecture diagram of the SkyWalking Showcase, we can observe that the data remains intact even after standardization by OpenTelemetry. In this trace, the request starts from app/homepage, then two requests /rcmd and /songs/top are initiated simultaneously within the app, distributed to the recommendation and songs services, and finally reach the database for querying, completing the entire request chain.
Additionally, you can view the original SkyWalking Id information on the Jaeger page, which facilitates correlation with application logs:
By following these steps, you can seamlessly integrate SkyWalking trace data into OpenTelemetry and leverage the capabilities of the Insight platform.
"},{"location":"en/admin/insight/best-practice/tail-based-sampling.html","title":"About Trace Sampling and Configuration","text":"
Using distributed tracing, you can observe how requests flow through various systems in a distributed system. Undeniably, it is very useful for understanding service connections, diagnosing latency issues, and providing many other benefits.
However, if most of your requests are successful and there are no unacceptable delays or errors, do you really need all this data? Therefore, you only need to achieve the right insights through appropriate data sampling rather than a large amount or complete data.
The idea behind sampling is to control the traces sent to the observability collector, thereby reducing collection costs. Different organizations have different reasons for sampling, including why they want to sample and what types of data they wish to sample. Therefore, we need to customize the sampling strategy:
Cost Management: If a large amount of telemetry data needs to be stored, it incurs higher computational and storage costs.
Focus on Interesting Traces: Different organizations prioritize different data types.
Filter Out Noise: For example, you may want to filter out health checks.
It is important to use consistent terminology when discussing sampling. A Trace or Span is considered sampled or unsampled:
Sampled: A Trace or Span that is processed and stored. It is chosen by the sampler to represent the overall data, so it is considered sampled.
Unsampled: A Trace or Span that is not processed or stored. Because it was not selected by the sampler, it is considered unsampled.
"},{"location":"en/admin/insight/best-practice/tail-based-sampling.html#what-are-the-sampling-options","title":"What Are the Sampling Options?","text":""},{"location":"en/admin/insight/best-practice/tail-based-sampling.html#head-sampling","title":"Head Sampling","text":"
Head sampling is a sampling technique used to make a sampling decision as early as possible. A decision to sample or drop a span or trace is not made by inspecting the trace as a whole.
For example, the most common form of head sampling is Consistent Probability Sampling. This is also be referred to as Deterministic Sampling. In this case, a sampling decision is made based on the trace ID and the desired percentage of traces to sample. This ensures that whole traces are sampled - no missing spans - at a consistent rate, such as 5% of all traces.
The upsides to head sampling are: - Easy to understand - Easy to configure - Efficient - Can be done at any point in the trace collection pipeline
The primary downside to head sampling is that it is not possible to make a sampling decision based on data in the entire trace. This means that while head sampling is effective as a blunt instrument, but it is completely insufficient for sampling strategies that must consider information from the entire system. For example, you cannot ensure that all traces with an error within them are sampled with head sampling alone. For this situation and many others, you need tail sampling.
Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to sample your traces based on specific criteria derived from different parts of a trace, which isn\u2019t an option with Head Sampling.
Some examples of how to use tail sampling include:
Always sampling traces that contain an error
Sampling traces based on overall latency
Sampling traces based on the presence or value of specific attributes on one or more spans in a trace; for example, sampling more traces originating from a newly deployed service
Applying different sampling rates to traces based on certain criteria, such as when traces only come from low-volume services versus traces with high-volume services
As you can see, tail sampling allows for a much higher degree of sophistication in how you sample data. For larger systems that must sample telemetry, it is almost always necessary to use Tail Sampling to balance data volume with the usefulness of that data.
There are three primary downsides to tail sampling today:
Tail sampling can be difficult to implement. Depending on the kind of sampling techniques available to you, it is not always a \u201cset and forget\u201d kind of thing. As your systems change, so too will your sampling strategies. For a large and sophisticated distributed system, rules that implement sampling strategies can also be large and sophisticated.
Tail sampling can be difficult to operate. The component(s) that implement tail sampling must be stateful systems that can accept and store a large amount of data. Depending on traffic patterns, this can require dozens or even hundreds of compute nodes that all utilize resources differently. Furthermore, a tail sampler might need to \u201cfall back\u201d to less computationally intensive sampling techniques if it is unable to keep up with the volume of data it is receiving. Because of these factors, it is critical to monitor tail-sampling components to ensure that they have the resources they need to make the correct sampling decisions.
Tail samplers often end up as vendor-specific technology today. If you\u2019re using a paid vendor for Observability, the most effective tail sampling options available to you might be limited to what the vendor offers.
Finally, for some systems, tail sampling might be used in conjunction with Head Sampling. For example, a set of services that produce an extremely high volume of trace data might first use head sampling to sample only a small percentage of traces, and then later in the telemetry pipeline use tail sampling to make more sophisticated sampling decisions before exporting to a backend. This is often done in the interest of protecting the telemetry pipeline from being overloaded.
Insight currently recommends using tail sampling and prioritizes support for tail sampling.
The tail sampling processor samples traces based on a defined set of strategies. However, all spans of a trace must be received by the same collector instance to make effective sampling decisions.
Therefore, adjustments need to be made to the Global OpenTelemetry Collector architecture of Insight to implement the tail sampling strategy.
"},{"location":"en/admin/insight/best-practice/tail-based-sampling.html#specific-changes-to-insight","title":"Specific Changes to Insight","text":"
Introduce an Opentelemetry Collector Gateway component with load balancing capabilities in front of the insight-opentelemetry-collector in the Global cluster, allowing the same group of Traces to be routed to the same Opentelemetry Collector instance based on the TraceID.
Deploy an OTEL COL Gateway component with load balancing capabilities.
If you are using Insight V0.25.x, you can quickly enable this by using the Helm Upgrade parameter --set opentelemetry-collector-gateway.enabled=true, thereby skipping the deployment process described below.
Refer to the following YAML to deploy the component.
Tail sampling rules need to be added to the existing insight-otel-collector-config configmap configuration group.
Add the following content in the processor section, and adjust the specific rules as needed, refer to the OTel official example.
........\ntail_sampling:\n decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed\n num_traces: 1500000 # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;\n # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped\n expected_new_traces_per_sec: 10\n policies: # Reporting policies\n [\n {\n name: latency-policy,\n type: latency, # Report traces that exceed 500ms\n latency: {threshold_ms: 500}\n },\n {\n name: status_code-policy,\n type: status_code, # Report traces with ERROR status code\n status_code: {status_codes: [ ERROR ]}\n }\n ]\n......\ntail_sampling: # Composite sampling\n decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed\n num_traces: 1500000 # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;\n # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped\n expected_new_traces_per_sec: 10\n policies: [\n {\n name: debug-worker-cluster-sample-policy,\n type: and,\n and:\n {\n and_sub_policy:\n [\n {\n name: service-name-policy,\n type: string_attribute,\n string_attribute:\n { key: k8s.cluster.id, values: [xxxxxxx] },\n },\n {\n name: trace-status-policy,\n type: status_code,\n status_code: { status_codes: [ERROR] },\n },\n {\n name: probabilistic-policy,\n type: probabilistic,\n probabilistic: { sampling_percentage: 1 },\n }\n ]\n }\n }\n ]\n
Activate the processor in the otel col pipeline within the insight-otel-collector-config configmap:
"},{"location":"en/admin/insight/collection-manag/agent-status.html","title":"insight-agent Component Status Explanation","text":"
In AI platform, Insight acts as a multi-cluster observability product. To achieve unified data collection across multiple clusters, users need to install the Helm App insight-agent (installed by default in the insight-system namespace). Refer to How to Install insight-agent .
In the \"Observability\" -> \"Collection Management\" section, you can view the installation status of insight-agent in each cluster.
Not Installed : insight-agent is not installed in the insight-system namespace of the cluster.
Running : insight-agent is successfully installed in the cluster, and all deployed components are running.
Error : If insight-agent is in this state, it indicates that the helm deployment failed or there are components deployed that are not in a running state.
You can troubleshoot using the following steps:
Run the following command. If the status is deployed , proceed to the next step. If it is failed , it is recommended to uninstall and reinstall it from Container Management -> Helm Apps as it may affect application upgrades:
helm list -n insight-system\n
Run the following command or check the status of the deployed components in Insight -> Data Collection . If there are Pods not in the Running state, restart the containers in an abnormal state.
The resource consumption of the Prometheus metric collection component in insight-agent is directly proportional to the number of Pods running in the cluster. Please adjust the resources for Prometheus according to the cluster size. Refer to Prometheus Resource Planning.
The storage capacity of the vmstorage metric storage component in the global service cluster is directly proportional to the total number of Pods in the clusters.
Please contact the platform administrator to adjust the disk capacity of vmstorage based on the cluster size. Refer to vmstorage Disk Capacity Planning.
Adjust vmstorage disk based on multi-cluster scale. Refer to vmstorge Disk Expansion.
Data Collection is mainly to centrally manage and display the entrance of the cluster installation collection plug-in insight-agent , which helps users quickly view the health status of the cluster collection plug-in, and provides a quick entry to configure collection rules.
The specific operation steps are as follows:
Click in the upper left corner and select Insight -> Data Collection .
You can view the status of all cluster collection plug-ins.
When the cluster is connected to insight-agent and is running, click a cluster name to enter the details\u3002
In the Service Monitor tab, click the shortcut link to jump to Container Management -> CRD to add service discovery rules.
Prometheus primarily uses the Pull approach to retrieve monitoring metrics from target services' exposed endpoints. Therefore, it requires configuring proper scraping jobs to request monitoring data and write it into the storage provided by Prometheus. Currently, Prometheus offers several configurations for these jobs:
Native Job Configuration: This provides native Prometheus job configuration for scraping.
Pod Monitor: In the Kubernetes ecosystem, it allows scraping of monitoring data from Pods using Prometheus Operator.
Service Monitor: In the Kubernetes ecosystem, it allows scraping monitoring data from Endpoints of Services using Prometheus Operator.
# Name of the scraping job, also adds a label (job=job_name) to the scraped metrics\njob_name: <job_name>\n\n# Time interval between scrapes\n[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]\n\n# Timeout for scrape requests\n[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]\n\n# URI path for the scrape request\n[ metrics_path: <path> | default = /metrics ]\n\n# Handling of label conflicts between scraped labels and labels added by the backend Prometheus.\n# true: Retains the scraped labels and ignores conflicting labels from the backend Prometheus.\n# false: Adds an \"exported_<original-label>\" prefix to the scraped labels and includes the additional labels added by the backend Prometheus.\n[ honor_labels: <boolean> | default = false ]\n\n# Whether to use the timestamp generated by the target being scraped.\n# true: Uses the timestamp from the target if available.\n# false: Ignores the timestamp from the target.\n[ honor_timestamps: <boolean> | default = true ]\n\n# Protocol for the scrape request: http or https\n[ scheme: <scheme> | default = http ]\n\n# URL parameters for the scrape request\nparams:\n [ <string>: [<string>, ...] ]\n\n# Set the value of the `Authorization` header in the scrape request through basic authentication. password/password_file are mutually exclusive, with password_file taking precedence.\nbasic_auth:\n [ username: <string> ]\n [ password: <secret> ]\n [ password_file: <string> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token: <secret> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token_file: <filename> ]\n\n# Whether the scrape connection should use a TLS secure channel, configure the proper TLS parameters\ntls_config:\n [ <tls_config> ]\n\n# Use a proxy service to scrape the metrics from the target, specify the address of the proxy service.\n[ proxy_url: <string> ]\n\n# Specify the targets using static configuration, see explanation below.\nstatic_configs:\n [ - <static_config> ... ]\n\n# CVM service discovery configuration, see explanation below.\ncvm_sd_configs:\n [ - <cvm_sd_config> ... ]\n\n# After scraping the data, rewrite the labels of the proper target using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nrelabel_configs:\n [ - <relabel_config> ... ]\n\n# Before writing the scraped data, rewrite the values of the labels using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nmetric_relabel_configs:\n [ - <relabel_config> ... ]\n\n# Limit the number of data points per scrape, 0: no limit, default is 0\n[ sample_limit: <int> | default = 0 ]\n\n# Limit the number of targets per scrape, 0: no limit, default is 0\n[ target_limit: <int> | default = 0 ]\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is PodMonitor\nkind: PodMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be <namespace>/<name>\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight\nspec:\n # Specify the label of the proper Pod, pod monitor will use this value as the job label value.\n # If viewing the Pod YAML, use the values in pod.metadata.labels.\n # If viewing Deployment/Daemonset/Statefulset, use spec.template.metadata.labels.\n [ jobLabel: string ]\n # Adds the proper Pod's Labels to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n podMetricsEndpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: PodMonitor\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n podMetricsEndpoints:\n - interval: 30s\n port: metric-port # Specify the Port Name proper to Prometheus Exporter in the pod YAML\n path: /metrics # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n relabelings:\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: instance\n replacement: \"crs-xxxxxx\" # Adjust to the proper Redis instance ID\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: ip\n replacement: \"1.x.x.x\" # Adjust to the proper Redis instance IP\n namespaceSelector: # Select the namespaces where the monitored Pods are located\n matchNames:\n - redis-test\n selector: # Specify the Label values of the Pods to be monitored in order to locate the target pods\n matchLabels:\n k8s-app: redis-exporter\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is ServiceMonitor\nkind: ServiceMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be the name of the Service.\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n # Specify the label(metadata/labels) of the proper Pod, service monitor will use this value as the job label value.\n [ jobLabel: string ]\n # Adds the Labels of the proper service to the Target's Labels\n [ targetLabels: []string ]\n # Adds the Labels of the proper Pod to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n endpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: ServiceMonitor\nmetadata:\n name: go-demo # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n endpoints:\n - interval: 30s\n # Specify the Port Name proper to Prometheus Exporter in the service YAML\n port: 8080-8080-tcp\n # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n path: /metrics\n relabelings:\n # ** There must be a label named 'application', assuming there is a label named 'app' in k8s,\n # we replace it with 'application' using the relabel 'replace' action\n - action: replace\n sourceLabels: [__meta_kubernetes_pod_label_app]\n targetLabel: application\n # Select the namespace where the monitored service is located\n namespaceSelector:\n matchNames:\n - golang-demo\n # Specify the Label values of the service to be monitored in order to locate the target service\n selector:\n matchLabels:\n app: golang-app-demo\n
The explanation for the proper configmaps is as follows:
# The name of the proper port. Please note that it's not the actual port number.\n# Default: 80. Possible values are as follows:\n# ServiceMonitor: corresponds to Service>spec/ports/name;\n# PodMonitor: explained as follows:\n# If viewing the Pod YAML, take the value from pod.spec.containers.ports.name.\n# If viewing Deployment/DaemonSet/StatefulSet, take the value from spec.template.spec.containers.ports.name.\n[ port: string | default = 80]\n# The URI path for the scrape request.\n[ path: string | default = /metrics ]\n# The protocol for the scrape: http or https.\n[ scheme: string | default = http]\n# URL parameters for the scrape request.\n[ params: map[string][]string]\n# The interval between scrape requests.\n[ interval: string | default = 30s ]\n# The timeout for the scrape request.\n[ scrapeTimeout: string | default = 30s]\n# Whether the scrape connection should be made over a secure TLS channel, and the TLS configuration.\n[ tlsConfig: TLSConfig ]\n# Read the bearer token value from the specified file and include it in the headers of the scrape request.\n[ bearerTokenFile: string ]\n# Read the bearer token from the specified K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ bearerTokenSecret: string ]\n# Handling conflicts when scraped labels conflict with labels added by the backend Prometheus.\n# true: Keep the scraped labels and ignore the conflicting labels from the backend Prometheus.\n# false: For conflicting labels, prefix the scraped label with 'exported_<original-label>' and add the labels added by the backend Prometheus.\n[ honorLabels: bool | default = false ]\n# Whether to use the timestamp generated on the target during the scrape.\n# true: Use the timestamp on the target if available.\n# false: Ignore the timestamp on the target.\n[ honorTimestamps: bool | default = true ]\n# Basic authentication credentials. Fill in the values of username/password from the proper K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ basicAuth: BasicAuth ]\n# Scrape the metrics from the target through a proxy server. Specify the address of the proxy server.\n[ proxyUrl: string ]\n# After scraping the data, rewrite the values of the labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nrelabelings:\n[ - <relabel_config> ...]\n# Before writing the scraped data, rewrite the values of the proper labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nmetricRelabelings:\n[ - <relabel_config> ...]\n
The explanation for the proper configmaps is as follows:
# Specifies which labels to take from the original labels for relabeling. The values taken are concatenated using the separator defined in the configuration.\n# For PodMonitor/ServiceMonitor, the proper configmap is sourceLabels.\n[ source_labels: '[' <labelname> [, ...] ']' ]\n# Defines the character used to concatenate the values of the labels to be relabeled. Default is ';'.\n[ separator: <string> | default = ; ]\n\n# When the action is replace/hashmod, target_label is used to specify the proper label name.\n# For PodMonitor/ServiceMonitor, the proper configmap is targetLabel.\n[ target_label: <labelname> ]\n\n# Regular expression used to match the values of the source labels.\n[ regex: <regex> | default = (.*) ]\n\n# Used when action is hashmod, it takes the modulus value based on the MD5 hash of the source label's value.\n[ modulus: <int> ]\n\n# Used when action is replace, it defines the expression to replace when the regex matches. It can use regular expression replacement with regex.\n[ replacement: <string> | default = $1 ]\n\n# Actions performed based on the matched values of regex. The available actions are as follows, with replace being the default:\n# replace: If the regex matches, replace the proper value with the value defined in replacement. Set the value using target_label and add the proper label.\n# keep: If the regex doesn't match, discard the value.\n# drop: If the regex matches, discard the value.\n# hashmod: Take the modulus of the MD5 hash of the source label's value based on the value specified in modulus.\n# Add a new label with a label name specified by target_label.\n# labelmap: If the regex matches, replace the proper label name with the value specified in replacement.\n# labeldrop: If the regex matches, delete the proper label.\n# labelkeep: If the regex doesn't match, delete the proper label.\n[ action: <relabel_action> | default = replace ]\n
Insight uses the Blackbox Exporter provided by Prometheus as a blackbox monitoring solution, allowing detection of target instances via HTTP, HTTPS, DNS, ICMP, TCP, and gRPC. It can be used in the following scenarios:
HTTP/HTTPS: URL/API availability monitoring
ICMP: Host availability monitoring
TCP: Port availability monitoring
DNS: Domain name resolution
In this page, we will explain how to configure custom probers in an existing Blackbox ConfigMap.
ICMP prober is not enabled by default in Insight because it requires higher permissions. Therfore We will use the HTTP prober as an example to demonstrate how to modify the ConfigMap to achieve custom HTTP probing.
module:\n ICMP: # Example of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: ip4\nicmp_example: # Example 2 of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: \"ip4\"\n source_ip_address: \"127.0.0.1\"\n
Since ICMP requires higher permissions, we also need to elevate the pod permissions. Otherwise, an operation not permitted error will occur. There are two ways to elevate permissions:
Directly edit the BlackBox Exporter deployment file to enable it
The following YAML file contains various probers such as HTTP, TCP, SMTP, ICMP, and DNS. You can modify the configuration file of insight-agent-prometheus-blackbox-exporter according to your needs.
Click to view the complete YAML file
kind: ConfigMap\napiVersion: v1\nmetadata:\n name: insight-agent-prometheus-blackbox-exporter\n namespace: insight-system\n labels:\n app.kubernetes.io/instance: insight-agent\n app.kubernetes.io/managed-by: Helm\n app.kubernetes.io/name: prometheus-blackbox-exporter\n app.kubernetes.io/version: v0.24.0\n helm.sh/chart: prometheus-blackbox-exporter-8.8.0\n annotations:\n meta.helm.sh/release-name: insight-agent\n meta.helm.sh/release-namespace: insight-system\ndata:\n blackbox.yaml: |\n modules:\n HTTP_GET:\n prober: http\n timeout: 5s\n http:\n method: GET\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"]\n follow_redirects: true\n preferred_ip_protocol: \"ip4\"\n HTTP_POST:\n prober: http\n timeout: 5s\n http:\n method: POST\n body_size_limit: 1MB\n TCP:\n prober: tcp\n timeout: 5s\n # Not enabled by default:\n # ICMP:\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: ip4\n SSH:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^SSH-2.0-\"\n POP3S:\n prober: tcp\n tcp:\n query_response:\n - expect: \"^+OK\"\n tls: true\n tls_config:\n insecure_skip_verify: false\n http_2xx_example: # http prober example\n prober: http\n timeout: 5s # probe timeout\n http:\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"] # Version in the response, usually default\n valid_status_codes: [] # Defaults to 2xx # Valid range of response codes, probe successful if within this range\n method: GET # request method\n headers: # request headers\n Host: vhost.example.com\n Accept-Language: en-US\n Origin: example.com\n no_follow_redirects: false # allow redirects\n fail_if_ssl: false \n fail_if_not_ssl: false\n fail_if_body_matches_regexp:\n - \"Could not connect to database\"\n fail_if_body_not_matches_regexp:\n - \"Download the latest version here\"\n fail_if_header_matches: # Verifies that no cookies are set\n - header: Set-Cookie\n allow_missing: true\n regexp: '.*'\n fail_if_header_not_matches:\n - header: Access-Control-Allow-Origin\n regexp: '(\\*|example\\.com)'\n tls_config: # tls configuration for https requests\n insecure_skip_verify: false\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\" # Preferred IP protocol version\n ip_protocol_fallback: false # no fallback to \"ip6\" \n http_post_2xx: # http prober example with body\n prober: http\n timeout: 5s\n http:\n method: POST # probe request method\n headers:\n Content-Type: application/json\n body: '{\"username\":\"admin\",\"password\":\"123456\"}' # body carried during probe\n http_basic_auth_example: # prober example with username and password\n prober: http\n timeout: 5s\n http:\n method: POST\n headers:\n Host: \"login.example.com\"\n basic_auth: # username and password to be added during probe\n username: \"username\"\n password: \"mysecret\"\n http_custom_ca_example:\n prober: http\n http:\n method: GET\n tls_config: # root certificate used during probe\n ca_file: \"/certs/my_cert.crt\"\n http_gzip:\n prober: http\n http:\n method: GET\n compression: gzip # compression method used during probe\n http_gzip_with_accept_encoding:\n prober: http\n http:\n method: GET\n compression: gzip\n headers:\n Accept-Encoding: gzip\n tls_connect: # TCP prober example\n prober: tcp\n timeout: 5s\n tcp:\n tls: true # use TLS\n tcp_connect_example:\n prober: tcp\n timeout: 5s\n imap_starttls: # IMAP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"OK.*STARTTLS\"\n - send: \". STARTTLS\"\n - expect: \"OK\"\n - starttls: true\n - send: \". capability\"\n - expect: \"CAPABILITY IMAP4rev1\"\n smtp_starttls: # SMTP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^220 ([^ ]+) ESMTP (.+)$\"\n - send: \"EHLO prober\\r\"\n - expect: \"^250-STARTTLS\"\n - send: \"STARTTLS\\r\"\n - expect: \"^220\"\n - starttls: true\n - send: \"EHLO prober\\r\"\n - expect: \"^250-AUTH\"\n - send: \"QUIT\\r\"\n irc_banner_example:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - send: \"NICK prober\"\n - send: \"USER prober prober prober :prober\"\n - expect: \"PING :([^ ]+)\"\n send: \"PONG ${1}\"\n - expect: \"^:[^ ]+ 001\"\n # icmp_example: # ICMP prober configuration example\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: \"ip4\"\n # source_ip_address: \"127.0.0.1\"\n dns_udp_example: # DNS query example using UDP\n prober: dns\n timeout: 5s\n dns:\n query_name: \"www.prometheus.io\" # domain name to resolve\n query_type: \"A\" # type proper to this domain\n valid_rcodes:\n - NOERROR\n validate_answer_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n fail_if_all_match_regexp:\n - \".*127.0.0.1\"\n fail_if_not_matches_regexp:\n - \"www.prometheus.io.\\t300\\tIN\\tA\\t127.0.0.1\"\n fail_if_none_matches_regexp:\n - \"127.0.0.1\"\n validate_authority_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n validate_additional_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n dns_soa:\n prober: dns\n dns:\n query_name: \"prometheus.io\"\n query_type: \"SOA\"\n dns_tcp_example: # DNS query example using TCP\n prober: dns\n dns:\n transport_protocol: \"tcp\" # defaults to \"udp\"\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\"\n query_name: \"www.prometheus.io\"\n
"},{"location":"en/admin/insight/collection-manag/service-monitor.html","title":"Configure service discovery rules","text":"
Observable Insight supports the way of creating CRD ServiceMonitor through container management to meet your collection requirements for custom service discovery. Users can use ServiceMonitor to define the scope of the Namespace discovered by the Pod and select the monitored Service through matchLabel .
This is the service endpoint, which represents the address where Prometheus collects Metrics. endpoints is an array, and multiple endpoints can be created at the same time. Each endpoint contains three fields, and the meaning of each field is as follows:
interval : Specifies the collection cycle of Prometheus for the current endpoint . The unit is seconds, set to 15s in this example.
path : Specifies the collection path of Prometheus. In this example, it is specified as /actuator/prometheus .
port : Specifies the port through which the collected data needs to pass. The set port is the name set by the port of the Service being collected.
This is the scope of the Service that needs to be discovered. namespaceSelector contains two mutually exclusive fields, and the meaning of the fields is as follows:
any : Only one value true , when this field is set, it will listen to changes of all Services that meet the Selector filtering conditions.
matchNames : An array value that specifies the scope of namespace to be monitored. For example, if you only want to monitor the Services in two namespaces, default and insight-system, the matchNames are set as follows:
Test Case Test Method OCP 4.10 (K8s 1.23.0) Remarks Collect and query web application metrics Manual \u2705 - Add custom metric collection Manual \u2705 - Query real-time metrics Manual \u2705 - Instantaneous index query Manual \u2705 - Instantaneous metric API field verification Manual \u2705 - Query metrics over a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Batch query cluster CPU, memory usage, total cluster CPU, cluster memory usage, total number of cluster nodes Manual \u2705 - Batch query node CPU, memory usage, total node CPU, node memory usage Manual \u2705 - Batch query cluster metrics within a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Query Pod log Manual \u2705 - Query SVC log Manual \u2705 - Query statefulset logs Manual \u2705 - Query Deployment Logs Manual \u2705 - Query NPD log Manual \u2705 - Log Filtering Manual \u2705 - Log fuzzy query - workloadSearch Manual \u2705 - Log fuzzy query - podSearch Manual \u2705 - Log fuzzy query - containerSearch Manual \u2705 - Log Accurate Query - cluster Manual \u2705 - Log Accurate Query - namespace Manual \u2705 - Log query API field verification Manual \u2705 - Alert Rule - CRUD operations Manual \u2705 - Alert Template - CRUD operations Manual \u2705 - Notification Method - CRUD operations Manual \u2705 - Link Query Manual \u2705 - Topology Query Manual \u2705 -
The table above represents the Openshift 4.x cluster compatibility test. It includes various test cases, their proper test method (manual), and the test results for the OCP version 4.10 (with Kubernetes version 1.23.0).
Please note that this is not an exhaustive list, and additional test scenarios may exist.
Test Scenario Test Method Rancher rke2c1 (K8s 1.24.11) Notes Collect and query web application metrics Manual \u2705 - Add custom metric collection Manual \u2705 - Query real-time metrics Manual \u2705 - Instantaneous index query Manual \u2705 - Instantaneous metric API field verification Manual \u2705 - Query metrics over a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Batch query cluster CPU, memory usage, total cluster CPU, cluster memory usage, total number of cluster nodes Manual \u2705 - Batch query node CPU, memory usage, total node CPU, node memory usage Manual \u2705 - Batch query cluster metrics within a period of time Manual \u2705 - Metric API field verification within a period of time Manual \u2705 - Query Pod log Manual \u2705 - Query SVC log Manual \u2705 - Query statefulset logs Manual \u2705 - Query Deployment Logs Manual \u2705 - Query NPD log Manual \u2705 - Log Filtering Manual \u2705 - Log fuzzy query - workloadSearch Manual \u2705 - Log fuzzy query - podSearch Manual \u2705 - Log fuzzy query - containerSearch Manual \u2705 - Log Accurate Query - cluster Manual \u2705 - Log Accurate Query - namespace Manual \u2705 - Log query API field verification Manual \u2705 - Alert Rule - CRUD operations Manual \u2705 - Alert Template - CRUD operations Manual \u2705 - Notification Method - CRUD operations Manual \u2705 - Link Query Manual \u2705 - Topology Query Manual \u2705 -"},{"location":"en/admin/insight/dashboard/dashboard.html","title":"Dashboard","text":"
Grafana is a cross-platform open source visual analysis tool. Insight uses open source Grafana to provide monitoring services, and supports viewing resource consumption from multiple dimensions such as clusters, nodes, and namespaces.
For more information on open source Grafana, see Grafana Official Documentation.
In the Insight / Overview dashboard, you can view the resource usage of multiple clusters and analyze resource usage, network, storage, and more based on dimensions such as namespaces and Pods.
Click the dropdown menu in the upper-left corner of the dashboard to switch between clusters.
Click the lower-right corner of the dashboard to switch the time range for queries.
Insight provides several recommended dashboards that allow monitoring from different dimensions such as nodes, namespaces, and workloads. Switch between dashboards by clicking the insight-system / Insight / Overview section.
Note
For accessing Grafana UI, refer to Access Native Grafana.
For importing custom dashboards, refer to Importing Custom Dashboards.
By using Grafana CRD, you can incorporate the management and deployment of dashboards into the lifecycle management of Kubernetes. This enables version control, automated deployment, and cluster-level management of dashboards. This page describes how to import custom dashboards using CRD and the UI interface.
Log in to the AI platform platform and go to Container Management . Select the kpanda-global-cluster from the cluster list.
Choose Custom Resources from the left navigation bar. Look for the grafanadashboards.integreatly.org file in the list and click it to view the details.
Click YAML Create and use the following template. Replace the dashboard JSON in the Json field.
namespace : Specify the target namespace.
name : Provide a name for the dashboard.
label : Mandatory. Set the label as operator.insight.io/managed-by: insight .
Insight only collects data from clusters that have insight-agent installed and running in a normal state. The overview provides an overview of resources across multiple clusters:
Alert Statistics: Provides statistics on active alerts across all clusters.
Resource Consumption: Displays the resource usage trends for the top 5 clusters and nodes in the past hour, based on CPU usage, memory usage, and disk usage.
By default, the sorting is based on CPU usage. You can switch the metric to sort clusters and nodes.
Resource Trends: Shows the trends in the number of nodes over the past 15 days and the running trend of pods in the last hour.
Service Requests Ranking: Displays the top 5 services with the highest request latency and error rates, along with their respective clusters and namespaces in the multi-cluster environment.
By default, Insight collects node logs, container logs, and Kubernetes audit logs. In the log query page, you can search for standard output (stdout) logs within the permissions of your login account. This includes node logs, product logs, and Kubernetes audit logs. You can quickly find the desired logs among a large volume of logs. Additionally, you can use the source information and contextual raw data of the logs to assist in troubleshooting and issue resolution.
In the left navigation bar, select Data Query -> Log Query .
After selecting the query criteria, click Search , and the log records in the form of graphs will be displayed. The most recent logs are displayed on top.
In the Filter panel, switch Type and select Node to check the logs of all nodes in the cluster.
In the Filter panel, switch Type and select Event to view the logs generated by all Kubernetes events in the cluster.
Lucene Syntax Explanation:
Use logical operators (AND, OR, NOT, \"\") to query multiple keywords. For example: keyword1 AND (keyword2 OR keyword3) NOT keyword4.
Use a tilde (~) for fuzzy queries. You can optionally specify a parameter after the \"~\" to control the similarity of the fuzzy query. If not specified, it defaults to 0.5. For example: error~.
Use wildcards (*, ?) as single-character placeholders to match any character.
Use square brackets [ ] or curly braces { } for range queries. Square brackets [ ] represent a closed interval and include the boundary values. Curly braces { } represent an open interval and exclude the boundary values. Range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. For example timestamp:[2022-01-01 TO 2022-01-31].
For more information, please refer to the Lucene Syntax Explanation.
Clicking on the button next to a log will slide out a panel on the right side where you can view the default 100 lines of context for that log. You can switch the Display Rows option to view more contextual content.
Metric query supports querying the index data of each container resource, and you can view the trend changes of the monitoring index. At the same time, advanced query supports native PromQL statements for Metric query.
In the left navigation bar, click Data Query -> Metrics .
After selecting query conditions such as cluster, type, node, and metric name, click Search , and the proper metric chart and data details will be displayed on the right side of the screen.
Tip
Support custom time range. You can manually click the Refresh icon or select a default time interval to refresh.
Modify the Kibana Service to be exposed as a NodePort for access:
kubectl patch svc -n mcamel-system mcamel-common-es-cluster-masters-kb-http -p '{\"spec\":{\"type\":\"NodePort\"}}'\n\n# After modification, check the NodePort. For example, if the port is 30128, the access URL will be https://{NodeIP in the cluster}:30128\n[root@insight-master1 ~]# kubectl get svc -n mcamel-system | grep mcamel-common-es-cluster-masters-kb-http\nmcamel-common-es-cluster-masters-kb-http NodePort 10.233.51.174 <none> 5601:30128/TCP 108m\n
Retrieve the ElasticSearch Secret to log in to Kibana (username is elastic):
Go to Kibana -> Stack Management -> Index Management and enable the Include hidden indices option to see all indexes. Based on the index sequence numbers, keep the indexes with larger numbers and delete the ones with smaller numbers.
"},{"location":"en/admin/insight/faq/traceclockskew.html","title":"Clock offset in trace data","text":"
In a distributed system, due to Clock Skew (clock skew adjustment) influence, Time drift exists between different hosts. Generally speaking, the system time of different hosts at the same time has a slight deviation.
The traces system is a typical distributed system, and it is also affected by this phenomenon in terms of time data collection. For example, in a link, the start time of the server-side span is earlier than that of the client-side span. This phenomenon does not exist logically, but due to the influence of clock skew, there is a deviation in the system time between the hosts at the moment when the trace data is collected in each service, which eventually leads to the phenomenon shown in the following figure:
The phenomenon in the above figure cannot be eliminated theoretically. However, this phenomenon is rare, and even if it occurs, it will not affect the calling relationship between services.
Currently Insight uses Jaeger UI to display trace data, and the UI will remind when encountering such a link:
Currently Jaeger's community is trying to optimize this problem through the UI level.
Through cluster monitoring, you can view the basic information of the cluster, the resource consumption and the trend of resource consumption over a period of time.
Select Infrastructure > Clusters from the left navigation bar. On this page, you can view the following information:
Resource Overview: Provides statistics on the number of normal/all nodes and workloads across multiple clusters.
Fault: Displays the number of alerts generated in the current cluster.
Resource Consumption: Shows the actual usage and total capacity of CPU, memory, and disk for the selected cluster.
Metric Explanations: Describes the trends in CPU, memory, disk I/O, and network bandwidth.
Click Resource Level Monitor, you can view more metrics of the current cluster.
"},{"location":"en/admin/insight/infra/cluster.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The ratio of the actual CPU usage of all pod resources in the cluster to the total CPU capacity of all nodes. CPU Allocation The ratio of the sum of CPU requests of all pods in the cluster to the total CPU capacity of all nodes. Memory Usage The ratio of the actual memory usage of all pod resources in the cluster to the total memory capacity of all nodes. Memory Allocation The ratio of the sum of memory requests of all pods in the cluster to the total memory capacity of all nodes."},{"location":"en/admin/insight/infra/container.html","title":"Container Insight","text":"
Container insight is the process of monitoring workloads in cluster management. In the list, you can view basic information and status of workloads. On the Workloads details page, you can see the number of active alerts and the trend of resource consumption such as CPU and memory.
Follow these steps to view service monitoring metrics:
Go to the Insight product module.
Select Infrastructure > Workloads from the left navigation bar.
Switch between tabs at the top to view data for different types of workloads.
Click the target workload name to view the details.
Faults: Displays the total number of active alerts for the workload.
Resource Consumption: Shows the CPU, memory, and network usage of the workload.
Monitoring Metrics: Provides the trends of CPU, Memory, Network, and disk usage for the workload over the past hour.
Switch to the Pods tab to view the status of various pods for the workload, including their nodes, restart counts, and other information.
Switch to the JVM monitor tab to view the JVM metrics for each pods
Note
The JVM monitoring feature only supports the Java language.
To enable the JVM monitoring feature, refer to Getting Started with Monitoring Java Applications.
"},{"location":"en/admin/insight/infra/container.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for all pods under the workload. CPU Requests The sum of CPU requests for all pods under the workload. CPU Limits The sum of CPU limits for all pods under the workload. Memory Usage The sum of memory usage for all pods under the workload. Memory Requests The sum of memory requests for all pods under the workload. Memory Limits The sum of memory limits for all pods under the workload. Disk Read/Write Rate The total number of continuous disk reads and writes per second within the specified time range, representing a performance measure of the number of read and write operations per second on the disk. Network Send/Receive Rate The incoming and outgoing rates of network traffic, aggregated by workload, within the specified time range."},{"location":"en/admin/insight/infra/event.html","title":"Event Query","text":"
AI platform Insight supports event querying by cluster and namespace.
"},{"location":"en/admin/insight/infra/event.html#event-status-distribution","title":"Event Status Distribution","text":"
By default, the events that occurred within the last 12 hours are displayed. You can select a different time range in the upper right corner to view longer or shorter periods. You can also customize the sampling interval from 1 minute to 5 hours.
The event status distribution chart provides a visual representation of the intensity and dispersion of events. This helps in evaluating and preparing for subsequent cluster operations and maintenance tasks. If events are densely concentrated during specific time periods, you may need to allocate more resources or take proper measures to ensure cluster stability and high availability. On the other hand, if events are dispersed, you can effectively schedule other maintenance tasks such as system optimization, upgrades, or handling other tasks during this period.
By considering the event status distribution chart and the selected time range, you can better plan and manage your cluster operations and maintenance work, ensuring system stability and reliability.
"},{"location":"en/admin/insight/infra/event.html#event-count-and-statistics","title":"Event Count and Statistics","text":"
Through important event statistics, you can easily understand the number of image pull failures, health check failures, Pod execution failures, Pod scheduling failures, container OOM (Out-of-Memory) occurrences, volume mounting failures, and the total count of all events. These events are typically categorized as \"Warning\" and \"Normal\".
Select Infrastructure -> Namespaces from the left navigation bar. On this page, you can view the following information:
Switch Namespace: Switch between clusters or namespaces at the top.
Resource Overview: Provides statistics on the number of normal and total workloads within the selected namespace.
Incidents: Displays the number of alerts generated within the selected namespace.
Events: Shows the number of Warning level events within the selected namespace in the past 24 hours.
Resource Consumption: Provides the sum of CPU and memory usage for Pods within the selected namespace, along with the CPU and memory quota information.
"},{"location":"en/admin/insight/infra/namespace.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for Pods within the selected namespace. Memory Usage The sum of memory usage for Pods within the selected namespace. Pod CPU Usage The CPU usage for each Pod within the selected namespace. Pod Memory Usage The memory usage for each Pod within the selected namespace."},{"location":"en/admin/insight/infra/node.html","title":"Node Monitoring","text":"
Through node monitoring, you can get an overview of the current health status of the nodes in the selected cluster and the number of abnormal pod; on the current node details page, you can view the number of alerts and the trend of resource consumption such as CPU, memory, and disk.
Probe refers to the use of black-box monitoring to regularly test the connectivity of targets through HTTP, TCP, and other methods, enabling quick detection of ongoing faults.
Insight uses the Prometheus Blackbox Exporter tool to probe the network using protocols such as HTTP, HTTPS, DNS, TCP, and ICMP, and returns the probe results to understand the network status.
Select Infrastructure -> Probes in the left navigation bar.
Click the cluster or namespace dropdown in the table to switch between clusters and namespaces.
The list displays the name, probe method, probe target, connectivity status, and creation time of the probes by default.
The connectivity status can be:
Normal: The probe successfully connects to the target, and the target returns the expected response.
Abnormal: The probe fails to connect to the target, or the target does not return the expected response.
Pending: The probe is attempting to connect to the target.
Supports fuzzy search of probe names.
"},{"location":"en/admin/insight/infra/probe.html#create-a-probe","title":"Create a Probe","text":"
Click Create Probe .
Fill in the basic information and click Next .
Name: The name can only contain lowercase letters, numbers, and hyphens (-), and must start and end with a lowercase letter or number, with a maximum length of 63 characters.
Cluster: Select the cluster for the probe task.
Namespace: The namespace where the probe task is located.
Configure the probe parameters.
Blackbox Instance: Select the blackbox instance responsible for the probe.
Probe Method:
HTTP: Sends HTTP or HTTPS requests to the target URL to check its connectivity and response time. This can be used to monitor the availability and performance of websites or web applications.
TCP: Establishes a TCP connection to the target host and port to check its connectivity and response time. This can be used to monitor TCP-based services such as web servers and database servers.
Other: Supports custom probe methods by configuring ConfigMap. For more information, refer to: Custom Probe Methods
Probe Target: The target address of the probe, supports domain names or IP addresses.
Labels: Custom labels that will be automatically added to Prometheus' labels.
Probe Interval: The interval between probes.
Probe Timeout: The maximum waiting time when probing the target.
After configuring, click OK to complete the creation.
Warning
After the probe task is created, it takes about 3 minutes to synchronize the configuration. During this period, no probes will be performed, and probe results cannot be viewed.
Click \u2507 in the operations column and click View Monitoring Dashboard .
Metric Name Description Current Status Response Represents the response status code of the HTTP probe request. Ping Status Indicates whether the probe request was successful. 1 indicates a successful probe request, and 0 indicates a failed probe request. IP Protocol Indicates the IP protocol version used in the probe request. SSL Expiry Represents the earliest expiration time of the SSL/TLS certificate. DNS Response (Latency) Represents the duration of the entire probe process in seconds. HTTP Duration Represents the duration of the entire process from sending the request to receiving the complete response."},{"location":"en/admin/insight/infra/probe.html#edit-a-probe","title":"Edit a Probe","text":"
Click \u2507 in the operations column and click Edit .
"},{"location":"en/admin/insight/infra/probe.html#delete-a-probe","title":"Delete a Probe","text":"
Click \u2507 in the operations column and click Delete .
AI platform platform enables the management and creation of multicloud and multiple clusters. Building upon this capability, Insight serves as a unified observability solution for multiple clusters. It collects observability data from multiple clusters by deploying the insight-agent plugin and allows querying of metrics, logs, and trace data through the AI platform Insight.
insight-agent is a tool that facilitates the collection of observability data from multiple clusters. Once installed, it automatically collects metrics, logs, and trace data without any modifications.
Clusters created through Container Management come pre-installed with insight-agent. Hence, this guide specifically provides instructions on enabling observability for integrated clusters.
Install insight-agent online
As a unified observability platform for multiple clusters, Insight's resource consumption of certain components is closely related to the data of cluster creation and the number of integrated clusters. When installing insight-agent, it is necessary to adjust the resources of the proper components based on the cluster size.
Adjust the CPU and memory resources of the Prometheus collection component in insight-agent according to the size of the cluster created or integrated. Please refer to Prometheus resource planning.
As the metric data from multiple clusters is stored centrally, AI platform platform administrators need to adjust the disk space of vmstorage based on the cluster size. Please refer to vmstorage disk capacity planning.
For instructions on adjusting the disk space of vmstorage, please refer to Expanding vmstorage disk.
Since AI platform supports the management of multicloud and multiple clusters, insight-agent has undergone partial verification. However, there are known conflicts with monitoring components when installing insight-agent in Suanova 4.0 clusters and Openshift 4.x clusters. If you encounter similar issues, please refer to the following documents:
Install insight-agent in Suanova 4.0.x
Install insight-agent in Openshift 4.x
Currently, the insight-agent collection component has undergone functional testing for popular versions of Kubernetes. Please refer to:
Kubernetes cluster compatibility testing
Openshift 4.x cluster compatibility testing
Rancher cluster compatibility testing
"},{"location":"en/admin/insight/quickstart/install/big-log-and-trace.html","title":"Enable Big Log and Big Trace Modes","text":"
The Insight Module supports switching log to Big Log mode and trace to Big Trace mode, in order to enhance data writing capabilities in large-scale environments. This page introduces following methods for enabling these modes:
Enable or upgrade to Big Log and Big Trace modes through the installer (controlled by the same parameter value in manifest.yaml)
Manually enable Big Log and Big Trace modes through Helm commands
This mode is referred to as the Kafka mode, and the data flow diagram is shown below:
"},{"location":"en/admin/insight/quickstart/install/big-log-and-trace.html#enabling-via-installer","title":"Enabling via Installer","text":"
When deploying/upgrading AI platform using the installer, the manifest.yaml file includes the infrastructures.kafka field. To enable observable Big Log and Big Trace modes, Kafka must be activated:
When using a manifest.yaml that enables kafka during installation, Kafka middleware will be installed by default, and Big Log and Big Trace modes will be enabled automatically. The installation command is:
The upgrade also involves modifying the kafka field. However, note that since the old environment was installed with kafka: false, Kafka is not present in the environment. Therefore, you need to specify the upgrade for middleware to install Kafka middleware simultaneously. The upgrade command is:
In the Container Management module, find the cluster, select Helm Apps from the left navigation bar, and find and update the insight-agent.
In Trace Settings, select kafka for output and fill in the correct brokers address.
Note that after the upgrade is complete, you need to manually restart the insight-agent-opentelemetry-collector and insight-opentelemetry-collector components.
When deploying Insight to a Kubernetes environment, proper resource management and optimization are crucial. Insight includes several core components such as Prometheus, OpenTelemetry, FluentBit, Vector, and Elasticsearch. These components, during their operation, may negatively impact the performance of other pods within the cluster due to resource consumption issues. To effectively manage resources and optimize cluster operations, node affinity becomes an important option.
This page is about how to add taints and node affinity to ensure that each component runs on the appropriate nodes, avoiding resource competition or contention, thereby guranttee the stability and efficiency of the entire Kubernetes cluster.
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#configure-dedicated-nodes-for-insight-using-taints","title":"Configure dedicated nodes for Insight using taints","text":"
Since the Insight Agent includes DaemonSet components, the configuration method described in this section is to have all components except the Insight DaemonSet run on dedicated nodes.
This is achieved by adding taints to the dedicated nodes and using tolerations to match them. More details can be found in the Kubernetes official documentation.
You can refer to the following commands to add and remove taints on nodes:
There are two ways to schedule Insight components to dedicated nodes:
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#1-add-tolerations-for-each-component","title":"1. Add tolerations for each component","text":"
Configure the tolerations for the insight-server and insight-agent Charts respectively:
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#2-configure-at-the-namespace-level","title":"2. Configure at the namespace level","text":"
Allow pods in the insight-system namespace to tolerate the node.daocloud.io=insight-only taint.
Adjust the apiserver configuration file /etc/kubernetes/manifests/kube-apiserver.yaml to include PodTolerationRestriction,PodNodeSelector. See the following picture:
Add an annotation to the insight-system namespace:
Restart the components under the insight-system namespace to allow normal scheduling of pods under the insight-system.
"},{"location":"en/admin/insight/quickstart/install/component-scheduling.html#use-node-labels-and-node-affinity-to-manage-component-scheduling","title":"Use node labels and node affinity to manage component scheduling","text":"
Info
Node affinity is conceptually similar to nodeSelector, allowing you to constrain which nodes a pod can be scheduled on based on labels on the nodes. There are two types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution: The scheduler will only schedule the pod if the rules are met. This feature is similar to nodeSelector but has more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler will try to find nodes that meet the rules. If no matching nodes are found, the scheduler will still schedule the Pod.
For more details, please refer to the Kubernetes official documentation.
To meet different user needs for scheduling Insight components, Insight provides fine-grained labels for different components' scheduling policies. Below is a description of the labels and their associated components:
Label Key Label Value Description node.daocloud.io/insight-any Any value, recommended to use true Represents that all Insight components prefer nodes with this label node.daocloud.io/insight-prometheus Any value, recommended to use true Specifically for Prometheus components node.daocloud.io/insight-vmstorage Any value, recommended to use true Specifically for VictoriaMetrics vmstorage components node.daocloud.io/insight-vector Any value, recommended to use true Specifically for Vector components node.daocloud.io/insight-otel-col Any value, recommended to use true Specifically for OpenTelemetry components
You can refer to the following commands to add and remove labels on nodes:
# Add label to node8, prioritizing scheduling insight-prometheus to node8 \nkubectl label nodes node8 node.daocloud.io/insight-prometheus=true\n\n# Remove the node.daocloud.io/insight-prometheus label from node8\nkubectl label nodes node8 node.daocloud.io/insight-prometheus-\n
Below is the default affinity preference for the insight-prometheus component during deployment:
Prioritize scheduling insight-prometheus to nodes with the node.daocloud.io/insight-prometheus label
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html","title":"Get Data Storage Address of Global Service Cluster","text":"
Insight is a product for unified observation of multiple clusters. To achieve unified storage and querying of observation data from multiple clusters, sub-clusters need to report the collected observation data to the global service cluster for unified storage. This document provides the required address of the storage component when installing the collection component insight-agent.
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#install-insight-agent-in-global-service-cluster","title":"Install insight-agent in Global Service Cluster","text":"
If installing insight-agent in the global service cluster, it is recommended to access the cluster via domain name:
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#install-insight-agent-in-other-clusters","title":"Install insight-agent in Other Clusters","text":""},{"location":"en/admin/insight/quickstart/install/gethosturl.html#get-address-via-interface-provided-by-insight-server","title":"Get Address via Interface Provided by Insight Server","text":"
The management cluster uses the default LoadBalancer mode for exposure.
Log in to the console of the global service cluster and run the following command:
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
Note
Please replace the ${INSIGHT_SERVER_IP} parameter in the command.
global.exporters.logging.host is the log service address, no need to set the proper service port, the default value will be used.
global.exporters.metric.host is the metrics service address.
global.exporters.trace.host is the trace service address.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
Management cluster disables LoadBalancer
When calling the interface, you need to additionally pass an externally accessible node IP from the cluster, which will be used to construct the complete access address of the proper service.
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
global.exporters.logging.host is the log service address.
global.exporters.logging.port is the NodePort exposed by the log service.
global.exporters.metric.host is the metrics service address.
global.exporters.metric.port is the NodePort exposed by the metrics service.
global.exporters.trace.host is the trace service address.
global.exporters.trace.port is the NodePort exposed by the trace service.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
global.exporters.auditLog.port is the NodePort exposed by the audit log service.
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#connect-via-loadbalancer","title":"Connect via LoadBalancer","text":"
If LoadBalancer is enabled in the cluster and a VIP is set for Insight, you can manually execute the following command to obtain the address information for vminsert and opentelemetry-collector:
$ kubectl get service -n insight-system | grep lb\nlb-insight-opentelemetry-collector LoadBalancer 10.233.23.12 <pending> 4317:31286/TCP,8006:31351/TCP 24d\nlb-vminsert-insight-victoria-metrics-k8s-stack LoadBalancer 10.233.63.67 <pending> 8480:31629/TCP 24d\n
lb-vminsert-insight-victoria-metrics-k8s-stack is the address for the metrics service.
lb-insight-opentelemetry-collector is the address for the tracing service.
Execute the following command to obtain the address information for elasticsearch:
$ kubectl get service -n mcamel-system | grep es\nmcamel-common-es-cluster-masters-es-http NodePort 10.233.16.120 <none> 9200:30465/TCP 47d\n
mcamel-common-es-cluster-masters-es-http is the address for the logging service.
"},{"location":"en/admin/insight/quickstart/install/gethosturl.html#connect-via-nodeport","title":"Connect via NodePort","text":"
The LoadBalancer feature is disabled in the global service cluster.
In this case, the LoadBalancer resources mentioned above will not be created by default. The relevant service names are:
insight-agent is a plugin for collecting insight data, supporting unified observation of metrics, links, and log data. This article describes how to install insight-agent in an online environment for the accessed cluster.
Enter Container Management from the left navigation bar, and enter Clusters . Find the cluster where you want to install insight-agent.
Choose Install now to jump, or click the cluster and click Helm Apps -> Helm Templates in the left navigation bar, search for insight-agent in the search box, and click it for details.
Select the appropriate version and click Install .
Fill in the name, select the namespace and version, and fill in the addresses of logging, metric, audit, and trace reporting data in the yaml file. The system has filled in the address of the component for data reporting by default, please check it before clicking OK to install.
If you need to modify the data reporting address, please refer to Get Data Reporting Address.
The system will automatically return to Helm Apps . When the application status changes from Unknown to Deployed , it means that insight-agent is installed successfully.
Note
Click \u2507 on the far right, and you can perform more operations such as Update , View YAML and Delete in the pop-up menu.
For a practical installation demo, watch Video demo of installing insight-agent
This page lists some issues related to the installation and uninstallation of Insight Agent and their workarounds.
"},{"location":"en/admin/insight/quickstart/install/knownissues.html#uninstallation-failure-of-insight-agent","title":"Uninstallation Failure of Insight Agent","text":"
When you run the following command to uninstall Insight Agent,
helm uninstall insight agent\n
The tls secret used by otel-operator is failed to uninstall.
Due to the logic of \"reusing tls secret\" in the following code of otel-operator, it checks whether MutationConfiguration exists and reuses the CA cert bound in MutationConfiguration. However, since helm uninstall has uninstalled MutationConfiguration, it results in a null value.
Therefore, please manually delete the proper secret using one of the following methods:
Delete via command line: Log in to the console of the target cluster and run the following command:
Delete via UI: Log in to AI platform container management, select the target cluster, select Secret from the left menu, input insight-agent-opentelemetry-operator-controller-manager-service-cert, then select Delete.
"},{"location":"en/admin/insight/quickstart/install/knownissues.html#insight-agent_1","title":"Insight Agent","text":""},{"location":"en/admin/insight/quickstart/install/knownissues.html#log-collection-endpoint-not-updated-when-upgrading-insight-agent","title":"Log Collection Endpoint Not Updated When Upgrading Insight Agent","text":"
When updating the log configuration of the insight-agent from Elasticsearch to Kafka or from Kafka to Elasticsearch, the changes do not take effect and the agent continues to use the previous configuration.
Solution :
Manually restart Fluent Bit in the cluster.
"},{"location":"en/admin/insight/quickstart/install/knownissues.html#podmonitor-collects-multiple-sets-of-jvm-metrics","title":"PodMonitor Collects Multiple Sets of JVM Metrics","text":"
In this version, there is a defect in PodMonitor/insight-kubernetes-pod: it will incorrectly create Jobs to collect metrics for all containers in Pods that are marked with insight.opentelemetry.io/metric-scrape=true, instead of only the containers proper to insight.opentelemetry.io/metric-port.
After PodMonitor is declared, PrometheusOperator will pre-configure some service discovery configurations. Considering the compatibility of CRDs, it is abandoned to configure the collection tasks through annotations.
Use the additional scrape config mechanism provided by Prometheus to configure the service discovery rules in a secret and introduce them into Prometheus.
Therefore:
Delete the current PodMonitor for insight-kubernetes-pod
Use a new rule
In the new rule, action: keepequal is used to compare the consistency between source_labels and target_label to determine whether to create collection tasks for the ports of a container. Note that this feature is only available in Prometheus v2.41.0 (2022-12-20) and higher.
This page provides some considerations for upgrading insight-server and insight-agent.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v028x-or-lower-to-v029x","title":"Upgrade from v0.28.x (or lower) to v0.29.x","text":"
Due to the upgrade of the Opentelemetry community operator chart version in v0.29.0, the supported values for featureGates in the values file have changed. Therefore, before upgrading, you need to set the value of featureGates to empty, as follows:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v026x-or-lower-to-v027x-or-higher","title":"Upgrade from v0.26.x (or lower) to v0.27.x or higher","text":"
In v0.27.x, the switch for the vector component has been separated. If the existing environment has vector enabled, you need to specify --set vector.enabled=true when upgrading the insight-server.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-020x","title":"Upgrade from v0.19.x (or lower) to 0.20.x","text":"
Before upgrading Insight , you need to manually delete the jaeger-collector and jaeger-query deployments by running the following command:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
In v0.18.x, there have been updates to the Jaeger-related deployment files, so you need to manually run the following commands before upgrading insight-server:
There have been changes to metric names in v0.18.x, so after upgrading insight-server, insight-agent should also be upgraded.
In addition, the parameters for enabling the tracing module and adjusting the ElasticSearch connection have been modified. Refer to the following parameters:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v015x-or-lower-to-v016x","title":"Upgrade from v0.15.x (or lower) to v0.16.x","text":"
In v0.16.x, a new feature parameter disableRouteContinueEnforce in the vmalertmanagers CRD is used. Therefore, you need to manually run the following command before upgrading insight-server:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v023x-or-lower-to-v024x","title":"Upgrade from v0.23.x (or lower) to v0.24.x","text":"
In v0.24.x, CRDs have been added to the OTEL operator chart. However, helm upgrade does not update CRDs, so you need to manually run the following command:
If you are performing an offline installation, you can find the above CRD yaml file after extracting the insight-agent offline package. After extracting the insight-agent Chart, manually run the following command:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-v020x","title":"Upgrade from v0.19.x (or lower) to v0.20.x","text":"
In v0.20.x, Kafka log export configuration has been added, and there have been some adjustments to the log export configuration. Before upgrading insight-agent , please note the parameter changes. The previous logging configuration has been moved to the logging.elasticsearch configuration:
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x_1","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
Due to the updated deployment files for Jaeger In v0.18.x, it is important to note the changes in parameters before upgrading the insight-agent.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v016x-or-lower-to-v017x","title":"Upgrade from v0.16.x (or lower) to v0.17.x","text":"
In v0.17.x, the kube-prometheus-stack chart version was upgraded from 41.9.1 to 45.28.1, and there were also some field upgrades in the CRD used, such as the attachMetadata field of servicemonitor. Therefore, the following command needs to be rund before upgrading the insight-agent:
If you are performing an offline installation, you can find the yaml for the above CRD in insight-agent/dependency-crds after extracting the insight-agent offline package.
"},{"location":"en/admin/insight/quickstart/install/upgrade-note.html#upgrade-from-v011x-or-earlier-to-v012x","title":"Upgrade from v0.11.x (or earlier) to v0.12.x","text":"
v0.12.x upgrades kube-prometheus-stack chart from 39.6.0 to 41.9.1, including prometheus-operator to v0.60.1, prometheus-node-exporter chart to v4.3.0. Prometheus-node-exporter uses Kubernetes recommended label after upgrading, so you need to delete node-exporter daemonset. prometheus-operator has updated the CRD, so you need to run the following command before upgrading the insight-agent:
Please ensure that the insight-agent is ready. If not, please refer to Install insight-agent for data collection and make sure the following three items are ready:
Enable trace functionality for insight-agent
Check if the address and port for trace data are correctly filled
Ensure that the Pods proper to deployment/insight-agent-opentelemetry-operator and deployment/insight-agent-opentelemetry-collector are ready
"},{"location":"en/admin/insight/quickstart/otel/operator.html#works-with-the-service-mesh-product-mspider","title":"Works with the Service Mesh Product (Mspider)","text":"
If you enable the tracing capability of the Mspider(Service Mesh), you need to add an additional environment variable injection configuration:
"},{"location":"en/admin/insight/quickstart/otel/operator.html#the-operation-steps-are-as-follows","title":"The operation steps are as follows","text":"
Log in to AI platform, then enter Container Management and select the target cluster.
Click CRDs in the left navigation bar, find instrumentations.opentelemetry.io, and enter the details page.
Select the insight-system namespace, then edit insight-opentelemetry-autoinstrumentation, and add the following content under spec:env::
"},{"location":"en/admin/insight/quickstart/otel/operator.html#add-annotations-to-automatically-access-traces","title":"Add annotations to automatically access traces","text":"
After the above is ready, you can access traces for the application through annotations (Annotation). Otel currently supports accessing traces through annotations. Depending on the service language, different pod annotations need to be added. Each service can add one of two types of annotations:
Only inject environment variable annotations
There is only one such annotation, which is used to add otel-related environment variables, such as link reporting address, cluster id where the container is located, and namespace (this annotation is very useful when the application does not support automatic probe language)
The value is divided into two parts by /, the first value (insight-system) is the namespace of the CR installed in the previous step, and the second value (insight-opentelemetry-autoinstrumentation) is the name of the CR.
Automatic probe injection and environment variable injection annotations
There are currently 4 such annotations, proper to 4 different programming languages: java, nodejs, python, dotnet. After using it, automatic probes and otel default environment variables will be injected into the first container under spec.pod:
Since Go's automatic detection requires the setting of OTEL_GO_AUTO_TARGET_EXE, you must provide a valid executable path through annotations or Instrumentation resources. Failure to set this value will result in the termination of Go's automatic detection injection, leading to a failure in the connection trace.
The OpenTelemetry Operator automatically adds some OTel-related environment variables when injecting probes and also supports overriding these variables. The priority order for overriding these environment variables is as follows:
original container env vars -> language specific env vars -> common env vars -> instrument spec configs' vars\n
However, it is important to avoid manually overriding OTEL_RESOURCE_ATTRIBUTES_NODE_NAME . This variable serves as an identifier within the operator to determine if a pod has already been injected with a probe. Manually adding this variable may prevent the probe from being injected successfully.
How to query the connected services, refer to Trace Query.
"},{"location":"en/admin/insight/quickstart/otel/otel.html","title":"Use OTel to provide the application observability","text":"
Enhancement is the process of enabling application code to generate telemetry data. i.e. something that helps you monitor or measure the performance and status of your application.
OpenTelemetry is a leading open source project providing instrumentation libraries for major programming languages \u200b\u200band popular frameworks. It is a project under the Cloud Native Computing Foundation and is supported by the vast resources of the community. It provides a standardized data format for collected data without the need to integrate specific vendors.
Insight supports OpenTelemetry for application instrumentation to enhance your applications.
This guide introduces the basic concepts of telemetry enhancement using OpenTelemetry. OpenTelemetry also has an ecosystem of libraries, plugins, integrations, and other useful tools to extend it. You can find these resources at the OTel Registry.
You can use any open standard library for telemetry enhancement and use Insight as an observability backend to ingest, analyze, and visualize data.
To enhance your code, you can use the enhanced operations provided by OpenTelemetry for specific languages:
Insight currently provides an easy way to enhance .Net NodeJS, Java, Python and Golang applications with OpenTelemetry. Please follow the guidelines below.
Best practices for integrate trace: Application Non-Intrusive Enhancement via Operator
Manual instrumentation with Go language as an example: Enhance Go application with OpenTelemetry SDK
Using ebpf to implement non-intrusive auto-instrumetation in Go language (experimental feature)
"},{"location":"en/admin/insight/quickstart/otel/send_tracing_to_insight.html","title":"Sending Trace Data to Insight","text":"
This document describes how customers can send trace data to Insight on their own. It mainly includes the following two scenarios:
Customer apps report traces to Insight through OTEL Agent/SDK
Forwarding traces to Insight through Opentelemetry Collector (OTEL COL)
In each cluster where Insight Agent is installed, there is an insight-agent-otel-col component that is used to receive trace data from that cluster. Therefore, this component serves as the entry point for user access and needs to obtain its address first. You can get the address of the Opentelemetry Collector in the cluster through the AI platform interface, such as insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 :
In addition, there are some slight differences for different reporting methods:
"},{"location":"en/admin/insight/quickstart/otel/send_tracing_to_insight.html#customer-apps-report-traces-to-insight-through-otel-agentsdk","title":"Customer apps report traces to Insight through OTEL Agent/SDK","text":"
To successfully report trace data to Insight and display it properly, it is recommended to provide the required metadata (Resource Attributes) for OTLP through the following environment variables. There are two ways to achieve this:
Manually add them to the deployment YAML file, for example:
"},{"location":"en/admin/insight/quickstart/otel/send_tracing_to_insight.html#forwarding-traces-to-insight-through-opentelemetry-collector","title":"Forwarding traces to Insight through Opentelemetry Collector","text":"
After ensuring that the application has added the metadata mentioned above, you only need to add an OTLP Exporter in your customer's Opentelemetry Collector to forward the trace data to Insight Agent Opentelemetry Collector. Below is an example Opentelemetry Collector configuration file:
Enhancing Applications Non-intrusively with the Operator
Achieving Observability with OTel
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html","title":"Enhance Go applications with OTel SDK","text":"
This page contains instructions on how to set up OpenTelemetry enhancements in a Go application.
OpenTelemetry, also known simply as OTel, is an open-source observability framework that helps generate and collect telemetry data: traces, metrics, and logs in Go apps.
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#enhance-go-apps-with-the-opentelemetry-sdk","title":"Enhance Go apps with the OpenTelemetry SDK","text":""},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#install-related-dependencies","title":"Install related dependencies","text":"
Dependencies related to the OpenTelemetry exporter and SDK must be installed first. If you are using another request router, please refer to request routing. After switching/going into the application source folder run the following command:
go get go.opentelemetry.io/otel@v1.8.0 \\\n go.opentelemetry.io/otel/trace@v1.8.0 \\\n go.opentelemetry.io/otel/sdk@v1.8.0 \\\n go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin@v0.33.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace@v1.7.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.4.1\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#create-an-initialization-feature-using-the-opentelemetry-sdk","title":"Create an initialization feature using the OpenTelemetry SDK","text":"
In order for an application to be able to send data, a feature is required to initialize OpenTelemetry. Add the following code snippet to the main.go file:
import (\n \"context\"\n \"os\"\n \"time\"\n\n \"go.opentelemetry.io/otel\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc\"\n \"go.opentelemetry.io/otel/propagation\"\n \"go.opentelemetry.io/otel/sdk/resource\"\n sdktrace \"go.opentelemetry.io/otel/sdk/trace\"\n semconv \"go.opentelemetry.io/otel/semconv/v1.7.0\"\n \"go.uber.org/zap\"\n \"google.golang.org/grpc\"\n)\n\nvar tracerExp *otlptrace.Exporter\n\nfunc retryInitTracer() func() {\n var shutdown func()\n go func() {\n for {\n // otel will reconnected and re-send spans when otel col recover. so, we don't need to re-init tracer exporter.\n if tracerExp == nil {\n shutdown = initTracer()\n } else {\n break\n }\n time.Sleep(time.Minute * 5)\n }\n }()\n return shutdown\n}\n\nfunc initTracer() func() {\n // temporarily set timeout to 10s\n ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n defer cancel()\n\n serviceName, ok := os.LookupEnv(\"OTEL_SERVICE_NAME\")\n if !ok {\n serviceName = \"server_name\"\n os.Setenv(\"OTEL_SERVICE_NAME\", serviceName)\n }\n otelAgentAddr, ok := os.LookupEnv(\"OTEL_EXPORTER_OTLP_ENDPOINT\")\n if !ok {\n otelAgentAddr = \"http://localhost:4317\"\n os.Setenv(\"OTEL_EXPORTER_OTLP_ENDPOINT\", otelAgentAddr)\n }\n zap.S().Infof(\"OTLP Trace connect to: %s with service name: %s\", otelAgentAddr, serviceName)\n\n traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithDialOption(grpc.WithBlock()))\n if err != nil {\n handleErr(err, \"OTLP Trace gRPC Creation\")\n return nil\n }\n\n tracerProvider := sdktrace.NewTracerProvider(\n sdktrace.WithBatcher(traceExporter),\n sdktrace.WithSampler(sdktrace.AlwaysSample()),\n sdktrace.WithResource(resource.NewWithAttributes(semconv.SchemaURL)))\n\n otel.SetTracerProvider(tracerProvider)\n otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))\n\n tracerExp = traceExporter\n return func() {\n // Shutdown will flush any remaining spans and shut down the exporter.\n handleErr(tracerProvider.Shutdown(ctx), \"failed to shutdown TracerProvider\")\n }\n}\n\nfunc handleErr(err error, message string) {\n if err != nil {\n zap.S().Errorf(\"%s: %v\", message, err)\n }\n}\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#initialize-tracker-in-maingo","title":"Initialize tracker in main.go","text":"
Modify the main feature to initialize the tracker in main.go. Also when your service shuts down, you should call TracerProvider.Shutdown() to ensure all spans are exported. The service makes the call as a deferred feature in the main function:
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#add-opentelemetry-gin-middleware-to-the-application","title":"Add OpenTelemetry Gin middleware to the application","text":"
Configure Gin to use the middleware by adding the following line to main.go :
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#run-the-application","title":"Run the application","text":"
Local debugging and running
Note: This step is only used for local development and debugging. In the production environment, the Operator will automatically complete the injection of the following environment variables.
The above steps have completed the work of initializing the SDK. Now if you need to develop and debug locally, you need to obtain the address of insight-agent-opentelemerty-collector in the insight-system namespace in advance, assuming: insight-agent-opentelemetry-collector .insight-system.svc.cluster.local:4317 .
Therefore, you can add the following environment variables when you start the application locally:
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
Running in a production environment
Please refer to the introduction of Only injecting environment variable annotations in Achieving non-intrusive enhancement of applications through Operators to add annotations to deployment yaml:
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
The OpenTelemetry community has also developed middleware for database access libraries, such as Gorm:
import (\n \"github.com/uptrace/opentelemetry-go-extra/otelgorm\"\n \"gorm.io/driver/sqlite\"\n \"gorm.io/gorm\"\n)\n\ndb, err := gorm.Open(sqlite.Open(\"file::memory:?cache=shared\"), &gorm.Config{})\nif err != nil {\n panic(err)\n}\n\notelPlugin := otelgorm.NewPlugin(otelgorm.WithDBName(\"mydb\"), # Missing this can lead to incomplete display of database related topology\n otelgorm.WithAttributes(semconv.ServerAddress(\"memory\"))) # Missing this can lead to incomplete display of database related topology\nif err := db.Use(otelPlugin); err != nil {\n panic(err)\n}\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#add-custom-properties-and-custom-events-to-span","title":"Add custom properties and custom events to span","text":"
It is also possible to set a custom attribute or tag as a span. To add custom properties and events, follow these steps:
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#import-tracking-and-property-libraries","title":"Import Tracking and Property Libraries","text":"
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#get-the-current-span-from-the-context","title":"Get the current Span from the context","text":"
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#set-properties-in-the-current-span","title":"Set properties in the current Span","text":"
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#add-an-event-to-the-current-span","title":"Add an Event to the current Span","text":"
Adding span events is done using AddEvent on the span object.
span.AddEvent(msg)\n
"},{"location":"en/admin/insight/quickstart/otel/golang/golang.html#log-errors-and-exceptions","title":"Log errors and exceptions","text":"
import \"go.opentelemetry.io/otel/codes\"\n\n// Get the current span\nspan := trace.SpanFromContext(ctx)\n\n// RecordError will automatically convert an error into a span even\nspan.RecordError(err)\n\n// Flag this span as an error\nspan.SetStatus(codes.Error, \"internal error\")\n
Navigate to your application\u2019s source folder and run the following command:
go get go.opentelemetry.io/otel \\\n go.opentelemetry.io/otel/attribute \\\n go.opentelemetry.io/otel/exporters/prometheus \\\n go.opentelemetry.io/otel/metric/global \\\n go.opentelemetry.io/otel/metric/instrument \\\n go.opentelemetry.io/otel/sdk/metric\n
"},{"location":"en/admin/insight/quickstart/otel/golang/meter.html#create-an-initialization-function-using-otel-sdk","title":"Create an Initialization Function Using OTel SDK","text":"
For Java applications, you can directly expose JVM-related metrics by using the OpenTelemetry agent with the following environment variable:
OTEL_METRICS_EXPORTER=prometheus\n
You can then check your metrics at http://localhost:8888/metrics.
Next, combine it with a Prometheus ServiceMonitor to complete the metrics integration. If you want to expose custom metrics, please refer to opentelemetry-java-docs/prometheus.
The process is mainly divided into two steps:
Create a meter provider and specify Prometheus as the exporter.
/*\n * Copyright The OpenTelemetry Authors\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the Prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
Create a custom meter and start the HTTP server.
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n * Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n * these to a Prometheus instance via a HttpServer exporter.\n *\n * <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n * The Gauge callback gets executed every collection interval.\n */\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // It is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
After running the Java application, you can check if your metrics are working correctly by visiting http://localhost:8888/metrics.
For accessing and monitoring Java application links, please refer to the document Implementing Non-Intrusive Enhancements for Applications via Operator, which explains how to automatically integrate links through annotations.
Monitoring the JVM of Java applications: How Java applications that have already exposed JVM metrics and those that have not yet exposed JVM metrics can connect with observability Insight.
If your Java application has not yet started exposing JVM metrics, you can refer to the following documents:
Exposing JVM Monitoring Metrics Using JMX Exporter
Exposing JVM Monitoring Metrics Using OpenTelemetry Java Agent
If your Java application has already exposed JVM metrics, you can refer to the following document:
Connecting Existing JVM Metrics of Java Applications to Observability
Writing TraceId and SpanId into Java Application Logs to correlate link data with log data.
"},{"location":"en/admin/insight/quickstart/otel/java/mdc.html","title":"Writing TraceId and SpanId into Java Application Logs","text":"
This article explains how to automatically write TraceId and SpanId into Java application logs using OpenTelemetry. By including TraceId and SpanId in your logs, you can correlate distributed tracing data with log data, enabling more efficient fault diagnosis and performance analysis.
Spring Boot projects come with a built-in logging framework and use Logback as the default logging implementation. If your Java project is a Spring Boot project, you can write TraceId into logs with minimal configuration.
Set logging.pattern.level in application.properties, adding %mdc{trace_id} and %mdc{span_id} to the logs.
Modify the log4j2.xml configuration, adding %X{trace_id} and %X{span_id} in the pattern to automatically write TraceId and SpanId into the logs:
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html","title":"Exposing JVM Monitoring Metrics Using JMX Exporter","text":"
JMX Exporter provides two usage methods:
Standalone Process: Specify parameters when starting the JVM to expose a JMX RMI interface. The JMX Exporter calls RMI to obtain the JVM runtime state data, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
In-Process (JVM process): Specify parameters when starting the JVM to run the JMX Exporter jar file as a javaagent. This method reads the JVM runtime state data in-process, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
Note
The official recommendation is not to use the first method due to its complex configuration and the requirement for a separate process, which introduces additional monitoring challenges. Therefore, this article focuses on the second method, detailing how to use JMX Exporter to expose JVM monitoring metrics in a Kubernetes environment.
In this method, you need to specify the JMX Exporter jar file and configuration file when starting the JVM. Since the jar file is a binary file that is not ideal for mounting via a configmap, and the configuration file typically does not require modifications, it is recommended to package both the JMX Exporter jar file and the configuration file directly into the business container image.
For the second method, you can choose to include the JMX Exporter jar file in the application image or mount it during deployment. Below are explanations for both approaches:
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-1-building-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Building JMX Exporter JAR File into the Business Image","text":"
The content of prometheus-jmx-config.yaml is as follows:
The format for the startup parameter is: -javaagent:=:
Here, port 8088 is used to expose JVM monitoring metrics; you may change it if it conflicts with the Java application.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-2-mounting-via-init-container","title":"Method 2: Mounting via Init Container","text":"
First, we need to create a Docker image for the JMX Exporter. The following Dockerfile is for reference:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file into the image\nCOPY prometheus-jmx-config.yaml ./\n# Download the jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image using the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment YAML:
With the above modifications, the example application my-demo-app now has the capability to expose JVM metrics. After running the service, you can access the Prometheus formatted metrics at http://localhost:8088.
Next, you can refer to Connecting Existing JVM Metrics of Java Applications to Observability.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.html","title":"Integrating Existing JVM Metrics of Java Applications with Observability","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), you will need to ensure that the monitoring data is collected. You can achieve this by adding annotations (Kubernetes Annotations) to your workload to allow Insight to scrape the existing JVM metrics:
annotations: \n insight.opentelemetry.io/metric-scrape: \"true\" # Whether to scrape\n insight.opentelemetry.io/metric-path: \"/\" # Path to scrape metrics\n insight.opentelemetry.io/metric-port: \"9464\" # Port to scrape metrics\n
For example, to add annotations to the my-deployment-app:
In the above example, Insight will scrape the Prometheus metrics exposed through Spring Boot Actuator via http://<service-ip>:8080/actuator/prometheus.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html","title":"Exposing JVM Metrics Using OpenTelemetry Java Agent","text":"
Starting from OpenTelemetry Agent v1.20.0 and later, the OpenTelemetry Agent has introduced the JMX Metric Insight module. If your application is already integrated with the OpenTelemetry Agent for tracing, you no longer need to introduce another agent to expose JMX metrics for your application. The OpenTelemetry Agent collects and exposes metrics by detecting the locally available MBeans in the application.
The OpenTelemetry Agent also provides built-in monitoring examples for common Java servers or frameworks. Please refer to the Predefined Metrics.
When using the OpenTelemetry Java Agent, you also need to consider how to mount the JAR into the container. In addition to the methods for mounting the JAR file as described with the JMX Exporter, you can leverage the capabilities provided by the OpenTelemetry Operator to automatically enable JVM metrics exposure for your application.
If your application is already integrated with the OpenTelemetry Agent for tracing, you do not need to introduce another agent to expose JMX metrics. The OpenTelemetry Agent can now locally collect and expose metrics interfaces by detecting the locally available MBeans in the application.
However, as of the current version, you still need to manually add the appropriate annotations to your application for the JVM data to be collected by Insight. For specific annotation content, please refer to Integrating Existing JVM Metrics of Java Applications with Observability.
"},{"location":"en/admin/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html#exposing-metrics-for-java-middleware","title":"Exposing Metrics for Java Middleware","text":"
The OpenTelemetry Agent also includes built-in examples for monitoring middleware. Please refer to the Predefined Metrics.
By default, no specific types are designated; you need to specify them using the -Dotel.jmx.target.system JVM options, for example, -Dotel.jmx.target.system=jetty,kafka-broker.
Although the OpenShift system comes with a monitoring system, we will still install Insight Agent because of some rules in the data collection agreement.
Among them, in addition to the basic installation configuration, the following parameters need to be added during helm install:
## Parameters related to fluentbit;\n--set fluent-bit.ocp.enabled=true \\\n--set fluent-bit.serviceAccount.create=false \\\n--set fluent-bit.securityContext.runAsUser=0 \\\n--set fluent-bit.securityContext.seLinuxOptions.type=spc_t \\\n--set fluent-bit.securityContext.readOnlyRootFilesystem=false \\\n--set fluent-bit.securityContext.allowPrivilegeEscalation=false \\\n\n## Enable Prometheus(CR) for OpenShift4.x\n--set compatibility.openshift.prometheus.enabled=true \\\n\n## Close the Prometheus instance of the higher version\n--set kube-prometheus-stack.prometheus.enabled=false \\\n--set kube-prometheus-stack.kubeApiServer.enabled=false \\\n--set kube-prometheus-stack.kubelet.enabled=false \\\n--set kube-prometheus-stack.kubeControllerManager.enabled=false \\\n--set kube-prometheus-stack.coreDns.enabled=false \\\n--set kube-prometheus-stack.kubeDns.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeScheduler.enabled=false \\\n--set kube-prometheus-stack.kubeStateMetrics.enabled=false \\\n--set kube-prometheus-stack.nodeExporter.enabled=false \\\n\n## Limit the namespace processed by PrometheusOperator to avoid competition with OpenShift's own PrometheusOperator\n--set kube-prometheus-stack.prometheusOperator.kubeletService.namespace=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.prometheusInstanceNamespaces=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[0]=\"openshift-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[1]=\"openshift-user-workload-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[2]=\"openshift-customer-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[3]=\"openshift-route-monitor-operator\" \\\n
"},{"location":"en/admin/insight/quickstart/other/install-agent-on-ocp.html#write-system-monitoring-data-into-prometheus-through-openshifts-own-mechanism","title":"Write system monitoring data into Prometheus through OpenShift's own mechanism","text":"
"},{"location":"en/admin/insight/quickstart/res-plan/modify-vms-disk.html","title":"vmstorage Disk Expansion","text":"
This article describes the method for expanding the vmstorage disk. Please refer to the vmstorage disk capacity planning for the specifications of the vmstorage disk.
Log in to the AI platform platform as a global service cluster administrator. Click Container Management -> Clusters and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Container Storage -> PVCs and find the PVC bound to the vmstorage.
Click a vmstorage PVC to enter the details of the volume claim for vmstorage and confirm the StorageClass that the PVC is bound to.
Select the left navigation menu Container Storage -> Storage Class and find local-path . Click the \u2507 on the right side of the target and select Edit in the popup menu.
Enable Scale Up and click OK .
"},{"location":"en/admin/insight/quickstart/res-plan/modify-vms-disk.html#modify-the-disk-capacity-of-vmstorage","title":"Modify the disk capacity of vmstorage","text":"
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu CRDs and find the custom resource for vmcluster .
Click the custom resource for vmcluster to enter the details page, switch to the insight-system namespace, and select Edit YAML from the right menu of insight-victoria-metrics-k8s-stack .
Modify according to the legend and click OK .
Select the left navigation menu Container Storage -> PVCs again and find the volume claim bound to vmstorage. Confirm that the modification has taken effect. In the details page of a PVC, click the associated storage source (PV).
Open the volume details page and click the Update button in the upper right corner.
After modifying the Capacity , click OK and wait for a moment until the expansion is successful.
"},{"location":"en/admin/insight/quickstart/res-plan/modify-vms-disk.html#clone-the-storage-volume","title":"Clone the storage volume","text":"
If the storage volume expansion fails, you can refer to the following method to clone the storage volume.
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Workloads -> StatefulSets and find the statefulset for vmstorage . Click the \u2507 on the right side of the target and select Status -> Stop -> OK in the popup menu.
After logging into the master node of the kpanda-global-cluster cluster in the command line, run the following command to copy the vm-data directory in the vmstorage container to store the metric information locally:
Log in to the AI platform platform and go to the details of the kpanda-global-cluster cluster. Select the left navigation menu Container Storage -> PVs , click Clone in the upper right corner, and modify the capacity of the volume.
Delete the previous data volume of vmstorage.
Wait for a moment until the volume claim is bound to the cloned data volume, then run the following command to import the exported data from step 3 into the proper container, and then start the previously paused vmstorage .
In the actual use of Prometheus, affected by the number of cluster containers and the opening of Istio, the CPU, memory and other resource usage of Prometheus will exceed the set resources.
In order to ensure the normal operation of Prometheus in clusters of different sizes, it is necessary to adjust the resources of Prometheus according to the actual size of the cluster.
In the case that the mesh is not enabled, the test statistics show that the relationship between the system Job index and pods is Series count = 800 * pod count
When the service mesh is enabled, the magnitude of the Istio-related metrics generated by the pod after the feature is enabled is Series count = 768 * pod count
"},{"location":"en/admin/insight/quickstart/res-plan/prometheus-res.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"
The following resource planning is recommended by Prometheus when the service mesh is not enabled :
Pod count in the table refers to the pod count that is basically running stably in the cluster. If a large number of pods are restarted, the index will increase sharply in a short period of time. At this time, resources need to be adjusted accordingly.
Prometheus stores two hours of data by default in memory, and when the Remote Write function is enabled in the cluster, a certain amount of memory will be occupied, and resources surge ratio is recommended to be set to 2.
The data in the table are recommended values, applicable to general situations. If the environment has precise resource requirements, it is recommended to check the resource usage of the proper Prometheus after the cluster has been running for a period of time for precise configuration.
"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html","title":"vmstorage disk capacity planning","text":"
vmstorage is responsible for storing multicluster metrics for observability. In order to ensure the stability of vmstorage, it is necessary to adjust the disk capacity of vmstorage according to the number of clusters and the size of the cluster. For more information, please refer to vmstorage retention period and disk space.
After 14 days of disk observation of vmstorage of clusters of different sizes, We found that the disk usage of vmstorage was positively correlated with the amount of metrics it stored and the disk usage of individual data points.
The amount of metrics stored instantaneously increase(vm_rows{ type != \"indexdb\"}[30s]) to obtain the increased amount of metrics within 30s
Disk usage of a single data point: sum(vm_data_size_bytes{type!=\"indexdb\"}) / sum(vm_rows{type != \"indexdb\"})
Disk usage = Instantaneous metrics x 2 x disk usage for a single data point x 60 x 24 x storage time (days)
Parameter Description:
The unit of disk usage is Byte .
Storage duration (days) x 60 x 24 converts time (days) into minutes to calculate disk usage.
The default collection time of Prometheus in Insight Agent is 30s, so twice the amount of metrics will be generated within 1 minute.
The default storage duration in vmstorage is 1 month, please refer to Modify System Configuration to modify the configuration.
Warning
This formula is a general solution, and it is recommended to reserve redundant disk capacity on the calculation result to ensure the normal operation of vmstorage.
The data in the table is calculated based on the default storage time of one month (30 days), and the disk usage of a single data point (datapoint) is calculated as 0.9. In a multicluster scenario, the number of Pods represents the sum of the number of Pods in the multicluster.
"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 8W 6 GiB 200 16W 12 GiB 300 24w 18 GiB 400 32w 24 GiB 500 40w 30 GiB 800 64w 48 GiB 1000 80W 60 GiB 2000 160w 120 GiB 3000 240w 180 GiB"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-enabled","title":"When the service mesh is enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 15W 12 GiB 200 31w 24 GiB 300 46w 36 GiB 400 62w 48 GiB 500 78w 60 GiB 800 125w 94 GiB 1000 156w 120 GiB 2000 312w 235 GiB 3000 468w 350 GiB"},{"location":"en/admin/insight/quickstart/res-plan/vms-res-plan.html#example","title":"Example","text":"
There are two clusters in the AI platform platform, of which 500 Pods are running in the global management cluster (service mesh is turned on), and 1000 Pods are running in the worker cluster (service mesh is not turned on), and the expected metrics are stored for 30 days.
The number of metrics in the global management cluster is 800x500 + 768x500 = 784000
Worker cluster metrics are 800x1000 = 800000
Then the current vmstorage disk usage should be set to (784000+80000)x2x0.9x60x24x31 =124384896000 byte = 116 GiB
Note
For the relationship between the number of metrics and the number of Pods in the cluster, please refer to Prometheus Resource Planning.
"},{"location":"en/admin/insight/reference/alertnotification.html","title":"Alert Notification Process Description","text":"
When configuring an alert policy in Insight, you have the ability to set different notification sending intervals for alerts triggered at different levels within the same policy. However, due to the presence of two parameters, group_interval and repeat_interval , in the native Alertmanager configuration, the actual intervals for sending alert notifications may deviate.
group_wait : Specifies the waiting time before sending alert notifications. When Alertmanager receives a group of alerts, if no further alerts are received within the duration specified by group_wait , Alertmanager waits for a certain amount of time to collect additional alerts with the same labels and content. It then includes all qualifying alerts in the same notification.
group_interval : Determines the waiting time before merging a group of alerts into a single notification. If no more alerts from the same group are received during this period, Alertmanager sends a notification containing all received alerts.
repeat_interval : Sets the interval for resending alert notifications. After Alertmanager sends an alert notification to a receiver, if it continues to receive alerts with the same labels and content within the duration specified by repeat_interval , Alertmanager will resend the alert notification.
When the group_wait , group_interval , and repeat_interval parameters are set simultaneously, Alertmanager handles alert notifications under the same group as follows:
When Alertmanager receives qualifying alerts, it waits for at least the duration specified in the group_wait parameter to collect additional alerts with the same labels and content. It includes all qualifying alerts in the same notification.
If no further alerts are received during the group_wait duration, Alertmanager sends all received alerts to the receiver after that time. If additional qualifying alerts arrive during this period, Alertmanager continues to wait until all alerts are collected or a timeout occurs.
If more alerts with the same labels and content are received within the group_interval parameter, these new alerts are merged into the previous notification and sent together. If there are still unsent alerts after the group_interval duration, Alertmanager starts a new timing cycle and waits for more alerts until the group_interval duration is reached again or new alerts are received.
If Alertmanager keeps receiving alerts with the same labels and content within the duration specified by repeat_interval , it will resend the previously sent alert notifications. When resending alert notifications, Alertmanager does not wait for group_wait or group_interval , but sends notifications repeatedly according to the time interval specified by repeat_interval .
If there are still unsent alerts after the repeat_interval duration, Alertmanager starts a new timing cycle and continues to wait for new alerts with the same labels and content. This process continues until there are no new alerts or Alertmanager is stopped.
When Alertmanager receives an alert, it waits for at least 30 seconds to collect additional alerts with the same labels and content, and includes them in the same notification.
If more alerts with the same labels and content are received within 5 minutes, these new alerts are merged into the previous notification and sent together. If there are still unsent alerts after 15 minutes, Alertmanager starts a new timing cycle and waits for more alerts until 5 minutes have passed or new alerts are received.
If Alertmanager continues to receive alerts with the same labels and content within 1 hour, it will resend the previously sent alert notifications.
"},{"location":"en/admin/insight/reference/lucene.html","title":"Lucene Syntax Usage","text":""},{"location":"en/admin/insight/reference/lucene.html#introduction-to-lucene","title":"Introduction to Lucene","text":"
Lucene is a subproject of Apache Software Foundation's Jakarta project and is an open-source full-text search engine toolkit. The purpose of Lucene is to provide software developers with a simple and easy-to-use toolkit for implementing full-text search functionality in their target systems.
Lucene's syntax allows you to construct search queries in a flexible way to meet different search requirements. Here is a detailed explanation of Lucene's syntax:
To perform searches with multiple keywords using Lucene syntax, you can use Boolean logical operators to combine multiple keywords. Lucene supports the following operators:
AND operator
Use AND or && to represent the logical AND relationship.
Example: term1 AND term2 or term1 && term2
OR operator
Use OR or || to represent the logical OR relationship.
Example: term1 OR term2 or term1 || term2
NOT operator
Use NOT or - to represent the logical NOT relationship.
Example: term1 NOT term2 or term1 -term2
Quotes
You can enclose a phrase in quotes for exact matching.
In Lucene, fuzzy queries can be performed using the tilde ( ~ ) operator for approximate matching. You can specify an edit distance to limit the degree of similarity in the matches.
term~\n
In the above example, term is the keyword to perform a fuzzy match on.
Please note the following:
After the tilde ( ~ ), you can optionally specify a parameter to control the similarity of the fuzzy query.
The parameter value ranges from 0 to 2, where 0 represents an exact match, 1 allows for one edit operation (such as adding, deleting, or replacing characters) to match, and 2 allows for two edit operations to match.
If no parameter value is specified, the default similarity threshold used is 0.5.
Fuzzy queries will return documents that are similar to the given keyword but may incur some performance overhead, especially for larger indexes.
In the above example, te?t represents a word that starts with \"te\", followed by any single character, and ends with \"t\". This query can match words like \"test\", \"text\", and \"tent\".
It is important to note that the question mark ( ? ) represents only a single character. If you want to match multiple characters or varying lengths of characters, you can use the asterisk ( * ) for multi-character wildcard matching. Additionally, the question mark will not match an empty string.
To summarize, in Lucene syntax, the question mark ( ? ) is used as a single-character wildcard to match any single character. By using the question mark in your search keywords, you can perform more flexible and specific pattern matching.
Lucene syntax supports range queries, where you can use square brackets [ ] or curly braces { } to represent a range. Here are examples of range queries:
Inclusive boundary range query:
Square brackets [ ] indicate a closed interval that includes the boundary values.
Example: field:[value1 TO value2] represents the range of values for field , including both value1 and value2 .
Exclusive boundary range query:
Curly braces { } indicate an open interval that excludes the boundary values.
Example: field:{value1 TO value2} represents the range of values for field between value1 and value2 , excluding both.
Omitted boundary range query:
You can omit one or both boundary values to specify an infinite range.
Example: field:[value TO ] represents the range of values for field from value to positive infinity, and field:[ TO value] represents the range of values for field from negative infinity to value .
Note
Please note that range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. Also, ensure that you correctly specify the boundary values as the actual value type of the field in your query. If you want to perform a range query across the entire index without specifying a specific field, you can use the wildcard query * instead of a field name.
This will retrieve data where the timestamp field falls within the range from January 1, 2022, to January 31, 2022.
Not specify a field
*:[value1 TO value2]\n
This will search the entire index for documents with values ranging from value1 to value2 .
"},{"location":"en/admin/insight/reference/lucene.html#insight-common-keywords","title":"Insight Common Keywords","text":""},{"location":"en/admin/insight/reference/lucene.html#container-logs","title":"Container Logs","text":"
The toClusterName function retrieves the \"cluster name\" based on the \"cluster unique identifier (ID)\". If there is no proper cluster found, it will directly return the passed-in cluster's unique identifier.
The toClusterId function retrieves the \"cluster unique identifier (ID)\" based on the \"cluster name\". If there is no proper cluster found, it will directly return the passed-in cluster name.
Because Insight combines messages generated by the same rule at the same time when sending alert messages, email subjects are different from the four templates above and only use the content of commonLabels in the alert message to render the template. The default template is as follows:
Other fields that can be used as email subjects are as follows:
{{ .status }} Triggering status of the alert message\n{{ .alertgroup }} Name of the policy to which the alert belongs\n{{ .alertname }} Name of the rule to which the alert belongs\n{{ .severity }} Severity level of the alert\n{{ .target_type }} Type of resource for which the alert is raised\n{{ .target }} Resource object for which the alert is raised\n{{ .Custom label key for other rules }}\n
"},{"location":"en/admin/insight/reference/tailing-sidecar.html","title":"Collecting Container Logs through Sidecar","text":"
Tailing Sidecar is a Kubernetes cluster-level logging proxy that acts as a streaming sidecar container. It allows automatic collection and summarization of log files within containers, even when the container cannot write to standard output or standard error streams.
Insight supports log collection through the Sidecar mode, which involves running a Sidecar container alongside each Pod to output log data to the standard output stream. This enables FluentBit to collect container logs effectively.
The Insight Agent comes with the tailing-sidecar operator installed by default. To enable file log collection within a container, you can add annotations to the Pod, which will automatically inject the Tailing Sidecar container. The injected Sidecar container reads the files in the business container and outputs them to the standard output stream.
Here are the specific steps to follow:
Modify the YAML file of the Pod and add the following parameters in the annotation field:
sidecar-name-0 : Name for the Tailing Sidecar container (optional; a container name will be created automatically if not specified, starting with the prefix \"tailing-sidecar\").
volume-name-0 : Name of the storage volume.
path-to-tail-0 : File path to tail.
Note
Each Pod can run multiple sidecar containers, separated by ; . This allows different sidecar containers to collect multiple files and store them in various volumes.
Restart the Pod. Once the Pod's status changes to Running , you can use the Log Query interface to search for logs within the container of the Pod.
The metrics in this article are organized based on the community's kube-prometheus framework. Currently, it covers metrics from multiple levels, including Cluster, Node, Namespace, and Workload. This article lists some commonly used metrics, their descriptions, and units for easy reference.
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#cluster","title":"Cluster","text":"Metric Name Description Unit cluster_cpu_utilization Cluster CPU Utilization cluster_cpu_total Total CPU in Cluster Core cluster_cpu_usage CPU Used in Cluster Core cluster_cpu_requests_commitment CPU Allocation Rate in Cluster cluster_memory_utilization Cluster Memory Utilization cluster_memory_usage Memory Usage in Cluster Byte cluster_memory_available Available Memory in Cluster Byte cluster_memory_requests_commitment Memory Allocation Rate in Cluster cluster_memory_total Total Memory in Cluster Byte cluster_net_utilization Network Data Transfer Rate in Cluster Byte/s cluster_net_bytes_transmitted Network Data Transmitted in Cluster (Upstream) Byte/s cluster_net_bytes_received Network Data Received in Cluster (Downstream) Byte/s cluster_disk_read_iops Disk Read IOPS in Cluster times/s cluster_disk_write_iops Disk Write IOPS in Cluster times/s cluster_disk_read_throughput Disk Read Throughput in Cluster Byte/s cluster_disk_write_throughput Disk Write Throughput in Cluster Byte/s cluster_disk_size_capacity Total Disk Capacity in Cluster Byte cluster_disk_size_available Available Disk Size in Cluster Byte cluster_disk_size_usage Disk Usage in Cluster Byte cluster_disk_size_utilization Disk Utilization in Cluster cluster_node_total Total Nodes in Cluster units cluster_node_online Online Nodes in Cluster units cluster_node_offline_count Count of Offline Nodes in Cluster units cluster_pod_count Total Pods in Cluster units cluster_pod_running_count Count of Running Pods in Cluster units cluster_pod_abnormal_count Count of Abnormal Pods in Cluster units cluster_deployment_count Total Deployments in Cluster units cluster_deployment_normal_count Count of Normal Deployments in Cluster units cluster_deployment_abnormal_count Count of Abnormal Deployments in Cluster units cluster_statefulset_count Count of StatefulSets in Cluster units cluster_statefulset_normal_count Count of Normal StatefulSets in Cluster units cluster_statefulset_abnormal_count Count of Abnormal StatefulSets in Cluster units cluster_daemonset_count Count of DaemonSets in Cluster units cluster_daemonset_normal_count Count of Normal DaemonSets in Cluster units cluster_daemonset_abnormal_count Count of Abnormal DaemonSets in Cluster units cluster_job_count Total Jobs in Cluster units cluster_job_normal_count Count of Normal Jobs in Cluster units cluster_job_abnormal_count Count of Abnormal Jobs in Cluster units
Tip
Utilization is generally a number in the range (0,1] (e.g., 0.21, not 21%)
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#node","title":"Node","text":"Metric Name Description Unit node_cpu_utilization Node CPU Utilization node_cpu_total Total CPU in Node Core node_cpu_usage CPU Usage in Node Core node_cpu_requests_commitment CPU Allocation Rate in Node node_memory_utilization Node Memory Utilization node_memory_usage Memory Usage in Node Byte node_memory_requests_commitment Memory Allocation Rate in Node node_memory_available Available Memory in Node Byte node_memory_total Total Memory in Node Byte node_net_utilization Network Data Transfer Rate in Node Byte/s node_net_bytes_transmitted Network Data Transmitted in Node (Upstream) Byte/s node_net_bytes_received Network Data Received in Node (Downstream) Byte/s node_disk_read_iops Disk Read IOPS in Node times/s node_disk_write_iops Disk Write IOPS in Node times/s node_disk_read_throughput Disk Read Throughput in Node Byte/s node_disk_write_throughput Disk Write Throughput in Node Byte/s node_disk_size_capacity Total Disk Capacity in Node Byte node_disk_size_available Available Disk Size in Node Byte node_disk_size_usage Disk Usage in Node Byte node_disk_size_utilization Disk Utilization in Node"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#workload","title":"Workload","text":"
The currently supported workload types include: Deployment, StatefulSet, DaemonSet, Job, and CronJob.
Metric Name Description Unit workload_cpu_usage Workload CPU Usage Core workload_cpu_limits Workload CPU Limit Core workload_cpu_requests Workload CPU Requests Core workload_cpu_utilization Workload CPU Utilization workload_memory_usage Workload Memory Usage Byte workload_memory_limits Workload Memory Limit Byte workload_memory_requests Workload Memory Requests Byte workload_memory_utilization Workload Memory Utilization workload_memory_usage_cached Workload Memory Usage (including cache) Byte workload_net_bytes_transmitted Workload Network Data Transmitted Rate Byte/s workload_net_bytes_received Workload Network Data Received Rate Byte/s workload_disk_read_throughput Workload Disk Read Throughput Byte/s workload_disk_write_throughput Workload Disk Write Throughput Byte/s
Total workload is calculated here.
Metrics can be obtained using workload_cpu_usage{workload_type=\"deployment\", workload=\"prometheus\"}.
Calculation rule for workload_pod_utilization: workload_pod_usage / workload_pod_request.
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#pod","title":"Pod","text":"Metric Name Description Unit pod_cpu_usage Pod CPU Usage Core pod_cpu_limits Pod CPU Limit Core pod_cpu_requests Pod CPU Requests Core pod_cpu_utilization Pod CPU Utilization pod_memory_usage Pod Memory Usage Byte pod_memory_limits Pod Memory Limit Byte pod_memory_requests Pod Memory Requests Byte pod_memory_utilization Pod Memory Utilization pod_memory_usage_cached Pod Memory Usage (including cache) Byte pod_net_bytes_transmitted Pod Network Data Transmitted Rate Byte/s pod_net_bytes_received Pod Network Data Received Rate Byte/s pod_disk_read_throughput Pod Disk Read Throughput Byte/s pod_disk_write_throughput Pod Disk Write Throughput Byte/s
You can obtain the CPU usage of all Pods belonging to the Deployment named prometheus by using pod_cpu_usage{workload_type=\"deployment\", workload=\"prometheus\"}.
"},{"location":"en/admin/insight/reference/used-metric-in-insight.html#span-metrics","title":"Span Metrics","text":"Metric Name Description Unit calls_total Total Service Requests duration_milliseconds_bucket Service Latency Histogram duration_milliseconds_sum Total Service Latency ms duration_milliseconds_count Number of Latency Records otelcol_processor_groupbytrace_spans_released Number of Collected Spans otelcol_processor_groupbytrace_traces_released Number of Collected Traces traces_service_graph_request_total Total Service Requests (Topology Feature) traces_service_graph_request_server_seconds_sum Total Latency (Topology Feature) ms traces_service_graph_request_server_seconds_bucket Service Latency Histogram (Topology Feature) traces_service_graph_request_server_seconds_count Total Service Requests (Topology Feature)"},{"location":"en/admin/insight/system-config/modify-config.html","title":"Modify system configuration","text":"
Observability will persist the data of metrics, logs, and traces by default. Users can modify the system configuration according to This page.
"},{"location":"en/admin/insight/system-config/modify-config.html#how-to-modify-the-metric-data-retention-period","title":"How to modify the metric data retention period","text":"
Refer to the following steps to modify the metric data retention period.
After saving the modification, the pod of the component responsible for storing the metrics will automatically restart, just wait for a while.
"},{"location":"en/admin/insight/system-config/modify-config.html#how-to-modify-the-log-data-storage-duration","title":"How to modify the log data storage duration","text":"
Refer to the following steps to modify the log data retention period:
"},{"location":"en/admin/insight/system-config/modify-config.html#method-1-modify-the-json-file","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . Change http://localhost:9200 to the address of elastic .
After modification, run the above command. It will print out the content as shown below, then the modification is successful.
{\n\"acknowledged\": true\n}\n
"},{"location":"en/admin/insight/system-config/modify-config.html#method-2-modify-from-the-ui","title":"Method 2: Modify from the UI","text":"
Log in kibana , select Stack Management in the left navigation bar.
Select the left navigation Index Lifecycle Polices , and find the index insight-es-k8s-logs-policy , click to enter the details.
Expand the Hot phase configuration panel, modify the Maximum age parameter, and set the retention period. The default storage period is 7d .
After modification, click Save policy at the bottom of the page to complete the modification.
"},{"location":"en/admin/insight/system-config/modify-config.html#how-to-modify-the-trace-data-storage-duration","title":"How to modify the trace data storage duration","text":"
Refer to the following steps to modify the trace data retention period:
"},{"location":"en/admin/insight/system-config/modify-config.html#method-1-modify-the-json-file_1","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . At the same time, modify http://localhost:9200 to the access address of elastic .
On the system component page, you can quickly view the running status of the system components in Insight. When a system component fails, some features in Insight will be unavailable.
Go to Insight product module,
In the left navigation bar, select System Management -> System Components .
"},{"location":"en/admin/insight/system-config/system-component.html#component-description","title":"Component description","text":"Module Component Name Description Metrics vminsert-insight-victoria-metrics-k8s-stack Responsible for writing the metric data collected by Prometheus in each cluster to the storage component. If this component is abnormal, the metric data of the worker cluster cannot be written. Metrics vmalert-insight-victoria-metrics-k8s-stack Responsible for taking effect of the recording and alert rules configured in the VM Rule, and sending the triggered alert rules to alertmanager. Metrics vmalertmanager-insight-victoria-metrics-k8s-stack is responsible for sending messages when alerts are triggered. If this component is abnormal, the alert information cannot be sent. Metrics vmselect-insight-victoria-metrics-k8s-stack Responsible for querying metrics data. If this component is abnormal, the metric cannot be queried. Metrics vmstorage-insight-victoria-metrics-k8s-stack Responsible for storing multicluster metrics data. Dashboard grafana-deployment Provide monitoring panel capability. The exception of this component will make it impossible to view the built-in dashboard. Link insight-jaeger-collector Responsible for receiving trace data in opentelemetry-collector and storing it. Link insight-jaeger-query Responsible for querying the trace data collected in each cluster. Link insight-opentelemetry-collector Responsible for receiving trace data forwarded by each sub-cluster Log elasticsearch Responsible for storing the log data of each cluster."},{"location":"en/admin/insight/system-config/system-config.html","title":"System Settings","text":"
System Settings displays the default storage time of metrics, logs, traces and the default Apdex threshold.
Click the right navigation bar and select System Settings .
Currently only supports modifying the storage duration of historical alerts, click Edit to enter the target duration.
When the storage duration is set to \"0\", the historical alerts will not be cleared.
Note
To modify other settings, please click to view How to modify the system settings?
In Insight , a service refers to a group of workloads that provide the same behavior for incoming requests. Service insight helps observe the performance and status of applications during the operation process by using the OpenTelemetry SDK.
For how to use OpenTelemetry, please refer to: Using OTel to give your application insight.
Service: A service represents a group of workloads that provide the same behavior for incoming requests. You can define the service name when using the OpenTelemetry SDK or use the name defined in Istio.
Operation: An operation refers to a specific request or action handled by a service. Each span has an operation name.
Outbound Traffic: Outbound traffic refers to all the traffic generated by the current service when making requests.
Inbound Traffic: Inbound traffic refers to all the traffic initiated by the upstream service targeting the current service.
The Services List page displays key metrics such as throughput rate, error rate, and request latency for all services that have been instrumented with distributed tracing. You can filter services based on clusters or namespaces and sort the list by throughput rate, error rate, or request latency. By default, the data displayed in the list is for the last hour, but you can customize the time range.
Follow these steps to view service insight metrics:
Go to the Insight product module.
Select Trace Tracking -> Services from the left navigation bar.
Attention
If the namespace of a service in the list is unknown , it means that the service has not been properly instrumented. We recommend reconfiguring the instrumentation.
If multiple services have the same name and none of them have the correct Namespace environment variable configured, the metrics displayed in the list and service details page will be aggregated for all those services.
Click a service name (taking insight-system as an example) to view the detailed metrics and operation metrics for that service.
In the Service Topology section, you can view the service topology one layer above or below the current service. When you hover over a node, you can see its information.
In the Traffic Metrics section, you can view the monitoring metrics for all requests to the service within the past hour (including inbound and outbound traffic).
You can use the time selector in the upper right corner to quickly select a time range or specify a custom time range.
Sorting is available for throughput, error rate, and request latency in the operation metrics.
Clicking on the icon next to an individual operation will take you to the Traces page to quickly search for related traces.
"},{"location":"en/admin/insight/trace/service.html#service-metric-explanations","title":"Service Metric Explanations","text":"Metric Description Throughput Rate The number of requests processed within a unit of time. Error Rate The ratio of erroneous requests to the total number of requests within the specified time range. P50 Request Latency The response time within which 50% of requests complete. P95 Request Latency The response time within which 95% of requests complete. P99 Request Latency The response time within which 99% of requests complete."},{"location":"en/admin/insight/trace/topology.html","title":"Service Map","text":"
Service map is a visual representation of the connections, communication, and dependencies between services. It provides insights into the service-to-service interactions, allowing you to view the calls and performance of services within a specified time range. The connections between nodes in the topology map represent the existence of service-to-service calls during the queried time period.
Select Tracing -> Service Map from the left navigation bar.
In the Service Map, you can perform the following actions:
Click a node to slide out the details of the service on the right side. Here, you can view metrics such as request latency, throughput, and error rate for the service. Clicking on the service name takes you to the service details page.
Hover over the connections to view the traffic metrics between the two services.
Click Display Settings , you can configure the display elements in the service map.
In the Service Map, there can be nodes that are not part of the cluster. These external nodes can be categorized into three types:
Database
Message Queue
Virtual Node
If a service makes a request to a Database or Message Queue, these two types of nodes will be displayed by default in the topology map. However, Virtual Nodes represent nodes outside the cluster or services not integrated into the trace, and they will not be displayed by default in the map.
When a service makes a request to MySQL, PostgreSQL, or Oracle Database, the detailed database type can be seen in the map.
TraceID: Used to identify a complete request call trace.
Operation: Describes the specific operation or event represented by a Span.
Entry Span: The entry Span represents the first request of the entire call.
Latency: The duration from receiving the request to completing the response for the entire call trace.
Span: The number of Spans included in the entire trace.
Start Time: The time when the current trace starts.
Tag: A collection of key-value pairs that constitute Span tags. Tags are used to annotate and supplement Spans, and each Span can have multiple key-value tag pairs.
Click the icon on the right side of the trace data to search for associated logs.
By default, it queries the log data within the duration of the trace and one minute after its completion.
The queried logs include those with the trace's TraceID in their log text and container logs related to the trace invocation process.
Click View More to jump to the Associated Log page with conditions.
By default, all logs are searched, but you can filter by the TraceID or the relevant container logs from the trace call process using the dropdown.
Note
Since trace may span across clusters or namespaces, if the user does not have sufficient permissions, they will be unable to query the associated logs for that trace.
"},{"location":"en/admin/k8s/add-node.html#steps-to-add-nodes","title":"Steps to Add Nodes","text":"
Log in to the AI platform as an administrator.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
On the cluster overview page, click Node Management, and then click the Add Node button on the right side.
Follow the wizard, fill in the required parameters, and then click OK.
Basic InformationParameter Configuration
Click OK in the popup window.
Return to the node list. The status of the newly added node will be Pending. After a few minutes, if the status changes to Running, it indicates that the node has been successfully added.
Tip
For nodes that have just been successfully added, it may take an additional 2-3 minutes for the GPU to be recognized.
"},{"location":"en/admin/k8s/create-k8s.html","title":"Creating a Kubernetes Cluster on the Cloud","text":"
Deploying a Kubernetes cluster is aimed at supporting efficient AI computing resource scheduling and management, achieving elastic scalability, providing high availability, and optimizing the model training and inference processes.
After configuring the node information, click Start Check.
Each node can run a default of 110 Pods (container groups). If the node configuration is higher, it can be adjusted to 200 or 300 Pods.
Wait for the cluster creation to complete.
In the cluster list, find the newly created cluster, click the cluster name, navigate to Helm Apps -> Helm Charts, and search for metax-gpu-extensions in the search box, then click the card.
Click the Install button on the right to begin installing the GPU plugin.
The cost of GPU resources is relatively high. If you temporarily do not need a GPU, you can remove the worker nodes with GPUs. The following steps are also applicable for removing regular worker nodes.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
On the cluster overview page, click Nodes, find the node you want to remove, click the \u2507 on the right side of the list, and select Remove Node from the pop-up menu.
In the pop-up window, enter the node name, and after confirming it is correct, click Delete.
You will automatically return to the node list, and the status will be Removing. After a few minutes, refresh the page; if the node is no longer there, it indicates that the node has been successfully removed.
After removing the node from the UI list and shutting it down, log in to the host of the removed node via SSH and execute the shutdown command.
Tip
After removing the node from the UI and shutting it down, the data on the node is not immediately deleted; the node's data will be retained for a period of time.
"},{"location":"en/admin/kpanda/backup/index.html","title":"Backup and Restore","text":"
Backup and restore are essential aspects of system management. In practice, it is important to first back up the data of the system at a specific point in time and securely store the backup. In case of incidents such as data corruption, loss, or accidental deletion, the system can be quickly restored based on the previous backup data, reducing downtime and minimizing losses.
In real production environments, services may be deployed across different clouds, regions, or availability zones. If one infrastructure faces a failure, organizations need to quickly restore applications in other available environments. In such cases, cross-cloud or cross-cluster backup and restore become crucial.
Large-scale systems often involve multiple roles and users with complex permission management systems. With many operators involved, accidents caused by human error can lead to system failures. In such scenarios, the ability to roll back the system quickly using previously backed-up data is necessary. Relying solely on manual troubleshooting, fault repair, and system recovery can be time-consuming, resulting in prolonged system unavailability and increased losses for organizations.
Additionally, factors like network attacks, natural disasters, and equipment malfunctions can also cause data accidents.
Therefore, backup and restore are vital as the last line of defense for maintaining system stability and ensuring data security.
Backups are typically classified into three types: full backups, incremental backups, and differential backups. Currently, AI platform supports full backups and incremental backups.
The backup and restore provided by AI platform can be divided into two categories: Application Backup and ETCD Backup. It supports both manual backups and scheduled automatic backups using CronJobs.
Application Backup
Application backup refers to backing up data of a specific workload in the cluster and then restoring that data either within the same cluster or in another cluster. It supports backing up all resources under a namespace or filtering resources by specific labels.
Application backup also supports cross-cluster backup of stateful applications. For detailed steps, refer to the Backup and Restore MySQL Applications and Data Across Clusters guide.
etcd Backup
etcd is the data storage component of Kubernetes. Kubernetes stores its own component's data and application data in etcd. Therefore, backing up etcd is equivalent to backing up the entire cluster's data, allowing quick restoration of the cluster to a previous state in case of failures.
It's worth noting that currently, restoring etcd backup data is only supported within the same cluster (the original cluster). To learn more about related best practices, refer to the ETCD Backup and Restore guide.
This article explains how to backup applications in AI platform. The demo application used in this tutorial is called dao-2048 , which is a deployment.
Before backing up a deployment, the following prerequisites must be met:
Integrate a Kubernetes cluster or create a Kubernetes cluster in the Container Management module, and be able to access the UI interface of the cluster.
Create a Namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
Install the velero component, and ensure the velero component is running properly.
Create a deployment (the workload in this tutorial is named dao-2048 ), and label the deployment with app: dao-2048 .
Follow the steps below to backup the deployment dao-2048 .
Enter the Container Management module, click Backup Recovery -> Application Backup on the left navigation bar, and enter the Application Backup list page.
On the Application Backup list page, select the cluster where the velero and dao-2048 applications have been installed. Click Backup Plan in the upper right corner to create a new backup cluster.
Refer to the instructions below to fill in the backup configuration.
Name: The name of the new backup plan.
Source Cluster: The cluster where the application backup plan is to be executed.
Object Storage Location: The access path of the object storage configured when installing velero on the source cluster.
Namespace: The namespaces that need to be backed up, multiple selections are supported.
Advanced Configuration: Back up specific resources in the namespace based on resource labels, such as an application, or do not back up specific resources in the namespace based on resource labels during backup.
Refer to the instructions below to set the backup execution frequency, and then click Next .
Backup Frequency: Set the time period for task execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Retention Time (days): Set the storage time of backup resources, the default is 30 days, and will be deleted after expiration.
Backup Data Volume (PV): Whether to back up the data in the data volume (PV), support direct copy and use CSI snapshot.
Direct Replication: directly copy the data in the data volume (PV) for backup;
Use CSI snapshots: Use CSI snapshots to back up data volumes (PVs). Requires a CSI snapshot type available for backup in the cluster.
Click OK , the page will automatically return to the application backup plan list, find the newly created dao-2048 backup plan, and perform the Immediate Execution operation.
At this point, the Last Execution State of the cluster will change to in progress . After the backup is complete, you can click the name of the backup plan to view the details of the backup plan.
etcd backup is based on cluster data as the core backup. In cases such as hardware device damage, development and test configuration errors, etc., the backup cluster data can be restored through etcd backup.
This section will introduce how to realize the etcd backup for clusters. Also see etcd Backup and Restore Best Practices.
Enter Container Management -> Backup Recovery -> etcd Backup page, you can see all the current backup policies. Click Create Backup Policy on the right.
Fill in the Basic Information. Then, click Next to automatically verify the connectivity of etcd. If the verification passes, proceed to the next step.
First select the backup cluster and log in to the terminal
Enter etcd, and the format is https://${NodeIP}:${Port}.
In a standard Kubernetes cluster, the default port for etcd is 2379.
In a Suanova 4.0 cluster, the default port for etcd is 12379.
In a public cloud managed cluster, you need to contact the relevant developers to obtain the etcd port number. This is because the control plane components of public cloud clusters are maintained and managed by the cloud service provider. Users cannot directly access or view these components, nor can they obtain control plane port information through regular commands (such as kubectl).
Ways to obtain port number
Find the etcd Pod in the kube-system namespace
kubectl get po -n kube-system | grep etcd\n
Get the port number from the listen-client-urls of the etcd Pod
kubectl get po -n kube-system ${etcd_pod_name} -oyaml | grep listen-client-urls # (1)!\n
Replace etcd_pod_name with the actual Pod name
The expected output is as follows, where the number after the node IP is the port number:
Fill in the CA certificate, you can use the following command to view the certificate content. Then, copy and paste it to the proper location:
Standard Kubernetes ClusterSuanova 4.0 Cluster
cat /etc/kubernetes/ssl/etcd/ca.crt\n
cat /etc/daocloud/dce/certs/ca.crt\n
Fill in the Cert certificate, you can use the following command to view the content of the certificate. Then, copy and paste it to the proper location:
Click How to get below the input box to see how to obtain the proper information on the UI page.
Refer to the following information to fill in the Backup Policy.
Backup Method: Choose either manual backup or scheduled backup
Manual Backup: Immediately perform a full backup of etcd data based on the backup configuration.
Scheduled Backup: Periodically perform full backups of etcd data according to the set backup frequency.
Backup Chain Length: the maximum number of backup data to retain. The default is 30.
Backup Frequency: it can be per hour, per day, per week or per month, and can also be customized.
Refer to the following information to fill in the Storage Path.
Storage Provider: Default is S3 storage
Object Storage Access Address: The access address of MinIO
Bucket: Create a Bucket in MinIO and fill in the Bucket name
Username: The login username for MinIO
Password: The login password for MinIO
After clicking OK , the page will automatically redirect to the backup policy list, where you can view all the currently created ones.
Click the \u2507 action button on the right side of the policy to view logs, view YAML, update the policy, stop the policy, or execute the policy immediately.
When the backup method is manual, you can click Execute Now to perform the backup.
When the backup method is scheduled, the backup will be performed according to the configured time.
Click Logs to view the log content. By default, 100 lines are displayed. If you want to see more log information or download the logs, you can follow the prompts above the logs to go to the observability module.
Go to Container Management -> Backup Recovery -> etcd Backup, and click the Recovery Point tab.
After selecting the target cluster, you can view all the backup information under that cluster.
Each time a backup is executed, a proper recovery point is generated, which can be used to quickly restore the application from a successful recovery point.
"},{"location":"en/admin/kpanda/backup/install-velero.html","title":"Install the Velero Plugin","text":"
velero is an open source tool for backing up and restoring Kubernetes cluster resources. It can back up resources in a Kubernetes cluster to cloud storage services, local storage, or other locations, and restore those resources to the same or a different cluster when needed.
This section introduces how to deploy the Velero plugin in AI platform using the Helm Apps.
Please perform the following steps to install the velero plugin for your cluster.
On the cluster list page, find the target cluster that needs to install the velero plugin, click the name of the cluster, click Helm Apps -> Helm chart in the left navigation bar, and enter velero in the search bar to search .
Read the introduction of the velero plugin, select the version and click the Install button. This page will take 5.2.0 version as an example to install, and it is recommended that you install 5.2.0 and later versions.
Configure basic info .
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, it must be velero namespace.
Version: The version of the plugin, here we take 5.2.0 version as an example.
Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Deletion Failed: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Detailed Logs: Turn on the verbose output of the installation process log.
Note
After enabling Ready Wait and/or Failed Delete , it takes a long time for the app to be marked as Running .
Configure Velero chart Parameter Settings according to the following instructions
S3 Credentials: Configure the authentication information of object storage (minio).
Use secret: Keep the default configuration true.
Secret name: Keep the default configuration velero-s3-credential.
SecretContents.aws_access_key_id = : Configure the username for accessing object storage, replace with the actual parameter.
SecretContents.aws_secret_access_key = : Configure the password for accessing object storage, replace with the actual parameter.
Use existing secret parameter example is as follows:
BackupStorageLocation: The location where Velero backs up data.
S3 bucket: The name of the storage bucket used to save backup data (must be a real storage bucket that already exists in minio).
Is default BackupStorage: Keep the default configuration true.
S3 access mode: The access mode of Velero to data, which can be selected
ReadWrite: Allow Velero to read and write backup data;
ReadOnly: Allow Velero to read backup data, but cannot modify backup data;
WriteOnly: Only allow Velero to write backup data, and cannot read backup data.
S3 Configs: Detailed configuration of S3 storage (minio).
S3 region: The geographical region of cloud storage. The default is to use the us-east-1 parameter, which is provided by the system administrator.
S3 force path style: Keep the default configuration true.
S3 server URL: The console access address of object storage (minio). Minio generally provides two services, UI access and console access. Please use the console access address here.
Click the OK button to complete the installation of the Velero plugin. The system will automatically jump to the Helm Apps list page. After waiting for a few minutes, refresh the page, and you can see the application just installed.
"},{"location":"en/admin/kpanda/best-practice/add-master-node.html","title":"Scaling Controller Nodes in a Worker Cluster","text":"
This article provides a step-by-step guide on how to manually scale the control nodes in a worker cluster to achieve high availability for self-built clusters.
Note
It is recommended to enable high availability mode when creating the worker cluster in the interface. Manually scaling the control nodes of the worker cluster involves certain operational risks, so please proceed with caution.
A worker cluster has been created using the AI platform platform. You can refer to the documentation on Creating a Worker Cluster.
The managed cluster associated with the worker cluster exists in the current platform and is running normally.
Note
Managed cluster refers to the cluster specified during the creation of the worker cluster, which provides capabilities such as Kubernetes version upgrades, node scaling, uninstallation, and operation records for the current cluster.
"},{"location":"en/admin/kpanda/best-practice/add-master-node.html#modify-the-host-manifest","title":"Modify the Host manifest","text":"
Log in to the container management platform and go to the overview page of the cluster where you want to scale the control nodes. In the Basic Information section, locate the Managed Cluster of the current cluster and click its name to enter the overview page.
In the overview page of the managed cluster, click Console to open the cloud terminal console. Run the following command to find the host manifest of the worker cluster that needs to be scaled.
kubectl get cm -n kubean-system ${ClusterName}-hosts-conf -oyaml\n
${ClusterName} is the name of the worker cluster to be scaled.
Modify the host manifest file based on the example below and add information for the controller nodes.
all.hosts.node1: Existing master node in the original cluster
all.hosts.node2, all.hosts.node3: Control nodes to be added during cluster scaling
all.children.kube_control_plane.hosts: Control plane group in the cluster
all.children.kube_node.hosts: Worker node group in the cluster
all.children.etcd.hosts: ETCD node group in the cluster
"},{"location":"en/admin/kpanda/best-practice/add-master-node.html#add-expansion-task-scale-master-node-opsyaml-using-the-clusteroperationyml-template","title":"Add Expansion Task \"scale-master-node-ops.yaml\" using the ClusterOperation.yml Template","text":"
Use the following ClusterOperation.yml template to add a cluster control node expansion task called \"scale-master-node-ops.yaml\".
ClusterOperation.yml
apiVersion: kubean.io/v1alpha1\nkind: ClusterOperation\nmetadata:\n name: cluster1-online-install-ops\nspec:\n cluster: ${cluster-name} # Specify cluster name\n image: ghcr.m.daocloud.io/kubean-io/spray-job:v0.18.0 # Specify the image for the kubean job\n actionType: playbook\n action: cluster.yml\n extraArgs: --limit=etcd,kube_control_plane -e ignore_assert_errors=yes\n preHook:\n - actionType: playbook\n action: ping.yml\n - actionType: playbook\n action: disable-firewalld.yml\n - actionType: playbook\n action: enable-repo.yml # In an offline environment, you need to add this yaml and\n # set the correct repo-list (for installing operating system packages).\n # The following parameter values are for reference only.\n extraArgs: |\n -e \"{repo_list: ['http://172.30.41.0:9000/kubean/centos/\\$releasever/os/\\$basearch','http://172.30.41.0:9000/kubean/centos-iso/\\$releasever/os/\\$basearch']}\"\n postHook:\n - actionType: playbook\n action: upgrade-cluster.yml\n extraArgs: --limit=etcd,kube_control_plane -e ignore_assert_errors=yes\n - actionType: playbook\n action: kubeconfig.yml\n - actionType: playbook\n action: cluster-info.yml\n
Note
spec.image: The image address should be consistent with the image within the job that was previously deployed
spec.action: set to cluster.yml, if adding Master (etcd) nodes exceeds (including) three at once, additional parameter -e etcd_retries=10 should be added to cluster.yaml to increase etcd node join retry times
spec.extraArgs: set to --limit=etcd,kube_control_plane -e ignore_assert_errors=yes
If it is an offline environment, spec.preHook needs to add enable-repo.yml, and the extraArgs parameter should fill in the correct repo_list for the relevant OS
spec.postHook.action: should include upgrade-cluster.yml, where extraArgs is set to --limit=etcd,kube_control_plane -e ignore_assert_errors=yes
Create and deploy scale-master-node-ops.yaml based on the above configuration.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html","title":"Scaling the Worker Nodes of the Global Service Cluster","text":"
This page introduces how to manually scale the worker nodes of the global service cluster in offline mode. By default, it is not recommended to scale the global service cluster after deploying AI platform. Please ensure proper resource planning before deploying AI platform.
Note
The controller node of the global service cluster do not support scaling.
The AI platform deployment has been completed through bootstrap node, and the kind cluster on the bootstrap node is running normally.
You must log in with a user account that has admin privileges on the platform.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#get-kubeconfig-for-the-kind-cluster-on-the-bootstrap-node","title":"Get kubeconfig for the kind cluster on the bootstrap node","text":"
Run the following command to log in to the bootstrap node:
ssh root@bootstrap-node-ip-address\n
On the bootstrap node, run the following command to get the CONTAINER ID of the kind cluster:
[root@localhost ~]# podman ps\n\n# Expected output:\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\n220d662b1b6a docker.m.daocloud.io/kindest/node:v1.26.2 2 weeks ago Up 2 weeks 0.0.0.0:443->30443/tcp, 0.0.0.0:8081->30081/tcp, 0.0.0.0:9000-9001->32000-32001/tcp, 0.0.0.0:36674->6443/tcp my-cluster-installer-control-plane\n
Run the following command to enter a container in the kind cluster:
podman exec -it {CONTAINER ID} bash\n
Replace {CONTAINER ID} with your actual container ID.
Inside the container of the kind cluster, run the following command to get the kubeconfig information for the kind cluster:
kubectl config view --minify --flatten --raw\n
After the console output, copy the kubeconfig information of the kind cluster for the next step.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#create-clusterkubeanio-resources-in-the-kind-cluster-on-the-bootstrap-node","title":"Create cluster.kubean.io resources in the kind cluster on the bootstrap node","text":"
Use the command podman exec -it {CONTAINER ID} bash to enter the kind cluster container.
Inside the kind cluster container, run the following command to get the kind cluster name:
kubectl get clusters\n
Copy and run the following command within the kind cluster to create the cluster.kubean.io resource:
The default cluster name for spec.hostsConfRef.name, spec.kubeconfRef.name, and spec.varsConfRef.name is my-cluster. Please replace it with the kind cluster name obtained in the previous step.
Run the following command in the kind cluster to verify if the cluster.kubean.io resource is created successfully:
kubectl get clusters\n
Expected output is:
NAME AGE\nkpanda-global-cluster 3s\nmy-cluster 16d\n
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#update-the-containerd-configuration-in-the-kind-cluster-on-the-bootstrap-node","title":"Update the containerd configuration in the kind cluster on the bootstrap node","text":"
Run the following command to log in to one of the controller nodes of the global service cluster:
On the global service cluster controller node, run the following command to copy the containerd configuration file config.toml from the controller node to the bootstrap node:
On the bootstrap node, select the insecure registry section from the containerd configuration file config.toml that was copied from the controller node, and add it to the config.toml in the kind cluster.
An example of the insecure registry section is as follows:
Since the config.toml file in the kind cluster cannot be modified directly, you can first copy the file out to modify it and then copy it back to the kind cluster. The steps are as follows:
Run the following command on the bootstrap node to copy the file out:
{CONTAINER ID} should be replaced with your actual container ID.
Run the following command within the kind cluster to restart the containerd service:
systemctl restart containerd\n
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#integrate-a-kind-cluster-into-the-ai-platform-cluster-list","title":"Integrate a Kind cluster into the AI platform cluster list","text":"
Log in to AI platform, navigate to Container Management, and on the right side of the cluster list, click the Integrate Cluster button.
In the integration configuration section, fill in and edit the kubeconfig of the Kind cluster.
Skip TLS verification; this line needs to be added manually.
Replace it with the IP of the Kind node, and change port 6443 to the port mapped to the node (you can run the command podman ps|grep 6443 to check the mapped port).
Click the OK to complete the integration of the Kind cluster.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#add-labels-to-the-global-service-cluster","title":"Add Labels to the Global Service Cluster","text":"
Log in to AI platform, navigate to Container Management, find the kapnda-global-cluster , and in the right-side, find the Basic Configuration menu options.
In the Basic Configuration page, add the label kpanda.io/managed-by=my-cluster for the global service cluster:
Note
The value in the label kpanda.io/managed-by=my-cluster corresponds to the name of the cluster specified during the integration process, which defaults to my-cluster. Please adjust this according to your actual situation.
"},{"location":"en/admin/kpanda/best-practice/add-worker-node-on-global.html#add-nodes-to-the-global-service-cluster","title":"Add nodes to the global service cluster","text":"
Go to the node list page of the global service cluster, find the Integrate Node button on the right side of the node list, and click to enter the node configuration page.
After filling in the IP and authentication information of the node to be integrated, click Start Check . Once the node check is completed, click Next .
Add the following custom parameters in the Custom Parameters section:
Click the OK button and wait for the node to be added.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html","title":"Cross-Cluster Backup and Recovery of MySQL Application and Data","text":"
This demonstration will show how to use the application backup feature in AI platform to perform cross-cluster backup migration for a stateful application.
Note
The current operator should have admin privileges on the AI platform platform.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#prepare-the-demonstration-environment","title":"Prepare the Demonstration Environment","text":""},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#prepare-two-clusters","title":"Prepare Two Clusters","text":"
main-cluster will be the source cluster for backup data, and recovery-cluster will be the target cluster for data recovery.
Cluster IP Nodes main-cluster 10.6.175.100 1 node recovery-cluster 10.6.175.110 1 node"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#set-up-minio-configuration","title":"Set Up MinIO Configuration","text":"MinIO Server Address Bucket Username Password http://10.7.209.110:9000 mysql-demo root dangerous"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#deploy-nfs-storage-service-in-both-clusters","title":"Deploy NFS Storage Service in Both Clusters","text":"
Note
NFS storage service needs to be deployed on all nodes in both the source and target clusters.
Install the dependencies required for NFS on all nodes in both clusters.
Prepare NFS storage service for the MySQL application.
Log in to any control node of both main-cluster and recovery-cluster . Use the command vi nfs.yaml to create a file named nfs.yaml on the node, and copy the following YAML content into the nfs.yaml file.
Run the nfs.yaml file on the control nodes of both clusters.
kubectl apply -f nfs.yaml\n
Check the status of the NFS Pod and wait for its status to become running (approximately 2 minutes).
kubectl get pod -n nfs-system -owide\n
Expected output
[root@g-master1 ~]# kubectl get pod -owide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nnfs-provisioner-7dfb9bcc45-74ws2 1/1 Running 0 4m45s 10.6.175.100 g-master1 <none> <none>\n
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#deploy-mysql-application","title":"Deploy MySQL Application","text":"
Prepare a PVC (Persistent Volume Claim) based on NFS storage for the MySQL application to store its data.
Use the command vi pvc.yaml to create a file named pvc.yaml on the node, and copy the following YAML content into the pvc.yaml file.
Run kubectl get pod | grep mysql to view the status of the MySQL Pod and wait for its status to become running (approximately 2 minutes).
Expected output
[root@g-master1 ~]# kubectl get pod |grep mysql\nmysql-deploy-5d6f94cb5c-gkrks 1/1 Running 0 2m53s\n
Note
If the MySQL Pod remains in a non-running state for a long time, it is usually because NFS dependencies are not installed on all nodes in the cluster.
Run kubectl describe pod ${mysql pod name} to view detailed information about the Pod.
If there is an error message like MountVolume.SetUp failed for volume \"pvc-4ad70cc6-df37-4253-b0c9-8cb86518ccf8\" : mount failed: exit status 32 , please delete the previous resources by executing kubectl delete -f nfs.yaml/pvc.yaml/mysql.yaml and start from deploying the NFS service again.
Write data to the MySQL application.
To verify the success of the data migration later, you can use a script to write test data to the MySQL application.
Use the command vi insert.sh to create a script named insert.sh on the node, and copy the following content into the script.
insert.sh
#!/bin/bash\n\nfunction rand(){\n min=$1\n max=$(($2-$min+1))\n num=$(date +%s%N)\n echo $(($num%$max+$min))\n}\n\nfunction insert(){\n user=$(date +%s%N | md5sum | cut -c 1-9)\n age=$(rand 1 100)\n\n sql=\"INSERT INTO test.users(user_name, age)VALUES('${user}', ${age});\"\n echo -e ${sql}\n\n kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"${sql}\"\n\n}\n\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE DATABASE IF NOT EXISTS test;\"\nkubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"CREATE TABLE IF NOT EXISTS test.users(user_name VARCHAR(10) NOT NULL,age INT UNSIGNED)ENGINE=InnoDB DEFAULT CHARSET=utf8;\"\n\nwhile true;do\n insert\n sleep 1\ndone\n
mysql: [Warning] Using a password on the command line interface can be insecure.\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('dc09195ba', 10);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('80ab6aa28', 70);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('f488e3d46', 23);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('e6098695c', 93);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('eda563e7d', 63);\nmysql: [Warning] Using a password on the command line interface can be insecure.\nINSERT INTO test.users(user_name, age)VALUES('a4d1b8d68', 17);\nmysql: [Warning] Using a password on the command line interface can be insecure.\n
Press Ctrl + C on the keyboard simultaneously to pause the script execution.
Go to the MySQL Pod and check the data written in MySQL.
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
Expected output
[root@g-master1 ~]# kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\nmysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#install-velero-plugin-on-both-clusters","title":"Install Velero Plugin on Both Clusters","text":"
Note
The velero plugin needs to be installed on both the source and target clusters.
Refer to the Install Velero Plugin documentation and the MinIO configuration below to install the velero plugin on the main-cluster and recovery-cluster .
MinIO Server Address Bucket Username Password http://10.7.209.110:9000 mysql-demo root dangerous
Note
When installing the plugin, replace S3url with the MinIO server address prepared for this demonstration, and replace the bucket with an existing bucket in MinIO.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#backup-mysql-application-and-data","title":"Backup MySQL Application and Data","text":"
Add a unique label, backup=mysql , to the MySQL application and PVC data. This will facilitate resource selection during backup.
kubectl label deploy mysql-deploy backup=mysql # Add label to mysql-deploy\nkubectl label pod mysql-deploy-5d6f94cb5c-gkrks backup=mysql # Add label to mysql pod\nkubectl label pvc mydata backup=mysql # Add label to mysql pvc\n
Refer to the steps described in Application Backup and the parameters below to create an application backup.
After creating the backup plan, the page will automatically return to the backup plan list. Find the newly created backup plan backup-mysq and click the more options button __ ...__ in the plan. Select \"Run Now\" to execute the newly created backup plan.
Wait for the backup plan execution to complete before proceeding with the next steps.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#cross-cluster-recovery-of-mysql-application-and-data","title":"Cross-Cluster Recovery of MySQL Application and Data","text":"
Log in to the AI platform platform and select Container Management -> Backup & Restore -> Application Backup from the left navigation menu.
Select Recovery in the left-side toolbar, then click Restore Backup on the right side.
Fill in the parameters based on the following instructions:
Name: restore-mysql (can be customized)
Backup Source Cluster: main-cluster
Backup Plan: backup-mysql
Backup Point: default
Recovery Target Cluster: recovery-cluster
Refresh the backup plan list and wait for the backup plan execution to complete.
"},{"location":"en/admin/kpanda/best-practice/backup-mysql-on-nfs.html#check-if-the-data-is-restored-successfully","title":"Check if the data is restored successfully","text":"
Log in to the control plane of recovery-cluster , check if mysql-deploy is successfully backed up in the current cluster.
kubectl get pod\n
Expected output\u5982\u4e0b\uff1a
NAME READY STATUS RESTARTS AGE\nmysql-deploy-5798f5d4b8-62k6c 1/1 Running 0 24h\n
Check if the data in MySQL datasheet is restored or not.
kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\n
Expected output is as follows\uff1a
[root@g-master1 ~]# kubectl exec deploy/mysql-deploy -- mysql -uroot -pdangerous -e \"SELECT * FROM test.users;\"\nmysql: [Warning] Using a password on the command line interface can be insecure.\nuser_name age\ndc09195ba 10\n80ab6aa28 70\nf488e3d46 23\ne6098695c 93\neda563e7d 63\na4d1b8d68 17\nea47546d9 86\na34311f2e 47\n740cefe17 33\nede85ea28 65\nb6d0d6a0e 46\nf0eb38e50 44\nc9d2f28f5 72\n8ddaafc6f 31\n3ae078d0e 23\n6e041631e 96\n
Success
As you can see, the data in the Pod is consistent with the data inside the Pods in the main-cluster . This indicates that the MySQL application and its data from the main-cluster have been successfully recovered to the recovery-cluster cluster.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html","title":"Create a RedHat 9.2 Worker Cluster on a CentOS Management Platform","text":"
This article explains how to create a RedHat 9.2 worker cluster on an existing CentOS management platform.
Note
This article only applies to the offline mode, using the AI platform platform to create a worker cluster. The architecture of the management platform and the cluster to be created are both AMD. When creating a cluster, heterogeneous deployment (mixing AMD and ARM) is not supported. After the cluster is created, you can use the method of connecting heterogeneous nodes to achieve mixed deployment and management of the cluster.
A AI platform full-mode has been deployed, and the spark node is still alive. For deployment, see the document Offline Install AI platform Enterprise.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#download-and-import-redhat-offline-packages","title":"Download and Import RedHat Offline Packages","text":"
Make sure you are logged into the spark node! And the clusterConfig.yaml file used when deploying AI platform is still available.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#download-the-relevant-redhat-offline-packages","title":"Download the Relevant RedHat Offline Packages","text":"
Download the required RedHat OS package and ISO offline packages:
Resource Name Description Download Link os-pkgs-redhat9-v0.9.3.tar.gz RedHat9.2 OS-package package Download ISO Offline Package ISO package import script Go to RedHat Official Download Site import-iso ISO import script Download"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#import-the-os-package-to-the-minio-of-the-spark-node","title":"Import the OS Package to the MinIO of the Spark Node","text":"
Extract the RedHat OS package
Execute the following command to extract the downloaded OS package. Here we download the RedHat OS package.
tar -xvf os-pkgs-redhat9-v0.9.3.tar.gz\n
The contents of the extracted OS package are as follows:
os-pkgs\n \u251c\u2500\u2500 import_ospkgs.sh # This script is used to import OS packages into the MinIO file service\n \u251c\u2500\u2500 os-pkgs-amd64.tar.gz # OS packages for the amd64 architecture\n \u251c\u2500\u2500 os-pkgs-arm64.tar.gz # OS packages for the arm64 architecture\n \u2514\u2500\u2500 os-pkgs.sha256sum.txt # sha256sum verification file of the OS packages\n
Import the OS Package to the MinIO of the Spark Node
Execute the following command to import the OS packages to the MinIO file service:
The above command is only applicable to the MinIO service built into the spark node. If an external MinIO is used, replace http://127.0.0.1:9000 with the access address of the external MinIO. \"rootuser\" and \"rootpass123\" are the default account and password of the MinIO service built into the spark node. \"os-pkgs-redhat9-v0.9.3.tar.gz\" is the name of the downloaded OS package offline package.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#import-the-iso-offline-package-to-the-minio-of-the-spark-node","title":"Import the ISO Offline Package to the MinIO of the Spark Node","text":"
Execute the following command to import the ISO package to the MinIO file service:
The above command is only applicable to the MinIO service built into the spark node. If an external MinIO is used, replace http://127.0.0.1:9000 with the access address of the external MinIO. \"rootuser\" and \"rootpass123\" are the default account and password of the MinIO service built into the spark node. \"rhel-9.2-x86_64-dvd.iso\" is the name of the downloaded ISO offline package.
"},{"location":"en/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.html#create-the-cluster-in-the-ui","title":"Create the Cluster in the UI","text":"
Refer to the document Creating a Worker Cluster to create a RedHat 9.2 cluster.
"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html","title":"Create an Ubuntu Worker Cluster on CentOS","text":"
This page explains how to create an Ubuntu worker cluster on an existing CentOS.
Note
This page is specifically for the offline mode, using the AI platform platform to create a worker cluster, where both the CentOS platform and the worker cluster to be created are based on AMD architecture. Heterogeneous (mixed AMD and ARM) deployments are not supported during cluster creation; however, after the cluster is created, you can manage a mixed deployment by adding heterogeneous nodes.
A fully deployed AI platform system, with the bootstrap node still active. For deployment reference, see the documentation Offline Install AI platform Enterprise.
"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#download-and-import-ubuntu-offline-packages","title":"Download and Import Ubuntu Offline Packages","text":"
Please ensure you are logged into the bootstrap node! Also, make sure that the clusterConfig.yaml file used during the AI platform deployment is still available.
Download the required Ubuntu OS packages and ISO offline packages:
Resource Name Description Download Link os-pkgs-ubuntu2204-v0.18.2.tar.gz Ubuntu 20.04 OS package https://github.com/kubean-io/kubean/releases/download/v0.18.2/os-pkgs-ubuntu2204-v0.18.2.tar.gz ISO Offline Package ISO Package http://mirrors.melbourne.co.uk/ubuntu-releases/"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#import-os-and-iso-packages-into-minio-on-the-bootstrap-node","title":"Import OS and ISO Packages into MinIO on the Bootstrap Node","text":"
Refer to the documentation Importing Offline Resources to import offline resources into MinIO on the bootstrap node.
"},{"location":"en/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.html#create-cluster-on-ui","title":"Create Cluster on UI","text":"
Refer to the documentation Creating a Worker Cluster to create the Ubuntu cluster.
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html","title":"etcd Backup and Restore","text":"
Using the ETCD backup feature to create a backup policy, you can back up the etcd data of a specified cluster to S3 storage on a scheduled basis. This page focuses on how to restore the data that has been backed up to the current cluster.
Note
AI platform ETCD backup restores are limited to backups and restores for the same cluster (with no change in the number of nodes and IP addresses). For example, after the etcd data of Cluster A is backed up, the backup data can only be restored to Cluster A, not to Cluster B.
The feature is recommended app backup and restore for cross-cluster backups and restores.
First, create a backup policy to back up the current status. It is recommended to refer to the ETCD backup.
The following is a specific case to illustrate the whole process of backup and restore.
Begin with basic information about the target cluster and S3 storage for the restore. Here, MinIo is used as S3 storage, and the whole cluster has 3 control planes (3 etcd copies).
IP Host Role Remarks 10.6.212.10 host01 k8s-master01 k8s node 1 10.6.212.11 host02 k8s-master02 k8s node 2 10.6.212.12 host03 k8s-master03 k8s node 3 10.6.212.13 host04 minio minio service"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#prerequisites","title":"Prerequisites","text":""},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#install-the-etcdbrctl-tool","title":"Install the etcdbrctl tool","text":"
To implement ETCD data backup and restore, you need to install the etcdbrctl open source tool on any of the above k8s nodes. This tool does not have binary files for the time being and needs to be compiled by itself. Refer to the compilation mode.
After installation, use the following command to check whether the tool is available:
etcdbrctl -v\n
The expected output is as follows:
INFO[0000] etcd-backup-restore Version: v0.23.0-dev\nINFO[0000] Git SHA: b980beec\nINFO[0000] Go Version: go1.19.3\nINFO[0000] Go OS/Arch: linux/amd64\n
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#check-the-backup-data","title":"Check the backup data","text":"
You need to check the following before restoring:
Have you successfully backed up your data in AI platform
Check if backup data exists in S3 storage
Note
The backup of AI platform is a full data backup, and the full data of the last backup will be restored when restoring.
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#shut-down-the-cluster","title":"Shut down the cluster","text":"
Before backing up, the cluster must be shut down. The default clusters etcd and kube-apiserver are started as static pods. To close the cluster here means to move the static Pod manifest file out of the /etc/kubernetes/manifest directory, and the cluster will remove Pods to close the service.
First, delete the previous backup data. Removing the data does not delete the existing etcd data, but refers to modifying the name of the etcd data directory. Wait for the backup to be successfully restored before deleting this directory. The purpose of this is to also try to restore the current cluster if the etcd backup restore fails. This step needs to be performed for each node.
rm -rf /var/lib/etcd_bak\n
The service then needs to be shut down kube-apiserver to ensure that there are no new changes to the etcd data. This step needs to be performed for each node.
The expected output is as follows, indicating that all etcd nodes have been destroyed:
{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:50.817+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.31:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-1:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:51:55.818+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-2:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.32:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-2:2379 (context deadline exceeded)\n{\"level\":\"warn\",\"ts\":\"2023-03-29T17:52:00.820+0800\",\"logger\":\"etcd-client\",\"caller\":\"v3@v3.5.6/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0xc0001ba000/controller-node-1:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.5.14.33:2379: connect: connection refused\\\"\"}\nFailed to get the status of endpoint controller-node-3:2379 (context deadline exceeded)\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
"},{"location":"en/admin/kpanda/best-practice/etcd-backup.html#restore-the-backup","title":"Restore the backup","text":"
You only need to restore the data of one node, and the etcd data of other nodes will be automatically synchronized.
Set environment variables
Before restoring the data using etcdbrctl, run the following command to set the authentication information of the connection S3 as an environment variable:
--data-dir: etcd data directory. This directory must be consistent with the etcd data directory so that etcd can load data normally.
--store-container: The location of S3 storage, the bucket in MinIO, must correspond to the bucket of data backup.
--initial-cluster: etcd is configured initially. The name of the etcd cluster must be the same as the original one.
--initial-advertise-peer-urls: etcd member inter-cluster access address. Must be consistent with etcd configuration.
The expected output is as follows:
INFO[0000] Finding latest set of snapshot to recover from...\nINFO[0000] Restoring from base snapshot: Full-00000000-00111147-1679991074 actor=restorer\nINFO[0001] successfully fetched data of base snapshot in 1.241380207 seconds actor=restorer\n{\"level\":\"info\",\"ts\":1680011221.2511616,\"caller\":\"mvcc/kvstore.go:380\",\"msg\":\"restored last compact revision\",\"meta-bucket-name\":\"meta\",\"meta-bucket-name-key\":\"finishedCompactRev\",\"restored-compact-revision\":110327}\n{\"level\":\"info\",\"ts\":1680011221.3045986,\"caller\":\"membership/cluster.go:392\",\"msg\":\"added member\",\"cluster-id\":\"66638454b9dd7b8a\",\"local-member-id\":\"0\",\"added-peer-id\":\"123c2503a378fc46\",\"added-peer-peer-urls\":[\"https://10.6.212.10:2380\"]}\nINFO[0001] Starting embedded etcd server... actor=restorer\n\n....\n\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:565\",\"msg\":\"stopped serving peer traffic\",\"address\":\"127.0.0.1:37161\"}\n{\"level\":\"info\",\"ts\":\"2023-03-28T13:47:02.922Z\",\"caller\":\"embed/etcd.go:367\",\"msg\":\"closed etcd server\",\"name\":\"default\",\"data-dir\":\"/var/lib/etcd\",\"advertise-peer-urls\":[\"http://localhost:0\"],\"advertise-client-urls\":[\"http://localhost:0\"]}\nINFO[0003] Successfully restored the etcd data directory.\n
!!! note \u201cYou can check the YAML file of etcd for comparison to avoid configuration errors\u201d
Then wait for the etcd service to finish starting, and check the status of etcd. The default directory of etcd-related certificates is: /etc/kubernetes/ssl . If the cluster certificate is stored in another location, specify the proper path.
Check the etcd cluster list:
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\" \n
The expected output is as follows:
+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n| 123c2503a378fc46 | started | controller-node-1 | https://10.6.212.10:2380 | https://10.6.212.10:2379 | false |\n+------------------+---------+-------------------+--------------------------+--------------------------+------------+\n
To view the status of controller-node-1:
etcdctl endpoint status --endpoints=controller-node-1:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
The expected output is as follows:
+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n| controller-node-1:2379 | 123c2503a378fc46 | 3.5.6 | 15 MB | true | false | 3 | 1200 | 1199 | |\n+------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+\n
Restore other node data
The above steps have restored the data of node 01. If you want to restore the data of other nodes, you only need to start the Pod of etcd and let etcd complete the data synchronization by itself.
Data synchronization between etcd member clusters takes some time. You can check the etcd cluster status to ensure that all etcd clusters are normal:
Check whether the etcd cluster status is normal:
etcdctl member list -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
The expected output is as follows:
+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n| 6ea47110c5a87c03 | started | controller-node-1 | https://10.5.14.31:2380 | https://10.5.14.31:2379 | false |\n| e222e199f1e318c4 | started | controller-node-2 | https://10.5.14.32:2380 | https://10.5.14.32:2379 | false |\n| f64eeda321aabe2d | started | controller-node-3 | https://10.5.14.33:2380 | https://10.5.14.33:2379 | false |\n+------------------+---------+-------------------+-------------------------+-------------------------+------------+\n
Check whether the three member nodes are normal:
etcdctl endpoint status --endpoints=controller-node-1:2379,controller-node-2:2379,controller-node-3:2379 -w table \\\n--cacert=\"/etc/kubernetes/ssl/etcd/ca.crt\" \\\n--cert=\"/etc/kubernetes/ssl/apiserver-etcd-client.crt\" \\\n--key=\"/etc/kubernetes/ssl/apiserver-etcd-client.key\"\n
After kubelet starts kube-apiserver, check whether the restored k8s data is normal:
kubectl get nodes\n
The expected output is as follows:
NAME STATUS ROLES AGE VERSION\ncontroller-node-1 Ready <none> 3h30m v1.25.4\ncontroller-node-3 Ready control-plane 3h29m v1.25.4\ncontroller-node-3 Ready control-plane 3h28m v1.25.4\n
"},{"location":"en/admin/kpanda/best-practice/hardening-cluster.html","title":"How to Harden a Self-built Work Cluster","text":"
In AI platform, when using the CIS Benchmark (CIS) scan on a work cluster created using the user interface, some scan items did not pass the scan. This article provides hardening instructions based on different versions of CIS Benchmark.
"},{"location":"en/admin/kpanda/best-practice/hardening-cluster.html#hardening-configuration-to-pass-cis-scan","title":"Hardening Configuration to Pass CIS Scan","text":"
To address these security scan issues, kubespray has added default values in v2.22 to solve some of the problems. For more details, refer to the kubespray hardening documentation.
Add parameters by modifying the kubean var-config configuration file:
In AI platform, there is also a feature to configure advanced parameters through the user interface. Add custom parameters in the last step of cluster creation:
After setting the custom parameters, the following parameters are added to the var-config configmap in kubean:
Perform a scan after installing the cluster:
After the scan, all scan items passed the scan (WARN and INFO are counted as PASS). Note that this document only applies to CIS Benchmark 1.27, as CIS Benchmark is continuously updated.
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html","title":"Deploy and Upgrade Compatible Versions of Kubean in Offline Scenarios","text":"
In order to meet the customer's demand for building Kubernetes (K8s) clusters with lower versions, Kubean provides the capability to be compatible with lower versions and create K8s clusters with those versions.
Currently, the supported versions for self-built worker clusters range from 1.26.0-v1.28. Refer to the AI platform Cluster Version Support System for more information.
This article will demonstrate how to deploy a K8s cluster with a lower version.
Prepare a management cluster where kubean resides, and the current environment has deployed the podman, skopeo, and minio client commands. If not supported, you can install the dependent components through the script, Installing Prerequisite Dependencies.
Go to kubean to view the released artifacts, and choose the specific artifact version based on the actual situation. The currently supported artifact versions and their proper cluster version ranges are as follows:
Artifact Version Cluster Range AI platform Support release-2.21 v1.23.0 ~ v1.25.6 Supported since installer v0.14.0 release-2.22 v1.24.0 ~ v1.26.9 Supported since installer v0.15.0 release-2.23 v1.25.0 ~ v1.27.7 Expected to support from installer v0.16.0
This article demonstrates the offline deployment of a K8s cluster with version 1.23.0 and the offline upgrade of a K8s cluster from version 1.23.0 to 1.24.0, so we choose the artifact release-2.21.
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#prepare-the-relevant-artifacts-for-the-lower-version-of-kubespray-release","title":"Prepare the Relevant Artifacts for the Lower Version of Kubespray Release","text":"
Import the spray-job image into the registry of the offline environment.
# Assuming the registry address in the bootstrap cluster is 172.30.41.200\nREGISTRY_ADDR=\"172.30.41.200\"\n\n# The image spray-job can use the accelerator address here, and the image address is determined based on the selected artifact version\nSPRAY_IMG_ADDR=\"ghcr.m.daocloud.io/kubean-io/spray-job:2.21-d6f688f\"\n\n# skopeo parameters\nSKOPEO_PARAMS=\" --insecure-policy -a --dest-tls-verify=false --retry-times=3 \"\n\n# Online environment: Export the spray-job image of version release-2.21 and transfer it to the offline environment\nskopeo copy docker://${SPRAY_IMG_ADDR} docker-archive:spray-job-2.21.tar\n\n# Offline environment: Import the spray-job image of version release-2.21 into the bootstrap registry\nskopeo copy ${SKOPEO_PARAMS} docker-archive:spray-job-2.21.tar docker://${REGISTRY_ADDR}/${SPRAY_IMG_ADDR}\n
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#create-offline-resources-for-the-earlier-versions-of-k8s","title":"Create Offline Resources for the Earlier Versions of K8s","text":"
Prepare the manifest.yml file.
cat > \"manifest.yml\" <<EOF\nimage_arch:\n - \"amd64\" ## \"arm64\"\nkube_version: ## Fill in the cluster version according to the actual scenario\n - \"v1.23.0\"\n - \"v1.24.0\"\nEOF\n
Create the offline incremental package.
# Create the data directory\nmkdir data\n# Create the offline package\nAIRGAP_IMG_ADDR=\"ghcr.m.daocloud.io/kubean-io/airgap-patch:2.21-d6f688f\" # (1)\npodman run --rm -v $(pwd)/manifest.yml:/manifest.yml -v $(pwd)/data:/data -e ZONE=CN -e MODE=FULL ${AIRGAP_IMG_ADDR}\n
The image spray-job can use the accelerator address here, and the image address is determined based on the selected artifact version
Import the offline images and binary packages for the proper K8s version.
# Import the binaries from the data directory to the minio in the bootstrap node\ncd ./data/amd64/files/\nMINIO_ADDR=\"http://172.30.41.200:9000\" # Replace IP with the actual repository url\nMINIO_USER=rootuser MINIO_PASS=rootpass123 ./import_files.sh ${MINIO_ADDR}\n\n# Import the images from the data directory to the image repository in the bootstrap node\ncd ./data/amd64/images/\nREGISTRY_ADDR=\"172.30.41.200\" ./import_images.sh # Replace IP with the actual repository url\n
Deploy the manifest and localartifactset.cr.yaml custom resources to the management cluster where kubean resides or the Global cluster. In this example, we use the Global cluster.
# Deploy the localArtifactSet resources in the data directory\ncd ./data\nkubectl apply -f data/localartifactset.cr.yaml\n\n# Download the manifest resources proper to release-2.21\nwget https://raw.githubusercontent.com/kubean-io/kubean-manifest/main/manifests/manifest-2.21-d6f688f.yml\n\n# Deploy the manifest resources proper to release-2.21\nkubectl apply -f manifest-2.21-d6f688f.yml\n
"},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#deployment-and-upgrade-legacy-k8s-cluster","title":"Deployment and Upgrade Legacy K8s Cluster","text":""},{"location":"en/admin/kpanda/best-practice/kubean-low-version.html#deploy","title":"Deploy","text":"
Go to Container Management and click the Create Cluster button on the Clusters page.
Choose the manifest and localartifactset.cr.yaml custom resources deployed cluster as the Managed parameter. In this example, we use the Global cluster.
Refer to Creating a Cluster for the remaining parameters.
Select the newly created cluster and go to the details page.
Click Cluster Operations in the left navigation bar, then click Cluster Upgrade on the top right of the page.
Select the available cluster for upgrade.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html","title":"How to Add Heterogeneous Nodes to a Worker Cluster","text":"
This page explains how to add ARM architecture nodes with Kylin v10 sp2 operating system to an AMD architecture worker cluster with CentOS 7.9 operating system.
Note
This page is only applicable to adding heterogeneous nodes to a worker cluster created using the AI platform platform in offline mode, excluding connected clusters.
A AI platform Full Mode deployment has been successfully completed, and the bootstrap node is still alive. Refer to the documentation Offline Installation of AI platform Enterprise for the deployment process.
A worker cluster with AMD architecture and CentOS 7.9 operating system has been created through the AI platform platform. Refer to the documentation Creating a Worker Cluster for the creation process.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/best-practice/multi-arch.html#download-and-import-offline-packages","title":"Download and Import Offline Packages","text":"
Take ARM architecture and Kylin v10 sp2 operating system as examples.
Make sure you are logged into the bootstrap node! Also, make sure the clusterConfig.yaml file used during the AI platform deployment is available.
The latest version can be downloaded from the Download Center.
CPU Architecture Version Download Link AMD64 v0.18.0 https://qiniu-download-public.daocloud.io/DaoCloud_Enterprise/dce5/offline-v0.18.0-amd64.tar ARM64 v0.18.0 https://qiniu-download-public.daocloud.io/DaoCloud_Enterprise/dce5/offline-v0.18.0-arm64.tar
After downloading, extract the offline package:
tar -xvf offline-v0.18.0-arm64.tar\n
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#iso-offline-package-kylin-v10-sp2","title":"ISO Offline Package (Kylin v10 sp2)","text":"CPU Architecture Operating System Version Download Link ARM64 Kylin Linux Advanced Server release V10 (Sword) SP2 https://www.kylinos.cn/support/trial.html
Note
Kylin operating system requires personal information to be provided for downloading and usage. Select V10 (Sword) SP2 when downloading.
The Kubean project provides osPackage offline packages for different operating systems. Visit https://github.com/kubean-io/kubean/releases to view the available packages.
Operating System Version Download Link Kylin Linux Advanced Server release V10 (Sword) SP2 https://github.com/kubean-io/kubean/releases/download/v0.16.3/os-pkgs-kylinv10-v0.16.3.tar.gz
Note
Check the specific version of the osPackage offline package in the offline/sample/clusterConfig.yaml file of the offline image package.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#import-offline-packages-to-the-bootstrap-node","title":"Import Offline Packages to the Bootstrap Node","text":"
To add information about the newly added worker nodes according to the above comments:
kubectl edit cm ${cluster-name}-hosts-conf -n kubean-system\n
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#add-expansion-tasks-through-clusteroperationyml","title":"Add Expansion Tasks through ClusterOperation.yml","text":"
Ensure the spec.image image address matches the image used in the previous deployment job.
Set spec.action to scale.yml .
Set spec.extraArgs to --limit=g-worker .
Fill in the correct repo_list parameter for the relevant OS in spec.preHook 's enable-repo.yml script.
To create and deploy join-node-ops.yaml according to the above configuration:
vi join-node-ops.yaml\nkubectl apply -f join-node-ops.yaml -n kubean-system\n
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#check-the-status-of-the-task-execution","title":"Check the status of the task execution","text":"
kubectl -n kubean-system get pod | grep add-worker-node\n
To check the progress of the scaling task, you can view the logs of the proper pod.
"},{"location":"en/admin/kpanda/best-practice/multi-arch.html#verify-in-ui","title":"Verify in UI","text":"
Go to Container Management -> Clusters -> Nodes .
Click the newly added node to view details.
"},{"location":"en/admin/kpanda/best-practice/replace-first-master-node.html","title":"Replace the first master node of the worker cluster","text":"
This page will take a highly available three-master-node worker cluster as an example. When the first master node of the worker cluster fails or malfunctions, how to replace or reintroduce the first master node.
This page features a highly available cluster with three master nodes.
node1 (172.30.41.161)
node2 (172.30.41.162)
node3 (172.30.41.163)
Assuming node1 is down, the following steps will explain how to reintroduce the recovered node1 back into the worker cluster.
Before performing the replacement operation, first obtain basic information about the cluster resources, which will be used when modifying related configurations.
Note
The following commands to obtain cluster resource information are executed in the management cluster.
Get the cluster name
Run the following command to find the clusters.kubean.io resource proper to the cluster:
# For example, if the resource name of clusters.kubean.io is cluster-mini-1\n# Get the name of the cluster\nCLUSTER_NAME=$(kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.metadata.name}{'\\n'}\")\n
Get the host list configmap of the cluster
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.hostsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-hosts-conf\",\"namespace\":\"kubean-system\"}\n
Get the configuration parameters configmap of the cluster
kubectl get clusters.kubean.io cluster-mini-1 -o=jsonpath=\"{.spec.varsConfRef}{'\\n'}\"\n{\"name\":\"mini-1-vars-conf\",\"namespace\":\"kubean-system\"}\n
Reset the node1 node to restore it to the state before installing the cluster (or use a new node), maintaining the network connectivity of the node1 node.
Adjust the order of the node1 node in the kube_control_plane, kube_node, and etcd sections in the host list (node1/node2/node3 -> node2/node3/node1):
Manually modify the cluster configuration, edit and update cluster-info
# Edit cluster-info\nkubectl -n kube-public edit cm cluster-info\n\n# 1. If the ca.crt certificate is updated, the content of the certificate-authority-data field needs to be updated\n# View the base64 encoding of the ca certificate:\ncat /etc/kubernetes/ssl/ca.crt | base64 | tr -d '\\n'\n\n# 2. Change the IP address in the server field to the new first master IP, this document will use the IP address of node2, 172.30.41.162\n
Manually modify the cluster configuration, edit and update kubeadm-config
# Edit kubeadm-config\nkubectl -n kube-system edit cm kubeadm-config\n\n# Change controlPlaneEndpoint to the new first master IP,\n# this document will use the IP address of node2, 172.30.41.162\n
Scale up the master node and update the cluster
Note
Use --limit to limit the update operation to only affect the etcd and kube_control_plane node groups.
If it is an offline environment, spec.preHook needs to add enable-repo.yml, and the extraArgs parameter should fill in the correct repo_list for the related OS.
cat << EOF | kubectl apply -f -\n---\napiVersion: kubean.io/v1alpha1\nkind: ClusterOperation\nmetadata:\n name: cluster-mini-1-update-cluster-ops\nspec:\n cluster: ${CLUSTER_NAME}\n image: ${SPRAY_IMG_ADDR}:${SPRAY_RLS_2_22_TAG}\n actionType: playbook\n action: cluster.yml\n extraArgs: --limit=etcd,kube_control_plane -e kube_version=${KUBE_VERSION}\n preHook:\n - actionType: playbook\n action: enable-repo.yml # This yaml needs to be added in an offline environment,\n # and set the correct repo-list (install operating system packages),\n # the following parameter values are for reference only\n extraArgs: |\n -e \"{repo_list: ['http://172.30.41.0:9000/kubean/centos/\\$releasever/os/\\$basearch','http://172.30.41.0:9000/kubean/centos-iso/\\$releasever/os/\\$basearch']}\"\n postHook:\n - actionType: playbook\n action: cluster-info.yml\nEOF\n
Now, you completed the replacement of the first Master node.
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html","title":"Offline Deployment/Upgrade Guide for Worker Clusters","text":"
Note
This document is specifically designed for deploying or upgrading the Kubernetes version of worker clusters created on the AI platform platform in offline mode. It does not cover the deployment or upgrade of other Kubernetes components.
This guide is applicable to the following offline scenarios:
You can follow the operational guidelines to deploy the recommended Kubernetes version in a non-GUI environment created by the AI platform platform.
You can upgrade the Kubernetes version of worker clusters created using the AI platform platform by generating incremental offline packages.
The overall approach is as follows:
Build the offline package on an integrated node.
Import the offline package to the bootstrap node.
Update the Kubernetes version manifest for the global service cluster.
Use the AI platform UI to create or upgrade the Kubernetes version of the worker cluster.
Note
For a list of currently supported offline Kubernetes versions, refer to the list of Kubernetes versions supported by Kubean.
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html#building-the-offline-package-on-an-integrated-node","title":"Building the Offline Package on an Integrated Node","text":"
Since the offline environment cannot connect to the internet, you need to prepare an integrated node in advance to build the incremental offline package and start Docker or Podman services on this node. Refer to How to Install Docker?
Check the status of the Docker service on the integrated node.
Create a file named manifest.yaml in the /root directory of the integrated node with the following command:
vi manifest.yaml\n
The content of manifest.yaml should be as follows:
manifest.yaml
image_arch:\n- \"amd64\"\nkube_version: # Specify the version of the cluster to be upgraded\n- \"v1.28.0\"\n
image_arch specifies the CPU architecture type, with options for amd64 and arm64.
kube_version indicates the version of the Kubernetes offline package to be built. You can refer to the supported offline Kubernetes versions mentioned earlier.
Create a folder named /data in the /root directory to store the incremental offline package.
mkdir data\n
Run the following command to generate the offline package using the kubean airgap-patch image. Make sure the tag of the airgap-patch image matches the Kubean version, and that the Kubean version covers the Kubernetes version you wish to upgrade.
# Assuming the Kubean version is v0.13.9\ndocker run --rm -v $(pwd)/manifest.yaml:/manifest.yaml -v $(pwd)/data:/data ghcr.m.daocloud.io/kubean-io/airgap-patch:v0.13.9\n
After the Docker service completes running, check the files in the /data folder. The folder structure should look like this:
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html#importing-the-offline-package-to-the-bootstrap-node","title":"Importing the Offline Package to the Bootstrap Node","text":"
Copy the /data files from the integrated node to the /root directory of the bootstrap node. On the integrated node , run the following command:
scp -r data root@x.x.x.x:/root\n
Replace x.x.x.x with the IP address of the bootstrap node.
On the bootstrap node, copy the image files in the /data folder to the built-in Docker registry of the bootstrap node. After logging into the bootstrap node, run the following commands:
Navigate to the directory where the image files are located.
cd data/amd64/images\n
Run the import_images.sh script to import the images into the built-in Docker Registry of the bootstrap node.
REGISTRY_ADDR=\"127.0.0.1\" ./import_images.sh\n
Note
The above command is only applicable to the built-in Docker Registry of the bootstrap node. If you are using an external registry, use the following command:
The above command is only applicable to the built-in Minio service of the bootstrap node. If you are using an external Minio, replace http://127.0.0.1:9000 with the access address of the external Minio. \"rootuser\" and \"rootpass123\" are the default account and password for the built-in Minio service of the bootstrap node.
"},{"location":"en/admin/kpanda/best-practice/update-offline-cluster.html#updating-the-kubernetes-version-manifest-for-the-global-service-cluster","title":"Updating the Kubernetes Version Manifest for the Global Service Cluster","text":"
Run the following command on the bootstrap node to deploy the localartifactset resource to the global service cluster:
Log into the AI platform UI management interface to continue with the following actions:
Refer to the Creating Cluster Documentation to create a worker cluster, where you can select the incremental version of Kubernetes.
Refer to the Upgrading Cluster Documentation to upgrade your self-built worker cluster.
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html","title":"Creating a Cluster on Non-Supported Operating Systems","text":"
This document outlines how to create a worker cluster on an unsupported OS in offline mode. For the range of OS supported by AI platform, refer to AI platform Supported Operating Systems.
The main process for creating a worker cluster on an unsupported OS in offline mode is illustrated in the diagram below:
Next, we will use the openAnolis operating system as an example to demonstrate how to create a cluster on a non-mainstream operating system.
AI platform Full Mode has been deployed following the documentation: Offline Installation of AI platform Enterprise.
At least one node with the same architecture and version that can connect to the internet.
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#online-node-building-an-offline-package","title":"Online Node - Building an Offline Package","text":"
Find an online environment with the same architecture and OS as the nodes in the target cluster. In this example, we will use AnolisOS 8.8 GA. Run the following command to generate an offline os-pkgs package:
# Download relevant scripts and build os packages package\n$ curl -Lo ./pkgs.yml https://raw.githubusercontent.com/kubean-io/kubean/main/build/os-packages/others/pkgs.yml\n$ curl -Lo ./other_os_pkgs.sh https://raw.githubusercontent.com/kubean-io/kubean/main/build/os-packages/others/other_os_pkgs.sh && chmod +x other_os_pkgs.sh\n$ ./other_os_pkgs.sh build # Build the offline package\n
After executing the above command, you should have a compressed package named os-pkgs-anolis-8.8.tar.gz in the current directory. The file structure in the current directory should look like this:
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#offline-node-installing-the-offline-package","title":"Offline Node - Installing the Offline Package","text":"
Copy the three files generated on the online node ( other_os_pkgs.sh , pkgs.yml , and os-pkgs-anolis-8.8.tar.gz ) to all nodes in the target cluster in the offline environment.
Login to any one of the nodes in the offline environment that is part of the target cluster, and run the following command to install the os-pkg package on the node:
# Configure environment variables\n$ export PKGS_YML_PATH=/root/workspace/os-pkgs/pkgs.yml # Path to the pkgs.yml file on the current offline node\n$ export PKGS_TAR_PATH=/root/workspace/os-pkgs/os-pkgs-anolis-8.8.tar.gz # Path to the os-pkgs-anolis-8.8.tar.gz file on the current offline node\n$ export SSH_USER=root # Username for the current offline node\n$ export SSH_PASS=dangerous # Password for the current offline node\n$ export HOST_IPS='172.30.41.168' # IP address of the current offline node\n$ ./other_os_pkgs.sh install # Install the offline package\n
After executing the above command, wait for the interface to prompt: All packages for node (X.X.X.X) have been installed , which indicates that the installation is complete.
"},{"location":"en/admin/kpanda/best-practice/use-otherlinux-create-custer.html#go-to-the-user-interface-to-create-cluster","title":"Go to the User Interface to Create Cluster","text":"
Refer to the documentation on Creating a Worker Cluster to create an openAnolis cluster.
"},{"location":"en/admin/kpanda/clusterops/cluster-oversold.html","title":"Dynamic Resource Overprovision in the Cluster","text":"
Currently, many businesses experience peaks and valleys in demand. To ensure service performance and stability, resources are typically allocated based on peak demand when deploying services. However, peak periods may be very short, resulting in resource waste during off-peak times. Cluster resource overprovision utilizes these allocated but unused resources (i.e., the difference between allocation and usage) to enhance cluster resource utilization and reduce waste.
This article mainly introduces how to use the cluster dynamic resource overprovision feature.
The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and access to the cluster's UI interface is available.
A namespace has been created, and the user has been granted Cluster Admin permissions. For details, refer to Cluster Authorization.
Once the cluster dynamic resource overprovision ratio is set, it will take effect while workloads are running. The following example uses nginx to validate the use of resource overprovision capabilities.
Create a workload (nginx) and set the proper resource limits. For the creation process, refer to Creating Stateless Workloads (Deployment).
Check whether the ratio of the Pod's resource requests to limits meets the overprovision ratio.
Cluster settings are used to customize advanced feature settings for your cluster, including whether to enable GPU, helm repo refresh cycle, Helm operation record retention, etc.
Enable GPU: GPUs and proper driver plug-ins need to be installed on the cluster in advance.
Click the name of the target cluster, and click Operations and Maintenance -> Cluster Settings -> Addons in the left navigation bar.
Helm operation basic image, registry refresh cycle, number of operation records retained, whether to enable cluster deletion protection (the cluster cannot be uninstalled directly after enabling)
On this page, you can view the recent cluster operation records and Helm operation records, as well as the YAML files and logs of each operation, and you can also delete a certain record.
Set the number of reserved entries for Helm operations:
By default, the system keeps the last 100 Helm operation records. If you keep too many entries, it may cause data redundancy, and if you keep too few entries, you may lose the key operation records you need. A reasonable reserved quantity needs to be set according to the actual situation. Specific steps are as follows:
Click the name of the target cluster, and click Recent Operations -> Helm Operations -> Set Number of Retained Items in the left navigation bar.
Set how many Helm operation records need to be kept, and click OK .
Clusters integrated or created using the AI platform Container Management platform can be accessed not only through the UI interface but also in two other ways for access control:
Access online via CloudShell
Access via kubectl after downloading the cluster certificate
Note
When accessing the cluster, the user should have Cluster Admin permission or higher.
"},{"location":"en/admin/kpanda/clusters/access-cluster.html#access-via-cloudshell","title":"Access via CloudShell","text":"
Enter Clusters page, select the cluster you want to access via CloudShell, click the ... icon on the right, and then click Console from the dropdown list.
Run kubectl get node command in the Console to verify the connectivity between CloudShell and the cluster. If the console returns node information of the cluster, you can access and manage the cluster through CloudShell.
"},{"location":"en/admin/kpanda/clusters/access-cluster.html#access-via-kubectl","title":"Access via kubectl","text":"
If you want to access and manage remote clusters from a local node, make sure you have met these prerequisites:
Your local node and the cloud cluster are in a connected network.
The cluster certificate has been downloaded to the local node.
The kubectl tool has been installed on the local node. For detailed installation guides, see Installing tools.
If everything is in place, follow these steps to access a cloud cluster from your local environment.
Enter Clusters page, find your target cluster, click ... on the right, and select Download kubeconfig in the drop-down list.
Set the Kubeconfig period and click Download .
Open the downloaded certificate and copy its content to the config file of the local node.
By default, the kubectl tool will look for a file named config in the $HOME/.kube directory on the local node. This file stores access credentials of clusters. Kubectl can access the cluster with that configuration file.
Run the following command on the local node to verify its connectivity with the cluster:
kubectl get pod -n default\n
An expected output is as follows:
NAME READY STATUS RESTARTS AGE\ndao-2048-2048-58c7f7fc5-mq7h4 1/1 Running 0 30h\n
Now you can access and manage the cluster locally with kubectl.
Suanova AI platform categorizes clusters based on different functionalities to help users better manage IT infrastructure.
"},{"location":"en/admin/kpanda/clusters/cluster-role.html#global-service-cluster","title":"Global Service Cluster","text":"
This cluster is used to run AI platform components such as Container Management, Global Management, Insight. It generally does not carry business workloads.
This cluster is used to manage worker clusters and generally does not carry business workloads.
Classic Mode deploys the global service cluster and management cluster in different clusters, suitable for multi-data center, multi-architecture enterprise scenarios.
Simple Mode deploys the management cluster and global service cluster in the same cluster.
This is a cluster created using Container Management and is mainly used to carry business workloads. This cluster is managed by the management cluster.
Supported Features Description K8s Version Supports K8s 1.22 and above Operating System RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;Ubuntu 18.04 x86, Ubuntu 20.04 x86;CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD Full Lifecycle Management Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Calico, Cillium, Multus, and other CNIs Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies"},{"location":"en/admin/kpanda/clusters/cluster-role.html#integrated-cluster","title":"Integrated Cluster","text":"
This cluster is used to integrate existing standard K8s clusters, including but not limited to self-built clusters in local data centers, clusters provided by public cloud vendors, clusters provided by private cloud vendors, edge clusters, Xinchuang clusters, heterogeneous clusters, and different Suanova clusters. It is mainly used to carry business workloads.
Supported Features Description K8s Version 1.18+ Supported Vendors VMware Tanzu, Amazon EKS, Redhat Openshift, SUSE Rancher, Alibaba ACK, Huawei CCE, Tencent TKE, Standard K8s Cluster, Suanova Full Lifecycle Management Not Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Depends on the network mode of the integrated cluster's kernel Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies
Note
A cluster can have multiple cluster roles. For example, a cluster can be both a global service cluster and a management cluster or a worker cluster.
"},{"location":"en/admin/kpanda/clusters/cluster-scheduler-plugin.html","title":"Deploy Second Scheduler scheduler-plugins in a Cluster","text":"
This page describes how to deploy a second scheduler-plugins in a cluster.
"},{"location":"en/admin/kpanda/clusters/cluster-scheduler-plugin.html#why-do-we-need-scheduler-plugins","title":"Why do we need scheduler-plugins?","text":"
The cluster created through the platform will install the native K8s scheduler-plugin, but the native scheduler-plugin has many limitations:
The native scheduler-plugin cannot meet scheduling requirements, so you can use either CoScheduling, CapacityScheduling or other types of scheduler-plugins.
In special scenarios, a new scheduler-plugin is needed to complete scheduling tasks without affecting the process of the native scheduler-plugin.
Distinguish scheduler-plugins with different functionalities and achieve different scheduling scenarios by switching scheduler-plugin names.
This page takes the scenario of using the vgpu scheduler-plugin while combining the coscheduling plugin capability of scheduler-plugins as an example to introduce how to install and use scheduler-plugins.
kubean is a new feature introduced in v0.13.0, please ensure that your version is v0.13.0 or higher.
The installation version of scheduler-plugins is v0.27.8, please ensure that the cluster version is compatible with it. Refer to the document Compatibility Matrix.
scheduler_plugins_enabled Set to true to enable the scheduler-plugins capability.
You can enable or disable certain plugins by setting the scheduler_plugins_enabled_plugins or scheduler_plugins_disabled_plugins options. See K8s Official Plugin Names for reference.
If you need to set parameters for custom plugins, please configure scheduler_plugins_plugin_config, for example: set the permitWaitingTimeoutSeconds parameter for coscheduling. See K8s Official Plugin Configuration for reference.
After successful cluster creation, the system will automatically install the scheduler-plugins and controller component loads. You can check the workload status in the proper cluster's deployment.
Here is an example of how to use scheduler-plugins by demonstrating a scenario where the vgpu scheduler is used in combination with the coscheduling plugin capability of scheduler-plugins.
Install vgpu in the Helm Charts and set the values.yaml parameters.
schedulerName: scheduler-plugins-scheduler: This is the scheduler name for scheduler-plugins installed by kubean, and currently cannot be modified.
scheduler.kubeScheduler.enabled: false: Do not install kube-scheduler and use vgpu-scheduler as a separate extender.
Extend vgpu-scheduler on scheduler-plugins.
[root@master01 charts]# kubectl get cm -n scheduler-plugins scheduler-config -ojsonpath=\"{.data.scheduler-config\\.yaml}\"\n
After installing vgpu-scheduler, the system will automatically create a service (svc), and the urlPrefix specifies the URL of the svc.
Note
The svc refers to the pod service load. You can use the following command in the namespace where the nvidia-vgpu plugin is installed to get the external access information for port 443.
kubectl get svc -n ${namespace}\n
The urlPrefix format is https://${ip address}:${port}
Restart the scheduler pod of scheduler-plugins to load the new configuration file.
Note
When creating a vgpu application, you do not need to specify the name of a scheduler-plugin. The vgpu-scheduler webhook will automatically change the scheduler's name to \"scheduler-plugins-scheduler\" without manual specification.
AI platform Container Management module can manage two types of clusters: integrated clusters and created clusters.
Integrated clusters: clusters created in other platforms and now integrated into AI platform.
Created clusters: clusters created in AI platform.
For more information about cluster types, see Cluster Role.
We designed several status for these two clusters.
"},{"location":"en/admin/kpanda/clusters/cluster-status.html#integrated-clusters","title":"Integrated Clusters","text":"Status Description Integrating The cluster is being integrated into AI platform. Removing The cluster is being removed from AI platform. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status."},{"location":"en/admin/kpanda/clusters/cluster-status.html#created-clusters","title":"Created Clusters","text":"Status Description Creating The cluster is being created. Updating The Kubernetes version of the cluster is being operating. Deleting The cluster is being deleted. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status. Failed The cluster creation is failed. You should check the logs for detailed reasons."},{"location":"en/admin/kpanda/clusters/cluster-version.html","title":"Supported Kubernetes Versions","text":"
In AI platform, the integrated clusters and created clusters have different version support mechanisms.
This page focuses on the version support mechanism for created clusters.
The Kubernetes community supports three version ranges: 1.26, 1.27, and 1.28. When a new version is released by the community, the supported version range is incremented. For example, if the latest version released by the community is 1.27, the supported version range by the community will be 1.27, 1.28, and 1.29.
To ensure the security and stability of the clusters, when creating clusters in AI platform, the supported version range will always be one version lower than the community's version.
For instance, if the Kubernetes community supports v1.25, v1.26, and v1.27, then the version range for creating worker clusters in AI platform will be v1.24, v1.25, and v1.26. Additionally, a stable version, such as 1.24.7, will be recommended to users.
Furthermore, the version range for creating worker clusters in AI platform will remain highly synchronized with the community. When the community version increases incrementally, the version range for creating worker clusters in AI platform will also increase by one version.
"},{"location":"en/admin/kpanda/clusters/cluster-version.html#supported-kubernetes-versions_1","title":"Supported Kubernetes Versions","text":"Kubernetes Community Versions Created Worker Cluster Versions Recommended Versions for Created Worker Cluster AI platform Installer Release Date
In AI platform Container Management, clusters can have four roles: global service cluster, management cluster, worker cluster, and integrated cluster. An integrated cluster can only be integrated from third-party vendors (see Integrate Cluster).
This page explains how to create a Worker Cluster. By default, when creating a new Worker Cluster, the operating system type and CPU architecture of the worker nodes should be consistent with the Global Service Cluster. If you want to create a cluster with a different operating system or architecture than the Global Management Cluster, refer to Creating an Ubuntu Worker Cluster on a CentOS Management Platform for instructions.
It is recommended to use the supported operating systems in AI platform to create the cluster. If your local nodes are not within the supported range, you can refer to Creating a Cluster on Non-Mainstream Operating Systems for instructions.
Certain prerequisites must be met before creating a cluster:
Prepare enough nodes to be joined into the cluster.
It is recommended to use Kubernetes version 1.25.7. For the specific version range, refer to the AI platform Cluster Version Support System. Currently, the supported version range for created worker clusters is v1.26.0-v1.28. If you need to create a cluster with a lower version, refer to the Supporterd Cluster Versions.
The target host must allow IPv4 forwarding. If using IPv6 in Pods and Services, the target server needs to allow IPv6 forwarding.
AI platform does not provide firewall management. You need to pre-define the firewall rules of the target host by yourself. To avoid errors during cluster creation, it is recommended to disable the firewall of the target host.
Enter the Container Management module, click Create Cluster on the upper right corner of the Clusters page.
Fill in the basic information by referring to the following instructions.
Cluster Name: only contain lowercase letters, numbers, and hyphens (\"-\"). Must start and end with a lowercase letter or number and totally up to 63 characters.
Managed By: Choose a cluster to manage this new cluster through its lifecycle, such as creating, upgrading, node scaling, deleting the new cluster, etc.
Runtime: Select the runtime environment of the cluster. Currently support containerd and docker (see How to Choose Container Runtime).
Kubernetes Version: Allow span of three major versions, such as from 1.23-1.25, subject to the versions supported by the management cluster.
Fill in the node configuration information and click Node Check .
High Availability: When enabled, at least 3 controller nodes are required. When disabled, only 1 controller node is needed.
It is recommended to use High Availability mode in production environments.
Credential Type: Choose whether to access nodes using username/password or public/private keys.
If using public/private key authentication, SSH keys for the nodes need to be configured in advance. Refer to Using SSH Key Authentication for Nodes.
Same Password: When enabled, all nodes in the cluster will have the same access password. Enter the unified password for accessing all nodes in the field below. If disabled, you can set separate usernames and passwords for each node.
Node Information: Set note names and IPs.
NTP Time Synchronization: When enabled, time will be automatically synchronized across all nodes. Provide the NTP server address.
If node check is passed, click Next . If the check failed, update Node Information and check again.
Fill in the network configuration and click Next .
CNI: Provide network services for Pods in the cluster. CNI cannot be changed after the cluster is created. Supports cilium and calico. Set none means not installing CNI when creating the cluster. You may install a CNI later.
For CNI configuration details, see Cilium Installation Parameters or Calico Installation Parameters.
Container IP Range: Set an IP range for allocating IPs for containers in the cluster. IP range determines the max number of containers allowed in the cluster. Cannot be modified after creation.
Service IP Range: Set an IP range for allocating IPs for container Services in the cluster. This range determines the max number of container Services that can be created in the cluster. Cannot be modified after creation.
Fill in the plug-in configuration and click Next .
Fill in advanced settings and click OK .
kubelet_max_pods : Set the maximum number of Pods per node. The default is 110.
hostname_override : Reset the hostname (not recommended).
kubernetes_audit : Kubernetes audit log, enabled by default.
auto_renew_certificate : Automatically renew the certificate of the control plane on the first Monday of each month, enabled by default.
disable_firewalld&ufw : Disable the firewall to prevent the node from being inaccessible during installation.
Insecure_registries : Set the address of you private container registry. If you use a private container registry, fill in its address can bypass certificate authentication of the container engine and obtain the image.
yum_repos : Fill in the Yum source registry address.
Success
After correctly filling in the above information, the page will prompt that the cluster is being created.
Creating a cluster takes a long time, so you need to wait patiently. You can click the Back to Clusters button to let it running backend.
To view the current status, click Real-time Log .
Note
hen the cluster is in an unknown state, it means that the current cluster has been disconnected.
The data displayed by the system is the cached data before the disconnection, which does not represent real data.
Any operations performed in the disconnected state will not take effect. Please check the cluster network connectivity or Host Status.
Clusters created in AI platform Container Management can be either deleted or removed. Clusters integrated into AI platform can only be removed.
Info
If you want to delete an integrated cluster, you should delete it in the platform where it is created.
In AI platform, the difference between Delete and Remove is:
Delete will destroy the cluster and reset the data of all nodes under the cluster. All data will be totally cleared and lost. Making a backup before deleting a cluster is a recommended best practice. You can no longer use that cluster anymore.
Remove just removes the cluster from AI platform. It will not destroy the cluster and no data will be lost. You can still use the cluster in other platforms or re-integrate it into AI platform later if needed.
Note
You should have Admin or Kpanda Owner permissions to perform delete or remove operations.
Before deleting a cluster, you should turn off Cluster Deletion Protection in Cluster Settings -> Advanced Settings , otherwise the Delete Cluster option will not be displayed.
The global service cluster cannot be deleted or removed.
Enter the Container Management module, find your target cluster, click __ ...__ on the right, and select Delete cluster / Remove in the drop-down list.
Enter the cluster name to confirm and click Delete .
You will be auto directed to cluster lists. The status of this cluster will changed to Deleting . It may take a while to delete/remove a cluster.
With the features of integrating clusters, AI platform allows you to manage on-premise and cloud clusters of various providers in a unified manner. This is quite important in avoiding the risk of being locked in by a certain providers, helping enterprises safely migrate their business to the cloud.
In AI platform Container Management module, you can integrate a cluster of the following providers: standard Kubernetes clusters, Redhat Openshift, SUSE Rancher, VMware Tanzu, Amazon EKS, Aliyun ACK, Huawei CCE, Tencent TKE, etc.
Enter Container Management module, and click Integrate Cluster in the upper right corner.
Fill in the basic information by referring to the following instructions.
Cluster Name: It should be unique and cannot be changed after the integration. Maximum 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number.
Cluster Alias: Enter any characters, no more than 60 characters.
Release Distribution: the cluster provider, support mainstream vendors listed at the beginning.
Fill in the KubeConfig of the target cluster and click Verify Config . The cluster can be successfully connected only after the verification is passed.
Click How do I get the KubeConfig? to see the specific steps for getting this file.
Confirm that all parameters are filled in correctly and click OK in the lower right corner of the page.
Note
The status of the newly integrated cluster is Integrating , which will become Running after the integration succeeds.
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html","title":"Integrate the Rancher Cluster","text":"
This page explains how to integrate a Rancher cluster.
Prepare a Rancher cluster with administrator privileges and ensure network connectivity between the container management cluster and the target cluster.
Be equipped with permissions not lower than kpanda owner.
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#steps","title":"Steps","text":""},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#step-1-create-a-serviceaccount-user-with-administrator-privileges-in-the-rancher-cluster","title":"Step 1: Create a ServiceAccount user with administrator privileges in the Rancher cluster","text":"
Log in to the Rancher cluster with a role that has administrator privileges, and create a file named sa.yaml using the terminal.
vi sa.yaml\n
Press the i key to enter insert mode, then copy and paste the following content:
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#step-2-update-kubeconfig-with-the-rancher-rke-sa-authentication-on-your-local-machine","title":"Step 2: Update kubeconfig with the rancher-rke SA authentication on your local machine","text":"
Perform the following steps on any local node where kubelet is installed:
{cluster-name} : the name of your Rancher cluster.
{APIServer} : the access address of the cluster, usually refering to the IP address of the control node + port \"6443\", such as https://10.X.X.X:6443 .
"},{"location":"en/admin/kpanda/clusters/integrate-rancher-cluster.html#step-3-connect-the-cluster-in-the-suanova-interface","title":"Step 3: Connect the cluster in the Suanova Interface","text":"
Using the kubeconfig file fetched earlier, refer to the Integrate Cluster documentation to integrate the Rancher cluster to the global cluster.
"},{"location":"en/admin/kpanda/clusters/runtime.html","title":"How to choose the container runtime","text":"
The container runtime is an important component in kubernetes to manage the life cycle of containers and container images. Kubernetes made containerd the default container runtime in version 1.19, and removed support for the Dockershim component in version 1.24.
Therefore, compared to the Docker runtime, we recommend you to use the lightweight containerd as your container runtime, because this has become the current mainstream runtime choice.
In addition, some operating system distribution vendors are not friendly enough for Docker runtime compatibility. The runtime support of different operating systems is as follows:
The Kubernetes Community packages a small version every quarter, and the maintenance cycle of each version is only about 9 months. Some major bugs or security holes will not be updated after the version stops maintenance. Manually upgrading cluster operations is cumbersome and places a huge workload on administrators.
In Suanova, you can upgrade the Kubernetes cluster with one click through the web UI interface.
Danger
After the version is upgraded, it will not be possible to roll back to the previous version, please proceed with caution.
Note
Kubernetes versions are denoted as x.y.z , where x is the major version, y is the minor version, and z is the patch version.
Cluster upgrades across minor versions are not allowed, e.g. a direct upgrade from 1.23 to 1.25 is not possible.
**Access clusters do not support version upgrades. If there is no \"cluster upgrade\" in the left navigation bar, please check whether the cluster is an access cluster. **
The global service cluster can only be upgraded through the terminal.
When upgrading a worker cluster, the Management Cluster of the worker cluster should have been connected to the container management module and be running normally.
Click the name of the target cluster in the cluster list.
Then click Cluster Operation and Maintenance -> Cluster Upgrade in the left navigation bar, and click Version Upgrade in the upper right corner of the page.
Select the version that can be upgraded, and enter the cluster name to confirm.
After clicking OK , you can see the upgrade progress of the cluster.
The cluster upgrade is expected to take 30 minutes. You can click the Real-time Log button to view the detailed log of the cluster upgrade.
ConfigMaps store non-confidential data in the form of key-value pairs to achieve the effect of mutual decoupling of configuration data and application code. ConfigMaps can be used as environment variables for containers, command-line parameters, or configuration files in storage volumes.
Note
The data saved in ConfigMaps cannot exceed 1 MiB. If you need to store larger volumes of data, it is recommended to mount a storage volume or use an independent database or file service.
ConfigMaps do not provide confidentiality or encryption. If you want to store encrypted data, it is recommended to use secret, or other third-party tools to ensure the privacy of data.
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> ConfigMap , and click the YAML Create button in the upper right corner.
Fill in or paste the configuration file prepared in advance, and then click OK in the lower right corner of the pop-up box.
!!! note
- Click __Import__ to import an existing file locally to quickly create ConfigMaps.\n - After filling in the data, click __Download__ to save the configuration file locally.\n
After the creation is complete, click More on the right side of the ConfigMap to edit YAML, update, export, delete and other operations.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
Secrets can be used in some cases:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
As the identity authentication credential for the container registry when the kubelet pulls the container image.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster
Created a namespace, user, and authorized the user as NS Editor. For details, refer to Namespace Authorization.
"},{"location":"en/admin/kpanda/configmaps-secrets/create-secret.html#create-secret-with-wizard","title":"Create secret with wizard","text":"
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> Secret , and click the Create Secret button in the upper right corner.
Fill in the configuration information on the Create Secret page, and click OK .
Note when filling in the configuration:
The name of the key must be unique within the same namespace
Key type:
Default (Opaque): Kubernetes default key type, which supports arbitrary data defined by users.
TLS (kubernetes.io/tls): credentials for TLS client or server data access.
Container registry information (kubernetes.io/dockerconfigjson): Credentials for Container registry access.
username and password (kubernetes.io/basic-auth): Credentials for basic authentication.
Custom: the type customized by the user according to business needs.
Key data: the data stored in the key, the parameters that need to be filled in are different for different data
When the key type is default (Opaque)/custom: multiple key-value pairs can be filled in.
When the key type is TLS (kubernetes.io/tls): you need to fill in the certificate certificate and private key data. Certificates are self-signed or CA-signed credentials used for authentication. A certificate request is a request for a signature and needs to be signed with a private key.
When the key type is container registry information (kubernetes.io/dockerconfigjson): you need to fill in the account and password of the private container registry.
When the key type is username and password (kubernetes.io/basic-auth): Username and password need to be specified.
ConfigMap (ConfigMap) is an API object of Kubernetes, which is used to save non-confidential data into key-value pairs, and can store configurations that other objects need to use. When used, the container can use it as an environment variable, a command-line argument, or a configuration file in a storage volume. By using ConfigMaps, configuration data and application code can be separated, providing a more flexible way to modify application configuration.
Note
ConfigMaps do not provide confidentiality or encryption. If the data to be stored is confidential, please use secret, or use other third-party tools to ensure the privacy of the data instead of ConfigMaps. In addition, when using ConfigMaps in containers, the container and ConfigMaps must be in the same cluster namespace.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use ConfigMaps in Pods. There are many use cases, mainly including:
Use ConfigMaps to set the environment variables of the container
Use ConfigMaps to set the command line parameters of the container
Use ConfigMaps as container data volumes
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#set-the-environment-variables-of-the-container","title":"Set the environment variables of the container","text":"
You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line.
Note
The ConfigMap import is to use the ConfigMap as the value of the environment variable; the ConfigMap key value import is to use a certain parameter in the ConfigMap as the value of the environment variable.
When creating a workload through an image, you can set environment variables for the container by selecting Import ConfigMaps or Import ConfigMap Key Values on the Environment Variables interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Environment Variables configuration, and click the Add Environment Variable button.
Select ConfigMap Import or ConfigMap Key Value Import in the environment variable type.
When the environment variable type is selected as ConfigMap import , enter variable name , prefix name, ConfigMap name in sequence.
When the environment variable type is selected as ConfigMap key-value import , enter variable name , ConfigMap name, and Secret name in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#command-line-operation","title":"Command line operation","text":"
You can set ConfigMaps as environment variables when creating a workload, using the valueFrom parameter to refer to the Key/Value in the ConfigMap.
Use valueFrom to specify the value of the env reference ConfigMap
Referenced configuration file name
Referenced ConfigMap key
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#set-the-command-line-parameters-of-the-container","title":"Set the command line parameters of the container","text":"
You can use ConfigMaps to set the command or parameter value in the container, and use the environment variable substitution syntax $(VAR_NAME) to do so. As follows.
When creating a workload through an image, you can use the ConfigMap as the data volume of the container by selecting the storage type as \"ConfigMap\" on the \"Data Storage\" interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Data Storage configuration, and click __Add in the __ Node Path Mapping __ list __ button.
Select ConfigMap in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-configmap.html#command-line-operation_1","title":"Command line operation","text":"
To use a ConfigMap in a Pod's storage volume.
Here is an example Pod that mounts a ConfigMap as a volume:
If there are multiple containers in a Pod, each container needs its own volumeMounts block, but you only need to set one spec.volumes block per ConfigMap.
Note
When a ConfigMap is used as a data volume mounted on a container, the ConfigMap can only be read as a read-only file.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use keys in Pods in a variety of use cases, mainly including:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
Used as the identity authentication credential for the container registry when the kubelet pulls the container image.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#use-the-key-to-set-the-environment-variable-of-the-container","title":"Use the key to set the environment variable of the container","text":"
You can use the key as the environment variable of the container through the GUI or the terminal command line.
Note
Key import is to use the key as the value of an environment variable; key key value import is to use a parameter in the key as the value of an environment variable.
When creating a workload from an image, you can set environment variables for the container by selecting Key Import or Key Key Value Import on the Environment Variables interface.
Go to the Image Creation Workload page.
Select the Environment Variables configuration in Container Configuration , and click the Add Environment Variable button.
Select Key Import or Key Key Value Import in the environment variable type.
When the environment variable type is selected as Key Import , enter Variable Name , Prefix , and Secret in sequence.
When the environment variable type is selected as key key value import , enter variable name , Secret , Secret name in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#command-line-operation","title":"Command line operation","text":"
As shown in the example below, you can set the secret as an environment variable when creating the workload, using the valueFrom parameter to refer to the Key/Value in the Secret.
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"username\"
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"password\"
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#use-the-key-as-the-pods-data-volume","title":"Use the key as the pod's data volume","text":""},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#graphical-interface-operation_1","title":"Graphical interface operation","text":"
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the Container Configuration , select the Data Storage configuration, and click the Add button in the Node Path Mapping list.
Select Secret in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#command-line-operation_1","title":"Command line operation","text":"
The following is an example of a Pod that mounts a Secret named mysecret via a data volume:
Default setting, means \"mysecret\" must already exist
If the Pod contains multiple containers, each container needs its own volumeMounts block, but only one .spec.volumes setting is required for each Secret.
"},{"location":"en/admin/kpanda/configmaps-secrets/use-secret.html#used-as-the-identity-authentication-credential-for-the-container-registry-when-the-kubelet-pulls-the-container-image","title":"Used as the identity authentication credential for the container registry when the kubelet pulls the container image","text":"
You can use the key as the identity authentication credential for the Container registry through the GUI or the terminal command line.
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the second step of Container Configuration , select the Basic Information configuration, and click the Select Image button.
Select the name of the private container registry in the drop-down list of `container registry' in the pop-up box. Please see Create Secret for details on private image secret creation.
Enter the image name in the private registry, click OK to complete the image selection.
Note
When creating a key, you need to ensure that you enter the correct container registry address, username, password, and select the correct mirror name, otherwise you will not be able to obtain the mirror image in the container registry.
In Kubernetes, all objects are abstracted as resources, such as Pod, Deployment, Service, Volume, etc. are the default resources provided by Kubernetes. This provides important support for our daily operation and maintenance and management work, but in some special cases, the existing preset resources cannot meet the needs of the business. Therefore, we hope to expand the capabilities of the Kubernetes API, and CustomResourceDefinition (CRD) was born based on this requirement.
The container management module supports interface-based management of custom resources, and its main features are as follows:
Obtain the list and detailed information of custom resources under the cluster
Create custom resources based on YAML
Create a custom resource example CR (Custom Resource) based on YAML
"},{"location":"en/admin/kpanda/custom-resources/create.html#create-a-custom-resource-example-via-yaml","title":"Create a custom resource example via YAML","text":"
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Custom Resource , and click the YAML Create button in the upper right corner.
Click the custom resource named crontabs.stable.example.com , enter the details, and click the YAML Create button in the upper right corner.
On the Create with YAML page, fill in the YAML statement and click OK .
Return to the details page of crontabs.stable.example.com , and you can view the custom resource named my-new-cron-object just created.
"},{"location":"en/admin/kpanda/gpu/index.html","title":"Overview of GPU Management","text":"
This article introduces the capability of Suanova container management platform in unified operations and management of heterogeneous resources, with a focus on GPUs.
With the rapid development of emerging technologies such as AI applications, large-scale models, artificial intelligence, and autonomous driving, enterprises are facing an increasing demand for compute-intensive tasks and data processing. Traditional compute architectures represented by CPUs can no longer meet the growing computational requirements of enterprises. At this point, heterogeneous computing represented by GPUs has been widely applied due to its unique advantages in processing large-scale data, performing complex calculations, and real-time graphics rendering.
Meanwhile, due to the lack of experience and professional solutions in scheduling and managing heterogeneous resources, the utilization efficiency of GPU devices is extremely low, resulting in high AI production costs for enterprises. The challenge of reducing costs, increasing efficiency, and improving the utilization of GPUs and other heterogeneous resources has become a pressing issue for many enterprises.
"},{"location":"en/admin/kpanda/gpu/index.html#introduction-to-gpu-capabilities","title":"Introduction to GPU Capabilities","text":"
The Suanova container management platform supports unified scheduling and operations management of GPUs, NPUs, and other heterogeneous resources, fully unleashing the computational power of GPU resources, and accelerating the development of enterprise AI and other emerging applications. The GPU management capabilities of Suanova are as follows:
Support for unified management of heterogeneous computing resources from domestic and foreign manufacturers such as NVIDIA, Huawei Ascend, and Iluvatar.
Support for multi-card heterogeneous scheduling within the same cluster, with automatic recognition of GPUs in the cluster.
Support for native management solutions for NVIDIA GPUs, vGPUs, and MIG, with cloud native capabilities.
Support for partitioning a single physical card for use by different tenants, and allocate GPU resources to tenants and containers based on computing power and memory quotas.
Support for multi-dimensional GPU resource monitoring at the cluster, node, and application levels, assisting operators in managing GPU resources.
Compatibility with various training frameworks such as TensorFlow and PyTorch.
"},{"location":"en/admin/kpanda/gpu/index.html#introduction-to-gpu-operator","title":"Introduction to GPU Operator","text":"
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
Architecture diagram of NVIDIA GPU Operator:
"},{"location":"en/admin/kpanda/gpu/FAQ.html","title":"GPU FAQs","text":""},{"location":"en/admin/kpanda/gpu/FAQ.html#gpu-processes-are-not-visible-while-running-nvidia-smi-inside-a-pod","title":"GPU processes are not visible while running nvidia-smi inside a pod","text":"
Q: When running the nvidia-smi command inside a GPU-utilizing pod, no GPU process information is visible in the full-card mode and vGPU mode.
A: Due to PID namespace isolation, GPU processes are not visible inside the Pod. To view GPU processes, you can use one of the following methods:
Configure the workload using the GPU with hostPID: true to enable viewing PIDs on the host.
Run the nvidia-smi command in the driver pod of the gpu-operator to view processes.
Run the chroot /run/nvidia/driver nvidia-smi command on the host to view processes.
"},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html","title":"How to Use Iluvatar GPU in Applications","text":"
This section describes how to use Iluvatar virtual GPU on AI platform.
Deployed AI platform container management platform and it is running smoothly.
The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and the UI interface of the cluster can be accessed.
The Iluvatar GPU driver has been installed on the current cluster. Refer to the Iluvatar official documentation for driver installation instructions, or contact the Suanova ecosystem team for enterprise-level support at peg-pem@daocloud.io.
The GPUs in the current cluster have not undergone any virtualization operations and not been occupied by other applications.
"},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html#configuration-via-user-interface","title":"Configuration via User Interface","text":"
Check if the GPU in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type as Iluvatar .
Deploy a workload. Click Clusters -> Workloads and deploy a workload using the image. After selecting the type as (Iluvatar) , configure the GPU resources used by the application:
Physical Card Count (iluvatar.ai/vcuda-core): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
Memory Usage (iluvatar.ai/vcuda-memory): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.
If there are any issues with the configuration values, scheduling failures or resource allocation failures may occur.
"},{"location":"en/admin/kpanda/gpu/Iluvatar_usage.html#configuration-via-yaml","title":"Configuration via YAML","text":"
To request GPU resources for a workload, add the iluvatar.ai/vcuda-core: 1 and iluvatar.ai/vcuda-memory: 200 to the requests and limits. These parameters configure the application to use the physical card resources.
"},{"location":"en/admin/kpanda/gpu/dynamic-regulation.html","title":"GPU Scheduling Configuration (Binpack and Spread)","text":"
This page introduces how to reduce GPU resource fragmentation and prevent single points of failure through Binpack and Spread when using NVIDIA vGPU, achieving advanced scheduling for vGPU. The AI platform platform provides Binpack and Spread scheduling policies across two dimensions: clusters and workloads, meeting different usage requirements in various scenarios.
Binpack: Prioritizes using the same GPU on a node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different GPUs on nodes, suitable for high availability scenarios to avoid single card failures.
Scheduling policy based on node dimension
Binpack: Multiple Pods prioritize using the same node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different nodes, suitable for high availability scenarios to avoid single node failures.
"},{"location":"en/admin/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-cluster-level","title":"Use Binpack and Spread at Cluster-Level","text":"
Note
By default, workloads will follow the cluster-level Binpack and Spread. If a workload sets its own Binpack and Spread scheduling policies that differ from the cluster, the workload will prioritize its own scheduling policy.
On the Clusters page, select the cluster for which you want to adjust the Binpack and Spread scheduling policies. Click the \u2507 icon on the right and select GPU Scheduling Configuration from the dropdown list.
Adjust the GPU scheduling configuration according to your business scenario, and click OK to save.
"},{"location":"en/admin/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-workload-level","title":"Use Binpack and Spread at Workload-Level","text":"
Note
When the Binpack and Spread scheduling policies at the workload level conflict with the cluster-level configuration, the workload-level configuration takes precedence.
Follow the steps below to create a deployment using an image and configure Binpack and Spread scheduling policies within the workload.
Click Clusters in the left navigation bar, then click the name of the target cluster to enter the Cluster Details page.
On the Cluster Details page, click Workloads -> Deployments in the left navigation bar, then click the Create by Image button in the upper right corner of the page.
Sequentially fill in the Basic Information, Container Settings, and in the Container Configuration section, enable GPU configuration, selecting the GPU type as NVIDIA vGPU. Click Advanced Settings, enable the Binpack / Spread scheduling policy, and adjust the GPU scheduling configuration according to the business scenario. After configuration, click Next to proceed to Service Settings and Advanced Settings. Finally, click OK at the bottom right of the page to complete the creation.
"},{"location":"en/admin/kpanda/gpu/gpu_matrix.html","title":"GPU Support Matrix","text":"
This page explains the matrix of supported GPUs and operating systems for AI platform.
"},{"location":"en/admin/kpanda/gpu/gpu_matrix.html#nvidia-gpu","title":"NVIDIA GPU","text":"GPU Manufacturer and Type Supported GPU Models Compatible Operating System (Online) Recommended Kernel Recommended Operating System and Kernel Installation Documentation NVIDIA GPU (Full Card/vGPU)
NVIDIA Fermi (2.1) Architecture:
NVIDIA GeForce 400 Series
NVIDIA Quadro 4000 Series
NVIDIA Tesla 20 Series
NVIDIA Ampere Architecture Series (A100; A800; H100)
CentOS 7
Kernel 3.10.0-123 ~ 3.10.0-1160
Kernel Reference Document
Recommended Operating System with Proper Kernel Version
This document mainly introduces the configuration of GPU scheduling, which can implement advanced scheduling policies. Currently, the primary implementation is the vgpu scheduling policy.
vGPU provides two policies for resource usage: binpack and spread. These correspond to node-level and GPU-level dimensions, respectively. The use case is whether you want to distribute workloads more sparsely across different nodes and GPUs or concentrate them on the same node and GPU, thereby making resource utilization more efficient and reducing resource fragmentation.
You can modify the scheduling policy in your cluster by following these steps:
Go to the cluster management list in the container management interface.
Click the settings button ... next to the cluster.
Click GPU Scheduling Configuration.
Toggle the scheduling policy between node-level and GPU-level. By default, the node-level policy is binpack, and the GPU-level policy is spread.
The above steps modify the cluster-level scheduling policy. Users can also specify their own scheduling policy at the workload level to change the scheduling results. Below is an example of modifying the scheduling policy at the workload level:
In this example, both the node- and GPU-level scheduling policies are set to binpack. This ensures that the workload is scheduled to maximize resource utilization and reduce fragmentation.
Follow these steps to manage GPU quotas in AI platform:
Go to Namespaces and click Quota Management to configure the GPU resources that can be used by a specific namespace.
The currently supported card types for quota management in a namespace are: NVIDIA vGPU, NVIDIA MIG, Iluvatar, and Ascend.
NVIDIA vGPU Quota Management: Configure the specific quota that can be used. This will create a ResourcesQuota CR.
- Physical Card Count (nvidia.com/vgpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine.\n- GPU Core Count (nvidia.com/gpucores): Indicates the GPU compute power occupied by each card. The value ranges from 0 to 100. If configured as 0, it is considered not to enforce isolation. If configured as 100, it is considered to exclusively occupy the entire card.\n- GPU Memory Usage (nvidia.com/gpumem): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.\n
"},{"location":"en/admin/kpanda/gpu/ascend/ascend_driver_install.html","title":"Installation of Ascend NPU Components","text":"
This chapter provides installation guidance for Ascend NPU drivers, Device Plugin, NPU-Exporter, and other components.
Before using NPU resources, you need to complete the firmware installation, NPU driver installation, Docker Runtime installation, user creation, log directory creation, and NPU Device Plugin installation. Refer to the following steps for details.
Confirm that the kernel version is within the range proper to the \"binary installation\" method, and then you can directly install the NPU driver firmware.
For firmware and driver downloads, refer to: Firmware Download Link
For firmware installation, refer to: Install NPU Driver Firmware
If the driver is not installed, refer to the official Ascend documentation for installation. For example, for Ascend910, refer to: 910 Driver Installation Document.
Run the command npu-smi info, and if the NPU information is returned normally, it indicates that the NPU driver and firmware are ready.
Create the parent directory for component logs and the log directories for each component on the proper node, and set the appropriate owner and permissions for the directories. Execute the following command to create the parent directory for component logs.
Please create the proper log directory for each required component. In this example, only the Device Plugin component is needed. For other component requirements, refer to the official documentation
Refer to the following commands to create labels on the proper nodes:
# Create this label on computing nodes where the driver is installed\nkubectl label node {nodename} huawei.com.ascend/Driver=installed\nkubectl label node {nodename} node-role.kubernetes.io/worker=worker\nkubectl label node {nodename} workerselector=dls-worker-node\nkubectl label node {nodename} host-arch=huawei-arm // or host-arch=huawei-x86, select according to the actual situation\nkubectl label node {nodename} accelerator=huawei-Ascend910 // select according to the actual situation\n# Create this label on control nodes\nkubectl label node {nodename} masterselector=dls-master-node\n
"},{"location":"en/admin/kpanda/gpu/ascend/ascend_driver_install.html#install-device-plugin-and-npuexporter","title":"Install Device Plugin and NpuExporter","text":"
Functional module path: Container Management -> Cluster, click the name of the target cluster, then click Helm Apps -> Helm Charts from the left navigation bar, and search for ascend-mindxdl.
DevicePlugin: Provides a general device plugin mechanism and standard device API interface for Kubernetes to use devices. It is recommended to use the default image and version.
NpuExporter: Based on the Prometheus/Telegraf ecosystem, this component provides interfaces to help users monitor the Ascend series AI processors and container-level allocation status. It is recommended to use the default image and version.
ServiceMonitor: Disabled by default. If enabled, you can view NPU-related monitoring in the observability module. To enable, ensure that the insight-agent is installed and running, otherwise, the ascend-mindxdl installation will fail.
isVirtualMachine: Disabled by default. If the NPU node is a virtual machine scenario, enable the isVirtualMachine parameter.
After a successful installation, two components will appear under the proper namespace, as shown below:
At the same time, the proper NPU information will also appear on the node information:
Once everything is ready, you can select the proper NPU device when creating a workload through the page, as shown below:
Note
For detailed information of how to use, refer to Using Ascend (Ascend) NPU.
Ascend virtualization is divided into dynamic virtualization and static virtualization. This document describes how to enable and use Ascend static virtualization capabilities.
To enable virtualization capabilities, you need to manually modify the startup parameters of the ascend-device-plugin-daemonset component. Refer to the following command:
After splitting the instance, manually restart the device-plugin pod, then use the kubectl describe command to check the resources of the registered node:
kubectl describe node {{nodename}}\n
"},{"location":"en/admin/kpanda/gpu/ascend/vnpu.html#how-to-use-the-device","title":"How to Use the Device","text":"
When creating an application, specify the resource key as shown in the following YAML:
"},{"location":"en/admin/kpanda/gpu/metax/usemetax.html","title":"MetaX GPU Component Installation and Usage","text":"
This chapter provides installation guidance for MetaX's gpu-extensions, gpu-operator, and other components, as well as usage methods for both the full GPU and vGPU modes.
The required tar package has been downloaded and installed from the MetaX Software Center. This article uses metax-gpu-k8s-package.0.7.10.tar.gz as an example.
Metax provides two helm-chart packages: metax-extensions and gpu-operator. Depending on the usage scenario, different components can be selected for installation.
Metax-extensions: Includes two components, gpu-device and gpu-label. When using the Metax-extensions solution, the user's application container image needs to be built based on the MXMACA\u00ae base image. Moreover, Metax-extensions is only suitable for scenarios using the full GPU.
gpu-operator: Includes components such as gpu-device, gpu-label, driver-manager, container-runtime, and operator-controller. When using the gpu-operator solution, users can choose to create application container images that do not include the MXMACA\u00ae SDK. The gpu-operator is suitable for both full GPU and vGPU scenarios.
If no content is displayed, it indicates that the software package has not been installed. If content is displayed, it indicates that the software package has been installed.
When using metax-operator, it is not recommended to pre-install the MXMACA kernel driver on worker nodes; if it has already been installed, there is no need to uninstall it.
The images for the components metax-operator, gpu-label, gpu-device, and container-runtime must have the amd64 suffix.
The image for the metax-maca component is not included in the metax-k8s-images.0.7.13.run package and needs to be separately downloaded, such as maca-mxc500-2.23.0.23-ubuntu20.04-x86_64.tar.xz. After loading it, the image for the metax-maca component needs to be modified again.
The image for the metax-driver component needs to be downloaded from https://pub-docstore.metax-tech.com:7001 as the k8s-driver-image.2.23.0.25.run file, and then execute the command k8s-driver-image.2.23.0.25.run push {registry}/metax to push the image to the image repository. After pushing, modify the image address for the metax-driver component.
The Suanova AI computing platform's container management platform has been deployed and is running normally.
The container management module has either integrated with a Kubernetes cluster or created a Kubernetes cluster, and is able to access the cluster's UI interface.
The current cluster has installed the Cambricon firmware, drivers, and DevicePlugin components. For installation details, please refer to the official documentation:
Driver Firmware Installation
DevicePlugin Installation
When installing DevicePlugin, please disable the --enable-device-type parameter; otherwise, the Suanova AI computing platform will not be able to correctly recognize the Cambricon GPU.
"},{"location":"en/admin/kpanda/gpu/mlu/use-mlu.html#introduction-to-cambricon-gpu-modes","title":"Introduction to Cambricon GPU Modes","text":"
Cambricon GPUs have the following modes:
Full Card Mode: Register the Cambricon GPU as a whole card for use in the cluster.
Share Mode: Allows one Cambricon GPU to be shared among multiple Pods, with the number of shareable containers set by the virtualization-num parameter.
Dynamic SMLU Mode: Further refines resource allocation, allowing control over the size of memory and computing power allocated to containers.
MIM Mode: Allows the Cambricon GPU to be divided into multiple GPUs of fixed specifications for use.
"},{"location":"en/admin/kpanda/gpu/mlu/use-mlu.html#using-cambricon-in-suanova-ai-computing-platform","title":"Using Cambricon in Suanova AI Computing Platform","text":"
Here, we take the Dynamic SMLU mode as an example:
After correctly installing the DevicePlugin and other components, click the proper Cluster -> Cluster Maintenance -> Cluster Settings -> Addon Plugins to check whether the proper GPU type has been automatically enabled and detected.
Click the node management page to check if the nodes have correctly recognized the proper GPU type.
Deploy workloads. Click the proper Cluster -> Workloads, and deploy workloads using images. After selecting the type (MLU VGPU), you need to configure the GPU resources used by the App:
GPU Computing Power (cambricon.com/mlu.smlu.vcore): Indicates the percentage of cores the current Pod needs to use.
GPU Memory (cambricon.com/mlu.smlu.vmemory): Indicates the size of memory the current Pod needs to use, in MB.
apiVersion: v1 \nkind: Pod \nmetadata: \n name: pod1 \nspec: \n restartPolicy: OnFailure \n containers: \n - image: ubuntu:16.04 \n name: pod1-ctr \n command: [\"sleep\"] \n args: [\"100000\"] \n resources: \n limits: \n cambricon.com/mlu: \"1\" # use this when device type is not enabled, else delete this line. \n #cambricon.com/mlu: \"1\" #uncomment to use when device type is enabled \n #cambricon.com/mlu.share: \"1\" #uncomment to use device with env-share mode \n #cambricon.com/mlu.mim-2m.8gb: \"1\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vcore: \"100\" #uncomment to use device with mim mode \n #cambricon.com/mlu.smlu.vmemory: \"1024\" #uncomment to use device with mim mode\n
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
vGPU is a virtualization technology that allows one physical GPU to be partitioned into multiple virtual GPUs, with each virtual GPU assigned to different virtual machines or users. vGPU enables multiple users to share the same physical GPU and independently use GPU resources in their respective virtual environments. Each virtual GPU can access a certain amount of compute power and memory capacity. vGPU is suitable for virtualized environments and cloud computing scenarios, providing higher resource utilization and flexibility.
MIG is a feature introduced by the NVIDIA Ampere architecture that allows one physical GPU to be divided into multiple physical GPU instances, each of which can be independently allocated to different users or workloads. Each MIG instance has its own compute resources, memory, and PCIe bandwidth, just like an independent virtual GPU. MIG provides finer-grained GPU resource allocation and management and allows dynamic adjustment of the number and size of instances based on demand. MIG is suitable for multi-tenant environments, containerized applications, batch jobs, and other scenarios.
Whether using vGPU in a virtualized environment or MIG on a physical GPU, NVIDIA provides users with more choices and optimized ways to utilize GPU resources. The Suanova container management platform fully supports the above NVIDIA capabilities. Users can easily access the full computational power of NVIDIA GPUs through simple UI operations, thereby improving resource utilization and reducing costs.
Single Mode: The node only exposes a single type of MIG device on all its GPUs. All GPUs on the node must:
Be of the same model (e.g., A100-SXM-40GB), with matching MIG profiles only for GPUs of the same model.
Have MIG configuration enabled, which requires a machine reboot to take effect.
Create identical GI and CI for exposing \"identical\" MIG devices across all products.
Mixed Mode: The node exposes mixed MIG device types on all its GPUs. Requesting a specific MIG device type requires the number of compute slices and total memory provided by the device type.
All GPUs on the node must: Be in the same product line (e.g., A100-SXM-40GB).
Each GPU can enable or disable MIG individually and freely configure any available mixture of MIG device types.
The k8s-device-plugin running on the node will:
Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type.
Expose individual MIG devices using resource types that follow the pattern nvidia.com/mig-<slice_count>g.<memory_size>gb .
For detailed instructions on enabling these configurations, refer to Offline Installation of GPU Operator.
"},{"location":"en/admin/kpanda/gpu/nvidia/index.html#how-to-use","title":"How to Use","text":"
You can refer to the following links to quickly start using Suanova's management capabilities for NVIDIA GPUs.
Using Full NVIDIA GPU
Using NVIDIA vGPU
Using NVIDIA MIG
"},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html","title":"Using the Whole NVIDIA GPU for an Application","text":"
This section describes how to allocate the entire NVIDIA GPU to a single application on the AI platform platform.
AI platform container management platform has been deployed and is running properly.
The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions.
The GPU in the current cluster has not undergone any virtualization operations or been occupied by other applications.
"},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-the-user-interface","title":"Configuring via the User Interface","text":"
Check if the cluster has detected the GPUs. Click Clusters -> Cluster Settings -> Addon Plugins to see if it has automatically enabled and detected the proper GPU types. Currently, the cluster will automatically enable GPU and set the GPU Type as Nvidia GPU .
Deploy a workload. Click Clusters -> Workloads , and deploy the workload using the image method. After selecting the type ( Nvidia GPU ), configure the number of physical cards used by the application:
Physical Card Count (nvidia.com/gpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
If the above value is configured incorrectly, scheduling failures and resource allocation issues may occur.
"},{"location":"en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-yaml","title":"Configuring via YAML","text":"
To request GPU resources for a workload, add the nvidia.com/gpu: 1 parameter to the resource request and limit configuration in the YAML file. This parameter configures the number of physical cards used by the application.
AI platform comes with pre-installed driver images for the following three operating systems: Ubuntu 22.04, Ubuntu 20.04, and CentOS 7.9. The driver version is 535.104.12. Additionally, it includes the required Toolkit images for each operating system, so users no longer need to manually provide offline toolkit images.
This page demonstrates using AMD architecture with CentOS 7.9 (3.10.0-1160). If you need to deploy on Red Hat 8.4, refer to Uploading Red Hat gpu-operator Offline Image to the Bootstrap Node Repository and Building Offline Yum Source for Red Hat 8.4.
The kernel version of the cluster nodes where the gpu-operator is to be deployed must be completely consistent. The distribution and GPU model of the nodes must fall within the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.
systemOS : Select the operating system for the host. The current options are Ubuntu 22.04, Ubuntu 20.04, Centos 7.9, and other. Please choose the correct operating system.
Namespace : Select the namespace for installing the plugin
Version: The version of the plugin. Here, we use version v23.9.0+2 as an example.
Failure Deletion: If the installation fails, it will delete the already installed associated resources. When enabled, Ready Wait will also be enabled by default.
Ready Wait: When enabled, the application will be marked as successfully installed only when all associated resources are in a ready state.
Detailed Logs: When enabled, detailed logs of the installation process will be recorded.
Driver.enable : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the gpu-operator, please disable this.
Driver.repository : Repository where the GPU driver image is located, default is nvidia's nvcr.io repository.
Driver.usePrecompiled : Enable the precompiled mode to install the driver.
Driver.version : Version of the GPU driver image, use default parameters for offline deployment. Configuration is only required for online installation. Different versions of the Driver image exist for different types of operating systems. For more details, refer to Nvidia GPU Driver Versions. Examples of Driver Version for different operating systems are as follows:
Note
When using the built-in operating system version, there is no need to modify the image version. For other operating system versions, please refer to Uploading Images to the Bootstrap Node Repository. note that there is no need to include the operating system name such as Ubuntu, CentOS, or Red Hat in the version number. If the official image contains an operating system suffix, please manually remove it.
For Red Hat systems, for example, 525.105.17
For Ubuntu systems, for example, 535-5.15.0-1043-nvidia
For CentOS systems, for example, 525.147.05
Driver.RepoConfig.ConfigMapName : Used to record the name of the offline yum repository configuration file for the gpu-operator. When using the pre-packaged offline bundle, refer to the following documents for different types of operating systems.
For detailed configuration methods, refer to Enabling MIG Functionality.
MigManager.Config.name : The name of the MIG split configuration file, used to define the MIG (GI, CI) split policy. The default is default-mig-parted-config . For custom parameters, refer to Enabling MIG Functionality.
After completing the configuration and creation of the above parameters:
If using full-card mode , GPU resources can be used when creating applications.
If using vGPU mode , after completing the above configuration and creation, proceed to vGPU Addon Installation.
If using MIG mode and you need to use a specific split specification for individual GPU nodes, otherwise, split according to the default value in MigManager.Config.
After spliting, applications can use MIG GPU resources.
"},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html","title":"Uploading Red Hat GPU Operator Offline Image to Bootstrap Repository","text":"
This guide explains how to upload an offline image to the bootstrap repository using the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image for Red Hat 8.4 as an example.
The bootstrap node and its components are running properly.
Prepare a node that has internet access and can access the bootstrap node. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
"},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html#step-1-obtain-the-offline-image-on-an-internet-connected-node","title":"Step 1: Obtain the Offline Image on an Internet-Connected Node","text":"
Perform the following steps on the internet-connected node:
Pull the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image:
Once the image is pulled, save it as a compressed archive named nvidia-driver.tar :
docker save nvcr.io/nvidia/driver:525.105.17-rhel8.4 > nvidia-driver.tar\n
Copy the compressed image archive nvidia-driver.tar to the bootstrap node:
scp nvidia-driver.tar user@ip:/root\n
For example:
scp nvidia-driver.tar root@10.6.175.10:/root\n
"},{"location":"en/admin/kpanda/gpu/nvidia/push_image_to_repo.html#step-2-push-the-image-to-the-bootstrap-repository","title":"Step 2: Push the Image to the Bootstrap Repository","text":"
Perform the following steps on the bootstrap node:
Log in to the bootstrap node and import the compressed image archive nvidia-driver.tar :
docker load -i nvidia-driver.tar\n
View the imported image:
docker images -a | grep nvidia\n
Expected output:
nvcr.io/nvidia/driver e3ed7dee73e9 1 days ago 1.02GB\n
Retag the image to correspond to the target repository in the remote Registry repository:
docker tag <image-name> <registry-url>/<repository-name>:<tag>\n
Replace with the name of the Nvidia image from the previous step, with the address of the Registry service on the bootstrap node, with the name of the repository you want to push the image to, and with the desired tag for the image.
For example:
docker tag nvcr.io/nvidia/driver 10.6.10.5/nvcr.io/nvidia/driver:525.105.17-rhel8.4\n
Check the GPU Driver image version applicable to your kernel, at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. Use the kernel to query the image version and save the image using ctr export.
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
Import the image into the cluster's container registry
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
"},{"location":"en/admin/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.html#install-the-driver","title":"Install the Driver","text":"
Install the gpu-operator addon and set driver.usePrecompiled=true
Set driver.version=535, note that it should be 535, not 535.104.12
The AI platform comes with a pre-installed GPU Operator offline package for CentOS 7.9 with kernel version 3.10.0-1160. or other OS types or kernel versions, users need to manually build an offline yum source.
This guide explains how to build an offline yum source for CentOS 7.9 with a specific kernel version and use it when installing the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the v0.12.0 or later version of the addon offline package on the platform.
Prepare a file server that is accessible from the cluster network, such as Nginx or MinIO.
Prepare a node that has internet access, can access the cluster where the GPU Operator will be deployed, and can access the file server. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
This guide uses CentOS 7.9 with kernel version 3.10.0-1160.95.1.el7.x86_64 as an example to explain how to upgrade the pre-installed GPU Operator offline package's yum source.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#check-os-and-kernel-versions-of-cluster-nodes","title":"Check OS and Kernel Versions of Cluster Nodes","text":"
Run the following commands on both the control node of the Global cluster and the node where GPU Operator will be deployed. If the OS and kernel versions of the two nodes are consistent, there is no need to build a yum source. You can directly refer to the Offline Installation of GPU Operator document for installation. If the OS or kernel versions of the two nodes are not consistent, please proceed to the next step.
Run the following command to view the distribution name and version of the node where GPU Operator will be deployed in the cluster.
cat /etc/redhat-release\n
Expected output:
CentOS Linux release 7.9 (Core)\n
The output shows the current node's OS version as CentOS 7.9.
Run the following command to view the kernel version of the node where GPU Operator will be deployed in the cluster.
uname -a\n
Expected output:
Linux localhost.localdomain 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux\n
The output shows the current node's kernel version as 3.10.0-1160.el7.x86_64.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-the-offline-yum-source","title":"Create the Offline Yum Source","text":"
Perform the following steps on a node that has internet access and can access the file server:
Create a script file named yum.sh by running the following command:
vi yum.sh\n
Then press the i key to enter insert mode and enter the following content:
Press the Esc key to exit insert mode, then enter :wq to save and exit.
Run the yum.sh file:
bash -x yum.sh TARGET_KERNEL_VERSION\n
The TARGET_KERNEL_VERSION parameter is used to specify the kernel version of the cluster nodes.
Note: You don't need to include the distribution identifier (e.g., __ .el7.x86_64__ ). For example:
bash -x yum.sh 3.10.0-1160.95.1\n
Now you have generated an offline yum source, centos-base , for the kernel version 3.10.0-1160.95.1.el7.x86_64 .
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#upload-the-offline-yum-source-to-the-file-server","title":"Upload the Offline Yum Source to the File Server","text":"
Perform the following steps on a node that has internet access and can access the file server. This step is used to upload the generated yum source from the previous step to a file server that can be accessed by the cluster where the GPU Operator will be deployed. The file server can be Nginx, MinIO, or any other file server that supports the HTTP protocol.
In this example, we will use the built-in MinIO as the file server. The MinIO details are as follows:
Run the following command in the current directory of the node to establish a connection between the node's local mc command-line tool and the MinIO server:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should resemble the following:
Added __minio__ successfully.\n
mc is the command-line tool provided by MinIO for interacting with the MinIO server. For more details, refer to the MinIO Client documentation.
In the current directory of the node, create a bucket named centos-base :
mc mb -p minio/centos-base\n
The expected output should resemble the following:
Bucket created successfully __minio/centos-base__ .\n
Set the access policy of the bucket centos-base to allow public download. This will enable access during the installation of the GPU Operator:
mc anonymous set download minio/centos-base\n
The expected output should resemble the following:
Access permission for __minio/centos-base__ is set to __download__ \n
In the current directory of the node, copy the generated centos-base offline yum source to the minio/centos-base bucket on the MinIO server:
mc cp centos-base minio/centos-base --recursive\n
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-a-configmap-to-store-the-yum-source-info-in-the-cluster","title":"Create a ConfigMap to Store the Yum Source Info in the Cluster","text":"
Perform the following steps on the control node of the cluster where the GPU Operator will be deployed.
Run the following command to create a file named CentOS-Base.repo that specifies the configmap for the yum source storage:
# The file name must be CentOS-Base.repo, otherwise it cannot be recognized during the installation of the GPU Operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The expected output should resemble the following:
configmap/local-repo-config created\n
The local-repo-config configmap will be used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can customize the configuration file name.
View the content of the local-repo-config configmap:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output should resemble the following:
apiVersion: v1\ndata:\nCentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base# The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\ncreationTimestamp: \"2023-10-18T01:59:02Z\"\nname: local-repo-config\nnamespace: gpu-operator\nresourceVersion: \"59445080\"\nuid: c5f0ebab-046f-442c-b932-f9003e014387\n
You have successfully created an offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it during the offline installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html","title":"Building Red Hat 8.4 Offline Yum Source","text":"
The AI platform comes with pre-installed CentOS v7.9 and GPU Operator offline packages with kernel v3.10.0-1160. For other OS types or nodes with different kernels, users need to manually build the offline yum source.
This guide explains how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also demonstrates how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the addon offline package v0.12.0 or higher on the platform.
The OS of the cluster nodes where the GPU Operator will be deployed must be Red Hat v8.4, and the kernel version must be identical.
Prepare a file server that can communicate with the cluster network where the GPU Operator will be deployed, such as Nginx or MinIO.
Prepare a node that can access the internet, the cluster where the GPU Operator will be deployed, and the file server. Ensure that Docker is already installed on this node.
The nodes in the Global cluster must be Red Hat 8.4 4.18.0-305.el8.x86_64.
This guide uses a node with Red Hat 8.4 4.18.0-305.el8.x86_64 as an example to demonstrate how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also explains how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-1-download-the-yum-source-from-the-bootstrap-node","title":"Step 1: Download the Yum Source from the Bootstrap Node","text":"
Perform the following steps on the master node of the Global cluster.
Use SSH or any other method to access any node in the Global cluster and run the following command:
cat /etc/yum.repos.d/extension.repo # View the contents of extension.repo.\n
The expected output should resemble the following:
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-2-download-the-elfutils-libelf-devel-0187-4el8x86_64rpm-package","title":"Step 2: Download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm Package","text":"
Perform the following steps on a node with internet access. Before proceeding, ensure that there is network connectivity between the node with internet access and the master node of the Global cluster.
Run the following command on the node with internet access to download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm package:
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-3-generate-the-local-yum-repository","title":"Step 3: Generate the Local Yum Repository","text":"
Perform the following steps on the master node of the Global cluster mentioned in Step 1.
Enter the yum repository directories:
cd ~/redhat-base-repo/extension-1/Packages\ncd ~/redhat-base-repo/extension-2/Packages\n
Generate the repository index for the directories:
createrepo_c ./\n
You have now generated the offline yum source named redhat-base-repo for kernel version 4.18.0-305.el8.x86_64 .
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-4-upload-the-local-yum-repository-to-the-file-server","title":"Step 4: Upload the Local Yum Repository to the File Server","text":"
In this example, we will use Minio, which is built-in as the file server in the bootstrap node. However, you can choose any file server that suits your needs. Here are the details for Minio:
Access URL: http://10.5.14.200:9000 (usually the {bootstrap-node-IP} + {port-9000})
Login username: rootuser
Login password: rootpass123
On the current node, establish a connection between the local mc command-line tool and the Minio server by running the following command:
mc config host add minio <file_server_access_url> <username> <password>\n
For example:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should be similar to:
Added __minio__ successfully.\n
The mc command-line tool is provided by the Minio file server as a client command-line tool. For more details, refer to the MinIO Client documentation.
Create a bucket named redhat-base in the current location:
mc mb -p minio/redhat-base\n
The expected output should be similar to:
Bucket created successfully __minio/redhat-base__ .\n
Set the access policy of the redhat-base bucket to allow public downloads so that it can be accessed during the installation of the GPU Operator:
mc anonymous set download minio/redhat-base\n
The expected output should be similar to:
Access permission for __minio/redhat-base__ is set to __download__ \n
Copy the offline yum repository files ( redhat-base-repo ) from the current location to the Minio server's minio/redhat-base bucket:
mc cp redhat-base-repo minio/redhat-base --recursive\n
"},{"location":"en/admin/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-5-create-a-configmap-to-store-yum-repository-information-in-the-cluster","title":"Step 5: Create a ConfigMap to Store Yum Repository Information in the Cluster","text":"
Perform the following steps on the control node of the cluster where you will deploy the GPU Operator.
Run the following command to create a file named redhat.repo , which specifies the configuration information for the yum repository storage:
# The file name must be redhat.repo, otherwise it won't be recognized when installing gpu-operator\ncat > redhat.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created redhat.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The local-repo-config configuration file is used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can choose a different name for the configuration file.
View the contents of the local-repo-config configuration file:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
You have successfully created the offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it by specifying the RepoConfig.ConfigMapName parameter during the offline installation of the GPU Operator.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"Build an Offline Yum Repository for Red Hat 7.9","text":""},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#introduction","title":"Introduction","text":"
AI platform comes with a pre-installed CentOS 7.9 with GPU Operator offline package for kernel 3.10.0-1160. You need to manually build an offline yum repository for other OS types or nodes with different kernels.
This page explains how to build an offline yum repository for Red Hat 7.9 based on any node in the Global cluster, and how to use the RepoConfig.ConfigMapName parameter when installing the GPU Operator.
The cluster nodes where the GPU Operator is to be deployed must be Red Hat 7.9 with the exact same kernel version.
Prepare a file server that can be connected to the cluster network where the GPU Operator is to be deployed, such as nginx or minio.
Prepare a node that can access the internet, the cluster where the GPU Operator is to be deployed, and the file server. Docker installation must be completed on this node.
The nodes in the global service cluster must be Red Hat 7.9.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#steps","title":"Steps","text":""},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#1-build-offline-yum-repo-for-relevant-kernel","title":"1. Build Offline Yum Repo for Relevant Kernel","text":"
Download rhel7.9 ISO
Download the rhel7.9 ospackage that corresponds to your Kubean version.
Find the version number of Kubean in the Container Management section of the Global cluster under Helm Apps.
Download the rhel7.9 ospackage for that version from the Kubean repository.
Import offline resources using the installer.
Refer to the Import Offline Resources document.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-download-offline-driver-image-for-red-hat-79-os","title":"2. Download Offline Driver Image for Red Hat 7.9 OS","text":"
Click here to view the download url.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-upload-red-hat-gpu-operator-offline-image-to-boostrap-node-repository","title":"3. Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository","text":"
Refer to Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository.
Note
This reference is based on rhel8.4, so make sure to modify it for rhel7.9.
"},{"location":"en/admin/kpanda/gpu/nvidia/yum_source_redhat7_9.html#4-create-configmaps-in-the-cluster-to-save-yum-repository-information","title":"4. Create ConfigMaps in the Cluster to Save Yum Repository Information","text":"
Run the following command on the control node of the cluster where the GPU Operator is to be deployed.
Run the following command to create a file named CentOS-Base.repo to specify the configuration information where the yum repository is stored.
# The file name must be CentOS-Base.repo, otherwise it will not be recognized when installing gpu-operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a profile named local-repo-config in the gpu-operator namespace:
The local-repo-config profile is used to provide the value of the RepoConfig.ConfigMapName parameter when installing gpu-operator, and the profile name can be customized by the user.
View the contents of the local-repo-config profile:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output is as follows:
local-repo-config.yaml
apiVersion: v1\ndata:\n CentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl\n = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname\n = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\n creationTimestamp: \"2023-10-18T01:59:02Z\"\n name: local-repo-config\n namespace: gpu-operator\n resourceVersion: \"59445080\"\n uid: c5f0ebab-046f-442c-b932-f9003e014387\n
At this point, you have successfully created the offline yum repository profile for the cluster where the GPU Operator is to be deployed. The RepoConfig.ConfigMapName parameter was used during the Offline Installation of GPU Operator.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html","title":"Overview of NVIDIA Multi-Instance GPU (MIG)","text":""},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html#mig-scenarios","title":"MIG Scenarios","text":"
Multi-Tenant Cloud Environments:
MIG allows cloud service providers to partition a physical GPU into multiple independent GPU instances, which can be allocated to different tenants. This enables resource isolation and independence, meeting the GPU computing needs of multiple tenants.
Containerized Applications:
MIG enables finer-grained GPU resource management in containerized environments. By partitioning a physical GPU into multiple MIG instances, each container can be assigned with dedicated GPU compute resources, providing better performance isolation and resource utilization.
Batch Processing Jobs:
For batch processing jobs requiring large-scale parallel computing, MIG provides higher computational performance and larger memory capacity. Each MIG instance can utilize a portion of the physical GPU's compute resources, accelerating the processing of large-scale computational tasks.
AI/Machine Learning Training:
MIG offers increased compute power and memory capacity for training large-scale deep learning models. By partitioning the physical GPU into multiple MIG instances, each instance can independently carry out model training, improving training efficiency and throughput.
In general, NVIDIA MIG is suitable for scenarios that require finer-grained allocation and management of GPU resources. It enables resource isolation, improved performance utilization, and meets the GPU computing needs of multiple users or applications.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html#overview-of-mig","title":"Overview of MIG","text":"
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU into seven GPU instances that can be used by multiple users.
SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices.
GPU Memory Slice : The smallest portion of GPU memory, including the proper memory controller and cache. A GPU memory slice is approximately one-eighth of the total GPU memory resources in terms of capacity and bandwidth.
GPU SM Slice : The smallest computational unit of SMs on a GPU. When configuring in MIG mode, the GPU SM slice is approximately one-seventh of the total available SMs in the GPU.
GPU Slice : The GPU slice represents the smallest portion of the GPU, consisting of a single GPU memory slice and a single GPU SM slice combined together.
GPU Instance (GI): A GPU instance is the combination of a GPU slice and GPU engines (DMA, NVDEC, etc.). Anything within a GPU instance always shares all GPU memory slices and other GPU engines, but its SM slice can be further subdivided into Compute Instances (CIs). A GPU instance provides memory QoS. Each GPU slice contains dedicated GPU memory resources, limiting available capacity and bandwidth while providing memory QoS. Each GPU memory slice gets one-eighth of the total GPU memory resources, and each GPU SM slice gets one-seventh of the total SM count.
Compute Instance (CI): A Compute Instance represents the smallest computational unit within a GPU instance. It consists of a subset of SMs, along with dedicated register files, shared memory, and other resources. Each CI has its own CUDA context and can run independent CUDA kernels. The number of CIs in a GPU instance depends on the number of available SMs and the configuration chosen during MIG setup.
Instance Slice : An Instance Slice represents a single CI within a GPU instance. It is the combination of a subset of SMs and a portion of the GPU memory slice. Each Instance Slice provides isolation and resource allocation for individual applications or users running on the GPU instance.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/index.html#key-benefits-of-mig","title":"Key Benefits of MIG","text":"
Resource Sharing: MIG allows a single physical GPU to be divided into multiple GPU instances, providing efficient sharing of GPU resources among different users or applications. This maximizes GPU utilization and enables improved performance isolation.
Fine-Grained Resource Allocation: With MIG, GPU resources can be allocated at a finer granularity, allowing for more precise partitioning and allocation of compute power and memory capacity.
Improved Performance Isolation: Each MIG instance operates independently with its dedicated resources, ensuring predictable throughput and latency for individual users or applications. This improves performance isolation and prevents interference between different workloads running on the same GPU.
Enhanced Security and Fault Isolation: MIG provides better security and fault isolation by ensuring that each user or application has its dedicated GPU resources. This prevents unauthorized access to data and mitigates the impact of faults or errors in one instance on others.
Increased Scalability: MIG enables the simultaneous usage of GPU resources by multiple users or applications, increasing system scalability and accommodating the needs of various workloads.
Efficient Containerization: By using MIG in containerized environments, GPU resources can be effectively allocated to different containers, improving performance isolation and resource utilization.
Overall, MIG offers significant advantages in terms of resource sharing, fine-grained allocation, performance isolation, security, scalability, and containerization, making it a valuable feature for various GPU computing scenarios.
Check the system requirements for the GPU driver installation on the target node: GPU Support Matrix
Ensure that the cluster nodes have GPUs of the proper models (NVIDIA H100, A100, and A30 Tensor Core GPUs). For more information, see the GPU Support Matrix.
All GPUs on the nodes must belong to the same product line (e.g., A100-SXM-40GB).
When installing the Operator, you need to set the MigManager Config parameter accordingly. The default setting is default-mig-parted-config. You can also customize the sharding policy configuration file:
After successfully installing the GPU operator, the node is in full card mode by default. There will be an indicator on the node management page, as shown below:
Click the \u2507 at the right side of the node list, select a GPU mode to switch, and then choose the proper MIG mode and sharding policy. Here, we take MIXED mode as an example:
There are two configurations here:
MIG Policy: Mixed and Single.
Sharding Policy: The policy here needs to match the key in the default-mig-parted-config (or user-defined sharding policy) configuration file.
After clicking OK button, wait for about a minute and refresh the page. The MIG mode will be switched to:
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_command.html","title":"MIG Related Commands","text":"
GI Related Commands:
Subcommand Description nvidia-smi mig -lgi View the list of created GI instances nvidia-smi mig -dgi -gi Delete a specific GI instance nvidia-smi mig -lgip View the profile of GI nvidia-smi mig -cgi Create a GI using the specified profile ID
CI Related Commands:
Subcommand Description nvidia-smi mig -lcip { -gi {gi Instance ID}} View the profile of CI, specifying -gi will show the CIs that can be created for a particular GI instance nvidia-smi mig -lci View the list of created CI instances nvidia-smi mig -cci {profile id} -gi {gi instance id} Create a CI instance with the specified GI nvidia-smi mig -dci -ci Delete a specific CI instance
GI+CI Related Commands:
Subcommand Description nvidia-smi mig -i 0 -cgi {gi profile id} -C {ci profile id} Create a GI + CI instance directly"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_usage.html","title":"Using MIG GPU Resources","text":"
This section explains how applications can use MIG GPU resources.
AI platform container management platform is deployed and running successfully.
The container management module is integrated with a Kubernetes cluster or a Kubernetes cluster is created, and the UI interface of the cluster can be accessed.
NVIDIA DevicePlugin and MIG capabilities are enabled. Refer to Offline installation of GPU Operator for details.
The nodes in the cluster have GPUs of the proper models.
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-gpu-through-the-ui","title":"Using MIG GPU through the UI","text":"
Confirm if the cluster has recognized the GPU type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
When deploying an application using an image, you can select and use NVIDIA MIG resources.
Example of MIG Single Mode (used in the same way as a full GPU):
Note
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
MIG Mixed Mode
"},{"location":"en/admin/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-through-yaml-configuration","title":"Using MIG through YAML Configuration","text":"
Expose MIG device through nvidia.com/mig-g.gb resource type
After entering the container, you can check if only one MIG device is being used:
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/hami.html","title":"Build a vGPU Memory Oversubscription Image","text":"
The vGPU memory oversubscription feature in the Hami Project no longer exists. To use this feature, you need to rebuild with the libvgpu.so file that supports memory oversubscription.
Dockerfile
FROM docker.m.daocloud.io/projecthami/hami:v2.3.11\nCOPY libvgpu.so /k8s-vgpu/lib/nvidia/\n
To virtualize a single NVIDIA GPU into multiple virtual GPUs and allocate them to different virtual machines or users, you can use NVIDIA's vGPU capability. This section explains how to install the vGPU plugin in the AI platform platform, which is a prerequisite for using NVIDIA vGPU capability.
During the installation of vGPU, several basic modification parameters are provided. If you need to modify advanced parameters, click the YAML column to make changes:
deviceMemoryScaling : NVIDIA device memory scaling factor, the input value must be an integer, with a default value of 1. It can be greater than 1 (enabling virtual memory, experimental feature). For an NVIDIA GPU with a memory size of M, if we configure the devicePlugin.deviceMemoryScaling parameter as S, in a Kubernetes cluster where we have deployed our device plugin, the vGPUs assigned from this GPU will have a total memory of S * M .
deviceSplitCount : An integer type, with a default value of 10. Number of GPU splits, each GPU cannot be assigned more tasks than its configuration count. If configured as N, each GPU can have up to N tasks simultaneously.
Resources : Represents the resource usage of the vgpu-device-plugin and vgpu-schedule pods.
After a successful installation, you will see two types of pods in the specified namespace, indicating that the NVIDIA vGPU plugin has been successfully installed:
After a successful installation, you can deploy applications using vGPU resources.
Note
NVIDIA vGPU Addon does not support upgrading directly from the older v2.0.0 to the latest v2.0.0+1; To upgrade, please uninstall the older version and then reinstall the latest version.
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html","title":"Using NVIDIA vGPU in Applications","text":"
This section explains how to use the vGPU capability in the AI platform platform.
The nodes in the cluster have GPUs of the proper models.
vGPU Addon has been successfully installed. Refer to Installing GPU Addon for details.
GPU Operator is installed, and the Nvidia.DevicePlugin capability is disabled. Refer to Offline Installation of GPU Operator for details.
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html#procedure","title":"Procedure","text":""},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-the-ui","title":"Using vGPU through the UI","text":"
Confirm if the cluster has detected GPUs. Click the Clusters -> Cluster Settings -> Addon Plugins and check if the GPU plugin has been automatically enabled and the proper GPU type has been detected. Currently, the cluster will automatically enable the GPU addon and set the GPU Type as Nvidia vGPU .
Deploy a workload by clicking Clusters -> Workloads . When deploying a workload using an image, select the type Nvidia vGPU , and you will be prompted with the following parameters:
Number of Physical Cards (nvidia.com/vgpu) : Indicates how many physical cards need to be mounted by the current pod. The input value must be an integer and less than or equal to the number of cards on the host machine.
GPU Cores (nvidia.com/gpucores): Indicates the GPU cores utilized by each card, with a value range from 0 to 100. Setting it to 0 means no enforced isolation, while setting it to 100 means exclusive use of the entire card.
GPU Memory (nvidia.com/gpumem): Indicates the GPU memory occupied by each card, with a value in MB. The minimum value is 1, and the maximum value is the total memory of the card.
If there are issues with the configuration values above, it may result in scheduling failure or inability to allocate resources.
"},{"location":"en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-yaml-configuration","title":"Using vGPU through YAML Configuration","text":"
Refer to the following workload configuration and add the parameter nvidia.com/vgpu: '1' in the resource requests and limits section to configure the number of physical cards used by the application.
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"Using Volcano's Gang Scheduler","text":"
The Gang scheduling policy is one of the core scheduling algorithms of the volcano-scheduler. It satisfies the \"All or nothing\" scheduling requirement during the scheduling process, preventing arbitrary scheduling of Pods that could waste cluster resources. The specific algorithm observes whether the number of scheduled Pods under a Job meets the minimum running quantity. When the Job's minimum running quantity is satisfied, scheduling actions are performed for all Pods under the Job; otherwise, no actions are taken.
The Gang scheduling algorithm, based on the concept of a Pod group, is particularly suitable for scenarios that require multi-process collaboration. AI scenarios often involve complex workflows, such as Data Ingestion, Data Analysis, Data Splitting, Training, Serving, and Logging, which require a group of containers to work together. This makes the Gang scheduling policy based on pods very appropriate.
In multi-threaded parallel computing communication scenarios under the MPI computation framework, Gang scheduling is also very suitable because it requires master and slave processes to work together. High relevance among containers in a pod may lead to resource contention, and overall scheduling allocation can effectively resolve deadlocks.
In scenarios with insufficient cluster resources, the Gang scheduling policy significantly improves the utilization of cluster resources. For example, if the cluster can currently accommodate only 2 Pods, but the minimum number of Pods required for scheduling is 3, then all Pods of this Job will remain pending until the cluster can accommodate 3 Pods, at which point the Pods will be scheduled. This effectively prevents the partial scheduling of Pods, which would not meet the requirements and would occupy resources, making other Jobs unable to run.
The Gang Scheduler is the core scheduling plugin of Volcano, and it is enabled by default upon installing Volcano. When creating a workload, you only need to specify the scheduler name as Volcano.
Volcano schedules based on PodGroups. When creating a workload, there is no need to manually create PodGroup resources; Volcano will automatically create them based on the workload information. Below is an example of a PodGroup:
Represents the minimum number of Pods or jobs that need to run under this PodGroup. If the cluster resources do not meet the requirements to run the number of jobs specified by miniMember, the scheduler will not schedule any jobs within this PodGroup.
Represents the minimum resources required to run this PodGroup. If the allocatable resources of the cluster do not meet the minResources, the scheduler will not schedule any jobs within this PodGroup.
Represents the priority of this PodGroup, used by the scheduler to sort all PodGroups within the queue during scheduling. system-node-critical and system-cluster-critical are two reserved values indicating the highest priority. If not specifically designated, the default priority or zero priority is used.
Represents the queue to which this PodGroup belongs. The queue must be pre-created and in the open state.
In a multi-threaded parallel computing communication scenario under the MPI computation framework, we need to ensure that all Pods can be successfully scheduled to ensure the job is completed correctly. Setting minAvailable to 4 means that 1 mpimaster and 3 mpiworkers are required to run.
apiVersion: scheduling.volcano.sh/v1beta1\nkind: PodGroup\nmetadata:\n annotations:\n creationTimestamp: \"2024-05-28T09:18:50Z\"\n generation: 5\n labels:\n volcano.sh/job-type: MPI\n name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2\n namespace: default\n ownerReferences:\n - apiVersion: batch.volcano.sh/v1alpha1\n blockOwnerDeletion: true\n controller: true\n kind: Job\n name: lm-mpi-job\n uid: 9c571015-37c7-4a1a-9604-eaa2248613f2\n resourceVersion: \"25173454\"\n uid: 7b04632e-7cff-4884-8e9a-035b7649d33b\nspec:\n minMember: 4\n minResources:\n count/pods: \"4\"\n cpu: 3500m\n limits.cpu: 3500m\n pods: \"4\"\n requests.cpu: 3500m\n minTaskMember:\n mpimaster: 1\n mpiworker: 3\n queue: default\nstatus:\n conditions:\n - lastTransitionTime: \"2024-05-28T09:19:01Z\"\n message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded,\n 3 Releasing, 4 minAvailable'\n reason: NotEnoughResources\n status: \"True\"\n transitionID: f875efa5-0358-4363-9300-06cebc0e7466\n type: Unschedulable\n - lastTransitionTime: \"2024-05-28T09:18:53Z\"\n reason: tasks in gang are ready to be scheduled\n status: \"True\"\n transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab\n type: Scheduled\n phase: Pending\n succeeded: 1\n
From the PodGroup, it can be seen that it is associated with the workload through ownerReferences and sets the minimum number of running Pods to 4.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html","title":"Use Volcano for AI Compute","text":""},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html#usage-scenarios","title":"Usage Scenarios","text":"
Kubernetes has become the de facto standard for orchestrating and managing cloud-native applications, and an increasing number of applications are choosing to migrate to K8s. The fields of artificial intelligence and machine learning inherently involve a large number of compute-intensive tasks, and developers are very willing to build AI platforms based on Kubernetes to fully leverage its resource management, application orchestration, and operations monitoring capabilities. However, the default Kubernetes scheduler was initially designed primarily for long-running services and has many shortcomings in batch and elastic scheduling for AI and big data tasks. For example, resource contention issues:
Take TensorFlow job scenarios as an example. TensorFlow jobs include two different roles, PS and Worker, and the Pods for these two roles need to work together to complete the entire job. If only one type of role Pod is running, the entire job cannot be executed properly. The default scheduler schedules Pods one by one and is unaware of the PS and Worker roles in a Kubeflow TFJob. In a high-load cluster (insufficient resources), multiple jobs may each be allocated some resources to run a portion of their Pods, but the jobs cannot complete successfully, leading to resource waste. For instance, if a cluster has 4 GPUs and both TFJob1 and TFJob2 each have 4 Workers, TFJob1 and TFJob2 might each be allocated 2 GPUs. However, both TFJob1 and TFJob2 require 4 GPUs to run. This mutual waiting for resource release creates a deadlock situation, resulting in GPU resource waste.
Volcano is the first Kubernetes-based container batch computing platform under CNCF, focusing on high-performance computing scenarios. It fills in the missing functionalities of Kubernetes in fields such as machine learning, big data, and scientific computing, providing essential support for these high-performance workloads. Additionally, Volcano seamlessly integrates with mainstream computing frameworks like Spark, TensorFlow, and PyTorch, and supports hybrid scheduling of heterogeneous devices, including CPUs and GPUs, effectively resolving the deadlock issues mentioned above.
The following sections will introduce how to install and use Volcano.
Find Volcano in Cluster Details -> Helm Apps -> Helm Charts and install it.
Check and confirm whether Volcano is installed successfully, that is, whether the components volcano-admission, volcano-controllers, and volcano-scheduler are running properly.
Typically, Volcano is used in conjunction with the AI Lab to achieve an effective closed-loop process for the development and training of datasets, Notebooks, and task training.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html#volcano-use-cases","title":"Volcano Use Cases","text":"
Volcano is a standalone scheduler. To enable the Volcano scheduler when creating workloads, simply specify the scheduler's name (schedulerName: volcano).
The volcanoJob resource is an extension of the Job in Volcano, breaking the Job down into smaller working units called tasks, which can interact with each other.
"},{"location":"en/admin/kpanda/gpu/volcano/volcano_user_guide.html#parallel-computing-with-mpi","title":"Parallel Computing with MPI","text":"
In multi-threaded parallel computing communication scenarios under the MPI computing framework, we need to ensure that all Pods are successfully scheduled to guarantee the task's proper completion. Setting minAvailable to 4 indicates that 1 mpimaster and 3 mpiworkers are required to run. By simply setting the schedulerName field value to \"volcano,\" you can enable the Volcano scheduler.
Helm is a package management tool for Kubernetes, which makes it easy for users to quickly discover, share and use applications built with Kubernetes. The Container Management Module provides hundreds of Helm charts, covering storage, network, monitoring, database and other main cases. With these templates, you can quickly deploy and easily manage Helm apps through the UI interface. In addition, it supports adding more personalized templates through Add Helm repository to meet various needs.
Key Concepts:
There are a few key concepts to understand when using Helm:
Chart: A Helm installation package, which contains the images, dependencies, and resource definitions required to run an application, and may also contain service definitions in the Kubernetes cluster, similar to the formula in Homebrew, dpkg in APT, or rpm files in Yum. Charts are called Helm Charts in AI platform.
Release: A Chart instance running on the Kubernetes cluster. A Chart can be installed multiple times in the same cluster, and each installation will create a new Release. Release is called Helm Apps in AI platform.
Repository: A repository for publishing and storing Charts. Repository is called Helm Repositories in AI platform.
For more details, refer to Helm official website.
Related operations:
Manage Helm apps, including installing, updating, uninstalling Helm apps, viewing Helm operation records, etc.
Manage Helm repository, including installing, updating, deleting Helm repository, etc.
"},{"location":"en/admin/kpanda/helm/Import-addon.html","title":"Import Custom Helm Apps into Built-in Addons","text":"
This article explains how to import Helm appss into the system's built-in addons in both offline and online environments.
charts-syncer is available and running. If not, you can click here to download.
The Helm Chart has been adapted for charts-syncer. This means adding a .relok8s-images.yaml file to the Helm Chart. This file should include all the images used in the Chart, including any images that are not directly used in the Chart but are used similar to images used in an Operator.
Note
Refer to image-hints-file for instructions on how to write a Chart. It is required to separate the registry and repository of the image because the registry/repository needs to be replaced or modified when loading the image.
The installer's fire cluster has charts-syncer installed. If you are importing a custom Helm apps into the installer's fire cluster, you can skip the download and proceed to the adaptation. If charts-syncer binary is not installed, you can download it immediately.
Go to Container Management -> Helm Apps -> Helm Repositories , search for the addon, and obtain the built-in repository address and username/password (the default username/password for the system's built-in repository is rootuser/rootpass123).
Sync the Helm Chart to the built-in repository addon of the container management system
Write the following configuration file, modify it according to your specific configuration, and save it as sync-dao-2048.yaml .
source: # helm charts source information\n repo:\n kind: HARBOR # It can also be any other supported Helm Chart repository type, such as CHARTMUSEUM\n url: https://release-ci.daocloud.io/chartrepo/community # Change to the chart repo URL\n #auth: # username/password, if no password is set, leave it blank\n #username: \"admin\"\n #password: \"Harbor12345\"\ncharts: # charts to sync\n - name: dao-2048 # helm charts information, if not specified, sync all charts in the source helm repo\n versions:\n - 1.4.1\ntarget: # helm charts target information\n containerRegistry: 10.5.14.40 # image repository URL\n repo:\n kind: CHARTMUSEUM # It can also be any other supported Helm Chart repository type, such as HARBOR\n url: http://10.5.14.40:8081 # Change to the correct chart repo URL, you can verify the address by using helm repo add $HELM-REPO\n auth: # username/password, if no password is set, leave it blank\n username: \"rootuser\"\n password: \"rootpass123\"\n containers:\n # kind: HARBOR # If the image repository is HARBOR and you want charts-syncer to automatically create an image repository, fill in this field\n # auth: # username/password, if no password is set, leave it blank\n # username: \"admin\"\n # password: \"Harbor12345\"\n\n# leverage .relok8s-images.yaml file inside the Charts to move the container images too\nrelocateContainerImages: true\n
Run the charts-syncer command to sync the Chart and its included images
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
Once the previous step is completed, go to Container Management -> Helm Apps -> Helm Repositories , find the proper addon, click Sync Repository in the action column, and you will see the uploaded Helm apps in the Helm template.
You can then proceed with normal installation, upgrade, and uninstallation.
The Helm Repo address for the online environment is release.daocloud.io . If the user does not have permission to add Helm Repo, they will not be able to import custom Helm appss into the system's built-in addons. You can add your own Helm repository and then integrate your Helm repository into the platform using the same steps as syncing Helm Chart in the offline environment.
The container management module supports interface-based management of Helm, including creating Helm instances using Helm charts, customizing Helm instance arguments, and managing the full lifecycle of Helm instances.
This section will take cert-manager as an example to introduce how to create and manage Helm apps through the container management interface.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster.
Created a namespace, user, and granted NS Admin or higher permissions to the user. For details, refer to Namespace Authorization.
"},{"location":"en/admin/kpanda/helm/helm-app.html#install-the-helm-app","title":"Install the Helm app","text":"
Follow the steps below to install the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps -> Helm Chart to enter the Helm chart page.
On the Helm chart page, select the Helm repository named addon , and all the Helm chart templates under the addon repository will be displayed on the interface. Click the Chart named cert-manager .
On the installation page, you can see the relevant detailed information of the Chart, select the version to be installed in the upper right corner of the interface, and click the Install button. Here select v1.9.1 version for installation.
Configure Name , Namespace and Version Information . You can also customize arguments by modifying YAML in the argument Configuration area below. Click OK .
The system will automatically return to the list of Helm apps, and the status of the newly created Helm app is Installing , and the status will change to Running after a period of time.
"},{"location":"en/admin/kpanda/helm/helm-app.html#update-the-helm-app","title":"Update the Helm app","text":"
After we have completed the installation of a Helm app through the interface, we can perform an update operation on the Helm app. Note: Update operations using the UI are only supported for Helm apps installed via the UI.
Follow the steps below to update the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app that needs to be updated, click the __ ...__ operation button on the right side of the list, and select the Update operation in the drop-down selection.
After clicking the Update button, the system will jump to the update interface, where you can update the Helm app as needed. Here we take updating the http port of the dao-2048 application as an example.
After modifying the proper arguments. You can click the Change button under the argument configuration to compare the files before and after the modification. After confirming that there is no error, click the OK button at the bottom to complete the update of the Helm app.
The system will automatically return to the Helm app list, and a pop-up window in the upper right corner will prompt update successful .
Every installation, update, and deletion of Helm apps has detailed operation records and logs for viewing.
In the left navigation bar, click Cluster Operations -> Recent Operations , and then select the Helm Operations tab at the top of the page. Each record corresponds to an install/update/delete operation.
To view the detailed log of each operation: Click \u2507 on the right side of the list, and select Log from the pop-up menu.
At this point, the detailed operation log will be displayed in the form of console at the bottom of the page.
"},{"location":"en/admin/kpanda/helm/helm-app.html#delete-the-helm-app","title":"Delete the Helm app","text":"
Follow the steps below to delete the Helm app.
Find the cluster where the Helm app to be deleted resides, click the cluster name, and enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app you want to delete, click the __ ...__ operation button on the right side of the list, and select Delete from the drop-down selection.
Enter the name of the Helm app in the pop-up window to confirm, and then click the Delete button.
The Helm repository is a repository for storing and publishing Charts. The Helm App module supports HTTP(s) protocol to access Chart packages in the repository. By default, the system has 4 built-in helm repos as shown in the table below to meet common needs in the production process of enterprises.
Repository Description Example partner Various high-quality features provided by ecological partners Chart tidb system Chart that must be relied upon by system core functional components and some advanced features. For example, insight-agent must be installed to obtain cluster monitoring information Insight addon Common Chart in business cases cert-manager community The most popular open source components in the Kubernetes community Chart Istio
In addition to the above preset repositories, you can also add third-party Helm repositories yourself. This page will introduce how to add and update third-party Helm repositories.
The following takes the public container repository of Kubevela as an example to introduce and manage the helm repo.
Find the cluster that needs to be imported into the third-party helm repo, click the cluster name, and enter cluster details.
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo page.
Click the Create Repository button on the helm repo page to enter the Create repository page, and configure relevant arguments according to the table below.
Repository Name: Set the repository name. It can be up to 63 characters long and may only include lowercase letters, numbers, and separators -. It must start and end with a lowercase letter or number, for example, kubevela.
Repository URL: The HTTP(S) address pointing to the target Helm repository. For example, https://charts.kubevela.net/core.
Skip TLS Verification: If the added Helm repository uses an HTTPS address and requires skipping TLS verification, you can check this option. The default is unchecked.
Authentication Method: The method used for identity verification after connecting to the repository URL. For public repositories, you can select None. For private repositories, you need to enter a username/password for identity verification.
Labels: Add labels to this Helm repository. For example, key: repo4; value: Kubevela.
Annotations: Add annotations to this Helm repository. For example, key: repo4; value: Kubevela.
Description: Add a description for this Helm repository. For example: This is a Kubevela public Helm repository.
Click OK to complete the creation of the Helm repository. The page will automatically jump to the list of Helm repositories.
"},{"location":"en/admin/kpanda/helm/helm-repo.html#update-the-helm-repository","title":"Update the Helm repository","text":"
When the address information of the helm repo changes, the address, authentication method, label, annotation, and description information of the helm repo can be updated.
Find the cluster where the repository to be updated is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Update in the pop-up menu.
Update on the Update Helm Repository page, and click OK when finished.
Return to the helm repo list, and the screen prompts that the update is successful.
"},{"location":"en/admin/kpanda/helm/helm-repo.html#delete-the-helm-repository","title":"Delete the Helm repository","text":"
In addition to importing and updating repositorys, you can also delete unnecessary repositories, including system preset repositories and third-party repositories.
Find the cluster where the repository to be deleted is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Delete in the pop-up menu.
Enter the repository name to confirm, and click Delete .
Return to the list of Helm repositories, and the screen prompts that the deletion is successful.
"},{"location":"en/admin/kpanda/helm/multi-archi-helm.html","title":"Import and Upgrade Multi-Arch Helm Apps","text":"
In a multi-arch cluster, it is common to use Helm charts that support multiple architectures to address deployment issues caused by architectural differences. This guide will explain how to integrate single-arch Helm apps into multi-arch deployments and how to integrate multi-arch Helm apps.
The offline package is quite large and requires sufficient space for decompression and loading of images. Otherwise, it may interrupt the process with a \"no space left\" error.
"},{"location":"en/admin/kpanda/helm/multi-archi-helm.html#retry-after-failure","title":"Retry after Failure","text":"
If the multi-arch fusion step fails, you need to clean up the residue before retrying:
If the offline package for fusion contains registry spaces that are inconsistent with the imported offline package, an error may occur during the fusion process due to the non-existence of the registry spaces:
Solution: Simply create the registry space before the fusion. For example, in the above error, creating the registry space \"localhost\" in advance can prevent the error.
When upgrading to a version lower than 0.12.0 of the addon, the charts-syncer in the target offline package does not check the existence of the image before pushing, so it will recombine the multi-arch into a single architecture during the upgrade process. For example, if the addon is implemented as a multi-arch in v0.10, upgrading to v0.11 will overwrite the multi-arch addon with a single architecture. However, upgrading to v0.12.0 or above can still maintain the multi-arch.
This article explains how to upload Helm charts. See the steps below.
Add a Helm repository, refer to Adding a Third-Party Helm Repository for the procedure.
Upload the Helm Chart to the Helm repository.
Upload with ClientUpload with Web Page
Note
This method is suitable for Harbor, ChartMuseum, JFrog type repositories.
Log in to a node that can access the Helm repository, upload the Helm binary to the node, and install the cm-push plugin (VPN is needed and Git should be installed in advance).
Refer to the plugin installation process.
Push the Helm Chart to the Helm repository by executing the following command:
charts-dir: The directory of the Helm Chart, or the packaged Chart (i.e., .tgz file).
HELM_REPO_URL: The URL of the Helm repository.
username/password: The username and password for the Helm repository with push permissions.
If you want to access via HTTPS and skip the certificate verification, you can add the argument --insecure.
Note
This method is only applicable to Harbor repositories.
Log into the Harbor repository, ensuring the logged-in user has permissions to push;
Go to the relevant project, select the Helm Charts tab, click the Upload button on the page to upload the Helm Chart.
Sync Remote Repository Data
Manual SyncAuto Sync
By default, the cluster does not enable Helm Repository Auto-Refresh, so you need to perform a manual sync operation. The general steps are:
Go to Helm Apps -> Helm Repositories, click the \u2507 button on the right side of the repository list, and select Sync Repository to complete the repository data synchronization.
If you need to enable the Helm repository auto-sync feature, you can go to Cluster Maintenance -> Cluster Settings -> Advanced Settings and turn on the Helm repository auto-refresh switch.
Cluster inspection allows administrators to regularly or ad-hoc check the overall health of the cluster, giving them proactive control over ensuring cluster security. With a well-planned inspection schedule, this proactive cluster check allows administrators to monitor the cluster status at any time and address potential issues in advance. It eliminates the previous dilemma of passive troubleshooting during failures, enabling proactive monitoring and prevention.
The cluster inspection feature provided by AI platform's container management module supports custom inspection items at the cluster, node, and pod levels. After the inspection is completed, it automatically generates visual inspection reports.
Cluster Level: Checks the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes, such as the status of kube-apiserver and etcd .
Node Level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle counts, PID status, and network status.
pod Level: Checks the CPU and memory usage, running status of pods, and the status of PV (Persistent Volume) and PVC (PersistentVolumeClaim).
For information on security inspections or executing security-related inspections, refer to the supported security scan types in AI platform.
AI platform Container Management module provides cluster inspection functionality, which supports inspection at the cluster, node, and pod levels.
Cluster level: Check the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes such as kube-apiserver and etcd .
Node level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle count, PID status, and network status.
Pod level: Check the CPU and memory usage, running status, PV and PVC status of Pods.
Here's how to create an inspection configuration.
Click Cluster Inspection in the left navigation bar.
On the right side of the page, click Inspection Configuration .
Fill in the inspection configuration based on the following instructions, then click OK at the bottom of the page.
Cluster: Select the clusters that you want to inspect from the dropdown list. If you select multiple clusters, multiple inspection configurations will be automatically generated (only the inspected clusters are inconsistent, all other configurations are identical).
Scheduled Inspection: When enabled, it allows for regular automatic execution of cluster inspections based on a pre-set inspection frequency.
Inspection Frequency: Set the interval for automatic inspections, e.g., every Tuesday at 10 AM. It supports custom CronExpressios, refer to Cron Schedule Syntax for more information.
Number of Inspection Records to Retain: Specifies the maximum number of inspection records to be retained, including all inspection records for each cluster.
Parameter Configuration: The parameter configuration is divided into three parts: cluster level, node level, and pod level. You can enable or disable specific inspection items based on your requirements.
After creating the inspection configuration, it will be automatically displayed in the inspection configuration list. Click the more options button on the right of the configuration to immediately perform an inspection, modify the inspection configuration or delete the inspection configuration and reports.
Click Inspection to perform an inspection once based on the configuration.
Click Inspection Configuration to modify the inspection configuration.
Click Delete to delete the inspection configuration and reports.
Note
After creating the inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time.
If Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
After creating an inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time. If the Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
This page explains how to manually perform a cluster inspection.
When performing an inspection, you can choose to inspect multiple clusters in batches or perform a separate inspection for a specific cluster.
Batch InspectionIndividual Inspection
Click Cluster Inspection in the top-level navigation bar of the Container Management module, then click Inspection on the right side of the page.
Select the clusters you want to inspect, then click OK at the bottom of the page.
If you choose to inspect multiple clusters at the same time, the system will perform inspections based on different inspection configurations for each cluster.
If no inspection configuration is set for a cluster, the system will use the default configuration.
Go to the Cluster Inspection page.
Click the more options button ( \u2507 ) on the right of the proper inspection configuration, then select Inspection from the popup menu.
Go to the Cluster Inspection page and click the name of the target inspection cluster.
Click the name of the inspection record you want to view.
Each inspection execution generates an inspection record.
When the number of inspection records exceeds the maximum retention specified in the inspection configuration, the earliest record will be deleted starting from the execution time.
View the detailed information of the inspection, which may include an overview of cluster resources and the running status of system components.
You can download the inspection report or delete the inspection report from the top right corner of the page.
Namespaces are an abstraction used in Kubernetes for resource isolation. A cluster can contain multiple namespaces with different names, and the resources in each namespace are isolated from each other. For a detailed introduction to namespaces, refer to Namespaces.
This page will introduce the related operations of the namespace.
"},{"location":"en/admin/kpanda/namespaces/createns.html#create-a-namespace","title":"Create a namespace","text":"
Supports easy creation of namespaces through forms, and quick creation of namespaces by writing or importing YAML files.
Note
Before creating a namespace, you need to Integrate a Kubernetes cluster or Create a Kubernetes cluster in the container management module.
The default namespace default is usually automatically generated after cluster initialization. But for production clusters, for ease of management, it is recommended to create other namespaces instead of using the default namespace directly.
"},{"location":"en/admin/kpanda/namespaces/createns.html#create-with-form","title":"Create with form","text":"
On the cluster list page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the Create button on the right side of the page.
Fill in the name of the namespace, configure the workspace and labels (optional), and then click OK.
Info
After binding a namespace to a workspace, the resources of that namespace will be shared with the bound workspace. For a detailed explanation of workspaces, refer to Workspaces and Hierarchies.
After the namespace is created, you can still bind/unbind the workspace.
Click OK to complete the creation of the namespace. On the right side of the namespace list, click \u2507 to select update, bind/unbind workspace, quota management, delete, and more from the pop-up menu.
"},{"location":"en/admin/kpanda/namespaces/createns.html#create-from-yaml","title":"Create from YAML","text":"
On the Clusters page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the YAML Create button on the right side of the page.
Enter or paste the prepared YAML content, or directly import an existing YAML file locally.
After entering the YAML content, click Download to save the YAML file locally.
Finally, click OK in the lower right corner of the pop-up box.
Namespace exclusive nodes in a Kubernetes cluster allow a specific namespace to have exclusive access to one or more node's CPU, memory, and other resources through taints and tolerations. Once exclusive nodes are configured for a specific namespace, applications and services from other namespaces cannot run on the exclusive nodes. Using exclusive nodes allows important applications to have exclusive access to some computing resources, achieving physical isolation from other applications.
Note
Applications and services running on a node before it is set to be an exclusive node will not be affected and will continue to run normally on that node. Only when these Pods are deleted or rebuilt will they be scheduled to other non-exclusive nodes.
Check whether the kube-apiserver of the current cluster has enabled the PodNodeSelector and PodTolerationRestriction admission controllers.
The use of namespace exclusive nodes requires users to enable the PodNodeSelector and PodTolerationRestriction admission controllers on the kube-apiserver. For more information about admission controllers, refer to Kubernetes Admission Controllers Reference.
You can go to any Master node in the current cluster to check whether these two features are enabled in the kube-apiserver.yaml file, or you can execute the following command on the Master node for a quick check:
[root@g-master1 ~]# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep enable-admission-plugins\n\n# The expected output is as follows:\n- --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction\n
"},{"location":"en/admin/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-global-cluster","title":"Enable Namespace Exclusive Nodes on Global Cluster","text":"
Since the Global cluster runs platform basic components such as kpanda, ghippo, and insight, enabling namespace exclusive nodes on Global may cause system components to not be scheduled to the exclusive nodes when they restart, affecting the overall high availability of the system. Therefore, we generally do not recommend users to enable the namespace exclusive node feature on the Global cluster.
If you do need to enable namespace exclusive nodes on the Global cluster, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the Global cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to configure system component tolerations.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Add toleration annotations to the namespace where the platform components are located
After enabling the admission controllers, you need to add toleration annotations to the namespace where the platform components are located to ensure the high availability of the platform components.
The system component namespaces for AI platform are as follows:
Check whether there are the above namespaces in the current cluster, execute the following command, and add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' for each namespace.
Please make sure to replace <namespace-name> with the name of the platform namespace you want to add the annotation to.
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
"},{"location":"en/admin/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-non-global-clusters","title":"Enable Namespace Exclusive Nodes on Non-Global Clusters","text":"
To enable namespace exclusive nodes on non-Global clusters, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the current cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to using the interface to set exclusive nodes for the namespace.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
Add toleration annotations to the namespace where the components that need high availability are located (optional)
Execute the following command to add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' to the namespace where the components that need high availability are located.
Pod security policies in a Kubernetes cluster allow you to control the behavior of Pods in various aspects of security by configuring different levels and modes for specific namespaces. Only Pods that meet certain conditions will be accepted by the system. It sets three levels and three modes, allowing users to choose the most suitable scheme to set restriction policies according to their needs.
Note
Only one security policy can be configured for one security mode. Please be careful when configuring the enforce security mode for a namespace, as violations will prevent Pods from being created.
This section will introduce how to configure Pod security policies for namespaces through the container management interface.
The container management module has integrated a Kubernetes cluster or created a Kubernetes cluster. The cluster version needs to be v1.22 or above, and you should be able to access the cluster's UI interface.
A namespace has been created, a user has been created, and the user has been granted NS Admin or higher permissions. For details, refer to Namespace Authorization.
"},{"location":"en/admin/kpanda/namespaces/podsecurity.html#configure-pod-security-policies-for-namespace","title":"Configure Pod Security Policies for Namespace","text":"
Select the namespace for which you want to configure Pod security policies and go to the details page. Click Configure Policy on the Pod Security Policy page to go to the configuration page.
Click Add Policy on the configuration page, and a policy will appear, including security level and security mode. The following is a detailed introduction to the security level and security policy.
Security Level Description Privileged An unrestricted policy that provides the maximum possible range of permissions. This policy allows known privilege elevations. Baseline The least restrictive policy that prohibits known privilege elevations. Allows the use of default (minimum specified) Pod configurations. Restricted A highly restrictive policy that follows current best practices for protecting Pods. Security Mode Description Audit Violations of the specified policy will add new audit events in the audit log, and the Pod can be created. Warn Violations of the specified policy will return user-visible warning information, and the Pod can be created. Enforce Violations of the specified policy will prevent the Pod from being created.
Different security levels correspond to different check items. If you don't know how to configure your namespace, you can Policy ConfigMap Explanation at the top right corner of the page to view detailed information.
Click Confirm. If the creation is successful, the security policy you configured will appear on the page.
Click \u2507 to edit or delete the security policy you configured.
"},{"location":"en/admin/kpanda/network/create-ingress.html","title":"Create an Ingress","text":"
In a Kubernetes cluster, Ingress exposes services from outside the cluster to inside the cluster HTTP and HTTPS ingress. Traffic ingress is controlled by rules defined on the Ingress resource. Here's an example of a simple Ingress that sends all traffic to the same Service:
Ingress is an API object that manages external access to services in the cluster, and the typical access method is HTTP. Ingress can provide load balancing, SSL termination, and name-based virtual hosting.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
Completed Create Ingress Instance, Deploy Application Workload, and have created the proper Service
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Ingress to enter the service list, and click the Create Ingress button in the upper right corner.
Note
It is also possible to Create from YAML .
Open Create Ingress page to configure. There are two protocol types to choose from, refer to the following two parameter tables for configuration.
"},{"location":"en/admin/kpanda/network/create-ingress.html#create-http-protocol-ingress","title":"Create HTTP protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTP protocol. HTTP Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Scope of use of Ingress Platform-level load balancer : In the same cluster, share the same Ingress instance, where all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace, or belongs to a certain workspace, and the set workspace includes the current namespace, and all Pods can receive it Requests distributed by this load balancer. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> Ngnix Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Path Rewriting [Type] Optional [Meaning] rewrite-target , in some cases, the URL exposed by the backend service is different from the path specified in the Ingress rule. If no URL rewriting configuration is performed, There will be an error when accessing. close Redirect [Type] Optional[Meaning] permanent-redirect , permanent redirection, after entering the rewriting path, the access path will be redirected to the set address. close Traffic Distribution [Type] Optional[Meaning] After enabled and set, traffic distribution will be performed according to the set conditions. Based on weight : After setting the weight, add the following Annotation to the created Ingress: nginx.ingress.kubernetes.io/canary-weight: \"10\" Based on Cookie : set After the cookie rules, the traffic will be distributed according to the set cookie conditions Based on Header : After setting the header rules, the traffic will be distributed according to the set header conditions Close Labels [Type] Optional [Meaning] Add a label for the ingress - Annotations [Type] Optional [Meaning] Add annotation for ingress -"},{"location":"en/admin/kpanda/network/create-ingress.html#create-https-protocol-ingress","title":"Create HTTPS protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTPS protocol. HTTPS Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io Secret [Type] Required [Meaning] Https TLS certificate, Create Secret. Forwarding policy [Type] Optional[Meaning] Specify the access policy of Ingress. Path: Specifies the URL path for service access, the default is the root path/directoryTarget service: Service name for ingressTarget service port: Port exposed by the service LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Platform-level load balancer : In the same cluster, the same Ingress instance is shared, and all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace or to a certain workspace. This workspace contains the current namespace, and all Pods can receive the workload from this Balanced distribution of requests. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> None Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Labels [Type] Optional [Meaning] Add a label for the ingress Annotations [Type] Optional[Meaning] Add annotation for ingress"},{"location":"en/admin/kpanda/network/create-ingress.html#create-ingress-successfully","title":"Create ingress successfully","text":"
After configuring all the parameters, click the OK button to return to the ingress list automatically. On the right side of the list, click \u2507 to modify or delete the selected ingress.
"},{"location":"en/admin/kpanda/network/create-services.html","title":"Create a Service","text":"
In a Kubernetes cluster, each Pod has an internal independent IP address, but Pods in the workload may be created and deleted at any time, and directly using the Pod IP address cannot provide external services.
This requires creating a service through which you get a fixed IP address, decoupling the front-end and back-end of the workload, and allowing external users to access the service. At the same time, the service also provides the Load Balancer feature, enabling users to access workloads from the public network.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Service to enter the service list, and click the Create Service button in the upper right corner.
!!! tip
It is also possible to create a service via __YAML__ .\n
Open the Create Service page, select an access type, and refer to the following three parameter tables for configuration.
Click Intra-Cluster Access (ClusterIP) , which refers to exposing services through the internal IP of the cluster. The services selected for this option can only be accessed within the cluster. This is the default service type. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select intra-cluster access (ClusterIP). ClusterIP Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. app:job01 Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. Container port (targetport): The container port that the workload actually monitors, used to expose services to the cluster. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same Pod Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time is 30 seconds by default 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/admin/kpanda/network/create-services.html#create-nodeport-service","title":"Create NodePort service","text":"
Click NodePort , which means exposing the service via IP and static port ( NodePort ) on each node. The NodePort service is routed to the automatically created ClusterIP service. You can access a NodePort service from outside the cluster by requesting : . Refer to the configuration parameters in the table below. Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. ***Container port (targetport)*: The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same PodAfter enabled, .spec.sessionAffinity of Service is ClientIP , refer to for details : Session Affinity for Service Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time, the default timeout is 30 seconds.spec.sessionAffinityConfig.clientIP.timeoutSeconds is set to 30 by default seconds 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/admin/kpanda/network/create-services.html#create-loadbalancer-service","title":"Create LoadBalancer service","text":"
Click Load Balancer , which refers to using the cloud provider's load balancer to expose services to the outside. External load balancers can route traffic to automatically created NodePort services and ClusterIP services. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default External Traffic Policy [Type] Required[Meaning] Set external traffic policy. Cluster: Traffic can be forwarded to Pods on all nodes in the cluster. Local: Traffic is only sent to Pods on this node. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Tag selector [Type] Required [Meaning] Add tag, Service Select the Pod according to the label, fill it out and click \"Add\". You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Load balancing type [Type] Required [Meaning] The type of load balancing used, currently supports MetalLB and others. MetalLB IP Pool [Type] Required[Meaning] When the selected load balancing type is MetalLB, LoadBalancer Service will allocate IP addresses from this pool by default, and declare all IP addresses in this pool through APR, For details, refer to: Install MetalLB Load balancing address [Type] Required[Meaning] 1. If you are using a public cloud CloudProvider, fill in the load balancing address provided by the cloud provider here;2. If the above load balancing type is selected as MetalLB, the IP will be obtained from the above IP pool by default, if not filled, it will be obtained automatically. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. Container port (targetport): The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/admin/kpanda/network/create-services.html#complete-service-creation","title":"Complete service creation","text":"
After configuring all parameters, click the OK button to return to the service list automatically. On the right side of the list, click \u2507 to modify or delete the selected service.
Network policies in Kubernetes allow you to control network traffic at the IP address or port level (OSI layer 3 or layer 4). The container management module currently supports creating network policies based on Pods or namespaces, using label selectors to specify which traffic can enter or leave Pods with specific labels.
For more details on network policies, refer to the official Kubernetes documentation on Network Policies.
Currently, there are two methods available for creating network policies: YAML and form-based creation. Each method has its advantages and disadvantages, catering to different user needs.
YAML creation requires fewer steps and is more efficient, but it has a higher learning curve as it requires familiarity with configuring network policy YAML files.
Form-based creation is more intuitive and straightforward. Users can simply fill in the proper values based on the prompts. However, this method involves more steps.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create with YAML in the left navigation bar.
In the pop-up dialog, enter or paste the pre-prepared YAML file, then click OK at the bottom of the dialog.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create Policy in the left navigation bar.
Fill in the basic information.
The name and namespace cannot be changed after creation.
Fill in the policy configuration.
The policy configuration includes ingress and egress policies. To establish a successful connection from a source Pod to a target Pod, both the egress policy of the source Pod and the ingress policy of the target Pod need to allow the connection. If either side does not allow the connection, the connection will fail.
Ingress Policy: Click \u2795 to begin configuring the policy. Multiple policies can be configured. The effects of multiple network policies are cumulative. Only when all network policies are satisfied simultaneously can a connection be successfully established.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies . Click the name of the network policy.
View the basic configuration, associated instances, ingress policies, and egress policies of the policy.
Info
Under the \"Associated Instances\" tab, you can view instance monitoring, logs, container lists, YAML files, events, and more.
There are two ways to update network policies. You can either update them through the form or by using a YAML file.
On the network policy list page, find the policy you want to update, and choose Update in the action column on the right to update it via the form. Choose Edit YAML to update it using a YAML file.
Click the name of the network policy, then choose Update in the top right corner of the policy details page to update it via the form. Choose Edit YAML to update it using a YAML file.
There are two ways to delete network policies. You can delete network policies either through the form or by using a YAML file.
On the network policy list page, find the policy you want to delete, and choose Delete in the action column on the right to delete it via the form. Choose Edit YAML to delete it using a YAML file.
Click the name of the network policy, then choose Delete in the top right corner of the policy details page to delete it via the form. Choose Edit YAML to delete it using a YAML file.
As the number of business applications continues to grow, the resources of the cluster become increasingly tight. At this point, you can expand the cluster nodes based on kubean. After the expansion, applications can run on the newly added nodes, alleviating resource pressure.
Only clusters created through the container management module support node autoscaling. Clusters accessed from the outside do not support this operation. This article mainly introduces the expansion of worker nodes in the same architecture work cluster. If you need to add control nodes or heterogeneous work nodes to the cluster, refer to: Expanding the control node of the work cluster, Adding heterogeneous nodes to the work cluster, Expanding the worker node of the global service cluster.
On the Clusters page, click the name of the target cluster.
If the Cluster Type contains the label Integrated Cluster, it means that the cluster does not support node autoscaling.
Click Nodes in the left navigation bar, and then click Integrate Node in the upper right corner of the page.
Enter the host name and node IP and click OK.
Click \u2795 Add Worker Node to continue accessing more nodes.
Note
Accessing the node takes about 20 minutes, please be patient.
When the peak business period is over, in order to save resource costs, you can reduce the size of the cluster and unload redundant nodes, that is, node scaling. After a node is uninstalled, applications cannot continue to run on the node.
The current operating user has the Cluster Admin role authorization.
Only through the container management module created cluster can node autoscaling be supported, and the cluster accessed from the outside does not support this operation.
Before uninstalling a node, you need to pause scheduling the node, and expel the applications on the node to other nodes.
Eviction method: log in to the controller node, and use the kubectl drain command to evict all Pods on the node. The safe eviction method allows the containers in the pod to terminate gracefully.
When cluster nodes scales down, they can only be uninstalled one by one, not in batches.
If you need to uninstall cluster controller nodes, you need to ensure that the final number of controller nodes is an odd number.
The first controller node cannot be offline when the cluster node scales down. If it is necessary to perform this operation, please contact the after-sales engineer.
On the Clusters page, click the name of the target cluster.
If the Cluster Type has the tag Integrate Cluster , it means that the cluster does not support node autoscaling.
Click Nodes on the left navigation bar, find the node to be uninstalled, click \u2507 and select Remove .
Enter the node name, and click Delete to confirm.
"},{"location":"en/admin/kpanda/nodes/labels-annotations.html","title":"Labels and Annotations","text":"
Labels are identifying key-value pairs added to Kubernetes objects such as Pods, nodes, and clusters, which can be combined with label selectors to find and filter Kubernetes objects that meet certain conditions. Each key must be unique for a given object.
Annotations, like tags, are key/value pairs, but they do not have identification or filtering features. Annotations can be used to add arbitrary metadata to nodes. Annotation keys usually use the format prefix(optional)/name(required) , for example nfd.node.kubernetes.io/extended-resources . If the prefix is \u200b\u200bomitted, it means that the annotation key is private to the user.
For more information about labels and annotations, refer to the official Kubernetes documentation labels and selectors Or Annotations.
The steps to add/delete tags and annotations are as follows:
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click Edit Labels or Edit Annotations .
Click \u2795 Add to add tags or annotations, click X to delete tags or annotations, and finally click OK .
"},{"location":"en/admin/kpanda/nodes/node-authentication.html","title":"Node Authentication","text":""},{"location":"en/admin/kpanda/nodes/node-authentication.html#authenticate-nodes-using-ssh-keys","title":"Authenticate Nodes Using SSH Keys","text":"
If you choose to authenticate the nodes of the cluster-to-be-created using SSH keys, you need to configure the public and private keys according to the following instructions.
Run the following command on any node within the management cluster of the cluster-to-be-created to generate the public and private keys.
cd /root/.ssh\nssh-keygen -t rsa\n
Run the ls command to check if the keys have been successfully created in the management cluster. The correct output should be as follows:
ls\nid_rsa id_rsa.pub known_hosts\n
The file named id_rsa is the private key, and the file named id_rsa.pub is the public key.
Run the following command to load the public key file id_rsa.pub onto all the nodes of the cluster-to-be-created.
Replace the user account and node IP in the above command with the username and IP of the nodes in the cluster-to-be-created. The same operation needs to be performed on every node in the cluster-to-be-created.
Run the following command to view the private key file id_rsa created in step 1.
Copy the content of the private key and paste it into the interface's key input field.
"},{"location":"en/admin/kpanda/nodes/node-check.html","title":"Create a cluster node availability check","text":"
When creating a cluster or adding nodes to an existing cluster, refer to the table below to check the node configuration to avoid cluster creation or expansion failure due to wrong node configuration.
Check Item Description OS Refer to Supported Architectures and Operating Systems SELinux Off Firewall Off Architecture Consistency Consistent CPU architecture between nodes (such as ARM or x86) Host Time All hosts are out of sync within 10 seconds. Network Connectivity The node and its SSH port can be accessed normally by the platform. CPU Available CPU resources are greater than 4 Cores Memory Available memory resources are greater than 8 GB"},{"location":"en/admin/kpanda/nodes/node-check.html#supported-architectures-and-operating-systems","title":"Supported architectures and operating systems","text":"Architecture Operating System Remarks ARM Kylin Linux Advanced Server release V10 (Sword) SP2 Recommended ARM UOS Linux ARM openEuler x86 CentOS 7.x Recommended x86 Redhat 7.x Recommended x86 Redhat 8.x Recommended x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 Haiguang x86 UOS Linux x86 openEuler"},{"location":"en/admin/kpanda/nodes/node-details.html","title":"Node Details","text":"
After accessing or creating a cluster, you can view the information of each node in the cluster, including node status, labels, resource usage, Pod, monitoring information, etc.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar to view the node status, role, label, CPU/memory usage, IP address, and creation time.
Click the node name to enter the node details page to view more information, including overview information, pod information, label annotation information, event list, status, etc.
In addition, you can also view the node's YAML file, monitoring information, labels and annotations, etc.
Supports suspending or resuming scheduling of nodes. Pausing scheduling means stopping the scheduling of Pods to the node. Resuming scheduling means that Pods can be scheduled to that node.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click the Cordon button to suspend scheduling the node.
Click the \u2507 operation icon on the right side of the node, and click the Uncordon button to resume scheduling the node.
The node scheduling status may be delayed due to network conditions. Click the refresh icon on the right side of the search box to refresh the node scheduling status.
Taint can make a node exclude a certain type of Pod and prevent Pod from being scheduled on the node. One or more taints can be applied to each node, and Pods that cannot tolerate these taints will not be scheduled on that node.
Find the target cluster on the Clusters page, and click the cluster name to enter the Cluster page.
In the left navigation bar, click Nodes , find the node that needs to modify the taint, click the \u2507 operation icon on the right and click the Edit Taints button.
Enter the key value information of the taint in the pop-up box, select the taint effect, and click OK .
Click \u2795 Add to add multiple taints to the node, and click X on the right side of the taint effect to delete the taint.
Currently supports three taint effects:
NoExecute: This affects pods that are already running on the node as follows:
Pods that do not tolerate the taint are evicted immediately
Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever
Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.
NoSchedule: No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.
PreferNoSchedule: This is a \"preference\" or \"soft\" version of NoSchedule. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed, so this taint is not recommended to use in a production environment.
For more details about taints, refer to the Kubernetes documentation Taints and Tolerance.
The current cluster is connected to the container management and the Global cluster has installed the kolm component (search for helm templates for kolm).
The current cluster has the olm component installed with a version of 0.2.4 or higher (search for helm templates for olm).
Go to Container Management -> Select the current cluster -> Helm Apps -> View the olm component -> Plugin Settings , and find the images needed for the opm, minio, minio bundle, and minio operator in the subsequent steps.
Using the screenshot as an example, the four image addresses are as follows:\n\n# opm image\n10.5.14.200/quay.m.daocloud.io/operator-framework/opm:v1.29.0\n\n# minio image\n10.5.14.200/quay.m.daocloud.io/minio/minio:RELEASE.2023-03-24T21-41-23Z\n\n# minio bundle image\n10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3\n\n# minio operator image\n10.5.14.200/quay.m.daocloud.io/minio/operator:v5.0.3\n
Run the opm command to get the operators included in the offline bundle image.
Replace all image addresses in the minio-operator/manifests/minio-operator.clusterserviceversion.yaml file with the image addresses from the offline container registry.
Before replacement:
After replacement:
Generate a Dockerfile for building the bundle image.
# Set the new catalog image \nexport OFFLINE_CATALOG_IMG=10.5.14.200/release.daocloud.io/operator-framework/system-operator-index:v0.1.0-offline\n\n$ docker build . -f index.Dockerfile -t ${OFFLINE_CATALOG_IMG} \n\n$ docker push ${OFFLINE_CATALOG_IMG}\n
Go to Container Management and update the built-in catsrc image for the Helm App olm (enter the catalog image specified in the construction of the catalog image, ${catalog-image} ).
After the update is successful, the minio-operator component will appear in the Operator Hub.
"},{"location":"en/admin/kpanda/permissions/cluster-ns-auth.html","title":"Cluster and Namespace Authorization","text":"
Container management implements authorization based on global authority management and global user/group management. If you need to grant users the highest authority for container management (can create, manage, and delete all clusters), refer to What are Access Control.
After the user logs in to the platform, click Privilege Management under Container Management on the left menu bar, which is located on the Cluster Permissions tab by default.
Click the Add Authorization button.
On the Add Cluster Permission page, select the target cluster, the user/group to be authorized, and click OK .
Currently, the only cluster role supported is Cluster Admin . For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permissions to add multiple times.
Return to the cluster permission management page, and a message appears on the screen: Cluster permission added successfully .
After the user logs in to the platform, click Permissions under Container Management on the left menu bar, and click the Namespace Permissions tab.
Click the Add Authorization button. On the Add Namespace Permission page, select the target cluster, target namespace, and user/group to be authorized, and click OK .
The currently supported namespace roles are NS Admin, NS Editor, and NS Viewer. For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permission to add multiple times. Click OK to complete the permission authorization.
Return to the namespace permission management page, and a message appears on the screen: Cluster permission added successfully .
Tip
If you need to delete or edit permissions later, you can click \u2507 on the right side of the list and select Edit or Delete .
"},{"location":"en/admin/kpanda/permissions/custom-kpanda-role.html","title":"Adding RBAC Rules to System Roles","text":"
In the past, the RBAC rules for those system roles in container management were pre-defined and could not be modified by users. To support more flexible permission settings and to meet the customized needs for system roles, now you can modify RBAC rules for system roles such as cluster admin, ns admin, ns editor, ns viewer.
The following example demonstrates how to add a new ns-view rule, granting the authority to delete workload deployments. Similar operations can be performed for other rules.
Before adding RBAC rules to system roles, the following prerequisites must be met:
Container management v0.27.0 and above.
Integrated Kubernetes cluster or created Kubernetes cluster, and able to access the cluster's UI interface.
Completed creation of a namespace and user account, and the granting of NS Viewer. For details, refer to namespace authorization.
Note
RBAC rules only need to be added in the Global Cluster, and the Kpanda controller will synchronize those added rules to all integrated subclusters. Synchronization may take some time to complete.
RBAC rules can only be added in the Global Cluster. RBAC rules added in subclusters will be overridden by the system role permissions of the Global Cluster.
Only ClusterRoles with fixed Label are supported for adding rules. Replacing or deleting rules is not supported, nor is adding rules by using role. The correspondence between built-in roles and ClusterRole Label created by users is as follows.
Create a deployment by a user with admin or cluster admin permissions.
Grant a user the ns-viewer role to provide them with the ns-view permission.
Switch the login user to ns-viewer, open the console to get the token for the ns-viewer user, and use curl to request and delete the nginx deployment mentioned above. However, a prompt appears as below, indicating the user doesn't have permission to delete it.
[root@master-01 ~]# curl -k -X DELETE 'https://${URL}/apis/kpanda.io/v1alpha1/clusters/cluster-member/namespaces/default/deployments/nginx' -H 'authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJOU044MG9BclBRMzUwZ2VVU2ZyNy1xMEREVWY4MmEtZmJqR05uRE1sd1lFIn0.eyJleHAiOjE3MTU3NjY1NzksImlhdCI6MTcxNTY4MDE3OSwiYXV0aF90aW1lIjoxNzE1NjgwMTc3LCJqdGkiOiIxZjI3MzJlNC1jYjFhLTQ4OTktYjBiZC1iN2IxZWY1MzAxNDEiLCJpc3MiOiJodHRwczovLzEwLjYuMjAxLjIwMTozMDE0Ny9hdXRoL3JlYWxtcy9naGlwcG8iLCJhdWQiOiJfX2ludGVybmFsLWdoaXBwbyIsInN1YiI6ImMxZmMxM2ViLTAwZGUtNDFiYS05ZTllLWE5OGU2OGM0MmVmMCIsInR5cCI6IklEIiwiYXpwIjoiX19pbnRlcm5hbC1naGlwcG8iLCJzZXNzaW9uX3N0YXRlIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiYXRfaGFzaCI6IlJhTHoyQjlKQ2FNc1RrbGVMR3V6blEiLCJhY3IiOiIwIiwic2lkIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiZW1haWxfdmVyaWZpZWQiOmZhbHNlLCJncm91cHMiOltdLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJucy12aWV3ZXIiLCJsb2NhbGUiOiIifQ.As2ipMjfvzvgONAGlc9RnqOd3zMwAj82VXlcqcR74ZK9tAq3Q4ruQ1a6WuIfqiq8Kq4F77ljwwzYUuunfBli2zhU2II8zyxVhLoCEBu4pBVBd_oJyUycXuNa6HfQGnl36E1M7-_QG8b-_T51wFxxVb5b7SEDE1AvIf54NAlAr-rhDmGRdOK1c9CohQcS00ab52MD3IPiFFZ8_Iljnii-RpXKZoTjdcULJVn_uZNk_SzSUK-7MVWmPBK15m6sNktOMSf0pCObKWRqHd15JSe-2aA2PKBo1jBH3tHbOgZyMPdsLI0QdmEnKB5FiiOeMpwn_oHnT6IjT-BZlB18VkW8rA'\n{\"code\":7,\"message\":\"[RBAC] delete resources(deployments: nginx) is forbidden for user(ns-viewer) in cluster(cluster-member)\",\"details\":[]}[root@master-01 ~]#\n[root@master-01 ~]#\n
Create a ClusterRole on the global cluster, as shown in the yaml below.
This field value can be arbitrarily specified, as long as it is not duplicated and complies with the Kubernetes resource naming conventions.
When adding rules to different roles, make sure to apply different labels.
Wait for the kpanda controller to add a rule of user creation to the built-in role: ns-viewer, then you can check if the rules added in the previous step are present for ns-viewer.
When using curl again to request the deletion of the aforementioned nginx deployment, this time the deletion was successful. This means that ns-viewer has successfully added the rule to delete deployments.
Container management permissions are based on a multi-dimensional permission management system created by global permission management and Kubernetes RBAC permission management. It supports cluster-level and namespace-level permission control, helping users to conveniently and flexibly set different operation permissions for IAM users and groups (collections of users) under a tenant.
Cluster permissions are authorized based on Kubernetes RBAC's ClusterRoleBinding, allowing users/groups to have cluster-related permissions. The current default cluster role is Cluster Admin (does not have the permission to create or delete clusters).
Namespace permissions are authorized based on Kubernetes RBAC capabilities, allowing different users/groups to have different operation permissions on resources under a namespace (including Kubernetes API permissions). For details, refer to: Kubernetes RBAC. Currently, the default roles for container management are: NS Admin, NS Editor, NS Viewer.
What is the relationship between global permissions and container management permissions?
Answer: Global permissions only authorize coarse-grained permissions, which can manage the creation, editing, and deletion of all clusters; while for fine-grained permissions, such as the management permissions of a single cluster, the management, editing, and deletion permissions of a single namespace, they need to be implemented based on Kubernetes RBAC container management permissions. Generally, users only need to be authorized in container management.
Currently, only four default roles are supported. Can the RoleBinding and ClusterRoleBinding (Kubernetes fine-grained RBAC) for custom roles also take effect?
Answer: Currently, custom permissions cannot be managed through the graphical interface, but the permission rules created using kubectl can still take effect.
Suanova AI platform supports elastic scaling of Pod resources based on metrics (Horizontal Pod Autoscaling, HPA). Users can dynamically adjust the number of copies of Pod resources by setting CPU utilization, memory usage, and custom metrics. For example, after setting an auto scaling policy based on the CPU utilization metric for the workload, when the CPU utilization of the Pod exceeds/belows the metric threshold you set, the workload controller will automatically increase/decrease the number of Pod replicas.
This page describes how to configure auto scaling based on built-in metrics and custom metrics for workloads.
Note
HPA is only applicable to Deployment and StatefulSet, and only one HPA can be created per workload.
If you create an HPA policy based on CPU utilization, you must set the configuration limit (Limit) for the workload in advance, otherwise the CPU utilization cannot be calculated.
If built-in metrics and multiple custom metrics are used at the same time, HPA will calculate the number of scaling copies required based on multiple metrics, and take the larger value (but not exceed the maximum number of copies configured when setting the HPA policy) for elastic scaling .
Refer to the following steps to configure the built-in index auto scaling policy for the workload.
Click Clusters on the left navigation bar to enter the cluster list page. Click a cluster name to enter the Cluster Details page.
On the cluster details page, click Workload in the left navigation bar to enter the workload list, and then click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster.
After confirming that the cluster has installed the metrics-server plug-in, and the plug-in is running normally, you can click the New Scaling button.
Create custom metric auto scaling policy parameters.
Policy name: Enter the name of the auto scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as hpa- my-dep.
Namespace: The namespace where the payload resides.
Workload: The workload object that performs auto scaling.
Target CPU Utilization: The CPU usage of the Pod under the workload resource. The calculation method is: the request (request) value of all Pod resources/workloads under the workload. When the actual CPU usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Target Memory Usage: The memory usage of the Pod under the workload resource. When the actual memory usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Replica range: the elastic scaling range of the number of Pod replicas. The default interval is 1 - 10.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to edit, delete, and view related events.
The container Vertical Pod Autoscaler (VPA) calculates the most suitable CPU and memory request values \u200b\u200bfor the Pod by monitoring the Pod's resource application and usage over a period of time. Using VPA can allocate resources to each Pod in the cluster more reasonably, improve the overall resource utilization of the cluster, and avoid waste of cluster resources.
AI platform supports VPA through containers. Based on this feature, the Pod request value can be dynamically adjusted according to the usage of container resources. AI platform supports manual and automatic modification of resource request values, and you can configure them according to actual needs.
This page describes how to configure VPA for deployment.
Warning
Using VPA to modify a Pod resource request will trigger a Pod restart. Due to the limitations of Kubernetes itself, Pods may be scheduled to other nodes after restarting.
Refer to the following steps to configure the built-in index auto scaling policy for the deployment.
Find the current cluster in Clusters , and click the name of the target cluster.
Click Deployments in the left navigation bar, find the deployment that needs to create a VPA, and click the name of the deployment.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster, and confirm that the relevant plug-ins have been installed and are running normally.
Click the Create Autoscaler button and configure the VPA vertical scaling policy parameters.
Policy name: Enter the name of the vertical scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as vpa- my-dep.
Scaling mode: Run the method of modifying the CPU and memory request values. Currently, vertical scaling supports manual and automatic scaling modes.
Manual scaling: After the vertical scaling policy calculates the recommended resource configuration value, the user needs to manually modify the resource quota of the application.
Auto-scaling: The vertical scaling policy automatically calculates and modifies the resource quota of the application.
Target container: Select the container to be scaled vertically.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to perform edit and delete operations.
"},{"location":"en/admin/kpanda/scale/custom-hpa.html","title":"Creating HPA Based on Custom Metrics","text":"
When the built-in CPU and memory metrics in the system do not meet your business needs, you can add custom metrics by configuring ServiceMonitoring and achieve auto-scaling based on these custom metrics. This article will introduce how to configure auto-scaling for workloads based on custom metrics.
Note
HPA is only applicable to Deployment and StatefulSet, and each workload can only create one HPA.
If both built-in metrics and multiple custom metrics are used, HPA will calculate the required number of scaled replicas based on multiple metrics respectively, and take the larger value (but not exceeding the maximum number of replicas configured when setting the HPA policy) for scaling.
Refer to the following steps to configure the auto-scaling policy based on metrics for workloads.
Click Clusters in the left navigation bar to enter the clusters page. Click a cluster name to enter the Cluster Overview page.
On the Cluster Details page, click Workloads in the left navigation bar to enter the workload list, and click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the current autoscaling configuration of the cluster.
Confirm that the cluster has installed metrics-server, Insight, and Prometheus-adapter plugins, and that the plugins are running normally, then click the Create AutoScaler button.
Note
If the related plugins are not installed or the plugins are in an abnormal state, you will not be able to see the entry for creating custom metrics auto-scaling on the page.
Policy Name: Enter the name of the auto-scaling policy. Note that the name can be up to 63 characters long, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with a lowercase letter or number, e.g., hpa-my-dep.
Namespace: The namespace where the workload is located.
Workload: The workload object that performs auto-scaling.
Resource Type: The type of custom metric being monitored, including Pod and Service types.
Metric: The name of the custom metric created using ServiceMonitoring or the name of the system-built custom metric.
Data Type: The method used to calculate the metric value, including target value and target average value. When the resource type is Pod, only the target average value can be used.
This case takes a Golang business program as an example. The example program exposes the httpserver_requests_total metric and records HTTP requests. This metric can be used to calculate the QPS value of the business program.
"},{"location":"en/admin/kpanda/scale/custom-hpa.html#deploy-business-program","title":"Deploy Business Program","text":"
"},{"location":"en/admin/kpanda/scale/custom-hpa.html#prometheus-collects-business-monitoring","title":"Prometheus Collects Business Monitoring","text":"
If the insight-agent is installed, Prometheus can be configured by creating a ServiceMonitor CRD object.
Operation steps: In Cluster Details -> Custom Resources, search for \u201cservicemonitors.monitoring.coreos.com\", click the name to enter the details. Create the following example CRD in the httpserver namespace via YAML:
If Prometheus is installed via insight, the serviceMonitor must be labeled with operator.insight.io/managed-by: insight. If installed by other means, this label is not required.
"},{"location":"en/admin/kpanda/scale/custom-hpa.html#configure-metric-rules-in-prometheus-adapter","title":"Configure Metric Rules in Prometheus-adapter","text":"
steps: In Clusters -> Helm Apps, search for \u201cprometheus-adapter\",enter the update page through the action bar, and configure custom metrics in YAML as follows:
Follow the above steps to find the application httpserver in the Deployment and create auto-scaling via custom metrics.
"},{"location":"en/admin/kpanda/scale/hpa-cronhpa-compatibility-rules.html","title":"Compatibility Rules for HPA and CronHPA","text":"
HPA stands for HorizontalPodAutoscaler, which refers to horizontal pod auto-scaling.
CronHPA stands for Cron HorizontalPodAutoscaler, which refers to scheduled horizontal pod auto-scaling.
"},{"location":"en/admin/kpanda/scale/hpa-cronhpa-compatibility-rules.html#conflict-between-cronhpa-and-hpa","title":"Conflict Between CronHPA and HPA","text":"
Scheduled scaling with CronHPA triggers horizontal pod scaling at specified times. To prevent sudden traffic surges, you may have configured HPA to ensure the normal operation of your application. If both HPA and CronHPA are detected simultaneously, conflicts arise because CronHPA and HPA operate independently without awareness of each other. Consequently, the actions performed last will override those executed first.
By comparing the definition templates of CronHPA and HPA, the following points can be observed:
Both CronHPA and HPA use the scaleTargetRef field to identify the scaling target.
CronHPA schedules the number of replicas to scale based on crontab rules in jobs.
HPA determines scaling based on resource utilization.
Note
If both CronHPA and HPA are set, there will be scenarios where CronHPA and HPA simultaneously operate on a single scaleTargetRef.
"},{"location":"en/admin/kpanda/scale/hpa-cronhpa-compatibility-rules.html#compatibility-solution-for-cronhpa-and-hpa","title":"Compatibility Solution for CronHPA and HPA","text":"
As noted above, the fundamental reason that simultaneous use of CronHPA and HPA results in the later action overriding the earlier one is that the two controllers cannot sense each other. Therefore, the conflict can be resolved by enabling CronHPA to be aware of HPA's current state.
The system will treat HPA as the scaling object for CronHPA, thus achieving scheduled scaling for the Deployment object defined by the HPA.
HPA's definition configures the Deployment in the scaleTargetRef field, and then the Deployment uses its definition to locate the ReplicaSet, which ultimately adjusts the actual number of replicas.
In AI platform, the scaleTargetRef in CronHPA is set to the HPA object, and it uses the HPA object to find the actual scaleTargetRef, allowing CronHPA to be aware of HPA's current state.
CronHPA senses HPA by adjusting HPA. CronHPA determines whether scaling is needed and modifies the HPA upper limit by comparing the target number of replicas with the current number of replicas, choosing the larger value. Similarly, CronHPA determines whether to modify the HPA lower limit by comparing the target number of replicas from CronHPA with the configuration in HPA, choosing the smaller value.
The container copy timing horizontal autoscaling policy (CronHPA) can provide stable computing resource guarantee for periodic high-concurrency applications, and kubernetes-cronhpa-controller is a key component to implement CronHPA.
This section describes how to install the kubernetes-cronhpa-controller plugin.
Note
In order to use CornHPA, not only the kubernetes-cronhpa-controller plugin needs to be installed, but also install the metrics-server plugin.
Refer to the following steps to install the kubernetes-cronhpa-controller plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of CronHPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.3.0 or later.
Refer to the following instructions to configure the parameters.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.3.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling ready wait and/or failed deletion , it takes a long time for the application to be marked as \"running\".
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the kubernetes-cronhpa-controller plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now it's time to start creating CronHPA policies.
metrics-server is the built-in resource usage metrics collection component of Kubernetes. You can automatically scale Pod copies horizontally for workload resources by configuring HPA policies.
This section describes how to install metrics-server .
Please perform the following steps to install the metrics-server plugin for the cluster.
On the Auto Scaling page under workload details, click the Install button to enter the metrics-server plug-in installation interface.
Read the introduction of the metrics-server plugin, select the version and click the Install button. This page will use the 3.8.2 version as an example to install, and it is recommended that you install 3.8.2 and later versions.
Configure basic parameters on the installation configuration interface.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, here we take default as an example.
Version: The version of the plugin, here we take 3.8.2 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Failed to delete: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Verbose log: Turn on the verbose output of the installation process log.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the app to be marked as Running .
Advanced parameter configuration
If the cluster network cannot access the k8s.gcr.io repository, please try to modify the repositort parameter to repository: k8s.m.daocloud.io/metrics-server/metrics-server .
An SSL certificate is also required to install the metrics-server plugin. To bypass certificate verification, you need to add - --kubelet-insecure-tls parameter at defaultArgs: .
Click to view and use the YAML parameters to replace the default YAML
Click the OK button to complete the installation of the metrics-server plug-in, and then the system will automatically jump to the Helm Apps list page. After a few minutes, refresh the page and you will see the newly installed Applications.
Note
When deleting the metrics-server plugin, the plugin can only be completely deleted on the Helm Apps list page. If you only delete metrics-server on the workload page, this only deletes the workload copy of the application, the application itself is still not deleted, and an error will be prompted when you reinstall the plugin later.
The Vertical Pod Autoscaler, VPA, can make the resource allocation of the cluster more reasonable and avoid the waste of cluster resources. vpa is the key component to realize the vertical autoscaling of the container.
This section describes how to install the vpa plugin.
In order to use VPA policies, not only the __vpa__ plugin needs to be installed, but also [install the __metrics-server__ plugin](install-metrics-server.md).\n
Refer to the following steps to install the vpa plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of VPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.5.0 or later.
Review the configuration parameters described below.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.5.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the application to be marked as running .
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the vpa plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now you can start Create VPA policy.
Log in to the cluster, click the sidebar Helm Apps \u2192 Helm Charts , enter knative in the search box at the top right, and then press the enter key to search.
Click the knative-operator to enter the installation configuration interface. You can view the available versions and the Parameters optional items of Helm values on this interface.
After clicking the install button, you will enter the installation configuration interface.
Enter the name, installation tenant, and it is recommended to check Wait and Detailed Logs .
In the settings below, you can tick Serving and enter the installation tenant of the Knative Serving component, which will deploy the Knative Serving component after installation. This component is managed by the Knative Operator.
Knative provides a higher level of abstraction, simplifying and speeding up the process of building, deploying, and managing applications on Kubernetes. It allows developers to focus more on implementing business logic, while leaving most of the infrastructure and operations work to Knative, significantly improving productivity.
Component Features Activator Queues requests (if a Knative Service has scaled to zero). Calls the autoscaler to bring back services that have scaled down to zero and forward queued requests. The Activator can also act as a request buffer, handling bursts of traffic. Autoscaler Responsible for scaling Knative services based on configuration, metrics, and incoming requests. Controller Manages the state of Knative CRs. It monitors multiple objects, manages the lifecycle of dependent resources, and updates resource status. Queue-Proxy Sidecar container injected into each Knative Service. Responsible for collecting traffic data and reporting it to the Autoscaler, which then initiates scaling requests based on this data and preset rules. Webhooks Knative Serving has several Webhooks responsible for validating and mutating Knative resources."},{"location":"en/admin/kpanda/scale/knative/knative.html#ingress-traffic-entry-solutions","title":"Ingress Traffic Entry Solutions","text":"Solution Use Case Istio If Istio is already in use, it can be chosen as the traffic entry solution. Contour If Contour has been enabled in the cluster, it can be chosen as the traffic entry solution. Kourier If neither of the above two Ingress components are present, Knative's Envoy-based Kourier Ingress can be used as the traffic entry solution."},{"location":"en/admin/kpanda/scale/knative/knative.html#autoscaler-solutions-comparison","title":"Autoscaler Solutions Comparison","text":"Autoscaler Type Core Part of Knative Serving Default Enabled Scale to Zero Support CPU-based Autoscaling Support Knative Pod Autoscaler (KPA) Yes Yes Yes No Horizontal Pod Autoscaler (HPA) No Needs to be enabled after installing Knative Serving No Yes"},{"location":"en/admin/kpanda/scale/knative/knative.html#crd","title":"CRD","text":"Resource Type API Name Description Services service.serving.knative.dev Automatically manages the entire lifecycle of Workloads, controls the creation of other objects, ensures applications have Routes, Configurations, and new revisions with each update. Routes route.serving.knative.dev Maps network endpoints to one or more revision versions, supports traffic distribution and version routing. Configurations configuration.serving.knative.dev Maintains the desired state of deployments, provides separation between code and configuration, follows the Twelve-Factor App methodology, modifying configurations creates new revisions. Revisions revision.serving.knative.dev Snapshot of the workload at each modification time point, immutable object, automatically scales based on traffic."},{"location":"en/admin/kpanda/scale/knative/playground.html","title":"Knative Practices","text":"
In this section, we will delve into learning Knative through several practical exercises.
case1 When there is low traffic or no traffic, traffic will be routed to the activator.
case2 When there is high traffic, traffic will be routed directly to the Pod only if it exceeds the target-burst-capacity.
Configured as 0, expansion from 0 is the only scenario.
Configured as -1, the activator will always be present in the request path.
Configured as >0, the number of additional concurrent requests that the system can handle before triggering scaling.
case3 When the traffic decreases again, traffic will be routed back to the activator if the traffic is lower than current_demand + target-burst-capacity > (pods * concurrency-target).
The total number of pending requests + the number of requests that can exceed the target concurrency > the target concurrency per Pod * number of Pods.
"},{"location":"en/admin/kpanda/scale/knative/playground.html#case-2-based-on-concurrent-elastic-scaling","title":"case 2 - Based on Concurrent Elastic Scaling","text":"
We first apply the following YAML definition under the cluster.
"},{"location":"en/admin/kpanda/scale/knative/playground.html#case-3-based-on-concurrent-elastic-scaling-scale-out-in-advance-to-reach-a-specific-ratio","title":"case 3 - Based on concurrent elastic scaling, scale out in advance to reach a specific ratio.","text":"
We can easily achieve this, for example, by limiting the concurrency to 10 per container. This can be implemented through autoscaling.knative.dev/target-utilization-percentage: 70, starting to scale out the Pods when 70% is reached.
"},{"location":"en/admin/kpanda/security/index.html","title":"Types of Security Scans","text":"
AI platform Container Management provides three types of security scans:
Compliance Scan: Conducts security scans on cluster nodes based on CIS Benchmark.
Authorization Scan: Checks for security and compliance issues in the Kubernetes cluster, records and verifies authorized access, object changes, events, and other activities related to the Kubernetes API.
Vulnerability Scan: Scans the Kubernetes cluster for potential vulnerabilities and risks, such as unauthorized access, sensitive information leakage, weak authentication, container escape, etc.
The object of compliance scanning is the cluster node. The scan result lists the scan items and results and provides repair suggestions for any failed scan items. For specific security rules used during scanning, refer to the CIS Kubernetes Benchmark.
The focus of the scan varies when checking different types of nodes.
Scan the control plane node (Controller)
Focus on the security of system components such as API Server , controller-manager , scheduler , kubelet , etc.
Check the security configuration of the Etcd database.
Verify whether the cluster's authentication mechanism, authorization policy, and network security configuration meet security standards.
Scan worker nodes
Check if the configuration of container runtimes such as kubelet and Docker meets security standards.
Verify whether the container image has been trusted and verified.
Check if the network security configuration of the node meets security standards.
Tip
To use compliance scanning, you need to create a scan configuration first, and then create a scan policy based on that configuration. After executing the scan policy, you can view the scan report.
Authorization scanning focuses on security vulnerabilities caused by authorization issues. Authorization scans can help users identify security threats in Kubernetes clusters, identify which resources need further review and protection measures. By performing these checks, users can gain a clearer and more comprehensive understanding of their Kubernetes environment and ensure that the cluster environment meets Kubernetes' best practices and security standards.
Specifically, authorization scanning supports the following operations:
Scans the health status of all nodes in the cluster.
Scans the running state of components in the cluster, such as kube-apiserver , kube-controller-manager , kube-scheduler , etc.
API security: whether unsafe API versions are enabled, whether appropriate RBAC roles and permission restrictions are set, etc.
Container security: whether insecure images are used, whether privileged mode is enabled, whether appropriate security context is set, etc.
Network security: whether appropriate network policy is enabled to restrict traffic, whether TLS encryption is used, etc.
Storage security: whether appropriate encryption and access controls are enabled.
Application security: whether necessary security measures are in place, such as password management, cross-site scripting attack defense, etc.
Provides warnings and suggestions: Security best practices that cluster administrators should perform, such as regularly rotating certificates, using strong passwords, restricting network access, etc.
Tip
To use authorization scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Security Scanning.
Vulnerability scanning focuses on scanning potential malicious attacks and security vulnerabilities, such as remote code execution, SQL injection, XSS attacks, and some attacks specific to Kubernetes. The final scan report lists the security vulnerabilities in the cluster and provides repair suggestions.
Tip
To use vulnerability scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Vulnerability Scan.
To use the Permission Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/admin/kpanda/security/audit.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Permission Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
To use the Vulnerability Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/admin/kpanda/security/hunter.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Vulnerability Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
The first step in using CIS Scanning is to create a scan configuration. Based on the scan configuration, you can then create scan policies, execute scan policies, and finally view scan results.
"},{"location":"en/admin/kpanda/security/cis/config.html#create-a-scan-configuration","title":"Create a Scan Configuration","text":"
The steps for creating a scan configuration are as follows:
Click Security Management in the left navigation bar of the homepage of the container management module.
By default, enter the Compliance Scanning page, click the Scan Configuration tab, and then click Create Scan Configuration in the upper-right corner.
Fill in the configuration name, select the configuration template, and optionally check the scan items, then click OK .
Scan Template: Currently, two templates are provided. The kubeadm template is suitable for general Kubernetes clusters. The daocloud template ignores scan items that are not applicable to AI platform based on the kubeadm template and the platform design of AI platform.
Under the scan configuration tab, clicking the name of a scan configuration displays the type of the configuration, the number of scan items, the creation time, the configuration template, and the specific scan items enabled for the configuration.
After a scan configuration has been successfully created, it can be updated or deleted according to your needs.
Under the scan configuration tab, click the \u2507 action button to the right of a configuration:
Select Edit to update the configuration. You can update the description, template, and scan items. The configuration name cannot be changed.
Select Delete to delete the configuration.
"},{"location":"en/admin/kpanda/security/cis/policy.html","title":"Scan Policy","text":""},{"location":"en/admin/kpanda/security/cis/policy.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
After creating a scan configuration, you can create a scan policy based on the configuration.
Under the Security Management -> Compliance Scanning page, click the Scan Policy tab on the right to create a scan policy.
Fill in the configuration according to the following instructions and click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Configuration: Select a pre-created scan configuration. The scan configuration determines which specific scan items need to be performed.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
After executing a scan policy, a scan report will be generated automatically. You can view the scan report online or download it to your local computer.
Download and View
Under the Security Management -> Compliance Scanning page, click the Scan Report tab, then click the \u2507 action button to the right of a report and select Download .
View Online
Clicking the name of a report allows you to view its content online, which includes:
The target cluster scanned.
The scan policy and scan configuration used.
The start time of the scan.
The total number of scan items, the number passed, and the number failed.
For failed scan items, repair suggestions are provided.
For passed scan items, more secure operational suggestions are provided.
A data volume (PersistentVolume, PV) is a piece of storage in the cluster, which can be prepared in advance by the administrator, or dynamically prepared using a storage class (Storage Class). PV is a cluster resource, but it has an independent life cycle and will not be deleted when the Pod process ends. Mounting PVs to workloads can achieve data persistence for workloads. The PV holds the data directory that can be accessed by the containers in the Pod.
"},{"location":"en/admin/kpanda/storage/pv.html#create-data-volume","title":"Create data volume","text":"
Currently, there are two ways to create data volumes: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create Data Volume (PV) in the left navigation bar.
Fill in the basic information.
The data volume name, data volume type, mount path, volume mode, and node affinity cannot be changed after creation.
Data volume type: For a detailed introduction to volume types, refer to the official Kubernetes document Volumes.
Local: The local storage of the Node node is packaged into a PVC interface, and the container directly uses the PVC without paying attention to the underlying storage type. Local volumes do not support dynamic configuration of data volumes, but support configuration of node affinity, which can limit which nodes can access the data volume.
HostPath: Use files or directories on the file system of Node nodes as data volumes, and do not support Pod scheduling based on node affinity.
Mount path: mount the data volume to a specific directory in the container.
access mode:
ReadWriteOnce: The data volume can be mounted by a node in read-write mode.
ReadWriteMany: The data volume can be mounted by multiple nodes in read-write mode.
ReadOnlyMany: The data volume can be mounted read-only by multiple nodes.
ReadWriteOncePod: The data volume can be mounted read-write by a single Pod.
Recycling policy:
Retain: The PV is not deleted, but its status is only changed to released , which needs to be manually recycled by the user. For how to manually reclaim, refer to Persistent Volume.
Recycle: keep the PV but empty its data, perform a basic wipe ( rm -rf /thevolume/* ).
Delete: When deleting a PV and its data.
Volume mode:
File system: The data volume will be mounted to a certain directory by the Pod. If the data volume is stored from a device and the device is currently empty, a file system is created on the device before the volume is mounted for the first time.
Block: Use the data volume as a raw block device. This type of volume is given to the Pod as a block device without any file system on it, allowing the Pod to access the data volume faster.
Node affinity:
"},{"location":"en/admin/kpanda/storage/pv.html#view-data-volume","title":"View data volume","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) in the left navigation bar.
On this page, you can view all data volumes in the current cluster, as well as information such as the status, capacity, and namespace of each data volume.
Supports sequential or reverse sorting according to the name, status, namespace, and creation time of data volumes.
Click the name of a data volume to view the basic configuration, StorageClass information, labels, comments, etc. of the data volume.
"},{"location":"en/admin/kpanda/storage/pv.html#clone-data-volume","title":"Clone data volume","text":"
By cloning a data volume, a new data volume can be recreated based on the configuration of the cloned data volume.
Enter the clone page
On the data volume list page, find the data volume to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/admin/kpanda/storage/pv.html#update-data-volume","title":"Update data volume","text":"
There are two ways to update data volumes. Support for updating data volumes via forms or YAML files.
Note
Only updating the alias, capacity, access mode, reclamation policy, label, and comment of the data volume is supported.
On the data volume list page, find the data volume that needs to be updated, select Update under the operation bar on the right to update through the form, select Edit YAML to update through YAML.
Click the name of the data volume to enter the details page of the data volume, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/admin/kpanda/storage/pv.html#delete-data-volume","title":"Delete data volume","text":"
On the data volume list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Delete .
A persistent volume claim (PersistentVolumeClaim, PVC) expresses a user's request for storage. PVC consumes PV resources and claims a data volume with a specific size and specific access mode. For example, the PV volume is required to be mounted in ReadWriteOnce, ReadOnlyMany or ReadWriteMany modes.
"},{"location":"en/admin/kpanda/storage/pvc.html#create-data-volume-statement","title":"Create data volume statement","text":"
Currently, there are two ways to create data volume declarations: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume declaration.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create Data Volume Declaration (PVC) in the left navigation bar.
Fill in the basic information.
The name, namespace, creation method, data volume, capacity, and access mode of the data volume declaration cannot be changed after creation.
Creation method: dynamically create a new data volume claim in an existing StorageClass or data volume, or create a new data volume claim based on a snapshot of a data volume claim.
The declared capacity of the data volume cannot be modified when the snapshot is created, and can be modified after the creation is complete.
After selecting the creation method, select the desired StorageClass/data volume/snapshot from the drop-down list.
access mode:
ReadWriteOnce, the data volume declaration can be mounted by a node in read-write mode.
ReadWriteMany, the data volume declaration can be mounted by multiple nodes in read-write mode.
ReadOnlyMany, the data volume declaration can be mounted read-only by multiple nodes.
ReadWriteOncePod, the data volume declaration can be mounted by a single Pod in read-write mode.
"},{"location":"en/admin/kpanda/storage/pvc.html#view-data-volume-statement","title":"View data volume statement","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) in the left navigation bar.
On this page, you can view all data volume declarations in the current cluster, as well as information such as the status, capacity, and namespace of each data volume declaration.
Supports sorting in sequential or reverse order according to the declared name, status, namespace, and creation time of the data volume.
Click the name of the data volume declaration to view the basic configuration, StorageClass information, labels, comments and other information of the data volume declaration.
"},{"location":"en/admin/kpanda/storage/pvc.html#expansion-data-volume-statement","title":"Expansion data volume statement","text":"
In the left navigation bar, click Container Storage -> Data Volume Declaration (PVC) , and find the data volume declaration whose capacity you want to adjust.
Click the name of the data volume declaration, and then click the operation button in the upper right corner of the page and select Expansion .
Enter the target capacity and click OK .
"},{"location":"en/admin/kpanda/storage/pvc.html#clone-data-volume-statement","title":"Clone data volume statement","text":"
By cloning a data volume claim, a new data volume claim can be recreated based on the configuration of the cloned data volume claim.
Enter the clone page
On the data volume declaration list page, find the data volume declaration that needs to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume declaration, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/admin/kpanda/storage/pvc.html#update-data-volume-statement","title":"Update data volume statement","text":"
There are two ways to update data volume claims. Support for updating data volume claims via form or YAML file.
Note
Only aliases, labels, and annotations for data volume claims are updated.
On the data volume list page, find the data volume declaration that needs to be updated, select Update in the operation bar on the right to update it through the form, and select Edit YAML to update it through YAML.
Click the name of the data volume declaration, enter the details page of the data volume declaration, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/admin/kpanda/storage/pvc.html#delete-data-volume-statement","title":"Delete data volume statement","text":"
On the data volume declaration list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume statement, click the operation button in the upper right corner of the details page and select Delete .
If there is no optional StorageClass or data volume in the list, you can Create a StorageClass or Create a data volume.
If there is no optional snapshot in the list, you can enter the details page of the data volume declaration and create a snapshot in the upper right corner.
If the StorageClass (SC) used by the data volume declaration is not enabled for snapshots, snapshots cannot be made, and the page will not display the \"Make Snapshot\" option.
If the StorageClass (SC) used by the data volume declaration does not have the capacity expansion feature enabled, the data volume does not support capacity expansion, and the page will not display the capacity expansion option.
A StorageClass refers to a large storage resource pool composed of many physical disks. This platform supports the creation of block StorageClass, local StorageClass, and custom StorageClass after accessing various storage vendors, and then dynamically configures data volumes for workloads.
Currently, it supports creating StorageClass through YAML and forms. These two methods have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the StorageClass.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create StorageClass (SC) in the left navigation bar.
Fill in the basic information and click OK at the bottom.
CUSTOM STORAGE SYSTEM
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
CSI storage driver: A standard Kubernetes-based container storage interface plug-in, which must comply with the format specified by the storage manufacturer, such as rancher.io/local-path .
For how to fill in the CSI drivers provided by different vendors, refer to the official Kubernetes document Storage Class.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
HwameiStor storage system
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
Storage system: HwameiStor storage system.
Storage type: support LVM, raw disk type
LVM type : HwameiStor recommended usage method, which can use highly available data volumes, and the proper CSI storage driver is lvm.hwameistor.io .
Raw disk data volume : suitable for high availability cases, without high availability capability, the proper CSI driver is hdd.hwameistor.io .
High Availability Mode: Before using the high availability capability, please make sure DRBD component has been installed. After the high availability mode is turned on, the number of data volume copies can be set to 1 and 2. Convert data volume copy from 1 to 1 if needed.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
On the StorageClass list page, find the StorageClass that needs to be updated, and select Edit under the operation bar on the right to update the StorageClass.
Info
Select View YAML to view the YAML file of the StorageClass, but editing is not supported.
This page introduces how to create a CronJob through images and YAML files.
CronJobs are suitable for performing periodic operations, such as backup and report generation. These jobs can be configured to repeat periodically (for example: daily/weekly/monthly), and the time interval at which the job starts to run can be defined.
Before creating a CronJob, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-cronjob.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a CronJob using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, CronJob Settings, Advanced Configuration, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the CronJobs list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the CronJob.
On the Create CronJobs page, enter the information according to the table below, and click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created CronJob in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the configuration with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull the image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local mirror will be pulled, and only when the mirror does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass configuration to the Pod, etc. For details, refer to Container environment variable configuration.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Concurrency Policy: Whether to allow multiple Job jobs to run in parallel.
Allow : A new CronJob can be created before the previous job is completed, and multiple jobs can be parallelized. Too many jobs may occupy cluster resources.
Forbid : Before the previous job is completed, a new job cannot be created. If the execution time of the new job is up and the previous job has not been completed, CronJob will ignore the execution of the new job.
Replace : If the execution time of the new job is up, but the previous job has not been completed, the new job will replace the previous job.
The above rules only apply to multiple jobs created by the same CronJob. Multiple jobs created by multiple CronJobs are always allowed to run concurrently.
Policy Settings: Set the time period for job execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Job Records: Set how many records of successful or failed jobs to keep. 0 means do not keep.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set. The default is 360 s.
Retries: the number of times the job can be retried, the default value is 6.
Restart Policy: Set whether to restart the Pod when the job fails.
The advanced configuration of CronJobs mainly involves labels and annotations.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/admin/kpanda/workloads/create-cronjob.html#create-from-yaml","title":"Create from YAML","text":"
In addition to mirroring, you can also create timed jobs more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page introduces how to create a daemonSet through image and YAML files.
DaemonSet is connected to taint through node affinity feature ensures that a replica of a Pod is running on all or some of the nodes. For nodes that newly joined the cluster, DaemonSet automatically deploys the proper Pod on the new node and tracks the running status of the Pod. When a node is removed, the DaemonSet deletes all Pods it created.
Common cases for daemons include:
Run cluster daemons on each node.
Run a log collection daemon on each node.
Run a monitoring daemon on each node.
For simplicity, a DaemonSet can be started on each node for each type of daemon. For finer and more advanced daemon management, you can also deploy multiple DaemonSets for the same daemon. Each DaemonSet has different flags and has different memory, CPU requirements for different hardware types.
Before creating a DaemonSet, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-daemonset.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a daemon using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> DaemonSets in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of DaemonSets . Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the DaemonSet.
On the Create DaemonSets page, after entering the information according to the table below, click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created DaemonSet in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyScheduling PoliciesLabels and Annotations
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set, at this time the domain name of the containerThe parsing file will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable Pods: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/admin/kpanda/workloads/create-daemonset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create daemons more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workload -> Daemons in the left navigation bar, and then click the YAML Create button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a daemon
This page describes how to create deployments through images and YAML files.
Deployment is a common resource in Kubernetes, mainly Pod and ReplicaSet provide declarative updates, support elastic scaling, rolling upgrades, and version rollbacks features. Declare the desired Pod state in the Deployment, and the Deployment Controller will modify the current state through the ReplicaSet to make it reach the pre-declared desired state. Deployment is stateless and does not support data persistence. It is suitable for deploying stateless applications that do not need to save data and can be restarted and rolled back at any time.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback.
Before using image to create deployments, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-deployment.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a deployment by image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Setting, Service Setting, Advanced Setting in turn, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of Deployments . Click \u2507 on the right side of the list to perform operations such as update, delete, elastic scaling, restart, and version rollback on the load. If the workload status is abnormal, please check the specific abnormal information, refer to Workload Status.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, it is essential to correctly fill in the container name and image parameters; otherwise, you will not be able to proceed to the next step. After filling in the configuration according to the following requirements, click OK.
Container Type: The default is Work Container. For information on init containers, see the [K8s Official Documentation] (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/).
Container Name: No more than 63 characters, supporting lowercase letters, numbers, and separators (\"-\"). It must start and end with a lowercase letter or number, for example, nginx-01.
Image:
Image: Select an appropriate image from the list. When entering the image name, the default is to pull the image from the official DockerHub.
Image Version: Select an appropriate version from the dropdown list.
Image Pull Policy: By checking Always pull the image, the image will be pulled from the repository each time the workload restarts/upgrades. If unchecked, it will only pull the local image, and will pull from the repository only if the image does not exist locally. For more details, refer to Image Pull Policy.
Registry Secret: Optional. If the target repository requires a Secret to access, you need to create secret first.
Privileged Container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and has all the privileges of running processes on the host.
CPU/Memory Request: The request value (the minimum resource needed) and the limit value (the maximum resource allowed) for CPU/memory resources. Configure resources for the container as needed to avoid resource waste and system failures caused by container resource overages. Default values are shown in the figure.
GPU Configuration: Configure GPU usage for the container, supporting only positive integers. The GPU quota setting supports configuring the container to exclusively use an entire GPU or part of a vGPU. For example, for a GPU with 8 cores, entering the number 8 means the container exclusively uses the entire card, and entering the number 1 means configuring 1 core of the vGPU for the container.
Before setting the GPU, the administrator needs to pre-install the GPU and driver plugin on the cluster node and enable the GPU feature in the Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Setting.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Setting.
Configure container parameters within the Pod, add environment variables or pass setting to the Pod, etc. For details, refer to Container environment variable setting.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Setting.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: Network Settings, Upgrade Policy, Scheduling Policies, Labels and Annotations. You can click the tabs below to view the setting requirements of each part.
Network SettingsUpgrade PolicyScheduling PoliciesLabels and Annotations
For container NIC setting, refer to Workload Usage IP Pool
DNS setting
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related setting options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make container use kubelet's -The domain name resolution file pointed to by the -resolv-conf parameter. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the setting of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Setting options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node Affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload Anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/admin/kpanda/workloads/create-deployment.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create deployments more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a deployment
This page introduces how to create a job through image and YAML file.
Job is suitable for performing one-time jobs. A Job creates one or more Pods, and the Job keeps retrying to run Pods until a certain number of Pods are successfully terminated. A Job ends when the specified number of Pods are successfully terminated. When a Job is deleted, all Pods created by the Job will be cleared. When a Job is paused, all active Pods in the Job are deleted until the Job is resumed. For more information about jobs, refer to Job.
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-job.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a job using an image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings and Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the job list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the job.
On the Create Jobs page, enter the basic information according to the table below, and click Next .
Payload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created job in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Number of Instances: Enter the number of Pod instances for the workload. By default, 1 Pod instance is created.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the setting requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle settings.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check settings.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage settings.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes job settings, labels and annotations.
Job SettingsLabels and Annotations
Parallel Pods: the maximum number of Pods that can be created at the same time during job execution, and the parallel number should not be greater than the total number of Pods. Default is 1.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set.
Restart Policy: Whether to restart the Pod when the setting fails.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/admin/kpanda/workloads/create-job.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, creation jobs can also be created more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page describes how to create a StatefulSet through image and YAML files.
StatefulSet is a common resource in Kubernetes, and Deployment, mainly used to manage the deployment and scaling of Pod collections. The main difference between the two is that Deployment is stateless and does not save data, while StatefulSet is stateful and is mainly used to manage stateful applications. In addition, Pods in a StatefulSet have a persistent ID, which makes it easy to identify the proper Pod when matching storage volumes.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management.
Before using image to create StatefulSets, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/admin/kpanda/workloads/create-statefulset.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a statefulSet using image.
Click Clusters on the left navigation bar, then click the name of the target cluster to enter Cluster Details.
Click Workloads -> StatefulSets in the left navigation bar, and then click the Create by Image button in the upper right corner.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the list of StatefulSets , and wait for the status of the workload to become running . If the workload status is abnormal, refer to Workload Status for specific exception information.
Click \u2507 on the right side of the New Workload column to perform operations such as update, delete, elastic scaling, restart, and version rollback on the workload.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
Used to judge the health status of containers and applications. Helps improve app usability. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyContainer Management PoliciesScheduling PoliciesLabels and Annotations
For container NIC settings, refer to Workload Usage IP Pool
DNS settings
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the application docking host.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Kubernetes v1.7 and later versions can set Pod management policies through .spec.podManagementPolicy , which supports the following two methods:
OrderedReady : The default Pod management policy, which means that Pods are deployed in order. Only after the deployment of the previous Pod is successfully completed, the statefulset will start to deploy the next Pod. Pods are deleted in reverse order, with the last created being deleted first.
Parallel : Create or delete containers in parallel, just like Pods of the Deployment type. The StatefulSet controller starts or terminates all containers in parallel. There is no need to wait for a Pod to enter the Running and ready state or to stop completely before starting or terminating other Pods. This option only affects the behavior of scaling operations, not the order of updates.
Tolerance time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/admin/kpanda/workloads/create-statefulset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create statefulsets more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> StatefulSets in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a statefulSet
An environment variable refers to a variable set in the container running environment, which is used to add environment flags to Pods or transfer configurations, etc. It supports configuring environment variables for Pods in the form of key-value pairs.
Suanova container management adds a graphical interface to configure environment variables for Pods on the basis of native Kubernetes, and supports the following configuration methods:
Key-value pair (Key/Value Pair): Use a custom key-value pair as the environment variable of the container
Resource reference (Resource): Use the fields defined by Container as the value of environment variables, such as the memory limit of the container, the number of copies, etc.
Variable/Variable Reference (Pod Field): Use the Pod field as the value of an environment variable, such as the name of the Pod
ConfigMap key value import (ConfigMap key): Import the value of a key in the ConfigMap as the value of an environment variable
Key key value import (Secret Key): use the data from the Secret to define the value of the environment variable
Key Import (Secret): Import all key values \u200b\u200bin Secret as environment variables
ConfigMap import (ConfigMap): import all key values \u200b\u200bin the ConfigMap as environment variables
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html","title":"Container health check","text":"
Container health check checks the health status of containers according to user requirements. After configuration, if the application in the container is abnormal, the container will automatically restart and recover. Kubernetes provides Liveness checks, Readiness checks, and Startup checks.
LivenessProbe can detect application deadlock (the application is running, but cannot continue to run the following steps). Restarting containers in this state can help improve the availability of applications, even if there are bugs in them.
ReadinessProbe can detect when a container is ready to accept request traffic. A Pod can only be considered ready when all containers in a Pod are ready. One use of this signal is to control which Pod is used as the backend of the Service. If the Pod is not ready, it will be removed from the Service's load balancer.
Startup check (StartupProbe) can know when the application container is started. After configuration, it can control the container to check the viability and readiness after it starts successfully, so as to ensure that these liveness and readiness probes will not affect the start of the application. Startup detection can be used to perform liveness checks on slow-starting containers, preventing them from being killed before they start running.
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#liveness-and-readiness-checks","title":"Liveness and readiness checks","text":"
The configuration of LivenessProbe is similar to that of ReadinessProbe, the only difference is to use readinessProbe field instead of livenessProbe field.
HTTP GET parameter description:
Parameter Description Path (Path) The requested path for access. Such as: /healthz path in the example Port (Port) Service listening port. Such as: port 8080 in the example protocol access protocol, Http or Https Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. SuccessThreshold (successThreshold) The minimum number of consecutive successes that are considered successful after a probe fails. The default value is 1, and the minimum value is 1. This value must be 1 for liveness and startup probes. Maximum number of failures (failureThreshold) The number of retries when the probe fails. Giving up in case of a liveness probe means restarting the container. Pods that are abandoned due to readiness probes are marked as not ready. The default value is 3. The minimum value is 1."},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#check-with-http-get-request","title":"Check with HTTP GET request","text":"
YAML example:
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-http\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/liveness # Container image\n args:\n - /server # Arguments to pass to the container\n livenessProbe:\n httpGet:\n path: /healthz # Access request path\n port: 8080 # Service listening port\n httpHeaders:\n - name: Custom-Header # Custom header name\n value: Awesome # Custom header value\n initialDelaySeconds: 3 # Wait 3 seconds before the first probe\n periodSeconds: 3 # Perform liveness detection every 3 seconds\n
According to the set rules, Kubelet sends an HTTP GET request to the service running in the container (the service is listening on port 8080) to perform the detection. The kubelet considers the container alive if the handler under the /healthz path on the server returns a success code. If the handler returns a failure code, the kubelet kills the container and restarts it. Any return code greater than or equal to 200 and less than 400 indicates success, and any other return code indicates failure. The /healthz handler returns a 200 status code for the first 10 seconds of the container's lifetime. The handler then returns a status code of 500.
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#use-tcp-port-check","title":"Use TCP port check","text":"
TCP port parameter description:
Parameter Description Port (Port) Service listening port. Such as: port 8080 in the example Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second.
For a container that provides TCP communication services, based on this configuration, the cluster establishes a TCP connection to the container according to the set rules. If the connection is successful, it proves that the detection is successful, otherwise the detection fails. If you choose the TCP port detection method, you must specify the port that the container listens to.
This example uses both readiness and liveness probes. The kubelet sends the first readiness probe 5 seconds after the container is started. Attempt to connect to port 8080 of the goproxy container. If the probe is successful, the Pod will be marked as ready and the kubelet will continue to run the check every 10 seconds.
In addition to the readiness probe, this configuration includes a liveness probe. The kubelet will perform the first liveness probe 15 seconds after the container is started. The readiness probe will attempt to connect to the goproxy container on port 8080. If the liveness probe fails, the container will be restarted.
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-exec\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/busybox # Container image\n args:\n - /bin/sh # Command to run\n - -c # Pass the following string as a command\n - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600 # Command to execute\n livenessProbe:\n exec:\n command:\n - cat # Command to check liveness\n - /tmp/healthy # File to check\n initialDelaySeconds: 5 # Wait 5 seconds before the first probe\n periodSeconds: 5 # Perform liveness detection every 5 seconds\n
The periodSeconds field specifies that the kubelet performs a liveness probe every 5 seconds, and the initialDelaySeconds field specifies that the kubelet waits for 5 seconds before performing the first probe. According to the set rules, the cluster periodically executes the command cat /tmp/healthy in the container through the kubelet to detect. If the command executes successfully and the return value is 0, the kubelet considers the container to be healthy and alive. If this command returns a non-zero value, the kubelet will kill the container and restart it.
"},{"location":"en/admin/kpanda/workloads/pod-config/health-check.html#protect-slow-starting-containers-with-pre-start-checks","title":"Protect slow-starting containers with pre-start checks","text":"
Some applications require a long initialization time at startup. You need to use the same command to set startup detection. For HTTP or TCP detection, you can set the failureThreshold * periodSeconds parameter to a long enough time to cope with the long startup time scene.
With the above settings, the application will have up to 5 minutes (30 * 10 = 300s) to complete the startup process. Once the startup detection is successful, the survival detection task will take over the detection of the container and respond quickly to the container deadlock. If the start probe has been unsuccessful, the container is killed after 300 seconds and further disposition is performed according to the restartPolicy .
"},{"location":"en/admin/kpanda/workloads/pod-config/job-parameters.html","title":"Description of job parameters","text":"
According to the settings of .spec.completions and .spec.Parallelism , jobs (Job) can be divided into the following types:
Job Type Description Non-parallel Job Creates a Pod until its Job completes successfully Parallel Jobs with deterministic completion counts A Job is considered complete when the number of successful Pods reaches .spec.completions Parallel Job Creates one or more Pods until one finishes successfully
Parameter Description
RestartPolicy Creates a Pod until it terminates successfully .spec.completions Indicates the number of Pods that need to run successfully when the Job ends, the default is 1 .spec.parallelism Indicates the number of Pods running in parallel, the default is 1 spec.backoffLimit Indicates the maximum number of retries for a failed Pod, beyond which no more retries will continue. .spec.activeDeadlineSeconds Indicates the Pod running time. Once this time is reached, the Job, that is, all its Pods, will stop. And activeDeadlineSeconds has a higher priority than backoffLimit, that is, the job that reaches activeDeadlineSeconds will ignore the setting of backoffLimit.
The following is an example Job configuration, saved in myjob.yaml, which calculates \u03c0 to 2000 digits and prints the output.
apiVersion: batch/v1\nkind: Job #The type of the current resource\nmetadata:\n name: myjob\nspec:\n completions: 50 # Job needs to run 50 Pods at the end, in this example it prints \u03c0 50 times\n parallelism: 5 # 5 Pods in parallel\n backoffLimit: 5 # retry up to 5 times\n template:\n spec:\n containers:\n - name: pi\n image: perl\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\n restartPolicy: Never #restart policy\n
Related commands
kubectl apply -f myjob.yaml # Start job\nkubectl get job # View this job\nkubectl logs myjob-1122dswzs View Job Pod logs\n
"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html","title":"Configure the container lifecycle","text":"
Pods follow a predefined lifecycle, starting in the Pending phase and entering the Running state if at least one container in the Pod starts normally. If any container in the Pod ends in a failed state, the state becomes Failed . The following phase field values \u200b\u200bindicate which phase of the lifecycle a Pod is in.
Value Description Pending The Pod has been accepted by the system, but one or more containers have not yet been created or run. This phase includes waiting for the pod to be scheduled and downloading the image over the network. Running (Running) The Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or in the process of starting or restarting. Succeeded (Success) All containers in the Pod were successfully terminated and will not be restarted. Failed All containers in the Pod have terminated, and at least one container terminated due to failure. That is, the container exited with a non-zero status or was terminated by the system. Unknown (Unknown) The status of the Pod cannot be obtained for some reason, usually due to a communication failure with the host where the Pod resides.
When creating a workload in Suanova container management, images are usually used to specify the running environment in the container. By default, when building an image, the Entrypoint and CMD fields can be used to define the commands and parameters to be executed when the container is running. If you need to change the commands and parameters of the container image before starting, after starting, and before stopping, you can override the default commands and parameters in the image by setting the lifecycle event commands and parameters of the container.
Configure the startup command, post-start command, and pre-stop command of the container according to business needs.
Parameter Description Example value Start command Type: Optional Meaning: The container will be started according to the start command. Command after startup Type: optionalMeaning: command after container startup Command before stopping Type: Optional Meaning: The command executed by the container after receiving the stop command. Ensure that the services running in the instance can be drained in advance when the instance is upgraded or deleted. -"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html#start-command","title":"start command","text":"
Configure the startup command according to the table below.
Parameter Description Example value Run command Type: RequiredMeaning: Enter an executable command, and separate multiple commands with spaces. If the command itself has spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html#post-start-commands","title":"Post-start commands","text":"
Suanova provides two processing types, command line script and HTTP request, to configure post-start commands. You can choose the configuration method that suits you according to the table below.
Command line script configuration
Parameter Description Example value Run Command Type: Optional Meaning: Enter an executable command, and separate multiple commands with spaces. If the command itself contains spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/admin/kpanda/workloads/pod-config/lifecycle.html#stop-pre-command","title":"stop pre-command","text":"
Suanova provides two processing types, command line script and HTTP request, to configure the pre-stop command. You can choose the configuration method that suits you according to the table below.
HTTP request configuration
Parameter Description Example value URL Path Type: Optional Meaning: Requested URL path. Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Port Type: RequiredMeaning: Requested port. port=8080 Node Address Type: Optional Meaning: The requested IP address, the default is the node IP where the container is located. -"},{"location":"en/admin/kpanda/workloads/pod-config/scheduling-policy.html","title":"Scheduling Policy","text":"
In a Kubernetes cluster, like many other Kubernetes objects, nodes have labels. You can manually add labels. Kubernetes also adds some standard labels to all nodes in the cluster. See Common Labels, Annotations, and Taints for common node labels. By adding labels to nodes, you can have pods scheduled on specific nodes or groups of nodes. You can use this feature to ensure that specific Pods can only run on nodes with certain isolation, security or governance properties.
nodeSelector is the simplest recommended form of a node selection constraint. You can add a nodeSelector field to the Pod's spec to set the node label. Kubernetes will only schedule pods on nodes with each label specified. nodeSelector provides one of the easiest ways to constrain Pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define. Some benefits of using affinity and anti-affinity are:
Affinity and anti-affinity languages are more expressive. nodeSelector can only select nodes that have all the specified labels. Affinity, anti-affinity give you greater control over selection logic.
You can mark a rule as \"soft demand\" or \"preference\", so that the scheduler will still schedule the Pod if no matching node can be found.
You can use the labels of other Pods running on the node (or in other topological domains) to enforce scheduling constraints, instead of only using the labels of the node itself. This capability allows you to define rules which allow Pods to be placed together.
You can choose which node the Pod will deploy to by setting affinity and anti-affinity.
When the node where the workload instance is located is unavailable, the period for the system to reschedule the instance to other available nodes. The default is 300 seconds.
Node affinity is conceptually similar to nodeSelector , which allows you to constrain which nodes Pods can be scheduled on based on the labels on the nodes. There are two types of node affinity:
Must be satisfied: ( requiredDuringSchedulingIgnoredDuringExecution ) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: ( preferredDuringSchedulingIgnoredDuringExecution ) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
It can only be added in the \"as far as possible\" policy, which can be understood as the priority of scheduling, and those with the highest weight will be scheduled first. The value range is 1 to 100.
Similar to node affinity, there are two types of workload affinity:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The affinity of the workload is mainly used to determine which Pods of the workload can be deployed in the same topology domain. For example, services that communicate with each other can be deployed in the same topology domain (such as the same availability zone) by applying affinity scheduling to reduce the network delay between them.
Similar to node affinity, there are two types of anti-affinity for workloads:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The anti-affinity of the workload is mainly used to determine which Pods of the workload cannot be deployed in the same topology domain. For example, the same Pod of a load is distributed to different topological domains (such as different hosts) to improve the stability of the workload itself.
A workload is an application running on Kubernetes, and in Kubernetes, whether your application is composed of a single same component or composed of many different components, you can use a set of Pods to run it. Kubernetes provides five built-in workload resources to manage pods:
Deployment
StatefulSet
Daemonset
Job
CronJob
You can also expand workload resources by setting Custom Resource CRD. In the fifth-generation container management, it supports full lifecycle management of workloads such as creation, update, capacity expansion, monitoring, logging, deletion, and version management.
Pod is the smallest computing unit created and managed in Kubernetes, that is, a collection of containers. These containers share storage, networking, and management policies that control how the containers run. Pods are typically not created directly by users, but through workload resources. Pods follow a predefined lifecycle, starting at Pending phase, if at least one of the primary containers starts normally, it enters Running , and then enters the Succeeded or Failed stage depending on whether any container in the Pod ends in a failed status.
The fifth-generation container management module designs a built-in workload life cycle status set based on factors such as Pod status and number of replicas, so that users can more realistically perceive the running status of workloads. Because different workload types (such as Deployment and Jobs) have inconsistent management mechanisms for Pods, different workloads will have different lifecycle status during operation, as shown in the following table.
"},{"location":"en/admin/kpanda/workloads/pod-config/workload-status.html#deployment-statefulset-damemonset-status","title":"Deployment, StatefulSet, DamemonSet Status","text":"Status Description Waiting 1. A workload is in this status while its creation is in progress. 2. After an upgrade or rollback action is triggered, the workload is in this status. 3. Trigger operations such as pausing/scaling, and the workload is in this status. Running This status occurs when all instances under the workload are running and the number of replicas matches the user-defined number. Deleting When a delete operation is performed, the payload is in this status until the delete is complete. Exception Unable to get the status of the workload for some reason. This usually occurs because communication with the pod's host has failed. Not Ready When the container is in an abnormal, pending status, this status is displayed when the workload cannot be started due to an unknown error"},{"location":"en/admin/kpanda/workloads/pod-config/workload-status.html#job-status","title":"Job Status","text":"Status Description Waiting The workload is in this status while Job creation is in progress. Executing The Job is in progress and the workload is in this status. Execution Complete The Job execution is complete and the workload is in this status. Deleting A delete operation is triggered and the workload is in this status. Exception Pod status could not be obtained for some reason. This usually occurs because communication with the pod's host has failed."},{"location":"en/admin/kpanda/workloads/pod-config/workload-status.html#cronjob-status","title":"CronJob status","text":"Status Description Waiting The CronJob is in this status when it is being created. Started After the CronJob is successfully created, the CronJob is in this status when it is running normally or when the paused task is started. Stopped The CronJob is in this status when the stop task operation is performed. Deleting The deletion operation is triggered, and the CronJob is in this status.
When the workload is in an abnormal or unready status, you can move the mouse over the status value of the load, and the system will display more detailed error information through a prompt box. You can also view the log or events to obtain related running information of the workload.
"},{"location":"en/admin/register/bindws.html#steps-to-follow","title":"Steps to Follow","text":"
Log in to the AI platform as an administrator.
Navigate to Global Management -> Workspace and Folder, and click Create Workspace.
Enter the workspace name, select a folder, and click OK to create a workspace.
Bind resources to the workspace.
On this interface, you can click Create Namespace to create a namespace.
Add authorization: Assign the user to the workspace.
The user logs in to the AI platform to check if they have permissions for the workspace and namespace. The administrator can perform more actions through the \u2507 on the right side.
Next step: Allocate Resources for the Workspace
"},{"location":"en/admin/register/wsres.html","title":"Allocate Resources to the Workspace","text":"
After binding a user to a workspace, it is necessary to allocate appropriate resources to the workspace.
Navigate to Global Management -> Workspace and Folder, find the workspace to which you want to add resources, and click Add Shared Resources.
Select the cluster, set the appropriate resource quota, and then click OK
Return to the shared resources page. Resources have been successfully allocated to the workspace, and the administrator can modify them at any time using the \u2507 on the right side.
AI platform provides a fully automated security implementation for containers, Pods, images, runtimes, and microservices. The following table lists some of the security features that have been implemented or are in the process of being implemented.
Security Features Specific Items Description Image security Trusted image Distribution Key pairs and signature information are required to achieve secure transport of images. It's allowed to select a key for mirror signing during mirror transmission. Runtime security Event correlation analysis Support correlation and risk analysis of security events detected at runtime to enhance attack traceability. Support converge alerts, reduce invalid alerts, and improve event response efficiency. - Container decoy repository The container decoy repository is equipped with common decoys including but not limited to: unauthorized access vulnerabilities, code execution vulnerabilities, local file reading vulnerabilities, remote command execution RCE vulnerabilities, and other container decoys. - Container decoy deployment Support custom decoy containers, including service names, service locations, etc. - Container decoy alerting Support alerting on suspicious behavior in container decoys. - Offset detection While scanning the image, learn all the binary information in the image and form a \"whitelist\" to allow only the binaries in the \"whitelist\" to run after the container is online, which ensures that the container can not run unauthorized (such as illegal download) executable files. Micro-isolation Intelligent recommendation of isolation policies Support for recording historical access traffic to resources, and intelligent policy recommendation based on historical access traffic when configuring isolation policies for resources. - Tenant isolation Support isolation control of tenants in Kubernetes clusters, with the ability to set different network security groups for different tenants, and supports tenant-level security policies to achieve inter-tenant network access and isolation. Microservices security Service and API security scanning Supports automatic, manual and periodic scanning of services and APIs within a cluster. Support all traditional web scanning items including XSS vulnerabilities, SQL injection, command/code injection, directory enumeration, path traversal, XML entity injection, poc, file upload, weak password, jsonp, ssrf, arbitrary jump, CRLF injection and other risks. For vulnerabilities found in the container environment, support vulnerability type display, url display, parameter display, danger level display, test method display, etc."},{"location":"en/admin/security/falco-exporter.html","title":"What is Falco-exporter","text":"
Falco-exporter is a Prometheus Metrics exporter for Falco output events.
Falco-exporter is deployed as a DaemonSet on a Kubernetes cluster. If Prometheus is installed and running in the cluster, metrics provided by Falco-exporter will be automatically discovered.
This section describes how to install Falco-exporter.
Note
Before installing and using Falco-exporter, you need to install and run Falco with gRPC output enabled (enabled by via Unix sockets by default). For more information on enabling gRPC output in Falco Helm Chart, see Enabling gRPC.
Please confirm that your cluster has successfully connected to the Container Management platform, and then perform the following steps to install Falco-exporter.
Click Container Management->Clusters in the left navigation bar, then find the cluster name where you want to install Falco-exporter.
In the left navigation bar, select Helm Releases -> Helm Charts, and then find and click falco-exporter.
Select the version you want to install in Version and click Install.
On the installation screen, fill in the required installation parameters.
Fill in application name, namespace, version, etc.
Fill in the following parameters:
Falco Prometheus Exporter -> Image Settings -> Registry: set the repository address of the falco-exporter image, which is already filled with the available online repositories by default. If it is a private environment, you can change it to a private repository address.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Repository: set the falco-exporter image name.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Install ServiceMonitor: install Prometheus Operator service monitor. It is enabled by default.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Scrape Interval: user-defined interval; if not specified, the Prometheus default interval is used.
Falco Prometheus Exporter -> Prometheus ServiceMonitor Settings -> Scrape Timeout: user-defined scrape timeout; if not specified, the Prometheus default scrape timeout is used.
In the screen as above, fill in the following parameters:
Falco Prometheus Exporter -> Prometheus prometheusRules -> Install prometheusRules: create PrometheusRules to alert on priority events. It is enabled by default.
Falco Prometheus Exporter -> Prometheus prometheusRules -> Alerts settings: set whether alerts are enabled for different levels of log events, the interval between alerts, and the threshold for alerts.
Click the OK button at the bottom right corner to complete the installation.
Please confirm that your cluster has successfully connected to the Container Management platform, and then perform the following steps to install Falco.
Click Container Management->Clusters in the left navigation bar, then find the cluster name where you want to install Falco.
In the left navigation bar, select Helm Releases -> Helm Charts, and then find and click Falco.
Select the version you want to install in Version, and click Install.
On the installation page, fill in the required installation parameters.
Fill in the application name, namespace, version, etc.
Fill in the following parameters:
Falco -> Image Settings -> Registry: set the repository address of the Falco image, which is already filled with the available online repositories by default. If it is a private environment, you can change it to a private repository address.
Falco -> Image Settings -> Repository: set the Falco image name.
Falco -> Falco Driver -> Image Settings -> Registry: set the repository address of the Falco Driver image, which is already filled with available online repositories by default. If it is a private environment, you can change it to a private repository address.
Falco -> Falco Driver -> Image Settings -> Repository: set the Falco Driver image name.
Falco -> Falco Driver -> Image Settings -> Driver Kind: set the Driver Kind, providing the following two options.
ebpf: use ebpf to detect events, which requires the Linux kernel to support ebpf and enable CONFIG_BPF_JIT and sysctl net.core.bpf_jit_enable=1.
module: use kernel module detection with limited OS version support. Refer to module support system version.
Falco -> Falco Driver -> Image Settings -> Log Level: the minimum log level to be included in the log.
Click the OK button in the bottom right corner to complete the installation.
"},{"location":"en/admin/security/falco.html","title":"What is Falco","text":"
Falco is a cloudnative runtime security tool designed to detect anomalous activity in applications, and can be used to monitor the runtime security of Kubernetes applications and internal components. With only a set of rules, Falco can continuously monitor and watch for anomalous activity in containers, applications, hosts, and networks.
"},{"location":"en/admin/security/falco.html#what-does-falco-detect","title":"What does Falco detect?","text":"
Falco can detect and alert on any behavior involving Linux system calls. Falco alerts can be triggered using specific system calls, parameters, and properties of the calling process. For example, Falco can easily detect events including but not limited to the following:
A shell is running inside a container or pod in Kubernetes.
A container is running in privileged mode or mounting a sensitive path, such as /proc, from the host.
A server process is spawning a child process of an unexpected type.
A sensitive file, such as /etc/shadow, is being read unexpectedly.
A non-device file is being written to /dev.
A standard system binary, such as ls, is making an outbound network connection.
A privileged pod is started in a Kubernetes cluster.
For more information on the default rules that come with Falco, see the Rules documentation.
"},{"location":"en/admin/security/falco.html#what-are-falco-rules","title":"What are Falco rules?","text":"
Falco rules define the behavior and events that Falco should monitor. Rules can be written in the Falco rules file or in a generic configuration file. For more information on writing, managing and deploying rules, see Falco Rules.
"},{"location":"en/admin/security/falco.html#what-are-falco-alerts","title":"What are Falco Alerts?","text":"
Alerts are configurable downstream operations that can be as simple as logging or as complex as STDOUT passing a gRPC call to a client. For more information on configuring, understanding, and developing alerts, see Falco Alerts. Falco can send alerts t:
Standard output
A file
A system log
A spawned program
An HTTP[s] endpoint
A client via the gRPC API
"},{"location":"en/admin/security/falco.html#what-are-the-components-of-falco","title":"What are the components of Falco?","text":"
Falco consists of the following main components:
Userspace program: a CLI tool that can be used to interact with Falco. The userspace program handles signals, parses messages from a Falco driver, and sends alerts.
Configuration: define how Falco is run, what rules to assert, and how to perform alerts. For more information, see Configuration.
Driver: a software that adheres to the Falco driver specification and sends a stream of system call information. You cannot run Falco without installing a driver. Currently, Falco supports the following drivers:
Kernel module built on libscap and libsinsp C++ libraries (default)
BPF probe built from the same modules
Userspace instrumentation
For more information, see Falco drivers.
Plugins: allow users to extend the functionality of falco libraries/falco executable by adding new event sources and new fields that can extract information from events. For more information, see Plugins.
Notebook typically refers to Jupyter Notebook or similar interactive computing environments. This is a very popular tool widely used in fields such as data science, machine learning, and deep learning. This page explains how to use Notebook on the Canfeng AI platform.
Enter a name, select a cluster, namespace, choose the newly created queue, and click One-click Initialization.
Select Notebook type, configure memory and CPU, enable GPU, create and configure PVC:
Enable SSH external access:
You will be automatically redirected to the Notebook instance list; click the instance name.
Enter the Notebook instance details page and click the Open button in the upper right corner.
You will enter the Notebook development environment, where a persistent volume is mounted in the /home/jovyan directory. You can clone code using git and upload data after connecting via SSH, etc.
"},{"location":"en/admin/share/notebook.html#accessing-notebook-instances-via-ssh","title":"Accessing Notebook Instances via SSH","text":"
Generate an SSH key pair on your own computer.
Open the command line on your computer, for example, open git bash on Windows, and enter ssh-keygen.exe -t rsa, then press enter until completion.
Use commands like cat ~/.ssh/id_rsa.pub to view and copy the public key.
Log in to the AI platform as a user, click Personal Center in the upper right corner -> SSH Public Key -> Import SSH Public Key.
Go to the details page of the Notebook instance and copy the SSH link.
Use SSH to access the Notebook instance from the client.
Administrator has assigned a workspace to the user
Resource quotas have been set for the workspace
A cluster has been created
"},{"location":"en/admin/share/workload.html#steps-to-create-ai-workloads","title":"Steps to Create AI Workloads","text":"
Log in to the AI platform as a User.
Navigate to Container Management, select a namespace, then click Workloads -> Deployments, and then click the Create from Image button on the right.
After configuring the parameters, click OK.
Basic InformationContainer ConfigurationOthers
Select your own namespace.
Set the image, configure resources such as CPU, memory, and GPU, and set the startup command.
Service configuration and advanced settings can use default configurations.
Automatically return to the stateless workload list and click the workload name.
Enter the details page to view the GPU quota.
You can also enter the console and run the mx-smi command to check the GPU resources.
Next step: Using Notebook
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html","title":"Import a Linux Virtual Machine with Ubuntu from an External Platform","text":"
This page provides a detailed introduction on how to import Linux virtual machines from the external platform VMware into the virtual machines of AI platform through the command line.
Info
The external virtual platform in this document is VMware vSphere Client, abbreviated as vSphere. Technically, it relies on kubevirt cdi for implementation. Before proceeding, the virtual machine imported on vSphere needs to be shut down. Take a virtual machine of the Ubuntu operating system as an example.
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html#fetch-basic-information-of-vsphere-virtual-machine","title":"Fetch Basic Information of vSphere Virtual Machine","text":"
vSphere URL: Fetch information on the URL of the target platform
vSphere SSL Certificate Thumbprint: Need to be fetched using openssl
Can't use SSL_get_servername\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=20:unable to get local issuer certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify error:num=21:unable to verify the first certificate\nverify return:1\ndepth=0 CN = vcsa.daocloud.io\nverify return:1\nDONE\nsha1 Fingerprint=C3:9D:D7:55:6A:43:11:2B:DE:BA:27:EA:3B:C2:13:AF:E4:12:62:4D # Value needed\n
vSphere Account: Fetch account information for vSphere, and pay attention to permissions
vSphere Password: Fetch password information for vSphere
UUID of the virtual machine to be imported: Need to be fetched on the web page of vSphere
Access the Vsphere page, go to the details page of the virtual machine to be imported, click Edit Settings , open the browser's developer console at this point, click Network -> Headers , find the URL as shown in the image below.
Click Response , locate vmConfigContext -> config , and finally find the target value uuid .
Path of the vmdk file of the virtual machine to be imported
Different information needs to be configured based on the chosen network mode. If a fixed IP address is required, you should select the Bridge network mode.
Create a Multus CR of the ovs type. Refer to Creating a Multus CR.
Create subnets and IP pools. Refer to Creating Subnets and IP Pools.
apiVersion: v1\nkind: Secret\nmetadata:\n name: vsphere # Can be changed\n labels:\n app: containerized-data-importer # Do not change\ntype: Opaque\ndata:\n accessKeyId: \"username-base64\"\n secretKey: \"password-base64\"\n
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html#write-a-kubevirt-vm-yaml-to-create-vm","title":"Write a KubeVirt VM YAML to create VM","text":"
Tip
If a fixed IP address is required, the YAML configuration differs slightly from the one used for the default network. These differences have been highlighted.
"},{"location":"en/admin/virtnest/best-practice/import-ubuntu.html#access-vnc-to-verify-successful-operation","title":"Access VNC to verify successful operation","text":"
Modify the network configuration of the virtual machine
Check the current network
When the actual import is completed, the configuration shown in the image below has been completed. However, it should be noted that the enp1s0 interface does not contain the inet field, so it cannot connect to the external network.
Configure netplan
In the configuration shown in the image above, change the objects in ethernets to enp1s0 and obtain an IP address using DHCP.
Apply the netplan configuration to the system network configuration
sudo netplan apply\n
Perform a ping test on the external network
Access the virtual machine on the node via SSH.
"},{"location":"en/admin/virtnest/best-practice/import-windows.html","title":"Import a Windows Virtual Machine from the External Platform","text":"
This page provides a detailed introduction on how to import virtual machines from an external platform -- VMware, into the virtual machines of AI platform using the command line.
Info
The external virtual platform on this page is VMware vSphere Client, abbreviated as vSphere. Technically, it relies on kubevirt cdi for implementation. Before proceeding, the virtual machine imported on vSphere needs to be shut down. Take a virtual machine of the Windows operating system as an example.
Before importing, refer to the Network Configuration to prepare the environment.
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#fetch-information-of-the-windows-virtual-machine","title":"Fetch Information of the Windows Virtual Machine","text":"
Similar to importing a virtual machine with a Linux operating system, refer to Importing a Linux Virtual Machine with Ubuntu from an External Platform to get the following information:
vSphere account and password
vSphere virtual machine information
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#check-the-boot-type-of-windows","title":"Check the Boot Type of Windows","text":"
When importing a virtual machine from an external platform into the AI platform virtualization platform, you need to configure it according to the boot type (BIOS or UEFI) to ensure it can boot and run correctly.
You can check whether Windows uses BIOS or UEFI through \"System Summary.\" If it uses UEFI, you need to add the relevant information in the YAML file.
Prepare the window.yaml file and pay attention to the following configmaps:
PVC booting Virtio drivers
Disk bus type, set to SATA or Virtio depending on the boot type
UEFI configuration (if UEFI is used)
Click to view the window.yaml example window.yaml
apiVersion: kubevirt.io/v1\nkind: VirtualMachine\nmetadata:\n labels:\n virtnest.io/os-family: windows\n virtnest.io/os-version: \"server2019\"\n name: export-window-21\n namespace: default\nspec:\n dataVolumeTemplates:\n - metadata:\n name: export-window-21-rootdisk\n spec:\n pvc:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 22Gi\n storageClassName: local-path\n source:\n vddk:\n backingFile: \"[A05-09-ShangPu-Local-DataStore] virtnest-export-window/virtnest-export-window.vmdk\"\n url: \"https://10.64.56.21\"\n uuid: \"421d40f2-21a2-cfeb-d5c9-e7f8abfc2faa\"\n thumbprint: \"D7:C4:22:E3:6F:69:DA:72:50:81:12:FA:42:18:3F:29:5C:7F:41:CA\"\n secretRef: \"vsphere21\"\n initImageURL: \"release.daocloud.io/virtnest/vddk:v8\"\n - metadata:\n name: export-window-21-datadisk\n spec:\n pvc:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 1Gi\n storageClassName: local-path\n source:\n vddk:\n backingFile: \"[A05-09-ShangPu-Local-DataStore] virtnest-export-window/virtnest-export-window_1.vmdk\"\n url: \"https://10.64.56.21\"\n uuid: \"421d40f2-21a2-cfeb-d5c9-e7f8abfc2faa\"\n thumbprint: \"D7:C4:22:E3:6F:69:DA:72:50:81:12:FA:42:18:3F:29:5C:7F:41:CA\"\n secretRef: \"vsphere21\"\n initImageURL: \"release.daocloud.io/virtnest/vddk:v8\"\n # <1>. PVC for booting Virtio drivers\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n - metadata:\n name: virtio-disk\n spec:\n pvc:\n accessModes:\n - ReadWriteOnce\n resources:\n requests:\n storage: 10Mi\n storageClassName: local-path\n source:\n blank: {}\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n running: true\n template:\n metadata:\n annotations:\n ipam.spidernet.io/ippools: '[{\"cleangateway\":false,\"ipv4\":[\"test86\"]}]'\n spec:\n dnsConfig:\n nameservers:\n - 223.5.5.5\n domain:\n cpu:\n cores: 2\n memory:\n guest: 4Gi\n devices:\n disks:\n - bootOrder: 1\n disk:\n bus: sata # <2> Disk bus type, set to SATA or Virtio depending on the boot type\n name: rootdisk\n - bootOrder: 2\n disk:\n bus: sata # <2> Disk bus type, set to SATA or Virtio depending on the boot type\n name: datadisk\n # <1>. disk for booting Virtio drivers\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n - bootOrder: 3\n disk:\n bus: virtio\n name: virtdisk\n - bootOrder: 4\n cdrom:\n bus: sata\n name: virtiocontainerdisk\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n interfaces:\n - bridge: {}\n name: ovs-bridge0\n # <3> In the above section \"Check the Boot Type of Windows\"\n # If using UEFI, add the following information\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n features:\n smm:\n enabled: true\n firmware:\n bootloader:\n efi:\n secureBoot: false\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n machine:\n type: q35\n resources:\n requests:\n memory: 4Gi\n networks:\n - multus:\n default: true\n networkName: kube-system/test1\n name: ovs-bridge0\n volumes:\n - dataVolume:\n name: export-window-21-rootdisk\n name: rootdisk\n - dataVolume:\n name: export-window-21-datadisk\n name: datadisk \n # <1> Volumes for booting Virtio drivers\n # \u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\u2193\n - dataVolume:\n name: virtio-disk\n name: virtdisk\n - containerDisk:\n image: release-ci.daocloud.io/virtnest/kubevirt/virtio-win:v4.12.12-5\n name: virtiocontainerdisk\n # \u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\u2191\n
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#install-virtio-drivers-via-vnc","title":"Install VirtIO Drivers via VNC","text":"
Access and connect to the virtual machine via VNC.
Download and install the appropriate VirtIO drivers based on the Windows version.
Enable Remote Desktop to facilitate future connections via RDP.
After installation, update the YAML file and reboot the virtual machine.
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#update-yaml-after-reboot","title":"Update YAML After Reboot","text":"Click to view the modified `window.yaml` example window.yaml
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#access-and-verify-via-rdp","title":"Access and Verify via RDP","text":"
Use an RDP client to connect to the virtual machine. Log in with the default account admin and password dangerous!123.
Verify network access and data disk data
"},{"location":"en/admin/virtnest/best-practice/import-windows.html#differences-between-importing-linux-and-windows-virtual-machines","title":"Differences Between Importing Linux and Windows Virtual Machines","text":"
Windows may require UEFI configuration.
Windows typically requires the installation of VirtIO drivers.
Windows multi-disk imports usually do not require re-mounting of disks.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html","title":"Create a Windows Virtual Machine","text":"
This document will explain how to create a Windows virtual machine via the command line.
Before creating a Windows virtual machine, it is recommended to first refer to installing dependencies and prerequisites for the virtual machine module to ensure that your environment is ready.
During the creation process, it is recommended to refer to the official documentation: Installing Windows documentation, Installing Windows related drivers.
It is recommended to access the Windows virtual machine using the VNC method.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#import-an-iso-image","title":"Import an ISO Image","text":"
Creating a Windows virtual machine requires importing an ISO image primarily to install the Windows operating system. Unlike Linux operating systems, the Windows installation process usually involves booting from an installation disc or ISO image file. Therefore, when creating a Windows virtual machine, it is necessary to first import the installation ISO image of the Windows operating system so that the virtual machine can be installed properly.
Here are two methods for importing ISO images:
(Recommended) Creating a Docker image. It is recommended to refer to building images.
(Not recommended) Using virtctl to import the image into a Persistent Volume Claim (PVC).
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#create-a-windows-virtual-machine-using-yaml","title":"Create a Windows Virtual Machine Using YAML","text":"
Creating a Windows virtual machine using YAML is more flexible and easier to write and maintain. Below are three reference YAML examples:
(Recommended) Using Virtio drivers + Docker image:
If you need to use storage capabilities - mount disks, please install viostor drivers.
If you need to use network capabilities, please install NetKVM drivers.
(Not recommended) In a scenario where Virtio drivers are not used, importing the image into a Persistent Volume Claim (PVC) using the virtctl tool. The virtual machine may use other types of drivers or default drivers to operate disk and network devices.
For Windows virtual machines, remote desktop control access is often required. It is recommended to use Microsoft Remote Desktop to control your virtual machine.
Note
Your Windows version must support remote desktop control to use Microsoft Remote Desktop.
You need to disable the Windows firewall.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#add-data-disks","title":"Add Data Disks","text":"
Adding a data disk to a Windows virtual machine follows the same process as adding one to a Linux virtual machine. You can refer to the provided YAML example for guidance.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#snapshots-cloning-live-migration","title":"Snapshots, Cloning, Live Migration","text":"
These capabilities are consistent with Linux virtual machines and can be configured using the same methods.
"},{"location":"en/admin/virtnest/best-practice/vm-windows.html#access-your-windows-virtual-machine","title":"Access Your Windows Virtual Machine","text":"
After successful creation, access the virtual machine list page to confirm that the virtual machine is running properly.
Click the console access (VNC) to access it successfully.
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html","title":"Configure GPU Passthrough for Virtual Machines","text":"
This page will explain the prerequisites for configuring GPU when creating a virtual machine.
The key to configuring GPU for virtual machines is to configure the GPU Operator to deploy different software components on the worker nodes, depending on the GPU workload configuration. Here are three example nodes:
The controller-node-1 node is configured to run containers.
The work-node-1 node is configured to run virtual machines with GPU passthrough.
The work-node-2 node is configured to run virtual machines with virtual vGPU.
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#assumptions-limitations-and-dependencies","title":"Assumptions, Limitations, and Dependencies","text":"
The worker nodes can run GPU-accelerated containers, virtual machines with GPU passthrough, or virtual machines with vGPU. However, a combination of any of these is not supported.
The cluster administrator or developer needs to have prior knowledge of the cluster and correctly label the nodes to indicate the type of GPU workload they will run.
The worker node that runs a GPU-accelerated virtual machine with GPU passthrough or vGPU is assumed to be a bare metal machine. If the worker node is a virtual machine, the GPU passthrough feature needs to be enabled on the virtual machine platform. Please consult your virtual machine platform provider for guidance.
Nvidia MIG is not supported for vGPU.
The GPU Operator does not automatically install GPU drivers in the virtual machine.
To enable GPU passthrough, the cluster nodes need to have IOMMU enabled. Refer to How to Enable IOMMU. If your cluster is running on a virtual machine, please consult your virtual machine platform provider.
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#label-the-cluster-nodes","title":"Label the Cluster Nodes","text":"
Go to Container Management, select your worker cluster, click Node Management, and then click Modify Labels in the action bar to add labels to the nodes. Each node can only have one label.
You can assign the following values to the labels: container, vm-passthrough, and vm-vgpu.
Go to Container Management, select your worker cluster, click Helm Apps -> Helm Chart , and choose and install gpu-operator. Modify the relevant fields in the yaml.
gpu-operator.sandboxWorkloads.enabled=true\ngpu-operator.vfioManager.enabled=true\ngpu-operator.sandboxDevicePlugin.enabled=true\ngpu-operator.sandboxDevicePlugin.version=v1.2.4 // version should be >= v1.2.4\ngpu-operator.toolkit.version=v1.14.3-ubuntu20.04\n
Wait for the installation to succeed, as shown in the following image:
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#install-virtnest-agent-and-configure-cr","title":"Install virtnest-agent and Configure CR","text":"
Install virtnest-agent, refer to Install virtnest-agent.
Add vGPU and GPU passthrough to the Virtnest Kubevirt CR. The following example shows the relevant yaml after adding vGPU and GPU passthrough:
GPU information registered by GPU Operator on the node
Example of obtaining vGPU information (only applicable to vGPU): View the node information on the node marked as nvidia.com/gpu.workload.config=vm-gpu, such as work-node-2, in the Capacity section, nvidia.com/GRID_P4-1Q: 8 indicates the available vGPU:
In this case, the mdevNameSelector should be \"GRID P4-1Q\" and the resourceName should be \"GRID_P4-1Q\".
Get GPU passthrough information: On the node marked as nvidia.com/gpu.workload.config=vm-passthrough (work-node-1 in this example), view the node information. In the Capacity section, nvidia.com/GP104GL_TESLA_P4: 2 represents the available vGPU:
The resourceName should be \"GRID_P4-1Q\". How to obtain the pciVendorSelector? Use SSH to log in to the target node work-node-1 and use the lspci -nnk -d 10de: command to obtain the Nvidia GPU PCI information, as shown below: The red box indicates the pciVendorSelector information.
Edit the kubevirt CR note: If there are multiple GPUs of the same model, you only need to write one in the CR, there is no need to list every GPU.
GPU passthrough; in the example above, there are two GPUs for TEESLA P4, so only one needs to be registered here
"},{"location":"en/admin/virtnest/gpu/vm-gpu.html#create-vm-using-yaml-and-enable-gpu-acceleration","title":"Create VM Using YAML and Enable GPU Acceleration","text":"
The only difference from a regular virtual machine is adding GPU-related information in the devices section.
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html","title":"Configure GPU (vGPU) for Virtual Machines","text":"
This page will explain the prerequisites for configuring GPU when creating a virtual machine.
The key to configuring GPU for virtual machines is to configure the GPU Operator to deploy different software components on the worker nodes, depending on the GPU workload configuration. Here are three example nodes:
The controller-node-1 node is configured to run containers.
The work-node-1 node is configured to run virtual machines with GPU passthrough.
The work-node-2 node is configured to run virtual machines with virtual vGPU.
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html#assumptions-limitations-and-dependencies","title":"Assumptions, Limitations, and Dependencies","text":"
The worker nodes can run GPU-accelerated containers, virtual machines with GPU passthrough, or virtual machines with vGPU. However, a combination of any of these is not supported.
The worker nodes can run GPU-accelerated containers, virtual machines with GPU passthrough, or virtual machines with vGPU individually, but not in any combination.
The cluster administrator or developer needs to have prior knowledge of the cluster and correctly label the nodes to indicate the type of GPU workload they will run.
The worker node that runs a GPU-accelerated virtual machine with GPU passthrough or vGPU is assumed to be a bare metal machine. If the worker node is a virtual machine, the GPU passthrough feature needs to be enabled on the virtual machine platform. Please consult your virtual machine platform provider for guidance.
Nvidia MIG is not supported for vGPU.
The GPU Operator does not automatically install GPU drivers in the virtual machine.
To enable GPU passthrough, the cluster nodes need to have IOMMU enabled. Please refer to How to Enable IOMMU. If your cluster is running on a virtual machine, please consult your virtual machine platform provider.
Note: This step is only required when using NVIDIA vGPU. If you plan to use GPU passthrough only, skip this section.
Follow these steps to build the vGPU Manager image and push it to the container registry:
Download the vGPU software from the NVIDIA Licensing Portal.
Log in to the NVIDIA Licensing Portal and go to the Software Downloads page.
The NVIDIA vGPU software is located in the Driver downloads tab on the Software Downloads page.
Select VGPU + Linux in the filter criteria and click Download to get the Linux KVM package. Unzip the downloaded file (NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run).
Open a terminal and clone the container-images/driver repository.
git clone https://gitlab.com/nvidia/container-images/driver cd driver\n
Switch to the vgpu-manager directory proper to your operating system.
cd vgpu-manager/<your-os>\n
Copy the .run file extracted in step 1 to the current directory.
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html#label-the-cluster-nodes","title":"Label the Cluster Nodes","text":"
Go to Container Management, select your worker cluster, click Node Management, and then click Modify Labels in the action bar to add labels to the nodes. Each node can only have one label.
You can assign the following values to the labels: container, vm-passthrough, and vm-vgpu.
Go to Container Management, select your worker cluster, click Helm Apps -> Helm Chart, and choose and install gpu-operator. Modify the relevant fields in the yaml.
GPU information registered by the GPU Operator on the node
Example of obtaining vGPU information (only applicable to vGPU): View the node information on the node marked as nvidia.com/gpu.workload.config=vm-gpu, such as work-node-2, in the Capacity section, nvidia.com/GRID_P4-1Q: 8 indicates the available vGPU:
In this case, the mdevNameSelector should be \"GRID P4-1Q\" and the resourceName should be \"GRID_P4-1Q\".
Get GPU passthrough information: On the node marked as nvidia.com/gpu.workload.config=vm-passthrough (work-node-1 in this example), view the node information. In the Capacity section, nvidia.com/GP104GL_TESLA_P4: 2 represents the available vGPU:
The resourceName should be \"GRID_P4-1Q\". How to obtain the pciVendorSelector? SSH into the target node work-node-1 and use the lspci -nnk -d 10de: command to obtain the Nvidia GPU PCI information, as shown below: The red box indicates the pciVendorSelector information.
Edit the kubevirt CR note: If there are multiple GPUs of the same model, you only need to write one in the CR, there is no need to list every GPU.
GPU passthrough; in the example above, there are two GPUs for TEESLA P4, so only one needs to be registered here
"},{"location":"en/admin/virtnest/gpu/vm-vgpu.html#create-vm-using-yaml-and-enable-gpu-acceleration","title":"Create VM Using YAML and Enable GPU Acceleration","text":"
The only difference from a regular virtual machine is adding the gpu-related information in the devices section.
If you want to experience the latest development version of virtnest, then please add the following repository address (the development version of virtnest is extremely unstable).
"},{"location":"en/admin/virtnest/install/index.html#choose-a-version-that-you-want-to-install","title":"Choose a Version that You Want to Install","text":"
It is recommended to install the latest version.
[root@master ~]# helm search repo virtnest-release/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest-release/virtnest 0.6.0 v0.6.0 A Helm chart for virtnest\n
"},{"location":"en/admin/virtnest/install/index.html#create-a-namespace","title":"Create a Namespace","text":"
"},{"location":"en/admin/virtnest/install/index.html#upgrade","title":"Upgrade","text":""},{"location":"en/admin/virtnest/install/index.html#update-the-virtnest-helm-repository","title":"Update the virtnest Helm Repository","text":"
helm repo update virtnest-release\n
"},{"location":"en/admin/virtnest/install/index.html#back-up-the-set-parameters","title":"Back up the --set Parameters","text":"
Before upgrading the virtnest version, we recommend executing the following command to backup the --set parameters of the previous version
helm get values virtnest -n virtnest-system -o yaml > bak.yaml\n
"},{"location":"en/admin/virtnest/install/install-dependency.html","title":"Dependencies and Prerequisites","text":"
This page explains the dependencies and prerequisites for installing virtual machine.
Info
The term virtnest mentioned in the commands or scripts below is the internal development codename for the Global Management module.
"},{"location":"en/admin/virtnest/install/install-dependency.html#prerequisites","title":"Prerequisites","text":""},{"location":"en/admin/virtnest/install/install-dependency.html#kernel-version-being-above-v411","title":"Kernel version being above v4.11","text":"
The kernel version of all nodes in the target cluster needs to be higher than v4.11. For detail information, see kubevirt issue. Run the following command to see the version:
"},{"location":"en/admin/virtnest/install/install-dependency.html#cpu-supporting-x86-64-v2-instruction-set-or-higher","title":"CPU supporting x86-64-v2 instruction set or higher","text":"
You can use the following script to check if the current node's CPU is usable:
Note
If you encounter a message like the one shown below, you can safely ignore it as it does not impact the final result.
\u793a\u4f8b
$ sh detect-cpu.sh\ndetect-cpu.sh: line 3: fpu: command not found\n
"},{"location":"en/admin/virtnest/install/install-dependency.html#all-nodes-having-hardware-virtualization-nested-virtualization-enabled","title":"All Nodes having hardware virtualization (nested virtualization) enabled","text":"
Run the following command to check if it has been achieved:
virt-host-validate qemu\n
# Successful case\nQEMU: Checking for hardware virtualization : PASS\nQEMU: Checking if device /dev/kvm exists : PASS\nQEMU: Checking if device /dev/kvm is accessible : PASS\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for device assignment IOMMU support : PASS\nQEMU: Checking if IOMMU is enabled by kernel : PASS\nQEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure Guest support)\n\n# Failure case\nQEMU: Checking for hardware virtualization : FAIL (Only emulated CPUs are available, performance will be significantly limited)\nQEMU: Checking if device /dev/vhost-net exists : PASS\nQEMU: Checking if device /dev/net/tun exists : PASS\nQEMU: Checking for cgroup 'memory' controller support : PASS\nQEMU: Checking for cgroup 'memory' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpu' controller support : PASS\nQEMU: Checking for cgroup 'cpu' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuacct' controller support : PASS\nQEMU: Checking for cgroup 'cpuacct' controller mount-point : PASS\nQEMU: Checking for cgroup 'cpuset' controller support : PASS\nQEMU: Checking for cgroup 'cpuset' controller mount-point : PASS\nQEMU: Checking for cgroup 'devices' controller support : PASS\nQEMU: Checking for cgroup 'devices' controller mount-point : PASS\nQEMU: Checking for cgroup 'blkio' controller support : PASS\nQEMU: Checking for cgroup 'blkio' controller mount-point : PASS\nWARN (Unknown if this platform has IOMMU support)\n
Methods vary from platforms, and this page takes vsphere as an example. See vmware website.
"},{"location":"en/admin/virtnest/install/install-dependency.html#if-using-docker-engine-as-the-container-runtime","title":"If using Docker Engine as the container runtime","text":"
If Docker Engine is used as the container runtime, it must be higher than v20.10.10.
"},{"location":"en/admin/virtnest/install/install-dependency.html#enabling-iommu-is-recommended","title":"Enabling IOMMU is recommended","text":"
To prepare for future functions, it is recommended to enable IOMMU.
"},{"location":"en/admin/virtnest/install/offline-install.html","title":"Offline Upgrade of the Virtual Machine Module","text":"
This page explains how to install or upgrade the Virtual Machine module after downloading it from the Download Center.
Info
The term \"virtnest\" appearing in the following commands or scripts is the internal development code name for the Virtual Machine module.
"},{"location":"en/admin/virtnest/install/offline-install.html#load-images-from-the-installation-package","title":"Load Images from the Installation Package","text":"
You can load the images using one of the following two methods. When there is an container registry available in your environment, it is recommended to choose the chart-syncer method for synchronizing the images to the container registry, as it is more efficient and convenient.
"},{"location":"en/admin/virtnest/install/offline-install.html#synchronize-images-to-the-container-registry-using-chart-syncer","title":"Synchronize Images to the container registry using chart-syncer","text":"
Create load-image.yaml file.
Note
All parameters in this YAML file are mandatory. You need a private container registry and modify the relevant configurations.
Chart Repo InstalledChart Repo Not Installed
If the chart repo is already installed in your environment, chart-syncer also supports exporting the chart as a tgz file.
The relative path to run the charts-syncer command, not the relative path between this YAML file and the offline package.
Change to your container registry URL.
Change to your container registry.
It can also be any other supported Helm Chart repository type.
Change to the chart repo URL.
Your container registry username.
Your container registry password.
Your container registry username.
Your container registry password.
If the chart repo is not installed in your environment, chart-syncer also supports exporting the chart as a tgz file and storing it in the specified path.
The relative path to run the charts-syncer command, not the relative path between this YAML file and the offline package.
Change to your container registry URL.
Change to your container registry.
Local path of the chart.
Your container registry username.
Your container registry password.
Run the command to synchronize the images.
charts-syncer sync --config load-image.yaml\n
"},{"location":"en/admin/virtnest/install/offline-install.html#load-images-directly-using-docker-or-containerd","title":"Load Images Directly using Docker or containerd","text":"
Unpack and load the image files.
Unpack the tar archive.
tar xvf virtnest.bundle.tar\n
After successful extraction, you will have three files:
hints.yaml
images.tar
original-chart
Load the images from the local file to Docker or containerd.
Dockercontainerd
docker load -i images.tar\n
ctr -n k8s.io image import images.tar\n
Note
Perform the Docker or containerd image loading operation on each node. After loading is complete, tag the images to match the Registry and Repository used during installation.
If the helm version is too low, it may fail. If it fails, try executing helm update repo.
Choose the version of the Virtual Machine you want to install (it is recommended to install the latest version).
helm search repo virtnest/virtnest --versions\n
[root@master ~]# helm search repo virtnest/virtnest --versions\nNAME CHART VERSION APP VERSION DESCRIPTION\nvirtnest/virtnest 0.2.0 v0.2.0 A Helm chart for virtnest\n...\n
Back up the --set parameters.
Before upgrading the Virtual Machine version, it is recommended to run the following command to backup the --set parameters of the previous version.
helm get values virtnest -n virtnest-system -o yaml > bak.yaml\n
To utilize the Virtual Machine (VM), the virtnest-agent component needs to be installed in the cluster using Helm.
Click Container Management in the left navigation menu, then click Virtual Machines . If the virtnest-agent component is not installed, you will not be able to use the VM. The interface will display a reminder for you to install within the required cluster.
Select the desired cluster, click Helm Apps in the left navigation menu, then click Helm Charts to view the template list.
Search for the virtnest-agent component, and click to the see details. Select the appropriate version and click Install button to install.
On the installation page, fill in the required information, and click OK to finish the installation.
Go back to the Virtual Machines in the navigation menu. If the installation is successful, you will see the VM list, and you can now use the VM.
This article will explain how to create a virtual machine using two methods: image and YAML file.
Virtual machine, based on KubeVirt, manages virtual machines as cloud native applications, seamlessly integrating with containers. This allows users to easily deploy virtual machine applications and enjoy a smooth experience similar to containerized applications.
Before creating a virtual machine, make sure you meet the following prerequisites:
Expose hardware-assisted virtualization to the user operating system.
Install virtnest-agent on the specified cluster; the operating system kernel version must be 3.15 or higher.
Create a namespace and user.
Prepare the image in advance. The platform comes with three built-in images (as shown below). If you need to create your own image, refer to creating from an image with KubeVirt.
When configuring the network, if you choose to use the Passt network mode, you need to upgrade to Version 0.4.0 or higher.
Follow the steps below to create a virtual machine using an image.
Click Container Management on the left navigation bar, then click Virtual Machines to enter the VM page.
On the virtual machine list page, click Create VMs and select Create with Image.
Fill the basic information, image settings, storage and network, login settings, and click OK at the bottom right corner to complete the creation.
The system will automatically return to the virtual machine list. By clicking the \u2507 button on the right side of the list, you can perform operations such as power on/off, restart, clone, update, create snapshots, console access (VNC), and delete virtual machines. Cloning and snapshot capabilities depend on the selected StorageClass.
In the Create VMs page, enter the information according to the table below and click Next.
Name: Up to 63 characters, can only contain lowercase letters, numbers, and hyphens ( - ), and must start and end with a lowercase letter or number. The name must be unique within the namespace, and cannot be changed once the virtual machine is created.
Alias: Allows any characters, up to 60 characters.
Cluster: Select the cluster to deploy the newly created virtual machine.
Namespace: Select the namespace to deploy the newly created virtual machine. If the desired namespace is not found, you can create a new namespace according to the prompts on the page.
Label/Annotation: Select the desired labels/annotations to add to the virtual machine.
Fill in the image-related information according to the table below, then click Next.
Image Source: Supports three types of sources.
Registry: Images stored in the container registry. You can select images from the registry as needed.
HTTP: Images stored in a file server using the HTTP protocol, supporting both HTTPS:// and HTTP:// prefixes.
Object Storage (S3): Virtual machine images obtained through the object storage protocol (S3). For non-authenticated object storage files, please use the HTTP source.
The following are the built-in images provided by the platform, including the operating system, version, and the image URL. Custom virtual machine images are also supported.
Operating System Version Image Address CentOS CentOS 7.9 release-ci.daocloud.io/virtnest/system-images/centos-7.9-x86_64:v1 Ubuntu Ubuntu 22.04 release-ci.daocloud.io/virtnest/system-images/ubuntu-22.04-x86_64:v1 Debian Debian 12 release-ci.daocloud.io/virtnest/system-images/debian-12-x86_64:v1
Image Secret: Only supports the default (Opaque) type of key, for specific operations you can refer to Create Secret.
The built-in image storage in the bootstrap cluster, and the container registry of the bootstrap cluster is not encrypted, so when selecting the built-in image, there is no need to select a secret.
Note
The hot-plug configuration for CPU and memory requires virtnest v0.10.0 or higher, and virtnest-agent v0.7.0 or higher.
Resource Config: For CPU, it is recommended to use whole numbers. If a decimal is entered, it will be rounded up. The hot-plug configuration for CPU and memory is supported.
GPU Configuration: Enabling GPU functionality requires meeting certain prerequisites. For details, refer to Configuring GPU for Virtual Machines (Nvidia). Virtual machines support two types of Nvidia GPUs: Nvidia-GPU and Nvidia-vGPU. After selecting the desired type, you will need to choose the proper GPU model and the number of cards.
"},{"location":"en/admin/virtnest/quickstart/index.html#storage-and-network","title":"Storage and Network","text":"
Storage:
Storage is closely related to the function of the virtual machine. Mainly by using Kubernetes' persistent volumes and storage classes, it provides flexible and scalable virtual machine storage capabilities. For example, the virtual machine image is stored in the PVC, and it supports cloning, snapshotting, etc. with other data.
System Disk: The system automatically creates a VirtIO type rootfs system disk for storing the operating system and data.
Data Disk: The data disk is a storage device in the virtual machine used to store user data, application data, or other non-operating system related files. Compared with the system disk, the data disk is optional and can be dynamically added or removed as needed. The capacity of the data disk can also be flexibly configured according to demand.
Block storage is used by default. If you need to use the clone and snapshot functions, make sure that your storage pool has created the proper VolumeSnapshotClass, which you can refer to the following example. If you need to use the live migration function, make sure your storage supports and selects the ReadWriteMany access mode.
In most cases, the storage will not automatically create such a VolumeSnapshotClass during the installation process, so you need to manually create a VolumeSnapshotClass. The following is an example of HwameiStor creating a VolumeSnapshotClass:
Run the following command to check if the VolumeSnapshotClass was created successfully.
kubectl get VolumeSnapshotClass\n
View the created Snapshotclass and confirm that the provisioner property is consistent with the Driver property in the storage pool.
Network:
Network setting can be combined as needed according to the table information.
Network Mode CNI Install Spiderpool Network Cards Fixed IP Live Migration Masquerade (NAT) Calico \u274c Single \u274c \u2705 Cilium \u274c Single \u274c \u2705 Flannel \u274c Single \u274c \u2705 Bridge OVS \u2705 Multiple \u2705 \u2705
Network modes are divided into Masquerade (NAT) and Bridge, the latter mode need to be installed after the spiderpool component can be used.
The network mode of Masquerade (NAT) is selected by default, using the default network card eth0.
If the spiderpool component is installed in the cluster, you can choose the Bridge mode, and the Bridge mode supports multiple NICs.
Add Network Card
Passthrough / Bridge mode supports manual addition of network cards. Click Add NIC to configure the network card IP pool. Choose the Multus CR that matches the network mode, if not, you need to create it yourself.
If you turn on the Use Default IP Pool switch, use the default IP pool in the multus CR setting. If the switch is off, manually select the IP pool.
Accessing virtual machines through the terminal provides more flexibility and lightweight access. However, it does not directly display the graphical interface, has limited interactivity, and does not support multiple concurrent terminal sessions.
Click Container Management in the left navigation bar, then click Virtual Machines to access the list page. Click the \u2507 button on the right side of the list to access the virtual machine via the terminal.
Accessing virtual machines through VNC allows you to access and control the full graphical interface of the remote computer. It provides a more interactive experience and allows intuitive operation of the remote device. However, it may have some performance impact, and it does not support multiple concurrent terminal sessions.
Choose VNC for Windows systems.
Click Container Management in the left navigation bar, then click Virtual Machines to access the list page. Click the \u2507 button on the right side of the list to access the virtual machine via Console Access (VNC).
After successfully creating a virtual machine, you can enter the VM Detail page to view Basic Information, Settings, GPU Settings, Overview, Storage, Network, Snapshot and Event List.
Click Container Management in the left navigation bar, then click Clusters to enter the page of the cluster where the virtual machine is located. Click the VM Name to view the virtual machine details.
Operating System: The operating system installed on the virtual machine to execute programs.
Image Address: A link to a virtual hard disk file or operating system installation media, which is used to load and install the operating system in the virtual machine software.
Network Mode: The network mode configured for the virtual machine, including Bridge or Masquerade(NAT).
CPU & Memory: The resources allocated to the virtual machine.
GPU Settings includes: GPU Type, GPU Model and GPU Counts
"},{"location":"en/admin/virtnest/quickstart/detail.html#other-information","title":"Other Information","text":"OverviewStorageNetworkSnapshotsEvent List
It allows you to view its insight content. Please note that if insight-agent is not installed, overview information cannot be obtained.
It displays the storage used by the virtual machine, including information about the system disk and data disk.
It displays the network settings of the virtual machine, including Multus CR, NIC Name, IP Address and so on.
If you have created snapshots, this part will display relative information. Restoring the virtual machine from snapshots is supported.
The event list includes various state changes, operation records, and system messages during the lifecycle of the virtual machine.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html","title":"Accessing Virtual Machine via NodePort","text":"
This page explains how to access a virtual machine using NodePort.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html#limitations-of-existing-access-methods","title":"Limitations of Existing Access Methods","text":"
Virtual machines support access via VNC or console, but both methods have a limitation: they do not allow multiple terminals to be simultaneously online.
Using a NodePort-formatted Service can help solve this problem.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html#create-a-service","title":"Create a Service","text":"
Using the Container Management Page
Select the cluster page where the target virtual machine is located and create a Service.
Select the access type as NodePort.
Choose the namespace (the namespace where the virtual machine resides).
Fill in the label selector as vm.kubevirt.io/name: your-vm-name.
Port Configuration: Choose TCP for the protocol, provide a custom port name, and set the service port and container port to 22.
After successful creation, you can access the virtual machine by using ssh username@nodeip -p port.
"},{"location":"en/admin/virtnest/quickstart/nodeport.html#create-the-service-via-kubectl","title":"Create the Service via kubectl","text":"
On this page, Alias , Label and Annotation can be updated, while other information cannot. After completing the updates, click Next to proceed to the Image Settings page.
On this page, parameters such as Image Address, Operating System, and Version cannot be changed once selected. Users are allowed to update the GPU Quota, including enabling or disabling GPU support, selecting the GPU type, specifying the required model, and configuring the number of GPU. A restart is required for taking effect. After completing the updates, click Next to proceed to the Storage and Network page.
"},{"location":"en/admin/virtnest/quickstart/update.html#storage-and-network","title":"Storage and Network","text":"
On the Storage and Network page, the StorageClass and PVC Mode for the System Disk cannot be changed once selected. You can increase Disk Capacity, but reducing it is not supported. And you can freely add or remove Data Disk. Network updates are not supported. After completing the updates, click Next to proceed to the Login Settings page.
Note
It is recommended to restart the virtual machine after modifying storage capacity or adding data disks to ensure the configuration takes effect.
On the Login Settings page, Username, Password, and SSH cannot be changed once set. After confirming your login information is correct, click OK to complete the update process.
In addition to updating the virtual machine via forms, you can also quickly update it using a YAML file.
Go to the virtual machine list page and click the Edit YAML button.
"},{"location":"en/admin/virtnest/template/index.html","title":"Create Virtual Machines via Templates","text":"
This guide explains how to create virtual machines using templates.
With internal templates and custom templates, users can easily create new virtual machines. Additionally, we provide the ability to convert existing virtual machines into templates, allowing users to manage and utilize resources more flexibly.
"},{"location":"en/admin/virtnest/template/index.html#create-with-template","title":"Create with Template","text":"
Follow these steps to create a virtual machine using a template.
Click Container Management in the left navigation menu, then click Virtual Machines to access the Virtual Machine Management page. On the virtual machine list page, click Create Virtual Machine and select Create with Template .
On the template creation page, fill in the required information, including Basic Information, Template Config, Storage and Network, and Login Settings. Then, click OK in the bottom-right corner to complete the creation.
The system will automatically return to the virtual machine list. By clicking \u2507 on the right side of the list, you can perform operations such as power off/restart, clone, update, create snapshot, convert to template, console access (VNC), and delete. The ability to clone and create snapshots depends on the selected storage pool.
On the Create VMs page, enter the information according to the table below and click Next .
Name: Can contain up to 63 characters and can only include lowercase letters, numbers, and hyphens ( - ). The name must start and end with a lowercase letter or number. Names must be unique within the same namespace, and the name cannot be changed after the virtual machine is created.
Alias: Can include any characters, up to 60 characters in length.
Cluster: Select the cluster where the new virtual machine will be deployed.
Namespace: Select the namespace where the new virtual machine will be deployed. If the desired namespace is not found, you can follow the instructions on the page to create a new namespace.
The template list will appear, and you can choose either an internal template or a custom template based on your needs.
Select an Internal Template: AI platform Virtual Machine provides several standard templates that cannot be edited or deleted. When selecting an internal template, the image source, operating system, image address, and other information will be based on the template and cannot be modified. GPU quota will also be based on the template but can be modified.
Select a Custom Template: These templates are created from virtual machine configurations and can be edited or deleted. When using a custom template, you can modify the image source and other information based on your specific requirements.
"},{"location":"en/admin/virtnest/template/index.html#storage-and-network","title":"Storage and Network","text":"
Storage: By default, the system creates a rootfs system disk of VirtIO type for storing the operating system and data. Block storage is used by default. If you need to use clone and snapshot functionality, make sure your storage pool supports the VolumeSnapshots feature and create it in the storage pool (SC). Please note that the storage pool (SC) has additional prerequisites that need to be met.
Prerequisites:
KubeVirt utilizes the VolumeSnapshot feature of the Kubernetes CSI driver to capture the persistent state of virtual machines. Therefore, you need to ensure that your virtual machine uses a StorageClass that supports VolumeSnapshots and is configured with the correct VolumeSnapshotClass.
Check the created SnapshotClass and confirm that the provisioner property matches the Driver property in the storage pool.
Supports adding one system disk and multiple data disks.
Network: If no configuration is made, the system will create a VirtIO type network by default.
This guide explains the usage of internal VM templates and custom VM templates.
Using both internal and custom templates, users can easily create new VMs. Additionally, we provide the ability to convert existing VMs into VM templates, allowing users to manage and utilize resources more flexibly.
Click Container Management in the left navigation menu, then click VM Template to access the VM Template page. If the template is converted from a virtual machine configured with a GPU, the template will also include GPU information and will be displayed in the template list.
Click the \u2507 on the right side of a template in the list. For internal templates, you can create VM and view YAML. For custom templates, you can create VM, edit YAML and delete template.
Custom templates are created from VM configurations. The following steps explain how to convert a VM configuration into a template.
Click Container Management in the left navigation menu, then click Virtual Machines to access the list page. Click the \u2507 on the right side of a VM in the list to convert the configuration into a template. Only running or stopped VMs can be converted.
Provide a name for the new template. A notification will indicate that the original VM will be preserved and remain available. After a successful conversion, a new entry will be added to the template list.
After successfully creating a template, you can click the template name to view the details of the VM, including Basic Information, GPU Settings, Storage, Network, and more. If you need to quickly deploy a new VM based on that template, simply click the Create VM button in the upper right corner of the page for easy operation.
"},{"location":"en/admin/virtnest/vm/auto-migrate.html","title":"Automatic VM Drifting","text":"
This article will explain how to seamlessly migrate running virtual machines to other nodes when a node in the cluster becomes inaccessible due to power outages or network failures, ensuring business continuity and data security.
Compared to automatic drifting, live migration requires you to manually initiate the migration process through the interface, rather than having the system automatically trigger it.
Check the status of the virtual machine launcher pod:
kubectl get pod\n
Check if the launcher pod is in a Terminating state.
Force delete the launcher pod:
If the launcher pod is in a Terminating state, you can force delete it with the following command:
kubectl delete <launcher pod> --force\n
Replace <launcher pod> with the name of your launcher pod.
Wait for recreation and check the status:
After deletion, the system will automatically recreate the launcher pod. Wait for its status to become running, then refresh the virtual machine list to see if the VM has successfully migrated to the new node.
If using rook-ceph as storage, it needs to be configured in ReadWriteOnce mode:
After force deleting the pod, you need to wait approximately six minutes for the launcher pod to start, or you can immediately start the pod using the following commands:
kubectl get pv | grep <vm name>\nkubectl get VolumeAttachment | grep <pv name>\n
Replace <vm name> and <pv name> with your virtual machine name and persistent volume name.
Then delete the proper VolumeAttachment with the following command:
kubectl delete VolumeAttachment <vm>\n
Replace <vm> with your virtual machine name.
"},{"location":"en/admin/virtnest/vm/clone.html","title":"Cloning a Cloud Host","text":"
This article will introduce how to clone a new cloud host.
Users can clone a new cloud host, which will have the same operating system and system configuration as the original cloud host. This enables quick deployment and scaling, allowing for the rapid creation of new cloud hosts with similar configurations without the need to install from scratch.
Before using the cloning feature, the following prerequisites must be met (which are the same as those for the snapshot feature):
Only cloud hosts that are not in an error state can use the cloning feature.
Install Snapshot CRDs, Snapshot Controller, and CSI Driver. For specific installation steps, refer to CSI Snapshotter.
Wait for the snapshot-controller component to be ready. This component will monitor events related to VolumeSnapshot and VolumeSnapshotContent and trigger related operations.
Wait for the CSI Driver to be ready, ensuring that the csi-snapshotter sidecar is running in the CSI Driver. The csi-snapshotter sidecar will monitor events related to VolumeSnapshotContent and trigger related operations.
If the storage is Rook-Ceph, refer to ceph-csi-snapshot
If the storage is HwameiStor, refer to huameistor-snapshot
"},{"location":"en/admin/virtnest/vm/clone.html#cloning-a-cloud-host_1","title":"Cloning a Cloud Host","text":"
Click__Container Management__ in the left navigation bar, then click__Cloud Hosts__ to enter the list page. Clickthe \u2507 on the right side of the list to perform snapshot operations on cloud hosts that are not in an error state.
A popup will appear, requiring you to fill in the name and description for the new cloud host being cloned. The cloning operation may take some time, depending on the size of the cloud host and storage performance.
After a successful clone, you can view the new cloud host in the cloud host list. The newly created cloud host will be in a powered-off state and will need to be manually powered on if required.
It is recommended to take a snapshot of the original cloud host before cloning. If you encounter issues during the cloning process, please check whether the prerequisites are met and try to execute the cloning operation again.
When creating a virtual machine using Object Storage (S3) as the image source, sometimes you need to fill in a secret to get through S3's verification. The following will introduce how to create a secret that meets the requirements of the virtual machine.
Click Container Management in the left navigation bar, then click Clusters , enter the details of the cluster where the virtual machine is located, click ConfigMaps & Secrets , select the Secrets , and click Create Secret .
Enter the creation page, fill in the secret name, select the namespace that is the same as the virtual machine, and note that you need to select the default type Opaque . The secret data needs to follow the following principles.
accessKeyId: Data represented in Base64 encoding
secretKey: Data represented in Base64 encoding
After successful creation, you can use the required secret when creating a virtual machine.
"},{"location":"en/admin/virtnest/vm/cross-cluster-migrate.html","title":"Migrate VM across Clusters","text":"
This feature currently does not have a UI, so you can follow the steps in the documentation.
A VM needs to be migrated to another cluster when the original cluster experiences a failure or performance degradation that makes the VM inaccessible.
A VM needs to be migrated to another cluster when perform planned maintenance or upgrades on the cluster.
A VM needs to be migrated to another cluster to match more appropriate resource configurations when the performance requirements of specific applications change and resource allocation needs to be adjusted.
Before performing migration of a VM across cluster, the following prerequisites must be met:
Cluster network connectivity: Ensure that the network between the original cluster and the target migration cluster is accessible.
Same storage type: The target migration cluster must support the same storage type as the original cluster. For example, if the exporting cluster uses rook-ceph-block type StorageClass, the importing cluster must also support this type.
Enable VMExport Feature Gate in KubeVirt of the original cluster.
"},{"location":"en/admin/virtnest/vm/cross-cluster-migrate.html#configure-ingress-for-the-original-cluster","title":"Configure Ingress for the Original Cluster","text":"
Using Nginx Ingress as an example, configure Ingress to point to the virt-exportproxy Service:
If cold migration is performed while the VM is powered off :
apiVersion: v1\nkind: Secret\nmetadata:\n name: example-token # Export Token used by the VM\n namespace: default # Namespace where the VM resides\nstringData:\n token: 1234567890ab # Export the used Token (Modifiable)\n\n---\napiVersion: export.kubevirt.io/v1alpha1\nkind: VirtualMachineExport\nmetadata:\n name: example-export # Export name (Modifiable)\n namespace: default # Namespace where the VM resides\nspec:\n tokenSecretRef: example-token # Must match the name of the token created above\n source:\n apiGroup: \"kubevirt.io\"\n kind: VirtualMachine\n name: testvm # VM name\n
If hot migration is performed using a VM snapshot while the VM is powered on :
apiVersion: v1\nkind: Secret\nmetadata:\n name: example-token # Export Token used by VM\n namespace: default # Namespace where the VM resides\nstringData:\n token: 1234567890ab # Export the used Token (Modifiable)\n\n---\napiVersion: export.kubevirt.io/v1alpha1\nkind: VirtualMachineExport\nmetadata:\n name: export-snapshot # Export name (Modifiable)\n namespace: default # Namespace where the VM resides\nspec:\n tokenSecretRef: export-token # Must match the name of the token created above\n source:\n apiGroup: \"snapshot.kubevirt.io\"\n kind: VirtualMachineSnapshot\n name: export-snap-202407191524 # Name of the proper VM snapshot\n
Check if the VirtualMachineExport is ready:
# Replace example-export with the name of the created VirtualMachineExport\nkubectl get VirtualMachineExport example-export -n default\n\nNAME SOURCEKIND SOURCENAME PHASE\nexample-export VirtualMachine testvm Ready\n
Once the VirtualMachineExport is ready, export the VM YAML.
If virtctl is installed, you can use the following command to export the VM YAML:
# Replace example-export with the name of the created VirtualMachineExport\n# Specify the namespace with -n\nvirtctl vmexport download example-export --manifest --include-secret --output=manifest.yaml\n
If virtctl is not installed, you can use the following commands to export the VM YAML:
# Replace example-export with the name and namespace of the created VirtualMachineExport\nmanifesturl=$(kubectl get VirtualMachineExport example-export -n default -o=jsonpath='{.status.links.internal.manifests[0].url}')\nsecreturl=$(kubectl get VirtualMachineExport example-export -n default -o=jsonpath='{.status.links.internal.manifests[1].url}')\n# Replace with the secret name and namespace\ntoken=$(kubectl get secret example-token -n default -o=jsonpath='{.data.token}' | base64 -d)\n\ncurl -H \"Accept: application/yaml\" -H \"x-kubevirt-export-token: $token\" --insecure $secreturl > manifest.yaml\ncurl -H \"Accept: application/yaml\" -H \"x-kubevirt-export-token: $token\" --insecure $manifesturl >> manifest.yaml\n
Import VM.
Copy the exported manifest.yaml to the target migration cluster and run the following command.(If the namespace does not exist, it need to be created in advance) :
kubectl apply -f manifest.yaml\n
After successfully creating a VM, you need to restart it. Once the VM is running successfully, the original VM need to be deleted in the original cluster (Do not delete the original VM if it has not started successfully).
When configuring the liveness and readiness probes for a cloud host, the process is similar to that of Kubernetes configuration. This article will introduce how to configure health check parameters for a cloud host using YAML.
However, it is important to note that the configuration must be done when the cloud host has been successfully created and is in a powered-off state.
The configuration of userData may vary depending on the operating system (such as Ubuntu/Debian or CentOS). The main differences are:
Package manager:
Ubuntu/Debian uses apt-get as the package manager. CentOS uses yum as the package manager.
SSH service restart command:
Ubuntu/Debian uses systemctl restart ssh.service. CentOS uses systemctl restart sshd.service (note that for CentOS 7 and earlier versions, it uses service sshd restart).
Installed packages:
Ubuntu/Debian installs ncat. CentOS installs nmap-ncat (because ncat may not be available in the default repository for CentOS).
This article will explain how to migrate a virtual machine from one node to another.
When a node needs maintenance or upgrades, users can seamlessly migrate running virtual machines to other nodes while ensuring business continuity and data security.
Click Container Management on the left navigation bar, then click Virtual Machines to enter the list page. Click \u2507 on the right side of the list to migrate running virtual machines. Currently, the virtual machine is on the node controller-node-1 .
A pop-up box will appear, indicating that during live migration, the running virtual machine instances will be migrated to another node, but the target node cannot be predetermined. Please ensure that other nodes have sufficient resources.
After a successful migration, you can view the node information in the virtual machine list. At this time, the node has been migrated to controller-node-2 .
"},{"location":"en/admin/virtnest/vm/migratiom.html","title":"Cold Migration within the Cluster","text":"
This article will introduce how to move a cloud host from one node to another within the same cluster while it is powered off.
The main feature of cold migration is that the cloud host will be offline during the migration process, which may impact business continuity. Therefore, careful planning of the migration time window is necessary, taking into account business needs and system availability. Typically, cold migration is suitable for scenarios where downtime requirements are not very strict.
Click__Container Management__ in the left navigation bar, then click__Cloud Hosts__ to enter the list page. Clickthe \u2507 on the right side of the list to initiate the migration action for the cloud host that is in a powered-off state. The current node of the cloud host cannot be viewed while it is powered off, so prior planning or checking while powered on is required.
Note
If you have used local-path in the storage pool of the original node, there may be issues during cross-node migration. Please choose carefully.
After clicking migrate, a prompt will appear allowing you to choose to migrate to a specific node or randomly. If you need to change the storage pool, ensure that there is an available storage pool in the target node. Also, ensure that the target node has sufficient resources. The migration process may take a significant amount of time, so please be patient.
The migration will take some time, so please be patient. After it is successful, you need to restart the cloud host to check if the migration was successful. This example has already powered on the cloud host to check the migration effect.
The virtual machine's monitoring is based on the Grafana Dashboard open-sourced by Kubevirt, which generates monitoring dashboards for each virtual machine.
Monitoring information of the virtual machine can provide better insights into the resource consumption of the virtual machine, such as CPU, memory, storage, and network resource usage. These information can help optimize and plan resources, improving overall resource utilization efficiency.
Navigate to the VM Detail page and click Overview to view the monitoring content of the virtual machine. Please note that without the insight-agent component installed, monitoring information cannot be obtained. Below are the detailed information:
Total CPU, CPU Usage, Memory Total, Memory Usage.
CPU Utilisation: the percentage of CPU resources currently used by the virtual machine;
Memory Utilisation: the percentage of memory resources currently used by the virtual machine out of the total available memory.
Network Traffic by Virtual Machines: the amount of network data sent and received by the virtual machine during a specific time period;
Network Packet Loss Rate: the proportion of lost data packets during data transmission out of the total sent data packets.
Network Packet Error Rate: the rate of errors that occur during network transmission;
Storage Traffic: the speed and capacity at which the virtual machine system reads and writes to the disk within a certain time period.
Storage IOPS: the number of input/output operations the virtual machine system performs in one second.
Storage Delay: the time delay experienced by the virtual machine system when performing disk read and write operations.
This article introduces how to create snapshots for VMs on a schedule.
You can create scheduled snapshots for VMs, providing continuous protection for data and ensuring effective data recovery in case of data loss, corruption, or deletion.
In the left navigation bar, click Container Management -> Clusters to select the proper cluster where the target VM is located. After entering the cluster, click Workloads -> CronJobs, and choose Create from YAML to create a scheduled task. Refer to the following YAML example to create snapshots for the specified VM on a schedule.
Click to view the YAML example for creating a scheduled task
apiVersion: batch/v1\nkind: CronJob\nmetadata:\n name: xxxxx-xxxxx-cronjob # Scheduled task name (Customizable)\n namespace: virtnest-system # Do not modify the namespace\nspec:\n schedule: \"5 * * * *\" # Modify the scheduled task execution interval as needed\n concurrencyPolicy: Allow\n suspend: false\n successfulJobsHistoryLimit: 10\n failedJobsHistoryLimit: 3\n startingDeadlineSeconds: 60\n jobTemplate:\n spec:\n template:\n metadata:\n labels:\n virtnest.io/vm: xxxx # Modify to the name of the VM that needs to be snapshotted\n virtnest.io/namespace: xxxx # Modify to the namespace where the VM is located\n spec:\n serviceAccountName: kubevirt-operator\n containers:\n - name: snapshot-job\n image: release.daocloud.io/virtnest/tools:v0.1.5 # For offline environments, modify the registry address to the proper registry address of the cluster\n imagePullPolicy: IfNotPresent\n env:\n - name: NS\n valueFrom:\n fieldRef:\n fieldPath: metadata.labels['virtnest.io/namespace']\n - name: VM\n valueFrom:\n fieldRef:\n fieldPath: metadata.labels['virtnest.io/vm']\n command:\n - /bin/sh\n - -c\n - |\n export SUFFIX=$(date +\"%Y%m%d-%H%M%S\")\n cat <<EOF | kubectl apply -f -\n apiVersion: snapshot.kubevirt.io/v1alpha1\n kind: VirtualMachineSnapshot\n metadata:\n name: $(VM)-snapshot-$SUFFIX\n namespace: $(NS)\n spec:\n source:\n apiGroup: kubevirt.io\n kind: VirtualMachine\n name: $(VM)\n EOF\n restartPolicy: OnFailure\n
After creating the scheduled task and running it successfully, you can click Virtual Machines in the list page to select the target VM. After entering the details, you can view the snapshot list.
This guide explains how to create snapshots for virtual machines and restore them.
You can create snapshots for virtual machines to save the current state of the virtual machine. A snapshot can be restored multiple times, and each time the virtual machine will be reverted to the state when the snapshot was created. Snapshots are commonly used for backup, recovery and rollback.
Before using the snapshots, the following prerequisites need to be met:
Only virtual machines in a non-error state can use the snapshot function.
Install Snapshot CRDs, Snapshot Controller, and CSI Driver. For detailed installation steps, refer to CSI Snapshotter.
Wait for the snapshot-controller component to be ready. This component monitors events related to VolumeSnapshot and VolumeSnapshotContent and triggers specific actions.
Wait for the CSI Driver to be ready. Ensure that the csi-snapshotter sidecar is running within the CSI Driver. The csi-snapshotter sidecar monitors events related to VolumeSnapshotContent and triggers specific actions.
If the storage is rook-ceph, refer to ceph-csi-snapshot.
If the storage is HwameiStor, refer to huameistor-snapshot.
"},{"location":"en/admin/virtnest/vm/snapshot.html#create-a-snapshot","title":"Create a Snapshot","text":"
Click Container Management in the left navigation menu, then click Virtual Machines to access the list page. Click the \u2507 on the right side of the list for a virtual machine to perform snapshot operations (only available for non-error state virtual machines).
A dialog box will pop up, prompting you to input a name and description for the snapshot. Please note that the creation process may take a few minutes, during which you won't be able to perform any operations on the virtual machine.
After successfully creating the snapshot, you can view its details within the virtual machine's information section. Here, you have the option to edit the description, recover from the snapshot, delete it, among other operations.
"},{"location":"en/admin/virtnest/vm/snapshot.html#restore-from-a-snapshot","title":"Restore from a Snapshot","text":"
Click Restore from Snapshot and provide a name for the virtual machine recovery record. The recovery operation may take some time to complete, depending on the size of the snapshot and other factors. After a successful recovery, the virtual machine will be restored to the state when the snapshot was created.
After some time, you can scroll down to the snapshot information to view all the recovery records for the current snapshot. It also provides a way to locate the position of the recovery.
This article will introduce how to configure network information when creating virtual machines.
In virtual machines, network management is a crucial part that allows us to manage and configure network connections for virtual machines in a Kubernetes environment. It can be configured according to different needs and scenarios, achieving a more flexible and diverse network architecture.
Single NIC Scenario: For simple applications that only require basic network connectivity or when there are resource constraints, using a single NIC can save network resources and prevent waste of resources.
Multiple NIC Scenario: When security isolation between different network environments needs to be achieved, multiple NICs can be used to divide different network areas. It also allows for control and management of traffic.
When selecting the Bridge network mode, some information needs to be configured in advance:
Install and run Open vSwitch on the host nodes. See Ovs-cni Quick Start.
Configure Open vSwitch bridge on the host nodes. See vswitch for instructions.
Install Spiderpool. See installing spiderpool for instructions. By default, Spiderpool will install both Multus CNI and Ovs CNI.
Create a Multus CR of type ovs. You can create a custom Multus CR or use YAML for creation
Create a subnet and IP pool. See creating subnets and IP pools .
Network configuration can be combined according to the table information.
Network Mode CNI Spiderpool Installed NIC Mode Fixed IP Live Migration Masquerade (NAT) Calico \u274c Single NIC \u274c \u2705 Cilium \u274c Single NIC \u274c \u2705 Flannel \u274c Single NIC \u274c \u2705 Bridge OVS \u2705 Multiple NIC \u2705 \u2705
Network Mode: There are two modes - Masquerade (NAT) and Bridge. Bridge mode requires the installation of the spiderpool component.
The default selection is Masquerade (NAT) network mode using the eth0 default NIC.
If the cluster has the spiderpool component installed, then Bridge mode can be selected. The Bridge mode supports multiple NICs.
Ensure all prerequisites are met before selecting the Bridge mode.
Adding NICs
Bridge modes support manually adding NICs. Click Add NIC to configure the NIC IP pool. Choose a Multus CR that matches the network mode, if not available, it needs to be created manually.
If the Use Default IP Pool switch is turned on, it will use the default IP pool in the multus CR configuration. If turned off, manually select the IP pool.
"},{"location":"en/admin/virtnest/vm/vm-network.html#network-configuration","title":"Network Configuration","text":""},{"location":"en/admin/virtnest/vm/vm-sc.html","title":"Storage for Virtual Machine","text":"
This article will introduce how to configure storage when creating a virtual machine.
Storage and virtual machine functionality are closely related, mainly providing flexible and scalable virtual machine storage capabilities through the use of Kubernetes persistent volumes and storage classes. For example, virtual machine image storage in PVC supports cloning, snapshotting, and other operations with other data.
"},{"location":"en/admin/virtnest/vm/vm-sc.html#deploying-different-storage","title":"Deploying Different Storage","text":"
Before using virtual machine storage functionality, different storage needs to be deployed according to requirements:
Refer to Deploying hwameistor, or install hwameistor-operator in the Helm template of the container management module.
Refer to Deploying rook-ceph
Deploy localpath, use the command kubectl apply -f to create the following YAML:
System Disk: By default, a VirtIO type rootfs system disk is created for the system to store the operating system and data.
Data Disk: The data disk is a storage device in the virtual machine used to store user data, application data, or other files unrelated to the operating system. Compared to the system disk, the data disk is optional and can be dynamically added or removed as needed. The capacity of the data disk can also be flexibly configured according to requirements.
Block storage is used by default. If you need to use cloning and snapshot functions, make sure that your storage pool has created the proper VolumeSnapshotClass, as shown in the example below. If you need to use real-time migration, make sure that your storage supports and has selected the ReadWriteMany access mode.
In most cases, such VolumeSnapshotClass is not automatically created during the installation process, so you need to manually create VolumeSnapshotClass. Here is an example of creating a VolumeSnapshotClass in HwameiStor:
This document will explain how to build the required virtual machine images.
A virtual machine image is essentially a replica file, which is a disk partition with an installed operating system. Common image file formats include raw, qcow2, vmdk, etc.
"},{"location":"en/admin/virtnest/vm-image/index.html#build-an-image","title":"Build an Image","text":"
Below are some detailed steps for building virtual machine images:
Download System Images
Before building virtual machine images, you need to download the required system images. We recommend using images in qcow2, raw, or vmdk formats. You can visit the following links to get CentOS and Fedora images:
CentOS Cloud Images: Obtain CentOS images from the official CentOS project or other sources. Make sure to choose a version compatible with your virtualization platform.
Fedora Cloud Images: Get images from the official Fedora project. Choose the appropriate version based on your requirements.
Build a Docker Image and Push it to a Containe Registry
In this step, we will use Docker to build an image and push it to a container registry for easy deployment and usage when needed.
Create a Dockerfile
FROM scratch\nADD --chown=107:107 CentOS-7-x86_64-GenericCloud.qcow2 /disk/\n
The Dockerfile above adds a file named CentOS-7-x86_64-GenericCloud.qcow2 to the image being built from a scratch base image and places it in the /disk/ directory within the image. This operation includes the file in the image, allowing it to provide a CentOS 7 x86_64 operating system environment when used to create a virtual machine.
The above command builds an image named release-ci.daocloud.io/ghippo/kubevirt-demo/centos7:v1 using the instructions in the Dockerfile. You can modify the image name according to your project requirements.
Push the Image to the Container Registry
Use the following command to push the built image to the release-ci.daocloud.io container registry. You can modify the repository name and address as needed.
These are the detailed steps and instructions for building virtual machine images. By following these steps, you will be able to successfully build and push images for virtual machines to meet your usage needs.
Naming convention: composed of lowercase \"spec.name\" and \"-custom\"
If used for modifying the category
This field must be true
Define the Chinese and English names of the category
The higher the number, the higher its position in the sorting order
After writing the YAML file, you can see the newly added or modified navigation bar categories by executing the following command and refreshing the page:
kubectl apply -f xxx.yaml\n
"},{"location":"en/admin/ghippo/best-practice/navigator.html#navigation-bar-menus","title":"Navigation Bar Menus","text":"
To add or reorder navigation bar menus, you can achieve it by adding a navigator YAML.
Note
If you need to edit an existing navigation bar menu (not a custom menu added by the user), you need to set the \"gproduct\" field of the new custom menu the same as the \"gproduct\" field of the menu to be overridden. The new navigation bar menu will overwrite the parts with the same \"name\" in the \"menus\" section, and perform an addition operation for the parts with different \"name\".
Naming convention: composed of lowercase \"spec.gproduct\" and \"-custom\"
Define the Chinese and English names of the menu
Either \"category\" or \"parentGProduct\" can be used to distinguish between first-level and second-level menus, and it should match the \"spec.name\" field of NavigatorCategory to complete the matching
Second-level menus
The lower the number, the higher its position in the sorting order
Define the identifier of the menu, used for linkage with the parentGProduct field to establish the parent-child relationship.
Set whether the menu is visible, default is true
This field must be true
The higher the number, the higher its position in the sorting order
Naming convention: composed of lowercase \"spec.gproduct\" and \"-custom\"
Define the Chinese and English names of the menu
Either \"category\" or \"parentGProduct\" can be used to distinguish between first-level and second-level menus. If this field is added, it will ignore the \"menus\" field and insert this menu as a second-level menu under the first-level menu with the \"gproduct\" of \"ghippo\"
Define the identifier of the menu, used for linkage with the parentGProduct field to establish the parent-child relationship.
Set whether the menu is visible, default is true
This field must be true
The higher the number, the higher its position in the sorting order
Insight is a multicluster observation product in AI platform. In order to realize the unified collection of multicluster observation data, users need to install the Helm App insight-agent (Installed in insight-system namespace by default). See How to install insight-agent .
In Insight -> Data Collection section, you can view the status of insight-agent installed in each cluster.
not installed : insight-agent is not installed under the insight-system namespace in this cluster
Running : insight-agent is successfully installed in the cluster, and all deployed components are running
Exception : If insight-agent is in this state, it means that the helm deployment failed or the deployed components are not running
Can be checked by:
Run the following command, if the status is deployed , go to the next step. If it is failed , since it will affect the upgrade of the application, it is recommended to reinstall after uninstalling Container Management -> Helm Apps :
helm list -n insight-system\n
run the following command or check the status of the components deployed in the cluster in Insight -> Data Collection . If there is a pod that is not in the Running state, please restart the abnormal pod.
The resource consumption of the metric collection component Prometheus in insight-agent is directly proportional to the number of pods running in the cluster. Adjust Prometheus resources according to the cluster size, please refer to Prometheus Resource Planning.
Since the storage capacity of the metric storage component vmstorage in the global service cluster is directly proportional to the sum of the number of pods in each cluster.
Please contact the platform administrator to adjust the disk capacity of vmstorage according to the cluster size, see vmstorage disk capacity planning.
Adjust vmstorage disk according to multicluster size, see vmstorge disk expansion.
"},{"location":"en/admin/insight/quickstart/jvm-monitor/jmx-exporter.html","title":"Use JMX Exporter to expose JVM monitoring metrics","text":"
JMX-Exporter provides two usages:
Start a standalone process. Specify parameters when the JVM starts, expose the RMI interface of JMX, JMX Exporter calls RMI to obtain the JVM runtime status data, Convert to Prometheus metrics format, and expose ports for Prometheus to collect.
Start the JVM in-process. Specify parameters when the JVM starts, and run the jar package of JMX-Exporter in the form of javaagent. Read the JVM runtime status data in the process, convert it into Prometheus metrics format, and expose the port for Prometheus to collect.
Note
Officials do not recommend the first method. On the one hand, the configuration is complicated, and on the other hand, it requires a separate process, and the monitoring of this process itself has become a new problem. So This page focuses on the second usage and how to use JMX Exporter to expose JVM monitoring metrics in the Kubernetes environment.
The second usage is used here, and the JMX Exporter jar package file and configuration file need to be specified when starting the JVM. The jar package is a binary file, so it is not easy to mount it through configmap. We hardly need to modify the configuration file. So the suggestion is to directly package the jar package and configuration file of JMX Exporter into the business container image.
Among them, in the second way, we can choose to put the jar file of JMX Exporter in the business application mirror, You can also choose to mount it during deployment. Here is an introduction to the two methods:
"},{"location":"en/admin/insight/quickstart/jvm-monitor/jmx-exporter.html#method-1-build-the-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Build the JMX Exporter JAR file into the business image","text":"
The content of prometheus-jmx-config.yaml is as follows:
For more configmaps, please refer to the bottom introduction or Prometheus official documentation.
Then prepare the jar package file, you can find the latest jar package download address on the Github page of jmx_exporter and refer to the following Dockerfile:
Port 8088 is used here to expose the monitoring metrics of the JVM. If it conflicts with Java applications, you can change it yourself
"},{"location":"en/admin/insight/quickstart/jvm-monitor/jmx-exporter.html#method-2-mount-via-init-container-container","title":"Method 2: mount via init container container","text":"
We need to make the JMX exporter into a Docker image first, the following Dockerfile is for reference only:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file to the mirror\nCOPY prometheus-jmx-config.yaml ./\n# Download jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image according to the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment Yaml:
After the above modification, the sample application my-demo-app has the ability to expose JVM metrics. After running the service, we can access the prometheus format metrics exposed by the service through http://lcoalhost:8088.
Then, you can refer to Java Application Docking Observability with JVM Metrics.
This document mainly describes how to monitor the JVM of the customer's Java application. It describes how Java applications that have exposed JVM metrics, and those that have not, interface with Insight.
If your Java application does not start exposing JVM metrics, you can refer to the following documents:
Expose JVM monitoring metrics with JMX Exporter
Expose JVM monitoring metrics using OpenTelemetry Java Agent
If your Java application has exposed JVM metrics, you can refer to the following documents:
Java application docking observability with existing JVM metrics
"},{"location":"en/admin/insight/quickstart/jvm-monitor/legacy-jvm.html","title":"Java Application with JVM Metrics to Dock Insight","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), We need to allow monitoring data to be collected. You can let Insight collect existing JVM metrics by adding Kubernetes Annotations to the workload:
annatation:\n insight.opentelemetry.io/metric-scrape: \"true\" # whether to collect\n insight.opentelemetry.io/metric-path: \"/\" # path to collect metrics\n insight.opentelemetry.io/metric-port: \"9464\" # port for collecting metrics\n
YAML Example to add annotations for my-deployment-app workload\uff1a
In the above example\uff0cInsight will use :8080//actuator/prometheus to get Prometheus metrics exposed through Spring Boot Actuator .
"},{"location":"en/admin/insight/quickstart/jvm-monitor/otel-java-agent.html","title":"Use OpenTelemetry Java Agent to expose JVM monitoring metrics","text":"
In Opentelemetry Agent v1.20.0 and above, Opentelemetry Agent has added the JMX Metric Insight module. If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents for our application Expose JMX metrics. The Opentelemetry Agent also collects and exposes metrics by instrumenting the metrics exposed by MBeans locally available in the application.
Opentelemetry Agent also has some built-in monitoring samples for common Java Servers or frameworks, please refer to predefined metrics.
Using the OpenTelemetry Java Agent also needs to consider how to mount the JAR into the container. In addition to referring to the JMX Exporter above to mount the JAR file, we can also use the Operator capabilities provided by OpenTelemetry to automatically enable JVM metric exposure for our applications. :
If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents to expose JMX metrics for our application. The Opentelemetry Agent can now natively collect and expose metrics interfaces by instrumenting metrics exposed by MBeans available locally in the application.
However, for current version, you still need to manually add the proper annotations to workload before the JVM data will be collected by Insight.
"},{"location":"en/admin/insight/quickstart/jvm-monitor/otel-java-agent.html#expose-metrics-for-java-middleware","title":"Expose metrics for Java middleware","text":"
Opentelemetry Agent also has some built-in middleware monitoring samples, please refer to Predefined Metrics.
By default, no type is specified, and it needs to be specified through -Dotel.jmx.target.system JVM Options, such as -Dotel.jmx.target.system=jetty,kafka-broker .
Gaining JMX Metric Insights with the OpenTelemetry Java Agent
Otel jmx metrics
"},{"location":"en/admin/insight/quickstart/otel/golang-ebpf.html","title":"Enhance Go apps with OTel auto-instrumentation","text":"
If you don't want to manually change the application code, you can try This page's eBPF-based automatic enhancement method. This feature is currently in the review stage of donating to the OpenTelemetry community, and does not support Operator injection through annotations (it will be supported in the future), so you need to manually change the Deployment YAML or use a patch.
Install under the Insight-system namespace, skip this step if it has already been installed.
Note: This CR currently only supports the injection of environment variables (including service name and trace address) required to connect to Insight, and will support the injection of Golang probes in the future.
"},{"location":"en/admin/insight/quickstart/otel/golang-ebpf.html#change-the-application-deployment-file","title":"Change the application deployment file","text":"
Add environment variable annotations
There is only one such annotation, which is used to add OpenTelemetry-related environment variables, such as link reporting address, cluster id where the container is located, and namespace:
The value is divided into two parts by / , the first value insight-system is the namespace of the CR installed in the second step, and the second value insight-opentelemetry-autoinstrumentation is the name of the CR.
Getting Started with Go OpenTelemetry Automatic Instrumentation
Donating ebpf based instrumentation
"},{"location":"en/admin/insight/quickstart/other/install-agentindce.html","title":"Install insight-agent in Suanova 4.0","text":"
In AI platform, previous Suanova 4.0 can be accessed as a subcluster. This guide provides potential issues and solutions when installing insight-agent in a Suanova 4.0 cluster.
Since most Suanova 4.0 clusters have installed dx-insight as the monitoring system, installing insight-agent at this time will conflict with the existing prometheus operator in the cluster, making it impossible to install smoothly.
Enable the parameters of the prometheus operator, retain the prometheus operator in dx-insight, and make it compatible with the prometheus operator in insight-agent in 5.0.
Enable the --deny-namespaces parameter in the two prometheus operators respectively.
Run the following command (the following command is for reference only, the actual command needs to replace the prometheus operator name and namespace in the command).
As shown in the figure above, the dx-insight component is deployed under the dx-insight tenant, and the insight-agent is deployed under the insight-system tenant. Add --deny-namespaces=insight-system in the prometheus operator in dx-insight, Add --deny-namespaces=dx-insight in the prometheus operator in insight-agent.
Just add deny namespace, both prometheus operators can continue to scan other namespaces, and the related collection resources under kube-system or customer business namespaces are not affected.
Please pay attention to the problem of node exporter port conflict.
The open-source node-exporter turns on hostnetwork by default and the default port is 9100. If the monitoring system of the cluster has installed node-exporter , then installing insight-agent at this time will cause node-exporter port conflict and it cannot run normally.
Note
Insight's node exporter will enable some features to collect special indicators, so it is recommended to install.
Currently, it does not support modifying the port in the installation command. After helm install insight-agent , you need to manually modify the related ports of the insight node-exporter daemonset and svc.
The docker storage directory of Suanova 4.0 is /var/lib/containers , which is different from the path in the configuration of insigh-agent, so the logs are not collected.
"},{"location":"en/admin/insight/trace/topology-helper.html","title":"Service Topology Element Explanations","text":"
The service topology provided by Observability allows you to quickly identify the request relationships between services and determine the health status of services based on different colors. The health status is determined based on the request latency and error rate of the service's overall traffic. This article explains the elements in the service topology.
"},{"location":"en/admin/insight/trace/topology-helper.html#node-status-explanation","title":"Node Status Explanation","text":"
The node health status is determined based on the error rate and request latency of the service's overall traffic, following these rules:
Color Status Rules Gray Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/admin/insight/trace/topology-helper.html#connection-status-explanation","title":"Connection Status Explanation","text":"Color Status Rules Green Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html","title":"GPU Metrics","text":"
This page lists some commonly used GPU metrics.
"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html#cluster-level","title":"Cluster Level","text":"Metric Name Description Number of GPUs Total number of GPUs in the cluster Average GPU Utilization Average compute utilization of all GPUs in the cluster Average GPU Memory Utilization Average memory utilization of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Temperature Temperature of all GPUs in the cluster GPU Utilization Details 24-hour usage details of all GPUs in the cluster (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of all GPUs in the cluster (includes min, max, avg, current) GPU Memory Bandwidth Utilization For example, an Nvidia V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the utilization is 50%"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html#node-level","title":"Node Level","text":"Metric Name Description GPU Mode Usage mode of GPUs on the node, including full-card mode, MIG mode, vGPU mode Number of Physical GPUs Total number of physical GPUs on the node Number of Virtual GPUs Number of vGPU devices created on the node Number of MIG Instances Number of MIG instances created on the node GPU Memory Allocation Rate Memory allocation rate of all GPUs on the node Average GPU Utilization Average compute utilization of all GPUs on the node Average GPU Memory Utilization Average memory utilization of all GPUs on the node GPU Driver Version Driver version information of GPUs on the node GPU Utilization Details 24-hour usage details of each GPU on the node (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of each GPU on the node (includes min, max, avg, current)"},{"location":"en/admin/kpanda/gpu/gpu-metrics.html#pod-level","title":"Pod Level","text":"Category Metric Name Description Application Overview GPU - Compute & Memory Pod GPU Utilization Compute utilization of the GPUs used by the current Pod Pod GPU Memory Utilization Memory utilization of the GPUs used by the current Pod Pod GPU Memory Usage Memory usage of the GPUs used by the current Pod Memory Allocation Memory allocation of the GPUs used by the current Pod Pod GPU Memory Copy Ratio Memory copy ratio of the GPUs used by the current Pod GPU - Engine Overview GPU Graphics Engine Activity Percentage Percentage of time the Graphics or Compute engine is active during a monitoring cycle GPU Memory Bandwidth Utilization Memory bandwidth utilization (Memory BW Utilization) indicates the fraction of cycles during which data is sent to or received from the device memory. This value represents the average over the interval, not an instantaneous value. A higher value indicates higher utilization of device memory.A value of 1 (100%) indicates that a DRAM instruction is executed every cycle during the interval (in practice, a peak of about 0.8 (80%) is the maximum achievable).A value of 0.2 (20%) indicates that 20% of the cycles during the interval are spent reading from or writing to device memory. Tensor Core Utilization Percentage of time the Tensor Core pipeline is active during a monitoring cycle FP16 Engine Utilization Percentage of time the FP16 pipeline is active during a monitoring cycle FP32 Engine Utilization Percentage of time the FP32 pipeline is active during a monitoring cycle FP64 Engine Utilization Percentage of time the FP64 pipeline is active during a monitoring cycle GPU Decode Utilization Decode engine utilization of the GPU GPU Encode Utilization Encode engine utilization of the GPU GPU - Temperature & Power GPU Temperature Temperature of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Total Power Consumption Total power consumption of the GPUs GPU - Clock GPU Memory Clock Memory clock frequency GPU Application SM Clock Application SM clock frequency GPU Application Memory Clock Application memory clock frequency GPU Video Engine Clock Video engine clock frequency GPU Throttle Reasons Reasons for GPU throttling GPU - Other Details PCIe Transfer Rate Data transfer rate of the GPU through the PCIe bus PCIe Receive Rate Data receive rate of the GPU through the PCIe bus"},{"location":"en/admin/kpanda/gpu/ascend/Ascend_usage.html","title":"Use Ascend NPU","text":"
This section explains how to use Ascend NPU on the AI platform platform.
This document uses the AscentCL Image Classification Application example from the Ascend sample library.
Download the Ascend repository
Run the following command to download the Ascend demo repository, and remember the storage location of the code for subsequent use.
git clone https://gitee.com/ascend/samples.git\n
Prepare the base image
This example uses the Ascent-pytorch base image, which can be obtained from the Ascend Container Registry.
Prepare the YAML file
ascend-demo.yaml
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: resnetinfer1-1-1usoc\nspec:\n template:\n spec:\n containers:\n - image: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:23.0.RC2-ubuntu18.04 # Inference image name\n imagePullPolicy: IfNotPresent\n name: resnet50infer\n securityContext:\n runAsUser: 0\n command:\n - \"/bin/bash\"\n - \"-c\"\n - |\n source /usr/local/Ascend/ascend-toolkit/set_env.sh &&\n TEMP_DIR=/root/samples_copy_$(date '+%Y%m%d_%H%M%S_%N') &&\n cp -r /root/samples \"$TEMP_DIR\" &&\n cd \"$TEMP_DIR\"/inference/modelInference/sampleResnetQuickStart/python/model &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/003_Atc_Models/resnet50/resnet50.onnx &&\n atc --model=resnet50.onnx --framework=5 --output=resnet50 --input_shape=\"actual_input_1:1,3,224,224\" --soc_version=Ascend910 &&\n cd ../data &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/models/aclsample/dog1_1024_683.jpg &&\n cd ../scripts &&\n bash sample_run.sh\n resources:\n requests:\n huawei.com/Ascend910: 1 # Number of the Ascend 910 Processors\n limits:\n huawei.com/Ascend910: 1 # The value should be the same as that of requests\n volumeMounts:\n - name: hiai-driver\n mountPath: /usr/local/Ascend/driver\n readOnly: true\n - name: slog\n mountPath: /var/log/npu/conf/slog/slog.conf\n - name: localtime # The container time must be the same as the host time\n mountPath: /etc/localtime\n - name: dmp\n mountPath: /var/dmp_daemon\n - name: slogd\n mountPath: /var/slogd\n - name: hbasic\n mountPath: /etc/hdcBasic.cfg\n - name: sys-version\n mountPath: /etc/sys_version.conf\n - name: aicpu\n mountPath: /usr/lib64/aicpu_kernels\n - name: tfso\n mountPath: /usr/lib64/libtensorflow.so\n - name: sample-path\n mountPath: /root/samples\n volumes:\n - name: hiai-driver\n hostPath:\n path: /usr/local/Ascend/driver\n - name: slog\n hostPath:\n path: /var/log/npu/conf/slog/slog.conf\n - name: localtime\n hostPath:\n path: /etc/localtime\n - name: dmp\n hostPath:\n path: /var/dmp_daemon\n - name: slogd\n hostPath:\n path: /var/slogd\n - name: hbasic\n hostPath:\n path: /etc/hdcBasic.cfg\n - name: sys-version\n hostPath:\n path: /etc/sys_version.conf\n - name: aicpu\n hostPath:\n path: /usr/lib64/aicpu_kernels\n - name: tfso\n hostPath:\n path: /usr/lib64/libtensorflow.so\n - name: sample-path\n hostPath:\n path: /root/samples\n restartPolicy: OnFailure\n
Some fields in the above YAML need to be modified according to the actual situation:
atc ... --soc_version=Ascend910 uses Ascend910, adjust this field depending on your actual situation. You can use the npu-smi info command to check the GPU model and add the Ascend prefix.
samples-path should be adjusted according to the actual situation.
resources should be adjusted according to the actual situation.
Deploy a Job and check its results
Use the following command to create a Job:
kubectl apply -f ascend-demo.yaml\n
Check the Pod running status:
After the Pod runs successfully, check the log results. The key prompt information on the screen is shown in the figure below. The Label indicates the category identifier, Conf indicates the maximum confidence of the classification, and Class indicates the belonging category. These values may vary depending on the version and environment, so please refer to the actual situation:
Confirm whether the cluster has detected the GPU. Click Clusters -> Cluster Settings -> Addon Plugins , and check whether the proper GPU type is automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type to Ascend .
Deploy the workload. Click Clusters -> Workloads , deploy the workload through an image, select the type (Ascend), and then configure the number of physical cards used by the application:
Number of Physical Cards (huawei.com/Ascend910) : This indicates how many physical cards the current Pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host.
If there is an issue with the above configuration, it will result in scheduling failure and resource allocation issues.
This article will introduce the prerequisites for configuring GPU when creating a virtual machine.
The key point of configuring GPU for virtual machines is to configure the GPU Operator so that different software components can be deployed on working nodes depending on the GPU workloads configured on these nodes. Taking the following three nodes as examples:
The controller-node-1 node is configured to run containers.
The work-node-1 node is configured to run virtual machines with direct GPUs.
The work-node-2 node is configured to run virtual machines with virtual vGPUs.
"},{"location":"en/admin/virtnest/vm/vm-gpu.html#assumptions-limitations-and-dependencies","title":"Assumptions, Limitations, and Dependencies","text":"
Working nodes can run GPU-accelerated containers, virtual machines with direct GPUs, or virtual machines with vGPUs, but not a combination of any.
Working nodes can run GPU-accelerated containers, virtual machines with direct GPUs, or virtual machines with vGPUs separately, without supporting any combination forms.
Cluster administrators or developers need to understand the cluster situation in advance and correctly label the nodes to indicate the type of GPU workload they will run.
The working nodes running virtual machines with direct GPUs or vGPUs are assumed to be bare metal. If the working nodes are virtual machines, the GPU direct pass-through feature needs to be enabled on the virtual machine platform. Please consult the virtual machine platform provider.
Nvidia MIG vGPU is not supported.
The GPU Operator will not automatically install GPU drivers in virtual machines.
To enable the GPU direct pass-through feature, the cluster nodes need to enable IOMMU. Please refer to How to Enable IOMMU. If your cluster is running on a virtual machine, consult your virtual machine platform provider.
Note: Building a vGPU Manager image is only required when using NVIDIA vGPUs. If you plan to use only GPU direct pass-through, skip this section.
The following are the steps to build the vGPU Manager image and push it to the container registry:
Download the vGPU software from the NVIDIA Licensing Portal.
Log in to the NVIDIA Licensing Portal and go to the Software Downloads page.
The NVIDIA vGPU software is located in the Driver downloads tab on the Software Downloads page.
Select VGPU + Linux in the filter criteria and click Download to get the software package for Linux KVM. Unzip the downloaded file (NVIDIA-Linux-x86_64-<version>-vgpu-kvm.run).
Clone the container-images/driver repository in the terminal
git clone https://gitlab.com/nvidia/container-images/driver cd driver\n
Switch to the vgpu-manager directory for your operating system
cd vgpu-manager/<your-os>\n
Copy the .run file extracted in step 1 to the current directory
Go to Container Management , select your worker cluster and click Nodes. On the right of the list, click \u2507 and select Edit Labels to add labels to the nodes. Each node can only have one label.
You can assign the following values to the labels: container, vm-passthrough, and vm-vgpu.
Go to Container Management , select your worker cluster, click Helm Apps -> Helm Charts , choose and install gpu-operator. You need to modify some fields in the yaml.
Fill in the container registry address refered in the step \"Build vGPU Manager Image\".
Fill in the VERSION refered in the step \"Build vGPU Manager Image\".
Wait for the installation to be successful, as shown in the image below:
"},{"location":"en/admin/virtnest/vm/vm-gpu.html#install-virtnest-agent-and-configure-cr","title":"Install virtnest-agent and Configure CR","text":"
Install virtnest-agent, refer to Install virtnest-agent.
Add vGPU and GPU direct pass-through to the Virtnest Kubevirt CR. The following example shows the key yaml after adding vGPU and GPU direct pass-through:
spec:\n configuration:\n developerConfiguration:\n featureGates:\n - GPU\n - DisableMDEVConfiguration\n # Fill in the information below\n permittedHostDevices:\n mediatedDevices: # vGPU\n - mdevNameSelector: GRID P4-1Q\n resourceName: nvidia.com /GRID_P4-1Q\n pciHostDevices: # GPU direct pass-through\n - externalResourceProvider: true\n pciVendorSelector: 10DE:1BB3\n resourceName: nvidia.com /GP104GL_TESLA_P4\n
In the kubevirt CR yaml, permittedHostDevices is used to import VM devices, and vGPU should be added in mediatedDevices with the following structure:
mediatedDevices: \n- mdevNameSelector: GRID P4-1Q # Device Name\n resourceName: nvidia.com/GRID_P4-1Q # vGPU information registered by GPU Operator to the node\n
GPU direct pass-through should be added in pciHostDevices under permittedHostDevices with the following structure:
pciHostDevices: \n- externalResourceProvider: true # Do not change by default\n pciVendorSelector: 10DE:1BB3 # Vendor id of the current pci device\n resourceName: nvidia.com/GP104GL_TESLA_P4 # GPU information registered by GPU Operator to the node\n
Example of obtaining vGPU information (only applicable to vGPU): View node information on a node marked as nvidia.com/gpu.workload.config=vm-vgpu (e.g., work-node-2), and the nvidia.com/GRID_P4-1Q: 8 in Capacity indicates available vGPUs:
So the mdevNameSelector should be \"GRID P4-1Q\" and the resourceName should be \"GRID_P4-1Q\".
Obtain GPU direct pass-through information: On a node marked as nvidia.com/gpu.workload.config=vm-passthrough (e.g., work-node-1), view the node information, and nvidia.com/GP104GL_TESLA_P4: 2 in Capacity indicates available vGPUs:
So the resourceName should be \"GRID_P4-1Q\". How to obtain the pciVendorSelector? SSH into the target node work-node-1 and use the command \"lspci -nnk -d 10de:\" to get the Nvidia GPU PCI information, as shown in the image above.
Editing kubevirt CR note: If there are multiple GPUs of the same model, only one needs to be written in the CR, listing each GPU is not necessary.
# kubectl -n virtnest-system edit kubevirt kubevirt\nspec:\n configuration:\n developerConfiguration:\n featureGates:\n - GPU\n - DisableMDEVConfiguration\n # Fill in the information below\n permittedHostDevices:\n mediatedDevices: # vGPU\n - mdevNameSelector: GRID P4-1Q\n resourceName: nvidia.com/GRID_P4-1Q\n pciHostDevices: # GPU direct pass-through, in the above example, TEESLA P4 has two GPUs, only register one here\n - externalResourceProvider: true\n pciVendorSelector: 10DE:1BB3\n resourceName: nvidia.com/GP104GL_TESLA_P4 \n
"},{"location":"en/admin/virtnest/vm/vm-gpu.html#create-vm-yaml-and-use-gpu-acceleration","title":"Create VM YAML and Use GPU Acceleration","text":"
The only difference from a regular virtual machine is adding GPU-related information in the devices section.
"},{"location":"en/end-user/index.html","title":"Suanova AI Platform - End User","text":"
This is the user documentation for the Suanova AI Platform aimed at end users.
User Registration
User registration is the first step to using the AI platform.
User Registration
Cloud Host
A cloud host is a virtual machine deployed in the cloud.
Create Cloud Host
Use Cloud Host
Container Management
Container management is the core module of the AI computing center.
K8s Clusters on Cloud
Node Management
Workloads
Helm Apps and Templates
AI Lab
Manage datasets and run AI training and inference jobs.
Create AI Workloads
Use Notebook
Create Training Jobs
Create Inference Services
Insight
Monitor the status of clusters, nodes, and workloads through dashboards.
Monitor Clusters/Nodes
Metrics
Logs
Tracing
Personal Center
Set password, keys, and language in the personal center.
Security Settings
Access Keys
Language Settings
"},{"location":"en/end-user/baize/dataset/create-use-delete.html","title":"Create, Use and Delete Datasets","text":"
AI Lab provides comprehensive dataset management functions needed for model development, training, and inference processes. Currently, it supports unified access to various data sources.
With simple configurations, you can connect data sources to AI Lab, achieving unified data management, preloading, dataset management, and other functionalities.
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#create-a-dataset","title":"Create a Dataset","text":"
In the left navigation bar, click Data Management -> Dataset List, and then click the Create button on the right.
Select the worker cluster and namespace to which the dataset belongs, then click Next.
Configure the data source type for the target data, then click OK.
Currently supported data sources include:
GIT: Supports repositories such as GitHub, GitLab, and Gitee
Upon successful creation, the dataset will be returned to the dataset list. You can perform more actions by clicking \u2507 on the right.
Info
The system will automatically perform a one-time data preloading after the dataset is successfully created; the dataset cannot be used until the preloading is complete.
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-a-dataset","title":"Use a Dataset","text":"
Once the dataset is successfully created, it can be used in tasks such as model training and inference.
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-in-notebook","title":"Use in Notebook","text":"
In creating a Notebook, you can directly use the dataset; the usage is as follows:
Use the dataset as training data mount
Use the dataset as code mount
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-in-training-obs","title":"Use in Training obs","text":"
Use the dataset to specify job output
Use the dataset to specify job input
Use the dataset to specify TensorBoard output
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#use-in-inference-services","title":"Use in Inference Services","text":"
Use the dataset to mount a model
"},{"location":"en/end-user/baize/dataset/create-use-delete.html#delete-a-dataset","title":"Delete a Dataset","text":"
If you find a dataset to be redundant, expired, or no longer needed, you can delete it from the dataset list.
Click the \u2507 on the right side of the dataset list, then choose Delete from the dropdown menu.
In the pop-up window, confirm the dataset you want to delete, enter the dataset name, and then click Delete.
A confirmation message will appear indicating successful deletion, and the dataset will disappear from the list.
Caution
Once a dataset is deleted, it cannot be recovered, so please proceed with caution.
Traditionally, Python environment dependencies are built into an image, which includes the Python version and dependency packages. This approach has high maintenance costs and is inconvenient to update, often requiring a complete rebuild of the image.
In AI Lab, users can manage pure environment dependencies through the Environment Management module, decoupling this part from the image. The advantages include:
One environment can be used in multiple places, such as in Notebooks, distributed training tasks, and even inference services.
Updating dependency packages is more convenient; you only need to update the environment dependencies without rebuilding the image.
The main components of the environment management are:
Cluster : Select the cluster to operate on.
Namespace : Select the namespace to limit the scope of operations.
Environment List : Displays all environments and their statuses under the current cluster and namespace.
"},{"location":"en/end-user/baize/dataset/environments.html#explanation-of-environment-list-fields","title":"Explanation of Environment List Fields","text":"
Name : The name of the environment.
Status : The current status of the environment (normal or failed). New environments undergo a warming-up process, after which they can be used in other tasks.
Creation Time : The time the environment was created.
"},{"location":"en/end-user/baize/dataset/environments.html#creat-new-environment","title":"Creat New Environment","text":"
On the Environment Management interface, click the Create button at the top right to enter the environment creation process.
Fill in the following basic information:
Name : Enter the environment name, with a length of 2-63 characters, starting and ending with lowercase letters or numbers.
Deployment Location:
Cluster : Select the cluster to deploy, such as gpu-cluster.
Namespace : Select the namespace, such as default.
Remarks (optional): Enter remarks.
Labels (optional): Add labels to the environment.
Annotations (optional): Add annotations to the environment. After completing the information, click Next to proceed to environment configuration.
Python Version : Select the required Python version, such as 3.12.3.
Package Manager : Choose the package management tool, either PIP or CONDA.
Environment Data :
If PIP is selected: Enter the dependency package list in requirements.txt format in the editor below.
If CONDA is selected: Enter the dependency package list in environment.yaml format in the editor below.
Other Options (optional):
Additional pip Index URLs : Configure additional pip index URLs; suitable for internal enterprise private repositories or PIP acceleration sites.
GPU Configuration : Enable or disable GPU configuration; some GPU-related dependency packages need GPU resources configured during preloading.
Associated Storage : Select the associated storage configuration; environment dependency packages will be stored in the associated storage. Note: Storage must support ReadWriteMany.
After configuration, click the Create button, and the system will automatically create and configure the new Python environment.
Verify that the Python version and package manager configuration are correct.
Ensure the selected cluster and namespace are available.
If dependency preloading fails:
Check if the requirements.txt or environment.yaml file format is correct.
Verify that the dependency package names and versions are correct. If other issues arise, contact the platform administrator or refer to the platform help documentation for more support.
These are the basic steps and considerations for managing Python dependencies in AI Lab.
With the rapid iteration of AI Lab, we have now supported various model inference services. Here, you can see information about the supported models.
AI Lab v0.3.0 launched model inference services, facilitating users to directly use the inference services of AI Lab without worrying about model deployment and maintenance for traditional deep learning models.
AI Lab v0.6.0 supports the complete version of vLLM inference capabilities, supporting many large language models such as LLama, Qwen, ChatGLM, and more.
Note
The support for inference capabilities is related to the version of AI Lab.
You can use GPU types that have been verified by AI platform in AI Lab. For more details, refer to the GPU Support Matrix.
Through the Triton Inference Server, traditional deep learning models can be well supported. Currently, AI Lab supports mainstream inference backend services:
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
With vLLM, we can quickly use large language models. Here, you can see the list of models we support, which generally aligns with the vLLM Support Models.
HuggingFace Models: We support most of HuggingFace's models. You can see more models at the HuggingFace Model Hub.
The vLLM Supported Models list includes supported large language models and vision-language models.
Models fine-tuned using the vLLM support framework.
"},{"location":"en/end-user/baize/inference/models.html#new-features-of-vllm","title":"New Features of vLLM","text":"
Currently, AI Lab also supports some new features when using vLLM as an inference tool:
Enable Lora Adapter to optimize model inference services during inference.
Provide a compatible OpenAPI interface with OpenAI, making it easy for users to switch to local inference services at a low cost and quickly transition.
"},{"location":"en/end-user/baize/inference/triton-inference.html","title":"Create Inference Service Using Triton Framework","text":"
The AI Lab currently offers Triton and vLLM as inference frameworks. Users can quickly start a high-performance inference service with simple configurations.
Danger
The use of Triton's Backend vLLM method has been deprecated. It is recommended to use the latest support for vLLM to deploy your large language models.
"},{"location":"en/end-user/baize/inference/triton-inference.html#introduction-to-triton","title":"Introduction to Triton","text":"
Triton is an open-source inference server developed by NVIDIA, designed to simplify the deployment and inference of machine learning models. It supports a variety of deep learning frameworks, including TensorFlow and PyTorch, enabling users to easily manage and deploy different types of models.
Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. The following example illustrates the PyTorch model for mnist handwritten digit recognition.
Note
The model to be inferred must adhere to the following directory structure within the dataset:
Currently, form-based creation is supported, allowing you to create services with field prompts in the interface.
"},{"location":"en/end-user/baize/inference/triton-inference.html#configure-model-path","title":"Configure Model Path","text":"
The model path model-repo/mnist-cnn/1/model.pt must be consistent with the directory structure of the dataset.
"},{"location":"en/end-user/baize/inference/triton-inference.html#model-configuration","title":"Model Configuration","text":""},{"location":"en/end-user/baize/inference/triton-inference.html#configure-input-and-output-parameters","title":"Configure Input and Output Parameters","text":"
Note
The first dimension of the input and output parameters defaults to batchsize, setting it to -1 allows for the automatic calculation of the batchsize based on the input inference data. The remaining dimensions and data type must match the model's input.
Send HTTP POST Request: Use tools like curl or HTTP client libraries (e.g., Python's requests library) to send POST requests to the Triton Server.
Set HTTP Headers: Configuration generated automatically based on user settings, include metadata about the model inputs and outputs in the HTTP headers.
Construct Request Body: The request body usually contains the input data for inference and model-specific metadata.
<ip> is the host address where the Triton Inference Server is running.
<port> is the port where the Triton Inference Server is running.
<inference-name> is the name of the inference service that has been created.
\"name\" must match the name of the input parameter in the model configuration.
\"shape\" must match the dims of the input parameter in the model configuration.
\"datatype\" must match the Data Type of the input parameter in the model configuration.
\"data\" should be replaced with the actual inference data.
Please note that the above example code needs to be adjusted according to your specific model and environment. The format and content of the input data must also comply with the model's requirements.
"},{"location":"en/end-user/baize/inference/vllm-inference.html","title":"Create Inference Service Using vLLM Framework","text":"
AI Lab supports using vLLM as an inference service, offering all the capabilities of vLLM while fully adapting to the OpenAI interface definition.
"},{"location":"en/end-user/baize/inference/vllm-inference.html#introduction-to-vllm","title":"Introduction to vLLM","text":"
vLLM is a fast and easy-to-use library for inference and services. It aims to significantly improve the throughput and memory efficiency of language model services in real-time scenarios. vLLM boasts several features in terms of speed and flexibility:
Continuous batching of incoming requests.
Efficiently manages attention keys and values memory using PagedAttention.
Seamless integration with popular HuggingFace models.
Select the vLLM inference framework. In the model module selection, choose the pre-created model dataset hdd-models and fill in the path information where the model is located within the dataset.
This guide uses the ChatGLM3 model for creating the inference service.
Configure the resources for the inference service and adjust the parameters for running the inference service.
Parameter Name Description GPU Resources Configure GPU resources for inference based on the model scale and cluster resources. Allow Remote Code Controls whether vLLM trusts and executes code from remote sources. LoRA LoRA is a parameter-efficient fine-tuning technique for deep learning models. It reduces the number of parameters and computational complexity by decomposing the original model parameter matrix into low-rank matrices. 1. --lora-modules: Specifies specific modules or layers for low-rank approximation. 2. max_loras_rank: Specifies the maximum rank for each adapter layer in the LoRA model. For simpler tasks, a smaller rank value can be chosen, while more complex tasks may require a larger rank value to ensure model performance. 3. max_loras: Indicates the maximum number of LoRA layers that can be included in the model, customized based on model size and inference complexity. 4. max_cpu_loras: Specifies the maximum number of LoRA layers that can be handled in a CPU environment. Associated Environment Selects predefined environment dependencies required for inference.
Info
For models that support LoRA parameters, refer to vLLM Supported Models.
In the Advanced Configuration , support is provided for automated affinity scheduling based on GPU resources and other node configurations. Users can also customize scheduling policies.
Once the inference service is created, click the name of the inference service to enter the details and view the API call methods. Verify the execution results using Curl, Python, and Node.js.
Copy the curl command from the details and execute it in the terminal to send a model inference request. The expected output should be:
Job management refers to the functionality of creating and managing job lifecycles through job scheduling and control components.
AI platform Smart Computing Capability adopts Kubernetes' Job mechanism to schedule various AI inference and training jobs.
Click Job Center -> Jobs in the left navigation bar to enter the job list. Click the Create button on the right.
The system will pre-fill basic configuration data, including the cluster, namespace, type, queue, and priority. Adjust these parameters and click Next.
Configure the URL, runtime parameters, and associated datasets, then click Next.
Optionally add labels, annotations, runtime env variables, and other job parameters. Select a scheduling policy and click Confirm.
After the job is successfully created, it will have several running statuses:
Pytorch is an open-source deep learning framework that provides a flexible environment for training and deployment. A Pytorch job is a job that uses the Pytorch framework.
In the AI Lab platform, we provide support and adaptation for Pytorch jobs. Through a graphical interface, you can quickly create Pytorch jobs and perform model training.
Here we use the baize-notebook base image and the associated environment as the basic runtime environment for the job.
To learn how to create an environment, refer to Environments.
"},{"location":"en/end-user/baize/jobs/pytorch.html#create-jobs","title":"Create Jobs","text":""},{"location":"en/end-user/baize/jobs/pytorch.html#pytorch-single-jobs","title":"Pytorch Single Jobs","text":"
Log in to the AI Lab platform, click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Pytorch Single and click Next .
Fill in the job name and description, then click OK .
Once the job is successfully submitted, we can enter the job details to see the resource usage. From the upper right corner, go to Workload Details to view the log output during the training process.
import os\nimport torch\nimport torch.distributed as dist\nimport torch.nn as nn\nimport torch.optim as optim\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\nclass SimpleModel(nn.Module):\n def __init__(self):\n super(SimpleModel, self).__init__()\n self.fc = nn.Linear(10, 1)\n\n def forward(self, x):\n return self.fc(x)\n\ndef train():\n # Print environment information\n print(f'PyTorch version: {torch.__version__}')\n print(f'CUDA available: {torch.cuda.is_available()}')\n if torch.cuda.is_available():\n print(f'CUDA version: {torch.version.cuda}')\n print(f'CUDA device count: {torch.cuda.device_count()}')\n\n rank = int(os.environ.get('RANK', '0'))\n world_size = int(os.environ.get('WORLD_SIZE', '1'))\n\n print(f'Rank: {rank}, World Size: {world_size}')\n\n # Initialize distributed environment\n try:\n if world_size > 1:\n dist.init_process_group('nccl')\n print('Distributed process group initialized successfully')\n else:\n print('Running in non-distributed mode')\n except Exception as e:\n print(f'Error initializing process group: {e}')\n return\n\n # Set device\n try:\n if torch.cuda.is_available():\n device = torch.device(f'cuda:{rank % torch.cuda.device_count()}')\n print(f'Using CUDA device: {device}')\n else:\n device = torch.device('cpu')\n print('CUDA not available, using CPU')\n except Exception as e:\n print(f'Error setting device: {e}')\n device = torch.device('cpu')\n print('Falling back to CPU')\n\n try:\n model = SimpleModel().to(device)\n print('Model moved to device successfully')\n except Exception as e:\n print(f'Error moving model to device: {e}')\n return\n\n try:\n if world_size > 1:\n ddp_model = DDP(model, device_ids=[rank % torch.cuda.device_count()] if torch.cuda.is_available() else None)\n print('DDP model created successfully')\n else:\n ddp_model = model\n print('Using non-distributed model')\n except Exception as e:\n print(f'Error creating DDP model: {e}')\n return\n\n loss_fn = nn.MSELoss()\n optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)\n\n # Generate some random data\n try:\n data = torch.randn(100, 10, device=device)\n labels = torch.randn(100, 1, device=device)\n print('Data generated and moved to device successfully')\n except Exception as e:\n print(f'Error generating or moving data to device: {e}')\n return\n\n for epoch in range(10):\n try:\n ddp_model.train()\n outputs = ddp_model(data)\n loss = loss_fn(outputs, labels)\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n if rank == 0:\n print(f'Epoch {epoch}, Loss: {loss.item():.4f}')\n except Exception as e:\n print(f'Error during training epoch {epoch}: {e}')\n break\n\n if world_size > 1:\n dist.destroy_process_group()\n\nif __name__ == '__main__':\n train()\n
"},{"location":"en/end-user/baize/jobs/pytorch.html#number-of-job-replicas","title":"Number of Job Replicas","text":"
Note that Pytorch Distributed training jobs will create a group of Master and Worker training Pods, where the Master is responsible for coordinating the training job, and the Worker is responsible for the actual training work.
Note
In this demonstration: Master replica count is 1, Worker replica count is 2; Therefore, we need to set the replica count to 3 in the Job Configuration , which is the sum of Master and Worker replica counts. Pytorch will automatically tune the roles of Master and Worker.
AI Lab provides important visualization analysis tools provided for the model development process, used to display the training process and results of machine learning models. This document will introduce the basic concepts of Job Analysis (Tensorboard), its usage in the AI Lab system, and how to configure the log content of datasets.
Note
Tensorboard is a visualization tool provided by TensorFlow, used to display the training process and results of machine learning models. It can help developers more intuitively understand the training dynamics of their models, analyze model performance, debug issues, and more.
The role and advantages of Tensorboard in the model development process:
Visualize Training Process : Display metrics such as training and validation loss, and accuracy through charts, helping developers intuitively observe the training effects of the model.
Debug and Optimize Models : By viewing the weights and gradient distributions of different layers, help developers discover and fix issues in the model.
Compare Different Experiments : Simultaneously display the results of multiple experiments, making it convenient for developers to compare the effects of different models and hyperparameter configurations.
Track Training Data : Record the datasets and parameters used during training to ensure the reproducibility of experiments.
"},{"location":"en/end-user/baize/jobs/tensorboard.html#how-to-create-tensorboard","title":"How to Create Tensorboard","text":"
In the AI Lab system, we provide a convenient way to create and manage Tensorboard. Here are the specific steps:
"},{"location":"en/end-user/baize/jobs/tensorboard.html#enable-tensorboard-when-creating-a-notebook","title":"Enable Tensorboard When Creating a Notebook","text":"
Create a Notebook : Create a new Notebook on the AI Lab platform.
Enable Tensorboard : On the Notebook creation page, enable the Tensorboard option and specify the dataset and log path.
"},{"location":"en/end-user/baize/jobs/tensorboard.html#enable-tensorboard-after-creating-and-completing-a-distributed-job","title":"Enable Tensorboard After Creating and Completing a Distributed Job","text":"
Create a Distributed Job : Create a new distributed training job on the AI Lab platform.
Configure Tensorboard : On the job configuration page, enable the Tensorboard option and specify the dataset and log path.
View Tensorboard After Job Completion : After the job is completed, you can view the Tensorboard link on the job details page. Click the link to see the visualized results of the training process.
"},{"location":"en/end-user/baize/jobs/tensorboard.html#directly-reference-tensorboard-in-a-notebook","title":"Directly Reference Tensorboard in a Notebook","text":"
In a Notebook, you can directly start Tensorboard through code. Here is a sample code snippet:
"},{"location":"en/end-user/baize/jobs/tensorboard.html#how-to-configure-dataset-log-content","title":"How to Configure Dataset Log Content","text":"
When using Tensorboard, you can record and configure different datasets and log content. Here are some common configuration methods:
"},{"location":"en/end-user/baize/jobs/tensorboard.html#configure-training-and-validation-dataset-logs","title":"Configure Training and Validation Dataset Logs","text":"
While training the model, you can use TensorFlow's tf.summary API to record logs for the training and validation datasets. Here is a sample code snippet:
# Import necessary libraries\nimport tensorflow as tf\n\n# Create log directories\ntrain_log_dir = 'logs/gradient_tape/train'\nval_log_dir = 'logs/gradient_tape/val'\ntrain_summary_writer = tf.summary.create_file_writer(train_log_dir)\nval_summary_writer = tf.summary.create_file_writer(val_log_dir)\n\n# Train model and record logs\nfor epoch in range(EPOCHS):\n for (x_train, y_train) in train_dataset:\n # Training step\n train_step(x_train, y_train)\n with train_summary_writer.as_default():\n tf.summary.scalar('loss', train_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)\n\n for (x_val, y_val) in val_dataset:\n # Validation step\n val_step(x_val, y_val)\n with val_summary_writer.as_default():\n tf.summary.scalar('loss', val_loss.result(), step=epoch)\n tf.summary.scalar('accuracy', val_accuracy.result(), step=epoch)\n
In addition to logs for training and validation datasets, you can also record other custom log content such as learning rate and gradient distribution. Here is a sample code snippet:
In AI Lab, Tensorboards created through various methods are uniformly displayed on the job analysis page, making it convenient for users to view and manage.
Users can view information such as the link, status, and creation time of Tensorboard on the job analysis page and directly access the visualized results of Tensorboard through the link.
Tensorflow, along with Pytorch, is a highly active open-source deep learning framework that provides a flexible environment for training and deployment.
AI Lab provides support and adaptation for the Tensorflow framework. You can quickly create Tensorflow jobs and conduct model training through graphical operations.
Here, we use the baize-notebook base image and the associated environment as the basic runtime environment for jobs.
For information on how to create an environment, refer to Environment List.
"},{"location":"en/end-user/baize/jobs/tensorflow.html#creating-a-job","title":"Creating a Job","text":""},{"location":"en/end-user/baize/jobs/tensorflow.html#example-tfjob-single","title":"Example TFJob Single","text":"
Log in to the AI Lab platform and click Job Center in the left navigation bar to enter the Jobs page.
Click the Create button in the upper right corner to enter the job creation page.
Select the job type as Tensorflow Single and click Next .
Fill in the job name and description, then click OK .
"},{"location":"en/end-user/baize/jobs/tensorflow.html#pre-warming-the-code-repository","title":"Pre-warming the Code Repository","text":"
Use AI Lab -> Dataset List to create a dataset and pull the code from a remote GitHub repository into the dataset. This way, when creating a job, you can directly select the dataset and mount the code into the job.
Command parameters: Use python /code/tensorflow/tf-single.py
\"\"\"\n pip install tensorflow numpy\n\"\"\"\n\nimport tensorflow as tf\nimport numpy as np\n\n# Create some random data\nx = np.random.rand(100, 1)\ny = 2 * x + 1 + np.random.rand(100, 1) * 0.1\n\n# Create a simple model\nmodel = tf.keras.Sequential([\n tf.keras.layers.Dense(1, input_shape=(1,))\n])\n\n# Compile the model\nmodel.compile(optimizer='adam', loss='mse')\n\n# Train the model, setting epochs to 10\nhistory = model.fit(x, y, epochs=10, verbose=1)\n\n# Print the final loss\nprint('Final loss: {' + str(history.history['loss'][-1]) +'}')\n\n# Use the model to make predictions\ntest_x = np.array([[0.5]])\nprediction = model.predict(test_x)\nprint(f'Prediction for x=0.5: {prediction[0][0]}')\n
After the job is successfully submitted, you can enter the job details to see the resource usage. From the upper right corner, navigate to Workload Details to view log outputs during the training process.
Once a job is created, it will be displayed in the job list.
In the job list, click the \u2507 on the right side of a job and select Job Workload Details .
A pop-up window will appear asking you to choose which Pod to view. Click Enter .
You will be redirected to the container management interface, where you can view the container\u2019s working status, labels and annotations, and any events that have occurred.
You can also view detailed logs of the current Pod for the recent period. By default, 100 lines of logs are displayed. To view more detailed logs or to download logs, click the blue Insight text at the top.
Additionally, you can use the ... in the upper right corner to view the current Pod's YAML, and to upload or download files. Below is an example of a Pod's YAML.
The access key can be used to access the openAPI and continuous delivery. Users can obtain the key and access the API by referring to the following steps in the personal center.
Log in to AI platform, find Personal Center in the drop-down menu in the upper right corner, and you can manage the access key of the account on the Access Keys page.
Info
Access key is displayed only once. If you forget your access key, you will need to create a new key.
"},{"location":"en/end-user/ghippo/personal-center/accesstoken.html#use-the-key-to-access-api","title":"Use the key to access API","text":"
When accessing AI platform openAPI, add the header Authorization:Bearer ${token} to the request to identify the visitor, where ${token} is the key obtained in the previous step. For the specific API, see OpenAPI Documentation.
Request Example
curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k\n
This section explains how to set the interface language. Currently supports Chinese, English two languages.
Language setting is the portal for the platform to provide multilingual services. The platform is displayed in Chinese by default. Users can switch the platform language by selecting English or automatically detecting the browser language preference according to their needs. Each user's multilingual service is independent of each other, and switching will not affect other users.
The platform provides three ways to switch languages: Chinese, English-English, and automatically detect your browser language preference.
The operation steps are as follows.
Log in to the AI platform with your username/password. Click Global Management at the bottom of the left navigation bar.
Click the username in the upper right corner and select Personal Center .
Function description: It is used to fill in the email address and modify the login password.
Email: After the administrator configures the email server address, the user can click the Forget Password button on the login page to fill in the email address there to retrieve the password.
Password: The password used to log in to the platform, it is recommended to change the password regularly.
The specific operation steps are as follows:
Click the username in the upper right corner and select Personal Center .
Click the Security Settings tab. Fill in your email address or change the login password.
"},{"location":"en/end-user/ghippo/personal-center/ssh-key.html","title":"Configuring SSH Public Key","text":"
This article explains how to configure SSH public key.
Before generating a new SSH key, please check if you need to use an existing SSH key stored in the root directory of the local user. For Linux and Mac, use the following command to view existing public keys. Windows users can use the following command in WSL (requires Windows 10 or above) or Git Bash to view the generated public keys.
ED25519 Algorithm:
cat ~/.ssh/id_ed25519.pub\n
RSA Algorithm:
cat ~/.ssh/id_rsa.pub\n
If a long string starting with ssh-ed25519 or ssh-rsa is returned, it means that a local public key already exists. You can skip Step 2 Generate SSH Key and proceed directly to Step 3.
If Step 1 does not return the specified content string, it means that there is no available SSH key locally and a new SSH key needs to be generated. Please follow these steps:
Access the terminal (Windows users please use WSL or Git Bash), and run ssh-keygen -t.
Enter the key algorithm type and an optional comment.
The comment will appear in the .pub file and can generally use the email address as the comment content.
To generate a key pair based on the ED25519 algorithm, use the following command:
ssh-keygen -t ed25519 -C \"<comment>\"\n
To generate a key pair based on the RSA algorithm, use the following command:
ssh-keygen -t rsa -C \"<comment>\"\n
Press Enter to choose the SSH key generation path.
Taking the ED25519 algorithm as an example, the default path is as follows:
Generating public/private ed25519 key pair.\nEnter file in which to save the key (/home/user/.ssh/id_ed25519):\n
The default key generation path is /home/user/.ssh/id_ed25519, and the proper public key is /home/user/.ssh/id_ed25519.pub.
Set a passphrase for the key.
Enter passphrase (empty for no passphrase):\nEnter same passphrase again:\n
The passphrase is empty by default, and you can choose to use a passphrase to protect the private key file. If you do not want to enter a passphrase every time you access the repository using the SSH protocol, you can enter an empty passphrase when creating the key.
Press Enter to complete the key pair creation.
"},{"location":"en/end-user/ghippo/personal-center/ssh-key.html#step-3-copy-the-public-key","title":"Step 3. Copy the Public Key","text":"
In addition to manually copying the generated public key information printed on the command line, you can use the following commands to copy the public key to the clipboard, depending on the operating system.
Windows (in WSL or Git Bash):
cat ~/.ssh/id_ed25519.pub | clip\n
Mac:
tr -d '\\n'< ~/.ssh/id_ed25519.pub | pbcopy\n
GNU/Linux (requires xclip):
xclip -sel clip < ~/.ssh/id_ed25519.pub\n
"},{"location":"en/end-user/ghippo/personal-center/ssh-key.html#step-4-set-the-public-key-on-ai-platform-platform","title":"Step 4. Set the Public Key on AI platform Platform","text":"
Log in to the AI platform UI page and select Profile -> SSH Public Key in the upper right corner of the page.
Add the generated SSH public key information.
SSH public key content.
Public key title: Supports customizing the public key name for management differentiation.
Expiration: Set the expiration period for the public key. After it expires, the public key will be automatically invalidated and cannot be used. If not set, it will be permanently valid.
"},{"location":"en/end-user/ghippo/workspace/folder-permission.html","title":"Description of folder permissions","text":"
Folders have permission mapping capabilities, which can map the permissions of users/groups in this folder to subfolders, workspaces and resources under it.
If the user/group is Folder Admin role in this folder, it is still Folder Admin role when mapped to a subfolder, and Workspace Admin is mapped to the workspace under it; If a Namespace is bound in Workspace and Folder -> Resource Group , the user/group is also a Namespace Admin after mapping.
Note
The permission mapping capability of folders will not be applied to shared resources, because sharing is to share the use permissions of the cluster to multiple workspaces, rather than assigning management permissions to workspaces, so permission inheritance and role mapping will not be implemented.
Folders have hierarchical capabilities, so when folders are mapped to departments/suppliers/projects in the enterprise,
If a user/group has administrative authority (Admin) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have administrative authority;
If a user/group has access rights (Editor) in the first-level department, the second-, third-, and fourth-level departments or projects under it also have access rights;
If a user/group has read-only permission (Viewer) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have read-only permission.
Objects Actions Folder Admin Folder Editor Folder Viewer on the folder itself view \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 To Subfolder Create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 workspace under it create \u2713 \u2717 \u2717 View \u2713 \u2713 \u2713 Authorization \u2713 \u2717 \u2717 Modify Alias \u200b\u200b \u2713 \u2717 \u2717 Workspace under it - Resource Group View \u2713 \u2713 \u2713 resource binding \u2713 \u2717 \u2717 unbind \u2713 \u2717 \u2717 Workspaces under it - Shared Resources View \u2713 \u2713 \u2713 New share \u2713 \u2717 \u2717 Unshare \u2713 \u2717 \u2717 Resource Quota \u2713 \u2717 \u2717"},{"location":"en/end-user/ghippo/workspace/folders.html","title":"Create/Delete Folders","text":"
Folders have the capability to map permissions, allowing users/user groups to have their permissions in the folder mapped to its sub-folders, workspaces, and resources.
Follow the steps below to create a folder:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Folder button in the top right corner.
Fill in the folder name, parent folder, and other information, then click OK to complete creating the folder.
Tip
After successful creation, the folder name will be displayed in the left tree structure, represented by different icons for workspaces and folders.
Note
To edit or delete a specific folder, select it and Click \u2507 on the right side.
If there are resources bound to the resource group or shared resources within the folder, the folder cannot be deleted. All resources need to be unbound before deleting.
If there are registry resources accessed by the microservice engine module within the folder, the folder cannot be deleted. All access to the registry needs to be removed before deleting the folder.
Shared resources do not necessarily mean that the shared users can use the shared resources without any restrictions. Admin, Kpanda Owner, and Workspace Admin can limit the maximum usage quota of a user through the Resource Quota feature in shared resources. If no restrictions are set, it means the usage is unlimited.
CPU Request (Core)
CPU Limit (Core)
Memory Request (MB)
Memory Limit (MB)
Total Storage Request (GB)
Persistent Volume Claims (PVC)
GPU Type, Spec, Quantity (including but not limited to Nvidia, Ascend, ILLUVATAR, and other GPUs)
A resource (cluster) can be shared among multiple workspaces, and a workspace can use resources from multiple shared clusters simultaneously.
"},{"location":"en/end-user/ghippo/workspace/quota.html#resource-groups-and-shared-resources","title":"Resource Groups and Shared Resources","text":"
Cluster resources in both shared resources and resource groups are derived from Container Management. However, different effects will occur when binding a cluster to a workspace or sharing it with a workspace.
Binding Resources
Users/User groups in the workspace will have full management and usage permissions for the cluster. Workspace Admin will be mapped as Cluster Admin. Workspace Admin can access the Container Management module to manage the cluster.
Note
As of now, there are no Cluster Editor and Cluster Viewer roles in the Container Management module. Therefore, Workspace Editor and Workspace Viewer cannot be mapped.
Adding Shared Resources
Users/User groups in the workspace will have usage permissions for the cluster resources.
Unlike resource groups, when sharing a cluster with a workspace, the roles of the users in the workspace will not be mapped to the resources. Therefore, Workspace Admin will not be mapped as Cluster Admin.
This section demonstrates three scenarios related to resource quotas.
Select workspace ws01 and the shared cluster in Workbench, and create a namespace ns01 .
If no resource quotas are set in the shared cluster, there is no need to set resource quotas when creating the namespace.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the CPU request for the namespace must be less than or equal to 100 cores (CPU Request \u2264 100 core) for successful creation.
"},{"location":"en/end-user/ghippo/workspace/quota.html#bind-namespace-to-workspace","title":"Bind Namespace to Workspace","text":"
Prerequisite: Workspace ws01 has added a shared cluster, and the operator has the Workspace Admin + Kpanda Owner or Admin role.
The two methods of binding have the same effect.
Bind the created namespace ns01 to ws01 in Container Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
Bind the namespace ns01 to ws01 in Global Management.
If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless of whether resource quotas are set.
If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 must meet the requirement of CPU requests less than or equal to 100 cores (CPU Request \u2264 100 core) for successful binding.
"},{"location":"en/end-user/ghippo/workspace/quota.html#unbind-namespace-from-workspace","title":"Unbind Namespace from Workspace","text":"
The two methods of unbinding have the same effect.
Unbind the namespace ns01 from workspace ws01 in Container Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
Unbind the namespace ns01 from workspace ws01 in Global Management.
If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the resource quotas, regardless of whether resource quotas were set for the namespace.
If resource quotas (CPU Request = 100 cores) are set in the shared cluster and the namespace ns01 has its own resource quotas, unbinding will release the proper resource quota.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html","title":"Differences between Resource Groups and Shared Resources","text":"
Both resource groups and shared resources support cluster binding, but they have significant differences in usage.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#differences-in-usage-scenarios","title":"Differences in Usage Scenarios","text":"
Cluster Binding for Resource Groups: Resource groups are usually used for batch authorization. After binding a resource group to a cluster, the workspace administrator will be mapped as a cluster administrator and able to manage and use cluster resources.
Cluster Binding for Shared Resources: Shared resources are usually used for resource quotas. A typical scenario is that the platform administrator assigns a cluster to a first-level supplier, who then assigns the cluster to a second-level supplier and sets resource quotas for the second-level supplier.
Note: In this scenario, the platform administrator needs to impose resource restrictions on secondary suppliers. Currently, it is not supported to limit the cluster quota of secondary suppliers by the primary supplier.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#differences-in-cluster-quota-usage","title":"Differences in Cluster Quota Usage","text":"
Cluster Binding for Resource Groups: The workspace administrator is mapped as the administrator of the cluster and is equivalent to being granted the Cluster Admin role in Container Management-Permission Management. They can have unrestricted access to cluster resources, manage important content such as management nodes, and cannot be subject to resource quotas.
Cluster Binding for Shared Resources: The workspace administrator can only use the quota in the cluster to create namespaces in the Workbench and does not have cluster management permissions. If the workspace is restricted by a quota, the workspace administrator can only create and use namespaces within the quota range.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#differences-in-resource-types","title":"Differences in Resource Types","text":"
Resource Groups: Can bind to clusters, cluster-namespaces, multiclouds, multicloud namespaces, meshs, and mesh-namespaces.
Shared Resources: Can only bind to clusters.
"},{"location":"en/end-user/ghippo/workspace/res-gp-and-shared-res.html#similarities-between-resource-groups-and-shared-resources","title":"Similarities between Resource Groups and Shared Resources","text":"
After binding to a cluster, both resource groups and shared resources can go to the Workbench to create namespaces, which will be automatically bound to the workspace.
A workspace is a resource category that represents a hierarchical relationship of resources. A workspace can contain resources such as clusters, namespaces, and registries. Typically, each workspace corresponds to a project and different resources can be allocated, and different users and user groups can be assigned to each workspace.
Follow the steps below to create a workspace:
Log in to AI platform with a user account having the admin/folder admin role. Click Global Management -> Workspace and Folder at the bottom of the left navigation bar.
Click the Create Workspace button in the top right corner.
Fill in the workspace name, folder assignment, and other information, then click OK to complete creating the workspace.
Tip
After successful creation, the workspace name will be displayed in the left tree structure, represented by different icons for folders and workspaces.
Note
To edit or delete a specific workspace or folder, select it and click ... on the right side.
If resource groups and shared resources have resources under the workspace, the workspace cannot be deleted. All resources need to be unbound before deletion of the workspace.
If Microservices Engine has Integrated Registry under the workspace, the workspace cannot be deleted. Integrated Registry needs to be removed before deletion of the workspace.
If Container Registry has Registry Space or Integrated Registry under the workspace, the workspace cannot be deleted. Registry Space needs to be removed, and Integrated Registry needs to be deleted before deletion of the workspace.
"},{"location":"en/end-user/ghippo/workspace/ws-folder.html","title":"Workspace and Folder","text":"
Workspace and Folder is a feature that provides resource isolation and grouping, addressing issues related to unified authorization, resource grouping, and resource quotas.
Workspace and Folder involves two concepts: workspaces and folders.
Workspaces allow the management of resources through Authorization , Resource Group , and Shared Resource , enabling users (and user groups) to share resources within the workspace.
Resources
Resources are at the lowest level of the hierarchy in the resource management module. They include clusters, namespaces, pipelines, gateways, and more. All these resources can only have workspaces as their parent level. Workspaces act as containers for grouping resources.
Workspace
A workspace usually refers to a project or environment, and the resources in each workspace are logically isolated from those in other workspaces. You can grant users (groups of users) different access rights to the same set of resources through authorization in the workspace.
Workspaces are at the first level, counting from the bottom of the hierarchy, and contain resources. All resources except shared resources have one and only one parent. All workspaces also have one and only one parent folder.
Resources are grouped by workspace, and there are two grouping modes in workspace, namely Resource Group and Shared Resource .
Resource group
A resource can only be added to one resource group, and resource groups correspond to workspaces one by one. After a resource is added to a resource group, Workspace Admin will obtain the management authority of the resource, which is equivalent to the owner of the resource.
Share resource
For shared resources, multiple workspaces can share one or more resources. Resource owners can choose to share their own resources with the workspace. Generally, when sharing, the resource owner will limit the amount of resources that can be used by the shared workspace. After resources are shared, Workspace Admin only has resource usage rights under the resource limit, and cannot manage resources or adjust the amount of resources that can be used by the workspace.
At the same time, shared resources also have certain requirements for the resources themselves. Only Cluster (cluster) resources can be shared. Cluster Admin can share Cluster resources to different workspaces, and limit the use of workspaces on this Cluster.
Workspace Admin can create multiple Namespaces within the resource quota, but the sum of the resource quotas of the Namespaces cannot exceed the resource quota of the Cluster in the workspace. For Kubernetes resources, the only resource type that can be shared currently is Cluster.
Folders can be used to build enterprise business hierarchy relationships.
Folders are a further grouping mechanism based on workspaces and have a hierarchical structure. A folder can contain workspaces, other folders, or a combination of both, forming a tree-like organizational relationship.
Folders allow you to map your business hierarchy and group workspaces by department. Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces.
A folder has one and only one parent folder, and the root folder is the highest level of the hierarchy. The root folder has no parent, and folders and workspaces are attached to the root folder.
In addition, users (groups) in folders can inherit permissions from their parents through a hierarchical structure. The permissions of the user in the hierarchical structure come from the combination of the permissions of the current level and the permissions inherited from its parents. The permissions are additive and there is no mutual exclusion.
"},{"location":"en/end-user/ghippo/workspace/ws-permission.html","title":"Description of workspace permissions","text":"
The workspace has permission mapping and resource isolation capabilities, and can map the permissions of users/groups in the workspace to the resources under it. If the user/group has the Workspace Admin role in the workspace and the resource Namespace is bound to the workspace-resource group, the user/group will become Namespace Admin after mapping.
Note
The permission mapping capability of the workspace will not be applied to shared resources, because sharing is to share the cluster usage permissions to multiple workspaces, rather than assigning management permissions to the workspaces, so permission inheritance and role mapping will not be implemented.
Resource isolation is achieved by binding resources to different workspaces. Therefore, resources can be flexibly allocated to each workspace (tenant) with the help of permission mapping, resource isolation, and resource sharing capabilities.
Generally applicable to the following two use cases:
Cluster one-to-one
Ordinary Cluster Department/Tenant (Workspace) Purpose Cluster 01 A Administration and Usage Cluster 02 B Administration and Usage
Cluster one-to-many
Cluster Department/Tenant (Workspace) Resource Quota Cluster 01 A 100 core CPU B 50-core CPU
Authorized users can go to modules such as workbench, microservice engine, middleware, multicloud orchestration, and service mesh to use resources in the workspace. For the operation scope of the roles of Workspace Admin, Workspace Editor, and Workspace Viewer in each module, please refer to the permission description:
If a user John (\"John\" represents any user who is required to bind resources) has the Workspace Admin role assigned or has been granted proper permissions through a custom role, which includes the Workspace's \"Resource Binding\" Permissions, and wants to bind a specific cluster or namespace to the workspace.
To bind cluster/namespace resources to a workspace, not only the workspace's \"Resource Binding\" permissions are required, but also the permissions of Cluster Admin.
"},{"location":"en/end-user/ghippo/workspace/wsbind-permission.html#granting-authorization-to-john","title":"Granting Authorization to John","text":"
Using the Platform Admin Role, grant John the role of Workspace Admin on the Workspace -> Authorization page.
Then, on the Container Management -> Permissions page, authorize John as a Cluster Admin by Add Permission.
"},{"location":"en/end-user/ghippo/workspace/wsbind-permission.html#binding-to-workspace","title":"Binding to Workspace","text":"
Using John's account to log in to AI platform, on the Container Management -> Clusters page, John can bind the specified cluster to his own workspace by using the Bind Workspace button.
Note
John can only bind clusters or namespaces to a specific workspace in the Container Management module, and cannot perform this operation in the Global Management module.
To bind a namespace to a workspace, you must have at least Workspace Admin and Cluster Admin permissions.
"},{"location":"en/end-user/host/createhost.html","title":"Creating and Starting a Cloud Host","text":"
Once the user completes registration and is assigned a workspace, namespace, and resources, they can create and start a cloud host.
"},{"location":"en/end-user/host/usehost.html#steps-to-operate","title":"Steps to Operate","text":"
Log into the AI platform as an administrator.
Navigate to Container Management -> Container Network -> Services, click the service name to enter the service details page, and click Update in the upper right corner.
Change the port range to 30900-30999, ensuring there are no conflicts.
Log into the AI platform as an end user, navigate to the proper service, and check the access ports.
Use an SSH client to log into the cloud host from the external network.
At this point, you can perform various operations on the cloud host.
The Alert Center is an important feature provided by AI platform that allows users to easily view all active and historical alerts by cluster and namespace through a graphical interface, and search alerts based on severity level (critical, warning, info).
All alerts are triggered based on the threshold conditions set in the preset alert rules. In AI platform, some global alert policies are built-in, but users can also create or delete alert policies at any time, and set thresholds for the following metrics:
CPU usage
Memory usage
Disk usage
Disk reads per second
Disk writes per second
Cluster disk read throughput
Cluster disk write throughput
Network send rate
Network receive rate
Users can also add labels and annotations to alert rules. Alert rules can be classified as active or expired, and certain rules can be enabled/disabled to achieve silent alerts.
When the threshold condition is met, users can configure how they want to be notified, including email, DingTalk, WeCom, webhook, and SMS notifications. All notification message templates can be customized and all messages are sent at specified intervals.
In addition, the Alert Center also supports sending alert messages to designated users through short message services provided by Alibaba Cloud, Tencent Cloud, and more platforms that will be added soon, enabling multiple ways of alert notification.
AI platform Alert Center is a powerful alert management platform that helps users quickly detect and resolve problems in the cluster, improve business stability and availability, and facilitate cluster inspection and troubleshooting.
In addition to the built-in alert policies, AI platform allows users to create custom alert policies. Each alert policy is a collection of alert rules that can be set for clusters, nodes, and workloads. When an alert object reaches the threshold set by any of the rules in the policy, an alert is automatically triggered and a notification is sent.
Taking the built-in alerts as an example, click the first alert policy alertmanager.rules .
You can see that some alert rules have been set under it. You can add more rules under this policy, or edit or delete them at any time. You can also view the historical and active alerts related to this alert policy and edit the notification configuration.
Select Alert Center -> Alert Policies , and click the Create Alert Policy button.
Fill in the basic information, select one or more clusters, nodes, or workloads as the alert objects, and click Next .
The list must have at least one rule. If the list is empty, please Add Rule .
Create an alert rule in the pop-up window, fill in the parameters, and click OK .
Template rules: Pre-defined basic metrics that can monitor CPU, memory, disk, and network.
PromQL rules: Input a PromQL expression, please query Prometheus expressions.
Duration: After the alert is triggered and the duration reaches the set value, the alert policy will become a triggered state.
Alert level: Including emergency, warning, and information levels.
Advanced settings: Custom tags and annotations.
After clicking Next , configure notifications.
After the configuration is complete, click the OK button to return to the Alert Policy list.
Tip
The newly created alert policy is in the Not Triggered state. Once the threshold conditions and duration specified in the rules are met, it will change to the Triggered state.
After filling in the basic information, click Add Rule and select Log Rule as the rule type.
Creating log rules is supported only when the resource object is selected as a node or workload.
Field Explanation:
Filter Condition : Field used to query log content, supports four filtering conditions: AND, OR, regular expression matching, and fuzzy matching.
Condition : Based on the filter condition, enter keywords or matching conditions.
Time Range : Time range for log queries.
Threshold Condition : Enter the alert threshold value in the input box. When the set threshold is reached, an alert will be triggered. Supported comparison operators are: >, \u2265, =, \u2264, <.
Alert Level : Select the alert level to indicate the severity of the alert.
After filling in the basic information, click Add Rule and select Event Rule as the rule type.
Creating event rules is supported only when the resource object is selected as a workload.
Field Explanation:
Event Rule : Only supports selecting the workload as the resource object.
Event Reason : Different event reasons for different types of workloads, where the event reasons are combined with \"AND\" relationship.
Time Range : Detect data generated within this time range. If the threshold condition is reached, an alert event will be triggered.
Threshold Condition : When the generated events reach the set threshold, an alert event will be triggered.
Trend Chart : By default, it queries the trend of event changes within the last 10 minutes. The value at each point represents the total number of occurrences within a certain period of time (time range) from the current time point to a previous time.
Click \u2507 at the right side of the list, then choose Delete from the pop-up menu to delete an alert policy. By clicking on the policy name, you can enter the policy details where you can add, edit, or delete the alert rules under it.
Warning
Deleted alert strategies will be permanently removed, so please proceed with caution.
The Alert template allows platform administrators to create Alert templates and rules, and business units can directly use Alert templates to create Alert policies. This feature can reduce the management of Alert rules by business personnel and allow for modification of Alert thresholds based on actual environment conditions.
In the navigation bar, select Alert -> Alert Policy, and click Alert Template at the top.
Click Create Alert Template, and set the name, description, and other information for the Alert template.
Parameter Description Template Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Resource Type Used to specify the matching type of the Alert template. Alert Rule Supports pre-defined multiple Alert rules, including template rules and PromQL rules.
Click OK to complete the creation and return to the Alert template list. Click the template name to view the template details.
Alert Inhibition is mainly a mechanism for temporarily hiding or reducing the priority of alerts that do not need immediate attention. The purpose of this feature is to reduce unnecessary alert information that may disturb operations personnel, allowing them to focus on more critical issues.
Alert inhibition recognizes and ignores certain alerts by defining a set of rules to deal with specific conditions. There are mainly the following conditions:
Parent-child inhibition: when a parent alert (for example, a crash on a node) is triggered, all child alerts aroused by it (for example, a crash on a container running on that node) are inhibited.
Similar alert inhibition: When alerts have the same characteristics (for example, the same problem on the same instance), multiple alerts are inhibited.
In the left navigation bar, select Alert -> Noise Reduction, and click Inhibition at the top.
Click Create Inhibition, and set the name and rules for the inhibition.
Note
The problem of avoiding multiple similar or related alerts that may be triggered by the same issue is achieved by defining a set of rules to identify and ignore certain alerts through Rule Details and Alert Details.
Parameter Description Name The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. Description The description can contain any characters and can be up to 256 characters long. Cluster The cluster where the inhibition rule applies. Namespace The namespace where the inhibition rule applies. Source Alert Matching alerts by label conditions. It compares alerts that meet all label conditions with those that meet inhibition conditions, and alerts that do not meet inhibition conditions will be sent to the user as usual. Value range explanation: - Alert Level: The level of metric or event alerts, can be set as: Critical, Major, Minor. - Resource Type: The resource type specific for the alert object, can be set as: Cluster, Node, StatefulSet, Deployment, DaemonSet, Pod. - Labels: Alert identification attributes, consisting of label name and label value, supports user-defined values. Inhibition Specifies the matching conditions for the target alert (the alert to be inhibited). Alerts that meet all the conditions will no longer be sent to the user. Equal Specifies the list of labels to compare to determine if the source alert and target alert match. Inhibition is triggered only when the values of the labels specified in equal are exactly the same in the source and target alerts. The equal field is optional. If the equal field is omitted, all labels are used for matching.
Click OK to complete the creation and return to Inhibition list. Click the inhibition rule name to view the rule details.
After entering Insight , click Alert Center -> Notification Settings in the left navigation bar. By default, the email notification object is selected. Click Add email group and add one or more email addresses.
Multiple email addresses can be added.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the email group.
In the left navigation bar, click Alert Center -> Notification Settings -> WeCom . Click Add Group Robot and add one or more group robots.
For the URL of the WeCom group robot, please refer to the official document of WeCom: How to use group robots.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> DingTalk . Click Add Group Robot and add one or more group robots.
For the URL of the DingTalk group robot, please refer to the official document of DingTalk: Custom Robot Access.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the group robot.
In the left navigation bar, click Alert Center -> Notification Settings -> Lark . Click Add Group Bot and add one or more group bots.
Note
When signature verification is required in Lark's group bot, you need to fill in the specific signature key when enabling notifications. Refer to Customizing Bot User Guide.
After configuration, you will be automatically redirected to the list page. Click \u2507 on the right side of the list and select Send Test Message . You can edit or delete group bots.
In the left navigation bar, click Alert Center -> Notification Settings -> Webhook . Click New Webhook and add one or more Webhooks.
For the Webhook URL and more configuration methods, please refer to the webhook document.
After the configuration is complete, the notification list will automatically return. Click \u2507 on the right side of the list, select Send Test Information , and you can also edit or delete the Webhook.
In the left navigation bar, click Alert Center -> Notification Settings -> SMS . Click Add SMS Group and add one or more SMS groups.
Enter the name, the object receiving the message, phone number, and notification server in the pop-up window.
The notification server needs to be created in advance under Notification Settings -> Notification Server . Currently, two cloud servers, Alibaba Cloud and Tencent Cloud, are supported. Please refer to your own cloud server information for the specific configuration parameters.
After the SMS group is successfully added, the notification list will automatically return. Click \u2507 on the right side of the list to edit or delete the SMS group.
The message template feature supports customizing the content of message templates and can notify specified objects in the form of email, WeCom, DingTalk, Webhook, and SMS.
"},{"location":"en/end-user/insight/alert-center/msg-template.html#creating-a-message-template","title":"Creating a Message Template","text":"
In the left navigation bar, select Alert -> Message Template .
Insight comes with two default built-in templates in both Chinese and English for user convenience.
Fill in the template content.
Info
Observability comes with predefined message templates. If you need to define the content of the templates, refer to Configure Notification Templates.
Click the name of a message template to view the details of the message template in the right slider.
Parameters Variable Description ruleName {{ .Labels.alertname }} The name of the rule that triggered the alert groupName {{ .Labels.alertgroup }} The name of the alert policy to which the alert rule belongs severity {{ .Labels.severity }} The level of the alert that was triggered cluster {{ .Labels.cluster }} The cluster where the resource that triggered the alert is located namespace {{ .Labels.namespace }} The namespace where the resource that triggered the alert is located node {{ .Labels.node }} The node where the resource that triggered the alert is located targetType {{ .Labels.target_type }} The resource type of the alert target target {{ .Labels.target }} The name of the object that triggered the alert value {{ .Annotations.value }} The metric value at the time the alert notification was triggered startsAt {{ .StartsAt }} The time when the alert started to occur endsAt {{ .EndsAt }} The time when the alert ended description {{ .Annotations.description }} A detailed description of the alert labels {{ for .labels }} {{ end }} All labels of the alert use the for function to iterate through the labels list to get all label contents."},{"location":"en/end-user/insight/alert-center/msg-template.html#editing-or-deleting-a-message-template","title":"Editing or Deleting a Message Template","text":"
Click \u2507 on the right side of the list and select Edit or Delete from the pop-up menu to modify or delete the message template.
Warning
Once a template is deleted, it cannot be recovered, so please use caution when deleting templates.
Alert silence is a feature that allows alerts meeting certain criteria to be temporarily disabled from sending notifications within a specific time range. This feature helps operations personnel avoid receiving too many noisy alerts during certain operations or events, while also allowing for more precise handling of real issues that need to be addressed.
On the Alert Silence page, you can see two tabs: Active Rule and Expired Rule. The former presents the rules currently in effect, while the latter presents those that were defined in the past but have now expired (or have been deleted by the user).
"},{"location":"en/end-user/insight/alert-center/silent.html#creating-a-silent-rule","title":"Creating a Silent Rule","text":"
In the left navigation bar, select Alert -> Noice Reduction -> Alert Silence , and click the Create Silence Rule button.
Fill in the parameters for the silent rule, such as cluster, namespace, tags, and time, to define the scope and effective time of the rule, and then click OK .
Return to the rule list, and on the right side of the list, click \u2507 to edit or delete a silent rule.
Through the Alert Silence feature, you can flexibly control which alerts should be ignored and when they should be effective, thereby improving operational efficiency and reducing the possibility of false alerts.
Insight supports SMS notifications and currently sends alert messages using integrated Alibaba Cloud and Tencent Cloud SMS services. This article explains how to configure the SMS notification server in Insight. The variables supported in the SMS signature are the default variables in the message template. As the number of SMS characters is limited, it is recommended to choose more explicit variables.
For information on how to configure SMS recipients, refer to the document: Configure SMS Notification Group.
Go to Alert Center -> Notification Settings -> Notification Server .
Click Add Notification Server .
Configure Alibaba Cloud server.
To apply for Alibaba Cloud SMS service, refer to Alibaba Cloud SMS Service.
Field descriptions:
AccessKey ID : Parameter used by Alibaba Cloud to identify the user.
AccessKey Secret : Key used by Alibaba Cloud to authenticate the user. AccessKey Secret must be kept confidential.
SMS Signature : The SMS service supports creating signatures that meet the requirements according to user needs. When sending SMS, the SMS platform will add the approved SMS signature to the SMS content before sending it to the SMS recipient.
Template CODE : The SMS template is the specific content of the SMS to be sent.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content.
Please refer to Alibaba Cloud Variable Specification.
Note
Example: The template content defined in Alibaba Cloud is: ${severity}: ${alertname} triggered at ${startat}. Refer to the configuration in the parameter template.
Configure Tencent Cloud server.
To apply for Tencent Cloud SMS service, please refer to Tencent Cloud SMS.
Field descriptions:
Secret ID : Parameter used by Tencent Cloud to identify the API caller.
SecretKey : Parameter used by Tencent Cloud to authenticate the API caller.
SMS Template ID : The SMS template ID automatically generated by Tencent Cloud system.
Signature Content : The SMS signature content, which is the full name or abbreviation of the actual website name defined in the Tencent Cloud SMS signature.
SdkAppId : SMS SdkAppId, the actual SdkAppId generated after adding the application in the Tencent Cloud SMS console.
Parameter Template : The SMS body template can contain variables. Users can use variables to customize the SMS content. Please refer to: Tencent Cloud Variable Specification.
Note
Example: The template content defined in Tencent Cloud is: {1}: {2} triggered at {3}. Refer to the configuration in the parameter template.
"},{"location":"en/end-user/insight/collection-manag/agent-status.html","title":"insight-agent Component Status Explanation","text":"
In AI platform, Insight acts as a multi-cluster observability product. To achieve unified data collection across multiple clusters, users need to install the Helm App insight-agent (installed by default in the insight-system namespace). Refer to How to Install insight-agent .
In the \"Observability\" -> \"Collection Management\" section, you can view the installation status of insight-agent in each cluster.
Not Installed : insight-agent is not installed in the insight-system namespace of the cluster.
Running : insight-agent is successfully installed in the cluster, and all deployed components are running.
Error : If insight-agent is in this state, it indicates that the helm deployment failed or there are components deployed that are not in a running state.
You can troubleshoot using the following steps:
Run the following command. If the status is deployed , proceed to the next step. If it is failed , it is recommended to uninstall and reinstall it from Container Management -> Helm Apps as it may affect application upgrades:
helm list -n insight-system\n
Run the following command or check the status of the deployed components in Insight -> Data Collection . If there are Pods not in the Running state, restart the containers in an abnormal state.
The resource consumption of the Prometheus metric collection component in insight-agent is directly proportional to the number of Pods running in the cluster. Please adjust the resources for Prometheus according to the cluster size. Refer to Prometheus Resource Planning.
The storage capacity of the vmstorage metric storage component in the global service cluster is directly proportional to the total number of Pods in the clusters.
Please contact the platform administrator to adjust the disk capacity of vmstorage based on the cluster size. Refer to vmstorage Disk Capacity Planning.
Adjust vmstorage disk based on multi-cluster scale. Refer to vmstorge Disk Expansion.
Data Collection is mainly to centrally manage and display the entrance of the cluster installation collection plug-in insight-agent , which helps users quickly view the health status of the cluster collection plug-in, and provides a quick entry to configure collection rules.
The specific operation steps are as follows:
Click in the upper left corner and select Insight -> Data Collection .
You can view the status of all cluster collection plug-ins.
When the cluster is connected to insight-agent and is running, click a cluster name to enter the details\u3002
In the Service Monitor tab, click the shortcut link to jump to Container Management -> CRD to add service discovery rules.
Prometheus primarily uses the Pull approach to retrieve monitoring metrics from target services' exposed endpoints. Therefore, it requires configuring proper scraping jobs to request monitoring data and write it into the storage provided by Prometheus. Currently, Prometheus offers several configurations for these jobs:
Native Job Configuration: This provides native Prometheus job configuration for scraping.
Pod Monitor: In the Kubernetes ecosystem, it allows scraping of monitoring data from Pods using Prometheus Operator.
Service Monitor: In the Kubernetes ecosystem, it allows scraping monitoring data from Endpoints of Services using Prometheus Operator.
# Name of the scraping job, also adds a label (job=job_name) to the scraped metrics\njob_name: <job_name>\n\n# Time interval between scrapes\n[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]\n\n# Timeout for scrape requests\n[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]\n\n# URI path for the scrape request\n[ metrics_path: <path> | default = /metrics ]\n\n# Handling of label conflicts between scraped labels and labels added by the backend Prometheus.\n# true: Retains the scraped labels and ignores conflicting labels from the backend Prometheus.\n# false: Adds an \"exported_<original-label>\" prefix to the scraped labels and includes the additional labels added by the backend Prometheus.\n[ honor_labels: <boolean> | default = false ]\n\n# Whether to use the timestamp generated by the target being scraped.\n# true: Uses the timestamp from the target if available.\n# false: Ignores the timestamp from the target.\n[ honor_timestamps: <boolean> | default = true ]\n\n# Protocol for the scrape request: http or https\n[ scheme: <scheme> | default = http ]\n\n# URL parameters for the scrape request\nparams:\n [ <string>: [<string>, ...] ]\n\n# Set the value of the `Authorization` header in the scrape request through basic authentication. password/password_file are mutually exclusive, with password_file taking precedence.\nbasic_auth:\n [ username: <string> ]\n [ password: <secret> ]\n [ password_file: <string> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token: <secret> ]\n\n# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence.\n[ bearer_token_file: <filename> ]\n\n# Whether the scrape connection should use a TLS secure channel, configure the proper TLS parameters\ntls_config:\n [ <tls_config> ]\n\n# Use a proxy service to scrape the metrics from the target, specify the address of the proxy service.\n[ proxy_url: <string> ]\n\n# Specify the targets using static configuration, see explanation below.\nstatic_configs:\n [ - <static_config> ... ]\n\n# CVM service discovery configuration, see explanation below.\ncvm_sd_configs:\n [ - <cvm_sd_config> ... ]\n\n# After scraping the data, rewrite the labels of the proper target using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nrelabel_configs:\n [ - <relabel_config> ... ]\n\n# Before writing the scraped data, rewrite the values of the labels using the relabel mechanism. Executes multiple relabel rules in order.\n# See explanation below for relabel_config.\nmetric_relabel_configs:\n [ - <relabel_config> ... ]\n\n# Limit the number of data points per scrape, 0: no limit, default is 0\n[ sample_limit: <int> | default = 0 ]\n\n# Limit the number of targets per scrape, 0: no limit, default is 0\n[ target_limit: <int> | default = 0 ]\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is PodMonitor\nkind: PodMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be <namespace>/<name>\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight\nspec:\n # Specify the label of the proper Pod, pod monitor will use this value as the job label value.\n # If viewing the Pod YAML, use the values in pod.metadata.labels.\n # If viewing Deployment/Daemonset/Statefulset, use spec.template.metadata.labels.\n [ jobLabel: string ]\n # Adds the proper Pod's Labels to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n podMetricsEndpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: PodMonitor\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n podMetricsEndpoints:\n - interval: 30s\n port: metric-port # Specify the Port Name proper to Prometheus Exporter in the pod YAML\n path: /metrics # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n relabelings:\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: instance\n replacement: \"crs-xxxxxx\" # Adjust to the proper Redis instance ID\n - action: replace\n sourceLabels:\n - instance\n regex: (.*)\n targetLabel: ip\n replacement: \"1.x.x.x\" # Adjust to the proper Redis instance IP\n namespaceSelector: # Select the namespaces where the monitored Pods are located\n matchNames:\n - redis-test\n selector: # Specify the Label values of the Pods to be monitored in order to locate the target pods\n matchLabels:\n k8s-app: redis-exporter\n
The explanation for the proper configmaps is as follows:
# Prometheus Operator CRD version\napiVersion: monitoring.coreos.com/v1\n# proper Kubernetes resource type, here it is ServiceMonitor\nkind: ServiceMonitor\n# proper Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be the name of the Service.\nmetadata:\n name: redis-exporter # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, no need to modify\n# Describes the selection and configuration of the target Pods to be scraped\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n # Specify the label(metadata/labels) of the proper Pod, service monitor will use this value as the job label value.\n [ jobLabel: string ]\n # Adds the Labels of the proper service to the Target's Labels\n [ targetLabels: []string ]\n # Adds the Labels of the proper Pod to the Target's Labels\n [ podTargetLabels: []string ]\n # Limit the number of data points per scrape, 0: no limit, default is 0\n [ sampleLimit: uint64 ]\n # Limit the number of targets per scrape, 0: no limit, default is 0\n [ targetLimit: uint64 ]\n # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured.\n endpoints:\n [ - <endpoint_config> ... ] # See explanation below for endpoint\n # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces.\n [ namespaceSelector: ]\n # Select all namespaces\n [ any: bool ]\n # Specify the list of namespaces to be selected\n [ matchNames: []string ]\n # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta)\n selector:\n [ matchExpressions: array ]\n [ example: - {key: tier, operator: In, values: [cache]} ]\n [ matchLabels: object ]\n [ example: k8s-app: redis-exporter ]\n
apiVersion: monitoring.coreos.com/v1\nkind: ServiceMonitor\nmetadata:\n name: go-demo # Specify a unique name\n namespace: cm-prometheus # Fixed namespace, do not modify\n labels:\n operator.insight.io/managed-by: insight # Label indicating managed by Insight, required.\nspec:\n endpoints:\n - interval: 30s\n # Specify the Port Name proper to Prometheus Exporter in the service YAML\n port: 8080-8080-tcp\n # Specify the value of the Path proper to Prometheus Exporter, if not specified, default is /metrics\n path: /metrics\n relabelings:\n # ** There must be a label named 'application', assuming there is a label named 'app' in k8s,\n # we replace it with 'application' using the relabel 'replace' action\n - action: replace\n sourceLabels: [__meta_kubernetes_pod_label_app]\n targetLabel: application\n # Select the namespace where the monitored service is located\n namespaceSelector:\n matchNames:\n - golang-demo\n # Specify the Label values of the service to be monitored in order to locate the target service\n selector:\n matchLabels:\n app: golang-app-demo\n
The explanation for the proper configmaps is as follows:
# The name of the proper port. Please note that it's not the actual port number.\n# Default: 80. Possible values are as follows:\n# ServiceMonitor: corresponds to Service>spec/ports/name;\n# PodMonitor: explained as follows:\n# If viewing the Pod YAML, take the value from pod.spec.containers.ports.name.\n# If viewing Deployment/DaemonSet/StatefulSet, take the value from spec.template.spec.containers.ports.name.\n[ port: string | default = 80]\n# The URI path for the scrape request.\n[ path: string | default = /metrics ]\n# The protocol for the scrape: http or https.\n[ scheme: string | default = http]\n# URL parameters for the scrape request.\n[ params: map[string][]string]\n# The interval between scrape requests.\n[ interval: string | default = 30s ]\n# The timeout for the scrape request.\n[ scrapeTimeout: string | default = 30s]\n# Whether the scrape connection should be made over a secure TLS channel, and the TLS configuration.\n[ tlsConfig: TLSConfig ]\n# Read the bearer token value from the specified file and include it in the headers of the scrape request.\n[ bearerTokenFile: string ]\n# Read the bearer token from the specified K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ bearerTokenSecret: string ]\n# Handling conflicts when scraped labels conflict with labels added by the backend Prometheus.\n# true: Keep the scraped labels and ignore the conflicting labels from the backend Prometheus.\n# false: For conflicting labels, prefix the scraped label with 'exported_<original-label>' and add the labels added by the backend Prometheus.\n[ honorLabels: bool | default = false ]\n# Whether to use the timestamp generated on the target during the scrape.\n# true: Use the timestamp on the target if available.\n# false: Ignore the timestamp on the target.\n[ honorTimestamps: bool | default = true ]\n# Basic authentication credentials. Fill in the values of username/password from the proper K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor.\n[ basicAuth: BasicAuth ]\n# Scrape the metrics from the target through a proxy server. Specify the address of the proxy server.\n[ proxyUrl: string ]\n# After scraping the data, rewrite the values of the labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nrelabelings:\n[ - <relabel_config> ...]\n# Before writing the scraped data, rewrite the values of the proper labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order.\n# See explanation below for relabel_config\nmetricRelabelings:\n[ - <relabel_config> ...]\n
The explanation for the proper configmaps is as follows:
# Specifies which labels to take from the original labels for relabeling. The values taken are concatenated using the separator defined in the configuration.\n# For PodMonitor/ServiceMonitor, the proper configmap is sourceLabels.\n[ source_labels: '[' <labelname> [, ...] ']' ]\n# Defines the character used to concatenate the values of the labels to be relabeled. Default is ';'.\n[ separator: <string> | default = ; ]\n\n# When the action is replace/hashmod, target_label is used to specify the proper label name.\n# For PodMonitor/ServiceMonitor, the proper configmap is targetLabel.\n[ target_label: <labelname> ]\n\n# Regular expression used to match the values of the source labels.\n[ regex: <regex> | default = (.*) ]\n\n# Used when action is hashmod, it takes the modulus value based on the MD5 hash of the source label's value.\n[ modulus: <int> ]\n\n# Used when action is replace, it defines the expression to replace when the regex matches. It can use regular expression replacement with regex.\n[ replacement: <string> | default = $1 ]\n\n# Actions performed based on the matched values of regex. The available actions are as follows, with replace being the default:\n# replace: If the regex matches, replace the proper value with the value defined in replacement. Set the value using target_label and add the proper label.\n# keep: If the regex doesn't match, discard the value.\n# drop: If the regex matches, discard the value.\n# hashmod: Take the modulus of the MD5 hash of the source label's value based on the value specified in modulus.\n# Add a new label with a label name specified by target_label.\n# labelmap: If the regex matches, replace the proper label name with the value specified in replacement.\n# labeldrop: If the regex matches, delete the proper label.\n# labelkeep: If the regex doesn't match, delete the proper label.\n[ action: <relabel_action> | default = replace ]\n
Insight uses the Blackbox Exporter provided by Prometheus as a blackbox monitoring solution, allowing detection of target instances via HTTP, HTTPS, DNS, ICMP, TCP, and gRPC. It can be used in the following scenarios:
HTTP/HTTPS: URL/API availability monitoring
ICMP: Host availability monitoring
TCP: Port availability monitoring
DNS: Domain name resolution
In this page, we will explain how to configure custom probers in an existing Blackbox ConfigMap.
ICMP prober is not enabled by default in Insight because it requires higher permissions. Therfore We will use the HTTP prober as an example to demonstrate how to modify the ConfigMap to achieve custom HTTP probing.
module:\n ICMP: # Example of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: ip4\nicmp_example: # Example 2 of ICMP prober configuration\n prober: icmp\n timeout: 5s\n icmp:\n preferred_ip_protocol: \"ip4\"\n source_ip_address: \"127.0.0.1\"\n
Since ICMP requires higher permissions, we also need to elevate the pod permissions. Otherwise, an operation not permitted error will occur. There are two ways to elevate permissions:
Directly edit the BlackBox Exporter deployment file to enable it
The following YAML file contains various probers such as HTTP, TCP, SMTP, ICMP, and DNS. You can modify the configuration file of insight-agent-prometheus-blackbox-exporter according to your needs.
Click to view the complete YAML file
kind: ConfigMap\napiVersion: v1\nmetadata:\n name: insight-agent-prometheus-blackbox-exporter\n namespace: insight-system\n labels:\n app.kubernetes.io/instance: insight-agent\n app.kubernetes.io/managed-by: Helm\n app.kubernetes.io/name: prometheus-blackbox-exporter\n app.kubernetes.io/version: v0.24.0\n helm.sh/chart: prometheus-blackbox-exporter-8.8.0\n annotations:\n meta.helm.sh/release-name: insight-agent\n meta.helm.sh/release-namespace: insight-system\ndata:\n blackbox.yaml: |\n modules:\n HTTP_GET:\n prober: http\n timeout: 5s\n http:\n method: GET\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"]\n follow_redirects: true\n preferred_ip_protocol: \"ip4\"\n HTTP_POST:\n prober: http\n timeout: 5s\n http:\n method: POST\n body_size_limit: 1MB\n TCP:\n prober: tcp\n timeout: 5s\n # Not enabled by default:\n # ICMP:\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: ip4\n SSH:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^SSH-2.0-\"\n POP3S:\n prober: tcp\n tcp:\n query_response:\n - expect: \"^+OK\"\n tls: true\n tls_config:\n insecure_skip_verify: false\n http_2xx_example: # http prober example\n prober: http\n timeout: 5s # probe timeout\n http:\n valid_http_versions: [\"HTTP/1.1\", \"HTTP/2.0\"] # Version in the response, usually default\n valid_status_codes: [] # Defaults to 2xx # Valid range of response codes, probe successful if within this range\n method: GET # request method\n headers: # request headers\n Host: vhost.example.com\n Accept-Language: en-US\n Origin: example.com\n no_follow_redirects: false # allow redirects\n fail_if_ssl: false \n fail_if_not_ssl: false\n fail_if_body_matches_regexp:\n - \"Could not connect to database\"\n fail_if_body_not_matches_regexp:\n - \"Download the latest version here\"\n fail_if_header_matches: # Verifies that no cookies are set\n - header: Set-Cookie\n allow_missing: true\n regexp: '.*'\n fail_if_header_not_matches:\n - header: Access-Control-Allow-Origin\n regexp: '(\\*|example\\.com)'\n tls_config: # tls configuration for https requests\n insecure_skip_verify: false\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\" # Preferred IP protocol version\n ip_protocol_fallback: false # no fallback to \"ip6\" \n http_post_2xx: # http prober example with body\n prober: http\n timeout: 5s\n http:\n method: POST # probe request method\n headers:\n Content-Type: application/json\n body: '{\"username\":\"admin\",\"password\":\"123456\"}' # body carried during probe\n http_basic_auth_example: # prober example with username and password\n prober: http\n timeout: 5s\n http:\n method: POST\n headers:\n Host: \"login.example.com\"\n basic_auth: # username and password to be added during probe\n username: \"username\"\n password: \"mysecret\"\n http_custom_ca_example:\n prober: http\n http:\n method: GET\n tls_config: # root certificate used during probe\n ca_file: \"/certs/my_cert.crt\"\n http_gzip:\n prober: http\n http:\n method: GET\n compression: gzip # compression method used during probe\n http_gzip_with_accept_encoding:\n prober: http\n http:\n method: GET\n compression: gzip\n headers:\n Accept-Encoding: gzip\n tls_connect: # TCP prober example\n prober: tcp\n timeout: 5s\n tcp:\n tls: true # use TLS\n tcp_connect_example:\n prober: tcp\n timeout: 5s\n imap_starttls: # IMAP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"OK.*STARTTLS\"\n - send: \". STARTTLS\"\n - expect: \"OK\"\n - starttls: true\n - send: \". capability\"\n - expect: \"CAPABILITY IMAP4rev1\"\n smtp_starttls: # SMTP email server probe configuration example\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - expect: \"^220 ([^ ]+) ESMTP (.+)$\"\n - send: \"EHLO prober\\r\"\n - expect: \"^250-STARTTLS\"\n - send: \"STARTTLS\\r\"\n - expect: \"^220\"\n - starttls: true\n - send: \"EHLO prober\\r\"\n - expect: \"^250-AUTH\"\n - send: \"QUIT\\r\"\n irc_banner_example:\n prober: tcp\n timeout: 5s\n tcp:\n query_response:\n - send: \"NICK prober\"\n - send: \"USER prober prober prober :prober\"\n - expect: \"PING :([^ ]+)\"\n send: \"PONG ${1}\"\n - expect: \"^:[^ ]+ 001\"\n # icmp_example: # ICMP prober configuration example\n # prober: icmp\n # timeout: 5s\n # icmp:\n # preferred_ip_protocol: \"ip4\"\n # source_ip_address: \"127.0.0.1\"\n dns_udp_example: # DNS query example using UDP\n prober: dns\n timeout: 5s\n dns:\n query_name: \"www.prometheus.io\" # domain name to resolve\n query_type: \"A\" # type proper to this domain\n valid_rcodes:\n - NOERROR\n validate_answer_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n fail_if_all_match_regexp:\n - \".*127.0.0.1\"\n fail_if_not_matches_regexp:\n - \"www.prometheus.io.\\t300\\tIN\\tA\\t127.0.0.1\"\n fail_if_none_matches_regexp:\n - \"127.0.0.1\"\n validate_authority_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n validate_additional_rrs:\n fail_if_matches_regexp:\n - \".*127.0.0.1\"\n dns_soa:\n prober: dns\n dns:\n query_name: \"prometheus.io\"\n query_type: \"SOA\"\n dns_tcp_example: # DNS query example using TCP\n prober: dns\n dns:\n transport_protocol: \"tcp\" # defaults to \"udp\"\n preferred_ip_protocol: \"ip4\" # defaults to \"ip6\"\n query_name: \"www.prometheus.io\"\n
"},{"location":"en/end-user/insight/collection-manag/service-monitor.html","title":"Configure service discovery rules","text":"
Observable Insight supports the way of creating CRD ServiceMonitor through container management to meet your collection requirements for custom service discovery. Users can use ServiceMonitor to define the scope of the Namespace discovered by the Pod and select the monitored Service through matchLabel .
This is the service endpoint, which represents the address where Prometheus collects Metrics. endpoints is an array, and multiple endpoints can be created at the same time. Each endpoint contains three fields, and the meaning of each field is as follows:
interval : Specifies the collection cycle of Prometheus for the current endpoint . The unit is seconds, set to 15s in this example.
path : Specifies the collection path of Prometheus. In this example, it is specified as /actuator/prometheus .
port : Specifies the port through which the collected data needs to pass. The set port is the name set by the port of the Service being collected.
This is the scope of the Service that needs to be discovered. namespaceSelector contains two mutually exclusive fields, and the meaning of the fields is as follows:
any : Only one value true , when this field is set, it will listen to changes of all Services that meet the Selector filtering conditions.
matchNames : An array value that specifies the scope of namespace to be monitored. For example, if you only want to monitor the Services in two namespaces, default and insight-system, the matchNames are set as follows:
Grafana is a cross-platform open source visual analysis tool. Insight uses open source Grafana to provide monitoring services, and supports viewing resource consumption from multiple dimensions such as clusters, nodes, and namespaces.
For more information on open source Grafana, see Grafana Official Documentation.
In the Insight / Overview dashboard, you can view the resource usage of multiple clusters and analyze resource usage, network, storage, and more based on dimensions such as namespaces and Pods.
Click the dropdown menu in the upper-left corner of the dashboard to switch between clusters.
Click the lower-right corner of the dashboard to switch the time range for queries.
Insight provides several recommended dashboards that allow monitoring from different dimensions such as nodes, namespaces, and workloads. Switch between dashboards by clicking the insight-system / Insight / Overview section.
Note
For accessing Grafana UI, refer to Access Native Grafana.
For importing custom dashboards, refer to Importing Custom Dashboards.
By using Grafana CRD, you can incorporate the management and deployment of dashboards into the lifecycle management of Kubernetes. This enables version control, automated deployment, and cluster-level management of dashboards. This page describes how to import custom dashboards using CRD and the UI interface.
Log in to the AI platform platform and go to Container Management . Select the kpanda-global-cluster from the cluster list.
Choose Custom Resources from the left navigation bar. Look for the grafanadashboards.integreatly.org file in the list and click it to view the details.
Click YAML Create and use the following template. Replace the dashboard JSON in the Json field.
namespace : Specify the target namespace.
name : Provide a name for the dashboard.
label : Mandatory. Set the label as operator.insight.io/managed-by: insight .
Insight only collects data from clusters that have insight-agent installed and running in a normal state. The overview provides an overview of resources across multiple clusters:
Alert Statistics: Provides statistics on active alerts across all clusters.
Resource Consumption: Displays the resource usage trends for the top 5 clusters and nodes in the past hour, based on CPU usage, memory usage, and disk usage.
By default, the sorting is based on CPU usage. You can switch the metric to sort clusters and nodes.
Resource Trends: Shows the trends in the number of nodes over the past 15 days and the running trend of pods in the last hour.
Service Requests Ranking: Displays the top 5 services with the highest request latency and error rates, along with their respective clusters and namespaces in the multi-cluster environment.
By default, Insight collects node logs, container logs, and Kubernetes audit logs. In the log query page, you can search for standard output (stdout) logs within the permissions of your login account. This includes node logs, product logs, and Kubernetes audit logs. You can quickly find the desired logs among a large volume of logs. Additionally, you can use the source information and contextual raw data of the logs to assist in troubleshooting and issue resolution.
In the left navigation bar, select Data Query -> Log Query .
After selecting the query criteria, click Search , and the log records in the form of graphs will be displayed. The most recent logs are displayed on top.
In the Filter panel, switch Type and select Node to check the logs of all nodes in the cluster.
In the Filter panel, switch Type and select Event to view the logs generated by all Kubernetes events in the cluster.
Lucene Syntax Explanation:
Use logical operators (AND, OR, NOT, \"\") to query multiple keywords. For example: keyword1 AND (keyword2 OR keyword3) NOT keyword4.
Use a tilde (~) for fuzzy queries. You can optionally specify a parameter after the \"~\" to control the similarity of the fuzzy query. If not specified, it defaults to 0.5. For example: error~.
Use wildcards (*, ?) as single-character placeholders to match any character.
Use square brackets [ ] or curly braces { } for range queries. Square brackets [ ] represent a closed interval and include the boundary values. Curly braces { } represent an open interval and exclude the boundary values. Range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. For example timestamp:[2022-01-01 TO 2022-01-31].
Clicking on the button next to a log will slide out a panel on the right side where you can view the default 100 lines of context for that log. You can switch the Display Rows option to view more contextual content.
Metric query supports querying the index data of each container resource, and you can view the trend changes of the monitoring index. At the same time, advanced query supports native PromQL statements for Metric query.
In the left navigation bar, click Data Query -> metric Query .
After selecting query conditions such as cluster, type, node, and metric name, click Search , and the proper metric chart and data details will be displayed on the right side of the screen.
Tip
Support custom time range. You can manually click the Refresh icon or select a default time interval to refresh.
Through cluster monitoring, you can view the basic information of the cluster, the resource consumption and the trend of resource consumption over a period of time.
Select Infrastructure > Clusters from the left navigation bar. On this page, you can view the following information:
Resource Overview: Provides statistics on the number of normal/all nodes and workloads across multiple clusters.
Fault: Displays the number of alerts generated in the current cluster.
Resource Consumption: Shows the actual usage and total capacity of CPU, memory, and disk for the selected cluster.
Metric Explanations: Describes the trends in CPU, memory, disk I/O, and network bandwidth.
Click Resource Level Monitor, you can view more metrics of the current cluster.
"},{"location":"en/end-user/insight/infra/cluster.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The ratio of the actual CPU usage of all pod resources in the cluster to the total CPU capacity of all nodes. CPU Allocation The ratio of the sum of CPU requests of all pods in the cluster to the total CPU capacity of all nodes. Memory Usage The ratio of the actual memory usage of all pod resources in the cluster to the total memory capacity of all nodes. Memory Allocation The ratio of the sum of memory requests of all pods in the cluster to the total memory capacity of all nodes."},{"location":"en/end-user/insight/infra/container.html","title":"Container Insight","text":"
Container insight is the process of monitoring workloads in cluster management. In the list, you can view basic information and status of workloads. On the Workloads details page, you can see the number of active alerts and the trend of resource consumption such as CPU and memory.
Follow these steps to view service monitoring metrics:
Go to the Insight product module.
Select Infrastructure > Workloads from the left navigation bar.
Switch between tabs at the top to view data for different types of workloads.
Click the target workload name to view the details.
Faults: Displays the total number of active alerts for the workload.
Resource Consumption: Shows the CPU, memory, and network usage of the workload.
Monitoring Metrics: Provides the trends of CPU, Memory, Network, and disk usage for the workload over the past hour.
Switch to the Pods tab to view the status of various pods for the workload, including their nodes, restart counts, and other information.
Switch to the JVM monitor tab to view the JVM metrics for each pods
Note
The JVM monitoring feature only supports the Java language.
To enable the JVM monitoring feature, refer to Getting Started with Monitoring Java Applications.
"},{"location":"en/end-user/insight/infra/container.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for all pods under the workload. CPU Requests The sum of CPU requests for all pods under the workload. CPU Limits The sum of CPU limits for all pods under the workload. Memory Usage The sum of memory usage for all pods under the workload. Memory Requests The sum of memory requests for all pods under the workload. Memory Limits The sum of memory limits for all pods under the workload. Disk Read/Write Rate The total number of continuous disk reads and writes per second within the specified time range, representing a performance measure of the number of read and write operations per second on the disk. Network Send/Receive Rate The incoming and outgoing rates of network traffic, aggregated by workload, within the specified time range."},{"location":"en/end-user/insight/infra/event.html","title":"Event Query","text":"
AI platform Insight supports event querying by cluster and namespace.
"},{"location":"en/end-user/insight/infra/event.html#event-status-distribution","title":"Event Status Distribution","text":"
By default, the events that occurred within the last 12 hours are displayed. You can select a different time range in the upper right corner to view longer or shorter periods. You can also customize the sampling interval from 1 minute to 5 hours.
The event status distribution chart provides a visual representation of the intensity and dispersion of events. This helps in evaluating and preparing for subsequent cluster operations and maintenance tasks. If events are densely concentrated during specific time periods, you may need to allocate more resources or take proper measures to ensure cluster stability and high availability. On the other hand, if events are dispersed, you can effectively schedule other maintenance tasks such as system optimization, upgrades, or handling other tasks during this period.
By considering the event status distribution chart and the selected time range, you can better plan and manage your cluster operations and maintenance work, ensuring system stability and reliability.
"},{"location":"en/end-user/insight/infra/event.html#event-count-and-statistics","title":"Event Count and Statistics","text":"
Through important event statistics, you can easily understand the number of image pull failures, health check failures, Pod execution failures, Pod scheduling failures, container OOM (Out-of-Memory) occurrences, volume mounting failures, and the total count of all events. These events are typically categorized as \"Warning\" and \"Normal\".
Select Infrastructure -> Namespaces from the left navigation bar. On this page, you can view the following information:
Switch Namespace: Switch between clusters or namespaces at the top.
Resource Overview: Provides statistics on the number of normal and total workloads within the selected namespace.
Incidents: Displays the number of alerts generated within the selected namespace.
Events: Shows the number of Warning level events within the selected namespace in the past 24 hours.
Resource Consumption: Provides the sum of CPU and memory usage for Pods within the selected namespace, along with the CPU and memory quota information.
"},{"location":"en/end-user/insight/infra/namespace.html#metric-explanations","title":"Metric Explanations","text":"Metric Name Description CPU Usage The sum of CPU usage for Pods within the selected namespace. Memory Usage The sum of memory usage for Pods within the selected namespace. Pod CPU Usage The CPU usage for each Pod within the selected namespace. Pod Memory Usage The memory usage for each Pod within the selected namespace."},{"location":"en/end-user/insight/infra/node.html","title":"Node Monitoring","text":"
Through node monitoring, you can get an overview of the current health status of the nodes in the selected cluster and the number of abnormal pod; on the current node details page, you can view the number of alerts and the trend of resource consumption such as CPU, memory, and disk.
Probe refers to the use of black-box monitoring to regularly test the connectivity of targets through HTTP, TCP, and other methods, enabling quick detection of ongoing faults.
Insight uses the Prometheus Blackbox Exporter tool to probe the network using protocols such as HTTP, HTTPS, DNS, TCP, and ICMP, and returns the probe results to understand the network status.
Select Infrastructure -> Probes in the left navigation bar.
Click the cluster or namespace dropdown in the table to switch between clusters and namespaces.
The list displays the name, probe method, probe target, connectivity status, and creation time of the probes by default.
The connectivity status can be:
Normal: The probe successfully connects to the target, and the target returns the expected response.
Abnormal: The probe fails to connect to the target, or the target does not return the expected response.
Pending: The probe is attempting to connect to the target.
Supports fuzzy search of probe names.
"},{"location":"en/end-user/insight/infra/probe.html#create-a-probe","title":"Create a Probe","text":"
Click Create Probe .
Fill in the basic information and click Next .
Name: The name can only contain lowercase letters, numbers, and hyphens (-), and must start and end with a lowercase letter or number, with a maximum length of 63 characters.
Cluster: Select the cluster for the probe task.
Namespace: The namespace where the probe task is located.
Configure the probe parameters.
Blackbox Instance: Select the blackbox instance responsible for the probe.
Probe Method:
HTTP: Sends HTTP or HTTPS requests to the target URL to check its connectivity and response time. This can be used to monitor the availability and performance of websites or web applications.
TCP: Establishes a TCP connection to the target host and port to check its connectivity and response time. This can be used to monitor TCP-based services such as web servers and database servers.
Other: Supports custom probe methods by configuring ConfigMap. For more information, refer to: Custom Probe Methods
Probe Target: The target address of the probe, supports domain names or IP addresses.
Labels: Custom labels that will be automatically added to Prometheus' labels.
Probe Interval: The interval between probes.
Probe Timeout: The maximum waiting time when probing the target.
After configuring, click OK to complete the creation.
Warning
After the probe task is created, it takes about 3 minutes to synchronize the configuration. During this period, no probes will be performed, and probe results cannot be viewed.
Click \u2507 in the operations column and click View Monitoring Dashboard .
Metric Name Description Current Status Response Represents the response status code of the HTTP probe request. Ping Status Indicates whether the probe request was successful. 1 indicates a successful probe request, and 0 indicates a failed probe request. IP Protocol Indicates the IP protocol version used in the probe request. SSL Expiry Represents the earliest expiration time of the SSL/TLS certificate. DNS Response (Latency) Represents the duration of the entire probe process in seconds. HTTP Duration Represents the duration of the entire process from sending the request to receiving the complete response."},{"location":"en/end-user/insight/infra/probe.html#edit-a-probe","title":"Edit a Probe","text":"
Click \u2507 in the operations column and click Edit .
"},{"location":"en/end-user/insight/infra/probe.html#delete-a-probe","title":"Delete a Probe","text":"
Click \u2507 in the operations column and click Delete .
Insight is a multicluster observation product in AI platform. In order to realize the unified collection of multicluster observation data, users need to install the Helm App insight-agent (Installed in insight-system namespace by default). See How to install insight-agent .
In Insight -> Data Collection section, you can view the status of insight-agent installed in each cluster.
not installed : insight-agent is not installed under the insight-system namespace in this cluster
Running : insight-agent is successfully installed in the cluster, and all deployed components are running
Exception : If insight-agent is in this state, it means that the helm deployment failed or the deployed components are not running
Can be checked by:
Run the following command, if the status is deployed , go to the next step. If it is failed , since it will affect the upgrade of the application, it is recommended to reinstall after uninstalling Container Management -> Helm Apps :
helm list -n insight-system\n
run the following command or check the status of the components deployed in the cluster in Insight -> Data Collection . If there is a pod that is not in the Running state, please restart the abnormal pod.
The resource consumption of the metric collection component Prometheus in insight-agent is directly proportional to the number of pods running in the cluster. Adjust Prometheus resources according to the cluster size, please refer to Prometheus Resource Planning.
Since the storage capacity of the metric storage component vmstorage in the global service cluster is directly proportional to the sum of the number of pods in each cluster.
Please contact the platform administrator to adjust the disk capacity of vmstorage according to the cluster size, see vmstorage disk capacity planning.
Adjust vmstorage disk according to multicluster size, see vmstorge disk expansion.
AI platform platform enables the management and creation of multicloud and multiple clusters. Building upon this capability, Insight serves as a unified observability solution for multiple clusters. It collects observability data from multiple clusters by deploying the insight-agent plugin and allows querying of metrics, logs, and trace data through the AI platform Insight.
insight-agent is a tool that facilitates the collection of observability data from multiple clusters. Once installed, it automatically collects metrics, logs, and trace data without any modifications.
Clusters created through Container Management come pre-installed with insight-agent. Hence, this guide specifically provides instructions on enabling observability for integrated clusters.
Install insight-agent online
As a unified observability platform for multiple clusters, Insight's resource consumption of certain components is closely related to the data of cluster creation and the number of integrated clusters. When installing insight-agent, it is necessary to adjust the resources of the proper components based on the cluster size.
Adjust the CPU and memory resources of the Prometheus collection component in insight-agent according to the size of the cluster created or integrated. Please refer to Prometheus resource planning.
As the metric data from multiple clusters is stored centrally, AI platform platform administrators need to adjust the disk space of vmstorage based on the cluster size. Please refer to vmstorage disk capacity planning.
For instructions on adjusting the disk space of vmstorage, please refer to Expanding vmstorage disk.
Since AI platform supports the management of multicloud and multiple clusters, insight-agent has undergone partial verification. However, there are known conflicts with monitoring components when installing insight-agent in Suanova 4.0 clusters and Openshift 4.x clusters. If you encounter similar issues, please refer to the following documents:
Install insight-agent in Openshift 4.x
Currently, the insight-agent collection component has undergone functional testing for popular versions of Kubernetes. Please refer to:
Kubernetes cluster compatibility testing
Openshift 4.x cluster compatibility testing
Rancher cluster compatibility testing
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html","title":"Enable Big Log and Big Trace Modes","text":"
The Insight Module supports switching log to Big Log mode and trace to Big Trace mode, in order to enhance data writing capabilities in large-scale environments. This page introduces following methods for enabling these modes:
Enable or upgrade to Big Log and Big Trace modes through the installer (controlled by the same parameter value in manifest.yaml)
Manually enable Big Log and Big Trace modes through Helm commands
This mode is referred to as the Kafka mode, and the data flow diagram is shown below:
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html#enabling-via-installer","title":"Enabling via Installer","text":"
When deploying/upgrading AI platform using the installer, the manifest.yaml file includes the infrastructures.kafka field. To enable observable Big Log and Big Trace modes, Kafka must be activated:
When using a manifest.yaml that enables kafka during installation, Kafka middleware will be installed by default, and Big Log and Big Trace modes will be enabled automatically. The installation command is:
The upgrade also involves modifying the kafka field. However, note that since the old environment was installed with kafka: false, Kafka is not present in the environment. Therefore, you need to specify the upgrade for middleware to install Kafka middleware simultaneously. The upgrade command is:
After the upgrade is complete, you need to manually restart the following components:
insight-agent-fluent-bit
insight-agent-opentelemetry-collector
insight-opentelemetry-collector
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html#enabling-via-helm-commands","title":"Enabling via Helm Commands","text":"
Prerequisites: Ensure that there is a usable Kafka and that the address is accessible.
Use the following commands to retrieve the values of the old versions of Insight and insight-agent (it's recommended to back them up):
helm get values insight -n insight-system -o yaml > insight.yaml\nhelm get values insight-agent -n insight-system -o yaml > insight-agent.yaml\n
"},{"location":"en/end-user/insight/quickstart/install/big-log-and-trace.html#enabling-big-log","title":"Enabling Big Log","text":"
There are several ways to enable or upgrade to Big Log mode:
Use --set in the helm upgrade commandModify YAML and run helm upgradeUpgrade via Container Management UI
First, run the following Insight upgrade command, ensuring the Kafka brokers address is correct:
In the Container Management module, find the cluster, select Helm Apps from the left navigation bar, and find and update the insight-agent.
In Trace Settings, select kafka for output and fill in the correct brokers address.
Note that after the upgrade is complete, you need to manually restart the insight-agent-opentelemetry-collector and insight-opentelemetry-collector components.
When deploying Insight to a Kubernetes environment, proper resource management and optimization are crucial. Insight includes several core components such as Prometheus, OpenTelemetry, FluentBit, Vector, and Elasticsearch. These components, during their operation, may negatively impact the performance of other pods within the cluster due to resource consumption issues. To effectively manage resources and optimize cluster operations, node affinity becomes an important option.
This page is about how to add taints and node affinity to ensure that each component runs on the appropriate nodes, avoiding resource competition or contention, thereby guranttee the stability and efficiency of the entire Kubernetes cluster.
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#configure-dedicated-nodes-for-insight-using-taints","title":"Configure dedicated nodes for Insight using taints","text":"
Since the Insight Agent includes DaemonSet components, the configuration method described in this section is to have all components except the Insight DaemonSet run on dedicated nodes.
This is achieved by adding taints to the dedicated nodes and using tolerations to match them. More details can be found in the Kubernetes official documentation.
You can refer to the following commands to add and remove taints on nodes:
There are two ways to schedule Insight components to dedicated nodes:
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#1-add-tolerations-for-each-component","title":"1. Add tolerations for each component","text":"
Configure the tolerations for the insight-server and insight-agent Charts respectively:
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#2-configure-at-the-namespace-level","title":"2. Configure at the namespace level","text":"
Allow pods in the insight-system namespace to tolerate the node.daocloud.io=insight-only taint.
Adjust the apiserver configuration file /etc/kubernetes/manifests/kube-apiserver.yaml to include PodTolerationRestriction,PodNodeSelector. See the following picture:
Add an annotation to the insight-system namespace:
Restart the components under the insight-system namespace to allow normal scheduling of pods under the insight-system.
"},{"location":"en/end-user/insight/quickstart/install/component-scheduling.html#use-node-labels-and-node-affinity-to-manage-component-scheduling","title":"Use node labels and node affinity to manage component scheduling","text":"
Info
Node affinity is conceptually similar to nodeSelector, allowing you to constrain which nodes a pod can be scheduled on based on labels on the nodes. There are two types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution: The scheduler will only schedule the pod if the rules are met. This feature is similar to nodeSelector but has more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler will try to find nodes that meet the rules. If no matching nodes are found, the scheduler will still schedule the Pod.
For more details, please refer to the Kubernetes official documentation.
To meet different user needs for scheduling Insight components, Insight provides fine-grained labels for different components' scheduling policies. Below is a description of the labels and their associated components:
Label Key Label Value Description node.daocloud.io/insight-any Any value, recommended to use true Represents that all Insight components prefer nodes with this label node.daocloud.io/insight-prometheus Any value, recommended to use true Specifically for Prometheus components node.daocloud.io/insight-vmstorage Any value, recommended to use true Specifically for VictoriaMetrics vmstorage components node.daocloud.io/insight-vector Any value, recommended to use true Specifically for Vector components node.daocloud.io/insight-otel-col Any value, recommended to use true Specifically for OpenTelemetry components
You can refer to the following commands to add and remove labels on nodes:
# Add label to node8, prioritizing scheduling insight-prometheus to node8 \nkubectl label nodes node8 node.daocloud.io/insight-prometheus=true\n\n# Remove the node.daocloud.io/insight-prometheus label from node8\nkubectl label nodes node8 node.daocloud.io/insight-prometheus-\n
Below is the default affinity preference for the insight-prometheus component during deployment:
Prioritize scheduling insight-prometheus to nodes with the node.daocloud.io/insight-prometheus label
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html","title":"Get Data Storage Address of Global Service Cluster","text":"
Insight is a product for unified observation of multiple clusters. To achieve unified storage and querying of observation data from multiple clusters, sub-clusters need to report the collected observation data to the global service cluster for unified storage. This document provides the required address of the storage component when installing the collection component insight-agent.
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#install-insight-agent-in-global-service-cluster","title":"Install insight-agent in Global Service Cluster","text":"
If installing insight-agent in the global service cluster, it is recommended to access the cluster via domain name:
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#install-insight-agent-in-other-clusters","title":"Install insight-agent in Other Clusters","text":""},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#get-address-via-interface-provided-by-insight-server","title":"Get Address via Interface Provided by Insight Server","text":"
The management cluster uses the default LoadBalancer mode for exposure.
Log in to the console of the global service cluster and run the following command:
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam'\n
Note
Please replace the ${INSIGHT_SERVER_IP} parameter in the command.
global.exporters.logging.host is the log service address, no need to set the proper service port, the default value will be used.
global.exporters.metric.host is the metrics service address.
global.exporters.trace.host is the trace service address.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
Management cluster disables LoadBalancer
When calling the interface, you need to additionally pass an externally accessible node IP from the cluster, which will be used to construct the complete access address of the proper service.
export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP})\ncurl --location --request POST 'http://'\"${INSIGHT_SERVER_IP}\"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{\"extra\": {\"EXPORTER_EXTERNAL_IP\": \"10.5.14.51\"}}'\n
global.exporters.logging.host is the log service address.
global.exporters.logging.port is the NodePort exposed by the log service.
global.exporters.metric.host is the metrics service address.
global.exporters.metric.port is the NodePort exposed by the metrics service.
global.exporters.trace.host is the trace service address.
global.exporters.trace.port is the NodePort exposed by the trace service.
global.exporters.auditLog.host is the audit log service address (same service as trace but different port).
global.exporters.auditLog.port is the NodePort exposed by the audit log service.
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#connect-via-loadbalancer","title":"Connect via LoadBalancer","text":"
If LoadBalancer is enabled in the cluster and a VIP is set for Insight, you can manually execute the following command to obtain the address information for vminsert and opentelemetry-collector:
$ kubectl get service -n insight-system | grep lb\nlb-insight-opentelemetry-collector LoadBalancer 10.233.23.12 <pending> 4317:31286/TCP,8006:31351/TCP 24d\nlb-vminsert-insight-victoria-metrics-k8s-stack LoadBalancer 10.233.63.67 <pending> 8480:31629/TCP 24d\n
lb-vminsert-insight-victoria-metrics-k8s-stack is the address for the metrics service.
lb-insight-opentelemetry-collector is the address for the tracing service.
Execute the following command to obtain the address information for elasticsearch:
$ kubectl get service -n mcamel-system | grep es\nmcamel-common-es-cluster-masters-es-http NodePort 10.233.16.120 <none> 9200:30465/TCP 47d\n
mcamel-common-es-cluster-masters-es-http is the address for the logging service.
"},{"location":"en/end-user/insight/quickstart/install/gethosturl.html#connect-via-nodeport","title":"Connect via NodePort","text":"
The LoadBalancer feature is disabled in the global service cluster.
In this case, the LoadBalancer resources mentioned above will not be created by default. The relevant service names are:
insight-agent is a plugin for collecting insight data, supporting unified observation of metrics, links, and log data. This article describes how to install insight-agent in an online environment for the accessed cluster.
Enter Container Management from the left navigation bar, and enter Clusters . Find the cluster where you want to install insight-agent.
Choose Install now to jump, or click the cluster and click Helm Apps -> Helm Templates in the left navigation bar, search for insight-agent in the search box, and click it for details.
Select the appropriate version and click Install .
Fill in the name, select the namespace and version, and fill in the addresses of logging, metric, audit, and trace reporting data in the yaml file. The system has filled in the address of the component for data reporting by default, please check it before clicking OK to install.
If you need to modify the data reporting address, please refer to Get Data Reporting Address.
The system will automatically return to Helm Apps . When the application status changes from Unknown to Deployed , it means that insight-agent is installed successfully.
Note
Click \u2507 on the far right, and you can perform more operations such as Update , View YAML and Delete in the pop-up menu.
For a practical installation demo, watch Video demo of installing insight-agent
This page lists some issues related to the installation and uninstallation of Insight Agent and their workarounds.
"},{"location":"en/end-user/insight/quickstart/install/knownissues.html#uninstallation-failure-of-insight-agent","title":"Uninstallation Failure of Insight Agent","text":"
When you run the following command to uninstall Insight Agent,
helm uninstall insight agent\n
The tls secret used by otel-operator is failed to uninstall.
Due to the logic of \"reusing tls secret\" in the following code of otel-operator, it checks whether MutationConfiguration exists and reuses the CA cert bound in MutationConfiguration. However, since helm uninstall has uninstalled MutationConfiguration, it results in a null value.
Therefore, please manually delete the proper secret using one of the following methods:
Delete via command line: Log in to the console of the target cluster and run the following command:
Delete via UI: Log in to AI platform container management, select the target cluster, select Secret from the left menu, input insight-agent-opentelemetry-operator-controller-manager-service-cert, then select Delete.
"},{"location":"en/end-user/insight/quickstart/install/knownissues.html#insight-agent_1","title":"Insight Agent","text":""},{"location":"en/end-user/insight/quickstart/install/knownissues.html#log-collection-endpoint-not-updated-when-upgrading-insight-agent","title":"Log Collection Endpoint Not Updated When Upgrading Insight Agent","text":"
When updating the log configuration of the insight-agent from Elasticsearch to Kafka or from Kafka to Elasticsearch, the changes do not take effect and the agent continues to use the previous configuration.
Solution :
Manually restart Fluent Bit in the cluster.
"},{"location":"en/end-user/insight/quickstart/install/knownissues.html#podmonitor-collects-multiple-sets-of-jvm-metrics","title":"PodMonitor Collects Multiple Sets of JVM Metrics","text":"
In this version, there is a defect in PodMonitor/insight-kubernetes-pod: it will incorrectly create Jobs to collect metrics for all containers in Pods that are marked with insight.opentelemetry.io/metric-scrape=true, instead of only the containers proper to insight.opentelemetry.io/metric-port.
After PodMonitor is declared, PrometheusOperator will pre-configure some service discovery configurations. Considering the compatibility of CRDs, it is abandoned to configure the collection tasks through annotations.
Use the additional scrape config mechanism provided by Prometheus to configure the service discovery rules in a secret and introduce them into Prometheus.
Therefore:
Delete the current PodMonitor for insight-kubernetes-pod
Use a new rule
In the new rule, action: keepequal is used to compare the consistency between source_labels and target_label to determine whether to create collection tasks for the ports of a container. Note that this feature is only available in Prometheus v2.41.0 (2022-12-20) and higher.
This page provides some considerations for upgrading insight-server and insight-agent.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v028x-or-lower-to-v029x","title":"Upgrade from v0.28.x (or lower) to v0.29.x","text":"
Due to the upgrade of the Opentelemetry community operator chart version in v0.29.0, the supported values for featureGates in the values file have changed. Therefore, before upgrading, you need to set the value of featureGates to empty, as follows:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v026x-or-lower-to-v027x-or-higher","title":"Upgrade from v0.26.x (or lower) to v0.27.x or higher","text":"
In v0.27.x, the switch for the vector component has been separated. If the existing environment has vector enabled, you need to specify --set vector.enabled=true when upgrading the insight-server.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-020x","title":"Upgrade from v0.19.x (or lower) to 0.20.x","text":"
Before upgrading Insight , you need to manually delete the jaeger-collector and jaeger-query deployments by running the following command:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
In v0.18.x, there have been updates to the Jaeger-related deployment files, so you need to manually run the following commands before upgrading insight-server:
There have been changes to metric names in v0.18.x, so after upgrading insight-server, insight-agent should also be upgraded.
In addition, the parameters for enabling the tracing module and adjusting the ElasticSearch connection have been modified. Refer to the following parameters:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v015x-or-lower-to-v016x","title":"Upgrade from v0.15.x (or lower) to v0.16.x","text":"
In v0.16.x, a new feature parameter disableRouteContinueEnforce in the vmalertmanagers CRD is used. Therefore, you need to manually run the following command before upgrading insight-server:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v023x-or-lower-to-v024x","title":"Upgrade from v0.23.x (or lower) to v0.24.x","text":"
In v0.24.x, CRDs have been added to the OTEL operator chart. However, helm upgrade does not update CRDs, so you need to manually run the following command:
If you are performing an offline installation, you can find the above CRD yaml file after extracting the insight-agent offline package. After extracting the insight-agent Chart, manually run the following command:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v019x-or-lower-to-v020x","title":"Upgrade from v0.19.x (or lower) to v0.20.x","text":"
In v0.20.x, Kafka log export configuration has been added, and there have been some adjustments to the log export configuration. Before upgrading insight-agent , please note the parameter changes. The previous logging configuration has been moved to the logging.elasticsearch configuration:
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v017x-or-lower-to-v018x_1","title":"Upgrade from v0.17.x (or lower) to v0.18.x","text":"
Due to the updated deployment files for Jaeger In v0.18.x, it is important to note the changes in parameters before upgrading the insight-agent.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v016x-or-lower-to-v017x","title":"Upgrade from v0.16.x (or lower) to v0.17.x","text":"
In v0.17.x, the kube-prometheus-stack chart version was upgraded from 41.9.1 to 45.28.1, and there were also some field upgrades in the CRD used, such as the attachMetadata field of servicemonitor. Therefore, the following command needs to be rund before upgrading the insight-agent:
If you are performing an offline installation, you can find the yaml for the above CRD in insight-agent/dependency-crds after extracting the insight-agent offline package.
"},{"location":"en/end-user/insight/quickstart/install/upgrade-note.html#upgrade-from-v011x-or-earlier-to-v012x","title":"Upgrade from v0.11.x (or earlier) to v0.12.x","text":"
v0.12.x upgrades kube-prometheus-stack chart from 39.6.0 to 41.9.1, including prometheus-operator to v0.60.1, prometheus-node-exporter chart to v4.3.0. Prometheus-node-exporter uses Kubernetes recommended label after upgrading, so you need to delete node-exporter daemonset. prometheus-operator has updated the CRD, so you need to run the following command before upgrading the insight-agent:
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/jmx-exporter.html","title":"Use JMX Exporter to expose JVM monitoring metrics","text":"
JMX-Exporter provides two usages:
Start a standalone process. Specify parameters when the JVM starts, expose the RMI interface of JMX, JMX Exporter calls RMI to obtain the JVM runtime status data, Convert to Prometheus metrics format, and expose ports for Prometheus to collect.
Start the JVM in-process. Specify parameters when the JVM starts, and run the jar package of JMX-Exporter in the form of javaagent. Read the JVM runtime status data in the process, convert it into Prometheus metrics format, and expose the port for Prometheus to collect.
Note
Officials do not recommend the first method. On the one hand, the configuration is complicated, and on the other hand, it requires a separate process, and the monitoring of this process itself has become a new problem. So This page focuses on the second usage and how to use JMX Exporter to expose JVM monitoring metrics in the Kubernetes environment.
The second usage is used here, and the JMX Exporter jar package file and configuration file need to be specified when starting the JVM. The jar package is a binary file, so it is not easy to mount it through configmap. We hardly need to modify the configuration file. So the suggestion is to directly package the jar package and configuration file of JMX Exporter into the business container image.
Among them, in the second way, we can choose to put the jar file of JMX Exporter in the business application mirror, You can also choose to mount it during deployment. Here is an introduction to the two methods:
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/jmx-exporter.html#method-1-build-the-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Build the JMX Exporter JAR file into the business image","text":"
The content of prometheus-jmx-config.yaml is as follows:
For more configmaps, please refer to the bottom introduction or Prometheus official documentation.
Then prepare the jar package file, you can find the latest jar package download address on the Github page of jmx_exporter and refer to the following Dockerfile:
Port 8088 is used here to expose the monitoring metrics of the JVM. If it conflicts with Java applications, you can change it yourself
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/jmx-exporter.html#method-2-mount-via-init-container-container","title":"Method 2: mount via init container container","text":"
We need to make the JMX exporter into a Docker image first, the following Dockerfile is for reference only:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file to the mirror\nCOPY prometheus-jmx-config.yaml ./\n# Download jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image according to the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment Yaml:
After the above modification, the sample application my-demo-app has the ability to expose JVM metrics. After running the service, we can access the prometheus format metrics exposed by the service through http://lcoalhost:8088.
Then, you can refer to Java Application Docking Observability with JVM Metrics.
This document mainly describes how to monitor the JVM of the customer's Java application. It describes how Java applications that have exposed JVM metrics, and those that have not, interface with Insight.
If your Java application does not start exposing JVM metrics, you can refer to the following documents:
Expose JVM monitoring metrics with JMX Exporter
Expose JVM monitoring metrics using OpenTelemetry Java Agent
If your Java application has exposed JVM metrics, you can refer to the following documents:
Java application docking observability with existing JVM metrics
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/legacy-jvm.html","title":"Java Application with JVM Metrics to Dock Insight","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), We need to allow monitoring data to be collected. You can let Insight collect existing JVM metrics by adding Kubernetes Annotations to the workload:
annatation:\n insight.opentelemetry.io/metric-scrape: \"true\" # whether to collect\n insight.opentelemetry.io/metric-path: \"/\" # path to collect metrics\n insight.opentelemetry.io/metric-port: \"9464\" # port for collecting metrics\n
YAML Example to add annotations for my-deployment-app workload\uff1a
In the above example\uff0cInsight will use :8080//actuator/prometheus to get Prometheus metrics exposed through Spring Boot Actuator .
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/otel-java-agent.html","title":"Use OpenTelemetry Java Agent to expose JVM monitoring metrics","text":"
In Opentelemetry Agent v1.20.0 and above, Opentelemetry Agent has added the JMX Metric Insight module. If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents for our application Expose JMX metrics. The Opentelemetry Agent also collects and exposes metrics by instrumenting the metrics exposed by MBeans locally available in the application.
Opentelemetry Agent also has some built-in monitoring samples for common Java Servers or frameworks, please refer to predefined metrics.
Using the OpenTelemetry Java Agent also needs to consider how to mount the JAR into the container. In addition to referring to the JMX Exporter above to mount the JAR file, we can also use the Operator capabilities provided by OpenTelemetry to automatically enable JVM metric exposure for our applications. :
If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents to expose JMX metrics for our application. The Opentelemetry Agent can now natively collect and expose metrics interfaces by instrumenting metrics exposed by MBeans available locally in the application.
However, for current version, you still need to manually add the proper annotations to workload before the JVM data will be collected by Insight.
"},{"location":"en/end-user/insight/quickstart/jvm-monitor/otel-java-agent.html#expose-metrics-for-java-middleware","title":"Expose metrics for Java middleware","text":"
Opentelemetry Agent also has some built-in middleware monitoring samples, please refer to Predefined Metrics.
By default, no type is specified, and it needs to be specified through -Dotel.jmx.target.system JVM Options, such as -Dotel.jmx.target.system=jetty,kafka-broker .
Gaining JMX Metric Insights with the OpenTelemetry Java Agent
Otel jmx metrics
"},{"location":"en/end-user/insight/quickstart/otel/golang-ebpf.html","title":"Enhance Go apps with OTel auto-instrumentation","text":"
If you don't want to manually change the application code, you can try This page's eBPF-based automatic enhancement method. This feature is currently in the review stage of donating to the OpenTelemetry community, and does not support Operator injection through annotations (it will be supported in the future), so you need to manually change the Deployment YAML or use a patch.
Install under the Insight-system namespace, skip this step if it has already been installed.
Note: This CR currently only supports the injection of environment variables (including service name and trace address) required to connect to Insight, and will support the injection of Golang probes in the future.
"},{"location":"en/end-user/insight/quickstart/otel/golang-ebpf.html#change-the-application-deployment-file","title":"Change the application deployment file","text":"
Add environment variable annotations
There is only one such annotation, which is used to add OpenTelemetry-related environment variables, such as link reporting address, cluster id where the container is located, and namespace:
The value is divided into two parts by / , the first value insight-system is the namespace of the CR installed in the second step, and the second value insight-opentelemetry-autoinstrumentation is the name of the CR.
Please ensure that the insight-agent is ready. If not, please refer to Install insight-agent for data collection and make sure the following three items are ready:
Enable trace functionality for insight-agent
Check if the address and port for trace data are correctly filled
Ensure that the Pods proper to deployment/insight-agent-opentelemetry-operator and deployment/insight-agent-opentelemetry-collector are ready
"},{"location":"en/end-user/insight/quickstart/otel/operator.html#works-with-the-service-mesh-product-mspider","title":"Works with the Service Mesh Product (Mspider)","text":"
If you enable the tracing capability of the Mspider(Service Mesh), you need to add an additional environment variable injection configuration:
"},{"location":"en/end-user/insight/quickstart/otel/operator.html#the-operation-steps-are-as-follows","title":"The operation steps are as follows","text":"
Log in to AI platform, then enter Container Management and select the target cluster.
Click CRDs in the left navigation bar, find instrumentations.opentelemetry.io, and enter the details page.
Select the insight-system namespace, then edit insight-opentelemetry-autoinstrumentation, and add the following content under spec:env::
"},{"location":"en/end-user/insight/quickstart/otel/operator.html#add-annotations-to-automatically-access-traces","title":"Add annotations to automatically access traces","text":"
After the above is ready, you can access traces for the application through annotations (Annotation). Otel currently supports accessing traces through annotations. Depending on the service language, different pod annotations need to be added. Each service can add one of two types of annotations:
Only inject environment variable annotations
There is only one such annotation, which is used to add otel-related environment variables, such as link reporting address, cluster id where the container is located, and namespace (this annotation is very useful when the application does not support automatic probe language)
The value is divided into two parts by /, the first value (insight-system) is the namespace of the CR installed in the previous step, and the second value (insight-opentelemetry-autoinstrumentation) is the name of the CR.
Automatic probe injection and environment variable injection annotations
There are currently 4 such annotations, proper to 4 different programming languages: java, nodejs, python, dotnet. After using it, automatic probes and otel default environment variables will be injected into the first container under spec.pod:
Since Go's automatic detection requires the setting of OTEL_GO_AUTO_TARGET_EXE, you must provide a valid executable path through annotations or Instrumentation resources. Failure to set this value will result in the termination of Go's automatic detection injection, leading to a failure in the connection trace.
The OpenTelemetry Operator automatically adds some OTel-related environment variables when injecting probes and also supports overriding these variables. The priority order for overriding these environment variables is as follows:
original container env vars -> language specific env vars -> common env vars -> instrument spec configs' vars\n
However, it is important to avoid manually overriding OTEL_RESOURCE_ATTRIBUTES_NODE_NAME . This variable serves as an identifier within the operator to determine if a pod has already been injected with a probe. Manually adding this variable may prevent the probe from being injected successfully.
How to query the connected services, refer to Trace Query.
"},{"location":"en/end-user/insight/quickstart/otel/otel.html","title":"Use OTel to provide the application observability","text":"
Enhancement is the process of enabling application code to generate telemetry data. i.e. something that helps you monitor or measure the performance and status of your application.
OpenTelemetry is a leading open source project providing instrumentation libraries for major programming languages \u200b\u200band popular frameworks. It is a project under the Cloud Native Computing Foundation and is supported by the vast resources of the community. It provides a standardized data format for collected data without the need to integrate specific vendors.
Insight supports OpenTelemetry for application instrumentation to enhance your applications.
This guide introduces the basic concepts of telemetry enhancement using OpenTelemetry. OpenTelemetry also has an ecosystem of libraries, plugins, integrations, and other useful tools to extend it. You can find these resources at the OTel Registry.
You can use any open standard library for telemetry enhancement and use Insight as an observability backend to ingest, analyze, and visualize data.
To enhance your code, you can use the enhanced operations provided by OpenTelemetry for specific languages:
Insight currently provides an easy way to enhance .Net NodeJS, Java, Python and Golang applications with OpenTelemetry. Please follow the guidelines below.
Best practices for integrate trace: Application Non-Intrusive Enhancement via Operator
Manual instrumentation with Go language as an example: Enhance Go application with OpenTelemetry SDK
Using ebpf to implement non-intrusive auto-instrumetation in Go language (experimental feature)
"},{"location":"en/end-user/insight/quickstart/otel/send_tracing_to_insight.html","title":"Sending Trace Data to Insight","text":"
This document describes how customers can send trace data to Insight on their own. It mainly includes the following two scenarios:
Customer apps report traces to Insight through OTEL Agent/SDK
Forwarding traces to Insight through Opentelemetry Collector (OTEL COL)
In each cluster where Insight Agent is installed, there is an insight-agent-otel-col component that is used to receive trace data from that cluster. Therefore, this component serves as the entry point for user access and needs to obtain its address first. You can get the address of the Opentelemetry Collector in the cluster through the AI platform interface, such as insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 :
In addition, there are some slight differences for different reporting methods:
"},{"location":"en/end-user/insight/quickstart/otel/send_tracing_to_insight.html#customer-apps-report-traces-to-insight-through-otel-agentsdk","title":"Customer apps report traces to Insight through OTEL Agent/SDK","text":"
To successfully report trace data to Insight and display it properly, it is recommended to provide the required metadata (Resource Attributes) for OTLP through the following environment variables. There are two ways to achieve this:
Manually add them to the deployment YAML file, for example:
"},{"location":"en/end-user/insight/quickstart/otel/send_tracing_to_insight.html#forwarding-traces-to-insight-through-opentelemetry-collector","title":"Forwarding traces to Insight through Opentelemetry Collector","text":"
After ensuring that the application has added the metadata mentioned above, you only need to add an OTLP Exporter in your customer's Opentelemetry Collector to forward the trace data to Insight Agent Opentelemetry Collector. Below is an example Opentelemetry Collector configuration file:
Enhancing Applications Non-intrusively with the Operator
Achieving Observability with OTel
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html","title":"Enhance Go applications with OTel SDK","text":"
This page contains instructions on how to set up OpenTelemetry enhancements in a Go application.
OpenTelemetry, also known simply as OTel, is an open-source observability framework that helps generate and collect telemetry data: traces, metrics, and logs in Go apps.
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#enhance-go-apps-with-the-opentelemetry-sdk","title":"Enhance Go apps with the OpenTelemetry SDK","text":""},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#install-related-dependencies","title":"Install related dependencies","text":"
Dependencies related to the OpenTelemetry exporter and SDK must be installed first. If you are using another request router, please refer to request routing. After switching/going into the application source folder run the following command:
go get go.opentelemetry.io/otel@v1.8.0 \\\n go.opentelemetry.io/otel/trace@v1.8.0 \\\n go.opentelemetry.io/otel/sdk@v1.8.0 \\\n go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin@v0.33.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace@v1.7.0 \\\n go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.4.1\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#create-an-initialization-feature-using-the-opentelemetry-sdk","title":"Create an initialization feature using the OpenTelemetry SDK","text":"
In order for an application to be able to send data, a feature is required to initialize OpenTelemetry. Add the following code snippet to the main.go file:
import (\n \"context\"\n \"os\"\n \"time\"\n\n \"go.opentelemetry.io/otel\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace\"\n \"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc\"\n \"go.opentelemetry.io/otel/propagation\"\n \"go.opentelemetry.io/otel/sdk/resource\"\n sdktrace \"go.opentelemetry.io/otel/sdk/trace\"\n semconv \"go.opentelemetry.io/otel/semconv/v1.7.0\"\n \"go.uber.org/zap\"\n \"google.golang.org/grpc\"\n)\n\nvar tracerExp *otlptrace.Exporter\n\nfunc retryInitTracer() func() {\n var shutdown func()\n go func() {\n for {\n // otel will reconnected and re-send spans when otel col recover. so, we don't need to re-init tracer exporter.\n if tracerExp == nil {\n shutdown = initTracer()\n } else {\n break\n }\n time.Sleep(time.Minute * 5)\n }\n }()\n return shutdown\n}\n\nfunc initTracer() func() {\n // temporarily set timeout to 10s\n ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)\n defer cancel()\n\n serviceName, ok := os.LookupEnv(\"OTEL_SERVICE_NAME\")\n if !ok {\n serviceName = \"server_name\"\n os.Setenv(\"OTEL_SERVICE_NAME\", serviceName)\n }\n otelAgentAddr, ok := os.LookupEnv(\"OTEL_EXPORTER_OTLP_ENDPOINT\")\n if !ok {\n otelAgentAddr = \"http://localhost:4317\"\n os.Setenv(\"OTEL_EXPORTER_OTLP_ENDPOINT\", otelAgentAddr)\n }\n zap.S().Infof(\"OTLP Trace connect to: %s with service name: %s\", otelAgentAddr, serviceName)\n\n traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithDialOption(grpc.WithBlock()))\n if err != nil {\n handleErr(err, \"OTLP Trace gRPC Creation\")\n return nil\n }\n\n tracerProvider := sdktrace.NewTracerProvider(\n sdktrace.WithBatcher(traceExporter),\n sdktrace.WithSampler(sdktrace.AlwaysSample()),\n sdktrace.WithResource(resource.NewWithAttributes(semconv.SchemaURL)))\n\n otel.SetTracerProvider(tracerProvider)\n otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))\n\n tracerExp = traceExporter\n return func() {\n // Shutdown will flush any remaining spans and shut down the exporter.\n handleErr(tracerProvider.Shutdown(ctx), \"failed to shutdown TracerProvider\")\n }\n}\n\nfunc handleErr(err error, message string) {\n if err != nil {\n zap.S().Errorf(\"%s: %v\", message, err)\n }\n}\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#initialize-tracker-in-maingo","title":"Initialize tracker in main.go","text":"
Modify the main feature to initialize the tracker in main.go. Also when your service shuts down, you should call TracerProvider.Shutdown() to ensure all spans are exported. The service makes the call as a deferred feature in the main function:
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#add-opentelemetry-gin-middleware-to-the-application","title":"Add OpenTelemetry Gin middleware to the application","text":"
Configure Gin to use the middleware by adding the following line to main.go :
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#run-the-application","title":"Run the application","text":"
Local debugging and running
Note: This step is only used for local development and debugging. In the production environment, the Operator will automatically complete the injection of the following environment variables.
The above steps have completed the work of initializing the SDK. Now if you need to develop and debug locally, you need to obtain the address of insight-agent-opentelemerty-collector in the insight-system namespace in advance, assuming: insight-agent-opentelemetry-collector .insight-system.svc.cluster.local:4317 .
Therefore, you can add the following environment variables when you start the application locally:
OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go...\n
Running in a production environment
Please refer to the introduction of Only injecting environment variable annotations in Achieving non-intrusive enhancement of applications through Operators to add annotations to deployment yaml:
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin\"\n
# Add one line to your import() stanza depending upon your request router:\nmiddleware \"go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux\"\n
The OpenTelemetry community has also developed middleware for database access libraries, such as Gorm:
import (\n \"github.com/uptrace/opentelemetry-go-extra/otelgorm\"\n \"gorm.io/driver/sqlite\"\n \"gorm.io/gorm\"\n)\n\ndb, err := gorm.Open(sqlite.Open(\"file::memory:?cache=shared\"), &gorm.Config{})\nif err != nil {\n panic(err)\n}\n\notelPlugin := otelgorm.NewPlugin(otelgorm.WithDBName(\"mydb\"), # Missing this can lead to incomplete display of database related topology\n otelgorm.WithAttributes(semconv.ServerAddress(\"memory\"))) # Missing this can lead to incomplete display of database related topology\nif err := db.Use(otelPlugin); err != nil {\n panic(err)\n}\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#add-custom-properties-and-custom-events-to-span","title":"Add custom properties and custom events to span","text":"
It is also possible to set a custom attribute or tag as a span. To add custom properties and events, follow these steps:
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#import-tracking-and-property-libraries","title":"Import Tracking and Property Libraries","text":"
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#get-the-current-span-from-the-context","title":"Get the current Span from the context","text":"
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#set-properties-in-the-current-span","title":"Set properties in the current Span","text":"
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#add-an-event-to-the-current-span","title":"Add an Event to the current Span","text":"
Adding span events is done using AddEvent on the span object.
span.AddEvent(msg)\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/golang.html#log-errors-and-exceptions","title":"Log errors and exceptions","text":"
import \"go.opentelemetry.io/otel/codes\"\n\n// Get the current span\nspan := trace.SpanFromContext(ctx)\n\n// RecordError will automatically convert an error into a span even\nspan.RecordError(err)\n\n// Flag this span as an error\nspan.SetStatus(codes.Error, \"internal error\")\n
Navigate to your application\u2019s source folder and run the following command:
go get go.opentelemetry.io/otel \\\n go.opentelemetry.io/otel/attribute \\\n go.opentelemetry.io/otel/exporters/prometheus \\\n go.opentelemetry.io/otel/metric/global \\\n go.opentelemetry.io/otel/metric/instrument \\\n go.opentelemetry.io/otel/sdk/metric\n
"},{"location":"en/end-user/insight/quickstart/otel/golang/meter.html#create-an-initialization-function-using-otel-sdk","title":"Create an Initialization Function Using OTel SDK","text":"
For Java applications, you can directly expose JVM-related metrics by using the OpenTelemetry agent with the following environment variable:
OTEL_METRICS_EXPORTER=prometheus\n
You can then check your metrics at http://localhost:8888/metrics.
Next, combine it with a Prometheus ServiceMonitor to complete the metrics integration. If you want to expose custom metrics, please refer to opentelemetry-java-docs/prometheus.
The process is mainly divided into two steps:
Create a meter provider and specify Prometheus as the exporter.
/*\n * Copyright The OpenTelemetry Authors\n * SPDX-License-Identifier: Apache-2.0\n */\n\npackage io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport io.opentelemetry.exporter.prometheus.PrometheusHttpServer;\nimport io.opentelemetry.sdk.metrics.SdkMeterProvider;\nimport io.opentelemetry.sdk.metrics.export.MetricReader;\n\npublic final class ExampleConfiguration {\n\n /**\n * Initializes the Meter SDK and configures the Prometheus collector with all default settings.\n *\n * @param prometheusPort the port to open up for scraping.\n * @return A MeterProvider for use in instrumentation.\n */\n static MeterProvider initializeOpenTelemetry(int prometheusPort) {\n MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build();\n\n return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build();\n }\n}\n
Create a custom meter and start the HTTP server.
package io.opentelemetry.example.prometheus;\n\nimport io.opentelemetry.api.common.Attributes;\nimport io.opentelemetry.api.metrics.Meter;\nimport io.opentelemetry.api.metrics.MeterProvider;\nimport java.util.concurrent.ThreadLocalRandom;\n\n/**\n * Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose\n * these to a Prometheus instance via a HttpServer exporter.\n *\n * <p>A Gauge is used to periodically measure how many incoming messages are awaiting processing.\n * The Gauge callback gets executed every collection interval.\n */\npublic final class PrometheusExample {\n private long incomingMessageCount;\n\n public PrometheusExample(MeterProvider meterProvider) {\n Meter meter = meterProvider.get(\"PrometheusExample\");\n meter\n .gaugeBuilder(\"incoming.messages\")\n .setDescription(\"No of incoming messages awaiting processing\")\n .setUnit(\"message\")\n .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty()));\n }\n\n void simulate() {\n for (int i = 500; i > 0; i--) {\n try {\n System.out.println(\n i + \" Iterations to go, current incomingMessageCount is: \" + incomingMessageCount);\n incomingMessageCount = ThreadLocalRandom.current().nextLong(100);\n Thread.sleep(1000);\n } catch (InterruptedException e) {\n // ignored here\n }\n }\n }\n\n public static void main(String[] args) {\n int prometheusPort = 8888;\n\n // It is important to initialize the OpenTelemetry SDK as early as possible in your process.\n MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort);\n\n PrometheusExample prometheusExample = new PrometheusExample(meterProvider);\n\n prometheusExample.simulate();\n\n System.out.println(\"Exiting\");\n }\n}\n
After running the Java application, you can check if your metrics are working correctly by visiting http://localhost:8888/metrics.
For accessing and monitoring Java application links, please refer to the document Implementing Non-Intrusive Enhancements for Applications via Operator, which explains how to automatically integrate links through annotations.
Monitoring the JVM of Java applications: How Java applications that have already exposed JVM metrics and those that have not yet exposed JVM metrics can connect with observability Insight.
If your Java application has not yet started exposing JVM metrics, you can refer to the following documents:
Exposing JVM Monitoring Metrics Using JMX Exporter
Exposing JVM Monitoring Metrics Using OpenTelemetry Java Agent
If your Java application has already exposed JVM metrics, you can refer to the following document:
Connecting Existing JVM Metrics of Java Applications to Observability
Writing TraceId and SpanId into Java Application Logs to correlate link data with log data.
"},{"location":"en/end-user/insight/quickstart/otel/java/mdc.html","title":"Writing TraceId and SpanId into Java Application Logs","text":"
This article explains how to automatically write TraceId and SpanId into Java application logs using OpenTelemetry. By including TraceId and SpanId in your logs, you can correlate distributed tracing data with log data, enabling more efficient fault diagnosis and performance analysis.
Spring Boot projects come with a built-in logging framework and use Logback as the default logging implementation. If your Java project is a Spring Boot project, you can write TraceId into logs with minimal configuration.
Set logging.pattern.level in application.properties, adding %mdc{trace_id} and %mdc{span_id} to the logs.
Modify the log4j2.xml configuration, adding %X{trace_id} and %X{span_id} in the pattern to automatically write TraceId and SpanId into the logs:
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<configuration>\n <appender name=\"CONSOLE\" class=\"ch.qos.logback.core.ConsoleAppender\">\n <encoder>\n <pattern>%d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n</pattern>\n </encoder>\n </appender>\n\n <!-- Just wrap your logging appender, for example ConsoleAppender, with OpenTelemetryAppender -->\n <appender name=\"OTEL\" class=\"io.opentelemetry.instrumentation.logback.mdc.v1_0.OpenTelemetryAppender\">\n <appender-ref ref=\"CONSOLE\"/>\n </appender>\n\n <!-- Use the wrapped \"OTEL\" appender instead of the original \"CONSOLE\" one -->\n <root level=\"INFO\">\n <appender-ref ref=\"OTEL\"/>\n </root>\n\n</configuration>\n
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html","title":"Exposing JVM Monitoring Metrics Using JMX Exporter","text":"
JMX Exporter provides two usage methods:
Standalone Process: Specify parameters when starting the JVM to expose a JMX RMI interface. The JMX Exporter calls RMI to obtain the JVM runtime state data, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
In-Process (JVM process): Specify parameters when starting the JVM to run the JMX Exporter jar file as a javaagent. This method reads the JVM runtime state data in-process, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape.
Note
The official recommendation is not to use the first method due to its complex configuration and the requirement for a separate process, which introduces additional monitoring challenges. Therefore, this article focuses on the second method, detailing how to use JMX Exporter to expose JVM monitoring metrics in a Kubernetes environment.
In this method, you need to specify the JMX Exporter jar file and configuration file when starting the JVM. Since the jar file is a binary file that is not ideal for mounting via a configmap, and the configuration file typically does not require modifications, it is recommended to package both the JMX Exporter jar file and the configuration file directly into the business container image.
For the second method, you can choose to include the JMX Exporter jar file in the application image or mount it during deployment. Below are explanations for both approaches:
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-1-building-jmx-exporter-jar-file-into-the-business-image","title":"Method 1: Building JMX Exporter JAR File into the Business Image","text":"
The content of prometheus-jmx-config.yaml is as follows:
The format for the startup parameter is: -javaagent:=:
Here, port 8088 is used to expose JVM monitoring metrics; you may change it if it conflicts with the Java application.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.html#method-2-mounting-via-init-container","title":"Method 2: Mounting via Init Container","text":"
First, we need to create a Docker image for the JMX Exporter. The following Dockerfile is for reference:
FROM alpine/curl:3.14\nWORKDIR /app/\n# Copy the previously created config file into the image\nCOPY prometheus-jmx-config.yaml ./\n# Download the jmx prometheus javaagent jar online\nRUN set -ex; \\\n curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar;\n
Build the image using the above Dockerfile: docker build -t my-jmx-exporter .
Add the following init container to the Java application deployment YAML:
With the above modifications, the example application my-demo-app now has the capability to expose JVM metrics. After running the service, you can access the Prometheus formatted metrics at http://localhost:8088.
Next, you can refer to Connecting Existing JVM Metrics of Java Applications to Observability.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.html","title":"Integrating Existing JVM Metrics of Java Applications with Observability","text":"
If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), you will need to ensure that the monitoring data is collected. You can achieve this by adding annotations (Kubernetes Annotations) to your workload to allow Insight to scrape the existing JVM metrics:
annotations: \n insight.opentelemetry.io/metric-scrape: \"true\" # Whether to scrape\n insight.opentelemetry.io/metric-path: \"/\" # Path to scrape metrics\n insight.opentelemetry.io/metric-port: \"9464\" # Port to scrape metrics\n
For example, to add annotations to the my-deployment-app:
In the above example, Insight will scrape the Prometheus metrics exposed through Spring Boot Actuator via http://<service-ip>:8080/actuator/prometheus.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html","title":"Exposing JVM Metrics Using OpenTelemetry Java Agent","text":"
Starting from OpenTelemetry Agent v1.20.0 and later, the OpenTelemetry Agent has introduced the JMX Metric Insight module. If your application is already integrated with the OpenTelemetry Agent for tracing, you no longer need to introduce another agent to expose JMX metrics for your application. The OpenTelemetry Agent collects and exposes metrics by detecting the locally available MBeans in the application.
The OpenTelemetry Agent also provides built-in monitoring examples for common Java servers or frameworks. Please refer to the Predefined Metrics.
When using the OpenTelemetry Java Agent, you also need to consider how to mount the JAR into the container. In addition to the methods for mounting the JAR file as described with the JMX Exporter, you can leverage the capabilities provided by the OpenTelemetry Operator to automatically enable JVM metrics exposure for your application.
If your application is already integrated with the OpenTelemetry Agent for tracing, you do not need to introduce another agent to expose JMX metrics. The OpenTelemetry Agent can now locally collect and expose metrics interfaces by detecting the locally available MBeans in the application.
However, as of the current version, you still need to manually add the appropriate annotations to your application for the JVM data to be collected by Insight. For specific annotation content, please refer to Integrating Existing JVM Metrics of Java Applications with Observability.
"},{"location":"en/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.html#exposing-metrics-for-java-middleware","title":"Exposing Metrics for Java Middleware","text":"
The OpenTelemetry Agent also includes built-in examples for monitoring middleware. Please refer to the Predefined Metrics.
By default, no specific types are designated; you need to specify them using the -Dotel.jmx.target.system JVM options, for example, -Dotel.jmx.target.system=jetty,kafka-broker.
Although the OpenShift system comes with a monitoring system, we will still install Insight Agent because of some rules in the data collection agreement.
Among them, in addition to the basic installation configuration, the following parameters need to be added during helm install:
## Parameters related to fluentbit;\n--set fluent-bit.ocp.enabled=true \\\n--set fluent-bit.serviceAccount.create=false \\\n--set fluent-bit.securityContext.runAsUser=0 \\\n--set fluent-bit.securityContext.seLinuxOptions.type=spc_t \\\n--set fluent-bit.securityContext.readOnlyRootFilesystem=false \\\n--set fluent-bit.securityContext.allowPrivilegeEscalation=false \\\n\n## Enable Prometheus(CR) for OpenShift4.x\n--set compatibility.openshift.prometheus.enabled=true \\\n\n## Close the Prometheus instance of the higher version\n--set kube-prometheus-stack.prometheus.enabled=false \\\n--set kube-prometheus-stack.kubeApiServer.enabled=false \\\n--set kube-prometheus-stack.kubelet.enabled=false \\\n--set kube-prometheus-stack.kubeControllerManager.enabled=false \\\n--set kube-prometheus-stack.coreDns.enabled=false \\\n--set kube-prometheus-stack.kubeDns.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeEtcd.enabled=false \\\n--set kube-prometheus-stack.kubeScheduler.enabled=false \\\n--set kube-prometheus-stack.kubeStateMetrics.enabled=false \\\n--set kube-prometheus-stack.nodeExporter.enabled=false \\\n\n## Limit the namespace processed by PrometheusOperator to avoid competition with OpenShift's own PrometheusOperator\n--set kube-prometheus-stack.prometheusOperator.kubeletService.namespace=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.prometheusInstanceNamespaces=\"insight-system\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[0]=\"openshift-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[1]=\"openshift-user-workload-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[2]=\"openshift-customer-monitoring\" \\\n--set kube-prometheus-stack.prometheusOperator.denyNamespaces[3]=\"openshift-route-monitor-operator\" \\\n
"},{"location":"en/end-user/insight/quickstart/other/install-agent-on-ocp.html#write-system-monitoring-data-into-prometheus-through-openshifts-own-mechanism","title":"Write system monitoring data into Prometheus through OpenShift's own mechanism","text":"
"},{"location":"en/end-user/insight/quickstart/other/install-agentindce.html","title":"Install insight-agent in Suanova 4.0","text":"
In AI platform, previous Suanova 4.0 can be accessed as a subcluster. This guide provides potential issues and solutions when installing insight-agent in a Suanova 4.0 cluster.
Since most Suanova 4.0 clusters have installed dx-insight as the monitoring system, installing insight-agent at this time will conflict with the existing prometheus operator in the cluster, making it impossible to install smoothly.
Enable the parameters of the prometheus operator, retain the prometheus operator in dx-insight, and make it compatible with the prometheus operator in insight-agent in 5.0.
Enable the --deny-namespaces parameter in the two prometheus operators respectively.
Run the following command (the following command is for reference only, the actual command needs to replace the prometheus operator name and namespace in the command).
As shown in the figure above, the dx-insight component is deployed under the dx-insight tenant, and the insight-agent is deployed under the insight-system tenant. Add --deny-namespaces=insight-system in the prometheus operator in dx-insight, Add --deny-namespaces=dx-insight in the prometheus operator in insight-agent.
Just add deny namespace, both prometheus operators can continue to scan other namespaces, and the related collection resources under kube-system or customer business namespaces are not affected.
Please pay attention to the problem of node exporter port conflict.
The open-source node-exporter turns on hostnetwork by default and the default port is 9100. If the monitoring system of the cluster has installed node-exporter , then installing insight-agent at this time will cause node-exporter port conflict and it cannot run normally.
Note
Insight's node exporter will enable some features to collect special indicators, so it is recommended to install.
Currently, it does not support modifying the port in the installation command. After helm install insight-agent , you need to manually modify the related ports of the insight node-exporter daemonset and svc.
The docker storage directory of Suanova 4.0 is /var/lib/containers , which is different from the path in the configuration of insigh-agent, so the logs are not collected.
"},{"location":"en/end-user/insight/quickstart/res-plan/modify-vms-disk.html","title":"vmstorage Disk Expansion","text":"
This article describes the method for expanding the vmstorage disk. Please refer to the vmstorage disk capacity planning for the specifications of the vmstorage disk.
Log in to the AI platform platform as a global service cluster administrator. Click Container Management -> Clusters and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Container Storage -> PVCs and find the PVC bound to the vmstorage.
Click a vmstorage PVC to enter the details of the volume claim for vmstorage and confirm the StorageClass that the PVC is bound to.
Select the left navigation menu Container Storage -> Storage Class and find local-path . Click the \u2507 on the right side of the target and select Edit in the popup menu.
Enable Scale Up and click OK .
"},{"location":"en/end-user/insight/quickstart/res-plan/modify-vms-disk.html#modify-the-disk-capacity-of-vmstorage","title":"Modify the disk capacity of vmstorage","text":"
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu CRDs and find the custom resource for vmcluster .
Click the custom resource for vmcluster to enter the details page, switch to the insight-system namespace, and select Edit YAML from the right menu of insight-victoria-metrics-k8s-stack .
Modify according to the legend and click OK .
Select the left navigation menu Container Storage -> PVCs again and find the volume claim bound to vmstorage. Confirm that the modification has taken effect. In the details page of a PVC, click the associated storage source (PV).
Open the volume details page and click the Update button in the upper right corner.
After modifying the Capacity , click OK and wait for a moment until the expansion is successful.
"},{"location":"en/end-user/insight/quickstart/res-plan/modify-vms-disk.html#clone-the-storage-volume","title":"Clone the storage volume","text":"
If the storage volume expansion fails, you can refer to the following method to clone the storage volume.
Log in to the AI platform platform as a global service cluster administrator and go to the details of the kpanda-global-cluster cluster.
Select the left navigation menu Workloads -> StatefulSets and find the statefulset for vmstorage . Click the \u2507 on the right side of the target and select Status -> Stop -> OK in the popup menu.
After logging into the master node of the kpanda-global-cluster cluster in the command line, run the following command to copy the vm-data directory in the vmstorage container to store the metric information locally:
Log in to the AI platform platform and go to the details of the kpanda-global-cluster cluster. Select the left navigation menu Container Storage -> PVs , click Clone in the upper right corner, and modify the capacity of the volume.
Delete the previous data volume of vmstorage.
Wait for a moment until the volume claim is bound to the cloned data volume, then run the following command to import the exported data from step 3 into the proper container, and then start the previously paused vmstorage .
In the actual use of Prometheus, affected by the number of cluster containers and the opening of Istio, the CPU, memory and other resource usage of Prometheus will exceed the set resources.
In order to ensure the normal operation of Prometheus in clusters of different sizes, it is necessary to adjust the resources of Prometheus according to the actual size of the cluster.
In the case that the mesh is not enabled, the test statistics show that the relationship between the system Job index and pods is Series count = 800 * pod count
When the service mesh is enabled, the magnitude of the Istio-related metrics generated by the pod after the feature is enabled is Series count = 768 * pod count
"},{"location":"en/end-user/insight/quickstart/res-plan/prometheus-res.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"
The following resource planning is recommended by Prometheus when the service mesh is not enabled :
Pod count in the table refers to the pod count that is basically running stably in the cluster. If a large number of pods are restarted, the index will increase sharply in a short period of time. At this time, resources need to be adjusted accordingly.
Prometheus stores two hours of data by default in memory, and when the Remote Write function is enabled in the cluster, a certain amount of memory will be occupied, and resources surge ratio is recommended to be set to 2.
The data in the table are recommended values, applicable to general situations. If the environment has precise resource requirements, it is recommended to check the resource usage of the proper Prometheus after the cluster has been running for a period of time for precise configuration.
"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html","title":"vmstorage disk capacity planning","text":"
vmstorage is responsible for storing multicluster metrics for observability. In order to ensure the stability of vmstorage, it is necessary to adjust the disk capacity of vmstorage according to the number of clusters and the size of the cluster. For more information, please refer to vmstorage retention period and disk space.
After 14 days of disk observation of vmstorage of clusters of different sizes, We found that the disk usage of vmstorage was positively correlated with the amount of metrics it stored and the disk usage of individual data points.
The amount of metrics stored instantaneously increase(vm_rows{ type != \"indexdb\"}[30s]) to obtain the increased amount of metrics within 30s
Disk usage of a single data point: sum(vm_data_size_bytes{type!=\"indexdb\"}) / sum(vm_rows{type != \"indexdb\"})
Disk usage = Instantaneous metrics x 2 x disk usage for a single data point x 60 x 24 x storage time (days)
Parameter Description:
The unit of disk usage is Byte .
Storage duration (days) x 60 x 24 converts time (days) into minutes to calculate disk usage.
The default collection time of Prometheus in Insight Agent is 30s, so twice the amount of metrics will be generated within 1 minute.
The default storage duration in vmstorage is 1 month, please refer to Modify System Configuration to modify the configuration.
Warning
This formula is a general solution, and it is recommended to reserve redundant disk capacity on the calculation result to ensure the normal operation of vmstorage.
The data in the table is calculated based on the default storage time of one month (30 days), and the disk usage of a single data point (datapoint) is calculated as 0.9. In a multicluster scenario, the number of Pods represents the sum of the number of Pods in the multicluster.
"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-not-enabled","title":"When the service mesh is not enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 8W 6 GiB 200 16W 12 GiB 300 24w 18 GiB 400 32w 24 GiB 500 40w 30 GiB 800 64w 48 GiB 1000 80W 60 GiB 2000 160w 120 GiB 3000 240w 180 GiB"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html#when-the-service-mesh-is-enabled","title":"When the service mesh is enabled","text":"Cluster size (number of Pods) Metrics Disk capacity 100 15W 12 GiB 200 31w 24 GiB 300 46w 36 GiB 400 62w 48 GiB 500 78w 60 GiB 800 125w 94 GiB 1000 156w 120 GiB 2000 312w 235 GiB 3000 468w 350 GiB"},{"location":"en/end-user/insight/quickstart/res-plan/vms-res-plan.html#example","title":"Example","text":"
There are two clusters in the AI platform platform, of which 500 Pods are running in the global management cluster (service mesh is turned on), and 1000 Pods are running in the worker cluster (service mesh is not turned on), and the expected metrics are stored for 30 days.
The number of metrics in the global management cluster is 800x500 + 768x500 = 784000
Worker cluster metrics are 800x1000 = 800000
Then the current vmstorage disk usage should be set to (784000+80000)x2x0.9x60x24x31 =124384896000 byte = 116 GiB
Note
For the relationship between the number of metrics and the number of Pods in the cluster, please refer to Prometheus Resource Planning.
"},{"location":"en/end-user/insight/system-config/modify-config.html","title":"Modify system configuration","text":"
Observability will persist the data of metrics, logs, and traces by default. Users can modify the system configuration according to This page.
"},{"location":"en/end-user/insight/system-config/modify-config.html#how-to-modify-the-metric-data-retention-period","title":"How to modify the metric data retention period","text":"
Refer to the following steps to modify the metric data retention period.
After saving the modification, the pod of the component responsible for storing the metrics will automatically restart, just wait for a while.
"},{"location":"en/end-user/insight/system-config/modify-config.html#how-to-modify-the-log-data-storage-duration","title":"How to modify the log data storage duration","text":"
Refer to the following steps to modify the log data retention period:
"},{"location":"en/end-user/insight/system-config/modify-config.html#method-1-modify-the-json-file","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . Change http://localhost:9200 to the address of elastic .
After modification, run the above command. It will print out the content as shown below, then the modification is successful.
{\n\"acknowledged\": true\n}\n
"},{"location":"en/end-user/insight/system-config/modify-config.html#method-2-modify-from-the-ui","title":"Method 2: Modify from the UI","text":"
Log in kibana , select Stack Management in the left navigation bar.
Select the left navigation Index Lifecycle Polices , and find the index insight-es-k8s-logs-policy , click to enter the details.
Expand the Hot phase configuration panel, modify the Maximum age parameter, and set the retention period. The default storage period is 7d .
After modification, click Save policy at the bottom of the page to complete the modification.
"},{"location":"en/end-user/insight/system-config/modify-config.html#how-to-modify-the-trace-data-storage-duration","title":"How to modify the trace data storage duration","text":"
Refer to the following steps to modify the trace data retention period:
"},{"location":"en/end-user/insight/system-config/modify-config.html#method-1-modify-the-json-file_1","title":"Method 1: Modify the Json file","text":"
Modify the max_age parameter in the rollover field in the following files, and set the retention period. The default storage period is 7d . At the same time, modify http://localhost:9200 to the access address of elastic .
On the system component page, you can quickly view the running status of the system components in Insight. When a system component fails, some features in Insight will be unavailable.
Go to Insight product module,
In the left navigation bar, select System Management -> System Components .
"},{"location":"en/end-user/insight/system-config/system-component.html#component-description","title":"Component description","text":"Module Component Name Description Metrics vminsert-insight-victoria-metrics-k8s-stack Responsible for writing the metric data collected by Prometheus in each cluster to the storage component. If this component is abnormal, the metric data of the worker cluster cannot be written. Metrics vmalert-insight-victoria-metrics-k8s-stack Responsible for taking effect of the recording and alert rules configured in the VM Rule, and sending the triggered alert rules to alertmanager. Metrics vmalertmanager-insight-victoria-metrics-k8s-stack is responsible for sending messages when alerts are triggered. If this component is abnormal, the alert information cannot be sent. Metrics vmselect-insight-victoria-metrics-k8s-stack Responsible for querying metrics data. If this component is abnormal, the metric cannot be queried. Metrics vmstorage-insight-victoria-metrics-k8s-stack Responsible for storing multicluster metrics data. Dashboard grafana-deployment Provide monitoring panel capability. The exception of this component will make it impossible to view the built-in dashboard. Link insight-jaeger-collector Responsible for receiving trace data in opentelemetry-collector and storing it. Link insight-jaeger-query Responsible for querying the trace data collected in each cluster. Link insight-opentelemetry-collector Responsible for receiving trace data forwarded by each sub-cluster Log elasticsearch Responsible for storing the log data of each cluster."},{"location":"en/end-user/insight/system-config/system-config.html","title":"System Configuration","text":"
System Configuration displays the default storage time of metrics, logs, traces and the default Apdex threshold.
Click the right navigation bar and select System Configuration .
Currently only supports modifying the storage duration of historical alerts, click Edit to enter the target duration.
When the storage duration is set to \"0\", the historical alerts will not be cleared.
Note
To modify other configurations, please click to view How to modify the system configuration?
In Insight , a service refers to a group of workloads that provide the same behavior for incoming requests. Service insight helps observe the performance and status of applications during the operation process by using the OpenTelemetry SDK.
For how to use OpenTelemetry, please refer to: Using OTel to give your application insight.
Service: A service represents a group of workloads that provide the same behavior for incoming requests. You can define the service name when using the OpenTelemetry SDK or use the name defined in Istio.
Operation: An operation refers to a specific request or action handled by a service. Each span has an operation name.
Outbound Traffic: Outbound traffic refers to all the traffic generated by the current service when making requests.
Inbound Traffic: Inbound traffic refers to all the traffic initiated by the upstream service targeting the current service.
The Services List page displays key metrics such as throughput rate, error rate, and request latency for all services that have been instrumented with distributed tracing. You can filter services based on clusters or namespaces and sort the list by throughput rate, error rate, or request latency. By default, the data displayed in the list is for the last hour, but you can customize the time range.
Follow these steps to view service insight metrics:
Go to the Insight product module.
Select Trace Tracking -> Services from the left navigation bar.
Attention
If the namespace of a service in the list is unknown , it means that the service has not been properly instrumented. We recommend reconfiguring the instrumentation.
If multiple services have the same name and none of them have the correct Namespace environment variable configured, the metrics displayed in the list and service details page will be aggregated for all those services.
Click a service name (taking insight-system as an example) to view the detailed metrics and operation metrics for that service.
In the Service Topology section, you can view the service topology one layer above or below the current service. When you hover over a node, you can see its information.
In the Traffic Metrics section, you can view the monitoring metrics for all requests to the service within the past hour (including inbound and outbound traffic).
You can use the time selector in the upper right corner to quickly select a time range or specify a custom time range.
Sorting is available for throughput, error rate, and request latency in the operation metrics.
Clicking on the icon next to an individual operation will take you to the Traces page to quickly search for related traces.
"},{"location":"en/end-user/insight/trace/service.html#service-metric-explanations","title":"Service Metric Explanations","text":"Metric Description Throughput Rate The number of requests processed within a unit of time. Error Rate The ratio of erroneous requests to the total number of requests within the specified time range. P50 Request Latency The response time within which 50% of requests complete. P95 Request Latency The response time within which 95% of requests complete. P99 Request Latency The response time within which 99% of requests complete."},{"location":"en/end-user/insight/trace/topology-helper.html","title":"Service Topology Element Explanations","text":"
The service topology provided by Observability allows you to quickly identify the request relationships between services and determine the health status of services based on different colors. The health status is determined based on the request latency and error rate of the service's overall traffic. This article explains the elements in the service topology.
"},{"location":"en/end-user/insight/trace/topology-helper.html#node-status-explanation","title":"Node Status Explanation","text":"
The node health status is determined based on the error rate and request latency of the service's overall traffic, following these rules:
Color Status Rules Gray Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/end-user/insight/trace/topology-helper.html#connection-status-explanation","title":"Connection Status Explanation","text":"Color Status Rules Green Healthy Error rate equals 0% and request latency is less than 100ms Orange Warning Error rate (0, 5%) or request latency (100ms, 200ms) Red Abnormal Error rate (5%, 100%) or request latency (200ms, +Infinity)"},{"location":"en/end-user/insight/trace/topology.html","title":"Service Map","text":"
Service map is a visual representation of the connections, communication, and dependencies between services. It provides insights into the service-to-service interactions, allowing you to view the calls and performance of services within a specified time range. The connections between nodes in the topology map represent the existence of service-to-service calls during the queried time period.
Select Tracing -> Service Map from the left navigation bar.
In the Service Map, you can perform the following actions:
Click a node to slide out the details of the service on the right side. Here, you can view metrics such as request latency, throughput, and error rate for the service. Clicking on the service name takes you to the service details page.
Hover over the connections to view the traffic metrics between the two services.
Click Display Settings , you can configure the display elements in the service map.
In the Service Map, there can be nodes that are not part of the cluster. These external nodes can be categorized into three types:
Database
Message Queue
Virtual Node
If a service makes a request to a Database or Message Queue, these two types of nodes will be displayed by default in the topology map. However, Virtual Nodes represent nodes outside the cluster or services not integrated into the trace, and they will not be displayed by default in the map.
When a service makes a request to MySQL, PostgreSQL, or Oracle Database, the detailed database type can be seen in the map.
TraceID: Used to identify a complete request call trace.
Operation: Describes the specific operation or event represented by a Span.
Entry Span: The entry Span represents the first request of the entire call.
Latency: The duration from receiving the request to completing the response for the entire call trace.
Span: The number of Spans included in the entire trace.
Start Time: The time when the current trace starts.
Tag: A collection of key-value pairs that constitute Span tags. Tags are used to annotate and supplement Spans, and each Span can have multiple key-value tag pairs.
Click the icon on the right side of the trace data to search for associated logs.
By default, it queries the log data within the duration of the trace and one minute after its completion.
The queried logs include those with the trace's TraceID in their log text and container logs related to the trace invocation process.
Click View More to jump to the Associated Log page with conditions.
By default, all logs are searched, but you can filter by the TraceID or the relevant container logs from the trace call process using the dropdown.
Note
Since trace may span across clusters or namespaces, if the user does not have sufficient permissions, they will be unable to query the associated logs for that trace.
"},{"location":"en/end-user/k8s/add-node.html#steps-to-add-nodes","title":"Steps to Add Nodes","text":"
Log into the AI platform as an administrator.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
On the cluster overview page, click Nodes, then click the Add Node button on the right.
Follow the wizard to fill in the parameters and click OK.
In the pop-up window, click OK.
Return to the node list; the status of the newly added node will be Connecting. After a few minutes, when the status changes to Running, it indicates that the connection was successful.
Tip
For newly connected nodes, it may take an additional 2-3 minutes to recognize the GPU.
"},{"location":"en/end-user/k8s/create-k8s.html","title":"Creating a Kubernetes Cluster in the Cloud","text":"
Deploying a Kubernetes cluster is aimed at supporting efficient AI computing resource scheduling and management, achieving elastic scaling, providing high availability, and optimizing the model training and inference processes.
Two segments of IP addresses are allocated (Pod CIDR 18 bits, SVC CIDR 18 bits, must not conflict with existing networks)
"},{"location":"en/end-user/k8s/create-k8s.html#steps-to-create","title":"Steps to Create","text":"
Log into the AI platform as an administrator.
Create and launch 3 cloud hosts without GPU to serve as Master nodes for the cluster.
Configure resources: 16 CPU cores, 32 GB RAM, 200 GB system disk (ReadWriteOnce)
Select Bridge network mode
Set the root password or add an SSH public key for SSH connection
Record the IPs of the 3 hosts
Navigate to Container Management -> Clusters, and click the Create Cluster button on the right.
Follow the wizard to configure various parameters of the cluster.
Wait for the cluster creation to complete.
In the cluster list, find the newly created cluster, click the cluster name, navigate to Helm Apps -> Helm Charts, and search for metax-gpu-extensions in the search box, then click the card.
Click the Install button on the right to start installing the GPU plugin.
Automatically return to the Helm App list and wait for the status of metax-gpu-extensions to change to Deployed.
At this point, the cluster has been successfully created. You can check the nodes included in the cluster. You can now create AI workloads and use GPUs.
The cost of GPU resources is relatively high. If GPUs are not needed temporarily, you can remove the worker nodes with GPUs. The following steps also apply to removing regular worker nodes.
"},{"location":"en/end-user/k8s/remove-node.html#steps-to-remove","title":"Steps to Remove","text":"
Log into the AI platform as an administrator.
Navigate to Container Management -> Clusters, and click the name of the target cluster.
Enter the cluster overview page, click Nodes, find the node to be removed, click the \u2507 on the right side of the list, and select Remove Node from the pop-up menu.
In the pop-up window, enter the node name, and after confirming it is correct, click Delete.
You will automatically return to the node list, where the status will be Removing. After a few minutes, refresh the page, and if the node is no longer present, it indicates that the node has been successfully removed.
After removing the node from the UI list, SSH into the removed node's host and execute the shutdown command.
Tip
After removing the node in the UI and shutting it down, the data on the node is not immediately deleted; the node's data will be retained for a period of time.
"},{"location":"en/end-user/kpanda/backup/index.html","title":"Backup and Restore","text":"
Backup and restore are essential aspects of system management. In practice, it is important to first back up the data of the system at a specific point in time and securely store the backup. In case of incidents such as data corruption, loss, or accidental deletion, the system can be quickly restored based on the previous backup data, reducing downtime and minimizing losses.
In real production environments, services may be deployed across different clouds, regions, or availability zones. If one infrastructure faces a failure, organizations need to quickly restore applications in other available environments. In such cases, cross-cloud or cross-cluster backup and restore become crucial.
Large-scale systems often involve multiple roles and users with complex permission management systems. With many operators involved, accidents caused by human error can lead to system failures. In such scenarios, the ability to roll back the system quickly using previously backed-up data is necessary. Relying solely on manual troubleshooting, fault repair, and system recovery can be time-consuming, resulting in prolonged system unavailability and increased losses for organizations.
Additionally, factors like network attacks, natural disasters, and equipment malfunctions can also cause data accidents.
Therefore, backup and restore are vital as the last line of defense for maintaining system stability and ensuring data security.
Backups are typically classified into three types: full backups, incremental backups, and differential backups. Currently, AI platform supports full backups and incremental backups.
The backup and restore provided by AI platform can be divided into two categories: Application Backup and ETCD Backup. It supports both manual backups and scheduled automatic backups using CronJobs.
Application Backup
Application backup refers to backing up data of a specific workload in the cluster and then restoring that data either within the same cluster or in another cluster. It supports backing up all resources under a namespace or filtering resources by specific labels.
Application backup also supports cross-cluster backup of stateful applications. For detailed steps, refer to the Backup and Restore MySQL Applications and Data Across Clusters guide.
etcd Backup
etcd is the data storage component of Kubernetes. Kubernetes stores its own component's data and application data in etcd. Therefore, backing up etcd is equivalent to backing up the entire cluster's data, allowing quick restoration of the cluster to a previous state in case of failures.
It's worth noting that currently, restoring etcd backup data is only supported within the same cluster (the original cluster). To learn more about related best practices, refer to the ETCD Backup and Restore guide.
This article explains how to backup applications in AI platform. The demo application used in this tutorial is called dao-2048 , which is a deployment.
Before backing up a deployment, the following prerequisites must be met:
Integrate a Kubernetes cluster or create a Kubernetes cluster in the Container Management module, and be able to access the UI interface of the cluster.
Create a Namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
Install the velero component, and ensure the velero component is running properly.
Create a deployment (the workload in this tutorial is named dao-2048 ), and label the deployment with app: dao-2048 .
Follow the steps below to backup the deployment dao-2048 .
Enter the Container Management module, click Backup Recovery -> Application Backup on the left navigation bar, and enter the Application Backup list page.
On the Application Backup list page, select the cluster where the velero and dao-2048 applications have been installed. Click Backup Plan in the upper right corner to create a new backup cluster.
Refer to the instructions below to fill in the backup configuration.
Name: The name of the new backup plan.
Source Cluster: The cluster where the application backup plan is to be executed.
Object Storage Location: The access path of the object storage configured when installing velero on the source cluster.
Namespace: The namespaces that need to be backed up, multiple selections are supported.
Advanced Configuration: Back up specific resources in the namespace based on resource labels, such as an application, or do not back up specific resources in the namespace based on resource labels during backup.
Refer to the instructions below to set the backup execution frequency, and then click Next .
Backup Frequency: Set the time period for task execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Retention Time (days): Set the storage time of backup resources, the default is 30 days, and will be deleted after expiration.
Backup Data Volume (PV): Whether to back up the data in the data volume (PV), support direct copy and use CSI snapshot.
Direct Replication: directly copy the data in the data volume (PV) for backup;
Use CSI snapshots: Use CSI snapshots to back up data volumes (PVs). Requires a CSI snapshot type available for backup in the cluster.
Click OK , the page will automatically return to the application backup plan list, find the newly created dao-2048 backup plan, and perform the Immediate Execution operation.
At this point, the Last Execution State of the cluster will change to in progress . After the backup is complete, you can click the name of the backup plan to view the details of the backup plan.
etcd backup is based on cluster data as the core backup. In cases such as hardware device damage, development and test configuration errors, etc., the backup cluster data can be restored through etcd backup.
This section will introduce how to realize the etcd backup for clusters. Also see etcd Backup and Restore Best Practices.
Enter Container Management -> Backup Recovery -> etcd Backup page, you can see all the current backup policies. Click Create Backup Policy on the right.
Fill in the Basic Information. Then, click Next to automatically verify the connectivity of etcd. If the verification passes, proceed to the next step.
First select the backup cluster and log in to the terminal
Enter etcd, and the format is https://${NodeIP}:${Port}.
In a standard Kubernetes cluster, the default port for etcd is 2379.
In a Suanova 4.0 cluster, the default port for etcd is 12379.
In a public cloud managed cluster, you need to contact the relevant developers to obtain the etcd port number. This is because the control plane components of public cloud clusters are maintained and managed by the cloud service provider. Users cannot directly access or view these components, nor can they obtain control plane port information through regular commands (such as kubectl).
Ways to obtain port number
Find the etcd Pod in the kube-system namespace
kubectl get po -n kube-system | grep etcd\n
Get the port number from the listen-client-urls of the etcd Pod
kubectl get po -n kube-system ${etcd_pod_name} -oyaml | grep listen-client-urls # (1)!\n
Replace etcd_pod_name with the actual Pod name
The expected output is as follows, where the number after the node IP is the port number:
Fill in the CA certificate, you can use the following command to view the certificate content. Then, copy and paste it to the proper location:
Standard Kubernetes ClusterSuanova 4.0 Cluster
cat /etc/kubernetes/ssl/etcd/ca.crt\n
cat /etc/daocloud/dce/certs/ca.crt\n
Fill in the Cert certificate, you can use the following command to view the content of the certificate. Then, copy and paste it to the proper location:
Click How to get below the input box to see how to obtain the proper information on the UI page.
Refer to the following information to fill in the Backup Policy.
Backup Method: Choose either manual backup or scheduled backup
Manual Backup: Immediately perform a full backup of etcd data based on the backup configuration.
Scheduled Backup: Periodically perform full backups of etcd data according to the set backup frequency.
Backup Chain Length: the maximum number of backup data to retain. The default is 30.
Backup Frequency: it can be per hour, per day, per week or per month, and can also be customized.
Refer to the following information to fill in the Storage Path.
Storage Provider: Default is S3 storage
Object Storage Access Address: The access address of MinIO
Bucket: Create a Bucket in MinIO and fill in the Bucket name
Username: The login username for MinIO
Password: The login password for MinIO
After clicking OK , the page will automatically redirect to the backup policy list, where you can view all the currently created ones.
Click the \u2507 action button on the right side of the policy to view logs, view YAML, update the policy, stop the policy, or execute the policy immediately.
When the backup method is manual, you can click Execute Now to perform the backup.
When the backup method is scheduled, the backup will be performed according to the configured time.
Click Logs to view the log content. By default, 100 lines are displayed. If you want to see more log information or download the logs, you can follow the prompts above the logs to go to the observability module.
Go to Container Management -> Backup Recovery -> etcd Backup, and click the Recovery Point tab.
After selecting the target cluster, you can view all the backup information under that cluster.
Each time a backup is executed, a proper recovery point is generated, which can be used to quickly restore the application from a successful recovery point.
"},{"location":"en/end-user/kpanda/backup/install-velero.html","title":"Install the Velero Plugin","text":"
velero is an open source tool for backing up and restoring Kubernetes cluster resources. It can back up resources in a Kubernetes cluster to cloud storage services, local storage, or other locations, and restore those resources to the same or a different cluster when needed.
This section introduces how to deploy the Velero plugin in AI platform using the Helm Apps.
Please perform the following steps to install the velero plugin for your cluster.
On the cluster list page, find the target cluster that needs to install the velero plugin, click the name of the cluster, click Helm Apps -> Helm chart in the left navigation bar, and enter velero in the search bar to search .
Read the introduction of the velero plugin, select the version and click the Install button. This page will take 5.2.0 version as an example to install, and it is recommended that you install 5.2.0 and later versions.
Configure basic info .
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, it must be velero namespace.
Version: The version of the plugin, here we take 5.2.0 version as an example.
Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Deletion Failed: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Detailed Logs: Turn on the verbose output of the installation process log.
!!! note
After enabling __Ready Wait__ and/or __Failed Delete__ , it takes a long time for the app to be marked as __Running__ .\n
Configure Velero chart Parameter Settings according to the following instructions
S3 Credentials: Configure the authentication information of object storage (minio).
Use secret: Keep the default configuration true.
Secret name: Keep the default configuration velero-s3-credential.
SecretContents.aws_access_key_id = : Configure the username for accessing object storage, replace with the actual parameter.
SecretContents.aws_secret_access_key = : Configure the password for accessing object storage, replace with the actual parameter.
Use existing secret parameter example is as follows:
BackupStorageLocation: The location where Velero backs up data.
S3 bucket: The name of the storage bucket used to save backup data (must be a real storage bucket that already exists in minio).
Is default BackupStorage: Keep the default configuration true.
S3 access mode: The access mode of Velero to data, which can be selected
ReadWrite: Allow Velero to read and write backup data;
ReadOnly: Allow Velero to read backup data, but cannot modify backup data;
WriteOnly: Only allow Velero to write backup data, and cannot read backup data.
S3 Configs: Detailed configuration of S3 storage (minio).
S3 region: The geographical region of cloud storage. The default is to use the us-east-1 parameter, which is provided by the system administrator.
S3 force path style: Keep the default configuration true.
S3 server URL: The console access address of object storage (minio). Minio generally provides two services, UI access and console access. Please use the console access address here.
Click the OK button to complete the installation of the Velero plugin. The system will automatically jump to the Helm Apps list page. After waiting for a few minutes, refresh the page, and you can see the application just installed.
Cluster settings are used to customize advanced feature settings for your cluster, including whether to enable GPU, helm repo refresh cycle, Helm operation record retention, etc.
Enable GPU: GPUs and proper driver plug-ins need to be installed on the cluster in advance.
Click the name of the target cluster, and click Operations and Maintenance -> Cluster Settings -> Addons in the left navigation bar.
Helm operation basic image, registry refresh cycle, number of operation records retained, whether to enable cluster deletion protection (the cluster cannot be uninstalled directly after enabling)
On this page, you can view the recent cluster operation records and Helm operation records, as well as the YAML files and logs of each operation, and you can also delete a certain record.
Set the number of reserved entries for Helm operations:
By default, the system keeps the last 100 Helm operation records. If you keep too many entries, it may cause data redundancy, and if you keep too few entries, you may lose the key operation records you need. A reasonable reserved quantity needs to be set according to the actual situation. Specific steps are as follows:
Click the name of the target cluster, and click Recent Operations -> Helm Operations -> Set Number of Retained Items in the left navigation bar.
Set how many Helm operation records need to be kept, and click OK .
Clusters integrated or created using the AI platform Container Management platform can be accessed not only through the UI interface but also in two other ways for access control:
Access online via CloudShell
Access via kubectl after downloading the cluster certificate
Note
When accessing the cluster, the user should have Cluster Admin permission or higher.
"},{"location":"en/end-user/kpanda/clusters/access-cluster.html#access-via-cloudshell","title":"Access via CloudShell","text":"
Enter Clusters page, select the cluster you want to access via CloudShell, click the ... icon on the right, and then click Console from the dropdown list.
Run kubectl get node command in the Console to verify the connectivity between CloudShell and the cluster. If the console returns node information of the cluster, you can access and manage the cluster through CloudShell.
"},{"location":"en/end-user/kpanda/clusters/access-cluster.html#access-via-kubectl","title":"Access via kubectl","text":"
If you want to access and manage remote clusters from a local node, make sure you have met these prerequisites:
Your local node and the cloud cluster are in a connected network.
The cluster certificate has been downloaded to the local node.
The kubectl tool has been installed on the local node. For detailed installation guides, see Installing tools.
If everything is in place, follow these steps to access a cloud cluster from your local environment.
Enter Clusters page, find your target cluster, click ... on the right, and select Download kubeconfig in the drop-down list.
Set the Kubeconfig period and click Download .
Open the downloaded certificate and copy its content to the config file of the local node.
By default, the kubectl tool will look for a file named config in the $HOME/.kube directory on the local node. This file stores access credentials of clusters. Kubectl can access the cluster with that configuration file.
Run the following command on the local node to verify its connectivity with the cluster:
kubectl get pod -n default\n
An expected output is as follows:
NAME READY STATUS RESTARTS AGE\ndao-2048-2048-58c7f7fc5-mq7h4 1/1 Running 0 30h\n
Now you can access and manage the cluster locally with kubectl.
This is a cluster created using Container Management and is mainly used to carry business workloads. This cluster is managed by the management cluster.
Supported Features Description K8s Version Supports K8s 1.22 and above Operating System RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;Ubuntu 18.04 x86, Ubuntu 20.04 x86;CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD Full Lifecycle Management Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Calico, Cillium, Multus, and other CNIs Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies"},{"location":"en/end-user/kpanda/clusters/cluster-role.html#integrated-cluster","title":"Integrated Cluster","text":"
This cluster is used to integrate existing standard K8s clusters, including but not limited to self-built clusters in local data centers, clusters provided by public cloud vendors, clusters provided by private cloud vendors, edge clusters, Xinchuang clusters, heterogeneous clusters, and different Suanova clusters. It is mainly used to carry business workloads.
Supported Features Description K8s Version 1.18+ Supported Vendors VMware Tanzu, Amazon EKS, Redhat Openshift, SUSE Rancher, Alibaba ACK, Huawei CCE, Tencent TKE, Standard K8s Cluster, Suanova Full Lifecycle Management Not Supported K8s Resource Management Supported Cloud Native Storage Supported Cloud Native Network Depends on the network mode of the integrated cluster's kernel Policy Management Supports network policies, quota policies, resource limits, disaster recovery policies, security policies
Note
A cluster can have multiple cluster roles. For example, a cluster can be both a global service cluster and a management cluster or a worker cluster.
"},{"location":"en/end-user/kpanda/clusters/cluster-scheduler-plugin.html","title":"Deploy Second Scheduler scheduler-plugins in a Cluster","text":"
This page describes how to deploy a second scheduler-plugins in a cluster.
"},{"location":"en/end-user/kpanda/clusters/cluster-scheduler-plugin.html#why-do-we-need-scheduler-plugins","title":"Why do we need scheduler-plugins?","text":"
The cluster created through the platform will install the native K8s scheduler-plugin, but the native scheduler-plugin has many limitations:
The native scheduler-plugin cannot meet scheduling requirements, so you can use either CoScheduling, CapacityScheduling or other types of scheduler-plugins.
In special scenarios, a new scheduler-plugin is needed to complete scheduling tasks without affecting the process of the native scheduler-plugin.
Distinguish scheduler-plugins with different functionalities and achieve different scheduling scenarios by switching scheduler-plugin names.
This page takes the scenario of using the vgpu scheduler-plugin while combining the coscheduling plugin capability of scheduler-plugins as an example to introduce how to install and use scheduler-plugins.
kubean is a new feature introduced in v0.13.0, please ensure that your version is v0.13.0 or higher.
The installation version of scheduler-plugins is v0.27.8, please ensure that the cluster version is compatible with it. Refer to the document Compatibility Matrix.
scheduler_plugins_enabled Set to true to enable the scheduler-plugins capability.
You can enable or disable certain plugins by setting the scheduler_plugins_enabled_plugins or scheduler_plugins_disabled_plugins options. See K8s Official Plugin Names for reference.
If you need to set parameters for custom plugins, please configure scheduler_plugins_plugin_config, for example: set the permitWaitingTimeoutSeconds parameter for coscheduling. See K8s Official Plugin Configuration for reference.
After successful cluster creation, the system will automatically install the scheduler-plugins and controller component loads. You can check the workload status in the proper cluster's deployment.
Here is an example of how to use scheduler-plugins by demonstrating a scenario where the vgpu scheduler is used in combination with the coscheduling plugin capability of scheduler-plugins.
Install vgpu in the Helm Charts and set the values.yaml parameters.
schedulerName: scheduler-plugins-scheduler: This is the scheduler name for scheduler-plugins installed by kubean, and currently cannot be modified.
scheduler.kubeScheduler.enabled: false: Do not install kube-scheduler and use vgpu-scheduler as a separate extender.
Extend vgpu-scheduler on scheduler-plugins.
[root@master01 charts]# kubectl get cm -n scheduler-plugins scheduler-config -ojsonpath=\"{.data.scheduler-config\\.yaml}\"\n
After installing vgpu-scheduler, the system will automatically create a service (svc), and the urlPrefix specifies the URL of the svc.
Note
The svc refers to the pod service load. You can use the following command in the namespace where the nvidia-vgpu plugin is installed to get the external access information for port 443.
kubectl get svc -n ${namespace}\n
The urlPrefix format is https://${ip address}:${port}
Restart the scheduler pod of scheduler-plugins to load the new configuration file.
Note
When creating a vgpu application, you do not need to specify the name of a scheduler-plugin. The vgpu-scheduler webhook will automatically change the scheduler's name to \"scheduler-plugins-scheduler\" without manual specification.
AI platform Container Management module can manage two types of clusters: integrated clusters and created clusters.
Integrated clusters: clusters created in other platforms and now integrated into AI platform.
Created clusters: clusters created in AI platform.
For more information about cluster types, see Cluster Role.
We designed several status for these two clusters.
"},{"location":"en/end-user/kpanda/clusters/cluster-status.html#integrated-clusters","title":"Integrated Clusters","text":"Status Description Integrating The cluster is being integrated into AI platform. Removing The cluster is being removed from AI platform. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status."},{"location":"en/end-user/kpanda/clusters/cluster-status.html#created-clusters","title":"Created Clusters","text":"Status Description Creating The cluster is being created. Updating The Kubernetes version of the cluster is being operating. Deleting The cluster is being deleted. Running The cluster is running as expected. Unknown The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status. Failed The cluster creation is failed. You should check the logs for detailed reasons."},{"location":"en/end-user/kpanda/clusters/cluster-version.html","title":"Supported Kubernetes Versions","text":"
In AI platform, the integrated clusters and created clusters have different version support mechanisms.
This page focuses on the version support mechanism for created clusters.
The Kubernetes community supports three version ranges: 1.26, 1.27, and 1.28. When a new version is released by the community, the supported version range is incremented. For example, if the latest version released by the community is 1.27, the supported version range by the community will be 1.27, 1.28, and 1.29.
To ensure the security and stability of the clusters, when creating clusters in AI platform, the supported version range will always be one version lower than the community's version.
For instance, if the Kubernetes community supports v1.25, v1.26, and v1.27, then the version range for creating worker clusters in AI platform will be v1.24, v1.25, and v1.26. Additionally, a stable version, such as 1.24.7, will be recommended to users.
Furthermore, the version range for creating worker clusters in AI platform will remain highly synchronized with the community. When the community version increases incrementally, the version range for creating worker clusters in AI platform will also increase by one version.
"},{"location":"en/end-user/kpanda/clusters/cluster-version.html#supported-kubernetes-versions_1","title":"Supported Kubernetes Versions","text":"Kubernetes Community Versions Created Worker Cluster Versions Recommended Versions for Created Worker Cluster AI platform Installer Release Date
In AI platform Container Management, clusters can have four roles: global service cluster, management cluster, worker cluster, and integrated cluster. An integrated cluster can only be integrated from third-party vendors (see Integrate Cluster).
This page explains how to create a Worker Cluster. By default, when creating a new Worker Cluster, the operating system type and CPU architecture of the worker nodes should be consistent with the Global Service Cluster. If you want to create a cluster with a different operating system or architecture than the Global Management Cluster, refer to Creating an Ubuntu Worker Cluster on a CentOS Management Platform for instructions.
It is recommended to use the supported operating systems in AI platform to create the cluster. If your local nodes are not within the supported range, you can refer to Creating a Cluster on Non-Mainstream Operating Systems for instructions.
Certain prerequisites must be met before creating a cluster:
Prepare enough nodes to be joined into the cluster.
It is recommended to use Kubernetes version 1.25.7. For the specific version range, refer to the AI platform Cluster Version Support System. Currently, the supported version range for created worker clusters is v1.26.0-v1.28. If you need to create a cluster with a lower version, refer to the Supporterd Cluster Versions.
The target host must allow IPv4 forwarding. If using IPv6 in Pods and Services, the target server needs to allow IPv6 forwarding.
AI platform does not provide firewall management. You need to pre-define the firewall rules of the target host by yourself. To avoid errors during cluster creation, it is recommended to disable the firewall of the target host.
Enter the Container Management module, click Create Cluster on the upper right corner of the Clusters page.
Fill in the basic information by referring to the following instructions.
Cluster Name: only contain lowercase letters, numbers, and hyphens (\"-\"). Must start and end with a lowercase letter or number and totally up to 63 characters.
Managed By: Choose a cluster to manage this new cluster through its lifecycle, such as creating, upgrading, node scaling, deleting the new cluster, etc.
Runtime: Select the runtime environment of the cluster. Currently support containerd and docker (see How to Choose Container Runtime).
Kubernetes Version: Allow span of three major versions, such as from 1.23-1.25, subject to the versions supported by the management cluster.
Fill in the node configuration information and click Node Check .
High Availability: When enabled, at least 3 controller nodes are required. When disabled, only 1 controller node is needed.
It is recommended to use High Availability mode in production environments.
Credential Type: Choose whether to access nodes using username/password or public/private keys.
If using public/private key authentication, SSH keys for the nodes need to be configured in advance. Refer to Using SSH Key Authentication for Nodes.
Same Password: When enabled, all nodes in the cluster will have the same access password. Enter the unified password for accessing all nodes in the field below. If disabled, you can set separate usernames and passwords for each node.
Node Information: Set note names and IPs.
NTP Time Synchronization: When enabled, time will be automatically synchronized across all nodes. Provide the NTP server address.
If node check is passed, click Next . If the check failed, update Node Information and check again.
Fill in the network configuration and click Next .
CNI: Provide network services for Pods in the cluster. CNI cannot be changed after the cluster is created. Supports cilium and calico. Set none means not installing CNI when creating the cluster. You may install a CNI later.
For CNI configuration details, see Cilium Installation Parameters or Calico Installation Parameters.
Container IP Range: Set an IP range for allocating IPs for containers in the cluster. IP range determines the max number of containers allowed in the cluster. Cannot be modified after creation.
Service IP Range: Set an IP range for allocating IPs for container Services in the cluster. This range determines the max number of container Services that can be created in the cluster. Cannot be modified after creation.
Fill in the plug-in configuration and click Next .
Fill in advanced settings and click OK .
kubelet_max_pods : Set the maximum number of Pods per node. The default is 110.
hostname_override : Reset the hostname (not recommended).
kubernetes_audit : Kubernetes audit log, enabled by default.
auto_renew_certificate : Automatically renew the certificate of the control plane on the first Monday of each month, enabled by default.
disable_firewalld&ufw : Disable the firewall to prevent the node from being inaccessible during installation.
Insecure_registries : Set the address of you private container registry. If you use a private container registry, fill in its address can bypass certificate authentication of the container engine and obtain the image.
yum_repos : Fill in the Yum source registry address.
Success
After correctly filling in the above information, the page will prompt that the cluster is being created.
Creating a cluster takes a long time, so you need to wait patiently. You can click the Back to Clusters button to let it running backend.
To view the current status, click Real-time Log .
Note
hen the cluster is in an unknown state, it means that the current cluster has been disconnected.
The data displayed by the system is the cached data before the disconnection, which does not represent real data.
Any operations performed in the disconnected state will not take effect. Please check the cluster network connectivity or Host Status.
Clusters created in AI platform Container Management can be either deleted or removed. Clusters integrated into AI platform can only be removed.
Info
If you want to delete an integrated cluster, you should delete it in the platform where it is created.
In AI platform, the difference between Delete and Remove is:
Delete will destroy the cluster and reset the data of all nodes under the cluster. All data will be totally cleared and lost. Making a backup before deleting a cluster is a recommended best practice. You can no longer use that cluster anymore.
Remove just removes the cluster from AI platform. It will not destroy the cluster and no data will be lost. You can still use the cluster in other platforms or re-integrate it into AI platform later if needed.
Note
You should have Admin or Kpanda Owner permissions to perform delete or remove operations.
Before deleting a cluster, you should turn off Cluster Deletion Protection in Cluster Settings -> Advanced Settings , otherwise the Delete Cluster option will not be displayed.
The global service cluster cannot be deleted or removed.
Enter the Container Management module, find your target cluster, click __ ...__ on the right, and select Delete cluster / Remove in the drop-down list.
Enter the cluster name to confirm and click Delete .
You will be auto directed to cluster lists. The status of this cluster will changed to Deleting . It may take a while to delete/remove a cluster.
With the features of integrating clusters, AI platform allows you to manage on-premise and cloud clusters of various providers in a unified manner. This is quite important in avoiding the risk of being locked in by a certain providers, helping enterprises safely migrate their business to the cloud.
In AI platform Container Management module, you can integrate a cluster of the following providers: standard Kubernetes clusters, Redhat Openshift, SUSE Rancher, VMware Tanzu, Amazon EKS, Aliyun ACK, Huawei CCE, Tencent TKE, etc.
Enter Container Management module, and click Integrate Cluster in the upper right corner.
Fill in the basic information by referring to the following instructions.
Cluster Name: It should be unique and cannot be changed after the integration. Maximum 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number.
Cluster Alias: Enter any characters, no more than 60 characters.
Release Distribution: the cluster provider, support mainstream vendors listed at the beginning.
Fill in the KubeConfig of the target cluster and click Verify Config . The cluster can be successfully connected only after the verification is passed.
Click How do I get the KubeConfig? to see the specific steps for getting this file.
Confirm that all parameters are filled in correctly and click OK in the lower right corner of the page.
Note
The status of the newly integrated cluster is Integrating , which will become Running after the integration succeeds.
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html","title":"Integrate the Rancher Cluster","text":"
This page explains how to integrate a Rancher cluster.
Prepare a Rancher cluster with administrator privileges and ensure network connectivity between the container management cluster and the target cluster.
Be equipped with permissions not lower than kpanda owner.
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#steps","title":"Steps","text":""},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#step-1-create-a-serviceaccount-user-with-administrator-privileges-in-the-rancher-cluster","title":"Step 1: Create a ServiceAccount user with administrator privileges in the Rancher cluster","text":"
Log in to the Rancher cluster with a role that has administrator privileges, and create a file named sa.yaml using the terminal.
vi sa.yaml\n
Press the i key to enter insert mode, then copy and paste the following content:
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#step-2-update-kubeconfig-with-the-rancher-rke-sa-authentication-on-your-local-machine","title":"Step 2: Update kubeconfig with the rancher-rke SA authentication on your local machine","text":"
Perform the following steps on any local node where kubelet is installed:
{cluster-name} : the name of your Rancher cluster.
{APIServer} : the access address of the cluster, usually refering to the IP address of the control node + port \"6443\", such as https://10.X.X.X:6443 .
"},{"location":"en/end-user/kpanda/clusters/integrate-rancher-cluster.html#step-3-connect-the-cluster-in-the-suanova-interface","title":"Step 3: Connect the cluster in the Suanova Interface","text":"
Using the kubeconfig file fetched earlier, refer to the Integrate Cluster documentation to integrate the Rancher cluster to the global cluster.
"},{"location":"en/end-user/kpanda/clusters/runtime.html","title":"How to choose the container runtime","text":"
The container runtime is an important component in kubernetes to manage the life cycle of containers and container images. Kubernetes made containerd the default container runtime in version 1.19, and removed support for the Dockershim component in version 1.24.
Therefore, compared to the Docker runtime, we recommend you to use the lightweight containerd as your container runtime, because this has become the current mainstream runtime choice.
In addition, some operating system distribution vendors are not friendly enough for Docker runtime compatibility. The runtime support of different operating systems is as follows:
The Kubernetes Community packages a small version every quarter, and the maintenance cycle of each version is only about 9 months. Some major bugs or security holes will not be updated after the version stops maintenance. Manually upgrading cluster operations is cumbersome and places a huge workload on administrators.
In Suanova, you can upgrade the Kubernetes cluster with one click through the web UI interface.
Danger
After the version is upgraded, it will not be possible to roll back to the previous version, please proceed with caution.
Note
Kubernetes versions are denoted as x.y.z , where x is the major version, y is the minor version, and z is the patch version.
Cluster upgrades across minor versions are not allowed, e.g. a direct upgrade from 1.23 to 1.25 is not possible.
**Access clusters do not support version upgrades. If there is no \"cluster upgrade\" in the left navigation bar, please check whether the cluster is an access cluster. **
The global service cluster can only be upgraded through the terminal.
When upgrading a worker cluster, the Management Cluster of the worker cluster should have been connected to the container management module and be running normally.
Click the name of the target cluster in the cluster list.
Then click Cluster Operation and Maintenance -> Cluster Upgrade in the left navigation bar, and click Version Upgrade in the upper right corner of the page.
Select the version that can be upgraded, and enter the cluster name to confirm.
After clicking OK , you can see the upgrade progress of the cluster.
The cluster upgrade is expected to take 30 minutes. You can click the Real-time Log button to view the detailed log of the cluster upgrade.
ConfigMaps store non-confidential data in the form of key-value pairs to achieve the effect of mutual decoupling of configuration data and application code. ConfigMaps can be used as environment variables for containers, command-line parameters, or configuration files in storage volumes.
Note
The data saved in ConfigMaps cannot exceed 1 MiB. If you need to store larger volumes of data, it is recommended to mount a storage volume or use an independent database or file service.
ConfigMaps do not provide confidentiality or encryption. If you want to store encrypted data, it is recommended to use secret, or other third-party tools to ensure the privacy of data.
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> ConfigMap , and click the YAML Create button in the upper right corner.
Fill in or paste the configuration file prepared in advance, and then click OK in the lower right corner of the pop-up box.
!!! note
- Click __Import__ to import an existing file locally to quickly create ConfigMaps.\n - After filling in the data, click __Download__ to save the configuration file locally.\n
After the creation is complete, click More on the right side of the ConfigMap to edit YAML, update, export, delete and other operations.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
Secrets can be used in some cases:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
As the identity authentication credential for the container registry when the kubelet pulls the container image.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster
Created a namespace, user, and authorized the user as NS Editor. For details, refer to Namespace Authorization.
"},{"location":"en/end-user/kpanda/configmaps-secrets/create-secret.html#create-secret-with-wizard","title":"Create secret with wizard","text":"
Click the name of a cluster on the Clusters page to enter Cluster Details .
In the left navigation bar, click ConfigMap and Secret -> Secret , and click the Create Secret button in the upper right corner.
Fill in the configuration information on the Create Secret page, and click OK .
Note when filling in the configuration:
The name of the key must be unique within the same namespace
Key type:
Default (Opaque): Kubernetes default key type, which supports arbitrary data defined by users.
TLS (kubernetes.io/tls): credentials for TLS client or server data access.
Container registry information (kubernetes.io/dockerconfigjson): Credentials for Container registry access.
username and password (kubernetes.io/basic-auth): Credentials for basic authentication.
Custom: the type customized by the user according to business needs.
Key data: the data stored in the key, the parameters that need to be filled in are different for different data
When the key type is default (Opaque)/custom: multiple key-value pairs can be filled in.
When the key type is TLS (kubernetes.io/tls): you need to fill in the certificate certificate and private key data. Certificates are self-signed or CA-signed credentials used for authentication. A certificate request is a request for a signature and needs to be signed with a private key.
When the key type is container registry information (kubernetes.io/dockerconfigjson): you need to fill in the account and password of the private container registry.
When the key type is username and password (kubernetes.io/basic-auth): Username and password need to be specified.
ConfigMap (ConfigMap) is an API object of Kubernetes, which is used to save non-confidential data into key-value pairs, and can store configurations that other objects need to use. When used, the container can use it as an environment variable, a command-line argument, or a configuration file in a storage volume. By using ConfigMaps, configuration data and application code can be separated, providing a more flexible way to modify application configuration.
Note
ConfigMaps do not provide confidentiality or encryption. If the data to be stored is confidential, please use secret, or use other third-party tools to ensure the privacy of the data instead of ConfigMaps. In addition, when using ConfigMaps in containers, the container and ConfigMaps must be in the same cluster namespace.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use ConfigMaps in Pods. There are many use cases, mainly including:
Use ConfigMaps to set the environment variables of the container
Use ConfigMaps to set the command line parameters of the container
Use ConfigMaps as container data volumes
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#set-the-environment-variables-of-the-container","title":"Set the environment variables of the container","text":"
You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line.
Note
The ConfigMap import is to use the ConfigMap as the value of the environment variable; the ConfigMap key value import is to use a certain parameter in the ConfigMap as the value of the environment variable.
When creating a workload through an image, you can set environment variables for the container by selecting Import ConfigMaps or Import ConfigMap Key Values on the Environment Variables interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Environment Variables configuration, and click the Add Environment Variable button.
Select ConfigMap Import or ConfigMap Key Value Import in the environment variable type.
When the environment variable type is selected as ConfigMap import , enter variable name , prefix name, ConfigMap name in sequence.
When the environment variable type is selected as ConfigMap key-value import , enter variable name , ConfigMap name, and Secret name in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#command-line-operation","title":"Command line operation","text":"
You can set ConfigMaps as environment variables when creating a workload, using the valueFrom parameter to refer to the Key/Value in the ConfigMap.
Use valueFrom to specify the value of the env reference ConfigMap
Referenced configuration file name
Referenced ConfigMap key
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#set-the-command-line-parameters-of-the-container","title":"Set the command line parameters of the container","text":"
You can use ConfigMaps to set the command or parameter value in the container, and use the environment variable substitution syntax $(VAR_NAME) to do so. As follows.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#used-as-container-data-volume","title":"Used as container data volume","text":"
You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line.
When creating a workload through an image, you can use the ConfigMap as the data volume of the container by selecting the storage type as \"ConfigMap\" on the \"Data Storage\" interface.
Go to the Image Creation Workload page, in the Container Configuration step, select the Data Storage configuration, and click __Add in the __ Node Path Mapping __ list __ button.
Select ConfigMap in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-configmap.html#command-line-operation_1","title":"Command line operation","text":"
To use a ConfigMap in a Pod's storage volume.
Here is an example Pod that mounts a ConfigMap as a volume:
If there are multiple containers in a Pod, each container needs its own volumeMounts block, but you only need to set one spec.volumes block per ConfigMap.
Note
When a ConfigMap is used as a data volume mounted on a container, the ConfigMap can only be read as a read-only file.
A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#scenes-to-be-used","title":"scenes to be used","text":"
You can use keys in Pods in a variety of use cases, mainly including:
Used as an environment variable of the container to provide some necessary information required during the running of the container.
Use secrets as pod data volumes.
Used as the identity authentication credential for the container registry when the kubelet pulls the container image.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#use-the-key-to-set-the-environment-variable-of-the-container","title":"Use the key to set the environment variable of the container","text":"
You can use the key as the environment variable of the container through the GUI or the terminal command line.
Note
Key import is to use the key as the value of an environment variable; key key value import is to use a parameter in the key as the value of an environment variable.
When creating a workload from an image, you can set environment variables for the container by selecting Key Import or Key Key Value Import on the Environment Variables interface.
Go to the Image Creation Workload page.
Select the Environment Variables configuration in Container Configuration , and click the Add Environment Variable button.
Select Key Import or Key Key Value Import in the environment variable type.
When the environment variable type is selected as Key Import , enter Variable Name , Prefix , and Secret in sequence.
When the environment variable type is selected as key key value import , enter variable name , Secret , Secret name in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#command-line-operation","title":"Command line operation","text":"
As shown in the example below, you can set the secret as an environment variable when creating the workload, using the valueFrom parameter to refer to the Key/Value in the Secret.
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"username\"
This value is the default; means \"mysecret\", which must exist and contain a primary key named \"password\"
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#use-the-key-as-the-pods-data-volume","title":"Use the key as the pod's data volume","text":""},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#graphical-interface-operation_1","title":"Graphical interface operation","text":"
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the Container Configuration , select the Data Storage configuration, and click the Add button in the Node Path Mapping list.
Select Secret in the storage type, and enter container path , subpath and other information in sequence.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#command-line-operation_1","title":"Command line operation","text":"
The following is an example of a Pod that mounts a Secret named mysecret via a data volume:
Default setting, means \"mysecret\" must already exist
If the Pod contains multiple containers, each container needs its own volumeMounts block, but only one .spec.volumes setting is required for each Secret.
"},{"location":"en/end-user/kpanda/configmaps-secrets/use-secret.html#used-as-the-identity-authentication-credential-for-the-container-registry-when-the-kubelet-pulls-the-container-image","title":"Used as the identity authentication credential for the container registry when the kubelet pulls the container image","text":"
You can use the key as the identity authentication credential for the Container registry through the GUI or the terminal command line.
When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as \"key\" on the \"data storage\" interface.
Go to the Image Creation Workload page.
In the second step of Container Configuration , select the Basic Information configuration, and click the Select Image button.
Select the name of the private container registry in the drop-down list of `container registry' in the pop-up box. Please see Create Secret for details on private image secret creation.
Enter the image name in the private registry, click OK to complete the image selection.
Note
When creating a key, you need to ensure that you enter the correct container registry address, username, password, and select the correct mirror name, otherwise you will not be able to obtain the mirror image in the container registry.
In Kubernetes, all objects are abstracted as resources, such as Pod, Deployment, Service, Volume, etc. are the default resources provided by Kubernetes. This provides important support for our daily operation and maintenance and management work, but in some special cases, the existing preset resources cannot meet the needs of the business. Therefore, we hope to expand the capabilities of the Kubernetes API, and CustomResourceDefinition (CRD) was born based on this requirement.
The container management module supports interface-based management of custom resources, and its main features are as follows:
Obtain the list and detailed information of custom resources under the cluster
Create custom resources based on YAML
Create a custom resource example CR (Custom Resource) based on YAML
"},{"location":"en/end-user/kpanda/custom-resources/create.html#create-a-custom-resource-example-via-yaml","title":"Create a custom resource example via YAML","text":"
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Custom Resource , and click the YAML Create button in the upper right corner.
Click the custom resource named crontabs.stable.example.com , enter the details, and click the YAML Create button in the upper right corner.
On the Create with YAML page, fill in the YAML statement and click OK .
Return to the details page of crontabs.stable.example.com , and you can view the custom resource named my-new-cron-object just created.
"},{"location":"en/end-user/kpanda/gpu/index.html","title":"Overview of GPU Management","text":"
This article introduces the capability of Suanova container management platform in unified operations and management of heterogeneous resources, with a focus on GPUs.
With the rapid development of emerging technologies such as AI applications, large-scale models, artificial intelligence, and autonomous driving, enterprises are facing an increasing demand for compute-intensive tasks and data processing. Traditional compute architectures represented by CPUs can no longer meet the growing computational requirements of enterprises. At this point, heterogeneous computing represented by GPUs has been widely applied due to its unique advantages in processing large-scale data, performing complex calculations, and real-time graphics rendering.
Meanwhile, due to the lack of experience and professional solutions in scheduling and managing heterogeneous resources, the utilization efficiency of GPU devices is extremely low, resulting in high AI production costs for enterprises. The challenge of reducing costs, increasing efficiency, and improving the utilization of GPUs and other heterogeneous resources has become a pressing issue for many enterprises.
"},{"location":"en/end-user/kpanda/gpu/index.html#introduction-to-gpu-capabilities","title":"Introduction to GPU Capabilities","text":"
The Suanova container management platform supports unified scheduling and operations management of GPUs, NPUs, and other heterogeneous resources, fully unleashing the computational power of GPU resources, and accelerating the development of enterprise AI and other emerging applications. The GPU management capabilities of Suanova are as follows:
Support for unified management of heterogeneous computing resources from domestic and foreign manufacturers such as NVIDIA, Huawei Ascend, and Iluvatar.
Support for multi-card heterogeneous scheduling within the same cluster, with automatic recognition of GPUs in the cluster.
Support for native management solutions for NVIDIA GPUs, vGPUs, and MIG, with cloud native capabilities.
Support for partitioning a single physical card for use by different tenants, and allocate GPU resources to tenants and containers based on computing power and memory quotas.
Support for multi-dimensional GPU resource monitoring at the cluster, node, and application levels, assisting operators in managing GPU resources.
Compatibility with various training frameworks such as TensorFlow and PyTorch.
"},{"location":"en/end-user/kpanda/gpu/index.html#introduction-to-gpu-operator","title":"Introduction to GPU Operator","text":"
Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the NVIDIA official documentation. For deployment instructions, refer to Offline Installation of GPU Operator.
Architecture diagram of NVIDIA GPU Operator:
"},{"location":"en/end-user/kpanda/gpu/FAQ.html","title":"GPU FAQs","text":""},{"location":"en/end-user/kpanda/gpu/FAQ.html#gpu-processes-are-not-visible-while-running-nvidia-smi-inside-a-pod","title":"GPU processes are not visible while running nvidia-smi inside a pod","text":"
Q: When running the nvidia-smi command inside a GPU-utilizing pod, no GPU process information is visible in the full-card mode and vGPU mode.
A: Due to PID namespace isolation, GPU processes are not visible inside the Pod. To view GPU processes, you can use one of the following methods:
Configure the workload using the GPU with hostPID: true to enable viewing PIDs on the host.
Run the nvidia-smi command in the driver pod of the gpu-operator to view processes.
Run the chroot /run/nvidia/driver nvidia-smi command on the host to view processes.
"},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html","title":"How to Use Iluvatar GPU in Applications","text":"
This section describes how to use Iluvatar virtual GPU on AI platform.
Deployed AI platform container management platform and it is running smoothly.
The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and the UI interface of the cluster can be accessed.
The Iluvatar GPU driver has been installed on the current cluster. Refer to the Iluvatar official documentation for driver installation instructions, or contact the Suanova ecosystem team for enterprise-level support at peg-pem@daocloud.io.
The GPUs in the current cluster have not undergone any virtualization operations and not been occupied by other applications.
"},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html#configuration-via-user-interface","title":"Configuration via User Interface","text":"
Check if the GPU in the cluster has been detected. Click Clusters -> Cluster Settings -> Addon Plugins , and check if the proper GPU type has been automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type as Iluvatar .
Deploy a workload. Click Clusters -> Workloads and deploy a workload using the image. After selecting the type as (Iluvatar) , configure the GPU resources used by the application:
Physical Card Count (iluvatar.ai/vcuda-core): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
Memory Usage (iluvatar.ai/vcuda-memory): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.
If there are any issues with the configuration values, scheduling failures or resource allocation failures may occur.
"},{"location":"en/end-user/kpanda/gpu/Iluvatar_usage.html#configuration-via-yaml","title":"Configuration via YAML","text":"
To request GPU resources for a workload, add the iluvatar.ai/vcuda-core: 1 and iluvatar.ai/vcuda-memory: 200 to the requests and limits. These parameters configure the application to use the physical card resources.
"},{"location":"en/end-user/kpanda/gpu/dynamic-regulation.html","title":"GPU Scheduling Configuration (Binpack and Spread)","text":"
This page introduces how to reduce GPU resource fragmentation and prevent single points of failure through Binpack and Spread when using NVIDIA vGPU, achieving advanced scheduling for vGPU. The AI platform platform provides Binpack and Spread scheduling policies across two dimensions: clusters and workloads, meeting different usage requirements in various scenarios.
Binpack: Prioritizes using the same GPU on a node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different GPUs on nodes, suitable for high availability scenarios to avoid single card failures.
Scheduling policy based on node dimension
Binpack: Multiple Pods prioritize using the same node, suitable for increasing GPU utilization and reducing resource fragmentation.
Spread: Multiple Pods are distributed across different nodes, suitable for high availability scenarios to avoid single node failures.
"},{"location":"en/end-user/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-cluster-level","title":"Use Binpack and Spread at Cluster-Level","text":"
Note
By default, workloads will follow the cluster-level Binpack and Spread. If a workload sets its own Binpack and Spread scheduling policies that differ from the cluster, the workload will prioritize its own scheduling policy.
On the Clusters page, select the cluster for which you want to adjust the Binpack and Spread scheduling policies. Click the \u2507 icon on the right and select GPU Scheduling Configuration from the dropdown list.
Adjust the GPU scheduling configuration according to your business scenario, and click OK to save.
"},{"location":"en/end-user/kpanda/gpu/dynamic-regulation.html#use-binpack-and-spread-at-workload-level","title":"Use Binpack and Spread at Workload-Level","text":"
Note
When the Binpack and Spread scheduling policies at the workload level conflict with the cluster-level configuration, the workload-level configuration takes precedence.
Follow the steps below to create a deployment using an image and configure Binpack and Spread scheduling policies within the workload.
Click Clusters in the left navigation bar, then click the name of the target cluster to enter the Cluster Details page.
On the Cluster Details page, click Workloads -> Deployments in the left navigation bar, then click the Create by Image button in the upper right corner of the page.
Sequentially fill in the Basic Information, Container Settings, and in the Container Configuration section, enable GPU configuration, selecting the GPU type as NVIDIA vGPU. Click Advanced Settings, enable the Binpack / Spread scheduling policy, and adjust the GPU scheduling configuration according to the business scenario. After configuration, click Next to proceed to Service Settings and Advanced Settings. Finally, click OK at the bottom right of the page to complete the creation.
"},{"location":"en/end-user/kpanda/gpu/gpu-metrics.html#cluster-level","title":"Cluster Level","text":"Metric Name Description Number of GPUs Total number of GPUs in the cluster Average GPU Utilization Average compute utilization of all GPUs in the cluster Average GPU Memory Utilization Average memory utilization of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Temperature Temperature of all GPUs in the cluster GPU Utilization Details 24-hour usage details of all GPUs in the cluster (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of all GPUs in the cluster (includes min, max, avg, current) GPU Memory Bandwidth Utilization For example, an Nvidia V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the utilization is 50%"},{"location":"en/end-user/kpanda/gpu/gpu-metrics.html#node-level","title":"Node Level","text":"Metric Name Description GPU Mode Usage mode of GPUs on the node, including full-card mode, MIG mode, vGPU mode Number of Physical GPUs Total number of physical GPUs on the node Number of Virtual GPUs Number of vGPU devices created on the node Number of MIG Instances Number of MIG instances created on the node GPU Memory Allocation Rate Memory allocation rate of all GPUs on the node Average GPU Utilization Average compute utilization of all GPUs on the node Average GPU Memory Utilization Average memory utilization of all GPUs on the node GPU Driver Version Driver version information of GPUs on the node GPU Utilization Details 24-hour usage details of each GPU on the node (includes max, avg, current) GPU Memory Usage Details 24-hour memory usage details of each GPU on the node (includes min, max, avg, current)"},{"location":"en/end-user/kpanda/gpu/gpu-metrics.html#pod-level","title":"Pod Level","text":"Category Metric Name Description Application Overview GPU - Compute & Memory Pod GPU Utilization Compute utilization of the GPUs used by the current Pod Pod GPU Memory Utilization Memory utilization of the GPUs used by the current Pod Pod GPU Memory Usage Memory usage of the GPUs used by the current Pod Memory Allocation Memory allocation of the GPUs used by the current Pod Pod GPU Memory Copy Ratio Memory copy ratio of the GPUs used by the current Pod GPU - Engine Overview GPU Graphics Engine Activity Percentage Percentage of time the Graphics or Compute engine is active during a monitoring cycle GPU Memory Bandwidth Utilization Memory bandwidth utilization (Memory BW Utilization) indicates the fraction of cycles during which data is sent to or received from the device memory. This value represents the average over the interval, not an instantaneous value. A higher value indicates higher utilization of device memory.A value of 1 (100%) indicates that a DRAM instruction is executed every cycle during the interval (in practice, a peak of about 0.8 (80%) is the maximum achievable).A value of 0.2 (20%) indicates that 20% of the cycles during the interval are spent reading from or writing to device memory. Tensor Core Utilization Percentage of time the Tensor Core pipeline is active during a monitoring cycle FP16 Engine Utilization Percentage of time the FP16 pipeline is active during a monitoring cycle FP32 Engine Utilization Percentage of time the FP32 pipeline is active during a monitoring cycle FP64 Engine Utilization Percentage of time the FP64 pipeline is active during a monitoring cycle GPU Decode Utilization Decode engine utilization of the GPU GPU Encode Utilization Encode engine utilization of the GPU GPU - Temperature & Power GPU Temperature Temperature of all GPUs in the cluster GPU Power Power consumption of all GPUs in the cluster GPU Total Power Consumption Total power consumption of the GPUs GPU - Clock GPU Memory Clock Memory clock frequency GPU Application SM Clock Application SM clock frequency GPU Application Memory Clock Application memory clock frequency GPU Video Engine Clock Video engine clock frequency GPU Throttle Reasons Reasons for GPU throttling GPU - Other Details PCIe Transfer Rate Data transfer rate of the GPU through the PCIe bus PCIe Receive Rate Data receive rate of the GPU through the PCIe bus"},{"location":"en/end-user/kpanda/gpu/gpu_matrix.html","title":"GPU Support Matrix","text":"
This page explains the matrix of supported GPUs and operating systems for AI platform.
"},{"location":"en/end-user/kpanda/gpu/gpu_matrix.html#nvidia-gpu","title":"NVIDIA GPU","text":"GPU Manufacturer and Type Supported GPU Models Compatible Operating System (Online) Recommended Kernel Recommended Operating System and Kernel Installation Documentation NVIDIA GPU (Full Card/vGPU)
NVIDIA Fermi (2.1) Architecture:
NVIDIA GeForce 400 Series
NVIDIA Quadro 4000 Series
NVIDIA Tesla 20 Series
NVIDIA Ampere Architecture Series (A100; A800; H100)
CentOS 7
Kernel 3.10.0-123 ~ 3.10.0-1160
Kernel Reference Document
Recommended Operating System with Proper Kernel Version
This document mainly introduces the configuration of GPU scheduling, which can implement advanced scheduling policies. Currently, the primary implementation is the vgpu scheduling policy.
vGPU provides two policies for resource usage: binpack and spread. These correspond to node-level and GPU-level dimensions, respectively. The use case is whether you want to distribute workloads more sparsely across different nodes and GPUs or concentrate them on the same node and GPU, thereby making resource utilization more efficient and reducing resource fragmentation.
You can modify the scheduling policy in your cluster by following these steps:
Go to the cluster management list in the container management interface.
Click the settings button ... next to the cluster.
Click GPU Scheduling Configuration.
Toggle the scheduling policy between node-level and GPU-level. By default, the node-level policy is binpack, and the GPU-level policy is spread.
The above steps modify the cluster-level scheduling policy. Users can also specify their own scheduling policy at the workload level to change the scheduling results. Below is an example of modifying the scheduling policy at the workload level:
In this example, both the node- and GPU-level scheduling policies are set to binpack. This ensures that the workload is scheduled to maximize resource utilization and reduce fragmentation.
Follow these steps to manage GPU quotas in AI platform:
Go to Namespaces and click Quota Management to configure the GPU resources that can be used by a specific namespace.
The currently supported card types for quota management in a namespace are: NVIDIA vGPU, NVIDIA MIG, Iluvatar, and Ascend.
NVIDIA vGPU Quota Management: Configure the specific quota that can be used. This will create a ResourcesQuota CR.
- Physical Card Count (nvidia.com/vgpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine.\n- GPU Core Count (nvidia.com/gpucores): Indicates the GPU compute power occupied by each card. The value ranges from 0 to 100. If configured as 0, it is considered not to enforce isolation. If configured as 100, it is considered to exclusively occupy the entire card.\n- GPU Memory Usage (nvidia.com/gpumem): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card.\n
This document uses the AscentCL Image Classification Application example from the Ascend sample library.
Download the Ascend repository
Run the following command to download the Ascend demo repository, and remember the storage location of the code for subsequent use.
git clone https://gitee.com/ascend/samples.git\n
Prepare the base image
This example uses the Ascent-pytorch base image, which can be obtained from the Ascend Container Registry.
Prepare the YAML file
ascend-demo.yaml
apiVersion: batch/v1\nkind: Job\nmetadata:\n name: resnetinfer1-1-1usoc\nspec:\n template:\n spec:\n containers:\n - image: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:23.0.RC2-ubuntu18.04 # Inference image name\n imagePullPolicy: IfNotPresent\n name: resnet50infer\n securityContext:\n runAsUser: 0\n command:\n - \"/bin/bash\"\n - \"-c\"\n - |\n source /usr/local/Ascend/ascend-toolkit/set_env.sh &&\n TEMP_DIR=/root/samples_copy_$(date '+%Y%m%d_%H%M%S_%N') &&\n cp -r /root/samples \"$TEMP_DIR\" &&\n cd \"$TEMP_DIR\"/inference/modelInference/sampleResnetQuickStart/python/model &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/003_Atc_Models/resnet50/resnet50.onnx &&\n atc --model=resnet50.onnx --framework=5 --output=resnet50 --input_shape=\"actual_input_1:1,3,224,224\" --soc_version=Ascend910 &&\n cd ../data &&\n wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/models/aclsample/dog1_1024_683.jpg &&\n cd ../scripts &&\n bash sample_run.sh\n resources:\n requests:\n huawei.com/Ascend910: 1 # Number of the Ascend 910 Processors\n limits:\n huawei.com/Ascend910: 1 # The value should be the same as that of requests\n volumeMounts:\n - name: hiai-driver\n mountPath: /usr/local/Ascend/driver\n readOnly: true\n - name: slog\n mountPath: /var/log/npu/conf/slog/slog.conf\n - name: localtime # The container time must be the same as the host time\n mountPath: /etc/localtime\n - name: dmp\n mountPath: /var/dmp_daemon\n - name: slogd\n mountPath: /var/slogd\n - name: hbasic\n mountPath: /etc/hdcBasic.cfg\n - name: sys-version\n mountPath: /etc/sys_version.conf\n - name: aicpu\n mountPath: /usr/lib64/aicpu_kernels\n - name: tfso\n mountPath: /usr/lib64/libtensorflow.so\n - name: sample-path\n mountPath: /root/samples\n volumes:\n - name: hiai-driver\n hostPath:\n path: /usr/local/Ascend/driver\n - name: slog\n hostPath:\n path: /var/log/npu/conf/slog/slog.conf\n - name: localtime\n hostPath:\n path: /etc/localtime\n - name: dmp\n hostPath:\n path: /var/dmp_daemon\n - name: slogd\n hostPath:\n path: /var/slogd\n - name: hbasic\n hostPath:\n path: /etc/hdcBasic.cfg\n - name: sys-version\n hostPath:\n path: /etc/sys_version.conf\n - name: aicpu\n hostPath:\n path: /usr/lib64/aicpu_kernels\n - name: tfso\n hostPath:\n path: /usr/lib64/libtensorflow.so\n - name: sample-path\n hostPath:\n path: /root/samples\n restartPolicy: OnFailure\n
Some fields in the above YAML need to be modified according to the actual situation:
atc ... --soc_version=Ascend910 uses Ascend910, adjust this field depending on your actual situation. You can use the npu-smi info command to check the GPU model and add the Ascend prefix.
samples-path should be adjusted according to the actual situation.
resources should be adjusted according to the actual situation.
Deploy a Job and check its results
Use the following command to create a Job:
kubectl apply -f ascend-demo.yaml\n
Check the Pod running status:
After the Pod runs successfully, check the log results. The key prompt information on the screen is shown in the figure below. The Label indicates the category identifier, Conf indicates the maximum confidence of the classification, and Class indicates the belonging category. These values may vary depending on the version and environment, so please refer to the actual situation:
Confirm whether the cluster has detected the GPU. Click Clusters -> Cluster Settings -> Addon Plugins , and check whether the proper GPU type is automatically enabled and detected. Currently, the cluster will automatically enable GPU and set the GPU type to Ascend .
Deploy the workload. Click Clusters -> Workloads , deploy the workload through an image, select the type (Ascend), and then configure the number of physical cards used by the application:
Number of Physical Cards (huawei.com/Ascend910) : This indicates how many physical cards the current Pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host.
If there is an issue with the above configuration, it will result in scheduling failure and resource allocation issues.
"},{"location":"en/end-user/kpanda/gpu/ascend/ascend_driver_install.html","title":"Installation of Ascend NPU Components","text":"
This chapter provides installation guidance for Ascend NPU drivers, Device Plugin, NPU-Exporter, and other components.
Before using NPU resources, you need to complete the firmware installation, NPU driver installation, Docker Runtime installation, user creation, log directory creation, and NPU Device Plugin installation. Refer to the following steps for details.
Confirm that the kernel version is within the range proper to the \"binary installation\" method, and then you can directly install the NPU driver firmware.
For firmware and driver downloads, refer to: Firmware Download Link
For firmware installation, refer to: Install NPU Driver Firmware
If the driver is not installed, refer to the official Ascend documentation for installation. For example, for Ascend910, refer to: 910 Driver Installation Document.
Run the command npu-smi info, and if the NPU information is returned normally, it indicates that the NPU driver and firmware are ready.
Create the parent directory for component logs and the log directories for each component on the proper node, and set the appropriate owner and permissions for the directories. Execute the following command to create the parent directory for component logs.
Please create the proper log directory for each required component. In this example, only the Device Plugin component is needed. For other component requirements, refer to the official documentation
Refer to the following commands to create labels on the proper nodes:
# Create this label on computing nodes where the driver is installed\nkubectl label node {nodename} huawei.com.ascend/Driver=installed\nkubectl label node {nodename} node-role.kubernetes.io/worker=worker\nkubectl label node {nodename} workerselector=dls-worker-node\nkubectl label node {nodename} host-arch=huawei-arm // or host-arch=huawei-x86, select according to the actual situation\nkubectl label node {nodename} accelerator=huawei-Ascend910 // select according to the actual situation\n# Create this label on control nodes\nkubectl label node {nodename} masterselector=dls-master-node\n
"},{"location":"en/end-user/kpanda/gpu/ascend/ascend_driver_install.html#install-device-plugin-and-npuexporter","title":"Install Device Plugin and NpuExporter","text":"
Functional module path: Container Management -> Cluster, click the name of the target cluster, then click Helm Apps -> Helm Charts from the left navigation bar, and search for ascend-mindxdl.
DevicePlugin: Provides a general device plugin mechanism and standard device API interface for Kubernetes to use devices. It is recommended to use the default image and version.
NpuExporter: Based on the Prometheus/Telegraf ecosystem, this component provides interfaces to help users monitor the Ascend series AI processors and container-level allocation status. It is recommended to use the default image and version.
ServiceMonitor: Disabled by default. If enabled, you can view NPU-related monitoring in the observability module. To enable, ensure that the insight-agent is installed and running, otherwise, the ascend-mindxdl installation will fail.
isVirtualMachine: Disabled by default. If the NPU node is a virtual machine scenario, enable the isVirtualMachine parameter.
After a successful installation, two components will appear under the proper namespace, as shown below:
At the same time, the proper NPU information will also appear on the node information:
Once everything is ready, you can select the proper NPU device when creating a workload through the page, as shown below:
Note
For detailed information of how to use, refer to Using Ascend (Ascend) NPU.
Ascend virtualization is divided into dynamic virtualization and static virtualization. This document describes how to enable and use Ascend static virtualization capabilities.
To enable virtualization capabilities, you need to manually modify the startup parameters of the ascend-device-plugin-daemonset component. Refer to the following command:
After splitting the instance, manually restart the device-plugin pod, then use the kubectl describe command to check the resources of the registered node:
kubectl describe node {{nodename}}\n
"},{"location":"en/end-user/kpanda/gpu/ascend/vnpu.html#how-to-use-the-device","title":"How to Use the Device","text":"
When creating an application, specify the resource key as shown in the following YAML:
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
vGPU is a virtualization technology that allows one physical GPU to be partitioned into multiple virtual GPUs, with each virtual GPU assigned to different virtual machines or users. vGPU enables multiple users to share the same physical GPU and independently use GPU resources in their respective virtual environments. Each virtual GPU can access a certain amount of compute power and memory capacity. vGPU is suitable for virtualized environments and cloud computing scenarios, providing higher resource utilization and flexibility.
MIG is a feature introduced by the NVIDIA Ampere architecture that allows one physical GPU to be divided into multiple physical GPU instances, each of which can be independently allocated to different users or workloads. Each MIG instance has its own compute resources, memory, and PCIe bandwidth, just like an independent virtual GPU. MIG provides finer-grained GPU resource allocation and management and allows dynamic adjustment of the number and size of instances based on demand. MIG is suitable for multi-tenant environments, containerized applications, batch jobs, and other scenarios.
Whether using vGPU in a virtualized environment or MIG on a physical GPU, NVIDIA provides users with more choices and optimized ways to utilize GPU resources. The Suanova container management platform fully supports the above NVIDIA capabilities. Users can easily access the full computational power of NVIDIA GPUs through simple UI operations, thereby improving resource utilization and reducing costs.
Single Mode: The node only exposes a single type of MIG device on all its GPUs. All GPUs on the node must:
Be of the same model (e.g., A100-SXM-40GB), with matching MIG profiles only for GPUs of the same model.
Have MIG configuration enabled, which requires a machine reboot to take effect.
Create identical GI and CI for exposing \"identical\" MIG devices across all products.
Mixed Mode: The node exposes mixed MIG device types on all its GPUs. Requesting a specific MIG device type requires the number of compute slices and total memory provided by the device type.
All GPUs on the node must: Be in the same product line (e.g., A100-SXM-40GB).
Each GPU can enable or disable MIG individually and freely configure any available mixture of MIG device types.
The k8s-device-plugin running on the node will:
Expose any GPUs not in MIG mode using the traditional nvidia.com/gpu resource type.
Expose individual MIG devices using resource types that follow the pattern nvidia.com/mig-<slice_count>g.<memory_size>gb .
For detailed instructions on enabling these configurations, refer to Offline Installation of GPU Operator.
"},{"location":"en/end-user/kpanda/gpu/nvidia/index.html#how-to-use","title":"How to Use","text":"
You can refer to the following links to quickly start using Suanova's management capabilities for NVIDIA GPUs.
Using Full NVIDIA GPU
Using NVIDIA vGPU
Using NVIDIA MIG
"},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html","title":"Using the Whole NVIDIA GPU for an Application","text":"
This section describes how to allocate the entire NVIDIA GPU to a single application on the AI platform platform.
AI platform container management platform has been deployed and is running properly.
The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions.
The GPU in the current cluster has not undergone any virtualization operations or been occupied by other applications.
"},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-the-user-interface","title":"Configuring via the User Interface","text":"
Check if the cluster has detected the GPUs. Click Clusters -> Cluster Settings -> Addon Plugins to see if it has automatically enabled and detected the proper GPU types. Currently, the cluster will automatically enable GPU and set the GPU Type as Nvidia GPU .
Deploy a workload. Click Clusters -> Workloads , and deploy the workload using the image method. After selecting the type ( Nvidia GPU ), configure the number of physical cards used by the application:
Physical Card Count (nvidia.com/gpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and less than or equal to the number of cards on the host machine.
If the above value is configured incorrectly, scheduling failures and resource allocation issues may occur.
"},{"location":"en/end-user/kpanda/gpu/nvidia/full_gpu_userguide.html#configuring-via-yaml","title":"Configuring via YAML","text":"
To request GPU resources for a workload, add the nvidia.com/gpu: 1 parameter to the resource request and limit configuration in the YAML file. This parameter configures the number of physical cards used by the application.
AI platform comes with pre-installed driver images for the following three operating systems: Ubuntu 22.04, Ubuntu 20.04, and CentOS 7.9. The driver version is 535.104.12. Additionally, it includes the required Toolkit images for each operating system, so users no longer need to manually provide offline toolkit images.
This page demonstrates using AMD architecture with CentOS 7.9 (3.10.0-1160). If you need to deploy on Red Hat 8.4, refer to Uploading Red Hat gpu-operator Offline Image to the Bootstrap Node Repository and Building Offline Yum Source for Red Hat 8.4.
The kernel version of the cluster nodes where the gpu-operator is to be deployed must be completely consistent. The distribution and GPU model of the nodes must fall within the scope specified in the GPU Support Matrix.
When installing the gpu-operator, select v23.9.0+2 or above.
systemOS : Select the operating system for the host. The current options are Ubuntu 22.04, Ubuntu 20.04, Centos 7.9, and other. Please choose the correct operating system.
Namespace : Select the namespace for installing the plugin
Version: The version of the plugin. Here, we use version v23.9.0+2 as an example.
Failure Deletion: If the installation fails, it will delete the already installed associated resources. When enabled, Ready Wait will also be enabled by default.
Ready Wait: When enabled, the application will be marked as successfully installed only when all associated resources are in a ready state.
Detailed Logs: When enabled, detailed logs of the installation process will be recorded.
Driver.enable : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the gpu-operator, please disable this.
Driver.repository : Repository where the GPU driver image is located, default is nvidia's nvcr.io repository.
Driver.usePrecompiled : Enable the precompiled mode to install the driver.
Driver.version : Version of the GPU driver image, use default parameters for offline deployment. Configuration is only required for online installation. Different versions of the Driver image exist for different types of operating systems. For more details, refer to Nvidia GPU Driver Versions. Examples of Driver Version for different operating systems are as follows:
Note
When using the built-in operating system version, there is no need to modify the image version. For other operating system versions, please refer to Uploading Images to the Bootstrap Node Repository. note that there is no need to include the operating system name such as Ubuntu, CentOS, or Red Hat in the version number. If the official image contains an operating system suffix, please manually remove it.
For Red Hat systems, for example, 525.105.17
For Ubuntu systems, for example, 535-5.15.0-1043-nvidia
For CentOS systems, for example, 525.147.05
Driver.RepoConfig.ConfigMapName : Used to record the name of the offline yum repository configuration file for the gpu-operator. When using the pre-packaged offline bundle, refer to the following documents for different types of operating systems.
For detailed configuration methods, refer to Enabling MIG Functionality.
MigManager.Config.name : The name of the MIG split configuration file, used to define the MIG (GI, CI) split policy. The default is default-mig-parted-config . For custom parameters, refer to Enabling MIG Functionality.
After completing the configuration and creation of the above parameters:
If using full-card mode , GPU resources can be used when creating applications.
If using vGPU mode , after completing the above configuration and creation, proceed to vGPU Addon Installation.
If using MIG mode and you need to use a specific split specification for individual GPU nodes, otherwise, split according to the default value in MigManager.Config.
After spliting, applications can use MIG GPU resources.
"},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html","title":"Uploading Red Hat GPU Operator Offline Image to Bootstrap Repository","text":"
This guide explains how to upload an offline image to the bootstrap repository using the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image for Red Hat 8.4 as an example.
The bootstrap node and its components are running properly.
Prepare a node that has internet access and can access the bootstrap node. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
"},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html#step-1-obtain-the-offline-image-on-an-internet-connected-node","title":"Step 1: Obtain the Offline Image on an Internet-Connected Node","text":"
Perform the following steps on the internet-connected node:
Pull the nvcr.io/nvidia/driver:525.105.17-rhel8.4 offline driver image:
Once the image is pulled, save it as a compressed archive named nvidia-driver.tar :
docker save nvcr.io/nvidia/driver:525.105.17-rhel8.4 > nvidia-driver.tar\n
Copy the compressed image archive nvidia-driver.tar to the bootstrap node:
scp nvidia-driver.tar user@ip:/root\n
For example:
scp nvidia-driver.tar root@10.6.175.10:/root\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/push_image_to_repo.html#step-2-push-the-image-to-the-bootstrap-repository","title":"Step 2: Push the Image to the Bootstrap Repository","text":"
Perform the following steps on the bootstrap node:
Log in to the bootstrap node and import the compressed image archive nvidia-driver.tar :
docker load -i nvidia-driver.tar\n
View the imported image:
docker images -a | grep nvidia\n
Expected output:
nvcr.io/nvidia/driver e3ed7dee73e9 1 days ago 1.02GB\n
Retag the image to correspond to the target repository in the remote Registry repository:
docker tag <image-name> <registry-url>/<repository-name>:<tag>\n
Replace with the name of the Nvidia image from the previous step, with the address of the Registry service on the bootstrap node, with the name of the repository you want to push the image to, and with the desired tag for the image.
For example:
docker tag nvcr.io/nvidia/driver 10.6.10.5/nvcr.io/nvidia/driver:525.105.17-rhel8.4\n
Check the GPU Driver image version applicable to your kernel, at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. Use the kernel to query the image version and save the image using ctr export.
ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 \n
Import the image into the cluster's container registry
ctr i import driver.tar.gz\nctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04\nctr i push {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.html#install-the-driver","title":"Install the Driver","text":"
Install the gpu-operator addon and set driver.usePrecompiled=true
Set driver.version=535, note that it should be 535, not 535.104.12
The AI platform comes with a pre-installed GPU Operator offline package for CentOS 7.9 with kernel version 3.10.0-1160. or other OS types or kernel versions, users need to manually build an offline yum source.
This guide explains how to build an offline yum source for CentOS 7.9 with a specific kernel version and use it when installing the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the v0.12.0 or later version of the addon offline package on the platform.
Prepare a file server that is accessible from the cluster network, such as Nginx or MinIO.
Prepare a node that has internet access, can access the cluster where the GPU Operator will be deployed, and can access the file server. Docker should also be installed on this node. You can refer to Installing Docker for installation instructions.
This guide uses CentOS 7.9 with kernel version 3.10.0-1160.95.1.el7.x86_64 as an example to explain how to upgrade the pre-installed GPU Operator offline package's yum source.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#check-os-and-kernel-versions-of-cluster-nodes","title":"Check OS and Kernel Versions of Cluster Nodes","text":"
Run the following commands on both the control node of the Global cluster and the node where GPU Operator will be deployed. If the OS and kernel versions of the two nodes are consistent, there is no need to build a yum source. You can directly refer to the Offline Installation of GPU Operator document for installation. If the OS or kernel versions of the two nodes are not consistent, please proceed to the next step.
Run the following command to view the distribution name and version of the node where GPU Operator will be deployed in the cluster.
cat /etc/redhat-release\n
Expected output:
CentOS Linux release 7.9 (Core)\n
The output shows the current node's OS version as CentOS 7.9.
Run the following command to view the kernel version of the node where GPU Operator will be deployed in the cluster.
uname -a\n
Expected output:
Linux localhost.localdomain 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux\n
The output shows the current node's kernel version as 3.10.0-1160.el7.x86_64.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-the-offline-yum-source","title":"Create the Offline Yum Source","text":"
Perform the following steps on a node that has internet access and can access the file server:
Create a script file named yum.sh by running the following command:
vi yum.sh\n
Then press the i key to enter insert mode and enter the following content:
Press the Esc key to exit insert mode, then enter :wq to save and exit.
Run the yum.sh file:
bash -x yum.sh TARGET_KERNEL_VERSION\n
The TARGET_KERNEL_VERSION parameter is used to specify the kernel version of the cluster nodes.
Note: You don't need to include the distribution identifier (e.g., __ .el7.x86_64__ ). For example:
bash -x yum.sh 3.10.0-1160.95.1\n
Now you have generated an offline yum source, centos-base , for the kernel version 3.10.0-1160.95.1.el7.x86_64 .
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#upload-the-offline-yum-source-to-the-file-server","title":"Upload the Offline Yum Source to the File Server","text":"
Perform the following steps on a node that has internet access and can access the file server. This step is used to upload the generated yum source from the previous step to a file server that can be accessed by the cluster where the GPU Operator will be deployed. The file server can be Nginx, MinIO, or any other file server that supports the HTTP protocol.
In this example, we will use the built-in MinIO as the file server. The MinIO details are as follows:
Run the following command in the current directory of the node to establish a connection between the node's local mc command-line tool and the MinIO server:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should resemble the following:
Added __minio__ successfully.\n
mc is the command-line tool provided by MinIO for interacting with the MinIO server. For more details, refer to the MinIO Client documentation.
In the current directory of the node, create a bucket named centos-base :
mc mb -p minio/centos-base\n
The expected output should resemble the following:
Bucket created successfully __minio/centos-base__ .\n
Set the access policy of the bucket centos-base to allow public download. This will enable access during the installation of the GPU Operator:
mc anonymous set download minio/centos-base\n
The expected output should resemble the following:
Access permission for __minio/centos-base__ is set to __download__ \n
In the current directory of the node, copy the generated centos-base offline yum source to the minio/centos-base bucket on the MinIO server:
mc cp centos-base minio/centos-base --recursive\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.html#create-a-configmap-to-store-the-yum-source-info-in-the-cluster","title":"Create a ConfigMap to Store the Yum Source Info in the Cluster","text":"
Perform the following steps on the control node of the cluster where the GPU Operator will be deployed.
Run the following command to create a file named CentOS-Base.repo that specifies the configmap for the yum source storage:
# The file name must be CentOS-Base.repo, otherwise it cannot be recognized during the installation of the GPU Operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The expected output should resemble the following:
configmap/local-repo-config created\n
The local-repo-config configmap will be used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can customize the configuration file name.
View the content of the local-repo-config configmap:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output should resemble the following:
apiVersion: v1\ndata:\nCentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base# The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file server path where the yum source is placed in step 2\\ngpgcheck = 0\\nname = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\ncreationTimestamp: \"2023-10-18T01:59:02Z\"\nname: local-repo-config\nnamespace: gpu-operator\nresourceVersion: \"59445080\"\nuid: c5f0ebab-046f-442c-b932-f9003e014387\n
You have successfully created an offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it during the offline installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html","title":"Building Red Hat 8.4 Offline Yum Source","text":"
The AI platform comes with pre-installed CentOS v7.9 and GPU Operator offline packages with kernel v3.10.0-1160. For other OS types or nodes with different kernels, users need to manually build the offline yum source.
This guide explains how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also demonstrates how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
The user has already installed the addon offline package v0.12.0 or higher on the platform.
The OS of the cluster nodes where the GPU Operator will be deployed must be Red Hat v8.4, and the kernel version must be identical.
Prepare a file server that can communicate with the cluster network where the GPU Operator will be deployed, such as Nginx or MinIO.
Prepare a node that can access the internet, the cluster where the GPU Operator will be deployed, and the file server. Ensure that Docker is already installed on this node.
The nodes in the Global cluster must be Red Hat 8.4 4.18.0-305.el8.x86_64.
This guide uses a node with Red Hat 8.4 4.18.0-305.el8.x86_64 as an example to demonstrate how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also explains how to use it during the installation of the GPU Operator by specifying the RepoConfig.ConfigMapName parameter.
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-1-download-the-yum-source-from-the-bootstrap-node","title":"Step 1: Download the Yum Source from the Bootstrap Node","text":"
Perform the following steps on the master node of the Global cluster.
Use SSH or any other method to access any node in the Global cluster and run the following command:
cat /etc/yum.repos.d/extension.repo # View the contents of extension.repo.\n
The expected output should resemble the following:
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-2-download-the-elfutils-libelf-devel-0187-4el8x86_64rpm-package","title":"Step 2: Download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm Package","text":"
Perform the following steps on a node with internet access. Before proceeding, ensure that there is network connectivity between the node with internet access and the master node of the Global cluster.
Run the following command on the node with internet access to download the elfutils-libelf-devel-0.187-4.el8.x86_64.rpm package:
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-3-generate-the-local-yum-repository","title":"Step 3: Generate the Local Yum Repository","text":"
Perform the following steps on the master node of the Global cluster mentioned in Step 1.
Enter the yum repository directories:
cd ~/redhat-base-repo/extension-1/Packages\ncd ~/redhat-base-repo/extension-2/Packages\n
Generate the repository index for the directories:
createrepo_c ./\n
You have now generated the offline yum source named redhat-base-repo for kernel version 4.18.0-305.el8.x86_64 .
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-4-upload-the-local-yum-repository-to-the-file-server","title":"Step 4: Upload the Local Yum Repository to the File Server","text":"
In this example, we will use Minio, which is built-in as the file server in the bootstrap node. However, you can choose any file server that suits your needs. Here are the details for Minio:
Access URL: http://10.5.14.200:9000 (usually the {bootstrap-node-IP} + {port-9000})
Login username: rootuser
Login password: rootpass123
On the current node, establish a connection between the local mc command-line tool and the Minio server by running the following command:
mc config host add minio <file_server_access_url> <username> <password>\n
For example:
mc config host add minio http://10.5.14.200:9000 rootuser rootpass123\n
The expected output should be similar to:
Added __minio__ successfully.\n
The mc command-line tool is provided by the Minio file server as a client command-line tool. For more details, refer to the MinIO Client documentation.
Create a bucket named redhat-base in the current location:
mc mb -p minio/redhat-base\n
The expected output should be similar to:
Bucket created successfully __minio/redhat-base__ .\n
Set the access policy of the redhat-base bucket to allow public downloads so that it can be accessed during the installation of the GPU Operator:
mc anonymous set download minio/redhat-base\n
The expected output should be similar to:
Access permission for __minio/redhat-base__ is set to __download__ \n
Copy the offline yum repository files ( redhat-base-repo ) from the current location to the Minio server's minio/redhat-base bucket:
mc cp redhat-base-repo minio/redhat-base --recursive\n
"},{"location":"en/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.html#step-5-create-a-configmap-to-store-yum-repository-information-in-the-cluster","title":"Step 5: Create a ConfigMap to Store Yum Repository Information in the Cluster","text":"
Perform the following steps on the control node of the cluster where you will deploy the GPU Operator.
Run the following command to create a file named redhat.repo , which specifies the configuration information for the yum repository storage:
# The file name must be redhat.repo, otherwise it won't be recognized when installing gpu-operator\ncat > redhat.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created redhat.repo file, create a configmap named local-repo-config in the gpu-operator namespace:
The local-repo-config configuration file is used to provide the value for the RepoConfig.ConfigMapName parameter during the installation of the GPU Operator. You can choose a different name for the configuration file.
View the contents of the local-repo-config configuration file:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
You have successfully created the offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it by specifying the RepoConfig.ConfigMapName parameter during the offline installation of the GPU Operator.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html","title":"Build an Offline Yum Repository for Red Hat 7.9","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#introduction","title":"Introduction","text":"
AI platform comes with a pre-installed CentOS 7.9 with GPU Operator offline package for kernel 3.10.0-1160. You need to manually build an offline yum repository for other OS types or nodes with different kernels.
This page explains how to build an offline yum repository for Red Hat 7.9 based on any node in the Global cluster, and how to use the RepoConfig.ConfigMapName parameter when installing the GPU Operator.
The cluster nodes where the GPU Operator is to be deployed must be Red Hat 7.9 with the exact same kernel version.
Prepare a file server that can be connected to the cluster network where the GPU Operator is to be deployed, such as nginx or minio.
Prepare a node that can access the internet, the cluster where the GPU Operator is to be deployed, and the file server. Docker installation must be completed on this node.
The nodes in the global service cluster must be Red Hat 7.9.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#steps","title":"Steps","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#1-build-offline-yum-repo-for-relevant-kernel","title":"1. Build Offline Yum Repo for Relevant Kernel","text":"
Download rhel7.9 ISO
Download the rhel7.9 ospackage that corresponds to your Kubean version.
Find the version number of Kubean in the Container Management section of the Global cluster under Helm Apps.
Download the rhel7.9 ospackage for that version from the Kubean repository.
Import offline resources using the installer.
Refer to the Import Offline Resources document.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#2-download-offline-driver-image-for-red-hat-79-os","title":"2. Download Offline Driver Image for Red Hat 7.9 OS","text":"
Click here to view the download url.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#3-upload-red-hat-gpu-operator-offline-image-to-boostrap-node-repository","title":"3. Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository","text":"
Refer to Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository.
Note
This reference is based on rhel8.4, so make sure to modify it for rhel7.9.
"},{"location":"en/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.html#4-create-configmaps-in-the-cluster-to-save-yum-repository-information","title":"4. Create ConfigMaps in the Cluster to Save Yum Repository Information","text":"
Run the following command on the control node of the cluster where the GPU Operator is to be deployed.
Run the following command to create a file named CentOS-Base.repo to specify the configuration information where the yum repository is stored.
# The file name must be CentOS-Base.repo, otherwise it will not be recognized when installing gpu-operator\ncat > CentOS-Base.repo << EOF\n[extension-0]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 0\n\n[extension-1]\nbaseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port}\ngpgcheck = 0\nname = kubean extension 1\nEOF\n
Based on the created CentOS-Base.repo file, create a profile named local-repo-config in the gpu-operator namespace:
The local-repo-config profile is used to provide the value of the RepoConfig.ConfigMapName parameter when installing gpu-operator, and the profile name can be customized by the user.
View the contents of the local-repo-config profile:
kubectl get configmap local-repo-config -n gpu-operator -oyaml\n
The expected output is as follows:
local-repo-config.yaml
apiVersion: v1\ndata:\n CentOS-Base.repo: \"[extension-0]\\nbaseurl = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname = kubean extension 0\\n \\n[extension-1]\\nbaseurl\n = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \\ngpgcheck = 0\\nname\n = kubean extension 1\\n\"\nkind: ConfigMap\nmetadata:\n creationTimestamp: \"2023-10-18T01:59:02Z\"\n name: local-repo-config\n namespace: gpu-operator\n resourceVersion: \"59445080\"\n uid: c5f0ebab-046f-442c-b932-f9003e014387\n
At this point, you have successfully created the offline yum repository profile for the cluster where the GPU Operator is to be deployed. The RepoConfig.ConfigMapName parameter was used during the Offline Installation of GPU Operator.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html","title":"Overview of NVIDIA Multi-Instance GPU (MIG)","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html#mig-scenarios","title":"MIG Scenarios","text":"
Multi-Tenant Cloud Environments:
MIG allows cloud service providers to partition a physical GPU into multiple independent GPU instances, which can be allocated to different tenants. This enables resource isolation and independence, meeting the GPU computing needs of multiple tenants.
Containerized Applications:
MIG enables finer-grained GPU resource management in containerized environments. By partitioning a physical GPU into multiple MIG instances, each container can be assigned with dedicated GPU compute resources, providing better performance isolation and resource utilization.
Batch Processing Jobs:
For batch processing jobs requiring large-scale parallel computing, MIG provides higher computational performance and larger memory capacity. Each MIG instance can utilize a portion of the physical GPU's compute resources, accelerating the processing of large-scale computational tasks.
AI/Machine Learning Training:
MIG offers increased compute power and memory capacity for training large-scale deep learning models. By partitioning the physical GPU into multiple MIG instances, each instance can independently carry out model training, improving training efficiency and throughput.
In general, NVIDIA MIG is suitable for scenarios that require finer-grained allocation and management of GPU resources. It enables resource isolation, improved performance utilization, and meets the GPU computing needs of multiple users or applications.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html#overview-of-mig","title":"Overview of MIG","text":"
NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU into seven GPU instances that can be used by multiple users.
SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices.
GPU Memory Slice : The smallest portion of GPU memory, including the proper memory controller and cache. A GPU memory slice is approximately one-eighth of the total GPU memory resources in terms of capacity and bandwidth.
GPU SM Slice : The smallest computational unit of SMs on a GPU. When configuring in MIG mode, the GPU SM slice is approximately one-seventh of the total available SMs in the GPU.
GPU Slice : The GPU slice represents the smallest portion of the GPU, consisting of a single GPU memory slice and a single GPU SM slice combined together.
GPU Instance (GI): A GPU instance is the combination of a GPU slice and GPU engines (DMA, NVDEC, etc.). Anything within a GPU instance always shares all GPU memory slices and other GPU engines, but its SM slice can be further subdivided into Compute Instances (CIs). A GPU instance provides memory QoS. Each GPU slice contains dedicated GPU memory resources, limiting available capacity and bandwidth while providing memory QoS. Each GPU memory slice gets one-eighth of the total GPU memory resources, and each GPU SM slice gets one-seventh of the total SM count.
Compute Instance (CI): A Compute Instance represents the smallest computational unit within a GPU instance. It consists of a subset of SMs, along with dedicated register files, shared memory, and other resources. Each CI has its own CUDA context and can run independent CUDA kernels. The number of CIs in a GPU instance depends on the number of available SMs and the configuration chosen during MIG setup.
Instance Slice : An Instance Slice represents a single CI within a GPU instance. It is the combination of a subset of SMs and a portion of the GPU memory slice. Each Instance Slice provides isolation and resource allocation for individual applications or users running on the GPU instance.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/index.html#key-benefits-of-mig","title":"Key Benefits of MIG","text":"
Resource Sharing: MIG allows a single physical GPU to be divided into multiple GPU instances, providing efficient sharing of GPU resources among different users or applications. This maximizes GPU utilization and enables improved performance isolation.
Fine-Grained Resource Allocation: With MIG, GPU resources can be allocated at a finer granularity, allowing for more precise partitioning and allocation of compute power and memory capacity.
Improved Performance Isolation: Each MIG instance operates independently with its dedicated resources, ensuring predictable throughput and latency for individual users or applications. This improves performance isolation and prevents interference between different workloads running on the same GPU.
Enhanced Security and Fault Isolation: MIG provides better security and fault isolation by ensuring that each user or application has its dedicated GPU resources. This prevents unauthorized access to data and mitigates the impact of faults or errors in one instance on others.
Increased Scalability: MIG enables the simultaneous usage of GPU resources by multiple users or applications, increasing system scalability and accommodating the needs of various workloads.
Efficient Containerization: By using MIG in containerized environments, GPU resources can be effectively allocated to different containers, improving performance isolation and resource utilization.
Overall, MIG offers significant advantages in terms of resource sharing, fine-grained allocation, performance isolation, security, scalability, and containerization, making it a valuable feature for various GPU computing scenarios.
Check the system requirements for the GPU driver installation on the target node: GPU Support Matrix
Ensure that the cluster nodes have GPUs of the proper models (NVIDIA H100, A100, and A30 Tensor Core GPUs). For more information, see the GPU Support Matrix.
All GPUs on the nodes must belong to the same product line (e.g., A100-SXM-40GB).
When installing the Operator, you need to set the MigManager Config parameter accordingly. The default setting is default-mig-parted-config. You can also customize the sharding policy configuration file:
After successfully installing the GPU operator, the node is in full card mode by default. There will be an indicator on the node management page, as shown below:
Click the \u2507 at the right side of the node list, select a GPU mode to switch, and then choose the proper MIG mode and sharding policy. Here, we take MIXED mode as an example:
There are two configurations here:
MIG Policy: Mixed and Single.
Sharding Policy: The policy here needs to match the key in the default-mig-parted-config (or user-defined sharding policy) configuration file.
After clicking OK button, wait for about a minute and refresh the page. The MIG mode will be switched to:
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_command.html","title":"MIG Related Commands","text":"
GI Related Commands:
Subcommand Description nvidia-smi mig -lgi View the list of created GI instances nvidia-smi mig -dgi -gi Delete a specific GI instance nvidia-smi mig -lgip View the profile of GI nvidia-smi mig -cgi Create a GI using the specified profile ID
CI Related Commands:
Subcommand Description nvidia-smi mig -lcip { -gi {gi Instance ID}} View the profile of CI, specifying -gi will show the CIs that can be created for a particular GI instance nvidia-smi mig -lci View the list of created CI instances nvidia-smi mig -cci {profile id} -gi {gi instance id} Create a CI instance with the specified GI nvidia-smi mig -dci -ci Delete a specific CI instance
GI+CI Related Commands:
Subcommand Description nvidia-smi mig -i 0 -cgi {gi profile id} -C {ci profile id} Create a GI + CI instance directly"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_usage.html","title":"Using MIG GPU Resources","text":"
This section explains how applications can use MIG GPU resources.
AI platform container management platform is deployed and running successfully.
The container management module is integrated with a Kubernetes cluster or a Kubernetes cluster is created, and the UI interface of the cluster can be accessed.
NVIDIA DevicePlugin and MIG capabilities are enabled. Refer to Offline installation of GPU Operator for details.
The nodes in the cluster have GPUs of the proper models.
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-gpu-through-the-ui","title":"Using MIG GPU through the UI","text":"
Confirm if the cluster has recognized the GPU type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
When deploying an application using an image, you can select and use NVIDIA MIG resources.
Example of MIG Single Mode (used in the same way as a full GPU):
Note
The MIG single policy allows users to request and use GPU resources in the same way as a full GPU (nvidia.com/gpu). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.
MIG Mixed Mode
"},{"location":"en/end-user/kpanda/gpu/nvidia/mig/mig_usage.html#using-mig-through-yaml-configuration","title":"Using MIG through YAML Configuration","text":"
Expose MIG device through nvidia.com/mig-g.gb resource type
After entering the container, you can check if only one MIG device is being used:
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/hami.html","title":"Build a vGPU Memory Oversubscription Image","text":"
The vGPU memory oversubscription feature in the Hami Project no longer exists. To use this feature, you need to rebuild with the libvgpu.so file that supports memory oversubscription.
Dockerfile
FROM docker.m.daocloud.io/projecthami/hami:v2.3.11\nCOPY libvgpu.so /k8s-vgpu/lib/nvidia/\n
To virtualize a single NVIDIA GPU into multiple virtual GPUs and allocate them to different virtual machines or users, you can use NVIDIA's vGPU capability. This section explains how to install the vGPU plugin in the AI platform platform, which is a prerequisite for using NVIDIA vGPU capability.
During the installation of vGPU, several basic modification parameters are provided. If you need to modify advanced parameters, click the YAML column to make changes:
deviceMemoryScaling : NVIDIA device memory scaling factor, the input value must be an integer, with a default value of 1. It can be greater than 1 (enabling virtual memory, experimental feature). For an NVIDIA GPU with a memory size of M, if we configure the devicePlugin.deviceMemoryScaling parameter as S, in a Kubernetes cluster where we have deployed our device plugin, the vGPUs assigned from this GPU will have a total memory of S * M .
deviceSplitCount : An integer type, with a default value of 10. Number of GPU splits, each GPU cannot be assigned more tasks than its configuration count. If configured as N, each GPU can have up to N tasks simultaneously.
Resources : Represents the resource usage of the vgpu-device-plugin and vgpu-schedule pods.
After a successful installation, you will see two types of pods in the specified namespace, indicating that the NVIDIA vGPU plugin has been successfully installed:
After a successful installation, you can deploy applications using vGPU resources.
Note
NVIDIA vGPU Addon does not support upgrading directly from the older v2.0.0 to the latest v2.0.0+1; To upgrade, please uninstall the older version and then reinstall the latest version.
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html","title":"Using NVIDIA vGPU in Applications","text":"
This section explains how to use the vGPU capability in the AI platform platform.
The nodes in the cluster have GPUs of the proper models.
vGPU Addon has been successfully installed. Refer to Installing GPU Addon for details.
GPU Operator is installed, and the Nvidia.DevicePlugin capability is disabled. Refer to Offline Installation of GPU Operator for details.
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html#procedure","title":"Procedure","text":""},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-the-ui","title":"Using vGPU through the UI","text":"
Confirm if the cluster has detected GPUs. Click the Clusters -> Cluster Settings -> Addon Plugins and check if the GPU plugin has been automatically enabled and the proper GPU type has been detected. Currently, the cluster will automatically enable the GPU addon and set the GPU Type as Nvidia vGPU .
Deploy a workload by clicking Clusters -> Workloads . When deploying a workload using an image, select the type Nvidia vGPU , and you will be prompted with the following parameters:
Number of Physical Cards (nvidia.com/vgpu) : Indicates how many physical cards need to be mounted by the current pod. The input value must be an integer and less than or equal to the number of cards on the host machine.
GPU Cores (nvidia.com/gpucores): Indicates the GPU cores utilized by each card, with a value range from 0 to 100. Setting it to 0 means no enforced isolation, while setting it to 100 means exclusive use of the entire card.
GPU Memory (nvidia.com/gpumem): Indicates the GPU memory occupied by each card, with a value in MB. The minimum value is 1, and the maximum value is the total memory of the card.
If there are issues with the configuration values above, it may result in scheduling failure or inability to allocate resources.
"},{"location":"en/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.html#using-vgpu-through-yaml-configuration","title":"Using vGPU through YAML Configuration","text":"
Refer to the following workload configuration and add the parameter nvidia.com/vgpu: '1' in the resource requests and limits section to configure the number of physical cards used by the application.
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano-gang-scheduler.html","title":"Using Volcano's Gang Scheduler","text":"
The Gang scheduling policy is one of the core scheduling algorithms of the volcano-scheduler. It satisfies the \"All or nothing\" scheduling requirement during the scheduling process, preventing arbitrary scheduling of Pods that could waste cluster resources. The specific algorithm observes whether the number of scheduled Pods under a Job meets the minimum running quantity. When the Job's minimum running quantity is satisfied, scheduling actions are performed for all Pods under the Job; otherwise, no actions are taken.
The Gang scheduling algorithm, based on the concept of a Pod group, is particularly suitable for scenarios that require multi-process collaboration. AI scenarios often involve complex workflows, such as Data Ingestion, Data Analysis, Data Splitting, Training, Serving, and Logging, which require a group of containers to work together. This makes the Gang scheduling policy based on pods very appropriate.
In multi-threaded parallel computing communication scenarios under the MPI computation framework, Gang scheduling is also very suitable because it requires master and slave processes to work together. High relevance among containers in a pod may lead to resource contention, and overall scheduling allocation can effectively resolve deadlocks.
In scenarios with insufficient cluster resources, the Gang scheduling policy significantly improves the utilization of cluster resources. For example, if the cluster can currently accommodate only 2 Pods, but the minimum number of Pods required for scheduling is 3, then all Pods of this Job will remain pending until the cluster can accommodate 3 Pods, at which point the Pods will be scheduled. This effectively prevents the partial scheduling of Pods, which would not meet the requirements and would occupy resources, making other Jobs unable to run.
The Gang Scheduler is the core scheduling plugin of Volcano, and it is enabled by default upon installing Volcano. When creating a workload, you only need to specify the scheduler name as Volcano.
Volcano schedules based on PodGroups. When creating a workload, there is no need to manually create PodGroup resources; Volcano will automatically create them based on the workload information. Below is an example of a PodGroup:
Represents the minimum number of Pods or jobs that need to run under this PodGroup. If the cluster resources do not meet the requirements to run the number of jobs specified by miniMember, the scheduler will not schedule any jobs within this PodGroup.
Represents the minimum resources required to run this PodGroup. If the allocatable resources of the cluster do not meet the minResources, the scheduler will not schedule any jobs within this PodGroup.
Represents the priority of this PodGroup, used by the scheduler to sort all PodGroups within the queue during scheduling. system-node-critical and system-cluster-critical are two reserved values indicating the highest priority. If not specifically designated, the default priority or zero priority is used.
Represents the queue to which this PodGroup belongs. The queue must be pre-created and in the open state.
In a multi-threaded parallel computing communication scenario under the MPI computation framework, we need to ensure that all Pods can be successfully scheduled to ensure the job is completed correctly. Setting minAvailable to 4 means that 1 mpimaster and 3 mpiworkers are required to run.
apiVersion: scheduling.volcano.sh/v1beta1\nkind: PodGroup\nmetadata:\n annotations:\n creationTimestamp: \"2024-05-28T09:18:50Z\"\n generation: 5\n labels:\n volcano.sh/job-type: MPI\n name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2\n namespace: default\n ownerReferences:\n - apiVersion: batch.volcano.sh/v1alpha1\n blockOwnerDeletion: true\n controller: true\n kind: Job\n name: lm-mpi-job\n uid: 9c571015-37c7-4a1a-9604-eaa2248613f2\n resourceVersion: \"25173454\"\n uid: 7b04632e-7cff-4884-8e9a-035b7649d33b\nspec:\n minMember: 4\n minResources:\n count/pods: \"4\"\n cpu: 3500m\n limits.cpu: 3500m\n pods: \"4\"\n requests.cpu: 3500m\n minTaskMember:\n mpimaster: 1\n mpiworker: 3\n queue: default\nstatus:\n conditions:\n - lastTransitionTime: \"2024-05-28T09:19:01Z\"\n message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded,\n 3 Releasing, 4 minAvailable'\n reason: NotEnoughResources\n status: \"True\"\n transitionID: f875efa5-0358-4363-9300-06cebc0e7466\n type: Unschedulable\n - lastTransitionTime: \"2024-05-28T09:18:53Z\"\n reason: tasks in gang are ready to be scheduled\n status: \"True\"\n transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab\n type: Scheduled\n phase: Pending\n succeeded: 1\n
From the PodGroup, it can be seen that it is associated with the workload through ownerReferences and sets the minimum number of running Pods to 4.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html","title":"Use Volcano for AI Compute","text":""},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html#usage-scenarios","title":"Usage Scenarios","text":"
Kubernetes has become the de facto standard for orchestrating and managing cloud-native applications, and an increasing number of applications are choosing to migrate to K8s. The fields of artificial intelligence and machine learning inherently involve a large number of compute-intensive tasks, and developers are very willing to build AI platforms based on Kubernetes to fully leverage its resource management, application orchestration, and operations monitoring capabilities. However, the default Kubernetes scheduler was initially designed primarily for long-running services and has many shortcomings in batch and elastic scheduling for AI and big data tasks. For example, resource contention issues:
Take TensorFlow job scenarios as an example. TensorFlow jobs include two different roles, PS and Worker, and the Pods for these two roles need to work together to complete the entire job. If only one type of role Pod is running, the entire job cannot be executed properly. The default scheduler schedules Pods one by one and is unaware of the PS and Worker roles in a Kubeflow TFJob. In a high-load cluster (insufficient resources), multiple jobs may each be allocated some resources to run a portion of their Pods, but the jobs cannot complete successfully, leading to resource waste. For instance, if a cluster has 4 GPUs and both TFJob1 and TFJob2 each have 4 Workers, TFJob1 and TFJob2 might each be allocated 2 GPUs. However, both TFJob1 and TFJob2 require 4 GPUs to run. This mutual waiting for resource release creates a deadlock situation, resulting in GPU resource waste.
Volcano is the first Kubernetes-based container batch computing platform under CNCF, focusing on high-performance computing scenarios. It fills in the missing functionalities of Kubernetes in fields such as machine learning, big data, and scientific computing, providing essential support for these high-performance workloads. Additionally, Volcano seamlessly integrates with mainstream computing frameworks like Spark, TensorFlow, and PyTorch, and supports hybrid scheduling of heterogeneous devices, including CPUs and GPUs, effectively resolving the deadlock issues mentioned above.
The following sections will introduce how to install and use Volcano.
Find Volcano in Cluster Details -> Helm Apps -> Helm Charts and install it.
Check and confirm whether Volcano is installed successfully, that is, whether the components volcano-admission, volcano-controllers, and volcano-scheduler are running properly.
Typically, Volcano is used in conjunction with the AI Lab to achieve an effective closed-loop process for the development and training of datasets, Notebooks, and task training.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html#volcano-use-cases","title":"Volcano Use Cases","text":"
Volcano is a standalone scheduler. To enable the Volcano scheduler when creating workloads, simply specify the scheduler's name (schedulerName: volcano).
The volcanoJob resource is an extension of the Job in Volcano, breaking the Job down into smaller working units called tasks, which can interact with each other.
"},{"location":"en/end-user/kpanda/gpu/volcano/volcano_user_guide.html#parallel-computing-with-mpi","title":"Parallel Computing with MPI","text":"
In multi-threaded parallel computing communication scenarios under the MPI computing framework, we need to ensure that all Pods are successfully scheduled to guarantee the task's proper completion. Setting minAvailable to 4 indicates that 1 mpimaster and 3 mpiworkers are required to run. By simply setting the schedulerName field value to \"volcano,\" you can enable the Volcano scheduler.
Helm is a package management tool for Kubernetes, which makes it easy for users to quickly discover, share and use applications built with Kubernetes. Container Management provides hundreds of Helm charts, covering storage, network, monitoring, database and other main cases. With these templates, you can quickly deploy and easily manage Helm apps through the UI interface. In addition, it supports adding more personalized templates through Add Helm repository to meet various needs.
Key Concepts:
There are a few key concepts to understand when using Helm:
Chart: A Helm installation package, which contains the images, dependencies, and resource definitions required to run an application, and may also contain service definitions in the Kubernetes cluster, similar to the formula in Homebrew, dpkg in APT, or rpm files in Yum. Charts are called Helm Charts in AI platform.
Release: A Chart instance running on the Kubernetes cluster. A Chart can be installed multiple times in the same cluster, and each installation will create a new Release. Release is called Helm Apps in AI platform.
Repository: A repository for publishing and storing Charts. Repository is called Helm Repositories in AI platform.
For more details, refer to Helm official website.
Related operations:
Manage Helm apps, including installing, updating, uninstalling Helm apps, viewing Helm operation records, etc.
Manage Helm repository, including installing, updating, deleting Helm repository, etc.
"},{"location":"en/end-user/kpanda/helm/Import-addon.html","title":"Import Custom Helm Apps into Built-in Addons","text":"
This article explains how to import Helm appss into the system's built-in addons in both offline and online environments.
charts-syncer is available and running. If not, you can click here to download.
The Helm Chart has been adapted for charts-syncer. This means adding a .relok8s-images.yaml file to the Helm Chart. This file should include all the images used in the Chart, including any images that are not directly used in the Chart but are used similar to images used in an Operator.
Note
Refer to image-hints-file for instructions on how to write a Chart. It is required to separate the registry and repository of the image because the registry/repository needs to be replaced or modified when loading the image.
The installer's fire cluster has charts-syncer installed. If you are importing a custom Helm apps into the installer's fire cluster, you can skip the download and proceed to the adaptation. If charts-syncer binary is not installed, you can download it immediately.
Go to Container Management -> Helm Apps -> Helm Repositories , search for the addon, and obtain the built-in repository address and username/password (the default username/password for the system's built-in repository is rootuser/rootpass123).
Sync the Helm Chart to the built-in repository addon of the container management system
Write the following configuration file, modify it according to your specific configuration, and save it as sync-dao-2048.yaml .
source: # helm charts source information\n repo:\n kind: HARBOR # It can also be any other supported Helm Chart repository type, such as CHARTMUSEUM\n url: https://release-ci.daocloud.io/chartrepo/community # Change to the chart repo URL\n #auth: # username/password, if no password is set, leave it blank\n #username: \"admin\"\n #password: \"Harbor12345\"\ncharts: # charts to sync\n - name: dao-2048 # helm charts information, if not specified, sync all charts in the source helm repo\n versions:\n - 1.4.1\ntarget: # helm charts target information\n containerRegistry: 10.5.14.40 # image repository URL\n repo:\n kind: CHARTMUSEUM # It can also be any other supported Helm Chart repository type, such as HARBOR\n url: http://10.5.14.40:8081 # Change to the correct chart repo URL, you can verify the address by using helm repo add $HELM-REPO\n auth: # username/password, if no password is set, leave it blank\n username: \"rootuser\"\n password: \"rootpass123\"\n containers:\n # kind: HARBOR # If the image repository is HARBOR and you want charts-syncer to automatically create an image repository, fill in this field\n # auth: # username/password, if no password is set, leave it blank\n # username: \"admin\"\n # password: \"Harbor12345\"\n\n# leverage .relok8s-images.yaml file inside the Charts to move the container images too\nrelocateContainerImages: true\n
Run the charts-syncer command to sync the Chart and its included images
I1222 15:01:47.119777 8743 sync.go:45] Using config file: \"examples/sync-dao-2048.yaml\"\nW1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory\nI1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync!\nI1222 15:01:47.234706 8743 sync.go:66] Syncing \"dao-2048_1.4.1\" chart...\n.relok8s-images.yaml hints file found\nComputing relocation...\n\nRelocating dao-2048@1.4.1...\nPushing 10.5.14.40/daocloud/dao-2048:v1.4.1...\nDone\nDone moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz\n
Once the previous step is completed, go to Container Management -> Helm Apps -> Helm Repositories , find the proper addon, click Sync Repository in the action column, and you will see the uploaded Helm apps in the Helm template.
You can then proceed with normal installation, upgrade, and uninstallation.
The Helm Repo address for the online environment is release.daocloud.io . If the user does not have permission to add Helm Repo, they will not be able to import custom Helm appss into the system's built-in addons. You can add your own Helm repository and then integrate your Helm repository into the platform using the same steps as syncing Helm Chart in the offline environment.
The container management module supports interface-based management of Helm, including creating Helm instances using Helm charts, customizing Helm instance arguments, and managing the full lifecycle of Helm instances.
This section will take cert-manager as an example to introduce how to create and manage Helm apps through the container management interface.
Integrated the Kubernetes cluster or created the Kubernetes cluster, and you can access the UI interface of the cluster.
Created a namespace, user, and granted NS Admin or higher permissions to the user. For details, refer to Namespace Authorization.
"},{"location":"en/end-user/kpanda/helm/helm-app.html#install-the-helm-app","title":"Install the Helm app","text":"
Follow the steps below to install the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps -> Helm Chart to enter the Helm chart page.
On the Helm chart page, select the Helm repository named addon , and all the Helm chart templates under the addon repository will be displayed on the interface. Click the Chart named cert-manager .
On the installation page, you can see the relevant detailed information of the Chart, select the version to be installed in the upper right corner of the interface, and click the Install button. Here select v1.9.1 version for installation.
Configure Name , Namespace and Version Information . You can also customize arguments by modifying YAML in the argument Configuration area below. Click OK .
The system will automatically return to the list of Helm apps, and the status of the newly created Helm app is Installing , and the status will change to Running after a period of time.
"},{"location":"en/end-user/kpanda/helm/helm-app.html#update-the-helm-app","title":"Update the Helm app","text":"
After we have completed the installation of a Helm app through the interface, we can perform an update operation on the Helm app. Note: Update operations using the UI are only supported for Helm apps installed via the UI.
Follow the steps below to update the Helm app.
Click a cluster name to enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app that needs to be updated, click the __ ...__ operation button on the right side of the list, and select the Update operation in the drop-down selection.
After clicking the Update button, the system will jump to the update interface, where you can update the Helm app as needed. Here we take updating the http port of the dao-2048 application as an example.
After modifying the proper arguments. You can click the Change button under the argument configuration to compare the files before and after the modification. After confirming that there is no error, click the OK button at the bottom to complete the update of the Helm app.
The system will automatically return to the Helm app list, and a pop-up window in the upper right corner will prompt update successful .
Every installation, update, and deletion of Helm apps has detailed operation records and logs for viewing.
In the left navigation bar, click Cluster Operations -> Recent Operations , and then select the Helm Operations tab at the top of the page. Each record corresponds to an install/update/delete operation.
To view the detailed log of each operation: Click \u2507 on the right side of the list, and select Log from the pop-up menu.
At this point, the detailed operation log will be displayed in the form of console at the bottom of the page.
"},{"location":"en/end-user/kpanda/helm/helm-app.html#delete-the-helm-app","title":"Delete the Helm app","text":"
Follow the steps below to delete the Helm app.
Find the cluster where the Helm app to be deleted resides, click the cluster name, and enter Cluster Details .
In the left navigation bar, click Helm Apps to enter the Helm app list page.
On the Helm app list page, select the Helm app you want to delete, click the __ ...__ operation button on the right side of the list, and select Delete from the drop-down selection.
Enter the name of the Helm app in the pop-up window to confirm, and then click the Delete button.
The Helm repository is a repository for storing and publishing Charts. The Helm App module supports HTTP(s) protocol to access Chart packages in the repository. By default, the system has 4 built-in helm repos as shown in the table below to meet common needs in the production process of enterprises.
Repository Description Example partner Various high-quality features provided by ecological partners Chart tidb system Chart that must be relied upon by system core functional components and some advanced features. For example, insight-agent must be installed to obtain cluster monitoring information Insight addon Common Chart in business cases cert-manager community The most popular open source components in the Kubernetes community Chart Istio
In addition to the above preset repositories, you can also add third-party Helm repositories yourself. This page will introduce how to add and update third-party Helm repositories.
The following takes the public container repository of Kubevela as an example to introduce and manage the helm repo.
Find the cluster that needs to be imported into the third-party helm repo, click the cluster name, and enter cluster details.
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo page.
Click the Create Repository button on the helm repo page to enter the Create repository page, and configure relevant arguments according to the table below.
Repository Name: Set the repository name. It can be up to 63 characters long and may only include lowercase letters, numbers, and separators -. It must start and end with a lowercase letter or number, for example, kubevela.
Repository URL: The HTTP(S) address pointing to the target Helm repository. For example, https://charts.kubevela.net/core.
Skip TLS Verification: If the added Helm repository uses an HTTPS address and requires skipping TLS verification, you can check this option. The default is unchecked.
Authentication Method: The method used for identity verification after connecting to the repository URL. For public repositories, you can select None. For private repositories, you need to enter a username/password for identity verification.
Labels: Add labels to this Helm repository. For example, key: repo4; value: Kubevela.
Annotations: Add annotations to this Helm repository. For example, key: repo4; value: Kubevela.
Description: Add a description for this Helm repository. For example: This is a Kubevela public Helm repository.
Click OK to complete the creation of the Helm repository. The page will automatically jump to the list of Helm repositories.
"},{"location":"en/end-user/kpanda/helm/helm-repo.html#update-the-helm-repository","title":"Update the Helm repository","text":"
When the address information of the helm repo changes, the address, authentication method, label, annotation, and description information of the helm repo can be updated.
Find the cluster where the repository to be updated is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Update in the pop-up menu.
Update on the Update Helm Repository page, and click OK when finished.
Return to the helm repo list, and the screen prompts that the update is successful.
"},{"location":"en/end-user/kpanda/helm/helm-repo.html#delete-the-helm-repository","title":"Delete the Helm repository","text":"
In addition to importing and updating repositorys, you can also delete unnecessary repositories, including system preset repositories and third-party repositories.
Find the cluster where the repository to be deleted is located, click the cluster name, and enter cluster details .
In the left navigation bar, click Helm Apps -> Helm Repositories to enter the helm repo list page.
Find the Helm repository that needs to be updated on the repository list page, click the \u2507 button on the right side of the list, and click Delete in the pop-up menu.
Enter the repository name to confirm, and click Delete .
Return to the list of Helm repositories, and the screen prompts that the deletion is successful.
"},{"location":"en/end-user/kpanda/helm/multi-archi-helm.html","title":"Import and Upgrade Multi-Arch Helm Apps","text":"
In a multi-arch cluster, it is common to use Helm charts that support multiple architectures to address deployment issues caused by architectural differences. This guide will explain how to integrate single-arch Helm apps into multi-arch deployments and how to integrate multi-arch Helm apps.
The offline package is quite large and requires sufficient space for decompression and loading of images. Otherwise, it may interrupt the process with a \"no space left\" error.
"},{"location":"en/end-user/kpanda/helm/multi-archi-helm.html#retry-after-failure","title":"Retry after Failure","text":"
If the multi-arch fusion step fails, you need to clean up the residue before retrying:
If the offline package for fusion contains registry spaces that are inconsistent with the imported offline package, an error may occur during the fusion process due to the non-existence of the registry spaces:
Solution: Simply create the registry space before the fusion. For example, in the above error, creating the registry space \"localhost\" in advance can prevent the error.
When upgrading to a version lower than 0.12.0 of the addon, the charts-syncer in the target offline package does not check the existence of the image before pushing, so it will recombine the multi-arch into a single architecture during the upgrade process. For example, if the addon is implemented as a multi-arch in v0.10, upgrading to v0.11 will overwrite the multi-arch addon with a single architecture. However, upgrading to v0.12.0 or above can still maintain the multi-arch.
This article explains how to upload Helm charts. See the steps below.
Add a Helm repository, refer to Adding a Third-Party Helm Repository for the procedure.
Upload the Helm Chart to the Helm repository.
Upload with ClientUpload with Web Page
Note
This method is suitable for Harbor, ChartMuseum, JFrog type repositories.
Log in to a node that can access the Helm repository, upload the Helm binary to the node, and install the cm-push plugin (VPN is needed and Git should be installed in advance).
Refer to the plugin installation process.
Push the Helm Chart to the Helm repository by executing the following command:
charts-dir: The directory of the Helm Chart, or the packaged Chart (i.e., .tgz file).
HELM_REPO_URL: The URL of the Helm repository.
username/password: The username and password for the Helm repository with push permissions.
If you want to access via HTTPS and skip the certificate verification, you can add the argument --insecure.
Note
This method is only applicable to Harbor repositories.
Log into the Harbor repository, ensuring the logged-in user has permissions to push;
Go to the relevant project, select the Helm Charts tab, click the Upload button on the page to upload the Helm Chart.
Sync Remote Repository Data
Manual SyncAuto Sync
By default, the cluster does not enable Helm Repository Auto-Refresh, so you need to perform a manual sync operation. The general steps are:
Go to Helm Apps -> Helm Repositories, click the \u2507 button on the right side of the repository list, and select Sync Repository to complete the repository data synchronization.
If you need to enable the Helm repository auto-sync feature, you can go to Cluster Maintenance -> Cluster Settings -> Advanced Settings and turn on the Helm repository auto-refresh switch.
Cluster inspection allows administrators to regularly or ad-hoc check the overall health of the cluster, giving them proactive control over ensuring cluster security. With a well-planned inspection schedule, this proactive cluster check allows administrators to monitor the cluster status at any time and address potential issues in advance. It eliminates the previous dilemma of passive troubleshooting during failures, enabling proactive monitoring and prevention.
The cluster inspection feature provided by AI platform's container management module supports custom inspection items at the cluster, node, and pod levels. After the inspection is completed, it automatically generates visual inspection reports.
Cluster Level: Checks the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes, such as the status of kube-apiserver and etcd .
Node Level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle counts, PID status, and network status.
pod Level: Checks the CPU and memory usage, running status of pods, and the status of PV (Persistent Volume) and PVC (PersistentVolumeClaim).
For information on security inspections or executing security-related inspections, refer to the supported security scan types in AI platform.
AI platform Container Management module provides cluster inspection functionality, which supports inspection at the cluster, node, and pod levels.
Cluster level: Check the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes such as kube-apiserver and etcd .
Node level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle count, PID status, and network status.
Pod level: Check the CPU and memory usage, running status, PV and PVC status of Pods.
Here's how to create an inspection configuration.
Click Cluster Inspection in the left navigation bar.
On the right side of the page, click Inspection Configuration .
Fill in the inspection configuration based on the following instructions, then click OK at the bottom of the page.
Cluster: Select the clusters that you want to inspect from the dropdown list. If you select multiple clusters, multiple inspection configurations will be automatically generated (only the inspected clusters are inconsistent, all other configurations are identical).
Scheduled Inspection: When enabled, it allows for regular automatic execution of cluster inspections based on a pre-set inspection frequency.
Inspection Frequency: Set the interval for automatic inspections, e.g., every Tuesday at 10 AM. It supports custom CronExpressios, refer to Cron Schedule Syntax for more information.
Number of Inspection Records to Retain: Specifies the maximum number of inspection records to be retained, including all inspection records for each cluster.
Parameter Configuration: The parameter configuration is divided into three parts: cluster level, node level, and pod level. You can enable or disable specific inspection items based on your requirements.
After creating the inspection configuration, it will be automatically displayed in the inspection configuration list. Click the more options button on the right of the configuration to immediately perform an inspection, modify the inspection configuration or delete the inspection configuration and reports.
Click Inspection to perform an inspection once based on the configuration.
Click Inspection Configuration to modify the inspection configuration.
Click Delete to delete the inspection configuration and reports.
Note
After creating the inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time.
If Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
After creating an inspection configuration, if the Scheduled Inspection configuration is enabled, inspections will be automatically executed at the specified time. If the Scheduled Inspection configuration is not enabled, you need to manually trigger the inspection.
This page explains how to manually perform a cluster inspection.
When performing an inspection, you can choose to inspect multiple clusters in batches or perform a separate inspection for a specific cluster.
Batch InspectionIndividual Inspection
Click Cluster Inspection in the top-level navigation bar of the Container Management module, then click Inspection on the right side of the page.
Select the clusters you want to inspect, then click OK at the bottom of the page.
If you choose to inspect multiple clusters at the same time, the system will perform inspections based on different inspection configurations for each cluster.
If no inspection configuration is set for a cluster, the system will use the default configuration.
Go to the Cluster Inspection page.
Click the more options button ( \u2507 ) on the right of the proper inspection configuration, then select Inspection from the popup menu.
Go to the Cluster Inspection page and click the name of the target inspection cluster.
Click the name of the inspection record you want to view.
Each inspection execution generates an inspection record.
When the number of inspection records exceeds the maximum retention specified in the inspection configuration, the earliest record will be deleted starting from the execution time.
View the detailed information of the inspection, which may include an overview of cluster resources and the running status of system components.
You can download the inspection report or delete the inspection report from the top right corner of the page.
Namespaces are an abstraction used in Kubernetes for resource isolation. A cluster can contain multiple namespaces with different names, and the resources in each namespace are isolated from each other. For a detailed introduction to namespaces, refer to Namespaces.
This page will introduce the related operations of the namespace.
"},{"location":"en/end-user/kpanda/namespaces/createns.html#create-a-namespace","title":"Create a namespace","text":"
Supports easy creation of namespaces through forms, and quick creation of namespaces by writing or importing YAML files.
Note
Before creating a namespace, you need to Integrate a Kubernetes cluster or Create a Kubernetes cluster in the container management module.
The default namespace default is usually automatically generated after cluster initialization. But for production clusters, for ease of management, it is recommended to create other namespaces instead of using the default namespace directly.
"},{"location":"en/end-user/kpanda/namespaces/createns.html#create-with-form","title":"Create with form","text":"
On the cluster list page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the Create button on the right side of the page.
Fill in the name of the namespace, configure the workspace and labels (optional), and then click OK.
Info
After binding a namespace to a workspace, the resources of that namespace will be shared with the bound workspace. For a detailed explanation of workspaces, refer to Workspaces and Hierarchies.
After the namespace is created, you can still bind/unbind the workspace.
Click OK to complete the creation of the namespace. On the right side of the namespace list, click \u2507 to select update, bind/unbind workspace, quota management, delete, and more from the pop-up menu.
"},{"location":"en/end-user/kpanda/namespaces/createns.html#create-from-yaml","title":"Create from YAML","text":"
On the Clusters page, click the name of the target cluster.
Click Namespace in the left navigation bar, then click the YAML Create button on the right side of the page.
Enter or paste the prepared YAML content, or directly import an existing YAML file locally.
After entering the YAML content, click Download to save the YAML file locally.
Finally, click OK in the lower right corner of the pop-up box.
Namespace exclusive nodes in a Kubernetes cluster allow a specific namespace to have exclusive access to one or more node's CPU, memory, and other resources through taints and tolerations. Once exclusive nodes are configured for a specific namespace, applications and services from other namespaces cannot run on the exclusive nodes. Using exclusive nodes allows important applications to have exclusive access to some computing resources, achieving physical isolation from other applications.
Note
Applications and services running on a node before it is set to be an exclusive node will not be affected and will continue to run normally on that node. Only when these Pods are deleted or rebuilt will they be scheduled to other non-exclusive nodes.
Check whether the kube-apiserver of the current cluster has enabled the PodNodeSelector and PodTolerationRestriction admission controllers.
The use of namespace exclusive nodes requires users to enable the PodNodeSelector and PodTolerationRestriction admission controllers on the kube-apiserver. For more information about admission controllers, refer to Kubernetes Admission Controllers Reference.
You can go to any Master node in the current cluster to check whether these two features are enabled in the kube-apiserver.yaml file, or you can execute the following command on the Master node for a quick check:
[root@g-master1 ~]# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep enable-admission-plugins\n\n# The expected output is as follows:\n- --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction\n
"},{"location":"en/end-user/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-global-cluster","title":"Enable Namespace Exclusive Nodes on Global Cluster","text":"
Since the Global cluster runs platform basic components such as kpanda, ghippo, and insight, enabling namespace exclusive nodes on Global may cause system components to not be scheduled to the exclusive nodes when they restart, affecting the overall high availability of the system. Therefore, we generally do not recommend users to enable the namespace exclusive node feature on the Global cluster.
If you do need to enable namespace exclusive nodes on the Global cluster, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the Global cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to configure system component tolerations.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction # List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Add toleration annotations to the namespace where the platform components are located
After enabling the admission controllers, you need to add toleration annotations to the namespace where the platform components are located to ensure the high availability of the platform components.
The system component namespaces for AI platform are as follows:
Check whether there are the above namespaces in the current cluster, execute the following command, and add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' for each namespace.
Please make sure to replace <namespace-name> with the name of the platform namespace you want to add the annotation to.
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
"},{"location":"en/end-user/kpanda/namespaces/exclusive.html#enable-namespace-exclusive-nodes-on-non-global-clusters","title":"Enable Namespace Exclusive Nodes on Non-Global Clusters","text":"
To enable namespace exclusive nodes on non-Global clusters, please follow the steps below:
Enable the PodNodeSelector and PodTolerationRestriction admission controllers for the kube-apiserver of the current cluster
Note
If the cluster has already enabled the above two admission controllers, please skip this step and go directly to using the interface to set exclusive nodes for the namespace.
Go to any Master node in the current cluster to modify the kube-apiserver.yaml configuration file, or execute the following command on the Master node for configuration:
[root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml\n\n# The expected output is as follows:\napiVersion: v1\nkind: Pod\nmetadata:\n ......\nspec:\ncontainers:\n- command:\n - kube-apiserver\n ......\n - --default-not-ready-toleration-seconds=300\n - --default-unreachable-toleration-seconds=300\n - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers\n - --enable-aggregator-routing=False\n - --enable-bootstrap-token-auth=true\n - --endpoint-reconciler-type=lease\n - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt\n ......\n
Find the --enable-admission-plugins parameter and add the PodNodeSelector and PodTolerationRestriction admission controllers (separated by commas). Refer to the following:
Use the interface to set exclusive nodes for the namespace
After confirming that the PodNodeSelector and PodTolerationRestriction admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace.
Click the cluster name in the cluster list page, then click Namespace in the left navigation bar.
Click the namespace name, then click the Exclusive Node tab, and click Add Node on the bottom right.
Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click OK at the bottom.
You can view the current exclusive nodes for this namespace in the list. You can choose to Stop Exclusivity on the right side of the node.
After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node.
Add toleration annotations to the namespace where the components that need high availability are located (optional)
Execute the following command to add the annotation: scheduler.alpha.kubernetes.io/defaultTolerations: '[{\"operator\": \"Exists\", \"effect\": \"NoSchedule\", \"key\": \"ExclusiveNamespace\"}]' to the namespace where the components that need high availability are located.
Pod security policies in a Kubernetes cluster allow you to control the behavior of Pods in various aspects of security by configuring different levels and modes for specific namespaces. Only Pods that meet certain conditions will be accepted by the system. It sets three levels and three modes, allowing users to choose the most suitable scheme to set restriction policies according to their needs.
Note
Only one security policy can be configured for one security mode. Please be careful when configuring the enforce security mode for a namespace, as violations will prevent Pods from being created.
This section will introduce how to configure Pod security policies for namespaces through the container management interface.
The container management module has integrated a Kubernetes cluster or created a Kubernetes cluster. The cluster version needs to be v1.22 or above, and you should be able to access the cluster's UI interface.
A namespace has been created, a user has been created, and the user has been granted NS Admin or higher permissions. For details, refer to Namespace Authorization.
"},{"location":"en/end-user/kpanda/namespaces/podsecurity.html#configure-pod-security-policies-for-namespace","title":"Configure Pod Security Policies for Namespace","text":"
Select the namespace for which you want to configure Pod security policies and go to the details page. Click Configure Policy on the Pod Security Policy page to go to the configuration page.
Click Add Policy on the configuration page, and a policy will appear, including security level and security mode. The following is a detailed introduction to the security level and security policy.
Security Level Description Privileged An unrestricted policy that provides the maximum possible range of permissions. This policy allows known privilege elevations. Baseline The least restrictive policy that prohibits known privilege elevations. Allows the use of default (minimum specified) Pod configurations. Restricted A highly restrictive policy that follows current best practices for protecting Pods. Security Mode Description Audit Violations of the specified policy will add new audit events in the audit log, and the Pod can be created. Warn Violations of the specified policy will return user-visible warning information, and the Pod can be created. Enforce Violations of the specified policy will prevent the Pod from being created.
Different security levels correspond to different check items. If you don't know how to configure your namespace, you can Policy ConfigMap Explanation at the top right corner of the page to view detailed information.
Click Confirm. If the creation is successful, the security policy you configured will appear on the page.
Click \u2507 to edit or delete the security policy you configured.
"},{"location":"en/end-user/kpanda/network/create-ingress.html","title":"Create an Ingress","text":"
In a Kubernetes cluster, Ingress exposes services from outside the cluster to inside the cluster HTTP and HTTPS ingress. Traffic ingress is controlled by rules defined on the Ingress resource. Here's an example of a simple Ingress that sends all traffic to the same Service:
Ingress is an API object that manages external access to services in the cluster, and the typical access method is HTTP. Ingress can provide load balancing, SSL termination, and name-based virtual hosting.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
Completed Create Ingress Instance, Deploy Application Workload, and have created the proper Service
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Ingress to enter the service list, and click the Create Ingress button in the upper right corner.
Note
It is also possible to Create from YAML .
Open Create Ingress page to configure. There are two protocol types to choose from, refer to the following two parameter tables for configuration.
"},{"location":"en/end-user/kpanda/network/create-ingress.html#create-http-protocol-ingress","title":"Create HTTP protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTP protocol. HTTP Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Scope of use of Ingress Platform-level load balancer : In the same cluster, share the same Ingress instance, where all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace, or belongs to a certain workspace, and the set workspace includes the current namespace, and all Pods can receive it Requests distributed by this load balancer. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> Ngnix Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Path Rewriting [Type] Optional [Meaning] rewrite-target , in some cases, the URL exposed by the backend service is different from the path specified in the Ingress rule. If no URL rewriting configuration is performed, There will be an error when accessing. close Redirect [Type] Optional[Meaning] permanent-redirect , permanent redirection, after entering the rewriting path, the access path will be redirected to the set address. close Traffic Distribution [Type] Optional[Meaning] After enabled and set, traffic distribution will be performed according to the set conditions. Based on weight : After setting the weight, add the following Annotation to the created Ingress: nginx.ingress.kubernetes.io/canary-weight: \"10\" Based on Cookie : set After the cookie rules, the traffic will be distributed according to the set cookie conditions Based on Header : After setting the header rules, the traffic will be distributed according to the set header conditions Close Labels [Type] Optional [Meaning] Add a label for the ingress - Annotations [Type] Optional [Meaning] Add annotation for ingress -"},{"location":"en/end-user/kpanda/network/create-ingress.html#create-https-protocol-ingress","title":"Create HTTPS protocol ingress","text":"Parameter Description Example value Ingress name [Type] Required[Meaning] Enter the name of the new ingress. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. Ing-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Protocol [Type] Required [Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTPS protocol. HTTPS Domain Name [Type] Required [Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster testing.daocloud.io Secret [Type] Required [Meaning] Https TLS certificate, Create Secret. Forwarding policy [Type] Optional[Meaning] Specify the access policy of Ingress. Path: Specifies the URL path for service access, the default is the root path/directoryTarget service: Service name for ingressTarget service port: Port exposed by the service LB Type [Type] Required [Meaning] The usage range of the Ingress instance. Platform-level load balancer : In the same cluster, the same Ingress instance is shared, and all Pods can receive requests distributed by the load balancer. Tenant-level load balancer : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace or to a certain workspace. This workspace contains the current namespace, and all Pods can receive the workload from this Balanced distribution of requests. Platform Level Load Balancer Ingress Class [Type] Optional[Meaning] Select the proper Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to Ingress Class< br /> None Session persistence [Type] Optional[Meaning] Session persistence is divided into three types: L4 source address hash , Cookie Key , L7 Header Name . Keep L4 Source Address Hash : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\(binary_remote_addr\"<br /> __Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:<br /> nginx.ingress.kubernetes.io/affinity: \"cookie\"<br /> nginx.ingress.kubernetes .io/affinity-mode: persistent<br /> __L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: \"\\)http_x_forwarded_for\" Close Labels [Type] Optional [Meaning] Add a label for the ingress Annotations [Type] Optional[Meaning] Add annotation for ingress"},{"location":"en/end-user/kpanda/network/create-ingress.html#create-ingress-successfully","title":"Create ingress successfully","text":"
After configuring all the parameters, click the OK button to return to the ingress list automatically. On the right side of the list, click \u2507 to modify or delete the selected ingress.
"},{"location":"en/end-user/kpanda/network/create-services.html","title":"Create a Service","text":"
In a Kubernetes cluster, each Pod has an internal independent IP address, but Pods in the workload may be created and deleted at any time, and directly using the Pod IP address cannot provide external services.
This requires creating a service through which you get a fixed IP address, decoupling the front-end and back-end of the workload, and allowing external users to access the service. At the same time, the service also provides the Load Balancer feature, enabling users to access workloads from the public network.
Container management module connected to Kubernetes cluster or created Kubernetes, and can access the cluster UI interface.
Completed a namespace creation, user creation, and authorize the user as NS Editor role, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
After successfully logging in as the NS Editor user, click Clusters in the upper left corner to enter the Clusters page. In the list of clusters, click a cluster name.
In the left navigation bar, click Container Network -> Service to enter the service list, and click the Create Service button in the upper right corner.
!!! tip
It is also possible to create a service via __YAML__ .\n
Open the Create Service page, select an access type, and refer to the following three parameter tables for configuration.
Click Intra-Cluster Access (ClusterIP) , which refers to exposing services through the internal IP of the cluster. The services selected for this option can only be accessed within the cluster. This is the default service type. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select intra-cluster access (ClusterIP). ClusterIP Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. app:job01 Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. Container port (targetport): The container port that the workload actually monitors, used to expose services to the cluster. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same Pod Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time is 30 seconds by default 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/end-user/kpanda/network/create-services.html#create-nodeport-service","title":"Create NodePort service","text":"
Click NodePort , which means exposing the service via IP and static port ( NodePort ) on each node. The NodePort service is routed to the automatically created ClusterIP service. You can access a NodePort service from outside the cluster by requesting : . Refer to the configuration parameters in the table below. Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default Label selector [Type] Required[Meaning] Add a label, the Service selects a Pod according to the label, and click \"Add\" after filling. You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. ***Container port (targetport)*: The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Session Persistence [Type] Optional [Meaning] When enabled, requests from the same client will be forwarded to the same PodAfter enabled, .spec.sessionAffinity of Service is ClientIP , refer to for details : Session Affinity for Service Enabled Maximum session hold time [Type] Optional [Meaning] After session hold is enabled, the maximum hold time, the default timeout is 30 seconds.spec.sessionAffinityConfig.clientIP.timeoutSeconds is set to 30 by default seconds 30 seconds Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/end-user/kpanda/network/create-services.html#create-loadbalancer-service","title":"Create LoadBalancer service","text":"
Click Load Balancer , which refers to using the cloud provider's load balancer to expose services to the outside. External load balancers can route traffic to automatically created NodePort services and ClusterIP services. Refer to the configuration parameters in the table below.
Parameter Description Example value Access type [Type] Required[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). NodePort Service Name [Type] Required[Meaning] Enter the name of the new service. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Svc-01 Namespace [Type] Required[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to Namespace Overview. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. default External Traffic Policy [Type] Required[Meaning] Set external traffic policy. Cluster: Traffic can be forwarded to Pods on all nodes in the cluster. Local: Traffic is only sent to Pods on this node. [Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. Tag selector [Type] Required [Meaning] Add tag, Service Select the Pod according to the label, fill it out and click \"Add\". You can also refer to the label of an existing workload. Click Reference workload label , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. Load balancing type [Type] Required [Meaning] The type of load balancing used, currently supports MetalLB and others. MetalLB IP Pool [Type] Required[Meaning] When the selected load balancing type is MetalLB, LoadBalancer Service will allocate IP addresses from this pool by default, and declare all IP addresses in this pool through APR, For details, refer to: Install MetalLB Load balancing address [Type] Required[Meaning] 1. If you are using a public cloud CloudProvider, fill in the load balancing address provided by the cloud provider here;2. If the above load balancing type is selected as MetalLB, the IP will be obtained from the above IP pool by default, if not filled, it will be obtained automatically. Port configuration [Type] Required[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. Port Name: Enter the name of the custom port. Service port (port): The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience. Container port (targetport): The container port actually monitored by the workload. Node port (nodeport): The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. Annotation [Type] Optional[Meaning] Add annotation for service"},{"location":"en/end-user/kpanda/network/create-services.html#complete-service-creation","title":"Complete service creation","text":"
After configuring all parameters, click the OK button to return to the service list automatically. On the right side of the list, click \u2507 to modify or delete the selected service.
Network policies in Kubernetes allow you to control network traffic at the IP address or port level (OSI layer 3 or layer 4). The container management module currently supports creating network policies based on Pods or namespaces, using label selectors to specify which traffic can enter or leave Pods with specific labels.
For more details on network policies, refer to the official Kubernetes documentation on Network Policies.
Currently, there are two methods available for creating network policies: YAML and form-based creation. Each method has its advantages and disadvantages, catering to different user needs.
YAML creation requires fewer steps and is more efficient, but it has a higher learning curve as it requires familiarity with configuring network policy YAML files.
Form-based creation is more intuitive and straightforward. Users can simply fill in the proper values based on the prompts. However, this method involves more steps.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create with YAML in the left navigation bar.
In the pop-up dialog, enter or paste the pre-prepared YAML file, then click OK at the bottom of the dialog.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies -> Create Policy in the left navigation bar.
Fill in the basic information.
The name and namespace cannot be changed after creation.
Fill in the policy configuration.
The policy configuration includes ingress and egress policies. To establish a successful connection from a source Pod to a target Pod, both the egress policy of the source Pod and the ingress policy of the target Pod need to allow the connection. If either side does not allow the connection, the connection will fail.
Ingress Policy: Click \u2795 to begin configuring the policy. Multiple policies can be configured. The effects of multiple network policies are cumulative. Only when all network policies are satisfied simultaneously can a connection be successfully established.
In the cluster list, click the name of the target cluster, then navigate to Container Network -> Network Policies . Click the name of the network policy.
View the basic configuration, associated instances, ingress policies, and egress policies of the policy.
Info
Under the \"Associated Instances\" tab, you can view instance monitoring, logs, container lists, YAML files, events, and more.
There are two ways to update network policies. You can either update them through the form or by using a YAML file.
On the network policy list page, find the policy you want to update, and choose Update in the action column on the right to update it via the form. Choose Edit YAML to update it using a YAML file.
Click the name of the network policy, then choose Update in the top right corner of the policy details page to update it via the form. Choose Edit YAML to update it using a YAML file.
There are two ways to delete network policies. You can delete network policies either through the form or by using a YAML file.
On the network policy list page, find the policy you want to delete, and choose Delete in the action column on the right to delete it via the form. Choose Edit YAML to delete it using a YAML file.
Click the name of the network policy, then choose Delete in the top right corner of the policy details page to delete it via the form. Choose Edit YAML to delete it using a YAML file.
As the number of business applications continues to grow, the resources of the cluster become increasingly tight. At this point, you can expand the cluster nodes based on kubean. After the expansion, applications can run on the newly added nodes, alleviating resource pressure.
Only clusters created through the container management module support node autoscaling. Clusters accessed from the outside do not support this operation. This article mainly introduces the expansion of worker nodes in the same architecture work cluster. If you need to add control nodes or heterogeneous work nodes to the cluster, refer to: Expanding the control node of the work cluster, Adding heterogeneous nodes to the work cluster, Expanding the worker node of the global service cluster.
On the Clusters page, click the name of the target cluster.
If the Cluster Type contains the label Integrated Cluster, it means that the cluster does not support node autoscaling.
Click Nodes in the left navigation bar, and then click Integrate Node in the upper right corner of the page.
Enter the host name and node IP and click OK.
Click \u2795 Add Worker Node to continue accessing more nodes.
Note
Accessing the node takes about 20 minutes, please be patient.
When the peak business period is over, in order to save resource costs, you can reduce the size of the cluster and unload redundant nodes, that is, node scaling. After a node is uninstalled, applications cannot continue to run on the node.
The current operating user has the Cluster Admin role authorization.
Only through the container management module created cluster can node autoscaling be supported, and the cluster accessed from the outside does not support this operation.
Before uninstalling a node, you need to pause scheduling the node, and expel the applications on the node to other nodes.
Eviction method: log in to the controller node, and use the kubectl drain command to evict all Pods on the node. The safe eviction method allows the containers in the pod to terminate gracefully.
When cluster nodes scales down, they can only be uninstalled one by one, not in batches.
If you need to uninstall cluster controller nodes, you need to ensure that the final number of controller nodes is an odd number.
The first controller node cannot be offline when the cluster node scales down. If it is necessary to perform this operation, please contact the after-sales engineer.
On the Clusters page, click the name of the target cluster.
If the Cluster Type has the tag Integrate Cluster , it means that the cluster does not support node autoscaling.
Click Nodes on the left navigation bar, find the node to be uninstalled, click \u2507 and select Remove .
Enter the node name, and click Delete to confirm.
"},{"location":"en/end-user/kpanda/nodes/labels-annotations.html","title":"Labels and Annotations","text":"
Labels are identifying key-value pairs added to Kubernetes objects such as Pods, nodes, and clusters, which can be combined with label selectors to find and filter Kubernetes objects that meet certain conditions. Each key must be unique for a given object.
Annotations, like tags, are key/value pairs, but they do not have identification or filtering features. Annotations can be used to add arbitrary metadata to nodes. Annotation keys usually use the format prefix(optional)/name(required) , for example nfd.node.kubernetes.io/extended-resources . If the prefix is \u200b\u200bomitted, it means that the annotation key is private to the user.
For more information about labels and annotations, refer to the official Kubernetes documentation labels and selectors Or Annotations.
The steps to add/delete tags and annotations are as follows:
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click Edit Labels or Edit Annotations .
Click \u2795 Add to add tags or annotations, click X to delete tags or annotations, and finally click OK .
"},{"location":"en/end-user/kpanda/nodes/node-authentication.html","title":"Node Authentication","text":""},{"location":"en/end-user/kpanda/nodes/node-authentication.html#authenticate-nodes-using-ssh-keys","title":"Authenticate Nodes Using SSH Keys","text":"
If you choose to authenticate the nodes of the cluster-to-be-created using SSH keys, you need to configure the public and private keys according to the following instructions.
Run the following command on any node within the management cluster of the cluster-to-be-created to generate the public and private keys.
cd /root/.ssh\nssh-keygen -t rsa\n
Run the ls command to check if the keys have been successfully created in the management cluster. The correct output should be as follows:
ls\nid_rsa id_rsa.pub known_hosts\n
The file named id_rsa is the private key, and the file named id_rsa.pub is the public key.
Run the following command to load the public key file id_rsa.pub onto all the nodes of the cluster-to-be-created.
Replace the user account and node IP in the above command with the username and IP of the nodes in the cluster-to-be-created. The same operation needs to be performed on every node in the cluster-to-be-created.
Run the following command to view the private key file id_rsa created in step 1.
Copy the content of the private key and paste it into the interface's key input field.
"},{"location":"en/end-user/kpanda/nodes/node-check.html","title":"Create a cluster node availability check","text":"
When creating a cluster or adding nodes to an existing cluster, refer to the table below to check the node configuration to avoid cluster creation or expansion failure due to wrong node configuration.
Check Item Description OS Refer to Supported Architectures and Operating Systems SELinux Off Firewall Off Architecture Consistency Consistent CPU architecture between nodes (such as ARM or x86) Host Time All hosts are out of sync within 10 seconds. Network Connectivity The node and its SSH port can be accessed normally by the platform. CPU Available CPU resources are greater than 4 Cores Memory Available memory resources are greater than 8 GB"},{"location":"en/end-user/kpanda/nodes/node-check.html#supported-architectures-and-operating-systems","title":"Supported architectures and operating systems","text":"Architecture Operating System Remarks ARM Kylin Linux Advanced Server release V10 (Sword) SP2 Recommended ARM UOS Linux ARM openEuler x86 CentOS 7.x Recommended x86 Redhat 7.x Recommended x86 Redhat 8.x Recommended x86 Flatcar Container Linux by Kinvolk x86 Debian Bullseye, Buster, Jessie, Stretch x86 Ubuntu 16.04, 18.04, 20.04, 22.04 x86 Fedora 35, 36 x86 Fedora CoreOS x86 openSUSE Leap 15.x/Tumbleweed x86 Oracle Linux 7, 8, 9 x86 Alma Linux 8, 9 x86 Rocky Linux 8, 9 x86 Amazon Linux 2 x86 Kylin Linux Advanced Server release V10 (Sword) - SP2 Haiguang x86 UOS Linux x86 openEuler"},{"location":"en/end-user/kpanda/nodes/node-details.html","title":"Node Details","text":"
After accessing or creating a cluster, you can view the information of each node in the cluster, including node status, labels, resource usage, Pod, monitoring information, etc.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar to view the node status, role, label, CPU/memory usage, IP address, and creation time.
Click the node name to enter the node details page to view more information, including overview information, pod information, label annotation information, event list, status, etc.
In addition, you can also view the node's YAML file, monitoring information, labels and annotations, etc.
Supports suspending or resuming scheduling of nodes. Pausing scheduling means stopping the scheduling of Pods to the node. Resuming scheduling means that Pods can be scheduled to that node.
On the Clusters page, click the name of the target cluster.
Click Nodes on the left navigation bar, click the \u2507 operation icon on the right side of the node, and click the Cordon button to suspend scheduling the node.
Click the \u2507 operation icon on the right side of the node, and click the Uncordon button to resume scheduling the node.
The node scheduling status may be delayed due to network conditions. Click the refresh icon on the right side of the search box to refresh the node scheduling status.
Taint can make a node exclude a certain type of Pod and prevent Pod from being scheduled on the node. One or more taints can be applied to each node, and Pods that cannot tolerate these taints will not be scheduled on that node.
Find the target cluster on the Clusters page, and click the cluster name to enter the Cluster page.
In the left navigation bar, click Nodes , find the node that needs to modify the taint, click the \u2507 operation icon on the right and click the Edit Taints button.
Enter the key value information of the taint in the pop-up box, select the taint effect, and click OK .
Click \u2795 Add to add multiple taints to the node, and click X on the right side of the taint effect to delete the taint.
Currently supports three taint effects:
NoExecute: This affects pods that are already running on the node as follows:
Pods that do not tolerate the taint are evicted immediately
Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever
Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.
NoSchedule: No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.
PreferNoSchedule: This is a \"preference\" or \"soft\" version of NoSchedule. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed, so this taint is not recommended to use in a production environment.
For more details about taints, refer to the Kubernetes documentation Taints and Tolerance.
The current cluster is connected to the container management and the Global cluster has installed the kolm component (search for helm templates for kolm).
The current cluster has the olm component installed with a version of 0.2.4 or higher (search for helm templates for olm).
Go to Container Management -> Select the current cluster -> Helm Apps -> View the olm component -> Plugin Settings , and find the images needed for the opm, minio, minio bundle, and minio operator in the subsequent steps.
Using the screenshot as an example, the four image addresses are as follows:\n\n# opm image\n10.5.14.200/quay.m.daocloud.io/operator-framework/opm:v1.29.0\n\n# minio image\n10.5.14.200/quay.m.daocloud.io/minio/minio:RELEASE.2023-03-24T21-41-23Z\n\n# minio bundle image\n10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3\n\n# minio operator image\n10.5.14.200/quay.m.daocloud.io/minio/operator:v5.0.3\n
Run the opm command to get the operators included in the offline bundle image.
Replace all image addresses in the minio-operator/manifests/minio-operator.clusterserviceversion.yaml file with the image addresses from the offline container registry.
Before replacement:
After replacement:
Generate a Dockerfile for building the bundle image.
# Set the new catalog image \nexport OFFLINE_CATALOG_IMG=10.5.14.200/release.daocloud.io/operator-framework/system-operator-index:v0.1.0-offline\n\n$ docker build . -f index.Dockerfile -t ${OFFLINE_CATALOG_IMG} \n\n$ docker push ${OFFLINE_CATALOG_IMG}\n
Go to Container Management and update the built-in catsrc image for the Helm App olm (enter the catalog image specified in the construction of the catalog image, ${catalog-image} ).
After the update is successful, the minio-operator component will appear in the Operator Hub.
"},{"location":"en/end-user/kpanda/permissions/cluster-ns-auth.html","title":"Cluster and Namespace Authorization","text":"
Container management implements authorization based on global authority management and global user/group management. If you need to grant users the highest authority for container management (can create, manage, and delete all clusters), refer to What are Access Control.
After the user logs in to the platform, click Privilege Management under Container Management on the left menu bar, which is located on the Cluster Permissions tab by default.
Click the Add Authorization button.
On the Add Cluster Permission page, select the target cluster, the user/group to be authorized, and click OK .
Currently, the only cluster role supported is Cluster Admin . For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permissions to add multiple times.
Return to the cluster permission management page, and a message appears on the screen: Cluster permission added successfully .
After the user logs in to the platform, click Permissions under Container Management on the left menu bar, and click the Namespace Permissions tab.
Click the Add Authorization button. On the Add Namespace Permission page, select the target cluster, target namespace, and user/group to be authorized, and click OK .
The currently supported namespace roles are NS Admin, NS Editor, and NS Viewer. For details about permissions, refer to Permission Description. If you need to authorize multiple users/groups at the same time, you can click Add User Permission to add multiple times. Click OK to complete the permission authorization.
Return to the namespace permission management page, and a message appears on the screen: Cluster permission added successfully .
Tip
If you need to delete or edit permissions later, you can click \u2507 on the right side of the list and select Edit or Delete .
"},{"location":"en/end-user/kpanda/permissions/custom-kpanda-role.html","title":"Adding RBAC Rules to System Roles","text":"
In the past, the RBAC rules for those system roles in container management were pre-defined and could not be modified by users. To support more flexible permission settings and to meet the customized needs for system roles, now you can modify RBAC rules for system roles such as cluster admin, ns admin, ns editor, ns viewer.
The following example demonstrates how to add a new ns-view rule, granting the authority to delete workload deployments. Similar operations can be performed for other rules.
Before adding RBAC rules to system roles, the following prerequisites must be met:
Container management v0.27.0 and above.
Integrated Kubernetes cluster or created Kubernetes cluster, and able to access the cluster's UI interface.
Completed creation of a namespace and user account, and the granting of NS Viewer. For details, refer to namespace authorization.
Note
RBAC rules only need to be added in the Global Cluster, and the Kpanda controller will synchronize those added rules to all integrated subclusters. Synchronization may take some time to complete.
RBAC rules can only be added in the Global Cluster. RBAC rules added in subclusters will be overridden by the system role permissions of the Global Cluster.
Only ClusterRoles with fixed Label are supported for adding rules. Replacing or deleting rules is not supported, nor is adding rules by using role. The correspondence between built-in roles and ClusterRole Label created by users is as follows.
Create a deployment by a user with admin or cluster admin permissions.
Grant a user the ns-viewer role to provide them with the ns-view permission.
Switch the login user to ns-viewer, open the console to get the token for the ns-viewer user, and use curl to request and delete the nginx deployment mentioned above. However, a prompt appears as below, indicating the user doesn't have permission to delete it.
[root@master-01 ~]# curl -k -X DELETE 'https://${URL}/apis/kpanda.io/v1alpha1/clusters/cluster-member/namespaces/default/deployments/nginx' -H 'authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJOU044MG9BclBRMzUwZ2VVU2ZyNy1xMEREVWY4MmEtZmJqR05uRE1sd1lFIn0.eyJleHAiOjE3MTU3NjY1NzksImlhdCI6MTcxNTY4MDE3OSwiYXV0aF90aW1lIjoxNzE1NjgwMTc3LCJqdGkiOiIxZjI3MzJlNC1jYjFhLTQ4OTktYjBiZC1iN2IxZWY1MzAxNDEiLCJpc3MiOiJodHRwczovLzEwLjYuMjAxLjIwMTozMDE0Ny9hdXRoL3JlYWxtcy9naGlwcG8iLCJhdWQiOiJfX2ludGVybmFsLWdoaXBwbyIsInN1YiI6ImMxZmMxM2ViLTAwZGUtNDFiYS05ZTllLWE5OGU2OGM0MmVmMCIsInR5cCI6IklEIiwiYXpwIjoiX19pbnRlcm5hbC1naGlwcG8iLCJzZXNzaW9uX3N0YXRlIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiYXRfaGFzaCI6IlJhTHoyQjlKQ2FNc1RrbGVMR3V6blEiLCJhY3IiOiIwIiwic2lkIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiZW1haWxfdmVyaWZpZWQiOmZhbHNlLCJncm91cHMiOltdLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJucy12aWV3ZXIiLCJsb2NhbGUiOiIifQ.As2ipMjfvzvgONAGlc9RnqOd3zMwAj82VXlcqcR74ZK9tAq3Q4ruQ1a6WuIfqiq8Kq4F77ljwwzYUuunfBli2zhU2II8zyxVhLoCEBu4pBVBd_oJyUycXuNa6HfQGnl36E1M7-_QG8b-_T51wFxxVb5b7SEDE1AvIf54NAlAr-rhDmGRdOK1c9CohQcS00ab52MD3IPiFFZ8_Iljnii-RpXKZoTjdcULJVn_uZNk_SzSUK-7MVWmPBK15m6sNktOMSf0pCObKWRqHd15JSe-2aA2PKBo1jBH3tHbOgZyMPdsLI0QdmEnKB5FiiOeMpwn_oHnT6IjT-BZlB18VkW8rA'\n{\"code\":7,\"message\":\"[RBAC] delete resources(deployments: nginx) is forbidden for user(ns-viewer) in cluster(cluster-member)\",\"details\":[]}[root@master-01 ~]#\n[root@master-01 ~]#\n
Create a ClusterRole on the global cluster, as shown in the yaml below.
This field value can be arbitrarily specified, as long as it is not duplicated and complies with the Kubernetes resource naming conventions.
When adding rules to different roles, make sure to apply different labels.
Wait for the kpanda controller to add a rule of user creation to the built-in role: ns-viewer, then you can check if the rules added in the previous step are present for ns-viewer.
When using curl again to request the deletion of the aforementioned nginx deployment, this time the deletion was successful. This means that ns-viewer has successfully added the rule to delete deployments.
Container management permissions are based on a multi-dimensional permission management system created by global permission management and Kubernetes RBAC permission management. It supports cluster-level and namespace-level permission control, helping users to conveniently and flexibly set different operation permissions for IAM users and user groups (collections of users) under a tenant.
Cluster permissions are authorized based on Kubernetes RBAC's ClusterRoleBinding, allowing users/user groups to have cluster-related permissions. The current default cluster role is Cluster Admin (does not have the permission to create or delete clusters).
Namespace permissions are authorized based on Kubernetes RBAC capabilities, allowing different users/user groups to have different operation permissions on resources under a namespace (including Kubernetes API permissions). For details, refer to: Kubernetes RBAC. Currently, the default roles for container management are: NS Admin, NS Editor, NS Viewer.
What is the relationship between global permissions and container management permissions?
Answer: Global permissions only authorize coarse-grained permissions, which can manage the creation, editing, and deletion of all clusters; while for fine-grained permissions, such as the management permissions of a single cluster, the management, editing, and deletion permissions of a single namespace, they need to be implemented based on Kubernetes RBAC container management permissions. Generally, users only need to be authorized in container management.
Currently, only four default roles are supported. Can the RoleBinding and ClusterRoleBinding (Kubernetes fine-grained RBAC) for custom roles also take effect?
Answer: Currently, custom permissions cannot be managed through the graphical interface, but the permission rules created using kubectl can still take effect.
Suanova AI platform supports elastic scaling of Pod resources based on metrics (Horizontal Pod Autoscaling, HPA). Users can dynamically adjust the number of copies of Pod resources by setting CPU utilization, memory usage, and custom metrics. For example, after setting an auto scaling policy based on the CPU utilization metric for the workload, when the CPU utilization of the Pod exceeds/belows the metric threshold you set, the workload controller will automatically increase/decrease the number of Pod replicas.
This page describes how to configure auto scaling based on built-in metrics and custom metrics for workloads.
Note
HPA is only applicable to Deployment and StatefulSet, and only one HPA can be created per workload.
If you create an HPA policy based on CPU utilization, you must set the configuration limit (Limit) for the workload in advance, otherwise the CPU utilization cannot be calculated.
If built-in metrics and multiple custom metrics are used at the same time, HPA will calculate the number of scaling copies required based on multiple metrics, and take the larger value (but not exceed the maximum number of copies configured when setting the HPA policy) for elastic scaling .
Refer to the following steps to configure the built-in index auto scaling policy for the workload.
Click Clusters on the left navigation bar to enter the cluster list page. Click a cluster name to enter the Cluster Details page.
On the cluster details page, click Workload in the left navigation bar to enter the workload list, and then click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster.
After confirming that the cluster has installed the metrics-server plug-in, and the plug-in is running normally, you can click the New Scaling button.
Create custom metric auto scaling policy parameters.
Policy name: Enter the name of the auto scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as hpa- my-dep.
Namespace: The namespace where the payload resides.
Workload: The workload object that performs auto scaling.
Target CPU Utilization: The CPU usage of the Pod under the workload resource. The calculation method is: the request (request) value of all Pod resources/workloads under the workload. When the actual CPU usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Target Memory Usage: The memory usage of the Pod under the workload resource. When the actual memory usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas.
Replica range: the elastic scaling range of the number of Pod replicas. The default interval is 1 - 10.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to edit, delete, and view related events.
The container Vertical Pod Autoscaler (VPA) calculates the most suitable CPU and memory request values \u200b\u200bfor the Pod by monitoring the Pod's resource application and usage over a period of time. Using VPA can allocate resources to each Pod in the cluster more reasonably, improve the overall resource utilization of the cluster, and avoid waste of cluster resources.
AI platform supports VPA through containers. Based on this feature, the Pod request value can be dynamically adjusted according to the usage of container resources. AI platform supports manual and automatic modification of resource request values, and you can configure them according to actual needs.
This page describes how to configure VPA for deployment.
Warning
Using VPA to modify a Pod resource request will trigger a Pod restart. Due to the limitations of Kubernetes itself, Pods may be scheduled to other nodes after restarting.
Refer to the following steps to configure the built-in index auto scaling policy for the deployment.
Find the current cluster in Clusters , and click the name of the target cluster.
Click Deployments in the left navigation bar, find the deployment that needs to create a VPA, and click the name of the deployment.
Click the Auto Scaling tab to view the auto scaling configuration of the current cluster, and confirm that the relevant plug-ins have been installed and are running normally.
Click the Create Autoscaler button and configure the VPA vertical scaling policy parameters.
Policy name: Enter the name of the vertical scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as vpa- my-dep.
Scaling mode: Run the method of modifying the CPU and memory request values. Currently, vertical scaling supports manual and automatic scaling modes.
Manual scaling: After the vertical scaling policy calculates the recommended resource configuration value, the user needs to manually modify the resource quota of the application.
Auto-scaling: The vertical scaling policy automatically calculates and modifies the resource quota of the application.
Target container: Select the container to be scaled vertically.
After completing the parameter configuration, click the OK button to automatically return to the elastic scaling details page. Click \u2507 on the right side of the list to perform edit and delete operations.
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html","title":"Creating HPA Based on Custom Metrics","text":"
When the built-in CPU and memory metrics in the system do not meet your business needs, you can add custom metrics by configuring ServiceMonitoring and achieve auto-scaling based on these custom metrics. This article will introduce how to configure auto-scaling for workloads based on custom metrics.
Note
HPA is only applicable to Deployment and StatefulSet, and each workload can only create one HPA.
If both built-in metrics and multiple custom metrics are used, HPA will calculate the required number of scaled replicas based on multiple metrics respectively, and take the larger value (but not exceeding the maximum number of replicas configured when setting the HPA policy) for scaling.
Refer to the following steps to configure the auto-scaling policy based on metrics for workloads.
Click Clusters in the left navigation bar to enter the clusters page. Click a cluster name to enter the Cluster Overview page.
On the Cluster Details page, click Workloads in the left navigation bar to enter the workload list, and click a workload name to enter the Workload Details page.
Click the Auto Scaling tab to view the current autoscaling configuration of the cluster.
Confirm that the cluster has installed metrics-server, Insight, and Prometheus-adapter plugins, and that the plugins are running normally, then click the Create AutoScaler button.
Note
If the related plugins are not installed or the plugins are in an abnormal state, you will not be able to see the entry for creating custom metrics auto-scaling on the page.
Policy Name: Enter the name of the auto-scaling policy. Note that the name can be up to 63 characters long, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with a lowercase letter or number, e.g., hpa-my-dep.
Namespace: The namespace where the workload is located.
Workload: The workload object that performs auto-scaling.
Resource Type: The type of custom metric being monitored, including Pod and Service types.
Metric: The name of the custom metric created using ServiceMonitoring or the name of the system-built custom metric.
Data Type: The method used to calculate the metric value, including target value and target average value. When the resource type is Pod, only the target average value can be used.
This case takes a Golang business program as an example. The example program exposes the httpserver_requests_total metric and records HTTP requests. This metric can be used to calculate the QPS value of the business program.
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html#deploy-business-program","title":"Deploy Business Program","text":"
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html#prometheus-collects-business-monitoring","title":"Prometheus Collects Business Monitoring","text":"
If the insight-agent is installed, Prometheus can be configured by creating a ServiceMonitor CRD object.
Operation steps: In Cluster Details -> Custom Resources, search for \u201cservicemonitors.monitoring.coreos.com\", click the name to enter the details. Create the following example CRD in the httpserver namespace via YAML:
If Prometheus is installed via insight, the serviceMonitor must be labeled with operator.insight.io/managed-by: insight. If installed by other means, this label is not required.
"},{"location":"en/end-user/kpanda/scale/custom-hpa.html#configure-metric-rules-in-prometheus-adapter","title":"Configure Metric Rules in Prometheus-adapter","text":"
steps: In Clusters -> Helm Apps, search for \u201cprometheus-adapter\",enter the update page through the action bar, and configure custom metrics in YAML as follows:
Follow the above steps to find the application httpserver in the Deployment and create auto-scaling via custom metrics.
"},{"location":"en/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.html","title":"Compatibility Rules for HPA and CronHPA","text":"
HPA stands for HorizontalPodAutoscaler, which refers to horizontal pod auto-scaling.
CronHPA stands for Cron HorizontalPodAutoscaler, which refers to scheduled horizontal pod auto-scaling.
"},{"location":"en/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.html#conflict-between-cronhpa-and-hpa","title":"Conflict Between CronHPA and HPA","text":"
Scheduled scaling with CronHPA triggers horizontal pod scaling at specified times. To prevent sudden traffic surges, you may have configured HPA to ensure the normal operation of your application. If both HPA and CronHPA are detected simultaneously, conflicts arise because CronHPA and HPA operate independently without awareness of each other. Consequently, the actions performed last will override those executed first.
By comparing the definition templates of CronHPA and HPA, the following points can be observed:
Both CronHPA and HPA use the scaleTargetRef field to identify the scaling target.
CronHPA schedules the number of replicas to scale based on crontab rules in jobs.
HPA determines scaling based on resource utilization.
Note
If both CronHPA and HPA are set, there will be scenarios where CronHPA and HPA simultaneously operate on a single scaleTargetRef.
"},{"location":"en/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.html#compatibility-solution-for-cronhpa-and-hpa","title":"Compatibility Solution for CronHPA and HPA","text":"
As noted above, the fundamental reason that simultaneous use of CronHPA and HPA results in the later action overriding the earlier one is that the two controllers cannot sense each other. Therefore, the conflict can be resolved by enabling CronHPA to be aware of HPA's current state.
The system will treat HPA as the scaling object for CronHPA, thus achieving scheduled scaling for the Deployment object defined by the HPA.
HPA's definition configures the Deployment in the scaleTargetRef field, and then the Deployment uses its definition to locate the ReplicaSet, which ultimately adjusts the actual number of replicas.
In AI platform, the scaleTargetRef in CronHPA is set to the HPA object, and it uses the HPA object to find the actual scaleTargetRef, allowing CronHPA to be aware of HPA's current state.
CronHPA senses HPA by adjusting HPA. CronHPA determines whether scaling is needed and modifies the HPA upper limit by comparing the target number of replicas with the current number of replicas, choosing the larger value. Similarly, CronHPA determines whether to modify the HPA lower limit by comparing the target number of replicas from CronHPA with the configuration in HPA, choosing the smaller value.
The container copy timing horizontal autoscaling policy (CronHPA) can provide stable computing resource guarantee for periodic high-concurrency applications, and kubernetes-cronhpa-controller is a key component to implement CronHPA.
This section describes how to install the kubernetes-cronhpa-controller plugin.
Note
In order to use CornHPA, not only the kubernetes-cronhpa-controller plugin needs to be installed, but also install the metrics-server plugin.
Refer to the following steps to install the kubernetes-cronhpa-controller plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of CronHPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.3.0 or later.
Refer to the following instructions to configure the parameters.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.3.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling ready wait and/or failed deletion , it takes a long time for the application to be marked as \"running\".
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the kubernetes-cronhpa-controller plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now it's time to start creating CronHPA policies.
metrics-server is the built-in resource usage metrics collection component of Kubernetes. You can automatically scale Pod copies horizontally for workload resources by configuring HPA policies.
This section describes how to install metrics-server .
Please perform the following steps to install the metrics-server plugin for the cluster.
On the Auto Scaling page under workload details, click the Install button to enter the metrics-server plug-in installation interface.
Read the introduction of the metrics-server plugin, select the version and click the Install button. This page will use the 3.8.2 version as an example to install, and it is recommended that you install 3.8.2 and later versions.
Configure basic parameters on the installation configuration interface.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators (\"-\"), and must start and end with lowercase letters or numbers, such as metrics-server-01.
Namespace: Select the namespace for plugin installation, here we take default as an example.
Version: The version of the plugin, here we take 3.8.2 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful.
Failed to delete: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed.
Verbose log: Turn on the verbose output of the installation process log.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the app to be marked as Running .
Advanced parameter configuration
If the cluster network cannot access the k8s.gcr.io repository, please try to modify the repositort parameter to repository: k8s.m.daocloud.io/metrics-server/metrics-server .
An SSL certificate is also required to install the metrics-server plugin. To bypass certificate verification, you need to add - --kubelet-insecure-tls parameter at defaultArgs: .
Click to view and use the YAML parameters to replace the default YAML
Click the OK button to complete the installation of the metrics-server plug-in, and then the system will automatically jump to the Helm Apps list page. After a few minutes, refresh the page and you will see the newly installed Applications.
Note
When deleting the metrics-server plugin, the plugin can only be completely deleted on the Helm Apps list page. If you only delete metrics-server on the workload page, this only deletes the workload copy of the application, the application itself is still not deleted, and an error will be prompted when you reinstall the plugin later.
The Vertical Pod Autoscaler, VPA, can make the resource allocation of the cluster more reasonable and avoid the waste of cluster resources. vpa is the key component to realize the vertical autoscaling of the container.
This section describes how to install the vpa plugin.
In order to use VPA policies, not only the __vpa__ plugin needs to be installed, but also [install the __metrics-server__ plugin](install-metrics-server.md).\n
Refer to the following steps to install the vpa plugin for the cluster.
On the Clusters page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click Workloads -> Deployments on the left, and click the name of the target workload.
On the workload details page, click the Auto Scaling tab, and click Install on the right side of VPA .
Read the relevant introduction of the plug-in, select the version and click the Install button. It is recommended to install 1.5.0 or later.
Review the configuration parameters described below.
Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators (\"-\"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller.
Namespace: Select which namespace the plugin will be installed in, here we take default as an example.
Version: The version of the plugin, here we take the 1.5.0 version as an example.
Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful.
Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, Wait will be enabled synchronously by default.
Detailed log: When enabled, a detailed log of the installation process will be recorded.
Note
After enabling Wait and/or Deletion failed , it takes a long time for the application to be marked as running .
Click OK in the lower right corner of the page, and the system will automatically jump to the Helm Apps list page. Wait a few minutes and refresh the page to see the application you just installed.
Warning
If you need to delete the vpa plugin, you should go to the Helm Apps list page to delete it completely.
If you delete the plug-in under the Auto Scaling tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later.
Go back to the Auto Scaling tab under the workload details page, and you can see that the interface displays Plug-in installed . Now you can start Create VPA policy.
Log in to the cluster, click the sidebar Helm Apps \u2192 Helm Charts , enter knative in the search box at the top right, and then press the enter key to search.
Click the knative-operator to enter the installation configuration interface. You can view the available versions and the Parameters optional items of Helm values on this interface.
After clicking the install button, you will enter the installation configuration interface.
Enter the name, installation tenant, and it is recommended to check Wait and Detailed Logs .
In the settings below, you can tick Serving and enter the installation tenant of the Knative Serving component, which will deploy the Knative Serving component after installation. This component is managed by the Knative Operator.
Knative provides a higher level of abstraction, simplifying and speeding up the process of building, deploying, and managing applications on Kubernetes. It allows developers to focus more on implementing business logic, while leaving most of the infrastructure and operations work to Knative, significantly improving productivity.
Component Features Activator Queues requests (if a Knative Service has scaled to zero). Calls the autoscaler to bring back services that have scaled down to zero and forward queued requests. The Activator can also act as a request buffer, handling bursts of traffic. Autoscaler Responsible for scaling Knative services based on configuration, metrics, and incoming requests. Controller Manages the state of Knative CRs. It monitors multiple objects, manages the lifecycle of dependent resources, and updates resource status. Queue-Proxy Sidecar container injected into each Knative Service. Responsible for collecting traffic data and reporting it to the Autoscaler, which then initiates scaling requests based on this data and preset rules. Webhooks Knative Serving has several Webhooks responsible for validating and mutating Knative resources."},{"location":"en/end-user/kpanda/scale/knative/knative.html#ingress-traffic-entry-solutions","title":"Ingress Traffic Entry Solutions","text":"Solution Use Case Istio If Istio is already in use, it can be chosen as the traffic entry solution. Contour If Contour has been enabled in the cluster, it can be chosen as the traffic entry solution. Kourier If neither of the above two Ingress components are present, Knative's Envoy-based Kourier Ingress can be used as the traffic entry solution."},{"location":"en/end-user/kpanda/scale/knative/knative.html#autoscaler-solutions-comparison","title":"Autoscaler Solutions Comparison","text":"Autoscaler Type Core Part of Knative Serving Default Enabled Scale to Zero Support CPU-based Autoscaling Support Knative Pod Autoscaler (KPA) Yes Yes Yes No Horizontal Pod Autoscaler (HPA) No Needs to be enabled after installing Knative Serving No Yes"},{"location":"en/end-user/kpanda/scale/knative/knative.html#crd","title":"CRD","text":"Resource Type API Name Description Services service.serving.knative.dev Automatically manages the entire lifecycle of Workloads, controls the creation of other objects, ensures applications have Routes, Configurations, and new revisions with each update. Routes route.serving.knative.dev Maps network endpoints to one or more revision versions, supports traffic distribution and version routing. Configurations configuration.serving.knative.dev Maintains the desired state of deployments, provides separation between code and configuration, follows the Twelve-Factor App methodology, modifying configurations creates new revisions. Revisions revision.serving.knative.dev Snapshot of the workload at each modification time point, immutable object, automatically scales based on traffic."},{"location":"en/end-user/kpanda/scale/knative/playground.html","title":"Knative Practices","text":"
In this section, we will delve into learning Knative through several practical exercises.
case1 When there is low traffic or no traffic, traffic will be routed to the activator.
case2 When there is high traffic, traffic will be routed directly to the Pod only if it exceeds the target-burst-capacity.
Configured as 0, expansion from 0 is the only scenario.
Configured as -1, the activator will always be present in the request path.
Configured as >0, the number of additional concurrent requests that the system can handle before triggering scaling.
case3 When the traffic decreases again, traffic will be routed back to the activator if the traffic is lower than current_demand + target-burst-capacity > (pods * concurrency-target).
The total number of pending requests + the number of requests that can exceed the target concurrency > the target concurrency per Pod * number of Pods.
"},{"location":"en/end-user/kpanda/scale/knative/playground.html#case-2-based-on-concurrent-elastic-scaling","title":"case 2 - Based on Concurrent Elastic Scaling","text":"
We first apply the following YAML definition under the cluster.
"},{"location":"en/end-user/kpanda/scale/knative/playground.html#case-3-based-on-concurrent-elastic-scaling-scale-out-in-advance-to-reach-a-specific-ratio","title":"case 3 - Based on concurrent elastic scaling, scale out in advance to reach a specific ratio.","text":"
We can easily achieve this, for example, by limiting the concurrency to 10 per container. This can be implemented through autoscaling.knative.dev/target-utilization-percentage: 70, starting to scale out the Pods when 70% is reached.
"},{"location":"en/end-user/kpanda/security/index.html","title":"Types of Security Scans","text":"
AI platform Container Management provides three types of security scans:
Compliance Scan: Conducts security scans on cluster nodes based on CIS Benchmark.
Authorization Scan: Checks for security and compliance issues in the Kubernetes cluster, records and verifies authorized access, object changes, events, and other activities related to the Kubernetes API.
Vulnerability Scan: Scans the Kubernetes cluster for potential vulnerabilities and risks, such as unauthorized access, sensitive information leakage, weak authentication, container escape, etc.
The object of compliance scanning is the cluster node. The scan result lists the scan items and results and provides repair suggestions for any failed scan items. For specific security rules used during scanning, refer to the CIS Kubernetes Benchmark.
The focus of the scan varies when checking different types of nodes.
Scan the control plane node (Controller)
Focus on the security of system components such as API Server , controller-manager , scheduler , kubelet , etc.
Check the security configuration of the Etcd database.
Verify whether the cluster's authentication mechanism, authorization policy, and network security configuration meet security standards.
Scan worker nodes
Check if the configuration of container runtimes such as kubelet and Docker meets security standards.
Verify whether the container image has been trusted and verified.
Check if the network security configuration of the node meets security standards.
Tip
To use compliance scanning, you need to create a scan configuration first, and then create a scan policy based on that configuration. After executing the scan policy, you can view the scan report.
Authorization scanning focuses on security vulnerabilities caused by authorization issues. Authorization scans can help users identify security threats in Kubernetes clusters, identify which resources need further review and protection measures. By performing these checks, users can gain a clearer and more comprehensive understanding of their Kubernetes environment and ensure that the cluster environment meets Kubernetes' best practices and security standards.
Specifically, authorization scanning supports the following operations:
Scans the health status of all nodes in the cluster.
Scans the running state of components in the cluster, such as kube-apiserver , kube-controller-manager , kube-scheduler , etc.
API security: whether unsafe API versions are enabled, whether appropriate RBAC roles and permission restrictions are set, etc.
Container security: whether insecure images are used, whether privileged mode is enabled, whether appropriate security context is set, etc.
Network security: whether appropriate network policy is enabled to restrict traffic, whether TLS encryption is used, etc.
Storage security: whether appropriate encryption and access controls are enabled.
Application security: whether necessary security measures are in place, such as password management, cross-site scripting attack defense, etc.
Provides warnings and suggestions: Security best practices that cluster administrators should perform, such as regularly rotating certificates, using strong passwords, restricting network access, etc.
Tip
To use authorization scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Security Scanning.
Vulnerability scanning focuses on scanning potential malicious attacks and security vulnerabilities, such as remote code execution, SQL injection, XSS attacks, and some attacks specific to Kubernetes. The final scan report lists the security vulnerabilities in the cluster and provides repair suggestions.
Tip
To use vulnerability scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to Vulnerability Scan.
To use the Permission Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/end-user/kpanda/security/audit.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Permission Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
To use the Vulnerability Scan feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing.
"},{"location":"en/end-user/kpanda/security/hunter.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
On the left navigation bar of the homepage in the Container Management module, click Security Management .
Click Vulnerability Scan on the left navigation bar, then click the Scan Policy tab and click Create Scan Policy on the right.
Fill in the configuration according to the following instructions, and then click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
The first step in using CIS Scanning is to create a scan configuration. Based on the scan configuration, you can then create scan policies, execute scan policies, and finally view scan results.
"},{"location":"en/end-user/kpanda/security/cis/config.html#create-a-scan-configuration","title":"Create a Scan Configuration","text":"
The steps for creating a scan configuration are as follows:
Click Security Management in the left navigation bar of the homepage of the container management module.
By default, enter the Compliance Scanning page, click the Scan Configuration tab, and then click Create Scan Configuration in the upper-right corner.
Fill in the configuration name, select the configuration template, and optionally check the scan items, then click OK .
Scan Template: Currently, two templates are provided. The kubeadm template is suitable for general Kubernetes clusters. The daocloud template ignores scan items that are not applicable to AI platform based on the kubeadm template and the platform design of AI platform.
Under the scan configuration tab, clicking the name of a scan configuration displays the type of the configuration, the number of scan items, the creation time, the configuration template, and the specific scan items enabled for the configuration.
After a scan configuration has been successfully created, it can be updated or deleted according to your needs.
Under the scan configuration tab, click the \u2507 action button to the right of a configuration:
Select Edit to update the configuration. You can update the description, template, and scan items. The configuration name cannot be changed.
Select Delete to delete the configuration.
"},{"location":"en/end-user/kpanda/security/cis/policy.html","title":"Scan Policy","text":""},{"location":"en/end-user/kpanda/security/cis/policy.html#create-a-scan-policy","title":"Create a Scan Policy","text":"
After creating a scan configuration, you can create a scan policy based on the configuration.
Under the Security Management -> Compliance Scanning page, click the Scan Policy tab on the right to create a scan policy.
Fill in the configuration according to the following instructions and click OK .
Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the Container Management module. If the desired cluster is not available, you can access or create a cluster in the Container Management module.
Scan Configuration: Select a pre-created scan configuration. The scan configuration determines which specific scan items need to be performed.
Scan Type:
Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later.
Scheduled scan: Automatically repeat the scan at scheduled intervals.
Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report.
After creating a scan policy, you can update or delete it as needed.
Under the Scan Policy tab, click the \u2507 action button to the right of a configuration:
For periodic scan policies:
Select Execute Immediately to perform an additional scan outside the regular schedule.
Select Disable to interrupt the scanning plan until Enable is clicked to resume executing the scan policy according to the scheduling plan.
Select Edit to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed.
Select Delete to delete the configuration.
For one-time scan policies: Only support the Delete operation.
After executing a scan policy, a scan report will be generated automatically. You can view the scan report online or download it to your local computer.
Download and View
Under the Security Management -> Compliance Scanning page, click the Scan Report tab, then click the \u2507 action button to the right of a report and select Download .
View Online
Clicking the name of a report allows you to view its content online, which includes:
The target cluster scanned.
The scan policy and scan configuration used.
The start time of the scan.
The total number of scan items, the number passed, and the number failed.
For failed scan items, repair suggestions are provided.
For passed scan items, more secure operational suggestions are provided.
A data volume (PersistentVolume, PV) is a piece of storage in the cluster, which can be prepared in advance by the administrator, or dynamically prepared using a storage class (Storage Class). PV is a cluster resource, but it has an independent life cycle and will not be deleted when the Pod process ends. Mounting PVs to workloads can achieve data persistence for workloads. The PV holds the data directory that can be accessed by the containers in the Pod.
"},{"location":"en/end-user/kpanda/storage/pv.html#create-data-volume","title":"Create data volume","text":"
Currently, there are two ways to create data volumes: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) -> Create Data Volume (PV) in the left navigation bar.
Fill in the basic information.
The data volume name, data volume type, mount path, volume mode, and node affinity cannot be changed after creation.
Data volume type: For a detailed introduction to volume types, refer to the official Kubernetes document Volumes.
Local: The local storage of the Node node is packaged into a PVC interface, and the container directly uses the PVC without paying attention to the underlying storage type. Local volumes do not support dynamic configuration of data volumes, but support configuration of node affinity, which can limit which nodes can access the data volume.
HostPath: Use files or directories on the file system of Node nodes as data volumes, and do not support Pod scheduling based on node affinity.
Mount path: mount the data volume to a specific directory in the container.
access mode:
ReadWriteOnce: The data volume can be mounted by a node in read-write mode.
ReadWriteMany: The data volume can be mounted by multiple nodes in read-write mode.
ReadOnlyMany: The data volume can be mounted read-only by multiple nodes.
ReadWriteOncePod: The data volume can be mounted read-write by a single Pod.
Recycling policy:
Retain: The PV is not deleted, but its status is only changed to released , which needs to be manually recycled by the user. For how to manually reclaim, refer to Persistent Volume.
Recycle: keep the PV but empty its data, perform a basic wipe ( rm -rf /thevolume/* ).
Delete: When deleting a PV and its data.
Volume mode:
File system: The data volume will be mounted to a certain directory by the Pod. If the data volume is stored from a device and the device is currently empty, a file system is created on the device before the volume is mounted for the first time.
Block: Use the data volume as a raw block device. This type of volume is given to the Pod as a block device without any file system on it, allowing the Pod to access the data volume faster.
Node affinity:
"},{"location":"en/end-user/kpanda/storage/pv.html#view-data-volume","title":"View data volume","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume (PV) in the left navigation bar.
On this page, you can view all data volumes in the current cluster, as well as information such as the status, capacity, and namespace of each data volume.
Supports sequential or reverse sorting according to the name, status, namespace, and creation time of data volumes.
Click the name of a data volume to view the basic configuration, StorageClass information, labels, comments, etc. of the data volume.
"},{"location":"en/end-user/kpanda/storage/pv.html#clone-data-volume","title":"Clone data volume","text":"
By cloning a data volume, a new data volume can be recreated based on the configuration of the cloned data volume.
Enter the clone page
On the data volume list page, find the data volume to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/end-user/kpanda/storage/pv.html#update-data-volume","title":"Update data volume","text":"
There are two ways to update data volumes. Support for updating data volumes via forms or YAML files.
Note
Only updating the alias, capacity, access mode, reclamation policy, label, and comment of the data volume is supported.
On the data volume list page, find the data volume that needs to be updated, select Update under the operation bar on the right to update through the form, select Edit YAML to update through YAML.
Click the name of the data volume to enter the details page of the data volume, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/end-user/kpanda/storage/pv.html#delete-data-volume","title":"Delete data volume","text":"
On the data volume list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select Delete .
A persistent volume claim (PersistentVolumeClaim, PVC) expresses a user's request for storage. PVC consumes PV resources and claims a data volume with a specific size and specific access mode. For example, the PV volume is required to be mounted in ReadWriteOnce, ReadOnlyMany or ReadWriteMany modes.
"},{"location":"en/end-user/kpanda/storage/pvc.html#create-data-volume-statement","title":"Create data volume statement","text":"
Currently, there are two ways to create data volume declarations: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume declaration.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) -> Create Data Volume Declaration (PVC) in the left navigation bar.
Fill in the basic information.
The name, namespace, creation method, data volume, capacity, and access mode of the data volume declaration cannot be changed after creation.
Creation method: dynamically create a new data volume claim in an existing StorageClass or data volume, or create a new data volume claim based on a snapshot of a data volume claim.
The declared capacity of the data volume cannot be modified when the snapshot is created, and can be modified after the creation is complete.
After selecting the creation method, select the desired StorageClass/data volume/snapshot from the drop-down list.
access mode:
ReadWriteOnce, the data volume declaration can be mounted by a node in read-write mode.
ReadWriteMany, the data volume declaration can be mounted by multiple nodes in read-write mode.
ReadOnlyMany, the data volume declaration can be mounted read-only by multiple nodes.
ReadWriteOncePod, the data volume declaration can be mounted by a single Pod in read-write mode.
"},{"location":"en/end-user/kpanda/storage/pvc.html#view-data-volume-statement","title":"View data volume statement","text":"
Click the name of the target cluster in the cluster list, and then click Container Storage -> Data Volume Declaration (PVC) in the left navigation bar.
On this page, you can view all data volume declarations in the current cluster, as well as information such as the status, capacity, and namespace of each data volume declaration.
Supports sorting in sequential or reverse order according to the declared name, status, namespace, and creation time of the data volume.
Click the name of the data volume declaration to view the basic configuration, StorageClass information, labels, comments and other information of the data volume declaration.
"},{"location":"en/end-user/kpanda/storage/pvc.html#expansion-data-volume-statement","title":"Expansion data volume statement","text":"
In the left navigation bar, click Container Storage -> Data Volume Declaration (PVC) , and find the data volume declaration whose capacity you want to adjust.
Click the name of the data volume declaration, and then click the operation button in the upper right corner of the page and select Expansion .
Enter the target capacity and click OK .
"},{"location":"en/end-user/kpanda/storage/pvc.html#clone-data-volume-statement","title":"Clone data volume statement","text":"
By cloning a data volume claim, a new data volume claim can be recreated based on the configuration of the cloned data volume claim.
Enter the clone page
On the data volume declaration list page, find the data volume declaration that needs to be cloned, and select Clone under the operation bar on the right.
You can also click the name of the data volume declaration, click the operation button in the upper right corner of the details page and select Clone .
Use the original configuration directly, or modify it as needed, and click OK at the bottom of the page.
"},{"location":"en/end-user/kpanda/storage/pvc.html#update-data-volume-statement","title":"Update data volume statement","text":"
There are two ways to update data volume claims. Support for updating data volume claims via form or YAML file.
Note
Only aliases, labels, and annotations for data volume claims are updated.
On the data volume list page, find the data volume declaration that needs to be updated, select Update in the operation bar on the right to update it through the form, and select Edit YAML to update it through YAML.
Click the name of the data volume declaration, enter the details page of the data volume declaration, select Update in the upper right corner of the page to update through the form, select Edit YAML to update through YAML.
"},{"location":"en/end-user/kpanda/storage/pvc.html#delete-data-volume-statement","title":"Delete data volume statement","text":"
On the data volume declaration list page, find the data to be deleted, and select Delete in the operation column on the right.
You can also click the name of the data volume statement, click the operation button in the upper right corner of the details page and select Delete .
If there is no optional StorageClass or data volume in the list, you can Create a StorageClass or Create a data volume.
If there is no optional snapshot in the list, you can enter the details page of the data volume declaration and create a snapshot in the upper right corner.
If the StorageClass (SC) used by the data volume declaration is not enabled for snapshots, snapshots cannot be made, and the page will not display the \"Make Snapshot\" option.
If the StorageClass (SC) used by the data volume declaration does not have the capacity expansion feature enabled, the data volume does not support capacity expansion, and the page will not display the capacity expansion option.
A StorageClass refers to a large storage resource pool composed of many physical disks. This platform supports the creation of block StorageClass, local StorageClass, and custom StorageClass after accessing various storage vendors, and then dynamically configures data volumes for workloads.
Currently, it supports creating StorageClass through YAML and forms. These two methods have their own advantages and disadvantages, and can meet the needs of different users.
There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the StorageClass.
It is more intuitive and easier to create through the form, just fill in the proper values \u200b\u200baccording to the prompts, but the steps are more cumbersome.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create with YAML in the left navigation bar.
Enter or paste the prepared YAML file in the pop-up box, and click OK at the bottom of the pop-up box.
Supports importing YAML files from local or downloading and saving filled files to local.
Click the name of the target cluster in the cluster list, and then click Container Storage -> StorageClass (SC) -> Create StorageClass (SC) in the left navigation bar.
Fill in the basic information and click OK at the bottom.
CUSTOM STORAGE SYSTEM
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
CSI storage driver: A standard Kubernetes-based container storage interface plug-in, which must comply with the format specified by the storage manufacturer, such as rancher.io/local-path .
For how to fill in the CSI drivers provided by different vendors, refer to the official Kubernetes document Storage Class.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
HwameiStor storage system
The StorageClass name, driver, and reclamation policy cannot be modified after creation.
Storage system: HwameiStor storage system.
Storage type: support LVM, raw disk type
LVM type : HwameiStor recommended usage method, which can use highly available data volumes, and the proper CSI storage driver is lvm.hwameistor.io .
Raw disk data volume : suitable for high availability cases, without high availability capability, the proper CSI driver is hdd.hwameistor.io .
High Availability Mode: Before using the high availability capability, please make sure DRBD component has been installed. After the high availability mode is turned on, the number of data volume copies can be set to 1 and 2. Convert data volume copy from 1 to 1 if needed.
Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it.
Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but the premise is that the underlying storage driver supports the snapshot and expansion features.
On the StorageClass list page, find the StorageClass that needs to be updated, and select Edit under the operation bar on the right to update the StorageClass.
Info
Select View YAML to view the YAML file of the StorageClass, but editing is not supported.
This page introduces how to create a CronJob through images and YAML files.
CronJobs are suitable for performing periodic operations, such as backup and report generation. These jobs can be configured to repeat periodically (for example: daily/weekly/monthly), and the time interval at which the job starts to run can be defined.
Before creating a CronJob, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-cronjob.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a CronJob using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, CronJob Settings, Advanced Configuration, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the CronJobs list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the CronJob.
On the Create CronJobs page, enter the information according to the table below, and click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created CronJob in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the configuration with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull the image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local mirror will be pulled, and only when the mirror does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass configuration to the Pod, etc. For details, refer to Container environment variable configuration.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Concurrency Policy: Whether to allow multiple Job jobs to run in parallel.
Allow : A new CronJob can be created before the previous job is completed, and multiple jobs can be parallelized. Too many jobs may occupy cluster resources.
Forbid : Before the previous job is completed, a new job cannot be created. If the execution time of the new job is up and the previous job has not been completed, CronJob will ignore the execution of the new job.
Replace : If the execution time of the new job is up, but the previous job has not been completed, the new job will replace the previous job.
The above rules only apply to multiple jobs created by the same CronJob. Multiple jobs created by multiple CronJobs are always allowed to run concurrently.
Policy Settings: Set the time period for job execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and * , after inputting the expression, the meaning of the current expression will be prompted. For detailed expression syntax rules, refer to Cron Schedule Syntax.
Job Records: Set how many records of successful or failed jobs to keep. 0 means do not keep.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set. The default is 360 s.
Retries: the number of times the job can be retried, the default value is 6.
Restart Policy: Set whether to restart the Pod when the job fails.
The advanced configuration of CronJobs mainly involves labels and annotations.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/end-user/kpanda/workloads/create-cronjob.html#create-from-yaml","title":"Create from YAML","text":"
In addition to mirroring, you can also create timed jobs more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> CronJobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page introduces how to create a daemonSet through image and YAML files.
DaemonSet is connected to taint through node affinity feature ensures that a replica of a Pod is running on all or some of the nodes. For nodes that newly joined the cluster, DaemonSet automatically deploys the proper Pod on the new node and tracks the running status of the Pod. When a node is removed, the DaemonSet deletes all Pods it created.
Common cases for daemons include:
Run cluster daemons on each node.
Run a log collection daemon on each node.
Run a monitoring daemon on each node.
For simplicity, a DaemonSet can be started on each node for each type of daemon. For finer and more advanced daemon management, you can also deploy multiple DaemonSets for the same daemon. Each DaemonSet has different flags and has different memory, CPU requirements for different hardware types.
Before creating a DaemonSet, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-daemonset.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a daemon using the image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> DaemonSets in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of DaemonSets . Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the DaemonSet.
On the Create DaemonSets page, after entering the information according to the table below, click Next .
Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created DaemonSet in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyScheduling PoliciesLabels and Annotations
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set, at this time the domain name of the containerThe parsing file will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable Pods: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/end-user/kpanda/workloads/create-daemonset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create daemons more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workload -> Daemons in the left navigation bar, and then click the YAML Create button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a daemon
This page describes how to create deployments through images and YAML files.
Deployment is a common resource in Kubernetes, mainly Pod and ReplicaSet provide declarative updates, support elastic scaling, rolling upgrades, and version rollbacks features. Declare the desired Pod state in the Deployment, and the Deployment Controller will modify the current state through the ReplicaSet to make it reach the pre-declared desired state. Deployment is stateless and does not support data persistence. It is suitable for deploying stateless applications that do not need to save data and can be restarted and rolled back at any time.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback.
Before using image to create deployments, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-deployment.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a deployment by image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Setting, Service Setting, Advanced Setting in turn, click OK in the lower right corner of the page to complete the creation.
The system will automatically return the list of Deployments . Click \u2507 on the right side of the list to perform operations such as update, delete, elastic scaling, restart, and version rollback on the load. If the workload status is abnormal, please check the specific abnormal information, refer to Workload Status.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container setting is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, it is essential to correctly fill in the container name and image parameters; otherwise, you will not be able to proceed to the next step. After filling in the configuration according to the following requirements, click OK.
Container Type: The default is Work Container. For information on init containers, see the [K8s Official Documentation] (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/).
Container Name: No more than 63 characters, supporting lowercase letters, numbers, and separators (\"-\"). It must start and end with a lowercase letter or number, for example, nginx-01.
Image:
Image: Select an appropriate image from the list. When entering the image name, the default is to pull the image from the official DockerHub.
Image Version: Select an appropriate version from the dropdown list.
Image Pull Policy: By checking Always pull the image, the image will be pulled from the repository each time the workload restarts/upgrades. If unchecked, it will only pull the local image, and will pull from the repository only if the image does not exist locally. For more details, refer to Image Pull Policy.
Registry Secret: Optional. If the target repository requires a Secret to access, you need to create secret first.
Privileged Container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and has all the privileges of running processes on the host.
CPU/Memory Request: The request value (the minimum resource needed) and the limit value (the maximum resource allowed) for CPU/memory resources. Configure resources for the container as needed to avoid resource waste and system failures caused by container resource overages. Default values are shown in the figure.
GPU Configuration: Configure GPU usage for the container, supporting only positive integers. The GPU quota setting supports configuring the container to exclusively use an entire GPU or part of a vGPU. For example, for a GPU with 8 cores, entering the number 8 means the container exclusively uses the entire card, and entering the number 1 means configuring 1 core of the vGPU for the container.
Before setting the GPU, the administrator needs to pre-install the GPU and driver plugin on the cluster node and enable the GPU feature in the Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Setting.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check Setting.
Configure container parameters within the Pod, add environment variables or pass setting to the Pod, etc. For details, refer to Container environment variable setting.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Setting.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: Network Settings, Upgrade Policy, Scheduling Policies, Labels and Annotations. You can click the tabs below to view the setting requirements of each part.
Network SettingsUpgrade PolicyScheduling PoliciesLabels and Annotations
For container NIC setting, refer to Workload Usage IP Pool
DNS setting
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related setting options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make container use kubelet's -The domain name resolution file pointed to by the -resolv-conf parameter. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the host to which the application is connected.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the setting of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Setting options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Max Unavailable: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption.
Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds.
Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node Affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload Anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/end-user/kpanda/workloads/create-deployment.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create deployments more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> Deployments in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a deployment
This page introduces how to create a job through image and YAML file.
Job is suitable for performing one-time jobs. A Job creates one or more Pods, and the Job keeps retrying to run Pods until a certain number of Pods are successfully terminated. A Job ends when the specified number of Pods are successfully terminated. When a Job is deleted, all Pods created by the Job will be cleared. When a Job is paused, all active Pods in the Job are deleted until the Job is resumed. For more information about jobs, refer to Job.
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-job.html#create-by-image","title":"Create by image","text":"
Refer to the following steps to create a job using an image.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create by Image button in the upper right corner of the page.
Fill in Basic Information, Container Settings and Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the job list. Click \u2507 on the right side of the list to perform operations such as updating, deleting, and restarting the job.
On the Create Jobs page, enter the basic information according to the table below, and click Next .
Payload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select which namespace to deploy the newly created job in, and the default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Number of Instances: Enter the number of Pod instances for the workload. By default, 1 Pod instance is created.
Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the setting requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle settings.
It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to Container Health Check settings.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage settings.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes job settings, labels and annotations.
Job SettingsLabels and Annotations
Parallel Pods: the maximum number of Pods that can be created at the same time during job execution, and the parallel number should not be greater than the total number of Pods. Default is 1.
Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set.
Restart Policy: Whether to restart the Pod when the setting fails.
You can click the Add button to add labels and annotations to the workload instance Pod.
"},{"location":"en/end-user/kpanda/workloads/create-job.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, creation jobs can also be created more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the cluster details page.
On the cluster details page, click Workloads -> Jobs in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
This page describes how to create a StatefulSet through image and YAML files.
StatefulSet is a common resource in Kubernetes, and Deployment, mainly used to manage the deployment and scaling of Pod collections. The main difference between the two is that Deployment is stateless and does not save data, while StatefulSet is stateful and is mainly used to manage stateful applications. In addition, Pods in a StatefulSet have a persistent ID, which makes it easy to identify the proper Pod when matching storage volumes.
Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on proper role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management.
Before using image to create StatefulSets, the following prerequisites need to be met:
In the Container Management module Integrate Kubernetes Cluster or Create Kubernetes Cluster, and can access the cluster UI interface.
Create a namespace and a user.
The current operating user should have NS Editor or higher permissions, for details, refer to Namespace Authorization.
When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail.
"},{"location":"en/end-user/kpanda/workloads/create-statefulset.html#create-by-image","title":"Create by image","text":"
Follow the steps below to create a statefulSet using image.
Click Clusters on the left navigation bar, then click the name of the target cluster to enter Cluster Details.
Click Workloads -> StatefulSets in the left navigation bar, and then click the Create by Image button in the upper right corner.
Fill in Basic Information, Container Settings, Service Settings, Advanced Settings, click OK in the lower right corner of the page to complete the creation.
The system will automatically return to the list of StatefulSets , and wait for the status of the workload to become running . If the workload status is abnormal, refer to Workload Status for specific exception information.
Click \u2507 on the right side of the New Workload column to perform operations such as update, delete, elastic scaling, restart, and version rollback on the workload.
Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator (\"-\"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created.
Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to Create a new namespace according to the prompt on the page.
Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default.
Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512.
Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part.
Container settings is only configured for a single container. To add multiple containers to a pod, click + on the right to add multiple containers.
When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click OK .
Container Name: Up to 63 characters, lowercase letters, numbers and separators (\"-\") are supported. Must start and end with a lowercase letter or number, eg nginx-01.
Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official DockerHub by default.
Image Pull Policy: After checking Always pull image , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to Image Pull Policy.
Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host.
CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure.
GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU or part of the vGPU for the container. For example, for an 8-core GPU, enter the number 8 to let the container exclusively use the entire length of the card, and enter the number 1 to configure a 1-core vGPU for the container.
Before setting exclusive GPU, the administrator needs to install the GPU and driver plug-in on the cluster nodes in advance, and enable the GPU feature in Cluster Settings.
Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to Container Lifecycle Configuration.
Used to judge the health status of containers and applications. Helps improve app usability. For details, refer to Container Health Check Configuration.
Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to Container environment variable settings.
Configure the settings for container mounting data volumes and data persistence. For details, refer to Container Data Storage Configuration.
Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter 0 to use the privileges of the root account.
Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part.
Network ConfigurationUpgrade PolicyContainer Management PoliciesScheduling PoliciesLabels and Annotations
For container NIC settings, refer to Workload Usage IP Pool
DNS settings
In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases.
DNS Policy
Default: Make the container use the domain name resolution file pointed to by the --resolv-conf parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query.
ClusterFirstWithHostNet: The domain name file of the application docking host.
ClusterFirst: application docking with Kube-DNS/CoreDNS.
None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the settings of dnsConfig.
Nameservers: fill in the address of the domain name server, such as 10.6.175.20 .
Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains.
Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig.
Host Alias: the alias set for the host.
Upgrade Mode: Rolling upgrade refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. Rebuild and upgrade refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted.
Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10.
Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds.
Kubernetes v1.7 and later versions can set Pod management policies through .spec.podManagementPolicy , which supports the following two methods:
OrderedReady : The default Pod management policy, which means that Pods are deployed in order. Only after the deployment of the previous Pod is successfully completed, the statefulset will start to deploy the next Pod. Pods are deleted in reverse order, with the last created being deleted first.
Parallel : Create or delete containers in parallel, just like Pods of the Deployment type. The StatefulSet controller starts or terminates all containers in parallel. There is no need to wait for a Pod to enter the Running and ready state or to stop completely before starting or terminating other Pods. This option only affects the behavior of scaling operations, not the order of updates.
Tolerance time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds.
Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on.
Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node.
Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node.
Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, kubernetes.io/os indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node.
For details, refer to Scheduling Policy.
You can click the Add button to add tags and annotations to workloads and pods.
"},{"location":"en/end-user/kpanda/workloads/create-statefulset.html#create-from-yaml","title":"Create from YAML","text":"
In addition to image, you can also create statefulsets more quickly through YAML files.
Click Clusters on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page.
On the cluster details page, click Workloads -> StatefulSets in the left navigation bar, and then click the Create from YAML button in the upper right corner of the page.
Enter or paste the YAML file prepared in advance, click OK to complete the creation.
Click to see an example YAML for creating a statefulSet
An environment variable refers to a variable set in the container running environment, which is used to add environment flags to Pods or transfer configurations, etc. It supports configuring environment variables for Pods in the form of key-value pairs.
Suanova container management adds a graphical interface to configure environment variables for Pods on the basis of native Kubernetes, and supports the following configuration methods:
Key-value pair (Key/Value Pair): Use a custom key-value pair as the environment variable of the container
Resource reference (Resource): Use the fields defined by Container as the value of environment variables, such as the memory limit of the container, the number of copies, etc.
Variable/Variable Reference (Pod Field): Use the Pod field as the value of an environment variable, such as the name of the Pod
ConfigMap key value import (ConfigMap key): Import the value of a key in the ConfigMap as the value of an environment variable
Key key value import (Secret Key): use the data from the Secret to define the value of the environment variable
Key Import (Secret): Import all key values \u200b\u200bin Secret as environment variables
ConfigMap import (ConfigMap): import all key values \u200b\u200bin the ConfigMap as environment variables
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html","title":"Container health check","text":"
Container health check checks the health status of containers according to user requirements. After configuration, if the application in the container is abnormal, the container will automatically restart and recover. Kubernetes provides Liveness checks, Readiness checks, and Startup checks.
LivenessProbe can detect application deadlock (the application is running, but cannot continue to run the following steps). Restarting containers in this state can help improve the availability of applications, even if there are bugs in them.
ReadinessProbe can detect when a container is ready to accept request traffic. A Pod can only be considered ready when all containers in a Pod are ready. One use of this signal is to control which Pod is used as the backend of the Service. If the Pod is not ready, it will be removed from the Service's load balancer.
Startup check (StartupProbe) can know when the application container is started. After configuration, it can control the container to check the viability and readiness after it starts successfully, so as to ensure that these liveness and readiness probes will not affect the start of the application. Startup detection can be used to perform liveness checks on slow-starting containers, preventing them from being killed before they start running.
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#liveness-and-readiness-checks","title":"Liveness and readiness checks","text":"
The configuration of LivenessProbe is similar to that of ReadinessProbe, the only difference is to use readinessProbe field instead of livenessProbe field.
HTTP GET parameter description:
Parameter Description Path (Path) The requested path for access. Such as: /healthz path in the example Port (Port) Service listening port. Such as: port 8080 in the example protocol access protocol, Http or Https Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. SuccessThreshold (successThreshold) The minimum number of consecutive successes that are considered successful after a probe fails. The default value is 1, and the minimum value is 1. This value must be 1 for liveness and startup probes. Maximum number of failures (failureThreshold) The number of retries when the probe fails. Giving up in case of a liveness probe means restarting the container. Pods that are abandoned due to readiness probes are marked as not ready. The default value is 3. The minimum value is 1."},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#check-with-http-get-request","title":"Check with HTTP GET request","text":"
YAML example:
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-http\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/liveness # Container image\n args:\n - /server # Arguments to pass to the container\n livenessProbe:\n httpGet:\n path: /healthz # Access request path\n port: 8080 # Service listening port\n httpHeaders:\n - name: Custom-Header # Custom header name\n value: Awesome # Custom header value\n initialDelaySeconds: 3 # Wait 3 seconds before the first probe\n periodSeconds: 3 # Perform liveness detection every 3 seconds\n
According to the set rules, Kubelet sends an HTTP GET request to the service running in the container (the service is listening on port 8080) to perform the detection. The kubelet considers the container alive if the handler under the /healthz path on the server returns a success code. If the handler returns a failure code, the kubelet kills the container and restarts it. Any return code greater than or equal to 200 and less than 400 indicates success, and any other return code indicates failure. The /healthz handler returns a 200 status code for the first 10 seconds of the container's lifetime. The handler then returns a status code of 500.
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#use-tcp-port-check","title":"Use TCP port check","text":"
TCP port parameter description:
Parameter Description Port (Port) Service listening port. Such as: port 8080 in the example Delay time (initialDelaySeconds) Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. Timeout (timeoutSeconds) Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second.
For a container that provides TCP communication services, based on this configuration, the cluster establishes a TCP connection to the container according to the set rules. If the connection is successful, it proves that the detection is successful, otherwise the detection fails. If you choose the TCP port detection method, you must specify the port that the container listens to.
This example uses both readiness and liveness probes. The kubelet sends the first readiness probe 5 seconds after the container is started. Attempt to connect to port 8080 of the goproxy container. If the probe is successful, the Pod will be marked as ready and the kubelet will continue to run the check every 10 seconds.
In addition to the readiness probe, this configuration includes a liveness probe. The kubelet will perform the first liveness probe 15 seconds after the container is started. The readiness probe will attempt to connect to the goproxy container on port 8080. If the liveness probe fails, the container will be restarted.
apiVersion: v1\nkind: Pod\nmetadata:\n labels:\n test: liveness\n name: liveness-exec\nspec:\n containers:\n - name: liveness # Container name\n image: k8s.gcr.io/busybox # Container image\n args:\n - /bin/sh # Command to run\n - -c # Pass the following string as a command\n - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600 # Command to execute\n livenessProbe:\n exec:\n command:\n - cat # Command to check liveness\n - /tmp/healthy # File to check\n initialDelaySeconds: 5 # Wait 5 seconds before the first probe\n periodSeconds: 5 # Perform liveness detection every 5 seconds\n
The periodSeconds field specifies that the kubelet performs a liveness probe every 5 seconds, and the initialDelaySeconds field specifies that the kubelet waits for 5 seconds before performing the first probe. According to the set rules, the cluster periodically executes the command cat /tmp/healthy in the container through the kubelet to detect. If the command executes successfully and the return value is 0, the kubelet considers the container to be healthy and alive. If this command returns a non-zero value, the kubelet will kill the container and restart it.
"},{"location":"en/end-user/kpanda/workloads/pod-config/health-check.html#protect-slow-starting-containers-with-pre-start-checks","title":"Protect slow-starting containers with pre-start checks","text":"
Some applications require a long initialization time at startup. You need to use the same command to set startup detection. For HTTP or TCP detection, you can set the failureThreshold * periodSeconds parameter to a long enough time to cope with the long startup time scene.
With the above settings, the application will have up to 5 minutes (30 * 10 = 300s) to complete the startup process. Once the startup detection is successful, the survival detection task will take over the detection of the container and respond quickly to the container deadlock. If the start probe has been unsuccessful, the container is killed after 300 seconds and further disposition is performed according to the restartPolicy .
"},{"location":"en/end-user/kpanda/workloads/pod-config/job-parameters.html","title":"Description of job parameters","text":"
According to the settings of .spec.completions and .spec.Parallelism , jobs (Job) can be divided into the following types:
Job Type Description Non-parallel Job Creates a Pod until its Job completes successfully Parallel Jobs with deterministic completion counts A Job is considered complete when the number of successful Pods reaches .spec.completions Parallel Job Creates one or more Pods until one finishes successfully
Parameter Description
RestartPolicy Creates a Pod until it terminates successfully .spec.completions Indicates the number of Pods that need to run successfully when the Job ends, the default is 1 .spec.parallelism Indicates the number of Pods running in parallel, the default is 1 spec.backoffLimit Indicates the maximum number of retries for a failed Pod, beyond which no more retries will continue. .spec.activeDeadlineSeconds Indicates the Pod running time. Once this time is reached, the Job, that is, all its Pods, will stop. And activeDeadlineSeconds has a higher priority than backoffLimit, that is, the job that reaches activeDeadlineSeconds will ignore the setting of backoffLimit.
The following is an example Job configuration, saved in myjob.yaml, which calculates \u03c0 to 2000 digits and prints the output.
apiVersion: batch/v1\nkind: Job #The type of the current resource\nmetadata:\n name: myjob\nspec:\n completions: 50 # Job needs to run 50 Pods at the end, in this example it prints \u03c0 50 times\n parallelism: 5 # 5 Pods in parallel\n backoffLimit: 5 # retry up to 5 times\n template:\n spec:\n containers:\n - name: pi\n image: perl\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\n restartPolicy: Never #restart policy\n
Related commands
kubectl apply -f myjob.yaml # Start job\nkubectl get job # View this job\nkubectl logs myjob-1122dswzs View Job Pod logs\n
"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html","title":"Configure the container lifecycle","text":"
Pods follow a predefined lifecycle, starting in the Pending phase and entering the Running state if at least one container in the Pod starts normally. If any container in the Pod ends in a failed state, the state becomes Failed . The following phase field values \u200b\u200bindicate which phase of the lifecycle a Pod is in.
Value Description Pending The Pod has been accepted by the system, but one or more containers have not yet been created or run. This phase includes waiting for the pod to be scheduled and downloading the image over the network. Running (Running) The Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or in the process of starting or restarting. Succeeded (Success) All containers in the Pod were successfully terminated and will not be restarted. Failed All containers in the Pod have terminated, and at least one container terminated due to failure. That is, the container exited with a non-zero status or was terminated by the system. Unknown (Unknown) The status of the Pod cannot be obtained for some reason, usually due to a communication failure with the host where the Pod resides.
When creating a workload in Suanova container management, images are usually used to specify the running environment in the container. By default, when building an image, the Entrypoint and CMD fields can be used to define the commands and parameters to be executed when the container is running. If you need to change the commands and parameters of the container image before starting, after starting, and before stopping, you can override the default commands and parameters in the image by setting the lifecycle event commands and parameters of the container.
Configure the startup command, post-start command, and pre-stop command of the container according to business needs.
Parameter Description Example value Start command Type: Optional Meaning: The container will be started according to the start command. Command after startup Type: optionalMeaning: command after container startup Command before stopping Type: Optional Meaning: The command executed by the container after receiving the stop command. Ensure that the services running in the instance can be drained in advance when the instance is upgraded or deleted. -"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html#start-command","title":"start command","text":"
Configure the startup command according to the table below.
Parameter Description Example value Run command Type: RequiredMeaning: Enter an executable command, and separate multiple commands with spaces. If the command itself has spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html#post-start-commands","title":"Post-start commands","text":"
Suanova provides two processing types, command line script and HTTP request, to configure post-start commands. You can choose the configuration method that suits you according to the table below.
Command line script configuration
Parameter Description Example value Run Command Type: Optional Meaning: Enter an executable command, and separate multiple commands with spaces. If the command itself contains spaces, you need to add (\"\"). Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Running parameters Type: OptionalMeaning: Enter the parameters of the control container running command. port=8080"},{"location":"en/end-user/kpanda/workloads/pod-config/lifecycle.html#stop-pre-command","title":"stop pre-command","text":"
Suanova provides two processing types, command line script and HTTP request, to configure the pre-stop command. You can choose the configuration method that suits you according to the table below.
HTTP request configuration
Parameter Description Example value URL Path Type: Optional Meaning: Requested URL path. Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. /run/server Port Type: RequiredMeaning: Requested port. port=8080 Node Address Type: Optional Meaning: The requested IP address, the default is the node IP where the container is located. -"},{"location":"en/end-user/kpanda/workloads/pod-config/scheduling-policy.html","title":"Scheduling Policy","text":"
In a Kubernetes cluster, like many other Kubernetes objects, nodes have labels. You can manually add labels. Kubernetes also adds some standard labels to all nodes in the cluster. See Common Labels, Annotations, and Taints for common node labels. By adding labels to nodes, you can have pods scheduled on specific nodes or groups of nodes. You can use this feature to ensure that specific Pods can only run on nodes with certain isolation, security or governance properties.
nodeSelector is the simplest recommended form of a node selection constraint. You can add a nodeSelector field to the Pod's spec to set the node label. Kubernetes will only schedule pods on nodes with each label specified. nodeSelector provides one of the easiest ways to constrain Pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define. Some benefits of using affinity and anti-affinity are:
Affinity and anti-affinity languages are more expressive. nodeSelector can only select nodes that have all the specified labels. Affinity, anti-affinity give you greater control over selection logic.
You can mark a rule as \"soft demand\" or \"preference\", so that the scheduler will still schedule the Pod if no matching node can be found.
You can use the labels of other Pods running on the node (or in other topological domains) to enforce scheduling constraints, instead of only using the labels of the node itself. This capability allows you to define rules which allow Pods to be placed together.
You can choose which node the Pod will deploy to by setting affinity and anti-affinity.
When the node where the workload instance is located is unavailable, the period for the system to reschedule the instance to other available nodes. The default is 300 seconds.
Node affinity is conceptually similar to nodeSelector , which allows you to constrain which nodes Pods can be scheduled on based on the labels on the nodes. There are two types of node affinity:
Must be satisfied: ( requiredDuringSchedulingIgnoredDuringExecution ) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: ( preferredDuringSchedulingIgnoredDuringExecution ) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
It can only be added in the \"as far as possible\" policy, which can be understood as the priority of scheduling, and those with the highest weight will be scheduled first. The value range is 1 to 100.
Similar to node affinity, there are two types of workload affinity:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The affinity of the workload is mainly used to determine which Pods of the workload can be deployed in the same topology domain. For example, services that communicate with each other can be deployed in the same topology domain (such as the same availability zone) by applying affinity scheduling to reduce the network delay between them.
Similar to node affinity, there are two types of anti-affinity for workloads:
Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution) The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to nodeSelector , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied.
Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution) The scheduler will try to find nodes that meet the proper rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied.
The anti-affinity of the workload is mainly used to determine which Pods of the workload cannot be deployed in the same topology domain. For example, the same Pod of a load is distributed to different topological domains (such as different hosts) to improve the stability of the workload itself.
A workload is an application running on Kubernetes, and in Kubernetes, whether your application is composed of a single same component or composed of many different components, you can use a set of Pods to run it. Kubernetes provides five built-in workload resources to manage pods:
Deployment
StatefulSet
Daemonset
Job
CronJob
You can also expand workload resources by setting Custom Resource CRD. In the fifth-generation container management, it supports full lifecycle management of workloads such as creation, update, capacity expansion, monitoring, logging, deletion, and version management.
Pod is the smallest computing unit created and managed in Kubernetes, that is, a collection of containers. These containers share storage, networking, and management policies that control how the containers run. Pods are typically not created directly by users, but through workload resources. Pods follow a predefined lifecycle, starting at Pending phase, if at least one of the primary containers starts normally, it enters Running , and then enters the Succeeded or Failed stage depending on whether any container in the Pod ends in a failed status.
The fifth-generation container management module designs a built-in workload life cycle status set based on factors such as Pod status and number of replicas, so that users can more realistically perceive the running status of workloads. Because different workload types (such as Deployment and Jobs) have inconsistent management mechanisms for Pods, different workloads will have different lifecycle status during operation, as shown in the following table.
"},{"location":"en/end-user/kpanda/workloads/pod-config/workload-status.html#deployment-statefulset-damemonset-status","title":"Deployment, StatefulSet, DamemonSet Status","text":"Status Description Waiting 1. A workload is in this status while its creation is in progress. 2. After an upgrade or rollback action is triggered, the workload is in this status. 3. Trigger operations such as pausing/scaling, and the workload is in this status. Running This status occurs when all instances under the workload are running and the number of replicas matches the user-defined number. Deleting When a delete operation is performed, the payload is in this status until the delete is complete. Exception Unable to get the status of the workload for some reason. This usually occurs because communication with the pod's host has failed. Not Ready When the container is in an abnormal, pending status, this status is displayed when the workload cannot be started due to an unknown error"},{"location":"en/end-user/kpanda/workloads/pod-config/workload-status.html#job-status","title":"Job Status","text":"Status Description Waiting The workload is in this status while Job creation is in progress. Executing The Job is in progress and the workload is in this status. Execution Complete The Job execution is complete and the workload is in this status. Deleting A delete operation is triggered and the workload is in this status. Exception Pod status could not be obtained for some reason. This usually occurs because communication with the pod's host has failed."},{"location":"en/end-user/kpanda/workloads/pod-config/workload-status.html#cronjob-status","title":"CronJob status","text":"Status Description Waiting The CronJob is in this status when it is being created. Started After the CronJob is successfully created, the CronJob is in this status when it is running normally or when the paused task is started. Stopped The CronJob is in this status when the stop task operation is performed. Deleting The deletion operation is triggered, and the CronJob is in this status.
When the workload is in an abnormal or unready status, you can move the mouse over the status value of the load, and the system will display more detailed error information through a prompt box. You can also view the log or events to obtain related running information of the workload.
Notebook usually refers to Jupyter Notebook or similar interactive computing environments. It is a very popular tool widely used in fields such as data science, machine learning, and deep learning. This page explains how to use Notebook in the AI platform.
Enter a name, select the cluster, namespace, choose the queue just created, and click One-Click Initialization.
Select the Notebook type, configure memory, CPU, enable GPU, create and configure PVC:
Enable SSH external network access:
You will be automatically redirected to the Notebook instance list, click the instance name.
Enter the Notebook instance detail page and click the Open button in the upper right corner.
You have entered the Notebook development environment, where a persistent volume is mounted in the /home/jovyan directory. You can clone code through git, upload data after connecting via SSH, etc.
"},{"location":"en/end-user/share/notebook.html#accessing-notebook-instances-via-ssh","title":"Accessing Notebook Instances via SSH","text":"
Generate an SSH key pair on your own computer.
Open the command line on your computer, for example, open git bash on Windows, enter ssh-keygen.exe -t rsa, and press enter through the prompts.
Use commands like cat ~/.ssh/id_rsa.pub to view and copy the public key.
Log into the AI platform as a user, click Personal Center -> SSH Public Key -> Import SSH Public Key in the upper right corner.
Enter the detail page of the Notebook instance and copy the SSH link.
Use SSH to access the Notebook instance from the client.
Next step: Create Training Job
"},{"location":"en/end-user/share/workload.html","title":"Creating AI Workloads Using GPU Resources","text":"
After the administrator allocates resource quotas for the workspace, users can create AI workloads to utilize GPU computing resources.
Access Keys can be used to access the OpenAPI and for continuous publishing. You can follow the steps below to obtain their keys and access the API in their personal center.
Log in to the AI platform, find Personal Center in the dropdown menu at the top right corner, and manage your account's access keys on the Access Keys page.
Info
Access key information is displayed only once. If you forget the access key information, you will need to create a new access key.
"},{"location":"en/openapi/index.html#using-the-key-to-access-the-api","title":"Using the Key to Access the API","text":"
When accessing the AI platform's OpenAPI, include the request header Authorization:Bearer ${token} in the request to identify the visitor's identity, where ${token} is the key obtained in the previous step.
Request Example
curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k\n