diff --git a/docs/en/docs/admin/baize/best-practice/add-scheduler.md b/docs/en/docs/admin/baize/best-practice/add-scheduler.md index e839d332df..11e640b248 100644 --- a/docs/en/docs/admin/baize/best-practice/add-scheduler.md +++ b/docs/en/docs/admin/baize/best-practice/add-scheduler.md @@ -5,7 +5,7 @@ Date: 2024-07-30 # Add Job Scheduler -DCE 5.0 AI Lab provides a job scheduler to help you better manage jobs. +AI Lab provides a job scheduler to help you better manage jobs. In addition to the basic scheduler, it also supports custom schedulers. ## Introduction to Job Scheduler @@ -54,7 +54,7 @@ including `Coscheduling (Gang Scheduling)` and other features. ### Deploy Scheduler Plugins To deploy a secondary scheduler plugin in a worker cluster, refer to -[Deploying Secondary Scheduler Plugin](../../kpanda/user-guide/clusters/cluster-scheduler-plugin.md). +[Deploying Secondary Scheduler Plugin](../../kpanda/clusters/cluster-scheduler-plugin.md). ### Enable Scheduler Plugins in AI Lab diff --git a/docs/en/docs/admin/baize/best-practice/change-notebook-image.md b/docs/en/docs/admin/baize/best-practice/change-notebook-image.md index 8d40a69b20..d545e9793b 100644 --- a/docs/en/docs/admin/baize/best-practice/change-notebook-image.md +++ b/docs/en/docs/admin/baize/best-practice/change-notebook-image.md @@ -27,7 +27,7 @@ This Notebook includes basic development tools. Taking `baize-notebook:v0.5.0` ( !!! note - With each version iteration, DCE 5.0 will proactively maintain and update. + With each version iteration, AI platform will proactively maintain and update. However, sometimes users may need custom images. This page explains how to update images and add them to the Notebook creation interface for selection. diff --git a/docs/en/docs/admin/baize/best-practice/checkpoint.md b/docs/en/docs/admin/baize/best-practice/checkpoint.md index 0ea5139af9..bb02bc7701 100644 --- a/docs/en/docs/admin/baize/best-practice/checkpoint.md +++ b/docs/en/docs/admin/baize/best-practice/checkpoint.md @@ -99,7 +99,7 @@ checkpoint.save(file_prefix=checkpoint_prefix) !!!note - Users of DCE 5.0 AI Lab can directly mount high-performance storage as the checkpoint directory to improve the speed of saving and restoring checkpoints. + Users of AI Lab can directly mount high-performance storage as the checkpoint directory to improve the speed of saving and restoring checkpoints. ### Restore Checkpoints in TensorFlow diff --git a/docs/en/docs/admin/baize/best-practice/deploy-nfs-in-worker.md b/docs/en/docs/admin/baize/best-practice/deploy-nfs-in-worker.md index 96e371161d..5fa57761fb 100644 --- a/docs/en/docs/admin/baize/best-practice/deploy-nfs-in-worker.md +++ b/docs/en/docs/admin/baize/best-practice/deploy-nfs-in-worker.md @@ -4,7 +4,7 @@ A **Network File System (NFS)** allows remote hosts to mount file systems over a interact with those file systems as though they are mounted locally. This enables system administrators to consolidate resources onto centralized servers on the network. -**Dataset** is a core feature provided by DCE 5.0 AI Lab. +**Dataset** is a core feature provided by AI Lab. By abstracting the dependency on data throughout the entire lifecycle of MLOps into datasets, users can manage various types of data in datasets so that training tasks can directly use the data in the dataset. diff --git a/docs/en/docs/admin/baize/best-practice/finetunel-llm.md b/docs/en/docs/admin/baize/best-practice/finetunel-llm.md index aec6983df9..4c2117f861 100644 --- a/docs/en/docs/admin/baize/best-practice/finetunel-llm.md +++ b/docs/en/docs/admin/baize/best-practice/finetunel-llm.md @@ -1,7 +1,7 @@ # Fine-tune the ChatGLM3 Model by Using AI Lab This page uses the `ChatGLM3` model as an example to demonstrate how to use LoRA (Low-Rank Adaptation) -to fine-tune the ChatGLM3 model within the DCE 5.0 AI Lab environment. The demo program is from the +to fine-tune the ChatGLM3 model within the AI Lab environment. The demo program is from the [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb) official example. The general process of fine-tuning is as follows: @@ -17,12 +17,12 @@ The general process of fine-tuning is as follows: !!! info - Before starting, ensure DCE 5.0 and [AI Lab](../intro/install.md) are correctly installed, + Before starting, ensure AI platform and [AI Lab](../intro/install.md) are correctly installed, GPU queue resources are successfully initialized, and computing resources are sufficient. ## Prepare Data -Utilize the dataset management feature provided by DCE 5.0 AI Lab to quickly preheat +Utilize the dataset management feature provided by AI Lab to quickly preheat and persist the data required for fine-tuning large models, reducing GPU resource occupation due to data preparation, and improving resource utilization efficiency. @@ -40,7 +40,7 @@ First, pull the ChatGLM3 code repository and download the pre-training model for -DCE 5.0 AI Lab will automatically preheat the data in the background to ensure +AI Lab will automatically preheat the data in the background to ensure quick data access for subsequent tasks. ### AdvertiseGen Dataset @@ -68,11 +68,11 @@ Traditionally, environment dependencies are either packaged directly into the de installed in the local environment, which can lead to inconsistency in environment dependencies and difficulties in managing and updating dependencies. -DCE 5.0 AI Lab provides environment management capabilities, decoupling Python environment +AI Lab provides environment management capabilities, decoupling Python environment dependency package management from development tools and task images, solving dependency management chaos and environment inconsistency issues. -Here, use the environment management feature provided by DCE 5.0 AI Lab to +Here, use the environment management feature provided by AI Lab to create the environment required for ChatGLM3 fine-tuning for subsequent use. !!! warning @@ -97,7 +97,7 @@ may vary based on your location. Using a domestic mirror for acceleration can sp ## Use Notebook as IDE -DCE 5.0 AI Lab provides Notebook as an IDE feature, allowing users to write, run, and view +AI Lab provides Notebook as an IDE feature, allowing users to write, run, and view code results directly in the browser. This is very suitable for development in data analysis, machine learning, and deep learning fields. @@ -289,7 +289,7 @@ data output dataset for subsequent inference tasks. ### Submit Tasks via `baizectl` -DCE 5.0 AI Lab's Notebook supports using the `baizectl` command-line tool without authentication. +AI Lab's Notebook supports using the `baizectl` command-line tool without authentication. If you prefer using CLI, you can directly use the `baizectl` command-line tool to submit tasks. ```bash @@ -329,7 +329,7 @@ the resource configuration of the previous fine-tuning tasks. ### Configure Model Runtime -Configuring the model runtime is crucial. Currently, DCE 5.0 AI Lab supports +Configuring the model runtime is crucial. Currently, AI Lab supports `vLLM` as the model inference service runtime, which can be directly selected. !!! tip @@ -359,6 +359,6 @@ curl -X POST http://10.20.100.210:31118/v2/models/chatglm3-6b/generate \ This page used `ChatGLM3` as an example to quickly introduce and get you started with the **AI Lab** for model fine-tuning, using `LoRA` to fine-tune the ChatGLM3 model. -DCE 5.0 AI Lab provides a wealth of features to help model developers quickly conduct +AI Lab provides a wealth of features to help model developers quickly conduct model development, fine-tuning, and inference tasks. It also offers rich OpenAPI interfaces, facilitating integration with third-party application ecosystems. diff --git a/docs/en/docs/admin/baize/best-practice/label-studio.md b/docs/en/docs/admin/baize/best-practice/label-studio.md index 2c716f1b00..df82b211d5 100644 --- a/docs/en/docs/admin/baize/best-practice/label-studio.md +++ b/docs/en/docs/admin/baize/best-practice/label-studio.md @@ -21,10 +21,10 @@ machine learning and artificial intelligence jobs. Here is a brief introduction Label Studio offers a powerful data labeling solution for data scientists and machine learning engineers due to its flexibility and rich features. -## Deploy to DCE 5.0 +## Deploy to AI platform To use Label Studio in AI Lab, it needs to be deployed to the -[Global Service Cluster](../../kpanda/user-guide/clusters/cluster-role.md#global-service-cluster). +[Global Service Cluster](../../kpanda/clusters/cluster-role.md#global-service-cluster). You can quickly deploy it using Helm. !!! note @@ -57,7 +57,7 @@ You can quickly deploy it using Helm. image: repository: heartexlabs/label-studio # Configure proxy address here if docker.io is inaccessible extraEnvironmentVars: - LABEL_STUDIO_HOST: https://{DCE_Access_Address}/label-studio # Use the DCE 5.0 login address, refer to the current webpage URL + LABEL_STUDIO_HOST: https://{Access_Address}/label-studio # Use the AI platform login address, refer to the current webpage URL LABEL_STUDIO_USERNAME: {User_Email} # Must be an email, replace with your own LABEL_STUDIO_PASSWORD: {User_Password} app: @@ -84,7 +84,7 @@ global: image: repository: heartexlabs/label-studio # Configure proxy address here if docker.io is inaccessible extraEnvironmentVars: - LABEL_STUDIO_HOST: https://{DCE_Access_Address}/label-studio # Use the DCE 5.0 login address, refer to the current webpage URL + LABEL_STUDIO_HOST: https://{Access_Address}/label-studio # Use the AI platform login address, refer to the current webpage URL LABEL_STUDIO_USERNAME: {User_Email} # Must be an email, replace with your own LABEL_STUDIO_PASSWORD: {User_Password} app: @@ -105,7 +105,7 @@ externalPostgresql: ## Add GProduct to Navigation Bar -To add Label Studio to the DCE 5.0 navigation bar, you can refer to the method in +To add Label Studio to the AI platform navigation bar, you can refer to the method in [Global Management OEM IN](../../ghippo/best-practice/oem/oem-in.md). The following example shows how to add it to the secondary navigation of AI Lab. @@ -178,7 +178,7 @@ spec: name: label-studio order: 1 target: blank # Control new blank page - url: https://{DCE_Access_Address}/label-studio # url to access + url: https://{Access_Address}/label-studio # url to access visible: true # End adding name: AI Lab diff --git a/docs/en/docs/admin/baize/developer/inference/models.md b/docs/en/docs/admin/baize/developer/inference/models.md index 3a14090b66..60e2b1a420 100644 --- a/docs/en/docs/admin/baize/developer/inference/models.md +++ b/docs/en/docs/admin/baize/developer/inference/models.md @@ -15,7 +15,7 @@ Here, you can see information about the supported models. Refer to the [Release Notes](../../intro/release-notes.md) to understand the latest version and update timely. You can use GPU types that have been verified by AI platform in AI Lab. -For more details, refer to the [GPU Support Matrix](../../../kpanda/user-guide/gpu/gpu_matrix.md). +For more details, refer to the [GPU Support Matrix](../../../kpanda/gpu/gpu_matrix.md). ![Click to Create](../../images/inference-interface.png) diff --git a/docs/en/docs/admin/baize/troubleshoot/cluster-not-found.md b/docs/en/docs/admin/baize/troubleshoot/cluster-not-found.md index f49ca72eb1..dbffdb8004 100644 --- a/docs/en/docs/admin/baize/troubleshoot/cluster-not-found.md +++ b/docs/en/docs/admin/baize/troubleshoot/cluster-not-found.md @@ -41,8 +41,8 @@ find `baize-agent` and install it. !!! note - Quickly jump to this address: `https:///kpanda/clusters//helm/charts/addon/baize-agent`. - Note to replace `` with the actual DCE console address, and `` with the actual cluster name. + Quickly jump to this address: `https:///kpanda/clusters//helm/charts/addon/baize-agent`. + Note to replace `` with the actual console address, and `` with the actual cluster name. ### Cluster name not configured in the process of installing `baize-agent` @@ -58,6 +58,6 @@ to be unable to retrieve cluster information. Check if the platform's Insight se are running and configured correctly. - Check if the insight-server component is running properly in the - [Global Service Cluster](../../kpanda/user-guide/clusters/cluster-role.md#global-service-cluster). + [Global Service Cluster](../../kpanda/clusters/cluster-role.md#global-service-cluster). - Check if the insight-agent component is running properly in the - [worker cluster](../../kpanda/user-guide/clusters/cluster-role.md#worker-cluster). + [worker cluster](../../kpanda/clusters/cluster-role.md#worker-cluster). diff --git a/docs/en/docs/admin/baize/troubleshoot/index.md b/docs/en/docs/admin/baize/troubleshoot/index.md index ade6f9aec9..c930ee7d2b 100644 --- a/docs/en/docs/admin/baize/troubleshoot/index.md +++ b/docs/en/docs/admin/baize/troubleshoot/index.md @@ -13,10 +13,10 @@ solutions for certain errors encountered during use. !!! warning - This documentation is only applicable to version DCE 5.0. If you encounter issues with + This documentation is only applicable to version AI platform. If you encounter issues with the use of AI Lab, please refer to this troubleshooting guide first. -In DCE 5.0, the module name for AI Lab is `baize`, +In AI platform, the module name for AI Lab is `baize`, which offers one-stop solutions for model training, inference, model management, and more. ## Common Troubleshooting Cases diff --git a/docs/en/docs/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.md b/docs/en/docs/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.md index c531a26d0f..95c6acdc73 100644 --- a/docs/en/docs/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.md +++ b/docs/en/docs/admin/baize/troubleshoot/notebook-not-controlled-by-quotas.md @@ -14,7 +14,7 @@ they find that even if the selected queue lacks resources, the Notebook can stil The queue management capability in AI Lab is provided by [Kueue](https://kueue.sigs.k8s.io/), and the Notebook service is provided through [JupyterHub](https://jupyter.org/hub). JupyterHub has high - requirements for the Kubernetes version. For versions below v1.27, even if queue quotas are set in DCE 5.0, + requirements for the Kubernetes version. For versions below v1.27, even if queue quotas are set in AI platform, and users select the quota when creating a Notebook, the Notebook will not actually be restricted by the queue quota. ![local-queue-initialization-failed](./images/kueue-k8s127.png) diff --git a/docs/en/docs/admin/ghippo/access-control/iam.md b/docs/en/docs/admin/ghippo/access-control/iam.md index 63de633539..e02a14429f 100644 --- a/docs/en/docs/admin/ghippo/access-control/iam.md +++ b/docs/en/docs/admin/ghippo/access-control/iam.md @@ -34,9 +34,9 @@ graph TD class login,user,auth,group,role,id cluster; click login "https://docs.daocloud.io/en/ghippo/install/login.html" -click user "https://docs.daocloud.io/en/ghippo/user-guide/access-control/user.html" -click auth "https://docs.daocloud.io/en/ghippo/user-guide/access-control/role.html" -click group "https://docs.daocloud.io/en/ghippo/user-guide/access-control/group.html" -click role "https://docs.daocloud.io/en/ghippo/user-guide/access-control/custom-role.html" -click id "https://docs.daocloud.io/en/ghippo/user-guide/access-control/idprovider.html" +click user "https://docs.daocloud.io/en/ghippo/access-control/user.html" +click auth "https://docs.daocloud.io/en/ghippo/access-control/role.html" +click group "https://docs.daocloud.io/en/ghippo/access-control/group.html" +click role "https://docs.daocloud.io/en/ghippo/access-control/custom-role.html" +click id "https://docs.daocloud.io/en/ghippo/access-control/idprovider.html" ``` diff --git a/docs/en/docs/admin/ghippo/access-control/user.md b/docs/en/docs/admin/ghippo/access-control/user.md index 00db92771c..1b3da1848f 100644 --- a/docs/en/docs/admin/ghippo/access-control/user.md +++ b/docs/en/docs/admin/ghippo/access-control/user.md @@ -36,11 +36,11 @@ Prerequisite: The user already exists. 1. The administrator enters __Access Control__ , selects __Users__ , enters the user list, and clicks __┇__ -> __Authorization__ . - ![Menu](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/authorize01.png) + ![Menu](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/authorize01.png) 2. On the __Authorization__ page, check the required role permissions (multiple choices are allowed). - ![Interface](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/authorize02.png) + ![Interface](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/authorize02.png) 3. Click __OK__ to complete the authorization for the user. @@ -52,11 +52,11 @@ Prerequisite: The user already exists. 1. The administrator enters __Access Control__ , selects __Users__ , enters the user list, and clicks __┇__ -> __Add to Group__ . - ![Add group menu](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/joingroup01.png) + ![Add group menu](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/joingroup01.png) 2. On the __Add to Group__ page, check the groups to be joined (multiple choices are allowed). If there is no optional group, click __Create a new group__ to create a group, and then return to this page and click the __Refresh__ button to display the newly created group. - ![Add group interface](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/joingroup02.png) + ![Add group interface](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/joingroup02.png) 3. Click __OK__ to add the user to the group. @@ -70,11 +70,11 @@ Once a user is deactivated, that user will no longer be able to access the Platf 1. The administrator enters __Access Control__ , selects __Users__ , enters the user list, and clicks a username to enter user details. - ![User details](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/createuser03.png) + ![User details](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/createuser03.png) 2. Click __Edit__ on the upper right, turn off the status button, and make the button gray and inactive. - ![Edit](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/enableuser01.png) + ![Edit](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/enableuser01.png) 3. Click __OK__ to finish disabling the user. @@ -84,11 +84,11 @@ Premise: User mailboxes need to be set. There are two ways to set user mailboxes - On the user details page, the administrator clicks __Edit__ , enters the user's email address in the pop-up box, and clicks __OK__ to complete the email setting. - ![Edit](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/enableuser02.png) + ![Edit](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/enableuser02.png) - Users can also enter the __Personal Center__ and set the email address on the __Security Settings__ page. - ![User center](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/mailbox.png) + ![User center](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/mailbox.png) If the user forgets the password when logging in, please refer to [Reset Password](../password.md). @@ -103,8 +103,8 @@ If the user forgets the password when logging in, please refer to [Reset Passwor 1. The administrator enters __Access Control__ , selects __Users__ , enters the user list, and clicks __┇__ -> __Delete__ . - ![Delete user](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/deleteuser01.png) + ![Delete user](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/deleteuser01.png) 2. Click __Delete__ to finish deleting the user. - ![Confirm deletion](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/deleteuser02.png) + ![Confirm deletion](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/deleteuser02.png) diff --git a/docs/en/docs/admin/ghippo/best-practice/authz-plan.md b/docs/en/docs/admin/ghippo/best-practice/authz-plan.md index d21cb176d7..27f69bcdc5 100644 --- a/docs/en/docs/admin/ghippo/best-practice/authz-plan.md +++ b/docs/en/docs/admin/ghippo/best-practice/authz-plan.md @@ -1,6 +1,6 @@ # Ordinary user authorization plan -Ordinary users refer to those who can use most of DCE's product modules and features (except management features), have certain operation rights to resources within the scope of authority, and can independently use resources to deploy applications. +Ordinary users refer to those who can use most product modules and features (except management features), have certain operation rights to resources within the scope of authority, and can independently use resources to deploy applications. The authorization and resource planning process for such users is shown in the following figure. @@ -14,11 +14,11 @@ graph TB ws-to-ns --> authu[5. Authorize a user with Workspace Editor] authu --> complete([End]) -click user "https://docs.daocloud.io/en/ghippo/user-guide/access-control/user/" -click ns "https://docs.daocloud.io/en/kpanda/user-guide/namespaces/createns/" -click ws "https://docs.daocloud.io/en/ghippo/user-guide/workspace/workspace/" -click ws-to-ns "https://docs.daocloud.io/en/ghippo/user-guide/workspace/ws-to-ns-across-clus/" -click authu "https://docs.daocloud.io/en/ghippo/user-guide/workspace/wspermission/" +click user "https://docs.daocloud.io/en/ghippo/access-control/user/" +click ns "https://docs.daocloud.io/en/kpanda/namespaces/createns/" +click ws "https://docs.daocloud.io/en/ghippo/workspace/workspace/" +click ws-to-ns "https://docs.daocloud.io/en/ghippo/workspace/ws-to-ns-across-clus/" +click authu "https://docs.daocloud.io/en/ghippo/workspace/wspermission/" classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; diff --git a/docs/en/docs/admin/ghippo/best-practice/cluster-for-multiws.md b/docs/en/docs/admin/ghippo/best-practice/cluster-for-multiws.md index ae115feb77..b484bc1b96 100644 --- a/docs/en/docs/admin/ghippo/best-practice/cluster-for-multiws.md +++ b/docs/en/docs/admin/ghippo/best-practice/cluster-for-multiws.md @@ -11,7 +11,7 @@ This method has a drawback: if the business volume of the enterprise is large, manually allocating resources requires a significant amount of work, and flexibly adjusting resource quotas can also be challenging. -To address this, DCE introduces the concept of workspaces. By sharing resources, +To address this, the AI platform introduces the concept of workspaces. By sharing resources, workspaces can provide higher-dimensional resource quota capabilities, allowing workspaces (tenants) to self-create Kubernetes namespaces under resource quotas. @@ -41,9 +41,9 @@ class preparews,preparecs,share, cluster; class judge plain class modifyns,createns k8s -click preparews "https://docs.daocloud.io/en/ghippo/user-guide/workspace/cluster-for-multiws/#prepare-a-workspace" -click preparecs "https://docs.daocloud.io/en/ghippo/user-guide/workspace/cluster-for-multiws/#prepare-a-cluster" -click share "https://docs.daocloud.io/en/ghippo/user-guide/workspace/cluster-for-multiws/#add-a-cluster-to-the-workspace" +click preparews "https://docs.daocloud.io/en/ghippo/workspace/cluster-for-multiws/#prepare-a-workspace" +click preparecs "https://docs.daocloud.io/en/ghippo/workspace/cluster-for-multiws/#prepare-a-cluster" +click share "https://docs.daocloud.io/en/ghippo/workspace/cluster-for-multiws/#add-a-cluster-to-the-workspace" click createns "https://docs.daocloud.io/en/amamba/user-guide/namespace/namespace/#create-a-namespace" click modifyns "https://docs.daocloud.io/en/amamba/user-guide/namespace/namespace/#namespace-quotas" ``` @@ -54,7 +54,7 @@ Workspaces are designed to meet multi-tenant usage scenarios, forming isolated r cluster namespaces, meshes, mesh namespaces, multicloud, multicloud namespaces, and other resources. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers. -1. Log in to DCE 5.0 with a user having the admin/folder admin role and click __Global Management__ at the bottom of the left navigation bar. +1. Log in to AI platform with a user having the admin/folder admin role and click __Global Management__ at the bottom of the left navigation bar. ![Global Management](../images/ws01.png) @@ -76,7 +76,7 @@ Follow these steps to prepare a cluster. ![Container Management](../images/clusterlist01.png) -1. Click __Create Cluster__ to [create a cluster](../../kpanda/user-guide/clusters/create-cluster.md) or click __Integrate Cluster__ to [integrate a cluster](../../kpanda/user-guide/clusters/integrate-cluster.md). +1. Click __Create Cluster__ to [create a cluster](../../kpanda/clusters/create-cluster.md) or click __Integrate Cluster__ to [integrate a cluster](../../kpanda/clusters/integrate-cluster.md). ## Add Cluster to Workspace diff --git a/docs/en/docs/admin/ghippo/best-practice/folder-practice.md b/docs/en/docs/admin/ghippo/best-practice/folder-practice.md index 556e086b61..1945e87f9f 100644 --- a/docs/en/docs/admin/ghippo/best-practice/folder-practice.md +++ b/docs/en/docs/admin/ghippo/best-practice/folder-practice.md @@ -14,7 +14,7 @@ Therefore, with the help of folders, enterprise managers can centrally manage an 1. Build corporate hierarchy First of all, according to the existing enterprise hierarchy structure, build the same folder hierarchy as the enterprise. - DCE supports 5-level folders, which can be freely combined according to the actual situation of the enterprise, and folders and workspaces are mapped to entities such as departments, projects, and suppliers in the enterprise. + The AI platform supports 5-level folders, which can be freely combined according to the actual situation of the enterprise, and folders and workspaces are mapped to entities such as departments, projects, and suppliers in the enterprise. Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces. @@ -23,7 +23,7 @@ Therefore, with the help of folders, enterprise managers can centrally manage an 2. User identity management Folder provides three roles: Folder Admin, Folder Editor, and Folder Viewer. - [View role permissions](../user-guide/access-control/role.md), you can grant different roles to users/groups in the same folder through [Authorization](../user-guide/access-control/role.md). + [View role permissions](../access-control/role.md), you can grant different roles to users/groups in the same folder through [Authorization](../access-control/role.md). 3. Role and permission mapping diff --git a/docs/en/docs/admin/ghippo/best-practice/gproduct/intro.md b/docs/en/docs/admin/ghippo/best-practice/gproduct/intro.md index 5ccc1c9f23..c2977bbbd9 100644 --- a/docs/en/docs/admin/ghippo/best-practice/gproduct/intro.md +++ b/docs/en/docs/admin/ghippo/best-practice/gproduct/intro.md @@ -5,7 +5,7 @@ hide: # How GProduct connects to global management -GProduct is the general term for all other modules in DCE 5.0 except the global management. These modules need to be connected with the global management before they can be added to DCE 5.0. +GProduct is the general term for all other modules in AI platform except the global management. These modules need to be connected with the global management before they can be added to AI platform. ## What to be docking diff --git a/docs/en/docs/admin/ghippo/best-practice/gproduct/nav.md b/docs/en/docs/admin/ghippo/best-practice/gproduct/nav.md index cf88d1142c..43a44f56dc 100644 --- a/docs/en/docs/admin/ghippo/best-practice/gproduct/nav.md +++ b/docs/en/docs/admin/ghippo/best-practice/gproduct/nav.md @@ -44,9 +44,9 @@ Refer to the following steps to dock the GProduct: The configuration for the global management navigation bar __category__ is stored in a ConfigMap and cannot be added through registration at present. Please contact the global management team to add it. -2. The `kpanda` front-end is integrated into the DCE 5.0 parent application `Anakin` as a micro-frontend. +2. The `kpanda` front-end is integrated into the AI platform parent application `Anakin` as a micro-frontend. - DCE 5.0 frontend uses [qiankun](https://qiankun.umijs.org) to connect the sub-applications UI. + AI platform frontend uses [qiankun](https://qiankun.umijs.org) to connect the sub-applications UI. See [getting started](https://qiankun.umijs.org/guide/getting-started). After registering the GProductNavigator CR, the corresponding registration information will be generated for the front-end parent application. For example, `kpanda` will generate the following registration information: diff --git a/docs/en/docs/admin/ghippo/best-practice/oem/custom-idp.md b/docs/en/docs/admin/ghippo/best-practice/oem/custom-idp.md index c450c7effe..09fb85dc7e 100644 --- a/docs/en/docs/admin/ghippo/best-practice/oem/custom-idp.md +++ b/docs/en/docs/admin/ghippo/best-practice/oem/custom-idp.md @@ -1,8 +1,8 @@ -# Customizing DCE 5.0 Integration with IdP +# Customizing AI platform Integration with IdP -Identity Provider (IdP): In DCE 5.0, when a client system needs to be used as the user source and user +Identity Provider (IdP): In AI platform, when a client system needs to be used as the user source and user authentication is performed through the client system's login interface, the client system is referred to -as the Identity Provider for DCE 5.0. +as the Identity Provider for AI platform. ## Use Cases diff --git a/docs/en/docs/admin/ghippo/best-practice/oem/oem-in.md b/docs/en/docs/admin/ghippo/best-practice/oem/oem-in.md index 8cb33b5a83..0c79fcab5e 100644 --- a/docs/en/docs/admin/ghippo/best-practice/oem/oem-in.md +++ b/docs/en/docs/admin/ghippo/best-practice/oem/oem-in.md @@ -3,10 +3,10 @@ MTPE: WANG0608GitHub Date: 2024-08-15 --- -# Integrating Customer Systems into DCE 5.0 (OEM IN) +# Integrating Customer Systems into AI platform (OEM IN) -OEM IN refers to the partner's platform being embedded as a submodule in DCE 5.0, appearing in the primary -navigation bar of DCE 5.0. Users can log in and manage it uniformly through DCE 5.0. The implementation +OEM IN refers to the partner's platform being embedded as a submodule in AI platform, appearing in the primary +navigation bar of AI platform. Users can log in and manage it uniformly through AI platform. The implementation of OEM IN is divided into 5 steps: 1. [Unify Domain](#unify-domain-name-and-port) @@ -22,14 +22,14 @@ For specific operational demonstrations, refer to the [OEM IN Best Practices Vid The open source software Label Studio is used for nested demonstrations below. In actual scenarios, you need to solve the following issues in the customer system: - The customer system needs to add a Subpath to distinguish which services belong to DCE 5.0 + The customer system needs to add a Subpath to distinguish which services belong to AI platform and which belong to the customer system. ## Environment Preparation -1. Deploy the DCE 5.0 environment: +1. Deploy the AI platform environment: - `https://10.6.202.177:30443` as DCE 5.0 + `https://10.6.202.177:30443` as AI platform @@ -40,14 +40,14 @@ For specific operational demonstrations, refer to the [OEM IN Best Practices Vid Adjust the operations on the customer system during the application according to the actual situation. 1. Plan the Subpath path of the customer system: `http://10.6.202.177:30123/label-studio` (It is recommended to use a recognizable name as the Subpath, which should not conflict with - the HTTP router of the main DCE 5.0). Ensure that users can access the customer system through + the HTTP router of the main AI platform). Ensure that users can access the customer system through `http://10.6.202.177:30123/label-studio`. ## Unify Domain Name and Port -1. SSH into the DCE 5.0 server. +1. SSH into the AI platform server. ```bash ssh root@10.6.202.177 @@ -100,9 +100,9 @@ For specific operational demonstrations, refer to the [OEM IN Best Practices Vid http: - match: - uri: - exact: /label-studio # Change to the routing address of the customer system in the DCE5.0 Web UI entry + exact: /label-studio # Change to the routing address of the customer system in the Web UI entry - uri: - prefix: /label-studio/ # Change to the routing address of the customer system in the DCE5.0 Web UI entry + prefix: /label-studio/ # Change to the routing address of the customer system in the Web UI entry route: - destination: # Change to the value of spec.hosts in the ServiceEntry above @@ -146,10 +146,10 @@ For specific operational demonstrations, refer to the [OEM IN Best Practices Vid ## Integrate User Systems -Integrate the customer system with the DCE 5.0 platform through protocols like OIDC/OAUTH, -allowing users to enter the customer system without logging in again after logging into the DCE 5.0 platform. +Integrate the customer system with the AI platform platform through protocols like OIDC/OAUTH, +allowing users to enter the customer system without logging in again after logging into the AI platform platform. -1. In the scenario of two DCE 5.0, you can create SSO access through __Global Management__ -> __Access Control__ -> __Docking Portal__. +1. In the scenario of two AI platform, you can create SSO access through __Global Management__ -> __Access Control__ -> __Docking Portal__. @@ -162,7 +162,7 @@ allowing users to enter the customer system without logging in again after loggi 3. After integration, the customer system login page will display the OIDC (Custom) option. - Select to log in via OIDC the first time entering the customer system from the DCE 5.0 platform, + Select to log in via OIDC the first time entering the customer system from the AI platform platform, and subsequently, you will directly enter the customer system without selecting again. @@ -174,7 +174,7 @@ and embed the customer system into this empty shell application in the form of a 1. Download the gproduct-demo-main.tar.gz file and change the value of the src attribute in App-iframe.vue under the src folder (the user entering the customer system): - - The absolute address: `src="https://10.6.202.177:30443/label-studio" (DCE 5.0 address + Subpath)` + - The absolute address: `src="https://10.6.202.177:30443/label-studio" (AI platform address + Subpath)` - The relative address, such as `src="./external-anyproduct/insight"` ```html title="App-iframe.vue" @@ -244,7 +244,7 @@ and embed the customer system into this empty shell application in the form of a ... ``` -After integration, the __Customer System__ will appear in the primary navigation bar of DCE 5.0, +After integration, the __Customer System__ will appear in the primary navigation bar of AI platform, and clicking it will allow users to enter the customer system. @@ -253,28 +253,28 @@ and clicking it will allow users to enter the customer system. !!! note - DCE 5.0 supports customizing the appearance by writing CSS. How the customer system implements + AI platform supports customizing the appearance by writing CSS. How the customer system implements appearance customization in actual applications needs to be handled according to the actual situation. Log in to the customer system, and through __Global Management__ -> __Settings__ -> __Appearance__, you can customize platform background colors, logos, and names. For specific operations, please refer to -[Appearance Customization](../../user-guide/platform-setting/appearance.md). +[Appearance Customization](../../platform-setting/appearance.md). ## Integrate Permission System (Optional) **Method One:** -Customized teams can implement a customized module that DCE 5 will notify each user login event to +Customized teams can implement a customized module that AI platform will notify each user login event to the customized module via Webhook, and the customized module can call the [OpenAPI](https://docs.daocloud.io/openapi/index.html) -of AnyProduct and DCE 5.0 to synchronize the user's permission information. +of AnyProduct and AI platform to synchronize the user's permission information. **Method Two:** Through Webhook, notify AnyProduct of each authorization change (if required, it can be implemented later). -### Use Other Capabilities of DCE 5.0 in AnyProduct (Optional) +### Use Other Capabilities of AI platform in AnyProduct (Optional) -The method is to call the DCE 5.0 [OpenAPI](https://docs.daocloud.io/openapi/index.html). +The method is to call the AI platform [OpenAPI](https://docs.daocloud.io/openapi/index.html). ## References diff --git a/docs/en/docs/admin/ghippo/best-practice/oem/oem-out.md b/docs/en/docs/admin/ghippo/best-practice/oem/oem-out.md index b95d155dc1..bb5b544f86 100644 --- a/docs/en/docs/admin/ghippo/best-practice/oem/oem-out.md +++ b/docs/en/docs/admin/ghippo/best-practice/oem/oem-out.md @@ -3,10 +3,10 @@ MTPE: WANG0608GitHub Date: 2024-08-15 --- -# Integrate DCE 5.0 into Customer System (OEM OUT) +# Integrate AI platform into Customer System (OEM OUT) -OEM OUT refers to integrating DCE 5.0 as a sub-module into other products, appearing in their menus. -You can directly access DCE 5.0 without logging in again after logging into other products. +OEM OUT refers to integrating AI platform as a sub-module into other products, appearing in their menus. +You can directly access AI platform without logging in again after logging into other products. The OEM OUT integration involves 5 steps: 1. [Unify domain name](#unify-domain-name) @@ -19,9 +19,9 @@ For detailed instructions, refer to the [OEM OUT Best Practices video tutorial]( ## Unify Domain Name -1. Deploy DCE 5.0 (Assuming the access address after deployment is `https://10.6.8.2:30343/`). +1. Deploy AI platform (Assuming the access address after deployment is `https://10.6.8.2:30343/`). -2. To achieve cross-domain access between the customer system and DCE 5.0, you can use an nginx reverse proxy. +2. To achieve cross-domain access between the customer system and AI platform, you can use an nginx reverse proxy. Use the following example configuration in __vi /etc/nginx/conf.d/default.conf__ : ```nginx @@ -55,8 +55,8 @@ For detailed instructions, refer to the [OEM OUT Best Practices video tutorial]( ``` 3. Assuming the nginx entry address is 10.6.165.50, follow the - [Customize DCE 5.0 Reverse Proxy Server Address](../install/reverse-proxy.md) to - set the DCE_PROXY reverse proxy as `http://10.6.165.50/dce5`. Ensure that DCE 5.0 + [Customize AI platform Reverse Proxy Server Address](../install/reverse-proxy.md) to + set the AI_PROXY reverse proxy as `http://10.6.165.50/dce5`. Ensure that AI platform can be accessed via `http://10.6.165.50/dce5`. The customer system also needs to configure the reverse proxy based on its specific requirements. @@ -64,26 +64,26 @@ For detailed instructions, refer to the [OEM OUT Best Practices video tutorial]( ## User System Integration -Integrate the customer system with DCE 5.0 using protocols like OIDC/OAUTH, -allowing users to access DCE 5.0 without logging in again after logging into +Integrate the customer system with AI platform using protocols like OIDC/OAUTH, +allowing users to access AI platform without logging in again after logging into the customer system. Fill in the OIDC information of the customer system in __Global Management__ -> __Access Control__ -> __Identity Provider__ . -After integration, the DCE 5.0 login page will display the OIDC (custom) option. -When accessing DCE 5.0 from the customer system for the first time, -select OIDC login, and subsequent logins will directly enter DCE 5.0 without needing to choose again. +After integration, the AI platform login page will display the OIDC (custom) option. +When accessing AI platform from the customer system for the first time, +select OIDC login, and subsequent logins will directly enter AI platform without needing to choose again. ## Navigation Bar Integration -Navigation bar integration means adding DCE 5.0 to the menu of the customer system. -You can directly access DCE 5.0 by clicking the proper menu item. The navigation bar +Navigation bar integration means adding AI platform to the menu of the customer system. +You can directly access AI platform by clicking the proper menu item. The navigation bar integration depends on the customer system and needs to be handled based on specific circumstances. ## Customizie Appearance Use __Global Management__ -> __Settings__ -> __Appearance__ to customize the platform's background color, logo, and name. For detailed instructions, -refer to [Appearance Customization](../user-guide/platform-setting/appearance.md). +refer to [Appearance Customization](../platform-setting/appearance.md). ## Permission System Integration (optional) diff --git a/docs/en/docs/admin/ghippo/best-practice/super-group.md b/docs/en/docs/admin/ghippo/best-practice/super-group.md index df22dbd9a7..6e943ee889 100644 --- a/docs/en/docs/admin/ghippo/best-practice/super-group.md +++ b/docs/en/docs/admin/ghippo/best-practice/super-group.md @@ -20,7 +20,7 @@ The specific operational steps are as follows: 3. Create Users/Integrate User Systems - The main platform administrator Admin can [create users](../user-guide/access-control/user.md) on the platform or integrate users through LDAP/OIDC/OAuth2.0 and other [identity providers](../user-guide/access-control/ldap.md) to DCE 5.0. + The main platform administrator Admin can [create users](../access-control/user.md) on the platform or integrate users through LDAP/OIDC/OAuth2.0 and other [identity providers](../access-control/ldap.md) to AI platform. 4. Create Folder Roles diff --git a/docs/en/docs/admin/ghippo/best-practice/system-message.md b/docs/en/docs/admin/ghippo/best-practice/system-message.md index b7f7a8963e..ec6cb475fd 100644 --- a/docs/en/docs/admin/ghippo/best-practice/system-message.md +++ b/docs/en/docs/admin/ghippo/best-practice/system-message.md @@ -1,12 +1,12 @@ # System Messages System messages are used to notify all users, similar to system announcements, and will be displayed -at the top bar of the DCE 5.0 UI at specific times. +at the top bar of the AI platform UI at specific times. ## Configure System Messages You can create a system message by applying the YAML for the system message in the -[Cluster Roles](../../kpanda/user-guide/clusters/cluster-role.md). The display time of the message is determined by +[Cluster Roles](../../kpanda/clusters/cluster-role.md). The display time of the message is determined by the time fields in the YAML. System messages will only be displayed within the time range configured by the start and end fields. diff --git a/docs/en/docs/admin/ghippo/best-practice/ws-best-practice.md b/docs/en/docs/admin/ghippo/best-practice/ws-best-practice.md index e301e8b9fd..219b2b05da 100644 --- a/docs/en/docs/admin/ghippo/best-practice/ws-best-practice.md +++ b/docs/en/docs/admin/ghippo/best-practice/ws-best-practice.md @@ -15,7 +15,7 @@ A workspace consists of three features: authorization, resource groups, and shar Best practice: When ordinary users want to use Workbench, microservice engine, service mesh, and middleware module features, or need to have permission to use container management and some resources in the service mesh, the administrator needs to grant the workspace permissions (Workspace Admin, Workspace Edit, Workspace View). The administrator here can be the Admin role, the Workspace Admin role of the workspace, or the Folder Admin role above the workspace. - See [Relationship between Folder and Workspace](../user-guide/workspace/ws-folder.md). + See [Relationship between Folder and Workspace](../workspace/ws-folder.md). 2. Resource group: Resource group and shared resource are two resource management modes of the workspace. @@ -48,27 +48,17 @@ A workspace consists of three features: authorization, resource groups, and shar | Department Administrator B | Workspace Admin | CPU 100 cores | CPU 100 cores | | Other Members of the Department | Namesapce Admin
Namesapce Edit
Namesapce View | Assign as Needed | Assign as Needed | -## The effect of the workspace on the DCE module +## The effect of the workspace on the AI platfrom -1. Module name: [Workbench](../../amamba/intro/index.md), [Microservice Engine](../../skoala/intro/index.md), [Service Mesh](../../mspider/intro/index.md), [Middleware](../../middleware/index.md) +Module name: [Container Management](../../kpanda/intro/index.md) - The premise of entering the above modules is to have the permission of a certain workspace, so you must have the Admin role or have certain role permissions of a certain workspace before using the module features. +Due to the particularity of functional modules, resources created in the container management module will not be automatically bound to a certain workspace. - - The roles of the workspace are automatically applied to the resources contained in the workspace. For example, if you have the Workspace Admin role of workspace A, then you are the Admin role for all resources in this workspace; - - If you are a Workspace Edit, you are the Edit role for all resources in the workspace; - - If you are Workspace View, you are View role for all resources in the workspace. +If you need to perform unified authorization management on people and resources through workspaces, you can manually bind the required resources to a certain workspace, to apply the roles of users in this workspace to resources (resources here can be cross- clustered). - In addition, the resources you create in these modules will also be automatically bound to the corresponding workspace without any additional operations. +In addition, there is a slight difference between container management and service mesh in terms of resource binding entry. The workspace provides the binding entry of Cluster and Cluster-Namespace resources in container management, but has not opened the Mesh and Mesh-Namespace for service mesh. Bindings for Namespace resources. -2. Module name: [Container Management](../../kpanda/intro/index.md), [Service Mesh](../../mspider/intro/index.md) - - Due to the particularity of functional modules, resources created in the container management module will not be automatically bound to a certain workspace. - - If you need to perform unified authorization management on people and resources through workspaces, you can manually bind the required resources to a certain workspace, to apply the roles of users in this workspace to resources (resources here can be cross- clustered). - - In addition, there is a slight difference between container management and service mesh in terms of resource binding entry. The workspace provides the binding entry of Cluster and Cluster-Namespace resources in container management, but has not opened the Mesh and Mesh-Namespace for service mesh. Bindings for Namespace resources. - - For Mesh and Mesh-Namespace resources, you can manually bind them in the resource list of the service mesh. +For Mesh and Mesh-Namespace resources, you can manually bind them in the resource list of the service mesh. ## Use Cases of Workspace diff --git a/docs/en/docs/admin/ghippo/best-practice/ws-to-ns.md b/docs/en/docs/admin/ghippo/best-practice/ws-to-ns.md index 27de607a13..b86f12daed 100644 --- a/docs/en/docs/admin/ghippo/best-practice/ws-to-ns.md +++ b/docs/en/docs/admin/ghippo/best-practice/ws-to-ns.md @@ -42,11 +42,11 @@ classDef cluster fill:#fff,stroke:#bbb,stroke-width:1px,color:#326ce5; class preparews, preparens, createns, nstows, wsperm cluster; class judge plain -click preparews "https://docs.daocloud.io/ghippo/user-guide/workspace/ws-to-ns-across-clus/#_3" -click prepares "https://docs.daocloud.io/ghippo/user-guide/workspace/ws-to-ns-across-clus/#_4" -click nstows "https://docs.daocloud.io/ghippo/user-guide/workspace/ws-to-ns-across-clus/#_5" -click wsperm "https://docs.daocloud.io/ghippo/user-guide/workspace/ws-to-ns-across-clus/#_6" -click creates "https://docs.daocloud.io/ghippo/user-guide/workspace/ws-to-ns-across-clus/#_4" +click preparews "https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_3" +click prepares "https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_4" +click nstows "https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_5" +click wsperm "https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_6" +click creates "https://docs.daocloud.io/ghippo/workspace/ws-to-ns-across-clus/#_4" ``` !!! tip @@ -58,7 +58,7 @@ click creates "https://docs.daocloud.io/ghippo/user-guide/workspace/ws-to-ns-acr In order to meet the multi-tenant use cases, the workspace forms an isolated resource environment based on multiple resources such as clusters, cluster namespaces, meshs, mesh namespaces, multicloud, and multicloud namespaces. Workspaces can be mapped to various concepts such as projects, tenants, enterprises, and suppliers. -1. Log in to DCE 5.0 as a user with the admin/folder admin role, and click __Global Management__ at the bottom of the left navigation bar. +1. Log in to AI platform as a user with the admin/folder admin role, and click __Global Management__ at the bottom of the left navigation bar. ![Global Management](../../images/ws01.png) @@ -96,7 +96,7 @@ Follow the steps below to prepare a namespace that is not yet bound to any works !!! info - Workspaces are primarily used to divide groups of resources and grant users (groups of users) different access rights to that resource. For a detailed description of the workspace, please refer to [Workspace and Folder](../user-guide/workspace/workspace.md). + Workspaces are primarily used to divide groups of resources and grant users (groups of users) different access rights to that resource. For a detailed description of the workspace, please refer to [Workspace and Folder](../workspace/workspace.md). ![fill](../../images/ns02.png) diff --git a/docs/en/docs/admin/ghippo/permissions/amamba.md b/docs/en/docs/admin/ghippo/permissions/amamba.md deleted file mode 100644 index 080a23f6dc..0000000000 --- a/docs/en/docs/admin/ghippo/permissions/amamba.md +++ /dev/null @@ -1,72 +0,0 @@ ---- -hide: - - toc ---- - -# Workbench Permissions - -[Workbench](../../amamba/intro/index.md) supports three user roles: - -- Workspace Admin -- Workspace Editor -- Workspace Viewer - -Each role has different permissions, which are described below. - - - -| Menu Objects | Actions | Workspace Admin | Workspace Editor | Workspace Viewer | -| -------- | ----------- | -------------- | ---------------- | ---------------- | -| Apps | View App List | ✓ | ✓ | ✓ | -| | View details (jump to container management) | ✓ | ✓ | ✓ | -| | View application logs (jump to observables) | ✓ | ✓ | ✓ | -| | View application monitoring (jump to observables) | ✓ | ✓ | ✓ | -| | View RabbitMQ Details - Basic Information | ✓ | ✓ | ✓ | -| | View service mesh (jump to service mesh) | ✓ | ✓ | ✓ | -| | View microservice engine (jump to microservice engine) | ✓ | ✓ | ✓ | -| | Create App | ✓ | ✓ | ✗ | -| | Edit YAML | ✓ | ✓ | ✗ | -| | Update replica count | ✓ | ✓ | ✗ | -| | Update container image | ✓ | ✓ | ✗ | -| | Edit Pipeline | ✓ | ✓ | ✗ | -| | App Grouping | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Namespace | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✗ | ✗ | -| | Edit Tab | ✓ | ✗ | ✗ | -| | Edit Resource Quota | ✓ | ✗ | ✗ | -| | delete | ✓ | ✗ | ✗ | -| Pipeline | View Pipeline | ✓ | ✓ | ✓ | -| | View running records | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | run | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| | Copy | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | cancel run | ✓ | ✓ | ✗ | -| Credentials | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Continuous Deployment | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | sync | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✓ | -| Code repository | View | ✓ | ✓ | ✗ | -| | import | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Grayscale release | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | Post | ✓ | ✓ | ✗ | -| | Continue posting | ✓ | ✓ | ✗ | -| | End of publication | ✓ | ✓ | ✗ | -| | Update | ✓ | ✓ | ✗ | -| | rollback | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | - -!!! note - - For a complete introduction to role and access management, please refer to [Role and Access Management](../access-control/role.md). \ No newline at end of file diff --git a/docs/en/docs/admin/ghippo/permissions/mcamel.md b/docs/en/docs/admin/ghippo/permissions/mcamel.md deleted file mode 100644 index 4f9739814d..0000000000 --- a/docs/en/docs/admin/ghippo/permissions/mcamel.md +++ /dev/null @@ -1,88 +0,0 @@ -# Middleware Permissions - -[Middleware](../../middleware/index.md) includes selected middleware: -MySQL, Redis, MongoDB, PostgreSQL, Elasticsearch, Kafka, RabbitMQ, RocketMQ, MinIO. -Middleware supports three user roles: - -- Workspace Admin -- Workspace Editor -- Workspace Viewer - -Each role has different permissions, which are described below. - - - -## Middleware Permissions - -| Middleware Modules | Menu Objects | Actions | Workspace Admin | Workspace Editor | Workspace Viewer | -| ------- | ----------------- | ------- | --------------- | ---------------- | --------------- | -| MySQL | MySQL Instance List | View List | ✓ | ✓ | ✓ | -| | | instance name search | ✓ | ✓ | ✓ | -| | | Create instance | ✓ | ✓ | ✗ | -| | | Update Instance Configuration | ✓ | ✓ | ✗ | -| | | delete instance | ✓ | ✗ | ✗ | -| | MySQL Instance Details | Instance Overview | ✓ | ✓ | ✓ | -| | | Instance Monitoring | ✓ | ✓ | ✓ | -| | | View Instance Configuration Parameters | ✓ | ✓ | ✓ | -| | | Modify instance configuration parameters | ✓ | ✓ | ✗ | -| | | View Instance Access Password | ✓ | ✓ | ✗ | -| | | View instance backup list | ✓ | ✓ | ✓ | -| | | instance creation backup | ✓ | ✓ | ✗ | -| | | Instance modification automatic backup task | ✓ | ✓ | ✗ | -| | | Create new instance with backup | ✓ | ✓ | ✗ | -| | Backup configuration management | Backup configuration list | ✓ | ✗ | ✗ | -| | | Create Backup Configuration | ✓ | ✗ | ✗ | -| | | Modify backup configuration | ✓ | ✗ | ✗ | -| | | delete backup configuration | ✓ | ✗ | ✗ | -| RabbitMQ | RabbitMQ Instance List | View List | ✓ | ✓ | ✓ | -| | | instance name search | ✓ | ✓ | ✓ | -| | | Create instance | ✓ | ✓ | ✗ | -| | | Update Instance Configuration | ✓ | ✓ | ✗ | -| | | delete instance | ✓ | ✗ | ✗ | -| | RabbitMQ Instance Details | Instance Overview | ✓ | ✓ | ✓ | -| | | Instance Monitoring | ✓ | ✓ | ✓ | -| | | View Instance Configuration Parameters | ✓ | ✓ | ✓ | -| | | Modify instance configuration parameters | ✓ | ✓ | ✗ | -| | | View Instance Access Password | ✓ | ✓ | ✗ | -| Elasticsearch | Elasticsearch Instance List | View List | ✓ | ✓ | ✓ | -| | | instance name search | ✓ | ✓ | ✓ | -| | | Create instance | ✓ | ✓ | ✗ | -| | | Update Instance Configuration | ✓ | ✓ | ✗ | -| | | delete instance | ✓ | ✗ | ✗ | -| | Elasticsearch Instance Details | Instance Overview | ✓ | ✓ | ✓ | -| | | Instance Monitoring | ✓ | ✓ | ✓ | -| | | View Instance Configuration Parameters | ✓ | ✓ | ✓ | -| | | Modify instance configuration parameters | ✓ | ✓ | ✗ | -| | | View Instance Access Password | ✓ | ✓ | ✗ | -| Redis | Redis Instance List | View List | ✓ | ✓ | ✓ | -| | | instance name search | ✓ | ✓ | ✓ | -| | | Create instance | ✓ | ✓ | ✗ | -| | | Update Instance Configuration | ✓ | ✓ | ✗ | -| | | delete instance | ✓ | ✗ | ✗ | -| | Redis Instance Details | Instance Overview | ✓ | ✓ | ✓ | -| | | Instance Monitoring | ✓ | ✓ | ✓ | -| | | View Instance Configuration Parameters | ✓ | ✓ | ✓ | -| | | Modify instance configuration parameters | ✓ | ✓ | ✗ | -| | | View Instance Access Password | ✓ | ✓ | ✗ | -| Kafka | Kafka instance list | View list | ✓ | ✓ | ✓ | -| | | instance name search | ✓ | ✓ | ✓ | -| | | Create instance | ✓ | ✓ | ✗ | -| | | Update Instance Configuration | ✓ | ✓ | ✗ | -| | | delete instance | ✓ | ✗ | ✗ | -| | Kafka Instance Details | Instance Overview | ✓ | ✓ | ✓ | -| | | Instance Monitoring | ✓ | ✓ | ✓ | -| | | View Instance Configuration Parameters | ✓ | ✓ | ✓ | -| | | Modify instance configuration parameters | ✓ | ✓ | ✗ | -| | | View Instance Access Password | ✓ | ✓ | ✗ | -| MinIO | MinIO Instance List | View List | ✓ | ✓ | ✓ | -| | | instance name search | ✓ | ✓ | ✓ | -| | | Create instance | ✓ | ✓ | ✗ | -| | | Update Instance Configuration | ✓ | ✓ | ✗ | -| | | delete instance | ✓ | ✗ | ✗ | -| | MinIO Instance Details | Instance Overview | ✓ | ✓ | ✓ | -| | | Instance Monitoring | ✓ | ✓ | ✓ | -| | | View Instance Configuration Parameters | ✓ | ✓ | ✓ | -| | | Modify instance configuration parameters | ✓ | ✓ | ✗ | -| | | View Instance Access Password | ✓ | ✓ | ✗ | \ No newline at end of file diff --git a/docs/en/docs/admin/ghippo/permissions/mspider.md b/docs/en/docs/admin/ghippo/permissions/mspider.md deleted file mode 100644 index 7dc6c8a495..0000000000 --- a/docs/en/docs/admin/ghippo/permissions/mspider.md +++ /dev/null @@ -1,96 +0,0 @@ ---- -hide: - - toc ---- - -# Service Mesh Permissions - -[Service Mesh](../../mspider/intro/index.md) supports several user roles: - -- Admin -- Workspace Admin -- Workspace Editor -- Workspace Viewer - - - -The specific permissions for each role are shown in the following table. - -| Menu Object | Action | Admin | Workspace Admin | Workspace Editor | Workspace Viewer | -| ----------- | ------ | ----- | --------------- | ---------------- | ---------------- | -| Service Mesh List | [Create Mesh](../../mspider/user-guide/service-mesh/README.md) | ✓ | ✗ | ✗ | ✗ | -| | Edit Mesh | ✓ | ✓ | ✗ | ✗ | -| | [Delete Mesh](../../mspider/user-guide/service-mesh/delete.md) | ✓ | ✗ | ✗ | ✗ | -| | [View Mesh](../../mspider/user-guide/service-mesh/README.md) | ✓ | ✓ | ✓ | ✓ | -| Mesh Overview | View | ✓ | ✓ | ✓ | ✓ | -| Service List | View | ✓ | ✓ | ✓ | ✓ | -| | Create VM | ✓ | ✓ | ✓ | ✗ | -| | Delete VM | ✓ | ✓ | ✓ | ✗ | -| Service Entry | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Virtual Service | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Destination Rule | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Gateway Rule | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Peer Authentication | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Request Authentication | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Authorization Policy | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Namespace Sidecar Management | Enable Injection | ✓ | ✓ | ✓ | ✗ | -| | Disable Injection | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| | Sidecar Service Discovery Scope | ✓ | ✓ | ✓ | ✗ | -| Workload Sidecar Management | Enable Injection | ✓ | ✓ | ✓ | ✗ | -| | Disable Injection | ✓ | ✓ | ✓ | ✗ | -| | Configure Sidecar Resources | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Global Sidecar Injection | Enable Injection | ✓ | ✓ | ✗ | ✗ | -| | Disable Injection | ✓ | ✓ | ✗ | ✗ | -| | Configure Sidecar Resources | ✓ | ✓ | ✗ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Cluster Management (for Hosted Mesh only) | Join Cluster | ✓ | ✓ | ✗ | ✗ | -| | Leave Cluster | ✓ | ✓ | ✗ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Mesh Gateway Management | Create | ✓ | ✓ | ✗ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✗ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Istio Resource Management | Create | ✓ | ✓ | ✗ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✗ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| TLS Certificate Management | Create | ✓ | ✓ | ✓ | ✗ | -| | Edit | ✓ | ✓ | ✓ | ✗ | -| | Delete | ✓ | ✓ | ✓ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Multicloud Network Interconnection | Enable | ✓ | ✓ | ✗ | ✗ | -| | View | ✓ | ✓ | ✗ | ✗ | -| | Edit | ✓ | ✓ | ✗ | ✗ | -| | Delete | ✓ | ✓ | ✗ | ✗ | -| | Disable | ✓ | ✓ | ✗ | ✗ | -| System Upgrade | Istio Upgrade | ✓ | ✓ | ✗ | ✗ | -| | Sidecar Upgrade | ✓ | ✓ | ✗ | ✗ | -| | View | ✓ | ✓ | ✓ | ✓ | -| Workspace Management | Bind | ✓ | ✗ | ✗ | ✗ | -| | Unbind | ✓ | ✗ | ✗ | ✗ | -| | View | ✓ | ✗ | ✗ | ✗ | diff --git a/docs/en/docs/admin/ghippo/permissions/skoala.md b/docs/en/docs/admin/ghippo/permissions/skoala.md deleted file mode 100644 index c0d6891af7..0000000000 --- a/docs/en/docs/admin/ghippo/permissions/skoala.md +++ /dev/null @@ -1,125 +0,0 @@ -# Microservice Engine Permissions - -[Microservice engine](../../skoala/intro/index.md) includes two parts: microservice management center and microservice gateway. The microservice engine supports three user roles: - -- Workspace Admin -- Workspace Editor -- Workspace Viewer - -Each role has different permissions, which are described below. - - - -## Microservice Governance Permissions - -| Menu Objects | Actions | Workspace Admin | Workspace Editor | Workspace Viewer | -| ------------ | ------- | --------------- | ---------------- | ---------------- | -| Hosted Registry List | View List | ✓ | ✓ | ✓ | -| Hosted Registry | View Basic Information | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | restart | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✗ | ✗ | -| | On/Off | ✓ | ✓ | ✗ | -| Microservice Namespace | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Microservice List | View | ✓ | ✓ | ✓ | -| | filter namespace | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | Governance | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Service Governance Rules-Sentinel | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Service Governance Rules-Mesh | Governance | ✓ | ✓ | ✗ | -| Instance List | View | ✓ | ✓ | ✓ | -| | On/Off | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| Service Governance Policy-Sentinel | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Service Governance Policy-Mesh | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | Create with YAML | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | YAML Editing | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Microservice Configuration List | View | ✓ | ✓ | ✓ | -| | filter namespace | ✓ | ✓ | ✓ | -| | Batch delete | ✓ | ✓ | ✗ | -| | Export/Import | ✓ | ✓ | ✗ | -| | Create | ✓ | ✓ | ✗ | -| | Clone | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | History query | ✓ | ✓ | ✓ | -| | rollback | ✓ | ✓ | ✗ | -| | listen query | ✓ | ✓ | ✓ | -| Business Monitor | View | ✓ | ✓ | ✓ | -| Resource Monitor | View | ✓ | ✓ | ✓ | -| Request Log | View | ✓ | ✓ | ✓ | -| Instance Log | View | ✓ | ✓ | ✓ | -| Plugin Center | View | ✓ | ✓ | ✓ | -| | Open | ✓ | ✓ | ✗ | -| | Close | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | View Details | ✓ | ✓ | ✓ | -| access registry list | view | ✓ | ✓ | ✓ | -| | Access | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | Remove | ✓ | ✗ | ✗ | -| Microservices | View List | ✓ | ✓ | ✓ | -| | View Details | ✓ | ✓ | ✓ | -| | Governance | ✓ | ✓ | ✗ | -| Service Governance Policy-Mesh | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | Create with YAML | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | YAML Editing | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | - -## Microservice Gateway Permissions - -| Objects | Actions | Workspace Admin | Workspace Editor | Workspace Viewer | -| ------- | ---- | --------------- | ---------------- | ---------------- | -| Gateway List | View | ✓ | ✓ | ✓ | -| Gateway instance | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✗ | ✗ | -| Diagnostic Mode | View | ✓ | ✓ | ✓ | -| | debug | ✓ | ✓ | ✗ | -| Service List | View | ✓ | ✓ | ✓ | -| | Add | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Service Details | View | ✓ | ✓ | ✓ | -| Service Source Management | View | ✓ | ✓ | ✓ | -| | Add | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| API List | View | ✓ | ✓ | ✓ | -| | Create | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✗ | -| Request Log | View | ✓ | ✓ | ✓ | -| Instance Log | View | ✓ | ✓ | ✓ | -| Plugin Center | View | ✓ | ✓ | ✓ | -| | enable | ✓ | ✓ | ✗ | -| | disabled | ✓ | ✓ | ✗ | -| Plugin Configuration | View | ✓ | ✓ | ✓ | -| | enable | ✓ | ✓ | ✗ | -| Domain List | View | ✓ | ✓ | ✓ | -| | Add | ✓ | ✓ | ✗ | -| | edit | ✓ | ✓ | ✗ | -| | delete | ✓ | ✓ | ✓ | -| Monitor Alert | View | ✓ | ✓ | ✓ | - -!!! note - - For a complete introduction to role and access management, please refer to [Role and Access Management](../access-control/role.md). \ No newline at end of file diff --git a/docs/en/docs/admin/ghippo/personal-center/security-setting.md b/docs/en/docs/admin/ghippo/personal-center/security-setting.md index 2ba9b407ad..66d1af4950 100644 --- a/docs/en/docs/admin/ghippo/personal-center/security-setting.md +++ b/docs/en/docs/admin/ghippo/personal-center/security-setting.md @@ -14,8 +14,8 @@ The specific operation steps are as follows: 1. Click the username in the upper right corner and select __Personal Center__ . - ![Personal Center](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/lang01.png) + ![Personal Center](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/lang01.png) 2. Click the __Security Settings__ tab. Fill in your email address or change the login password. - ![Security Settings](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/security01.png) + ![Security Settings](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/security01.png) diff --git a/docs/en/docs/admin/ghippo/troubleshooting/ghippo03.md b/docs/en/docs/admin/ghippo/troubleshooting/ghippo03.md index a7f6fec143..dc40b3c405 100644 --- a/docs/en/docs/admin/ghippo/troubleshooting/ghippo03.md +++ b/docs/en/docs/admin/ghippo/troubleshooting/ghippo03.md @@ -1,6 +1,6 @@ # Keycloak Unable to Start -*[Ghippo]: The dev code for DCE 5.0 Global Management +*[Ghippo]: The dev code for AI platform Global Management ## Common Issues diff --git a/docs/en/docs/admin/ghippo/workspace/folders.md b/docs/en/docs/admin/ghippo/workspace/folders.md index bbcbc0ef8f..e552ffc12f 100644 --- a/docs/en/docs/admin/ghippo/workspace/folders.md +++ b/docs/en/docs/admin/ghippo/workspace/folders.md @@ -16,7 +16,7 @@ Follow the steps below to create a folder: 2. Click the __Create Folder__ button in the top right corner. - ![Create Folder](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/user-guide/images/ws02.png) + ![Create Folder](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/ws02.png) 3. Fill in the folder name, parent folder, and other information, then click __OK__ to complete creating the folder. diff --git a/docs/en/docs/admin/ghippo/workspace/quota.md b/docs/en/docs/admin/ghippo/workspace/quota.md index 90c328194a..c172fa7cf8 100644 --- a/docs/en/docs/admin/ghippo/workspace/quota.md +++ b/docs/en/docs/admin/ghippo/workspace/quota.md @@ -29,7 +29,7 @@ Cluster resources in both shared resources and resource groups are derived from Users/User groups in the workspace will have full management and usage permissions for the cluster. Workspace Admin will be mapped as Cluster Admin. - Workspace Admin can access the [Container Management module](../../../kpanda/user-guide/permissions/permission-brief.md) + Workspace Admin can access the [Container Management module](../../../kpanda/permissions/permission-brief.md) to manage the cluster. ![Resource Group](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota01.png) diff --git a/docs/en/docs/admin/ghippo/workspace/ws-permission.md b/docs/en/docs/admin/ghippo/workspace/ws-permission.md index 1bc938ea19..f325bdb13d 100644 --- a/docs/en/docs/admin/ghippo/workspace/ws-permission.md +++ b/docs/en/docs/admin/ghippo/workspace/ws-permission.md @@ -51,4 +51,4 @@ Generally applicable to the following two use cases: - [Service Mesh Permissions](../../permissions/mspider.md) - [Middleware permissions](../../permissions/mcamel.md) - [Microservice Engine Permissions](../../permissions/skoala.md) - - [Container Management Permissions](../../../kpanda/user-guide/permissions/permission-brief.md) \ No newline at end of file + - [Container Management Permissions](../../../kpanda/permissions/permission-brief.md) \ No newline at end of file diff --git a/docs/en/docs/admin/ghippo/workspace/wsbind-permission.md b/docs/en/docs/admin/ghippo/workspace/wsbind-permission.md index 9b28ad7915..45e89e922f 100644 --- a/docs/en/docs/admin/ghippo/workspace/wsbind-permission.md +++ b/docs/en/docs/admin/ghippo/workspace/wsbind-permission.md @@ -10,7 +10,7 @@ the [Workspace Admin role](../access-control/role.md#workspace-role-authorizatio which includes the [Workspace's "Resource Binding" Permissions](./ws-permission.md#description-of-workspace-permissions), and wants to bind a specific cluster or namespace to the workspace. To bind cluster/namespace resources to a workspace, not only the [workspace's "Resource Binding" permissions](./ws-permission.md#description-of-workspace-permissions) are required, -but also the permissions of [Cluster Admin](../../../kpanda/user-guide/permissions/permission-brief.md#cluster-admin). +but also the permissions of [Cluster Admin](../../../kpanda/permissions/permission-brief.md#cluster-admin). ## Granting Authorization to John diff --git a/docs/en/docs/admin/host/createhost.md b/docs/en/docs/admin/host/createhost.md index 801c8dc7c7..827a468d9b 100644 --- a/docs/en/docs/admin/host/createhost.md +++ b/docs/en/docs/admin/host/createhost.md @@ -1,41 +1,42 @@ -# 创建和启动云主机 +# Create and Start a Cloud Host -用户完成注册,为其分配了工作空间、命名空间和资源后,即可以创建并启动云主机。 +After the user completes registration and is assigned a workspace, namespace, and resources, they can create and start a cloud host. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](../register/index.md) -- [为用户绑定了工作空间](../register/bindws.md) -- [为工作空间分配了资源](../register/wsres.md) +- AI platform installed +- [User has successfully registered](../register/index.md) +- [Workspace has been bound to the user](../register/bindws.md) +- [Resources have been allocated to the workspace](../register/wsres.md) -## 操作步骤 +## Steps -1. 用户登录 AI 算力平台 -1. 点击 **创建云主机** -> **通过模板创建** +1. User logs into the AI platform. +2. Click **Create Cloud Host** -> **Create from Template** ![create](../images/host01.png) -1. 定义的云主机各项配置后点击 **下一步** +3. After defining all configurations for the cloud host, click **Next** - === "基本配置" + === "Basic Configuration" ![basic](../images/host02.png) - === "模板配置" + === "Template Configuration" ![template](../images/host03.png) - === "存储与网络" + === "Storage and Network" ![storage](../images/host05.png) -1. 配置 root 密码或 ssh 密钥后点击 **确定** +4. After configuring the root password or SSH key, click **Confirm** ![pass](../images/host06.png) -1. 返回主机列表,等待状态变为 **运行中** 之后,可以通过右侧的 **┇** 启动主机。 +5. Return to the host list and wait for the status to change to **Running**. + After that, you can start the host by clicking the **┇** on the right side. ![pass](../images/host07.png) -下一步:[使用云主机](./usehost.md) +Next step: [Use the Cloud Host](./usehost.md) diff --git a/docs/en/docs/admin/host/usehost.md b/docs/en/docs/admin/host/usehost.md index ffba6ebe30..7748d0d962 100644 --- a/docs/en/docs/admin/host/usehost.md +++ b/docs/en/docs/admin/host/usehost.md @@ -1,31 +1,31 @@ -# 使用云主机 +# Using Cloud Host -创建并启动云主机之后,用户就可以开始使用云主机。 +After creating and starting the cloud host, users can begin using it. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已创建并启动云主机](./createhost.md) +- AI platform is installed +- [User has created and started a cloud host](./createhost.md) -## 操作步骤 +## Steps to Follow -1. 以管理员身份登录 AI 算力平台 -1. 导航到 **容器管理** -> **容器网络** -> **服务** ,点击服务的名称,进入服务详情页,在右上角点击 **更新** +1. Log in to the AI platform as an administrator. +2. Navigate to **Container Management** -> **Container Network** -> **Services**, click on the service name to enter the service details page, and click **Update** at the top right corner. ![service](../images/usehost01.png) -1. 更改端口范围为 30900-30999,但不能冲突。 +3. Change the port range to 30900-30999, ensuring there are no conflicts. ![port](../images/usehost02.png) -1. 以终端用户登录 AI 算力平台,导航到对应的服务,查看访问端口。 +4. Log in to the AI platform as an end user, navigate to the corresponding service, and check the access port. ![port](../images/usehost03.png) -1. 在外网使用 SSH 客户端登录云主机 +5. Use an SSH client to log in to the cloud host from the external network. ![ssh](../images/usehost04.png) -1. 至此,你可以在云主机上执行各项操作。 +6. At this point, you can perform various operations on the cloud host. -下一步:[云资源共享:配额管理](../share/quota.md) +Next step: [Cloud Resource Sharing: Quota Management](../share/quota.md) diff --git a/docs/en/docs/admin/insight/alert-center/alert-template.md b/docs/en/docs/admin/insight/alert-center/alert-template.md index a3a2e8671f..3810e9da6e 100644 --- a/docs/en/docs/admin/insight/alert-center/alert-template.md +++ b/docs/en/docs/admin/insight/alert-center/alert-template.md @@ -14,13 +14,13 @@ Alert thresholds based on actual environment conditions. 1. In the navigation bar, select **Alert** -> **Alert Policy**, and click **Alert Template** at the top. - ![Alert Template](../../user-guide/images/template01.png){ width=1000px} + ![Alert Template](../../images/template01.png){ width=1000px} 2. Click **Create Alert Template**, and set the name, description, and other information for the Alert template. - ![Basic Information](../../user-guide/images/template02.png){ width=1000px} + ![Basic Information](../../images/template02.png){ width=1000px} - ![Alert Rule](../../user-guide/images/template03.png){ width=1000px} + ![Alert Rule](../../images/template03.png){ width=1000px} | Parameter | Description | | ---- | ---- | @@ -36,11 +36,11 @@ Alert thresholds based on actual environment conditions. Click **┇** next to the target rule, then click **Edit** to enter the editing page for the suppression rule. - ![Edit](../../user-guide/images/template04.png){ width=1000px} + ![Edit](../../images/template04.png){ width=1000px} ## Delete Alert Template Click **┇** next to the target template, then click **Delete**. Enter the name of the Alert template in the input box to confirm deletion. - ![Delete](../../user-guide/images/template05.png){ width=1000px} + ![Delete](../../images/template05.png){ width=1000px} diff --git a/docs/en/docs/admin/insight/alert-center/inhibition.md b/docs/en/docs/admin/insight/alert-center/inhibition.md index 0fbe9f2f83..ed9a679658 100644 --- a/docs/en/docs/admin/insight/alert-center/inhibition.md +++ b/docs/en/docs/admin/insight/alert-center/inhibition.md @@ -21,7 +21,7 @@ There are mainly the following conditions: 1. In the left navigation bar, select **Alert** -> **Noise Reduction**, and click **Inhibition** at the top. - ![Inhibition](../../user-guide/images/inhibition01.png){ width=1000px} + ![Inhibition](../../images/inhibition01.png){ width=1000px} 2. Click **Create Inhibition**, and set the name and rules for the inhibition. @@ -31,7 +31,7 @@ There are mainly the following conditions: by defining a set of rules to identify and ignore certain alerts through [Rule Details](#view-rule-details) and [Alert Details](#view-alert-details). - ![Create Inhibition](../../user-guide/images/inhibition02.png){ width=1000px} + ![Create Inhibition](../../images/inhibition02.png){ width=1000px} | Parameter | Description | | ---- | ---- | @@ -69,11 +69,11 @@ In the left navigation bar, select **Alert** -> **Alerts**, and click the policy Click **┇** next to the target rule, then click **Edit** to enter the editing page for the inhibition rule. -![Edit Rules](../../user-guide/images/inhibition03.png){ width=1000px} +![Edit Rules](../../images/inhibition03.png){ width=1000px} ## Delete Inhibition Rule Click **┇** next to the target rule, then click **Delete**. Enter the name of the inhibition rule in the input box to confirm deletion. -![Delete Rules](../../user-guide/images/inhibition04.png){ width=1000px} +![Delete Rules](../../images/inhibition04.png){ width=1000px} diff --git a/docs/en/docs/admin/insight/alert-center/sms-provider.md b/docs/en/docs/admin/insight/alert-center/sms-provider.md index cc13156955..48fe7ffc7a 100644 --- a/docs/en/docs/admin/insight/alert-center/sms-provider.md +++ b/docs/en/docs/admin/insight/alert-center/sms-provider.md @@ -2,7 +2,7 @@ Insight supports SMS notifications and currently sends alert messages using integrated Alibaba Cloud and Tencent Cloud SMS services. This article explains how to configure the SMS notification server in Insight. The variables supported in the SMS signature are the default variables in the message template. As the number of SMS characters is limited, it is recommended to choose more explicit variables. -> For information on how to configure SMS recipients, refer to the document: [Configure SMS Notification Group](../../user-guide/alert-center/message.md). +> For information on how to configure SMS recipients, refer to the document: [Configure SMS Notification Group](../../alert-center/message.md). ## Procedure diff --git a/docs/en/docs/admin/insight/best-practice/debug-log.md b/docs/en/docs/admin/insight/best-practice/debug-log.md index 9b30a61dba..c39a3a0fcc 100644 --- a/docs/en/docs/admin/insight/best-practice/debug-log.md +++ b/docs/en/docs/admin/insight/best-practice/debug-log.md @@ -7,7 +7,7 @@ date: 2024-06-11 After installing the __insight-agent__ in the cluster, __Fluent Bit__ in __insight-agent__ will collect logs in the cluster by default, including Kubernetes event logs, node logs, and container logs. __Fluent Bit__ has already configured various log collection plugins, related filter plugins, and log output plugins. The working status of these plugins determines whether log collection is normal. Below is a dashboard for __Fluent Bit__ that monitors the working conditions of each __Fluent Bit__ in the cluster and the collection, processing, and export of plugin logs. -1. Use DCE 5.0 platform, enter __Insight__ , and select the __Dashboard__ in the left navigation bar. +1. Use AI platform platform, enter __Insight__ , and select the __Dashboard__ in the left navigation bar. ![nav](./images/insight01.png) @@ -56,5 +56,5 @@ Here are some plugins for __Fluent Bit__ . | Output Plugin | Plugin Description | | ------------- | ------------------ | | es.kube.kubeevent.syslog | Write Kubernetes audit logs, event logs, and syslog logs to [ElasticSearch cluster](../../middleware/elasticsearch/intro/index.md) | -| forward.audit_log | Send Kubernetes audit logs and [global management audit logs](../../ghippo/user-guide/audit/audit-log.md) to __Global Management__ | +| forward.audit_log | Send Kubernetes audit logs and [global management audit logs](../../ghippo/audit/audit-log.md) to __Global Management__ | | es.skoala | Write [request logs](../../skoala/gateway/logs/reqlog.md) and [instance logs](../../skoala/gateway/logs/inslog.md) of microservice gateway to ElasticSearch cluster | diff --git a/docs/en/docs/admin/insight/best-practice/debug-trace.md b/docs/en/docs/admin/insight/best-practice/debug-trace.md index 92c07662c3..20430ec06d 100644 --- a/docs/en/docs/admin/insight/best-practice/debug-trace.md +++ b/docs/en/docs/admin/insight/best-practice/debug-trace.md @@ -26,7 +26,7 @@ class sdk,workload,otel,jaeger,es cluster As shown in the above figure, any transmission failure at any step will result in the inability to query trace data. If you find that there is no trace data after completing the application trace enhancement, please perform the following steps: -1. Use DCE 5.0 platform, enter __Insight__ , and select the __Dashboard__ in the left navigation bar. +1. Use AI platform platform, enter __Insight__ , and select the __Dashboard__ in the left navigation bar. ![nav](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/insight01.png) diff --git a/docs/en/docs/admin/insight/best-practice/find_root_cause.md b/docs/en/docs/admin/insight/best-practice/find_root_cause.md index 0b3705b7fc..b7b4ed482b 100644 --- a/docs/en/docs/admin/insight/best-practice/find_root_cause.md +++ b/docs/en/docs/admin/insight/best-practice/find_root_cause.md @@ -5,7 +5,7 @@ DATE: 2024-07-18 # Troubleshooting Service Issues with Insight -This article serves as a guide on using Insight to identify and analyze abnormal components in DCE 5.0 and determine +This article serves as a guide on using Insight to identify and analyze abnormal components in AI platform and determine the root causes of component exceptions. Please note that this post assumes you have a basic understanding of Insight's product features or vision. diff --git a/docs/en/docs/admin/insight/best-practice/tail-based-sampling.md b/docs/en/docs/admin/insight/best-practice/tail-based-sampling.md index f51340e40e..1866237199 100644 --- a/docs/en/docs/admin/insight/best-practice/tail-based-sampling.md +++ b/docs/en/docs/admin/insight/best-practice/tail-based-sampling.md @@ -91,7 +91,7 @@ sample only a small percentage of traces, and then later in the telemetry pipeli make more sophisticated sampling decisions before exporting to a backend. This is often done in the interest of protecting the telemetry pipeline from being overloaded. -**DCE5 Insight currently recommends using tail sampling and prioritizes support for tail sampling.** +**Insight currently recommends using tail sampling and prioritizes support for tail sampling.** The tail sampling processor samples traces based on a defined set of strategies. However, all spans of a trace must be received by the same collector instance to make effective sampling decisions. diff --git a/docs/en/docs/admin/insight/dashboard/dashboard.md b/docs/en/docs/admin/insight/dashboard/dashboard.md index d24c444761..9196316f23 100644 --- a/docs/en/docs/admin/insight/dashboard/dashboard.md +++ b/docs/en/docs/admin/insight/dashboard/dashboard.md @@ -32,6 +32,6 @@ For more information on open source Grafana, see !!! note - 1. For accessing Grafana UI, refer to [Access Native Grafana](../../user-guide/dashboard/login-grafana.md). + 1. For accessing Grafana UI, refer to [Access Native Grafana](../../dashboard/login-grafana.md). 2. For importing custom dashboards, refer to [Importing Custom Dashboards](./import-dashboard.md). diff --git a/docs/en/docs/admin/insight/trace/topology.md b/docs/en/docs/admin/insight/trace/topology.md index ffccfd257c..fd951914ef 100644 --- a/docs/en/docs/admin/insight/trace/topology.md +++ b/docs/en/docs/admin/insight/trace/topology.md @@ -25,7 +25,7 @@ service-to-service calls during the queried time period. - Hover over the connections to view the traffic metrics between the two services. - Click __Display Settings__ , you can configure the display elements in the service map. - ![Servicemap](../../user-guide/images/servicemap.png) + ![Servicemap](../../images/servicemap.png) ### Other Nodes diff --git a/docs/en/docs/admin/k8s/add-node.md b/docs/en/docs/admin/k8s/add-node.md index b740ffa18f..5187eb496f 100644 --- a/docs/en/docs/admin/k8s/add-node.md +++ b/docs/en/docs/admin/k8s/add-node.md @@ -1,43 +1,43 @@ -# 添加工作节点 +# Adding Worker Nodes -如果节点不够用了,可以添加更多节点到集群中。 +If there are not enough nodes, you can add more nodes to the cluster. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 有一个管理员帐号 -- [已创建带 GPU 节点的集群](./create-k8s.md) -- [准备一台云主机](../host/createhost.md) +- AI platform is installed +- An administrator account is available +- [A cluster with GPU nodes has been created](./create-k8s.md) +- [A cloud host has been prepared](../host/createhost.md) -## 添加步骤 +## Steps to Add Nodes -1. 以 **管理员身份** 登录 AI 算力平台 -1. 导航至 **容器管理** -> **集群列表** ,点击目标集群的名称 +1. Log in to the AI platform as an **administrator**. +2. Navigate to **Container Management** -> **Clusters**, and click on the name of the target cluster. ![clusters](../images/remove01.png) -1. 进入集群概览页,点击 **节点管理** ,点击右侧的 **接入节点** 按钮 +3. On the cluster overview page, click **Node Management**, and then click the **Add Node** button on the right side. ![add](../images/add01.png) -1. 按照向导,填写各项参数后点击 **确定** +4. Follow the wizard, fill in the required parameters, and then click **OK**. - === "基本信息" + === "Basic Information" ![basic](../images/add02.png) - === "参数配置" + === "Parameter Configuration" ![arguments](../images/add03.png) -1. 在弹窗中点击 **确定** +5. Click **OK** in the popup window. ![ok](../images/add04.png) -1. 返回节点列表,新接入的节点状态为 **接入中** ,等待几分钟后状态变为 **健康** 则表示接入成功。 +6. Return to the node list. The status of the newly added node will be **Pending**. After a few minutes, if the status changes to **Running**, it indicates that the node has been successfully added. ![success](../images/add05.png) !!! tip - 对于刚接入成功的节点,可能还要等 2-3 分钟才能识别出 GPU。 + For nodes that have just been successfully added, it may take an additional 2-3 minutes for the GPU to be recognized. diff --git a/docs/en/docs/admin/k8s/create-k8s.md b/docs/en/docs/admin/k8s/create-k8s.md index 16674d7ea1..634971a44e 100644 --- a/docs/en/docs/admin/k8s/create-k8s.md +++ b/docs/en/docs/admin/k8s/create-k8s.md @@ -1,80 +1,80 @@ -# 创建云上 Kubernetes 集群 +# Creating a Kubernetes Cluster on the Cloud -部署 Kubernetes 集群是为了支持高效的 AI 算力调度和管理,实现弹性伸缩,提供高可用性,从而优化模型训练和推理过程。 +Deploying a Kubernetes cluster is aimed at supporting efficient AI computing resource scheduling and management, achieving elastic scalability, providing high availability, and optimizing the model training and inference processes. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台已 -- 有一个管理员权限的账号 -- 准备一台带 GPU 的物理机 -- 分配两段 IP 地址(Pod CIDR 18 位、SVC CIDR 18 位,不能与现有网段冲突) +- An AI platform is installed +- An administrator account is available +- A physical machine with a GPU is prepared +- Two segments of IP addresses are allocated (Pod CIDR 18 bits, SVC CIDR 18 bits, must not conflict with existing network segments) -## 创建步骤 +## Steps to Create the Cluster -1. 以 **管理员身份** 登录 AI 算力平台 -1. [创建并启动 3 台不带 GPU 的云主机](../host/createhost.md)用作集群的 Master 节点 +1. Log in to the AI platform as an **administrator**. +2. [Create and launch 3 cloud hosts without GPUs](../host/createhost.md) to serve as the Master nodes for the cluster. - - 配置资源,CPU 16 核,内存 32 GB,系统盘 200 GB(ReadWriteOnce) - - 网络模式选择 **Bridge(桥接)** - - 设置 root 密码或添加 SSH 公钥,方便以 SSH 连接 - - 记录好 3 台主机的 IP + - Configure resources: 16 CPU cores, 32 GB memory, 200 GB system disk (ReadWriteOnce) + - Select **Bridge** network mode + - Set the root password or add an SSH public key for SSH connection + - Take note of the IP addresses of the 3 hosts -1. 导航至 **容器管理** -> **集群列表** ,点击右侧的 **创建集群** 按钮 -1. 按照向导,配置集群的各项参数 +3. Navigate to **Container Management** -> **Clusters**, and click the **Create Cluster** button on the right side. +4. Follow the wizard to configure the various parameters of the cluster. - === "基本信息" + === "Basic Information" ![basic](../images/k8s01.png) - === "节点配置" + === "Node Configuration" - 配置完节点信息后,点击 **开始检查** , + After configuring the node information, click **Start Check**. ![node](../images/k8s02.png) ![node](../images/k8s03.png) - === "网络配置" + === "Network Configuration" ![network](../images/k8s04.png) - === "Addon 配置" + === "Addon Configuration" ![addon](../images/k8s05.png) - === "高级配置" + === "Advanced Configuration" - 每个节点默认可运行 110 个 Pod(容器组),如果节点配置比较高,可以调整到 200 或 300 个 Pod。 + Each node can run a default of 110 Pods (container groups). If the node configuration is higher, it can be adjusted to 200 or 300 Pods. ![basic](../images/k8s06.png) -1. 等待集群创建完成。 +5. Wait for the cluster creation to complete. ![done](../images/k8s08.png) -1. 在集群列表中,找到刚创建的集群,点击集群名称,导航到 **Helm 应用** -> **Helm 模板** ,在搜索框内搜索 metax-gpu-extensions,点击卡片 +6. In the cluster list, find the newly created cluster, click on the cluster name, navigate to **Helm Apps** -> **Helm Charts**, and search for `metax-gpu-extensions` in the search box, then click the card. ![cluster](../images/k8s09.png) ![helm](../images/k8s10.png) -1. 点击右侧的 **安装** 按钮,开始安装 GPU 插件 +7. Click the **Install** button on the right to begin installing the GPU plugin. - === "应用设置" + === "Application Settings" - 输入名称,选择命名空间,在 YAMl 中修改镜像地址: + Enter a name, select a namespace, and modify the image address in the YAML: ![app settings](../images/k8s11.png) - === "Kubernetes 编排确认" + === "Kubernetes Orchestration Confirmation" ![confirm](../images/k8s12.png) -1. 自动返回 Helm 应用列表,等待 metax-gpu-extensions 状态变为 **已部署** +8. You will automatically return to the Helm application list. Wait for the status of `metax-gpu-extensions` to change to **Deployed**. ![deployed](../images/k8s13.png) -1. 到此集群创建成功,可以去查看集群所包含的节点。你可以去[创建 AI 工作负载并使用 GPU 了](../share/workload.md)。 +9. The cluster has been successfully created. You can now check the nodes included in the cluster. You can [create AI workloads and use the GPU](../share/workload.md). ![nodes](../images/k8s14.png) -下一步:[创建 AI 工作负载](../share/workload.md) +Next step: [Create AI Workloads](../share/workload.md) diff --git a/docs/en/docs/admin/k8s/remove-node.md b/docs/en/docs/admin/k8s/remove-node.md index 5f5bd822e7..3b78247801 100644 --- a/docs/en/docs/admin/k8s/remove-node.md +++ b/docs/en/docs/admin/k8s/remove-node.md @@ -1,37 +1,36 @@ -# 移除 GPU 工作节点 +# Removing GPU Worker Nodes -GPU 资源的成本相对较高,如果暂时用不到 GPU,可以将带 GPU 的工作节点移除。 -以下步骤也同样适用于移除普通工作节点。 +The cost of GPU resources is relatively high. If you temporarily do not need a GPU, you can remove the worker nodes with GPUs. The following steps are also applicable for removing regular worker nodes. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 有一个管理员帐号 -- [已创建带 GPU 节点的集群](./create-k8s.md) +- AI platform installed +- An administrator account +- [A cluster with GPU nodes created](./create-k8s.md) -## 移除步骤 +## Removal Steps -1. 以 **管理员身份** 登录 AI 算力平台 -1. 导航至 **容器管理** -> **集群列表** ,点击目标集群的名称 +1. Log in to the AI platform as an **administrator**. +2. Navigate to **Container Management** -> **Clusters**, and click on the name of the target cluster. ![clusters](../images/remove01.png) -1. 进入集群概览页,点击 **节点管理** ,找到要移除的节点,点击列表右侧的 __┇__ ,在弹出菜单中选择 **移除节点** +3. On the cluster overview page, click **Nodes**, find the node you want to remove, click the __┇__ on the right side of the list, and select **Remove Node** from the pop-up menu. ![remove](../images/remove02.png) -1. 在弹框中输入节点名称,确认无误后点击 **删除** +4. In the pop-up window, enter the node name, and after confirming it is correct, click **Delete**. ![confirm](../images/remove03.png) -1. 自动返回节点列表,状态为 **移除中** ,几分钟后刷新页面,节点不在了,说明节点被成功移除 +5. You will automatically return to the node list, and the status will be **Removing**. After a few minutes, refresh the page; if the node is no longer there, it indicates that the node has been successfully removed. ![removed](../images/remove04.png) -1. 从 UI 列表移除节点后,通过 SSH 登录到已移除的节点主机,执行关机命令。 +6. After removing the node from the UI list and shutting it down, log in to the host of the removed node via SSH and execute the shutdown command. ![shutdown](../images/remove05.png) !!! tip - 在 UI 上移除节点并将其关机后,节点上的数据并未被立即删除,节点数据会被保留一段时间。 + After removing the node from the UI and shutting it down, the data on the node is not immediately deleted; the node's data will be retained for a period of time. diff --git a/docs/en/docs/admin/kpanda/best-practice/add-master-node.md b/docs/en/docs/admin/kpanda/best-practice/add-master-node.md index 30f70b6fd4..312cfbf458 100644 --- a/docs/en/docs/admin/kpanda/best-practice/add-master-node.md +++ b/docs/en/docs/admin/kpanda/best-practice/add-master-node.md @@ -8,7 +8,7 @@ This article provides a step-by-step guide on how to manually scale the control ## Prerequisites -- A worker cluster has been created using the DCE 5.0 platform. You can refer to the documentation on [Creating a Worker Cluster](../user-guide/clusters/create-cluster.md). +- A worker cluster has been created using the AI platform platform. You can refer to the documentation on [Creating a Worker Cluster](../clusters/create-cluster.md). - The managed cluster associated with the worker cluster exists in the current platform and is running normally. !!! note diff --git a/docs/en/docs/admin/kpanda/best-practice/add-worker-node-on-global.md b/docs/en/docs/admin/kpanda/best-practice/add-worker-node-on-global.md index 155a0812c7..83064e34f9 100644 --- a/docs/en/docs/admin/kpanda/best-practice/add-worker-node-on-global.md +++ b/docs/en/docs/admin/kpanda/best-practice/add-worker-node-on-global.md @@ -1,8 +1,8 @@ # Scaling the Worker Nodes of the Global Service Cluster This page introduces how to manually scale the worker nodes of the global service cluster in offline mode. -By default, it is not recommended to scale the [global service cluster](../user-guide/clusters/cluster-role.md#global-service-cluster) after deploying DCE 5.0. -Please ensure proper resource planning before deploying DCE 5.0. +By default, it is not recommended to scale the [global service cluster](../clusters/cluster-role.md#global-service-cluster) after deploying AI platform. +Please ensure proper resource planning before deploying AI platform. !!! note @@ -10,7 +10,7 @@ Please ensure proper resource planning before deploying DCE 5.0. ## Prerequisites -- DCE platform deployment has been completed through [bootstrap node](../../install/commercial/deploy-arch.md), +- The AI platform deployment has been completed through [bootstrap node](../../install/commercial/deploy-arch.md), and the kind cluster on the bootstrap node is running normally. - You must log in with a user account that has admin privileges on the platform. @@ -162,9 +162,9 @@ Please ensure proper resource planning before deploying DCE 5.0. systemctl restart containerd ``` -## Integrate a Kind cluster into the DCE 5.0 cluster list +## Integrate a Kind cluster into the AI platform cluster list -1. Log in to DCE 5.0, navigate to Container Management, and on the right side of the cluster list, +1. Log in to AI platform, navigate to Container Management, and on the right side of the cluster list, click the __Integrate Cluster__ button. 2. In the integration configuration section, fill in and edit the kubeconfig of the Kind cluster. @@ -197,7 +197,7 @@ Please ensure proper resource planning before deploying DCE 5.0. ## Add Labels to the Global Service Cluster -1. Log in to DCE 5.0, navigate to Container Management, find the __kapnda-global-cluster__ , +1. Log in to AI platform, navigate to Container Management, find the __kapnda-global-cluster__ , and in the right-side, find the __Basic Configuration__ menu options. 2. In the Basic Configuration page, add the label `kpanda.io/managed-by=my-cluster` for the global service cluster: diff --git a/docs/en/docs/admin/kpanda/best-practice/backup-mysql-on-nfs.md b/docs/en/docs/admin/kpanda/best-practice/backup-mysql-on-nfs.md index f037c29794..173436c19d 100644 --- a/docs/en/docs/admin/kpanda/best-practice/backup-mysql-on-nfs.md +++ b/docs/en/docs/admin/kpanda/best-practice/backup-mysql-on-nfs.md @@ -1,11 +1,11 @@ # Cross-Cluster Backup and Recovery of MySQL Application and Data -This demonstration will show how to use the application backup feature in DCE 5.0 to +This demonstration will show how to use the application backup feature in AI platform to perform cross-cluster backup migration for a stateful application. !!! note - The current operator should have admin privileges on the DCE 5.0 platform. + The current operator should have admin privileges on the AI platform platform. ## Prepare the Demonstration Environment @@ -528,7 +528,7 @@ perform cross-cluster backup migration for a stateful application. The velero plugin needs to be installed on **both the source and target clusters**. -Refer to the [Install Velero Plugin](../user-guide/backup/install-velero.md) documentation and the MinIO configuration below to install the velero plugin on the __main-cluster__ and __recovery-cluster__ . +Refer to the [Install Velero Plugin](../backup/install-velero.md) documentation and the MinIO configuration below to install the velero plugin on the __main-cluster__ and __recovery-cluster__ . | MinIO Server Address | Bucket | Username | Password | | ------------------------| ----------- | ---------| ----------| @@ -548,7 +548,7 @@ Refer to the [Install Velero Plugin](../user-guide/backup/install-velero.md) doc kubectl label pvc mydata backup=mysql #为 mysql 的 pvc 添加标签 ``` -2. Refer to the steps described in [Application Backup](../user-guide/backup/deployment.md#application-backup) and the parameters below to create an application backup. +2. Refer to the steps described in [Application Backup](../backup/deployment.md#application-backup) and the parameters below to create an application backup. - Name: __backup-mysql__ (can be customized) - Source Cluster: __main-cluster__ @@ -565,7 +565,7 @@ Refer to the [Install Velero Plugin](../user-guide/backup/install-velero.md) doc ## Cross-Cluster Recovery of MySQL Application and Data -1. Log in to the DCE 5.0 platform and select __Container Management__ -> __Backup & Restore__ -> __Application Backup__ from the left navigation menu. +1. Log in to the AI platform platform and select __Container Management__ -> __Backup & Restore__ -> __Application Backup__ from the left navigation menu. ![img](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/images/mysql06.png) diff --git a/docs/en/docs/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.md b/docs/en/docs/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.md index 459700c351..a2f74e5f7b 100644 --- a/docs/en/docs/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.md +++ b/docs/en/docs/admin/kpanda/best-practice/create-redhat9.2-on-centos-platform.md @@ -4,16 +4,16 @@ This article explains how to create a RedHat 9.2 worker cluster on an existing C !!! note - This article only applies to the offline mode, using the DCE 5.0 platform to create a worker cluster. The architecture of the management platform and the cluster to be created are both AMD. + This article only applies to the offline mode, using the AI platform platform to create a worker cluster. The architecture of the management platform and the cluster to be created are both AMD. When creating a cluster, heterogeneous deployment (mixing AMD and ARM) is not supported. After the cluster is created, you can use the method of connecting heterogeneous nodes to achieve mixed deployment and management of the cluster. ## Prerequisites -A DCE 5.0 full-mode has been deployed, and the spark node is still alive. For deployment, see the document [Offline Install DCE 5.0 Enterprise](../../install/commercial/start-install.md). +A AI platform full-mode has been deployed, and the spark node is still alive. For deployment, see the document [Offline Install AI platform Enterprise](../../install/commercial/start-install.md). ## Download and Import RedHat Offline Packages -Make sure you are logged into the spark node! And the clusterConfig.yaml file used when deploying DCE 5.0 is still available. +Make sure you are logged into the spark node! And the clusterConfig.yaml file used when deploying AI platform is still available. ### Download the Relevant RedHat Offline Packages @@ -75,4 +75,4 @@ MINIO_USER=rootuser MINIO_PASS=rootpass123 ./import_iso.sh http://127.0.0.1:9000 ## Create the Cluster in the UI -Refer to the document [Creating a Worker Cluster](../user-guide/clusters/create-cluster.md) to create a RedHat 9.2 cluster. +Refer to the document [Creating a Worker Cluster](../clusters/create-cluster.md) to create a RedHat 9.2 cluster. diff --git a/docs/en/docs/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.md b/docs/en/docs/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.md index 23a7422deb..85c3cb706f 100644 --- a/docs/en/docs/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.md +++ b/docs/en/docs/admin/kpanda/best-practice/create-ubuntu-on-centos-platform.md @@ -9,19 +9,19 @@ This page explains how to create an Ubuntu worker cluster on an existing CentOS. !!! note - This page is specifically for the offline mode, using the DCE 5.0 platform to create a worker cluster, + This page is specifically for the offline mode, using the AI platform platform to create a worker cluster, where both the CentOS platform and the worker cluster to be created are based on AMD architecture. Heterogeneous (mixed AMD and ARM) deployments are not supported during cluster creation; however, after the cluster is created, you can manage a mixed deployment by adding heterogeneous nodes. ## Prerequisite -- A fully deployed DCE 5.0 system, with the bootstrap node still active. For deployment reference, see the documentation [Offline Install DCE 5.0 Enterprise](../../install/commercial/start-install.md). +- A fully deployed AI platform system, with the bootstrap node still active. For deployment reference, see the documentation [Offline Install AI platform Enterprise](../../install/commercial/start-install.md). ## Download and Import Ubuntu Offline Packages Please ensure you are logged into the bootstrap node! Also, make sure that the -clusterConfig.yaml file used during the DCE 5.0 deployment is still available. +clusterConfig.yaml file used during the AI platform deployment is still available. ### Download Ubuntu Offline Packages @@ -39,5 +39,5 @@ to import offline resources into MinIO on the bootstrap node. ## Create Cluster on UI -Refer to the documentation [Creating a Worker Cluster](../user-guide/clusters/create-cluster.md) +Refer to the documentation [Creating a Worker Cluster](../clusters/create-cluster.md) to create the Ubuntu cluster. diff --git a/docs/en/docs/admin/kpanda/best-practice/dce4-5-migration.md b/docs/en/docs/admin/kpanda/best-practice/dce4-5-migration.md deleted file mode 100644 index 00fd59ef0b..0000000000 --- a/docs/en/docs/admin/kpanda/best-practice/dce4-5-migration.md +++ /dev/null @@ -1,244 +0,0 @@ -# Limited Scenario Migration from DCE 4.0 to DCE 5.0 - -## Environment Preparation - -1. Available DCE 4.0 environment. -2. Available DCE 5.0 environment. -3. Kubernetes cluster for data restoration, referred to as the __restoration cluster__. - -## Prerequisites - -1. Install the CoreDNS plugin on DCE 4.0. Refer to [Install CoreDNS](https://dwiki.daocloud.io/pages/viewpage.action?pageId=36668076) for installation steps. - -2. Migrate DCE 4.0 to DCE 5.0. Refer to [Integrate Cluster](../user-guide/clusters/integrate-cluster.md) for migration steps. The DCE 4.0 cluster migrated to DCE 5.0 is referred to as the __backup cluster__. - - !!! note - - When integrating the cluster, select __DaoCloud DCE4__ as the distribution. - -3. Install Velero on the managed DCE 4.0 cluster. Refer to [Install Velero](../user-guide/backup/install-velero.md) for installation steps. - -4. Migrate the restoration cluster to DCE 5.0, which can be done by creating a cluster or integrating it. - -5. Install Velero on the restoration cluster. Refer to [Install Velero](../user-guide/backup/install-velero.md) for installation steps. - -!!! note - - - The object storage configuration must be consistent between the managed DCE 4.0 cluster and the restoration cluster. - - If you need to perform pod migration, open the __Migration Plugin Configuration__ switch in the form parameters (supported in Velero version 5.2.0+). - - -## Optional Configuration - -If you need to perform pod migration, follow these steps after completing the prerequisites. These steps can be ignored for non-pod migration scenarios. - -!!! note - - These steps are executed in the restoration cluster managed by DCE 5.0. - -### Configure the Velero Plugin - -1. After the Velero plugin is installed, you can use the following YAML file to configure the Velero plugin. - - !!! note - - When installing the Velero plugin, make sure to open the __Migration Plugin Configuration__ switch in the form parameters. - - ```yaml - apiVersion: v1 - kind: ConfigMap - metadata: - # any name can be used; Velero uses the labels (below) to identify it rather than the name - name: velero-plugin-for-migration - # must be in the velero namespace - namespace: velero - # the below labels should be used verbatim in your ConfigMap. - labels: - # this value-less label identifies the ConfigMap as - # config for a plugin (i.e. the built-in restore item action plugin) - velero.io/plugin-config: "velero-plugin-for-migration" - # this label identifies the name and kind of plugin that this ConfigMap is for. - velero.io/velero-plugin-for-migration: RestoreItemAction - data: - velero-plugin-for-migration: '{"resourcesSelector":{"includedNamespaces":["kube-system"],"excludedNamespaces":["default"],"includedResources":["pods","deployments","ingress"],"excludedResources":["secrets"],"skipRestoreKinds":["endpointslice"],"labelSelector":"app:dao-2048"},"resourcesConverter":[{"ingress":{"enabled":true,"apiVersion":"extensions/v1beat1"}}],"resourcesOperation":[{"kinds":["pod"],"domain":"labels","operation":{"add":{"key1":"values","key2":""},"remove":{"key3":"values","key4":""},"replace":{"key5":["source","dest"],"key6":["","dest"],"key7":["source",""]}}},{"kinds":["deployment","daemonset"],"domain":"annotations","scope":"resourceSpec","operation":{"add":{"key1":"values","key2":""},"remove":{"key3":"values","key4":""},"replace":{"key5":["source","dest"],"key6":["","dest"],"key7":["source",""]}}}]}' - ``` - - !!! note - - - Do not modify the name of the plugin configuration ConfigMap, and it must be created in the velero namespace. - - Pay attention to whether you are filling in resource resources or kind in the plugin configuration. - - After modifying the plugin configuration, restart the Velero pod. - - The YAML below is a display style for the plugin configuration, and it needs to be converted to JSON and added to the ConfigMap. - - Refer to the following YAML and comments for how to configure __velero-plugin-for-migration__: - - ```yaml - resourcesSelector: # The resources that the plugin needs to handle or ignore - includedNamespaces: # Exclude the namespaces included in the backup - - kube-system - excludedNamespaces: # Do not handle the namespaces included in the backup - - default - includedResources: # Handle the resources included in the backup - - pods - - deployments - - ingress - excludedResources: # Do not handle the resources included in the backup - - secrets - skipRestoreKinds: - - endpointslice # The restore plugin skips the resources included in the backup, that is, - # it does not perform the restore operation. This resource needs to be included in - # includedResources to be captured by the plugin. This field needs to be filled - # with the resource kind, case insensitive. - labelSelector: 'app:dao-2048' - resourcesConverter: # The resources that the restore plugin needs to convert, does not support - # configuring specific resource field conversions - - ingress: - enabled: true - apiVersion: extensions/v1beat1 - resourcesOperation: # The restore plugin modifies the annotations/labels of the resource/template - - kinds: ['pod'] # Fill in the resource kind included in the backup, case insensitive - domain: labels # Handle the resources labels - operation: - add: - key1: values # Add labels key1:values - key2: '' - remove: - key3: values # Remove lables key3:values, match key,values - key4: '' # Remove lables key4, only match key, not match values - replace: - key5: # Replace lables key5:source -> key5:dest - - source - - dest - key6: # Replace lables key6: -> key6:dest, not match key6 values - - "" - - dest - key7: # Replace lables key7:source -> key7:"" - - source - - "" - - kinds: ['deployment', 'daemonset'] # Fill in the resource kind included in the backup, case insensitive - domain: annotations # Handle the resources template annotations - scope: resourceSpec # Handle the resources template spec annotations or labels, depending on the domain configuration - operation: - add: - key1: values # Add annotations key1:values - key2: '' - remove: - key3: values # Remove annotations key3:values, match key,values - key4: '' # Remove annotations key4, only match key, not match values - replace: - key5: # Replace annotations key5:source -> key5:dest - - source - - dest - key6: # Replace annotations key6: -> key6:dest, not match key6 values - - "" - - dest - key7: # Replace annotations key7:source -> key7:"" - - source - - "" - ``` - -2. After obtaining the velero-plugin-for-dce plugin configuration, perform chained operations on resources according to the configuration. For example, after ingress is processed by resourcesConverter, it will be processed by resourcesOperation. - -### Image Repository Migration - -The following steps describe how to migrate images between image repositories. - -1. Integrate the DCE 4.0 image repository with the Kangaroo repository integration (admin). - Refer to [Repository Integration](../../kangaroo/integrate/integrate-admin/integrate-admin.md) for the operation steps. - - - !!! note - - - Use the VIP address IP of dce-registry as the repository address. - - Use the account and password of the DCE 4.0 administrator. - -2. Create or integrate a Harbor repository in the administrator interface to migrate the source images. - - -3. Configure the target repository and synchronization rules in the Harbor repository instance. After the rules are triggered, Harbor will automatically pull images from dce-registry. - - -4. Click the name of the synchronization rule to view whether the image synchronization is successful. - - -### Network Policy Migration - -#### Calico Network Policy Migration - -Refer to the resource and data migration process to migrate the Calico service from DCE 4.0 to DCE 5.0. -Due to the different ippool names, services may be abnormal after migration. Manually delete the annotations in the service YAML after migration to ensure that the service starts properly. - -!!! note - - - In DCE 4.0, the name is default-ipv4-ippool. - - In DCE 5.0, the name is default-pool. - -```yaml -annotations: - dce.daocloud.io/parcel.net.type: calico - dce.daocloud.io/parcel.net.type: default-ipv4-ippool -``` - -#### Parcel Underlay Network Policy Migration - -The following steps describe how to migrate Parcel underlay network policies. - -1. Install the spiderpool Helm application in the restoration cluster. Refer to [Install Spiderpool](../user-guide/helm/helm-app.md) for installation steps. - - -2. Go to the details page of the restoration cluster and select __Container Network__ -> __Network Configuration__ from the left menu. - - -3. Create a subnet and reserve IPs in the __Static IP Pool__. - - -4. Create a Multus CR with the same IP (the default pool can be left blank, and the port is consistent with the actual port). - - -5. Create a Velero DCE plugin configmap. - - ```yaml - --- - resourcesSelector: - includedResources: - - pods - - deployments - resourcesConverter: - resourcesOperation: - - kinds: - - pod - domain: annotations - operation: - replace: - cni.projectcalico.org/ipv4pools: - - '["default-ipv4-ippool"]' - - default-pool - - kinds: - - deployment - domain: annotations - scope: resourceSpec - operation: - remove: - dce.daocloud.io/parcel.egress.burst: - dce.daocloud.io/parcel.egress.rate: - dce.daocloud.io/parcel.ingress.burst: - dce.daocloud.io/parcel.ingress.rate: - dce.daocloud.io/parcel.net.type: - dce.daocloud.io/parcel.net.value: - dce.daocloud.io/parcel.ovs.network.status: - add: - ipam.spidernet.io/subnets: ' [ { "interface": "eth0", "ipv4": ["d5"] } ]' - v1.multus-cni.io/default-network: kube-system/d5multus - ``` - -6. Verify whether the migration is successful. - - 1. Check whether there are annotations in the application YAML. - - ```yaml - annotations: - ipam.spidernet.io/subnets: ' [ { "interface": "eth0", "ipv4": ["d5"] } ]' - v1.multus-cni.io/default-network: kube-system/d5multus - ``` - - 1. Check whether the pod IP is within the configured IP pool. diff --git a/docs/en/docs/admin/kpanda/best-practice/etcd-backup.md b/docs/en/docs/admin/kpanda/best-practice/etcd-backup.md index ae62e82fd3..2b87a599ff 100644 --- a/docs/en/docs/admin/kpanda/best-practice/etcd-backup.md +++ b/docs/en/docs/admin/kpanda/best-practice/etcd-backup.md @@ -9,9 +9,9 @@ Using the ETCD backup feature to create a backup policy, you can back up the etc !!! note - - DCE 5.0 ETCD backup restores are limited to backups and restores for the same cluster (with no change in the number of nodes and IP addresses). For example, after the etcd data of Cluster A is backed up, the backup data can only be restored to Cluster A, not to Cluster B. - - The feature is recommended [app backup and restore](../user-guide/backup/deployment.md) for cross-cluster backups and restores. - - First, create a backup policy to back up the current status. It is recommended to refer to the [ETCD backup](../user-guide/backup/etcd-backup.md). + - AI platform ETCD backup restores are limited to backups and restores for the same cluster (with no change in the number of nodes and IP addresses). For example, after the etcd data of Cluster A is backed up, the backup data can only be restored to Cluster A, not to Cluster B. + - The feature is recommended [app backup and restore](../backup/deployment.md) for cross-cluster backups and restores. + - First, create a backup policy to back up the current status. It is recommended to refer to the [ETCD backup](../backup/etcd-backup.md). The following is a specific case to illustrate the whole process of backup and restore. @@ -51,12 +51,12 @@ INFO[0000] Go OS/Arch: linux/amd64 You need to check the following before restoring: -- Have you successfully backed up your data in DCE 5.0 +- Have you successfully backed up your data in AI platform - Check if backup data exists in S3 storage !!! note - The backup of DCE 5.0 is a full data backup, and the full data of the last backup will be restored when restoring. + The backup of AI platform is a full data backup, and the full data of the last backup will be restored when restoring. ### Shut down the cluster diff --git a/docs/en/docs/admin/kpanda/best-practice/hardening-cluster.md b/docs/en/docs/admin/kpanda/best-practice/hardening-cluster.md index d23c4bd2b6..3a53e8380b 100644 --- a/docs/en/docs/admin/kpanda/best-practice/hardening-cluster.md +++ b/docs/en/docs/admin/kpanda/best-practice/hardening-cluster.md @@ -1,6 +1,6 @@ # How to Harden a Self-built Work Cluster -In DCE 5.0, when using the CIS Benchmark (CIS) scan on a work cluster created using the user interface, some scan items did not pass the scan. This article provides hardening instructions based on different versions of CIS Benchmark. +In AI platform, when using the CIS Benchmark (CIS) scan on a work cluster created using the user interface, some scan items did not pass the scan. This article provides hardening instructions based on different versions of CIS Benchmark. ## CIS Benchmark 1.27 @@ -59,7 +59,7 @@ To address these security scan issues, kubespray has added default values in v2. kubelet_rotate_server_certificates: true ``` -- In DCE 5.0, there is also a feature to configure advanced parameters through the user interface. Add custom parameters in the last step of cluster creation: +- In AI platform, there is also a feature to configure advanced parameters through the user interface. Add custom parameters in the last step of cluster creation: ![img](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/images/hardening05.png) diff --git a/docs/en/docs/admin/kpanda/best-practice/kubean-low-version.md b/docs/en/docs/admin/kpanda/best-practice/kubean-low-version.md index d9ad36467d..c3bb8c43e4 100644 --- a/docs/en/docs/admin/kpanda/best-practice/kubean-low-version.md +++ b/docs/en/docs/admin/kpanda/best-practice/kubean-low-version.md @@ -9,7 +9,7 @@ In order to meet the customer's demand for building Kubernetes (K8s) clusters wi Kubean provides the capability to be compatible with lower versions and create K8s clusters with those versions. Currently, the supported versions for self-built worker clusters range from `1.26.0-v1.28`. -Refer to the [DCE 5.0 Cluster Version Support System](./cluster-version.md) for more information. +Refer to the [AI platform Cluster Version Support System](./cluster-version.md) for more information. This article will demonstrate how to deploy a K8s cluster with a lower version. @@ -31,7 +31,7 @@ This article will demonstrate how to deploy a K8s cluster with a lower version. version based on the actual situation. The currently supported artifact versions and their corresponding cluster version ranges are as follows: - | Artifact Version | Cluster Range | DCE 5.0 Support | + | Artifact Version | Cluster Range | AI platform Support | | ----------- | ----------- | ------ | | release-2.21 | v1.23.0 ~ v1.25.6 | Supported since installer v0.14.0 | | release-2.22 | v1.24.0 ~ v1.26.9 | Supported since installer v0.15.0 | @@ -125,7 +125,7 @@ skopeo copy ${SKOPEO_PARAMS} docker-archive:spray-job-2.21.tar docker://${REGIST 2. Choose the `manifest` and `localartifactset.cr.yaml` custom resources deployed cluster as the `Managed` parameter. In this example, we use the Global cluster. -3. Refer to [Creating a Cluster](../user-guide/clusters/create-cluster.md) for the remaining parameters. +3. Refer to [Creating a Cluster](../clusters/create-cluster.md) for the remaining parameters. #### Upgrade diff --git a/docs/en/docs/admin/kpanda/best-practice/multi-arch.md b/docs/en/docs/admin/kpanda/best-practice/multi-arch.md index d1f083bbab..488dbfd048 100644 --- a/docs/en/docs/admin/kpanda/best-practice/multi-arch.md +++ b/docs/en/docs/admin/kpanda/best-practice/multi-arch.md @@ -11,15 +11,15 @@ to an AMD architecture worker cluster with CentOS 7.9 operating system. !!! note This page is only applicable to adding heterogeneous nodes to a worker cluster created - using the DCE 5.0 platform in offline mode, excluding connected clusters. + using the AI platform platform in offline mode, excluding connected clusters. ## Prerequisites -- A DCE 5.0 Full Mode deployment has been successfully completed, and the bootstrap node is still alive. - Refer to the documentation [Offline Installation of DCE 5.0 Enterprise](../../install/commercial/start-install.md) for the deployment process. +- A AI platform Full Mode deployment has been successfully completed, and the bootstrap node is still alive. + Refer to the documentation [Offline Installation of AI platform Enterprise](../../install/commercial/start-install.md) for the deployment process. - A worker cluster with AMD architecture and CentOS 7.9 operating system has been created through the - DCE 5.0 platform. Refer to the documentation - [Creating a Worker Cluster](../user-guide/clusters/create-cluster.md) for the creation process. + AI platform platform. Refer to the documentation + [Creating a Worker Cluster](../clusters/create-cluster.md) for the creation process. ## Procedure @@ -28,7 +28,7 @@ to an AMD architecture worker cluster with CentOS 7.9 operating system. Take ARM architecture and Kylin v10 sp2 operating system as examples. Make sure you are logged into the bootstrap node! Also, make sure the -__clusterConfig.yaml__ file used during the DCE 5.0 deployment is available. +__clusterConfig.yaml__ file used during the AI platform deployment is available. #### Offline Image Package @@ -86,7 +86,7 @@ Run the import-artifact command: Parameter Explanation: - - __-c clusterConfig.yaml__ specifies the clusterConfig.yaml file used during the previous DCE 5.0 deployment. + - __-c clusterConfig.yaml__ specifies the clusterConfig.yaml file used during the previous AI platform deployment. - __--offline-path__ specifies the file path of the downloaded offline image package. - __--iso-path__ specifies the file path of the downloaded ISO operating system image. - __--os-pkgs-path__ specifies the file path of the downloaded osPackage offline package. @@ -95,15 +95,8 @@ After a successful import command execution, the offline package will be uploade ### Add Heterogeneous Worker Nodes -!!! note - - If the version of DCE 5.0 you have installed is higher than (inclusive of) - [DCE5.0-20230731](../../dce/dce-rn/20230731.md), after completing the above steps, - you can directly integrate nodes via UI; if not, you will need to continue with - the following steps to integrate heterogeneous nodes. - -Make sure you are logged into the management node of the DCE 5.0 -[Global Service Cluster](../user-guide/clusters/cluster-role.md#global-service-cluster). +Make sure you are logged into the management node of the AI platform +[Global Service Cluster](../clusters/cluster-role.md#global-service-cluster). #### Modify the Host Manifest diff --git a/docs/en/docs/admin/kpanda/best-practice/update-offline-cluster.md b/docs/en/docs/admin/kpanda/best-practice/update-offline-cluster.md index f6c54da907..ea19b9203c 100644 --- a/docs/en/docs/admin/kpanda/best-practice/update-offline-cluster.md +++ b/docs/en/docs/admin/kpanda/best-practice/update-offline-cluster.md @@ -7,19 +7,19 @@ Date: 2024-10-25 !!! note - This document is specifically designed for deploying or upgrading the Kubernetes version of worker clusters created on the DCE 5.0 platform in offline mode. It does not cover the deployment or upgrade of other Kubernetes components. + This document is specifically designed for deploying or upgrading the Kubernetes version of worker clusters created on the AI platform platform in offline mode. It does not cover the deployment or upgrade of other Kubernetes components. This guide is applicable to the following offline scenarios: -- You can follow the operational guidelines to deploy the recommended Kubernetes version in a non-GUI environment created by the DCE 5.0 platform. -- You can upgrade the Kubernetes version of worker clusters created using the DCE 5.0 platform by generating incremental offline packages. +- You can follow the operational guidelines to deploy the recommended Kubernetes version in a non-GUI environment created by the AI platform platform. +- You can upgrade the Kubernetes version of worker clusters created using the AI platform platform by generating incremental offline packages. The overall approach is as follows: 1. Build the offline package on an integrated node. 2. Import the offline package to the bootstrap node. -3. Update the Kubernetes version manifest for the [global service cluster](../user-guide/clusters/cluster-role.md#global-service-cluster). -4. Use the DCE 5.0 UI to create or upgrade the Kubernetes version of the worker cluster. +3. Update the Kubernetes version manifest for the [global service cluster](../clusters/cluster-role.md#global-service-cluster). +4. Use the AI platform UI to create or upgrade the Kubernetes version of the worker cluster. !!! note @@ -153,8 +153,8 @@ kubectl apply -f data/kubeanofflineversion.cr.patch.yaml ## Next Steps -Log into the DCE 5.0 UI management interface to continue with the following actions: +Log into the AI platform UI management interface to continue with the following actions: -1. Refer to the [Creating Cluster Documentation](../user-guide/clusters/create-cluster.md) to create a worker cluster, where you can select the incremental version of Kubernetes. +1. Refer to the [Creating Cluster Documentation](../clusters/create-cluster.md) to create a worker cluster, where you can select the incremental version of Kubernetes. -2. Refer to the [Upgrading Cluster Documentation](../user-guide/clusters/upgrade-cluster.md) to upgrade your self-built worker cluster. +2. Refer to the [Upgrading Cluster Documentation](../clusters/upgrade-cluster.md) to upgrade your self-built worker cluster. diff --git a/docs/en/docs/admin/kpanda/best-practice/use-otherlinux-create-custer.md b/docs/en/docs/admin/kpanda/best-practice/use-otherlinux-create-custer.md index 88e7fa64a8..1135c491bc 100644 --- a/docs/en/docs/admin/kpanda/best-practice/use-otherlinux-create-custer.md +++ b/docs/en/docs/admin/kpanda/best-practice/use-otherlinux-create-custer.md @@ -1,6 +1,6 @@ # Creating a Cluster on Non-Supported Operating Systems -This document outlines how to create a worker cluster on an **unsupported OS** in offline mode. For the range of OS supported by DCE 5.0, refer to [DCE 5.0 Supported Operating Systems](../../install/commercial/deploy-requirements.md). +This document outlines how to create a worker cluster on an **unsupported OS** in offline mode. For the range of OS supported by AI platform, refer to [AI platform Supported Operating Systems](../../install/commercial/deploy-requirements.md). The main process for creating a worker cluster on an unsupported OS in offline mode is illustrated in the diagram below: @@ -10,7 +10,7 @@ Next, we will use the openAnolis operating system as an example to demonstrate h ## Prerequisites -- DCE 5.0 Full Mode has been deployed following the documentation: [Offline Installation of DCE 5.0 Enterprise](../../install/commercial/start-install.md). +- AI platform Full Mode has been deployed following the documentation: [Offline Installation of AI platform Enterprise](../../install/commercial/start-install.md). - At least one node with the same architecture and version that can connect to the internet. ## Procedure @@ -55,4 +55,4 @@ After executing the above command, wait for the interface to prompt: __All packa ### Go to the User Interface to Create Cluster -Refer to the documentation on [Creating a Worker Cluster](../user-guide/clusters/create-cluster.md) to create an openAnolis cluster. +Refer to the documentation on [Creating a Worker Cluster](../clusters/create-cluster.md) to create an openAnolis cluster. diff --git a/docs/en/docs/admin/kpanda/gpu/ascend/Ascend_usage.md b/docs/en/docs/admin/kpanda/gpu/ascend/Ascend_usage.md index f00f8084f2..94c0918cbf 100644 --- a/docs/en/docs/admin/kpanda/gpu/ascend/Ascend_usage.md +++ b/docs/en/docs/admin/kpanda/gpu/ascend/Ascend_usage.md @@ -143,18 +143,18 @@ This document uses the [AscentCL Image Classification Application](https://gitee kubectl apply -f ascend-demo.yaml ``` - Check the Pod running status: ![Ascend Pod Status](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/user-guide/gpu/images/ascend-demo-pod-status.png) + Check the Pod running status: ![Ascend Pod Status](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/ascend-demo-pod-status.png) After the Pod runs successfully, check the log results. The key prompt information on the screen is shown in the figure below. The Label indicates the category identifier, Conf indicates the maximum confidence of the classification, and Class indicates the belonging category. These values may vary depending on the version and environment, so please refer to the actual situation: - ![Ascend demo running result](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/user-guide/gpu/images/ascend-demo-pod-result.png) + ![Ascend demo running result](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/ascend-demo-pod-result.png) Result image display: - ![Ascend demo running result image](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/user-guide/gpu/images/ascend-demo-infer-result.png) + ![Ascend demo running result image](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/ascend-demo-infer-result.png) ## UI Usage @@ -162,7 +162,7 @@ This document uses the [AscentCL Image Classification Application](https://gitee and check whether the proper GPU type is automatically enabled and detected. Currently, the cluster will automatically enable __GPU__ and set the __GPU__ type to __Ascend__ . - ![Cluster Settings](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/user-guide/gpu/images/cluster-setting-ascend-gpu.jpg) + ![Cluster Settings](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/cluster-setting-ascend-gpu.jpg) 2. Deploy the workload. Click __Clusters__ -> __Workloads__ , deploy the workload through an image, select the type (Ascend), and then configure the number of physical cards used by the application: @@ -171,7 +171,7 @@ This document uses the [AscentCL Image Classification Application](https://gitee the current Pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host. - ![Workload Usage](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/user-guide/gpu/images/workload_ascendgpu_userguide.jpg) + ![Workload Usage](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/workload_ascendgpu_userguide.jpg) > If there is an issue with the above configuration, it will result in > scheduling failure and resource allocation issues. diff --git a/docs/en/docs/admin/kpanda/gpu/ascend/ascend_driver_install.md b/docs/en/docs/admin/kpanda/gpu/ascend/ascend_driver_install.md index 21fcb3864f..48334699b3 100644 --- a/docs/en/docs/admin/kpanda/gpu/ascend/ascend_driver_install.md +++ b/docs/en/docs/admin/kpanda/gpu/ascend/ascend_driver_install.md @@ -170,4 +170,4 @@ Once everything is ready, you can select the corresponding NPU device when creat !!! note - For detailed information of how to use, refer to [Using Ascend (Ascend) NPU](https://docs.daocloud.io/kpanda/user-guide/gpu/Ascend_usage/). + For detailed information of how to use, refer to [Using Ascend (Ascend) NPU](https://docs.daocloud.io/kpanda/gpu/Ascend_usage/). diff --git a/docs/en/docs/admin/kpanda/gpu/nvidia/vgpu/vgpu_addon.md b/docs/en/docs/admin/kpanda/gpu/nvidia/vgpu/vgpu_addon.md index e8c69c183c..5405cd08fc 100644 --- a/docs/en/docs/admin/kpanda/gpu/nvidia/vgpu/vgpu_addon.md +++ b/docs/en/docs/admin/kpanda/gpu/nvidia/vgpu/vgpu_addon.md @@ -26,7 +26,7 @@ This section explains how to install the vGPU plugin in the AI platform platform 3. After a successful installation, you will see two types of pods in the specified namespace, indicating that the NVIDIA vGPU plugin has been successfully installed: - ![Alt text](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/user-guide/gpu/images/vgpu-pod.png) + ![Alt text](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/vgpu-pod.png) After a successful installation, you can [deploy applications using vGPU resources](vgpu_user.md). diff --git a/docs/en/docs/admin/kpanda/network/create-services.md b/docs/en/docs/admin/kpanda/network/create-services.md index c735e07918..28bc6197b8 100644 --- a/docs/en/docs/admin/kpanda/network/create-services.md +++ b/docs/en/docs/admin/kpanda/network/create-services.md @@ -40,7 +40,7 @@ Click __Intra-Cluster Access (ClusterIP)__ , which refers to exposing services t | Service Name | [Type] Required
[Meaning] Enter the name of the new service.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | Svc-01 | | Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | | Label selector | [Type] Required
[Meaning] Add a label, the Service selects a Pod according to the label, and click "Add" after filling. You can also refer to the label of an existing workload. Click __Reference workload label__ , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. | app:job01 | -| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. For more information about the protocol, refer to [Protocol Overview](../../../dce/index.md).
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services.
**Container port (targetport)**: The container port that the workload actually monitors, used to expose services to the cluster. | | +| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP.
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services.
**Container port (targetport)**: The container port that the workload actually monitors, used to expose services to the cluster. | | | Session Persistence | [Type] Optional
[Meaning] When enabled, requests from the same client will be forwarded to the same Pod | Enabled | | Maximum session hold time | [Type] Optional
[Meaning] After session hold is enabled, the maximum hold time is 30 seconds by default | 30 seconds | | Annotation | [Type] Optional
[Meaning] Add annotation for service
| | @@ -55,7 +55,7 @@ Click __NodePort__ , which means exposing the service via IP and static port ( _ | Service Name | [Type] Required
[Meaning] Enter the name of the new service.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | Svc-01 | | Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | | Label selector | [Type] Required
[Meaning] Add a label, the Service selects a Pod according to the label, and click "Add" after filling. You can also refer to the label of an existing workload. Click __Reference workload label__ , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. | | -| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. For more information about the protocol, refer to [Protocol Overview](../../../dce/index.md).
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services. *By default, the service port is set to the same value as the container port field for convenience. *
**Container port (targetport)**: The container port actually monitored by the workload.
**Node port (nodeport)**: The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. | | +| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP.
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services. *By default, the service port is set to the same value as the container port field for convenience. *
**Container port (targetport)**: The container port actually monitored by the workload.
**Node port (nodeport)**: The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. | | | Session Persistence| [Type] Optional
[Meaning] When enabled, requests from the same client will be forwarded to the same Pod
After enabled, __.spec.sessionAffinity__ of Service is __ClientIP__ , refer to for details : [Session Affinity for Service](https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity) | Enabled | | Maximum session hold time| [Type] Optional
[Meaning] After session hold is enabled, the maximum hold time, the default timeout is 30 seconds
.spec.sessionAffinityConfig.clientIP.timeoutSeconds is set to 30 by default seconds | 30 seconds | | Annotation | [Type] Optional
[Meaning] Add annotation for service
| | @@ -74,7 +74,7 @@ Click __Load Balancer__ , which refers to using the cloud provider's load balanc | Load balancing type | [Type] Required
[Meaning] The type of load balancing used, currently supports MetalLB and others. | | | | MetalLB IP Pool| [Type] Required
[Meaning] When the selected load balancing type is MetalLB, LoadBalancer Service will allocate IP addresses from this pool by default, and declare all IP addresses in this pool through APR, For details, refer to: [Install MetalLB](../../../network/modules/metallb/install.md) | | | | Load balancing address| [Type] Required
[Meaning]
1. If you are using a public cloud CloudProvider, fill in the load balancing address provided by the cloud provider here;
2. If the above load balancing type is selected as MetalLB, the IP will be obtained from the above IP pool by default, if not filled, it will be obtained automatically. | | | -| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP. For more information about the protocol, refer to [Protocol Overview](../../../dce/index.md).
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience.
**Container port (targetport)**: The container port actually monitored by the workload.
**Node port (nodeport)**: The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. | | | +| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP.
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience.
**Container port (targetport)**: The container port actually monitored by the workload.
**Node port (nodeport)**: The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. | | | | Annotation | [Type] Optional
[Meaning] Add annotation for service
| | | ### Complete service creation diff --git a/docs/en/docs/admin/kpanda/workloads/create-deployment.md b/docs/en/docs/admin/kpanda/workloads/create-deployment.md index 654ed77351..1e19aea87d 100644 --- a/docs/en/docs/admin/kpanda/workloads/create-deployment.md +++ b/docs/en/docs/admin/kpanda/workloads/create-deployment.md @@ -9,7 +9,7 @@ This page describes how to create deployments through images and YAML files. [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) is a common resource in Kubernetes, mainly [Pod](https://kubernetes.io/docs/concepts/workloads/pods/) and [ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) provide declarative updates, support elastic scaling, rolling upgrades, and version rollbacks features. Declare the desired Pod state in the Deployment, and the Deployment Controller will modify the current state through the ReplicaSet to make it reach the pre-declared desired state. Deployment is stateless and does not support data persistence. It is suitable for deploying stateless applications that do not need to save data and can be restarted and rolled back at any time. -Through the container management module of [AI platform](../../../dce/index.md), workloads on multicloud and multiclusters can be easily managed based on corresponding role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback. +Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on corresponding role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback. ## Prerequisites diff --git a/docs/en/docs/admin/kpanda/workloads/create-statefulset.md b/docs/en/docs/admin/kpanda/workloads/create-statefulset.md index 95090f6f45..cc0f3bacfe 100644 --- a/docs/en/docs/admin/kpanda/workloads/create-statefulset.md +++ b/docs/en/docs/admin/kpanda/workloads/create-statefulset.md @@ -9,7 +9,7 @@ This page describes how to create a StatefulSet through image and YAML files. [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) is a common resource in Kubernetes, and [Deployment](create-deployment.md), mainly used to manage the deployment and scaling of Pod collections. The main difference between the two is that Deployment is stateless and does not save data, while StatefulSet is stateful and is mainly used to manage stateful applications. In addition, Pods in a StatefulSet have a persistent ID, which makes it easy to identify the corresponding Pod when matching storage volumes. -Through the container management module of [AI platform](../../../dce/index.md), workloads on multicloud and multiclusters can be easily managed based on corresponding role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management. +Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on corresponding role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management. ## Prerequisites diff --git a/docs/en/docs/admin/register/bindws.md b/docs/en/docs/admin/register/bindws.md index d71a16804f..3b20bf10d4 100644 --- a/docs/en/docs/admin/register/bindws.md +++ b/docs/en/docs/admin/register/bindws.md @@ -1,38 +1,38 @@ -# 为用户绑定工作空间 +# Binding a Workspace for the User -用户成功注册之后,需要为其绑定一个工作空间。 +After a user successfully registers, a workspace needs to be bound to them. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](index.md) -- 有一个可用的管理员账号 +- AI platform installed +- [User has successfully registered](index.md) +- An available administrator account -## 操作步骤 +## Steps to Follow -1. 以管理员身份登录 AI 算力平台 -1. 导航切换至 **全局管理** -> **工作空间与层级** ,点击 **创建工作空间** +1. Log in to the AI platform as an administrator. +2. Navigate to **Global Management** -> **Workspace and Folder**, and click **Create Workspace**. ![workspace](../images/bindws01.png) -1. 输入名称,选择文件夹后点击 **确定** ,创建一个工作空间 +3. Enter the workspace name, select a folder, and click **OK** to create a workspace. ![create ws](../images/bindws02.png) -1. 给工作空间绑定资源 +4. Bind resources to the workspace. ![bind resource](../images/bindws07.png) - 可以在这个界面上点击 **创建集群-命名空间** 来创建一个命名空间。 + On this interface, you can click **Create Namespace** to create a namespace. -1. 添加授权:将用户分配至工作空间 +5. Add authorization: Assign the user to the workspace. - ![授权1](../images/bindws08.png) - ![授权2](../images/bindws09.png) + ![authorization1](../images/bindws08.png) + ![authorization2](../images/bindws09.png) -1. 用户登录 AI 算力平台,查看是否具有工作空间及命名空间的权限。 - 管理员可以通过右侧的 **┇** 执行更多操作。 +6. The user logs in to the AI platform to check if they have permissions for the workspace and namespace. + The administrator can perform more actions through the **┇** on the right side. - ![确认](../images/bindws11.png) + ![confirmation](../images/bindws11.png) -下一步:[为工作空间分配资源](./wsres.md) +Next step: [Allocate Resources for the Workspace](./wsres.md) diff --git a/docs/en/docs/admin/register/index.md b/docs/en/docs/admin/register/index.md index 09f9816086..3bc22889bd 100644 --- a/docs/en/docs/admin/register/index.md +++ b/docs/en/docs/admin/register/index.md @@ -1,33 +1,33 @@ -# 用户注册 +# User Registration -新用户首次使用 AI 算力平台需要进行注册。 +New users need to register to use the AI platform for the first time. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 已开启邮箱注册功能 -- 有一个可用的邮箱 +- The AI platform is installed +- Email registration feature is enabled +- An available email address -## 邮箱注册步骤 +## Email Registration Steps -1. 打开 AI 算力平台首页 ,点击 **注册** +1. Open the AI platform homepage at and click **Register**. ![home](../../images/regis01.PNG) -1. 键入用户名、密码、邮箱后点击 **注册** +2. Enter your username, password, and email, then click **Register**. ![to register](../../images/regis02.PNG) -1. 系统提示发送了一封邮件到您的邮箱。 +3. The system will prompt that an email has been sent to your inbox. ![to register](../../images/regis03.PNG) -1. 登录自己的邮箱,找到邮件,点击链接。 +4. Log in to your email, find the email, and click the link. ![email](../../images/regis04.PNG) -1. 恭喜,您成功进入了 AI 算力平台,现在可以开始您的 AI 之旅了。 +5. Congratulations, you have successfully accessed the AI platform, and you can now begin your AI journey. ![verify](../../images/regis05.PNG) -下一步:[为用户绑定工作空间](bindws.md) +Next step: [Bind a Workspace for the User](bindws.md) diff --git a/docs/en/docs/admin/register/wsres.md b/docs/en/docs/admin/register/wsres.md index ac575afcd3..413f69973e 100644 --- a/docs/en/docs/admin/register/wsres.md +++ b/docs/en/docs/admin/register/wsres.md @@ -1,26 +1,26 @@ -# 为工作空间分配资源 +# Allocate Resources to the Workspace -将[用户绑定到工作空间](./bindws.md)后,需要给工作空间分配合适的资源。 +After [binding a user to a workspace](./bindws.md), it is necessary to allocate appropriate resources to the workspace. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 有一个可用的管理员账号 -- 工作空间已创建且绑定了命名空间 +- The AI platform is installed +- An available administrator account +- The workspace has been created and bound to a namespace -## 操作步骤 +## Steps -1. 以管理员身份登录 AI 算力平台 -1. 导航到 **全局管理** -> **工作空间与层级**,找到要添加资源的工作空间,点击 **新增共享资源** +1. Log in to the AI platform as an administrator. +2. Navigate to **Global Management** -> **Workspace and Folder**, find the workspace to which you want to add resources, and click **Add Shared Resources**. - ![点击按钮](../images/wsres01.png) + ![Click Button](../images/wsres01.png) -1. 选择集群,设置合适的资源配额后,点击 **确定** +3. Select the cluster, set the appropriate resource quota, and then click **OK** - ![配置](../images/wsres02.png) + ![Configuration](../images/wsres02.png) -1. 返回共享资源页,为工作空间成功分配了资源,管理员可以通过右侧的 **┇** 随时修改。 +4. Return to the shared resources page. Resources have been successfully allocated to the workspace, and the administrator can modify them at any time using the **┇** on the right side. - ![成功](../images/wsres03.png) + ![Success](../images/wsres03.png) -下一步:[创建云主机](../host/createhost.md) +Next step: [Create a Cloud Host](../host/createhost.md) diff --git a/docs/en/docs/admin/share/infer.md b/docs/en/docs/admin/share/infer.md deleted file mode 100644 index 7909df9e5c..0000000000 --- a/docs/en/docs/admin/share/infer.md +++ /dev/null @@ -1,2 +0,0 @@ -# 创建推理服务 - diff --git a/docs/en/docs/admin/share/job.md b/docs/en/docs/admin/share/job.md deleted file mode 100644 index c37b349178..0000000000 --- a/docs/en/docs/admin/share/job.md +++ /dev/null @@ -1 +0,0 @@ -# 创建训练任务 diff --git a/docs/en/docs/admin/share/notebook.md b/docs/en/docs/admin/share/notebook.md index 1285232f0e..8b6f32e35e 100644 --- a/docs/en/docs/admin/share/notebook.md +++ b/docs/en/docs/admin/share/notebook.md @@ -1,85 +1,83 @@ -# 使用 Notebook +# Using Notebook -Notebook 通常指的是 Jupyter Notebook 或类似的交互式计算环境。 -这是一种非常流行的工具,广泛用于数据科学、机器学习和深度学习等领域。 -本页说明如何在算丰 AI 算力平台中使用 Notebook。 +Notebook typically refers to Jupyter Notebook or similar interactive computing environments. This is a very popular tool widely used in fields such as data science, machine learning, and deep learning. This page explains how to use Notebook on the Canfeng AI platform. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](../register/index.md) -- [管理员为用户分配了工作空间](../register/bindws.md) -- 已准备好数据集(代码、数据等) +- The AI platform is installed +- [The user has successfully registered](../register/index.md) +- [The administrator has assigned a workspace to the user](../register/bindws.md) +- A dataset (code, data, etc.) is prepared -## 创建和使用 Notebook 实例 +## Creating and Using Notebook Instances -1. 以 **管理员身份** 登录 AI 算力平台 -1. 导航至 **AI Lab** -> **运维管理** -> **队列管理** ,点击右侧的 **创建** 按钮 +1. Log in to the AI platform as an **Administrator**. +2. Navigate to **AI Lab** -> **Queue Management**, and click the **Create** button on the right. ![create queue](../images/notebook01.png) -1. 键入名称,选择集群、工作空间和配额后,点击 **确定** +3. After entering a name, selecting a cluster, workspace, and quota, click **OK**. ![ok](../images/notebook02.png) -1. 以 **用户身份** 登录 AI 算力平台,导航至 **AI Lab** -> **Notebook** ,点击右侧的 **创建** 按钮 +4. Log in to the AI platform as a **User**, navigate to **AI Lab** -> **Notebook**, and click the **Create** button on the right. ![create notebook](../images/notebook03.png) -1. 配置各项参数后点击 **确定** +5. After configuring the parameters, click **OK**. - === "基本信息" + === "Basic Information" - 键入名称,选择集群、命名空间,选择刚创建的队列,点击 **一键初始化** + Enter a name, select a cluster, namespace, choose the newly created queue, and click **One-click Initialization**. ![basic](../images/notebook04.png) - === "资源配置" + === "Resource Configuration" - 选择 Notebook 类型,配置内存、CPU,开启 GPU,创建和配置 PVC: + Select Notebook type, configure memory and CPU, enable GPU, create and configure PVC: ![resource](../images/notebook05.png) - === "高级配置" + === "Advanced Configuration" - 开启 SSH 外网访问: + Enable SSH external access: ![advanced](../images/notebook06.png) -1. 自动跳转到 Notebook 实例列表,点击实例名称 +6. You will be automatically redirected to the Notebook instance list; click on the instance name. ![click name](../images/notebook07.png) -1. 进入 Notebook 实例详情页,点击右上角的 **打开** 按钮 +7. Enter the Notebook instance details page and click the **Open** button in the upper right corner. ![open](../images/notebook08.png) -1. 进入了 Notebook 开发环境,比如在 `/home/jovyan` 目录挂载了持久卷,可以通过 git 克隆代码,通过 SSH 连接后上传数据等。 +8. You will enter the Notebook development environment, where a persistent volume is mounted in the `/home/jovyan` directory. You can clone code using git and upload data after connecting via SSH, etc. ![notebook](../images/notebook09.png) -## 通过 SSH 访问 Notebook 实例 +## Accessing Notebook Instances via SSH -1. 在自己的电脑上生成 SSH 密钥对 +1. Generate an SSH key pair on your own computer. - 在自己电脑上打开命令行,比如在 Windows 上打开 git bash,输入 `ssh-keygen.exe -t rsa`,然后一路回车。 + Open the command line on your computer, for example, open git bash on Windows, and enter `ssh-keygen.exe -t rsa`, then press enter until completion. ![generate](../images/ssh01.png) -1. 通过 `cat ~/.ssh/id_rsa.pub` 等命令查看并复制公钥 +2. Use commands like `cat ~/.ssh/id_rsa.pub` to view and copy the public key. ![copy key](../images/ssh02.png) -1. 以用户身份登录 AI 算力平台,在右上角点击 **个人中心** -> **SSH 公钥** -> **导入 SSH 公钥** +3. Log in to the AI platform as a user, click on **Personal Center** in the upper right corner -> **SSH Public Key** -> **Import SSH Public Key**. ![import](../images/ssh03.png) -1. 进入 Notebook 实例的详情页,复制 SSH 的链接 +4. Go to the details page of the Notebook instance and copy the SSH link. ![copy link](../images/ssh04.png) -1. 在客户端使用 SSH 访问 Notebook 实例 +5. Use SSH to access the Notebook instance from the client. ![ssh](../images/ssh05.png) -下一步:[创建训练任务](../baize/developer/jobs/create.md) +Next step: [Create Training Jobs](../baize/developer/jobs/create.md) diff --git a/docs/en/docs/admin/share/quota.md b/docs/en/docs/admin/share/quota.md index 8a020d10ac..2b491ba154 100644 --- a/docs/en/docs/admin/share/quota.md +++ b/docs/en/docs/admin/share/quota.md @@ -1,26 +1,26 @@ -# 配额管理 +# Quota Management -用户被绑定到工作空间后,即可为工作空间分配资源,管理资源配额。 +Once a user is bound to a workspace, resources can be allocated to the workspace, and resource quotas can be managed. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 有一个可用的管理员账号 +- The AI platform is installed +- There is an available administrator account -## 创建和管理配额 +## Creating and Managing Quotas -1. 以 **管理员身份** 登录 AI 算力平台 -1. [创建工作空间和命名空间,并绑定用户](../register/bindws.md) -1. [为工作空间分配资源配额](../register/wsres.md#quota) +1. Log in to the AI platform as an **Administrator**. +2. [Create a workspace and namespace, and bind users](../register/bindws.md). +3. [Allocate resource quotas to the workspace](../register/wsres.md#quota). ![quota to ws](../images/quota01.png) -1. 管理命名空间 `test-ns-1` 的资源配额,其数值不能超过工作空间的配额。 +4. Manage the resource quotas for the namespace `test-ns-1`, ensuring that the values do not exceed the workspace's quota. ![quota to ns](../images/quota02.png) -1. 以 **用户身份** 登录 AI 算力平台,查看其是否被分配了 `test-ns-1` 命名空间。 +5. Log in to the AI platform as a **User** to check if they have been assigned the `test-ns-1` namespace. ![check ns](../images/quota03.png) -下一步:[创建 AI 负载使用 GPU 资源](./workload.md) +Next step: [Create AI Workloads Using GPUs](./workload.md) diff --git a/docs/en/docs/admin/share/workload.md b/docs/en/docs/admin/share/workload.md index 041f5f1dca..62f779fb40 100644 --- a/docs/en/docs/admin/share/workload.md +++ b/docs/en/docs/admin/share/workload.md @@ -1,51 +1,50 @@ -# 创建 AI 负载使用 GPU 资源 +# Creating AI Workloads Using GPU Resources -管理员为工作空间分配资源配额后,用户就可以创建 AI 工作负载来使用 GPU 算力资源。 +After the administrator allocates resource quotas for the workspace, users can create AI workloads to utilize GPU computing resources. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](../register/index.md) -- [管理员为用户分配了工作空间](../register/bindws.md) -- [为工作空间设置了资源配额](./quota.md) -- [已经创建了一个集群](../k8s/create-k8s.md) +- The AI platform is installed +- [User has successfully registered](../register/index.md) +- [Administrator has assigned a workspace to the user](../register/bindws.md) +- [Resource quotas have been set for the workspace](./quota.md) +- [A cluster has been created](../k8s/create-k8s.md) -## 创建 AI 负载步骤 +## Steps to Create AI Workloads -1. 以用户身份登录 AI 算力平台 -1. 导航至 **容器管理** ,选择一个命名空间,点击 **工作负载** -> **无状态负载** , - 点击右侧的 **镜像创建** 按钮 +1. Log in to the AI platform as a **User**. +2. Navigate to **Container Management**, select a namespace, then click **Workloads** -> **Deployments**, and then click the **Create from Image** button on the right. ![button](../images/workload01.png) -1. 配置各项参数后点击 **确定** +3. After configuring the parameters, click **OK**. - === "基本信息" + === "Basic Information" - 选择自己的命名空间。 + Select your own namespace. ![basic](../images/workload02.png) - === "容器配置" + === "Container Configuration" - 设置镜像,配置 CPU、内存、GPU 等资源,设置启动命令。 + Set the image, configure resources such as CPU, memory, and GPU, and set the startup command. ![container](../images/workload03.png) - === "其他" + === "Others" - 服务配置和高级配置可以使用默认配置。 + Service configuration and advanced settings can use default configurations. -1. 自动返回无状态负载列表,点击负载名称 +4. Automatically return to the stateless workload list and click on the workload name. ![click name](../images/workload04.png) -1. 进入详情页,可以看到 GPU 配额 +5. Enter the details page to view the GPU quota. ![check gpu](../images/workload05.png) -1. 你还可以进入控制台,运行 `mx-smi` 命令查看 GPU 资源 +6. You can also enter the console and run the `mx-smi` command to check the GPU resources. ![check gpu](../images/workload06.png) -下一步:[使用 Notebook](./notebook.md) +Next step: [Using Notebook](./notebook.md) diff --git a/docs/en/docs/admin/virtnest/best-practice/import-ubuntu.md b/docs/en/docs/admin/virtnest/best-practice/import-ubuntu.md index 732ea9f6bf..00bfd4a123 100644 --- a/docs/en/docs/admin/virtnest/best-practice/import-ubuntu.md +++ b/docs/en/docs/admin/virtnest/best-practice/import-ubuntu.md @@ -6,7 +6,7 @@ DATE: 2024-07-12 # Import a Linux Virtual Machine with Ubuntu from an External Platform This page provides a detailed introduction on how to import Linux virtual machines from the external platform VMware -into the virtual machines of DCE 5.0 through the command line. +into the virtual machines of AI platform through the command line. !!! info diff --git a/docs/en/docs/admin/virtnest/best-practice/import-windows.md b/docs/en/docs/admin/virtnest/best-practice/import-windows.md index 541ab9b0d2..d31fb33a8c 100644 --- a/docs/en/docs/admin/virtnest/best-practice/import-windows.md +++ b/docs/en/docs/admin/virtnest/best-practice/import-windows.md @@ -5,7 +5,7 @@ DATE: 2024-07-30 # Import a Windows Virtual Machine from the External Platform -This page provides a detailed introduction on how to import virtual machines from an external platform -- VMware, into the virtual machines of DCE 5.0 using the command line. +This page provides a detailed introduction on how to import virtual machines from an external platform -- VMware, into the virtual machines of AI platform using the command line. !!! info @@ -27,7 +27,7 @@ Similar to importing a virtual machine with a Linux operating system, refer to [ ### Check the Boot Type of Windows -When importing a virtual machine from an external platform into the DCE 5.0 virtualization platform, +When importing a virtual machine from an external platform into the AI platform virtualization platform, you need to configure it according to the boot type (BIOS or UEFI) to ensure it can boot and run correctly. You can check whether Windows uses BIOS or UEFI through "System Summary." If it uses UEFI, you need to diff --git a/docs/en/docs/end-user/baize/dataset/create-use-delete.md b/docs/en/docs/end-user/baize/dataset/create-use-delete.md new file mode 100644 index 0000000000..b0de9cc15f --- /dev/null +++ b/docs/en/docs/end-user/baize/dataset/create-use-delete.md @@ -0,0 +1,89 @@ +--- +MTPE: windsonsea +date: 2024-05-21 +--- + +# Create, Use and Delete Datasets + +AI Lab provides comprehensive dataset management functions needed for model development, +training, and inference processes. Currently, it supports unified access to various data sources. + +With simple configurations, you can connect data sources to AI Lab, achieving unified data management, +preloading, dataset management, and other functionalities. + +## Create a Dataset + +1. In the left navigation bar, click **Data Management** -> **Dataset List**, and then click the **Create** button + on the right. + + ![Click Create](../../images/dataset01.png) + +2. Select the worker cluster and namespace to which the dataset belongs, then click **Next**. + + ![Fill in Parameters](../../images/dataset02.png) + +3. Configure the data source type for the target data, then click **OK**. + + ![Task Resource Configuration](../../images/dataset03.png) + + Currently supported data sources include: + + - GIT: Supports repositories such as GitHub, GitLab, and Gitee + - S3: Supports object storage like Amazon Cloud + - HTTP: Directly input a valid HTTP URL + - PVC: Supports pre-created Kubernetes PersistentVolumeClaim + - NFS: Supports NFS shared storage + +4. Upon successful creation, the dataset will be returned to the dataset list. + You can perform more actions by clicking **┇** on the right. + + ![Dataset List](../../images/dataset04.png) + +!!! info + + The system will automatically perform a one-time data preloading after the dataset is successfully created; the dataset cannot be used until the preloading is complete. + +## Use a Dataset + +Once the dataset is successfully created, it can be used in tasks such as model training and inference. + +### Use in Notebook + +In creating a Notebook, you can directly use the dataset; the usage is as follows: + +- Use the dataset as training data mount +- Use the dataset as code mount + +![Dataset List](../../images/dataset05.png) + +### Use in Training obs + +- Use the dataset to specify job output +- Use the dataset to specify job input +- Use the dataset to specify TensorBoard output + +![jobs](../../images/dataset06.png) + +### Use in Inference Services + +- Use the dataset to mount a model + +![Inference Service](../../images/dataset07.png) + +## Delete a Dataset + +If you find a dataset to be redundant, expired, or no longer needed, you can delete it from the dataset list. + +1. Click the **┇** on the right side of the dataset list, then choose **Delete** from the dropdown menu. + + ![Delete](../../images/ds-delete01.png) + +2. In the pop-up window, confirm the dataset you want to delete, enter the dataset name, and then click **Delete**. + + ![Confirm](../../images/ds-delete02.png) + +3. A confirmation message will appear indicating successful deletion, and the dataset will disappear from the list. + +!!! caution + + Once a dataset is deleted, it cannot be recovered, so please proceed with caution. diff --git a/docs/en/docs/end-user/baize/dataset/environments.md b/docs/en/docs/end-user/baize/dataset/environments.md new file mode 100644 index 0000000000..6c41bf62c0 --- /dev/null +++ b/docs/en/docs/end-user/baize/dataset/environments.md @@ -0,0 +1,95 @@ +--- +MTPE: windsonsea +date: 2024-06-17 +--- + +# Manage Python Environment Dependencies + +This document aims to guide users on managing environment dependencies using AI platform. Below are the specific steps and considerations. + +1. [Overview of Environment Management](#overview) +2. [Create New Environment](#creat-new-environment) +3. [Configure Environment](#configure-environment) +4. [Troubleshooting](#troubleshooting) + +## Overview + +Traditionally, Python environment dependencies are built into an image, which includes the Python version +and dependency packages. This approach has high maintenance costs and is inconvenient to update, often requiring a complete rebuild of the image. + +In AI Lab, users can manage pure environment dependencies through the +**Environment Management** module, decoupling this part from the image. The advantages include: + +- One environment can be used in multiple places, such as in Notebooks, distributed training tasks, and even inference services. +- Updating dependency packages is more convenient; you only need to update the environment dependencies without rebuilding the image. + +The main components of the environment management are: + +- **Cluster** : Select the cluster to operate on. +- **Namespace** : Select the namespace to limit the scope of operations. +- **Environment List** : Displays all environments and their statuses under the current cluster and namespace. + +![Environment Management](../../images/conda01.png) + +### Explanation of Environment List Fields + +- **Name** : The name of the environment. +- **Status** : The current status of the environment (normal or failed). New environments undergo a warming-up process, after which they can be used in other tasks. +- **Creation Time** : The time the environment was created. + +## Creat New Environment + +On the **Environment Management** interface, click the **Create** button at the top right +to enter the environment creation process. + +![Environment Management](../../images/conda02.png) + +Fill in the following basic information: + +- **Name** : Enter the environment name, with a length of 2-63 characters, + starting and ending with lowercase letters or numbers. +- **Deployment Location**: + - **Cluster** : Select the cluster to deploy, such as `gpu-cluster`. + - **Namespace** : Select the namespace, such as `default`. +- **Remarks** (optional): Enter remarks. +- **Labels** (optional): Add labels to the environment. +- **Annotations** (optional): Add annotations to the environment. After completing the information, + click **Next** to proceed to environment configuration. + +## Configure Environment + +In the environment configuration step, users need to configure the Python version and dependency management tool. + +![Environment Management](../../images/conda03.png) + +### Environment Settings + +- **Python Version** : Select the required Python version, such as `3.12.3`. +- **Package Manager** : Choose the package management tool, either `PIP` or `CONDA`. +- **Environment Data** : + - If `PIP` is selected: Enter the dependency package list in `requirements.txt` format in the editor below. + - If `CONDA` is selected: Enter the dependency package list in `environment.yaml` format in the editor below. +- **Other Options** (optional): + - **Additional pip Index URLs** : Configure additional pip index URLs; suitable for internal enterprise private repositories or PIP acceleration sites. + - **GPU Configuration** : Enable or disable GPU configuration; some GPU-related dependency packages + need GPU resources configured during preloading. + - **Associated Storage** : Select the associated storage configuration; environment dependency packages + will be stored in the associated storage. **Note: Storage must support `ReadWriteMany`.** + +After configuration, click the **Create** button, and the system will automatically create and configure the new Python environment. + +## Troubleshooting + +- If environment creation fails: + - Check if the network connection is normal. + - Verify that the Python version and package manager configuration are correct. + - Ensure the selected cluster and namespace are available. + +- If dependency preloading fails: + - Check if the `requirements.txt` or `environment.yaml` file format is correct. + - Verify that the dependency package names and versions are correct. If other issues arise, + contact the platform administrator or refer to the platform help documentation for more support. + +--- + +These are the basic steps and considerations for managing Python dependencies in AI Lab. diff --git a/docs/en/docs/end-user/baize/images/agent-helm.png b/docs/en/docs/end-user/baize/images/agent-helm.png new file mode 100644 index 0000000000..7a98add93e Binary files /dev/null and b/docs/en/docs/end-user/baize/images/agent-helm.png differ diff --git a/docs/en/docs/end-user/baize/images/bind01.png b/docs/en/docs/end-user/baize/images/bind01.png new file mode 100644 index 0000000000..e36807afe2 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/bind01.png differ diff --git a/docs/en/docs/end-user/baize/images/bind02.png b/docs/en/docs/end-user/baize/images/bind02.png new file mode 100644 index 0000000000..eea10f9093 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/bind02.png differ diff --git a/docs/en/docs/end-user/baize/images/bind03.png b/docs/en/docs/end-user/baize/images/bind03.png new file mode 100644 index 0000000000..3f0fbfd1ea Binary files /dev/null and b/docs/en/docs/end-user/baize/images/bind03.png differ diff --git a/docs/en/docs/end-user/baize/images/bind04.png b/docs/en/docs/end-user/baize/images/bind04.png new file mode 100644 index 0000000000..17b47e9a3f Binary files /dev/null and b/docs/en/docs/end-user/baize/images/bind04.png differ diff --git a/docs/en/docs/end-user/baize/images/change-ws.png b/docs/en/docs/end-user/baize/images/change-ws.png new file mode 100644 index 0000000000..cd6430ab39 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/change-ws.png differ diff --git a/docs/en/docs/end-user/baize/images/cluster.png b/docs/en/docs/end-user/baize/images/cluster.png new file mode 100644 index 0000000000..8a86a73415 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/cluster.png differ diff --git a/docs/en/docs/end-user/baize/images/conda01.png b/docs/en/docs/end-user/baize/images/conda01.png new file mode 100644 index 0000000000..c32d768a69 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/conda01.png differ diff --git a/docs/en/docs/end-user/baize/images/conda02.png b/docs/en/docs/end-user/baize/images/conda02.png new file mode 100644 index 0000000000..b840730057 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/conda02.png differ diff --git a/docs/en/docs/end-user/baize/images/conda03.png b/docs/en/docs/end-user/baize/images/conda03.png new file mode 100644 index 0000000000..7357355c74 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/conda03.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset01.png b/docs/en/docs/end-user/baize/images/dataset01.png new file mode 100644 index 0000000000..8bb75feac2 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset01.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset02.png b/docs/en/docs/end-user/baize/images/dataset02.png new file mode 100644 index 0000000000..81ddd67d36 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset02.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset03.png b/docs/en/docs/end-user/baize/images/dataset03.png new file mode 100644 index 0000000000..785587c174 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset03.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset04.png b/docs/en/docs/end-user/baize/images/dataset04.png new file mode 100644 index 0000000000..269323d89a Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset04.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset05.png b/docs/en/docs/end-user/baize/images/dataset05.png new file mode 100644 index 0000000000..e2fb6a1cbb Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset05.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset06.png b/docs/en/docs/end-user/baize/images/dataset06.png new file mode 100644 index 0000000000..9cd8904735 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset06.png differ diff --git a/docs/en/docs/end-user/baize/images/dataset07.png b/docs/en/docs/end-user/baize/images/dataset07.png new file mode 100644 index 0000000000..dc6b8623c5 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/dataset07.png differ diff --git a/docs/en/docs/end-user/baize/images/delete02.png b/docs/en/docs/end-user/baize/images/delete02.png new file mode 100644 index 0000000000..d2d397ded7 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/delete02.png differ diff --git a/docs/en/docs/end-user/baize/images/ds-delete01.png b/docs/en/docs/end-user/baize/images/ds-delete01.png new file mode 100644 index 0000000000..7a52e757b3 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/ds-delete01.png differ diff --git a/docs/en/docs/end-user/baize/images/ds-delete02.png b/docs/en/docs/end-user/baize/images/ds-delete02.png new file mode 100644 index 0000000000..4bef572c40 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/ds-delete02.png differ diff --git a/docs/en/docs/end-user/baize/images/ds-delete03.png b/docs/en/docs/end-user/baize/images/ds-delete03.png new file mode 100644 index 0000000000..8dfcf026df Binary files /dev/null and b/docs/en/docs/end-user/baize/images/ds-delete03.png differ diff --git a/docs/en/docs/end-user/baize/images/inference-interface.png b/docs/en/docs/end-user/baize/images/inference-interface.png new file mode 100644 index 0000000000..a989517651 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/inference-interface.png differ diff --git a/docs/en/docs/end-user/baize/images/job01.png b/docs/en/docs/end-user/baize/images/job01.png new file mode 100644 index 0000000000..d3e5dec8fe Binary files /dev/null and b/docs/en/docs/end-user/baize/images/job01.png differ diff --git a/docs/en/docs/end-user/baize/images/job02.png b/docs/en/docs/end-user/baize/images/job02.png new file mode 100644 index 0000000000..c6bf3a13d6 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/job02.png differ diff --git a/docs/en/docs/end-user/baize/images/job03.png b/docs/en/docs/end-user/baize/images/job03.png new file mode 100644 index 0000000000..e78e9cad6f Binary files /dev/null and b/docs/en/docs/end-user/baize/images/job03.png differ diff --git a/docs/en/docs/end-user/baize/images/job04.png b/docs/en/docs/end-user/baize/images/job04.png new file mode 100644 index 0000000000..7e83937eda Binary files /dev/null and b/docs/en/docs/end-user/baize/images/job04.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook-idle.png b/docs/en/docs/end-user/baize/images/notebook-idle.png new file mode 100644 index 0000000000..9f483e8be8 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook-idle.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook-idle02.png b/docs/en/docs/end-user/baize/images/notebook-idle02.png new file mode 100644 index 0000000000..748e697f3a Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook-idle02.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook-images.png b/docs/en/docs/end-user/baize/images/notebook-images.png new file mode 100644 index 0000000000..12bb1ac5eb Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook-images.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook01.png b/docs/en/docs/end-user/baize/images/notebook01.png new file mode 100644 index 0000000000..284b1d691d Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook01.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook02.png b/docs/en/docs/end-user/baize/images/notebook02.png new file mode 100644 index 0000000000..0748948727 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook02.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook03.png b/docs/en/docs/end-user/baize/images/notebook03.png new file mode 100644 index 0000000000..11579db91c Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook03.png differ diff --git a/docs/en/docs/end-user/baize/images/notebook04.png b/docs/en/docs/end-user/baize/images/notebook04.png new file mode 100644 index 0000000000..984e6a0b84 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/notebook04.png differ diff --git a/docs/en/docs/end-user/baize/images/oam-overview.png b/docs/en/docs/end-user/baize/images/oam-overview.png new file mode 100644 index 0000000000..ade3fefaa2 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/oam-overview.png differ diff --git a/docs/en/docs/end-user/baize/images/q-delete01.png b/docs/en/docs/end-user/baize/images/q-delete01.png new file mode 100644 index 0000000000..e5a03fc4b1 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/q-delete01.png differ diff --git a/docs/en/docs/end-user/baize/images/queue01.png b/docs/en/docs/end-user/baize/images/queue01.png new file mode 100644 index 0000000000..8b2a38c8b7 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/queue01.png differ diff --git a/docs/en/docs/end-user/baize/images/queue02.png b/docs/en/docs/end-user/baize/images/queue02.png new file mode 100644 index 0000000000..4c803009c6 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/queue02.png differ diff --git a/docs/en/docs/end-user/baize/images/queue03.png b/docs/en/docs/end-user/baize/images/queue03.png new file mode 100644 index 0000000000..40f0b24604 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/queue03.png differ diff --git a/docs/en/docs/end-user/baize/images/resource.png b/docs/en/docs/end-user/baize/images/resource.png new file mode 100644 index 0000000000..67bcd8dbba Binary files /dev/null and b/docs/en/docs/end-user/baize/images/resource.png differ diff --git a/docs/en/docs/end-user/baize/images/triton-infer-0.png b/docs/en/docs/end-user/baize/images/triton-infer-0.png new file mode 100644 index 0000000000..df0aae12e4 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/triton-infer-0.png differ diff --git a/docs/en/docs/end-user/baize/images/triton-infer-1.png b/docs/en/docs/end-user/baize/images/triton-infer-1.png new file mode 100644 index 0000000000..188d06183d Binary files /dev/null and b/docs/en/docs/end-user/baize/images/triton-infer-1.png differ diff --git a/docs/en/docs/end-user/baize/images/triton-infer-2.png b/docs/en/docs/end-user/baize/images/triton-infer-2.png new file mode 100644 index 0000000000..e46b611466 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/triton-infer-2.png differ diff --git a/docs/en/docs/end-user/baize/images/triton-infer-3.png b/docs/en/docs/end-user/baize/images/triton-infer-3.png new file mode 100644 index 0000000000..bea80f9d21 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/triton-infer-3.png differ diff --git a/docs/en/docs/end-user/baize/images/update-baize.png b/docs/en/docs/end-user/baize/images/update-baize.png new file mode 100644 index 0000000000..73b43f5130 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/update-baize.png differ diff --git a/docs/en/docs/end-user/baize/images/view-wl01.png b/docs/en/docs/end-user/baize/images/view-wl01.png new file mode 100644 index 0000000000..fd7467f67b Binary files /dev/null and b/docs/en/docs/end-user/baize/images/view-wl01.png differ diff --git a/docs/en/docs/end-user/baize/images/view-wl02.png b/docs/en/docs/end-user/baize/images/view-wl02.png new file mode 100644 index 0000000000..60536d3760 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/view-wl02.png differ diff --git a/docs/en/docs/end-user/baize/images/view-wl03.png b/docs/en/docs/end-user/baize/images/view-wl03.png new file mode 100644 index 0000000000..a9f5ebfea5 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/view-wl03.png differ diff --git a/docs/en/docs/end-user/baize/images/view-wl04.png b/docs/en/docs/end-user/baize/images/view-wl04.png new file mode 100644 index 0000000000..4a95a931d8 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/view-wl04.png differ diff --git a/docs/en/docs/end-user/baize/images/workspace.png b/docs/en/docs/end-user/baize/images/workspace.png new file mode 100644 index 0000000000..1fec80d372 Binary files /dev/null and b/docs/en/docs/end-user/baize/images/workspace.png differ diff --git a/docs/en/docs/end-user/baize/inference/models.md b/docs/en/docs/end-user/baize/inference/models.md new file mode 100644 index 0000000000..60e2b1a420 --- /dev/null +++ b/docs/en/docs/end-user/baize/inference/models.md @@ -0,0 +1,55 @@ +# Model Support + +With the rapid iteration of AI Lab, we have now supported various model inference services. +Here, you can see information about the supported models. + +- AI Lab v0.3.0 launched model inference services, facilitating users to directly use + the inference services of AI Lab without worrying about model deployment and maintenance + for traditional deep learning models. +- AI Lab v0.6.0 supports the complete version of vLLM inference capabilities, + supporting many large language models such as `LLama`, `Qwen`, `ChatGLM`, and more. + +!!! note + + The support for inference capabilities is related to the version of AI Lab. + Refer to the [Release Notes](../../intro/release-notes.md) to understand the latest version and update timely. + +You can use GPU types that have been verified by AI platform in AI Lab. +For more details, refer to the [GPU Support Matrix](../../../kpanda/gpu/gpu_matrix.md). + +![Click to Create](../../images/inference-interface.png) + +## Triton Inference Server + +Through the Triton Inference Server, traditional deep learning models can be well supported. +Currently, AI Lab supports mainstream inference backend services: + +| Backend | Supported Model Formats | Description | +| ------- | ----------------------- | ----------- | +| pytorch | TorchScript, PyTorch 2.0 formats | [triton-inference-server/pytorch_backend](https://github.com/triton-inference-server/pytorch_backend) | +| tensorflow | TensorFlow 2.x | [triton-inference-server/tensorflow_backend](https://github.com/triton-inference-server/tensorflow_backend) | +| vLLM (Deprecated) | TensorFlow 2.x | [triton-inference-server/tensorflow_backend](https://github.com/triton-inference-server/tensorflow_backend) | + +!!! danger + + The use of Triton's Backend vLLM method has been deprecated. + It is recommended to use the latest support for vLLM to deploy your large language models. + +## vLLM + +With vLLM, we can quickly use large language models. Here, +you can see the list of models we support, which generally aligns with the `vLLM Support Models`. + +- HuggingFace Models: We support most of HuggingFace's models. You can see more models at the + [HuggingFace Model Hub](https://huggingface.co/models). +- The [vLLM Supported Models](https://docs.vllm.ai/en/stable/models/supported_models.html) + list includes supported large language models and vision-language models. +- Models fine-tuned using the vLLM support framework. + +### New Features of vLLM + +Currently, AI Lab also supports some new features when using vLLM as an inference tool: + +- Enable `Lora Adapter` to optimize model inference services during inference. +- Provide a compatible `OpenAPI` interface with `OpenAI`, making it easy for users + to switch to local inference services at a low cost and quickly transition. diff --git a/docs/en/docs/end-user/baize/inference/triton-inference.md b/docs/en/docs/end-user/baize/inference/triton-inference.md new file mode 100644 index 0000000000..8d4cd9605b --- /dev/null +++ b/docs/en/docs/end-user/baize/inference/triton-inference.md @@ -0,0 +1,183 @@ +# Create Inference Service Using Triton Framework + +The AI Lab currently offers Triton and vLLM as inference frameworks. Users can quickly start a high-performance inference service with simple configurations. + +!!! danger + + The use of Triton's Backend vLLM method has been deprecated. + It is recommended to use the latest support for vLLM to deploy your large language models. + +## Introduction to Triton + +Triton is an open-source inference server developed by NVIDIA, designed to simplify the deployment and inference of machine learning models. It supports a variety of deep learning frameworks, including TensorFlow and PyTorch, enabling users to easily manage and deploy different types of models. + +## Prerequisites + +Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. The following example illustrates the PyTorch model for mnist handwritten digit recognition. + +!!! note + + The model to be inferred must adhere to the following directory structure within the dataset: + + ```bash + + └── + └── + └── + ``` + +The directory structure in this example is as follows: + +```bash + model-repo + └── mnist-cnn + └── 1 + └── model.pt +``` + +## Create Inference Service + +Currently, form-based creation is supported, allowing you to create services with field prompts in the interface. + +![点击创建](../../images/triton-infer-0.png) + +### Configure Model Path + +The model path `model-repo/mnist-cnn/1/model.pt` must be consistent with the directory structure of the dataset. + +## Model Configuration + +![点击创建](../../images/triton-infer-1.png) + +### Configure Input and Output Parameters + +!!! note + + The first dimension of the input and output parameters defaults to `batchsize`, setting it to `-1` allows for the automatic calculation of the batchsize based on the input inference data. The remaining dimensions and data type must match the model's input. + +### Configure Environment + +You can import the environment created in [Manage Python Environment Dependencies](../dataset/environments.md) to serve as the runtime environment for inference. + +## Advanced Settings + +![点击创建](../../images/triton-infer-2.png) + +### Configure Authentication Policy + +Supports API key-based request authentication. Users can customize and add authentication parameters. + +### Affinity Scheduling + +Supports automated affinity scheduling based on GPU resources and other node configurations. It also allows users to customize scheduling policies. + +## Access + +![点击创建](../../images/triton-infer-3.png) + + + +### API Access + +- Triton provides a REST-based API, allowing clients to perform model inference via HTTP POST requests. +- Clients can send requests with JSON-formatted bodies containing input data and related metadata. + +#### HTTP Access + +1. **Send HTTP POST Request**: Use tools like `curl` or HTTP client libraries (e.g., Python's `requests` library) to send POST requests to the Triton Server. + +2. **Set HTTP Headers**: Configuration generated automatically based on user settings, include metadata about the model inputs and outputs in the HTTP headers. + +3. **Construct Request Body**: The request body usually contains the input data for inference and model-specific metadata. + + +##### Example curl Command + +```bash + curl -X POST "http://:/v2/models//infer" \ + -H "Content-Type: application/json" \ + -d '{ + "inputs": [ + { + "name": "model_input", + "shape": [1, 1, 32, 32], + "datatype": "FP32", + "data": [ + [0.1234, 0.5678, 0.9101, ... ] + ] + } + ] + }' +``` + +- `` is the host address where the Triton Inference Server is running. +- `` is the port where the Triton Inference Server is running. +- `` is the name of the inference service that has been created. +- `"name"` must match the `name` of the input parameter in the model configuration. +- `"shape"` must match the `dims` of the input parameter in the model configuration. +- `"datatype"` must match the `Data Type` of the input parameter in the model configuration. +- `"data"` should be replaced with the actual inference data. + + + +Please note that the above example code needs to be adjusted according to your specific model and environment. The format and content of the input data must also comply with the model's requirements. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/baize/inference/vllm-inference.md b/docs/en/docs/end-user/baize/inference/vllm-inference.md new file mode 100644 index 0000000000..b67d5a5acd --- /dev/null +++ b/docs/en/docs/end-user/baize/inference/vllm-inference.md @@ -0,0 +1,49 @@ +# Create Inference Service Using vLLM Framework + +AI Lab supports using vLLM as an inference service, offering all the capabilities of vLLM while fully adapting to the OpenAI interface definition. + +## Introduction to vLLM + +vLLM is a fast and easy-to-use library for inference and services. It aims to significantly improve the throughput and memory efficiency of language model services in real-time scenarios. vLLM boasts several features in terms of speed and flexibility: + +- Continuous batching of incoming requests. +- Efficiently manages attention keys and values memory using PagedAttention. +- Seamless integration with popular HuggingFace models. +- Compatible with OpenAI's API server. + +## Prerequisites + +Prepare model data: Manage the model code in dataset management and ensure that the data is successfully preloaded. + +## Create Inference Service + +1. Select the `vLLM` inference framework. In the model module selection, choose the pre-created model dataset `hdd-models` and fill in the `path` information where the model is located within the dataset. + + This guide uses the ChatGLM3 model for creating the inference service. + + + +2. Configure the resources for the inference service and adjust the parameters for running the inference service. + + + + | Parameter Name | Description | + | -- | -- | + | GPU Resources | Configure GPU resources for inference based on the model scale and cluster resources. | + | Allow Remote Code | Controls whether vLLM trusts and executes code from remote sources. | + | LoRA | **LoRA** is a parameter-efficient fine-tuning technique for deep learning models. It reduces the number of parameters and computational complexity by decomposing the original model parameter matrix into low-rank matrices.

1. `--lora-modules`: Specifies specific modules or layers for low-rank approximation.
2. `max_loras_rank`: Specifies the maximum rank for each adapter layer in the LoRA model. For simpler tasks, a smaller rank value can be chosen, while more complex tasks may require a larger rank value to ensure model performance.
3. `max_loras`: Indicates the maximum number of LoRA layers that can be included in the model, customized based on model size and inference complexity.
4. `max_cpu_loras`: Specifies the maximum number of LoRA layers that can be handled in a CPU environment. | + | Associated Environment | Selects predefined environment dependencies required for inference. | + + !!! info + + For models that support LoRA parameters, refer to [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html). + +3. In the **Advanced Configuration** , support is provided for automated affinity scheduling based on GPU resources and other node configurations. Users can also customize scheduling policies. + +## Verify Inference Service + +Once the inference service is created, click the name of the inference service to enter the details and view the API call methods. Verify the execution results using Curl, Python, and Node.js. + +Copy the `curl` command from the details and execute it in the terminal to send a model inference request. The expected output should be: + + diff --git a/docs/en/docs/end-user/baize/jobs/create.md b/docs/en/docs/end-user/baize/jobs/create.md new file mode 100644 index 0000000000..4244ee58c5 --- /dev/null +++ b/docs/en/docs/end-user/baize/jobs/create.md @@ -0,0 +1,45 @@ +--- +MTPE: ModetaNiu +Date: 2024-07-08 +hide: + - toc +--- + +# Create Job + +Job management refers to the functionality of creating and managing job lifecycles through job scheduling +and control components. + +AI platform Smart Computing Capability adopts Kubernetes' Job mechanism to schedule various AI inference and +training jobs. + +1. Click **Job Center** -> **Jobs** in the left navigation bar to enter the job list. Click the **Create** button + on the right. + + ![Create a Job](../../images/job01.png) + +2. The system will pre-fill basic configuration data, including the cluster, namespace, type, queue, and priority. + Adjust these parameters and click **Next**. + + ![Bacis Info](../../images/job02.png) + +3. Configure the URL, runtime parameters, and associated datasets, then click **Next**. + + ![Resource config](../../images/job03.png) + +4. Optionally add labels, annotations, runtime env variables, and other job parameters. Select a scheduling policy + and click **Confirm**. + + ![Advanced settings](../../images/job04.png) + +5. After the job is successfully created, it will have several running statuses: + + - Running + - Queued + - Submission successful, Submission failed + - Successful, Failed + +## Next Steps + +- [View Job Load](./view.md) +- [Delete Job](./delete.md) diff --git a/docs/en/docs/end-user/baize/jobs/delete.md b/docs/en/docs/end-user/baize/jobs/delete.md new file mode 100644 index 0000000000..13f92674ff --- /dev/null +++ b/docs/en/docs/end-user/baize/jobs/delete.md @@ -0,0 +1,24 @@ +--- +hide: + - toc +--- + +# Delete Job + +If you find a job to be redundant, expired, or no longer needed for any other reason, you can delete it from the job list. + +1. Click the **┇** on the right side of the job in the job list, then choose **Delete** from the dropdown menu. + + + +2. In the pop-up window, confirm the job you want to delete, enter the job name, and then click **Delete**. + + + +3. A confirmation message will appear indicating successful deletion, and the job will disappear from the list. + + + +!!! caution + + Once a job is deleted, it cannot be recovered, so please proceed with caution. diff --git a/docs/en/docs/end-user/baize/jobs/pytorch.md b/docs/en/docs/end-user/baize/jobs/pytorch.md new file mode 100644 index 0000000000..feb2b91bfe --- /dev/null +++ b/docs/en/docs/end-user/baize/jobs/pytorch.md @@ -0,0 +1,227 @@ +# Pytorch Jobs + +Pytorch is an open-source deep learning framework that provides a flexible environment for training and deployment. +A Pytorch job is a job that uses the Pytorch framework. + +In the AI Lab platform, we provide support and adaptation for Pytorch jobs. Through a graphical interface, you can quickly create Pytorch jobs and perform model training. + +## Job Configuration + +- Job types support both `Pytorch Single` and `Pytorch Distributed` modes. +- The runtime image already supports the Pytorch framework by default, so no additional installation is required. + +## Job Runtime Environment + +Here we use the `baize-notebook` base image and the `associated environment` as the basic runtime environment for the job. + +> To learn how to create an environment, refer to [Environments](../dataset/environments.md). + +## Create Jobs + +### Pytorch Single Jobs + + + +1. Log in to the AI Lab platform, click **Job Center** in the left navigation bar to enter the **Jobs** page. +2. Click the **Create** button in the upper right corner to enter the job creation page. +3. Select the job type as `Pytorch Single` and click **Next** . +4. Fill in the job name and description, then click **OK** . + +#### Parameters + +- Start command: `bash` +- Command parameters: + +```python +import torch +import torch.nn as nn +import torch.optim as optim + +# Define a simple neural network +class SimpleNet(nn.Module): + def __init__(self): + super(SimpleNet, self).__init__() + self.fc = nn.Linear(10, 1) + + def forward(self, x): + return self.fc(x) + +# Create model, loss function, and optimizer +model = SimpleNet() +criterion = nn.MSELoss() +optimizer = optim.SGD(model.parameters(), lr=0.01) + +# Generate some random data +x = torch.randn(100, 10) +y = torch.randn(100, 1) + +# Train the model +for epoch in range(100): + # Forward pass + outputs = model(x) + loss = criterion(outputs, y) + + # Backward pass and optimization + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if (epoch + 1) % 10 == 0: + print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}') + +print('Training finished.') +``` + +#### Results + +Once the job is successfully submitted, we can enter the job details to see the resource usage. From the upper right corner, go to **Workload Details** to view the log output during the training process. + +```bash +[HAMI-core Warn(1:140244541377408:utils.c:183)]: get default cuda from (null) +[HAMI-core Msg(1:140244541377408:libvgpu.c:855)]: Initialized +Epoch [10/100], Loss: 1.1248 +Epoch [20/100], Loss: 1.0486 +Epoch [30/100], Loss: 0.9969 +Epoch [40/100], Loss: 0.9611 +Epoch [50/100], Loss: 0.9360 +Epoch [60/100], Loss: 0.9182 +Epoch [70/100], Loss: 0.9053 +Epoch [80/100], Loss: 0.8960 +Epoch [90/100], Loss: 0.8891 +Epoch [100/100], Loss: 0.8841 +Training finished. +[HAMI-core Msg(1:140244541377408:multiprocess_memory_limit.c:468)]: Calling exit handler 1 +``` + +### Pytorch Distributed Jobs + +1. Log in to the AI Lab platform, click **Job Center** in the left navigation bar to enter the **Jobs** page. +2. Click the **Create** button in the upper right corner to enter the job creation page. +3. Select the job type as `Pytorch Distributed` and click **Next**. +4. Fill in the job name and description, then click **OK**. + +#### Parameters + +- Start command: `bash` +- Command parameters: + +```python +import os +import torch +import torch.distributed as dist +import torch.nn as nn +import torch.optim as optim +from torch.nn.parallel import DistributedDataParallel as DDP + +class SimpleModel(nn.Module): + def __init__(self): + super(SimpleModel, self).__init__() + self.fc = nn.Linear(10, 1) + + def forward(self, x): + return self.fc(x) + +def train(): + # Print environment information + print(f'PyTorch version: {torch.__version__}') + print(f'CUDA available: {torch.cuda.is_available()}') + if torch.cuda.is_available(): + print(f'CUDA version: {torch.version.cuda}') + print(f'CUDA device count: {torch.cuda.device_count()}') + + rank = int(os.environ.get('RANK', '0')) + world_size = int(os.environ.get('WORLD_SIZE', '1')) + + print(f'Rank: {rank}, World Size: {world_size}') + + # Initialize distributed environment + try: + if world_size > 1: + dist.init_process_group('nccl') + print('Distributed process group initialized successfully') + else: + print('Running in non-distributed mode') + except Exception as e: + print(f'Error initializing process group: {e}') + return + + # Set device + try: + if torch.cuda.is_available(): + device = torch.device(f'cuda:{rank % torch.cuda.device_count()}') + print(f'Using CUDA device: {device}') + else: + device = torch.device('cpu') + print('CUDA not available, using CPU') + except Exception as e: + print(f'Error setting device: {e}') + device = torch.device('cpu') + print('Falling back to CPU') + + try: + model = SimpleModel().to(device) + print('Model moved to device successfully') + except Exception as e: + print(f'Error moving model to device: {e}') + return + + try: + if world_size > 1: + ddp_model = DDP(model, device_ids=[rank % torch.cuda.device_count()] if torch.cuda.is_available() else None) + print('DDP model created successfully') + else: + ddp_model = model + print('Using non-distributed model') + except Exception as e: + print(f'Error creating DDP model: {e}') + return + + loss_fn = nn.MSELoss() + optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) + + # Generate some random data + try: + data = torch.randn(100, 10, device=device) + labels = torch.randn(100, 1, device=device) + print('Data generated and moved to device successfully') + except Exception as e: + print(f'Error generating or moving data to device: {e}') + return + + for epoch in range(10): + try: + ddp_model.train() + outputs = ddp_model(data) + loss = loss_fn(outputs, labels) + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if rank == 0: + print(f'Epoch {epoch}, Loss: {loss.item():.4f}') + except Exception as e: + print(f'Error during training epoch {epoch}: {e}') + break + + if world_size > 1: + dist.destroy_process_group() + +if __name__ == '__main__': + train() +``` + +#### Number of Job Replicas + +Note that `Pytorch Distributed` training jobs will create a group of `Master` and `Worker` training Pods, +where the `Master` is responsible for coordinating the training job, and the `Worker` is responsible for the actual training work. + +!!! note + + In this demonstration: `Master` replica count is 1, `Worker` replica count is 2; + Therefore, we need to set the replica count to 3 in the **Job Configuration** , + which is the sum of `Master` and `Worker` replica counts. + Pytorch will automatically tune the roles of `Master` and `Worker`. + +#### Results + +Similarly, we can enter the job details to view the resource usage and the log output of each Pod. diff --git a/docs/en/docs/end-user/baize/jobs/tensorboard.md b/docs/en/docs/end-user/baize/jobs/tensorboard.md new file mode 100644 index 0000000000..f7f27a7200 --- /dev/null +++ b/docs/en/docs/end-user/baize/jobs/tensorboard.md @@ -0,0 +1,140 @@ +# Job Analysis + +AI Lab provides important visualization analysis tools provided for the model development +process, used to display the training process and results of machine learning models. This document will +introduce the basic concepts of Job Analysis (Tensorboard), its usage in the AI Lab system, +and how to configure the log content of datasets. + +!!! note + + Tensorboard is a visualization tool provided by TensorFlow, used to display the + training process and results of machine learning models. + It can help developers more intuitively understand the training dynamics of + their models, analyze model performance, debug issues, and more. + + + +The role and advantages of Tensorboard in the model development process: + +- **Visualize Training Process** : Display metrics such as training and validation loss, and accuracy + through charts, helping developers intuitively observe the training effects of the model. +- **Debug and Optimize Models** : By viewing the weights and gradient distributions of different layers, + help developers discover and fix issues in the model. +- **Compare Different Experiments** : Simultaneously display the results of multiple experiments, + making it convenient for developers to compare the effects of different models and hyperparameter configurations. +- **Track Training Data** : Record the datasets and parameters used during training to + ensure the reproducibility of experiments. + +## How to Create Tensorboard + +In the AI Lab system, we provide a convenient way to create and manage Tensorboard. +Here are the specific steps: + +### Enable Tensorboard When Creating a Notebook + +1. **Create a Notebook** : Create a new Notebook on the AI Lab platform. +2. **Enable Tensorboard** : On the Notebook creation page, enable the **Tensorboard** + option and specify the dataset and log path. + + + +### Enable Tensorboard After Creating and Completing a Distributed Job + +1. **Create a Distributed Job** : Create a new distributed training job on the AI Lab platform. +2. **Configure Tensorboard** : On the job configuration page, enable the **Tensorboard** + option and specify the dataset and log path. +3. **View Tensorboard After Job Completion** : After the job is completed, you can view + the Tensorboard link on the job details page. Click the link to see the visualized results + of the training process. + + + +### Directly Reference Tensorboard in a Notebook + +In a Notebook, you can directly start Tensorboard through code. Here is a sample code snippet: + +```python +# Import necessary libraries +import tensorflow as tf +import datetime + +# Define log directory +log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + +# Create Tensorboard callback +tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1) + +# Build and compile model +model = tf.keras.models.Sequential([ + tf.keras.layers.Flatten(input_shape=(28, 28)), + tf.keras.layers.Dense(512, activation='relu'), + tf.keras.layers.Dropout(0.2), + tf.keras.layers.Dense(10, activation='softmax') +]) + +model.compile(optimizer='adam', + loss='sparse_categorical_crossentropy', + metrics=['accuracy']) + +# Train model and enable Tensorboard callback +model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test), callbacks=[tensorboard_callback]) +``` + +## How to Configure Dataset Log Content + +When using Tensorboard, you can record and configure different datasets and log content. +Here are some common configuration methods: + +### Configure Training and Validation Dataset Logs + +While training the model, you can use TensorFlow's `tf.summary` API to record logs +for the training and validation datasets. Here is a sample code snippet: + +```python +# Import necessary libraries +import tensorflow as tf + +# Create log directories +train_log_dir = 'logs/gradient_tape/train' +val_log_dir = 'logs/gradient_tape/val' +train_summary_writer = tf.summary.create_file_writer(train_log_dir) +val_summary_writer = tf.summary.create_file_writer(val_log_dir) + +# Train model and record logs +for epoch in range(EPOCHS): + for (x_train, y_train) in train_dataset: + # Training step + train_step(x_train, y_train) + with train_summary_writer.as_default(): + tf.summary.scalar('loss', train_loss.result(), step=epoch) + tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch) + + for (x_val, y_val) in val_dataset: + # Validation step + val_step(x_val, y_val) + with val_summary_writer.as_default(): + tf.summary.scalar('loss', val_loss.result(), step=epoch) + tf.summary.scalar('accuracy', val_accuracy.result(), step=epoch) +``` + +### Configure Custom Logs + +In addition to logs for training and validation datasets, you can also record other +custom log content such as learning rate and gradient distribution. Here is a sample code snippet: + +```python +# Record custom logs +with train_summary_writer.as_default(): + tf.summary.scalar('learning_rate', learning_rate, step=epoch) + tf.summary.histogram('gradients', gradients, step=epoch) +``` + +## Tensorboard Management + +In AI Lab, Tensorboards created through various methods are uniformly +displayed on the job analysis page, making it convenient for users to view and manage. + + + +Users can view information such as the link, status, and creation time of Tensorboard +on the job analysis page and directly access the visualized results of Tensorboard through the link. diff --git a/docs/en/docs/end-user/baize/jobs/tensorflow.md b/docs/en/docs/end-user/baize/jobs/tensorflow.md new file mode 100644 index 0000000000..c3cf2b475a --- /dev/null +++ b/docs/en/docs/end-user/baize/jobs/tensorflow.md @@ -0,0 +1,165 @@ +# Tensorflow Jobs + +Tensorflow, along with Pytorch, is a highly active open-source deep learning framework +that provides a flexible environment for training and deployment. + +AI Lab provides support and adaptation for the Tensorflow framework. +You can quickly create Tensorflow jobs and conduct model training through graphical operations. + +## Job Configuration + +- The job types support both `Tensorflow Single` and `Tensorflow Distributed` modes. +- The runtime image already supports the Tensorflow framework by default, + so no additional installation is required. + +## Job Runtime Environment + +Here, we use the `baize-notebook` base image and the `associated environment` as the basic runtime environment for jobs. + +> For information on how to create an environment, refer to [Environment List](../dataset/environments.md). + +## Creating a Job + +### Example TFJob Single + + + +1. Log in to the AI Lab platform and click **Job Center** in the left navigation bar + to enter the **Jobs** page. +2. Click the **Create** button in the upper right corner to enter the job creation page. +3. Select the job type as `Tensorflow Single` and click **Next** . +4. Fill in the job name and description, then click **OK** . + +#### Pre-warming the Code Repository + +Use **AI Lab** -> **Dataset List** to create a dataset and pull the code from a remote GitHub repository into the dataset. +This way, when creating a job, you can directly select the dataset and mount the code into the job. + +Demo code repository address: [https://github.com/d-run/training-sample-code/](https://github.com/d-run/training-sample-code/) + +#### Parameters + +- Launch command: Use `bash` +- Command parameters: Use `python /code/tensorflow/tf-single.py` + +```python +""" + pip install tensorflow numpy +""" + +import tensorflow as tf +import numpy as np + +# Create some random data +x = np.random.rand(100, 1) +y = 2 * x + 1 + np.random.rand(100, 1) * 0.1 + +# Create a simple model +model = tf.keras.Sequential([ + tf.keras.layers.Dense(1, input_shape=(1,)) +]) + +# Compile the model +model.compile(optimizer='adam', loss='mse') + +# Train the model, setting epochs to 10 +history = model.fit(x, y, epochs=10, verbose=1) + +# Print the final loss +print('Final loss: {' + str(history.history['loss'][-1]) +'}') + +# Use the model to make predictions +test_x = np.array([[0.5]]) +prediction = model.predict(test_x) +print(f'Prediction for x=0.5: {prediction[0][0]}') +``` + +#### Results + +After the job is successfully submitted, you can enter the job details to see the resource usage. From the upper right corner, navigate to **Workload Details** to view log outputs during the training process. + +### TFJob Distributed Job + +1. Log in to **AI Lab** and click **Job Center** in the left navigation bar to enter the **Jobs** page. +2. Click the **Create** button in the upper right corner to enter the job creation page. +3. Select the job type as `Tensorflow Distributed` and click **Next**. +4. Fill in the job name and description, then click **OK**. + +#### Example Job Introduction + + + +This job includes three roles: `Chief`, `Worker`, and `Parameter Server (PS)`. + +- Chief: Responsible for coordinating the training process and saving model checkpoints. +- Worker: Executes the actual model training. +- PS: Used in asynchronous training to store and update model parameters. + +Different resources are allocated to different roles. `Chief` and `Worker` use GPUs, +while `PS` uses CPUs and larger memory. + +#### Parameters + +- Launch command: Use `bash` +- Command parameters: Use `python /code/tensorflow/tensorflow-distributed.py` + +```python +import os +import json +import tensorflow as tf + +class SimpleModel(tf.keras.Model): + def __init__(self): + super(SimpleModel, self).__init__() + self.fc = tf.keras.layers.Dense(1, input_shape=(10,)) + + def call(self, x): + return self.fc(x) + +def train(): + # Print environment information + print(f"TensorFlow version: {tf.__version__}") + print(f"GPU available: {tf.test.is_gpu_available()}") + if tf.test.is_gpu_available(): + print(f"GPU device count: {len(tf.config.list_physical_devices('GPU'))}") + + # Retrieve distributed training information + tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}') + job_type = tf_config.get('job', {}).get('type') + job_id = tf_config.get('job', {}).get('index') + + print(f"Job type: {job_type}, Job ID: {job_id}") + + # Set up distributed strategy + strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() + + with strategy.scope(): + model = SimpleModel() + loss_fn = tf.keras.losses.MeanSquaredError() + optimizer = tf.keras.optimizers.SGD(learning_rate=0.001) + + # Generate some random data + data = tf.random.normal((100, 10)) + labels = tf.random.normal((100, 1)) + + @tf.function + def train_step(inputs, labels): + with tf.GradientTape() as tape: + predictions = model(inputs) + loss = loss_fn(labels, predictions) + gradients = tape.gradient(loss, model.trainable_variables) + optimizer.apply_gradients(zip(gradients, model.trainable_variables)) + return loss + + for epoch in range(10): + loss = train_step(data, labels) + if job_type == 'chief': + print(f'Epoch {epoch}, Loss: {loss.numpy():.4f}') + +if __name__ == '__main__': + train() +``` + +#### Results + +Similarly, you can enter the job details to view the resource usage and log outputs of each Pod. diff --git a/docs/en/docs/end-user/baize/jobs/view.md b/docs/en/docs/end-user/baize/jobs/view.md new file mode 100644 index 0000000000..cf8937cef5 --- /dev/null +++ b/docs/en/docs/end-user/baize/jobs/view.md @@ -0,0 +1,212 @@ +--- +hide: + - toc +--- + +# View Job Workloads + +Once a job is created, it will be displayed in the job list. + +1. In the job list, click the **┇** on the right side of a job and select **Job Workload Details** . + + ![Click Menu Item](../../images/view-wl01.png) + +2. A pop-up window will appear asking you to choose which Pod to view. Click **Enter** . + + ![Pop-up Enter](../../images/view-wl02.png) + +3. You will be redirected to the container management interface, where you can view the container’s working status, labels and annotations, and any events that have occurred. + + ![View Details](../../images/view-wl03.png) + +4. You can also view detailed logs of the current Pod for the recent period. + By default, 100 lines of logs are displayed. To view more detailed logs or to download logs, click the blue **Insight** text at the top. + + ![Logs](../../images/view-wl04.png) + +5. Additionally, you can use the **...** in the upper right corner to view the current Pod's YAML, and to upload or download files. + Below is an example of a Pod's YAML. + +```yaml +kind: Pod +apiVersion: v1 +metadata: + name: neko-tensorboard-job-test-202404181843-skxivllb-worker-0 + namespace: default + uid: ddedb6ff-c278-47eb-ae1e-0de9b7c62f8c + resourceVersion: '41092552' + creationTimestamp: '2024-04-18T10:43:36Z' + labels: + training.kubeflow.org/job-name: neko-tensorboard-job-test-202404181843-skxivllb + training.kubeflow.org/operator-name: pytorchjob-controller + training.kubeflow.org/replica-index: '0' + training.kubeflow.org/replica-type: worker + annotations: + cni.projectcalico.org/containerID: 0cfbb9af257d5e69027c603c6cb2d3890a17c4ae1a145748d5aef73a10d7fbe1 + cni.projectcalico.org/podIP: '' + cni.projectcalico.org/podIPs: '' + hami.io/bind-phase: success + hami.io/bind-time: '1713437016' + hami.io/vgpu-devices-allocated: GPU-29d5fa0d-935b-2966-aff8-483a174d61d1,NVIDIA,1024,20:; + hami.io/vgpu-devices-to-allocate: ; + hami.io/vgpu-node: worker-a800-1 + hami.io/vgpu-time: '1713437016' + k8s.v1.cni.cncf.io/network-status: |- + [{ + "name": "kube-system/calico", + "ips": [ + "10.233.97.184" + ], + "default": true, + "dns": {} + }] + k8s.v1.cni.cncf.io/networks-status: |- + [{ + "name": "kube-system/calico", + "ips": [ + "10.233.97.184" + ], + "default": true, + "dns": {} + }] + ownerReferences: + - apiVersion: kubeflow.org/v1 + kind: PyTorchJob + name: neko-tensorboard-job-test-202404181843-skxivllb + uid: e5a8b05d-1f03-4717-8e1c-4ec928014b7b + controller: true + blockOwnerDeletion: true +spec: + volumes: + - name: 0-dataset-pytorch-examples + persistentVolumeClaim: + claimName: pytorch-examples + - name: kube-api-access-wh9rh + projected: + sources: + - serviceAccountToken: + expirationSeconds: 3607 + path: token + - configMap: + name: kube-root-ca.crt + items: + - key: ca.crt + path: ca.crt + - downwardAPI: + items: + - path: namespace + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + defaultMode: 420 + containers: + - name: pytorch + image: m.daocloud.io/docker.io/pytorch/pytorch + command: + - bash + args: + - '-c' + - >- + ls -la /root && which pip && pip install pytorch_lightning tensorboard + && python /root/Git/pytorch/examples/mnist/main.py + ports: + - name: pytorchjob-port + containerPort: 23456 + protocol: TCP + env: + - name: PYTHONUNBUFFERED + value: '1' + - name: PET_NNODES + value: '1' + resources: + limits: + cpu: '4' + memory: 8Gi + nvidia.com/gpucores: '20' + nvidia.com/gpumem: '1024' + nvidia.com/vgpu: '1' + requests: + cpu: '4' + memory: 8Gi + nvidia.com/gpucores: '20' + nvidia.com/gpumem: '1024' + nvidia.com/vgpu: '1' + volumeMounts: + - name: 0-dataset-pytorch-examples + mountPath: /root/Git/pytorch/examples + - name: kube-api-access-wh9rh + readOnly: true + mountPath: /var/run/secrets/kubernetes.io/serviceaccount + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: Always + restartPolicy: Never + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + serviceAccountName: default + serviceAccount: default + nodeName: worker-a800-1 + securityContext: {} + affinity: {} + schedulerName: hami-scheduler + tolerations: + - key: node.kubernetes.io/not-ready + operator: Exists + effect: NoExecute + tolerationSeconds: 300 + - key: node.kubernetes.io/unreachable + operator: Exists + effect: NoExecute + tolerationSeconds: 300 + priorityClassName: baize-high-priority + priority: 100000 + enableServiceLinks: true + preemptionPolicy: PreemptLowerPriority +status: + phase: Succeeded + conditions: + - type: Initialized + status: 'True' + lastProbeTime: null + lastTransitionTime: '2024-04-18T10:43:36Z' + reason: PodCompleted + - type: Ready + status: 'False' + lastProbeTime: null + lastTransitionTime: '2024-04-18T10:46:34Z' + reason: PodCompleted + - type: ContainersReady + status: 'False' + lastProbeTime: null + lastTransitionTime: '2024-04-18T10:46:34Z' + reason: PodCompleted + - type: PodScheduled + status: 'True' + lastProbeTime: null + lastTransitionTime: '2024-04-18T10:43:36Z' + hostIP: 10.20.100.211 + podIP: 10.233.97.184 + podIPs: + - ip: 10.233.97.184 + startTime: '2024-04-18T10:43:36Z' + containerStatuses: + - name: pytorch + state: + terminated: + exitCode: 0 + reason: Completed + startedAt: '2024-04-18T10:43:39Z' + finishedAt: '2024-04-18T10:46:34Z' + containerID: >- + containerd://09010214bcf3315e81d38fba50de3943c9d2b48f50a6cc2e83f8ef0e5c6eeec1 + lastState: {} + ready: false + restartCount: 0 + image: m.daocloud.io/docker.io/pytorch/pytorch:latest + imageID: >- + m.daocloud.io/docker.io/pytorch/pytorch@sha256:11691e035a3651d25a87116b4f6adc113a27a29d8f5a6a583f8569e0ee5ff897 + containerID: >- + containerd://09010214bcf3315e81d38fba50de3943c9d2b48f50a6cc2e83f8ef0e5c6eeec1 + started: false + qosClass: Guaranteed +``` diff --git a/docs/en/docs/end-user/ghippo/images/14.png b/docs/en/docs/end-user/ghippo/images/14.png new file mode 100644 index 0000000000..9aa7d08e9e Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/14.png differ diff --git a/docs/en/docs/end-user/ghippo/images/access.png b/docs/en/docs/end-user/ghippo/images/access.png new file mode 100644 index 0000000000..9da2789e9e Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/access.png differ diff --git a/docs/en/docs/end-user/ghippo/images/addcluster01.png b/docs/en/docs/end-user/ghippo/images/addcluster01.png new file mode 100644 index 0000000000..8adb495227 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/addcluster01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/addcluster02.png b/docs/en/docs/end-user/ghippo/images/addcluster02.png new file mode 100644 index 0000000000..022c6c902a Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/addcluster02.png differ diff --git a/docs/en/docs/end-user/ghippo/images/agent.png b/docs/en/docs/end-user/ghippo/images/agent.png new file mode 100644 index 0000000000..f2c5101dc2 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/agent.png differ diff --git a/docs/en/docs/end-user/ghippo/images/clusterlist01.png b/docs/en/docs/end-user/ghippo/images/clusterlist01.png new file mode 100644 index 0000000000..4add7ef42b Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/clusterlist01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/gmagpiereport.png b/docs/en/docs/end-user/ghippo/images/gmagpiereport.png new file mode 100644 index 0000000000..19bf5b5bfa Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/gmagpiereport.png differ diff --git a/docs/en/docs/end-user/ghippo/images/gproduct03.png b/docs/en/docs/end-user/ghippo/images/gproduct03.png new file mode 100644 index 0000000000..9c5019c7c8 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/gproduct03.png differ diff --git a/docs/en/docs/end-user/ghippo/images/ldap00.png b/docs/en/docs/end-user/ghippo/images/ldap00.png new file mode 100644 index 0000000000..21f94b74f5 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/ldap00.png differ diff --git a/docs/en/docs/end-user/ghippo/images/ldap01.png b/docs/en/docs/end-user/ghippo/images/ldap01.png new file mode 100644 index 0000000000..74e37a9940 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/ldap01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/login02.png b/docs/en/docs/end-user/ghippo/images/login02.png new file mode 100644 index 0000000000..0509b1a313 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/login02.png differ diff --git a/docs/en/docs/end-user/ghippo/images/logindesign.png b/docs/en/docs/end-user/ghippo/images/logindesign.png new file mode 100644 index 0000000000..4fe84d2fe2 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/logindesign.png differ diff --git a/docs/en/docs/end-user/ghippo/images/menu1.png b/docs/en/docs/end-user/ghippo/images/menu1.png new file mode 100644 index 0000000000..ec4813f2d3 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/menu1.png differ diff --git a/docs/en/docs/end-user/ghippo/images/menu2.png b/docs/en/docs/end-user/ghippo/images/menu2.png new file mode 100644 index 0000000000..2bae9ed2ed Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/menu2.png differ diff --git a/docs/en/docs/end-user/ghippo/images/menu3.png b/docs/en/docs/end-user/ghippo/images/menu3.png new file mode 100644 index 0000000000..8d9090ba52 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/menu3.png differ diff --git a/docs/en/docs/end-user/ghippo/images/menu4.png b/docs/en/docs/end-user/ghippo/images/menu4.png new file mode 100644 index 0000000000..b7a8489572 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/menu4.png differ diff --git a/docs/en/docs/end-user/ghippo/images/mybusiness.png b/docs/en/docs/end-user/ghippo/images/mybusiness.png new file mode 100644 index 0000000000..8317dd59e0 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/mybusiness.png differ diff --git a/docs/en/docs/end-user/ghippo/images/nav01.png b/docs/en/docs/end-user/ghippo/images/nav01.png new file mode 100644 index 0000000000..64d5446a3e Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/nav01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/nav02.png b/docs/en/docs/end-user/ghippo/images/nav02.png new file mode 100644 index 0000000000..3ed6bcc7bb Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/nav02.png differ diff --git a/docs/en/docs/end-user/ghippo/images/note.svg b/docs/en/docs/end-user/ghippo/images/note.svg new file mode 100644 index 0000000000..5e473aeb7d --- /dev/null +++ b/docs/en/docs/end-user/ghippo/images/note.svg @@ -0,0 +1,18 @@ + + + + Icon/16/Prompt备份@0.5x + Created with Sketch. + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/en/docs/end-user/ghippo/images/oauth2.png b/docs/en/docs/end-user/ghippo/images/oauth2.png new file mode 100644 index 0000000000..5014a5b6e9 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/oauth2.png differ diff --git a/docs/en/docs/end-user/ghippo/images/oidc-button.png b/docs/en/docs/end-user/ghippo/images/oidc-button.png new file mode 100644 index 0000000000..ed70422c56 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/oidc-button.png differ diff --git a/docs/en/docs/end-user/ghippo/images/oidc01.png b/docs/en/docs/end-user/ghippo/images/oidc01.png new file mode 100644 index 0000000000..f1b73f3bbc Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/oidc01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/password01en.png b/docs/en/docs/end-user/ghippo/images/password01en.png new file mode 100644 index 0000000000..88172a2d22 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/password01en.png differ diff --git a/docs/en/docs/end-user/ghippo/images/password02en.png b/docs/en/docs/end-user/ghippo/images/password02en.png new file mode 100644 index 0000000000..5984daa99b Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/password02en.png differ diff --git a/docs/en/docs/end-user/ghippo/images/password03en.png b/docs/en/docs/end-user/ghippo/images/password03en.png new file mode 100644 index 0000000000..ad24586214 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/password03en.png differ diff --git a/docs/en/docs/end-user/ghippo/images/password04.png b/docs/en/docs/end-user/ghippo/images/password04.png new file mode 100644 index 0000000000..293579d9d4 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/password04.png differ diff --git a/docs/en/docs/end-user/ghippo/images/password04en.png b/docs/en/docs/end-user/ghippo/images/password04en.png new file mode 100644 index 0000000000..c8d30ca792 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/password04en.png differ diff --git a/docs/en/docs/end-user/ghippo/images/platform02.png b/docs/en/docs/end-user/ghippo/images/platform02.png new file mode 100644 index 0000000000..d8b33925ef Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/platform02.png differ diff --git a/docs/en/docs/end-user/ghippo/images/platform03.png b/docs/en/docs/end-user/ghippo/images/platform03.png new file mode 100644 index 0000000000..73d189ec45 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/platform03.png differ diff --git a/docs/en/docs/end-user/ghippo/images/report01.png b/docs/en/docs/end-user/ghippo/images/report01.png new file mode 100644 index 0000000000..c38ee377eb Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/report01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/security-policy.png b/docs/en/docs/end-user/ghippo/images/security-policy.png new file mode 100644 index 0000000000..17378ba3dc Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/security-policy.png differ diff --git a/docs/en/docs/end-user/ghippo/images/selfapplication.png b/docs/en/docs/end-user/ghippo/images/selfapplication.png new file mode 100644 index 0000000000..64129f857f Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/selfapplication.png differ diff --git a/docs/en/docs/end-user/ghippo/images/sso1.png b/docs/en/docs/end-user/ghippo/images/sso1.png new file mode 100644 index 0000000000..17f1dc247b Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/sso1.png differ diff --git a/docs/en/docs/end-user/ghippo/images/sso2.png b/docs/en/docs/end-user/ghippo/images/sso2.png new file mode 100644 index 0000000000..f732bc3337 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/sso2.png differ diff --git a/docs/en/docs/end-user/ghippo/images/sso3.png b/docs/en/docs/end-user/ghippo/images/sso3.png new file mode 100644 index 0000000000..d42376f947 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/sso3.png differ diff --git a/docs/en/docs/end-user/ghippo/images/system-message1.png b/docs/en/docs/end-user/ghippo/images/system-message1.png new file mode 100644 index 0000000000..72cd0a5309 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/system-message1.png differ diff --git a/docs/en/docs/end-user/ghippo/images/system-message2.png b/docs/en/docs/end-user/ghippo/images/system-message2.png new file mode 100644 index 0000000000..124a000599 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/system-message2.png differ diff --git a/docs/en/docs/end-user/ghippo/images/system-message3.png b/docs/en/docs/end-user/ghippo/images/system-message3.png new file mode 100644 index 0000000000..e8a42ae3bd Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/system-message3.png differ diff --git a/docs/en/docs/end-user/ghippo/images/ws01.png b/docs/en/docs/end-user/ghippo/images/ws01.png new file mode 100644 index 0000000000..dabf343889 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/ws01.png differ diff --git a/docs/en/docs/end-user/ghippo/images/ws02.png b/docs/en/docs/end-user/ghippo/images/ws02.png new file mode 100644 index 0000000000..35b4d467e8 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/ws02.png differ diff --git a/docs/en/docs/end-user/ghippo/images/ws03.png b/docs/en/docs/end-user/ghippo/images/ws03.png new file mode 100644 index 0000000000..3e8b1e09f5 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/ws03.png differ diff --git a/docs/en/docs/end-user/ghippo/images/wsbind1.png b/docs/en/docs/end-user/ghippo/images/wsbind1.png new file mode 100644 index 0000000000..f3699f2430 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/wsbind1.png differ diff --git a/docs/en/docs/end-user/ghippo/images/wsbind2.png b/docs/en/docs/end-user/ghippo/images/wsbind2.png new file mode 100644 index 0000000000..23e411a44c Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/wsbind2.png differ diff --git a/docs/en/docs/end-user/ghippo/images/wsbind3.png b/docs/en/docs/end-user/ghippo/images/wsbind3.png new file mode 100644 index 0000000000..d52bbd41a1 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/wsbind3.png differ diff --git a/docs/en/docs/end-user/ghippo/images/wsbind4.png b/docs/en/docs/end-user/ghippo/images/wsbind4.png new file mode 100644 index 0000000000..319f60e05f Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/wsbind4.png differ diff --git a/docs/en/docs/end-user/ghippo/images/wsbind5.png b/docs/en/docs/end-user/ghippo/images/wsbind5.png new file mode 100644 index 0000000000..4b6367bcd0 Binary files /dev/null and b/docs/en/docs/end-user/ghippo/images/wsbind5.png differ diff --git a/docs/en/docs/end-user/ghippo/personal-center/accesstoken.md b/docs/en/docs/end-user/ghippo/personal-center/accesstoken.md new file mode 100644 index 0000000000..a7c6a1b0c2 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/personal-center/accesstoken.md @@ -0,0 +1,58 @@ +--- +MTPE: windsonsea +date: 2024-01-11 +--- + +# Access key + +The access key can be used to access the openAPI and continuous delivery. Users can obtain the key and access the API by referring to the following steps in the personal center. + +## Get key + +Log in to AI platform, find __Personal Center__ in the drop-down menu in the upper right corner, and you can manage the access key of the account on the __Access Keys__ page. + +![key list](../../images/platform02.png) + +![created a key](../../images/platform03.png) + +!!! info + + Access key is displayed only once. If you forget your access key, + you will need to create a new key. + +## Use the key to access API + +When accessing AI platform openAPI, add the header `Authorization:Bearer ${token}` to the request to identify the visitor, where `${token}` is the key obtained in the previous step. For the specific API, see [OpenAPI Documentation](https://docs.daocloud.io/openapi/). + +**Request Example** + +```bash +curl -X GET -H 'Authorization:Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IkRKVjlBTHRBLXZ4MmtQUC1TQnVGS0dCSWc1cnBfdkxiQVVqM2U3RVByWnMiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE2NjE0MTU5NjksImlhdCI6MTY2MDgxMTE2OSwiaXNzIjoiZ2hpcHBvLmlvIiwic3ViIjoiZjdjOGIxZjUtMTc2MS00NjYwLTg2MWQtOWI3MmI0MzJmNGViIiwicHJlZmVycmVkX3VzZXJuYW1lIjoiYWRtaW4iLCJncm91cHMiOltdfQ.RsUcrAYkQQ7C6BxMOrdD3qbBRUt0VVxynIGeq4wyIgye6R8Ma4cjxG5CbU1WyiHKpvIKJDJbeFQHro2euQyVde3ygA672ozkwLTnx3Tu-_mB1BubvWCBsDdUjIhCQfT39rk6EQozMjb-1X1sbLwzkfzKMls-oxkjagI_RFrYlTVPwT3Oaw-qOyulRSw7Dxd7jb0vINPq84vmlQIsI3UuTZSNO5BCgHpubcWwBss-Aon_DmYA-Et_-QtmPBA3k8E2hzDSzc7eqK0I68P25r9rwQ3DeKwD1dbRyndqWORRnz8TLEXSiCFXdZT2oiMrcJtO188Ph4eLGut1-4PzKhwgrQ' https://demo-dev.daocloud.io/apis/ghippo.io/v1alpha1/users?page=1&pageSize=10 -k +``` + +**Request result** + +```json +{ + "items": [ + { + "id": "a7cfd010-ebbe-4601-987f-d098d9ef766e", + "name": "a", + "email": "", + "description": "", + "firstname": "", + "lastname": "", + "source": "locale", + "enabled": true, + "createdAt": "1660632794800", + "updatedAt": "0", + "lastLoginAt": "" + } + ], + "pagination": { + "page": 1, + "pageSize": 10, + "total": 1 + } +} +``` diff --git a/docs/en/docs/end-user/ghippo/personal-center/language.md b/docs/en/docs/end-user/ghippo/personal-center/language.md new file mode 100644 index 0000000000..e4ac00caf3 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/personal-center/language.md @@ -0,0 +1,31 @@ +--- +hide: + - toc +--- + +# language settings + +This section explains how to set the interface language. Currently supports Chinese, English two languages. + +Language setting is the portal for the platform to provide multilingual services. The platform is displayed in Chinese by default. Users can switch the platform language by selecting English or automatically detecting the browser language preference according to their needs. +Each user's multilingual service is independent of each other, and switching will not affect other users. + +The platform provides three ways to switch languages: Chinese, English-English, and automatically detect your browser language preference. + +The operation steps are as follows. + +1. Log in to the AI platform with your username/password. Click __Global Management__ at the bottom of the left navigation bar. + + + +2. Click the username in the upper right corner and select __Personal Center__ . + + + +3. Click the __Language Settings__ tab. + + + +4. Toggle the language option. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/ghippo/personal-center/security-setting.md b/docs/en/docs/end-user/ghippo/personal-center/security-setting.md new file mode 100644 index 0000000000..66d1af4950 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/personal-center/security-setting.md @@ -0,0 +1,21 @@ +--- +hide: + - toc +--- + +# Security Settings + +Function description: It is used to fill in the email address and modify the login password. + +- Email: After the administrator configures the email server address, the user can click the Forget Password button on the login page to fill in the email address there to retrieve the password. +- Password: The password used to log in to the platform, it is recommended to change the password regularly. + +The specific operation steps are as follows: + +1. Click the username in the upper right corner and select __Personal Center__ . + + ![Personal Center](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/lang01.png) + +2. Click the __Security Settings__ tab. Fill in your email address or change the login password. + + ![Security Settings](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/security01.png) diff --git a/docs/en/docs/end-user/ghippo/personal-center/ssh-key.md b/docs/en/docs/end-user/ghippo/personal-center/ssh-key.md new file mode 100644 index 0000000000..f79fe21f37 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/personal-center/ssh-key.md @@ -0,0 +1,106 @@ +# Configuring SSH Public Key + +This article explains how to configure SSH public key. + +## Step 1. View Existing SSH Keys + +Before generating a new SSH key, please check if you need to use an existing SSH key stored in the root directory of the local user. +For Linux and Mac, use the following command to view existing public keys. Windows users can use the +following command in WSL (requires Windows 10 or above) or Git Bash to view the generated public keys. + +- **ED25519 Algorithm:** + + ```bash + cat ~/.ssh/id_ed25519.pub + ``` + +- **RSA Algorithm:** + + ```bash + cat ~/.ssh/id_rsa.pub + ``` + +If a long string starting with ssh-ed25519 or ssh-rsa is returned, it means that a local public key already exists. +You can skip [Step 2 Generate SSH Key](#step-2-generate-ssh-key) and proceed directly to [Step 3](#step-3-copy-the-public-key). + +## Step 2. Generate SSH Key + +If [Step 1](#step-1-view-existing-ssh-keys) does not return the specified content string, it means that +there is no available SSH key locally and a new SSH key needs to be generated. Please follow these steps: + +1. Access the terminal (Windows users please use [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) or [Git Bash](https://gitforwindows.org/)), and run `ssh-keygen -t`. + +2. Enter the key algorithm type and an optional comment. + + The comment will appear in the .pub file and can generally use the email address as the comment content. + + - To generate a key pair based on the `ED25519` algorithm, use the following command: + + ```bash + ssh-keygen -t ed25519 -C "" + ``` + + - To generate a key pair based on the `RSA` algorithm, use the following command: + + ```bash + ssh-keygen -t rsa -C "" + ``` + +3. Press Enter to choose the SSH key generation path. + + Taking the ED25519 algorithm as an example, the default path is as follows: + + ```console + Generating public/private ed25519 key pair. + Enter file in which to save the key (/home/user/.ssh/id_ed25519): + ``` + + The default key generation path is `/home/user/.ssh/id_ed25519`, and the corresponding public key is `/home/user/.ssh/id_ed25519.pub`. + +4. Set a passphrase for the key. + + ```console + Enter passphrase (empty for no passphrase): + Enter same passphrase again: + ``` + + The passphrase is empty by default, and you can choose to use a passphrase to protect the private key file. + If you do not want to enter a passphrase every time you access the repository using the SSH protocol, + you can enter an empty passphrase when creating the key. + +5. Press Enter to complete the key pair creation. + +## Step 3. Copy the Public Key + +In addition to manually copying the generated public key information printed on the command line, you can use the following commands to copy the public key to the clipboard, depending on the operating system. + +- Windows (in [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) or [Git Bash](https://gitforwindows.org/)): + + ```bash + cat ~/.ssh/id_ed25519.pub | clip + ``` + +- Mac: + + ```bash + tr -d '\n'< ~/.ssh/id_ed25519.pub | pbcopy + ``` + +- GNU/Linux (requires xclip): + + ```bash + xclip -sel clip < ~/.ssh/id_ed25519.pub + ``` + +## Step 4. Set the Public Key on AI platform Platform + +1. Log in to the AI platform UI page and select **Profile** -> **SSH Public Key** in the upper right corner of the page. + +2. Add the generated SSH public key information. + + 1. SSH public key content. + + 2. Public key title: Supports customizing the public key name for management differentiation. + + 3. Expiration: Set the expiration period for the public key. After it expires, + the public key will be automatically invalidated and cannot be used. If not set, it will be permanently valid. diff --git a/docs/en/docs/end-user/ghippo/workspace/folder-permission.md b/docs/en/docs/end-user/ghippo/workspace/folder-permission.md new file mode 100644 index 0000000000..882e8fc799 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/folder-permission.md @@ -0,0 +1,39 @@ +# Description of folder permissions + +Folders have permission mapping capabilities, which can map the permissions of users/groups in this folder to subfolders, workspaces and resources under it. + +If the user/group is Folder Admin role in this folder, it is still Folder Admin role when mapped to a subfolder, and Workspace Admin is mapped to the workspace under it; +If a Namespace is bound in __Workspace and Folder__ -> __Resource Group__ , the user/group is also a Namespace Admin after mapping. + +!!! note + + The permission mapping capability of folders will not be applied to shared resources, because sharing is to share the use permissions of the cluster to multiple workspaces, rather than assigning management permissions to workspaces, so permission inheritance and role mapping will not be implemented. + +## Use cases + +Folders have hierarchical capabilities, so when folders are mapped to departments/suppliers/projects in the enterprise, + +- If a user/group has administrative authority (Admin) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have administrative authority; +- If a user/group has access rights (Editor) in the first-level department, the second-, third-, and fourth-level departments or projects under it also have access rights; +- If a user/group has read-only permission (Viewer) in the first-level department, the second-level, third-level, and fourth-level departments or projects under it also have read-only permission. + +| Objects | Actions | Folder Admin | Folder Editor | Folder Viewer | +| --------------------------- | -------- | ------------ | ------------- | ------------- | +| on the folder itself | view | ✓ | ✓ | ✓ | +| | Authorization | ✓ | ✗ | ✗ | +| | Modify Alias ​​| ✓ | ✗ | ✗ | +| To Subfolder | Create | ✓ | ✗ | ✗ | +| | View | ✓ | ✓ | ✓ | +| | Authorization | ✓ | ✗ | ✗ | +| | Modify Alias ​​| ✓ | ✗ | ✗ | +| workspace under it | create | ✓ | ✗ | ✗ | +| | View | ✓ | ✓ | ✓ | +| | Authorization | ✓ | ✗ | ✗ | +| | Modify Alias ​​| ✓ | ✗ | ✗ | +| Workspace under it - Resource Group | View | ✓ | ✓ | ✓ | +| | resource binding | ✓ | ✗ | ✗ | +| | unbind | ✓ | ✗ | ✗ | +| Workspaces under it - Shared Resources | View | ✓ | ✓ | ✓ | +| | New share | ✓ | ✗ | ✗ | +| | Unshare | ✓ | ✗ | ✗ | +| | Resource Quota | ✓ | ✗ | ✗ | \ No newline at end of file diff --git a/docs/en/docs/end-user/ghippo/workspace/folders.md b/docs/en/docs/end-user/ghippo/workspace/folders.md new file mode 100644 index 0000000000..e552ffc12f --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/folders.md @@ -0,0 +1,37 @@ +--- +hide: + - toc +--- + +# Create/Delete Folders + +Folders have the capability to map permissions, allowing users/user groups to have their permissions in the folder mapped to its sub-folders, workspaces, and resources. + +Follow the steps below to create a folder: + +1. Log in to AI platform with a user account having the admin/folder admin role. + Click __Global Management__ -> __Workspace and Folder__ at the bottom of the left navigation bar. + + ![Global Management](../images/ws01.png) + +2. Click the __Create Folder__ button in the top right corner. + + ![Create Folder](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/ws02.png) + +3. Fill in the folder name, parent folder, and other information, then click __OK__ to complete creating the folder. + + ![Confirm](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/fd03.png) + +!!! tip + + After successful creation, the folder name will be displayed in the left tree structure, represented by different icons for workspaces and folders. + + ![Workspaces and Folders](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/fd04.png) + +!!! note + + To edit or delete a specific folder, select it and Click __┇__ on the right side. + + - If there are resources bound to the resource group or shared resources within the folder, the folder cannot be deleted. All resources need to be unbound before deleting. + + - If there are registry resources accessed by the microservice engine module within the folder, the folder cannot be deleted. All access to the registry needs to be removed before deleting the folder. diff --git a/docs/en/docs/end-user/ghippo/workspace/quota.md b/docs/en/docs/end-user/ghippo/workspace/quota.md new file mode 100644 index 0000000000..c172fa7cf8 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/quota.md @@ -0,0 +1,119 @@ +--- +MTPE: WANG0608GitHub +Date: 2024-09-02 +--- + +# Resource Quota + +Shared resources do not necessarily mean that the shared users can use the shared resources without +any restrictions. Admin, Kpanda Owner, and Workspace Admin can limit the maximum usage quota of a user +through the __Resource Quota__ feature in shared resources. If no restrictions are set, it means the +usage is unlimited. + +- CPU Request (Core) +- CPU Limit (Core) +- Memory Request (MB) +- Memory Limit (MB) +- Total Storage Request (GB) +- Persistent Volume Claims (PVC) +- GPU Type, Spec, Quantity (including but not limited to Nvidia, Ascend, ILLUVATAR, and other GPUs) + +A resource (cluster) can be shared among multiple workspaces, and a workspace can use resources from +multiple shared clusters simultaneously. + +## Resource Groups and Shared Resources + +Cluster resources in both shared resources and resource groups are derived from [Container Management](../../../kpanda/intro/index.md). However, different effects will occur when binding a cluster to a workspace or sharing it with a workspace. + +1. Binding Resources + + Users/User groups in the workspace will have full management and usage permissions for the cluster. + Workspace Admin will be mapped as Cluster Admin. + Workspace Admin can access the [Container Management module](../../../kpanda/permissions/permission-brief.md) + to manage the cluster. + + ![Resource Group](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota01.png) + + !!! note + + As of now, there are no Cluster Editor and Cluster Viewer roles in the Container Management module. + Therefore, Workspace Editor and Workspace Viewer cannot be mapped. + +2. Adding Shared Resources + + Users/User groups in the workspace will have usage permissions for the cluster resources, which can be + + used when [creating namespaces](../../../amamba/user-guide/namespace/namespace.md). + + ![Shared Resources](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota02.png) + + Unlike resource groups, when sharing a cluster with a workspace, the roles of the users in the workspace + will not be mapped to the resources. Therefore, Workspace Admin will not be mapped as Cluster Admin. + +This section demonstrates three scenarios related to resource quotas. + +## Create Namespaces + +Creating a namespace involves resource quotas. + +1. Add a shared cluster to workspace __ws01__ . + + ![Add Shared Cluster](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota03.png) + +2. Select workspace __ws01__ and the shared cluster in Workbench, and create a namespace __ns01__ . + + ![Create Namespace](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota04.png) + + - If no resource quotas are set in the shared cluster, there is no need to set resource quotas when creating + the namespace. + - If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the CPU request for the + namespace must be less than or equal to 100 cores (__CPU Request ≤ 100 core__) for successful creation. + +## Bind Namespace to Workspace + +Prerequisite: Workspace ws01 has added a shared cluster, and the operator has the Workspace Admin + Kpanda Owner +or Admin role. + +The two methods of binding have the same effect. + +- Bind the created namespace ns01 to ws01 in Container Management. + + ![Bind to Workspace](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota05.png) + + - If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound regardless + of whether resource quotas are set. + - If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 + must meet the requirement of CPU requests less than or equal to 100 cores (__CPU Request ≤ 100 core__) + for successful binding. + +- Bind the namespace ns01 to ws01 in Global Management. + + ![Bind to Workspace](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota06.png) + + - If no resource quotas are set in the shared cluster, the namespace ns01 can be successfully bound + regardless of whether resource quotas are set. + - If resource quotas are set in the shared cluster (e.g., CPU Request = 100 cores), the namespace ns01 + must meet the requirement of CPU requests less than or equal to 100 cores (__CPU Request ≤ 100 core__) + for successful binding. + +## Unbind Namespace from Workspace + +The two methods of unbinding have the same effect. + +- Unbind the namespace ns01 from workspace ws01 in Container Management. + + ![Bind to Workspace](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota07.png) + + - If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the + resource quotas, regardless of whether resource quotas were set for the namespace. + - If resource quotas (__CPU Request = 100 cores__) are set in the shared cluster and the namespace ns01 + has its own resource quotas, unbinding will release the corresponding resource quota. + +- Unbind the namespace ns01 from workspace ws01 in Global Management. + + ![Bind to Workspace](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/quota08.png) + + - If no resource quotas are set in the shared cluster, unbinding the namespace ns01 will not affect the + resource quotas, regardless of whether resource quotas were set for the namespace. + - If resource quotas (__CPU Request = 100 cores__) are set in the shared cluster and the namespace ns01 + has its own resource quotas, unbinding will release the corresponding resource quota. diff --git a/docs/en/docs/end-user/ghippo/workspace/res-gp-and-shared-res.md b/docs/en/docs/end-user/ghippo/workspace/res-gp-and-shared-res.md new file mode 100644 index 0000000000..2fc90529ed --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/res-gp-and-shared-res.md @@ -0,0 +1,28 @@ +# Differences between Resource Groups and Shared Resources + +Both resource groups and shared resources support cluster binding, but they have significant differences in usage. + +## Differences in Usage Scenarios + +- Cluster Binding for Resource Groups: Resource groups are usually used for batch authorization. After binding a resource group to a cluster, + the workspace administrator will be mapped as a cluster administrator and able to manage and use cluster resources. +- Cluster Binding for Shared Resources: Shared resources are usually used for resource quotas. A typical scenario is that the platform administrator assigns a cluster to a first-level supplier, who then assigns the cluster to a second-level supplier and sets resource quotas for the second-level supplier. + +![diff](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/res-gp01.png) + +Note: In this scenario, the platform administrator needs to impose resource restrictions on secondary suppliers. +Currently, it is not supported to limit the cluster quota of secondary suppliers by the primary supplier. + +## Differences in Cluster Quota Usage + +- Cluster Binding for Resource Groups: The workspace administrator is mapped as the administrator of the cluster and is equivalent to being granted the Cluster Admin role in Container Management-Permission Management. They can have unrestricted access to cluster resources, manage important content such as management nodes, and cannot be subject to resource quotas. +- Cluster Binding for Shared Resources: The workspace administrator can only use the quota in the cluster to create namespaces in the Workbench and does not have cluster management permissions. If the workspace is restricted by a quota, the workspace administrator can only create and use namespaces within the quota range. + +## Differences in Resource Types + +- Resource Groups: Can bind to clusters, cluster-namespaces, multiclouds, multicloud namespaces, meshs, and mesh-namespaces. +- Shared Resources: Can only bind to clusters. + +## Similarities between Resource Groups and Shared Resources + +After binding to a cluster, both resource groups and shared resources can go to the Workbench to create namespaces, which will be automatically bound to the workspace. diff --git a/docs/en/docs/end-user/ghippo/workspace/workspace.md b/docs/en/docs/end-user/ghippo/workspace/workspace.md new file mode 100644 index 0000000000..4b8b01ab21 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/workspace.md @@ -0,0 +1,42 @@ +--- +hide: + - toc +--- + +# Creating/Deleting Workspaces + +A workspace is a resource category that represents a hierarchical relationship of resources. +A workspace can contain resources such as clusters, namespaces, and registries. Typically, +each workspace corresponds to a project and different resources can be allocated, and +different users and user groups can be assigned to each workspace. + +Follow the steps below to create a workspace: + +1. Log in to AI platform with a user account having the admin/folder admin role. + Click __Global Management__ -> __Workspace and Folder__ at the bottom of the left navigation bar. + + ![Global Management](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/ws01.png) + +3. Click the __Create Workspace__ button in the top right corner. + + ![Create Workspace](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/ws02.png) + +4. Fill in the workspace name, folder assignment, and other information, then click __OK__ to complete creating the workspace. + + ![Confirm](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/ws03.png) + +!!! tip + + After successful creation, the workspace name will be displayed in the left tree structure, represented by different icons for folders and workspaces. + + ![Folders and Workspaces](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/ws04.png) + +!!! note + + To edit or delete a specific workspace or folder, select it and click __...__ on the right side. + + - If resource groups and shared resources have resources under the workspace, the workspace cannot be deleted. All resources need to be unbound before deletion of the workspace. + + - If Microservices Engine has Integrated Registry under the workspace, the workspace cannot be deleted. Integrated Registry needs to be removed before deletion of the workspace. + + - If Container Registry has Registry Space or Integrated Registry under the workspace, the workspace cannot be deleted. Registry Space needs to be removed, and Integrated Registry needs to be deleted before deletion of the workspace. diff --git a/docs/en/docs/end-user/ghippo/workspace/ws-folder.md b/docs/en/docs/end-user/ghippo/workspace/ws-folder.md new file mode 100644 index 0000000000..fe149394f8 --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/ws-folder.md @@ -0,0 +1,65 @@ +--- +hide: + - toc +--- + +# Workspace and Folder + +Workspace and Folder is a feature that provides resource isolation and grouping, addressing issues +related to unified authorization, resource grouping, and resource quotas. + +Workspace and Folder involves two concepts: workspaces and folders. + +## Workspaces + +Workspaces allow the management of resources through __Authorization__ , __Resource Group__ , and __Shared Resource__ , +enabling users (and user groups) to share resources within the workspace. + +![Workspaces](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/ghippo/images/wsfd01.png) + +- Resources + + Resources are at the lowest level of the hierarchy in the resource management module. They include clusters, namespaces, pipelines, gateways, and more. All these resources can only have workspaces as their parent level. Workspaces act as containers for grouping resources. + +- Workspace + + A workspace usually refers to a project or environment, and the resources in each workspace are logically isolated from those in other workspaces. + You can grant users (groups of users) different access rights to the same set of resources through authorization in the workspace. + + Workspaces are at the first level, counting from the bottom of the hierarchy, and contain resources. + All resources except shared resources have one and only one parent. All workspaces also have one and only one parent folder. + + Resources are grouped by workspace, and there are two grouping modes in workspace, namely __Resource Group__ and __Shared Resource__ . + +- Resource group + + A resource can only be added to one resource group, and resource groups correspond to workspaces one by one. + After a resource is added to a resource group, Workspace Admin will obtain the management authority of the resource, which is equivalent to the owner of the resource. + +- Share resource + + For shared resources, multiple workspaces can share one or more resources. + Resource owners can choose to share their own resources with the workspace. Generally, when sharing, the resource owner will limit the amount of resources that can be used by the shared workspace. + After resources are shared, Workspace Admin only has resource usage rights under the resource limit, and cannot manage resources or adjust the amount of resources that can be used by the workspace. + + At the same time, shared resources also have certain requirements for the resources themselves. Only Cluster (cluster) resources can be shared. + Cluster Admin can share Cluster resources to different workspaces, and limit the use of workspaces on this Cluster. + + Workspace Admin can create multiple Namespaces within the resource quota, but the sum of the resource quotas of the Namespaces cannot exceed the resource quota of the Cluster in the workspace. + For Kubernetes resources, the only resource type that can be shared currently is Cluster. + +## Folder + +Folders can be used to build enterprise business hierarchy relationships. + +- Folders are a further grouping mechanism based on workspaces and have a hierarchical structure. + A folder can contain workspaces, other folders, or a combination of both, forming a tree-like organizational relationship. + +- Folders allow you to map your business hierarchy and group workspaces by department. + Folders are not directly linked to resources, but indirectly achieve resource grouping through workspaces. + +- A folder has one and only one parent folder, and the root folder is the highest level of the hierarchy. + The root folder has no parent, and folders and workspaces are attached to the root folder. + +In addition, users (groups) in folders can inherit permissions from their parents through a hierarchical structure. +The permissions of the user in the hierarchical structure come from the combination of the permissions of the current level and the permissions inherited from its parents. The permissions are additive and there is no mutual exclusion. diff --git a/docs/en/docs/end-user/ghippo/workspace/ws-permission.md b/docs/en/docs/end-user/ghippo/workspace/ws-permission.md new file mode 100644 index 0000000000..f325bdb13d --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/ws-permission.md @@ -0,0 +1,54 @@ +# Description of workspace permissions + +The workspace has permission mapping and resource isolation capabilities, and can map the permissions of users/groups in the workspace to the resources under it. +If the user/group has the Workspace Admin role in the workspace and the resource Namespace is bound to the workspace-resource group, the user/group will become Namespace Admin after mapping. + +!!! note + + The permission mapping capability of the workspace will not be applied to shared resources, because sharing is to share the cluster usage permissions to multiple workspaces, rather than assigning management permissions to the workspaces, so permission inheritance and role mapping will not be implemented. + +## Use cases + +Resource isolation is achieved by binding resources to different workspaces. Therefore, resources can be flexibly allocated to each workspace (tenant) with the help of permission mapping, resource isolation, and resource sharing capabilities. + +Generally applicable to the following two use cases: + +- Cluster one-to-one + + | Ordinary Cluster | Department/Tenant (Workspace) | Purpose | + | -------- | ---------------- | -------- | + | Cluster 01 | A | Administration and Usage | + | Cluster 02 | B | Administration and Usage | + +- Cluster one-to-many + + | Cluster | Department/Tenant (Workspace) | Resource Quota | + | ------- | ---------------- | ---------- | + | Cluster 01 | A | 100 core CPU | + | | B | 50-core CPU | + +## Permission description + +| Action Objects | Operations | Workspace Admin | Workspace Editor | Workspace Viewer | +| :------- | :---------------- | :-------------- | :-------------- | :--------------- | +| itself | view | ✓ | ✓ | ✓ | +| - | Authorization | ✓ | ✗ | ✗ | +| - | Modify Alias | ✓ | ✓ | ✗ | +| Resource Group | View | ✓ | ✓ | ✓ | +| - | resource binding | ✓ | ✗ | ✗ | +| - | unbind | ✓ | ✗ | ✗ | +| Shared Resources | View | ✓ | ✓ | ✓ | +| - | Add Share | ✓ | ✗ | ✗ | +| - | Unshare | ✓ | ✗ | ✗ | +| - | Resource Quota | ✓ | ✗ | ✗ | +| - | Using Shared Resources [^1] | ✓ | ✗ | ✗ | + +[^1]: + Authorized users can go to modules such as workbench, microservice engine, middleware, multicloud orchestration, and service mesh to use resources in the workspace. + For the operation scope of the roles of Workspace Admin, Workspace Editor, and Workspace Viewer in each module, please refer to the permission description: + + - [Workbench Permissions](../../permissions/amamba.md) + - [Service Mesh Permissions](../../permissions/mspider.md) + - [Middleware permissions](../../permissions/mcamel.md) + - [Microservice Engine Permissions](../../permissions/skoala.md) + - [Container Management Permissions](../../../kpanda/permissions/permission-brief.md) \ No newline at end of file diff --git a/docs/en/docs/end-user/ghippo/workspace/wsbind-permission.md b/docs/en/docs/end-user/ghippo/workspace/wsbind-permission.md new file mode 100644 index 0000000000..45e89e922f --- /dev/null +++ b/docs/en/docs/end-user/ghippo/workspace/wsbind-permission.md @@ -0,0 +1,41 @@ +--- +MTPE: WANG0608GitHub +date: 2024-07-16 +--- + +# Resource Binding Permission Instructions + +If a user John ("John" represents any user who is required to bind resources) has +the [Workspace Admin role](../access-control/role.md#workspace-role-authorization-methods) assigned or has been granted proper permissions through a [custom role](../access-control/custom-role.md), +which includes the [Workspace's "Resource Binding" Permissions](./ws-permission.md#description-of-workspace-permissions), and wants to bind a specific cluster or namespace to the workspace. + +To bind cluster/namespace resources to a workspace, not only the [workspace's "Resource Binding" permissions](./ws-permission.md#description-of-workspace-permissions) are required, +but also the permissions of [Cluster Admin](../../../kpanda/permissions/permission-brief.md#cluster-admin). + +## Granting Authorization to John + +1. Using the [Platform Admin Role](../access-control/role.md#workspace-role-authorization-methods), + grant John the role of Workspace Admin on the **Workspace** -> **Authorization** page. + + ![Resource Binding](../../images/wsbind1.png) + +1. Then, on the **Container Management** -> **Permissions** page, authorize John as a Cluster Admin by **Add Permission**. + + ![Cluster Permissions1](../../images/wsbind2.png) + + ![Cluster Permissions2](../../images/wsbind3.png) + +## Binding to Workspace + +Using John's account to log in to AI platform, on the **Container Management** -> **Clusters** page, + John can bind the specified cluster to his own workspace by using the **Bind Workspace** button. + +!!! note + + John can only bind clusters or namespaces to a specific workspace in the [Container Management module](../../../kpanda/intro/index.md), and cannot perform this operation in the Global Management module. + +![cluster banding](../../images/wsbind4.png) + +To bind a namespace to a workspace, you must have at least Workspace Admin and Cluster Admin permissions. + +![cluster banding](../../images/wsbind5.png) \ No newline at end of file diff --git a/docs/en/docs/end-user/host/createhost.md b/docs/en/docs/end-user/host/createhost.md index 5747f91472..55b53e18b1 100644 --- a/docs/en/docs/end-user/host/createhost.md +++ b/docs/en/docs/end-user/host/createhost.md @@ -1,41 +1,41 @@ -# 创建和启动云主机 +# Creating and Starting a Cloud Host -用户完成注册,为其分配了工作空间、命名空间和资源后,即可以创建并启动云主机。 +Once the user completes registration and is assigned a workspace, namespace, and resources, they can create and start a cloud host. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](../register/index.md) -- 管理员为用户绑定了工作空间 -- 管理员为工作空间分配了资源 +- The AI platform is installed +- [User has successfully registered](../register/index.md) +- Administrator has bound the workspace to the user +- Administrator has allocated resources for the workspace -## 操作步骤 +## Steps to Operate -1. 用户登录 AI 算力平台 -1. 点击 **创建云主机** -> **通过模板创建** +1. User logs into the AI platform. +2. Click **Create Cloud Host** -> **Create from Template**. ![create](../images/host01.png) -1. 定义的云主机各项配置后点击 **下一步** +3. After defining the configurations for the cloud host, click **Next**. - === "基本配置" + === "Basic Configuration" ![basic](../images/host02.png) - === "模板配置" + === "Template Configuration" ![template](../images/host03.png) - === "存储与网络" + === "Storage and Network" ![storage](../images/host05.png) -1. 配置 root 密码或 ssh 密钥后点击 **确定** +4. After configuring the root password or SSH key, click **OK**. ![pass](../images/host06.png) -1. 返回主机列表,等待状态变为 **运行中** 之后,可以通过右侧的 **┇** 启动主机。 +5. Return to the host list and wait for the status to change to **Running**. Then, you can start the host using the **┇** button on the right. ![pass](../images/host07.png) -下一步:[使用云主机](./usehost.md) +Next step: [Using the Cloud Host](./usehost.md) diff --git a/docs/en/docs/end-user/host/usehost.md b/docs/en/docs/end-user/host/usehost.md index 6513d8deea..1e70538b6d 100644 --- a/docs/en/docs/end-user/host/usehost.md +++ b/docs/en/docs/end-user/host/usehost.md @@ -1,31 +1,31 @@ -# 使用云主机 +# Using the Cloud Host -创建并启动云主机之后,用户就可以开始使用云主机。 +After creating and starting the cloud host, the user can begin using the cloud host. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已创建并启动云主机](./createhost.md) +- The AI platform is installed +- [User has created and started the cloud host](./createhost.md) -## 操作步骤 +## Steps to Operate -1. 以管理员身份登录 AI 算力平台 -1. 导航到 **容器管理** -> **容器网络** -> **服务** ,点击服务的名称,进入服务详情页,在右上角点击 **更新** +1. Log into the AI platform as an administrator. +2. Navigate to **Container Management** -> **Container Network** -> **Services**, click on the service name to enter the service details page, and click **Update** in the upper right corner. ![service](../images/usehost01.png) -1. 更改端口范围为 30900-30999,但不能冲突。 +3. Change the port range to 30900-30999, ensuring there are no conflicts. ![port](../images/usehost02.png) -1. 以终端用户登录 AI 算力平台,导航到对应的服务,查看访问端口。 +4. Log into the AI platform as an end user, navigate to the corresponding service, and check the access ports. ![port](../images/usehost03.png) -1. 在外网使用 SSH 客户端登录云主机 +5. Use an SSH client to log into the cloud host from the external network. ![ssh](../images/usehost04.png) -1. 至此,你可以在云主机上执行各项操作。 +6. At this point, you can perform various operations on the cloud host. -下一步:[使用 Notebook](../share/notebook.md) +Next step: [Using Notebook](../share/notebook.md) diff --git a/docs/en/docs/end-user/images/add01.png b/docs/en/docs/end-user/images/add01.png new file mode 100644 index 0000000000..9abedc0fa7 Binary files /dev/null and b/docs/en/docs/end-user/images/add01.png differ diff --git a/docs/en/docs/end-user/images/add02.png b/docs/en/docs/end-user/images/add02.png new file mode 100644 index 0000000000..ff6dca87c6 Binary files /dev/null and b/docs/en/docs/end-user/images/add02.png differ diff --git a/docs/en/docs/end-user/images/add03.png b/docs/en/docs/end-user/images/add03.png new file mode 100644 index 0000000000..4e6172fbdb Binary files /dev/null and b/docs/en/docs/end-user/images/add03.png differ diff --git a/docs/en/docs/end-user/images/add04.png b/docs/en/docs/end-user/images/add04.png new file mode 100644 index 0000000000..03e968d899 Binary files /dev/null and b/docs/en/docs/end-user/images/add04.png differ diff --git a/docs/en/docs/end-user/images/add05.png b/docs/en/docs/end-user/images/add05.png new file mode 100644 index 0000000000..b5018eb9e8 Binary files /dev/null and b/docs/en/docs/end-user/images/add05.png differ diff --git a/docs/en/docs/end-user/images/bindws01.png b/docs/en/docs/end-user/images/bindws01.png new file mode 100644 index 0000000000..52c64ac85a Binary files /dev/null and b/docs/en/docs/end-user/images/bindws01.png differ diff --git a/docs/en/docs/end-user/images/bindws02.png b/docs/en/docs/end-user/images/bindws02.png new file mode 100644 index 0000000000..cc53748cc8 Binary files /dev/null and b/docs/en/docs/end-user/images/bindws02.png differ diff --git a/docs/en/docs/end-user/images/bindws03.png b/docs/en/docs/end-user/images/bindws03.png new file mode 100644 index 0000000000..78d2bf4100 Binary files /dev/null and b/docs/en/docs/end-user/images/bindws03.png differ diff --git a/docs/en/docs/end-user/images/bindws04.png b/docs/en/docs/end-user/images/bindws04.png new file mode 100644 index 0000000000..91fc55f76c Binary files /dev/null and b/docs/en/docs/end-user/images/bindws04.png differ diff --git a/docs/en/docs/end-user/images/bindws05.png b/docs/en/docs/end-user/images/bindws05.png new file mode 100644 index 0000000000..a1bbe217bf Binary files /dev/null and b/docs/en/docs/end-user/images/bindws05.png differ diff --git a/docs/en/docs/end-user/images/bindws06.png b/docs/en/docs/end-user/images/bindws06.png new file mode 100644 index 0000000000..c157644f0d Binary files /dev/null and b/docs/en/docs/end-user/images/bindws06.png differ diff --git a/docs/en/docs/end-user/images/bindws07.png b/docs/en/docs/end-user/images/bindws07.png new file mode 100644 index 0000000000..0239d61d53 Binary files /dev/null and b/docs/en/docs/end-user/images/bindws07.png differ diff --git a/docs/en/docs/end-user/images/bindws08.png b/docs/en/docs/end-user/images/bindws08.png new file mode 100644 index 0000000000..bc5881fc1d Binary files /dev/null and b/docs/en/docs/end-user/images/bindws08.png differ diff --git a/docs/en/docs/end-user/images/bindws09.png b/docs/en/docs/end-user/images/bindws09.png new file mode 100644 index 0000000000..ee587b030c Binary files /dev/null and b/docs/en/docs/end-user/images/bindws09.png differ diff --git a/docs/en/docs/end-user/images/bindws10.png b/docs/en/docs/end-user/images/bindws10.png new file mode 100644 index 0000000000..1e99f7e174 Binary files /dev/null and b/docs/en/docs/end-user/images/bindws10.png differ diff --git a/docs/en/docs/end-user/images/bindws11.png b/docs/en/docs/end-user/images/bindws11.png new file mode 100644 index 0000000000..53fa67dd31 Binary files /dev/null and b/docs/en/docs/end-user/images/bindws11.png differ diff --git a/docs/en/docs/end-user/images/home.png b/docs/en/docs/end-user/images/home.png new file mode 100644 index 0000000000..2477dbc86e Binary files /dev/null and b/docs/en/docs/end-user/images/home.png differ diff --git a/docs/en/docs/end-user/images/host01.png b/docs/en/docs/end-user/images/host01.png new file mode 100644 index 0000000000..692e2219d0 Binary files /dev/null and b/docs/en/docs/end-user/images/host01.png differ diff --git a/docs/en/docs/end-user/images/host02.png b/docs/en/docs/end-user/images/host02.png new file mode 100644 index 0000000000..60914151da Binary files /dev/null and b/docs/en/docs/end-user/images/host02.png differ diff --git a/docs/en/docs/end-user/images/host03.png b/docs/en/docs/end-user/images/host03.png new file mode 100644 index 0000000000..2eecbeda8d Binary files /dev/null and b/docs/en/docs/end-user/images/host03.png differ diff --git a/docs/en/docs/end-user/images/host04.png b/docs/en/docs/end-user/images/host04.png new file mode 100644 index 0000000000..b1d0c2ea60 Binary files /dev/null and b/docs/en/docs/end-user/images/host04.png differ diff --git a/docs/en/docs/end-user/images/host05.png b/docs/en/docs/end-user/images/host05.png new file mode 100644 index 0000000000..67806a5059 Binary files /dev/null and b/docs/en/docs/end-user/images/host05.png differ diff --git a/docs/en/docs/end-user/images/host06.png b/docs/en/docs/end-user/images/host06.png new file mode 100644 index 0000000000..dae145fcf2 Binary files /dev/null and b/docs/en/docs/end-user/images/host06.png differ diff --git a/docs/en/docs/end-user/images/host07.png b/docs/en/docs/end-user/images/host07.png new file mode 100644 index 0000000000..e94e450ca2 Binary files /dev/null and b/docs/en/docs/end-user/images/host07.png differ diff --git a/docs/en/docs/end-user/images/k8s01.png b/docs/en/docs/end-user/images/k8s01.png new file mode 100644 index 0000000000..a06d26aa92 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s01.png differ diff --git a/docs/en/docs/end-user/images/k8s02.png b/docs/en/docs/end-user/images/k8s02.png new file mode 100644 index 0000000000..4ae0a2d0a7 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s02.png differ diff --git a/docs/en/docs/end-user/images/k8s03.png b/docs/en/docs/end-user/images/k8s03.png new file mode 100644 index 0000000000..f6e2fc523c Binary files /dev/null and b/docs/en/docs/end-user/images/k8s03.png differ diff --git a/docs/en/docs/end-user/images/k8s04.png b/docs/en/docs/end-user/images/k8s04.png new file mode 100644 index 0000000000..579e978026 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s04.png differ diff --git a/docs/en/docs/end-user/images/k8s05.png b/docs/en/docs/end-user/images/k8s05.png new file mode 100644 index 0000000000..54e89eefd0 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s05.png differ diff --git a/docs/en/docs/end-user/images/k8s06.png b/docs/en/docs/end-user/images/k8s06.png new file mode 100644 index 0000000000..87a9478b8e Binary files /dev/null and b/docs/en/docs/end-user/images/k8s06.png differ diff --git a/docs/en/docs/end-user/images/k8s07.png b/docs/en/docs/end-user/images/k8s07.png new file mode 100644 index 0000000000..512762beb2 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s07.png differ diff --git a/docs/en/docs/end-user/images/k8s08.png b/docs/en/docs/end-user/images/k8s08.png new file mode 100644 index 0000000000..8407336365 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s08.png differ diff --git a/docs/en/docs/end-user/images/k8s09.png b/docs/en/docs/end-user/images/k8s09.png new file mode 100644 index 0000000000..a27d8fda14 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s09.png differ diff --git a/docs/en/docs/end-user/images/k8s10.png b/docs/en/docs/end-user/images/k8s10.png new file mode 100644 index 0000000000..d6e5c7801d Binary files /dev/null and b/docs/en/docs/end-user/images/k8s10.png differ diff --git a/docs/en/docs/end-user/images/k8s11.png b/docs/en/docs/end-user/images/k8s11.png new file mode 100644 index 0000000000..0965c31a2a Binary files /dev/null and b/docs/en/docs/end-user/images/k8s11.png differ diff --git a/docs/en/docs/end-user/images/k8s12.png b/docs/en/docs/end-user/images/k8s12.png new file mode 100644 index 0000000000..63592d1780 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s12.png differ diff --git a/docs/en/docs/end-user/images/k8s13.png b/docs/en/docs/end-user/images/k8s13.png new file mode 100644 index 0000000000..da719560f8 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s13.png differ diff --git a/docs/en/docs/end-user/images/k8s14.png b/docs/en/docs/end-user/images/k8s14.png new file mode 100644 index 0000000000..ef075306f3 Binary files /dev/null and b/docs/en/docs/end-user/images/k8s14.png differ diff --git a/docs/en/docs/end-user/images/notebook01.png b/docs/en/docs/end-user/images/notebook01.png new file mode 100644 index 0000000000..4888a935ef Binary files /dev/null and b/docs/en/docs/end-user/images/notebook01.png differ diff --git a/docs/en/docs/end-user/images/notebook02.png b/docs/en/docs/end-user/images/notebook02.png new file mode 100644 index 0000000000..79699cb776 Binary files /dev/null and b/docs/en/docs/end-user/images/notebook02.png differ diff --git a/docs/en/docs/end-user/images/notebook03.png b/docs/en/docs/end-user/images/notebook03.png new file mode 100644 index 0000000000..309832266d Binary files /dev/null and b/docs/en/docs/end-user/images/notebook03.png differ diff --git a/docs/en/docs/end-user/images/notebook04.png b/docs/en/docs/end-user/images/notebook04.png new file mode 100644 index 0000000000..b34a57e4d7 Binary files /dev/null and b/docs/en/docs/end-user/images/notebook04.png differ diff --git a/docs/en/docs/end-user/images/notebook05.png b/docs/en/docs/end-user/images/notebook05.png new file mode 100644 index 0000000000..07329e4721 Binary files /dev/null and b/docs/en/docs/end-user/images/notebook05.png differ diff --git a/docs/en/docs/end-user/images/notebook06.png b/docs/en/docs/end-user/images/notebook06.png new file mode 100644 index 0000000000..50d43ff890 Binary files /dev/null and b/docs/en/docs/end-user/images/notebook06.png differ diff --git a/docs/en/docs/end-user/images/notebook07.png b/docs/en/docs/end-user/images/notebook07.png new file mode 100644 index 0000000000..18c9e6d50c Binary files /dev/null and b/docs/en/docs/end-user/images/notebook07.png differ diff --git a/docs/en/docs/end-user/images/notebook08.png b/docs/en/docs/end-user/images/notebook08.png new file mode 100644 index 0000000000..015c025620 Binary files /dev/null and b/docs/en/docs/end-user/images/notebook08.png differ diff --git a/docs/en/docs/end-user/images/notebook09.png b/docs/en/docs/end-user/images/notebook09.png new file mode 100644 index 0000000000..bb7fcc5158 Binary files /dev/null and b/docs/en/docs/end-user/images/notebook09.png differ diff --git a/docs/en/docs/end-user/images/quota01.png b/docs/en/docs/end-user/images/quota01.png new file mode 100644 index 0000000000..389b47d75e Binary files /dev/null and b/docs/en/docs/end-user/images/quota01.png differ diff --git a/docs/en/docs/end-user/images/quota02.png b/docs/en/docs/end-user/images/quota02.png new file mode 100644 index 0000000000..ac37f401a7 Binary files /dev/null and b/docs/en/docs/end-user/images/quota02.png differ diff --git a/docs/en/docs/end-user/images/quota03.png b/docs/en/docs/end-user/images/quota03.png new file mode 100644 index 0000000000..2635d7c9c2 Binary files /dev/null and b/docs/en/docs/end-user/images/quota03.png differ diff --git a/docs/en/docs/end-user/images/remove01.png b/docs/en/docs/end-user/images/remove01.png new file mode 100644 index 0000000000..f3491b22db Binary files /dev/null and b/docs/en/docs/end-user/images/remove01.png differ diff --git a/docs/en/docs/end-user/images/remove02.png b/docs/en/docs/end-user/images/remove02.png new file mode 100644 index 0000000000..8160244d3c Binary files /dev/null and b/docs/en/docs/end-user/images/remove02.png differ diff --git a/docs/en/docs/end-user/images/remove03.png b/docs/en/docs/end-user/images/remove03.png new file mode 100644 index 0000000000..1a80f89532 Binary files /dev/null and b/docs/en/docs/end-user/images/remove03.png differ diff --git a/docs/en/docs/end-user/images/remove04.png b/docs/en/docs/end-user/images/remove04.png new file mode 100644 index 0000000000..8cf2feb981 Binary files /dev/null and b/docs/en/docs/end-user/images/remove04.png differ diff --git a/docs/en/docs/end-user/images/remove05.png b/docs/en/docs/end-user/images/remove05.png new file mode 100644 index 0000000000..a7b59fa05f Binary files /dev/null and b/docs/en/docs/end-user/images/remove05.png differ diff --git a/docs/en/docs/end-user/images/ssh01.png b/docs/en/docs/end-user/images/ssh01.png new file mode 100644 index 0000000000..738455475c Binary files /dev/null and b/docs/en/docs/end-user/images/ssh01.png differ diff --git a/docs/en/docs/end-user/images/ssh02.png b/docs/en/docs/end-user/images/ssh02.png new file mode 100644 index 0000000000..293ceffbf3 Binary files /dev/null and b/docs/en/docs/end-user/images/ssh02.png differ diff --git a/docs/en/docs/end-user/images/ssh03.png b/docs/en/docs/end-user/images/ssh03.png new file mode 100644 index 0000000000..268c245d1f Binary files /dev/null and b/docs/en/docs/end-user/images/ssh03.png differ diff --git a/docs/en/docs/end-user/images/ssh04.png b/docs/en/docs/end-user/images/ssh04.png new file mode 100644 index 0000000000..2f9ebd9115 Binary files /dev/null and b/docs/en/docs/end-user/images/ssh04.png differ diff --git a/docs/en/docs/end-user/images/ssh05.png b/docs/en/docs/end-user/images/ssh05.png new file mode 100644 index 0000000000..b34d72a8fe Binary files /dev/null and b/docs/en/docs/end-user/images/ssh05.png differ diff --git a/docs/en/docs/end-user/images/usehost01.png b/docs/en/docs/end-user/images/usehost01.png new file mode 100644 index 0000000000..c4a5a39193 Binary files /dev/null and b/docs/en/docs/end-user/images/usehost01.png differ diff --git a/docs/en/docs/end-user/images/usehost02.png b/docs/en/docs/end-user/images/usehost02.png new file mode 100644 index 0000000000..a8f46c63b9 Binary files /dev/null and b/docs/en/docs/end-user/images/usehost02.png differ diff --git a/docs/en/docs/end-user/images/usehost03.png b/docs/en/docs/end-user/images/usehost03.png new file mode 100644 index 0000000000..cc36100d87 Binary files /dev/null and b/docs/en/docs/end-user/images/usehost03.png differ diff --git a/docs/en/docs/end-user/images/usehost04.png b/docs/en/docs/end-user/images/usehost04.png new file mode 100644 index 0000000000..caa09c2b65 Binary files /dev/null and b/docs/en/docs/end-user/images/usehost04.png differ diff --git a/docs/en/docs/end-user/images/workload01.png b/docs/en/docs/end-user/images/workload01.png new file mode 100644 index 0000000000..1db6174153 Binary files /dev/null and b/docs/en/docs/end-user/images/workload01.png differ diff --git a/docs/en/docs/end-user/images/workload02.png b/docs/en/docs/end-user/images/workload02.png new file mode 100644 index 0000000000..27ab3afeae Binary files /dev/null and b/docs/en/docs/end-user/images/workload02.png differ diff --git a/docs/en/docs/end-user/images/workload03.png b/docs/en/docs/end-user/images/workload03.png new file mode 100644 index 0000000000..e517c14a66 Binary files /dev/null and b/docs/en/docs/end-user/images/workload03.png differ diff --git a/docs/en/docs/end-user/images/workload04.png b/docs/en/docs/end-user/images/workload04.png new file mode 100644 index 0000000000..5abe7a4057 Binary files /dev/null and b/docs/en/docs/end-user/images/workload04.png differ diff --git a/docs/en/docs/end-user/images/workload05.png b/docs/en/docs/end-user/images/workload05.png new file mode 100644 index 0000000000..26efad6955 Binary files /dev/null and b/docs/en/docs/end-user/images/workload05.png differ diff --git a/docs/en/docs/end-user/images/workload06.png b/docs/en/docs/end-user/images/workload06.png new file mode 100644 index 0000000000..9c3bd869d3 Binary files /dev/null and b/docs/en/docs/end-user/images/workload06.png differ diff --git a/docs/en/docs/end-user/images/wsres01.png b/docs/en/docs/end-user/images/wsres01.png new file mode 100644 index 0000000000..779e2e291c Binary files /dev/null and b/docs/en/docs/end-user/images/wsres01.png differ diff --git a/docs/en/docs/end-user/images/wsres02.png b/docs/en/docs/end-user/images/wsres02.png new file mode 100644 index 0000000000..bc2094c181 Binary files /dev/null and b/docs/en/docs/end-user/images/wsres02.png differ diff --git a/docs/en/docs/end-user/images/wsres03.png b/docs/en/docs/end-user/images/wsres03.png new file mode 100644 index 0000000000..563cd3ac8a Binary files /dev/null and b/docs/en/docs/end-user/images/wsres03.png differ diff --git a/docs/en/docs/end-user/insight/alert-center/alert-policy.md b/docs/en/docs/end-user/insight/alert-center/alert-policy.md new file mode 100644 index 0000000000..93167f5b87 --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/alert-policy.md @@ -0,0 +1,83 @@ +# Alert Policies + +In addition to the built-in alert policies, AI platform allows users to create custom alert policies. Each alert policy is a collection of alert rules that can be set for clusters, nodes, and workloads. When an alert object reaches the threshold set by any of the rules in the policy, an alert is automatically triggered and a notification is sent. + +Taking the built-in alerts as an example, click the first alert policy __alertmanager.rules__ . + +![alert policy](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy01.png) + +You can see that some alert rules have been set under it. You can add more rules under this policy, or edit or delete them at any time. You can also view the historical and active alerts related to this alert policy and edit the notification configuration. + +![alertmanager.rules](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy02.png) + +## Create Alert Policies + +1. Select __Alert Center__ -> __Alert Policies__ , and click the __Create Alert Policy__ button. + + ![alert policy](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy01.png) + +2. Fill in the basic information, select one or more clusters, nodes, or workloads as the alert objects, and click __Next__ . + + ![basic information](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy03.png) + +3. The list must have at least one rule. If the list is empty, please __Add Rule__ . + + ![add rule](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy04.png) + + Create an alert rule in the pop-up window, fill in the parameters, and click __OK__ . + + ![create rule](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy05.png) + + - Template rules: Pre-defined basic metrics that can monitor CPU, memory, disk, and network. + - PromQL rules: Input a PromQL expression, please [query Prometheus expressions](https://prometheus.io/docs/prometheus/latest/querying/basics/). + - Duration: After the alert is triggered and the duration reaches the set value, the alert policy will become a triggered state. + - Alert level: Including emergency, warning, and information levels. + - Advanced settings: Custom tags and annotations. + +4. After clicking __Next__ , configure notifications. + + ![notification configuration](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy06.png) + +5. After the configuration is complete, click the __OK__ button to return to the Alert Policy list. + +!!! tip + + The newly created alert policy is in the __Not Triggered__ state. Once the threshold conditions and duration specified in the rules are met, it will change to the __Triggered__ state. + +### Create Log Rules + +After filling in the basic information, click __Add Rule__ and select __Log Rule__ as the rule type. + +Creating log rules is supported only when the resource object is selected as a node or workload. + +**Field Explanation:** + +- __Filter Condition__ : Field used to query log content, supports four filtering conditions: AND, OR, regular expression matching, and fuzzy matching. +- __Condition__ : Based on the filter condition, enter keywords or matching conditions. +- __Time Range__ : Time range for log queries. +- __Threshold Condition__ : Enter the alert threshold value in the input box. When the set threshold is reached, an alert will be triggered. Supported comparison operators are: >, ≥, =, ≤, <. +- __Alert Level__ : Select the alert level to indicate the severity of the alert. + +### Create Event Rules + +After filling in the basic information, click __Add Rule__ and select __Event Rule__ as the rule type. + +Creating event rules is supported only when the resource object is selected as a workload. + +**Field Explanation:** + +- __Event Rule__ : Only supports selecting the workload as the resource object. +- __Event Reason__ : Different event reasons for different types of workloads, where the event reasons are combined with "AND" relationship. +- __Time Range__ : Detect data generated within this time range. If the threshold condition is reached, an alert event will be triggered. +- __Threshold Condition__ : When the generated events reach the set threshold, an alert event will be triggered. +- __Trend Chart__ : By default, it queries the trend of event changes within the last 10 minutes. The value at each point represents the total number of occurrences within a certain period of time (time range) from the current time point to a previous time. + +## Other Operations + +Click __┇__ at the right side of the list, then choose __Delete__ from the pop-up menu to delete an alert policy. By clicking on the policy name, you can enter the policy details where you can add, edit, or delete the alert rules under it. + +![alert rule](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert-policy07.png) + +!!! warning + + Deleted alert strategies will be permanently removed, so please proceed with caution. diff --git a/docs/en/docs/end-user/insight/alert-center/alert-template.md b/docs/en/docs/end-user/insight/alert-center/alert-template.md new file mode 100644 index 0000000000..3810e9da6e --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/alert-template.md @@ -0,0 +1,46 @@ +--- +MTPE: ModetaNiu +date: 2024-07-01 +--- + +# Alert Template + +The Alert template allows platform administrators to create Alert templates and rules, and +business units can directly use Alert templates to create Alert policies. This feature can +reduce the management of Alert rules by business personnel and allow for modification of +Alert thresholds based on actual environment conditions. + +## Create Alert Template + +1. In the navigation bar, select **Alert** -> **Alert Policy**, and click **Alert Template** at the top. + + ![Alert Template](../../images/template01.png){ width=1000px} + +2. Click **Create Alert Template**, and set the name, description, and other information for the Alert template. + + ![Basic Information](../../images/template02.png){ width=1000px} + + ![Alert Rule](../../images/template03.png){ width=1000px} + + | Parameter | Description | + | ---- | ---- | + | Template Name | The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. | + | Description | The description can contain any characters and can be up to 256 characters long. | + | Resource Type | Used to specify the matching type of the Alert template. | + | Alert Rule | Supports pre-defined multiple Alert rules, including template rules and PromQL rules. | + +3. Click **OK** to complete the creation and return to the Alert template list. Click the template name + to view the template details. + +## Edit Alert Template + +Click **┇** next to the target rule, then click **Edit** to enter the editing page for the suppression rule. + + ![Edit](../../images/template04.png){ width=1000px} + +## Delete Alert Template + +Click **┇** next to the target template, then click **Delete**. Enter the name of the Alert template +in the input box to confirm deletion. + + ![Delete](../../images/template05.png){ width=1000px} diff --git a/docs/en/docs/end-user/insight/alert-center/index.md b/docs/en/docs/end-user/insight/alert-center/index.md new file mode 100644 index 0000000000..d2644eaef5 --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/index.md @@ -0,0 +1,36 @@ +# Alert Center + +The Alert Center is an important feature provided by AI platform that allows users +to easily view all active and historical alerts by cluster and namespace through +a graphical interface, and search alerts based on severity level (critical, warning, info). + +![alert list](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/alert01.png) + +All alerts are triggered based on the threshold conditions set in the preset alert rules. +In AI platform, some global alert policies are built-in, but users can also create or delete +alert policies at any time, and set thresholds for the following metrics: + +- CPU usage +- Memory usage +- Disk usage +- Disk reads per second +- Disk writes per second +- Cluster disk read throughput +- Cluster disk write throughput +- Network send rate +- Network receive rate + +Users can also add labels and annotations to alert rules. Alert rules can be classified as +active or expired, and certain rules can be enabled/disabled to achieve silent alerts. + +When the threshold condition is met, users can configure how they want to be notified, +including email, DingTalk, WeCom, webhook, and SMS notifications. All notification +message templates can be customized and all messages are sent at specified intervals. + +In addition, the Alert Center also supports sending alert messages to designated users +through short message services provided by Alibaba Cloud, Tencent Cloud, and more platforms +that will be added soon, enabling multiple ways of alert notification. + +AI platform Alert Center is a powerful alert management platform that helps users +quickly detect and resolve problems in the cluster, improve business stability and availability, +and facilitate cluster inspection and troubleshooting. diff --git a/docs/en/docs/end-user/insight/alert-center/inhibition.md b/docs/en/docs/end-user/insight/alert-center/inhibition.md new file mode 100644 index 0000000000..ed9a679658 --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/inhibition.md @@ -0,0 +1,79 @@ +--- +MTPE: ModetaNiu +Date: 2024-07-01 +--- + +# Alert Inhibition + +Alert Inhibition is mainly a mechanism for temporarily hiding or reducing the priority of alerts that do not need +immediate attention. The purpose of this feature is to reduce unnecessary alert information that may disturb +operations personnel, allowing them to focus on more critical issues. + +Alert inhibition recognizes and ignores certain alerts by defining a set of rules to deal with specific conditions. +There are mainly the following conditions: + +- Parent-child inhibition: when a parent alert (for example, a crash on a node) is triggered, all child alerts aroused by + it (for example, a crash on a container running on that node) are inhibited. +- Similar alert inhibition: When alerts have the same characteristics (for example, the same problem on the same instance), + multiple alerts are inhibited. + +## Create Inhibition + +1. In the left navigation bar, select **Alert** -> **Noise Reduction**, and click **Inhibition** at the top. + + ![Inhibition](../../images/inhibition01.png){ width=1000px} + +2. Click **Create Inhibition**, and set the name and rules for the inhibition. + + !!! note + + The problem of avoiding multiple similar or related alerts that may be triggered by the same issue is achieved + by defining a set of rules to identify and ignore certain alerts through [Rule Details](#view-rule-details) + and [Alert Details](#view-alert-details). + + ![Create Inhibition](../../images/inhibition02.png){ width=1000px} + + | Parameter | Description | + | ---- | ---- | + | Name | The name can only contain lowercase letters, numbers, and hyphens (-), must start and end with a lowercase letter or number, and can be up to 63 characters long. | + | Description | The description can contain any characters and can be up to 256 characters long. | + | Cluster | The cluster where the inhibition rule applies. | + | Namespace | The namespace where the inhibition rule applies. | + | Source Alert | Matching alerts by label conditions. It compares alerts that meet all label conditions with those that meet inhibition conditions, and alerts that do not meet inhibition conditions will be sent to the user as usual.

Value range explanation:
- **Alert Level**: The level of metric or event alerts, can be set as: Critical, Major, Minor.
- **Resource Type**: The resource type specific for the alert object, can be set as: Cluster, Node, StatefulSet, Deployment, DaemonSet, Pod.
- **Labels**: Alert identification attributes, consisting of label name and label value, supports user-defined values. | + | Inhibition | Specifies the matching conditions for the target alert (the alert to be inhibited). Alerts that meet all the conditions will no longer be sent to the user. | + | Equal | Specifies the list of labels to compare to determine if the source alert and target alert match. Inhibition is triggered only when the values of the labels specified in `equal` are exactly the same in the source and target alerts. The `equal` field is optional. If the `equal` field is omitted, all labels are used for matching. | + +3. Click **OK** to complete the creation and return to Inhibition list. Click the inhibition rule name to view the rule details. + +### View Rule Details + +In the left navigation bar, select **Alert** -> **Alert Policy**, and click the policy name to view the rule details. + + ![Rule details](../../image/inhibition.png) + + !!! note + + You can add cuntom tags when adding rules. + +### View Alert Details + +In the left navigation bar, select **Alert** -> **Alerts**, and click the policy name to view details. + + ![Alert details](../../image/inhibition-01.png) + + !!! note + + Alert details show information and settings for creating inhibitions. + +## Edit Inhibition Rule + +Click **┇** next to the target rule, then click **Edit** to enter the editing page for the inhibition rule. + +![Edit Rules](../../images/inhibition03.png){ width=1000px} + +## Delete Inhibition Rule + +Click **┇** next to the target rule, then click **Delete**. Enter the name of the inhibition rule in the input box +to confirm deletion. + +![Delete Rules](../../images/inhibition04.png){ width=1000px} diff --git a/docs/en/docs/end-user/insight/alert-center/message.md b/docs/en/docs/end-user/insight/alert-center/message.md new file mode 100644 index 0000000000..95da06e6cf --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/message.md @@ -0,0 +1,88 @@ +# Notification Settings + +On the __Notification Settings__ page, you can configure how to send messages to users through email, WeCom, DingTalk, Webhook, and SMS. + +## Email Group + +1. After entering __Insight__ , click __Alert Center__ -> __Notification Settings__ in the left navigation bar. + By default, the email notification object is selected. Click __Add email group__ and add one or more email addresses. + +2. Multiple email addresses can be added. + + ![WeCom](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/notify02.png) + +3. After the configuration is complete, the notification list will automatically return. Click __┇__ on the right side + of the list to edit or delete the email group. + +## WeCom + +1. In the left navigation bar, click __Alert Center__ -> __Notification Settings__ -> __WeCom__ . Click __Add Group Robot__ and add one or more group robots. + + For the URL of the WeCom group robot, please refer to the [official document of WeCom: How to use group robots](https://developers.weixin.qq.com/doc/offiaccount/Getting_Started/Overview.html). + +2. After the configuration is complete, the notification list will automatically return. Click __┇__ on the right side + of the list, select __Send Test Information__ , and you can also edit or delete the group robot. + +## DingTalk + +1. In the left navigation bar, click __Alert Center__ -> __Notification Settings__ -> __DingTalk__ . + Click __Add Group Robot__ and add one or more group robots. + + For the URL of the DingTalk group robot, please refer to the [official document of DingTalk: Custom Robot Access](https://developers.dingtalk.com/document/robots/custom-robot-access). + +2. After the configuration is complete, the notification list will automatically return. Click __┇__ on the right + side of the list, select __Send Test Information__ , and you can also edit or delete the group robot. + +## Lark + +1. In the left navigation bar, click __Alert Center__ -> __Notification Settings__ -> __Lark__ . Click __Add Group Bot__ + and add one or more group bots. + + ![Lark](../../image/notify-01.png) + + !!! note + + When signature verification is required in Lark's group bot, you need to fill in the specific signature key + when enabling notifications. Refer to [Customizing Bot User Guide](https://open.feishu.cn/document/client-docs/bot-v3/add-custom-bot). + +2. After configuration, you will be automatically redirected to the list page. Click __┇__ on the right side of the list + and select __Send Test Message__ . You can edit or delete group bots. + +## Webhook + +1. In the left navigation bar, click __Alert Center__ -> __Notification Settings__ -> __Webhook__ . + Click __New Webhook__ and add one or more Webhooks. + + For the Webhook URL and more configuration methods, please refer to the [webhook document](https://github.com/webhooksite/webhook.site). + +2. After the configuration is complete, the notification list will automatically return. Click __┇__ on the right side + of the list, select __Send Test Information__ , and you can also edit or delete the Webhook. + +## Message + +!!! note + + Alert messages are sent to the personal Message sector and notifications can be viewed by clicking 🔔 at the top. + +1. In the left navigation bar, click __Alert Center__ -> __Notification Settings__ -> __Message__,click __Create Message__ . + + You can add and notify multiple users for a message. + + ![message](../../image/notify-02.png) + +2. After configuration, you will be automatically redirected to the list page. Click __┇__ on the right side of + the list and select __Send Test Message__ . + +## SMS Group + +1. In the left navigation bar, click __Alert Center__ -> __Notification Settings__ -> __SMS__ . Click __Add SMS Group__ + and add one or more SMS groups. + +2. Enter the name, the object receiving the message, phone number, and notification server in the pop-up window. + + The notification server needs to be created in advance under __Notification Settings__ -> __Notification Server__ . + Currently, two cloud servers, Alibaba Cloud and Tencent Cloud, are supported. Please refer to your own + cloud server information for the specific configuration parameters. + +3. After the SMS group is successfully added, the notification list will automatically return. Click __┇__ on the + right side of the list to edit or delete the SMS group. diff --git a/docs/en/docs/end-user/insight/alert-center/msg-template.md b/docs/en/docs/end-user/insight/alert-center/msg-template.md new file mode 100644 index 0000000000..c4903daeb3 --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/msg-template.md @@ -0,0 +1,49 @@ +# Message Templates + +The message template feature supports customizing the content of message templates and can notify specified objects in the form of email, WeCom, DingTalk, Webhook, and SMS. + +## Creating a Message Template + +1. In the left navigation bar, select __Alert__ -> __Message Template__ . + + Insight comes with two default built-in templates in both Chinese and English for user convenience. + + ![Click button](../images/template00.png) + +2. Fill in the template content. + + ![message template](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/template02.png) + +!!! info + + Observability comes with predefined message templates. If you need to define the content of the templates, refer to [Configure Notification Templates](../../reference/notify-helper.md). + +## Message Template Details + +Click the name of a message template to view the details of the message template in the right slider. + +![Message Template](../images/msg-detail.png) + +| Parameters | Variable | Description | +|------------|----------|-------------| +| ruleName | {{ .Labels.alertname }} | The name of the rule that triggered the alert | +| groupName | {{ .Labels.alertgroup }} | The name of the alert policy to which the alert rule belongs | +| severity | {{ .Labels.severity }} | The level of the alert that was triggered | +| cluster | {{ .Labels.cluster }} | The cluster where the resource that triggered the alert is located | +| namespace | {{ .Labels.namespace }} | The namespace where the resource that triggered the alert is located | +| node | {{ .Labels.node }} | The node where the resource that triggered the alert is located | +| targetType | {{ .Labels.target_type }} | The resource type of the alert target | +| target | {{ .Labels.target }} | The name of the object that triggered the alert | +| value | {{ .Annotations.value }} | The metric value at the time the alert notification was triggered | +| startsAt | {{ .StartsAt }} | The time when the alert started to occur | +| endsAt | {{ .EndsAt }} | The time when the alert ended | +| description | {{ .Annotations.description }} | A detailed description of the alert | +| labels | {{ for .labels }} {{ end }} | All labels of the alert use the `for` function to iterate through the labels list to get all label contents. | + +## Editing or Deleting a Message Template + +Click __┇__ on the right side of the list and select __Edit__ or __Delete__ from the pop-up menu to modify or delete the message template. + +!!! warning + + Once a template is deleted, it cannot be recovered, so please use caution when deleting templates. diff --git a/docs/en/docs/end-user/insight/alert-center/silent.md b/docs/en/docs/end-user/insight/alert-center/silent.md new file mode 100644 index 0000000000..59141de568 --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/silent.md @@ -0,0 +1,30 @@ +--- +MTPE: ModetaNiu +Date: 2024-07-01 +--- + +# Alert Silence + +Alert silence is a feature that allows alerts meeting certain criteria to be temporarily disabled from +sending notifications within a specific time range. This feature helps operations personnel avoid receiving +too many noisy alerts during certain operations or events, while also allowing for more precise handling of real issues +that need to be addressed. + +On the Alert Silence page, you can see two tabs: Active Rule and Expired Rule. The former presents the rules currently in effect, +while the latter presents those that were defined in the past but have now expired (or have been deleted by the user). + +## Creating a Silent Rule + +1. In the left navigation bar, select __Alert__ -> __Noice Reduction__ -> __Alert Silence__ , and click the __Create Silence Rule__ button. + + ![click button](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/silent01.png) + +2. Fill in the parameters for the silent rule, such as cluster, namespace, tags, and time, to define the scope + and effective time of the rule, and then click __OK__ . + + ![silent rule](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/silent02.png) + +3. Return to the rule list, and on the right side of the list, click __┇__ to edit or delete a silent rule. + +Through the Alert Silence feature, you can flexibly control which alerts should be ignored and when they should be effective, +thereby improving operational efficiency and reducing the possibility of false alerts. diff --git a/docs/en/docs/end-user/insight/alert-center/sms-provider.md b/docs/en/docs/end-user/insight/alert-center/sms-provider.md new file mode 100644 index 0000000000..48fe7ffc7a --- /dev/null +++ b/docs/en/docs/end-user/insight/alert-center/sms-provider.md @@ -0,0 +1,52 @@ +# Configure Notification Server + +Insight supports SMS notifications and currently sends alert messages using integrated Alibaba Cloud and Tencent Cloud SMS services. This article explains how to configure the SMS notification server in Insight. The variables supported in the SMS signature are the default variables in the message template. As the number of SMS characters is limited, it is recommended to choose more explicit variables. + +> For information on how to configure SMS recipients, refer to the document: [Configure SMS Notification Group](../../alert-center/message.md). + +## Procedure + +1. Go to __Alert Center__ -> __Notification Settings__ -> __Notification Server__ . + + ![Notification Server](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/sms01.png) + +2. Click __Add Notification Server__ . + + - Configure Alibaba Cloud server. + + > To apply for Alibaba Cloud SMS service, please refer to [Alibaba Cloud SMS Service](https://help.aliyun.com/document_detail/108062.html?spm=a2c4g.57535.0.0.2cec637ffna8ye). + + Field descriptions: + + - __AccessKey ID__ : Parameter used by Alibaba Cloud to identify the user. + - __AccessKey Secret__ : Key used by Alibaba Cloud to authenticate the user. AccessKey Secret must be kept confidential. + - __SMS Signature__ : The SMS service supports creating signatures that meet the requirements according to user needs. When sending SMS, the SMS platform will add the approved SMS signature to the SMS content before sending it to the SMS recipient. + - __Template CODE__ : The SMS template is the specific content of the SMS to be sent. + - __Parameter Template__ : The SMS body template can contain variables. Users can use variables to customize the SMS content. + + Please refer to [Alibaba Cloud Variable Specification](https://help.aliyun.com/document_detail/463270.html). + + ![Notification Server](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/sms02.png) + + !!! note + + Example: The template content defined in Alibaba Cloud is: ${severity}: ${alertname} triggered at ${startat}. Refer to the configuration in the parameter template. + + - Configure Tencent Cloud server. + + > To apply for Tencent Cloud SMS service, please refer to [Tencent Cloud SMS](https://cloud.tencent.com/document/product/382/37794). + + Field descriptions: + + - __Secret ID__ : Parameter used by Tencent Cloud to identify the API caller. + - __SecretKey__ : Parameter used by Tencent Cloud to authenticate the API caller. + - __SMS Template ID__ : The SMS template ID automatically generated by Tencent Cloud system. + - __Signature Content__ : The SMS signature content, which is the full name or abbreviation of the actual website name defined in the Tencent Cloud SMS signature. + - __SdkAppId__ : SMS SdkAppId, the actual SdkAppId generated after adding the application in the Tencent Cloud SMS console. + - __Parameter Template__ : The SMS body template can contain variables. Users can use variables to customize the SMS content. Please refer to: [Tencent Cloud Variable Specification](https://cloud.tencent.com/document/product/382/39023#.E5.8F.98.E9.87.8F.E8.A7.84.E8.8C.83.3Ca-id.3D.22variable.22.3E.3C.2Fa.3E). + + ![Notification Server](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/sms03.png) + + !!! note + + Example: The template content defined in Tencent Cloud is: {1}: {2} triggered at {3}. Refer to the configuration in the parameter template. diff --git a/docs/en/docs/end-user/insight/collection-manag/agent-status.md b/docs/en/docs/end-user/insight/collection-manag/agent-status.md new file mode 100644 index 0000000000..a213cceeef --- /dev/null +++ b/docs/en/docs/end-user/insight/collection-manag/agent-status.md @@ -0,0 +1,49 @@ +# insight-agent Component Status Explanation + +In AI platform, Insight acts as a multi-cluster observability product. +To achieve unified data collection across multiple clusters, users need to install +the Helm application __insight-agent__ (installed by default in the __insight-system__ namespace). +Refer to [How to Install __insight-agent__ ](../../quickstart/install/install-agent.md). + +## Status Explanation + +In the "Observability" -> "Collection Management" section, you can view the installation status +of __insight-agent__ in each cluster. + +- __Not Installed__ : __insight-agent__ is not installed in the __insight-system__ namespace of the cluster. +- __Running__ : __insight-agent__ is successfully installed in the cluster, and all deployed components are running. +- __Error__ : If __insight-agent__ is in this state, it indicates that the helm deployment failed or + there are components deployed that are not in a running state. + +You can troubleshoot using the following steps: + +1. Run the following command. If the status is __deployed__ , proceed to the next step. + If it is __failed__ , it is recommended to uninstall and reinstall it from + __Container Management__ -> __Helm Apps__ as it may affect application upgrades: + + ```bash + helm list -n insight-system + ``` + +2. Run the following command or check the status of the deployed components in + __Insight__ -> __Data Collection__ . If there are Pods not in the __Running__ state, + restart the containers in an abnormal state. + + ```bash + kubectl get pods -n insight-system + ``` + +## Additional Notes + +1. The resource consumption of the Prometheus metric collection component in __insight-agent__ + is directly proportional to the number of Pods running in the cluster. + Please adjust the resources for Prometheus according to the cluster size. + Refer to [Prometheus Resource Planning](../../quickstart/res-plan/prometheus-res.md). + +2. The storage capacity of the vmstorage metric storage component in the global service cluster + is directly proportional to the total number of Pods in the clusters. + + - Please contact the platform administrator to adjust the disk capacity of vmstorage + based on the cluster size. Refer to [vmstorage Disk Capacity Planning](../../quickstart/res-plan/vms-res-plan.md). + - Adjust vmstorage disk based on multi-cluster scale. + Refer to [vmstorge Disk Expansion](../../quickstart/res-plan/modify-vms-disk.md). diff --git a/docs/en/docs/end-user/insight/collection-manag/collection-manag.md b/docs/en/docs/end-user/insight/collection-manag/collection-manag.md new file mode 100644 index 0000000000..e2c0e2ef1d --- /dev/null +++ b/docs/en/docs/end-user/insight/collection-manag/collection-manag.md @@ -0,0 +1,31 @@ +--- +hide: + - toc +--- + +# Data Collection + + __Data Collection__ is mainly to centrally manage and display the entrance of the +cluster installation collection plug-in __insight-agent__ , which helps users quickly +view the health status of the cluster collection plug-in, and provides a quick entry +to configure collection rules. + +The specific operation steps are as follows: + +1. Click in the upper left corner and select __Insight__ -> __Data Collection__ . + + ![Data Collection](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/collectmanage01.png) + +2. You can view the status of all cluster collection plug-ins. + + ![Data Collection](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/collectmanage02.png) + +3. When the cluster is connected to __insight-agent__ and is running, click a cluster name + to enter the details。 + + ![Data Collection](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/collectmanage03.png) + +4. In the __Service Monitor__ tab, click the shortcut link to jump to __Container Management__ -> __CRD__ + to add service discovery rules. + + ![Data Collection](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/collectmanage04.png) diff --git a/docs/en/docs/end-user/insight/collection-manag/metric-collect.md b/docs/en/docs/end-user/insight/collection-manag/metric-collect.md new file mode 100644 index 0000000000..ab504aaf0f --- /dev/null +++ b/docs/en/docs/end-user/insight/collection-manag/metric-collect.md @@ -0,0 +1,342 @@ +# Metrics Retrieval Methods + +Prometheus primarily uses the Pull approach to retrieve monitoring metrics from target services' exposed endpoints. Therefore, it requires configuring corresponding scraping jobs to request monitoring data and write it into the storage provided by Prometheus. Currently, Prometheus offers several configurations for these jobs: + +- Native Job Configuration: This provides native Prometheus job configuration for scraping. +- Pod Monitor: In the Kubernetes ecosystem, it allows scraping of monitoring data from Pods using Prometheus Operator. +- Service Monitor: In the Kubernetes ecosystem, it allows scraping monitoring data from Endpoints of Services using Prometheus Operator. + +!!! note + + `[ ]` indicates optional configmaps. + +## Native Job Configuration + +The corresponding configmaps are explained as follows: + +```yaml +# Name of the scraping job, also adds a label (job=job_name) to the scraped metrics +job_name: + +# Time interval between scrapes +[ scrape_interval: | default = ] + +# Timeout for scrape requests +[ scrape_timeout: | default = ] + +# URI path for the scrape request +[ metrics_path: | default = /metrics ] + +# Handling of label conflicts between scraped labels and labels added by the backend Prometheus. +# true: Retains the scraped labels and ignores conflicting labels from the backend Prometheus. +# false: Adds an "exported_" prefix to the scraped labels and includes the additional labels added by the backend Prometheus. +[ honor_labels: | default = false ] + +# Whether to use the timestamp generated by the target being scraped. +# true: Uses the timestamp from the target if available. +# false: Ignores the timestamp from the target. +[ honor_timestamps: | default = true ] + +# Protocol for the scrape request: http or https +[ scheme: | default = http ] + +# URL parameters for the scrape request +params: + [ : [, ...] ] + +# Set the value of the `Authorization` header in the scrape request through basic authentication. password/password_file are mutually exclusive, with password_file taking precedence. +basic_auth: + [ username: ] + [ password: ] + [ password_file: ] + +# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence. +[ bearer_token: ] + +# Set the value of the `Authorization` header in the scrape request through bearer token authentication. bearer_token/bearer_token_file are mutually exclusive, with bearer_token taking precedence. +[ bearer_token_file: ] + +# Whether the scrape connection should use a TLS secure channel, configure the corresponding TLS parameters +tls_config: + [ ] + +# Use a proxy service to scrape the metrics from the target, specify the address of the proxy service. +[ proxy_url: ] + +# Specify the targets using static configuration, see explanation below. +static_configs: + [ - ... ] + +# CVM service discovery configuration, see explanation below. +cvm_sd_configs: + [ - ... ] + +# After scraping the data, rewrite the labels of the corresponding target using the relabel mechanism. Executes multiple relabel rules in order. +# See explanation below for relabel_config. +relabel_configs: + [ - ... ] + +# Before writing the scraped data, rewrite the values of the labels using the relabel mechanism. Executes multiple relabel rules in order. +# See explanation below for relabel_config. +metric_relabel_configs: + [ - ... ] + +# Limit the number of data points per scrape, 0: no limit, default is 0 +[ sample_limit: | default = 0 ] + +# Limit the number of targets per scrape, 0: no limit, default is 0 +[ target_limit: | default = 0 ] +``` + +## Pod Monitor + +The explanation for the corresponding configmaps is as follows: + +```yaml +# Prometheus Operator CRD version +apiVersion: monitoring.coreos.com/v1 +# Corresponding Kubernetes resource type, here it is PodMonitor +kind: PodMonitor +# Corresponding Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be / +metadata: + name: redis-exporter # Specify a unique name + namespace: cm-prometheus # Fixed namespace, no need to modify +# Describes the selection and configuration of the target Pods to be scraped + labels: + operator.insight.io/managed-by: insight # Label indicating managed by Insight +spec: + # Specify the label of the corresponding Pod, pod monitor will use this value as the job label value. + # If viewing the Pod YAML, use the values in pod.metadata.labels. + # If viewing Deployment/Daemonset/Statefulset, use spec.template.metadata.labels. + [ jobLabel: string ] + # Adds the corresponding Pod's Labels to the Target's Labels + [ podTargetLabels: []string ] + # Limit the number of data points per scrape, 0: no limit, default is 0 + [ sampleLimit: uint64 ] + # Limit the number of targets per scrape, 0: no limit, default is 0 + [ targetLimit: uint64 ] + # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured. + podMetricsEndpoints: + [ - ... ] # See explanation below for endpoint + # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces. + [ namespaceSelector: ] + # Select all namespaces + [ any: bool ] + # Specify the list of namespaces to be selected + [ matchNames: []string ] + # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#labelselector-v1-meta) + selector: + [ matchExpressions: array ] + [ example: - {key: tier, operator: In, values: [cache]} ] + [ matchLabels: object ] + [ example: k8s-app: redis-exporter ] +``` + +### Example 1 + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PodMonitor +metadata: + name: redis-exporter # Specify a unique name + namespace: cm-prometheus # Fixed namespace, do not modify + labels: + operator.insight.io/managed-by: insight # Label indicating managed by Insight, required. +spec: + podMetricsEndpoints: + - interval: 30s + port: metric-port # Specify the Port Name corresponding to Prometheus Exporter in the pod YAML + path: /metrics # Specify the value of the Path corresponding to Prometheus Exporter, if not specified, default is /metrics + relabelings: + - action: replace + sourceLabels: + - instance + regex: (.*) + targetLabel: instance + replacement: "crs-xxxxxx" # Adjust to the corresponding Redis instance ID + - action: replace + sourceLabels: + - instance + regex: (.*) + targetLabel: ip + replacement: "1.x.x.x" # Adjust to the corresponding Redis instance IP + namespaceSelector: # Select the namespaces where the monitored Pods are located + matchNames: + - redis-test + selector: # Specify the Label values of the Pods to be monitored in order to locate the target pods + matchLabels: + k8s-app: redis-exporter +``` + +### Example 2 + +```yaml +job_name: prometheus +scrape_interval: 30s +static_configs: +- targets: + - 127.0.0.1:9090 +``` + +## Service Monitor + +The explanation for the corresponding configmaps is as follows: + +```yaml +# Prometheus Operator CRD version +apiVersion: monitoring.coreos.com/v1 +# Corresponding Kubernetes resource type, here it is ServiceMonitor +kind: ServiceMonitor +# Corresponding Kubernetes Metadata, only the name needs to be concerned. If jobLabel is not specified, the value of the job label in the scraped metrics will be the name of the Service. +metadata: + name: redis-exporter # Specify a unique name + namespace: cm-prometheus # Fixed namespace, no need to modify +# Describes the selection and configuration of the target Pods to be scraped + labels: + operator.insight.io/managed-by: insight # Label indicating managed by Insight, required. +spec: + # Specify the label(metadata/labels) of the corresponding Pod, service monitor will use this value as the job label value. + [ jobLabel: string ] + # Adds the Labels of the corresponding service to the Target's Labels + [ targetLabels: []string ] + # Adds the Labels of the corresponding Pod to the Target's Labels + [ podTargetLabels: []string ] + # Limit the number of data points per scrape, 0: no limit, default is 0 + [ sampleLimit: uint64 ] + # Limit the number of targets per scrape, 0: no limit, default is 0 + [ targetLimit: uint64 ] + # Configure the Prometheus HTTP endpoints that need to be scraped and exposed. Multiple endpoints can be configured. + endpoints: + [ - ... ] # See explanation below for endpoint + # Select the namespaces where the monitored Pods are located. Leave it blank to select all namespaces. + [ namespaceSelector: ] + # Select all namespaces + [ any: bool ] + # Specify the list of namespaces to be selected + [ matchNames: []string ] + # Specify the Label values of the Pods to be monitored in order to locate the target Pods [K8S metav1.LabelSelector](https://v1-17.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#labelselector-v1-meta) + selector: + [ matchExpressions: array ] + [ example: - {key: tier, operator: In, values: [cache]} ] + [ matchLabels: object ] + [ example: k8s-app: redis-exporter ] +``` + +### Example + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: go-demo # Specify a unique name + namespace: cm-prometheus # Fixed namespace, do not modify + labels: + operator.insight.io/managed-by: insight # Label indicating managed by Insight, required. +spec: + endpoints: + - interval: 30s + # Specify the Port Name corresponding to Prometheus Exporter in the service YAML + port: 8080-8080-tcp + # Specify the value of the Path corresponding to Prometheus Exporter, if not specified, default is /metrics + path: /metrics + relabelings: + # ** There must be a label named 'application', assuming there is a label named 'app' in k8s, + # we replace it with 'application' using the relabel 'replace' action + - action: replace + sourceLabels: [__meta_kubernetes_pod_label_app] + targetLabel: application + # Select the namespace where the monitored service is located + namespaceSelector: + matchNames: + - golang-demo + # Specify the Label values of the service to be monitored in order to locate the target service + selector: + matchLabels: + app: golang-app-demo +``` + +### endpoint_config + +The explanation for the corresponding configmaps is as follows: + +```yaml +# The name of the corresponding port. Please note that it's not the actual port number. +# Default: 80. Possible values are as follows: +# ServiceMonitor: corresponds to Service>spec/ports/name; +# PodMonitor: explained as follows: +# If viewing the Pod YAML, take the value from pod.spec.containers.ports.name. +# If viewing Deployment/DaemonSet/StatefulSet, take the value from spec.template.spec.containers.ports.name. +[ port: string | default = 80] +# The URI path for the scrape request. +[ path: string | default = /metrics ] +# The protocol for the scrape: http or https. +[ scheme: string | default = http] +# URL parameters for the scrape request. +[ params: map[string][]string] +# The interval between scrape requests. +[ interval: string | default = 30s ] +# The timeout for the scrape request. +[ scrapeTimeout: string | default = 30s] +# Whether the scrape connection should be made over a secure TLS channel, and the TLS configuration. +[ tlsConfig: TLSConfig ] +# Read the bearer token value from the specified file and include it in the headers of the scrape request. +[ bearerTokenFile: string ] +# Read the bearer token from the specified K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor. +[ bearerTokenSecret: string ] +# Handling conflicts when scraped labels conflict with labels added by the backend Prometheus. +# true: Keep the scraped labels and ignore the conflicting labels from the backend Prometheus. +# false: For conflicting labels, prefix the scraped label with 'exported_' and add the labels added by the backend Prometheus. +[ honorLabels: bool | default = false ] +# Whether to use the timestamp generated on the target during the scrape. +# true: Use the timestamp on the target if available. +# false: Ignore the timestamp on the target. +[ honorTimestamps: bool | default = true ] +# Basic authentication credentials. Fill in the values of username/password from the corresponding K8S secret key. Note that the secret namespace must match the PodMonitor/ServiceMonitor. +[ basicAuth: BasicAuth ] +# Scrape the metrics from the target through a proxy server. Specify the address of the proxy server. +[ proxyUrl: string ] +# After scraping the data, rewrite the values of the labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order. +# See explanation below for relabel_config +relabelings: +[ - ...] +# Before writing the scraped data, rewrite the values of the corresponding labels on the target using the relabeling mechanism. Multiple relabel rules are executed in order. +# See explanation below for relabel_config +metricRelabelings: +[ - ...] +``` + +### relabel_config + +The explanation for the corresponding configmaps is as follows: + +```yaml +# Specifies which labels to take from the original labels for relabeling. The values taken are concatenated using the separator defined in the configuration. +# For PodMonitor/ServiceMonitor, the corresponding configmap is sourceLabels. +[ source_labels: '[' [, ...] ']' ] +# Defines the character used to concatenate the values of the labels to be relabeled. Default is ';'. +[ separator: | default = ; ] + +# When the action is replace/hashmod, target_label is used to specify the corresponding label name. +# For PodMonitor/ServiceMonitor, the corresponding configmap is targetLabel. +[ target_label: ] + +# Regular expression used to match the values of the source labels. +[ regex: | default = (.*) ] + +# Used when action is hashmod, it takes the modulus value based on the MD5 hash of the source label's value. +[ modulus: ] + +# Used when action is replace, it defines the expression to replace when the regex matches. It can use regular expression replacement with regex. +[ replacement: | default = $1 ] + +# Actions performed based on the matched values of regex. The available actions are as follows, with replace being the default: +# replace: If the regex matches, replace the corresponding value with the value defined in replacement. Set the value using target_label and add the corresponding label. +# keep: If the regex doesn't match, discard the value. +# drop: If the regex matches, discard the value. +# hashmod: Take the modulus of the MD5 hash of the source label's value based on the value specified in modulus. +# Add a new label with a label name specified by target_label. +# labelmap: If the regex matches, replace the corresponding label name with the value specified in replacement. +# labeldrop: If the regex matches, delete the corresponding label. +# labelkeep: If the regex doesn't match, delete the corresponding label. +[ action: | default = replace ] +``` diff --git a/docs/en/docs/end-user/insight/collection-manag/probe-module.md b/docs/en/docs/end-user/insight/collection-manag/probe-module.md new file mode 100644 index 0000000000..56388c42e4 --- /dev/null +++ b/docs/en/docs/end-user/insight/collection-manag/probe-module.md @@ -0,0 +1,313 @@ +--- +MTPE: WANG0608GitHub +Date: 2024-09-23 +--- + +# Custom probers + +Insight uses the Blackbox Exporter provided by Prometheus as a blackbox monitoring solution, allowing detection of target instances via HTTP, HTTPS, DNS, ICMP, TCP, and gRPC. It can be used in the following scenarios: + +- HTTP/HTTPS: URL/API availability monitoring +- ICMP: Host availability monitoring +- TCP: Port availability monitoring +- DNS: Domain name resolution + +In this page, we will explain how to configure custom probers in an existing Blackbox ConfigMap. + +ICMP prober is not enabled by default in Insight because it requires higher permissions. Therfore We will use the HTTP prober as an example to demonstrate how to modify the ConfigMap to achieve custom HTTP probing. + +## Procedure + +1. Go to __Clusters__ in __Container Management__ and enter the details of the target cluster. +2. Click the left navigation bar and select __ConfigMaps & Secrets__ -> __ConfigMaps__ . +3. Find the ConfigMap named __insight-agent-prometheus-blackbox-exporter__ and click __Edit YAML__ . + + Add custom probers under __modules__ : + +=== "HTTP Prober" + ```yaml + module: + http_2xx: + prober: http + timeout: 5s + http: + valid_http_versions: [HTTP/1.1, HTTP/2] + valid_status_codes: [] # Defaults to 2xx + method: GET + ``` + +=== "ICMP Prober" + + ```yaml + module: + ICMP: # Example of ICMP prober configuration + prober: icmp + timeout: 5s + icmp: + preferred_ip_protocol: ip4 + icmp_example: # Example 2 of ICMP prober configuration + prober: icmp + timeout: 5s + icmp: + preferred_ip_protocol: "ip4" + source_ip_address: "127.0.0.1" + ``` + Since ICMP requires higher permissions, we also need to elevate the pod permissions. Otherwise, an `operation not permitted` error will occur. There are two ways to elevate permissions: + + - Directly edit the `BlackBox Exporter` deployment file to enable it + + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: insight-agent-prometheus-blackbox-exporter + namespace: insight-system + spec: + template: + spec: + containers: + - name: blackbox-exporter + image: # ... (image, args, ports, etc. remain unchanged) + imagePullPolicy: IfNotPresent + securityContext: + allowPrivilegeEscalation: false + capabilities: + add: + - NET_RAW + drop: + - ALL + readOnlyRootFilesystem: true + runAsGroup: 0 + runAsNonRoot: false + runAsUser: 0 + ``` + + - Elevate permissions via `helm upgrade` + + ```diff + prometheus-blackbox-exporter: + enabled: true + securityContext: + runAsUser: 0 + runAsGroup: 0 + readOnlyRootFilesystem: true + runAsNonRoot: false + allowPrivilegeEscalation: false + capabilities: + add: ["NET_RAW"] + ``` + +!!! info + + For more probers, refer to [blackbox_exporter Configuration](https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md). + +## Other References + +The following YAML file contains various probers such as HTTP, TCP, SMTP, ICMP, and DNS. You can modify the configuration file of `insight-agent-prometheus-blackbox-exporter` according to your needs. + +??? note "Click to view the complete YAML file" + + ```yaml + kind: ConfigMap + apiVersion: v1 + metadata: + name: insight-agent-prometheus-blackbox-exporter + namespace: insight-system + labels: + app.kubernetes.io/instance: insight-agent + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: prometheus-blackbox-exporter + app.kubernetes.io/version: v0.24.0 + helm.sh/chart: prometheus-blackbox-exporter-8.8.0 + annotations: + meta.helm.sh/release-name: insight-agent + meta.helm.sh/release-namespace: insight-system + data: + blackbox.yaml: | + modules: + HTTP_GET: + prober: http + timeout: 5s + http: + method: GET + valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] + follow_redirects: true + preferred_ip_protocol: "ip4" + HTTP_POST: + prober: http + timeout: 5s + http: + method: POST + body_size_limit: 1MB + TCP: + prober: tcp + timeout: 5s + # Not enabled by default: + # ICMP: + # prober: icmp + # timeout: 5s + # icmp: + # preferred_ip_protocol: ip4 + SSH: + prober: tcp + timeout: 5s + tcp: + query_response: + - expect: "^SSH-2.0-" + POP3S: + prober: tcp + tcp: + query_response: + - expect: "^+OK" + tls: true + tls_config: + insecure_skip_verify: false + http_2xx_example: # http prober example + prober: http + timeout: 5s # probe timeout + http: + valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] # Version in the response, usually default + valid_status_codes: [] # Defaults to 2xx # Valid range of response codes, probe successful if within this range + method: GET # request method + headers: # request headers + Host: vhost.example.com + Accept-Language: en-US + Origin: example.com + no_follow_redirects: false # allow redirects + fail_if_ssl: false + fail_if_not_ssl: false + fail_if_body_matches_regexp: + - "Could not connect to database" + fail_if_body_not_matches_regexp: + - "Download the latest version here" + fail_if_header_matches: # Verifies that no cookies are set + - header: Set-Cookie + allow_missing: true + regexp: '.*' + fail_if_header_not_matches: + - header: Access-Control-Allow-Origin + regexp: '(\*|example\.com)' + tls_config: # tls configuration for https requests + insecure_skip_verify: false + preferred_ip_protocol: "ip4" # defaults to "ip6" # Preferred IP protocol version + ip_protocol_fallback: false # no fallback to "ip6" + http_post_2xx: # http prober example with body + prober: http + timeout: 5s + http: + method: POST # probe request method + headers: + Content-Type: application/json + body: '{"username":"admin","password":"123456"}' # body carried during probe + http_basic_auth_example: # prober example with username and password + prober: http + timeout: 5s + http: + method: POST + headers: + Host: "login.example.com" + basic_auth: # username and password to be added during probe + username: "username" + password: "mysecret" + http_custom_ca_example: + prober: http + http: + method: GET + tls_config: # root certificate used during probe + ca_file: "/certs/my_cert.crt" + http_gzip: + prober: http + http: + method: GET + compression: gzip # compression method used during probe + http_gzip_with_accept_encoding: + prober: http + http: + method: GET + compression: gzip + headers: + Accept-Encoding: gzip + tls_connect: # TCP prober example + prober: tcp + timeout: 5s + tcp: + tls: true # use TLS + tcp_connect_example: + prober: tcp + timeout: 5s + imap_starttls: # IMAP email server probe configuration example + prober: tcp + timeout: 5s + tcp: + query_response: + - expect: "OK.*STARTTLS" + - send: ". STARTTLS" + - expect: "OK" + - starttls: true + - send: ". capability" + - expect: "CAPABILITY IMAP4rev1" + smtp_starttls: # SMTP email server probe configuration example + prober: tcp + timeout: 5s + tcp: + query_response: + - expect: "^220 ([^ ]+) ESMTP (.+)$" + - send: "EHLO prober\r" + - expect: "^250-STARTTLS" + - send: "STARTTLS\r" + - expect: "^220" + - starttls: true + - send: "EHLO prober\r" + - expect: "^250-AUTH" + - send: "QUIT\r" + irc_banner_example: + prober: tcp + timeout: 5s + tcp: + query_response: + - send: "NICK prober" + - send: "USER prober prober prober :prober" + - expect: "PING :([^ ]+)" + send: "PONG ${1}" + - expect: "^:[^ ]+ 001" + # icmp_example: # ICMP prober configuration example + # prober: icmp + # timeout: 5s + # icmp: + # preferred_ip_protocol: "ip4" + # source_ip_address: "127.0.0.1" + dns_udp_example: # DNS query example using UDP + prober: dns + timeout: 5s + dns: + query_name: "www.prometheus.io" # domain name to resolve + query_type: "A" # type corresponding to this domain + valid_rcodes: + - NOERROR + validate_answer_rrs: + fail_if_matches_regexp: + - ".*127.0.0.1" + fail_if_all_match_regexp: + - ".*127.0.0.1" + fail_if_not_matches_regexp: + - "www.prometheus.io.\t300\tIN\tA\t127.0.0.1" + fail_if_none_matches_regexp: + - "127.0.0.1" + validate_authority_rrs: + fail_if_matches_regexp: + - ".*127.0.0.1" + validate_additional_rrs: + fail_if_matches_regexp: + - ".*127.0.0.1" + dns_soa: + prober: dns + dns: + query_name: "prometheus.io" + query_type: "SOA" + dns_tcp_example: # DNS query example using TCP + prober: dns + dns: + transport_protocol: "tcp" # defaults to "udp" + preferred_ip_protocol: "ip4" # defaults to "ip6" + query_name: "www.prometheus.io" + ``` diff --git a/docs/en/docs/end-user/insight/collection-manag/service-monitor.md b/docs/en/docs/end-user/insight/collection-manag/service-monitor.md new file mode 100644 index 0000000000..c3e65b1e3f --- /dev/null +++ b/docs/en/docs/end-user/insight/collection-manag/service-monitor.md @@ -0,0 +1,74 @@ +# Configure service discovery rules + +Observable Insight supports the way of creating CRD ServiceMonitor through __container management__ to meet your collection requirements for custom service discovery. +Users can use ServiceMonitor to define the scope of the Namespace discovered by the Pod and select the monitored Service through __matchLabel__ . + +## Prerequisites + +The cluster has the Helm application __insight-agent__ installed and in the __running__ state. + +## Steps + +1. Select __Data Collection__ on the left navigation bar to view the status of all cluster collection plug-ins. + + ![Data Collection](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/collectmanage01.png) + +2. Click a cluster name to enter the collection configuration details. + + ![Data Collection](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/collectmanage02.png) + +3. Click the link to jump to __Container Management__ to create a Service Monitor. + + ```yaml + apiVersion: monitoring.coreos.com/v1 + kind: ServiceMonitor + metadata: + name: micrometer-demo # (1) + namespace: insight-system # (2) + labels: + operator.insight.io/managed-by: insight + spec: + endpoints: # (3) + - honorLabels: true + interval: 15s + path: /actuator/prometheus + port: http + namespaceSelector: # (4) + matchNames: + - insight-system # (5) + selector: # (6) + matchLabels: + micrometer-prometheus-discovery: "true" + ``` + + 1. Specify the name of the ServiceMonitor. + 2. Specify the namespace of the ServiceMonitor. + 3. This is the service endpoint, which represents the address where Prometheus collects Metrics. + __endpoints__ is an array, and multiple __endpoints__ can be created at the same time. + Each __endpoint__ contains three fields, and the meaning of each field is as follows: + + - __interval__ : Specifies the collection cycle of Prometheus for the current __endpoint__ . + The unit is seconds, set to __15s__ in this example. + - __path__ : Specifies the collection path of Prometheus. + In this example, it is specified as __/actuator/prometheus__ . + - __port__ : Specifies the port through which the collected data needs to pass. + The set port is the __name__ set by the port of the Service being collected. + + 4. This is the scope of the Service that needs to be discovered. + __namespaceSelector__ contains two mutually exclusive fields, and the meaning of the fields is as follows: + + - __any__ : Only one value __true__ , when this field is set, it will listen to changes + of all Services that meet the Selector filtering conditions. + - __matchNames__ : An array value that specifies the scope of __namespace__ to be monitored. + For example, if you only want to monitor the Services in two namespaces, default and + insight-system, the __matchNames__ are set as follows: + + ```yaml + namespaceSelector: + matchNames: + - default + - insight-system + ``` + + 5. The namespace where the application that needs to expose metrics is located + 5. Used to select the Service diff --git a/docs/en/docs/end-user/insight/dashboard/dashboard.md b/docs/en/docs/end-user/insight/dashboard/dashboard.md new file mode 100644 index 0000000000..9196316f23 --- /dev/null +++ b/docs/en/docs/end-user/insight/dashboard/dashboard.md @@ -0,0 +1,37 @@ +--- +hide: + - toc +--- + +# Dashboard + +Grafana is a cross-platform open source visual analysis tool. Insight uses open source Grafana +to provide monitoring services, and supports viewing resource consumption from multiple dimensions +such as clusters, nodes, and namespaces. + +For more information on open source Grafana, see +[Grafana Official Documentation](https://grafana.com/docs/grafana/latest/getting-started/?spm=a2c4g.11186623.0.0.1f34de53ksAH9a). + +## Steps + +1. Select __Dashboard__ from the left navigation bar . + + - In the __Insight / Overview__ dashboard, you can view the resource usage of multiple clusters and analyze resource usage, network, storage, and more based on dimensions such as namespaces and Pods. + + - Click the dropdown menu in the upper-left corner of the dashboard to switch between clusters. + + - Click the lower-right corner of the dashboard to switch the time range for queries. + + ![Dashboard](../images/dashboard00.png) + +2. Insight provides several recommended dashboards that allow monitoring from different dimensions + such as nodes, namespaces, and workloads. Switch between dashboards by clicking the + __insight-system / Insight / Overview__ section. + + ![Overview](../images/dashboard01.png) + +!!! note + + 1. For accessing Grafana UI, refer to [Access Native Grafana](../../dashboard/login-grafana.md). + + 2. For importing custom dashboards, refer to [Importing Custom Dashboards](./import-dashboard.md). diff --git a/docs/en/docs/end-user/insight/dashboard/import-dashboard.md b/docs/en/docs/end-user/insight/dashboard/import-dashboard.md new file mode 100644 index 0000000000..19a1e04fa5 --- /dev/null +++ b/docs/en/docs/end-user/insight/dashboard/import-dashboard.md @@ -0,0 +1,65 @@ +# Import Custom Dashboards + +By using Grafana CRD, you can incorporate the management and deployment of dashboards into the lifecycle management of Kubernetes. This enables version control, automated deployment, and cluster-level management of dashboards. This page describes how to import custom dashboards using CRD and the UI interface. + +## Steps + +1. Log in to the AI platform platform and go to __Container Management__ . Select the __kpanda-global-cluster__ from the cluster list. + +2. Choose __Custom Resources__ from the left navigation bar. Look for the __grafanadashboards.integreatly.org__ + file in the list and click it to view the details. + +3. Click __YAML Create__ and use the following template. Replace the dashboard JSON in the __Json__ field. + + - __namespace__ : Specify the target namespace. + - __name__ : Provide a name for the dashboard. + - __label__ : Mandatory. Set the label as __operator.insight.io/managed-by: insight__ . + + ```yaml + apiVersion: integreatly.org/v1alpha1 + kind: GrafanaDashboard + metadata: + labels: + app: insight-grafana-operator + operator.insight.io/managed-by: insight + name: sample-dashboard + namespace: insight-system + spec: + json: > + { + "id": null, + "title": "Simple Dashboard", + "tags": [], + "style": "dark", + "timezone": "browser", + "editable": true, + "hideControls": false, + "graphTooltip": 1, + "panels": [], + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": { + "time_options": [], + "refresh_intervals": [] + }, + "templating": { + "list": [] + }, + "annotations": { + "list": [] + }, + "refresh": "5s", + "schemaVersion": 17, + "version": 0, + "links": [] + } + ``` + +4. After clicking __OK__ , wait for a while to view the newly imported dashboard in __Dashboard__ . + +!!! info + + If you need to customize the dashboard, refer to + [Add Dashboard Panel](https://grafana.com/docs/grafana/latest/dashboards/add-organize-panels/). diff --git a/docs/en/docs/end-user/insight/dashboard/login-grafana.md b/docs/en/docs/end-user/insight/dashboard/login-grafana.md new file mode 100644 index 0000000000..47de1b18d7 --- /dev/null +++ b/docs/en/docs/end-user/insight/dashboard/login-grafana.md @@ -0,0 +1,24 @@ +--- +hide: + - toc +--- + +# Access Native Grafana + +Please make sure that the Helm application __Insight__ in your global management cluster is in __Running__ state. + +The specific operation steps are as follows: + +1. Log in to the console to access native Grafana. + + Access address: `http://ip:port/ui/insight-grafana` + + For example: `http://10.6.10.233:30209/ui/insight-grafana` + +2. Click Login in the lower right corner, and use the default username and password to log in. + + - Default username: admin + + - Default password: admin + +3. Click __Log in__ to complete the login. diff --git a/docs/en/docs/end-user/insight/dashboard/overview.md b/docs/en/docs/end-user/insight/dashboard/overview.md new file mode 100644 index 0000000000..f24bf68dea --- /dev/null +++ b/docs/en/docs/end-user/insight/dashboard/overview.md @@ -0,0 +1,21 @@ +--- +MTPE: WANG0608GitHub +date: 2024-07-09 +--- + +# Overview + +__Insight__ only collects data from clusters that have __insight-agent__ installed and running in a normal state. The overview provides an overview of resources across multiple clusters: + +- Alert Statistics: Provides statistics on active alerts across all clusters. +- Resource Consumption: Displays the resource usage trends for the top 5 clusters and nodes in the past hour, based on CPU usage, memory usage, and disk usage. +- By default, the sorting is based on CPU usage. You can switch the metric to sort clusters and nodes. +- Resource Trends: Shows the trends in the number of nodes over the past 15 days and the running trend of pods in the last hour. +- Service Requests Ranking: Displays the top 5 services with the highest request latency and error rates, along with their respective clusters and namespaces in the multi-cluster environment. + +## Operation procedure + +Select __Overview__ in the left navigation bar to enter the details page. + +![overview](../../image/overview.png){ width="1000"} + diff --git a/docs/en/docs/end-user/insight/data-query/log.md b/docs/en/docs/end-user/insight/data-query/log.md new file mode 100644 index 0000000000..65c203ed35 --- /dev/null +++ b/docs/en/docs/end-user/insight/data-query/log.md @@ -0,0 +1,53 @@ +# Log query + +By default, Insight collects node logs, container logs, and Kubernetes audit logs. +In the log query page, you can search for standard output (stdout) logs within the permissions +of your login account. This includes node logs, product logs, and Kubernetes audit logs. +You can quickly find the desired logs among a large volume of logs. Additionally, you can +use the source information and contextual raw data of the logs to assist in troubleshooting and issue resolution. + +## Prerequisites + +The cluster has [insight-agent installed](../../quickstart/install/install-agent.md) +and the application is in __running__ state. + +## Query log + +1. In the left navigation bar, select __Data Query__ -> __Log Query__ . + + ![Log Query](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/log01.png) + +2. After selecting the query criteria, click __Search__ , and the log records in the form of graphs will be displayed. The most recent logs are displayed on top. + + ![Search](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/log02.png) + +3. In the __Filter__ panel, switch __Type__ and select __Node__ to check the logs of all nodes in the cluster. + + ![Node](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/log03.png) + +4. In the __Filter__ panel, switch __Type__ and select __Event__ to view the logs generated by all Kubernetes events in the cluster. + +**Lucene Syntax Explanation:** + +1. Use logical operators (AND, OR, NOT, "") to query multiple keywords. For example: keyword1 AND (keyword2 OR keyword3) NOT keyword4. +2. Use a tilde (~) for fuzzy queries. You can optionally specify a parameter after the "~" to control the similarity of the fuzzy query. If not specified, it defaults to 0.5. For example: error~. +3. Use wildcards (*, ?) as single-character placeholders to match any character. +4. Use square brackets [ ] or curly braces { } for range queries. Square brackets [ ] represent a closed interval and include the boundary values. Curly braces { } represent an open interval and exclude the boundary values. Range queries are applicable only to fields that can be sorted, such as numeric fields and date fields. For example `timestamp:[2022-01-01 TO 2022-01-31]`. +5. For more information, please refer to the [Lucene Syntax Explanation](../../reference/lucene.md). + +## View log context + +Clicking on the button next to a log will slide out a panel on the right side where you can view the +default 100 lines of context for that log. You can switch the __Display Rows__ option to view more contextual content. + +![view](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/log06.png) + +## Export log + +Click the download button located in the upper right corner of the list. + +- You can configure the exported log fields. The available fields may vary depending on the log type, + with the __Log Content__ field being mandatory. +- You can export the log query results in **.txt** or **.csv** format. + +![export](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/log05.png) diff --git a/docs/en/docs/end-user/insight/data-query/metric.md b/docs/en/docs/end-user/insight/data-query/metric.md new file mode 100644 index 0000000000..90436a1c16 --- /dev/null +++ b/docs/en/docs/end-user/insight/data-query/metric.md @@ -0,0 +1,33 @@ +# Metric query + +Metric query supports querying the index data of each container resource, and you can view the trend changes of the monitoring index. At the same time, advanced query supports native PromQL statements for Metric query. + +## Prerequisites + +- The cluster has [insight-agent installed](../../quickstart/install/install-agent.md) and the application is in __running__ state. + +## Common query + +1. In the left navigation bar, click __Data Query__ -> __metric Query__ . + +2. After selecting query conditions such as cluster, type, node, and metric name, click __Search__ , + and the corresponding metric chart and data details will be displayed on the right side of the screen. + + + +!!! tip + + Support custom time range. You can manually click the __Refresh__ icon or select a default time interval to refresh. + +## Advanced Search + +1. In the left navigation bar, click __Data Query__ -> __metric Query__ , + click the __Advanced Query__ tab to switch to the advanced query page. + + + +2. Enter a PromQL statement + (see [PromQL Syntax](https://prometheus.io/docs/prometheus/latest/querying/basics/)), + click __Query__ , and the query metric chart and data details will be displayed. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/images/big-log01.png b/docs/en/docs/end-user/insight/images/big-log01.png new file mode 100644 index 0000000000..b178b19ed9 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/big-log01.png differ diff --git a/docs/en/docs/end-user/insight/images/big-log02.png b/docs/en/docs/end-user/insight/images/big-log02.png new file mode 100644 index 0000000000..a28c76a96d Binary files /dev/null and b/docs/en/docs/end-user/insight/images/big-log02.png differ diff --git a/docs/en/docs/end-user/insight/images/big-log03.png b/docs/en/docs/end-user/insight/images/big-log03.png new file mode 100644 index 0000000000..f6616d8100 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/big-log03.png differ diff --git a/docs/en/docs/end-user/insight/images/big-log04.png b/docs/en/docs/end-user/insight/images/big-log04.png new file mode 100644 index 0000000000..bac3d7226e Binary files /dev/null and b/docs/en/docs/end-user/insight/images/big-log04.png differ diff --git a/docs/en/docs/end-user/insight/images/cluster.png b/docs/en/docs/end-user/insight/images/cluster.png new file mode 100644 index 0000000000..565754c04b Binary files /dev/null and b/docs/en/docs/end-user/insight/images/cluster.png differ diff --git a/docs/en/docs/end-user/insight/images/dashboard00.png b/docs/en/docs/end-user/insight/images/dashboard00.png new file mode 100644 index 0000000000..46a541b155 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/dashboard00.png differ diff --git a/docs/en/docs/end-user/insight/images/dashboard01.png b/docs/en/docs/end-user/insight/images/dashboard01.png new file mode 100644 index 0000000000..e4de048efa Binary files /dev/null and b/docs/en/docs/end-user/insight/images/dashboard01.png differ diff --git a/docs/en/docs/end-user/insight/images/inhibition-01.png b/docs/en/docs/end-user/insight/images/inhibition-01.png new file mode 100644 index 0000000000..53f0ba80d4 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/inhibition-01.png differ diff --git a/docs/en/docs/end-user/insight/images/inhibition.png b/docs/en/docs/end-user/insight/images/inhibition.png new file mode 100644 index 0000000000..ab8626b185 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/inhibition.png differ diff --git a/docs/en/docs/end-user/insight/images/inhibition01.png b/docs/en/docs/end-user/insight/images/inhibition01.png new file mode 100644 index 0000000000..f4067db1db Binary files /dev/null and b/docs/en/docs/end-user/insight/images/inhibition01.png differ diff --git a/docs/en/docs/end-user/insight/images/inhibition02.png b/docs/en/docs/end-user/insight/images/inhibition02.png new file mode 100644 index 0000000000..b75788b7aa Binary files /dev/null and b/docs/en/docs/end-user/insight/images/inhibition02.png differ diff --git a/docs/en/docs/end-user/insight/images/inhibition03.png b/docs/en/docs/end-user/insight/images/inhibition03.png new file mode 100644 index 0000000000..a38d4a7716 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/inhibition03.png differ diff --git a/docs/en/docs/end-user/insight/images/inhibition04.png b/docs/en/docs/end-user/insight/images/inhibition04.png new file mode 100644 index 0000000000..8411f3c3d5 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/inhibition04.png differ diff --git a/docs/en/docs/end-user/insight/images/insight-ns-toleration.png b/docs/en/docs/end-user/insight/images/insight-ns-toleration.png new file mode 100644 index 0000000000..b2eaa81ea8 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/insight-ns-toleration.png differ diff --git a/docs/en/docs/end-user/insight/images/msg-detail.png b/docs/en/docs/end-user/insight/images/msg-detail.png new file mode 100644 index 0000000000..bf069e087c Binary files /dev/null and b/docs/en/docs/end-user/insight/images/msg-detail.png differ diff --git a/docs/en/docs/end-user/insight/images/node.png b/docs/en/docs/end-user/insight/images/node.png new file mode 100644 index 0000000000..bcc3449049 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/node.png differ diff --git a/docs/en/docs/end-user/insight/images/notify-01.png b/docs/en/docs/end-user/insight/images/notify-01.png new file mode 100644 index 0000000000..0dd518362b Binary files /dev/null and b/docs/en/docs/end-user/insight/images/notify-01.png differ diff --git a/docs/en/docs/end-user/insight/images/notify-02.png b/docs/en/docs/end-user/insight/images/notify-02.png new file mode 100644 index 0000000000..e84ce39fe5 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/notify-02.png differ diff --git a/docs/en/docs/end-user/insight/images/overview.png b/docs/en/docs/end-user/insight/images/overview.png new file mode 100644 index 0000000000..859a2cc294 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/overview.png differ diff --git a/docs/en/docs/end-user/insight/images/service00.png b/docs/en/docs/end-user/insight/images/service00.png new file mode 100644 index 0000000000..6ab222dbf3 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/service00.png differ diff --git a/docs/en/docs/end-user/insight/images/service01.png b/docs/en/docs/end-user/insight/images/service01.png new file mode 100644 index 0000000000..c16c2584ec Binary files /dev/null and b/docs/en/docs/end-user/insight/images/service01.png differ diff --git a/docs/en/docs/end-user/insight/images/servicemap.png b/docs/en/docs/end-user/insight/images/servicemap.png new file mode 100644 index 0000000000..8c6d448e03 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/servicemap.png differ diff --git a/docs/en/docs/end-user/insight/images/template00.png b/docs/en/docs/end-user/insight/images/template00.png new file mode 100644 index 0000000000..5391031ffd Binary files /dev/null and b/docs/en/docs/end-user/insight/images/template00.png differ diff --git a/docs/en/docs/end-user/insight/images/template01.png b/docs/en/docs/end-user/insight/images/template01.png new file mode 100644 index 0000000000..619afc902b Binary files /dev/null and b/docs/en/docs/end-user/insight/images/template01.png differ diff --git a/docs/en/docs/end-user/insight/images/template02.png b/docs/en/docs/end-user/insight/images/template02.png new file mode 100644 index 0000000000..e007e60b3a Binary files /dev/null and b/docs/en/docs/end-user/insight/images/template02.png differ diff --git a/docs/en/docs/end-user/insight/images/template03.png b/docs/en/docs/end-user/insight/images/template03.png new file mode 100644 index 0000000000..67a3449d7e Binary files /dev/null and b/docs/en/docs/end-user/insight/images/template03.png differ diff --git a/docs/en/docs/end-user/insight/images/template04.png b/docs/en/docs/end-user/insight/images/template04.png new file mode 100644 index 0000000000..564cf2f179 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/template04.png differ diff --git a/docs/en/docs/end-user/insight/images/template05.png b/docs/en/docs/end-user/insight/images/template05.png new file mode 100644 index 0000000000..fef14daa0b Binary files /dev/null and b/docs/en/docs/end-user/insight/images/template05.png differ diff --git a/docs/en/docs/end-user/insight/images/trace00.png b/docs/en/docs/end-user/insight/images/trace00.png new file mode 100644 index 0000000000..f54b62e334 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/trace00.png differ diff --git a/docs/en/docs/end-user/insight/images/tracelog.png b/docs/en/docs/end-user/insight/images/tracelog.png new file mode 100644 index 0000000000..bb4c1fbf24 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/tracelog.png differ diff --git a/docs/en/docs/end-user/insight/images/workload-1.png b/docs/en/docs/end-user/insight/images/workload-1.png new file mode 100644 index 0000000000..699f4b9600 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/workload-1.png differ diff --git a/docs/en/docs/end-user/insight/images/workload.png b/docs/en/docs/end-user/insight/images/workload.png new file mode 100644 index 0000000000..cc052da14f Binary files /dev/null and b/docs/en/docs/end-user/insight/images/workload.png differ diff --git a/docs/en/docs/end-user/insight/images/workload00.png b/docs/en/docs/end-user/insight/images/workload00.png new file mode 100644 index 0000000000..909ad43fa0 Binary files /dev/null and b/docs/en/docs/end-user/insight/images/workload00.png differ diff --git a/docs/en/docs/end-user/insight/infra/cluster.md b/docs/en/docs/end-user/insight/infra/cluster.md new file mode 100644 index 0000000000..06f053ecce --- /dev/null +++ b/docs/en/docs/end-user/insight/infra/cluster.md @@ -0,0 +1,39 @@ +--- +MTPE: ModetaNiu +DATE: 2024-08-29 +--- + +# Cluster Monitoring + +Through cluster monitoring, you can view the basic information of the cluster, the resource consumption +and the trend of resource consumption over a period of time. + +## Prerequisites + +The cluster has [insight-agent installed](../../quickstart/install/install-agent.md) and the application +is in __running__ state. + +## Steps + +1. Go to the __Insight__ product module. + +2. Select __Infrastructure__ > __Clusters__ from the left navigation bar. On this page, you can view + the following information: + + - **Resource Overview**: Provides statistics on the number of normal/all nodes and workloads across multiple clusters. + - **Fault**: Displays the number of alerts generated in the current cluster. + - **Resource Consumption**: Shows the actual usage and total capacity of CPU, memory, and disk for the selected cluster. + - **Metric Explanations**: Describes the trends in CPU, memory, disk I/O, and network bandwidth. + + ![Monitor](../../image/cluster.png){ width="1000"} + +3. Click __Resource Level Monitor__, you can view more metrics of the current cluster. + +### Metric Explanations + +| Metric Name | Description | +| -- | -- | +| CPU Usage | The ratio of the actual CPU usage of all pod resources in the cluster to the total CPU capacity of all nodes.| +| CPU Allocation | The ratio of the sum of CPU requests of all pods in the cluster to the total CPU capacity of all nodes.| +| Memory Usage | The ratio of the actual memory usage of all pod resources in the cluster to the total memory capacity of all nodes.| +| Memory Allocation | The ratio of the sum of memory requests of all pods in the cluster to the total memory capacity of all nodes.| diff --git a/docs/en/docs/end-user/insight/infra/container.md b/docs/en/docs/end-user/insight/infra/container.md new file mode 100644 index 0000000000..8dff288ed3 --- /dev/null +++ b/docs/en/docs/end-user/insight/infra/container.md @@ -0,0 +1,64 @@ +--- +MTPE: WANG0608GitHub +Date: 2024-09-25 +hide: + - toc +--- + +# Container Insight + +Container insight is the process of monitoring workloads in cluster management. In the list, +you can view basic information and status of workloads. On the Workloads details page, you can +see the number of active alerts and the trend of resource consumption such as CPU and memory. + +## Prerequisites + +- The cluster has insight-agent installed, and all pods are in the __Running__ state. + +- To install insight-agent, please refer to: [Installing insight-agent online](../../quickstart/install/install-agent.md) or [Offline upgrade of insight-agent](../../quickstart/install/offline-install.md). + +## Steps + +Follow these steps to view service monitoring metrics: + +1. Go to the __Insight__ product module. + +2. Select __Infrastructure__ > __Workloads__ from the left navigation bar. + +3. Switch between tabs at the top to view data for different types of workloads. + + ![container insight](../../image/workload00.png){ width="1000"} + +4. Click the target workload name to view the details. + + 1. Faults: Displays the total number of active alerts for the workload. + 2. Resource Consumption: Shows the CPU, memory, and network usage of the workload. + 3. Monitoring Metrics: Provides the trends of CPU, Memory, Network, and disk usage for the workload over the past hour. + + ![container insight](../../image/workload.png){ width="1000"} + +5. Switch to the __Pods__ tab to view the status of various pods for the workload, including their nodes, restart counts, and other information. + + ![container insight](../../image/workload-1.png){ width="1000"} + +6. Switch to the __JVM monitor__ tab to view the JVM metrics for each pods + + + + !!! note + + 1. The JVM monitoring feature only supports the Java language. + 2. To enable the JVM monitoring feature, refer to [Getting Started with Monitoring Java Applications](../../quickstart/otel/java/index.md). + +### Metric Explanations + +| **Metric Name** | **Description** | +| -- | -- | +| CPU Usage | The sum of CPU usage for all pods under the workload.| +| CPU Requests | The sum of CPU requests for all pods under the workload.| +| CPU Limits | The sum of CPU limits for all pods under the workload.| +| Memory Usage | The sum of memory usage for all pods under the workload.| +| Memory Requests | The sum of memory requests for all pods under the workload.| +| Memory Limits | The sum of memory limits for all pods under the workload.| +| Disk Read/Write Rate | The total number of continuous disk reads and writes per second within the specified time range, representing a performance measure of the number of read and write operations per second on the disk.| +| Network Send/Receive Rate | The incoming and outgoing rates of network traffic, aggregated by workload, within the specified time range.| diff --git a/docs/en/docs/end-user/insight/infra/event.md b/docs/en/docs/end-user/insight/infra/event.md new file mode 100644 index 0000000000..e68c14e805 --- /dev/null +++ b/docs/en/docs/end-user/insight/infra/event.md @@ -0,0 +1,41 @@ +# Event Query + +AI platform Insight supports event querying by cluster and namespace. + +![event](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/event01.png) + +## Event Status Distribution + +By default, the events that occurred within the last 12 hours are displayed. +You can select a different time range in the upper right corner to view longer or shorter periods. +You can also customize the sampling interval from 1 minute to 5 hours. + +The event status distribution chart provides a visual representation of the intensity and dispersion of events. +This helps in evaluating and preparing for subsequent cluster operations and maintenance tasks. +If events are densely concentrated during specific time periods, you may need to allocate more resources or take corresponding measures to ensure cluster stability and high availability. +On the other hand, if events are dispersed, you can effectively schedule other maintenance tasks such as system optimization, upgrades, or handling other tasks during this period. + +By considering the event status distribution chart and the selected time range, you can better plan and manage your cluster operations and maintenance work, ensuring system stability and reliability. + +## Event Count and Statistics + +Through important event statistics, you can easily understand the number of image pull failures, health check failures, Pod execution failures, Pod scheduling failures, container OOM (Out-of-Memory) occurrences, volume mounting failures, and the total count of all events. These events are typically categorized as "Warning" and "Normal". + +## Event List + +The event list is presented chronologically based on time. You can sort the events by __Last Occurrend At__ and __Type__ . + +By clicking on the ⚙️ icon on the right side, you can customize the displayed columns according to your preferences and needs. + +Additionally, you can click the refresh icon to update the current event list when needed. + +![list](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/event02.png) + +In the operation column on the right, clicking the icon allows you to view the history of a specific event. + +![history](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/images/event03.png) + +## Reference + +For detailed meanings of the built-in Events in the system, refer to the +[Kubernetes API Event List](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/). diff --git a/docs/en/docs/end-user/insight/infra/namespace.md b/docs/en/docs/end-user/insight/infra/namespace.md new file mode 100644 index 0000000000..3eb7a8f7d3 --- /dev/null +++ b/docs/en/docs/end-user/insight/infra/namespace.md @@ -0,0 +1,34 @@ +--- +hide: + - toc +--- + +# Namespace Monitoring + +With namespaces as the dimension, you can quickly query resource consumption and trends within a namespace. + +## Prerequisites + +- Insight Agent is [installed](../../quickstart/install/install-agent.md) in the cluster and the applications are in the __Running__ state. + +## Steps + +1. Go to the __Insight__ product module. + +2. Select __Infrastructure__ -> __Namespaces__ from the left navigation bar. On this page, you can view the following information: + + 1. **Switch Namespace**: Switch between clusters or namespaces at the top. + 2. **Resource Overview**: Provides statistics on the number of normal and total workloads within the selected namespace. + 3. **Incidents**: Displays the number of alerts generated within the selected namespace. + 4. **Events**: Shows the number of Warning level events within the selected namespace in the past 24 hours. + 5. **Resource Consumption**: Provides the sum of CPU and memory usage for Pods within the selected namespace, along with the CPU and memory quota information. + + +### Metric Explanations + +| Metric Name | Description | +| -- | -- | +| CPU Usage | The sum of CPU usage for Pods within the selected namespace. | +| Memory Usage | The sum of memory usage for Pods within the selected namespace. | +| Pod CPU Usage | The CPU usage for each Pod within the selected namespace. | +| Pod Memory Usage | The memory usage for each Pod within the selected namespace. | diff --git a/docs/en/docs/end-user/insight/infra/node.md b/docs/en/docs/end-user/insight/infra/node.md new file mode 100644 index 0000000000..6b12e06f27 --- /dev/null +++ b/docs/en/docs/end-user/insight/infra/node.md @@ -0,0 +1,30 @@ +--- +MTPE: ModetaNiu +DATE: 2024-08-29 +--- + +# Node Monitoring + +Through node monitoring, you can get an overview of the current health status of the nodes in the selected cluster +and the number of abnormal pod; on the current node details page, you can view the number of alerts and +the trend of resource consumption such as CPU, memory, and disk. + +## Prerequisites + +- The cluster has [insight-agent installed](../../quickstart/install/install-agent.md) and the application is in __running__ state. + +## Steps + +1. Go to the __Insight__ product module. + +2. Select __Infrastructure__ -> __Nodes__ from the left navigation bar. On this page, you can view the following information: + + - **Cluster**: Uses the dropdown at the top to switch between clusters. + - **Nodes**: Shows a list of nodes within the selected cluster. Click a specific node to view detailed information. + - **Alert**: Displays the number of alerts generated in the current cluster. + - **Resource Consumption**: Shows the actual usage and total capacity of CPU, memory, and disk for the selected node. + - **Metric Explanations**: Describes the trends in CPU, memory, disk I/O, and network traffic for the selected node. + + ![Node Monitoring](../../image/node.png){ width="1000"} + +3. Click __Resource Level Monitor__, you can view more metrics of the current cluster. diff --git a/docs/en/docs/end-user/insight/infra/probe.md b/docs/en/docs/end-user/insight/infra/probe.md new file mode 100644 index 0000000000..c358e814b7 --- /dev/null +++ b/docs/en/docs/end-user/insight/infra/probe.md @@ -0,0 +1,73 @@ +# Probe + +Probe refers to the use of black-box monitoring to regularly test the connectivity of targets through HTTP, TCP, and other methods, enabling quick detection of ongoing faults. + +Insight uses the Prometheus Blackbox Exporter tool to probe the network using protocols such as HTTP, HTTPS, DNS, TCP, and ICMP, and returns the probe results to understand the network status. + +## Prerequisites + +The __insight-agent__ has been successfully deployed in the target cluster and is in the __Running__ state. + +## View Probes + +1. Go to the __Insight__ product module. +2. Select __Infrastructure__ -> __Probes__ in the left navigation bar. + + - Click the cluster or namespace dropdown in the table to switch between clusters and namespaces. + - The list displays the name, probe method, probe target, connectivity status, and creation time of the probes by default. + - The connectivity status can be: + - Normal: The probe successfully connects to the target, and the target returns the expected response. + - Abnormal: The probe fails to connect to the target, or the target does not return the expected response. + - Pending: The probe is attempting to connect to the target. + - Supports fuzzy search of probe names. + + +## Create a Probe + +1. Click __Create Probe__ . +2. Fill in the basic information and click __Next__ . + + - Name: The name can only contain lowercase letters, numbers, and hyphens (-), and must start and end with a lowercase letter or number, with a maximum length of 63 characters. + - Cluster: Select the cluster for the probe task. + - Namespace: The namespace where the probe task is located. + + +3. Configure the probe parameters. + + - Blackbox Instance: Select the blackbox instance responsible for the probe. + - Probe Method: + - HTTP: Sends HTTP or HTTPS requests to the target URL to check its connectivity and response time. This can be used to monitor the availability and performance of websites or web applications. + - TCP: Establishes a TCP connection to the target host and port to check its connectivity and response time. This can be used to monitor TCP-based services such as web servers and database servers. + - Other: Supports custom probe methods by configuring ConfigMap. For more information, refer to: [Custom Probe Methods](../collection-manag/probe-module.md) + - Probe Target: The target address of the probe, supports domain names or IP addresses. + - Labels: Custom labels that will be automatically added to Prometheus' labels. + - Probe Interval: The interval between probes. + - Probe Timeout: The maximum waiting time when probing the target. + +4. After configuring, click **OK** to complete the creation. + +!!! warning + + After the probe task is created, it takes about 3 minutes to synchronize the configuration. During this period, no probes will be performed, and probe results cannot be viewed. + +## View Monitoring Dashboards + +Click __ ...__ in the operations column and click __View Monitoring Dashboard__ . + +| Metric Name | Description | +| -- | -- | +| Current Status Response | Represents the response status code of the HTTP probe request. | +| Ping Status | Indicates whether the probe request was successful. 1 indicates a successful probe request, and 0 indicates a failed probe request. | +| IP Protocol | Indicates the IP protocol version used in the probe request. | +| SSL Expiry | Represents the earliest expiration time of the SSL/TLS certificate. | +| DNS Response (Latency) | Represents the duration of the entire probe process in seconds. | +| HTTP Duration | Represents the duration of the entire process from sending the request to receiving the complete response. | + +## Edit a Probe + +Click __ ...__ in the operations column and click __Edit__ . + + +## Delete a Probe + +Click __ ...__ in the operations column and click __Delete__ . diff --git a/docs/en/docs/end-user/insight/quickstart/agent-status.md b/docs/en/docs/end-user/insight/quickstart/agent-status.md new file mode 100644 index 0000000000..1f8e3a3c89 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/agent-status.md @@ -0,0 +1,42 @@ +# Insight-agent component status + +Insight is a multicluster observation product in AI platform. In order to realize the unified collection of multicluster observation data, users need to install the Helm application __insight-agent__ +(Installed in insight-system namespace by default). See [How to install __insight-agent__ ](install/install-agent.md). + +## Status description + +In __Insight__ -> __Data Collection__ section, you can view the status of __insight-agent__ installed in each cluster. + +- __not installed__ : __insight-agent__ is not installed under the insight-system namespace in this cluster +- __Running__ : __insight-agent__ is successfully installed in the cluster, and all deployed components are running +- __Exception__ : If insight-agent is in this state, it means that the helm deployment failed or the deployed components are not running + +Can be checked by: + +1. Run the following command, if the status is __deployed__ , go to the next step. + If it is __failed__ , since it will affect the upgrade of the application, + it is recommended to reinstall after uninstalling __Container Management__ -> __Helm Apps__ : + + ```bash + helm list -n insight-system + ``` + +2. run the following command or check the status of the components deployed in the cluster in + __Insight__ -> __Data Collection__ . If there is a pod that is not in the __Running__ state, please restart the abnormal pod. + + ```bash + kubectl get pods -n insight-system + ``` + +## Supplementary instructions + +1. The resource consumption of the metric collection component Prometheus in __insight-agent__ is directly proportional + to the number of pods running in the cluster. Adjust Prometheus resources according to the cluster size, + please refer to [Prometheus Resource Planning](./res-plan/prometheus-res.md). + +2. Since the storage capacity of the metric storage component vmstorage in the global service cluster + is directly proportional to the sum of the number of pods in each cluster. + + - Please contact the platform administrator to adjust the disk capacity of vmstorage according to the cluster size, + see [vmstorage disk capacity planning](./res-plan/vms-res-plan.md). + - Adjust vmstorage disk according to multicluster size, see [vmstorge disk expansion](./res-plan/modify-vms-disk.md). diff --git a/docs/en/docs/end-user/insight/quickstart/install/big-log-and-trace.md b/docs/en/docs/end-user/insight/quickstart/install/big-log-and-trace.md new file mode 100644 index 0000000000..20681ddace --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/big-log-and-trace.md @@ -0,0 +1,285 @@ +--- +MTPE: ModetaNiu +DATE: 2024-09-14 +--- + +# Enable Big Log and Big Trace Modes + +The Insight Module supports switching log to **Big Log** mode and trace to **Big Trace** mode, in order to +enhance data writing capabilities in large-scale environments. +This page introduces following methods for enabling these modes: + +- Enable or upgrade to Big Log and Big Trace modes [through the installer](#enabling-via-installer) (controlled by the same parameter value in `manifest.yaml`) +- Manually enable Big Log and Big Trace modes [through Helm commands](#enabling-via-helm-commands) + +## Logs + +This section explains the differences between the normal log mode and the Big Log mode. + +### Log Mode + +Components: Fluentbit + Elasticsearch + +This mode is referred to as the ES mode, and the data flow diagram is shown below: + +![Log Mode](../../images/big-log01.png) + +### Big Log Mode + +Components: Fluentbit + **Kafka** + **Vector** + Elasticsearch + +This mode is referred to as the Kafka mode, and the data flow diagram is shown below: + +![Big Log Mode](../../images/big-log02.png) + +## Traces + +This section explains the differences between the normal trace mode and the Big Trace mode. + +### Trace Mode + +Components: Agent opentelemetry-collector + Global opentelemetry-collector + Jaeger-collector + Elasticsearch + +This mode is referred to as the OTlp mode, and the data flow diagram is shown below: + +![Trace Mode](../../images/big-log03.png) + +### Big Trace Mode + +Components: Agent opentelemetry-collector + Kafka + Global opentelemetry-collector + Jaeger-collector + Elasticsearch + +This mode is referred to as the Kafka mode, and the data flow diagram is shown below: + +![Big Trace Mode](../../images/big-log04.png) + +## Enabling via Installer + +When deploying/upgrading AI platform using the installer, the `manifest.yaml` file includes the `infrastructures.kafka` field. +To enable observable Big Log and Big Trace modes, Kafka must be activated: + +```yaml title="manifest.yaml" +apiVersion: manifest.daocloud.io/v1alpha1 +kind: SuanovaManifest +... +infrastructures: + ... + kafka: + enable: true # Default is false + cpuLimit: 1 + memLimit: 2Gi + pvcSize: 15Gi +``` + +### Enable + +When using a `manifest.yaml` that enables `kafka` during installation, Kafka middleware will be installed by default, +and Big Log and Big Trace modes will be enabled automatically. The installation command is: + +```bash +./dce5-installer cluster-create -c clusterConfig.yaml -m manifest.yaml +``` + +### Upgrade + +The upgrade also involves modifying the `kafka` field. However, note that since the old environment was installed +with `kafka: false`, Kafka is not present in the environment. Therefore, you need to specify the upgrade +for `middleware` to install Kafka middleware simultaneously. The upgrade command is: + +```bash +./dce5-installer cluster-create -c clusterConfig.yaml -m manifest.yaml -u gproduct,middleware +``` + +!!! note + + After the upgrade is complete, you need to manually restart the following components: + + - insight-agent-fluent-bit + - insight-agent-opentelemetry-collector + - insight-opentelemetry-collector + +## Enabling via Helm Commands + +Prerequisites: Ensure that there is a **usable Kafka** and that the address is accessible. + +Use the following commands to retrieve the values of the old versions of Insight and insight-agent (it's recommended to back them up): + +```bash +helm get values insight -n insight-system -o yaml > insight.yaml +helm get values insight-agent -n insight-system -o yaml > insight-agent.yaml +``` + +### Enabling Big Log + +There are several ways to enable or upgrade to Big Log mode: + +=== "Use `--set` in the `helm upgrade` command" + + First, run the following Insight upgrade command, ensuring the Kafka brokers address is correct: + + ```bash + helm upgrade insight insight-release/insight \ + -n insight-system \ + -f ./insight.yaml \ + --set global.kafka.brokers="10.6.216.111:30592" \ + --set global.kafka.enabled=true \ + --set vector.enabled=true \ + --version 0.30.1 + ``` + + Then, run the following insight-agent upgrade command, ensuring the Kafka brokers address is correct: + + ```bash + helm upgrade insight-agent insight-release/insight-agent \ + -n insight-system \ + -f ./insight-agent.yaml \ + --set global.exporters.logging.kafka.brokers="10.6.216.111:30592" \ + --set global.exporters.logging.output=kafka \ + --version 0.30.1 + ``` + +=== "Modify YAML and run helm upgrade" + + Follow these steps to modify the YAML and then run the `helm upgrade` command: + + 1. Modify `insight.yaml` + + ```yaml title="insight.yaml" + global: + ... + kafka: + brokers: 10.6.216.111:30592 + enabled: true + ... + vector: + enabled: true + ``` + + 1. Upgrade the Insight component: + + ```bash + helm upgrade insight insight-release/insight \ + -n insight-system \ + -f ./insight.yaml \ + --version 0.30.1 + ``` + + 1. Modify `insight-agent.yaml` + + ```yaml title="insight-agent.yaml" + global: + ... + exporters: + ... + logging: + ... + kafka: + brokers: 10.6.216.111:30592 + output: kafka + ``` + + 1. Upgrade the insight-agent: + + ```bash + helm upgrade insight-agent insight-release/insight-agent \ + -n insight-system \ + -f ./insight-agent.yaml \ + --version 0.30.1 + ``` + +=== "Upgrade via Container Management UI" + + In the Container Management module, find the cluster, select **Helm Apps** from the left navigation bar, + and find and update the insight-agent. + + In **Logging Settings**, select **kafka** for **output** and fill in the correct **brokers** address. + + Note that after the upgrade is complete, you need to manually restart the **insight-agent-fluent-bit** component. + +### Enabling Big Trace + +There are several ways to enable or upgrade to Big Trace mode: + +=== "Using --set in the `helm upgrade` command" + + First, run the following Insight upgrade command, ensuring the Kafka brokers address is correct: + + ```bash + helm upgrade insight insight-release/insight \ + -n insight-system \ + -f ./insight.yaml \ + --set global.kafka.brokers="10.6.216.111:30592" \ + --set global.kafka.enabled=true \ + --set global.tracing.kafkaReceiver.enabled=true \ + --version 0.30.1 + ``` + + Then, run the following insight-agent upgrade command, ensuring the Kafka brokers address is correct: + + ```bash + helm upgrade insight-agent insight-release/insight-agent \ + -n insight-system \ + -f ./insight-agent.yaml \ + --set global.exporters.trace.kafka.brokers="10.6.216.111:30592" \ + --set global.exporters.trace.output=kafka \ + --version 0.30.1 + ``` + +=== "Modify YAML and run helm upgrade" + + Follow these steps to modify the YAML and then run the `helm upgrade` command: + + 1. Modify `insight.yaml` + + ```yaml title="insight.yaml" + global: + ... + kafka: + brokers: 10.6.216.111:30592 + enabled: true + ... + tracing: + ... + kafkaReceiver: + enabled: true + ``` + + 1. Upgrade the Insight component: + + ```bash + helm upgrade insight insight-release/insight \ + -n insight-system \ + -f ./insight.yaml \ + --version 0.30.1 + ``` + + 1. Modify `insight-agent.yaml` + + ```yaml title="insight-agent.yaml" + global: + ... + exporters: + ... + trace: + ... + kafka: + brokers: 10.6.216.111:30592 + output: kafka + ``` + + 1. Upgrade the insight-agent: + + ```bash + helm upgrade insight-agent insight-release/insight-agent \ + -n insight-system \ + -f ./insight-agent.yaml \ + --version 0.30.1 + ``` + +=== "Upgrade via Container Management UI" + + In the Container Management module, find the cluster, select **Helm Apps** from the left navigation bar, + and find and update the insight-agent. + + In **Trace Settings**, select **kafka** for **output** and fill in the correct **brokers** address. + + Note that after the upgrade is complete, you need to manually **restart the insight-agent-opentelemetry-collector** and **insight-opentelemetry-collector** components. \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/quickstart/install/component-scheduling.md b/docs/en/docs/end-user/insight/quickstart/install/component-scheduling.md new file mode 100644 index 0000000000..5fd83a04b8 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/component-scheduling.md @@ -0,0 +1,332 @@ +--- +MTPE: WANG0608GitHub +Date: 2024-10-08 +--- + +# Custom Insight Component Scheduling Policy + +When deploying Insight to a Kubernetes environment, proper resource management and optimization are crucial. +Insight includes several core components such as Prometheus, OpenTelemetry, FluentBit, Vector, and Elasticsearch. +These components, during their operation, may negatively impact the performance of other pods within the cluster +due to resource consumption issues. To effectively manage resources and optimize cluster operations, +node affinity becomes an important option. + +This page is about how to add [taints](#configure-dedicated-nodes-for-insight-using-taints) +and [node affinity](#use-node-labels-and-node-affinity-to-manage-component-scheduling) to ensure that each component +runs on the appropriate nodes, avoiding resource competition or contention, thereby guranttee the stability and efficiency +of the entire Kubernetes cluster. + +## Configure dedicated nodes for Insight using taints + +Since the Insight Agent includes DaemonSet components, the configuration method described in this section +is to have all components except the Insight DaemonSet run on dedicated nodes. + +This is achieved by adding taints to the dedicated nodes and using tolerations to match them. More details can be +found in the [Kubernetes official documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). + +You can refer to the following commands to add and remove taints on nodes: + +```bash +# Add taint +kubectl taint nodes worker1 node.daocloud.io=insight-only:NoSchedule + +# Remove taint +kubectl taint nodes worker1 node.daocloud.io:NoSchedule- +``` + +There are two ways to schedule Insight components to dedicated nodes: + +### 1. Add tolerations for each component + +Configure the tolerations for the `insight-server` and `insight-agent` Charts respectively: + +=== "insight-server Chart" + + ```yaml + server: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + ui: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + runbook: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + # mysql: + victoria-metrics-k8s-stack: + victoria-metrics-operator: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + vmcluster: + spec: + vmstorage: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + vmselect: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + vminsert: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + vmalert: + spec: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + alertmanager: + spec: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + jaeger: + collector: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + query: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + opentelemetry-collector-aggregator: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + opentelemetry-collector: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + grafana-operator: + operator: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + grafana: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + kibana: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + elastic-alert: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + vector: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + ``` + +=== "insight-agent Chart" + + ```yaml + kube-prometheus-stack: + prometheus: + prometheusSpec: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + prometheus-node-exporter: + tolerations: + - effect: NoSchedule + operator: Exists + prometheusOperator: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + + kube-state-metrics: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + opentelemetry-operator: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + opentelemetry-collector: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + tailing-sidecar-operator: + operator: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + opentelemetry-kubernetes-collector: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + prometheus-blackbox-exporter: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + etcd-exporter: + tolerations: + - key: "node.daocloud.io" + operator: "Equal" + value: "insight-only" + effect: "NoSchedule" + ``` + +### 2. Configure at the namespace level + +Allow pods in the `insight-system` namespace to tolerate the `node.daocloud.io=insight-only` taint. + +1. Adjust the `apiserver` configuration file `/etc/kubernetes/manifests/kube-apiserver.yaml` to include + `PodTolerationRestriction,PodNodeSelector`. See the following picture: + + ![insight-ns-toleration](../../images/insight-ns-toleration.png) + +2. Add an annotation to the `insight-system` namespace: + + ```yaml + apiVersion: v1 + kind: Namespace + metadata: + name: insight-system + annotations: + scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Equal", "effect": "NoSchedule", "key": "node.daocloud.io", "value": "insight-only"}]' + ``` + +Restart the components under the insight-system namespace to allow normal scheduling of pods under the insight-system. + +## Use node labels and node affinity to manage component scheduling + +!!! info + + Node affinity is conceptually similar to `nodeSelector`, allowing you to constrain + which nodes a pod can be scheduled on based on **labels** on the nodes. + There are two types of node affinity: + + 1. requiredDuringSchedulingIgnoredDuringExecution: The scheduler will only schedule the pod + if the rules are met. This feature is similar to nodeSelector but has more expressive syntax. + 2. preferredDuringSchedulingIgnoredDuringExecution: The scheduler will try to find nodes that + meet the rules. If no matching nodes are found, the scheduler will still schedule the Pod. + + For more details, please refer to the [Kubernetes official documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity). + +To meet different user needs for scheduling Insight components, Insight provides fine-grained labels for +different components' scheduling policies. Below is a description of the labels and their associated components: + +| Label Key | Label Value | Description | +| --- | ------- | ------------ | +| `node.daocloud.io/insight-any` | Any value, recommended to use `true` | Represents that all Insight components prefer nodes with this label | +| `node.daocloud.io/insight-prometheus` | Any value, recommended to use `true` | Specifically for Prometheus components | +| `node.daocloud.io/insight-vmstorage` | Any value, recommended to use `true` | Specifically for VictoriaMetrics vmstorage components | +| `node.daocloud.io/insight-vector` | Any value, recommended to use `true` | Specifically for Vector components | +| `node.daocloud.io/insight-otel-col` | Any value, recommended to use `true` | Specifically for OpenTelemetry components | + +You can refer to the following commands to add and remove labels on nodes: + +```bash +# Add label to node8, prioritizing scheduling insight-prometheus to node8 +kubectl label nodes node8 node.daocloud.io/insight-prometheus=true + +# Remove the node.daocloud.io/insight-prometheus label from node8 +kubectl label nodes node8 node.daocloud.io/insight-prometheus- +``` + +Below is the default affinity preference for the insight-prometheus component during deployment: + +```yaml +affinity: + nodeAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - preference: + matchExpressions: + - key: node-role.kubernetes.io/control-plane + operator: DoesNotExist + weight: 1 + - preference: + matchExpressions: + - key: node.daocloud.io/insight-prometheus # (1)! + operator: Exists + weight: 2 + - preference: + matchExpressions: + - key: node.daocloud.io/insight-any + operator: Exists + weight: 3 + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 1 + podAffinityTerm: + topologyKey: kubernetes.io/hostname + labelSelector: + matchExpressions: + - key: app.kubernetes.io/instance + operator: In + values: + - insight-agent-kube-prometh-prometheus +``` + +1. Prioritize scheduling insight-prometheus to nodes with the node.daocloud.io/insight-prometheus label diff --git a/docs/en/docs/end-user/insight/quickstart/install/gethosturl.md b/docs/en/docs/end-user/insight/quickstart/install/gethosturl.md new file mode 100644 index 0000000000..061d571ac3 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/gethosturl.md @@ -0,0 +1,188 @@ +--- +MTPE: ModetaNiu +DATE: 2024-07-24 +--- + +# Get Data Storage Address of Global Service Cluster + +Insight is a product for unified observation of multiple clusters. To achieve unified storage and +querying of observation data from multiple clusters, sub-clusters need to report the collected observation data to the +[global service cluster](../../../kpanda/clusters/cluster-role.md#global-service-cluster) +for unified storage. This document provides the required address of the storage component when +installing the collection component insight-agent. + +## Install insight-agent in Global Service Cluster + +If installing insight-agent in the global service cluster, it is recommended to access the cluster via domain name: + +```shell +export vminsert_host="vminsert-insight-victoria-metrics-k8s-stack.insight-system.svc.cluster.local" # (1)! +export es_host="insight-es-master.insight-system.svc.cluster.local" # (2)! +export otel_col_host="insight-opentelemetry-collector.insight-system.svc.cluster.local" # (3)! +``` + +## Install insight-agent in Other Clusters + +### Get Address via Interface Provided by Insight Server + +1. The [management cluster](../../../kpanda/clusters/cluster-role.md#management-clusters) + uses the default LoadBalancer mode for exposure. + + Log in to the console of the global service cluster and run the following command: + + + ```bash + export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP}) + curl --location --request POST 'http://'"${INSIGHT_SERVER_IP}"'/apis/insight.io/v1alpha1/agentinstallparam' + ``` + + !!! note + + Please replace the `${INSIGHT_SERVER_IP}` parameter in the command. + + You will get the following response: + + ```json + { + "values": { + "global": { + "exporters": { + "logging": { + "host": "10.6.182.32" + }, + "metric": { + "host": "10.6.182.32" + }, + "auditLog": { + "host": "10.6.182.32" + }, + "trace": { + "host": "10.6.182.32" + } + } + }, + "opentelemetry-operator": { + "enabled": true + }, + "opentelemetry-collector": { + "enabled": true + } + } + } + ``` + + - `global.exporters.logging.host` is the log service address, no need to set the proper service port, + the default value will be used. + - `global.exporters.metric.host` is the metrics service address. + - `global.exporters.trace.host` is the trace service address. + - `global.exporters.auditLog.host` is the audit log service address (same service as trace but different port). + +1. Management cluster disables LoadBalancer + + When calling the interface, you need to additionally pass an externally accessible node IP from the cluster, + which will be used to construct the complete access address of the proper service. + + ```bash + export INSIGHT_SERVER_IP=$(kubectl get service insight-server -n insight-system --output=jsonpath={.spec.clusterIP}) + curl --location --request POST 'http://'"${INSIGHT_SERVER_IP}"'/apis/insight.io/v1alpha1/agentinstallparam' --data '{"extra": {"EXPORTER_EXTERNAL_IP": "10.5.14.51"}}' + ``` + + You will get the following response: + + ```json + { + "values": { + "global": { + "exporters": { + "logging": { + "scheme": "https", + "host": "10.5.14.51", + "port": 32007, + "user": "elastic", + "password": "j8V1oVoM1184HvQ1F3C8Pom2" + }, + "metric": { + "host": "10.5.14.51", + "port": 30683 + }, + "auditLog": { + "host": "10.5.14.51", + "port": 30884 + }, + "trace": { + "host": "10.5.14.51", + "port": 30274 + } + } + }, + "opentelemetry-operator": { + "enabled": true + }, + "opentelemetry-collector": { + "enabled": true + } + } + } + ``` + + - `global.exporters.logging.host` is the log service address. + - `global.exporters.logging.port` is the NodePort exposed by the log service. + - `global.exporters.metric.host` is the metrics service address. + - `global.exporters.metric.port` is the NodePort exposed by the metrics service. + - `global.exporters.trace.host` is the trace service address. + - `global.exporters.trace.port` is the NodePort exposed by the trace service. + - `global.exporters.auditLog.host` is the audit log service address (same service as trace but different port). + - `global.exporters.auditLog.port` is the NodePort exposed by the audit log service. + +### Connect via LoadBalancer + +1. If `LoadBalancer` is enabled in the cluster and a `VIP` is set for Insight, you can manually execute + the following command to obtain the address information for `vminsert` and `opentelemetry-collector`: + + ```shell + $ kubectl get service -n insight-system | grep lb + lb-insight-opentelemetry-collector LoadBalancer 10.233.23.12 4317:31286/TCP,8006:31351/TCP 24d + lb-vminsert-insight-victoria-metrics-k8s-stack LoadBalancer 10.233.63.67 8480:31629/TCP 24d + ``` + + - `lb-vminsert-insight-victoria-metrics-k8s-stack` is the address for the metrics service. + - `lb-insight-opentelemetry-collector` is the address for the tracing service. + +2. Execute the following command to obtain the address information for `elasticsearch`: + + ```shell + $ kubectl get service -n mcamel-system | grep es + mcamel-common-es-cluster-masters-es-http NodePort 10.233.16.120 9200:30465/TCP 47d + ``` + + `mcamel-common-es-cluster-masters-es-http` is the address for the logging service. + +### Connect via NodePort + +The LoadBalancer feature is disabled in the global service cluster. + +In this case, the LoadBalancer resources mentioned above will not be created by default. The relevant service names are: + +- vminsert-insight-victoria-metrics-k8s-stack (metrics service) +- common-es (logging service) +- insight-opentelemetry-collector (tracing service) + +After obtaining the corresponding port information for the services in the above two scenarios, make the following settings: + +```shell +--set global.exporters.logging.host= # (1)! +--set global.exporters.logging.port= # (2)! +--set global.exporters.metric.host= # (3)! +--set global.exporters.metric.port= # (4)! +--set global.exporters.trace.host= # (5)! +--set global.exporters.trace.port= # (6)! +--set global.exporters.auditLog.host= # (7)! +``` + +1. NodeIP of the externally accessible management cluster +2. NodePort of the logging service port 9200 +3. NodeIP of the externally accessible management cluster +4. NodePort of the metrics service port 8480 +5. NodeIP of the externally accessible management cluster +6. NodePort of the tracing service port 4317 +7. NodeIP of the externally accessible management cluster diff --git a/docs/en/docs/end-user/insight/quickstart/install/index.md b/docs/en/docs/end-user/insight/quickstart/install/index.md new file mode 100644 index 0000000000..4c1744d4b9 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/index.md @@ -0,0 +1,48 @@ +--- +MTPE: ModetaNiu +DATE: 2024-07-19 +--- + +# Start Observing + +AI platform platform enables the management and creation of multicloud and multiple clusters. +Building upon this capability, Insight serves as a unified observability solution for +multiple clusters. It collects observability data from multiple clusters by deploying the insight-agent +plugin and allows querying of metrics, logs, and trace data through the AI platform Insight. + + __insight-agent__ is a tool that facilitates the collection of observability data from multiple clusters. +Once installed, it automatically collects metrics, logs, and trace data without any modifications. + +Clusters created through __Container Management__ come pre-installed with insight-agent. Hence, +this guide specifically provides instructions on enabling observability for integrated clusters. + +- [Install insight-agent online](install-agent.md) + +As a unified observability platform for multiple clusters, Insight's resource consumption of certain components +is closely related to the data of cluster creation and the number of integrated clusters. +When installing insight-agent, it is necessary to adjust the resources of the corresponding components based on the cluster size. + +1. Adjust the CPU and memory resources of the __Prometheus__ collection component in insight-agent + according to the size of the cluster created or integrated. Please refer to + [Prometheus resource planning](../res-plan/prometheus-res.md). + +2. As the metric data from multiple clusters is stored centrally, AI platform platform administrators + need to adjust the disk space of __vmstorage__ based on the cluster size. + Please refer to [vmstorage disk capacity planning](../res-plan/vms-res-plan.md). + +- For instructions on adjusting the disk space of vmstorage, please refer to + [Expanding vmstorage disk](../res-plan/modify-vms-disk.md). + +Since AI platform supports the management of multicloud and multiple clusters, +insight-agent has undergone partial verification. However, there are known conflicts +with monitoring components when installing insight-agent in Suanova 4.0 clusters and +Openshift 4.x clusters. If you encounter similar issues, please refer to the following documents: + +- [Install insight-agent in Openshift 4.x](../other/install-agent-on-ocp.md) + +Currently, the insight-agent collection component has undergone functional testing +for popular versions of Kubernetes. Please refer to: + +- [Kubernetes cluster compatibility testing](../../compati-test/k8s-compatibility.md) +- [Openshift 4.x cluster compatibility testing](../../compati-test/ocp-compatibility.md) +- [Rancher cluster compatibility testing](../../compati-test/rancher-compatibility.md) diff --git a/docs/en/docs/end-user/insight/quickstart/install/install-agent.md b/docs/en/docs/end-user/insight/quickstart/install/install-agent.md new file mode 100644 index 0000000000..2fceea5f20 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/install-agent.md @@ -0,0 +1,44 @@ +--- +date: 2022-11-17 +hide: + - toc +--- + +# Install insight-agent + +insight-agent is a plugin for collecting insight data, supporting unified observation of metrics, links, and log data. This article describes how to install insight-agent in an online environment for the accessed cluster. + +## Prerequisites + +Please confirm that your cluster has successfully connected to the __container management__ platform. You can refer to [Integrate Clusters](../../../kpanda/clusters/integrate-cluster.md) for details. + +## Steps + +1. Enter __Container Management__ from the left navigation bar, and enter __Clusters__ . Find the cluster where you want to install insight-agent. + + ![Find Cluster](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/insight-agent01.png) + +1. Choose __Install now__ to jump, or click the cluster and click __Helm Applications__ -> __Helm Templates__ in the left navigation bar, search for __insight-agent__ in the search box, and click it for details. + + ![Search insight-agent](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/insight-agent02.png) + +1. Select the appropriate version and click __Install__ . + + ![Install](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/insight-agent03.png) + +1. Fill in the name, select the namespace and version, and fill in the addresses of logging, metric, audit, and trace reporting data in the yaml file. The system has filled in the address of the component for data reporting by default, please check it before clicking __OK__ to install. + + If you need to modify the data reporting address, please refer to [Get Data Reporting Address](./gethosturl.md). + + ![Sheet Fill1](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/insight-agent04-1.png) + + ![Sheet Fill2](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/insight-agent04-2.png) + +1. The system will automatically return to __Helm Apps__ . When the application status changes from __Unknown__ to __Deployed__ , it means that insight-agent is installed successfully. + + ![Finish Page](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/insight-agent05.png) + + !!! note + + - Click __┇__ on the far right, and you can perform more operations such as __Update__ , __View YAML__ and __Delete__ in the pop-up menu. + - For a practical installation demo, watch [Video demo of installing insight-agent](../../../videos/insight.md#install-insight-agent) diff --git a/docs/en/docs/end-user/insight/quickstart/install/knownissues.md b/docs/en/docs/end-user/insight/quickstart/install/knownissues.md new file mode 100644 index 0000000000..fb7096a25b --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/knownissues.md @@ -0,0 +1,83 @@ +--- +MTPE: windsonsea +date: 2024-02-26 +--- + +# Known Issues + +This page lists some issues related to the installation and uninstallation of Insight Agent and their workarounds. + +## v0.23.0 + +### Insight Agent + +#### Uninstallation Failure of Insight Agent + +When you run the following command to uninstall Insight Agent, + +```sh +helm uninstall insight agent +``` + +The `tls secret` used by `otel-operator` is failed to uninstall. + +Due to the logic of "reusing tls secret" in the following code of `otel-operator`, +it checks whether `MutationConfiguration` exists and reuses the CA cert bound in +MutationConfiguration. However, since `helm uninstall` has uninstalled `MutationConfiguration`, +it results in a null value. + +Therefore, please manually delete the corresponding `secret` using one of the following methods: + +- **Delete via command line**: Log in to the console of the target cluster and run the following command: + + ```sh + kubectl -n insight-system delete secret insight-agent-opentelemetry-operator-controller-manager-service-cert + ``` + +- **Delete via UI**: Log in to AI platform container management, select the target cluster, select **Secret** + from the left menu, input `insight-agent-opentelemetry-operator-controller-manager-service-cert`, + then select `Delete`. + +### Insight Agent + +#### Log Collection Endpoint Not Updated When Upgrading Insight Agent + +When updating the log configuration of the insight-agent from Elasticsearch to Kafka or from Kafka +to Elasticsearch, the changes do not take effect and the agent continues to use the previous configuration. + +**Solution** : + +Manually restart Fluent Bit in the cluster. + +## v0.21.0 + +### Insight Agent + +#### PodMonitor Collects Multiple Sets of JVM Metrics + +1. In this version, there is a defect in **PodMonitor/insight-kubernetes-pod**: it will incorrectly + create Jobs to collect metrics for all containers in Pods that are marked with + `insight.opentelemetry.io/metric-scrape=true`, instead of only the containers corresponding + to `insight.opentelemetry.io/metric-port`. + +2. After PodMonitor is declared, **PrometheusOperator** will pre-configure some service discovery configurations. + Considering the compatibility of CRDs, it is abandoned to configure the collection tasks through **annotations**. + +3. Use the additional scrape config mechanism provided by Prometheus to configure the service discovery rules + in a secret and introduce them into Prometheus. + +Therefore: + +1. Delete the current **PodMonitor** for **insight-kubernetes-pod** +2. Use a new rule + +In the new rule, **action: keepequal** is used to compare the consistency between **source_labels** +and **target_label** to determine whether to create collection tasks for the ports of a container. +Note that this feature is only available in Prometheus v2.41.0 (2022-12-20) and higher. + +```diff ++ - source_labels: [__meta_kubernetes_pod_annotation_insight_opentelemetry_io_metric_port] ++ separator: ; ++ target_label: __meta_kubernetes_pod_container_port_number ++ action: keepequal +``` diff --git a/docs/en/docs/end-user/insight/quickstart/install/upgrade-note.md b/docs/en/docs/end-user/insight/quickstart/install/upgrade-note.md new file mode 100644 index 0000000000..109010a879 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/install/upgrade-note.md @@ -0,0 +1,166 @@ +--- +MTPE: WANG0608GitHub +Date: 2024-09-24 +--- + +# Upgrade Notes + +This page provides some considerations for upgrading insight-server and insight-agent. + +## insight-agent + +### Upgrade from v0.28.x (or lower) to v0.29.x + +Due to the upgrade of the Opentelemetry community operator chart version in v0.29.0, the supported values for `featureGates` in the values file have changed. Therefore, before upgrading, you need to set the value of `featureGates` to empty, as follows: + +```diff +- --set opentelemetry-operator.manager.featureGates="+operator.autoinstrumentation.go,+operator.autoinstrumentation.multi-instrumentation,+operator.autoinstrumentation.nginx" \ ++ --set opentelemetry-operator.manager.featureGates="" +``` + +## insight-server + +### Upgrade from v0.26.x (or lower) to v0.27.x or higher + +In v0.27.x, the switch for the vector component has been separated. If the existing environment has vector enabled, you need to specify `--set vector.enabled=true` when upgrading the insight-server. + +### Upgrade from v0.19.x (or lower) to 0.20.x + +Before upgrading __Insight__ , you need to manually delete the __jaeger-collector__ and +__jaeger-query__ deployments by running the following command: + +```bash +kubectl -n insight-system delete deployment insight-jaeger-collector +kubectl -n insight-system delete deployment insight-jaeger-query +``` + +### Upgrade from v0.17.x (or lower) to v0.18.x + +In v0.18.x, there have been updates to the Jaeger-related deployment files, +so you need to manually run the following commands before upgrading insight-server: + +```bash +kubectl -n insight-system delete deployment insight-jaeger-collector +kubectl -n insight-system delete deployment insight-jaeger-query +``` + +There have been changes to metric names in v0.18.x, so after upgrading insight-server, +insight-agent should also be upgraded. + +In addition, the parameters for enabling the tracing module and adjusting the ElasticSearch connection +have been modified. Refer to the following parameters: + +```diff ++ --set global.tracing.enable=true \ +- --set jaeger.collector.enabled=true \ +- --set jaeger.query.enabled=true \ ++ --set global.elasticsearch.scheme=${your-external-elasticsearch-scheme} \ ++ --set global.elasticsearch.host=${your-external-elasticsearch-host} \ ++ --set global.elasticsearch.port=${your-external-elasticsearch-port} \ ++ --set global.elasticsearch.user=${your-external-elasticsearch-username} \ ++ --set global.elasticsearch.password=${your-external-elasticsearch-password} \ +- --set jaeger.storage.elasticsearch.scheme=${your-external-elasticsearch-scheme} \ +- --set jaeger.storage.elasticsearch.host=${your-external-elasticsearch-host} \ +- --set jaeger.storage.elasticsearch.port=${your-external-elasticsearch-port} \ +- --set jaeger.storage.elasticsearch.user=${your-external-elasticsearch-username} \ +- --set jaeger.storage.elasticsearch.password=${your-external-elasticsearch-password} \ +``` + +### Upgrade from v0.15.x (or lower) to v0.16.x + +In v0.16.x, a new feature parameter `disableRouteContinueEnforce` in the `vmalertmanagers CRD` +is used. Therefore, you need to manually run the following command before upgrading insight-server: + +```shell +kubectl apply --server-side -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.33.0/config/crd/bases/operator.victoriametrics.com_vmalertmanagers.yaml --force-conflicts +``` + +!!! note + + If you are performing an offline installation, after extracting the insight offline package, + please run the following command to update CRDs. + + ```shell + kubectl apply --server-side -f insight/dependency-crds --force-conflicts + ``` + +## insight-agent + +### Upgrade from v0.23.x (or lower) to v0.24.x + +In v0.24.x, CRDs have been added to the `OTEL operator chart`. However, +helm upgrade does not update CRDs, so you need to manually run the following command: + +```shell +kubectl apply -f https://raw.githubusercontent.com/open-telemetry/opentelemetry-helm-charts/main/charts/opentelemetry-operator/crds/crd-opentelemetry.io_opampbridges.yaml +``` + +If you are performing an offline installation, you can find the above CRD yaml file after extracting the +insight-agent offline package. After extracting the insight-agent Chart, manually run the following command: + +```shell +kubectl apply -f charts/agent/crds/crd-opentelemetry.io_opampbridges.yaml +``` + +### Upgrade from v0.19.x (or lower) to v0.20.x + +In v0.20.x, Kafka log export configuration has been added, and there have been some adjustments +to the log export configuration. Before upgrading __insight-agent__ , please note the parameter changes. +The previous logging configuration has been moved to the logging.elasticsearch configuration: + +```diff +- --set global.exporters.logging.host \ +- --set global.exporters.logging.port \ ++ --set global.exporters.logging.elasticsearch.host \ ++ --set global.exporters.logging.elasticsearch.port \ +``` + +### Upgrade from v0.17.x (or lower) to v0.18.x + +Due to the updated deployment files for Jaeger In v0.18.x, it is important to +note the changes in parameters before upgrading the insight-agent. + +```diff ++ --set global.exporters.trace.enable=true \ +- --set opentelemetry-collector.enabled=true \ +- --set opentelemetry-operator.enabled=true \ +``` + +### Upgrade from v0.16.x (or lower) to v0.17.x + +In v0.17.x, the kube-prometheus-stack chart version was upgraded from 41.9.1 to 45.28.1, and +there were also some field upgrades in the CRD used, such as the __attachMetadata__ field of +servicemonitor. Therefore, the following command needs to be rund before upgrading the insight-agent: + +```bash +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.65.1/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml --force-conflicts +``` + +If you are performing an offline installation, you can find the yaml for the above CRD in +insight-agent/dependency-crds after extracting the insight-agent offline package. + +### Upgrade from v0.11.x (or earlier) to v0.12.x + +v0.12.x upgrades kube-prometheus-stack chart from 39.6.0 to 41.9.1, including prometheus-operator to v0.60.1, prometheus-node-exporter chart to v4.3.0. +Prometheus-node-exporter uses [Kubernetes recommended label](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/) after upgrading, so you need to delete __node-exporter__ daemonset. +prometheus-operator has updated the CRD, so you need to run the following command before upgrading the insight-agent: + +```shell linenums="1" +kubectl delete daemonset insight-agent-prometheus-node-exporter -n insight-system +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml --force- conflicts +kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.60.1/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml --force- conflicts +``` + +!!! note + + If you are installing offline, you can run the following command to update the CRD after decompressing the insight-agent offline package. + + ```shell + kubectl apply --server-side -f insight-agent/dependency-crds --force-conflicts + ``` diff --git a/docs/en/docs/end-user/insight/quickstart/jvm-monitor/jmx-exporter.md b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/jmx-exporter.md new file mode 100644 index 0000000000..935d3724b3 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/jmx-exporter.md @@ -0,0 +1,136 @@ +# Use JMX Exporter to expose JVM monitoring metrics + +JMX-Exporter provides two usages: + +1. Start a standalone process. Specify parameters when the JVM starts, expose the RMI interface of JMX, JMX Exporter calls RMI to obtain the JVM runtime status data, + Convert to Prometheus metrics format, and expose ports for Prometheus to collect. +2. Start the JVM in-process. Specify parameters when the JVM starts, and run the jar package of JMX-Exporter in the form of javaagent. + Read the JVM runtime status data in the process, convert it into Prometheus metrics format, and expose the port for Prometheus to collect. + +!!! note + + Officials do not recommend the first method. On the one hand, the configuration is complicated, and on the other hand, it requires a separate process, and the monitoring of this process itself has become a new problem. + So This page focuses on the second usage and how to use JMX Exporter to expose JVM monitoring metrics in the Kubernetes environment. + +The second usage is used here, and the JMX Exporter jar package file and configuration file need to be specified when starting the JVM. +The jar package is a binary file, so it is not easy to mount it through configmap. We hardly need to modify the configuration file. +So the suggestion is to directly package the jar package and configuration file of JMX Exporter into the business container image. + +Among them, in the second way, we can choose to put the jar file of JMX Exporter in the business application mirror, +You can also choose to mount it during deployment. Here is an introduction to the two methods: + +## Method 1: Build the JMX Exporter JAR file into the business image + +The content of prometheus-jmx-config.yaml is as follows: + +```yaml title="prometheus-jmx-config.yaml" +... +ssl: false +lowercaseOutputName: false +lowercaseOutputLabelNames: false +rules: +- pattern: ".*" +``` + +!!! note + + For more configmaps, please refer to the bottom introduction or [Prometheus official documentation](https://github.com/prometheus/jmx_exporter#configuration). + +Then prepare the jar package file, you can find the latest jar package download address on the Github page of [jmx_exporter](https://github.com/prometheus/jmx_exporter) and refer to the following Dockerfile: + +```shell +FROM openjdk:11.0.15-jre +WORKDIR /app/ +COPY target/my-app.jar ./ +COPY prometheus-jmx-config.yaml ./ +RUN set -ex; \ + curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar; +ENV JAVA_TOOL_OPTIONS=-javaagent:/app/jmx_prometheus_javaagent-0.17.2.jar=8088:/app/prometheus-jmx-config.yaml +EXPOSE 8081 8999 8080 8888 +ENTRYPOINT java $JAVA_OPTS -jar my-app.jar +``` + +Notice: + +- Start parameter format: -javaagent:=: +- Port 8088 is used here to expose the monitoring metrics of the JVM. If it conflicts with Java applications, you can change it yourself + +## Method 2: mount via init container container + +We need to make the JMX exporter into a Docker image first, the following Dockerfile is for reference only: + +```shell +FROM alpine/curl:3.14 +WORKDIR /app/ +# Copy the previously created config file to the mirror +COPY prometheus-jmx-config.yaml ./ +# Download jmx prometheus javaagent jar online +RUN set -ex; \ + curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar; +``` + +Build the image according to the above Dockerfile: __docker build -t my-jmx-exporter .__ + +Add the following init container to the Java application deployment Yaml: + +??? note "Click to view YAML file" + + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: my-demo-app + labels: + app: my-demo-app + spec: + selector: + matchLabels: + app: my-demo-app + template: + metadata: + labels: + app: my-demo-app + spec: + imagePullSecrets: + - name: registry-pull + initContainers: + - name: jmx-sidecar + image: my-jmx-exporter + command: ["cp", "-r", "/app/jmx_prometheus_javaagent-0.17.2.jar", "/target/jmx_prometheus_javaagent-0.17.2.jar"] ➊ + volumeMounts: + - name: sidecar + mountPath: /target + containers: + - image: my-demo-app-image + name: my-demo-app + resources: + requests: + memory: "1000Mi" + cpu: "500m" + limits: + memory: "1000Mi" + cpu: "500m" + ports: + - containerPort: 18083 + env: + - name: JAVA_TOOL_OPTIONS + value: "-javaagent:/app/jmx_prometheus_javaagent-0.17.2.jar=8088:/app/prometheus-jmx-config.yaml" ➋ + volumeMounts: + - name: host-time + mountPath: /etc/localtime + readOnly: true + - name: sidecar + mountPath: /sidecar + volumes: + - name: host-time + hostPath: + path: /etc/localtime + - name: sidecar # Share the agent folder + emptyDir: {} + restartPolicy: Always + ``` + +After the above modification, the sample application my-demo-app has the ability to expose JVM metrics. +After running the service, we can access the prometheus format metrics exposed by the service through `http://lcoalhost:8088`. + +Then, you can refer to [Java Application Docking Observability with JVM Metrics](./legacy-jvm.md). diff --git a/docs/en/docs/end-user/insight/quickstart/jvm-monitor/jvm-catelogy.md b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/jvm-catelogy.md new file mode 100644 index 0000000000..e0d6ef1e36 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/jvm-catelogy.md @@ -0,0 +1,13 @@ +# Start monitoring Java applications + +This document mainly describes how to monitor the JVM of the customer's Java application. +It describes how Java applications that have exposed JVM metrics, and those that have not, interface with Insight. + +If your Java application does not start exposing JVM metrics, you can refer to the following documents: + +- [Expose JVM monitoring metrics with JMX Exporter](./jmx-exporter.md) +- [Expose JVM monitoring metrics using OpenTelemetry Java Agent](./otel-java-agent.md) + +If your Java application has exposed JVM metrics, you can refer to the following documents: + +- [Java application docking observability with existing JVM metrics](./legacy-jvm.md) \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/quickstart/jvm-monitor/legacy-jvm.md b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/legacy-jvm.md new file mode 100644 index 0000000000..b03b28ad5e --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/legacy-jvm.md @@ -0,0 +1,94 @@ +--- +MTPE: ModetaNiu +DATE: 2024-08-14 +--- + +# Java Application with JVM Metrics to Dock Insight + +If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), +We need to allow monitoring data to be collected. You can let Insight collect existing JVM metrics by +adding Kubernetes Annotations to the workload: + +```yaml +annatation: + insight.opentelemetry.io/metric-scrape: "true" # whether to collect + insight.opentelemetry.io/metric-path: "/" # path to collect metrics + insight.opentelemetry.io/metric-port: "9464" # port for collecting metrics +``` + +YAML Example to add annotations for __my-deployment-app__ workload: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-deployment-app +spec: + selector: + matchLabels: + app: my-deployment-app + app.kubernetes.io/name: my-deployment-app + replicas: 1 + template: + metadata: + labels: + app: my-deployment-app + app.kubernetes.io/name: my-deployment-app + annotations: + insight.opentelemetry.io/metric-scrape: "true" # whether to collect + insight.opentelemetry.io/metric-path: "/" # path to collect metrics + insight.opentelemetry.io/metric-port: "9464" # port for collecting metrics +``` + +The following shows the complete YAML: + +```yaml +--- +apiVersion: v1 +kind: Service +metadata: + name: spring-boot-actuator-prometheus-metrics-demo +spec: + type: NodePort + selector: + #app: my-deployment-with-aotu-instrumentation-app + app.kubernetes.io/name: spring-boot-actuator-prometheus-metrics-demo + ports: + - name: http + port: 8080 +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: spring-boot-actuator-prometheus-metrics-demo +spec: + selector: + matchLabels: + #app: my-deployment-with-aotu-instrumentation-app + app.kubernetes.io/name: spring-boot-actuator-prometheus-metrics-demo + replicas: 1 + template: + metadata: + labels: + app.kubernetes.io/name: spring-boot-actuator-prometheus-metrics-demo + annotations: + insight.opentelemetry.io/metric-scrape: "true" # whether to collect + insight.opentelemetry.io/metric-path: "/actuator/prometheus" # path to collect metrics + insight.opentelemetry.io/metric-port: "8080" # port for collecting metrics + spec: + containers: + - name: myapp + image: docker.m.daocloud.io/wutang/spring-boot-actuator-prometheus-metrics-demo + ports: + - name: http + containerPort: 8080 + resources: + limits: + cpu: 500m + memory: 800Mi + requests: + cpu: 200m + memory: 400Mi +``` + +In the above example,Insight will use __:8080//actuator/prometheus__ to get Prometheus metrics exposed through *Spring Boot Actuator* . diff --git a/docs/en/docs/end-user/insight/quickstart/jvm-monitor/otel-java-agent.md b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/otel-java-agent.md new file mode 100644 index 0000000000..b97086dcaa --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/jvm-monitor/otel-java-agent.md @@ -0,0 +1,23 @@ +# Use OpenTelemetry Java Agent to expose JVM monitoring metrics + +In Opentelemetry Agent v1.20.0 and above, Opentelemetry Agent has added the JMX Metric Insight module. If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents for our application Expose JMX metrics. The Opentelemetry Agent also collects and exposes metrics by instrumenting the metrics exposed by MBeans locally available in the application. + +Opentelemetry Agent also has some built-in monitoring samples for common Java Servers or frameworks, please refer to [predefined metrics](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics /semantic_conventions/runtime-environment-metrics.md#jvm-metrics). + +Using the OpenTelemetry Java Agent also needs to consider how to mount the JAR into the container. In addition to referring to the JMX Exporter above to mount the JAR file, we can also use the Operator capabilities provided by OpenTelemetry to automatically enable JVM metric exposure for our applications. : + +If your application has integrated Opentelemetry Agent to collect application traces, then you no longer need to introduce other Agents to expose JMX metrics for our application. The Opentelemetry Agent can now natively collect and expose metrics interfaces by instrumenting metrics exposed by MBeans available locally in the application. + +However, for current version, you still need to manually add the [corresponding annotations](./legacy-jvm.md) to workload before the JVM data will be collected by Insight. + +## Expose metrics for Java middleware + +Opentelemetry Agent also has some built-in middleware monitoring samples, please refer to [Predefined Metrics](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/jmx-metrics/javaagent /README.md#predefined-metrics). + +By default, no type is specified, and it needs to be specified through __-Dotel.jmx.target.system__ JVM Options, such as __-Dotel.jmx.target.system=jetty,kafka-broker__ . + +## Reference + +- [Gaining JMX Metric Insights with the OpenTelemetry Java Agent](https://opentelemetry.io/blog/2023/jmx-metric-insight/) + +- [Otel jmx metrics](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/jmx-metrics) \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/quickstart/otel/golang-ebpf.md b/docs/en/docs/end-user/insight/quickstart/otel/golang-ebpf.md new file mode 100644 index 0000000000..1dc5d92df4 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/golang-ebpf.md @@ -0,0 +1,284 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Enhance Go apps with OTel auto-instrumentation + +If you don't want to manually change the application code, you can try This page's eBPF-based automatic enhancement method. +This feature is currently in the review stage of donating to the OpenTelemetry community, and does not support Operator injection through annotations (it will be supported in the future), so you need to manually change the Deployment YAML or use a patch. + +## Prerequisites + +Make sure Insight Agent is ready. If not, see [Install insight-agent to collect data](../install/install-agent.md) and make sure the following three items are in place: + +- Enable trace feature for Insight-agent +- Whether the address and port of the trace data are filled in correctly +- Pods corresponding to deployment/opentelemetry-operator-controller-manager and deployment/insight-agent-opentelemetry-collector are ready + +## Install Instrumentation CR + +Install under the Insight-system namespace, skip this step if it has already been installed. + +Note: This CR currently only supports the injection of environment variables (including service name and trace address) required to connect to Insight, and will support the injection of Golang probes in the future. + +```bash +kubectl apply -f - <- + http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 + - name: OTEL_EXPORTER_OTLP_TIMEOUT + value: '200' + - name: SPLUNK_TRACE_RESPONSE_HEADER_ENABLED + value: 'true' + - name: OTEL_SERVICE_NAME + value: voting + - name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: OTEL_RESOURCE_ATTRIBUTES_POD_UID + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.uid + - name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: spec.nodeName + - name: OTEL_PROPAGATORS + value: jaeger,b3 + - name: OTEL_TRACES_SAMPLER + value: always_on + - name: OTEL_RESOURCE_ATTRIBUTES + value: >- + k8s.container.name=voting-svc,k8s.deployment.name=voting,k8s.deployment.uid=79e015e2-4643-44c0-993c-e486aebaba10,k8s.namespace.name=default,k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.pod.uid=$(OTEL_RESOURCE_ATTRIBUTES_POD_UID),k8s.replicaset.name=voting-84b696c897,k8s.replicaset.uid=63f56167-6632-415d-8b01-43a3db9891ff + resources: + requests: + cpu: 100m + volumeMounts: + - name: launcherdir + mountPath: /odigos-launcher + - name: kube-api-access-gwj5v + readOnly: true + mountPath: /var/run/secrets/kubernetes.io/serviceaccount + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + - name: emojivoto-voting-instrumentation + image: keyval/otel-go-agent:v0.6.0 + env: + - name: OTEL_TARGET_EXE + value: /usr/local/bin/emojivoto-voting-svc + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: jaeger:4317 + - name: OTEL_SERVICE_NAME + value: emojivoto-voting + resources: {} + volumeMounts: + - name: kernel-debug + mountPath: /sys/kernel/debug + - name: kube-api-access-gwj5v + readOnly: true + mountPath: /var/run/secrets/kubernetes.io/serviceaccount + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + securityContext: + capabilities: + add: + - SYS_PTRACE + privileged: true + runAsUser: 0 +······ +``` + +## Reference + +- [Getting Started with Go OpenTelemetry Automatic Instrumentation](https://github.com/keyval-dev/opentelemetry-go-instrumentation/blob/master/docs/getting-started/README.md) +- [Donating ebpf based instrumentation](https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/4) diff --git a/docs/en/docs/end-user/insight/quickstart/otel/golang/golang.md b/docs/en/docs/end-user/insight/quickstart/otel/golang/golang.md new file mode 100644 index 0000000000..9a6536f01b --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/golang/golang.md @@ -0,0 +1,364 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Enhance Go applications with OTel SDK + +This page contains instructions on how to set up OpenTelemetry enhancements in a Go application. + +OpenTelemetry, also known simply as OTel, is an open-source observability framework that helps generate and collect telemetry data: traces, metrics, and logs in Go apps. + +## Enhance Go apps with the OpenTelemetry SDK + +### Install related dependencies + +Dependencies related to the OpenTelemetry exporter and SDK must be installed first. If you are using another request router, please refer to [request routing](#request-routing). +After switching/going into the application source folder run the following command: + +```golang +go get go.opentelemetry.io/otel@v1.8.0 \ + go.opentelemetry.io/otel/trace@v1.8.0 \ + go.opentelemetry.io/otel/sdk@v1.8.0 \ + go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin@v0.33.0 \ + go.opentelemetry.io/otel/exporters/otlp/otlptrace@v1.7.0 \ + go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.4.1 +``` + +### Create an initialization feature using the OpenTelemetry SDK + +In order for an application to be able to send data, a feature is required to initialize OpenTelemetry. Add the following code snippet to the __main.go__ file: + +```golang +import ( + "context" + "os" + "time" + + "go.opentelemetry.io/otel" + "go.opentelemetry.io/otel/exporters/otlp/otlptrace" + "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" + "go.opentelemetry.io/otel/propagation" + "go.opentelemetry.io/otel/sdk/resource" + sdktrace "go.opentelemetry.io/otel/sdk/trace" + semconv "go.opentelemetry.io/otel/semconv/v1.7.0" + "go.uber.org/zap" + "google.golang.org/grpc" +) + +var tracerExp *otlptrace.Exporter + +func retryInitTracer() func() { + var shutdown func() + go func() { + for { + // otel will reconnected and re-send spans when otel col recover. so, we don't need to re-init tracer exporter. + if tracerExp == nil { + shutdown = initTracer() + } else { + break + } + time.Sleep(time.Minute * 5) + } + }() + return shutdown +} + +func initTracer() func() { + // temporarily set timeout to 10s + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + serviceName, ok := os.LookupEnv("OTEL_SERVICE_NAME") + if !ok { + serviceName = "server_name" + os.Setenv("OTEL_SERVICE_NAME", serviceName) + } + otelAgentAddr, ok := os.LookupEnv("OTEL_EXPORTER_OTLP_ENDPOINT") + if !ok { + otelAgentAddr = "http://localhost:4317" + os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", otelAgentAddr) + } + zap.S().Infof("OTLP Trace connect to: %s with service name: %s", otelAgentAddr, serviceName) + + traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithDialOption(grpc.WithBlock())) + if err != nil { + handleErr(err, "OTLP Trace gRPC Creation") + return nil + } + + tracerProvider := sdktrace.NewTracerProvider( + sdktrace.WithBatcher(traceExporter), + sdktrace.WithSampler(sdktrace.AlwaysSample()), + sdktrace.WithResource(resource.NewWithAttributes(semconv.SchemaURL))) + + otel.SetTracerProvider(tracerProvider) + otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{})) + + tracerExp = traceExporter + return func() { + // Shutdown will flush any remaining spans and shut down the exporter. + handleErr(tracerProvider.Shutdown(ctx), "failed to shutdown TracerProvider") + } +} + +func handleErr(err error, message string) { + if err != nil { + zap.S().Errorf("%s: %v", message, err) + } +} +``` + +### Initialize tracker in main.go + +Modify the main feature to initialize the tracker in main.go. Also when your service shuts down, you should call __TracerProvider.Shutdown()__ to ensure all spans are exported. The service makes the call as a deferred feature in the main function: + +```golang +func main() { + // start otel tracing + if shutdown := retryInitTracer(); shutdown != nil { + defer shutdown() + } + ...... +} +``` + +### Add OpenTelemetry Gin middleware to the application + +Configure Gin to use the middleware by adding the following line to __main.go__ : + +```golang +import ( + .... + "go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin" +) + +func main() { + ...... + r := gin.Default() + r.Use(otelgin.Middleware("my-app")) + ...... +} +``` + +### Run the application + +- Local debugging and running + + > Note: This step is only used for local development and debugging. In the production environment, the Operator will automatically complete the injection of the following environment variables. + + The above steps have completed the work of initializing the SDK. Now if you need to develop and debug locally, you need to obtain the address of insight-agent-opentelemerty-collector in the insight-system namespace in advance, assuming: __insight-agent-opentelemetry-collector .insight-system.svc.cluster.local:4317__ . + + Therefore, you can add the following environment variables when you start the application locally: + + ```bash + OTEL_SERVICE_NAME=my-golang-app OTEL_EXPORTER_OTLP_ENDPOINT=http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317 go run main.go... + ``` + +- Running in a production environment + + Please refer to the introduction of __Only injecting environment variable annotations__ in [Achieving non-intrusive enhancement of applications through Operators](../operator.md) to add annotations to deployment yaml: + + ```console + instrumentation.opentelemetry.io/inject-sdk: "insight-system/insight-opentelemetry-autoinstrumentation" + ``` + + If you cannot use annotations, you can manually add the following environment variables to the deployment yaml: + +```yaml +······ +env: + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: 'http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317' + - name: OTEL_SERVICE_NAME + value: "your depolyment name" # modify it. + - name: OTEL_K8S_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: spec.nodeName + - name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: OTEL_RESOURCE_ATTRIBUTES + value: 'k8s.namespace.name=$(OTEL_K8S_NAMESPACE),k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME)' +······ +``` + +## Request Routing + +### OpenTelemetry gin/gonic enhancements + +```golang +# Add one line to your import() stanza depending upon your request router: +middleware "go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin" +``` + +Then inject the OpenTelemetry middleware: + +```golang +router. Use(middleware. Middleware("my-app")) +``` + +### OpenTelemetry gorillamux enhancements + +```golang +# Add one line to your import() stanza depending upon your request router: +middleware "go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux" +``` + +Then inject the OpenTelemetry middleware: + +```golang +router. Use(middleware. Middleware("my-app")) +``` + +### gRPC enhancements + +Likewise, OpenTelemetry can help you auto-detect gRPC requests. To detect any gRPC server you have, add the interceptor to the server's instantiation. + +```golang +import ( + grpcotel "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc" +) +func main() { + [...] + + s := grpc.NewServer( + grpc.UnaryInterceptor(grpcotel.UnaryServerInterceptor()), + grpc.StreamInterceptor(grpcotel.StreamServerInterceptor()), + ) +} +``` + +It should be noted that if your program uses Grpc Client to call third-party services, you also need to add an interceptor to Grpc Client: + +```golang + [...] + + conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()), + grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()), + grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()), + ) +``` + +### If not using request routing + +```golang +import ( + "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp" +) +``` + +Everywhere you pass http.Handler to ServeMux you will wrap the handler function. For example, the following replacements would be made: + +```golang +- mux.Handle("/path", h) ++ mux.Handle("/path", otelhttp.NewHandler(h, "description of path")) +--- +- mux.Handle("/path", http.HandlerFunc(f)) ++ mux.Handle("/path", otelhttp.NewHandler(http.HandlerFunc(f), "description of path")) +``` + +In this way, you can ensure that each feature wrapped with othttp will automatically collect its metadata and start the corresponding trace. + +## database enhancements + +### Golang Gorm + +The OpenTelemetry community has also developed middleware for database access libraries, such as Gorm: +```golang +import ( + "github.com/uptrace/opentelemetry-go-extra/otelgorm" + "gorm.io/driver/sqlite" + "gorm.io/gorm" +) + +db, err := gorm.Open(sqlite.Open("file::memory:?cache=shared"), &gorm.Config{}) +if err != nil { + panic(err) +} + +otelPlugin := otelgorm.NewPlugin(otelgorm.WithDBName("mydb"), # Missing this can lead to incomplete display of database related topology + otelgorm.WithAttributes(semconv.ServerAddress("memory"))) # Missing this can lead to incomplete display of database related topology +if err := db.Use(otelPlugin); err != nil { + panic(err) +} +``` + +### Custom Span + +In many cases, the middleware provided by OpenTelemetry cannot help us record more internally called features, and we need to customize Span to record + +```golang + ······ + _, span := otel.Tracer("GetServiceDetail").Start(ctx, + "spanMetricDao.GetServiceDetail", + trace.WithSpanKind(trace.SpanKindInternal)) + defer span.End() + ······ +``` + +### Add custom properties and custom events to span + +It is also possible to set a custom attribute or tag as a span. To add custom properties and events, follow these steps: + +### Import Tracking and Property Libraries + +```golang +import ( + ... + "go.opentelemetry.io/otel/attribute" + "go.opentelemetry.io/otel/trace" +) +``` + +### Get the current Span from the context + +```golang +span := trace.SpanFromContext(c.Request.Context()) +``` + +### Set properties in the current Span + +```golang +span.SetAttributes(attribute. String("controller", "books")) +``` + +### Add an Event to the current Span + +Adding span events is done using __AddEvent__ on the span object. + +```golang +span.AddEvent(msg) +``` + +## Log errors and exceptions + +```golang +import "go.opentelemetry.io/otel/codes" + +// Get the current span +span := trace.SpanFromContext(ctx) + +// RecordError will automatically convert an error into a span even +span.RecordError(err) + +// Flag this span as an error +span.SetStatus(codes.Error, "internal error") +``` + +## References + +For the Demo presentation, please refer to: + +- [otel-grpc-examples](https://github.com/openinsight-proj/otel-grpc-examples/tree/no-metadata-grpcgateway-v1.11.1) +- [opentelemetry-demo/productcatalogservice](https://github.com/open-telemetry/opentelemetry-demo/tree/main/src/productcatalogservice) +- [opentelemetry-collector-contrib/demo](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/examples/demo) diff --git a/docs/en/docs/end-user/insight/quickstart/otel/golang/meter.md b/docs/en/docs/end-user/insight/quickstart/otel/golang/meter.md new file mode 100644 index 0000000000..7393d775c0 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/golang/meter.md @@ -0,0 +1,259 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Exposing Metrics for Applications Using OpenTelemetry SDK + +> This article is intended for users who wish to evaluate or explore the developing OTLP metrics. + +The OpenTelemetry project requires that APIs and SDKs must emit data in the OpenTelemetry Protocol (OTLP) for supported languages. + +## For Golang Applications + +Golang can expose runtime metrics through the SDK by adding the following methods to enable the metrics exporter within the application: + +### Install Required Dependencies + +Navigate to your application’s source folder and run the following command: + +```bash +go get go.opentelemetry.io/otel \ + go.opentelemetry.io/otel/attribute \ + go.opentelemetry.io/otel/exporters/prometheus \ + go.opentelemetry.io/otel/metric/global \ + go.opentelemetry.io/otel/metric/instrument \ + go.opentelemetry.io/otel/sdk/metric +``` + +### Create an Initialization Function Using OTel SDK + +```go +import ( + ..... + + "go.opentelemetry.io/otel/attribute" + otelPrometheus "go.opentelemetry.io/otel/exporters/prometheus" + "go.opentelemetry.io/otel/metric/global" + "go.opentelemetry.io/otel/metric/instrument" + "go.opentelemetry.io/otel/sdk/metric/aggregator/histogram" + controller "go.opentelemetry.io/otel/sdk/metric/controller/basic" + "go.opentelemetry.io/otel/sdk/metric/export/aggregation" + processor "go.opentelemetry.io/otel/sdk/metric/processor/basic" + selector "go.opentelemetry.io/otel/sdk/metric/selector/simple" +) + +func (s *insightServer) initMeter() *otelPrometheus.Exporter { + s.meter = global.Meter("xxx") + + config := otelPrometheus.Config{ + DefaultHistogramBoundaries: []float64{1, 2, 5, 10, 20, 50}, + Gatherer: prometheus.DefaultGatherer, + Registry: prometheus.NewRegistry(), + Registerer: prometheus.DefaultRegisterer, + } + + c := controller.New( + processor.NewFactory( + selector.NewWithHistogramDistribution( + histogram.WithExplicitBoundaries(config.DefaultHistogramBoundaries), + ), + aggregation.CumulativeTemporalitySelector(), + processor.WithMemory(true), + ), + ) + + exporter, err := otelPrometheus.New(config, c) + if err != nil { + zap.S().Panicf("failed to initialize prometheus exporter %v", err) + } + + global.SetMeterProvider(exporter.MeterProvider()) + + http.HandleFunc("/metrics", exporter.ServeHTTP) + + go func() { + _ = http.ListenAndServe(fmt.Sprintf(":%d", 8888), nil) + }() + + zap.S().Info("Prometheus server running on ", fmt.Sprintf(":%d", port)) + return exporter +} +``` + +The above method will expose a metrics endpoint for your application at: `http://localhost:8888/metrics`. + +Next, initialize it in `main.go`: + +```go +func main() { + // ... + tp := initMeter() + // ... +} +``` + +If you want to add custom metrics, you can refer to the following: + +```go +// exposeClusterMetric exposes a metric like "insight_logging_count{} 1" +func (s *insightServer) exposeLoggingMetric(lserver *log.LogService) { + s.meter = global.Meter("insight.io/basic") + + var lock sync.Mutex + logCounter, err := s.meter.AsyncFloat64().Counter("insight_log_total") + if err != nil { + zap.S().Panicf("failed to initialize instrument: %v", err) + } + + _ = s.meter.RegisterCallback([]instrument.Asynchronous{logCounter}, func(ctx context.Context) { + lock.Lock() + defer lock.Unlock() + count, err := lserver.Count(ctx) + if err == nil || count != -1 { + logCounter.Observe(ctx, float64(count)) + } + }) +} +``` + +Then, call this method in `main.go`: + +```go +// ... +s.exposeLoggingMetric(lservice) +// ... +``` + +You can check if your metrics are working correctly by visiting `http://localhost:8888/metrics`. + +## For Java Applications + +For Java applications, you can directly expose JVM-related metrics by using the OpenTelemetry agent with the following environment variable: + +```bash +OTEL_METRICS_EXPORTER=prometheus +``` + +You can then check your metrics at `http://localhost:8888/metrics`. + +Next, combine it with a Prometheus `ServiceMonitor` to complete the metrics integration. If you want to expose custom metrics, please refer to [opentelemetry-java-docs/prometheus](https://github.com/open-telemetry/opentelemetry-java-docs/blob/main/prometheus/README.md). + +The process is mainly divided into two steps: + +- Create a meter provider and specify Prometheus as the exporter. + +```java +/* + * Copyright The OpenTelemetry Authors + * SPDX-License-Identifier: Apache-2.0 + */ + +package io.opentelemetry.example.prometheus; + +import io.opentelemetry.api.metrics.MeterProvider; +import io.opentelemetry.exporter.prometheus.PrometheusHttpServer; +import io.opentelemetry.sdk.metrics.SdkMeterProvider; +import io.opentelemetry.sdk.metrics.export.MetricReader; + +public final class ExampleConfiguration { + + /** + * Initializes the Meter SDK and configures the Prometheus collector with all default settings. + * + * @param prometheusPort the port to open up for scraping. + * @return A MeterProvider for use in instrumentation. + */ + static MeterProvider initializeOpenTelemetry(int prometheusPort) { + MetricReader prometheusReader = PrometheusHttpServer.builder().setPort(prometheusPort).build(); + + return SdkMeterProvider.builder().registerMetricReader(prometheusReader).build(); + } +} +``` + +- Create a custom meter and start the HTTP server. + +```java +package io.opentelemetry.example.prometheus; + +import io.opentelemetry.api.common.Attributes; +import io.opentelemetry.api.metrics.Meter; +import io.opentelemetry.api.metrics.MeterProvider; +import java.util.concurrent.ThreadLocalRandom; + +/** + * Example of using the PrometheusHttpServer to convert OTel metrics to Prometheus format and expose + * these to a Prometheus instance via a HttpServer exporter. + * + *

A Gauge is used to periodically measure how many incoming messages are awaiting processing. + * The Gauge callback gets executed every collection interval. + */ +public final class PrometheusExample { + private long incomingMessageCount; + + public PrometheusExample(MeterProvider meterProvider) { + Meter meter = meterProvider.get("PrometheusExample"); + meter + .gaugeBuilder("incoming.messages") + .setDescription("No of incoming messages awaiting processing") + .setUnit("message") + .buildWithCallback(result -> result.record(incomingMessageCount, Attributes.empty())); + } + + void simulate() { + for (int i = 500; i > 0; i--) { + try { + System.out.println( + i + " Iterations to go, current incomingMessageCount is: " + incomingMessageCount); + incomingMessageCount = ThreadLocalRandom.current().nextLong(100); + Thread.sleep(1000); + } catch (InterruptedException e) { + // ignored here + } + } + } + + public static void main(String[] args) { + int prometheusPort = 8888; + + // It is important to initialize the OpenTelemetry SDK as early as possible in your process. + MeterProvider meterProvider = ExampleConfiguration.initializeOpenTelemetry(prometheusPort); + + PrometheusExample prometheusExample = new PrometheusExample(meterProvider); + + prometheusExample.simulate(); + + System.out.println("Exiting"); + } +} +``` + +After running the Java application, you can check if your metrics are working correctly by visiting `http://localhost:8888/metrics`. + +## Insight Collecting Metrics + +Lastly, it is important to note that you have exposed metrics in your application, and now you need Insight to collect those metrics. + +The recommended way to expose metrics is via [ServiceMonitor](https://github.com/prometheus-operator/prometheus-operator/blob/501d079e3d3769b94dca6684cf155034e468829a/Documentation/design.md#servicemonitor) or PodMonitor. + +### Creating ServiceMonitor/PodMonitor + +The added ServiceMonitor/PodMonitor needs to have the label `operator.insight.io/managed-by: insight` for the Operator to recognize it: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: example-app + labels: + operator.insight.io/managed-by: insight +spec: + selector: + matchLabels: + app: example-app + endpoints: + - port: web + namespaceSelector: + any: true +``` diff --git a/docs/en/docs/end-user/insight/quickstart/otel/java/index.md b/docs/en/docs/end-user/insight/quickstart/otel/java/index.md new file mode 100644 index 0000000000..25f95da940 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/java/index.md @@ -0,0 +1,21 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Start Monitoring Java Applications + +1. For accessing and monitoring Java application links, please refer to the document [Implementing Non-Intrusive Enhancements for Applications via Operator](../operator.md), which explains how to automatically integrate links through annotations. + +2. Monitoring the JVM of Java applications: How Java applications that have already exposed JVM metrics and those that have not yet exposed JVM metrics can connect with observability Insight. + +- If your Java application has not yet started exposing JVM metrics, you can refer to the following documents: + + - [Exposing JVM Monitoring Metrics Using JMX Exporter](./jvm-monitor/jmx-exporter.md) + - [Exposing JVM Monitoring Metrics Using OpenTelemetry Java Agent](./jvm-monitor/otel-java-agent.md) + +- If your Java application has already exposed JVM metrics, you can refer to the following document: + + - [Connecting Existing JVM Metrics of Java Applications to Observability](./jvm-monitor/legacy-jvm.md) + +3. [Writing TraceId and SpanId into Java Application Logs](./mdc.md) to correlate link data with log data. diff --git a/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.md b/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.md new file mode 100644 index 0000000000..7dd4877707 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/jmx-exporter.md @@ -0,0 +1,134 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Exposing JVM Monitoring Metrics Using JMX Exporter + +JMX Exporter provides two usage methods: + +1. **Standalone Process**: Specify parameters when starting the JVM to expose a JMX RMI interface. The JMX Exporter calls RMI to obtain the JVM runtime state data, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape. +2. **In-Process (JVM process)**: Specify parameters when starting the JVM to run the JMX Exporter jar file as a javaagent. This method reads the JVM runtime state data in-process, converts it into Prometheus metrics format, and exposes a port for Prometheus to scrape. + +!!! note + + The official recommendation is not to use the first method due to its complex configuration and the requirement for a separate process, which introduces additional monitoring challenges. Therefore, this article focuses on the second method, detailing how to use JMX Exporter to expose JVM monitoring metrics in a Kubernetes environment. + +In this method, you need to specify the JMX Exporter jar file and configuration file when starting the JVM. Since the jar file is a binary file that is not ideal for mounting via a configmap, and the configuration file typically does not require modifications, it is recommended to package both the JMX Exporter jar file and the configuration file directly into the business container image. + +For the second method, you can choose to include the JMX Exporter jar file in the application image or mount it during deployment. Below are explanations for both approaches: + +## Method 1: Building JMX Exporter JAR File into the Business Image + +The content of `prometheus-jmx-config.yaml` is as follows: + +```yaml title="prometheus-jmx-config.yaml" +... +ssl: false +lowercaseOutputName: false +lowercaseOutputLabelNames: false +rules: +- pattern: ".*" +``` + +!!! note + + For more configuration options, please refer to the introduction at the bottom or [Prometheus official documentation](https://github.com/prometheus/jmx_exporter#configuration). + +Next, prepare the jar file. You can find the latest jar download link on the [jmx_exporter](https://github.com/prometheus/jmx_exporter) GitHub page and refer to the following Dockerfile: + +```shell +FROM openjdk:11.0.15-jre +WORKDIR /app/ +COPY target/my-app.jar ./ +COPY prometheus-jmx-config.yaml ./ +RUN set -ex; \ + curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar; +ENV JAVA_TOOL_OPTIONS=-javaagent:/app/jmx_prometheus_javaagent-0.17.2.jar=8088:/app/prometheus-jmx-config.yaml +EXPOSE 8081 8999 8080 8888 +ENTRYPOINT java $JAVA_OPTS -jar my-app.jar +``` + +Note: + +- The format for the startup parameter is: `-javaagent:=:` +- Here, port 8088 is used to expose JVM monitoring metrics; you may change it if it conflicts with the Java application. + +## Method 2: Mounting via Init Container + +First, we need to create a Docker image for the JMX Exporter. The following Dockerfile is for reference: + +```shell +FROM alpine/curl:3.14 +WORKDIR /app/ +# Copy the previously created config file into the image +COPY prometheus-jmx-config.yaml ./ +# Download the jmx prometheus javaagent jar online +RUN set -ex; \ + curl -L -O https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.2/jmx_prometheus_javaagent-0.17.2.jar; +``` + +Build the image using the above Dockerfile: `docker build -t my-jmx-exporter .` + +Add the following init container to the Java application deployment YAML: + +??? note "Click to expand YAML file" + + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: my-demo-app + labels: + app: my-demo-app + spec: + selector: + matchLabels: + app: my-demo-app + template: + metadata: + labels: + app: my-demo-app + spec: + imagePullSecrets: + - name: registry-pull + initContainers: + - name: jmx-sidecar + image: my-jmx-exporter + command: ["cp", "-r", "/app/jmx_prometheus_javaagent-0.17.2.jar", "/target/jmx_prometheus_javaagent-0.17.2.jar"] ➊ + volumeMounts: + - name: sidecar + mountPath: /target + containers: + - image: my-demo-app-image + name: my-demo-app + resources: + requests: + memory: "1000Mi" + cpu: "500m" + limits: + memory: "1000Mi" + cpu: "500m" + ports: + - containerPort: 18083 + env: + - name: JAVA_TOOL_OPTIONS + value: "-javaagent:/app/jmx_prometheus_javaagent-0.17.2.jar=8088:/app/prometheus-jmx-config.yaml" ➋ + volumeMounts: + - name: host-time + mountPath: /etc/localtime + readOnly: true + - name: sidecar + mountPath: /sidecar + volumes: + - name: host-time + hostPath: + path: /etc/localtime + - name: sidecar # Shared agent folder + emptyDir: {} + restartPolicy: Always + ``` + +With the above modifications, the example application `my-demo-app` now has the capability to expose JVM metrics. After running the service, you can access the Prometheus formatted metrics at `http://localhost:8088`. + +Next, you can refer to [Connecting Existing JVM Metrics of Java Applications to Observability](./legacy-jvm.md). diff --git a/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.md b/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.md new file mode 100644 index 0000000000..d52c973eea --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/legacy-jvm.md @@ -0,0 +1,90 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Integrating Existing JVM Metrics of Java Applications with Observability + +If your Java application exposes JVM monitoring metrics through other means (such as Spring Boot Actuator), you will need to ensure that the monitoring data is collected. You can achieve this by adding annotations (Kubernetes Annotations) to your workload to allow Insight to scrape the existing JVM metrics: + +```yaml +annotations: + insight.opentelemetry.io/metric-scrape: "true" # Whether to scrape + insight.opentelemetry.io/metric-path: "/" # Path to scrape metrics + insight.opentelemetry.io/metric-port: "9464" # Port to scrape metrics +``` + +For example, to add annotations to the **my-deployment-app**: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-deployment-app +spec: + selector: + matchLabels: + app: my-deployment-app + app.kubernetes.io/name: my-deployment-app + replicas: 1 + template: + metadata: + labels: + app: my-deployment-app + app.kubernetes.io/name: my-deployment-app + annotations: + insight.opentelemetry.io/metric-scrape: "true" # Whether to scrape + insight.opentelemetry.io/metric-path: "/" # Path to scrape metrics + insight.opentelemetry.io/metric-port: "9464" # Port to scrape metrics +``` + +Here is a complete example: + +```yaml +--- +apiVersion: v1 +kind: Service +metadata: + name: spring-boot-actuator-prometheus-metrics-demo +spec: + type: NodePort + selector: + app.kubernetes.io/name: spring-boot-actuator-prometheus-metrics-demo + ports: + - name: http + port: 8080 +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: spring-boot-actuator-prometheus-metrics-demo +spec: + selector: + matchLabels: + app.kubernetes.io/name: spring-boot-actuator-prometheus-metrics-demo + replicas: 1 + template: + metadata: + labels: + app.kubernetes.io/name: spring-boot-actuator-prometheus-metrics-demo + annotations: + insight.opentelemetry.io/metric-scrape: "true" # Whether to scrape + insight.opentelemetry.io/metric-path: "/actuator/prometheus" # Path to scrape metrics + insight.opentelemetry.io/metric-port: "8080" # Port to scrape metrics + spec: + containers: + - name: myapp + image: docker.m.daocloud.io/wutang/spring-boot-actuator-prometheus-metrics-demo + ports: + - name: http + containerPort: 8080 + resources: + limits: + cpu: 500m + memory: 800Mi + requests: + cpu: 200m + memory: 400Mi +``` + +In the above example, Insight will scrape the Prometheus metrics exposed through **Spring Boot Actuator** via `http://:8080/actuator/prometheus`. diff --git a/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.md b/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.md new file mode 100644 index 0000000000..df4c10dbe7 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/java/jvm-monitor/otel-java-agent.md @@ -0,0 +1,28 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Exposing JVM Metrics Using OpenTelemetry Java Agent + +Starting from OpenTelemetry Agent v1.20.0 and later, the OpenTelemetry Agent has introduced the JMX Metric Insight module. If your application is already integrated with the OpenTelemetry Agent for tracing, you no longer need to introduce another agent to expose JMX metrics for your application. The OpenTelemetry Agent collects and exposes metrics by detecting the locally available MBeans in the application. + +The OpenTelemetry Agent also provides built-in monitoring examples for common Java servers or frameworks. Please refer to the [Predefined Metrics](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/semantic-conventions.md). + +When using the OpenTelemetry Java Agent, you also need to consider how to mount the JAR into the container. In addition to the methods for mounting the JAR file as described with the JMX Exporter, you can leverage the capabilities provided by the OpenTelemetry Operator to automatically enable JVM metrics exposure for your application. + +If your application is already integrated with the OpenTelemetry Agent for tracing, you do not need to introduce another agent to expose JMX metrics. The OpenTelemetry Agent can now locally collect and expose metrics interfaces by detecting the locally available MBeans in the application. + +However, as of the current version, you still need to manually add the appropriate annotations to your application for the JVM data to be collected by Insight. For specific annotation content, please refer to [Integrating Existing JVM Metrics of Java Applications with Observability](./legacy-jvm.md). + +## Exposing Metrics for Java Middleware + +The OpenTelemetry Agent also includes built-in examples for monitoring middleware. Please refer to the [Predefined Metrics](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/jmx-metrics/javaagent/README.md#predefined-metrics). + +By default, no specific types are designated; you need to specify them using the `-Dotel.jmx.target.system` JVM options, for example, `-Dotel.jmx.target.system=jetty,kafka-broker`. + +## References + +- [Gaining JMX Metric Insights with the OpenTelemetry Java Agent](https://opentelemetry.io/blog/2023/jmx-metric-insight/) + +- [Otel JMX Metrics](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/jmx-metrics) diff --git a/docs/en/docs/end-user/insight/quickstart/otel/java/mdc.md b/docs/en/docs/end-user/insight/quickstart/otel/java/mdc.md new file mode 100644 index 0000000000..e40cbd7c84 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/java/mdc.md @@ -0,0 +1,115 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Writing TraceId and SpanId into Java Application Logs + +This article explains how to automatically write TraceId and SpanId into Java application logs using OpenTelemetry. By including TraceId and SpanId in your logs, you can correlate distributed tracing data with log data, enabling more efficient fault diagnosis and performance analysis. + +## Supported Logging Libraries + +For more information, please refer to the [Logger MDC auto-instrumentation](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/logger-mdc-instrumentation.md). + +| Logging Framework | Supported Automatic Instrumentation Versions | Dependencies Required for Manual Instrumentation | +| ------------------ | ----------------------------------------- | ----------------------------------------------- | +| Log4j 1 | 1.2+ | None | +| Log4j 2 | 2.7+ | [opentelemetry-log4j-context-data-2.17-autoconfigure](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/log4j/log4j-context-data/log4j-context-data-2.17/library-autoconfigure) | +| Logback | 1.0+ | [opentelemetry-logback-mdc-1.0](https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library) | + +## Using Logback (Spring Boot Project) + +Spring Boot projects come with a built-in logging framework and use Logback as the default logging implementation. If your Java project is a Spring Boot project, you can write TraceId into logs with minimal configuration. + +Set `logging.pattern.level` in `application.properties`, adding `%mdc{trace_id}` and `%mdc{span_id}` to the logs. + +```bash +logging.pattern.level=trace_id=%mdc{trace_id} span_id=%mdc{span_id} %5p ....omited... +``` + +Here is an example of the logs: + +```console +2024-06-26 10:56:31.200 trace_id=8f7ebd8a73f9a8f50e6a00a87a20952a span_id=1b08f18b8858bb9a INFO 53724 --- [nio-8081-exec-1] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring DispatcherServlet 'dispatcherServlet' +2024-06-26 10:56:31.201 trace_id=8f7ebd8a73f9a8f50e6a00a87a20952a span_id=1b08f18b8858bb9a INFO 53724 --- [nio-8081-exec-1] o.s.web.servlet.DispatcherServlet : Initializing Servlet 'dispatcherServlet' +2024-06-26 10:56:31.209 trace_id=8f7ebd8a73f9a8f50e6a00a87a20952a span_id=1b08f18b8858bb9a INFO 53724 --- [nio-8081-exec-1] o.s.web.servlet.DispatcherServlet : Completed initialization in 8 ms +2024-06-26 10:56:31.296 trace_id=8f7ebd8a73f9a8f50e6a00a87a20952a span_id=5743699405074f4e INFO 53724 --- [nio-8081-exec-1] com.example.httpserver.ot.OTServer : hello world +``` + +## Using Log4j2 + +1. Add `OpenTelemetry Log4j2` dependency in `pom.xml`: + + !!! tip + + Please replace `OPENTELEMETRY_VERSION` with the [latest version](https://central.sonatype.com/artifact/io.opentelemetry.instrumentation/opentelemetry-log4j-context-data-2.17-autoconfigure/versions). + + ```xml + + + io.opentelemetry.instrumentation + opentelemetry-log4j-context-data-2.17-autoconfigure + OPENTELEMETRY_VERSION + runtime + + + ``` + +2. Modify the `log4j2.xml` configuration, adding `%X{trace_id}` and `%X{span_id}` in the `pattern` to automatically write TraceId and SpanId into the logs: + + ```xml + + + + + + + + + + + + + + ``` + +3. If using Logback, add `OpenTelemetry Logback` dependency in `pom.xml`. + + !!! tip + + Please replace `OPENTELEMETRY_VERSION` with the [latest version](https://central.sonatype.com/artifact/io.opentelemetry.instrumentation/opentelemetry-log4j-context-data-2.17-autoconfigure/versions). + + ```xml + + + io.opentelemetry.instrumentation + opentelemetry-logback-mdc-1.0 + OPENTELEMETRY_VERSION + + + ``` + +4. Modify the `log4j2.xml` configuration, adding `%X{trace_id}` and `%X{span_id}` in the `pattern` to automatically write TraceId and SpanId into the logs: + + ```xml + + + + + %d{HH:mm:ss.SSS} trace_id=%X{trace_id} span_id=%X{span_id} trace_flags=%X{trace_flags} %msg%n + + + + + + + + + + + + + + + ``` diff --git a/docs/en/docs/end-user/insight/quickstart/otel/operator.md b/docs/en/docs/end-user/insight/quickstart/otel/operator.md new file mode 100644 index 0000000000..f593456754 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/operator.md @@ -0,0 +1,569 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Enhance Applications Non-Intrusively with Operators + +Currently, only Java, Node.js, Python, .NET, and Golang support non-intrusive integration through the Operator approach. + +## Prerequisites + +Please ensure that the insight-agent is ready. If not, please refer to +[Install insight-agent for data collection](../install/install-agent.md) +and make sure the following three items are ready: + +- Enable trace functionality for insight-agent +- Check if the address and port for trace data are correctly filled +- Ensure that the Pods corresponding to deployment/insight-agent-opentelemetry-operator and + deployment/insight-agent-opentelemetry-collector are ready + +## Install Instrumentation CR + +!!! tip + + Starting from Insight v0.22.0, there is no longer a need to manually install the Instrumentation CR. + +Install it in the insight-system namespace. There are some minor differences between different versions. + +=== "Insight v0.21.x" + + ```bash + K8S_CLUSTER_UID=$(kubectl get namespace kube-system -o jsonpath='{.metadata.uid}') + kubectl apply -f - < language specific env vars -> common env vars -> instrument spec configs' vars + ``` + + However, it is important to avoid manually overriding __OTEL_RESOURCE_ATTRIBUTES_NODE_NAME__ . + This variable serves as an identifier within the operator to determine if a pod has already + been injected with a probe. Manually adding this variable may prevent the probe from being + injected successfully. + +## Automatic injection Demo + +Note that the `annotation` is added under spec.annotations. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-app + labels: + app: my-app +spec: + selector: + matchLabels: + app: my-app + replicas: 1 + template: + metadata: + labels: + app: my-app + annotations: + instrumentation.opentelemetry.io/inject-java: "insight-system/insight-opentelemetry-autoinstrumentation" + spec: + containers: + - name: myapp + image: jaegertracing/vertx-create-span:operator-e2e-tests + ports: + - containerPort: 8080 + protocol: TCP +``` + +The final generated YAML is as follows: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: my-deployment-with-sidecar-565bd877dd-nqkk6 + generateName: my-deployment-with-sidecar-565bd877dd- + namespace: default + uid: aa89ca0d-620c-4d20-8bc1-37d67bad4ea4 + resourceVersion: '2668986' + creationTimestamp: '2022-04-08T05:58:48Z' + labels: + app: my-pod-with-sidecar + pod-template-hash: 565bd877dd + annotations: + cni.projectcalico.org/containerID: 234eae5e55ea53db2a4bc2c0384b9a1021ed3908f82a675e4a92a49a7e80dd61 + cni.projectcalico.org/podIP: 192.168.134.133/32 + cni.projectcalico.org/podIPs: 192.168.134.133/32 + instrumentation.opentelemetry.io/inject-java: "insight-system/insight-opentelemetry-autoinstrumentation" +spec: + volumes: + - name: kube-api-access-sp2mz + projected: + sources: + - serviceAccountToken: + expirationSeconds: 3607 + path: token + - configMap: + name: kube-root-ca.crt + items: + - key: ca.crt + path: ca.crt + - downwardAPI: + items: + - path: namespace + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + defaultMode: 420 + - name: opentelemetry-auto-instrumentation + emptyDir: {} + initContainers: + - name: opentelemetry-auto-instrumentation + image: >- + ghcr.m.daocloud.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java + command: + - cp + - /javaagent.jar + - /otel-auto-instrumentation/javaagent.jar + resources: {} + volumeMounts: + - name: opentelemetry-auto-instrumentation + mountPath: /otel-auto-instrumentation + - name: kube-api-access-sp2mz + readOnly: true + mountPath: /var/run/secrets/kubernetes.io/serviceaccount + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: Always + containers: + - name: myapp + image: ghcr.io/pavolloffay/spring-petclinic:latest + env: + - name: OTEL_JAVAAGENT_DEBUG + value: 'true' + - name: OTEL_INSTRUMENTATION_JDBC_ENABLED + value: 'true' + - name: SPLUNK_PROFILER_ENABLED + value: 'false' + - name: JAVA_TOOL_OPTIONS + value: ' -javaagent:/otel-auto-instrumentation/javaagent.jar' + - name: OTEL_TRACES_EXPORTER + value: otlp + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: http://insight-agent-opentelemetry-collector.svc.cluster.local:4317 + - name: OTEL_EXPORTER_OTLP_TIMEOUT + value: '20' + - name: OTEL_TRACES_SAMPLER + value: parentbased_traceidratio + - name: OTEL_TRACES_SAMPLER_ARG + value: '0.85' + - name: SPLUNK_TRACE_RESPONSE_HEADER_ENABLED + value: 'true' + - name: OTEL_SERVICE_NAME + value: my-deployment-with-sidecar + - name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: OTEL_RESOURCE_ATTRIBUTES_POD_UID + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.uid + - name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: spec.nodeName + - name: OTEL_RESOURCE_ATTRIBUTES + value: >- + k8s.container.name=myapp,k8s.deployment.name=my-deployment-with-sidecar,k8s.deployment.uid=8de6929d-dda0-436c-bca1-604e9ca7ea4e,k8s.namespace.name=default,k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.pod.uid=$(OTEL_RESOURCE_ATTRIBUTES_POD_UID),k8s.replicaset.name=my-deployment-with-sidecar-565bd877dd,k8s.replicaset.uid=190d5f6e-ba7f-4794-b2e6-390b5879a6c4 + - name: OTEL_PROPAGATORS + value: jaeger,b3 + resources: {} + volumeMounts: + - name: kube-api-access-sp2mz + readOnly: true + mountPath: /var/run/secrets/kubernetes.io/serviceaccount + - name: opentelemetry-auto-instrumentation + mountPath: /otel-auto-instrumentation + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: Always + restartPolicy: Always + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + serviceAccountName: default + serviceAccount: default + nodeName: k8s-master3 + securityContext: + runAsUser: 1000 + runAsGroup: 3000 + fsGroup: 2000 + schedulerName: default-scheduler + tolerations: + - key: node.kubernetes.io/not-ready + operator: Exists + effect: NoExecute + tolerationSeconds: 300 + - key: node.kubernetes.io/unreachable + operator: Exists + effect: NoExecute + tolerationSeconds: 300 + priority: 0 + enableServiceLinks: true + preemptionPolicy: PreemptLowerPriority +``` + +## Trace query + +How to query the connected services, refer to [Trace Query](../../trace/trace.md). diff --git a/docs/en/docs/end-user/insight/quickstart/otel/otel.md b/docs/en/docs/end-user/insight/quickstart/otel/otel.md new file mode 100644 index 0000000000..8739712399 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/otel.md @@ -0,0 +1,35 @@ +--- +MTPE: windsonsea +Date: 2024-10-16 +--- + +# Use OTel to provide the application observability + +> Enhancement is the process of enabling application code to generate telemetry data. i.e. something that helps you monitor or measure the performance and status of your application. + +OpenTelemetry is a leading open source project providing instrumentation libraries for major programming languages ​​and popular frameworks. It is a project under the Cloud Native Computing Foundation and is supported by the vast resources of the community. +It provides a standardized data format for collected data without the need to integrate specific vendors. + +Insight supports OpenTelemetry for application instrumentation to enhance your applications. + +This guide introduces the basic concepts of telemetry enhancement using OpenTelemetry. +OpenTelemetry also has an ecosystem of libraries, plugins, integrations, and other useful tools to extend it. +You can find these resources at the [OTel Registry](https://opentelemetry.io/registry/). + +You can use any open standard library for telemetry enhancement and use Insight as an observability backend to ingest, analyze, and visualize data. + +To enhance your code, you can use the enhanced operations provided by OpenTelemetry for specific languages: + +Insight currently provides an easy way to enhance .Net NodeJS, Java, Python and Golang applications with OpenTelemetry. Please follow the guidelines below. + +## Trace Enhancement + +- Best practices for integrate trace: [Application Non-Intrusive Enhancement via Operator](./operator.md) +- Manual instrumentation with Go language as an example: [Enhance Go application with OpenTelemetry SDK](golang/golang.md) +- [Using ebpf to implement non-intrusive auto-instrumetation in Go language](./golang-ebpf.md) (experimental feature) + + \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/quickstart/otel/send_tracing_to_insight.md b/docs/en/docs/end-user/insight/quickstart/otel/send_tracing_to_insight.md new file mode 100644 index 0000000000..3ec7fd26ec --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/otel/send_tracing_to_insight.md @@ -0,0 +1,102 @@ +# Sending Trace Data to Insight + +This document describes how customers can send trace data to Insight on their own. It mainly includes the following two scenarios: + +1. Customer apps report traces to Insight through OTEL Agent/SDK +2. Forwarding traces to Insight through Opentelemetry Collector (OTEL COL) + +In each cluster where Insight Agent is installed, there is an __insight-agent-otel-col__ component +that is used to receive trace data from that cluster. Therefore, this component serves as the +entry point for user access and needs to obtain its address first. You can get the address of +the Opentelemetry Collector in the cluster through the AI platform interface, such as + __insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317__ : + +In addition, there are some slight differences for different reporting methods: + +## Customer apps report traces to Insight through OTEL Agent/SDK + +To successfully report trace data to Insight and display it properly, it is recommended to provide +the required metadata (Resource Attributes) for OTLP through the following environment variables. +There are two ways to achieve this: + +- Manually add them to the deployment YAML file, for example: + + ```yaml + ... + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: "http://insight-agent-opentelemetry-collector.insight-system.svc.cluster.local:4317" + - name: "OTEL_SERVICE_NAME" + value: my-java-app-name + - name: "OTEL_K8S_NAMESPACE" + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: spec.nodeName + - name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: OTEL_RESOURCE_ATTRIBUTES + value: "k8s.namespace.name=$(OTEL_K8S_NAMESPACE),k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME)" + ``` + +- Use the automatic injection capability of Insight Agent to inject the metadata (Resource Attributes) + + Ensure that Insight Agent is working properly and after [installing the Instrumentation CR](./operator.md#instrumentation-cr), + you only need to add the following annotation to the Pod: + + ```console + instrumentation.opentelemetry.io/inject-sdk: "insight-system/insight-opentelemetry-autoinstrumentation" + ``` + + For example: + + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: my-deployment-with-aotu-instrumentation + spec: + selector: + matchLabels: + app.kubernetes.io/name: my-deployment-with-aotu-instrumentation-kuberntes + replicas: 1 + template: + metadata: + labels: + app.kubernetes.io/name: my-deployment-with-aotu-instrumentation-kuberntes + annotations: + sidecar.opentelemetry.io/inject: "false" + instrumentation.opentelemetry.io/inject-sdk: "insight-system/insight-opentelemetry-autoinstrumentation" + ``` + +## Forwarding traces to Insight through Opentelemetry Collector + +After ensuring that the application has added the metadata mentioned above, you only need to add +an OTLP Exporter in your customer's Opentelemetry Collector to forward the trace data to +Insight Agent Opentelemetry Collector. Below is an example Opentelemetry Collector configuration file: + +```yaml +... +exporters: + otlp/insight: + endpoint: insight-opentelemetry-collector.insight-system.svc.cluster.local:4317 +service: +... +pipelines: +... +traces: + exporters: + - otlp/insight +``` + +## References + +- [Enhancing Applications Non-intrusively with the Operator](./operator.md) +- [Achieving Observability with OTel](./otel.md) diff --git a/docs/en/docs/end-user/insight/quickstart/other/install-agent-on-ocp.md b/docs/en/docs/end-user/insight/quickstart/other/install-agent-on-ocp.md new file mode 100644 index 0000000000..d1bc7ff2ae --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/other/install-agent-on-ocp.md @@ -0,0 +1,68 @@ +# OpenShift Install Insight Agent + +Although the OpenShift system comes with a monitoring system, we will still install Insight Agent because of some rules in the data collection agreement. + +Among them, in addition to the basic installation configuration, the following parameters need to be added during helm install: + +```bash +## Parameters related to fluentbit; +--set fluent-bit.ocp.enabled=true \ +--set fluent-bit.serviceAccount.create=false \ +--set fluent-bit.securityContext.runAsUser=0 \ +--set fluent-bit.securityContext.seLinuxOptions.type=spc_t \ +--set fluent-bit.securityContext.readOnlyRootFilesystem=false \ +--set fluent-bit.securityContext.allowPrivilegeEscalation=false \ + +## Enable Prometheus(CR) for OpenShift4.x +--set compatibility.openshift.prometheus.enabled=true \ + +## Close the Prometheus instance of the higher version +--set kube-prometheus-stack.prometheus.enabled=false \ +--set kube-prometheus-stack.kubeApiServer.enabled=false \ +--set kube-prometheus-stack.kubelet.enabled=false \ +--set kube-prometheus-stack.kubeControllerManager.enabled=false \ +--set kube-prometheus-stack.coreDns.enabled=false \ +--set kube-prometheus-stack.kubeDns.enabled=false \ +--set kube-prometheus-stack.kubeEtcd.enabled=false \ +--set kube-prometheus-stack.kubeEtcd.enabled=false \ +--set kube-prometheus-stack.kubeScheduler.enabled=false \ +--set kube-prometheus-stack.kubeStateMetrics.enabled=false \ +--set kube-prometheus-stack.nodeExporter.enabled=false \ + +## Limit the namespace processed by PrometheusOperator to avoid competition with OpenShift's own PrometheusOperator +--set kube-prometheus-stack.prometheusOperator.kubeletService.namespace="insight-system" \ +--set kube-prometheus-stack.prometheusOperator.prometheusInstanceNamespaces="insight-system" \ +--set kube-prometheus-stack.prometheusOperator.denyNamespaces[0]="openshift-monitoring" \ +--set kube-prometheus-stack.prometheusOperator.denyNamespaces[1]="openshift-user-workload-monitoring" \ +--set kube-prometheus-stack.prometheusOperator.denyNamespaces[2]="openshift-customer-monitoring" \ +--set kube-prometheus-stack.prometheusOperator.denyNamespaces[3]="openshift-route-monitor-operator" \ +``` + +### Write system monitoring data into Prometheus through OpenShift's own mechanism + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: cluster-monitoring-config + namespace: openshift-monitoring +data: + config.yaml: | + prometheusK8s: + remoteWrite: + - queueConfig: + batchSendDeadline: 60s + maxBackoff: 5s + minBackoff: 30ms + minShards: 1 + capacity: 5000 + maxSamplesPerSend: 1000 + maxShards: 100 + remoteTimeout: 30s + url: http://insight-agent-prometheus.insight-system.svc.cluster.local:9090/api/v1/write + writeRelabelConfigs: + - action: keep + regex: etcd|kubelet|node-exporter|apiserver|kube-state-metrics + sourceLabels: + - job +``` diff --git a/docs/en/docs/end-user/insight/quickstart/other/install-agentindce.md b/docs/en/docs/end-user/insight/quickstart/other/install-agentindce.md new file mode 100644 index 0000000000..37da126762 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/other/install-agentindce.md @@ -0,0 +1,70 @@ +# Install insight-agent in Suanova 4.0 + +In AI platform, previous Suanova 4.0 can be accessed as a subcluster. This guide provides potential issues and solutions when installing insight-agent in a Suanova 4.0 cluster. + +## Issue One + +Since most Suanova 4.0 clusters have installed dx-insight as the monitoring system, installing insight-agent at this time will conflict with the existing prometheus operator in the cluster, making it impossible to install smoothly. + +### Solution + +Enable the parameters of the prometheus operator, retain the prometheus operator in dx-insight, and make it compatible with the prometheus operator in insight-agent in 5.0. + +### Steps + +1. Log in to the console. +2. Enable the __--deny-namespaces__ parameter in the two prometheus operators respectively. +3. Run the following command (the following command is for reference only, the actual command needs to replace the prometheus operator name and namespace in the command). + + ```bash + kubectl edit deploy insight-agent-kube-prometh-operator -n insight-system + ``` + + ![operatoryaml](https://docs.daocloud.io/daocloud-docs-images/docs/insight/images/promerator.png) + +!!! note + - As shown in the figure above, the dx-insight component is deployed under the dx-insight tenant, and the insight-agent is deployed under the insight-system tenant. + Add __--deny-namespaces=insight-system__ in the prometheus operator in dx-insight, + Add __--deny-namespaces=dx-insight__ in the prometheus operator in insight-agent. + - Just add deny namespace, both prometheus operators can continue to scan other namespaces, and the related collection resources under kube-system or customer business namespaces are not affected. + - Please pay attention to the problem of node exporter port conflict. + +### Supplementary Explanation + +The open-source __node-exporter__ turns on hostnetwork by default and the default port is 9100. +If the monitoring system of the cluster has installed __node-exporter__ , then installing __insight-agent__ at this time will cause node-exporter port conflict and it cannot run normally. + +!!! note + Insight's __node exporter__ will enable some features to collect special indicators, so it is recommended to install. + +Currently, it does not support modifying the port in the installation command. After __helm install insight-agent__ , you need to manually modify the related ports of the insight node-exporter daemonset and svc. + +## Issue Two + +After Insight Agent is successfully deployed, fluentbit does not collect logs of Suanova 4.0. + +### Solution + +The docker storage directory of Suanova 4.0 is __/var/lib/containers__ , which is different from the path in the configuration of insigh-agent, so the logs are not collected. + +### Steps + +1. Log in to the console. +2. Modify the following parameters in the insight-agent Chart. + + ```diff + fluent-bit: + daemonSetVolumeMounts: + - name: varlog + mountPath: /var/log + - name: varlibdockercontainers + - mountPath: /var/lib/docker/containers + + mountPath: /var/lib/containers/docker/containers + readOnly: true + - name: etcmachineid + mountPath: /etc/machine-id + readOnly: true + - name: dmesg + mountPath: /var/log/dmesg + readOnly: true + ``` diff --git a/docs/en/docs/end-user/insight/quickstart/res-plan/modify-vms-disk.md b/docs/en/docs/end-user/insight/quickstart/res-plan/modify-vms-disk.md new file mode 100644 index 0000000000..bc7db3e7dc --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/res-plan/modify-vms-disk.md @@ -0,0 +1,85 @@ +# vmstorage Disk Expansion + +This article describes the method for expanding the vmstorage disk. Please refer to the [vmstorage disk capacity planning](../res-plan/vms-res-plan.md) for the specifications of the vmstorage disk. + +## Procedure + +### Enable StorageClass expansion + +1. Log in to the AI platform platform as a global service cluster administrator. Click __Container Management__ -> __Clusters__ and go to the details of the __kpanda-global-cluster__ cluster. + +2. Select the left navigation menu __Container Storage__ -> __PVCs__ and find the PVC bound to the vmstorage. + + ![Find vmstorage](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk01.png) + +3. Click a vmstorage PVC to enter the details of the volume claim for vmstorage and confirm the StorageClass that the PVC is bound to. + + ![Modify Disk](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk02.png) + +4. Select the left navigation menu __Container Storage__ -> __Storage Class__ and find __local-path__ . Click the __┇__ on the right side of the target and select __Edit__ in the popup menu. + + ![Edit StorageClass](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk03.png) + +5. Enable __Scale Up__ and click __OK__ . + + ![Scale Up](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk04.png) + +### Modify the disk capacity of vmstorage + +1. Log in to the AI platform platform as a global service cluster administrator and go to the details of the __kpanda-global-cluster__ cluster. + +2. Select the left navigation menu __CRDs__ and find the custom resource for __vmcluster__ . + + ![vmcluster](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk05.png) + +3. Click the custom resource for vmcluster to enter the details page, switch to the __insight-system__ namespace, and select __Edit YAML__ from the right menu of __insight-victoria-metrics-k8s-stack__ . + + ![Edit YAML](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk06.png) + +4. Modify according to the legend and click __OK__ . + + ![Confirm Edit](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk07.png) + +5. Select the left navigation menu __Container Storage__ -> __PVCs__ again and find the volume claim bound to vmstorage. Confirm that the modification has taken effect. In the details page of a PVC, click the associated storage source (PV). + + ![Relate PV](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk08.png) + +6. Open the volume details page and click the __Update__ button in the upper right corner. + + ![Update](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk09.png) + +7. After modifying the __Capacity__ , click __OK__ and wait for a moment until the expansion is successful. + + ![Edit Storage](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk10.png) + +### Clone the storage volume + +If the storage volume expansion fails, you can refer to the following method to clone the storage volume. + +1. Log in to the AI platform platform as a global service cluster administrator and go to the details of the __kpanda-global-cluster__ cluster. + +2. Select the left navigation menu __Workloads__ -> __StatefulSets__ and find the statefulset for __vmstorage__ . Click the __┇__ on the right side of the target and select __Status__ -> __Stop__ -> __OK__ in the popup menu. + + ![Stop Status](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk11.png) + +3. After logging into the __master__ node of the __kpanda-global-cluster__ cluster in the command line, run the following command to copy the vm-data directory in the vmstorage container to store the metric information locally: + + ```bash + kubectl cp -n insight-system vmstorage-insight-victoria-metrics-k8s-stack-1:vm-data ./vm-data + ``` + +4. Log in to the AI platform platform and go to the details of the __kpanda-global-cluster__ cluster. Select the left navigation menu __Container Storage__ -> __PVs__ , click __Clone__ in the upper right corner, and modify the capacity of the volume. + + ![Clone](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk12.png) + + ![Edit Storage](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk13.png) + +5. Delete the previous data volume of vmstorage. + + ![Delete vmstorage](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/insight/quickstart/images/vmdisk14.png) + +6. Wait for a moment until the volume claim is bound to the cloned data volume, then run the following command to import the exported data from step 3 into the corresponding container, and then start the previously paused __vmstorage__ . + + ```bash + kubectl cp -n insight-system ./vm-data vmstorage-insight-victoria-metrics-k8s-stack-1:vm-data + ``` diff --git a/docs/en/docs/end-user/insight/quickstart/res-plan/prometheus-res.md b/docs/en/docs/end-user/insight/quickstart/res-plan/prometheus-res.md new file mode 100644 index 0000000000..22b1921b8a --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/res-plan/prometheus-res.md @@ -0,0 +1,64 @@ +--- +MTPE: WANG0608GitHub +Date: 2024-09-24 +--- + +# Prometheus Resource Planning + +In the actual use of Prometheus, affected by the number of cluster containers and the opening of Istio, +the CPU, memory and other resource usage of Prometheus will exceed the set resources. + +In order to ensure the normal operation of Prometheus in clusters of different sizes, +it is necessary to adjust the resources of Prometheus according to the actual size of the cluster. + +## Reference resource planning + +In the case that the mesh is not enabled, the test statistics show that the relationship +between the system Job index and pods is **Series count = 800 \* pod count** + +When the service mesh is enabled, the magnitude of the Istio-related metrics generated +by the pod after the feature is enabled is **Series count = 768 \* pod count** + +### When the service mesh is not enabled + +The following resource planning is recommended by Prometheus when **the service mesh is not enabled** : + +| Cluster size (pod count) | Metrics (service mesh is not enabled) | CPU (core) | Memory (GB) | +| ---------------- | ---------------------- | ------------------ | ------------------------ | +| 100 | 8w | Request: 0.5
Limit: 1 | Request: 2GB
Limit: 4GB | +| 200 | 16w | Request: 1
Limit: 1.5 | Request: 3GB
Limit: 6GB | +| 300 | 24w | Request: 1
Limit: 2 | Request: 3GB
Limit: 6GB | +| 400 | 32w | Request: 1
Limit: 2 | Request: 4GB
Limit: 8GB | +| 500 | 40w | Request: 1.5
Limit: 3 | Request: 5GB
Limit: 10GB | +| 800 | 64w | Request: 2
Limit: 4 | Request: 8GB
Limit: 16GB | +| 1000 | 80w | Request: 2.5
Limit: 5 | Request: 9GB
Limit: 18GB | +| 2000 | 160w | Request: 3.5
Limit: 7 | Request: 20GB
Limit: 40GB | +| 3000 | 240w | Request: 4
Limit: 8 | Request: 33GB
Limit: 66GB | + +### When the service mesh feature is enabled + +The following resource planning is recommended by Prometheus in the scenario of **starting the service mesh**: + +| Cluster size (pod count) | metric volume (service mesh enabled) | CPU (core) | Memory (GB) | +| ---------------- | -------------------- | --------------------- | ------------------------ | +| 100 | 15w | Request: 1
Limit: 2 | Request: 3GB
Limit: 6GB | +| 200 | 31w | Request: 2
Limit: 3 | Request: 5GB
Limit: 10GB | +| 300 | 46w | Request: 2
Limit: 4 | Request: 6GB
Limit: 12GB | +| 400 | 62w | Request: 2
Limit: 4 | Request: 8GB
Limit: 16GB | +| 500 | 78w | Request: 3
Limit: 6 | Request: 10GB
Limit: 20GB | +| 800 | 125w | Request: 4
Limit: 8 | Request: 15GB
Limit: 30GB | +| 1000 | 156w | Request: 5
Limit: 10 | Request: 18GB
Limit: 36GB | +| 2000 | 312w | Request: 7
Limit: 14 | Request: 40GB
Limit: 80GB | +| 3000 | 468w | Request: 8
Limit: 16 | Request: 65GB
Limit: 130GB | + +!!! note + + 1. __Pod count__ in the table refers to the pod count that is basically running stably in the cluster. + If a large number of pods are restarted, the index will increase sharply in a short period of time. + At this time, resources need to be adjusted accordingly. + 2. Prometheus stores two hours of data by default in memory, and when the + [Remote Write function](https://prometheus.io/docs/practices/remote_write/#memory-usage) is enabled in the cluster, + a certain amount of memory will be occupied, and resources surge ratio is recommended to be set to 2. + 3. The data in the table are recommended values, applicable to general situations. + If the environment has precise resource requirements, it is recommended to check the resource usage of + the corresponding Prometheus after the cluster has been running for a period of time for precise configuration. diff --git a/docs/en/docs/end-user/insight/quickstart/res-plan/vms-res-plan.md b/docs/en/docs/end-user/insight/quickstart/res-plan/vms-res-plan.md new file mode 100644 index 0000000000..de6b201d68 --- /dev/null +++ b/docs/en/docs/end-user/insight/quickstart/res-plan/vms-res-plan.md @@ -0,0 +1,83 @@ +# vmstorage disk capacity planning + +vmstorage is responsible for storing multicluster metrics for observability. +In order to ensure the stability of vmstorage, it is necessary to adjust the disk capacity +of vmstorage according to the number of clusters and the size of the cluster. +For more information, please refer to [vmstorage retention period and disk space](https://docs.victoriametrics.com/guides/understand-your-setup-size.html?highlight=datapoint#retention-perioddisk-space). + +## Test Results + +After 14 days of disk observation of vmstorage of clusters of different sizes, +We found that the disk usage of vmstorage was positively correlated with the +amount of metrics it stored and the disk usage of individual data points. + +1. The amount of metrics stored instantaneously __increase(vm_rows{ type != "indexdb"}[30s])__ + to obtain the increased amount of metrics within 30s +2. Disk usage of a single data point: __sum(vm_data_size_bytes{type!="indexdb"}) / sum(vm_rows{type != "indexdb"})__ + +## calculation method + +**Disk usage** = Instantaneous metrics x 2 x disk usage for a single data point x 60 x 24 x storage time (days) + +**Parameter Description:** + +1. The unit of disk usage is __Byte__ . +2. __Storage duration (days) x 60 x 24__ converts time (days) into minutes to calculate disk usage. +3. The default collection time of Prometheus in Insight Agent is 30s, so twice the amount of metrics + will be generated within 1 minute. +4. The default storage duration in vmstorage is 1 month, please refer to + [Modify System Configuration](../../system-config/modify-config.md) to modify the configuration. + +!!! warning + + This formula is a general solution, and it is recommended to reserve redundant disk + capacity on the calculation result to ensure the normal operation of vmstorage. + +## reference capacity + +The data in the table is calculated based on the default storage time of one month (30 days), +and the disk usage of a single data point (datapoint) is calculated as 0.9. +In a multicluster scenario, the number of Pods represents the sum of the number of Pods in the multicluster. + +### When the service mesh is not enabled + +| Cluster size (number of Pods) | Metrics | Disk capacity | +| ----------------- | ------ | -------- | +| 100 | 8W | 6 GiB | +| 200 | 16W | 12 GiB | +| 300 | 24w | 18 GiB | +| 400 | 32w | 24 GiB | +| 500 | 40w | 30 GiB | +| 800 | 64w | 48 GiB | +| 1000 | 80W | 60 GiB | +| 2000 | 160w | 120 GiB | +| 3000 | 240w | 180 GiB | + +### When the service mesh is enabled + +| Cluster size (number of Pods) | Metrics | Disk capacity | +| ----------------- | ------ | -------- | +| 100 | 15W | 12 GiB | +| 200 | 31w | 24 GiB | +| 300 | 46w | 36 GiB | +| 400 | 62w | 48 GiB | +| 500 | 78w | 60 GiB | +| 800 | 125w | 94 GiB | +| 1000 | 156w | 120 GiB | +| 2000 | 312w | 235 GiB | +| 3000 | 468w | 350 GiB | + +### Example + +There are two clusters in the AI platform platform, of which 500 Pods are running in the global management cluster +(service mesh is turned on), and 1000 Pods are running in the worker cluster (service mesh is not turned on), and the expected metrics are stored for 30 days. + +- The number of metrics in the global management cluster is 800x500 + 768x500 = 784000 +- Worker cluster metrics are 800x1000 = 800000 + +Then the current vmstorage disk usage should be set to (784000+80000)x2x0.9x60x24x31 =124384896000 byte = 116 GiB + +!!! note + + For the relationship between the number of metrics and the number of Pods in the cluster, + please refer to [Prometheus Resource Planning](./prometheus-res.md). diff --git a/docs/en/docs/end-user/insight/system-config/modify-config.md b/docs/en/docs/end-user/insight/system-config/modify-config.md new file mode 100644 index 0000000000..f744393b24 --- /dev/null +++ b/docs/en/docs/end-user/insight/system-config/modify-config.md @@ -0,0 +1,190 @@ +# Modify system configuration + +Observability will persist the data of metrics, logs, and traces by default. Users can modify the system configuration according to This page. + +## How to modify the metric data retention period + +Refer to the following steps to modify the metric data retention period. + +1. run the following command: + + ```sh + kubectl edit vmcluster insight-victoria-metrics-k8s-stack -n insight-system + ``` + +2. In the Yaml file, the default value of __retentionPeriod__ is __14__ , and the unit is __day__ . You can modify the parameters according to your needs. + + ```Yaml + apiVersion: operator.victoriametrics.com/v1beta1 + kind: VMCluster + metadata: + annotations: + meta.helm.sh/release-name: insight + meta.helm.sh/release-namespace: insight-system + creationTimestamp: "2022-08-25T04:31:02Z" + finalizers: + - apps.victoriametrics.com/finalizer + generation: 2 + labels: + app.kubernetes.io/instance: insight + app.kubernetes.io/managed-by: Helm + app.kubernetes.io/name: victoria-metrics-k8s-stack + app.kubernetes.io/version: 1.77.2 + helm.sh/chart: victoria-metrics-k8s-stack-0.9.3 + name: insight-victoria-metrics-k8s-stack + namespace: insight-system + resourceVersion: "123007381" + uid: 55cee8d6-c651-404b-b2c9-50603b405b54 + spec: + replicationFactor: 1 + retentionPeriod: "14" + vminsert: + extraArgs: + maxLabelsPerTimeseries: "45" + image: + repository: docker.m.daocloud.io/victoriametrics/vminsert + tag: v1.80.0-cluster + replicaCount: 1 + ``` + +3. After saving the modification, the pod of the component responsible for storing the metrics will automatically restart, just wait for a while. + +## How to modify the log data storage duration + +Refer to the following steps to modify the log data retention period: + +### Method 1: Modify the Json file + +1. Modify the __max_age__ parameter in the __rollover__ field in the following files, and set the retention period. The default storage period is __7d__ . Change `http://localhost:9200` to the address of __elastic__ . + + ```json + curl -X PUT "http://localhost:9200/_ilm/policy/insight-es-k8s-logs-policy?pretty" -H 'Content-Type: application/json' -d' + { + "policy": { + "phases": { + "hot": { + "min_age": "0ms", + "actions": { + "set_priority": { + "priority": 100 + }, + "rollover": { + "max_age": "7d", + "max_size": "10gb" + } + } + }, + "warm": { + "min_age": "10d", + "actions": { + "forcemerge": { + "max_num_segments": 1 + } + } + }, + "delete": { + "min_age": "30d", + "actions": { + "delete": {} + } + } + } + } + } + ``` + +2. After modification, run the above command. It will print out the content as shown below, then the modification is successful. + + ```json + { + "acknowledged": true + } + ``` + +### Method 2: Modify from the UI + +1. Log in __kibana__ , select __Stack Management__ in the left navigation bar. + + + +2. Select the left navigation __Index Lifecycle Polices__ , and find the index __insight-es-k8s-logs-policy__ , click to enter the details. + + + +3. Expand the __Hot phase__ configuration panel, modify the __Maximum age__ parameter, and set the retention period. The default storage period is __7d__ . + + + +4. After modification, click __Save policy__ at the bottom of the page to complete the modification. + + + +## How to modify the trace data storage duration + +Refer to the following steps to modify the trace data retention period: + +### Method 1: Modify the Json file + +1. Modify the __max_age__ parameter in the __rollover__ field in the following files, and set the retention period. The default storage period is __7d__ . At the same time, modify `http://localhost:9200` to the access address of __elastic__ . + + ```json + curl -X PUT "http://localhost:9200/_ilm/policy/jaeger-ilm-policy?pretty" -H 'Content-Type: application/json' -d' + { + "policy": { + "phases": { + "hot": { + "min_age": "0ms", + "actions": { + "set_priority": { + "priority": 100 + }, + "rollover": { + "max_age": "7d", + "max_size": "10gb" + } + } + }, + "warm": { + "min_age": "10d", + "actions": { + "forcemerge": { + "max_num_segments": 1 + } + } + }, + "delete": { + "min_age": "30d", + "actions": { + "delete": {} + } + } + } + } + } + ``` + +2. After modification, run the above command on the console. It will print out the content as shown below, then the modification is successful. + + ```json + { + "acknowledged": true + } + ``` + +### Method 2: Modify from the UI + +1. Log in __kibana__ , select __Stack Management__ in the left navigation bar. + + + +2. Select the left navigation __Index Lifecycle Polices__ , and find the index __jaeger-ilm-policy__ , click to enter the details. + + + +3. Expand the __Hot phase__ configuration panel, modify the __Maximum age__ parameter, and set the retention period. The default storage period is __7d__ . + + + +4. After modification, click __Save policy__ at the bottom of the page to complete the modification. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/system-config/system-component.md b/docs/en/docs/end-user/insight/system-config/system-component.md new file mode 100644 index 0000000000..7a170fbad3 --- /dev/null +++ b/docs/en/docs/end-user/insight/system-config/system-component.md @@ -0,0 +1,21 @@ +# System Components + +On the system component page, you can quickly view the running status of the system components in Insight. When a system component fails, some features in Insight will be unavailable. + +1. Go to __Insight__ product module, +2. In the left navigation bar, select __System Management -> System Components__ . + +## Component description + +|Module| Component Name | Description | +| ----- | ------------- | ----------- | +|Metrics| vminsert-insight-victoria-metrics-k8s-stack | Responsible for writing the metric data collected by Prometheus in each cluster to the storage component. If this component is abnormal, the metric data of the worker cluster cannot be written. | +|Metrics| vmalert-insight-victoria-metrics-k8s-stack | Responsible for taking effect of the recording and alert rules configured in the VM Rule, and sending the triggered alert rules to alertmanager. | +|Metrics| vmalertmanager-insight-victoria-metrics-k8s-stack| is responsible for sending messages when alerts are triggered. If this component is abnormal, the alert information cannot be sent. | +|Metrics| vmselect-insight-victoria-metrics-k8s-stack | Responsible for querying metrics data. If this component is abnormal, the metric cannot be queried. | +|Metrics| vmstorage-insight-victoria-metrics-k8s-stack | Responsible for storing multicluster metrics data. | +| Dashboard | grafana-deployment | Provide monitoring panel capability. The exception of this component will make it impossible to view the built-in dashboard. | +|Link| insight-jaeger-collector | Responsible for receiving trace data in opentelemetry-collector and storing it. | +|Link| insight-jaeger-query | Responsible for querying the trace data collected in each cluster. | +|Link| insight-opentelemetry-collector | Responsible for receiving trace data forwarded by each sub-cluster | +|Log| elasticsearch | Responsible for storing the log data of each cluster. | \ No newline at end of file diff --git a/docs/en/docs/end-user/insight/system-config/system-config.md b/docs/en/docs/end-user/insight/system-config/system-config.md new file mode 100644 index 0000000000..924919adf6 --- /dev/null +++ b/docs/en/docs/end-user/insight/system-config/system-config.md @@ -0,0 +1,22 @@ +--- +hide: + - toc +--- + +# System Configuration + + __System Configuration__ displays the default storage time of metrics, logs, traces and the default Apdex threshold. + +1. Click the right navigation bar and select __System Configuration__ . + + + +2. Currently only supports modifying the storage duration of historical alerts, click __Edit__ to enter the target duration. + + When the storage duration is set to "0", the historical alerts will not be cleared. + + + +!!! note + + To modify other configurations, please click to view [How to modify the system configuration? ](modify-config.md) diff --git a/docs/en/docs/end-user/insight/trace/service.md b/docs/en/docs/end-user/insight/trace/service.md new file mode 100644 index 0000000000..b91006e39f --- /dev/null +++ b/docs/en/docs/end-user/insight/trace/service.md @@ -0,0 +1,64 @@ +--- +MTPE: FanLin +Date: 2024-01-23 +--- + +# Service Insight + +In __Insight__ , a service refers to a group of workloads that provide the same behavior for incoming requests. +Service insight helps observe the performance and status of applications during the operation process by +using the OpenTelemetry SDK. + +For how to use OpenTelemetry, please refer to: [Using OTel to give your application insight](../../quickstart/otel/otel.md). + +## Glossary + +- **Service**: A service represents a group of workloads that provide the same behavior for incoming requests. + You can define the service name when using the OpenTelemetry SDK or use the name defined in Istio. +- **Operation**: An operation refers to a specific request or action handled by a service. Each span has an operation name. +- **Outbound Traffic**: Outbound traffic refers to all the traffic generated by the current service when making requests. +- **Inbound Traffic**: Inbound traffic refers to all the traffic initiated by the upstream service targeting the current service. + +## Steps + +The Services List page displays key metrics such as throughput rate, error rate, and request latency for all services +that have been instrumented with distributed tracing. You can filter services based on clusters or namespaces and sort +the list by throughput rate, error rate, or request latency. By default, the data displayed in the list is for the last hour, +but you can customize the time range. + +Follow these steps to view service insight metrics: + +1. Go to the __Insight__ product module. + +2. Select __Trace Tracking__ -> __Services__ from the left navigation bar. + + ![Trace Tracking](../images/service00.png){ width="1000"} + + !!! attention + + 1. If the namespace of a service in the list is __unknown__ , it means that the service has not been properly instrumented. + We recommend reconfiguring the instrumentation. + 2. If multiple services have the same name and none of them have the correct __Namespace__ environment variable configured, + the metrics displayed in the list and service details page will be aggregated for all those services. + +3. Click a service name (taking insight-system as an example) to view the detailed metrics and operation metrics for that service. + + 1. In the Service Topology section, you can view the service topology one layer above or below the current service. + When you hover over a node, you can see its information. + 2. In the Traffic Metrics section, you can view the monitoring metrics for all requests to the service within + the past hour (including inbound and outbound traffic). + 3. You can use the time selector in the upper right corner to quickly select a time range or specify a custom time range. + 4. Sorting is available for throughput, error rate, and request latency in the operation metrics. + 5. Clicking on the icon next to an individual operation will take you to the __Traces__ page to quickly search for related traces. + + ![Service Monitoring](../images/service01.png){: width="1000px"} + +### Service Metric Explanations + +| Metric | Description | +| ------ | ----------- | +| Throughput Rate | The number of requests processed within a unit of time. | +| Error Rate | The ratio of erroneous requests to the total number of requests within the specified time range. | +| P50 Request Latency | The response time within which 50% of requests complete. | +| P95 Request Latency | The response time within which 95% of requests complete. | +| P99 Request Latency | The response time within which 99% of requests complete. | diff --git a/docs/en/docs/end-user/insight/trace/topology-helper.md b/docs/en/docs/end-user/insight/trace/topology-helper.md new file mode 100644 index 0000000000..e96faa84bd --- /dev/null +++ b/docs/en/docs/end-user/insight/trace/topology-helper.md @@ -0,0 +1,21 @@ +# Service Topology Element Explanations + +The service topology provided by Observability allows you to quickly identify the request relationships between services and determine the health status of services based on different colors. The health status is determined based on the request latency and error rate of the service's overall traffic. This article explains the elements in the service topology. + +## Node Status Explanation + +The node health status is determined based on the error rate and request latency of the service's overall traffic, following these rules: + +| Color | Status | Rules | +| ----- | ------ | ----- | +| Gray | Healthy | Error rate equals 0% and request latency is less than 100ms | +| Orange | Warning | Error rate (0, 5%] or request latency (100ms, 200ms] | +| Red | Abnormal | Error rate (5%, 100%] or request latency (200ms, +Infinity) | + +## Connection Status Explanation + +| Color | Status | Rules | +| ----- | ------ | ----- | +| Green | Healthy | Error rate equals 0% and request latency is less than 100ms | +| Orange | Warning | Error rate (0, 5%] or request latency (100ms, 200ms] | +| Red | Abnormal | Error rate (5%, 100%] or request latency (200ms, +Infinity) | diff --git a/docs/en/docs/end-user/insight/trace/topology.md b/docs/en/docs/end-user/insight/trace/topology.md new file mode 100644 index 0000000000..fd951914ef --- /dev/null +++ b/docs/en/docs/end-user/insight/trace/topology.md @@ -0,0 +1,51 @@ +# Service Map + +Service map is a visual representation of the connections, communication, and dependencies between services. +It provides insights into the service-to-service interactions, allowing you to view the calls and performance of +services within a specified time range. The connections between nodes in the topology map represent the existence of +service-to-service calls during the queried time period. + +## Prerequisites + +1. Insight Agent is [installed](../../quickstart/install/install-agent.md) in the cluster and the applications are in the __Running__ state. +2. Services have been instrumented for distributed tracing using + [Operator](../../quickstart/otel/operator.md) or [OpenTelemetry SDK](../../quickstart/otel/golang/golang.md). + +## Steps + +1. Go to the __Insight__ product module. + +2. Select __Tracing__ -> __Service Map__ from the left navigation bar. + +3. In the Service Map, you can perform the following actions: + + - Click a node to slide out the details of the service on the right side. Here, + you can view metrics such as request latency, throughput, and error rate for the service. + Clicking on the service name takes you to the service details page. + - Hover over the connections to view the traffic metrics between the two services. + - Click __Display Settings__ , you can configure the display elements in the service map. + + ![Servicemap](../../images/servicemap.png) + +### Other Nodes + +In the Service Map, there can be nodes that are not part of the cluster. These external nodes can be categorized into three types: + +- Database +- Message Queue +- Virtual Node + +1. If a service makes a request to a Database or Message Queue, these two types of nodes will be displayed + by default in the topology map. However, Virtual Nodes represent nodes outside the cluster or services + not integrated into the trace, and they will not be displayed by default in the map. + +2. When a service makes a request to MySQL, PostgreSQL, or Oracle Database, the detailed database type + can be seen in the map. + +#### Enabling Virtual Nodes + +1. Update the `insight-server` chart values, locate the parameter shown in the image below, and change `false` to `true`. + + ![change-parameters](../../image/servicemap.png) + +2. In the display settings of the service map, check the `Virtual Services` option to enable it. diff --git a/docs/en/docs/end-user/insight/trace/trace.md b/docs/en/docs/end-user/insight/trace/trace.md new file mode 100644 index 0000000000..dc33aabb36 --- /dev/null +++ b/docs/en/docs/end-user/insight/trace/trace.md @@ -0,0 +1,56 @@ +# Trace Query + +On the trace query page, you can query detailed information about a call trace by TraceID or filter call traces based on various conditions. + +## Glossary + +- TraceID: Used to identify a complete request call trace. +- Operation: Describes the specific operation or event represented by a Span. +- Entry Span: The entry Span represents the first request of the entire call. +- Latency: The duration from receiving the request to completing the response for the entire call trace. +- Span: The number of Spans included in the entire trace. +- Start Time: The time when the current trace starts. +- Tag: A collection of key-value pairs that constitute Span tags. Tags are used to annotate and supplement Spans, and each Span can have multiple key-value tag pairs. + +## Steps + +Please follow these steps to search for a trace: + +1. Go to the __Insight__ product module. +2. Select __Tracing__ -> __Traces__ from the left navigation bar. + + ![jaeger](../../image/trace00.png) + + !!! note + + Sorting by Span, Latency, and Start At is supported in the list. + +3. Click the __TraceID Query__ in the filter bar to switch to TraceID search. + + - To search using TraceID, please enter the complete TraceID. + + + +## Other Operations + +### View Trace Details + +1. Click the TraceID of a trace in the trace list to view its detailed call information. + + ![jaeger](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/insight/images/trace03.png) + +### Associated Logs + +1. Click the icon on the right side of the trace data to search for associated logs. + + - By default, it queries the log data within the duration of the trace and one minute after its completion. + - The queried logs include those with the trace's TraceID in their log text and container logs related to the trace invocation process. + +2. Click __View More__ to jump to the __Associated Log__ page with conditions. +3. By default, all logs are searched, but you can filter by the TraceID or the relevant container logs from the trace call process using the dropdown. + + ![tracelog](../../image/tracelog.png) + + !!! note + + Since trace may span across clusters or namespaces, if the user does not have sufficient permissions, they will be unable to query the associated logs for that trace. diff --git a/docs/en/docs/end-user/k8s/add-node.md b/docs/en/docs/end-user/k8s/add-node.md index b740ffa18f..057df2b8f3 100644 --- a/docs/en/docs/end-user/k8s/add-node.md +++ b/docs/en/docs/end-user/k8s/add-node.md @@ -1,43 +1,43 @@ -# 添加工作节点 +# Adding Worker Nodes -如果节点不够用了,可以添加更多节点到集群中。 +If there are not enough nodes, you can add more nodes to the cluster. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 有一个管理员帐号 -- [已创建带 GPU 节点的集群](./create-k8s.md) -- [准备一台云主机](../host/createhost.md) +- The AI platform is installed +- An administrator account is available +- [A cluster with GPU nodes has been created](./create-k8s.md) +- [A cloud host has been prepared](../host/createhost.md) -## 添加步骤 +## Steps to Add Nodes -1. 以 **管理员身份** 登录 AI 算力平台 -1. 导航至 **容器管理** -> **集群列表** ,点击目标集群的名称 +1. Log into the AI platform as an **administrator**. +2. Navigate to **Container Management** -> **Clusters**, and click on the name of the target cluster. ![clusters](../images/remove01.png) -1. 进入集群概览页,点击 **节点管理** ,点击右侧的 **接入节点** 按钮 +3. On the cluster overview page, click on **Nodes**, then click the **Add Node** button on the right. ![add](../images/add01.png) -1. 按照向导,填写各项参数后点击 **确定** +4. Follow the wizard to fill in the parameters and click **OK**. - === "基本信息" + === "Basic Information" ![basic](../images/add02.png) - === "参数配置" + === "Parameter Configuration" ![arguments](../images/add03.png) -1. 在弹窗中点击 **确定** +5. In the pop-up window, click **OK**. ![ok](../images/add04.png) -1. 返回节点列表,新接入的节点状态为 **接入中** ,等待几分钟后状态变为 **健康** 则表示接入成功。 +6. Return to the node list; the status of the newly added node will be **Connecting**. After a few minutes, when the status changes to **Running**, it indicates that the connection was successful. ![success](../images/add05.png) !!! tip - 对于刚接入成功的节点,可能还要等 2-3 分钟才能识别出 GPU。 + For newly connected nodes, it may take an additional 2-3 minutes to recognize the GPU. diff --git a/docs/en/docs/end-user/k8s/create-k8s.md b/docs/en/docs/end-user/k8s/create-k8s.md index 16674d7ea1..fd43843049 100644 --- a/docs/en/docs/end-user/k8s/create-k8s.md +++ b/docs/en/docs/end-user/k8s/create-k8s.md @@ -1,80 +1,81 @@ -# 创建云上 Kubernetes 集群 +# Creating a Kubernetes Cluster in the Cloud -部署 Kubernetes 集群是为了支持高效的 AI 算力调度和管理,实现弹性伸缩,提供高可用性,从而优化模型训练和推理过程。 +Deploying a Kubernetes cluster is aimed at supporting efficient AI computing resource scheduling and management, achieving elastic scaling, providing high availability, and optimizing the model training and inference processes. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台已 -- 有一个管理员权限的账号 -- 准备一台带 GPU 的物理机 -- 分配两段 IP 地址(Pod CIDR 18 位、SVC CIDR 18 位,不能与现有网段冲突) +- The AI platform is installed +- An administrator account is available +- A physical machine with a GPU is prepared +- Two segments of IP addresses are allocated (Pod CIDR 18 bits, SVC CIDR 18 bits, must not conflict with existing networks) -## 创建步骤 +## Steps to Create -1. 以 **管理员身份** 登录 AI 算力平台 -1. [创建并启动 3 台不带 GPU 的云主机](../host/createhost.md)用作集群的 Master 节点 +1. Log into the AI platform as an **administrator**. +2. [Create and launch 3 cloud hosts without GPU](../host/createhost.md) to serve as Master nodes for the cluster. - - 配置资源,CPU 16 核,内存 32 GB,系统盘 200 GB(ReadWriteOnce) - - 网络模式选择 **Bridge(桥接)** - - 设置 root 密码或添加 SSH 公钥,方便以 SSH 连接 - - 记录好 3 台主机的 IP + - Configure resources: 16 CPU cores, 32 GB RAM, 200 GB system disk (ReadWriteOnce) + - Select **Bridge** network mode + - Set the root password or add an SSH public key for SSH connection + - Record the IPs of the 3 hosts -1. 导航至 **容器管理** -> **集群列表** ,点击右侧的 **创建集群** 按钮 -1. 按照向导,配置集群的各项参数 +3. Navigate to **Container Management** -> **Clusters**, and click the **Create Cluster** button on the right. +4. Follow the wizard to configure various parameters of the cluster. - === "基本信息" + === "Basic Information" ![basic](../images/k8s01.png) - === "节点配置" + === "Node Configuration" - 配置完节点信息后,点击 **开始检查** , + After configuring the node information, click **Start Check**. ![node](../images/k8s02.png) + ![node](../images/k8s03.png) - === "网络配置" + === "Network Configuration" ![network](../images/k8s04.png) - === "Addon 配置" + === "Addon Configuration" ![addon](../images/k8s05.png) - === "高级配置" + === "Advanced Configuration" - 每个节点默认可运行 110 个 Pod(容器组),如果节点配置比较高,可以调整到 200 或 300 个 Pod。 + Each node can run 110 Pods (container groups) by default. If the node configuration is higher, it can be adjusted to 200 or 300 Pods. ![basic](../images/k8s06.png) -1. 等待集群创建完成。 +5. Wait for the cluster creation to complete. ![done](../images/k8s08.png) -1. 在集群列表中,找到刚创建的集群,点击集群名称,导航到 **Helm 应用** -> **Helm 模板** ,在搜索框内搜索 metax-gpu-extensions,点击卡片 +6. In the cluster list, find the newly created cluster, click on the cluster name, navigate to **Helm Apps** -> **Helm Charts**, and search for metax-gpu-extensions in the search box, then click the card. ![cluster](../images/k8s09.png) ![helm](../images/k8s10.png) -1. 点击右侧的 **安装** 按钮,开始安装 GPU 插件 +7. Click the **Install** button on the right to start installing the GPU plugin. - === "应用设置" + === "Application Settings" - 输入名称,选择命名空间,在 YAMl 中修改镜像地址: + Enter a name, select a namespace, and modify the image address in the YAML: ![app settings](../images/k8s11.png) - === "Kubernetes 编排确认" + === "Kubernetes Orchestration Confirmation" ![confirm](../images/k8s12.png) -1. 自动返回 Helm 应用列表,等待 metax-gpu-extensions 状态变为 **已部署** +8. Automatically return to the Helm application list and wait for the status of metax-gpu-extensions to change to **Deployed**. ![deployed](../images/k8s13.png) -1. 到此集群创建成功,可以去查看集群所包含的节点。你可以去[创建 AI 工作负载并使用 GPU 了](../share/workload.md)。 +9. At this point, the cluster has been successfully created. You can check the nodes included in the cluster. You can now [create AI workloads and use GPUs](../share/workload.md). ![nodes](../images/k8s14.png) -下一步:[创建 AI 工作负载](../share/workload.md) +Next step: [Create AI Workloads](../share/workload.md) diff --git a/docs/en/docs/end-user/k8s/remove-node.md b/docs/en/docs/end-user/k8s/remove-node.md index 5f5bd822e7..8f400f2aa1 100644 --- a/docs/en/docs/end-user/k8s/remove-node.md +++ b/docs/en/docs/end-user/k8s/remove-node.md @@ -1,37 +1,36 @@ -# 移除 GPU 工作节点 +# Removing GPU Worker Nodes -GPU 资源的成本相对较高,如果暂时用不到 GPU,可以将带 GPU 的工作节点移除。 -以下步骤也同样适用于移除普通工作节点。 +The cost of GPU resources is relatively high. If GPUs are not needed temporarily, you can remove the worker nodes with GPUs. The following steps also apply to removing regular worker nodes. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 有一个管理员帐号 -- [已创建带 GPU 节点的集群](./create-k8s.md) +- The AI platform is installed +- An administrator account is available +- [A cluster with GPU nodes has been created](./create-k8s.md) -## 移除步骤 +## Steps to Remove -1. 以 **管理员身份** 登录 AI 算力平台 -1. 导航至 **容器管理** -> **集群列表** ,点击目标集群的名称 +1. Log into the AI platform as an **administrator**. +2. Navigate to **Container Management** -> **Clusters**, and click on the name of the target cluster. ![clusters](../images/remove01.png) -1. 进入集群概览页,点击 **节点管理** ,找到要移除的节点,点击列表右侧的 __┇__ ,在弹出菜单中选择 **移除节点** +3. Enter the cluster overview page, click on **Nodes**, find the node to be removed, click on the __┇__ on the right side of the list, and select **Remove Node** from the pop-up menu. ![remove](../images/remove02.png) -1. 在弹框中输入节点名称,确认无误后点击 **删除** +4. In the pop-up window, enter the node name, and after confirming it is correct, click **Delete**. ![confirm](../images/remove03.png) -1. 自动返回节点列表,状态为 **移除中** ,几分钟后刷新页面,节点不在了,说明节点被成功移除 +5. You will automatically return to the node list, where the status will be **Removing**. After a few minutes, refresh the page, and if the node is no longer present, it indicates that the node has been successfully removed. ![removed](../images/remove04.png) -1. 从 UI 列表移除节点后,通过 SSH 登录到已移除的节点主机,执行关机命令。 +6. After removing the node from the UI list, SSH into the removed node's host and execute the shutdown command. ![shutdown](../images/remove05.png) !!! tip - 在 UI 上移除节点并将其关机后,节点上的数据并未被立即删除,节点数据会被保留一段时间。 + After removing the node in the UI and shutting it down, the data on the node is not immediately deleted; the node's data will be retained for a period of time. diff --git a/docs/en/docs/end-user/kpanda/backup/deployment.md b/docs/en/docs/end-user/kpanda/backup/deployment.md new file mode 100644 index 0000000000..60f35a5e59 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/backup/deployment.md @@ -0,0 +1,62 @@ +--- +MTPE: FanLin +Date: 2024-01-19 +--- + +# Application Backup + +This article explains how to backup applications in AI platform. The demo application used in this tutorial is called __dao-2048__ , which is a deployment. + +## Prerequisites + +Before backing up a deployment, the following prerequisites must be met: + +- [Integrate a Kubernetes cluster](../clusters/integrate-cluster.md) or [create a Kubernetes cluster](../clusters/create-cluster.md) in the [Container Management](../../intro/index.md) module, and be able to access the UI interface of the cluster. + +- Create a [Namespace](../namespaces/createns.md) and a [User](../../../ghippo/access-control/user.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- [Install the velero component](install-velero.md), and ensure the velero component is running properly. + +- [Create a deployment](../workloads/create-deployment.md) (the workload in this tutorial is named __dao-2048__ ), and label the deployment with __app: dao-2048__ . + +## Backup workload + +Follow the steps below to backup the deployment __dao-2048__ . + +1. Enter the Container Management module, click __Backup Recovery__ -> __Application Backup__ on the left navigation bar, and enter the __Application Backup__ list page. + + ![Cluster List](../../images/backupd20481.png) + +2. On the __Application Backup__ list page, select the cluster where the velero and __dao-2048__ applications have been installed. Click __Backup Plan__ in the upper right corner to create a new backup cluster. + + ![Application Backup](../../images/backupd20482.png) + +3. Refer to the instructions below to fill in the backup configuration. + + - Name: The name of the new backup plan. + - Source Cluster: The cluster where the application backup plan is to be executed. + - Object Storage Location: The access path of the object storage configured when installing velero on the source cluster. + - Namespace: The namespaces that need to be backed up, multiple selections are supported. + - Advanced Configuration: Back up specific resources in the namespace based on resource labels, such as an application, or do not back up specific resources in the namespace based on resource labels during backup. + + ![Backup Resource](../../images/backupd20483.png) + +4. Refer to the instructions below to set the backup execution frequency, and then click __Next__ . + + - Backup Frequency: Set the time period for task execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and `*` , **after inputting the expression, the meaning of the current expression will be prompted**. For detailed expression syntax rules, refer to [Cron Schedule Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax). + - Retention Time (days): Set the storage time of backup resources, the default is 30 days, and will be deleted after expiration. + - Backup Data Volume (PV): Whether to back up the data in the data volume (PV), support direct copy and use CSI snapshot. + - Direct Replication: directly copy the data in the data volume (PV) for backup; + - Use CSI snapshots: Use CSI snapshots to back up data volumes (PVs). Requires a CSI snapshot type available for backup in the cluster. + + ![Backup Policy](../../images/backupd20484.png) + +5. Click __OK__ , the page will automatically return to the application backup plan list, find the newly created __dao-2048__ backup plan, and perform the __Immediate Execution__ operation. + + ![Immediate Execution](../../images/backupd20485.png) + +6. At this point, the __Last Execution State__ of the cluster will change to __in progress__ . After the backup is complete, you can click the name of the backup plan to view the details of the backup plan. + + ![Backup Details](../../images/backupd20486.png) diff --git a/docs/en/docs/end-user/kpanda/backup/etcd-backup.md b/docs/en/docs/end-user/kpanda/backup/etcd-backup.md new file mode 100644 index 0000000000..4c576220a4 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/backup/etcd-backup.md @@ -0,0 +1,159 @@ +--- +MTPE: ModetaNiu +Date: 2024-06-05 +--- + +# etcd backup + +etcd backup is based on cluster data as the core backup. In cases such as hardware device damage, development and test configuration errors, etc., the backup cluster data can be restored through etcd backup. + +This section will introduce how to realize the etcd backup for clusters. +Also see [etcd Backup and Restore Best Practices](../../best-practice/etcd-backup.md). + +## Prerequisites + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. + +- Created a [namespace](../namespaces/createns.md), + [user](../../../ghippo/access-control/user.md), + and granted [`NS Admin`](../permissions/permission-brief.md#ns-admin) or higher permissions to the user. + For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +- Prepared a MinIO instance. It is recommended to create it through AI platform's MinIO middleware. + For specific steps, refer to [MinIO Object Storage](../../../middleware/minio/user-guide/create.md). + +## Create etcd backup + +Follow the steps below to create an etcd backup. + +1. Enter __Container Management__ -> __Backup Recovery__ -> __etcd Backup__ page, you can see all the current + backup policies. Click __Create Backup Policy__ on the right. + + ![Backup policy list](../images/etcd01.png) + +2. Fill in the __Basic Information__. Then, click __Next__ to automatically verify the connectivity of etcd. If + the verification passes, proceed to the next step. + + - First select the backup cluster and log in to the terminal + - Enter etcd, and the format is `https://${NodeIP}:${Port}`. + + - In a standard Kubernetes cluster, the default port for etcd is __2379__. + - In a Suanova 4.0 cluster, the default port for etcd is __12379__. + - In a public cloud managed cluster, you need to contact the relevant developers to obtain the etcd port number. + This is because the control plane components of public cloud clusters are maintained and managed by + the cloud service provider. Users cannot directly access or view these components, nor can they obtain + control plane port information through regular commands (such as kubectl). + + ??? note "Ways to obtain port number" + + 1. Find the etcd Pod in the __kube-system__ namespace + + ```shell + kubectl get po -n kube-system | grep etcd + ``` + + 2. Get the port number from the __listen-client-urls__ of the etcd Pod + + ```shell + kubectl get po -n kube-system ${etcd_pod_name} -oyaml | grep listen-client-urls # (1)! + ``` + + 1. Replace __etcd_pod_name__ with the actual Pod name + + The expected output is as follows, where the number after the node IP is the port number: + + ```shell + - --listen-client-urls=https://127.0.0.1:2379,https://10.6.229.191:2379 + ``` + + - Fill in the CA certificate, you can use the following command to view the certificate content. + Then, copy and paste it to the proper location: + + === "Standard Kubernetes Cluster" + + ```shell + cat /etc/kubernetes/ssl/etcd/ca.crt + ``` + + === "Suanova 4.0 Cluster" + + ```shell + cat /etc/daocloud/dce/certs/ca.crt + ``` + + - Fill in the Cert certificate, you can use the following command to view the content of the certificate. Then, copy and paste it to the proper location: + + === "Standard Kubernetes Cluster" + + ```shell + cat /etc/kubernetes/ssl/apiserver-etcd-client.crt + ``` + + === "Suanova 4.0 Cluster" + + ```shell + cat /etc/daocloud/dce/certs/etcd/server.crt + ``` + + - Fill in the Key, you can use the following command to view the content of the certificate and copy and paste it to the proper location: + + === "Standard Kubernetes Cluster" + + ```shell + cat /etc/kubernetes/ssl/apiserver-etcd-client.key + ``` + + === "Suanova 4.0 Cluster" + + ```shell + cat /etc/daocloud/dce/certs/etcd/server.key + ``` + + ![Create Basic Information](../images/etcd-get01.png) + + !!! note + + Click __How to get__ below the input box to see how to obtain the proper information on the UI page. + +3. Refer to the following information to fill in the __Backup Policy__. + + - Backup Method: Choose either manual backup or scheduled backup + + - Manual Backup: Immediately perform a full backup of etcd data based on the backup configuration. + - Scheduled Backup: Periodically perform full backups of etcd data according to the set backup frequency. + + - Backup Chain Length: the maximum number of backup data to retain. The default is 30. + - Backup Frequency: it can be per hour, per day, per week or per month, and can also be customized. + +4. Refer to the following information to fill in the __Storage Path__. + + - Storage Provider: Default is S3 storage + - Object Storage Access Address: The access address of MinIO + - Bucket: Create a Bucket in MinIO and fill in the Bucket name + - Username: The login username for MinIO + - Password: The login password for MinIO + +5. After clicking __OK__ , the page will automatically redirect to the backup policy list, where you can + view all the currently created ones. + + - Click the __┇__ action button on the right side of the policy to view logs, view YAML, update the policy, stop the policy, or execute the policy immediately. + - When the backup method is manual, you can click __Execute Now__ to perform the backup. + - When the backup method is scheduled, the backup will be performed according to the configured time. + +## View Backup Policy Logs + +Click __Logs__ to view the log content. By default, 100 lines are displayed. If you want to see more log information or download the logs, you can follow the prompts above the logs to go to the observability module. + +## View Backup POlicy Details + +Go to __Container Management__ -> __Backup Recovery__ -> __etcd Backup__ , click the __Backup Policy__ tab, and then click the policy to view the details. + +## View Recovery Point + +1. Go to __Container Management__ -> __Backup Recovery__ -> __etcd Backup__, and click the __Recovery Point__ tab. +2. After selecting the target cluster, you can view all the backup information under that cluster. + + Each time a backup is executed, a corresponding recovery point is generated, which can be used to quickly restore + the application from a successful recovery point. diff --git a/docs/en/docs/end-user/kpanda/backup/index.md b/docs/en/docs/end-user/kpanda/backup/index.md new file mode 100644 index 0000000000..146f792fb1 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/backup/index.md @@ -0,0 +1,52 @@ +--- +hide: + - toc +--- + +# Backup and Restore + +Backup and restore are essential aspects of system management. In practice, it is important to +first back up the data of the system at a specific point in time and securely store the backup. +In case of incidents such as data corruption, loss, or accidental deletion, the system can be +quickly restored based on the previous backup data, reducing downtime and minimizing losses. + +- In real production environments, services may be deployed across different clouds, regions, + or availability zones. If one infrastructure faces a failure, organizations need to quickly + restore applications in other available environments. In such cases, cross-cloud or cross-cluster + backup and restore become crucial. +- Large-scale systems often involve multiple roles and users with complex permission management systems. + With many operators involved, accidents caused by human error can lead to system failures. + In such scenarios, the ability to roll back the system quickly using previously backed-up data + is necessary. Relying solely on manual troubleshooting, fault repair, and system recovery can + be time-consuming, resulting in prolonged system unavailability and increased losses for organizations. +- Additionally, factors like network attacks, natural disasters, and equipment malfunctions can + also cause data accidents. + +Therefore, backup and restore are vital as the last line of defense for maintaining system stability +and ensuring data security. + +Backups are typically classified into three types: full backups, incremental backups, +and differential backups. Currently, AI platform supports full backups and incremental backups. + +The backup and restore provided by AI platform can be divided into two categories: +**Application Backup** and **ETCD Backup**. It supports both manual backups +and scheduled automatic backups using CronJobs. + +- Application Backup + + Application backup refers to backing up data of a specific workload in the cluster and then + restoring that data either within the same cluster or in another cluster. It supports backing up + all resources under a namespace or filtering resources by specific labels. + + Application backup also supports cross-cluster backup of stateful applications. + For detailed steps, refer to the [Backup and Restore MySQL Applications and Data Across Clusters](../../best-practice/backup-mysql-on-nfs.md) guide. + +- etcd Backup + + etcd is the data storage component of Kubernetes. Kubernetes stores its own component's data + and application data in etcd. Therefore, backing up etcd is equivalent to backing up the + entire cluster's data, allowing quick restoration of the cluster to a previous state in case of failures. + + It's worth noting that currently, restoring etcd backup data is only supported within the same + cluster (the original cluster). To learn more about related best practices, refer to the + [ETCD Backup and Restore](../../best-practice/etcd-backup.md) guide. diff --git a/docs/en/docs/end-user/kpanda/backup/install-velero.md b/docs/en/docs/end-user/kpanda/backup/install-velero.md new file mode 100644 index 0000000000..2143ff33c8 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/backup/install-velero.md @@ -0,0 +1,83 @@ +--- +MTPE: FanLin +Date: 2024-01-18 +--- + +# Install the Velero Plugin + +velero is an open source tool for backing up and restoring Kubernetes cluster resources. It can back up resources in a Kubernetes cluster to cloud storage services, local storage, or other locations, and restore those resources to the same or a different cluster when needed. + +This section introduces how to deploy the Velero plugin in AI platform using the __Helm Apps__. + +## Prerequisites + +Before installing the __velero__ plugin, the following prerequisites need to be met: + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. +- Created a __velero__ [namespace](../namespaces/createns.md). +- You should have permissions not lower than + [NS Editor](../permissions/permission-brief.md#ns-editor). + For details, refer to [Namespace Authorization](../namespaces/createns.md). + +## Steps + +Please perform the following steps to install the __velero__ plugin for your cluster. + +1. On the cluster list page, find the target cluster that needs to install the __velero__ plugin, click the name of the cluster, click __Helm Apps__ -> __Helm chart__ in the left navigation bar, and enter __velero__ in the search bar to search . + + ![Find velero](../../images/backup1.png) + +2. Read the introduction of the __velero__ plugin, select the version and click the __Install__ button. This page will take __5.2.0__ version as an example to install, and it is recommended that you install __5.2.0__ and later versions. + + ![Install velero](../../images/backup2.png) + +3. Configure __basic info__ . + + - Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators ("-"), and must start and end with lowercase letters or numbers, such as metrics-server-01. + - Namespace: Select the namespace for plugin installation, it must be __velero__ namespace. + - Version: The version of the plugin, here we take __5.2.0__ version as an example. + - Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful. + - Deletion Failed: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed. + - Detailed Logs: Turn on the verbose output of the installation process log. + + ![Basic Info](../../images/backup3.png) + + !!! note + + After enabling __Ready Wait__ and/or __Failed Delete__ , it takes a long time for the app to be marked as __Running__ . + +4. Configure Velero chart __Parameter Settings__ according to the following instructions + + - __S3 Credentials__: Configure the authentication information of object storage (minio). + + - __Use secret__: Keep the default configuration __true__. + - __Secret name__: Keep the default configuration __velero-s3-credential__. + - __SecretContents.aws_access_key_id = __: Configure the username for accessing object storage, replace ____ with the actual parameter. + - __SecretContents.aws_secret_access_key = __: Configure the password for accessing object storage, replace ____ with the actual parameter. + + !!! note " __Use existing secret__ parameter example is as follows:" + + ```config + [default] + aws_access_key_id = minio + aws_secret_access_key = minio123 + ``` + + - __BackupStorageLocation__: The location where Velero backs up data. + - __S3 bucket__: The name of the storage bucket used to save backup data (must be a real storage bucket that already exists in minio). + - __Is default BackupStorage__: Keep the default configuration __true__. + - __S3 access mode__: The access mode of Velero to data, which can be selected + - __ReadWrite__: Allow Velero to read and write backup data; + - __ReadOnly__: Allow Velero to read backup data, but cannot modify backup data; + - __WriteOnly__: Only allow Velero to write backup data, and cannot read backup data. + + - __S3 Configs__: Detailed configuration of S3 storage (minio). + - __S3 region__: The geographical region of cloud storage. The default is to use the __us-east-1__ parameter, which is provided by the system administrator. + - __S3 force path style__: Keep the default configuration __true__. + - __S3 server URL__: The console access address of object storage (minio). Minio generally provides two services, UI access and console access. Please use the console access address here. + + ![Parameter Settings](../../images/backup4.png) + +5. Click the __OK__ button to complete the installation of the __Velero__ plugin. The system will automatically jump to the __Helm Apps__ list page. After waiting for a few minutes, refresh the page, and you can see the application just installed. diff --git a/docs/en/docs/end-user/kpanda/clusterops/cluster-settings.md b/docs/en/docs/end-user/kpanda/clusterops/cluster-settings.md new file mode 100644 index 0000000000..0bdd180630 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusterops/cluster-settings.md @@ -0,0 +1,18 @@ +--- +MTPE: FanLin +Date: 2024-02-26 +--- + +# Cluster Settings + +Cluster settings are used to customize advanced feature settings for your cluster, including whether to enable GPU, helm repo refresh cycle, Helm operation record retention, etc. + +- Enable GPU: GPUs and proper driver plug-ins need to be installed on the cluster in advance. + + Click the name of the target cluster, and click __Operations and Maintenance__ -> __Cluster Settings__ -> __Addons__ in the left navigation bar. + + ![Config GPU](../images/settings01.png) + +- Helm operation basic image, registry refresh cycle, number of operation records retained, whether to enable cluster deletion protection (the cluster cannot be uninstalled directly after enabling) + + ![Advanced Settings](../images/settings02.png) diff --git a/docs/en/docs/end-user/kpanda/clusterops/latest-operations.md b/docs/en/docs/end-user/kpanda/clusterops/latest-operations.md new file mode 100644 index 0000000000..0a5359cc9e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusterops/latest-operations.md @@ -0,0 +1,22 @@ +--- +hide: + - toc +--- + +# recent operations + +On this page, you can view the recent cluster operation records and Helm operation records, as well as the YAML files and logs of each operation, and you can also delete a certain record. + + + +Set the number of reserved entries for Helm operations: + +By default, the system keeps the last 100 Helm operation records. If you keep too many entries, it may cause data redundancy, and if you keep too few entries, you may lose the key operation records you need. A reasonable reserved quantity needs to be set according to the actual situation. Specific steps are as follows: + +1. Click the name of the target cluster, and click __Recent Operations__ -> __Helm Operations__ -> __Set Number of Retained Items__ in the left navigation bar. + + + +2. Set how many Helm operation records need to be kept, and click __OK__ . + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/clusters/access-cluster.md b/docs/en/docs/end-user/kpanda/clusters/access-cluster.md new file mode 100644 index 0000000000..c0deb19ccc --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/access-cluster.md @@ -0,0 +1,62 @@ +--- +MTPE: ModetaNiu +Date: 2024-06-06 +--- + +# Access Clusters + +Clusters integrated or created using the AI platform [Container Management](../../intro/index.md) platform can be accessed not only through the UI interface but also in two other ways for access control: + +- Access online via CloudShell +- Access via kubectl after downloading the cluster certificate + +!!! note + + When accessing the cluster, the user should have [Cluster Admin](../permissions/permission-brief.md) permission or higher. + +## Access via CloudShell + +1. Enter __Clusters__ page, select the cluster you want to access via CloudShell, click the __...__ icon on the right, and then click __Console__ from the dropdown list. + + ![screen](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-access01.png) + +2. Run __kubectl get node__ command in the Console to verify the connectivity between CloudShell and the cluster. If the console returns node information of the cluster, you can access and manage the cluster through CloudShell. + + + +## Access via kubectl + +If you want to access and manage remote clusters from a local node, make sure you have met these prerequisites: + +- Your local node and the cloud cluster are in a connected network. +- The cluster certificate has been downloaded to the local node. +- The kubectl tool has been installed on the local node. For detailed installation guides, see [Installing tools](https://kubernetes.io/docs/tasks/tools/). + +If everything is in place, follow these steps to access a cloud cluster from your local environment. + +1. Enter __Clusters__ page, find your target cluster, click __...__ on the right, and select __Download kubeconfig__ in the drop-down list. + + ![Enter the page of downloading certificates](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-access02.png) + +2. Set the Kubeconfig period and click __Download__ . + + ![Download certificates](../../images/access-download-cert.png) + +3. Open the downloaded certificate and copy its content to the __config__ file of the local node. + + By default, the kubectl tool will look for a file named __config__ in the __$HOME/.kube__ directory on the local node. This file stores access credentials of clusters. Kubectl can access the cluster with that configuration file. + +4. Run the following command on the local node to verify its connectivity with the cluster: + + ```sh + kubectl get pod -n default + ``` + + An expected output is as follows: + + ```none + NAME READY STATUS RESTARTS AGE + dao-2048-2048-58c7f7fc5-mq7h4 1/1 Running 0 30h + ``` + +Now you can access and manage the cluster locally with kubectl. diff --git a/docs/en/docs/end-user/kpanda/clusters/cluster-role.md b/docs/en/docs/end-user/kpanda/clusters/cluster-role.md new file mode 100644 index 0000000000..31a597097a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/cluster-role.md @@ -0,0 +1,81 @@ +--- +MTPE: windsonsea +Date: 2024-07-19 +--- + +# Cluster Roles + +Suanova AI platform categorizes clusters based on different functionalities to help users better manage IT infrastructure. + +## Global Service Cluster + +This cluster is used to run AI platform components such as +[Container Management](../../intro/index.md), [Global Management](../../../ghippo/intro/index.md), +[Insight](../../../insight/intro/index.md), [Container Registry](../../../kangaroo/intro/index.md). +It generally does not carry business workloads. + +| Supported Features | Description | +| ------------------ | ----------- | +| K8s Version | 1.22+ | +| Operating System | RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;
Ubuntu 18.04 x86, Ubuntu 20.04 x86;
CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD | +| Full Lifecycle Management | Supported | +| K8s Resource Management | Supported | +| Cloud Native Storage | Supported | +| Cloud Native Network | Calico, Cillium, Multus, and other CNIs | +| Policy Management | Supports network policies, quota policies, resource limits, disaster recovery policies, security policies | + +## Management Cluster + +This cluster is used to manage worker clusters and generally does not carry business workloads. + +- [Classic Mode](../../../install/commercial/deploy-requirements.md) deploys the global service cluster + and management cluster in different clusters, suitable for multi-data center, multi-architecture enterprise scenarios. +- [Simple Mode](../../../install/commercial/deploy-requirements.md) deploys the management cluster and + global service cluster in the same cluster. + +| Supported Features | Description | +| ------------------ | ----------- | +| K8s Version | 1.22+ | +| Operating System | RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;
Ubuntu 18.04 x86, Ubuntu 20.04 x86;
CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD | +| Full Lifecycle Management | Supported | +| K8s Resource Management | Supported | +| Cloud Native Storage | Supported | +| Cloud Native Network | Calico, Cillium, Multus, and other CNIs | +| Policy Management | Supports network policies, quota policies, resource limits, disaster recovery policies, security policies | + +## Worker Cluster + +This is a cluster created using [Container Management](../../intro/index.md) and is mainly used to +carry business workloads. This cluster is managed by the management cluster. + +| Supported Features | Description | +| ------------------ | ----------- | +| K8s Version | Supports K8s 1.22 and above | +| Operating System | RedHat 7.6 x86/ARM, RedHat 7.9 x86, RedHat 8.4 x86/ARM, RedHat 8.6 x86;
Ubuntu 18.04 x86, Ubuntu 20.04 x86;
CentOS 7.6 x86/AMD, CentOS 7.9 x86/AMD | +| Full Lifecycle Management | Supported | +| K8s Resource Management | Supported | +| Cloud Native Storage | Supported | +| Cloud Native Network | Calico, Cillium, Multus, and other CNIs | +| Policy Management | Supports network policies, quota policies, resource limits, disaster recovery policies, security policies | + +## Integrated Cluster + +This cluster is used to integrate existing standard K8s clusters, including but not limited to self-built clusters +in local data centers, clusters provided by public cloud vendors, clusters provided by private cloud vendors, +edge clusters, Xinchuang clusters, heterogeneous clusters, and different Suanova clusters. +It is mainly used to carry business workloads. + +| Supported Features | Description | +| ------------------ | ----------- | +| K8s Version | 1.18+ | +| Supported Vendors | VMware Tanzu, Amazon EKS, Redhat Openshift, SUSE Rancher, Alibaba ACK, Huawei CCE, Tencent TKE, Standard K8s Cluster, Suanova | +| Full Lifecycle Management | Not Supported | +| K8s Resource Management | Supported | +| Cloud Native Storage | Supported | +| Cloud Native Network | Depends on the network mode of the integrated cluster's kernel | +| Policy Management | Supports network policies, quota policies, resource limits, disaster recovery policies, security policies | + +!!! note + + A cluster can have multiple cluster roles. For example, a cluster can be both + a global service cluster and a management cluster or a worker cluster. diff --git a/docs/en/docs/end-user/kpanda/clusters/cluster-scheduler-plugin.md b/docs/en/docs/end-user/kpanda/clusters/cluster-scheduler-plugin.md new file mode 100644 index 0000000000..d84d5a6179 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/cluster-scheduler-plugin.md @@ -0,0 +1,178 @@ +--- +MTPE: ModetaNiu +Date: 2024-06-06 +--- + +# Deploy Second Scheduler scheduler-plugins in a Cluster + +This page describes how to deploy a second scheduler-plugins in a cluster. + +## Why do we need scheduler-plugins? + +The cluster created through the platform will install the native K8s scheduler-plugin, but the native scheduler-plugin +has many limitations: + +- The native scheduler-plugin cannot meet scheduling requirements, so you can use either + [CoScheduling](https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling), + [CapacityScheduling](https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/capacityscheduling) + or other types of scheduler-plugins. +- In special scenarios, a new scheduler-plugin is needed to complete scheduling tasks without affecting the process of + the native scheduler-plugin. +- Distinguish scheduler-plugins with different functionalities and achieve different scheduling scenarios by switching + scheduler-plugin names. + +This page takes the scenario of using the vgpu scheduler-plugin while combining the coscheduling plugin capability +of scheduler-plugins as an example to introduce how to install and use scheduler-plugins. + +## Installing scheduler-plugins + +### Prerequisites + +- kubean is a new feature introduced in v0.13.0, please ensure that your version is v0.13.0 or higher. +- The installation version of scheduler-plugins is v0.27.8, please ensure that the cluster version is compatible with it. + Refer to the document [Compatibility Matrix](https://github.com/kubernetes-sigs/scheduler-plugins/tree/master?tab=readme-ov-file#compatibility-matrix). + +### Installation Process + +1. Add the scheduler-plugins parameter in **Create Cluster** -> **Advanced Settings** -> **Custom Parameters**. + + ```yaml + scheduler_plugins_enabled:true + scheduler_plugins_plugin_config: + - name: Coscheduling + args: + permitWaitingTimeSeconds: 10 # default is 60 + ``` + + Parameters: + + - `scheduler_plugins_enabled` Set to true to enable the scheduler-plugins capability. + - You can enable or disable certain plugins by setting the `scheduler_plugins_enabled_plugins` or + `scheduler_plugins_disabled_plugins` options. + See [K8s Official Plugin Names](https://github.com/kubernetes-sigs/scheduler-plugins?tab=readme-ov-file#plugins) + for reference. + - If you need to set parameters for custom plugins, please configure `scheduler_plugins_plugin_config`, + for example: set the `permitWaitingTimeoutSeconds` parameter for coscheduling. + See [K8s Official Plugin Configuration](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/manifests/coscheduling/scheduler-config.yaml) for reference. + + + +2. After successful cluster creation, the system will automatically install the scheduler-plugins and + controller component loads. You can check the workload status in the proper cluster's deployment. + + + +## Using scheduler-plugins + +Here is an example of how to use scheduler-plugins by demonstrating a scenario where the vgpu scheduler is used in combination with the coscheduling plugin capability of scheduler-plugins. + +1. Install vgpu in the Helm Charts and set the values.yaml parameters. + + - `schedulerName: scheduler-plugins-scheduler`: This is the scheduler name for scheduler-plugins installed by + kubean, and currently cannot be modified. + - `scheduler.kubeScheduler.enabled: false`: Do not install kube-scheduler and use vgpu-scheduler as a separate extender. + + + +1. Extend vgpu-scheduler on scheduler-plugins. + + ```bash + [root@master01 charts]# kubectl get cm -n scheduler-plugins scheduler-config -ojsonpath="{.data.scheduler-config\.yaml}" + ``` + + ```yaml + apiVersion: kubescheduler.config.k8s.io/v1 + kind: KubeSchedulerConfiguration + leaderElection: + leaderElect: false + profiles: + # Compose all plugins in one profile + - schedulerName: scheduler-plugins-scheduler + plugins: + multiPoint: + enabled: + - name: Coscheduling + - name: CapacityScheduling + - name: NodeResourceTopologyMatch + - name: NodeResourcesAllocatable + disabled: + - name: PrioritySort + pluginConfig: + - args: + permitWaitingTimeSeconds: 10 + name: Coscheduling + ``` + + Modify configmap of scheduler-config for scheduler-plugins: + + ```bash + [root@master01 charts]# kubectl get cm -n scheduler-plugins scheduler-config -ojsonpath="{.data.scheduler-config\.yaml}" + ``` + + ```yaml + apiVersion: kubescheduler.config.k8s.io/v1 + kind: KubeSchedulerConfiguration + leaderElection: + leaderElect: false + profiles: + # Compose all plugins in one profile + - schedulerName: scheduler-plugins-scheduler + plugins: + multiPoint: + enabled: + - name: Coscheduling + - name: CapacityScheduling + - name: NodeResourceTopologyMatch + - name: NodeResourcesAllocatable + disabled: + - name: PrioritySort + pluginConfig: + - args: + permitWaitingTimeSeconds: 10 + name: Coscheduling + extenders: + - urlPrefix: "${urlPrefix}" + filterVerb: filter + bindVerb: bind + nodeCacheCapable: true + ignorable: true + httpTimeout: 30s + weight: 1 + enableHTTPS: true + tlsConfig: + insecure: true + managedResources: + - name: nvidia.com/vgpu + ignoredByScheduler: true + - name: nvidia.com/gpumem + ignoredByScheduler: true + - name: nvidia.com/gpucores + ignoredByScheduler: true + - name: nvidia.com/gpumem-percentage + ignoredByScheduler: true + - name: nvidia.com/priority + ignoredByScheduler: true + - name: cambricon.com/mlunum + ignoredByScheduler: true + ``` + +1. After installing vgpu-scheduler, the system will automatically create a service (svc), and the urlPrefix + specifies the URL of the svc. + + !!! note + + - The svc refers to the pod service load. You can use the following command in the namespace where the + nvidia-vgpu plugin is installed to get the external access information for port 443. + + ```shell + kubectl get svc -n ${namespace} + ``` + + - The urlPrefix format is `https://${ip address}:${port}` + +1. Restart the scheduler pod of scheduler-plugins to load the new configuration file. + + !!! note + + When creating a vgpu application, you do not need to specify the name of a scheduler-plugin. The vgpu-scheduler webhook + will automatically change the scheduler's name to "scheduler-plugins-scheduler" without manual specification. diff --git a/docs/en/docs/end-user/kpanda/clusters/cluster-status.md b/docs/en/docs/end-user/kpanda/clusters/cluster-status.md new file mode 100644 index 0000000000..a488884773 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/cluster-status.md @@ -0,0 +1,35 @@ +--- +MTPE: ModetaNiu +date: 2024-06-06 +--- + +# Cluster Status + +AI platform Container Management module can manage two types of clusters: integrated clusters and created clusters. + +- Integrated clusters: clusters created in other platforms and now integrated into AI platform. +- Created clusters: clusters created in AI platform. + +For more information about cluster types, see [Cluster Role](cluster-role.md). + +We designed several status for these two clusters. + +## Integrated Clusters + +| Status | Description | +| ------ | ----------- | +| Integrating | The cluster is being integrated into AI platform. | +| Removing | The cluster is being removed from AI platform. | +| Running | The cluster is running as expected. | +| Unknown | The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status. | + +## Created Clusters + +| Status | Description | +| ------ | ----------- | +| Creating | The cluster is being created. | +| Updating | The Kubernetes version of the cluster is being operating. | +| Deleting | The cluster is being deleted. | +| Running | The cluster is running as expected. | +| Unknown | The cluster is lost. Data displayed in the AI platform UI is the cached data before the disconnection, which does not represent real-time data. Any operation during this status will not take effect. You should check cluster network connectivity or host status. | +| Failed | The cluster creation is failed. You should check the logs for detailed reasons. | diff --git a/docs/en/docs/end-user/kpanda/clusters/cluster-version.md b/docs/en/docs/end-user/kpanda/clusters/cluster-version.md new file mode 100644 index 0000000000..b0fc5e6c69 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/cluster-version.md @@ -0,0 +1,63 @@ +--- +MTPE: windsonsea +date: 2024-01-08 +--- + +# Supported Kubernetes Versions + +In AI platform, the [integrated clusters](cluster-status.md) and [created clusters](./cluster-status.md) have different version support mechanisms. + +This page focuses on the version support mechanism for created clusters. + +The Kubernetes community supports three version ranges: 1.26, 1.27, and 1.28. When a new version +is released by the community, the supported version range is incremented. For example, if the +latest version released by the community is 1.27, the supported version range by the community +will be 1.27, 1.28, and 1.29. + +To ensure the security and stability of the clusters, when creating clusters +in AI platform, the supported version range will always be one version lower than the community's +version. + +For instance, if the Kubernetes community supports v1.25, v1.26, and v1.27, then the +version range for creating worker clusters in AI platform will be +v1.24, v1.25, and v1.26. Additionally, a stable version, such as 1.24.7, will be recommended to users. + +Furthermore, the version range for creating worker clusters in AI platform +will remain highly synchronized with the community. When the community version increases +incrementally, the version range for creating worker clusters in +AI platform will also increase by one version. + +## Supported Kubernetes Versions + + + + + + + + + + + + + + + + + + + + +
Kubernetes Community VersionsCreated Worker Cluster VersionsRecommended Versions for Created Worker ClusterAI platform InstallerRelease Date
+
    +
  • 1.26
  • +
  • 1.27
  • +
  • 1.28
  • +
+
+
    +
  • 1.26
  • +
  • 1.27
  • +
  • 1.28
  • +
+
1.27.5v0.13.02023.11.30
diff --git a/docs/en/docs/end-user/kpanda/clusters/create-cluster.md b/docs/en/docs/end-user/kpanda/clusters/create-cluster.md new file mode 100644 index 0000000000..3698c58201 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/create-cluster.md @@ -0,0 +1,96 @@ +--- +date: 2023-08-08 +hide: + - toc +--- + +# Create Worker Clusters + +In AI platform Container Management, clusters can have four [roles](./cluster-role.md): +global service cluster, management cluster, worker cluster, and integrated cluster. +An integrated cluster can only be integrated from third-party vendors (see [Integrate Cluster](./integrate-cluster.md)). + +This page explains how to create a Worker Cluster. By default, when creating a new Worker Cluster, the operating system type and CPU architecture of the worker nodes should be consistent with the Global Service Cluster. If you want to create a cluster with a different operating system or architecture than the Global Management Cluster, refer to [Creating an Ubuntu Worker Cluster on a CentOS Management Platform](../../best-practice/create-ubuntu-on-centos-platform.md) for instructions. + +It is recommended to use the [supported operating systems in AI platform](../../../install/commercial/deploy-requirements.md) to create the cluster. If your local nodes are not within the supported range, you can refer to [Creating a Cluster on Non-Mainstream Operating Systems](../../best-practice/use-otherlinux-create-custer.md) for instructions. + +## Prerequisites + +Certain prerequisites must be met before creating a cluster: + +- Prepare enough nodes to be joined into the cluster. +- It is recommended to use Kubernetes version 1.25.7. For the specific version range, refer to the + [AI platform Cluster Version Support System](./cluster-version.md). Currently, the supported version + range for created worker clusters is `v1.26.0-v1.28`. If you need to create a cluster with a + lower version, refer to the [Supporterd Cluster Versions](./cluster-version.md). +- The target host must allow IPv4 forwarding. If using IPv6 in Pods and Services, + the target server needs to allow IPv6 forwarding. +- AI platform does not provide firewall management. You need to pre-define the firewall rules of + the target host by yourself. To avoid errors during cluster creation, it is recommended + to disable the firewall of the target host. +- See [Node Availability Check](../nodes/node-check.md). + +## Steps + +1. Enter the Container Management module, click __Create Cluster__ on the upper right corner of the __Clusters__ page. + + ![click create button](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-create01.png) + +2. Fill in the basic information by referring to the following instructions. + + - Cluster Name: only contain lowercase letters, numbers, and hyphens ("-"). Must start and end with a lowercase letter or number and totally up to 63 characters. + - Managed By: Choose a cluster to manage this new cluster through its lifecycle, such as creating, upgrading, node scaling, deleting the new cluster, etc. + - Runtime: Select the runtime environment of the cluster. Currently support containerd and docker (see [How to Choose Container Runtime](runtime.md)). + - Kubernetes Version: Allow span of three major versions, such as from 1.23-1.25, subject to the versions supported by the management cluster. + + ![basic info](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-create02.png) + +3. Fill in the node configuration information and click __Node Check__ . + + - High Availability: When enabled, at least 3 controller nodes are required. When disabled, only 1 controller node is needed. + + > It is recommended to use High Availability mode in production environments. + + - Credential Type: Choose whether to access nodes using username/password or public/private keys. + + > If using public/private key authentication, SSH keys for the nodes need to be configured in advance. Refer to [Using SSH Key Authentication for Nodes](../nodes/node-authentication.md). + + - Same Password: When enabled, all nodes in the cluster will have the same access password. Enter the unified password for accessing all nodes in the field below. If disabled, you can set separate usernames and passwords for each node. + - Node Information: Set note names and IPs. + - NTP Time Synchronization: When enabled, time will be automatically synchronized across all nodes. Provide the NTP server address. + + ![node check](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-create03.png) + +4. If node check is passed, click __Next__ . If the check failed, update __Node Information__ and check again. +5. Fill in the network configuration and click __Next__ . + + - CNI: Provide network services for Pods in the cluster. CNI cannot be changed after the cluster is created. Supports cilium and calico. Set __none__ means not installing CNI when creating the cluster. You may install a CNI later. + + > For CNI configuration details, see [Cilium Installation Parameters](../../../network/modules/cilium/install.md) or [Calico Installation Parameters](../../../network/modules/calico/install.md). + + - Container IP Range: Set an IP range for allocating IPs for containers in the cluster. IP range determines the max number of containers allowed in the cluster. Cannot be modified after creation. + - Service IP Range: Set an IP range for allocating IPs for container Services in the cluster. This range determines the max number of container Services that can be created in the cluster. Cannot be modified after creation. + +6. Fill in the plug-in configuration and click __Next__ . + +7. Fill in advanced settings and click __OK__ . + + - __kubelet_max_pods__ : Set the maximum number of Pods per node. The default is 110. + - __hostname_override__ : Reset the hostname (not recommended). + - __kubernetes_audit__ : Kubernetes audit log, enabled by default. + - __auto_renew_certificate__ : Automatically renew the certificate of the control plane on the first Monday of each month, enabled by default. + - __disable_firewalld&ufw__ : Disable the firewall to prevent the node from being inaccessible during installation. + - __Insecure_registries__ : Set the address of you private container registry. If you use a private container registry, fill in its address can bypass certificate authentication of the container engine and obtain the image. + - __yum_repos__ : Fill in the Yum source registry address. + +!!! success + + - After correctly filling in the above information, the page will prompt that the cluster is being created. + - Creating a cluster takes a long time, so you need to wait patiently. You can click the __Back to Clusters__ button to let it running backend. + - To view the current status, click __Real-time Log__ . + +!!! note + + - hen the cluster is in an unknown state, it means that the current cluster has been disconnected. + - The data displayed by the system is the cached data before the disconnection, which does not represent real data. + - Any operations performed in the disconnected state will not take effect. Please check the cluster network connectivity or Host Status. diff --git a/docs/en/docs/end-user/kpanda/clusters/delete-cluster.md b/docs/en/docs/end-user/kpanda/clusters/delete-cluster.md new file mode 100644 index 0000000000..579ac44bd6 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/delete-cluster.md @@ -0,0 +1,37 @@ +--- +hide: + - toc +--- + +# Delete/Remove Clusters + +Clusters created in AI platform [Container Management](../../intro/index.md) can be either deleted or removed. Clusters integrated into AI platform can only be removed. + +!!! Info + + If you want to delete an integrated cluster, you should delete it in the platform where it is created. + +In AI platform, the difference between __Delete__ and __Remove__ is: + +- __Delete__ will destroy the cluster and reset the data of all nodes under the cluster. All data will be totally cleared and lost. Making a backup before deleting a cluster is a recommended best practice. You can no longer use that cluster anymore. +- __Remove__ just removes the cluster from AI platform. It will not destroy the cluster and no data will be lost. You can still use the cluster in other platforms or re-integrate it into AI platform later if needed. + +!!! note + + - You should have [Admin](../../../ghippo/access-control/role.md) + or [Kpanda Owner](../../../ghippo/access-control/global.md) permissions + to perform delete or remove operations. + - Before deleting a cluster, you should turn off __Cluster Deletion Protection__ in + __Cluster Settings__ -> __Advanced Settings__ , otherwise the __Delete Cluster__ option will not be displayed. + - The __global service cluster__ cannot be deleted or removed. + +1. Enter the Container Management module, find your target cluster, click __ ...__ on the right, + and select __Delete cluster__ / __Remove__ in the drop-down list. + + ![screen](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-delete01.png) + +2. Enter the cluster name to confirm and click __Delete__ . + + ![screen](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-delete02.png) + +3. You will be auto directed to cluster lists. The status of this cluster will changed to __Deleting__ . It may take a while to delete/remove a cluster. diff --git a/docs/en/docs/end-user/kpanda/clusters/integrate-cluster.md b/docs/en/docs/end-user/kpanda/clusters/integrate-cluster.md new file mode 100644 index 0000000000..b172d633ff --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/integrate-cluster.md @@ -0,0 +1,41 @@ +--- +MTPE: windsonsea +Date: 2024-07-19 +hide: + - toc +--- + +# Integrate Clusters + +With the features of integrating clusters, AI platform allows you to manage on-premise and cloud clusters of various providers in a unified manner. This is quite important in avoiding the risk of being locked in by a certain providers, helping enterprises safely migrate their business to the cloud. + +In AI platform Container Management module, you can integrate a cluster of the following providers: standard Kubernetes clusters, Redhat Openshift, SUSE Rancher, VMware Tanzu, Amazon EKS, Aliyun ACK, Huawei CCE, Tencent TKE, etc. + +## Prerequisites + +- Prepare a cluster of K8s v1.22+ and ensure its network connectivity. +- The operator should have the [NS Editor](../permissions/permission-brief.md) or higher permissions. + +## Steps + +1. Enter Container Management module, and click __Integrate Cluster__ in the upper right corner. + + ![screen](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-integrate01.png) + +2. Fill in the basic information by referring to the following instructions. + + - Cluster Name: It should be unique and cannot be changed after the integration. Maximum 63 characters, can only contain lowercase letters, numbers, and a separator ("-"), and must start and end with a lowercase letter or number. + - Cluster Alias: Enter any characters, no more than 60 characters. + - Release Distribution: the cluster provider, support mainstream vendors listed at the beginning. + +3. Fill in the KubeConfig of the target cluster and click __Verify Config__ . The cluster can be successfully connected only after the verification is passed. + + > Click __How do I get the KubeConfig?__ to see the specific steps for getting this file. + + ![screen](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/cluster-integrate03.png) + +4. Confirm that all parameters are filled in correctly and click __OK__ in the lower right corner of the page. + +!!! note + + The status of the newly integrated cluster is __Integrating__ , which will become __Running__ after the integration succeeds. diff --git a/docs/en/docs/end-user/kpanda/clusters/integrate-rancher-cluster.md b/docs/en/docs/end-user/kpanda/clusters/integrate-rancher-cluster.md new file mode 100644 index 0000000000..019159b3db --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/integrate-rancher-cluster.md @@ -0,0 +1,222 @@ +--- +MTPE: ModetaNiu +date: 2024-06-06 +--- + +# Integrate the Rancher Cluster + +This page explains how to integrate a Rancher cluster. + +## Prerequisites + +- Prepare a Rancher cluster with administrator privileges and ensure network connectivity between the container management cluster and the target cluster. +- Be equipped with permissions not lower than [kpanda owner](../permissions/permission-brief.md). + +## Steps + +### Step 1: Create a ServiceAccount user with administrator privileges in the Rancher cluster + +1. Log in to the Rancher cluster with a role that has administrator privileges, and create a file named __sa.yaml__ + using the terminal. + + ```bash + vi sa.yaml + ``` + + Press the __i__ key to enter insert mode, then copy and paste the following content: + + ```yaml title="sa.yaml" + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + metadata: + name: rancher-rke + rules: + - apiGroups: + - '*' + resources: + - '*' + verbs: + - '*' + - nonResourceURLs: + - '*' + verbs: + - '*' + --- + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRoleBinding + metadata: + name: rancher-rke + roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: rancher-rke + subjects: + - kind: ServiceAccount + name: rancher-rke + namespace: kube-system + --- + apiVersion: v1 + kind: ServiceAccount + metadata: + name: rancher-rke + namespace: kube-system + ``` + + Press the __Esc__ key to exit insert mode, then type __:wq__ to save and exit. + +2. Run the following command in the current directory to create a ServiceAccount named __rancher-rke__ + (referred to as __SA__ for short): + + ```bash + kubectl apply -f sa.yaml + ``` + + The expected output is as follows: + + ```console + clusterrole.rbac.authorization.k8s.io/rancher-rke created + clusterrolebinding.rbac.authorization.k8s.io/rancher-rke created + serviceaccount/rancher-rke created + ``` + +3. Create a secret named __rancher-rke-secret__ and bind the secret to the __rancher-rke__ SA. + + ```bash + kubectl apply -f - < + Annotations: kubernetes.io/service-account.name: rancher-rke + kubernetes.io/service-account.uid: d83df5d9-bd7d-488d-a046-b740618a0174 + + Type: kubernetes.io/service-account-token + + Data + ==== + ca.crt: 570 bytes + namespace: 11 bytes + token: eyJhbGciOiJSUzI1NiIsImtpZCI6IjUtNE9nUWZLRzVpbEJORkZaNmtCQXhqVzRsZHU4MHhHcDBfb0VCaUo0V1kifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJyYW5jaGVyLXJrZS1zZWNyZXQiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicmFuY2hlci1ya2UiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJkODNkZjVkOS1iZDdkLTQ4OGQtYTA0Ni1iNzQwNjE4YTAxNzQiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06cmFuY2hlci1ya2UifQ.VNsMtPEFOdDDeGt_8VHblcMRvjOwPXMM-79o9UooHx6q-VkHOcIOp3FOT2hnEdNnIsyODZVKCpEdCgyozX-3y5x2cZSZpocnkMcBbQm-qfTyUcUhAY7N5gcYUtHUhvRAsNWJcsDCn6d96gT_qo-ddo_cT8Ri39Lc123FDYOnYG-YGFKSgRQVy7Vyv34HIajZCCjZzy7i--eE_7o4DXeTjNqAFMFstUxxHBOXI3Rdn1zKQKqh5Jhg4ES7X-edSviSUfJUX-QV_LlAw5DuAyGPH7bDH4QaQ5k-p6cIctmpWZE-9wRDlKA4LYRblKE7MJcI6OmM4ldlMM0Jc8N-gCtl4w + ``` + +### Step 2: Update kubeconfig with the rancher-rke SA authentication on your local machine + +Perform the following steps on any local node where __kubelet__ is installed: + +1. Configure kubelet token. + + ```bash + kubectl config set-credentials rancher-rke --token= __rancher-rke-secret__ 里面的 token 信息 + ``` + + For example, + + ``` + kubectl config set-credentials eks-admin --token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjUtNE9nUWZLRzVpbEJORkZaNmtCQXhqVzRsZHU4MHhHcDBfb0VCaUo0V1kifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJyYW5jaGVyLXJrZS1zZWNyZXQiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicmFuY2hlci1ya2UiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJkODNkZjVkOS1iZDdkLTQ4OGQtYTA0Ni1iNzQwNjE4YTAxNzQiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06cmFuY2hlci1ya2UifQ.VNsMtPEFOdDDeGt_8VHblcMRvjOwPXMM-79o9UooHx6q-VkHOcIOp3FOT2hnEdNnIsyODZVKCpEdCgyozX-3y5x2cZSZpocnkMcBbQm-qfTyUcUhAY7N5gcYUtHUhvRAsNWJcsDCn6d96gT_qo-ddo_cT8Ri39Lc123FDYOnYG-YGFKSgRQVy7Vyv34HIajZCCjZzy7i--eE_7o4DXeTjNqAFMFstUxxHBOXI3Rdn1zKQKqh5Jhg4ES7X-edSviSUfJUX-QV_LlAw5DuAyGPH7bDH4QaQ5k-p6cIctmpWZE-9wRDlKA4LYRblKE7MJcI6OmM4ldlMM0Jc8N-gCtl4w + ``` + +2. Configure the kubelet APIServer information. + + ```bash + kubectl config set-cluster {cluster-name} --insecure-skip-tls-verify=true --server={APIServer} + ``` + + - __{cluster-name}__ : the name of your Rancher cluster. + - __{APIServer}__ : the access address of the cluster, usually refering to the IP address of the control node + port "6443", such as `https://10.X.X.X:6443` . + + For example, + + ```bash + kubectl config set-cluster rancher-rke --insecure-skip-tls-verify=true --server=https://10.X.X.X:6443 + ``` + +3. Configure the kubelet context. + + ```bash + kubectl config set-context {context-name} --cluster={cluster-name} --user={SA-usename} + ``` + + For example, + + ```bash + kubectl config set-context rancher-rke-context --cluster=rancher-rke --user=rancher-rke + ``` + +4. Specify the newly created context __rancher-rke-context__ in kubelet. + + ```bash + kubectl config use-context rancher-rke-context + ``` + +5. Fetch the kubeconfig information for the context __rancher-rke-context__ . + + ```bash + kubectl config view --minify --flatten --raw + ``` + + The output is expected to be: + + ```yaml + apiVersion: v1 + clusters: + - cluster: + insecure-skip-tls-verify: true + server: https://77C321BCF072682C70C8665ED4BFA10D.gr7.ap-southeast-1.eks.amazonaws.com + name: joincluster + contexts: + - context: + cluster: joincluster + user: eks-admin + name: ekscontext + current-context: ekscontext + kind: Config + preferences: {} + users: + - name: eks-admin + user: + token: eyJhbGciOiJSUzI1NiIsImtpZCI6ImcxTjJwNkktWm5IbmRJU1RFRExvdWY1TGFWVUtGQ3VIejFtNlFQcUNFalEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2V + +### Step 3: Connect the cluster in the Suanova Interface + +Using the kubeconfig file fetched earlier, refer to the [Integrate Cluster](./integrate-cluster.md) documentation to integrate the Rancher cluster to the global cluster. diff --git a/docs/en/docs/end-user/kpanda/clusters/runtime.md b/docs/en/docs/end-user/kpanda/clusters/runtime.md new file mode 100644 index 0000000000..60e22b2112 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/runtime.md @@ -0,0 +1,19 @@ +# How to choose the container runtime + +The container runtime is an important component in kubernetes to manage the life cycle of containers and container images. Kubernetes made containerd the default container runtime in version 1.19, and removed support for the Dockershim component in version 1.24. + +Therefore, compared to the Docker runtime, we **recommend you to use the lightweight containerd as your container runtime**, because this has become the current mainstream runtime choice. + +In addition, some operating system distribution vendors are not friendly enough for Docker runtime compatibility. The runtime support of different operating systems is as follows: + +## Operating systems and supported runtimes + +| Operating System | Supported containerd Versions | Supported Docker Versions | +|--------------|---------------|------------| +| CentOS | 1.5.5, 1.5.7, 1.5.8, 1.5.9, 1.5.10, 1.5.11, 1.5.12, 1.5.13, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.6.6, 1.6.7, 1.6.8, 1.6.9, 1.6.10, 1.6.11, 1.6.12, 1.6.13, 1.6.14, 1.6.15 (default) | 18.09, 19.03, 20.10 (default) | +| RedHatOS | 1.5.5, 1.5.7, 1.5.8, 1.5.9, 1.5.10, 1.5.11, 1.5.12, 1.5.13, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.6.6, 1.6.7, 1.6.8, 1.6.9, 1.6.10, 1.6.11, 1.6.12, 1.6.13, 1.6.14, 1.6.15 (default) | 18.09, 19.03, 20.10 (default) | +| KylinOS | 1.5.5, 1.5.7, 1.5.8, 1.5.9, 1.5.10, 1.5.11, 1.5.12, 1.5.13, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.6.6, 1.6.7, 1.6.8, 1.6.9, 1.6.10, 1.6.11, 1.6.12, 1.6.13, 1.6.14, 1.6.15 (default) | 19.03 (Only supported by ARM architecture, Docker is not supported as a runtime under x86 architecture)| + +!!! note + + In the offline installation mode, you need to prepare the runtime offline package of the relevant operating system in advance. diff --git a/docs/en/docs/end-user/kpanda/clusters/upgrade-cluster.md b/docs/en/docs/end-user/kpanda/clusters/upgrade-cluster.md new file mode 100644 index 0000000000..b29c5cbfc4 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/clusters/upgrade-cluster.md @@ -0,0 +1,43 @@ +--- +date: 2022-11-17 +hide: + - toc +--- + +# Cluster Upgrade + +The Kubernetes Community packages a small version every quarter, and the maintenance cycle of each version is only about 9 months. Some major bugs or security holes will not be updated after the version stops maintenance. Manually upgrading cluster operations is cumbersome and places a huge workload on administrators. + +In Suanova, you can upgrade the Kubernetes cluster with one click through the web UI interface. + +!!! danger + + After the version is upgraded, it will not be possible to roll back to the previous version, please proceed with caution. + +!!! note + + - Kubernetes versions are denoted as __x.y.z__ , where __x__ is the major version, __y__ is the minor version, and __z__ is the patch version. + - Cluster upgrades across minor versions are not allowed, e.g. a direct upgrade from 1.23 to 1.25 is not possible. + - **Access clusters do not support version upgrades. If there is no "cluster upgrade" in the left navigation bar, please check whether the cluster is an access cluster. ** + - The global service cluster can only be upgraded through the terminal. + - When upgrading a worker cluster, the [Management Cluster](cluster-role.md#management-clusters) of the worker cluster should have been connected to the container management module and be running normally. + +1. Click the name of the target cluster in the cluster list. + + + +2. Then click __Cluster Operation and Maintenance__ -> __Cluster Upgrade__ in the left navigation bar, and click __Version Upgrade__ in the upper right corner of the page. + + + +3. Select the version that can be upgraded, and enter the cluster name to confirm. + + + +4. After clicking __OK__ , you can see the upgrade progress of the cluster. + + + +5. The cluster upgrade is expected to take 30 minutes. You can click the __Real-time Log__ button to view the detailed log of the cluster upgrade. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/configmaps-secrets/create-configmap.md b/docs/en/docs/end-user/kpanda/configmaps-secrets/create-configmap.md new file mode 100644 index 0000000000..fb2db59944 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/configmaps-secrets/create-configmap.md @@ -0,0 +1,90 @@ +# Create ConfigMaps + +ConfigMaps store non-confidential data in the form of key-value pairs to achieve the effect of +mutual decoupling of configuration data and application code. ConfigMaps can be used as +environment variables for containers, command-line parameters, or configuration files in storage volumes. + +!!! note + + - The data saved in ConfigMaps cannot exceed 1 MiB. If you need to store larger volumes of data, + it is recommended to mount a storage volume or use an independent database or file service. + + - ConfigMaps do not provide confidentiality or encryption. If you want to store encrypted data, + it is recommended to use [secret](use-secret.md), or other third-party tools to ensure the + privacy of data. + +You can create ConfigMaps with two methods: + +- Graphical form creation +- YAML creation + +## Prerequisites + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. + +- Created a [namespace](../namespaces/createns.md), + [user](../../../ghippo/access-control/user.md), + and authorized the user as [NS Editor](../permissions/permission-brief.md#ns-editor). + For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +## Graphical form creation + +1. Click the name of a cluster on the __Clusters__ page to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __ConfigMap and Secret__ -> __ConfigMap__ , and click the __Create ConfigMap__ button in the upper right corner. + + + +3. Fill in the configuration information on the __Create ConfigMap__ page, and click __OK__ . + + !!! note + + Click __Upload File__ to import an existing file locally to quickly create ConfigMaps. + + + +4. After the creation is complete, click More on the right side of the ConfigMap to edit YAML, update, export, delete and other operations. + + + +## YAML creation + +1. Click the name of a cluster on the __Clusters__ page to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __ConfigMap and Secret__ -> __ConfigMap__ , and click the __YAML Create__ button in the upper right corner. + + + +3. Fill in or paste the configuration file prepared in advance, and then click __OK__ in the lower right corner of the pop-up box. + + !!! note + + - Click __Import__ to import an existing file locally to quickly create ConfigMaps. + - After filling in the data, click __Download__ to save the configuration file locally. + + + +4. After the creation is complete, click More on the right side of the ConfigMap to edit YAML, update, export, delete and other operations. + + + +## ConfigMap YAML example + + ```yaml + kind: ConfigMap + apiVersion: v1 + metadata: + name: kube-root-ca.crt + namespace: default + annotations: + data: + version: '1.0' + ``` + +[Next step: Use ConfigMaps](use-configmap.md){ .md-button .md-button--primary } \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/configmaps-secrets/create-secret.md b/docs/en/docs/end-user/kpanda/configmaps-secrets/create-secret.md new file mode 100644 index 0000000000..c569f9e4bc --- /dev/null +++ b/docs/en/docs/end-user/kpanda/configmaps-secrets/create-secret.md @@ -0,0 +1,89 @@ +# Create Secret + +A secret is a resource object used to store and manage sensitive information such as passwords, +OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets +in your application code. + +Secrets can be used in some cases: + +- Used as an environment variable of the container to provide some necessary information + required during the running of the container. +- Use secrets as pod data volumes. +- As the identity authentication credential for the container registry + when the kubelet pulls the container image. + +You can create ConfigMaps with two methods: + +- Graphical form creation +- YAML creation + +## Prerequisites + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster + +- Created a [namespace](../namespaces/createns.md), + [user](../../../ghippo/access-control/user.md), + and authorized the user as [NS Editor](../permissions/permission-brief.md#ns-editor). + For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +## Create secret with wizard + +1. Click the name of a cluster on the __Clusters__ page to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __ConfigMap and Secret__ -> __Secret__ , and click the __Create Secret__ button in the upper right corner. + + + +3. Fill in the configuration information on the __Create Secret__ page, and click __OK__ . + + + + Note when filling in the configuration: + + - The name of the key must be unique within the same namespace + - Key type: + - Default (Opaque): Kubernetes default key type, which supports arbitrary data defined by users. + - TLS (kubernetes.io/tls): credentials for TLS client or server data access. + - Container registry information (kubernetes.io/dockerconfigjson): Credentials for Container registry access. + - username and password (kubernetes.io/basic-auth): Credentials for basic authentication. + - Custom: the type customized by the user according to business needs. + - Key data: the data stored in the key, the parameters that need to be filled in are different for different data + - When the key type is default (Opaque)/custom: multiple key-value pairs can be filled in. + - When the key type is TLS (kubernetes.io/tls): you need to fill in the certificate certificate and private key data. Certificates are self-signed or CA-signed credentials used for authentication. A certificate request is a request for a signature and needs to be signed with a private key. + - When the key type is container registry information (kubernetes.io/dockerconfigjson): you need to fill in the account and password of the private container registry. + - When the key type is username and password (kubernetes.io/basic-auth): Username and password need to be specified. + +## YAML creation + +1. Click the name of a cluster on the __Clusters__ page to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __ConfigMap and Secret__ -> __Secret__ , and click the __YAML Create__ button in the upper right corner. + + + +3. Fill in the YAML configuration on the __Create with YAML__ page, and click __OK__ . + + > Supports importing YAML files from local or downloading and saving filled files to local. + + + +## key YAML example + + ```yaml + apiVersion: v1 + kind: Secret + metadata: + name: secretdemo + type: Opaque + data: + username: **** + password: **** + ``` + +[Next step: use secret](use-secret.md){ .md-button .md-button--primary } \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/configmaps-secrets/use-configmap.md b/docs/en/docs/end-user/kpanda/configmaps-secrets/use-configmap.md new file mode 100644 index 0000000000..29578da8a0 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/configmaps-secrets/use-configmap.md @@ -0,0 +1,147 @@ +# Use ConfigMaps + +ConfigMap (ConfigMap) is an API object of Kubernetes, which is used to save non-confidential data into key-value pairs, and can store configurations that other objects need to use. +When used, the container can use it as an environment variable, a command-line argument, or a configuration file in a storage volume. By using ConfigMaps, configuration data and application code can be separated, providing a more flexible way to modify application configuration. + +!!! note + + ConfigMaps do not provide confidentiality or encryption. If the data to be stored is confidential, please use [secret](use-secret.md), or use other third-party tools to ensure the privacy of the data instead of ConfigMaps. + In addition, when using ConfigMaps in containers, the container and ConfigMaps must be in the same cluster namespace. + +## scenes to be used + +You can use ConfigMaps in Pods. There are many use cases, mainly including: + +- Use ConfigMaps to set the environment variables of the container + +- Use ConfigMaps to set the command line parameters of the container + +- Use ConfigMaps as container data volumes + +## Set the environment variables of the container + +You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line. + +!!! note + + The ConfigMap import is to use the ConfigMap as the value of the environment variable; the ConfigMap key value import is to use a certain parameter in the ConfigMap as the value of the environment variable. + +### Graphical interface operation + +When creating a workload through an image, you can set environment variables for the container by selecting __Import ConfigMaps__ or __Import ConfigMap Key Values__ on the __Environment Variables__ interface. + +1. Go to the [Image Creation Workload](../workloads/create-deployment.md) page, in the __Container Configuration__ step, select the __Environment Variables__ configuration, and click the __Add Environment Variable__ button. + + + +2. Select __ConfigMap Import__ or __ConfigMap Key Value Import__ in the environment variable type. + + - When the environment variable type is selected as __ConfigMap import__ , enter __variable name__ , __prefix__ name, __ConfigMap__ name in sequence. + + - When the environment variable type is selected as __ConfigMap key-value import__ , enter __variable name__ , __ConfigMap__ name, and __Secret__ name in sequence. + +### Command line operation + +You can set ConfigMaps as environment variables when creating a workload, using the valueFrom parameter to refer to the Key/Value in the ConfigMap. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: configmap-pod-1 +spec: + containers: + - name: test-container + image: busybox + command: [ "/bin/sh", "-c", "env" ] + env: + - name: SPECIAL_LEVEL_KEY + valueFrom: # (1) + configMapKeyRef: + name: kpanda-configmap # (2) + key: SPECIAL_LEVEL # (3) + restartPolicy: Never +``` + +1. Use __valueFrom__ to specify the value of the env reference ConfigMap +2. Referenced configuration file name +3. Referenced ConfigMap key + +## Set the command line parameters of the container + +You can use ConfigMaps to set the command or parameter value in the container, and use the environment variable substitution syntax __$(VAR_NAME)__ to do so. As follows. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: configmap-pod-3 +spec: + containers: + - name: test-container + image: busybox + command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY)" ] + env: + - name: SPECIAL_LEVEL_KEY + valueFrom: + configMapKeyRef: + name: kpanda-configmap + key: SPECIAL_LEVEL + - name: SPECIAL_TYPE_KEY + valueFrom: + configMapKeyRef: + name: kpanda-configmap + key: SPECIAL_TYPE + restartPolicy: Never +``` + +After the Pod runs, the output is as follows. + +```none +Hello Kpanda +``` + +## Used as container data volume + +You can use the ConfigMap as the environment variable of the container through the graphical interface or the terminal command line. + +### Graphical operation + +When creating a workload through an image, you can use the ConfigMap as the data volume of the container by selecting the storage type as "ConfigMap" on the "Data Storage" interface. + +1. Go to the [Image Creation Workload](../workloads/create-deployment.md) page, in the __Container Configuration__ step, select the __Data Storage__ configuration, and click __Add in the __ Node Path Mapping __ list __ button. + + + +2. Select __ConfigMap__ in the storage type, and enter __container path__ , __subpath__ and other information in sequence. + +### Command line operation + +To use a ConfigMap in a Pod's storage volume. + +Here is an example Pod that mounts a ConfigMap as a volume: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: mypod +spec: + containers: + -name: mypod + image: redis + volumeMounts: + - name: foo + mountPath: "/etc/foo" + readOnly: true + volumes: + - name: foo + configMap: + name: myconfigmap +``` + +If there are multiple containers in a Pod, each container needs its own __volumeMounts__ block, but you only need to set one __spec.volumes__ block per ConfigMap. + +!!! note + + When a ConfigMap is used as a data volume mounted on a container, the ConfigMap can only be read as a read-only file. \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/configmaps-secrets/use-secret.md b/docs/en/docs/end-user/kpanda/configmaps-secrets/use-secret.md new file mode 100644 index 0000000000..35b39d811f --- /dev/null +++ b/docs/en/docs/end-user/kpanda/configmaps-secrets/use-secret.md @@ -0,0 +1,141 @@ +# use key + +A secret is a resource object used to store and manage sensitive information such as passwords, OAuth tokens, SSH, TLS credentials, etc. Using keys means you don't need to include sensitive secrets in your application code. + +## scenes to be used + +You can use keys in Pods in a variety of use cases, mainly including: + +- Used as an environment variable of the container to provide some necessary information required during the running of the container. +- Use secrets as pod data volumes. +- Used as the identity authentication credential for the container registry when the kubelet pulls the container image. + +## Use the key to set the environment variable of the container + +You can use the key as the environment variable of the container through the GUI or the terminal command line. + +!!! note + + Key import is to use the key as the value of an environment variable; key key value import is to use a parameter in the key as the value of an environment variable. + +### Graphical interface operation + +When creating a workload from an image, you can set environment variables for the container by selecting __Key Import__ or __Key Key Value Import__ on the __Environment Variables__ interface. + +1. Go to the [Image Creation Workload](../workloads/create-deployment.md) page. + + + +2. Select the __Environment Variables__ configuration in __Container Configuration__ , and click the __Add Environment Variable__ button. + + + +3. Select __Key Import__ or __Key Key Value Import__ in the environment variable type. + + + + - When the environment variable type is selected as __Key Import__ , enter __Variable Name__ , __Prefix__ , and __Secret__ in sequence. + + - When the environment variable type is selected as __key key value import__ , enter __variable name__ , __Secret__ , __Secret__ name in sequence. + +### Command line operation + +As shown in the example below, you can set the secret as an environment variable when creating the workload, using the __valueFrom__ parameter to refer to the Key/Value in the Secret. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: secret-env-pod +spec: + containers: + -name: mycontainer + image: redis + env: + - name: SECRET_USERNAME + valueFrom: + secretKeyRef: + name: mysecret + key: username + optional: false # (1) + - name: SECRET_PASSWORD + valueFrom: + secretKeyRef: + name: mysecret + key: password + optional: false # (2) + +``` + +1. This value is the default; means "mysecret", which must exist and contain a primary key named "username" +2. This value is the default; means "mysecret", which must exist and contain a primary key named "password" + +## Use the key as the pod's data volume + +### Graphical interface operation + +When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as "key" on the "data storage" interface. + +1. Go to the [Image Creation Workload](../workloads/create-deployment.md) page. + + + +2. In the __Container Configuration__ , select the __Data Storage__ configuration, and click the __Add__ button in the __Node Path Mapping__ list. + + + +3. Select __Secret__ in the storage type, and enter __container path__ , __subpath__ and other information in sequence. + +### Command line operation + +The following is an example of a Pod that mounts a Secret named __mysecret__ via a data volume: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: mypod +spec: + containers: + -name: mypod + image: redis + volumeMounts: + - name: foo + mountPath: "/etc/foo" + readOnly: true + volumes: + - name: foo + secret: + secretName: mysecret + optional: false # (1) +``` + +1. Default setting, means "mysecret" must already exist + +If the Pod contains multiple containers, each container needs its own __volumeMounts__ block, but only one __.spec.volumes__ setting is required for each Secret. + +## Used as the identity authentication credential for the container registry when the kubelet pulls the container image + +You can use the key as the identity authentication credential for the Container registry through the GUI or the terminal command line. + +### Graphical operation + +When creating a workload through an image, you can use the key as the data volume of the container by selecting the storage type as "key" on the "data storage" interface. + +1. Go to the [Image Creation Workload](../workloads/create-deployment.md) page. + + + +2. In the second step of __Container Configuration__ , select the __Basic Information__ configuration, and click the __Select Image__ button. + + + +3. Select the name of the private container registry in the drop-down list of `container registry' in the pop-up box. Please see [Create Secret](create-secret.md) for details on private image secret creation. + + + +4. Enter the image name in the private registry, click __OK__ to complete the image selection. + +!!! note + + When creating a key, you need to ensure that you enter the correct container registry address, username, password, and select the correct mirror name, otherwise you will not be able to obtain the mirror image in the container registry. \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/custom-resources/create.md b/docs/en/docs/end-user/kpanda/custom-resources/create.md new file mode 100644 index 0000000000..5e5dbcf4c3 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/custom-resources/create.md @@ -0,0 +1,107 @@ +# CustomResourceDefinition (CRD) + +In Kubernetes, all objects are abstracted as resources, such as Pod, Deployment, Service, Volume, etc. are the default resources provided by Kubernetes. +This provides important support for our daily operation and maintenance and management work, but in some special cases, the existing preset resources cannot meet the needs of the business. +Therefore, we hope to expand the capabilities of the Kubernetes API, and CustomResourceDefinition (CRD) was born based on this requirement. + +The container management module supports interface-based management of custom resources, and its main features are as follows: + +- Obtain the list and detailed information of custom resources under the cluster +- Create custom resources based on YAML +- Create a custom resource example CR (Custom Resource) based on YAML +- Delete custom resources + +## Prerequisites + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created Kubernetes](../clusters/create-cluster.md), and you can access the cluster UI interface. + +- Created a [namespace](../namespaces/createns.md), + [user](../../../ghippo/access-control/user.md), + and authorized the user as [`Cluster Admin`](../permissions/permission-brief.md#cluster-admin) + For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +## Create CRD via YAML + +1. Click a cluster name to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __Custom Resource__ , and click the __YAML Create__ button in the upper right corner. + + + +3. On the __Create with YAML__ page, fill in the YAML statement and click __OK__ . + + + +4. Return to the custom resource list page, and you can view the custom resource named `crontabs.stable.example.com` just created. + + + +**Custom resource example:** + +```yaml title="CRD example" +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: crontabs.stable.example.com +spec: + group: stable.example.com + versions: + - name: v1 + served: true + storage: true + schema: + openAPIV3Schema: + type: object + properties: + spec: + type: object + properties: + cronSpec: + type: string + image: + type: string + replicas: + type: integer + scope: Namespaced + names: + plural: crontabs + singular: crontab + kind: CronTab + shortNames: + - ct +``` + +## Create a custom resource example via YAML + +1. Click a cluster name to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __Custom Resource__ , and click the __YAML Create__ button in the upper right corner. + + + +3. Click the custom resource named `crontabs.stable.example.com` , enter the details, and click the __YAML Create__ button in the upper right corner. + + + +4. On the __Create with YAML__ page, fill in the YAML statement and click __OK__ . + + + +5. Return to the details page of `crontabs.stable.example.com` , and you can view the custom resource named __my-new-cron-object__ just created. + +**CR Example:** + +```yaml title="CR example" +apiVersion: "stable.example.com/v1" +kind: CronTab +metadata: + name: my-new-cron-object +spec: + cronSpec: "* * * * */5" + image: my-awesome-cron-image +``` diff --git a/docs/en/docs/end-user/kpanda/gpu/FAQ.md b/docs/en/docs/end-user/kpanda/gpu/FAQ.md new file mode 100644 index 0000000000..0d0cfb4a3d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/FAQ.md @@ -0,0 +1,18 @@ +--- +hide: + - toc +--- + +# GPU FAQs + +## GPU processes are not visible while running nvidia-smi inside a pod + +Q: When running the `nvidia-smi` command inside a GPU-utilizing pod, +no GPU process information is visible in the full-card mode and vGPU mode. + +A: Due to `PID namespace` isolation, GPU processes are not visible inside the Pod. +To view GPU processes, you can use one of the following methods: + +- Configure the workload using the GPU with `hostPID: true` to enable viewing PIDs on the host. +- Run the `nvidia-smi` command in the driver pod of the gpu-operator to view processes. +- Run the `chroot /run/nvidia/driver nvidia-smi` command on the host to view processes. diff --git a/docs/en/docs/end-user/kpanda/gpu/Iluvatar_usage.md b/docs/en/docs/end-user/kpanda/gpu/Iluvatar_usage.md new file mode 100644 index 0000000000..5dba06c80f --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/Iluvatar_usage.md @@ -0,0 +1,65 @@ +# How to Use Iluvatar GPU in Applications + +This section describes how to use Iluvatar virtual GPU on AI platform. + +## Prerequisites + +- Deployed AI platform container management platform and it is running smoothly. +- The container management module has been integrated with a Kubernetes cluster or a Kubernetes cluster has been created, and the UI interface of the cluster can be accessed. +- The Iluvatar GPU driver has been installed on the current cluster. Refer to the [Iluvatar official documentation](https://support.iluvatar.com/#/login) for driver installation instructions, or contact the Suanova ecosystem team for enterprise-level support at peg-pem@daocloud.io. +- The GPUs in the current cluster have not undergone any virtualization operations and not been occupied by other applications. + +## Procedure + +### Configuration via User Interface + +1. Check if the GPU card in the cluster has been detected. Click __Clusters__ -> __Cluster Settings__ -> __Addon Plugins__ , and check if the corresponding GPU type has been automatically enabled and detected. + Currently, the cluster will automatically enable __GPU__ and set the GPU type as __Iluvatar__ . + + + +2. Deploy a workload. Click __Clusters__ -> __Workloads__ and deploy a workload using the image. After selecting the type as __(Iluvatar)__ , configure the GPU resources used by the application: + + - Physical Card Count (iluvatar.ai/vcuda-core): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine. + + - Memory Usage (iluvatar.ai/vcuda-memory): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card. + + + > If there are any issues with the configuration values, scheduling failures or resource allocation failures may occur. + +### Configuration via YAML + +To request GPU resources for a workload, add the __iluvatar.ai/vcuda-core: 1__ and __iluvatar.ai/vcuda-memory: 200__ to the requests and limits. +These parameters configure the application to use the physical card resources. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: full-iluvatar-gpu-demo + namespace: default +spec: + replicas: 1 + selector: + matchLabels: + app: full-iluvatar-gpu-demo + template: + metadata: + labels: + app: full-iluvatar-gpu-demo + spec: + containers: + - image: nginx:perl + name: container-0 + resources: + limits: + cpu: 250m + iluvatar.ai/vcuda-core: '1' + iluvatar.ai/vcuda-memory: '200' + memory: 512Mi + requests: + cpu: 250m + memory: 512Mi + imagePullSecrets: + - name: default-secret +``` diff --git a/docs/en/docs/end-user/kpanda/gpu/ascend/Ascend_usage.md b/docs/en/docs/end-user/kpanda/gpu/ascend/Ascend_usage.md new file mode 100644 index 0000000000..94c0918cbf --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/ascend/Ascend_usage.md @@ -0,0 +1,177 @@ +--- +MTPE: windsonsea +Date: 2024-07-30 +--- + +# Use Ascend NPU + +This section explains how to use Ascend NPU on the AI platform platform. + +## Prerequisites + +- The current NPU node has the Ascend driver installed. +- The current NPU node has the Ascend-Docker-Runtime component installed. +- The NPU MindX DL suite is installed on the current cluster. +- No virtualization is performed on the NPU card in the current cluster, + and it is not occupied by other applications. + +Refer to the [Ascend NPU Component Installation Document](ascend_driver_install.md) +to install the basic environment. + +## Quick Start + +This document uses the [AscentCL Image Classification Application](https://gitee.com/ascend/samples/tree/master/inference/modelInference/sampleResnetQuickStart/python) example from the Ascend sample library. + +1. Download the Ascend repository + + Run the following command to download the Ascend demo repository, + and remember the storage location of the code for subsequent use. + + ```git + git clone https://gitee.com/ascend/samples.git + ``` + +2. Prepare the base image + + This example uses the Ascent-pytorch base image, which can be obtained from the + [Ascend Container Registry](https://www.hiascend.com/developer/ascendhub). + +3. Prepare the YAML file + + ```yaml title="ascend-demo.yaml" + apiVersion: batch/v1 + kind: Job + metadata: + name: resnetinfer1-1-1usoc + spec: + template: + spec: + containers: + - image: ascendhub.huawei.com/public-ascendhub/ascend-pytorch:23.0.RC2-ubuntu18.04 # Inference image name + imagePullPolicy: IfNotPresent + name: resnet50infer + securityContext: + runAsUser: 0 + command: + - "/bin/bash" + - "-c" + - | + source /usr/local/Ascend/ascend-toolkit/set_env.sh && + TEMP_DIR=/root/samples_copy_$(date '+%Y%m%d_%H%M%S_%N') && + cp -r /root/samples "$TEMP_DIR" && + cd "$TEMP_DIR"/inference/modelInference/sampleResnetQuickStart/python/model && + wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/003_Atc_Models/resnet50/resnet50.onnx && + atc --model=resnet50.onnx --framework=5 --output=resnet50 --input_shape="actual_input_1:1,3,224,224" --soc_version=Ascend910 && + cd ../data && + wget https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/models/aclsample/dog1_1024_683.jpg && + cd ../scripts && + bash sample_run.sh + resources: + requests: + huawei.com/Ascend910: 1 # Number of the Ascend 910 Processors + limits: + huawei.com/Ascend910: 1 # The value should be the same as that of requests + volumeMounts: + - name: hiai-driver + mountPath: /usr/local/Ascend/driver + readOnly: true + - name: slog + mountPath: /var/log/npu/conf/slog/slog.conf + - name: localtime # The container time must be the same as the host time + mountPath: /etc/localtime + - name: dmp + mountPath: /var/dmp_daemon + - name: slogd + mountPath: /var/slogd + - name: hbasic + mountPath: /etc/hdcBasic.cfg + - name: sys-version + mountPath: /etc/sys_version.conf + - name: aicpu + mountPath: /usr/lib64/aicpu_kernels + - name: tfso + mountPath: /usr/lib64/libtensorflow.so + - name: sample-path + mountPath: /root/samples + volumes: + - name: hiai-driver + hostPath: + path: /usr/local/Ascend/driver + - name: slog + hostPath: + path: /var/log/npu/conf/slog/slog.conf + - name: localtime + hostPath: + path: /etc/localtime + - name: dmp + hostPath: + path: /var/dmp_daemon + - name: slogd + hostPath: + path: /var/slogd + - name: hbasic + hostPath: + path: /etc/hdcBasic.cfg + - name: sys-version + hostPath: + path: /etc/sys_version.conf + - name: aicpu + hostPath: + path: /usr/lib64/aicpu_kernels + - name: tfso + hostPath: + path: /usr/lib64/libtensorflow.so + - name: sample-path + hostPath: + path: /root/samples + restartPolicy: OnFailure + ``` + + Some fields in the above YAML need to be modified according to the actual situation: + + 1. __atc ... --soc_version=Ascend910__ uses __Ascend910__, adjust this field depending on + your actual situation. You can use the __npu-smi info__ command to check the GPU model + and add the Ascend prefix. + 2. __samples-path__ should be adjusted according to the actual situation. + 3. __resources__ should be adjusted according to the actual situation. + +4. Deploy a Job and check its results + + Use the following command to create a Job: + + ```shell + kubectl apply -f ascend-demo.yaml + ``` + + Check the Pod running status: ![Ascend Pod Status](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/ascend-demo-pod-status.png) + + After the Pod runs successfully, check the log results. The key prompt information on the screen is shown in + the figure below. The Label indicates the category identifier, Conf indicates the maximum confidence of + the classification, and Class indicates the belonging category. These values may vary depending on the + version and environment, so please refer to the actual situation: + + ![Ascend demo running result](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/ascend-demo-pod-result.png) + + Result image display: + + ![Ascend demo running result image](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/ascend-demo-infer-result.png) + +## UI Usage + +1. Confirm whether the cluster has detected the GPU card. Click __Clusters__ -> __Cluster Settings__ -> __Addon Plugins__ , + and check whether the proper GPU type is automatically enabled and detected. + Currently, the cluster will automatically enable __GPU__ and set the __GPU__ type to __Ascend__ . + + ![Cluster Settings](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/cluster-setting-ascend-gpu.jpg) + +2. Deploy the workload. Click __Clusters__ -> __Workloads__ , deploy the workload through an image, + select the type (Ascend), and then configure the number of physical cards used by the application: + + **Number of Physical Cards (huawei.com/Ascend910)** : This indicates how many physical cards + the current Pod needs to mount. The input value must be an integer and **less than or equal to** + the number of cards on the host. + + ![Workload Usage](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/workload_ascendgpu_userguide.jpg) + + > If there is an issue with the above configuration, it will result in + > scheduling failure and resource allocation issues. diff --git a/docs/en/docs/end-user/kpanda/gpu/ascend/ascend_driver_install.md b/docs/en/docs/end-user/kpanda/gpu/ascend/ascend_driver_install.md new file mode 100644 index 0000000000..48334699b3 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/ascend/ascend_driver_install.md @@ -0,0 +1,173 @@ +--- +MTPE: ModetaNiu +Date: 2024-07-02 +--- + +# Installation of Ascend NPU Components + +This chapter provides installation guidance for Ascend NPU drivers, Device Plugin, NPU-Exporter, and other components. + +## Prerequisites + +1. Before installation, confirm the supported NPU models. For details, refer to the [Ascend NPU Matrix](../gpu_matrix.md). +2. Ensure that the kernel version required for the corresponding NPU model is compatible. For more details, + refer to the [Ascend NPU Matrix](../gpu_matrix.md). +3. Prepare the basic Kubernetes environment. + +## Installation Steps + +Before using NPU resources, you need to complete the firmware installation, NPU driver installation, +Docker Runtime installation, user creation, log directory creation, and NPU Device Plugin installation. +Refer to the following steps for details. + +### Install Firmware + +1. Confirm that the kernel version is within the range corresponding to the "binary installation" method, + and then you can directly install the NPU driver firmware. +2. For firmware and driver downloads, refer to: [Firmware Download Link](https://www.hiascend.com/zh/hardware/firmware-drivers/community?product=2&model=15&cann=6.3.RC2.alpha005&driver=1.0.20.alpha) +3. For firmware installation, refer to: [Install NPU Driver Firmware](https://www.hiascend.com/document/detail/zh/quick-installation/23.0.RC2/quickinstg/800_3000/quickinstg_800_3000_0001.html) + +### Install NPU Driver + +1. If the driver is not installed, refer to the official Ascend documentation for installation. For example, + for Ascend910, refer to: [910 Driver Installation Document](https://www.hiascend.com/document/detail/zh/Atlas%20200I%20A2/23.0.RC3/EP/installationguide/Install_87.html). +2. Run the command `npu-smi info`, and if the NPU information is returned normally, it indicates that the NPU driver + and firmware are ready. + +![Ascend-mindxdl Information](../images/npu-smi-info.png) + +### Install Docker Runtime + +1. Download Ascend Docker Runtime + + Community edition download link: https://www.hiascend.com/zh/software/mindx-dl/community + + ```sh + wget -c https://mindx.obs.cn-south-1.myhuaweicloud.com/OpenSource/MindX/MindX%205.0.RC2/MindX%20DL%205.0.RC2/Ascend-docker-runtime_5.0.RC2_linux-x86_64.run + ``` + + Install to the specified path by executing the following two commands in order, with parameters specifying the installation path: + + ```sh + chmod u+x Ascend-docker-runtime_5.0.RC2_linux-x86_64.run + ./Ascend-docker-runtime_{version}_linux-{arch}.run --install --install-path= + ``` + +2. Modify the containerd configuration file + + If containerd has no default configuration file, execute the following three commands in order to create the configuration file: + + ```bash + mkdir /etc/containerd + containerd config default > /etc/containerd/config.toml + vim /etc/containerd/config.toml + ``` + + If containerd has a configuration file: + + ```bash + vim /etc/containerd/config.toml + ``` + + Modify the runtime installation path according to the actual situation, mainly modifying the runtime field: + + ```toml + ... + [plugins."io.containerd.monitor.v1.cgroups"] + no_prometheus = false + [plugins."io.containerd.runtime.v1.linux"] + shim = "containerd-shim" + runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime" + runtime_root = "" + no_shim = false + shim_debug = false + [plugins."io.containerd.runtime.v2.task"] + platforms = ["linux/amd64"] + ... + ``` + + Execute the following command to restart containerd: + + ```bash + systemctl restart containerd + ``` + +### Create a User + +Execute the following commands on the node where the components are installed to create a user. + +```sh +# Ubuntu operating system +useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX +usermod -a -G HwHiAiUser hwMindX +# CentOS operating system +useradd -d /home/hwMindX -u 9000 -m -s /sbin/nologin hwMindX +usermod -a -G HwHiAiUser hwMindX +``` + +### Create Log Directory + +Create the parent directory for component logs and the log directories for each component on the corresponding node, +and set the appropriate owner and permissions for the directories. Execute the following command to create +the parent directory for component logs. + +```bash +mkdir -m 755 /var/log/mindx-dl +chown root:root /var/log/mindx-dl +``` + +Execute the following command to create the Device Plugin component log directory. + +```bash +mkdir -m 750 /var/log/mindx-dl/devicePlugin +chown root:root /var/log/mindx-dl/devicePlugin +``` + +!!! note + + Please create the corresponding log directory for each required component. In this example, only the Device Plugin component is needed. + For other component requirements, refer to the [official documentation](https://www.hiascend.com/document/detail/zh/mindx-dl/50rc3/clusterscheduling/clusterschedulingig/dlug_installation_016.html) + +### Create Node Labels + +Refer to the following commands to create labels on the corresponding nodes: + +```shell +# Create this label on computing nodes where the driver is installed +kubectl label node {nodename} huawei.com.ascend/Driver=installed +kubectl label node {nodename} node-role.kubernetes.io/worker=worker +kubectl label node {nodename} workerselector=dls-worker-node +kubectl label node {nodename} host-arch=huawei-arm // or host-arch=huawei-x86, select according to the actual situation +kubectl label node {nodename} accelerator=huawei-Ascend910 // select according to the actual situation +# Create this label on control nodes +kubectl label node {nodename} masterselector=dls-master-node +``` + +### Install Device Plugin and NpuExporter + +Functional module path: __Container Management__ -> __Cluster__, click the name of the target cluster, then click __Helm Apps__ -> __Helm Charts__ from the left navigation bar, and search for __ascend-mindxdl__. + +![Find ascend-mindxdl](../images/ascend-mindxdl.png) + +![Ascend-mindxdl](../images/detail-ascend.png) + +- __DevicePlugin__: Provides a general device plugin mechanism and standard device API interface for Kubernetes to use devices. It is recommended to use the default image and version. +- __NpuExporter__: Based on the Prometheus/Telegraf ecosystem, this component provides interfaces to help users monitor the Ascend series AI processors and container-level allocation status. It is recommended to use the default image and version. +- __ServiceMonitor__: Disabled by default. If enabled, you can view NPU-related monitoring in the observability module. To enable, ensure that the insight-agent is installed and running, otherwise, the ascend-mindxdl installation will fail. +- __isVirtualMachine__: Disabled by default. If the NPU node is a virtual machine scenario, enable the isVirtualMachine parameter. + +After a successful installation, two components will appear under the corresponding namespace, as shown below: + +![List of ascend-mindxdl](../images/list-ascend-mindxdl.png) + +At the same time, the corresponding NPU information will also appear on the node information: + +![Node labels](../images/label-ascend-mindxdl.png) + +Once everything is ready, you can select the corresponding NPU device when creating a workload through the page, as shown below: + + + +!!! note + + For detailed information of how to use, refer to [Using Ascend (Ascend) NPU](https://docs.daocloud.io/kpanda/gpu/Ascend_usage/). diff --git a/docs/en/docs/end-user/kpanda/gpu/ascend/vnpu.md b/docs/en/docs/end-user/kpanda/gpu/ascend/vnpu.md new file mode 100644 index 0000000000..527f7f3502 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/ascend/vnpu.md @@ -0,0 +1,87 @@ +--- +MTPE: windsonsea +Date: 2024-07-12 +--- + +# Enable Ascend Virtualization + +Ascend virtualization is divided into dynamic virtualization and static virtualization. +This document describes how to enable and use Ascend static virtualization capabilities. + +## Prerequisites + +- Setup of Kubernetes cluster environment. +- The current NPU node has the Ascend driver installed. +- The current NPU node has the Ascend-Docker-Runtime component installed. +- The NPU MindX DL suite is installed on the current cluster. +- Supported NPU models: + + - Ascend 310P, verified + - Ascend 910b (20 cores), verified + - Ascend 910 (32 cores), officially supported but not verified + - Ascend 910 (30 cores), officially supported but not verified + + For more details, refer to the [official virtualization hardware documentation](https://www.hiascend.com/document/detail/zh/mindx-dl/50rc1/AVI/cpaug/cpaug_0005.html). + +Refer to the [Ascend NPU Component Installation Documentation](./ascend_driver_install.md) +for the basic environment setup. + +## Enable Virtualization Capabilities + +To enable virtualization capabilities, you need to manually modify the startup parameters +of the `ascend-device-plugin-daemonset` component. Refer to the following command: + +```init +- device-plugin -useAscendDocker=true -volcanoType=false -presetVirtualDevice=true +- logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0 +``` + +### Split VNPU Instances + +Static virtualization requires manually splitting VNPU instances. Refer to the following command: + +```bash +npu-smi set -t create-vnpu -i 13 -c 0 -f vir02 +``` + +- `i` refers to the card id. +- `c` refers to the chip id. +- `vir02` refers to the split specification template. + +Card id and chip id can be queried using `npu-smi info`. The split specifications can be found in the +[Ascend official templates](https://www.hiascend.com/document/detail/zh/mindx-dl/500/AVI/cpaug/cpaug_006.html). + +After splitting the instance, you can query the split results using the following command: + +```bash +npu-smi info -t info-vnpu -i 13 -c 0 +``` + +The query result is as follows: + +![vnpu1](../images/vnpu1.png) + +### Restart `ascend-device-plugin-daemonset` + +After splitting the instance, manually restart the `device-plugin` pod, +then use the `kubectl describe` command to check the resources of the registered node: + +```bash +kubectl describe node {{nodename}} +``` + +![vnpu2](../images/vnpu2.png) + +## How to Use the Device + +When creating an application, specify the resource key as shown in the following YAML: + +```yaml +...... +resources: + requests: + huawei.com/Ascend310P-2c: 1 + limits: + huawei.com/Ascend310P-2c: 1 +...... +``` diff --git a/docs/en/docs/end-user/kpanda/gpu/dynamic-regulation.md b/docs/en/docs/end-user/kpanda/gpu/dynamic-regulation.md new file mode 100644 index 0000000000..d1c0af296d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/dynamic-regulation.md @@ -0,0 +1,72 @@ +# GPU Scheduling Configuration (Binpack and Spread) + +This page introduces how to reduce GPU resource fragmentation and prevent single points of failure through +Binpack and Spread when using NVIDIA vGPU, achieving advanced scheduling for vGPU. The AI platform platform +provides Binpack and Spread scheduling policies across two dimensions: clusters and workloads, +meeting different usage requirements in various scenarios. + +## Prerequisites + +- GPU devices are correctly installed on the cluster nodes. +- The [gpu-operator component](./nvidia/install_nvidia_driver_of_operator.md) + and [Nvidia-vgpu component](./nvidia/vgpu/vgpu_addon.md) are correctly installed in the cluster. +- The NVIDIA-vGPU type exists in the GPU mode in the node list in the cluster. + +## Use Cases + +- Scheduling policy based on GPU dimension + + - Binpack: Prioritizes using the same GPU on a node, suitable for increasing GPU utilization and reducing resource fragmentation. + - Spread: Multiple Pods are distributed across different GPUs on nodes, suitable for high availability scenarios to avoid single card failures. + +- Scheduling policy based on node dimension + + - Binpack: Multiple Pods prioritize using the same node, suitable for increasing GPU utilization and reducing resource fragmentation. + - Spread: Multiple Pods are distributed across different nodes, suitable for high availability scenarios to avoid single node failures. + +## Use Binpack and Spread at Cluster-Level + +!!! note + + By default, workloads will follow the cluster-level Binpack and Spread. If a workload sets its + own Binpack and Spread scheduling policies that differ from the cluster, the workload will prioritize + its own scheduling policy. + +1. On the __Clusters__ page, select the cluster for which you want to adjust the Binpack and Spread scheduling + policies. Click the __┇__ icon on the right and select __GPU Scheduling Configuration__ from the dropdown list. + + ![Cluster List](images/gpu-scheduler-clusterlist.png) + +2. Adjust the GPU scheduling configuration according to your business scenario, and click __OK__ to save. + + ![Binpack Configuration](images/gpu-scheduler-clusterrule.png) + +## Use Binpack and Spread at Workload-Level + +!!! note + + When the Binpack and Spread scheduling policies at the workload level conflict with the + cluster-level configuration, the workload-level configuration takes precedence. + +Follow the steps below to create a deployment using an image and configure Binpack and Spread +scheduling policies within the workload. + +1. Click __Clusters__ in the left navigation bar, then click the name of the target cluster to + enter the __Cluster Details__ page. + + ![Cluster List](images/clusterlist1.png) + +2. On the Cluster Details page, click __Workloads__ -> __Deployments__ in the left navigation bar, + then click the __Create by Image__ button in the upper right corner of the page. + + ![Create Workload](images/gpu-createdeploy.png) + +3. Sequentially fill in the [Basic Information](../workloads/create-deployment.md#basic-information), + [Container Settings](../workloads/create-deployment.md#container-settings), + and in the __Container Configuration__ section, enable GPU configuration, selecting the GPU type as NVIDIA vGPU. + Click __Advanced Settings__, enable the Binpack / Spread scheduling policy, and adjust the GPU scheduling + configuration according to the business scenario. After configuration, click __Next__ to proceed to + [Service Settings](../workloads/create-deployment.md#service-settings) + and [Advanced Settings](../workloads/create-deployment.md#advanced-settings). + Finally, click __OK__ at the bottom right of the page to complete the creation. + diff --git a/docs/en/docs/end-user/kpanda/gpu/gpu-metrics.md b/docs/en/docs/end-user/kpanda/gpu/gpu-metrics.md new file mode 100644 index 0000000000..9278339d7c --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/gpu-metrics.md @@ -0,0 +1,59 @@ +# GPU Metrics + +This page lists some commonly used GPU metrics. + +## Cluster Level + +| Metric Name | Description | +| ----------- | ----------- | +| Number of GPUs | Total number of GPUs in the cluster | +| Average GPU Utilization | Average compute utilization of all GPUs in the cluster | +| Average GPU Memory Utilization | Average memory utilization of all GPUs in the cluster | +| GPU Power | Power consumption of all GPUs in the cluster | +| GPU Temperature | Temperature of all GPUs in the cluster | +| GPU Utilization Details | 24-hour usage details of all GPUs in the cluster (includes max, avg, current) | +| GPU Memory Usage Details | 24-hour memory usage details of all GPUs in the cluster (includes min, max, avg, current) | +| GPU Memory Bandwidth Utilization | For example, an Nvidia V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the utilization is 50% | + +## Node Level + +| Metric Name | Description | +| ----------- | ----------- | +| GPU Mode | Usage mode of GPUs on the node, including full-card mode, MIG mode, vGPU mode | +| Number of Physical GPUs | Total number of physical GPUs on the node | +| Number of Virtual GPUs | Number of vGPU devices created on the node | +| Number of MIG Instances | Number of MIG instances created on the node | +| GPU Memory Allocation Rate | Memory allocation rate of all GPUs on the node | +| Average GPU Utilization | Average compute utilization of all GPUs on the node | +| Average GPU Memory Utilization | Average memory utilization of all GPUs on the node | +| GPU Driver Version | Driver version information of GPUs on the node | +| GPU Utilization Details | 24-hour usage details of each GPU on the node (includes max, avg, current) | +| GPU Memory Usage Details | 24-hour memory usage details of each GPU on the node (includes min, max, avg, current) | + +## Pod Level + +| Category | Metric Name | Description | +| -------- | ----------- | ----------- | +| Application Overview GPU - Compute & Memory | Pod GPU Utilization | Compute utilization of the GPUs used by the current Pod | +| | Pod GPU Memory Utilization | Memory utilization of the GPUs used by the current Pod | +| | Pod GPU Memory Usage | Memory usage of the GPUs used by the current Pod | +| | Memory Allocation | Memory allocation of the GPUs used by the current Pod | +| | Pod GPU Memory Copy Ratio | Memory copy ratio of the GPUs used by the current Pod | +| GPU - Engine Overview | GPU Graphics Engine Activity Percentage | Percentage of time the Graphics or Compute engine is active during a monitoring cycle | +| | GPU Memory Bandwidth Utilization | Memory bandwidth utilization (Memory BW Utilization) indicates the fraction of cycles during which data is sent to or received from the device memory. This value represents the average over the interval, not an instantaneous value. A higher value indicates higher utilization of device memory.
A value of 1 (100%) indicates that a DRAM instruction is executed every cycle during the interval (in practice, a peak of about 0.8 (80%) is the maximum achievable).
A value of 0.2 (20%) indicates that 20% of the cycles during the interval are spent reading from or writing to device memory. | +| | Tensor Core Utilization | Percentage of time the Tensor Core pipeline is active during a monitoring cycle | +| | FP16 Engine Utilization | Percentage of time the FP16 pipeline is active during a monitoring cycle | +| | FP32 Engine Utilization | Percentage of time the FP32 pipeline is active during a monitoring cycle | +| | FP64 Engine Utilization | Percentage of time the FP64 pipeline is active during a monitoring cycle | +| | GPU Decode Utilization | Decode engine utilization of the GPU | +| | GPU Encode Utilization | Encode engine utilization of the GPU | +| GPU - Temperature & Power | GPU Temperature | Temperature of all GPUs in the cluster | +| | GPU Power | Power consumption of all GPUs in the cluster | +| | GPU Total Power Consumption | Total power consumption of the GPUs | +| GPU - Clock | GPU Memory Clock | Memory clock frequency | +| | GPU Application SM Clock | Application SM clock frequency | +| | GPU Application Memory Clock | Application memory clock frequency | +| | GPU Video Engine Clock | Video engine clock frequency | +| | GPU Throttle Reasons | Reasons for GPU throttling | +| GPU - Other Details | PCIe Transfer Rate | Data transfer rate of the GPU through the PCIe bus | +| | PCIe Receive Rate | Data receive rate of the GPU through the PCIe bus | diff --git a/docs/en/docs/end-user/kpanda/gpu/gpu_matrix.md b/docs/en/docs/end-user/kpanda/gpu/gpu_matrix.md new file mode 100644 index 0000000000..1a35f92d59 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/gpu_matrix.md @@ -0,0 +1,318 @@ +--- +hide: + - toc +--- + +# GPU Support Matrix + +This page explains the matrix of supported GPUs and operating systems for AI platform. + +## NVIDIA GPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
GPU Manufacturer and TypeSupported GPU ModelsCompatible Operating System (Online)Recommended KernelRecommended Operating System and KernelInstallation Documentation
NVIDIA GPU (Full Card/vGPU) +
    +
  • NVIDIA Fermi (2.1) Architecture:
  • +
  • NVIDIA GeForce 400 Series
  • +
  • NVIDIA Quadro 4000 Series
  • +
  • NVIDIA Tesla 20 Series
  • +
  • NVIDIA Ampere Architecture Series (A100; A800; H100)
  • +
+
CentOS 7 + +Operating System: CentOS 7.9;
Kernel Version: 3.10.0-1160
Offline Installation with GPU Operator
CentOS 8Kernel 4.18.0-80 ~ 4.18.0-348
Ubuntu 20.04Kernel 5.4
Ubuntu 22.04Kernel 5.19
RHEL 7Kernel 3.10.0-123 ~ 3.10.0-1160
RHEL 8Kernel 4.18.0-80 ~ 4.18.0-348
NVIDIA MIG +
    +
  • Ampere Architecture Series:
  • +
  • A100
  • +
  • A800
  • +
  • H100
  • +
+
CentOS 7Kernel 3.10.0-123 ~ 3.10.0-1160Operating System: CentOS 7.9;
Kernel Version: 3.10.0-1160
Offline Installation with GPU Operator
CentOS 8Kernel 4.18.0-80 ~ 4.18.0-348
Ubuntu 20.04Kernel 5.4
Ubuntu 22.04Kernel 5.19
RHEL 7Kernel 3.10.0-123 ~ 3.10.0-1160
RHEL 8Kernel 4.18.0-80 ~ 4.18.0-348
+ +## Ascend NPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
GPU Manufacturer and TypeSupported NPU ModelsCompatible Operating System (Online)Recommended KernelRecommended Operating System and KernelInstallation Documentation
Ascend (Ascend 310) +
    +
  • Ascend 310;
  • +
  • Ascend 310P;
  • +
+
Ubuntu 20.04Details refer to: Kernel Version RequirementsOperating System: CentOS 7.9;
Kernel Version: 3.10.0-1160
300 and 310P Driver Documentation
CentOS 7.6
CentOS 8.2
KylinV10SP1 Operating System
openEuler Operating System
Ascend (Ascend 910P)Ascend 910Ubuntu 20.04Details refer to: Kernel Version RequirementsOperating System: CentOS 7.9;
Kernel Version: 3.10.0-1160
910 Driver Documentation
CentOS 7.6
CentOS 8.2
KylinV10SP1 Operating System
openEuler Operating System
+ +## Iluvatar GPU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
GPU Manufacturer and TypeSupported GPU ModelsCompatible Operating System (Online)Recommended KernelRecommended Operating System and KernelInstallation Documentation
Iluvatar (Iluvatar vGPU) +
    +
  • BI100;
  • +
  • MR100;
  • +
+
CentOS 7 +
    +
  • Kernel 3.10.0-957.el7.x86_64 ~ 3.10.0-1160.42.2.el7.x86_64
  • +
+
Operating System: CentOS 7.9;
Kernel Version: 3.10.0-1160
Coming Soon
CentOS 8 +
    +
  • Kernel 4.18.0-80.el8.x86_64 ~ 4.18.0-305.19.1.el8_4.x86_64
  • +
+
Ubuntu 20.04 +
    +
  • Kernel 4.15.0-20-generic ~ 4.15.0-160-generic
  • +
  • Kernel 5.4.0-26-generic ~ 5.4.0-89-generic
  • +
  • Kernel 5.8.0-23-generic ~ 5.8.0-63-generic
  • +
+
Ubuntu 21.04 +
    +
  • Kernel 4.15.0-20-generic ~ 4.15.0-160-generic
  • +
  • Kernel 5.4.0-26-generic ~ 5.4.0-89-generic
  • +
  • Kernel 5.8.0-23-generic ~ 5.8.0-63-generic
  • +
+
openEuler 22.03 LTS +
    +
  • Kernel version >= 5.1 and <= 5.10
  • +
+
diff --git a/docs/en/docs/end-user/kpanda/gpu/gpu_scheduler_config.md b/docs/en/docs/end-user/kpanda/gpu/gpu_scheduler_config.md new file mode 100644 index 0000000000..afe54963cb --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/gpu_scheduler_config.md @@ -0,0 +1,48 @@ +# GPU Scheduling Configuration + +This document mainly introduces the configuration of `GPU` scheduling, which can implement +advanced scheduling policies. Currently, the primary implementation is the `vgpu` scheduling policy. + +## vGPU Resource Scheduling Configuration + +`vGPU` provides two policies for resource usage: `binpack` and `spread`. These correspond to node-level +and GPU-level dimensions, respectively. The use case is whether you want to distribute workloads more +sparsely across different nodes and GPUs or concentrate them on the same node and GPU, +thereby making resource utilization more efficient and reducing resource fragmentation. + +You can modify the scheduling policy in your cluster by following these steps: + +1. Go to the cluster management list in the container management interface. +2. Click the settings button **...** next to the cluster. +3. Click **GPU Scheduling Configuration**. +4. Toggle the scheduling policy between node-level and GPU-level. By default, + the node-level policy is `binpack`, and the GPU-level policy is `spread`. + +![vgpu-scheduler](./images/vgpu-sc.png) + +The above steps modify the cluster-level scheduling policy. Users can also specify their own +scheduling policy at the workload level to change the scheduling results. Below is an example +of modifying the scheduling policy at the workload level: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: gpu-pod + annotations: + hami.io/node-scheduler-policy: "binpack" + hami.io/gpu-scheduler-policy: "binpack" +spec: + containers: + - name: ubuntu-container + image: ubuntu:18.04 + command: ["bash", "-c", "sleep 86400"] + resources: + limits: + nvidia.com/gpu: 1 + nvidia.com/gpumem: 3000 + nvidia.com/gpucores: 30 +``` + +In this example, both the node- and GPU-level scheduling policies are set to `binpack`. +This ensures that the workload is scheduled to maximize resource utilization and reduce fragmentation. diff --git a/docs/en/docs/end-user/kpanda/gpu/images/ascend-mindxdl.png b/docs/en/docs/end-user/kpanda/gpu/images/ascend-mindxdl.png new file mode 100644 index 0000000000..b7b16fdcff Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/ascend-mindxdl.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/cluster-ns.png b/docs/en/docs/end-user/kpanda/gpu/images/cluster-ns.png new file mode 100644 index 0000000000..a3d649dea1 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/cluster-ns.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/clusterlist1.png b/docs/en/docs/end-user/kpanda/gpu/images/clusterlist1.png new file mode 100644 index 0000000000..eeb5fdd36f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/clusterlist1.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/detail-ascend.png b/docs/en/docs/end-user/kpanda/gpu/images/detail-ascend.png new file mode 100644 index 0000000000..8d5b006f32 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/detail-ascend.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/driveimage.png b/docs/en/docs/end-user/kpanda/gpu/images/driveimage.png new file mode 100644 index 0000000000..e057a26c3c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/driveimage.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/driver.jpg b/docs/en/docs/end-user/kpanda/gpu/images/driver.jpg new file mode 100644 index 0000000000..7e595830e4 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/driver.jpg differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/gpu-createdeploy.png b/docs/en/docs/end-user/kpanda/gpu/images/gpu-createdeploy.png new file mode 100644 index 0000000000..91baba82ec Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/gpu-createdeploy.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/gpu-operator-mig.png b/docs/en/docs/end-user/kpanda/gpu/images/gpu-operator-mig.png new file mode 100644 index 0000000000..34df624785 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/gpu-operator-mig.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/gpu-scheduler-clusterlist.png b/docs/en/docs/end-user/kpanda/gpu/images/gpu-scheduler-clusterlist.png new file mode 100644 index 0000000000..bdf2bbf698 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/gpu-scheduler-clusterlist.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/gpu-scheduler-clusterrule.png b/docs/en/docs/end-user/kpanda/gpu/images/gpu-scheduler-clusterrule.png new file mode 100644 index 0000000000..ddfc201b24 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/gpu-scheduler-clusterrule.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/image-1.png b/docs/en/docs/end-user/kpanda/gpu/images/image-1.png new file mode 100644 index 0000000000..6e759868e7 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/image-1.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/image-2.png b/docs/en/docs/end-user/kpanda/gpu/images/image-2.png new file mode 100644 index 0000000000..1399f9184c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/image-2.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/image.png b/docs/en/docs/end-user/kpanda/gpu/images/image.png new file mode 100644 index 0000000000..090d76322d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/image.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/label-ascend-mindxdl.png b/docs/en/docs/end-user/kpanda/gpu/images/label-ascend-mindxdl.png new file mode 100644 index 0000000000..edea9287eb Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/label-ascend-mindxdl.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/list-ascend-mindxdl.png b/docs/en/docs/end-user/kpanda/gpu/images/list-ascend-mindxdl.png new file mode 100644 index 0000000000..7a229f769d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/list-ascend-mindxdl.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig-select.png b/docs/en/docs/end-user/kpanda/gpu/images/mig-select.png new file mode 100644 index 0000000000..3649d23011 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig-select.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig2c.4g.20gb.png b/docs/en/docs/end-user/kpanda/gpu/images/mig2c.4g.20gb.png new file mode 100644 index 0000000000..13b5887646 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig2c.4g.20gb.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig_1c.4g.20gb.png b/docs/en/docs/end-user/kpanda/gpu/images/mig_1c.4g.20gb.png new file mode 100644 index 0000000000..4233ecb225 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig_1c.4g.20gb.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig_1g5gb.png b/docs/en/docs/end-user/kpanda/gpu/images/mig_1g5gb.png new file mode 100644 index 0000000000..a8a88e9f43 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig_1g5gb.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig_4g20gb.png b/docs/en/docs/end-user/kpanda/gpu/images/mig_4g20gb.png new file mode 100644 index 0000000000..f8b81516ae Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig_4g20gb.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig_7m.png b/docs/en/docs/end-user/kpanda/gpu/images/mig_7m.png new file mode 100644 index 0000000000..2122431bc8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig_7m.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mig_overview.png b/docs/en/docs/end-user/kpanda/gpu/images/mig_overview.png new file mode 100644 index 0000000000..6b5e4d5807 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mig_overview.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mixed.png b/docs/en/docs/end-user/kpanda/gpu/images/mixed.png new file mode 100644 index 0000000000..91b16e79c8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mixed.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/mixed02.png b/docs/en/docs/end-user/kpanda/gpu/images/mixed02.png new file mode 100644 index 0000000000..c7b29d133e Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/mixed02.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/node-gpu.png b/docs/en/docs/end-user/kpanda/gpu/images/node-gpu.png new file mode 100644 index 0000000000..198b5790a3 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/node-gpu.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/npu-smi-info.png b/docs/en/docs/end-user/kpanda/gpu/images/npu-smi-info.png new file mode 100644 index 0000000000..b8a7177df9 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/npu-smi-info.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/operator-mig.png b/docs/en/docs/end-user/kpanda/gpu/images/operator-mig.png new file mode 100644 index 0000000000..a69f4713e8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/operator-mig.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/redhat0.12.2.png b/docs/en/docs/end-user/kpanda/gpu/images/redhat0.12.2.png new file mode 100644 index 0000000000..d0b45aaaf0 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/redhat0.12.2.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/rhel7.9.png b/docs/en/docs/end-user/kpanda/gpu/images/rhel7.9.png new file mode 100644 index 0000000000..85d11bce0f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/rhel7.9.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/single01.png b/docs/en/docs/end-user/kpanda/gpu/images/single01.png new file mode 100644 index 0000000000..16c96a60b9 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/single01.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/single02.png b/docs/en/docs/end-user/kpanda/gpu/images/single02.png new file mode 100644 index 0000000000..82c59feb0d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/single02.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-addon.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-addon.png new file mode 100644 index 0000000000..ccbadb69f1 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-addon.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-cluster.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-cluster.png new file mode 100644 index 0000000000..5de4d3a0c0 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-cluster.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-deployment.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-deployment.png new file mode 100644 index 0000000000..552e4693c1 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-deployment.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-pararm.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-pararm.png new file mode 100644 index 0000000000..f2018dd6b2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-pararm.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-pod.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-pod.png new file mode 100644 index 0000000000..c8ec011f00 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-pod.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-quota.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-quota.png new file mode 100644 index 0000000000..a8944ef83b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-quota.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vgpu-sc.png b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-sc.png new file mode 100644 index 0000000000..59ccfd9c3d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vgpu-sc.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vnpu1.png b/docs/en/docs/end-user/kpanda/gpu/images/vnpu1.png new file mode 100644 index 0000000000..0b8e88623a Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vnpu1.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/images/vnpu2.png b/docs/en/docs/end-user/kpanda/gpu/images/vnpu2.png new file mode 100644 index 0000000000..4ae7318541 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/gpu/images/vnpu2.png differ diff --git a/docs/en/docs/end-user/kpanda/gpu/index.md b/docs/en/docs/end-user/kpanda/gpu/index.md new file mode 100644 index 0000000000..f9ad6d2b36 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/index.md @@ -0,0 +1,33 @@ +--- +hide: + - toc +--- + +# Overview of GPU Management + +This article introduces the capability of Suanova container management platform in unified operations and management of heterogeneous resources, with a focus on GPUs. + +## Background + +With the rapid development of emerging technologies such as AI applications, large-scale models, artificial intelligence, and autonomous driving, enterprises are facing an increasing demand for compute-intensive tasks and data processing. Traditional compute architectures represented by CPUs can no longer meet the growing computational requirements of enterprises. At this point, heterogeneous computing represented by GPUs has been widely applied due to its unique advantages in processing large-scale data, performing complex calculations, and real-time graphics rendering. + +Meanwhile, due to the lack of experience and professional solutions in scheduling and managing heterogeneous resources, the utilization efficiency of GPU devices is extremely low, resulting in high AI production costs for enterprises. The challenge of reducing costs, increasing efficiency, and improving the utilization of GPUs and other heterogeneous resources has become a pressing issue for many enterprises. + +## Introduction to GPU Capabilities + +The Suanova container management platform supports unified scheduling and operations management of GPUs, NPUs, and other heterogeneous resources, fully unleashing the computational power of GPU resources, and accelerating the development of enterprise AI and other emerging applications. The GPU management capabilities of Suanova are as follows: + +- Support for unified management of heterogeneous computing resources from domestic and foreign manufacturers such as NVIDIA, Huawei Ascend, and Days. +- Support for multi-card heterogeneous scheduling within the same cluster, with automatic recognition of GPUs in the cluster. +- Support for native management solutions for NVIDIA GPUs, vGPUs, and MIG, with cloud native capabilities. +- Support for partitioning a single physical card for use by different tenants, and allocate GPU resources to tenants and containers based on computing power and memory quotas. +- Support for multi-dimensional GPU resource monitoring at the cluster, node, and application levels, assisting operators in managing GPU resources. +- Compatibility with various training frameworks such as TensorFlow and PyTorch. + +## Introduction to GPU Operator + +Similar to regular computer hardware, NVIDIA GPUs, as physical devices, need to have the NVIDIA GPU driver installed in order to be used. To reduce the cost of using GPUs on Kubernetes, NVIDIA provides the NVIDIA GPU Operator component to manage various components required for using NVIDIA GPUs. These components include the NVIDIA driver (for enabling CUDA), NVIDIA container runtime, GPU node labeling, DCGM-based monitoring, and more. In theory, users only need to plug the GPU card into a compute device managed by Kubernetes, and they can use all the capabilities of NVIDIA GPUs through the GPU Operator. For more information about NVIDIA GPU Operator, refer to the [NVIDIA official documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html). For deployment instructions, refer to [Offline Installation of GPU Operator](nvidia/install_nvidia_driver_of_operator.md). + +Architecture diagram of NVIDIA GPU Operator: + + diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/full_gpu_userguide.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/full_gpu_userguide.md new file mode 100644 index 0000000000..48422dfcb9 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/full_gpu_userguide.md @@ -0,0 +1,67 @@ +# Using the Whole NVIDIA GPU Card for an Application + +This section describes how to allocate the entire NVIDIA GPU card to a single application on the AI platform platform. + +## Prerequisites + +- AI platform container management platform has been [deployed](https://docs.daocloud.io/install/index.html) and is running properly. +- The container management module has been [connected to a Kubernetes cluster](../../clusters/integrate-cluster.md) or a Kubernetes cluster has been [created](../../clusters/create-cluster.md), and you can access the UI interface of the cluster. +- GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to [Offline Installation of GPU Operator](install_nvidia_driver_of_operator.md) for instructions. +- The GPU card in the current cluster has not undergone any virtualization operations or been occupied by other applications. + +## Procedure + +### Configuring via the User Interface + +1. Check if the cluster has detected the GPUs. Click __Clusters__ -> __Cluster Settings__ -> __Addon Plugins__ to see if it has automatically enabled and detected the proper GPU types. + Currently, the cluster will automatically enable __GPU__ and set the __GPU Type__ as __Nvidia GPU__ . + + + +2. Deploy a workload. Click __Clusters__ -> __Workloads__ , and deploy the workload using the image method. After selecting the type ( __Nvidia GPU__ ), configure the number of physical cards used by the application: + + **Physical Card Count (nvidia.com/gpu)**: Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine. + + + + > If the above value is configured incorrectly, scheduling failures and resource allocation issues may occur. + +### Configuring via YAML + +To request GPU resources for a workload, add the __nvidia.com/gpu: 1__ parameter to the resource request and limit configuration in the YAML file. This parameter configures the number of physical cards used by the application. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: full-gpu-demo + namespace: default +spec: + replicas: 1 + selector: + matchLabels: + app: full-gpu-demo + template: + metadata: + labels: + app: full-gpu-demo + spec: + containers: + - image: chrstnhntschl/gpu_burn + name: container-0 + resources: + requests: + cpu: 250m + memory: 512Mi + nvidia.com/gpu: 1 # Number of GPUs requested + limits: + cpu: 250m + memory: 512Mi + nvidia.com/gpu: 1 # Upper limit of GPU usage + imagePullSecrets: + - name: default-secret +``` + +!!! note + + When using the `nvidia.com/gpu` parameter to specify the number of GPUs, the values for requests and limits must be consistent. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/index.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/index.md new file mode 100644 index 0000000000..e6ed20588d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/index.md @@ -0,0 +1,38 @@ +# NVIDIA GPU Card Usage Modes + +NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage: + +#### Full GPU + +Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc. + +#### vGPU (Virtual GPU) + +vGPU is a virtualization technology that allows one physical GPU to be partitioned into multiple virtual GPUs, with each virtual GPU assigned to different virtual machines or users. vGPU enables multiple users to share the same physical GPU and independently use GPU resources in their respective virtual environments. Each virtual GPU can access a certain amount of compute power and memory capacity. vGPU is suitable for virtualized environments and cloud computing scenarios, providing higher resource utilization and flexibility. + +#### MIG (Multi-Instance GPU) + +MIG is a feature introduced by the NVIDIA Ampere architecture that allows one physical GPU to be divided into multiple physical GPU instances, each of which can be independently allocated to different users or workloads. Each MIG instance has its own compute resources, memory, and PCIe bandwidth, just like an independent virtual GPU. MIG provides finer-grained GPU resource allocation and management and allows dynamic adjustment of the number and size of instances based on demand. MIG is suitable for multi-tenant environments, containerized applications, batch jobs, and other scenarios. + +Whether using vGPU in a virtualized environment or MIG on a physical GPU, NVIDIA provides users with more choices and optimized ways to utilize GPU resources. The Suanova container management platform fully supports the above NVIDIA capabilities. Users can easily access the full computational power of NVIDIA GPUs through simple UI operations, thereby improving resource utilization and reducing costs. + +- **Single Mode**: The node only exposes a single type of MIG device on all its GPUs. All GPUs on the node must: + - Be of the same model (e.g., A100-SXM-40GB), with matching MIG profiles only for GPUs of the same model. + - Have MIG configuration enabled, which requires a machine reboot to take effect. + - Create identical GI and CI for exposing "identical" MIG devices across all products. +- **Mixed Mode**: The node exposes mixed MIG device types on all its GPUs. Requesting a specific MIG device type requires the number of compute slices and total memory provided by the device type. + - All GPUs on the node must: Be in the same product line (e.g., A100-SXM-40GB). + - Each GPU can enable or disable MIG individually and freely configure any available mixture of MIG device types. + - The k8s-device-plugin running on the node will: + - Expose any GPUs not in MIG mode using the traditional `nvidia.com/gpu` resource type. + - Expose individual MIG devices using resource types that follow the pattern `nvidia.com/mig-g.gb` . + +For detailed instructions on enabling these configurations, refer to [Offline Installation of GPU Operator](install_nvidia_driver_of_operator.md). + +## How to Use + +You can refer to the following links to quickly start using Suanova's management capabilities for NVIDIA GPUs. + +- **[Using Full NVIDIA GPU](full_gpu_userguide.md)** +- **[Using NVIDIA vGPU](vgpu/vgpu_user.md)** +- **[Using NVIDIA MIG](mig/mig_usage.md)** diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.md new file mode 100644 index 0000000000..30a5b3dab0 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.md @@ -0,0 +1,124 @@ +--- +MTPE: Fan-Lin +Date: 2024-01-24 +--- + +# Offline Install gpu-operator + +AI platform comes with pre-installed `driver` images for the following three operating systems: Ubuntu 22.04, Ubuntu 20.04, +and CentOS 7.9. The driver version is `535.104.12`. Additionally, it includes the required `Toolkit` images for each +operating system, so users no longer need to manually provide offline `toolkit` images. + +This page demonstrates using AMD architecture with CentOS 7.9 (3.10.0-1160). If you need to deploy on Red Hat 8.4, refer to +[Uploading Red Hat gpu-operator Offline Image to the Bootstrap Node Repository](./push_image_to_repo.md) +and [Building Offline Yum Source for Red Hat 8.4](./upgrade_yum_source_redhat8_4.md). + +## Prerequisites + +- The kernel version of the cluster nodes where the gpu-operator is to be deployed must be + completely consistent. The distribution and GPU card model of the nodes must fall within + the scope specified in the [GPU Support Matrix](../gpu_matrix.md). +- When installing the gpu-operator, select v23.9.0+2 or above. + +## Steps + +To install the gpu-operator plugin for your cluster, follow these steps: + +1. Log in to the platform and go to __Container Management__ -> __Clusters__ , check cluster eetails. + +2. On the __Helm Charts__ page, select __All Repositories__ and search for __gpu-operator__ . + +3. Select __gpu-operator__ and click __Install__ . + +4. Configure the installation parameters for __gpu-operator__ based on the instructions below to complete the installation. + +## Configure parameters + +- __systemOS__ : Select the operating system for the host. The current options are + `Ubuntu 22.04`, `Ubuntu 20.04`, `Centos 7.9`, and `other`. Please choose the correct operating system. + +### Basic information + +- __Name__ : Enter the plugin name +- __Namespace__ : Select the namespace for installing the plugin +- **Version**: The version of the plugin. Here, we use version **v23.9.0+2** as an example. +- **Failure Deletion**: If the installation fails, it will delete the already installed associated + resources. When enabled, **Ready Wait** will also be enabled by default. +- **Ready Wait**: When enabled, the application will be marked as successfully installed only + when all associated resources are in a ready state. +- **Detailed Logs**: When enabled, detailed logs of the installation process will be recorded. + +### Advanced settings + +#### Operator parameters + +- __InitContainer.image__ : Configure the CUDA image, recommended default image: __nvidia/cuda__ +- __InitContainer.repository__ : Repository where the CUDA image is located, defaults to __nvcr.m.daocloud.io__ repository +- __InitContainer.version__ : Version of the CUDA image, please use the default parameter + +#### Driver parameters + +- __Driver.enable__ : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the gpu-operator, please disable this. +- __Driver.image__ : Configure the GPU driver image, recommended default image: __nvidia/driver__ . +- __Driver.repository__ : Repository where the GPU driver image is located, default is nvidia's __nvcr.io__ repository. +- __Driver.usePrecompiled__ : Enable the precompiled mode to install the driver. +- __Driver.version__ : Version of the GPU driver image, use default parameters for offline deployment. + Configuration is only required for online installation. Different versions of the Driver image exist for + different types of operating systems. For more details, refer to + [Nvidia GPU Driver Versions](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags). + Examples of `Driver Version` for different operating systems are as follows: + + !!! note + + When using the built-in operating system version, there is no need to modify the image version. For other operating system versions, please refer to [Uploading Images to the Bootstrap Node Repository](./push_image_to_repo.md). + note that there is no need to include the operating system name such as Ubuntu, CentOS, or Red Hat in the version number. If the official image contains an operating system suffix, please manually remove it. + + - For Red Hat systems, for example, `525.105.17` + - For Ubuntu systems, for example, `535-5.15.0-1043-nvidia` + - For CentOS systems, for example, `525.147.05` + +- __Driver.RepoConfig.ConfigMapName__ : Used to record the name of the offline yum repository configuration file + for the gpu-operator. When using the pre-packaged offline bundle, refer to the following documents for + different types of operating systems. + + - [Building CentOS 7.9 Offline Yum Repository](./upgrade_yum_source_centos7_9.md) + - [Building Red Hat 8.4 Offline Yum Repository](./upgrade_yum_source_redhat8_4.md) + +#### Toolkit parameters + +__Toolkit.enable__ : Enabled by default. This component allows containerd/docker +to support running containers that require GPUs. + +#### MIG parameters + +For detailed configuration methods, refer to [Enabling MIG Functionality](mig/create_mig.md). + +**MigManager.Config.name** : The name of the MIG split configuration file, used to define the MIG (GI, CI) +split policy. The default is __default-mig-parted-config__ . For custom parameters, refer to +[Enabling MIG Functionality](mig/create_mig.md). + +### Next Steps + +After completing the configuration and creation of the above parameters: + +- If using **full-card mode** , [GPU resources can be used when creating applications](full_gpu_userguide.md). + +- If using **vGPU mode** , after completing the above configuration and creation, + proceed to [vGPU Addon Installation](vgpu/vgpu_addon.md). + +- If using **MIG mode** and you need to use a specific split specification for individual GPU nodes, + otherwise, split according to the __default__ value in `MigManager.Config`. + + - For **single** mode, add label to nodes as follows: + + ```sh + kubectl label nodes {node} nvidia.com/mig.config="all-1g.10gb" --overwrite + ``` + + - For **mixed** mode, add label to nodes as follows: + + ```sh + kubectl label nodes {node} nvidia.com/mig.config="custom-config" --overwrite + ``` + + After spliting, applications can [use MIG GPU resources](mig/mig_usage.md). diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/create_mig.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/create_mig.md new file mode 100644 index 0000000000..54047bb09d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/create_mig.md @@ -0,0 +1,124 @@ +--- +MTPE: Fan-Lin +Date: 2024-01-23 +--- + +# Enabling MIG Features + +This section describes how to enable NVIDIA MIG features. NVIDIA currently provides two strategies for exposing MIG devices on Kubernetes nodes: + +- **Single mode** : Nodes expose a single type of MIG device on all their GPUs. +- **Mixed mode** : Nodes expose a mixture of MIG device types on all their GPUs. + +For more details, refer to the [NVIDIA GPU Card Usage Modes](../index.md). + +## Prerequisites + +- Check the system requirements for the GPU driver installation on the target node: [GPU Support Matrix](../../gpu_matrix.md) +- Ensure that the cluster nodes have GPUs of the corresponding models + ([NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/), + [A100](https://www.nvidia.com/en-us/data-center/a100/), + and [A30](https://www.nvidia.com/en-us/data-center/products/a30-gpu/) Tensor Core GPUs). + For more information, see the [GPU Support Matrix](gpu_matrix.md). +- All GPUs on the nodes must belong to the same product line (e.g., A100-SXM-40GB). + +## Install GPU Operator Addon + +### Parameter Configuration + +When [installing the Operator](../install_nvidia_driver_of_operator.md), you need to set the MigManager Config parameter accordingly. The default setting is **default-mig-parted-config**. You can also customize the sharding policy configuration file: + +![single](../../images/gpu-operator-mig.png) + +### Custom Sharding Policy + +```yaml + ## Custom GI Instance Configuration + all-disabled: + - devices: all + mig-enabled: false + all-enabled: + - devices: all + mig-enabled: true + mig-devices: {} + all-1g.10gb: + - devices: all + mig-enabled: true + mig-devices: + 1g.5gb: 7 + all-1g.10gb.me: + - devices: all + mig-enabled: true + mig-devices: + 1g.10gb+me: 1 + all-1g.20gb: + - devices: all + mig-enabled: true + mig-devices: + 1g.20gb: 4 + all-2g.20gb: + - devices: all + mig-enabled: true + mig-devices: + 2g.20gb: 3 + all-3g.40gb: + - devices: all + mig-enabled: true + mig-devices: + 3g.40gb: 2 + all-4g.40gb: + - devices: all + mig-enabled: true + mig-devices: + 4g.40gb: 1 + all-7g.80gb: + - devices: all + mig-enabled: true + mig-devices: + 7g.80gb: 1 + all-balanced: + - device-filter: ["0x233110DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"] + devices: all + mig-enabled: true + mig-devices: + 1g.10gb: 2 + 2g.20gb: 1 + 3g.40gb: 1 + # After setting, CI instances will be partitioned according to the specified configuration + custom-config: + - devices: all + mig-enabled: true + mig-devices: + 3g.40gb: 2 +``` + +In the above **YAML**, set **custom-config** to partition **CI** instances according to the specifications. + +```yaml +custom-config: + - devices: all + mig-enabled: true + mig-devices: + 1c.3g.40gb: 6 +``` + +After completing the settings, you can [use GPU MIG resources](mig_usage.md) when confirming the deployment of the application. + +## Switch Node GPU Mode + +After successfully installing the GPU operator, the node is in full card mode by default. There will be an indicator on the node management page, as shown below: + +![mixed](../../images/node-gpu.png) + +Click the __┇__ at the right side of the node list, select a GPU mode to switch, +and then choose the proper MIG mode and sharding policy. Here, we take MIXED mode as an example: + +![mig](../../images/mig-select.png) + +There are two configurations here: + +1. MIG Policy: Mixed and Single. +2. Sharding Policy: The policy here needs to match the key in the **default-mig-parted-config** (or user-defined sharding policy) configuration file. + +After clicking **OK** button, wait for about a minute and refresh the page. The MIG mode will be switched to: + diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/index.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/index.md new file mode 100644 index 0000000000..a7c2df15d5 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/index.md @@ -0,0 +1,65 @@ +# Overview of NVIDIA Multi-Instance GPU (MIG) + +## MIG Scenarios + +- **Multi-Tenant Cloud Environments**: + + MIG allows cloud service providers to partition a physical GPU into multiple independent GPU instances, which can be allocated to different tenants. This enables resource isolation and independence, meeting the GPU computing needs of multiple tenants. + +- **Containerized Applications**: + + MIG enables finer-grained GPU resource management in containerized environments. By partitioning a physical GPU into multiple MIG instances, each container can be assigned with dedicated GPU compute resources, providing better performance isolation and resource utilization. + +- **Batch Processing Jobs**: + + For batch processing jobs requiring large-scale parallel computing, MIG provides higher computational performance and larger memory capacity. Each MIG instance can utilize a portion of the physical GPU's compute resources, accelerating the processing of large-scale computational tasks. + +- **AI/Machine Learning Training**: + + MIG offers increased compute power and memory capacity for training large-scale deep learning models. By partitioning the physical GPU into multiple MIG instances, each instance can independently carry out model training, improving training efficiency and throughput. + +In general, NVIDIA MIG is suitable for scenarios that require finer-grained allocation and management of GPU resources. It enables resource isolation, improved performance utilization, and meets the GPU computing needs of multiple users or applications. + +## Overview of MIG + +NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU card to provide separate GPU resources to multiple users, maximizing GPU utilization. + +This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability. + +With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance. + +This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU. + +MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to [NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/TB-10226-001_v01.pdf). + +## MIG Architecture + +The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU card into seven GPU instances that can be used by multiple users. + + + +## Important Concepts + +* __SM__ (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices. +* __GPU Memory Slice__ : The smallest portion of GPU memory, including the corresponding memory controller and cache. A GPU memory slice is approximately one-eighth of the total GPU memory resources in terms of capacity and bandwidth. +* __GPU SM Slice__ : The smallest computational unit of SMs on a GPU. When configuring in MIG mode, the GPU SM slice is approximately one-seventh of the total available SMs in the GPU. +* __GPU Slice__ : The GPU slice represents the smallest portion of the GPU, consisting of a single GPU memory slice and a single GPU SM slice combined together. +* __GPU Instance__ (GI): A GPU instance is the combination of a GPU slice and GPU engines (DMA, NVDEC, etc.). Anything within a GPU instance always shares all GPU memory slices and other GPU engines, but its SM slice can be further subdivided into Compute Instances (CIs). A GPU instance provides memory QoS. Each GPU slice contains dedicated GPU memory resources, limiting available capacity and bandwidth while providing memory QoS. Each GPU memory slice gets one-eighth of the total GPU memory resources, and each GPU SM slice gets one-seventh of the total SM count. +* __Compute Instance__ (CI): A Compute Instance represents the smallest computational unit within a GPU instance. It consists of a subset of SMs, along with dedicated register files, shared memory, and other resources. Each CI has its own CUDA context and can run independent CUDA kernels. The number of CIs in a GPU instance depends on the number of available SMs and the configuration chosen during MIG setup. +* __Instance Slice__ : An Instance Slice represents a single CI within a GPU instance. It is the combination of a subset of SMs and a portion of the GPU memory slice. Each Instance Slice provides isolation and resource allocation for individual applications or users running on the GPU instance. + +## Key Benefits of MIG + +- **Resource Sharing**: MIG allows a single physical GPU to be divided into multiple GPU instances, providing efficient sharing of GPU resources among different users or applications. This maximizes GPU utilization and enables improved performance isolation. + +- **Fine-Grained Resource Allocation**: With MIG, GPU resources can be allocated at a finer granularity, allowing for more precise partitioning and allocation of compute power and memory capacity. + +- **Improved Performance Isolation**: Each MIG instance operates independently with its dedicated resources, ensuring predictable throughput and latency for individual users or applications. This improves performance isolation and prevents interference between different workloads running on the same GPU. + +- **Enhanced Security and Fault Isolation**: MIG provides better security and fault isolation by ensuring that each user or application has its dedicated GPU resources. This prevents unauthorized access to data and mitigates the impact of faults or errors in one instance on others. + +- **Increased Scalability**: MIG enables the simultaneous usage of GPU resources by multiple users or applications, increasing system scalability and accommodating the needs of various workloads. + +- **Efficient Containerization**: By using MIG in containerized environments, GPU resources can be effectively allocated to different containers, improving performance isolation and resource utilization. + +Overall, MIG offers significant advantages in terms of resource sharing, fine-grained allocation, performance isolation, security, scalability, and containerization, making it a valuable feature for various GPU computing scenarios. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/mig_command.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/mig_command.md new file mode 100644 index 0000000000..ac5ed00564 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/mig_command.md @@ -0,0 +1,25 @@ +# MIG Related Commands + +GI Related Commands: + +| Subcommand | Description | +| --------------------------------------- | ----------------------------- | +| nvidia-smi mig -lgi | View the list of created GI instances | +| nvidia-smi mig -dgi -gi {Instance ID} | Delete a specific GI instance | +| nvidia-smi mig -lgip | View the profile of GI | +| nvidia-smi mig -cgi {profile id} | Create a GI using the specified profile ID | + +CI Related Commands: + +| Subcommand | Description | +| ------------------------------------------------------- | ------------------------------------------------------------ | +| nvidia-smi mig -lcip { -gi {gi Instance ID}} | View the profile of CI, specifying __-gi__ will show the CIs that can be created for a particular GI instance | +| nvidia-smi mig -lci | View the list of created CI instances | +| nvidia-smi mig -cci {profile id} -gi {gi instance id} | Create a CI instance with the specified GI | +| nvidia-smi mig -dci -ci {ci instance id} | Delete a specific CI instance | + +GI+CI Related Commands: + +| Subcommand | Description | +| ------------------------------------------------------------ | -------------------- | +| nvidia-smi mig -i 0 -cgi {gi profile id} -C {ci profile id} | Create a GI + CI instance directly | diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/mig_usage.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/mig_usage.md new file mode 100644 index 0000000000..bc3551d052 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/mig/mig_usage.md @@ -0,0 +1,94 @@ +# Using MIG GPU Resources + +This section explains how applications can use MIG GPU resources. + +## Prerequisites + +- AI platform container management platform is deployed and running successfully. +- The container management module is integrated with a Kubernetes cluster or a Kubernetes cluster is created, and the UI interface of the cluster can be accessed. +- NVIDIA DevicePlugin and MIG capabilities are enabled. Refer to [Offline installation of GPU Operator](../install_nvidia_driver_of_operator.md) for details. +- The nodes in the cluster have GPUs of the corresponding models. + +## Using MIG GPU through the UI + +1. Confirm if the cluster has recognized the GPU card type. + + Go to __Cluster Details__ -> __Nodes__ and check if it has been correctly recognized as MIG. + + + +2. When deploying an application using an image, you can select and use NVIDIA MIG resources. + +- Example of MIG Single Mode (used in the same way as a full GPU card): + + !!! note + + The MIG single policy allows users to request and use GPU resources in the same way as a full GPU card (`nvidia.com/gpu`). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the [GPU MIG Mode Design](https://docs.google.com/document/d/1bshSIcWNYRZGfywgwRHa07C0qRyOYKxWYxClbeJM-WM/edit#heading=h.jklusl667vn2). + +- MIG Mixed Mode + +## Using MIG through YAML Configuration + +__MIG Single__ mode: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: mig-demo + namespace: default +spec: + replicas: 1 + selector: + matchLabels: + app: mig-demo + template: + metadata: + creationTimestamp: null + labels: + app: mig-demo + spec: + containers: + - name: mig-demo1 + image: chrstnhntschl/gpu_burn + resources: + limits: + nvidia.com/gpu: 2 # (1)! + imagePullPolicy: Always + restartPolicy: Always +``` + +1. Number of MIG GPUs to request + +__MIG Mixed__ mode: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: mig-demo + namespace: default +spec: + replicas: 1 + selector: + matchLabels: + app: mig-demo + template: + metadata: + creationTimestamp: null + labels: + app: mig-demo + spec: + containers: + - name: mig-demo1 + image: chrstnhntschl/gpu_burn + resources: + limits: + nvidia.com/mig-4g.20gb: 1 # (1)! + imagePullPolicy: Always + restartPolicy: Always +``` + +1. Expose MIG device through nvidia.com/mig-g.gb resource type + +After entering the container, you can check if only one MIG device is being used: diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/push_image_to_repo.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/push_image_to_repo.md new file mode 100644 index 0000000000..699723c7ae --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/push_image_to_repo.md @@ -0,0 +1,84 @@ +# Uploading Red Hat GPU Operator Offline Image to Bootstrap Repository + +This guide explains how to upload an offline image to the bootstrap repository using the __nvcr.io/nvidia/driver:525.105.17-rhel8.4__ offline driver image for Red Hat 8.4 as an example. + +## Prerequisites + +1. The bootstrap node and its components are running properly. +2. Prepare a node that has internet access and can access the bootstrap node. Docker should also be installed on this node. You can refer to [Installing Docker](../../../../install/community/kind/online.md) for installation instructions. + +## Procedure + +### Step 1: Obtain the Offline Image on an Internet-Connected Node + +Perform the following steps on the internet-connected node: + +1. Pull the __nvcr.io/nvidia/driver:525.105.17-rhel8.4__ offline driver image: + + ```bash + docker pull nvcr.io/nvidia/driver:525.105.17-rhel8.4 + ``` + +2. Once the image is pulled, save it as a compressed archive named __nvidia-driver.tar__ : + + ```bash + docker save nvcr.io/nvidia/driver:525.105.17-rhel8.4 > nvidia-driver.tar + ``` + +3. Copy the compressed image archive __nvidia-driver.tar__ to the bootstrap node: + + ```bash + scp nvidia-driver.tar user@ip:/root + ``` + + For example: + + ```bash + scp nvidia-driver.tar root@10.6.175.10:/root + ``` + +### Step 2: Push the Image to the Bootstrap Repository + +Perform the following steps on the bootstrap node: + +1. Log in to the bootstrap node and import the compressed image archive __nvidia-driver.tar__ : + + ```bash + docker load -i nvidia-driver.tar + ``` + +2. View the imported image: + + ```bash + docker images -a | grep nvidia + ``` + + Expected output: + + ```bash + nvcr.io/nvidia/driver e3ed7dee73e9 1 days ago 1.02GB + ``` + +3. Retag the image to correspond to the target repository in the remote Registry repository: + + ```bash + docker tag /: + ``` + + Replace ____ with the name of the Nvidia image from the previous step, ____ with the address of the Registry service on the bootstrap node, ____ with the name of the repository you want to push the image to, and ____ with the desired tag for the image. + + For example: + + ```bash + docker tag nvcr.io/nvidia/driver 10.6.10.5/nvcr.io/nvidia/driver:525.105.17-rhel8.4 + ``` + +4. Push the image to the bootstrap repository: + + ```bash + docker push {ip}/nvcr.io/nvidia/driver:525.105.17-rhel8.4 + ``` + +## What's Next + +Refer to [Building Red Hat 8.4 Offline Yum Source](./upgrade_yum_source_redhat8_4.md) and [Offline Installation of GPU Operator](./install_nvidia_driver_of_operator.md) to deploy the GPU Operator to your cluster. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.md new file mode 100644 index 0000000000..564517022e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/ubuntu22.04_offline_install_driver.md @@ -0,0 +1,36 @@ +# Offline Install gpu-operator Driver on Ubuntu 22.04 + +Prerequisite: Installed gpu-operator v23.9.0+2 or higher versions + +## Prepare Offline Image + +1. Check the kernel version + + ```bash + $ uname -r + 5.15.0-78-generic + ``` + +1. Check the GPU Driver image version applicable to your kernel, + at `https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags`. + Use the kernel to query the image version and save the image using `ctr export`. + + ```bash + ctr i pull nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 + ctr i export --all-platforms driver.tar.gz nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 + ``` + +1. Import the image into the cluster's container registry + + ```bash + ctr i import driver.tar.gz + ctr i tag nvcr.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 + ctr i push {your_registry}/nvcr.m.daocloud.io/nvidia/driver:535-5.15.0-78-generic-ubuntu22.04 --skip-verify=true + ``` + +## Install the Driver + +1. Install the gpu-operator addon and set `driver.usePrecompiled=true` +2. Set `driver.version=535`, note that it should be 535, not 535.104.12 + +![Install Driver](../images/driver.jpg) diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.md new file mode 100644 index 0000000000..d196edc25d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/upgrade_yum_source_centos7_9.md @@ -0,0 +1,254 @@ +# Build CentOS 7.9 Offline Yum Source + +The AI platform comes with a pre-installed GPU Operator offline package for CentOS 7.9 with kernel version 3.10.0-1160. +or other OS types or kernel versions, users need to manually build an offline yum source. + +This guide explains how to build an offline yum source for CentOS 7.9 with a specific kernel version and use it when installing the GPU Operator by specifying the __RepoConfig.ConfigMapName__ parameter. + +## Prerequisites + +1. The user has already installed the v0.12.0 or later version of the addon offline package on the platform. +1. Prepare a file server that is accessible from the cluster network, such as Nginx or MinIO. +1. Prepare a node that has internet access, can access the cluster where the GPU Operator will + be deployed, and can access the file server. Docker should also be installed on this node. + You can refer to [Installing Docker](../../../../install/community/kind/online.md#install-docker) for installation instructions. + +## Procedure + +This guide uses CentOS 7.9 with kernel version 3.10.0-1160.95.1.el7.x86_64 as an example to explain how to upgrade the pre-installed GPU Operator offline package's yum source. + +### Check OS and Kernel Versions of Cluster Nodes + +Run the following commands on both the control node of the Global cluster and the node where +GPU Operator will be deployed. If the OS and kernel versions of the two nodes are consistent, +there is no need to build a yum source. You can directly refer to the +[Offline Installation of GPU Operator](./install_nvidia_driver_of_operator.md) document for +installation. If the OS or kernel versions of the two nodes are not consistent, +please proceed to the [next step](#create-the-offline-yum-source). + +1. Run the following command to view the distribution name and version of the node where GPU Operator will be deployed in the cluster. + + ```bash + cat /etc/redhat-release + ``` + + Expected output: + + ``` + CentOS Linux release 7.9 (Core) + ``` + + The output shows the current node's OS version as `CentOS 7.9`. + +2. Run the following command to view the kernel version of the node where GPU Operator will be deployed in the cluster. + + ```bash + uname -a + ``` + + Expected output: + + ``` + Linux localhost.localdomain 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux + ``` + + The output shows the current node's kernel version as `3.10.0-1160.el7.x86_64`. + +### Create the Offline Yum Source + +Perform the following steps on a node that has internet access and can access the file server: + +1. Create a script file named __yum.sh__ by running the following command: + + ```bash + vi yum.sh + ``` + + Then press the **i** key to enter insert mode and enter the following content: + + ```bash + export TARGET_KERNEL_VERSION=$1 + + cat >> run.sh << \EOF + #! /bin/bash + echo "start install kernel repo" + echo ${KERNEL_VERSION} + mkdir centos-base + + if [ "$OS" -eq 7 ]; then + yum install --downloadonly --downloaddir=./centos-base perl + yum install --downloadonly --downloaddir=./centos-base elfutils-libelf.x86_64 + yum install --downloadonly --downloaddir=./redhat-base elfutils-libelf-devel.x86_64 + yum install --downloadonly --downloaddir=./centos-base kernel-headers-${KERNEL_VERSION}.el7.x86_64 + yum install --downloadonly --downloaddir=./centos-base kernel-devel-${KERNEL_VERSION}.el7.x86_64 + yum install --downloadonly --downloaddir=./centos-base kernel-${KERNEL_VERSION}.el7.x86_64 + yum install -y --downloadonly --downloaddir=./centos-base groff-base + elif [ "$OS" -eq 8 ]; then + yum install --downloadonly --downloaddir=./centos-base perl + yum install --downloadonly --downloaddir=./centos-base elfutils-libelf.x86_64 + yum install --downloadonly --downloaddir=./redhat-base elfutils-libelf-devel.x86_64 + yum install --downloadonly --downloaddir=./centos-base kernel-headers-${KERNEL_VERSION}.el8.x86_64 + yum install --downloadonly --downloaddir=./centos-base kernel-devel-${KERNEL_VERSION}.el8.x86_64 + yum install --downloadonly --downloaddir=./centos-base kernel-${KERNEL_VERSION}.el8.x86_64 + yum install -y --downloadonly --downloaddir=./centos-base groff-base + else + echo "Error os version" + fi + + createrepo centos-base/ + ls -lh centos-base/ + tar -zcf centos-base.tar.gz centos-base/ + echo "end install kernel repo" + EOF + + cat >> Dockerfile << EOF + FROM centos:7 + ENV KERNEL_VERSION="" + ENV OS=7 + RUN yum install -y createrepo + COPY run.sh . + ENTRYPOINT ["/bin/bash","run.sh"] + EOF + + docker build -t test:v1 -f Dockerfile . + docker run -e KERNEL_VERSION=$TARGET_KERNEL_VERSION --name centos7.9 test:v1 + docker cp centos7.9:/centos-base.tar.gz . + tar -xzf centos-base.tar.gz + ``` + + Press the __Esc__ key to exit insert mode, then enter __:wq__ to save and exit. + +2. Run the __yum.sh__ file: + + ```bash + bash -x yum.sh TARGET_KERNEL_VERSION + ``` + + The `TARGET_KERNEL_VERSION` parameter is used to specify the kernel version of the cluster nodes. + + Note: You don't need to include the distribution identifier (e.g., __ .el7.x86_64__ ). + For example: + + ```bash + bash -x yum.sh 3.10.0-1160.95.1 + ``` + +Now you have generated an offline yum source, __centos-base__ , +for the kernel version __3.10.0-1160.95.1.el7.x86_64__ . + +### Upload the Offline Yum Source to the File Server + +Perform the following steps on a node that has internet access and can access the file server. +This step is used to upload the generated yum source from the previous step to a file server +that can be accessed by the cluster where the GPU Operator will be deployed. The file server +can be Nginx, MinIO, or any other file server that supports the HTTP protocol. + +In this example, we will use the built-in MinIO as the file server. The MinIO details are as follows: + +- Access URL: `http://10.5.14.200:9000` (usually __{bootstrap-node IP} + {port-9000}__ ) +- Login username: rootuser +- Login password: rootpass123 + +1. Run the following command in the current directory of the node to establish a connection between the node's local __mc__ command-line tool and the MinIO server: + + ```bash + mc config host add minio http://10.5.14.200:9000 rootuser rootpass123 + ``` + + The expected output should resemble the following: + + ```bash + Added __minio__ successfully. + ``` + + __mc__ is the command-line tool provided by MinIO for interacting with the MinIO server. For more details, refer to the [MinIO Client](https://min.io/docs/minio/linux/reference/minio-mc.html) documentation. + +2. In the current directory of the node, create a bucket named __centos-base__ : + + ```bash + mc mb -p minio/centos-base + ``` + + The expected output should resemble the following: + + ```bash + Bucket created successfully __minio/centos-base__ . + ``` + +3. Set the access policy of the bucket __centos-base__ to allow public download. This will enable access during the installation of the GPU Operator: + + ```bash + mc anonymous set download minio/centos-base + ``` + + The expected output should resemble the following: + + ```bash + Access permission for __minio/centos-base__ is set to __download__ + ``` + +4. In the current directory of the node, copy the generated __centos-base__ offline yum source to the __minio/centos-base__ bucket on the MinIO server: + + ```bash + mc cp centos-base minio/centos-base --recursive + ``` + +### Create a ConfigMap to Store the Yum Source Info in the Cluster + +Perform the following steps on the control node of the cluster where the GPU Operator will be deployed. + +1. Run the following command to create a file named __CentOS-Base.repo__ that specifies the configmap for the yum source storage: + + ```bash + # The file name must be CentOS-Base.repo, otherwise it cannot be recognized during the installation of the GPU Operator + cat > CentOS-Base.repo << EOF + [extension-0] + baseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3 + gpgcheck = 0 + name = kubean extension 0 + + [extension-1] + baseurl = http://10.5.14.200:9000/centos-base/centos-base # The file server address where the yum source is placed in step 3 + gpgcheck = 0 + name = kubean extension 1 + EOF + ``` + +2. Based on the created __CentOS-Base.repo__ file, create a configmap named __local-repo-config__ in the __gpu-operator__ namespace: + + ```bash + kubectl create configmap local-repo-config -n gpu-operator --from-file=CentOS-Base.repo=/etc/yum.repos.d/extension.repo + ``` + + The expected output should resemble the following: + + ``` + configmap/local-repo-config created + ``` + + The __local-repo-config__ configmap will be used to provide the value for the __RepoConfig.ConfigMapName__ parameter during the installation of the GPU Operator. You can customize the configuration file name. + +3. View the content of the __local-repo-config__ configmap: + + ```bash + kubectl get configmap local-repo-config -n gpu-operator -oyaml + ``` + + The expected output should resemble the following: + + ```yaml + apiVersion: v1 + data: + CentOS-Base.repo: "[extension-0]\nbaseurl = http://10.6.232.5:32618/centos-base# The file server path where the yum source is placed in step 2\ngpgcheck = 0\nname = kubean extension 0\n \n[extension-1]\nbaseurl = http://10.6.232.5:32618/centos-base # The file server path where the yum source is placed in step 2\ngpgcheck = 0\nname = kubean extension 1\n" + kind: ConfigMap + metadata: + creationTimestamp: "2023-10-18T01:59:02Z" + name: local-repo-config + namespace: gpu-operator + resourceVersion: "59445080" + uid: c5f0ebab-046f-442c-b932-f9003e014387 + ``` + +You have successfully created an offline yum source configuration file for the cluster where the +GPU Operator will be deployed. You can use it during the [offline installation of the GPU Operator](./install_nvidia_driver_of_operator.md) +by specifying the __RepoConfig.ConfigMapName__ parameter. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.md new file mode 100644 index 0000000000..78d61e949f --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/upgrade_yum_source_redhat8_4.md @@ -0,0 +1,208 @@ +# Building Red Hat 8.4 Offline Yum Source + +The AI platform comes with pre-installed CentOS v7.9 and GPU Operator offline packages with kernel v3.10.0-1160. For other OS types or nodes with different kernels, users need to manually build the offline yum source. + +This guide explains how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also demonstrates how to use it during the installation of the GPU Operator by specifying the __RepoConfig.ConfigMapName__ parameter. + +## Prerequisites + +1. The user has already installed the addon offline package v0.12.0 or higher on the platform. +2. The OS of the cluster nodes where the GPU Operator will be deployed must be Red Hat v8.4, and the kernel version must be identical. +3. Prepare a file server that can communicate with the cluster network where the GPU Operator will be deployed, such as Nginx or MinIO. +4. Prepare a node that can access the internet, the cluster where the GPU Operator will be deployed, and the file server. Ensure that Docker is already installed on this node. +5. The nodes in the Global cluster must be Red Hat 8.4 4.18.0-305.el8.x86_64. + +## Procedure + +This guide uses a node with Red Hat 8.4 4.18.0-305.el8.x86_64 as an example to demonstrate how to build an offline yum source package for Red Hat 8.4 based on any node in the Global cluster. It also explains how to use it during the installation of the GPU Operator by specifying the __RepoConfig.ConfigMapName__ parameter. + +### Step 1: Download the Yum Source from the Bootstrap Node + +Perform the following steps on the master node of the Global cluster. + +1. Use SSH or any other method to access any node in the Global cluster and run the following command: + + ```bash + cat /etc/yum.repos.d/extension.repo # View the contents of extension.repo. + ``` + + The expected output should resemble the following: + + ```ini + [extension-0] + baseurl = http://10.5.14.200:9000/kubean/redhat/$releasever/os/$basearch + gpgcheck = 0 + name = kubean extension 0 + + [extension-1] + baseurl = http://10.5.14.200:9000/kubean/redhat-iso/$releasever/os/$basearch/AppStream + gpgcheck = 0 + name = kubean extension 1 + + [extension-2] + baseurl = http://10.5.14.200:9000/kubean/redhat-iso/$releasever/os/$basearch/BaseOS + gpgcheck = 0 + name = kubean extension 2 + ``` + +2. Create a folder named __redhat-base-repo__ under the root directory: + + ```bash + mkdir redhat-base-repo + ``` + +3. Download the RPM packages from the yum source to your local machine: + + Download the RPM packages from __extension-1__ : + + ```bash + reposync -p redhat-base-repo -n --repoid=extension-1 + ``` + + Download the RPM packages from __extension-2__ : + + ```bash + reposync -p redhat-base-repo -n --repoid=extension-2 + ``` + +### Step 2: Download the __elfutils-libelf-devel-0.187-4.el8.x86_64.rpm__ Package + +Perform the following steps on a node with internet access. Before proceeding, ensure that there is network connectivity between the node with internet access and the master node of the Global cluster. + +1. Run the following command on the node with internet access to download the __elfutils-libelf-devel-0.187-4.el8.x86_64.rpm__ package: + + ```bash + wget https://rpmfind.net/linux/centos/8-stream/BaseOS/x86_64/os/Packages/elfutils-libelf-devel-0.187-4.el8.x86_64.rpm + ``` + +2. Transfer the __elfutils-libelf-devel-0.187-4.el8.x86_64.rpm__ package from the current directory to the node mentioned in step 1: + + ```bash + scp elfutils-libelf-devel-0.187-4.el8.x86_64.rpm user@ip:~/redhat-base-repo/extension-2/Packages/ + ``` + + For example: + + ```bash + scp elfutils-libelf-devel-0.187-4.el8.x86_64.rpm root@10.6.175.10:~/redhat-base-repo/extension-2/Packages/ + ``` + +### Step 3: Generate the Local Yum Repository + +Perform the following steps on the master node of the Global cluster mentioned in Step 1. + +1. Enter the yum repository directories: + + ```bash + cd ~/redhat-base-repo/extension-1/Packages + cd ~/redhat-base-repo/extension-2/Packages + ``` + +2. Generate the repository index for the directories: + + ```bash + createrepo_c ./ + ``` + +You have now generated the offline yum source named __redhat-base-repo__ for kernel version __4.18.0-305.el8.x86_64__ . + +### Step 4: Upload the Local Yum Repository to the File Server + +In this example, we will use Minio, which is built-in as the file server in the bootstrap node. However, you can choose any file server that suits your needs. Here are the details for Minio: + +- Access URL: `http://10.5.14.200:9000` (usually the {bootstrap-node-IP} + {port-9000}) +- Login username: rootuser +- Login password: rootpass123 + +1. On the current node, establish a connection between the local __mc__ command-line tool and the Minio server by running the following command: + + ```bash + mc config host add minio + ``` + + For example: + + ```bash + mc config host add minio http://10.5.14.200:9000 rootuser rootpass123 + ``` + + The expected output should be similar to: + + ```bash + Added __minio__ successfully. + ``` + + The __mc__ command-line tool is provided by the Minio file server as a client command-line tool. For more details, refer to the [MinIO Client](https://min.io/docs/minio/linux/reference/minio-mc.html) documentation. + +2. Create a bucket named __redhat-base__ in the current location: + + ```bash + mc mb -p minio/redhat-base + ``` + + The expected output should be similar to: + + ```bash + Bucket created successfully __minio/redhat-base__ . + ``` + +3. Set the access policy of the __redhat-base__ bucket to allow public downloads so that it can be accessed during the installation of the GPU Operator: + + ```bash + mc anonymous set download minio/redhat-base + ``` + + The expected output should be similar to: + + ```bash + Access permission for __minio/redhat-base__ is set to __download__ + ``` + +4. Copy the offline yum repository files ( __redhat-base-repo__ ) from the current location to the Minio server's __minio/redhat-base__ bucket: + + ```bash + mc cp redhat-base-repo minio/redhat-base --recursive + ``` + +### Step 5: Create a ConfigMap to Store Yum Repository Information in the Cluster + +Perform the following steps on the control node of the cluster where you will deploy the GPU Operator. + +1. Run the following command to create a file named __redhat.repo__ , which specifies the configuration information for the yum repository storage: + + ```bash + # The file name must be redhat.repo, otherwise it won't be recognized when installing gpu-operator + cat > redhat.repo << EOF + [extension-0] + baseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1 + gpgcheck = 0 + name = kubean extension 0 + + [extension-1] + baseurl = http://10.5.14.200:9000/redhat-base/redhat-base-repo/Packages # The file server address where the yum source is stored in Step 1 + gpgcheck = 0 + name = kubean extension 1 + EOF + ``` + +2. Based on the created __redhat.repo__ file, create a configmap named __local-repo-config__ in the __gpu-operator__ namespace: + + ```bash + kubectl create configmap local-repo-config -n gpu-operator --from-file=./redhat.repo + ``` + + The expected output should be similar to: + + ``` + configmap/local-repo-config created + ``` + + The __local-repo-config__ configuration file is used to provide the value for the __RepoConfig.ConfigMapName__ parameter during the installation of the GPU Operator. You can choose a different name for the configuration file. + +3. View the contents of the __local-repo-config__ configuration file: + + ```bash + kubectl get configmap local-repo-config -n gpu-operator -oyaml + ``` + +You have successfully created the offline yum source configuration file for the cluster where the GPU Operator will be deployed. You can use it by specifying the __RepoConfig.ConfigMapName__ parameter during the [offline installation of the GPU Operator](./install_nvidia_driver_of_operator.md). diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/hami.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/hami.md new file mode 100644 index 0000000000..e7b393f3df --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/hami.md @@ -0,0 +1,23 @@ +--- +hide: + - toc +--- + +# Build a vGPU Memory Oversubscription Image + +The vGPU memory oversubscription feature in the [Hami Project](https://github.com/Project-HAMi/HAMi) +no longer exists. To use this feature, you need to rebuild with the `libvgpu.so` file +that supports memory oversubscription. + +```bash title="Dockerfile" +FROM docker.m.daocloud.io/projecthami/hami:v2.3.11 +COPY libvgpu.so /k8s-vgpu/lib/nvidia/ +``` + +Run the following command to build the image: + +```bash +docker build -t release.daocloud.io/projecthami/hami:v2.3.11 -f Dockerfile . +``` + +Then, push the image to `release.daocloud.io`. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/vgpu_addon.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/vgpu_addon.md new file mode 100644 index 0000000000..5405cd08fc --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/vgpu_addon.md @@ -0,0 +1,36 @@ +# Installing NVIDIA vGPU Addon + +To virtualize a single NVIDIA GPU into multiple virtual GPUs and allocate them to different virtual machines or users, you can use NVIDIA's vGPU capability. +This section explains how to install the vGPU plugin in the AI platform platform, which is a prerequisite for using NVIDIA vGPU capability. + +## Prerequisites + +- Refer to the [GPU Support Matrix](../../gpu_matrix.md) to confirm that the nodes in the cluster have GPUs of the corresponding models. +- The current cluster has deployed NVIDIA drivers through the Operator. For specific instructions, refer to [Offline Installation of GPU Operator](../install_nvidia_driver_of_operator.md). + +## Procedure + +1. Path: __Container Management__ -> __Cluster Management__ -> Click the target cluster -> __Helm Apps__ -> __Helm Charts__ -> Search for __nvidia-vgpu__ . + + ![Alt text](../../images/vgpu-addon.png) + +2. During the installation of vGPU, several basic modification parameters are provided. If you need to modify advanced parameters, click the YAML column to make changes: + + - __deviceMemoryScaling__ : NVIDIA device memory scaling factor, the input value must be an integer, with a default value of 1. It can be greater than 1 (enabling virtual memory, experimental feature). For an NVIDIA GPU with a memory size of M, if we configure the __devicePlugin.deviceMemoryScaling__ parameter as S, in a Kubernetes cluster where we have deployed our device plugin, the vGPUs assigned from this GPU will have a total memory of __S * M__ . + + - __deviceSplitCount__ : An integer type, with a default value of 10. Number of GPU splits, each GPU cannot be assigned more tasks than its configuration count. If configured as N, each GPU can have up to N tasks simultaneously. + + - __Resources__ : Represents the resource usage of the vgpu-device-plugin and vgpu-schedule pods. + + ![Alt text](../../images/vgpu-pararm.png) + +3. After a successful installation, you will see two types of pods in the specified namespace, indicating that the NVIDIA vGPU plugin has been successfully installed: + + ![Alt text](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/gpu/images/vgpu-pod.png) + +After a successful installation, you can [deploy applications using vGPU resources](vgpu_user.md). + +!!! note + + NVIDIA vGPU Addon does not support upgrading directly from the older v2.0.0 to the + latest v2.0.0+1; To upgrade, please uninstall the older version and then reinstall the latest version. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.md new file mode 100644 index 0000000000..f6cf167548 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/vgpu/vgpu_user.md @@ -0,0 +1,63 @@ +# Using NVIDIA vGPU in Applications + +This section explains how to use the vGPU capability in the AI platform platform. + +## Prerequisites + +- The nodes in the cluster have GPUs of the corresponding models. +- vGPU Addon has been successfully installed. Refer to [Installing GPU Addon](vgpu_addon.md) for details. +- GPU Operator is installed, and the __Nvidia.DevicePlugin__ capability is **disabled**. Refer to [Offline Installation of GPU Operator](../install_nvidia_driver_of_operator.md) for details. + +## Procedure + +### Using vGPU through the UI + +1. Confirm if the cluster has detected GPUs. Click the __Clusters__ -> __Cluster Settings__ -> __Addon Plugins__ and check if the GPU plugin has been automatically enabled and the corresponding GPU type has been detected. Currently, the cluster will automatically enable the __GPU__ addon and set the __GPU Type__ as __Nvidia vGPU__ . + + + +2. Deploy a workload by clicking __Clusters__ -> __Workloads__ . When deploying a workload using an image, select the type __Nvidia vGPU__ , and you will be prompted with the following parameters: + + - **Number of Physical Cards (nvidia.com/vgpu)** : Indicates how many physical cards need to be mounted by the current pod. The input value must be an integer and **less than or equal to** the number of cards on the host machine. + - **GPU Cores (nvidia.com/gpucores)**: Indicates the GPU cores utilized by each card, with a value range from 0 to 100. + Setting it to 0 means no enforced isolation, while setting it to 100 means exclusive use of the entire card. + - **GPU Memory (nvidia.com/gpumem)**: Indicates the GPU memory occupied by each card, with a value in MB. The minimum value is 1, and the maximum value is the total memory of the card. + + > If there are issues with the configuration values above, it may result in scheduling failure or inability to allocate resources. + + + +### Using vGPU through YAML Configuration + +Refer to the following workload configuration and add the parameter __nvidia.com/vgpu: '1'__ in the resource requests and limits section to configure the number of physical cards used by the application. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: full-vgpu-demo + namespace: default +spec: + replicas: 1 + selector: + matchLabels: + app: full-vgpu-demo + template: + metadata: + creationTimestamp: null + labels: + app: full-vgpu-demo + spec: + containers: + - name: full-vgpu-demo1 + image: chrstnhntschl/gpu_burn + resources: + limits: + nvidia.com/gpucores: '20' # Request 20% of GPU cores for each card + nvidia.com/gpumem: '200' # Request 200MB of GPU memory for each card + nvidia.com/vgpu: '1' # Request 1 GPU card + imagePullPolicy: Always + restartPolicy: Always +``` + +This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU card. diff --git a/docs/en/docs/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.md b/docs/en/docs/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.md new file mode 100644 index 0000000000..c4313dc435 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/nvidia/yum_source_redhat7_9.md @@ -0,0 +1,120 @@ +--- +MTPE: windsonsea +date: 2024-06-14 +--- + +# Build an Offline Yum Repository for Red Hat 7.9 + +## Introduction + +AI platform comes with a pre-installed CentOS 7.9 with GPU Operator offline package for kernel 3.10.0-1160. +You need to manually build an offline yum repository for other OS types or nodes with different kernels. + +This page explains how to build an offline yum repository for Red Hat 7.9 based on any node in the Global cluster, and how to use the `RepoConfig.ConfigMapName` parameter when installing the GPU Operator. + +## Prerequisites + +1. The cluster nodes where the GPU Operator is to be deployed must be Red Hat 7.9 with the exact same kernel version. +1. Prepare a file server that can be connected to the cluster network where the GPU Operator is to be deployed, such as nginx or minio. +1. Prepare a node that can access the internet, the cluster where the GPU Operator is to be deployed, + and the file server. [Docker installation](../../../../install/community/kind/online.md#install-docker) must be completed on this node. +1. The nodes in the global service cluster must be Red Hat 7.9. + +## Steps + +### 1. Build Offline Yum Repo for Relevant Kernel + +1. [Download rhel7.9 ISO](https://developers.redhat.com/products/rhel/download#assembly-field-downloads-page-content-61451) + + ![Download rhel7.9 ISO](../images/rhel7.9.png) + +2. Download the [rhel7.9 ospackage](https://github.com/kubean-io/kubean/releases) that corresponds to your Kubean version. + + Find the version number of Kubean in the **Container Management** section of the Global cluster under **Helm Apps**. + + + + Download the rhel7.9 ospackage for that version from the + [Kubean repository](https://github.com/kubean-io/kubean/releases). + + ![Kubean repository](../images/redhat0.12.2.png) + +3. Import offline resources using the installer. + + Refer to the [Import Offline Resources document](../../../../install/import.md). + +### 2. Download Offline Driver Image for Red Hat 7.9 OS + +[Click here to view the download url](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags). + +![Driver image](../images/driveimage.png) + +### 3. Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository + +Refer to [Upload Red Hat GPU Operator Offline Image to Boostrap Node Repository](./push_image_to_repo.md). + +!!! note + + This reference is based on rhel8.4, so make sure to modify it for rhel7.9. + +### 4. Create ConfigMaps in the Cluster to Save Yum Repository Information + +Run the following command on the control node of the cluster where the GPU Operator is to be deployed. + +1. Run the following command to create a file named __CentOS-Base.repo__ to specify the configuration information where the yum repository is stored. + + ```bash + # The file name must be CentOS-Base.repo, otherwise it will not be recognized when installing gpu-operator + cat > CentOS-Base.repo << EOF + [extension-0] + baseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port} + gpgcheck = 0 + name = kubean extension 0 + + [extension-1] + baseurl = http://10.5.14.200:9000/centos-base/centos-base # The server file address of the boostrap node, usually {boostrap node IP} + {9000 port} + gpgcheck = 0 + name = kubean extension 1 + EOF + ``` + +2. Based on the created __CentOS-Base.repo__ file, create a profile named __local-repo-config__ in the gpu-operator namespace: + + ```bash + kubectl create configmap local-repo-config -n gpu-operator --from-file=CentOS-Base.repo=/etc/yum.repos.d/extension.repo + ``` + + The expected output is as follows: + + ```console + configmap/local-repo-config created + ``` + + The __local-repo-config__ profile is used to provide the value of the `RepoConfig.ConfigMapName` parameter when installing gpu-operator, and the profile name can be customized by the user. + +3. View the contents of the __local-repo-config__ profile: + + ```bash + kubectl get configmap local-repo-config -n gpu-operator -oyaml + ``` + + The expected output is as follows: + + ```yaml title="local-repo-config.yaml" + apiVersion: v1 + data: + CentOS-Base.repo: "[extension-0]\nbaseurl = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \ngpgcheck = 0\nname = kubean extension 0\n \n[extension-1]\nbaseurl + = http://10.6.232.5:32618/centos-base # The file path where yum repository is placed in Step 2 \ngpgcheck = 0\nname + = kubean extension 1\n" + kind: ConfigMap + metadata: + creationTimestamp: "2023-10-18T01:59:02Z" + name: local-repo-config + namespace: gpu-operator + resourceVersion: "59445080" + uid: c5f0ebab-046f-442c-b932-f9003e014387 + ``` + +At this point, you have successfully created the offline yum repository profile for the cluster +where the GPU Operator is to be deployed. The `RepoConfig.ConfigMapName` parameter was used during the +[Offline Installation of GPU Operator](./install_nvidia_driver_of_operator.md). diff --git a/docs/en/docs/end-user/kpanda/gpu/vgpu_quota.md b/docs/en/docs/end-user/kpanda/gpu/vgpu_quota.md new file mode 100644 index 0000000000..c33803209e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/vgpu_quota.md @@ -0,0 +1,25 @@ +# GPU Quota Management + +This section describes how to use vGPU capabilities on the AI platform platform. + +## Prerequisites + +The corresponding GPU driver (NVIDIA GPU, NVIDIA MIG, Days, Ascend) has been deployed on the current cluster either through an Operator or manually. + +## Procedure + +Follow these steps to manage GPU quotas in AI platform: + +1. Go to Namespaces and click __Quota Management__ to configure the GPU resources that can be used by a specific namespace. + + + +2. The currently supported card types for quota management in a namespace are: NVIDIA vGPU, NVIDIA MIG, Days, and Ascend. + + - **NVIDIA vGPU Quota Management**: Configure the specific quota that can be used. This will create a ResourcesQuota CR. + + - Physical Card Count (nvidia.com/vgpu): Indicates the number of physical cards that the current pod needs to mount. The input value must be an integer and **less than or equal to** the number of cards on the host machine. + - GPU Core Count (nvidia.com/gpucores): Indicates the GPU compute power occupied by each card. The value ranges from 0 to 100. If configured as 0, it is considered not to enforce isolation. If configured as 100, it is considered to exclusively occupy the entire card. + - GPU Memory Usage (nvidia.com/gpumem): Indicates the amount of GPU memory occupied by each card. The value is in MB, with a minimum value of 1 and a maximum value equal to the entire memory of the card. + + diff --git a/docs/en/docs/end-user/kpanda/gpu/volcano/volcano-gang-scheduler.md b/docs/en/docs/end-user/kpanda/gpu/volcano/volcano-gang-scheduler.md new file mode 100644 index 0000000000..7996e47db9 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/volcano/volcano-gang-scheduler.md @@ -0,0 +1,197 @@ +--- +MTPE: ModetaNiu +date: 2024-06-12 +--- + +# Using Volcano's Gang Scheduler + +The Gang scheduling policy is one of the core scheduling algorithms of the volcano-scheduler. +It satisfies the "All or nothing" scheduling requirement during the scheduling process, preventing +arbitrary scheduling of Pods that could waste cluster resources. The specific algorithm observes +whether the number of scheduled Pods under a Job meets the minimum running quantity. +When the Job's minimum running quantity is satisfied, scheduling actions are performed for all Pods under the Job; +otherwise, no actions are taken. + +## Use Cases + +The Gang scheduling algorithm, based on the concept of a Pod group, is particularly suitable for scenarios +that require multi-process collaboration. AI scenarios often involve complex workflows, such as Data Ingestion, +Data Analysis, Data Splitting, Training, Serving, and Logging, which require a group of containers to work together. +This makes the Gang scheduling policy based on pods very appropriate. + +In multi-threaded parallel computing communication scenarios under the MPI computation framework, +Gang scheduling is also very suitable because it requires master and slave processes to work together. +High relevance among containers in a pod may lead to resource contention, and overall scheduling allocation +can effectively resolve deadlocks. + +In scenarios with insufficient cluster resources, the Gang scheduling policy significantly improves +the utilization of cluster resources. For example, if the cluster can currently accommodate only 2 Pods, +but the minimum number of Pods required for scheduling is 3, then all Pods of this Job will remain pending until +the cluster can accommodate 3 Pods, at which point the Pods will be scheduled. This effectively prevents the +partial scheduling of Pods, which would not meet the requirements and would occupy resources, making other Jobs unable to run. + +## Concept Explanation + +The Gang Scheduler is the core scheduling plugin of Volcano, and it is enabled by default upon installing Volcano. +When creating a workload, you only need to specify the scheduler name as Volcano. + +Volcano schedules based on PodGroups. When creating a workload, there is no need to manually create PodGroup resources; +Volcano will automatically create them based on the workload information. Below is an example of a PodGroup: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: PodGroup +metadata: + name: test + namespace: default +spec: + minMember: 1 # (1)! + minResources: # (2)! + cpu: "3" + memory: "2048Mi" + priorityClassName: high-prority # (3)! + queue: default # (4)! +``` + +1. Represents the **minimum** number of Pods or jobs that need to run under this PodGroup. If the cluster resources + do not meet the requirements to run the number of jobs specified by miniMember, the scheduler will not + schedule any jobs within this PodGroup. +2. Represents the minimum resources required to run this PodGroup. If the allocatable resources of the cluster + do not meet the minResources, the scheduler will not schedule any jobs within this PodGroup. +3. Represents the priority of this PodGroup, used by the scheduler to sort all PodGroups within the queue during scheduling. + **system-node-critical** and **system-cluster-critical** are two reserved values indicating the highest priority. + If not specifically designated, the default priority or zero priority is used. +4. Represents the queue to which this PodGroup belongs. The queue must be pre-created and in the open state. + +## Use Case + +In a multi-threaded parallel computing communication scenario under the MPI computation framework, we need to ensure +that all Pods can be successfully scheduled to ensure the job is completed correctly. Setting minAvailable to 4 +means that 1 mpimaster and 3 mpiworkers are required to run. + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: lm-mpi-job + labels: + "volcano.sh/job-type": "MPI" +spec: + minAvailable: 4 + schedulerName: volcano + plugins: + ssh: [] + svc: [] + policies: + - event: PodEvicted + action: RestartJob + tasks: + - replicas: 1 + name: mpimaster + policies: + - event: TaskCompleted + action: CompleteJob + template: + spec: + containers: + - command: + - /bin/sh + - -c + - | + MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`; + mkdir -p /var/run/sshd; /usr/sbin/sshd; + mpiexec --allow-run-as-root --host ${MPI_HOST} -np 3 mpi_hello_world; + image: docker.m.daocloud.io/volcanosh/example-mpi:0.0.1 + name: mpimaster + ports: + - containerPort: 22 + name: mpijob-port + workingDir: /home + resources: + requests: + cpu: "500m" + limits: + cpu: "500m" + restartPolicy: OnFailure + imagePullSecrets: + - name: default-secret + - replicas: 3 + name: mpiworker + template: + spec: + containers: + - command: + - /bin/sh + - -c + - | + mkdir -p /var/run/sshd; /usr/sbin/sshd -D; + image: docker.m.daocloud.io/volcanosh/example-mpi:0.0.1 + name: mpiworker + ports: + - containerPort: 22 + name: mpijob-port + workingDir: /home + resources: + requests: + cpu: "1000m" + limits: + cpu: "1000m" + restartPolicy: OnFailure + imagePullSecrets: + - name: default-secret +``` + +Generate the resources for PodGroup: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: PodGroup +metadata: + annotations: + creationTimestamp: "2024-05-28T09:18:50Z" + generation: 5 + labels: + volcano.sh/job-type: MPI + name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2 + namespace: default + ownerReferences: + - apiVersion: batch.volcano.sh/v1alpha1 + blockOwnerDeletion: true + controller: true + kind: Job + name: lm-mpi-job + uid: 9c571015-37c7-4a1a-9604-eaa2248613f2 + resourceVersion: "25173454" + uid: 7b04632e-7cff-4884-8e9a-035b7649d33b +spec: + minMember: 4 + minResources: + count/pods: "4" + cpu: 3500m + limits.cpu: 3500m + pods: "4" + requests.cpu: 3500m + minTaskMember: + mpimaster: 1 + mpiworker: 3 + queue: default +status: + conditions: + - lastTransitionTime: "2024-05-28T09:19:01Z" + message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded, + 3 Releasing, 4 minAvailable' + reason: NotEnoughResources + status: "True" + transitionID: f875efa5-0358-4363-9300-06cebc0e7466 + type: Unschedulable + - lastTransitionTime: "2024-05-28T09:18:53Z" + reason: tasks in gang are ready to be scheduled + status: "True" + transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab + type: Scheduled + phase: Pending + succeeded: 1 +``` + +From the PodGroup, it can be seen that it is associated with the workload through ownerReferences and +sets the minimum number of running Pods to 4. diff --git a/docs/en/docs/end-user/kpanda/gpu/volcano/volcano_user_guide.md b/docs/en/docs/end-user/kpanda/gpu/volcano/volcano_user_guide.md new file mode 100644 index 0000000000..b5e6f97f7e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/gpu/volcano/volcano_user_guide.md @@ -0,0 +1,282 @@ +--- +MTPE: windsonsea +date: 2024-06-25 +--- + +# Use Volcano for AI Compute + +## Usage Scenarios + +Kubernetes has become the de facto standard for orchestrating and managing cloud-native applications, +and an increasing number of applications are choosing to migrate to K8s. The fields of artificial intelligence +and machine learning inherently involve a large number of compute-intensive tasks, and developers are very willing +to build AI platforms based on Kubernetes to fully leverage its resource management, application orchestration, +and operations monitoring capabilities. However, the default Kubernetes scheduler was initially designed +primarily for long-running services and has many shortcomings in batch and elastic scheduling for AI and +big data tasks. For example, resource contention issues: + +Take TensorFlow job scenarios as an example. TensorFlow jobs include two different roles, PS and Worker, +and the Pods for these two roles need to work together to complete the entire job. If only one type of +role Pod is running, the entire job cannot be executed properly. The default scheduler schedules Pods +one by one and is unaware of the PS and Worker roles in a Kubeflow TFJob. In a high-load cluster +(insufficient resources), multiple jobs may each be allocated some resources to run a portion of +their Pods, but the jobs cannot complete successfully, leading to resource waste. For instance, +if a cluster has 4 GPUs and both TFJob1 and TFJob2 each have 4 Workers, TFJob1 and TFJob2 +might each be allocated 2 GPUs. However, both TFJob1 and TFJob2 require 4 GPUs to run. +This mutual waiting for resource release creates a deadlock situation, resulting in GPU resource waste. + +## Volcano Batch Scheduling System + +Volcano is the first Kubernetes-based container batch computing platform under CNCF, focusing on +high-performance computing scenarios. It fills in the missing functionalities of Kubernetes +in fields such as machine learning, big data, and scientific computing, providing essential +support for these high-performance workloads. Additionally, Volcano seamlessly integrates +with mainstream computing frameworks like Spark, TensorFlow, and PyTorch, and supports +hybrid scheduling of heterogeneous devices, including CPUs and GPUs, effectively resolving +the deadlock issues mentioned above. + +The following sections will introduce how to install and use Volcano. + +## Install Volcano + +1. Find Volcano in **Cluster Details** -> **Helm Apps** -> **Helm Charts** and install it. + + ![Volcano helm chart](../../images/volcano-01.png) + + ![Install Volcano](../../images/volcano-02.png) + +2. Check and confirm whether Volcano is installed successfully, that is, whether the components volcano-admission, + volcano-controllers, and volcano-scheduler are running properly. + + ![Volcano components](../../images/volcano-03.png) + +Typically, Volcano is used in conjunction with the [AI Lab](../../../../baize/intro/index.md) +to achieve an effective closed-loop process for the development and training of datasets, Notebooks, and task training. + +## Volcano Use Cases + +- Volcano is a standalone scheduler. To enable the Volcano scheduler when creating workloads, + simply specify the scheduler's name (`schedulerName: volcano`). +- The `volcanoJob` resource is an extension of the Job in Volcano, + breaking the Job down into smaller working units called tasks, which can interact with each other. + +### Volcano Supports TensorFlow + +Here is an example: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: tensorflow-benchmark + labels: + "volcano.sh/job-type": "Tensorflow" +spec: + minAvailable: 3 + schedulerName: volcano + plugins: + env: [] + svc: [] + policies: + - event: PodEvicted + action: RestartJob + tasks: + - replicas: 1 + name: ps + template: + spec: + imagePullSecrets: + - name: default-secret + containers: + - command: + - sh + - -c + - | + PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`; + WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`; + python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=ps --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST} + image: docker.m.daocloud.io/volcanosh/example-tf:0.0.1 + name: tensorflow + ports: + - containerPort: 2222 + name: tfjob-port + resources: + requests: + cpu: "1000m" + memory: "2048Mi" + limits: + cpu: "1000m" + memory: "2048Mi" + workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks + restartPolicy: OnFailure + - replicas: 2 + name: worker + policies: + - event: TaskCompleted + action: CompleteJob + template: + spec: + imagePullSecrets: + - name: default-secret + containers: + - command: + - sh + - -c + - | + PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`; + WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`; + python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=worker --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST} + image: docker.m.daocloud.io/volcanosh/example-tf:0.0.1 + name: tensorflow + ports: + - containerPort: 2222 + name: tfjob-port + resources: + requests: + cpu: "2000m" + memory: "2048Mi" + limits: + cpu: "2000m" + memory: "4096Mi" + workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks + restartPolicy: OnFailure +``` + +### Parallel Computing with MPI + +In multi-threaded parallel computing communication scenarios under the MPI computing framework, +we need to ensure that all Pods are successfully scheduled to guarantee the task's proper completion. +Setting `minAvailable` to 4 indicates that 1 `mpimaster` and 3 `mpiworkers` are required to run. +By simply setting the `schedulerName` field value to "volcano," you can enable the Volcano scheduler. + +Here is an example: + +```yaml +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: lm-mpi-job + labels: + "volcano.sh/job-type": "MPI" +spec: + minAvailable: 4 + schedulerName: volcano + plugins: + ssh: [] + svc: [] + policies: + - event: PodEvicted + action: RestartJob + tasks: + - replicas: 1 + name: mpimaster + policies: + - event: TaskCompleted + action: CompleteJob + template: + spec: + containers: + - command: + - /bin/sh + - -c + - | + MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`; + mkdir -p /var/run/sshd; /usr/sbin/sshd; + mpiexec --allow-run-as-root --host ${MPI_HOST} -np 3 mpi_hello_world; + image: docker.m.daocloud.io/volcanosh/example-mpi:0.0.1 + name: mpimaster + ports: + - containerPort: 22 + name: mpijob-port + workingDir: /home + resources: + requests: + cpu: "500m" + limits: + cpu: "500m" + restartPolicy: OnFailure + imagePullSecrets: + - name: default-secret + - replicas: 3 + name: mpiworker + template: + spec: + containers: + - command: + - /bin/sh + - -c + - | + mkdir -p /var/run/sshd; /usr/sbin/sshd -D; + image: docker.m.daocloud.io/volcanosh/example-mpi:0.0.1 + name: mpiworker + ports: + - containerPort: 22 + name: mpijob-port + workingDir: /home + resources: + requests: + cpu: "1000m" + limits: + cpu: "1000m" + restartPolicy: OnFailure + imagePullSecrets: + - name: default-secret +``` + +Resources to generate PodGroup: + +```yaml +apiVersion: scheduling.volcano.sh/v1beta1 +kind: PodGroup +metadata: + annotations: + creationTimestamp: "2024-05-28T09:18:50Z" + generation: 5 + labels: + volcano.sh/job-type: MPI + name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2 + namespace: default + ownerReferences: + - apiVersion: batch.volcano.sh/v1alpha1 + blockOwnerDeletion: true + controller: true + kind: Job + name: lm-mpi-job + uid: 9c571015-37c7-4a1a-9604-eaa2248613f2 + resourceVersion: "25173454" + uid: 7b04632e-7cff-4884-8e9a-035b7649d33b +spec: + minMember: 4 + minResources: + count/pods: "4" + cpu: 3500m + limits.cpu: 3500m + pods: "4" + requests.cpu: 3500m + minTaskMember: + mpimaster: 1 + mpiworker: 3 + queue: default +status: + conditions: + - lastTransitionTime: "2024-05-28T09:19:01Z" + message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded, + 3 Releasing, 4 minAvailable' + reason: NotEnoughResources + status: "True" + transitionID: f875efa5-0358-4363-9300-06cebc0e7466 + type: Unschedulable + - lastTransitionTime: "2024-05-28T09:18:53Z" + reason: tasks in gang are ready to be scheduled + status: "True" + transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab + type: Scheduled + phase: Pending + succeeded: 1 +``` + +From the PodGroup, it can be seen that it is associated with the workload through +`ownerReferences` and sets the minimum number of running Pods to 4. + +If you want to learn more about the features and usage scenarios of Volcano, +refer to [Volcano Introduction](https://volcano.sh/en/docs/). diff --git a/docs/en/docs/end-user/kpanda/helm/Import-addon.md b/docs/en/docs/end-user/kpanda/helm/Import-addon.md new file mode 100644 index 0000000000..66780218e6 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/helm/Import-addon.md @@ -0,0 +1,97 @@ +# Import Custom Helm Apps into Built-in Addons + +This article explains how to import Helm appss into the system's built-in addons in both offline and online environments. + +## Offline Environment + +An offline environment refers to an environment that cannot connect to the internet or is a closed private network environment. + +### Prerequisites + +- [charts-syncer](https://github.com/DaoCloud/charts-syncer) is available and running. + If not, you can [click here to download](https://github.com/DaoCloud/charts-syncer/releases). +- The Helm Chart has been adapted for [charts-syncer](https://github.com/DaoCloud/charts-syncer). + This means adding a __.relok8s-images.yaml__ file to the Helm Chart. This file should include all the images used in the Chart, + including any images that are not directly used in the Chart but are used similar to images used in an Operator. + +!!! note + + - Refer to [image-hints-file](https://github.com/vmware-tanzu/asset-relocation-tool-for-kubernetes#image-hints-file) for instructions on how to write a Chart. + It is required to separate the registry and repository of the image because the registry/repository needs to be replaced or modified when loading the image. + - The installer's fire cluster has [charts-syncer](https://github.com/DaoCloud/charts-syncer) installed. + If you are importing a custom Helm apps into the installer's fire cluster, you can skip the download and proceed to the adaptation. + If [charts-syncer](https://github.com/DaoCloud/charts-syncer) binary is not installed, you can [download it immediately](https://github.com/DaoCloud/charts-syncer/releases). + +### Sync Helm Chart + +1. Go to __Container Management__ -> __Helm Apps__ -> __Helm Repositories__ , search for the addon, and obtain the built-in repository address and username/password (the default username/password for the system's built-in repository is rootuser/rootpass123). + + +1. Sync the Helm Chart to the built-in repository addon of the container management system + + * Write the following configuration file, modify it according to your specific configuration, and save it as __sync-dao-2048.yaml__ . + + ```yaml + source: # helm charts source information + repo: + kind: HARBOR # It can also be any other supported Helm Chart repository type, such as CHARTMUSEUM + url: https://release-ci.daocloud.io/chartrepo/community # Change to the chart repo URL + #auth: # username/password, if no password is set, leave it blank + #username: "admin" + #password: "Harbor12345" + charts: # charts to sync + - name: dao-2048 # helm charts information, if not specified, sync all charts in the source helm repo + versions: + - 1.4.1 + target: # helm charts target information + containerRegistry: 10.5.14.40 # image repository URL + repo: + kind: CHARTMUSEUM # It can also be any other supported Helm Chart repository type, such as HARBOR + url: http://10.5.14.40:8081 # Change to the correct chart repo URL, you can verify the address by using helm repo add $HELM-REPO + auth: # username/password, if no password is set, leave it blank + username: "rootuser" + password: "rootpass123" + containers: + # kind: HARBOR # If the image repository is HARBOR and you want charts-syncer to automatically create an image repository, fill in this field + # auth: # username/password, if no password is set, leave it blank + # username: "admin" + # password: "Harbor12345" + + # leverage .relok8s-images.yaml file inside the Charts to move the container images too + relocateContainerImages: true + ``` + + * Run the charts-syncer command to sync the Chart and its included images + + ```sh + charts-syncer sync --config sync-dao-2048.yaml --insecure --auto-create-repository + ``` + + The expected output is: + + ```console + I1222 15:01:47.119777 8743 sync.go:45] Using config file: "examples/sync-dao-2048.yaml" + W1222 15:01:47.234238 8743 syncer.go:263] Ignoring skipDependencies option as dependency sync is not supported if container image relocation is true or syncing from/to intermediate directory + I1222 15:01:47.234685 8743 sync.go:58] There is 1 chart out of sync! + I1222 15:01:47.234706 8743 sync.go:66] Syncing "dao-2048_1.4.1" chart... + .relok8s-images.yaml hints file found + Computing relocation... + + Relocating dao-2048@1.4.1... + Pushing 10.5.14.40/daocloud/dao-2048:v1.4.1... + Done + Done moving /var/folders/vm/08vw0t3j68z9z_4lcqyhg8nm0000gn/T/charts-syncer869598676/dao-2048-1.4.1.tgz + ``` + +1. Once the previous step is completed, go to __Container Management__ -> __Helm Apps__ -> __Helm Repositories__ , find the corresponding addon, + click __Sync Repository__ in the action column, and you will see the uploaded Helm apps in the Helm template. + +1. You can then proceed with normal installation, upgrade, and uninstallation. + + + +## Online Environment + +The Helm Repo address for the online environment is __release.daocloud.io__ . +If the user does not have permission to add Helm Repo, they will not be able to import custom Helm appss into the system's built-in addons. +You can add your own Helm repository and then integrate your Helm repository into the platform using the same steps as syncing Helm Chart in the offline environment. diff --git a/docs/en/docs/end-user/kpanda/helm/README.md b/docs/en/docs/end-user/kpanda/helm/README.md new file mode 100644 index 0000000000..381d20b860 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/helm/README.md @@ -0,0 +1,27 @@ +--- +hide: + - toc +--- + +# Helm Charts + +Helm is a package management tool for Kubernetes, which makes it easy for users to quickly discover, share and use applications built with Kubernetes. The fifth generation [Container Management Module](../../intro/index.md) provides hundreds of Helm charts, covering storage, network, monitoring, database and other main cases. With these templates, you can quickly deploy and easily manage Helm apps through the UI interface. In addition, it supports adding more personalized templates through [Add Helm repository](helm-repo.md) to meet various needs. + +![Helm Charts](../images/helm14.png) + +**Key Concepts**: + +There are a few key concepts to understand when using Helm: + +- Chart: A Helm installation package, which contains the images, dependencies, and resource definitions required to run an application, and may also contain service definitions in the Kubernetes cluster, similar to the formula in Homebrew, dpkg in APT, or rpm files in Yum. Charts are called __Helm Charts__ in AI platform. + +- Release: A Chart instance running on the Kubernetes cluster. A Chart can be installed multiple times in the same cluster, and each installation will create a new Release. Release is called __Helm Apps__ in AI platform. + +- Repository: A repository for publishing and storing Charts. Repository is called __Helm Repositories__ in AI platform. + +For more details, refer to [Helm official website](https://helm.sh/). + +**Related operations**: + +- [Manage Helm apps](helm-app.md), including installing, updating, uninstalling Helm apps, viewing Helm operation records, etc. +- [Manage Helm repository](helm-repo.md), including installing, updating, deleting Helm repository, etc. diff --git a/docs/en/docs/end-user/kpanda/helm/helm-app.md b/docs/en/docs/end-user/kpanda/helm/helm-app.md new file mode 100644 index 0000000000..65b92ac870 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/helm/helm-app.md @@ -0,0 +1,105 @@ +# Manage Helm Apps + +The container management module supports interface-based management of Helm, including creating Helm instances using Helm charts, customizing Helm instance arguments, and managing the full lifecycle of Helm instances. + +This section will take [cert-manager](https://cert-manager.io/docs/) as an example to introduce how to create and manage Helm apps through the container management interface. + +## Prerequisites + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. + +- Created a [namespace](../namespaces/createns.md), + [user](../../../ghippo/access-control/user.md), + and granted [`NS Admin`](../permissions/permission-brief.md#ns-admin) or higher permissions to the user. + For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +## Install the Helm app + +Follow the steps below to install the Helm app. + +1. Click a cluster name to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __Helm Apps__ -> __Helm Chart__ to enter the Helm chart page. + + On the Helm chart page, select the [Helm repository](helm-repo.md) named __addon__ , and all the Helm chart templates under the __addon__ repository will be displayed on the interface. + Click the Chart named __cert-manager__ . + + + +3. On the installation page, you can see the relevant detailed information of the Chart, select the version to be installed in the upper right corner of the interface, and click the __Install__ button. Here select v1.9.1 version for installation. + + + +4. Configure __Name__ , __Namespace__ and __Version Information__ . You can also customize arguments by modifying YAML in the **argument Configuration** area below. Click __OK__ . + + + +5. The system will automatically return to the list of Helm apps, and the status of the newly created Helm app is __Installing__ , and the status will change to __Running__ after a period of time. + + + +## Update the Helm app + +After we have completed the installation of a Helm app through the interface, we can perform an update operation on the Helm app. Note: Update operations using the UI are only supported for Helm apps installed via the UI. + +Follow the steps below to update the Helm app. + +1. Click a cluster name to enter __Cluster Details__ . + + + +2. In the left navigation bar, click __Helm Apps__ to enter the Helm app list page. + + On the Helm app list page, select the Helm app that needs to be updated, click the __ ...__ operation button on the right side of the list, and select the __Update__ operation in the drop-down selection. + + + +3. After clicking the __Update__ button, the system will jump to the update interface, where you can update the Helm app as needed. Here we take updating the http port of the __dao-2048__ application as an example. + + + +4. After modifying the corresponding arguments. You can click the __Change__ button under the argument configuration to compare the files before and after the modification. After confirming that there is no error, click the __OK__ button at the bottom to complete the update of the Helm app. + + + +5. The system will automatically return to the Helm app list, and a pop-up window in the upper right corner will prompt __update successful__ . + + + +## View Helm operation records + +Every installation, update, and deletion of Helm apps has detailed operation records and logs for viewing. + +1. In the left navigation bar, click __Cluster Operations__ -> __Recent Operations__ , and then select the __Helm Operations__ tab at the top of the page. Each record corresponds to an install/update/delete operation. + + + +2. To view the detailed log of each operation: Click __┇__ on the right side of the list, and select __Log__ from the pop-up menu. + + + +3. At this point, the detailed operation log will be displayed in the form of console at the bottom of the page. + + + +## Delete the Helm app + +Follow the steps below to delete the Helm app. + +1. Find the cluster where the Helm app to be deleted resides, click the cluster name, and enter __Cluster Details__ . + + + +2. In the left navigation bar, click __Helm Apps__ to enter the Helm app list page. + + On the Helm app list page, select the Helm app you want to delete, click the __ ...__ operation button on the right side of the list, and select __Delete__ from the drop-down selection. + + + +3. Enter the name of the Helm app in the pop-up window to confirm, and then click the __Delete__ button. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/helm/helm-repo.md b/docs/en/docs/end-user/kpanda/helm/helm-repo.md new file mode 100644 index 0000000000..e0c812a5c3 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/helm/helm-repo.md @@ -0,0 +1,106 @@ +--- +MTPE: FanLin +Date: 2024-02-29 +--- + +# Manage Helm Repository + +The Helm repository is a repository for storing and publishing Charts. The Helm application module supports HTTP(s) protocol to access Chart packages in the repository. By default, the system has 4 built-in helm repos as shown in the table below to meet common needs in the production process of enterprises. + +| Repository | Description | Example | +| --------- | ------------ | ------- | +| partner | Various high-quality features provided by ecological partners Chart | tidb | +| system | Chart that must be relied upon by system core functional components and some advanced features. For example, insight-agent must be installed to obtain cluster monitoring information | Insight | +| addon | Common Chart in business cases | cert-manager | +| community | The most popular open source components in the Kubernetes community Chart | Istio | + +In addition to the above preset repositories, you can also add third-party Helm repositories yourself. This page will introduce how to add and update third-party Helm repositories. + +## Prerequisites + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. + +- Created a [namespace](../namespaces/createns.md), + [user](../../../ghippo/access-control/user.md), + and granted [`NS Admin`](../permissions/permission-brief.md#ns-admin) or higher permissions to the user. + For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +- If using a private repository, you should have read and write permissions to the repository. + +## Introduce third-party Helm repository + +The following takes the public container repository of Kubevela as an example to introduce and manage the helm repo. + +1. Find the cluster that needs to be imported into the third-party helm repo, click the cluster name, and enter cluster details. + + ![Clusters](../images/crd01.png) + +2. In the left navigation bar, click __Helm Apps__ -> __Helm Repositories__ to enter the helm repo page. + + ![Helm Repo](../images/helmrepo01.png) + +3. Click the __Create Repository__ button on the helm repo page to enter the Create repository page, and configure relevant arguments according to the table below. + + - Repository Name: Set the repository name. It can be up to 63 characters long and may only include lowercase letters, + numbers, and separators __-__. It must start and end with a lowercase letter or number, for example, kubevela. + - Repository URL: The HTTP(S) address pointing to the target Helm repository. For example, . + - Skip TLS Verification: If the added Helm repository uses an HTTPS address and requires skipping TLS verification, + you can check this option. The default is unchecked. + - Authentication Method: The method used for identity verification after connecting to the repository URL. + For public repositories, you can select __None__. For private repositories, you need to enter a + username/password for identity verification. + - Labels: Add labels to this Helm repository. For example, key: repo4; value: Kubevela. + - Annotations: Add annotations to this Helm repository. For example, key: repo4; value: Kubevela. + - Description: Add a description for this Helm repository. For example: This is a Kubevela public Helm repository. + + ![Config](../images/helmrepo02.png) + +4. Click __OK__ to complete the creation of the Helm repository. The page will automatically jump to the list of Helm repositories. + + ![Confirm](../images/helmrepo03.png) + +## Update the Helm repository + +When the address information of the helm repo changes, the address, authentication method, label, annotation, and description information of the helm repo can be updated. + +1. Find the cluster where the repository to be updated is located, click the cluster name, and enter cluster details . + + ![Clusters](../images/crd01.png) + +2. In the left navigation bar, click __Helm Apps__ -> __Helm Repositories__ to enter the helm repo list page. + + ![Helm Repo](../images/helmrepo01.png) + +3. Find the Helm repository that needs to be updated on the repository list page, click the __┇__ button on the right side of the list, and click __Update__ in the pop-up menu. + + ![Update](../images/helmrepo04.png) + +4. Update on the __Update Helm Repository__ page, and click __OK__ when finished. + + ![Confirm](../images/helmrepo05.png) + +5. Return to the helm repo list, and the screen prompts that the update is successful. + +## Delete the Helm repository + +In addition to importing and updating repositorys, you can also delete unnecessary repositories, including system preset repositories and third-party repositories. + +1. Find the cluster where the repository to be deleted is located, click the cluster name, and enter cluster details . + + ![Clusters](../images/crd01.png) + +2. In the left navigation bar, click __Helm Apps__ -> __Helm Repositories__ to enter the helm repo list page. + + ![Helm Repo](../images/helmrepo01.png) + +3. Find the Helm repository that needs to be updated on the repository list page, click the __┇__ button on the right side of the list, and click __Delete__ in the pop-up menu. + + ![Delete](../images/helmrepo07.png) + +4. Enter the repository name to confirm, and click __Delete__ . + + ![Confirm](../images/helmrepo08.png) + +5. Return to the list of Helm repositories, and the screen prompts that the deletion is successful. diff --git a/docs/en/docs/end-user/kpanda/helm/multi-archi-helm.md b/docs/en/docs/end-user/kpanda/helm/multi-archi-helm.md new file mode 100644 index 0000000000..63ad81538d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/helm/multi-archi-helm.md @@ -0,0 +1,93 @@ +--- +MTPE: windsonsea +date: 2024-01-17 +--- + +# Import and Upgrade Multi-Arch Helm Apps + +In a multi-arch cluster, it is common to use Helm charts that support multiple architectures to address deployment issues caused by architectural differences. This guide will explain how to integrate single-arch Helm apps into multi-arch deployments and how to integrate multi-arch Helm apps. + +## Import + +### Import Single-arch + +Prepare the offline package `addon-offline-full-package-${version}-${arch}.tar.gz`. + +Specify the path in the __clusterConfig.yml__ configuration file, for example: + +```yaml +addonPackage: + path: "/home/addon-offline-full-package-v0.9.0-amd64.tar.gz" +``` + +Then run the import command: + +```shell +~/dce5-installer cluster-create -c /home/dce5/sample/clusterConfig.yaml -m /home/dce5/sample/manifest.yaml -d -j13 +``` + +### Integrate Multi-arch + +Prepare the offline package `addon-offline-full-package-${version}-${arch}.tar.gz`. + +Take `addon-offline-full-package-v0.9.0-arm64.tar.gz` as an example and run the import command: + +```shell +~/dce5-installer import-addon -c /home/dce5/sample/clusterConfig.yaml --addon-path=/home/addon-offline-full-package-v0.9.0-arm64.tar.gz +``` + +## Upgrade + +### Upgrade Single-arch + +Prepare the offline package `addon-offline-full-package-${version}-${arch}.tar.gz`. + +Specify the path in the __clusterConfig.yml__ configuration file, for example: + +```yaml +addonPackage: + path: "/home/addon-offline-full-package-v0.11.0-amd64.tar.gz" +``` + +Then run the import command: + +```shell +~/dce5-installer cluster-create -c /home/dce5/sample/clusterConfig.yaml -m /home/dce5/sample/manifest.yaml -d -j13 +``` + +### Multi-arch Integration + +Prepare the offline package `addon-offline-full-package-${version}-${arch}.tar.gz`. + +Take `addon-offline-full-package-v0.11.0-arm64.tar.gz` as an example and run the import command: + +```shell +~/dce5-installer import-addon -c /home/dce5/sample/clusterConfig.yaml --addon-path=/home/addon-offline-full-package-v0.11.0-arm64.tar.gz +``` + +## Notes + +### Disk Space + +The offline package is quite large and requires sufficient space for decompression and loading of images. Otherwise, it may interrupt the process with a "no space left" error. + +### Retry after Failure + +If the multi-arch fusion step fails, you need to clean up the residue before retrying: + +```shell +rm -rf addon-offline-target-package +``` + +### Registry Space + +If the offline package for fusion contains registry spaces that are inconsistent with the imported offline package, an error may occur during the fusion process due to the non-existence of the registry spaces: + +![helm](https://docs.daocloud.io/daocloud-docs-images/docs/zh/docs/kpanda/images/multi-arch-helm.png) + +Solution: Simply create the registry space before the fusion. For example, in the above error, creating the registry space "localhost" in advance can prevent the error. + +### Architecture Conflict + +When upgrading to a version lower than 0.12.0 of the addon, the charts-syncer in the target offline package does not check the existence of the image before pushing, so it will recombine the multi-arch into a single architecture during the upgrade process. +For example, if the addon is implemented as a multi-arch in v0.10, upgrading to v0.11 will overwrite the multi-arch addon with a single architecture. However, upgrading to v0.12.0 or above can still maintain the multi-arch. diff --git a/docs/en/docs/end-user/kpanda/helm/upload-helm.md b/docs/en/docs/end-user/kpanda/helm/upload-helm.md new file mode 100644 index 0000000000..3e50d510bf --- /dev/null +++ b/docs/en/docs/end-user/kpanda/helm/upload-helm.md @@ -0,0 +1,66 @@ +--- +MTPE: ModetaNiu +Date: 2024-05-27 +hide: + - toc +--- + +# Upload Helm Charts + +This article explains how to upload Helm charts. See the steps below. + +1. Add a Helm repository, refer to [Adding a Third-Party Helm Repository](./helm-repo.md) for the procedure. + +2. Upload the Helm Chart to the Helm repository. + + === "Upload with Client" + + !!! note + + This method is suitable for Harbor, ChartMuseum, JFrog type repositories. + + 1. Log in to a node that can access the Helm repository, upload the Helm binary to the node, + and install the cm-push plugin (VPN is needed and [Git](https://git-scm.com/downloads) should be installed in advance). + + Refer to the [plugin installation process](https://github.com/chartmuseum/helm-push). + + 2. Push the Helm Chart to the Helm repository by executing the following command: + + ```shell + helm cm-push ${charts-dir} ${HELM_REPO_URL} --username ${username} --password ${password} + ``` + + Argument descriptions: + + - `charts-dir`: The directory of the Helm Chart, or the packaged Chart (i.e., .tgz file). + - `HELM_REPO_URL`: The URL of the Helm repository. + - `username`/`password`: The username and password for the Helm repository with push permissions. + - If you want to access via HTTPS and skip the certificate verification, you can add the argument `--insecure`. + + === "Upload with Web Page" + + !!! note + + This method is only applicable to Harbor repositories. + + 1. Log into the Harbor repository, ensuring the logged-in user has permissions to push; + + 2. Go to the relevant project, select the __Helm Charts__ tab, click the __Upload__ button on the page to upload the Helm Chart. + + ![upload Helm Chart](../../images/upload-helm-01.png) + +3. Sync Remote Repository Data + + === "Manual Sync" + + By default, the cluster does not enable **Helm Repository Auto-Refresh**, so you need to perform a manual sync operation. The general steps are: + + Go to **Helm Applications** -> **Helm Repositories**, click the **┇** button on the right side of the repository list, and select **Sync Repository** to complete the repository data synchronization. + + ![Upload Helm Chart](../../images/upload-helm-02.png) + + === "Auto Sync" + + If you need to enable the Helm repository auto-sync feature, you can go to **Cluster Maintenance** -> **Cluster Settings** -> **Advanced Settings** and turn on the Helm repository auto-refresh switch. + + diff --git a/docs/en/docs/end-user/kpanda/images/access-download-cert.png b/docs/en/docs/end-user/kpanda/images/access-download-cert.png new file mode 100644 index 0000000000..bf9946e8c3 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/access-download-cert.png differ diff --git a/docs/en/docs/end-user/kpanda/images/add-global-node01.png b/docs/en/docs/end-user/kpanda/images/add-global-node01.png new file mode 100644 index 0000000000..8f172cddfc Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/add-global-node01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/add-global-node02.png b/docs/en/docs/end-user/kpanda/images/add-global-node02.png new file mode 100644 index 0000000000..e74a1c3b68 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/add-global-node02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/addnode01.png b/docs/en/docs/end-user/kpanda/images/addnode01.png new file mode 100644 index 0000000000..a84af0fa08 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/addnode01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/addnode02.png b/docs/en/docs/end-user/kpanda/images/addnode02.png new file mode 100644 index 0000000000..6f3159eb4e Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/addnode02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/addnode03.png b/docs/en/docs/end-user/kpanda/images/addnode03.png new file mode 100644 index 0000000000..1ae6dbe023 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/addnode03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/autoscaling01.png b/docs/en/docs/end-user/kpanda/images/autoscaling01.png new file mode 100644 index 0000000000..7642ebb917 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/autoscaling01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/autoscaling02.png b/docs/en/docs/end-user/kpanda/images/autoscaling02.png new file mode 100644 index 0000000000..939398f76f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/autoscaling02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/autoscaling03.png b/docs/en/docs/end-user/kpanda/images/autoscaling03.png new file mode 100644 index 0000000000..972b96bb04 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/autoscaling03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/autoscaling04.png b/docs/en/docs/end-user/kpanda/images/autoscaling04.png new file mode 100644 index 0000000000..f855c91047 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/autoscaling04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backup1.png b/docs/en/docs/end-user/kpanda/images/backup1.png new file mode 100644 index 0000000000..d24e5c61b8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backup1.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backup2.png b/docs/en/docs/end-user/kpanda/images/backup2.png new file mode 100644 index 0000000000..919b89dd05 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backup2.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backup3.png b/docs/en/docs/end-user/kpanda/images/backup3.png new file mode 100644 index 0000000000..eb71286cb5 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backup3.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backup4.png b/docs/en/docs/end-user/kpanda/images/backup4.png new file mode 100644 index 0000000000..a259ad289f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backup4.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backupd20481.png b/docs/en/docs/end-user/kpanda/images/backupd20481.png new file mode 100644 index 0000000000..66998aaebb Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backupd20481.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backupd20482.png b/docs/en/docs/end-user/kpanda/images/backupd20482.png new file mode 100644 index 0000000000..b75c0bb6c0 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backupd20482.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backupd20483.png b/docs/en/docs/end-user/kpanda/images/backupd20483.png new file mode 100644 index 0000000000..2349a396d8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backupd20483.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backupd20484.png b/docs/en/docs/end-user/kpanda/images/backupd20484.png new file mode 100644 index 0000000000..9562f3df11 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backupd20484.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backupd20485.png b/docs/en/docs/end-user/kpanda/images/backupd20485.png new file mode 100644 index 0000000000..9d7431d7ae Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backupd20485.png differ diff --git a/docs/en/docs/end-user/kpanda/images/backupd20486.png b/docs/en/docs/end-user/kpanda/images/backupd20486.png new file mode 100644 index 0000000000..9aab903c5c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/backupd20486.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cluster-access03.png b/docs/en/docs/end-user/kpanda/images/cluster-access03.png new file mode 100644 index 0000000000..7cf35061b2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cluster-access03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cluster-integrate02.png b/docs/en/docs/end-user/kpanda/images/cluster-integrate02.png new file mode 100644 index 0000000000..b09bbbf465 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cluster-integrate02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/clusterlist.png b/docs/en/docs/end-user/kpanda/images/clusterlist.png new file mode 100644 index 0000000000..d65b7ce95d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/clusterlist.png differ diff --git a/docs/en/docs/end-user/kpanda/images/crd01.png b/docs/en/docs/end-user/kpanda/images/crd01.png new file mode 100644 index 0000000000..911190b464 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/crd01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/create-depolyment.png b/docs/en/docs/end-user/kpanda/images/create-depolyment.png new file mode 100644 index 0000000000..c6cc672a16 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/create-depolyment.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createScale.png b/docs/en/docs/end-user/kpanda/images/createScale.png new file mode 100644 index 0000000000..3ea61648a4 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createScale.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createScale04.png b/docs/en/docs/end-user/kpanda/images/createScale04.png new file mode 100644 index 0000000000..add3efdb67 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createScale04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createScale05.png b/docs/en/docs/end-user/kpanda/images/createScale05.png new file mode 100644 index 0000000000..5d33b2a2b6 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createScale05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createScale06.png b/docs/en/docs/end-user/kpanda/images/createScale06.png new file mode 100644 index 0000000000..023250d41a Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createScale06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createVpaScale.png b/docs/en/docs/end-user/kpanda/images/createVpaScale.png new file mode 100644 index 0000000000..fa2699df13 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createVpaScale.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createVpaScale01.png b/docs/en/docs/end-user/kpanda/images/createVpaScale01.png new file mode 100644 index 0000000000..00bf4c926e Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createVpaScale01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/createVpaScale02.png b/docs/en/docs/end-user/kpanda/images/createVpaScale02.png new file mode 100644 index 0000000000..7c64108da9 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/createVpaScale02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob01.png b/docs/en/docs/end-user/kpanda/images/cronjob01.png new file mode 100644 index 0000000000..8aee70ff15 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob02.png b/docs/en/docs/end-user/kpanda/images/cronjob02.png new file mode 100644 index 0000000000..6d914f0733 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob03.png b/docs/en/docs/end-user/kpanda/images/cronjob03.png new file mode 100644 index 0000000000..da22a19540 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob04.png b/docs/en/docs/end-user/kpanda/images/cronjob04.png new file mode 100644 index 0000000000..372b8b9940 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob05.png b/docs/en/docs/end-user/kpanda/images/cronjob05.png new file mode 100644 index 0000000000..ca500ac956 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob06.png b/docs/en/docs/end-user/kpanda/images/cronjob06.png new file mode 100644 index 0000000000..d0feb4400b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob07.png b/docs/en/docs/end-user/kpanda/images/cronjob07.png new file mode 100644 index 0000000000..94d4bd2b02 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob07.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob08.png b/docs/en/docs/end-user/kpanda/images/cronjob08.png new file mode 100644 index 0000000000..40550fe432 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob08.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob09.png b/docs/en/docs/end-user/kpanda/images/cronjob09.png new file mode 100644 index 0000000000..cd4a9605d5 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob09.png differ diff --git a/docs/en/docs/end-user/kpanda/images/cronjob10.png b/docs/en/docs/end-user/kpanda/images/cronjob10.png new file mode 100644 index 0000000000..b75cb10a6e Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/cronjob10.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon006.png b/docs/en/docs/end-user/kpanda/images/daemon006.png new file mode 100644 index 0000000000..dec749813f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon006.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon01.png b/docs/en/docs/end-user/kpanda/images/daemon01.png new file mode 100644 index 0000000000..34b12b4366 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon02.png b/docs/en/docs/end-user/kpanda/images/daemon02.png new file mode 100644 index 0000000000..c789766238 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon02Yaml.png b/docs/en/docs/end-user/kpanda/images/daemon02Yaml.png new file mode 100644 index 0000000000..3777342d2e Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon02Yaml.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon03Yaml.png b/docs/en/docs/end-user/kpanda/images/daemon03Yaml.png new file mode 100644 index 0000000000..045b7ae5a2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon03Yaml.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon05.png b/docs/en/docs/end-user/kpanda/images/daemon05.png new file mode 100644 index 0000000000..a4d8568c3f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon06.png b/docs/en/docs/end-user/kpanda/images/daemon06.png new file mode 100644 index 0000000000..df6024c4f6 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon12.png b/docs/en/docs/end-user/kpanda/images/daemon12.png new file mode 100644 index 0000000000..cd9bb5ebe9 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon12.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon14.png b/docs/en/docs/end-user/kpanda/images/daemon14.png new file mode 100644 index 0000000000..992bc11c3f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon14.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon15.png b/docs/en/docs/end-user/kpanda/images/daemon15.png new file mode 100644 index 0000000000..7dfb36d490 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon15.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon16.png b/docs/en/docs/end-user/kpanda/images/daemon16.png new file mode 100644 index 0000000000..57a2688426 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon16.png differ diff --git a/docs/en/docs/end-user/kpanda/images/daemon17.png b/docs/en/docs/end-user/kpanda/images/daemon17.png new file mode 100644 index 0000000000..e4e4acbca1 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/daemon17.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deletenode01.png b/docs/en/docs/end-user/kpanda/images/deletenode01.png new file mode 100644 index 0000000000..a97d115ced Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deletenode01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deletenode02.png b/docs/en/docs/end-user/kpanda/images/deletenode02.png new file mode 100644 index 0000000000..42346a769c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deletenode02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy01.png b/docs/en/docs/end-user/kpanda/images/deploy01.png new file mode 100644 index 0000000000..ec90e8c8cb Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy02.png b/docs/en/docs/end-user/kpanda/images/deploy02.png new file mode 100644 index 0000000000..3d93974188 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy02Yaml.png b/docs/en/docs/end-user/kpanda/images/deploy02Yaml.png new file mode 100644 index 0000000000..5c6ba51235 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy02Yaml.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy03Yaml.png b/docs/en/docs/end-user/kpanda/images/deploy03Yaml.png new file mode 100644 index 0000000000..1b0723da38 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy03Yaml.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy04.png b/docs/en/docs/end-user/kpanda/images/deploy04.png new file mode 100644 index 0000000000..9e7ccd51f2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy05.png b/docs/en/docs/end-user/kpanda/images/deploy05.png new file mode 100644 index 0000000000..e809cc2cae Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy06.png b/docs/en/docs/end-user/kpanda/images/deploy06.png new file mode 100644 index 0000000000..cf269bd51e Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy07.png b/docs/en/docs/end-user/kpanda/images/deploy07.png new file mode 100644 index 0000000000..91b5a75977 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy07.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy08.png b/docs/en/docs/end-user/kpanda/images/deploy08.png new file mode 100644 index 0000000000..77787a8247 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy08.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy09.png b/docs/en/docs/end-user/kpanda/images/deploy09.png new file mode 100644 index 0000000000..c93abe0098 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy09.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy10.png b/docs/en/docs/end-user/kpanda/images/deploy10.png new file mode 100644 index 0000000000..93227731cb Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy10.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy12.png b/docs/en/docs/end-user/kpanda/images/deploy12.png new file mode 100644 index 0000000000..fb622bc637 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy12.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy13.png b/docs/en/docs/end-user/kpanda/images/deploy13.png new file mode 100644 index 0000000000..52fa4a469a Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy13.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy14.png b/docs/en/docs/end-user/kpanda/images/deploy14.png new file mode 100644 index 0000000000..b08f31ff7d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy14.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy15.png b/docs/en/docs/end-user/kpanda/images/deploy15.png new file mode 100644 index 0000000000..ae36dc56cb Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy15.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy16.png b/docs/en/docs/end-user/kpanda/images/deploy16.png new file mode 100644 index 0000000000..d21ed6abb6 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy16.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy17.png b/docs/en/docs/end-user/kpanda/images/deploy17.png new file mode 100644 index 0000000000..90ecd59621 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy17.png differ diff --git a/docs/en/docs/end-user/kpanda/images/deploy18.png b/docs/en/docs/end-user/kpanda/images/deploy18.png new file mode 100644 index 0000000000..3afaf85e73 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/deploy18.png differ diff --git a/docs/en/docs/end-user/kpanda/images/etcd-get01.png b/docs/en/docs/end-user/kpanda/images/etcd-get01.png new file mode 100644 index 0000000000..db3a66e2d5 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/etcd-get01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/etcd01.png b/docs/en/docs/end-user/kpanda/images/etcd01.png new file mode 100644 index 0000000000..4c6fdafce8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/etcd01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive01.png b/docs/en/docs/end-user/kpanda/images/exclusive01.png new file mode 100644 index 0000000000..f7bc3985ac Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive02.png b/docs/en/docs/end-user/kpanda/images/exclusive02.png new file mode 100644 index 0000000000..9407b885b4 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive03.png b/docs/en/docs/end-user/kpanda/images/exclusive03.png new file mode 100644 index 0000000000..2cd1bf043a Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive04.png b/docs/en/docs/end-user/kpanda/images/exclusive04.png new file mode 100644 index 0000000000..6651cddb67 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive05.png b/docs/en/docs/end-user/kpanda/images/exclusive05.png new file mode 100644 index 0000000000..07ab612bf5 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive06.png b/docs/en/docs/end-user/kpanda/images/exclusive06.png new file mode 100644 index 0000000000..9bd3e61399 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive07.png b/docs/en/docs/end-user/kpanda/images/exclusive07.png new file mode 100644 index 0000000000..efba2da803 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive07.png differ diff --git a/docs/en/docs/end-user/kpanda/images/exclusive08.png b/docs/en/docs/end-user/kpanda/images/exclusive08.png new file mode 100644 index 0000000000..9f114e2251 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/exclusive08.png differ diff --git a/docs/en/docs/end-user/kpanda/images/faq01.png b/docs/en/docs/end-user/kpanda/images/faq01.png new file mode 100644 index 0000000000..225d370a32 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/faq01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/faq02.png b/docs/en/docs/end-user/kpanda/images/faq02.png new file mode 100644 index 0000000000..f26153570d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/faq02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/gpu_mig01.jpg b/docs/en/docs/end-user/kpanda/images/gpu_mig01.jpg new file mode 100644 index 0000000000..ab2858b5bc Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/gpu_mig01.jpg differ diff --git a/docs/en/docs/end-user/kpanda/images/gpu_mig02.jpg b/docs/en/docs/end-user/kpanda/images/gpu_mig02.jpg new file mode 100644 index 0000000000..ff54af6ba7 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/gpu_mig02.jpg differ diff --git a/docs/en/docs/end-user/kpanda/images/gpu_mig03.png b/docs/en/docs/end-user/kpanda/images/gpu_mig03.png new file mode 100644 index 0000000000..1ad5d4dfcf Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/gpu_mig03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/gpu_mig04.png b/docs/en/docs/end-user/kpanda/images/gpu_mig04.png new file mode 100644 index 0000000000..a01cdabcf2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/gpu_mig04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helm14.png b/docs/en/docs/end-user/kpanda/images/helm14.png new file mode 100644 index 0000000000..e417320980 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helm14.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo01.png b/docs/en/docs/end-user/kpanda/images/helmrepo01.png new file mode 100644 index 0000000000..bde8cff1e9 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo02.png b/docs/en/docs/end-user/kpanda/images/helmrepo02.png new file mode 100644 index 0000000000..20c792162b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo03.png b/docs/en/docs/end-user/kpanda/images/helmrepo03.png new file mode 100644 index 0000000000..6b6f464b3a Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo04.png b/docs/en/docs/end-user/kpanda/images/helmrepo04.png new file mode 100644 index 0000000000..692ca29d88 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo05.png b/docs/en/docs/end-user/kpanda/images/helmrepo05.png new file mode 100644 index 0000000000..ed8ff0c08d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo07.png b/docs/en/docs/end-user/kpanda/images/helmrepo07.png new file mode 100644 index 0000000000..e37311a646 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo07.png differ diff --git a/docs/en/docs/end-user/kpanda/images/helmrepo08.png b/docs/en/docs/end-user/kpanda/images/helmrepo08.png new file mode 100644 index 0000000000..93c3905681 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/helmrepo08.png differ diff --git a/docs/en/docs/end-user/kpanda/images/imagequest.png b/docs/en/docs/end-user/kpanda/images/imagequest.png new file mode 100644 index 0000000000..8a9ccc8bec Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/imagequest.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ingress01.png b/docs/en/docs/end-user/kpanda/images/ingress01.png new file mode 100644 index 0000000000..f8df33ad00 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ingress01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ingress02.png b/docs/en/docs/end-user/kpanda/images/ingress02.png new file mode 100644 index 0000000000..7118e8cf57 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ingress02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ingress03.png b/docs/en/docs/end-user/kpanda/images/ingress03.png new file mode 100644 index 0000000000..151adc860f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ingress03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-home.png b/docs/en/docs/end-user/kpanda/images/inspection-home.png new file mode 100644 index 0000000000..4d519c8b91 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-home.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-list-more.png b/docs/en/docs/end-user/kpanda/images/inspection-list-more.png new file mode 100644 index 0000000000..c398c10fd7 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-list-more.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-report-01.png b/docs/en/docs/end-user/kpanda/images/inspection-report-01.png new file mode 100644 index 0000000000..fbf42190ca Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-report-01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-report-02.png b/docs/en/docs/end-user/kpanda/images/inspection-report-02.png new file mode 100644 index 0000000000..f882c8f06c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-report-02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-report-03.png b/docs/en/docs/end-user/kpanda/images/inspection-report-03.png new file mode 100644 index 0000000000..a6ab83af18 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-report-03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-start-alone.png b/docs/en/docs/end-user/kpanda/images/inspection-start-alone.png new file mode 100644 index 0000000000..46f6e4ecdf Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-start-alone.png differ diff --git a/docs/en/docs/end-user/kpanda/images/inspection-start.png b/docs/en/docs/end-user/kpanda/images/inspection-start.png new file mode 100644 index 0000000000..5fe79f37fe Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/inspection-start.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installcronhpa.png b/docs/en/docs/end-user/kpanda/images/installcronhpa.png new file mode 100644 index 0000000000..f599efc7b3 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installcronhpa.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installcronhpa1.png b/docs/en/docs/end-user/kpanda/images/installcronhpa1.png new file mode 100644 index 0000000000..80e7400906 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installcronhpa1.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installcronhpa2.png b/docs/en/docs/end-user/kpanda/images/installcronhpa2.png new file mode 100644 index 0000000000..5724365498 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installcronhpa2.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installvpa.png b/docs/en/docs/end-user/kpanda/images/installvpa.png new file mode 100644 index 0000000000..5730007a06 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installvpa.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installvpa1.png b/docs/en/docs/end-user/kpanda/images/installvpa1.png new file mode 100644 index 0000000000..b195312175 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installvpa1.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installvpa2.png b/docs/en/docs/end-user/kpanda/images/installvpa2.png new file mode 100644 index 0000000000..fa44d57f63 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installvpa2.png differ diff --git a/docs/en/docs/end-user/kpanda/images/installvpa3.png b/docs/en/docs/end-user/kpanda/images/installvpa3.png new file mode 100644 index 0000000000..e844f92eb7 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/installvpa3.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job01.png b/docs/en/docs/end-user/kpanda/images/job01.png new file mode 100644 index 0000000000..35c24d5022 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job02-1.png b/docs/en/docs/end-user/kpanda/images/job02-1.png new file mode 100644 index 0000000000..22ae7510a6 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job02-1.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job02.png b/docs/en/docs/end-user/kpanda/images/job02.png new file mode 100644 index 0000000000..19912828f2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job03.png b/docs/en/docs/end-user/kpanda/images/job03.png new file mode 100644 index 0000000000..0835f18f71 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job04.png b/docs/en/docs/end-user/kpanda/images/job04.png new file mode 100644 index 0000000000..1ca56a0bea Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job06.png b/docs/en/docs/end-user/kpanda/images/job06.png new file mode 100644 index 0000000000..9121fb9ec8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job07.png b/docs/en/docs/end-user/kpanda/images/job07.png new file mode 100644 index 0000000000..6487c15b01 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job07.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job08.png b/docs/en/docs/end-user/kpanda/images/job08.png new file mode 100644 index 0000000000..8b616fcf66 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job08.png differ diff --git a/docs/en/docs/end-user/kpanda/images/job09.png b/docs/en/docs/end-user/kpanda/images/job09.png new file mode 100644 index 0000000000..f929a25eac Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/job09.png differ diff --git a/docs/en/docs/end-user/kpanda/images/knative-install-1.png b/docs/en/docs/end-user/kpanda/images/knative-install-1.png new file mode 100644 index 0000000000..5d9fd3fd3b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/knative-install-1.png differ diff --git a/docs/en/docs/end-user/kpanda/images/knative-install-2.png b/docs/en/docs/end-user/kpanda/images/knative-install-2.png new file mode 100644 index 0000000000..238abee36f Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/knative-install-2.png differ diff --git a/docs/en/docs/end-user/kpanda/images/knative-install-3.png b/docs/en/docs/end-user/kpanda/images/knative-install-3.png new file mode 100644 index 0000000000..7dc5084298 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/knative-install-3.png differ diff --git a/docs/en/docs/end-user/kpanda/images/knative-request-flow.png b/docs/en/docs/end-user/kpanda/images/knative-request-flow.png new file mode 100644 index 0000000000..dfa6eb40e2 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/knative-request-flow.png differ diff --git a/docs/en/docs/end-user/kpanda/images/labels01.png b/docs/en/docs/end-user/kpanda/images/labels01.png new file mode 100644 index 0000000000..2372d15fa0 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/labels01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/labels02.png b/docs/en/docs/end-user/kpanda/images/labels02.png new file mode 100644 index 0000000000..1edf7f1410 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/labels02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/node-details01.png b/docs/en/docs/end-user/kpanda/images/node-details01.png new file mode 100644 index 0000000000..4c69151a7c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/node-details01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/node-details02.png b/docs/en/docs/end-user/kpanda/images/node-details02.png new file mode 100644 index 0000000000..30e92ca1b6 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/node-details02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/node-details03.png b/docs/en/docs/end-user/kpanda/images/node-details03.png new file mode 100644 index 0000000000..ae52531df0 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/node-details03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/note.svg b/docs/en/docs/end-user/kpanda/images/note.svg new file mode 100644 index 0000000000..5e473aeb7d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/images/note.svg @@ -0,0 +1,18 @@ + + + + Icon/16/Prompt备份@0.5x + Created with Sketch. + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/images/ns00.png b/docs/en/docs/end-user/kpanda/images/ns00.png new file mode 100644 index 0000000000..5e5e193846 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ns00.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ns01.png b/docs/en/docs/end-user/kpanda/images/ns01.png new file mode 100644 index 0000000000..3595b46829 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ns01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ns02.png b/docs/en/docs/end-user/kpanda/images/ns02.png new file mode 100644 index 0000000000..732b9fef75 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ns02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ns03.png b/docs/en/docs/end-user/kpanda/images/ns03.png new file mode 100644 index 0000000000..1be3a0f923 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ns03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ns04.png b/docs/en/docs/end-user/kpanda/images/ns04.png new file mode 100644 index 0000000000..21e3250f51 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ns04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/permisson02.png b/docs/en/docs/end-user/kpanda/images/permisson02.png new file mode 100644 index 0000000000..90a57439f8 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/permisson02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ps01.png b/docs/en/docs/end-user/kpanda/images/ps01.png new file mode 100644 index 0000000000..c73ed64746 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ps01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ps02.png b/docs/en/docs/end-user/kpanda/images/ps02.png new file mode 100644 index 0000000000..f2572bfa67 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ps02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ps03.png b/docs/en/docs/end-user/kpanda/images/ps03.png new file mode 100644 index 0000000000..cdc0ee253a Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ps03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ps04.png b/docs/en/docs/end-user/kpanda/images/ps04.png new file mode 100644 index 0000000000..072f42560b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ps04.png differ diff --git a/docs/en/docs/end-user/kpanda/images/ps05.png b/docs/en/docs/end-user/kpanda/images/ps05.png new file mode 100644 index 0000000000..44f0432fb4 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/ps05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/schedule01.png b/docs/en/docs/end-user/kpanda/images/schedule01.png new file mode 100644 index 0000000000..2d39b1f4ba Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/schedule01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/schedule02.png b/docs/en/docs/end-user/kpanda/images/schedule02.png new file mode 100644 index 0000000000..447356e0b1 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/schedule02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/schedule03.png b/docs/en/docs/end-user/kpanda/images/schedule03.png new file mode 100644 index 0000000000..c6c6254bb7 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/schedule03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/settings01.png b/docs/en/docs/end-user/kpanda/images/settings01.png new file mode 100644 index 0000000000..094a88379b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/settings01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/settings02.png b/docs/en/docs/end-user/kpanda/images/settings02.png new file mode 100644 index 0000000000..b4925e730c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/settings02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state01.png b/docs/en/docs/end-user/kpanda/images/state01.png new file mode 100644 index 0000000000..a156720198 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state02.png b/docs/en/docs/end-user/kpanda/images/state02.png new file mode 100644 index 0000000000..1a3ecd43c6 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state02Yaml.png b/docs/en/docs/end-user/kpanda/images/state02Yaml.png new file mode 100644 index 0000000000..562a3e3273 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state02Yaml.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state03yaml.png b/docs/en/docs/end-user/kpanda/images/state03yaml.png new file mode 100644 index 0000000000..7758d16660 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state03yaml.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state05.png b/docs/en/docs/end-user/kpanda/images/state05.png new file mode 100644 index 0000000000..8fcd4c8e1b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state05.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state06.png b/docs/en/docs/end-user/kpanda/images/state06.png new file mode 100644 index 0000000000..f2e70c3195 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state06.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state09.png b/docs/en/docs/end-user/kpanda/images/state09.png new file mode 100644 index 0000000000..c9ab34d334 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state09.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state10.png b/docs/en/docs/end-user/kpanda/images/state10.png new file mode 100644 index 0000000000..09bf045181 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state10.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state11.png b/docs/en/docs/end-user/kpanda/images/state11.png new file mode 100644 index 0000000000..078b8b0d23 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state11.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state12.png b/docs/en/docs/end-user/kpanda/images/state12.png new file mode 100644 index 0000000000..4fd887ce7c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state12.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state14.png b/docs/en/docs/end-user/kpanda/images/state14.png new file mode 100644 index 0000000000..ffc9ed6036 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state14.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state15.png b/docs/en/docs/end-user/kpanda/images/state15.png new file mode 100644 index 0000000000..8434900762 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state15.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state16.png b/docs/en/docs/end-user/kpanda/images/state16.png new file mode 100644 index 0000000000..dcd0a04fab Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state16.png differ diff --git a/docs/en/docs/end-user/kpanda/images/state17.png b/docs/en/docs/end-user/kpanda/images/state17.png new file mode 100644 index 0000000000..e99b0d4b0c Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/state17.png differ diff --git a/docs/en/docs/end-user/kpanda/images/taint01.png b/docs/en/docs/end-user/kpanda/images/taint01.png new file mode 100644 index 0000000000..5acdbf4adc Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/taint01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/taint02.png b/docs/en/docs/end-user/kpanda/images/taint02.png new file mode 100644 index 0000000000..2b8a89d16b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/taint02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/taint03.png b/docs/en/docs/end-user/kpanda/images/taint03.png new file mode 100644 index 0000000000..6107b634fd Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/taint03.png differ diff --git a/docs/en/docs/end-user/kpanda/images/update-kpanda.png b/docs/en/docs/end-user/kpanda/images/update-kpanda.png new file mode 100644 index 0000000000..ddc50c44b7 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/update-kpanda.png differ diff --git a/docs/en/docs/end-user/kpanda/images/upload-helm-01.png b/docs/en/docs/end-user/kpanda/images/upload-helm-01.png new file mode 100644 index 0000000000..40ad73490d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/upload-helm-01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/upload-helm-02.png b/docs/en/docs/end-user/kpanda/images/upload-helm-02.png new file mode 100644 index 0000000000..7039a6306b Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/upload-helm-02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/volcano-01.png b/docs/en/docs/end-user/kpanda/images/volcano-01.png new file mode 100644 index 0000000000..565baa0b00 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/volcano-01.png differ diff --git a/docs/en/docs/end-user/kpanda/images/volcano-02.png b/docs/en/docs/end-user/kpanda/images/volcano-02.png new file mode 100644 index 0000000000..03f674e0d0 Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/volcano-02.png differ diff --git a/docs/en/docs/end-user/kpanda/images/volcano-03.png b/docs/en/docs/end-user/kpanda/images/volcano-03.png new file mode 100644 index 0000000000..e91654812d Binary files /dev/null and b/docs/en/docs/end-user/kpanda/images/volcano-03.png differ diff --git a/docs/en/docs/end-user/kpanda/inspect/config.md b/docs/en/docs/end-user/kpanda/inspect/config.md new file mode 100644 index 0000000000..9cbe61530d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/inspect/config.md @@ -0,0 +1,45 @@ +--- +hide: + - toc +--- + +# Creating Inspection Configuration + +AI platform Container Management module provides cluster inspection functionality, which supports inspection at the cluster, node, and pod levels. + +- Cluster level: Check the running status of system components in the cluster, including cluster status, resource usage, and specific inspection items for control nodes such as __kube-apiserver__ and __etcd__ . +- Node level: Includes common inspection items for both control nodes and worker nodes, such as node resource usage, handle count, PID status, and network status. +- Pod level: Check the CPU and memory usage, running status, PV and PVC status of Pods. + +Here's how to create an inspection configuration. + +1. Click __Cluster Inspection__ in the left navigation bar. + + ![nav](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/inspect01.png) + +2. On the right side of the page, click __Inspection Configuration__ . + + ![create](../images/inspection-home.png) + +3. Fill in the inspection configuration based on the following instructions, then click __OK__ at the bottom of the page. + + - Cluster: Select the clusters that you want to inspect from the dropdown list. **If you select multiple clusters, multiple inspection configurations will be automatically generated (only the inspected clusters are inconsistent, all other configurations are identical).** + - Scheduled Inspection: When enabled, it allows for regular automatic execution of cluster inspections based on a pre-set inspection frequency. + - Inspection Frequency: Set the interval for automatic inspections, e.g., every Tuesday at 10 AM. It supports custom CronExpressios, refer to [Cron Schedule Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax) for more information. + - Number of Inspection Records to Retain: Specifies the maximum number of inspection records to be retained, including all inspection records for each cluster. + - Parameter Configuration: The parameter configuration is divided into three parts: cluster level, node level, and pod level. You can enable or disable specific inspection items based on your requirements. + + ![basic](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/inspect03.png) + +After creating the inspection configuration, it will be automatically displayed in the inspection configuration list. Click the more options button on the right of the configuration to immediately perform an inspection, modify the inspection configuration or delete the inspection configuration and reports. + +- Click __Inspection__ to perform an inspection once based on the configuration. +- Click __Inspection Configuration__ to modify the inspection configuration. +- Click __Delete__ to delete the inspection configuration and reports. + + ![basic](../images/inspection-list-more.png) + +!!! note + + - After creating the inspection configuration, if the __Scheduled Inspection__ configuration is enabled, inspections will be automatically executed at the specified time. + - If __Scheduled Inspection__ configuration is not enabled, you need to manually [trigger the inspection](inspect.md). \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/inspect/index.md b/docs/en/docs/end-user/kpanda/inspect/index.md new file mode 100644 index 0000000000..cafc067de2 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/inspect/index.md @@ -0,0 +1,27 @@ +--- +hide: + - toc +--- + +# Cluster Inspection + +Cluster inspection allows administrators to regularly or ad-hoc check the overall health of the cluster, +giving them proactive control over ensuring cluster security. With a well-planned inspection schedule, +this proactive cluster check allows administrators to monitor the cluster status at any time and address +potential issues in advance. It eliminates the previous dilemma of passive troubleshooting during failures, +enabling proactive monitoring and prevention. + +The cluster inspection feature provided by AI platform's container management module supports custom inspection +items at the cluster, node, and pod levels. After the inspection is completed, +it automatically generates visual inspection reports. + +- Cluster Level: Checks the running status of system components in the cluster, including cluster status, + resource usage, and specific inspection items for control nodes, such as the status of + __kube-apiserver__ and __etcd__ . +- Node Level: Includes common inspection items for both control nodes and worker nodes, + such as node resource usage, handle counts, PID status, and network status. +- pod Level: Checks the CPU and memory usage, running status of pods, + and the status of PV (Persistent Volume) and PVC (PersistentVolumeClaim). + +For information on security inspections or executing security-related inspections, +refer to the [supported security scan types](../security/index.md) in AI platform. diff --git a/docs/en/docs/end-user/kpanda/inspect/inspect.md b/docs/en/docs/end-user/kpanda/inspect/inspect.md new file mode 100644 index 0000000000..dc6fcccd93 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/inspect/inspect.md @@ -0,0 +1,40 @@ +--- +hide: + - toc +--- + +# Start Cluster Inspection + +After creating an inspection configuration, if the __Scheduled Inspection__ configuration is enabled, inspections will be automatically executed at the specified time. If the __Scheduled Inspection__ configuration is not enabled, you need to manually trigger the inspection. + +This page explains how to manually perform a cluster inspection. + +## Prerequisites + +- [Integrate](../clusters/integrate-cluster.md) or [create](../clusters/create-cluster.md) a cluster in the Container Management module. +- Create an [inspection configuration](config.md). +- The selected cluster is in the __Running__ state and the insight component has been [installed in the cluster](../../../insight/quickstart/install/install-agent.md). + +## Steps + +When performing an inspection, you can choose to inspect multiple clusters in batches or perform a separate inspection for a specific cluster. + +=== "Batch Inspection" + + 1. Click __Cluster Inspection__ in the top-level navigation bar of the Container Management module, then click __Inspection__ on the right side of the page. + + ![start](../images/inspection-start.png) + + 2. Select the clusters you want to inspect, then click __OK__ at the bottom of the page. + + - If you choose to inspect multiple clusters at the same time, the system will perform inspections based on different inspection configurations for each cluster. + - If no inspection configuration is set for a cluster, the system will use the default configuration. + + ![start](https://docs.daocloud.io/daocloud-docs-images/docs/en/docs/kpanda/images/inspect05.png) + +=== "Individual Inspection" + + 1. Go to the Cluster Inspection page. + 2. Click the more options button ( __┇__ ) on the right of the corresponding inspection configuration, then select __Inspection__ from the popup menu. + + ![basic](../images/inspection-start-alone.png) diff --git a/docs/en/docs/end-user/kpanda/inspect/report.md b/docs/en/docs/end-user/kpanda/inspect/report.md new file mode 100644 index 0000000000..49dc4438d8 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/inspect/report.md @@ -0,0 +1,32 @@ +--- +hide: + - toc +--- + +# Check Inspection Reports + +After the [inspection execution](inspect.md) is completed, you can view the inspection records and detailed inspection reports. + +## Prerequisites + +- [Create an inspection configuration](config.md). +- Perform at least one inspection [execution](inspect.md). + +## Steps + +1. Go to the Cluster Inspection page and click the name of the target inspection cluster. + + ![start](../images/inspection-report-01.png) + +2. Click the name of the inspection record you want to view. + + - Each inspection execution generates an inspection record. + - When the number of inspection records exceeds the maximum retention specified in the [inspection configuration](config.md), the earliest record will be deleted starting from the execution time. + + ![start](../images/inspection-report-02.png) + +3. View the detailed information of the inspection, which may include an overview of cluster resources and the running status of system components. + + You can download the inspection report or delete the inspection report from the top right corner of the page. + + ![start](../images/inspection-report-03.png) diff --git a/docs/en/docs/end-user/kpanda/namespaces/createns.md b/docs/en/docs/end-user/kpanda/namespaces/createns.md new file mode 100644 index 0000000000..4839e6bf75 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/namespaces/createns.md @@ -0,0 +1,57 @@ +# Namespaces + +Namespaces are an abstraction used in Kubernetes for resource isolation. A cluster can contain multiple namespaces with different names, and the resources in each namespace are isolated from each other. For a detailed introduction to namespaces, refer to [Namespaces](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/). + +This page will introduce the related operations of the namespace. + +## Create a namespace + +Supports easy creation of namespaces through forms, and quick creation of namespaces by writing or importing YAML files. + +!!! note + + - Before creating a namespace, you need to [Integrate a Kubernetes cluster](../clusters/integrate-cluster.md) or [Create a Kubernetes cluster](../clusters/create-cluster.md) in the container management module. + - The default namespace __default__ is usually automatically generated after cluster initialization. But for production clusters, for ease of management, it is recommended to create other namespaces instead of using the __default__ namespace directly. + +### Create with form + +1. On the cluster list page, click the name of the target cluster. + + ![Cluster Details](../images/crd01.png) + +2. Click __Namespace__ in the left navigation bar, then click the __Create__ button on the right side of the page. + + ![Click to Create](../images/ns01.png) + +3. Fill in the name of the namespace, configure the workspace and labels (optional), and then click __OK__. + + !!! info + + - After binding a namespace to a workspace, the resources of that namespace will be shared with the bound workspace. For a detailed explanation of workspaces, refer to [Workspaces and Hierarchies](../../../ghippo/workspace/workspace.md). + + - After the namespace is created, you can still bind/unbind the workspace. + + ![Fill the Form](../images/ns02.png) + +4. Click __OK__ to complete the creation of the namespace. On the right side of the namespace list, click __┇__ to select update, bind/unbind workspace, quota management, delete, and more from the pop-up menu. + + ![More Operations](../images/ns03.png) + + +### Create from YAML + +1. On the __Cluster List__ page, click the name of the target cluster. + + ![Cluster Details](../images/crd01.png) + +2. Click __Namespace__ in the left navigation bar, then click the __YAML Create__ button on the right side of the page. + + ![Click to Create](../images/ns00.png) + +3. Enter or paste the prepared YAML content, or directly import an existing YAML file locally. + + > After entering the YAML content, click __Download__ to save the YAML file locally. + + ![Click to Create](../images/ns04.png) + +4. Finally, click __OK__ in the lower right corner of the pop-up box. diff --git a/docs/en/docs/end-user/kpanda/namespaces/exclusive.md b/docs/en/docs/end-user/kpanda/namespaces/exclusive.md new file mode 100644 index 0000000000..5a362ab156 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/namespaces/exclusive.md @@ -0,0 +1,208 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Namespace Exclusive Nodes + +Namespace exclusive nodes in a Kubernetes cluster allow a specific namespace to have exclusive access to one or more node's CPU, memory, and other resources through taints and tolerations. Once exclusive nodes are configured for a specific namespace, applications and services from other namespaces cannot run on the exclusive nodes. Using exclusive nodes allows important applications to have exclusive access to some computing resources, achieving physical isolation from other applications. + +!!! note + + Applications and services running on a node before it is set to be an exclusive node will not be affected and will continue to run normally on that node. Only when these Pods are deleted or rebuilt will they be scheduled to other non-exclusive nodes. + +## Preparation + +Check whether the kube-apiserver of the current cluster has enabled the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers. + +The use of namespace exclusive nodes requires users to enable the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers on the kube-apiserver. For more information about admission controllers, refer to [Kubernetes Admission Controllers Reference](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/). + +You can go to any Master node in the current cluster to check whether these two features are enabled in the __kube-apiserver.yaml__ file, or you can execute the following command on the Master node for a quick check: + +```bash +[root@g-master1 ~]# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep enable-admission-plugins + +# The expected output is as follows: +- --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction +``` + +## Enable Namespace Exclusive Nodes on Global Cluster + +Since the Global cluster runs platform basic components such as kpanda, ghippo, and insight, enabling namespace exclusive nodes on Global may cause system components to not be scheduled to the exclusive nodes when they restart, affecting the overall high availability of the system. Therefore, **we generally do not recommend users to enable the namespace exclusive node feature on the Global cluster**. + +If you do need to enable namespace exclusive nodes on the Global cluster, please follow the steps below: + +1. Enable the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers for the kube-apiserver of the Global cluster + + !!! note + + If the cluster has already enabled the above two admission controllers, please skip this step and go directly to configure system component tolerations. + + Go to any Master node in the current cluster to modify the __kube-apiserver.yaml__ configuration file, or execute the following command on the Master node for configuration: + + ```bash + [root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml + + # The expected output is as follows: + apiVersion: v1 + kind: Pod + metadata: + ...... + spec: + containers: + - command: + - kube-apiserver + ...... + - --default-not-ready-toleration-seconds=300 + - --default-unreachable-toleration-seconds=300 + - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers + - --enable-aggregator-routing=False + - --enable-bootstrap-token-auth=true + - --endpoint-reconciler-type=lease + - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt + ...... + ``` + + Find the __--enable-admission-plugins__ parameter and add the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers (separated by commas). Refer to the following: + + ```bash + # Add __ ,PodNodeSelector,PodTolerationRestriction__ + - --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction + ``` + +2. Add toleration annotations to the namespace where the platform components are located + + After enabling the admission controllers, you need to add toleration annotations to the namespace where the platform components are located to ensure the high availability of the platform components. + + The system component namespaces for AI platform are as follows: + + | Namespace | System Components Included | + | ------------------- | ------------------------------------------------------------ | + | kpanda-system | kpanda | + | hwameiStor-system | hwameiStor | + | istio-system | istio | + | metallb-system | metallb | + | cert-manager-system | cert-manager | + | contour-system | contour | + | kubean-system | kubean | + | ghippo-system | ghippo | + | kcoral-system | kcoral | + | kcollie-system | kcollie | + | insight-system | insight, insight-agent: | + | ipavo-system | ipavo | + | kairship-system | kairship | + | karmada-system | karmada | + | amamba-system | amamba, jenkins | + | skoala-system | skoala | + | mspider-system | mspider | + | mcamel-system | mcamel-rabbitmq, mcamel-elasticsearch, mcamel-mysql, mcamel-redis, mcamel-kafka, mcamel-minio, mcamel-postgresql | + | spidernet-system | spidernet | + | kangaroo-system | kangaroo | + | gmagpie-system | gmagpie | + | dowl-system | dowl | + + Check whether there are the above namespaces in the current cluster, execute the following command, and add the annotation: `scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": "NoSchedule", "key": "ExclusiveNamespace"}]'` for each namespace. + + ```bash + kubectl annotate ns scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": + "NoSchedule", "key": "ExclusiveNamespace"}]' + ``` + Please make sure to replace `` with the name of the platform namespace you want to add the annotation to. + +3. Use the interface to set exclusive nodes for the namespace + + After confirming that the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace. + + 1. Click the cluster name in the cluster list page, then click __Namespace__ in the left navigation bar. + + ![Namespace](../../images/exclusive01.png) + + 2. Click the namespace name, then click the __Exclusive Node__ tab, and click __Add Node__ on the bottom right. + + ![Add Node](../../images/exclusive02.png) + + 3. Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click __OK__ at the bottom. + + ![Confirm](../../images/exclusive03.png) + + 4. You can view the current exclusive nodes for this namespace in the list. You can choose to __Stop Exclusivity__ on the right side of the node. + + > After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node. + + ![Cancel Exclusivity](../../images/exclusive04.png) + +## Enable Namespace Exclusive Nodes on Non-Global Clusters + +To enable namespace exclusive nodes on non-Global clusters, please follow the steps below: + +1. Enable the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers for the kube-apiserver of the current cluster + + !!! note + + If the cluster has already enabled the above two admission controllers, please skip this step and go directly to using the interface to set exclusive nodes for the namespace. + + Go to any Master node in the current cluster to modify the __kube-apiserver.yaml__ configuration file, or execute the following command on the Master node for configuration: + + ```bash + [root@g-master1 ~]# vi /etc/kubernetes/manifests/kube-apiserver.yaml + + # The expected output is as follows: + apiVersion: v1 + kind: Pod + metadata: + ...... + spec: + containers: + - command: + - kube-apiserver + ...... + - --default-not-ready-toleration-seconds=300 + - --default-unreachable-toleration-seconds=300 + - --enable-admission-plugins=NodeRestriction #List of enabled admission controllers + - --enable-aggregator-routing=False + - --enable-bootstrap-token-auth=true + - --endpoint-reconciler-type=lease + - --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt + ...... + ``` + + Find the __--enable-admission-plugins__ parameter and add the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers (separated by commas). Refer to the following: + + ```bash + # Add __ ,PodNodeSelector,PodTolerationRestriction__ + - --enable-admission-plugins=NodeRestriction,PodNodeSelector,PodTolerationRestriction + ``` + +2. Use the interface to set exclusive nodes for the namespace + + After confirming that the __PodNodeSelector__ and __PodTolerationRestriction__ admission controllers on the cluster API server have been enabled, please follow the steps below to use the AI platform UI management interface to set exclusive nodes for the namespace. + + 1. Click the cluster name in the cluster list page, then click __Namespace__ in the left navigation bar. + + ![Namespace](../../images/exclusive05.png) + + 2. Click the namespace name, then click the __Exclusive Node__ tab, and click __Add Node__ on the bottom right. + + ![Add Node](../../images/exclusive02.png) + + 3. Select which nodes you want to be exclusive to this namespace on the left side of the page. On the right side, you can clear or delete a selected node. Finally, click __OK__ at the bottom. + + ![Confirm](../../images/exclusive07.png) + + 4. You can view the current exclusive nodes for this namespace in the list. You can choose to __Stop Exclusivity__ on the right side of the node. + + > After cancelling exclusivity, Pods from other namespaces can also be scheduled to this node. + + ![Cancel Exclusivity](../../images/exclusive08.png) + +3. Add toleration annotations to the namespace where the components that need high availability are located (optional) + + Execute the following command to add the annotation: `scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": + "NoSchedule", "key": "ExclusiveNamespace"}]'` to the namespace where the components that need high availability are located. + + ```bash + kubectl annotate ns scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": + "NoSchedule", "key": "ExclusiveNamespace"}]' + ``` + + Please make sure to replace `` with the name of the platform namespace you want to add the annotation to. diff --git a/docs/en/docs/end-user/kpanda/namespaces/podsecurity.md b/docs/en/docs/end-user/kpanda/namespaces/podsecurity.md new file mode 100644 index 0000000000..6805af35c8 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/namespaces/podsecurity.md @@ -0,0 +1,54 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Pod Security Policy + +Pod security policies in a Kubernetes cluster allow you to control the behavior of Pods in various aspects of security by configuring different levels and modes for specific namespaces. Only Pods that meet certain conditions will be accepted by the system. It sets three levels and three modes, allowing users to choose the most suitable scheme to set restriction policies according to their needs. + +!!! note + + Only one security policy can be configured for one security mode. Please be careful when configuring the enforce security mode for a namespace, as violations will prevent Pods from being created. + +This section will introduce how to configure Pod security policies for namespaces through the container management interface. + +## Prerequisites + +- The container management module has [integrated a Kubernetes cluster](../clusters/integrate-cluster.md) or [created a Kubernetes cluster](../clusters/create-cluster.md). The cluster version needs to be v1.22 or above, and you should be able to access the cluster's UI interface. + +- A [namespace has been created](../namespaces/createns.md), a [user has been created](../../../ghippo/access-control/user.md), and the user has been granted [NS Admin](../permissions/permission-brief.md) or higher permissions. For details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +## Configure Pod Security Policies for Namespace + +1. Select the namespace for which you want to configure Pod security policies and go to the details page. Click __Configure Policy__ on the __Pod Security Policy__ page to go to the configuration page. + + ![Configure Policy List](../images/ps01.png) + +2. Click __Add Policy__ on the configuration page, and a policy will appear, including security level and security mode. The following is a detailed introduction to the security level and security policy. + + | Security Level | Description | + | ---------- | ------------------------------------------------------------ | + | Privileged | An unrestricted policy that provides the maximum possible range of permissions. This policy allows known privilege elevations. | + | Baseline | The least restrictive policy that prohibits known privilege elevations. Allows the use of default (minimum specified) Pod configurations. | + | Restricted | A highly restrictive policy that follows current best practices for protecting Pods. | + + | Security Mode | Description | + | -------- | ------------------------------------------------------------ | + | Audit | Violations of the specified policy will add new audit events in the audit log, and the Pod can be created. | + | Warn | Violations of the specified policy will return user-visible warning information, and the Pod can be created. | + | Enforce | Violations of the specified policy will prevent the Pod from being created. | + + ![Add Policy](../images/ps02.png) + +3. Different security levels correspond to different check items. If you don't know how to configure your namespace, you can __Policy ConfigMap Explanation__ at the top right corner of the page to view detailed information. + + ![ConfigMap Explanation01](../images/ps03.png) + +4. Click Confirm. If the creation is successful, the security policy you configured will appear on the page. + + ![Creation Success](../images/ps04.png) + +5. Click __┇__ to edit or delete the security policy you configured. + + ![Operation](../images/ps05.png) diff --git a/docs/en/docs/end-user/kpanda/network/create-ingress.md b/docs/en/docs/end-user/kpanda/network/create-ingress.md new file mode 100644 index 0000000000..3cac721083 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/network/create-ingress.md @@ -0,0 +1,75 @@ +--- +MTPE: windsonsea +Date: 2024-10-15 +--- + +# Create an Ingress + +In a Kubernetes cluster, [Ingress](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#ingress-v1beta1-networking-k8s-io) exposes services from outside the cluster to inside the cluster HTTP and HTTPS ingress. +Traffic ingress is controlled by rules defined on the Ingress resource. Here's an example of a simple Ingress that sends all traffic to the same Service: + +![ingress-diagram](https://docs.daocloud.io/daocloud-docs-images/docs/kpanda/images/ingress.svg) + +Ingress is an API object that manages external access to services in the cluster, and the typical access method is HTTP. Ingress can provide load balancing, SSL termination, and name-based virtual hosting. + +## Prerequisites + +- Container management module [connected to Kubernetes cluster](../clusters/integrate-cluster.md) or [created Kubernetes](../clusters/create-cluster.md), and can access the cluster UI interface. +- Completed a [namespace creation](../namespaces/createns.md), [user creation](../../../ghippo/access-control/user.md), and authorize the user as [NS Editor](../permissions/permission-brief.md#ns-editor) role, for details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). +- Completed [Create Ingress Instance](../../../network/modules/ingress-nginx/install.md), [Deploy Application Workload](../workloads/create-deployment.md), and have [created the corresponding Service](create-services.md) +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create ingress + +1. After successfully logging in as the __NS Editor__ user, click __Clusters__ in the upper left corner to enter the __Clusters__ page. In the list of clusters, click a cluster name. + + ![Clusters](../../images/ingress01.png) + +2. In the left navigation bar, click __Container Network__ -> __Ingress__ to enter the service list, and click the __Create Ingress__ button in the upper right corner. + + ![Ingress](../../images/ingress02.png) + + !!! note + + It is also possible to __Create from YAML__ . + +3. Open __Create Ingress__ page to configure. There are two protocol types to choose from, refer to the following two parameter tables for configuration. + +### Create HTTP protocol ingress + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Ingress name | [Type] Required
[Meaning] Enter the name of the new ingress.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. | Ing-01 | +| Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | +| Protocol | [Type] Required
[Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTP protocol. | HTTP | +| Domain Name | [Type] Required
[Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster | testing.daocloud.io | +| LB Type | [Type] Required
[Meaning] The usage range of the Ingress instance. [Scope of use of Ingress](../../../network/modules/ingress-nginx/scope.md)
__Platform-level load balancer__ : In the same cluster, share the same Ingress instance, where all Pods can receive requests distributed by the load balancer.
__Tenant-level load balancer__ : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace, or belongs to a certain workspace, and the set workspace includes the current namespace, and all Pods can receive it Requests distributed by this load balancer. | Platform Level Load Balancer | +| Ingress Class | [Type] Optional
[Meaning] Select the corresponding Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to [Ingress Class](../../../network/modules/ingress-nginx/ingressclass.md)< br /> | Ngnix | +| Session persistence| [Type] Optional
[Meaning] Session persistence is divided into three types: __L4 source address hash__ , __Cookie Key__ , __L7 Header Name__ . Keep
__L4 Source Address Hash__ : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: "$binary_remote_addr"
__Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes .io/affinity-mode: persistent
__L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: "$http_x_forwarded_for" | Close | +| Path Rewriting| [Type] Optional
[Meaning] __rewrite-target__ , in some cases, the URL exposed by the backend service is different from the path specified in the Ingress rule. If no URL rewriting configuration is performed, There will be an error when accessing. | close | +| Redirect | [Type] Optional
[Meaning] __permanent-redirect__ , permanent redirection, after entering the rewriting path, the access path will be redirected to the set address. | close | +| Traffic Distribution | [Type] Optional
[Meaning] After enabled and set, traffic distribution will be performed according to the set conditions.
__Based on weight__ : After setting the weight, add the following Annotation to the created Ingress: __nginx.ingress.kubernetes.io/canary-weight: "10"__
__Based on Cookie__ : set After the cookie rules, the traffic will be distributed according to the set cookie conditions
__Based on Header__ : After setting the header rules, the traffic will be distributed according to the set header conditions | Close | +| Labels | [Type] Optional
[Meaning] Add a label for the ingress
| - | +| Annotations | [Type] Optional
[Meaning] Add annotation for ingress
| - | + +### Create HTTPS protocol ingress + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Ingress name | [Type] Required
[Meaning] Enter the name of the new ingress.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter, lowercase English letters or numbers. | Ing-01 | +| Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | +| Protocol | [Type] Required
[Meaning] Refers to the protocol that authorizes inbound access to the cluster service, and supports HTTP (no identity authentication required) or HTTPS (identity authentication needs to be configured) protocol. Here select the ingress of HTTPS protocol. | HTTPS | +| Domain Name | [Type] Required
[Meaning] Use the domain name to provide external access services. The default is the domain name of the cluster | testing.daocloud.io | +| Secret | [Type] Required
[Meaning] Https TLS certificate, [Create Secret](../configmaps-secrets/create-secret.md). | | +| Forwarding policy | [Type] Optional
[Meaning] Specify the access policy of Ingress.
**Path**: Specifies the URL path for service access, the default is the root path/
**directoryTarget service**: Service name for ingress
**Target service port**: Port exposed by the service | | +| LB Type | [Type] Required
[Meaning] The usage range of the Ingress instance.
__Platform-level load balancer__ : In the same cluster, the same Ingress instance is shared, and all Pods can receive requests distributed by the load balancer.
__Tenant-level load balancer__ : Tenant load balancer, the Ingress instance belongs exclusively to the current namespace or to a certain workspace. This workspace contains the current namespace, and all Pods can receive the workload from this Balanced distribution of requests. | Platform Level Load Balancer | +| Ingress Class | [Type] Optional
[Meaning] Select the corresponding Ingress instance, and import traffic to the specified Ingress instance after selection. When it is None, the default DefaultClass is used. Please set the DefaultClass when creating an Ingress instance. For more information, refer to [Ingress Class](../../../network/modules/ingress-nginx/ingressclass.md)< br /> | None | +| Session persistence| [Type] Optional
[Meaning] Session persistence is divided into three types: __L4 source address hash__ , __Cookie Key__ , __L7 Header Name__ . Keep
__L4 Source Address Hash__ : : When enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: "$binary_remote_addr"
__Cookie Key__ : When enabled, the connection from a specific client will be passed to the same Pod. After enabled, the following parameters are added to the Annotation by default:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes .io/affinity-mode: persistent
__L7 Header Name__ : After enabled, the following tag is added to the Annotation by default: nginx.ingress.kubernetes.io/upstream-hash-by: "$http_x_forwarded_for" | Close | +| Labels | [Type] Optional
[Meaning] Add a label for the ingress | | +| Annotations | [Type] Optional
[Meaning] Add annotation for ingress | | + +### Create ingress successfully + +After configuring all the parameters, click the __OK__ button to return to the ingress list automatically. On the right side of the list, click __┇__ to modify or delete the selected ingress. + +![Ingress List](../../images/ingress03.png) diff --git a/docs/en/docs/end-user/kpanda/network/create-services.md b/docs/en/docs/end-user/kpanda/network/create-services.md new file mode 100644 index 0000000000..28bc6197b8 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/network/create-services.md @@ -0,0 +1,82 @@ +# Create a Service + +In a Kubernetes cluster, each Pod has an internal independent IP address, but Pods in the workload may be created and deleted at any time, and directly using the Pod IP address cannot provide external services. + +This requires creating a service through which you get a fixed IP address, decoupling the front-end and back-end of the workload, and allowing external users to access the service. At the same time, the service also provides the Load Balancer feature, enabling users to access workloads from the public network. + +## Prerequisites + +- Container management module [connected to Kubernetes cluster](../clusters/integrate-cluster.md) or [created Kubernetes](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Completed a [namespace creation](../namespaces/createns.md), [user creation](../../../ghippo/access-control/user.md), and authorize the user as [NS Editor](../permissions/permission-brief.md#ns-editor) role, for details, refer to [Namespace Authorization](../permissions/cluster-ns-auth.md). + +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create service + +1. After successfully logging in as the __NS Editor__ user, click __Clusters__ in the upper left corner to enter the __Clusters__ page. In the list of clusters, click a cluster name. + + + +2. In the left navigation bar, click __Container Network__ -> __Service__ to enter the service list, and click the __Create Service__ button in the upper right corner. + + + + !!! tip + + It is also possible to create a service via __YAML__ . + +3. Open the __Create Service__ page, select an access type, and refer to the following three parameter tables for configuration. + + + +### Create ClusterIP service + +Click __Intra-Cluster Access (ClusterIP)__ , which refers to exposing services through the internal IP of the cluster. The services selected for this option can only be accessed within the cluster. This is the default service type. Refer to the configuration parameters in the table below. + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Access type | [Type] Required
[Meaning] Specify the method of Pod service discovery, here select intra-cluster access (ClusterIP). | ClusterIP | +| Service Name | [Type] Required
[Meaning] Enter the name of the new service.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | Svc-01 | +| Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | +| Label selector | [Type] Required
[Meaning] Add a label, the Service selects a Pod according to the label, and click "Add" after filling. You can also refer to the label of an existing workload. Click __Reference workload label__ , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. | app:job01 | +| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP.
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services.
**Container port (targetport)**: The container port that the workload actually monitors, used to expose services to the cluster. | | +| Session Persistence | [Type] Optional
[Meaning] When enabled, requests from the same client will be forwarded to the same Pod | Enabled | +| Maximum session hold time | [Type] Optional
[Meaning] After session hold is enabled, the maximum hold time is 30 seconds by default | 30 seconds | +| Annotation | [Type] Optional
[Meaning] Add annotation for service
| | + +### Create NodePort service + +Click __NodePort__ , which means exposing the service via IP and static port ( __NodePort__ ) on each node. The __NodePort__ service is routed to the automatically created __ClusterIP__ service. You can access a __NodePort__ service from outside the cluster by requesting __:__ . Refer to the configuration parameters in the table below. + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Access type | [Type] Required
[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). | NodePort | +| Service Name | [Type] Required
[Meaning] Enter the name of the new service.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | Svc-01 | +| Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | +| Label selector | [Type] Required
[Meaning] Add a label, the Service selects a Pod according to the label, and click "Add" after filling. You can also refer to the label of an existing workload. Click __Reference workload label__ , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. | | +| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP.
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services. *By default, the service port is set to the same value as the container port field for convenience. *
**Container port (targetport)**: The container port actually monitored by the workload.
**Node port (nodeport)**: The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. | | +| Session Persistence| [Type] Optional
[Meaning] When enabled, requests from the same client will be forwarded to the same Pod
After enabled, __.spec.sessionAffinity__ of Service is __ClientIP__ , refer to for details : [Session Affinity for Service](https://kubernetes.io/docs/reference/networking/virtual-ips/#session-affinity) | Enabled | +| Maximum session hold time| [Type] Optional
[Meaning] After session hold is enabled, the maximum hold time, the default timeout is 30 seconds
.spec.sessionAffinityConfig.clientIP.timeoutSeconds is set to 30 by default seconds | 30 seconds | +| Annotation | [Type] Optional
[Meaning] Add annotation for service
| | + +### Create LoadBalancer service + +Click __Load Balancer__ , which refers to using the cloud provider's load balancer to expose services to the outside. External load balancers can route traffic to automatically created __NodePort__ services and __ClusterIP__ services. Refer to the configuration parameters in the table below. + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Access type | [Type] Required
[Meaning] Specify the method of Pod service discovery, here select node access (NodePort). | NodePort | | +| Service Name | [Type] Required
[Meaning] Enter the name of the new service.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | Svc-01 | | +| Namespace | [Type] Required
[Meaning] Select the namespace where the new service is located. For more information about namespaces, refer to [Namespace Overview](../namespaces/createns.md).
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | default | | +| External Traffic Policy | [Type] Required
[Meaning] Set external traffic policy.
**Cluster**: Traffic can be forwarded to Pods on all nodes in the cluster.
**Local**: Traffic is only sent to Pods on this node.
[Note] Please enter a string of 4 to 63 characters, which can contain lowercase English letters, numbers and dashes (-), and start with a lowercase English letter and end with a lowercase English letter or number. | | | +| Tag selector | [Type] Required
[Meaning] Add tag, Service Select the Pod according to the label, fill it out and click "Add". You can also refer to the label of an existing workload. Click __Reference workload label__ , select the workload in the pop-up window, and the system will use the selected workload label as the selector by default. | | | +| Load balancing type | [Type] Required
[Meaning] The type of load balancing used, currently supports MetalLB and others. | | | +| MetalLB IP Pool| [Type] Required
[Meaning] When the selected load balancing type is MetalLB, LoadBalancer Service will allocate IP addresses from this pool by default, and declare all IP addresses in this pool through APR, For details, refer to: [Install MetalLB](../../../network/modules/metallb/install.md) | | | +| Load balancing address| [Type] Required
[Meaning]
1. If you are using a public cloud CloudProvider, fill in the load balancing address provided by the cloud provider here;
2. If the above load balancing type is selected as MetalLB, the IP will be obtained from the above IP pool by default, if not filled, it will be obtained automatically. | | | +| Port configuration| [Type] Required
[Meaning] To add a protocol port for a service, you need to select the port protocol type first. Currently, it supports TCP and UDP.
**Port Name**: Enter the name of the custom port.
**Service port (port)**: The access port for Pod to provide external services. By default, the service port is set to the same value as the container port field for convenience.
**Container port (targetport)**: The container port actually monitored by the workload.
**Node port (nodeport)**: The port of the node, which receives traffic from ClusterIP transmission. It is used as the entrance for external traffic access. | | | +| Annotation | [Type] Optional
[Meaning] Add annotation for service
| | | + +### Complete service creation + +After configuring all parameters, click the __OK__ button to return to the service list automatically. On the right side of the list, click __┇__ to modify or delete the selected service. diff --git a/docs/en/docs/end-user/kpanda/network/network-policy.md b/docs/en/docs/end-user/kpanda/network/network-policy.md new file mode 100644 index 0000000000..78686cc398 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/network/network-policy.md @@ -0,0 +1,70 @@ +# Network Policies + +Network policies in Kubernetes allow you to control network traffic at the IP address or port level (OSI layer 3 or layer 4). The container management module currently supports creating network policies based on Pods or namespaces, using label selectors to specify which traffic can enter or leave Pods with specific labels. + +For more details on network policies, refer to the official Kubernetes documentation on [Network Policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/). + +## Creating Network Policies + +Currently, there are two methods available for creating network policies: YAML and form-based creation. Each method has its advantages and disadvantages, catering to different user needs. + +YAML creation requires fewer steps and is more efficient, but it has a higher learning curve as it requires familiarity with configuring network policy YAML files. + +Form-based creation is more intuitive and straightforward. Users can simply fill in the corresponding values based on the prompts. However, this method involves more steps. + +### YAML Creation + +1. In the cluster list, click the name of the target cluster, then navigate to __Container Network__ -> __Network Policies__ -> __Create with YAML__ in the left navigation bar. + + +2. In the pop-up dialog, enter or paste the pre-prepared YAML file, then click __OK__ at the bottom of the dialog. + + +### Form-Based Creation + +1. In the cluster list, click the name of the target cluster, then navigate to __Container Network__ -> __Network Policies__ -> __Create Policy__ in the left navigation bar. + + +2. Fill in the basic information. + + The name and namespace cannot be changed after creation. + + +3. Fill in the policy configuration. + + The policy configuration includes ingress and egress policies. To establish a successful connection from a source Pod to a target Pod, both the egress policy of the source Pod and the ingress policy of the target Pod need to allow the connection. If either side does not allow the connection, the connection will fail. + + - Ingress Policy: Click __➕__ to begin configuring the policy. Multiple policies can be configured. The effects of multiple network policies are cumulative. Only when all network policies are satisfied simultaneously can a connection be successfully established. + + - Egress Policy + +## Viewing Network Policies + +1. In the cluster list, click the name of the target cluster, then navigate to __Container Network__ -> __Network Policies__ . Click the name of the network policy. + + +2. View the basic configuration, associated instances, ingress policies, and egress policies of the policy. + + +!!! info + + Under the "Associated Instances" tab, you can view instance monitoring, logs, container lists, YAML files, events, and more. + + +## Updating Network Policies + +There are two ways to update network policies. You can either update them through the form or by using a YAML file. + +- On the network policy list page, find the policy you want to update, and choose __Update__ in the action column on the right to update it via the form. Choose __Edit YAML__ to update it using a YAML file. + + +- Click the name of the network policy, then choose __Update__ in the top right corner of the policy details page to update it via the form. Choose __Edit YAML__ to update it using a YAML file. + +## Deleting Network Policies + +There are two ways to delete network policies. You can delete network policies either through the form or by using a YAML file. + +- On the network policy list page, find the policy you want to delete, and choose __Delete__ in the action column on the right to delete it via the form. Choose __Edit YAML__ to delete it using a YAML file. + + +- Click the name of the network policy, then choose __Delete__ in the top right corner of the policy details page to delete it via the form. Choose __Edit YAML__ to delete it using a YAML file. diff --git a/docs/en/docs/end-user/kpanda/nodes/add-node.md b/docs/en/docs/end-user/kpanda/nodes/add-node.md new file mode 100644 index 0000000000..b9c9bf7c26 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/add-node.md @@ -0,0 +1,30 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Cluster Node Expansion + +As the number of business applications continues to grow, the resources of the cluster become increasingly tight. At this point, you can expand the cluster nodes based on kubean. After the expansion, applications can run on the newly added nodes, alleviating resource pressure. + +Only clusters [created through the container management module](../clusters/create-cluster.md) support node autoscaling. Clusters accessed from the outside do not support this operation. This article mainly introduces the expansion of **worker nodes** in the same architecture work cluster. If you need to add control nodes or heterogeneous work nodes to the cluster, refer to: [Expanding the control node of the work cluster](../../best-practice/add-master-node.md), [Adding heterogeneous nodes to the work cluster](../../best-practice/multi-arch.md), [Expanding the worker node of the global service cluster](../../best-practice/add-worker-node-on-global.md). + +1. On the __Clusters__ page, click the name of the target cluster. + + If the __Cluster Type__ contains the label __Integrated Cluster__, it means that the cluster does not support node autoscaling. + + ![Enter the cluster list page](../images/addnode01.png) + +2. Click __Nodes__ in the left navigation bar, and then click __Integrate Node__ in the upper right corner of the page. + + ![Integrate Node](../images/addnode02.png) + +3. Enter the host name and node IP and click __OK__. + + Click __➕ Add Worker Node__ to continue accessing more nodes. + + ![Node Check](../images/addnode03.png) + +!!! note + + Accessing the node takes about 20 minutes, please be patient. diff --git a/docs/en/docs/end-user/kpanda/nodes/delete-node.md b/docs/en/docs/end-user/kpanda/nodes/delete-node.md new file mode 100644 index 0000000000..dce0f8e24a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/delete-node.md @@ -0,0 +1,34 @@ +# Node Scales Down + +When the peak business period is over, in order to save resource costs, you can reduce the size of the cluster and unload redundant nodes, that is, node scaling. After a node is uninstalled, applications cannot continue to run on the node. + +## Prerequisites + +- The current operating user has the [`Cluster Admin`](../permissions/permission-brief.md) role authorization. +- Only through the container management module [created cluster](../clusters/create-cluster.md) can node autoscaling be supported, and the cluster accessed from the outside does not support this operation. +- Before uninstalling a node, you need to [pause scheduling the node](schedule.md), and expel the applications on the node to other nodes. +- Eviction method: log in to the controller node, and use the kubectl drain command to evict all Pods on the node. The safe eviction method allows the containers in the pod to terminate gracefully. + +## Precautions + +1. When cluster nodes scales down, they can only be uninstalled one by one, not in batches. + +2. If you need to uninstall cluster controller nodes, you need to ensure that the final number of controller nodes is an **odd number**. + +3. The **first controller** node cannot be offline when the cluster node scales down. If it is necessary to perform this operation, please contact the after-sales engineer. + +## Steps + +1. On the __Clusters__ page, click the name of the target cluster. + + If the __Cluster Type__ has the tag __Integrate Cluster__ , it means that the cluster does not support node autoscaling. + + ![Clusters](../images/addnode01.png) + +2. Click __Nodes__ on the left navigation bar, find the node to be uninstalled, click __┇__ and select __Remove__ . + + ![Remove Nodes](../images/deletenode01.png) + +3. Enter the node name, and click __Delete__ to confirm. + + ![Delete](../images/deletenode02.png) diff --git a/docs/en/docs/end-user/kpanda/nodes/labels-annotations.md b/docs/en/docs/end-user/kpanda/nodes/labels-annotations.md new file mode 100644 index 0000000000..d89638588f --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/labels-annotations.md @@ -0,0 +1,29 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Labels and Annotations + +Labels are identifying key-value pairs added to Kubernetes objects such as Pods, nodes, and clusters, which can be combined with label selectors to find and filter Kubernetes objects that meet certain conditions. Each key must be unique for a given object. + +Annotations, like tags, are key/value pairs, but they do not have identification or filtering features. +Annotations can be used to add arbitrary metadata to nodes. +Annotation keys usually use the format __prefix(optional)/name(required)__ , for example __nfd.node.kubernetes.io/extended-resources__ . +If the prefix is ​​omitted, it means that the annotation key is private to the user. + +For more information about labels and annotations, refer to the official Kubernetes documentation [labels and selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) Or [Annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/). + +The steps to add/delete tags and annotations are as follows: + +1. On the __Clusters__ page, click the name of the target cluster. + + ![Clusters](../images/schedule01.png) + +2. Click __Nodes__ on the left navigation bar, click the __┇__ operation icon on the right side of the node, and click __Edit Labels__ or __Edit Annotations__ . + + ![暂停调度](../images/labels01.png) + +3. Click __➕ Add__ to add tags or annotations, click __X__ to delete tags or annotations, and finally click __OK__ . + + ![节点管理](../images/labels02.png) diff --git a/docs/en/docs/end-user/kpanda/nodes/node-authentication.md b/docs/en/docs/end-user/kpanda/nodes/node-authentication.md new file mode 100644 index 0000000000..e5de285be7 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/node-authentication.md @@ -0,0 +1,68 @@ +# Node Authentication + +## Authenticate Nodes Using SSH Keys + +If you choose to authenticate the nodes of the cluster-to-be-created using SSH keys, you need to configure the public and private keys according to the following instructions. + +1. Run the following command on **any node within the management cluster of the cluster-to-be-created** to generate the public and private keys. + + ```shell + cd /root/.ssh + ssh-keygen -t rsa + ``` + +2. Run the __ls__ command to check if the keys have been successfully created in the management cluster. The correct output should be as follows: + + ```shell + ls + id_rsa id_rsa.pub known_hosts + ``` + + The file named __id_rsa__ is the private key, and the file named __id_rsa.pub__ is the public key. + +3. Run the following command to load the public key file __id_rsa.pub__ onto all the nodes of the cluster-to-be-created. + + ```shell + ssh-copy-id -i /root/.ssh/id_rsa.pub root@10.0.0.0 + ``` + + Replace the user account and node IP in the above command with the username and IP of the nodes in the cluster-to-be-created. **The same operation needs to be performed on every node in the cluster-to-be-created**. + +4. Run the following command to view the private key file __id_rsa__ created in step 1. + + ```shell + cat /root/.ssh/id_rsa + ``` + + The output should be as follows: + + ```bash + -----BEGIN RSA PRIVATE KEY----- + MIIEpQIBAAKCAQEA3UvyKINzY5BFuemQ+uJ6q+GqgfvnWwNC8HzZhpcMSjJy26MM + UtBEBJxy8fMi57XcjYxPibXW/wnd+32ICCycqCwByUmuXeCC1cjlCQDqjcAvXae7 + Y54IXGF7wm2IsMNwf0kjFEXjuS48FLDA0mGRaN3BG+Up5geXcHckg3K5LD8kXFFx + dEmSIjdyw55NaUitmEdHzN7cIdfi6Z56jcV8dcFBgWKUx+ebiyPmZBkXToz6GnMF + rswzzZCl+G6Jb2xTGy7g7ozb4BoZd1IpSD5EhDanRrESVE0C5YuJ5zUAC0CvVd1l + v67AK8Ko6MXToHp01/bcsvlM6cqgwUFXZKVeOwIDAQABAoIBAQCO36GQlo3BEjxy + M2HvGJmqrx+unDxafliRe4nVY2AD515Qf4xNSzke4QM1QoyenMOwf446krQkJPK0 + k+9nl6Xszby5gGCbK4BNFk8I6RaGPjZWeRx6zGUJf8avWJiPxx6yjz2esSC9RiR0 + F0nmiiefVMyAfgv2/5++dK2WUFNNRKLgSRRpP5bRaD5wMzzxtSSXrUon6217HO8p + 3RoWsI51MbVzhdVgpHUNABcoa0rpr9svT6XLKZxY8mxpKFYjM0Wv2JIDABg3kBvh + QbJ7kStCO3naZjKMU9UuSqVJs06cflGYw7Or8/tABR3LErNQKPjkhAQqt0DXw7Iw + 3tKdTAJBAoGBAP687U7JAOqQkcphek2E/A/sbO/d37ix7Z3vNOy065STrA+ZWMZn + pZ6Ui1B/oJpoZssnfvIoz9sn559X0j67TljFALFd2ZGS0Fqh9KVCqDvfk+Vst1dq + +3r/yZdTOyswoccxkJiC/GDwZGK0amJWqvob39JCZhDAKIGLbGMmjdAHAoGBAN5k + m1WGnni1nZ+3dryIwgB6z1hWcnLTamzSET6KhSuo946ET0IRG9xtlheCx6dqICbr + Vk1Y4NtRZjK/p/YGx59rDWf7E3I8ZMgR7mjieOcUZ4lUlA4l7ZIlW/2WZHW+nUXO + Ti20fqJ8qSp4BUvOvuth1pz2GLUHe2/Fxjf7HIstAoGBAPHpPr9r+TfIlPsJeRj2 + 6lzA3G8qWFRQfGRYjv0fjv0pA+RIb1rzgP/I90g5+63G6Z+R4WdcxI/OJJNY1iuG + uw9n/pFxm7U4JC990BPE6nj5iLz+clpNGYckNDBF9VG9vFSrSDLdaYkxoVNvG/xJ + a9Na90H4lm7f3VewrPy310KvAoGAZr+mwNoEh5Kpc6xo8Gxi7aPP/mlaUVD6X7Ki + gvmu02AqmC7rC4QqEiqTaONkaSXwGusqIWxJ3yp5hELmUBYLzszAEeV/s4zRp1oZ + g133LBRSTbHFAdBmNdqK6Nu+KGRb92980UMOKvZbliKDl+W6cbfvVu+gtKrzTc3b + aevb4TUCgYEAnJAxyVYDP1nJf7bjBSHXQu1E/DMwbtrqw7dylRJ8cAzI7IxfSCez + 7BYWq41PqVd9/zrb3Pbh2phiVzKe783igAIMqummcjo/kZyCwFsYBzK77max1jF5 + aPQsLbRS2aDz8kIH6jHPZ/R+15EROmdtLmA7vIJZGerWWQR0dUU+XXA= + ``` + +Copy the content of the private key and paste it into the interface's key input field. diff --git a/docs/en/docs/end-user/kpanda/nodes/node-check.md b/docs/en/docs/end-user/kpanda/nodes/node-check.md new file mode 100644 index 0000000000..54cdd24c59 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/node-check.md @@ -0,0 +1,38 @@ +# Create a cluster node availability check + +When creating a cluster or adding nodes to an existing cluster, refer to the table below to check the node configuration to avoid cluster creation or expansion failure due to wrong node configuration. + +| Check Item | Description | +| ---------- | ----------- | +| OS | Refer to [Supported Architectures and Operating Systems](#supported-architectures-and-operating-systems) | +| SELinux | Off | +| Firewall | Off | +| Architecture Consistency | Consistent CPU architecture between nodes (such as ARM or x86) | +| Host Time | All hosts are out of sync within 10 seconds. | +| Network Connectivity | The node and its SSH port can be accessed normally by the platform. | +| CPU | Available CPU resources are greater than 4 Cores | +| Memory | Available memory resources are greater than 8 GB | + +## Supported architectures and operating systems + +| Architecture | Operating System | Remarks | +| ---- | ------------------------ | ---- | +| ARM | Kylin Linux Advanced Server release V10 (Sword) SP2 | Recommended | +| ARM | UOS Linux | | +| ARM | openEuler | | +| x86 | CentOS 7.x | Recommended | +| x86 | Redhat 7.x | Recommended | +| x86 | Redhat 8.x | Recommended | +| x86 | Flatcar Container Linux by Kinvolk | | +| x86 | Debian Bullseye, Buster, Jessie, Stretch | | +| x86 | Ubuntu 16.04, 18.04, 20.04, 22.04 | | +| x86 | Fedora 35, 36 | | +| x86 | Fedora CoreOS | | +| x86 | openSUSE Leap 15.x/Tumbleweed | | +| x86 | Oracle Linux 7, 8, 9 | | +| x86 | Alma Linux 8, 9 | | +| x86 | Rocky Linux 8, 9 | | +| x86 | Amazon Linux 2 | | +| x86 | Kylin Linux Advanced Server release V10 (Sword) - SP2 Haiguang | | +| x86 | UOS Linux | | +| x86 | openEuler | | diff --git a/docs/en/docs/end-user/kpanda/nodes/node-details.md b/docs/en/docs/end-user/kpanda/nodes/node-details.md new file mode 100644 index 0000000000..e499d75307 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/node-details.md @@ -0,0 +1,24 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Node Details + +After accessing or creating a cluster, you can view the information of each node in the cluster, including node status, labels, resource usage, Pod, monitoring information, etc. + +1. On the __Clusters__ page, click the name of the target cluster. + + ![Clusters](../images/schedule01.png) + +2. Click __Nodes__ on the left navigation bar to view the node status, role, label, CPU/memory usage, IP address, and creation time. + + ![Nodes](../images/node-details01.png) + +3. Click the node name to enter the node details page to view more information, including overview information, pod information, label annotation information, event list, status, etc. + + ![Node Details](../images/node-details02.png) + + In addition, you can also view the node's YAML file, monitoring information, labels and annotations, etc. + + ![Edit](../images/node-details03.png) diff --git a/docs/en/docs/end-user/kpanda/nodes/schedule.md b/docs/en/docs/end-user/kpanda/nodes/schedule.md new file mode 100644 index 0000000000..d79fb8d79f --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/schedule.md @@ -0,0 +1,24 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Node Scheduling + +Supports suspending or resuming scheduling of nodes. Pausing scheduling means stopping the scheduling of Pods to the node. Resuming scheduling means that Pods can be scheduled to that node. + +1. On the __Clusters__ page, click the name of the target cluster. + + ![Clusters](../images/taint01.png) + +2. Click __Nodes__ on the left navigation bar, click the __┇__ operation icon on the right side of the node, and click the __Cordon__ button to suspend scheduling the node. + + ![Cordon](../images/schedule01.png) + +3. Click the __┇__ operation icon on the right side of the node, and click the __Uncordon__ button to resume scheduling the node. + + ![Uncordon](../images/schedule02.png) + +The node scheduling status may be delayed due to network conditions. Click the __refresh icon__ on the right side of the search box to refresh the node scheduling status. + +![Refresh](../images/schedule03.png) diff --git a/docs/en/docs/end-user/kpanda/nodes/taints.md b/docs/en/docs/end-user/kpanda/nodes/taints.md new file mode 100644 index 0000000000..745ccd0435 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/nodes/taints.md @@ -0,0 +1,51 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Node Taints + +Taint can make a node exclude a certain type of Pod and prevent Pod from being scheduled on the node. +One or more taints can be applied to each node, and Pods that cannot tolerate these taints will not be scheduled on that node. + +## Precautions + +1. The current operating user should have [NS Editor](../permissions/permission-brief.md) role authorization or other higher permissions. +2. After adding a taint to a node, only Pods that can tolerate the taint can be scheduled to the node. + +## Steps + +1. Find the target cluster on the __Clusters__ page, and click the cluster name to enter the Cluster page. + + ![Clusters](../images/taint01.png) + +2. In the left navigation bar, click __Nodes__ , find the node that needs to modify the taint, click the __┇__ operation icon on the right and click the __Edit Taints__ button. + + ![Edit Taints](../images/taint02.png) + +3. Enter the key value information of the taint in the pop-up box, select the taint effect, and click __OK__ . + + Click __➕ Add__ to add multiple taints to the node, and click __X__ on the right side of the taint effect to delete the taint. + + Currently supports three taint effects: + + - `NoExecute`: This affects pods that are already running on the node as follows: + + - Pods that do not tolerate the taint are evicted immediately + - Pods that tolerate the taint without specifying `tolerationSeconds` in + their toleration specification remain bound forever + - Pods that tolerate the taint with a specified `tolerationSeconds` remain + bound for the specified amount of time. After that time elapses, the node + lifecycle controller evicts the Pods from the node. + + - `NoSchedule`: No new Pods will be scheduled on the tainted node unless they have a matching + toleration. Pods currently running on the node are **not** evicted. + + - `PreferNoSchedule`: This is a "preference" or "soft" version of `NoSchedule`. + The control plane will *try* to avoid placing a Pod that does not tolerate + the taint on the node, but it is not guaranteed, so this taint is not recommended to use in a production environment. + + ![Config](../images/taint03.png) + +For more details about taints, refer to the Kubernetes documentation +[Taints and Tolerance](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). diff --git a/docs/en/docs/end-user/kpanda/olm/import-miniooperator.md b/docs/en/docs/end-user/kpanda/olm/import-miniooperator.md new file mode 100644 index 0000000000..339624f460 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/olm/import-miniooperator.md @@ -0,0 +1,153 @@ +# Importing MinIo Operator Offline + +This guide explains how to import the MinIo Operator offline in an environment without internet access. + +## Prerequisites + +- The current cluster is connected to the container management and the Global cluster has installed the __kolm__ component (search for helm templates for kolm). +- The current cluster has the __olm__ component installed with a version of 0.2.4 or higher (search for helm templates for olm). +- Ability to execute Docker commands. +- Prepare a container registry. + +## Steps + +1. Set the environment variables in the execution environment and use them in the subsequent steps by running the following command: + + ```bash + export OPM_IMG=10.5.14.200/quay.m.daocloud.io/operator-framework/opm:v1.29.0 + export BUNDLE_IMG=10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3 + ``` + + How to get the above image addresses: + + Go to __Container Management__ -> Select the current cluster -> __Helm Applications__ -> View the __olm__ component -> __Plugin Settings__ , and find the images needed for the opm, minio, minio bundle, and minio operator in the subsequent steps. + + + ```bash + Using the screenshot as an example, the four image addresses are as follows: + + # opm image + 10.5.14.200/quay.m.daocloud.io/operator-framework/opm:v1.29.0 + + # minio image + 10.5.14.200/quay.m.daocloud.io/minio/minio:RELEASE.2023-03-24T21-41-23Z + + # minio bundle image + 10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3 + + # minio operator image + 10.5.14.200/quay.m.daocloud.io/minio/operator:v5.0.3 + ``` + +2. Run the opm command to get the operators included in the offline bundle image. + + ```bash + # Create the operator directory + $ mkdir minio-operator && cd minio-operator + + # Get the operator yaml + $ docker run --user root -v $PWD/minio-operator:/minio-operator ${OPM_IMG} alpha bundle unpack --skip-tls-verify -v -d ${BUNDLE_IMG} -o ./minio-operator + + # Expected result + . + └── minio-operator + ├── manifests + │ ├── console-env_v1_configmap.yaml + │ ├── console-sa-secret_v1_secret.yaml + │ ├── console_v1_service.yaml + │ ├── minio-operator.clusterserviceversion.yaml + │ ├── minio.min.io_tenants.yaml + │ ├── operator_v1_service.yaml + │ ├── sts.min.io_policybindings.yaml + │ └── sts_v1_service.yaml + └── metadata + └── annotations.yaml + + 3 directories, 9 files + ``` + +3. Replace all image addresses in the __minio-operator/manifests/minio-operator.clusterserviceversion.yaml__ file with the image addresses from the offline container registry. + + Before replacement: + + + After replacement: + + +4. Generate a Dockerfile for building the bundle image. + + ```bash + $ docker run --user root -v $PWD:/minio-operator -w /minio-operator ${OPM_IMG} alpha bundle generate --channels stable,beta -d /minio-operator/minio-operator/manifests -e stable -p minio-operator   + + # Expected result + . + ├── bundle.Dockerfile + └── minio-operator + ├── manifests + │ ├── console-env_v1_configmap.yaml + │ ├── console-sa-secret_v1_secret.yaml + │ ├── console_v1_service.yaml + │ ├── minio-operator.clusterserviceversion.yaml + │ ├── minio.min.io_tenants.yaml + │ ├── operator_v1_service.yaml + │ ├── sts.min.io_policybindings.yaml + │ └── sts_v1_service.yaml + └── metadata + └── annotations.yaml + + 3 directories, 10 files + ``` + +5. Build the bundle image and push it to the offline registry. + + ```bash + # Set the new bundle image + export OFFLINE_BUNDLE_IMG=10.5.14.200/quay.m.daocloud.io/operatorhubio/minio-operator:v5.0.3-offline + + $ docker build . -f bundle.Dockerfile -t ${OFFLINE_BUNDLE_IMG}   + + $ docker push ${OFFLINE_BUNDLE_IMG} + ``` + +6. Generate a Dockerfile for building the catalog image. + + ```bash + $ docker run --user root -v $PWD:/minio-operator -w /minio-operator ${OPM_IMG} index add --bundles ${OFFLINE_BUNDLE_IMG} --generate --binary-image ${OPM_IMG} --skip-tls-verify + + # Expected result + . + ├── bundle.Dockerfile + ├── database + │ └── index.db + ├── index.Dockerfile + └── minio-operator + ├── manifests + │ ├── console-env_v1_configmap.yaml + │ ├── console-sa-secret_v1_secret.yaml + │ ├── console_v1_service.yaml + │ ├── minio.min.io_tenants.yaml + │ ├── minio-operator.clusterserviceversion.yaml + │ ├── operator_v1_service.yaml + │ ├── sts.min.io_policybindings.yaml + │ └── sts_v1_service.yaml + └── metadata + └── annotations.yaml + + 4 directories, 12 files + ``` + +7. Build the catalog image. + + ```bash + # Set the new catalog image + export OFFLINE_CATALOG_IMG=10.5.14.200/release.daocloud.io/operator-framework/system-operator-index:v0.1.0-offline + + $ docker build . -f index.Dockerfile -t ${OFFLINE_CATALOG_IMG} + + $ docker push ${OFFLINE_CATALOG_IMG} + ``` + +8. Go to Container Management and update the built-in catsrc image for the helm application __olm__ (enter the catalog image specified in the construction of the catalog image, __${catalog-image}__ ). + +9. After the update is successful, the __minio-operator__ component will appear in the Operator Hub. + diff --git a/docs/en/docs/end-user/kpanda/permissions/cluster-ns-auth.md b/docs/en/docs/end-user/kpanda/permissions/cluster-ns-auth.md new file mode 100644 index 0000000000..e5176b991e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/permissions/cluster-ns-auth.md @@ -0,0 +1,55 @@ +# Cluster and Namespace Authorization + +Container management implements authorization based on global authority management and global user/group management. If you need to grant users the highest authority for container management (can create, manage, and delete all clusters), refer to [What are Access Control](../../../ghippo/access-control/iam.md). + +## Prerequisites + +Before authorizing users/groups, complete the following preparations: + +- The user/group to be authorized has been created in the global management, refer to [User](../../../ghippo/access-control/user.md). + +- Only [ __Kpanda Owner__ ](../../../ghippo/access-control/global.md) and [`Cluster Admin`](permission-brief.md) of the current cluster have Cluster authorization capability. For details, refer to [Permission Description](permission-brief.md). + +- only [ __Kpanda Owner__ ](../../../ghippo/access-control/global.md), [`Cluster Admin`](permission-brief.md) for the current cluster, [`NS Admin`](permission-brief.md) of the current namespace has namespace authorization capability. + +## Cluster Authorization + +1. After the user logs in to the platform, click __Privilege Management__ under __Container Management__ on the left menu bar, which is located on the __Cluster Permissions__ tab by default. + + + +2. Click the __Add Authorization__ button. + + + +3. On the __Add Cluster Permission__ page, select the target cluster, the user/group to be authorized, and click __OK__ . + + Currently, the only cluster role supported is __Cluster Admin__ . For details about permissions, refer to [Permission Description](permission-brief.md). If you need to authorize multiple users/groups at the same time, you can click __Add User Permissions__ to add multiple times. + + + +4. Return to the cluster permission management page, and a message appears on the screen: __Cluster permission added successfully__ . + + + +## Namespace Authorization + +1. After the user logs in to the platform, click __Permissions__ under __Container Management__ on the left menu bar, and click the __Namespace Permissions__ tab. + + + +2. Click the __Add Authorization__ button. On the __Add Namespace Permission__ page, select the target cluster, target namespace, and user/group to be authorized, and click __OK__ . + + The currently supported namespace roles are NS Admin, NS Editor, and NS Viewer. For details about permissions, refer to [Permission Description](permission-brief.md). If you need to authorize multiple users/groups at the same time, you can click __Add User Permission__ to add multiple times. Click __OK__ to complete the permission authorization. + + + +3. Return to the namespace permission management page, and a message appears on the screen: __Cluster permission added successfully__ . + + + + !!! tip + + If you need to delete or edit permissions later, you can click __┇__ on the right side of the list and select __Edit__ or __Delete__ . + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/permissions/custom-kpanda-role.md b/docs/en/docs/end-user/kpanda/permissions/custom-kpanda-role.md new file mode 100644 index 0000000000..b61def9f89 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/permissions/custom-kpanda-role.md @@ -0,0 +1,107 @@ +--- +MTPE: ModetaNiu +Date: 2024-05-30 +--- + +# Adding RBAC Rules to System Roles + +*[kpanda]: A development codename for container management + +In the past, the RBAC rules for those system roles in container management were pre-defined and could +not be modified by users. +To support more flexible permission settings and to meet the customized needs for system roles, +now you can modify RBAC rules for system roles such as cluster admin, ns admin, ns editor, ns viewer. + +The following example demonstrates how to add a new ns-view rule, granting the authority to delete +workload deployments. Similar operations can be performed for other rules. + +## Prerequisites + +Before adding RBAC rules to system roles, the following prerequisites must be met: + +- Container management v0.27.0 and above. +- [Integrated Kubernetes cluster](../clusters/integrate-cluster.md) or + [created Kubernetes cluster](../clusters/create-cluster.md), and able to access the cluster's UI interface. +- Completed creation of a [namespace](../namespaces/createns.md) and [user account](../../../ghippo/access-control/user.md), + and the granting of [NS Viewer](./permission-brief.md#ns-viewer). + For details, refer to [namespace authorization](./cluster-ns-auth.md). + +!!! note + + - RBAC rules **only need to be added** in the Global Cluster, and the Kpanda controller will synchronize + those added rules to all integrated subclusters. Synchronization may take some time to complete. + - RBAC rules **can only be added** in the Global Cluster. RBAC rules added in subclusters + will be overridden by the system role permissions of the Global Cluster. + - Only ClusterRoles with fixed Label are supported for adding rules. Replacing or deleting rules + is not supported, nor is adding rules by using role. The correspondence between built-in roles and + ClusterRole Label created by users is as follows. + + ```output + cluster-admin: rbac.kpanda.io/role-template-cluster-admin: "true" + cluster-edit: rbac.kpanda.io/role-template-cluster-edit: "true" + cluster-view: rbac.kpanda.io/role-template-cluster-view: "true" + ns-admin: rbac.kpanda.io/role-template-ns-admin: "true" + ns-edit: rbac.kpanda.io/role-template-ns-edit: "true" + ns-view: rbac.kpanda.io/role-template-ns-view: "true" + ``` + +## Steps + +1. [Create a deployment](../workloads/create-deployment.md) by a user with `admin` or `cluster admin` permissions. + + ![image-20240514112742395](../images/create-depolyment.png) + +1. Grant a user the `ns-viewer` role to provide them with the `ns-view` permission. + + ![image-20240514113009311](../images/permisson02.png) + +1. Switch the login user to ns-viewer, open the console to get the token for the ns-viewer user, + and use `curl` to request and delete the nginx deployment mentioned above. However, + a prompt appears as below, indicating the user doesn't have permission to delete it. + + ```bash + [root@master-01 ~]# curl -k -X DELETE 'https://${URL}/apis/kpanda.io/v1alpha1/clusters/cluster-member/namespaces/default/deployments/nginx' -H 'authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJOU044MG9BclBRMzUwZ2VVU2ZyNy1xMEREVWY4MmEtZmJqR05uRE1sd1lFIn0.eyJleHAiOjE3MTU3NjY1NzksImlhdCI6MTcxNTY4MDE3OSwiYXV0aF90aW1lIjoxNzE1NjgwMTc3LCJqdGkiOiIxZjI3MzJlNC1jYjFhLTQ4OTktYjBiZC1iN2IxZWY1MzAxNDEiLCJpc3MiOiJodHRwczovLzEwLjYuMjAxLjIwMTozMDE0Ny9hdXRoL3JlYWxtcy9naGlwcG8iLCJhdWQiOiJfX2ludGVybmFsLWdoaXBwbyIsInN1YiI6ImMxZmMxM2ViLTAwZGUtNDFiYS05ZTllLWE5OGU2OGM0MmVmMCIsInR5cCI6IklEIiwiYXpwIjoiX19pbnRlcm5hbC1naGlwcG8iLCJzZXNzaW9uX3N0YXRlIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiYXRfaGFzaCI6IlJhTHoyQjlKQ2FNc1RrbGVMR3V6blEiLCJhY3IiOiIwIiwic2lkIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiZW1haWxfdmVyaWZpZWQiOmZhbHNlLCJncm91cHMiOltdLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJucy12aWV3ZXIiLCJsb2NhbGUiOiIifQ.As2ipMjfvzvgONAGlc9RnqOd3zMwAj82VXlcqcR74ZK9tAq3Q4ruQ1a6WuIfqiq8Kq4F77ljwwzYUuunfBli2zhU2II8zyxVhLoCEBu4pBVBd_oJyUycXuNa6HfQGnl36E1M7-_QG8b-_T51wFxxVb5b7SEDE1AvIf54NAlAr-rhDmGRdOK1c9CohQcS00ab52MD3IPiFFZ8_Iljnii-RpXKZoTjdcULJVn_uZNk_SzSUK-7MVWmPBK15m6sNktOMSf0pCObKWRqHd15JSe-2aA2PKBo1jBH3tHbOgZyMPdsLI0QdmEnKB5FiiOeMpwn_oHnT6IjT-BZlB18VkW8rA' + {"code":7,"message":"[RBAC] delete resources(deployments: nginx) is forbidden for user(ns-viewer) in cluster(cluster-member)","details":[]}[root@master-01 ~]# + [root@master-01 ~]# + ``` + +1. Create a ClusterRole on the global cluster, as shown in the yaml below. + + ```yaml + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + metadata: + name: append-ns-view # (1)! + labels: + rbac.kpanda.io/role-template-ns-view: "true" # (2)! + rules: + - apiGroups: [ "apps" ] + resources: [ "deployments" ] + verbs: [ "delete" ] + ``` + + 1. This field value can be arbitrarily specified, as long as it is not duplicated and complies with + the Kubernetes resource naming conventions. + 2. When adding rules to different roles, make sure to apply different labels. + +1. Wait for the kpanda controller to add a rule of user creation to the built-in role: ns-viewer, + then you can check if the rules added in the previous step are present for ns-viewer. + + ```bash + [root@master-01 ~]# kubectl get clusterrole role-template-ns-view -oyaml|grep deployments -C 10|tail -n 6 + ``` + ```yaml + - apiGroups: + - apps + resources: + - deployments + verbs: + - delete + ``` + +1. When using curl again to request the deletion of the aforementioned nginx deployment, this time the deletion + was successful. This means that ns-viewer has successfully added the rule to delete deployments. + + ```bash + [root@master-01 ~]# curl -k -X DELETE 'https://${URL}/apis/kpanda.io/v1alpha1/clusters/cluster-member/namespaces/default/deployments/nginx' -H 'authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJOU044MG9BclBRMzUwZ2VVU2ZyNy1xMEREVWY4MmEtZmJqR05uRE1sd1lFIn0.eyJleHAiOjE3MTU3NjY1NzksImlhdCI6MTcxNTY4MDE3OSwiYXV0aF90aW1lIjoxNzE1NjgwMTc3LCJqdGkiOiIxZjI3MzJlNC1jYjFhLTQ4OTktYjBiZC1iN2IxZWY1MzAxNDEiLCJpc3MiOiJodHRwczovLzEwLjYuMjAxLjIwMTozMDE0Ny9hdXRoL3JlYWxtcy9naGlwcG8iLCJhdWQiOiJfX2ludGVybmFsLWdoaXBwbyIsInN1YiI6ImMxZmMxM2ViLTAwZGUtNDFiYS05ZTllLWE5OGU2OGM0MmVmMCIsInR5cCI6IklEIiwiYXpwIjoiX19pbnRlcm5hbC1naGlwcG8iLCJzZXNzaW9uX3N0YXRlIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiYXRfaGFzaCI6IlJhTHoyQjlKQ2FNc1RrbGVMR3V6blEiLCJhY3IiOiIwIiwic2lkIjoiMGJjZWRjZTctMTliYS00NmU1LTkwYmUtOTliMWY2MWEyNzI0IiwiZW1haWxfdmVyaWZpZWQiOmZhbHNlLCJncm91cHMiOltdLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJucy12aWV3ZXIiLCJsb2NhbGUiOiIifQ.As2ipMjfvzvgONAGlc9RnqOd3zMwAj82VXlcqcR74ZK9tAq3Q4ruQ1a6WuIfqiq8Kq4F77ljwwzYUuunfBli2zhU2II8zyxVhLoCEBu4pBVBd_oJyUycXuNa6HfQGnl36E1M7-_QG8b-_T51wFxxVb5b7SEDE1AvIf54NAlAr-rhDmGRdOK1c9CohQcS00ab52MD3IPiFFZ8_Iljnii-RpXKZoTjdcULJVn_uZNk_SzSUK-7MVWmPBK15m6sNktOMSf0pCObKWRqHd15JSe-2aA2PKBo1jBH3tHbOgZyMPdsLI0QdmEnKB5FiiOeMpwn_oHnT6IjT-BZlB18VkW8rA' + ``` diff --git a/docs/en/docs/end-user/kpanda/permissions/permission-brief.md b/docs/en/docs/end-user/kpanda/permissions/permission-brief.md new file mode 100644 index 0000000000..c157825fc0 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/permissions/permission-brief.md @@ -0,0 +1,362 @@ +--- +MTPE: windsonsea +date: 2024-05-13 +--- + +# Container Management Permissions + +Container management permissions are based on a multi-dimensional permission management system created by global permission management and Kubernetes RBAC permission management. It supports cluster-level and namespace-level permission control, helping users to conveniently and flexibly set different operation permissions for IAM users and user groups (collections of users) under a tenant. + +## Cluster Permissions + +Cluster permissions are authorized based on Kubernetes RBAC's ClusterRoleBinding, allowing users/user groups to have cluster-related permissions. The current default cluster role is __Cluster Admin__ (does not have the permission to create or delete clusters). + +### __Cluster Admin__ + +__Cluster Admin__ has the following permissions: + +- Can manage, edit, and view the corresponding cluster +- Manage, edit, and view all workloads and all resources within the namespace +- Can authorize users for roles within the cluster (Cluster Admin, NS Admin, NS Editor, NS Viewer) + +The YAML example for this cluster role is as follows: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + annotations: + kpanda.io/creator: system + creationTimestamp: "2022-06-16T09:42:49Z" + labels: + iam.kpanda.io/role-template: "true" + name: role-template-cluster-admin + resourceVersion: "15168" + uid: f8f86d42-d5ef-47aa-b284-097615795076 +rules: +- apiGroups: + - '*' + resources: + - '*' + verbs: + - '*' +- nonResourceURLs: + - '*' + verbs: + - '*' +``` + +## Namespace Permissions + +Namespace permissions are authorized based on Kubernetes RBAC capabilities, allowing different users/user groups to have different operation permissions on resources under a namespace (including Kubernetes API permissions). For details, refer to: [Kubernetes RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/). Currently, the default roles for container management are: NS Admin, NS Editor, NS Viewer. + +### __NS Admin__ + +__NS Admin__ has the following permissions: + +- Can view the corresponding namespace +- Manage, edit, and view all workloads and custom resources within the namespace +- Can authorize users for corresponding namespace roles (NS Editor, NS Viewer) + +The YAML example for this cluster role is as follows: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + annotations: + kpanda.io/creator: system + creationTimestamp: "2022-06-16T09:42:49Z" + labels: + iam.kpanda.io/role-template: "true" + name: role-template-ns-admin + resourceVersion: "15173" + uid: 69f64c7e-70e7-4c7c-a3e0-053f507f2bc3 +rules: +- apiGroups: + - '*' + resources: + - '*' + verbs: + - '*' +- nonResourceURLs: + - '*' + verbs: + - '*' +``` + +### __NS Editor__ + +__NS Editor__ has the following permissions: + +- Can view corresponding namespaces where permissions are granted +- Manage, edit, and view all workloads within the namespace + +??? note "Click to view the YAML example of the cluster role" + + ```yaml + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + metadata: + annotations: + kpanda.io/creator: system + creationTimestamp: "2022-06-16T09:42:50Z" + labels: + iam.kpanda.io/role-template: "true" + name: role-template-ns-edit + resourceVersion: "15175" + uid: ca9e690e-96c0-4978-8915-6e4c00c748fe + rules: + - apiGroups: + - "" + resources: + - configmaps + - endpoints + - persistentvolumeclaims + - persistentvolumeclaims/status + - pods + - replicationcontrollers + - replicationcontrollers/scale + - serviceaccounts + - services + - services/status + verbs: + - '*' + - apiGroups: + - "" + resources: + - bindings + - events + - limitranges + - namespaces/status + - pods/log + - pods/status + - replicationcontrollers/status + - resourcequotas + - resourcequotas/status + verbs: + - '*' + - apiGroups: + - "" + resources: + - namespaces + verbs: + - '*' + - apiGroups: + - apps + resources: + - controllerrevisions + - daemonsets + - daemonsets/status + - deployments + - deployments/scale + - deployments/status + - replicasets + - replicasets/scale + - replicasets/status + - statefulsets + - statefulsets/scale + - statefulsets/status + verbs: + - '*' + - apiGroups: + - autoscaling + resources: + - horizontalpodautoscalers + - horizontalpodautoscalers/status + verbs: + - '*' + - apiGroups: + - batch + resources: + - cronjobs + - cronjobs/status + - jobs + - jobs/status + verbs: + - '*' + - apiGroups: + - extensions + resources: + - daemonsets + - daemonsets/status + - deployments + - deployments/scale + - deployments/status + - ingresses + - ingresses/status + - networkpolicies + - replicasets + - replicasets/scale + - replicasets/status + - replicationcontrollers/scale + verbs: + - '*' + - apiGroups: + - policy + resources: + - poddisruptionbudgets + - poddisruptionbudgets/status + verbs: + - '*' + - apiGroups: + - networking.k8s.io + resources: + - ingresses + - ingresses/status + - networkpolicies + verbs: + - '*' + ``` + +### __NS Viewer__ + +__NS Viewer__ has the following permissions: + +- Can view the corresponding namespace +- Can view all workloads and custom resources within the corresponding namespace + +??? note "Click to view the YAML example of the cluster role" + + ```yaml + apiVersion: rbac.authorization.k8s.io/v1 + kind: ClusterRole + metadata: + annotations: + kpanda.io/creator: system + creationTimestamp: "2022-06-16T09:42:50Z" + labels: + iam.kpanda.io/role-template: "true" + name: role-template-ns-view + resourceVersion: "15183" + uid: 853888fd-6ee8-42ac-b91e-63923918baf8 + rules: + - apiGroups: + - "" + resources: + - configmaps + - endpoints + - persistentvolumeclaims + - persistentvolumeclaims/status + - pods + - replicationcontrollers + - replicationcontrollers/scale + - serviceaccounts + - services + - services/status + verbs: + - get + - list + - watch + - apiGroups: + - "" + resources: + - bindings + - events + - limitranges + - namespaces/status + - pods/log + - pods/status + - replicationcontrollers/status + - resourcequotas + - resourcequotas/status + verbs: + - get + - list + - watch + - apiGroups: + - "" + resources: + - namespaces + verbs: + - get + - list + - watch + - apiGroups: + - apps + resources: + - controllerrevisions + - daemonsets + - daemonsets/status + - deployments + - deployments/scale + - deployments/status + - replicasets + - replicasets/scale + - replicasets/status + - statefulsets + - statefulsets/scale + - statefulsets/status + verbs: + - get + - list + - watch + - apiGroups: + - autoscaling + resources: + - horizontalpodautoscalers + - horizontalpodautoscalers/status + verbs: + - get + - list + - watch + - apiGroups: + - batch + resources: + - cronjobs + - cronjobs/status + - jobs + - jobs/status + verbs: + - get + - list + - watch + - apiGroups: + - extensions + resources: + - daemonsets + - daemonsets/status + - deployments + - deployments/scale + - deployments/status + - ingresses + - ingresses/status + - networkpolicies + - replicasets + - replicasets/scale + - replicasets/status + - replicationcontrollers/scale + verbs: + - get + - list + - watch + - apiGroups: + - policy + resources: + - poddisruptionbudgets + - poddisruptionbudgets/status + verbs: + - get + - list + - watch + - apiGroups: + - networking.k8s.io + resources: + - ingresses + - ingresses/status + - networkpolicies + verbs: + - get + - list + - watch + ``` + +## Permissions FAQ + +1. What is the relationship between global permissions and container management permissions? + + Answer: Global permissions only authorize coarse-grained permissions, which can manage the creation, editing, and deletion of all clusters; while for fine-grained permissions, such as the management permissions of a single cluster, the management, editing, and deletion permissions of a single namespace, they need to be implemented based on Kubernetes RBAC container management permissions. Generally, users only need to be authorized in container management. + +2. Currently, only four default roles are supported. Can the __RoleBinding__ and __ClusterRoleBinding__ (Kubernetes fine-grained RBAC) for custom roles also take effect? + + Answer: Currently, custom permissions cannot be managed through the graphical interface, but the permission rules created using kubectl can still take effect. diff --git a/docs/en/docs/end-user/kpanda/scale/create-hpa.md b/docs/en/docs/end-user/kpanda/scale/create-hpa.md new file mode 100644 index 0000000000..3388489939 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/create-hpa.md @@ -0,0 +1,68 @@ +# Create HPA + +Suanova AI platform supports elastic scaling of Pod resources based on metrics (Horizontal Pod Autoscaling, HPA). +Users can dynamically adjust the number of copies of Pod resources by setting CPU utilization, memory usage, and custom metrics. +For example, after setting an auto scaling policy based on the CPU utilization metric for the workload, +when the CPU utilization of the Pod exceeds/belows the metric threshold you set, the workload controller +will automatically increase/decrease the number of Pod replicas. + +This page describes how to configure auto scaling based on built-in metrics and custom metrics for workloads. + +!!! note + + 1. HPA is only applicable to Deployment and StatefulSet, and only one HPA can be created per workload. + 2. If you create an HPA policy based on CPU utilization, you must set the configuration limit (Limit) for the workload in advance, otherwise the CPU utilization cannot be calculated. + 3. If built-in metrics and multiple custom metrics are used at the same time, HPA will calculate the number of scaling copies required based on multiple metrics, and take the larger value (but not exceed the maximum number of copies configured when setting the HPA policy) for elastic scaling . + +## Built-in metric elastic scaling policy + +The system has two built-in elastic scaling metrics of CPU and memory to meet users' basic business cases. + +### Prerequisites + +Before configuring the built-in index auto scaling policy for the workload, the following prerequisites need to be met: + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. + +- Created a [namespace](../namespaces/createns.md), [deployment](../workloads/create-deployment.md) + or [statefulset](../workloads/create-statefulset.md). + +- You should have permissions not lower than [NS Editor](../permissions/permission-brief.md#ns-editor). + For details, refer to [Namespace Authorization](../namespaces/createns.md). + +- Installed [metrics-server plugin install](install-metrics-server.md). + +### Steps + +Refer to the following steps to configure the built-in index auto scaling policy for the workload. + +1. Click __Clusters__ on the left navigation bar to enter the cluster list page. Click a cluster name to enter the __Cluster Details__ page. + + + +2. On the cluster details page, click __Workload__ in the left navigation bar to enter the workload list, and then click a workload name to enter the __Workload Details__ page. + + + +3. Click the Auto Scaling tab to view the auto scaling configuration of the current cluster. + + + +4. After confirming that the cluster has [installed the __metrics-server__ plug-in](install-metrics-server.md), and the plug-in is running normally, you can click the __New Scaling__ button. + + + +5. Create custom metric auto scaling policy parameters. + + + + - Policy name: Enter the name of the auto scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators ("-"), and must start and end with lowercase letters or numbers, such as hpa- my-dep. + - Namespace: The namespace where the payload resides. + - Workload: The workload object that performs auto scaling. + - Target CPU Utilization: The CPU usage of the Pod under the workload resource. The calculation method is: the request (request) value of all Pod resources/workloads under the workload. When the actual CPU usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas. + - Target Memory Usage: The memory usage of the Pod under the workload resource. When the actual memory usage is greater/lower than the target value, the system automatically reduces/increases the number of Pod replicas. + - Replica range: the elastic scaling range of the number of Pod replicas. The default interval is 1 - 10. + +6. After completing the parameter configuration, click the __OK__ button to automatically return to the elastic scaling details page. Click __┇__ on the right side of the list to edit, delete, and view related events. diff --git a/docs/en/docs/end-user/kpanda/scale/create-vpa.md b/docs/en/docs/end-user/kpanda/scale/create-vpa.md new file mode 100644 index 0000000000..959ce6745a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/create-vpa.md @@ -0,0 +1,58 @@ +--- +MTPE: FanLin +Date: 2024-02-23 +--- + +# Create VPAs + +The container Vertical Pod Autoscaler (VPA) calculates the most suitable CPU and memory request values ​​for the Pod by monitoring the Pod's resource application and usage over a period of time. Using VPA can allocate resources to each Pod in the cluster more reasonably, improve the overall resource utilization of the cluster, and avoid waste of cluster resources. + +AI platform supports VPA through containers. Based on this feature, the Pod request value can be dynamically adjusted according to the usage of container resources. AI platform supports manual and automatic modification of resource request values, and you can configure them according to actual needs. + +This page describes how to configure VPA for deployment. + +!!! warning + + Using VPA to modify a Pod resource request will trigger a Pod restart. Due to the limitations of Kubernetes itself, Pods may be scheduled to other nodes after restarting. + +## Prerequisites + +Before configuring a vertical scaling policy for deployment, the following prerequisites must be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md), [user](../../../ghippo/access-control/user.md), [Deployments](../workloads/create-deployment.md) or [Statefulsets](../workloads/create-statefulset.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- The current cluster has installed [ __metrics-server__ ](install-metrics-server.md) and [ __VPA__ ](install-vpa.md) plugins. + +## Steps + +Refer to the following steps to configure the built-in index auto scaling policy for the deployment. + +1. Find the current cluster in __Clusters__ , and click the name of the target cluster. + + ![Clusters](../images/deploy01.png) + +2. Click __Deployments__ in the left navigation bar, find the deployment that needs to create a VPA, and click the name of the deployment. + + ![Deployments](../images/createScale.png) + +3. Click the __Auto Scaling__ tab to view the auto scaling configuration of the current cluster, and confirm that the relevant plug-ins have been installed and are running normally. + + ![VPA](../images/createVpaScale.png) + +4. Click the __Create Autoscaler__ button and configure the VPA vertical scaling policy parameters. + + ![Create Autoscaler](../images/createVpaScale01.png) + + - Policy name: Enter the name of the vertical scaling policy. Please note that the name can contain up to 63 characters, and can only contain lowercase letters, numbers, and separators ("-"), and must start and end with lowercase letters or numbers, such as vpa- my-dep. + - Scaling mode: Run the method of modifying the CPU and memory request values. Currently, vertical scaling supports manual and automatic scaling modes. + - Manual scaling: After the vertical scaling policy calculates the recommended resource configuration value, the user needs to manually modify the resource quota of the application. + - Auto-scaling: The vertical scaling policy automatically calculates and modifies the resource quota of the application. + - Target container: Select the container to be scaled vertically. + +5. After completing the parameter configuration, click the __OK__ button to automatically return to the elastic scaling details page. Click __┇__ on the right side of the list to perform edit and delete operations. + + ![Successfully Configurate](../images/createVpaScale02.png) diff --git a/docs/en/docs/end-user/kpanda/scale/custom-hpa.md b/docs/en/docs/end-user/kpanda/scale/custom-hpa.md new file mode 100644 index 0000000000..3e3e56b6d4 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/custom-hpa.md @@ -0,0 +1,182 @@ +# Creating HPA Based on Custom Metrics + +When the built-in CPU and memory metrics in the system do not meet your business needs, +you can add custom metrics by configuring ServiceMonitoring and achieve auto-scaling based +on these custom metrics. This article will introduce how to configure auto-scaling for +workloads based on custom metrics. + +!!! note + + 1. HPA is only applicable to Deployment and StatefulSet, and each workload can only create one HPA. + 2. If both built-in metrics and multiple custom metrics are used, HPA will calculate the required number + of scaled replicas based on multiple metrics respectively, and take the larger value + (but not exceeding the maximum number of replicas configured when setting the HPA policy) for scaling. + +## Prerequisites + +Before configuring the custom metrics auto-scaling policy for workloads, the following prerequisites must be met: + +- [Integrated Kubernetes cluster](../clusters/integrate-cluster.md) or + [created Kubernetes cluster](../clusters/create-cluster.md), and able to access the cluster's UI interface. +- Completed creation of a [namespace](../namespaces/createns.md), [deployment](../workloads/create-deployment.md), + or [statefulSet](../workloads/create-statefulset.md). +- The current user should have permissions higher than [NS Editor](../permissions/permission-brief.md#ns-editor). + For details, refer to [namespace authorization](../namespaces/createns.md). +- [metrics-server plugin](install-metrics-server.md) has been installed. +- [insight-agent plugin](../../../insight/quickstart/install/install-agent.md) has been installed. +- Prometheus-adapter plugin has been installed. + +## Steps + +Refer to the following steps to configure the auto-scaling policy based on metrics for workloads. + +1. Click __Clusters__ in the left navigation bar to enter the clusters page. + Click a cluster name to enter the __Cluster Overview__ page. + + ![Select a cluster](../images/autoscaling01.png) + +2. On the Cluster Details page, click __Workloads__ in the left navigation bar to enter the workload list, + and click a workload name to enter the __Workload Details__ page. + + ![Select a workload](../images/autoscaling02.png) + +3. Click the __Auto Scaling__ tab to view the current autoscaling configuration of the cluster. + + ![Autoscaling](../images/autoscaling03.png) + +4. Confirm that the cluster has [installed metrics-server](install-metrics-server.md), Insight, + and Prometheus-adapter plugins, and that the plugins are running normally, then click the __Create AutoScaler__ button. + + !!! note + + If the related plugins are not installed or the plugins are in an abnormal state, + you will not be able to see the entry for creating custom metrics auto-scaling on the page. + + ![Autoscaling](../images/autoscaling04.png) + +5. Create custom metrics auto-scaling policy parameters. + + + + - Policy Name: Enter the name of the auto-scaling policy. Note that the name can be up to 63 characters long, + can only contain lowercase letters, numbers, and separators ("-"), and must start and end with a lowercase letter + or number, e.g., hpa-my-dep. + - Namespace: The namespace where the workload is located. + - Workload: The workload object that performs auto-scaling. + - Resource Type: The type of custom metric being monitored, including Pod and Service types. + - Metric: The name of the custom metric created using ServiceMonitoring or the name of the system-built custom metric. + - Data Type: The method used to calculate the metric value, including target value and target average value. + When the resource type is Pod, only the target average value can be used. + +## Operation Example + +This case takes a Golang business program as an example. The example program exposes the +`httpserver_requests_total` metric and records HTTP requests. This metric can be used to +calculate the QPS value of the business program. + +### Deploy Business Program + +Use Deployment to deploy the business program: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: httpserver + namespace: httpserver +spec: + replicas: 1 + selector: + matchLabels: + app: httpserver + template: + metadata: + labels: + app: httpserver + spec: + containers: + - name: httpserver + image: registry.imroc.cc/test/httpserver:custom-metrics + imagePullPolicy: Always +--- + +apiVersion: v1 +kind: Service +metadata: + name: httpserver + namespace: httpserver + labels: + app: httpserver + annotations: + prometheus.io/scrape: "true" + prometheus.io/path: "/metrics" + prometheus.io/port: "http" +spec: + type: ClusterIP + ports: + - port: 80 + protocol: TCP + name: http + selector: + app: httpserver +``` + +### Prometheus Collects Business Monitoring + +If the insight-agent is installed, Prometheus can be configured by creating a ServiceMonitor CRD object. + +Operation steps: In **Cluster Details** -> **Custom Resources**, search for “servicemonitors.monitoring.coreos.com", +click the name to enter the details. Create the following example CRD in the **httpserver** namespace via YAML: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: httpserver + namespace: httpserver + labels: + operator.insight.io/managed-by: insight +spec: + endpoints: + - port: http + interval: 5s + namespaceSelector: + matchNames: + - httpserver + selector: + matchLabels: + app: httpserver +``` + + + +!!! note + + If Prometheus is installed via insight, the serviceMonitor must be labeled with + `operator.insight.io/managed-by: insight`. If installed by other means, this label is not required. + +### Configure Metric Rules in Prometheus-adapter + +steps: In **Clusters** -> **Helm Apps**, search for “prometheus-adapter",enter the update page through the action bar, +and configure custom metrics in YAML as follows: + +```yaml +rules: + custom: + - metricsQuery: sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>) + name: + as: httpserver_requests_qps + matches: httpserver_requests_total + resources: + template: <<.Resource>> + seriesQuery: httpserver_requests_total +``` + + + +### Create Custom Metrics Auto-scaling Policy Parameters + +Follow the above steps to find the application httpserver in the Deployment +and create auto-scaling via custom metrics. + + diff --git a/docs/en/docs/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.md b/docs/en/docs/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.md new file mode 100644 index 0000000000..ccfe8b4171 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/hpa-cronhpa-compatibility-rules.md @@ -0,0 +1,47 @@ +# Compatibility Rules for HPA and CronHPA + +HPA stands for HorizontalPodAutoscaler, which refers to horizontal pod auto-scaling. + +CronHPA stands for Cron HorizontalPodAutoscaler, which refers to scheduled horizontal pod auto-scaling. + +## Conflict Between CronHPA and HPA + +Scheduled scaling with CronHPA triggers horizontal pod scaling at specified times. +To prevent sudden traffic surges, you may have configured HPA to ensure the normal operation +of your application. If both HPA and CronHPA are detected simultaneously, conflicts arise +because CronHPA and HPA operate independently without awareness of each other. +Consequently, the actions performed last will override those executed first. + +By comparing the definition templates of CronHPA and HPA, the following points can be observed: + +- Both CronHPA and HPA use the `scaleTargetRef` field to identify the scaling target. +- CronHPA schedules the number of replicas to scale based on crontab rules in jobs. +- HPA determines scaling based on resource utilization. + +!!! note + + If both CronHPA and HPA are set, there will be scenarios where CronHPA and HPA + simultaneously operate on a single `scaleTargetRef`. + +## Compatibility Solution for CronHPA and HPA + +As noted above, the fundamental reason that simultaneous use of CronHPA and HPA results in +the later action overriding the earlier one is that the two controllers cannot sense each other. +Therefore, the conflict can be resolved by enabling CronHPA to be aware of HPA's current state. + +The system will treat HPA as the scaling object for CronHPA, thus achieving scheduled scaling +for the Deployment object defined by the HPA. + +HPA's definition configures the Deployment in the `scaleTargetRef` field, and then the Deployment +uses its definition to locate the ReplicaSet, which ultimately adjusts the actual number of replicas. + +In AI platform, the `scaleTargetRef` in CronHPA is set to the HPA object, and it uses the HPA object +to find the actual `scaleTargetRef`, allowing CronHPA to be aware of HPA's current state. + + + +CronHPA senses HPA by adjusting HPA. CronHPA determines whether scaling is needed and modifies +the HPA upper limit by comparing the target number of replicas with the current number of replicas, +choosing the larger value. Similarly, CronHPA determines whether to modify the HPA lower limit by +comparing the target number of replicas from CronHPA with the configuration in HPA, +choosing the smaller value. diff --git a/docs/en/docs/end-user/kpanda/scale/install-cronhpa.md b/docs/en/docs/end-user/kpanda/scale/install-cronhpa.md new file mode 100644 index 0000000000..b4d26e7c15 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/install-cronhpa.md @@ -0,0 +1,63 @@ +--- +MTPE: FanLin +Date: 2024-02-29 +--- + +# Install kubernetes-cronhpa-controller + +The container copy timing horizontal autoscaling policy (CronHPA) can provide stable computing resource guarantee for periodic high-concurrency applications, and __kubernetes-cronhpa-controller__ is a key component to implement CronHPA. + +This section describes how to install the __kubernetes-cronhpa-controller__ plugin. + +!!! note + + In order to use CornHPA, not only the __kubernetes-cronhpa-controller__ plugin needs to be installed, but also [install the __metrics-server__ plugin](install-metrics-server.md). + +## Prerequisites + +Before installing the __kubernetes-cronhpa-controller__ plugin, the following prerequisites need to be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +## Steps + +Refer to the following steps to install the __kubernetes-cronhpa-controller__ plugin for the cluster. + +1. On the __Clusters__ page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click __Workloads__ -> __Deployments__ on the left, and click the name of the target workload. + +2. On the workload details page, click the __Auto Scaling__ tab, and click __Install__ on the right side of __CronHPA__ . + + ![Auto Scaling](../images/installcronhpa.png) + +3. Read the relevant introduction of the plug-in, select the version and click the __Install__ button. It is recommended to install __1.3.0__ or later. + + ![Install](../images/installcronhpa1.png) + +4. Refer to the following instructions to configure the parameters. + + ![Config](../images/installcronhpa2.png) + + - Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators ("-"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller. + - Namespace: Select which namespace the plugin will be installed in, here we take __default__ as an example. + - Version: The version of the plugin, here we take the __1.3.0__ version as an example. + - Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful. + - Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, __Wait__ will be enabled synchronously by default. + - Detailed log: When enabled, a detailed log of the installation process will be recorded. + + !!! note + + After enabling __ready wait__ and/or __failed deletion__ , it takes a long time for the application to be marked as "running". + +5. Click __OK__ in the lower right corner of the page, and the system will automatically jump to the __Helm Apps__ list page. Wait a few minutes and refresh the page to see the application you just installed. + + !!! warning + + If you need to delete the __kubernetes-cronhpa-controller__ plugin, you should go to the __Helm Apps__ list page to delete it completely. + + If you delete the plug-in under the __Auto Scaling__ tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later. + +6. Go back to the __Auto Scaling__ tab under the workload details page, and you can see that the interface displays __Plug-in installed__ . Now it's time to start creating CronHPA policies. diff --git a/docs/en/docs/end-user/kpanda/scale/install-metrics-server.md b/docs/en/docs/end-user/kpanda/scale/install-metrics-server.md new file mode 100644 index 0000000000..e22ea16859 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/install-metrics-server.md @@ -0,0 +1,143 @@ +--- +MTPE: FanLin +Date: 2024-02-29 +--- + +# Install metrics-server + +__metrics-server__ is the built-in resource usage metrics collection component of Kubernetes. +You can automatically scale Pod copies horizontally for workload resources by configuring HPA policies. + +This section describes how to install __metrics-server__ . + +## Prerequisites + +Before installing the __metrics-server__ plugin, the following prerequisites need to be met: + +- [Integrated the Kubernetes cluster](../clusters/integrate-cluster.md) or + [created the Kubernetes cluster](../clusters/create-cluster.md), + and you can access the UI interface of the cluster. + +- Created a [namespace](../namespaces/createns.md). + +- You should have permissions not lower than [NS Editor](../permissions/permission-brief.md#ns-editor). + For details, refer to [Namespace Authorization](../namespaces/createns.md). + +## Steps + +Please perform the following steps to install the __metrics-server__ plugin for the cluster. + +1. On the Auto Scaling page under workload details, click the __Install__ button to enter the __metrics-server__ plug-in installation interface. + + ![metrics-server](../images/createScale04.png) + +2. Read the introduction of the __metrics-server__ plugin, select the version and click the __Install__ button. This page will use the __3.8.2__ version as an example to install, and it is recommended that you install __3.8.2__ and later versions. + + ![Install](../images/createScale05.png) + +3. Configure basic parameters on the installation configuration interface. + + ![Config](../images/createScale06.png) + + - Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers and separators ("-"), and must start and end with lowercase letters or numbers, such as metrics-server-01. + - Namespace: Select the namespace for plugin installation, here we take __default__ as an example. + - Version: The version of the plugin, here we take __3.8.2__ version as an example. + - Ready Wait: When enabled, it will wait for all associated resources under the application to be ready before marking the application installation as successful. + - Failed to delete: After it is enabled, the synchronization will be enabled by default and ready to wait. If the installation fails, the installation-related resources will be removed. + - Verbose log: Turn on the verbose output of the installation process log. + + !!! note + + After enabling __Wait__ and/or __Deletion failed__ , it takes a long time for the app to be marked as __Running__ . + +4. Advanced parameter configuration + + - If the cluster network cannot access the __k8s.gcr.io__ repository, please try to modify the __repositort__ parameter to __repository: k8s.m.daocloud.io/metrics-server/metrics-server__ . + + - An SSL certificate is also required to install the __metrics-server__ plugin. To bypass certificate verification, you need to add __- --kubelet-insecure-tls__ parameter at __defaultArgs:__ . + + ??? note "Click to view and use the YAML parameters to replace the default __YAML__ " + + ```yaml + image: + repository: k8s.m.daocloud.io/metrics-server/metrics-server # Change the registry source address to k8s.m.daocloud.io + tag: '' + pullPolicy: IfNotPresent + imagePullSecrets: [] + nameOverride: '' + fullnameOverride: '' + serviceAccount: + create: true + annotations: {} + name: '' + rbac: + create: true + pspEnabled: false + apiService: + create: true + podLabels: {} + podAnnotations: {} + podSecurityContext: {} + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + runAsNonRoot: true + runAsUser: 1000 + priorityClassName: system-cluster-critical + containerPort: 4443 + hostNetwork: + enabled: false + replicas: 1 + updateStrategy: {} + podDisruptionBudget: + enabled: false + minAvailable: null + maxUnavailable: null + defaultArgs: + - '--cert-dir=/tmp' + - '--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname' + - '--kubelet-use-node-status-port' + - '--metric-resolution=15s' + - --kubelet-insecure-tls # Bypass certificate verification + args: [] + livenessProbe: + httpGet: + path: /livez + port:https + scheme: HTTPS + initialDelaySeconds: 0 + periodSeconds: 10 + failureThreshold: 3 + readinessProbe: + httpGet: + path: /readyz + port:https + scheme: HTTPS + initialDelaySeconds: 20 + periodSeconds: 10 + failureThreshold: 3 + service: + type: ClusterIP + port: 443 + annotations: {} + labels: {} + metrics: + enabled: false + serviceMonitor: + enabled: false + additionalLabels: {} + interval: 1m + scrapeTimeout: 10s + resources: {} + extraVolumeMounts: [] + extraVolumes: [] + nodeSelector: {} + tolerations: [] + affinity: {} + ``` + +5. Click the __OK__ button to complete the installation of the __metrics-server__ plug-in, and then the system will automatically jump to the __Helm Apps__ list page. After a few minutes, refresh the page and you will see the newly installed Applications. + +!!! note + + When deleting the __metrics-server__ plugin, the plugin can only be completely deleted on the __Helm Applications__ list page. If you only delete __metrics-server__ on the workload page, this only deletes the workload copy of the application, the application itself is still not deleted, and an error will be prompted when you reinstall the plugin later. \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/scale/install-vpa.md b/docs/en/docs/end-user/kpanda/scale/install-vpa.md new file mode 100644 index 0000000000..7e8bcb52c5 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/install-vpa.md @@ -0,0 +1,63 @@ +--- +MTPE: FanLin +Date: 2024-02-29 +--- + +# Install vpa + +The Vertical Pod Autoscaler, VPA, can make the resource allocation of the cluster more reasonable and avoid the waste of cluster resources. __vpa__ is the key component to realize the vertical autoscaling of the container. + +This section describes how to install the __vpa__ plugin. + + In order to use VPA policies, not only the __vpa__ plugin needs to be installed, but also [install the __metrics-server__ plugin](install-metrics-server.md). + +## Prerequisites + +Before installing the __vpa__ plugin, the following prerequisites need to be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +## Steps + +Refer to the following steps to install the __vpa__ plugin for the cluster. + +1. On the __Clusters__ page, find the target cluster where the plugin needs to be installed, click the name of the cluster, then click __Workloads__ -> __Deployments__ on the left, and click the name of the target workload. + +2. On the workload details page, click the __Auto Scaling__ tab, and click __Install__ on the right side of __VPA__ . + + ![vpa](../images/installvpa.png) + +3. Read the relevant introduction of the plug-in, select the version and click the __Install__ button. It is recommended to install __1.5.0__ or later. + + ![Install](../images/installvpa1.png) + +4. Review the configuration parameters described below. + + ![Config](../images/installvpa2.png) + + - Name: Enter the plugin name, please note that the name can be up to 63 characters, can only contain lowercase letters, numbers, and separators ("-"), and must start and end with lowercase letters or numbers, such as kubernetes-cronhpa-controller. + - Namespace: Select which namespace the plugin will be installed in, here we take __default__ as an example. + - Version: The version of the plugin, here we take the __1.5.0__ version as an example. + - Ready Wait: When enabled, it will wait for all associated resources under the application to be in the ready state before marking the application installation as successful. + - Failed to delete: If the plugin installation fails, delete the associated resources that have already been installed. When enabled, __Wait__ will be enabled synchronously by default. + - Detailed log: When enabled, a detailed log of the installation process will be recorded. + + !!! note + + After enabling __Wait__ and/or __Deletion failed__ , it takes a long time for the application to be marked as __running__ . + +5. Click __OK__ in the lower right corner of the page, and the system will automatically jump to the __Helm Apps__ list page. Wait a few minutes and refresh the page to see the application you just installed. + + !!! warning + + If you need to delete the __vpa__ plugin, you should go to the __Helm Apps__ list page to delete it completely. + + If you delete the plug-in under the __Auto Scaling__ tab of the workload, this only deletes the workload copy of the plug-in, and the plug-in itself is still not deleted, and an error will be prompted when the plug-in is reinstalled later. + +6. Go back to the __Auto Scaling__ tab under the workload details page, and you can see that the interface displays __Plug-in installed__ . Now you can start [Create VPA](create-vpa.md) policy. + + ![Finish Creation](../images/installvpa3.png) diff --git a/docs/en/docs/end-user/kpanda/scale/knative/install.md b/docs/en/docs/end-user/kpanda/scale/knative/install.md new file mode 100644 index 0000000000..d0b245d8d2 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/knative/install.md @@ -0,0 +1,26 @@ +--- +MTPE: FanLin +Date: 2024-02-29 +--- + +# Installation + +Knative is a platform-agnostic solution for running serverless deployments. + +## Steps + +1. Log in to the cluster, click the sidebar __Helm Apps__ → __Helm Charts__ , enter __knative__ in the search box at the top right, and then press the enter key to search. + + ![Install-1](../../images/knative-install-1.png) + +2. Click the __knative-operator__ to enter the installation configuration interface. You can view the available versions and the Parameters optional items of Helm values on this interface. + + ![Install-2](../../images/knative-install-2.png) + +3. After clicking the install button, you will enter the installation configuration interface. + + ![Install-3](../../images/knative-install-3.png) + +4. Enter the name, installation tenant, and it is recommended to check __Wait__ and __Detailed Logs__ . + +5. In the settings below, you can tick __Serving__ and enter the installation tenant of the Knative Serving component, which will deploy the Knative Serving component after installation. This component is managed by the Knative Operator. diff --git a/docs/en/docs/end-user/kpanda/scale/knative/knative.md b/docs/en/docs/end-user/kpanda/scale/knative/knative.md new file mode 100644 index 0000000000..30a771e86e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/knative/knative.md @@ -0,0 +1,62 @@ +--- +MTPE: windsonsea +Date: 2024-07-19 +--- + +# Knative Introduction + +Knative provides a higher level of abstraction, simplifying and speeding up the process of building, deploying, and managing applications on Kubernetes. It allows developers to focus more on implementing business logic, while leaving most of the infrastructure and operations work to Knative, significantly improving productivity. + +## Components + +The Knative operator runs the following components. + +```shell +knative-operator knative-operator-58f7d7db5c-7f6r5 1/1 Running 0 6m55s +knative-operator operator-webhook-667dc67bc-qvrv4 1/1 Running 0 6m55s +``` + +The Knative serving components are as follows. + +```shell +knative-serving 3scale-kourier-gateway-d69fbfbd-bd8d8 1/1 Running 0 7m13s +knative-serving activator-7c6fddd698-wdlng 1/1 Running 0 7m3s +knative-serving autoscaler-8f4b876bb-kd25p 1/1 Running 0 7m17s +knative-serving autoscaler-hpa-5f7f74679c-vkc7p 1/1 Running 0 7m15s +knative-serving controller-789c896c46-tfvsv 1/1 Running 0 7m17s +knative-serving net-kourier-controller-7db578c889-7gd5l 1/1 Running 0 7m14s +knative-serving webhook-5c88b94c5-78x7m 1/1 Running 0 7m1s +knative-serving storage-version-migration-serving-serving-1.12.2-t7zvd 0/1 Completed 0 7m15s +``` + +| Component | Features | +|----------|-------------| +| Activator | Queues requests (if a Knative Service has scaled to zero). Calls the autoscaler to bring back services that have scaled down to zero and forward queued requests. The Activator can also act as a request buffer, handling bursts of traffic. | +| Autoscaler | Responsible for scaling Knative services based on configuration, metrics, and incoming requests. | +| Controller | Manages the state of Knative CRs. It monitors multiple objects, manages the lifecycle of dependent resources, and updates resource status. | +| Queue-Proxy | Sidecar container injected into each Knative Service. Responsible for collecting traffic data and reporting it to the Autoscaler, which then initiates scaling requests based on this data and preset rules. | +| Webhooks | Knative Serving has several Webhooks responsible for validating and mutating Knative resources. | + +## Ingress Traffic Entry Solutions + +| Solution | Use Case | +|----------|-------------| +| Istio | If Istio is already in use, it can be chosen as the traffic entry solution. | +| Contour | If Contour has been enabled in the cluster, it can be chosen as the traffic entry solution. | +| Kourier | If neither of the above two Ingress components are present, Knative's Envoy-based Kourier Ingress can be used as the traffic entry solution. | + +## Autoscaler Solutions Comparison + +| Autoscaler Type | Core Part of Knative Serving | Default Enabled | Scale to Zero Support | CPU-based Autoscaling Support | +| -------------- | ---------- | ------------ | ------------------ | -------------- | +| Knative Pod Autoscaler (KPA) | Yes | Yes | Yes | No | +| Horizontal Pod Autoscaler (HPA) | No | Needs to be enabled after installing Knative Serving | No | Yes | + +## CRD + +| Resource Type | API Name | Description | +| ------------- | -------- | ----------- | +| Services | `service.serving.knative.dev` | Automatically manages the entire lifecycle of Workloads, controls the creation of other objects, ensures applications have Routes, Configurations, and new revisions with each update. | +| Routes | `route.serving.knative.dev` | Maps network endpoints to one or more revision versions, supports traffic distribution and version routing. | +| Configurations | `configuration.serving.knative.dev` | Maintains the desired state of deployments, provides separation between code and configuration, follows the Twelve-Factor App methodology, modifying configurations creates new revisions. | +| Revisions | `revision.serving.knative.dev` | Snapshot of the workload at each modification time point, immutable object, automatically scales based on traffic. | diff --git a/docs/en/docs/end-user/kpanda/scale/knative/playground.md b/docs/en/docs/end-user/kpanda/scale/knative/playground.md new file mode 100644 index 0000000000..c7ee68cb04 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/knative/playground.md @@ -0,0 +1,145 @@ +# Knative Practices + +In this section, we will delve into learning Knative through several practical exercises. + +## case 1 - Hello World + +```yaml +apiVersion: serving.knative.dev/v1 +kind: Service +metadata: + name: hello +spec: + template: + spec: + containers: + - image: m.daocloud.io/ghcr.io/knative/helloworld-go:latest + ports: + - containerPort: 8080 + env: + - name: TARGET + value: "World" +``` + +You can use `kubectl` to check the status of a deployed application that has been automatically configured with ingress and scalers by Knative. + +```shell +~ kubectl get service.serving.knative.dev/hello +NAME URL LATESTCREATED LATESTREADY READY REASON +hello http://hello.knative-serving.knative.loulan.me hello-00001 hello-00001 True +``` + +The deployed Pod YAML is as follows, consisting of two Pods: `user-container` and `queue-proxy`. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: hello-00003-deployment-5fcb8ccbf-7qjfk +spec: + containers: + - name: user-container + - name: queue-proxy +``` + +![knative-request-flow](../../../images/knative-request-flow.png) + +Request Flow: + +1. case1 When there is low traffic or no traffic, traffic will be routed to the activator. +2. case2 When there is high traffic, traffic will be routed directly to the Pod only if it exceeds the target-burst-capacity. + 1. Configured as 0, expansion from 0 is the only scenario. + 2. Configured as -1, the activator will always be present in the request path. + 3. Configured as >0, the number of additional concurrent requests that the system can handle before triggering scaling. +3. case3 When the traffic decreases again, traffic will be routed back to the activator if the traffic is lower than current_demand + target-burst-capacity > (pods * concurrency-target). + + The total number of pending requests + the number of requests that can exceed the target concurrency > the target concurrency per Pod * number of Pods. + +## case 2 - Based on Concurrent Elastic Scaling + +We first apply the following YAML definition under the cluster. + +```yaml +apiVersion: serving.knative.dev/v1 +kind: Service +metadata: + name: hello +spec: + template: + metadata: + annotations: + autoscaling.knative.dev/target: "1" + autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev" + spec: + containers: + - image: m.daocloud.io/ghcr.io/knative/helloworld-go:latest + ports: + - containerPort: 8080 + env: + - name: TARGET + value: "World" +``` + +Execute the following command for testing, and you can observe the scaling of the Pods by using `kubectl get pods -A -w`. + +```shell +wrk -t2 -c4 -d6s http://hello.knative-serving.knative.daocloud.io/ +``` + +## case 3 - Based on concurrent elastic scaling, scale out in advance to reach a specific ratio. + +We can easily achieve this, for example, by limiting the concurrency to 10 per container. This can be implemented through `autoscaling.knative.dev/target-utilization-percentage: 70`, starting to scale out the Pods when 70% is reached. + +```yaml +apiVersion: serving.knative.dev/v1 +kind: Service +metadata: + name: hello +spec: + template: + metadata: + annotations: + autoscaling.knative.dev/target: "10" + autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev" +        autoscaling.knative.dev/target-utilization-percentage: "70" +        autoscaling.knative.dev/metric: "concurrency" +     spec: + containers: + - image: m.daocloud.io/ghcr.io/knative/helloworld-go:latest + ports: + - containerPort: 8080 + env: + - name: TARGET + value: "World" +``` + +## case 4 - Canary Release/Traffic Percentage + +We can control the distribution of traffic to each version through `spec.traffic`. + +```yaml +apiVersion: serving.knative.dev/v1 +kind: Service +metadata: + name: hello +spec: + template: + metadata: + annotations: + autoscaling.knative.dev/target: "1" + autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev" + spec: + containers: + - image: m.daocloud.io/ghcr.io/knative/helloworld-go:latest + ports: + - containerPort: 8080 + env: + - name: TARGET + value: "World" + traffic: + - latestRevision: true + percent: 50 + - latestRevision: false + percent: 50 + revisionName: hello-00001 +``` diff --git a/docs/en/docs/end-user/kpanda/scale/knative/scene.md b/docs/en/docs/end-user/kpanda/scale/knative/scene.md new file mode 100644 index 0000000000..64d0a5cf4a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/scale/knative/scene.md @@ -0,0 +1,15 @@ +# Use Cases + +## Suitable Cases + +* High concurrency business with short connections +* Businesses that require elastic scaling +* A large number of applications need to scale down to 0 to improve resource utilization +* AI Serving services that scale based on specific metrics + +## Unsuitable Cases + +* Long-lived connection business +* Latency-sensitive business +* Traffic splitting based on cookies +* Traffic splitting based on headers diff --git a/docs/en/docs/end-user/kpanda/security/audit.md b/docs/en/docs/end-user/kpanda/security/audit.md new file mode 100644 index 0000000000..e6de9f0cbe --- /dev/null +++ b/docs/en/docs/end-user/kpanda/security/audit.md @@ -0,0 +1,61 @@ +# Permission Scan + +To use the [Permission Scan](index.md) feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing. + +## Create a Scan Policy + +1. On the left navigation bar of the homepage in the Container Management module, click __Security Management__ . + + + +2. Click __Permission Scan__ on the left navigation bar, then click the __Scan Policy__ tab and click __Create Scan Policy__ on the right. + + + +3. Fill in the configuration according to the following instructions, and then click __OK__ . + + - Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the [Container Management](../../intro/index.md) module. If the desired cluster is not available, you can access or create a cluster in the Container Management module. + - Scan Type: + + - Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later. + - Scheduled scan: Automatically repeat the scan at scheduled intervals. + + - Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report. + + + +## Update/Delete Scan Policies + +After creating a scan policy, you can update or delete it as needed. + +Under the __Scan Policy__ tab, click the __┇__ action button to the right of a configuration: + +- For periodic scan policies: + + - Select __Execute Immediately__ to perform an additional scan outside the regular schedule. + - Select __Disable__ to interrupt the scanning plan until __Enable__ is clicked to resume executing the scan policy according to the scheduling plan. + - Select __Edit__ to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed. + - Select __Delete__ to delete the configuration. + +- For one-time scan policies: Only support the __Delete__ operation. + + + +## View Scan Reports + +1. Under the __Security Management__ -> __Permission Scanning__ -> __Scan Reports__ tab, click the report name. + + > Clicking __Delete__ on the right of a report allows you to manually delete the report. + + + +2. View the scan report content, including: + + - The target cluster scanned. + - The scan policy used. + - The total number of scan items, warnings, and errors. + - In periodic scan reports generated by periodic scan policies, you can also view the scan frequency. + - The start time of the scan. + - Check details, such as the checked resources, resource types, scan results, error types, and error details. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/security/cis/config.md b/docs/en/docs/end-user/kpanda/security/cis/config.md new file mode 100644 index 0000000000..c58eaa309c --- /dev/null +++ b/docs/en/docs/end-user/kpanda/security/cis/config.md @@ -0,0 +1,38 @@ +# Scan Configuration + +The first step in using [CIS Scanning](../index.md) is to create a scan configuration. Based on the scan configuration, you can then create scan policies, execute scan policies, and finally view scan results. + +## Create a Scan Configuration + +The steps for creating a scan configuration are as follows: + +1. Click __Security Management__ in the left navigation bar of the homepage of the container management module. + + + +2. By default, enter the __Compliance Scanning__ page, click the __Scan Configuration__ tab, and then click __Create Scan Configuration__ in the upper-right corner. + + + +3. Fill in the configuration name, select the configuration template, and optionally check the scan items, then click __OK__ . + + Scan Template: Currently, two templates are provided. The __kubeadm__ template is suitable for general Kubernetes clusters. The __daocloud__ template ignores scan items that are not applicable to AI platform based on the __kubeadm__ template and the platform design of AI platform. + + + +## View Scan Configuration + +Under the scan configuration tab, clicking the name of a scan configuration displays the type of the configuration, the number of scan items, the creation time, the configuration template, and the specific scan items enabled for the configuration. + + + +## Updat/Delete Scan Configuration + +After a scan configuration has been successfully created, it can be updated or deleted according to your needs. + +Under the scan configuration tab, click the __┇__ action button to the right of a configuration: + +- Select __Edit__ to update the configuration. You can update the description, template, and scan items. The configuration name cannot be changed. +- Select __Delete__ to delete the configuration. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/security/cis/policy.md b/docs/en/docs/end-user/kpanda/security/cis/policy.md new file mode 100644 index 0000000000..a368bcb96d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/security/cis/policy.md @@ -0,0 +1,39 @@ +# Scan Policy + +## Create a Scan Policy + +After creating a scan configuration, you can create a scan policy based on the configuration. + +1. Under the __Security Management__ -> __Compliance Scanning__ page, click the __Scan Policy__ tab on the right to create a scan policy. + + + +2. Fill in the configuration according to the following instructions and click __OK__ . + + - Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the [Container Management](../../../intro/index.md) module. If the desired cluster is not available, you can access or create a cluster in the Container Management module. + - Scan Configuration: Select a pre-created scan configuration. The scan configuration determines which specific scan items need to be performed. + - Scan Type: + + - Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later. + - Scheduled scan: Automatically repeat the scan at scheduled intervals. + + - Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report. + + + +## Update/Delete Scan Policies + +After creating a scan policy, you can update or delete it as needed. + +Under the __Scan Policy__ tab, click the __┇__ action button to the right of a configuration: + +- For periodic scan policies: + + - Select __Execute Immediately__ to perform an additional scan outside the regular schedule. + - Select __Disable__ to interrupt the scanning plan until __Enable__ is clicked to resume executing the scan policy according to the scheduling plan. + - Select __Edit__ to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed. + - Select __Delete__ to delete the configuration. + +- For one-time scan policies: Only support the __Delete__ operation. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/security/cis/report.md b/docs/en/docs/end-user/kpanda/security/cis/report.md new file mode 100644 index 0000000000..d520c2d669 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/security/cis/report.md @@ -0,0 +1,26 @@ +--- +hide: + - toc +--- + +# Scan Report + +After executing a scan policy, a scan report will be generated automatically. You can view the scan report online or download it to your local computer. + +- Download and View + + Under the __Security Management__ -> __Compliance Scanning__ page, click the __Scan Report__ tab, then click the __┇__ action button to the right of a report and select __Download__ . + + +- View Online + + Clicking the name of a report allows you to view its content online, which includes: + + - The target cluster scanned. + - The scan policy and scan configuration used. + - The start time of the scan. + - The total number of scan items, the number passed, and the number failed. + - For failed scan items, repair suggestions are provided. + - For passed scan items, more secure operational suggestions are provided. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/security/hunter.md b/docs/en/docs/end-user/kpanda/security/hunter.md new file mode 100644 index 0000000000..01f23dcd28 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/security/hunter.md @@ -0,0 +1,57 @@ +# Vulnerability Scan + +To use the [Vulnerability Scan](index.md) feature, you need to create a scan policy first. After executing the policy, a scan report will be automatically generated for viewing. + +## Create a Scan Policy + +1. On the left navigation bar of the homepage in the Container Management module, click __Security Management__ . + + + +2. Click __Vulnerability Scan__ on the left navigation bar, then click the __Scan Policy__ tab and click __Create Scan Policy__ on the right. + + +3. Fill in the configuration according to the following instructions, and then click __OK__ . + + - Cluster: Select the cluster to be scanned. The optional cluster list comes from the clusters accessed or created in the [Container Management](../../intro/index.md) module. If the desired cluster is not available, you can access or create a cluster in the Container Management module. + - Scan Type: + + - Immediate scan: Perform a scan immediately after the scan policy is created. It cannot be automatically/manually executed again later. + - Scheduled scan: Automatically repeat the scan at scheduled intervals. + + - Number of Scan Reports to Keep: Set the maximum number of scan reports to be kept. When the specified retention quantity is exceeded, delete from the earliest report. + + +## Update/Delete Scan Policies + +After creating a scan policy, you can update or delete it as needed. + +Under the __Scan Policy__ tab, click the __┇__ action button to the right of a configuration: + +- For periodic scan policies: + + - Select __Execute Immediately__ to perform an additional scan outside the regular schedule. + - Select __Disable__ to interrupt the scanning plan until __Enable__ is clicked to resume executing the scan policy according to the scheduling plan. + - Select __Edit__ to update the configuration. You can update the scan configuration, type, scan cycle, and report retention quantity. The configuration name and the target cluster to be scanned cannot be changed. + - Select __Delete__ to delete the configuration. + +- For one-time scan policies: Only support the __Delete__ operation. + + + +## Viewe Scan Reports + +1. Under the __Security Management__ -> __Vulnerability Scanning__ -> __Scan Reports__ tab, click the report name. + + > Clicking __Delete__ on the right of a report allows you to manually delete the report. + + +2. View the scan report content, including: + + - The target cluster scanned. + - The scan policy used. + - The scan frequency. + - The total number of risks, high risks, medium risks, and low risks. + - The time of the scan. + - Check details such as vulnerability ID, vulnerability type, vulnerability name, vulnerability description, etc. + diff --git a/docs/en/docs/end-user/kpanda/security/index.md b/docs/en/docs/end-user/kpanda/security/index.md new file mode 100644 index 0000000000..a66f00dc21 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/security/index.md @@ -0,0 +1,61 @@ +# Types of Security Scans + +AI platform Container Management provides three types of security scans: + +- Compliance Scan: Conducts security scans on cluster nodes based on [CIS Benchmark](https://github.com/aquasecurity/kube-bench/tree/main/cfg). +- Authorization Scan: Checks for security and compliance issues in the Kubernetes cluster, records and verifies authorized access, object changes, events, and other activities related to the Kubernetes API. +- Vulnerability Scan: Scans the Kubernetes cluster for potential vulnerabilities and risks, such as unauthorized access, sensitive information leakage, weak authentication, container escape, etc. + +## Compliance Scan + +The object of compliance scanning is the cluster node. The scan result lists the scan items and results and provides repair suggestions for any failed scan items. For specific security rules used during scanning, refer to the [CIS Kubernetes Benchmark](https://www.cisecurity.org/benchmark/kubernetes). + +The focus of the scan varies when checking different types of nodes. + +- Scan the control plane node (Controller) + + - Focus on the security of system components such as __API Server__ , __controller-manager__ , __scheduler__ , __kubelet__ , etc. + - Check the security configuration of the Etcd database. + - Verify whether the cluster's authentication mechanism, authorization policy, and network security configuration meet security standards. + +- Scan worker nodes + + - Check if the configuration of container runtimes such as kubelet and Docker meets security standards. + - Verify whether the container image has been trusted and verified. + - Check if the network security configuration of the node meets security standards. + +!!! tip + + To use compliance scanning, you need to create a [scan configuration](cis/config.md) first, and then create a [scan policy](cis/policy.md) based on that configuration. After executing the scan policy, you can [view the scan report](cis/report.md). + +## Authorization Scan + +Authorization scanning focuses on security vulnerabilities caused by authorization issues. Authorization scans can help users identify security threats in Kubernetes clusters, identify which resources need further review and protection measures. By performing these checks, users can gain a clearer and more comprehensive understanding of their Kubernetes environment and ensure that the cluster environment meets Kubernetes' best practices and security standards. + +Specifically, authorization scanning supports the following operations: + +- Scans the health status of all nodes in the cluster. + +- Scans the running state of components in the cluster, such as __kube-apiserver__ , __kube-controller-manager__ , __kube-scheduler__ , etc. + +- Scans security configurations: Check Kubernetes' security configuration. + + - API security: whether unsafe API versions are enabled, whether appropriate RBAC roles and permission restrictions are set, etc. + - Container security: whether insecure images are used, whether privileged mode is enabled, whether appropriate security context is set, etc. + - Network security: whether appropriate network policy is enabled to restrict traffic, whether TLS encryption is used, etc. + - Storage security: whether appropriate encryption and access controls are enabled. + - Application security: whether necessary security measures are in place, such as password management, cross-site scripting attack defense, etc. + +- Provides warnings and suggestions: Security best practices that cluster administrators should perform, such as regularly rotating certificates, using strong passwords, restricting network access, etc. + +!!! tip + + To use authorization scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to [Security Scanning](audit.md). + +## Vulnerability Scan + +Vulnerability scanning focuses on scanning potential malicious attacks and security vulnerabilities, such as remote code execution, SQL injection, XSS attacks, and some attacks specific to Kubernetes. The final scan report lists the security vulnerabilities in the cluster and provides repair suggestions. + +!!! tip + + To use vulnerability scanning, you need to create a scan policy first. After executing the scan policy, you can view the scan report. For details, refer to [Vulnerability Scan](hunter.md). \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/storage/pv.md b/docs/en/docs/end-user/kpanda/storage/pv.md new file mode 100644 index 0000000000..065d867044 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/storage/pv.md @@ -0,0 +1,111 @@ +# data volume (PV) + +A data volume (PersistentVolume, PV) is a piece of storage in the cluster, which can be prepared in advance by the administrator, or dynamically prepared using a storage class (Storage Class). PV is a cluster resource, but it has an independent life cycle and will not be deleted when the Pod process ends. Mounting PVs to workloads can achieve data persistence for workloads. The PV holds the data directory that can be accessed by the containers in the Pod. + +## Create data volume + +Currently, there are two ways to create data volumes: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users. + +- There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume. + +- It is more intuitive and easier to create through the form, just fill in the corresponding values ​​according to the prompts, but the steps are more cumbersome. + +### YAML creation + +1. Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __Data Volume (PV)__ -> __Create with YAML__ in the left navigation bar. + + + +2. Enter or paste the prepared YAML file in the pop-up box, and click __OK__ at the bottom of the pop-up box. + + > Supports importing YAML files from local or downloading and saving filled files to local. + + + +### Form Creation + +1. Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __Data Volume (PV)__ -> __Create Data Volume (PV)__ in the left navigation bar. + + + +2. Fill in the basic information. + + - The data volume name, data volume type, mount path, volume mode, and node affinity cannot be changed after creation. + - Data volume type: For a detailed introduction to volume types, refer to the official Kubernetes document [Volumes](https://kubernetes.io/docs/concepts/storage/volumes/). + + - Local: The local storage of the Node node is packaged into a PVC interface, and the container directly uses the PVC without paying attention to the underlying storage type. Local volumes do not support dynamic configuration of data volumes, but support configuration of node affinity, which can limit which nodes can access the data volume. + - HostPath: Use files or directories on the file system of Node nodes as data volumes, and do not support Pod scheduling based on node affinity. + + - Mount path: mount the data volume to a specific directory in the container. + - access mode: + + - ReadWriteOnce: The data volume can be mounted by a node in read-write mode. + - ReadWriteMany: The data volume can be mounted by multiple nodes in read-write mode. + - ReadOnlyMany: The data volume can be mounted read-only by multiple nodes. + - ReadWriteOncePod: The data volume can be mounted read-write by a single Pod. + + - Recycling policy: + + - Retain: The PV is not deleted, but its status is only changed to __released__ , which needs to be manually recycled by the user. For how to manually reclaim, refer to [Persistent Volume](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#retain). + - Recycle: keep the PV but empty its data, perform a basic wipe ( __rm -rf /thevolume/*__ ). + - Delete: When deleting a PV and its data. + + - Volume mode: + + - File system: The data volume will be mounted to a certain directory by the Pod. If the data volume is stored from a device and the device is currently empty, a file system is created on the device before the volume is mounted for the first time. + - Block: Use the data volume as a raw block device. This type of volume is given to the Pod as a block device without any file system on it, allowing the Pod to access the data volume faster. + + - Node affinity: + + + +## View data volume + +Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __Data Volume (PV)__ in the left navigation bar. + +- On this page, you can view all data volumes in the current cluster, as well as information such as the status, capacity, and namespace of each data volume. + +- Supports sequential or reverse sorting according to the name, status, namespace, and creation time of data volumes. + + + +- Click the name of a data volume to view the basic configuration, StorageClass information, labels, comments, etc. of the data volume. + + + +## Clone data volume + +By cloning a data volume, a new data volume can be recreated based on the configuration of the cloned data volume. + +1. Enter the clone page + + - On the data volume list page, find the data volume to be cloned, and select __Clone__ under the operation bar on the right. + + > You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select __Clone__ . + + + +2. Use the original configuration directly, or modify it as needed, and click __OK__ at the bottom of the page. + +## Update data volume + +There are two ways to update data volumes. Support for updating data volumes via forms or YAML files. + +!!! note + + Only updating the alias, capacity, access mode, reclamation policy, label, and comment of the data volume is supported. + +- On the data volume list page, find the data volume that needs to be updated, select __Update__ under the operation bar on the right to update through the form, select __Edit YAML__ to update through YAML. + + + +- Click the name of the data volume to enter the details page of the data volume, select __Update__ in the upper right corner of the page to update through the form, select __Edit YAML__ to update through YAML. + + + +## Delete data volume + +On the data volume list page, find the data to be deleted, and select Delete in the operation column on the right. + +> You can also click the name of the data volume, click the operation button in the upper right corner of the details page and select __Delete__ . + diff --git a/docs/en/docs/end-user/kpanda/storage/pvc.md b/docs/en/docs/end-user/kpanda/storage/pvc.md new file mode 100644 index 0000000000..e3894cfb3a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/storage/pvc.md @@ -0,0 +1,127 @@ +# Data volume declaration (PVC) + +A persistent volume claim (PersistentVolumeClaim, PVC) expresses a user's request for storage. PVC consumes PV resources and claims a data volume with a specific size and specific access mode. For example, the PV volume is required to be mounted in ReadWriteOnce, ReadOnlyMany or ReadWriteMany modes. + +## Create data volume statement + +Currently, there are two ways to create data volume declarations: YAML and form. These two ways have their own advantages and disadvantages, and can meet the needs of different users. + +- There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the data volume declaration. + +- It is more intuitive and easier to create through the form, just fill in the corresponding values ​​according to the prompts, but the steps are more cumbersome. + +### YAML creation + +1. Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __Data Volume Declaration (PVC)__ -> __Create with YAML__ in the left navigation bar. + + + +2. Enter or paste the prepared YAML file in the pop-up box, and click __OK__ at the bottom of the pop-up box. + + > Supports importing YAML files from local or downloading and saving filled files to local. + + + +### Form Creation + +1. Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __Data Volume Declaration (PVC)__ -> __Create Data Volume Declaration (PVC)__ in the left navigation bar. + + + +2. Fill in the basic information. + + - The name, namespace, creation method, data volume, capacity, and access mode of the data volume declaration cannot be changed after creation. + - Creation method: dynamically create a new data volume claim in an existing StorageClass or data volume, or create a new data volume claim based on a snapshot of a data volume claim. + + > The declared capacity of the data volume cannot be modified when the snapshot is created, and can be modified after the creation is complete. + + - After selecting the creation method, select the desired StorageClass/data volume/snapshot from the drop-down list. + - access mode: + + - ReadWriteOnce, the data volume declaration can be mounted by a node in read-write mode. + - ReadWriteMany, the data volume declaration can be mounted by multiple nodes in read-write mode. + - ReadOnlyMany, the data volume declaration can be mounted read-only by multiple nodes. + - ReadWriteOncePod, the data volume declaration can be mounted by a single Pod in read-write mode. + + + +## View data volume statement + +Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __Data Volume Declaration (PVC)__ in the left navigation bar. + +- On this page, you can view all data volume declarations in the current cluster, as well as information such as the status, capacity, and namespace of each data volume declaration. + +- Supports sorting in sequential or reverse order according to the declared name, status, namespace, and creation time of the data volume. + + + +- Click the name of the data volume declaration to view the basic configuration, StorageClass information, labels, comments and other information of the data volume declaration. + + + +## Expansion data volume statement + +1. In the left navigation bar, click __Container Storage__ -> __Data Volume Declaration (PVC)__ , and find the data volume declaration whose capacity you want to adjust. + + + +2. Click the name of the data volume declaration, and then click the operation button in the upper right corner of the page and select __Expansion__ . + + + +3. Enter the target capacity and click __OK__ . + + + +## Clone data volume statement + +By cloning a data volume claim, a new data volume claim can be recreated based on the configuration of the cloned data volume claim. + +1. Enter the clone page + + - On the data volume declaration list page, find the data volume declaration that needs to be cloned, and select __Clone__ under the operation bar on the right. + + > You can also click the name of the data volume declaration, click the operation button in the upper right corner of the details page and select __Clone__ . + + + +2. Use the original configuration directly, or modify it as needed, and click __OK__ at the bottom of the page. + + + +## Update data volume statement + +There are two ways to update data volume claims. Support for updating data volume claims via form or YAML file. + +!!! note + + Only aliases, labels, and annotations for data volume claims are updated. + +- On the data volume list page, find the data volume declaration that needs to be updated, select __Update__ in the operation bar on the right to update it through the form, and select __Edit YAML__ to update it through YAML. + + + +- Click the name of the data volume declaration, enter the details page of the data volume declaration, select __Update__ in the upper right corner of the page to update through the form, select __Edit YAML__ to update through YAML. + + + +## Delete data volume statement + +On the data volume declaration list page, find the data to be deleted, and select Delete in the operation column on the right. + +> You can also click the name of the data volume statement, click the operation button in the upper right corner of the details page and select __Delete__ . + + + +## common problem + +1. If there is no optional StorageClass or data volume in the list, you can [Create a StorageClass](sc.md) or [Create a data volume](pv.md). + +2. If there is no optional snapshot in the list, you can enter the details page of the data volume declaration and create a snapshot in the upper right corner. + + + +3. If the StorageClass (SC) used by the data volume declaration is not enabled for snapshots, snapshots cannot be made, and the page will not display the "Make Snapshot" option. +4. If the StorageClass (SC) used by the data volume declaration does not have the capacity expansion feature enabled, the data volume does not support capacity expansion, and the page will not display the capacity expansion option. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/storage/sc-share.md b/docs/en/docs/end-user/kpanda/storage/sc-share.md new file mode 100644 index 0000000000..01e13d5d72 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/storage/sc-share.md @@ -0,0 +1,14 @@ +# shared StorageClass + +The AI platform container management module supports sharing a StorageClass with multiple namespaces to improve resource utilization efficiency. + +1. Find the StorageClass that needs to be shared in the StorageClass list, and click __Authorize Namespace__ under the operation bar on the right. + + + +2. Click __Custom Namespace__ to select which namespaces this StorageClass needs to be shared to one by one. + + - Click __Authorize All Namespaces__ to share this StorageClass to all namespaces under the current cluster at one time. + - Click __Remove Authorization__ under the operation bar on the right side of the list to deauthorize and stop sharing this StorageClass to this namespace. + + \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/storage/sc.md b/docs/en/docs/end-user/kpanda/storage/sc.md new file mode 100644 index 0000000000..8ad09bfa12 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/storage/sc.md @@ -0,0 +1,68 @@ +# StorageClass (SC) + +A StorageClass refers to a large storage resource pool composed of many physical disks. This platform supports the creation of block StorageClass, local StorageClass, and custom StorageClass after accessing various storage vendors, and then dynamically configures data volumes for workloads. + +## Create StorageClass (SC) + +Currently, it supports creating StorageClass through YAML and forms. These two methods have their own advantages and disadvantages, and can meet the needs of different users. + +- There are fewer steps and more efficient creation through YAML, but the threshold requirement is high, and you need to be familiar with the YAML file configuration of the StorageClass. + +- It is more intuitive and easier to create through the form, just fill in the corresponding values ​​according to the prompts, but the steps are more cumbersome. + +### YAML creation + +1. Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __StorageClass (SC)__ -> __Create with YAML__ in the left navigation bar. + + + +2. Enter or paste the prepared YAML file in the pop-up box, and click __OK__ at the bottom of the pop-up box. + + > Supports importing YAML files from local or downloading and saving filled files to local. + + + +### Form Creation + +1. Click the name of the target cluster in the cluster list, and then click __Container Storage__ -> __StorageClass (SC)__ -> __Create StorageClass (SC)__ in the left navigation bar. + + + +2. Fill in the basic information and click __OK__ at the bottom. + + **CUSTOM STORAGE SYSTEM** + + - The StorageClass name, driver, and reclamation policy cannot be modified after creation. + - CSI storage driver: A standard Kubernetes-based container storage interface plug-in, which must comply with the format specified by the storage manufacturer, such as __rancher.io/local-path__ . + + - For how to fill in the CSI drivers provided by different vendors, refer to the official Kubernetes document [Storage Class](https://kubernetes.io/docs/concepts/storage/storage-classes/#provisioner). + - Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it. + - Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but **the premise is that the underlying storage driver supports the snapshot and expansion features**. + + **HwameiStor storage system** + + - The StorageClass name, driver, and reclamation policy cannot be modified after creation. + - Storage system: HwameiStor storage system. + - Storage type: support LVM, raw disk type + - __LVM type__ : HwameiStor recommended usage method, which can use highly available data volumes, and the corresponding CSI storage driver is `lvm.hwameistor.io` . + - __Raw disk data volume__ : suitable for high availability cases, without high availability capability, the corresponding CSI driver is `hdd.hwameistor.io` . + - High Availability Mode: Before using the high availability capability, please make sure __DRBD component__ has been installed. After the high availability mode is turned on, the number of data volume copies can be set to 1 and 2. Convert data volume copy from 1 to 1 if needed. + - Recycling policy: When deleting a data volume, keep the data in the data volume or delete the data in it. + - Snapshot/Expansion: After it is enabled, the data volume/data volume declaration based on the StorageClass can support the expansion and snapshot features, but **the premise is that the underlying storage driver supports the snapshot and expansion features**. + + + +## Update StorageClass (SC) + +On the StorageClass list page, find the StorageClass that needs to be updated, and select Edit under the operation bar on the right to update the StorageClass. + + + +!!! info + + Select __View YAML__ to view the YAML file of the StorageClass, but editing is not supported. + +## Delete StorageClass (SC) + +On the StorageClass list page, find the StorageClass to be deleted, and select Delete in the operation column on the right. + diff --git a/docs/en/docs/end-user/kpanda/workloads/create-cronjob.md b/docs/en/docs/end-user/kpanda/workloads/create-cronjob.md new file mode 100644 index 0000000000..f81724832e --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/create-cronjob.md @@ -0,0 +1,209 @@ +--- +MTPE: FanLin +Date: 2024-02-29 +--- + +# Create CronJob + +This page introduces how to create a CronJob through images and YAML files. + +CronJobs are suitable for performing periodic operations, such as backup and report generation. These jobs can be configured to repeat periodically (for example: daily/weekly/monthly), and the time interval at which the job starts to run can be defined. + +## Prerequisites + +Before creating a CronJob, the following prerequisites need to be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md) and a [user](../../../ghippo/access-control/user.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create by image + +Refer to the following steps to create a CronJob using the image. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the cluster details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __CronJobs__ in the left navigation bar, and then click the __Create by Image__ button in the upper right corner of the page. + + ![Create by image](../images/cronjob01.png) + +3. Fill in [Basic Information](create-cronjob.md#basic-information), [Container Settings](create-cronjob.md#container-settings), [CronJob Settings](create-cronjob.md#cronjob-settings), [Advanced Configuration](create-cronjob.md#advanced-configuration), click __OK__ in the lower right corner of the page to complete the creation. + + The system will automatically return to the __CronJobs__ list. Click __┇__ on the right side of the list to perform operations such as updating, deleting, and restarting the CronJob. + + ![Config](../images/cronjob06.png) + +### Basic information + +On the __Create CronJobs__ page, enter the information according to the table below, and click __Next__ . + +![Basic Information](../images/cronjob02.png) + +- Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator ("-"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created. +- Namespace: Select which namespace to deploy the newly created CronJob in, and the default namespace is used by default. If you can't find the desired namespace, you can go to [Create a new namespace](../namespaces/createns.md) according to the prompt on the page. +- Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512. + +### Container settings + +Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part. + +> Container setting is only configured for a single container. To add multiple containers to a pod, click __+__ on the right to add multiple containers. + +=== "Basic information (required)" + + ![Basic Info](../images/cronjob03.png) + + When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the configuration with reference to the following requirements, click __OK__ . + + - Container Name: Up to 63 characters, lowercase letters, numbers and separators ("-") are supported. Must start and end with a lowercase letter or number, eg nginx-01. + - Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official [DockerHub](https://hub.docker.com/) by default. After accessing the [container registry](../../../kangaroo/intro/index.md) module of AI platform, you can click the right side to select the image ` to select the image. + - Image Pull Policy: After checking __Always pull the image__ , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local mirror will be pulled, and only when the mirror does not exist locally, it will be re-pulled from the container registry. For more details, refer to [Image Pull Policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). + - Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host. + - CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure. + - GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number __8__ to let the container exclusively use the entire length of the card, and enter the number __1__ to configure a 1-core vGPU for the container. + + > Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in [Cluster Settings](../clusterops/cluster-settings.md). + +=== "Lifecycle (optional)" + + Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to [Container Lifecycle Configuration](pod-config/lifecycle.md). + + ![Lifecycle](../images/cronjob07.png) + +=== "Health Check (optional)" + + It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to [Container Health Check Configuration](pod-config/health-check.md). + + ![Health Check](../images/deploy07.png) + +=== "Environment variables (optional)" + + Configure container parameters within the Pod, add environment variables or pass configuration to the Pod, etc. For details, refer to [Container environment variable configuration](pod-config/env-variables.md). + + ![Environment Variables](../images/deploy08.png) + +=== "Data storage (optional)" + + Configure the settings for container mounting data volumes and data persistence. For details, refer to [Container Data Storage Configuration](pod-config/env-variables.md). + + ![Data storage](../images/deploy09.png) + +=== "Security settings (optional)" + + Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter __0__ to use the privileges of the root account. + + ![Security settings](../images/deploy10.png) + +### CronJob Settings + +![CronJob Settings](../images/cronjob04.png) + +- Concurrency Policy: Whether to allow multiple Job jobs to run in parallel. + + - __Allow__ : A new CronJob can be created before the previous job is completed, and multiple jobs can be parallelized. Too many jobs may occupy cluster resources. + - __Forbid__ : Before the previous job is completed, a new job cannot be created. If the execution time of the new job is up and the previous job has not been completed, CronJob will ignore the execution of the new job. + - __Replace__ : If the execution time of the new job is up, but the previous job has not been completed, the new job will replace the previous job. + + > The above rules only apply to multiple jobs created by the same CronJob. Multiple jobs created by multiple CronJobs are always allowed to run concurrently. + +- Policy Settings: Set the time period for job execution based on minutes, hours, days, weeks, and months. Support custom Cron expressions with numbers and `*` , **after inputting the expression, the meaning of the current expression will be prompted**. For detailed expression syntax rules, refer to [Cron Schedule Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax). +- Job Records: Set how many records of successful or failed jobs to keep. __0__ means do not keep. +- Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set. The default is 360 s. +- Retries: the number of times the job can be retried, the default value is 6. +- Restart Policy: Set whether to restart the Pod when the job fails. + +### Service settings + +Configure [Service](../network/create-services.md) for the statefulset, so that the statefulset can be accessed externally. + +1. Click the __Create Service__ button. + + ![Create Service](../images/cronjob08.png) + +2. Refer to [Create Service](../network/create-services.md) to configure service parameters. + + ![Config Parameters](../images/deploy13.png) + +3. Click __OK__ and click __Next__ . + +### Advanced configuration + +The advanced configuration of CronJobs mainly involves labels and annotations. + +You can click the __Add__ button to add labels and annotations to the workload instance Pod. + +![Labels and Annotations](../images/cronjob05.png) + +## Create from YAML + +In addition to mirroring, you can also create timed jobs more quickly through YAML files. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the cluster details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __CronJobs__ in the left navigation bar, and then click the __Create from YAML__ button in the upper right corner of the page. + + ![Create](../images/cronjob09.png) + +3. Enter or paste the YAML file prepared in advance, click __OK__ to complete the creation. + + ![Confirm](../images/cronjob10.png) + +??? note "click to view the complete YAML" + + ```yaml + apiVersion: batch/v1 + kind: CronJob + metadata: + creationTimestamp: '2022-12-26T09:45:47Z' + generation: 1 + name: demo + namespace: default + resourceVersion: '92726617' + uid: d030d8d7-a405-4dcd-b09a-176942ef36c9 + spec: + concurrencyPolicy: Allow + failedJobsHistoryLimit: 1 + jobTemplate: + metadata: + creationTimestamp: null + spec: + activeDeadlineSeconds: 360 + backoffLimit: 6 + template: + metadata: + creationTimestamp: null + spec: + containers: + - image: nginx + imagePullPolicy: IfNotPresent + lifecycle: {} + name: container-3 + resources: + limits: + cpu: 250m + memory: 512Mi + requests: + cpu: 250m + memory: 512Mi + securityContext: + privileged: false + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + dnsPolicy: ClusterFirst + restartPolicy: Never + schedulerName: default-scheduler + securityContext: {} + terminationGracePeriodSeconds: 30 + schedule: 0 0 13 * 5 + successfulJobsHistoryLimit: 3 + suspend: false + status: {} + ``` diff --git a/docs/en/docs/end-user/kpanda/workloads/create-daemonset.md b/docs/en/docs/end-user/kpanda/workloads/create-daemonset.md new file mode 100644 index 0000000000..ae47d26e7a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/create-daemonset.md @@ -0,0 +1,376 @@ +--- +MTPE: FanLin +Date: 2024-02-28 +--- + +# Create DaemonSet + +This page introduces how to create a daemonSet through image and YAML files. + +DaemonSet is connected to [taint](https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/) through [node affinity]( ://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) feature ensures that a replica of a Pod is running on all or some of the nodes. For nodes that newly joined the cluster, DaemonSet automatically deploys the corresponding Pod on the new node and tracks the running status of the Pod. When a node is removed, the DaemonSet deletes all Pods it created. + +Common cases for daemons include: + +- Run cluster daemons on each node. +- Run a log collection daemon on each node. +- Run a monitoring daemon on each node. + +For simplicity, a DaemonSet can be started on each node for each type of daemon. For finer and more advanced daemon management, you can also deploy multiple DaemonSets for the same daemon. Each DaemonSet has different flags and has different memory, CPU requirements for different hardware types. + +## Prerequisites + +Before creating a DaemonSet, the following prerequisites need to be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md) and a [user](../../../ghippo/access-control/user.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create by image + +Refer to the following steps to create a daemon using the image. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the cluster details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __DaemonSets__ in the left navigation bar, and then click the __Create by Image__ button in the upper right corner of the page. + + ![DaemonSet](../images/daemon01.png) + +3. Fill in [Basic Information](create-DaemonSet.md#basic-information), [Container Settings](create-DaemonSet.md#container-settings), [Service Settings](create-DaemonSet.md#service-settings), [Advanced Settings](create-DaemonSet.md#advanced-settings), click __OK__ in the lower right corner of the page to complete the creation. + + The system will automatically return the list of __DaemonSets__ . Click __┇__ on the right side of the list to perform operations such as updating, deleting, and restarting the DaemonSet. + + ![Daemons](../images/daemon05.png) + +### Basic information + +On the __Create DaemonSets__ page, after entering the information according to the table below, click __Next__ . + +![Basic Information](../images/daemon02.png) + +- Workload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator ("-"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created. +- Namespace: Select which namespace to deploy the newly created DaemonSet in, and the default namespace is used by default. If you can't find the desired namespace, you can go to [Create a new namespace](../namespaces/createns.md) according to the prompt on the page. +- Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512. + +### Container settings + +Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part. + +> Container setting is only configured for a single container. To add multiple containers to a pod, click __+__ on the right to add multiple containers. + +=== "Basic information (required)" + + ![Basic Info](../images/daemon06.png) + + When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click __OK__ . + + - Container Name: Up to 63 characters, lowercase letters, numbers and separators ("-") are supported. Must start and end with a lowercase letter or number, eg nginx-01. + - Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official [DockerHub](https://hub.docker.com/) by default. After accessing the [container registry](../../../kangaroo/intro/index.md) module of AI platform, you can click __Select Image__ on the right to select an image. + - Image Pull Policy: After checking __Always pull image__ , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to [Image Pull Policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). + - Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host. + - CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure. + - GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number __8__ to let the container exclusively use the entire length of the card, and enter the number __1__ to configure a 1-core vGPU for the container. + + > Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in [Cluster Settings](../clusterops/cluster-settings.md). + +=== "Lifecycle (optional)" + + Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to [Container Lifecycle Configuration](pod-config/lifecycle.md). + + ![Lifecycle](../images/daemon006.png) + +=== "Health Check (optional)" + + It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to [Container Health Check Configuration](pod-config/health-check.md). + + ![Health Check](../images/deploy07.png) + +=== "Environment variables (optional)" + + Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to [Container environment variable settings](pod-config/env-variables.md). + + ![Environment Variables](../images/deploy08.png) + +=== "Data storage (optional)" + + Configure the settings for container mounting data volumes and data persistence. For details, refer to [Container Data Storage Configuration](pod-config/env-variables.md). + + ![Data storage](../images/deploy09.png) + +=== "Security settings (optional)" + + Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter __0__ to use the privileges of the root account. + + ![Security settings](../images/deploy10.png) + +### Service settings + +Create a [Service (Service)](../network/create-services.md) for the daemon, so that the daemon can be accessed externally. + +1. Click the __Create Service__ button. + + ![Create Service](../images/daemon12.png) + +2. Configure service parameters, refer to [Create Service](../network/create-services.md) for details. + + ![Service Settings](../images/deploy13.png) + +3. Click __OK__ and click __Next__ . + +### Advanced settings + +Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part. + +=== "Network Configuration" + + In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases. + + + + - DNS Policy + + - Default: Make the container use the domain name resolution file pointed to by the __--resolv-conf__ parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query. + - ClusterFirstWithHostNet: The domain name file of the host to which the application is connected. + - ClusterFirst: application docking with Kube-DNS/CoreDNS. + - None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set, at this time the domain name of the containerThe parsing file will be completely generated through the settings of dnsConfig. + + - Nameservers: fill in the address of the domain name server, such as __10.6.175.20__ . + - Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains. + - Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig. + - Host Alias: the alias set for the host. + + ![DNS](../images/daemon17.png) + +=== "Upgrade Policy" + + - Upgrade Mode: __Rolling upgrade__ refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. __Rebuild and upgrade__ refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted. + - Max Unavailable Pods: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption. + - Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%. + - Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10. + - Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds. + - Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds. + - Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds. + + ![Upgrade Policy](../images/daemon14.png) + +=== "Scheduling Policies" + + - Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds. + - Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on. + - Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node. + - Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node. + - Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, __kubernetes.io/os__ indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node. + + > For details, refer to [Scheduling Policy](pod-config/scheduling-policy.md). + + ![Scheduling Policy](../images/daemon15.png) + +=== "Labels and Annotations" + + You can click the __Add__ button to add tags and annotations to workloads and pods. + + ![Labels and Annotations](../images/daemon16.png) + +## Create from YAML + +In addition to image, you can also create daemons more quickly through YAML files. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the __Cluster Details__ page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workload__ -> __Daemons__ in the left navigation bar, and then click the __YAML Create__ button in the upper right corner of the page. + + ![Deployments](../images/daemon02Yaml.png) + +3. Enter or paste the YAML file prepared in advance, click __OK__ to complete the creation. + + ![Confirm](../images/daemon03Yaml.png) + +??? note "Click to see an example YAML for creating a daemon" + + ```yaml + kind: DaemonSet + apiVersion: apps/v1 + metadata: + name: hwameistor-local-disk-manager + namespace: hwameistor + uid: ccbdc098-7de3-4a8a-96dd-d1cee159c92b + resourceVersion: '90999552' + generation: 1 + creationTimestamp: '2022-12-15T09:03:44Z' + labels: + app.kubernetes.io/managed-by: Helm + annotations: + deprecated.DaemonSet.template.generation: '1' + meta.helm.sh/release-name: hwameistor + meta.helm.sh/release-namespace:hwameistor + spec: + selector: + matchLabels: + app: hwameistor-local-disk-manager + template: + metadata: + creationTimestamp: null + labels: + app: hwameistor-local-disk-manager + spec: + volumes: + - name: udev + hostPath: + path: /run/udev + type: Directory + - name: procmount + hostPath: + path: /proc + type: Directory + - name: devmount + hostPath: + path: /dev + type: Directory + - name: socket-dir + hostPath: + path: /var/lib/kubelet/plugins/disk.hwameistor.io + type: DirectoryOrCreate + - name: registration-dir + hostPath: + path: /var/lib/kubelet/plugins_registry/ + type: Directory + - name: plugin-dir + hostPath: + path: /var/lib/kubelet/plugins + type: DirectoryOrCreate + - name: pods-mount-dir + hostPath: + path: /var/lib/kubelet/pods + type: DirectoryOrCreate + containers: + - name: registrar + image: k8s-gcr.m.daocloud.io/sig-storage/csi-node-driver-registrar:v2.5.0 + args: + - '--v=5' + - '--csi-address=/csi/csi.sock' + - >- + --kubelet-registration-path=/var/lib/kubelet/plugins/disk.hwameistor.io/csi.sock + env: + - name: KUBE_NODE_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: spec.nodeName + resources: {} + volumeMounts: + - name: socket-dir + mountPath: /csi + - name: registration-dir + mountPath: /registration + lifecycle: + preStop: + exec: + command: + - /bin/sh + - '-c' + - >- + rm -rf /registration/disk.hwameistor.io + /registration/disk.hwameistor.io-reg.sock + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + -name: managerimage: ghcr.m.daocloud.io/hwameistor/local-disk-manager:v0.6.1 + command: + - /local-disk-manager + args: + - '--endpoint=$(CSI_ENDPOINT)' + - '--nodeid=$(NODENAME)' + - '--csi-enable=true' + env: + - name: CSI_ENDPOINT + value: unix://var/lib/kubelet/plugins/disk.hwameistor.io/csi.sock + - name: NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: WATCH_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: NODENAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: spec.nodeName + - name: OPERATOR_NAME + value: local-disk-manager + resources: {} + volumeMounts: + - name: udev + mountPath: /run/udev + - name: procmount + readOnly: true + mountPath: /host/proc + - name: devmount + mountPath: /dev + - name: registration-dir + mountPath: /var/lib/kubelet/plugins_registry + - name: plugin-dir + mountPath: /var/lib/kubelet/plugins + mountPropagation: Bidirectional + - name: pods-mount-dir + mountPath: /var/lib/kubelet/pods + mountPropagation: Bidirectional + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + securityContext: + privileged: true + restartPolicy: Always + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + serviceAccountName: hwameistor-admin + serviceAccount: hwameistor-admin + hostNetwork: true + hostPID: true + securityContext: {} + schedulerName: default-scheduler + tolerations: + - key: CriticalAddonsOnly + operator: Exists + - key: node.kubernetes.io/not-ready + operator: Exists + effect: NoSchedule + - key: node-role.kubernetes.io/master + operator: Exists + effect: NoSchedule + - key: node-role.kubernetes.io/control-plane + operator: Exists + effect: NoSchedule + - key: node.cloudprovider.kubernetes.io/uninitialized + operator: Exists + effect: NoSchedule + updateStrategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 0 + revisionHistoryLimit: 10 + status: + currentNumberScheduled: 4 + numberMisscheduled: 0 + desiredNumberScheduled: 4 + numberReady: 4 + observedGeneration: 1 + updatedNumberScheduled: 4 + numberAvailable: 4 + ``` diff --git a/docs/en/docs/end-user/kpanda/workloads/create-deployment.md b/docs/en/docs/end-user/kpanda/workloads/create-deployment.md new file mode 100644 index 0000000000..1e19aea87d --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/create-deployment.md @@ -0,0 +1,233 @@ +--- +MTPE: FanLin +Date: 2024-02-27 +--- + +# Create Deployment + +This page describes how to create deployments through images and YAML files. + +[Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) is a common resource in Kubernetes, mainly [Pod](https://kubernetes.io/docs/concepts/workloads/pods/) and [ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) provide declarative updates, support elastic scaling, rolling upgrades, and version rollbacks features. Declare the desired Pod state in the Deployment, and the Deployment Controller will modify the current state through the ReplicaSet to make it reach the pre-declared desired state. Deployment is stateless and does not support data persistence. It is suitable for deploying stateless applications that do not need to save data and can be restarted and rolled back at any time. + +Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on corresponding role permissions, including the creation of deployments, Full life cycle management such as update, deletion, elastic scaling, restart, and version rollback. + +## Prerequisites + +Before using image to create deployments, the following prerequisites need to be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md) and a [user](../../../ghippo/access-control/user.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create by image + +Follow the steps below to create a deployment by image. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __Deployments__ in the left navigation bar, and then click the __Create by Image__ button in the upper right corner of the page. + + ![Deployment](../images/deploy02.png) + +3. Fill in [Basic Information](create-deployment.md#basic-information), [Container Setting](create-deployment.md#container-settings), [Service Setting](create-deployment.md#service-settings), [Advanced Setting](create-deployment.md#advanced-settings) in turn, click __OK__ in the lower right corner of the page to complete the creation. + + The system will automatically return the list of __Deployments__ . Click __┇__ on the right side of the list to perform operations such as update, delete, elastic scaling, restart, and version rollback on the load. If the workload status is abnormal, please check the specific abnormal information, refer to [Workload Status](../workloads/pod-config/workload-status.md). + + ![Menu](../images/deploy18.png) + +### Basic information + +- Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator ("-"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created. +- Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to [Create a new namespace](../namespaces/createns.md) according to the prompt on the page. +- Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default. +- Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512. + +![Basic Information](../images/deploy04.png) + +### Container settings + +Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part. + +> Container setting is only configured for a single container. To add multiple containers to a pod, click __+__ on the right to add multiple containers. + +=== "Basic Information (Required)" + + When configuring container-related parameters, it is essential to correctly fill in the container name and image parameters; + otherwise, you will not be able to proceed to the next step. + After filling in the configuration according to the following requirements, click __OK__. + + ![Basic Info](../images/deploy05.png) + + - Container Type: The default is `Work Container`. For information on init containers, see the [K8s Official Documentation] + (https://kubernetes.io/docs/concepts/workloads/pods/init-containers/). + - Container Name: No more than 63 characters, supporting lowercase letters, numbers, and separators ("-"). + It must start and end with a lowercase letter or number, for example, nginx-01. + - Image: + - Image: Select an appropriate image from the list. When entering the image name, the default is to pull the image + from the official [DockerHub](https://hub.docker.com/). + After integrating the [Image Repository](../../../kangaroo/intro/index.md) module of AI platform, + you can click the __Choose an image__ button on the right to choose an image. + - Image Version: Select an appropriate version from the dropdown list. + - Image Pull Policy: By checking __Always pull the image__, the image will be pulled from the repository each time + the workload restarts/upgrades. + If unchecked, it will only pull the local image, and will pull from the repository only if the image does not exist locally. + For more details, refer to [Image Pull Policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). + - Registry Secret: Optional. If the target repository requires a Secret to access, you need to [create secret](../configmaps-secrets/create-secret.md) first. + - Privileged Container: By default, the container cannot access any device on the host. After enabling the privileged container, + the container can access all devices on the host and has all the privileges of running processes on the host. + - CPU/Memory Request: The request value (the minimum resource needed) and the limit value (the maximum resource allowed) + for CPU/memory resources. Configure resources for the container as needed to avoid resource waste and system failures + caused by container resource overages. Default values are shown in the figure. + - GPU Configuration: Configure GPU usage for the container, supporting only positive integers. + The GPU quota setting supports configuring the container to exclusively use an entire GPU card or part of a vGPU. + For example, for a GPU card with 8 cores, entering the number __8__ means the container exclusively uses the entire card, + and entering the number __1__ means configuring 1 core of the vGPU for the container. + + > Before setting the GPU, the administrator needs to pre-install the GPU card and driver plugin on the cluster node + and enable the GPU feature in the [Cluster Settings](../clusterops/cluster-settings.md). + +=== "Lifecycle (optional)" + + Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to [Container Lifecycle Setting](pod-config/lifecycle.md). + + ![Lifecycle](../images/deploy06.png) + +=== "Health Check (optional)" + + It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to [Container Health Check Setting](pod-config/health-check.md). + + ![Health Check](../images/deploy07.png) + +=== "Environment variables (optional)" + + Configure container parameters within the Pod, add environment variables or pass setting to the Pod, etc. For details, refer to [Container environment variable setting](pod-config/env-variables.md). + + ![Environment variables](../images/deploy08.png) + +=== "Data storage (optional)" + + Configure the settings for container mounting data volumes and data persistence. For details, refer to [Container Data Storage Setting](pod-config/env-variables.md). + + ![Data storage](../images/deploy09.png) + +=== "Security settings (optional)" + + Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter __0__ to use the privileges of the root account. + + ![Security settings](../images/deploy10.png) + +### Service settings + +Configure [Service](../network/create-services.md) for the deployment, so that the deployment can be accessed externally. + +1. Click the __Create Service__ button. + + ![Create Service](../images/deploy12.png) + +2. Refer to [Create Service](../network/create-services.md) to configure service parameters. + + ![Service Settings](../images/deploy13.png) + +3. Click __OK__ and click __Next__ . + +### Advanced settings + +Advanced setting includes four parts: Network Settings, Upgrade Policy, Scheduling Policies, Labels and Annotations. You can click the tabs below to view the setting requirements of each part. + +=== "Network Settings" + + 1. For container NIC setting, refer to [Workload Usage IP Pool](../../../network/config/use-ippool/usage.md) + 2. DNS setting + + In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related setting options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases. + + - DNS Policy + + - Default: Make container use kubelet's __-The domain name resolution file pointed to by the -resolv-conf__ parameter. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query. + - ClusterFirstWithHostNet: The domain name file of the host to which the application is connected. + - ClusterFirst: application docking with Kube-DNS/CoreDNS. + - None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the setting of dnsConfig. + + - Nameservers: fill in the address of the domain name server, such as __10.6.175.20__ . + - Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains. + - Options: Setting options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig. + - Host Alias: the alias set for the host. + + ![DNS](../images/deploy17.png) + +=== "Upgrade Policy" + + - Upgrade Mode: __Rolling upgrade__ refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. __Rebuild and upgrade__ refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted. + - Max Unavailable: Specify the maximum value or ratio of unavailable pods during the workload update process, the default is 25%. If it is equal to the number of instances, there is a risk of service interruption. + - Max Surge: The maximum or ratio of the total number of Pods exceeding the desired replica count of Pods during a Pod update. Default is 25%. + - Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10. + - Minimum Ready: The minimum time for a Pod to be ready. Only after this time is the Pod considered available. The default is 0 seconds. + - Upgrade Max Duration: If the deployment is not successful after the set time, the workload will be marked as failed. Default is 600 seconds. + - Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds. + + ![Upgrade Policy](../images/deploy14.png) + +=== "Scheduling Policies" + + - Toleration time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds. + - Node Affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on. + - Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node. + - Workload Anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node. + + > For details, refer to [Scheduling Policy](pod-config/scheduling-policy.md). + + ![Scheduling Policy](../images/deploy15.png) + +=== "Labels and Annotations" + + You can click the __Add__ button to add tags and annotations to workloads and pods. + + ![Labels and Annotations](../images/deploy16.png) + +## Create from YAML + +In addition to image, you can also create deployments more quickly through YAML files. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __Deployments__ in the left navigation bar, and then click the __Create from YAML__ button in the upper right corner of the page. + + ![Deployments](../images/deploy02Yaml.png) + +3. Enter or paste the YAML file prepared in advance, click __OK__ to complete the creation. + + ![Confirm](../images/deploy03Yaml.png) + +??? note "Click to see an example YAML for creating a deployment" + + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: nginx-deployment + spec: + selector: + matchLabels: + app: nginx + replicas: 2 # (1)! + template: + metadata: + labels: + app: nginx + spec: + containers: + -name: nginx + image: nginx:1.14.2 + ports: + - containerPort: 80 + ``` + + 1. Tell the Deployment to run 2 Pods that match this template diff --git a/docs/en/docs/end-user/kpanda/workloads/create-job.md b/docs/en/docs/end-user/kpanda/workloads/create-job.md new file mode 100644 index 0000000000..7e73afcff1 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/create-job.md @@ -0,0 +1,195 @@ +--- +MTPE: FanLin +Date: 2024-02-28 +--- + +# Create Job + +This page introduces how to create a job through image and YAML file. + +Job is suitable for performing one-time jobs. A Job creates one or more Pods, and the Job keeps retrying to run Pods until a certain number of Pods are successfully terminated. A Job ends when the specified number of Pods are successfully terminated. When a Job is deleted, all Pods created by the Job will be cleared. When a Job is paused, all active Pods in the Job are deleted until the Job is resumed. For more information about jobs, refer to [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/). + +## Prerequisites + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md) and a [user](../../../ghippo/access-control/user.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create by image + +Refer to the following steps to create a job using an image. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the cluster details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __Jobs__ in the left navigation bar, and then click the __Create by Image__ button in the upper right corner of the page. + + ![Jobs](../images/job01.png) + +3. Fill in [Basic Information](create-job.md#basic-information), [Container Settings](create-job.md#container-settings) and [Advanced Settings](create-job.md#advanced-settings), click __OK__ in the lower right corner of the page to complete the creation. + + The system will automatically return to the __job__ list. Click __┇__ on the right side of the list to perform operations such as updating, deleting, and restarting the job. + + ![Config](../images/job07.png) + +### Basic information + +On the __Create Jobs__ page, enter the basic information according to the table below, and click __Next__ . + +![Create Jobs](../images/job02.png) + +- Payload Name: Can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator ("-"), and must start and end with a lowercase letter or number. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created. +- Namespace: Select which namespace to deploy the newly created job in, and the default namespace is used by default. If you can't find the desired namespace, you can go to [Create a new namespace](../namespaces/createns.md) according to the prompt on the page. +- Number of Instances: Enter the number of Pod instances for the workload. By default, 1 Pod instance is created. +- Description: Enter the description information of the workload and customize the content. The number of characters should not exceed 512. + +### Container settings + +Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the setting requirements of each part. + +> Container settings is only configured for a single container. To add multiple containers to a pod, click __+__ on the right to add multiple containers. + +=== "Basic information (required)" + + ![Basic Information](../images/job02-1.png) + + When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click __OK__ . + + - Container Name: Up to 63 characters, lowercase letters, numbers and separators ("-") are supported. Must start and end with a lowercase letter or number, eg nginx-01. + - Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official [DockerHub](https://hub.docker.com/) by default. After accessing the [container registry](../../../kangaroo/intro/index.md) module of AI platform, you can click __Select Image__ on the right to select an image. + - Image Pull Policy: After checking __Always pull image__ , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to [Image Pull Policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). + - Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host. + - CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure. + - GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number __8__ to let the container exclusively use the entire length of the card, and enter the number __1__ to configure a 1-core vGPU for the container. + + > Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in [Cluster Settings](../clusterops/cluster-settings.md). + +=== "Lifecycle (optional)" + + Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to [Container Lifecycle settings](pod-config/lifecycle.md). + + ![Lifecycle](../images/job06.png) + +=== "Health Check (optional)" + + It is used to judge the health status of containers and applications, which helps to improve the availability of applications. For details, refer to [Container Health Check settings](pod-config/health-check.md). + + ![Health Check](../images/deploy07.png) + +=== "Environment Variables (optional)" + + Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to [Container environment variable settings](pod-config/env-variables.md). + + ![Environment Variables](../images/deploy08.png) + +=== "Data Storage (optional)" + + Configure the settings for container mounting data volumes and data persistence. For details, refer to [Container Data Storage settings](pod-config/env-variables.md). + + ![Data storage](../images/deploy09.png) + +=== "Security Settings (optional)" + + Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter __0__ to use the privileges of the root account. + + ![Security settings](../images/deploy10.png) + +### Advanced settings + +Advanced setting includes job settings, labels and annotations. + +=== "Job Settings" + + ![Job Settings](../images/job03.png) + + - Parallel Pods: the maximum number of Pods that can be created at the same time during job execution, and the parallel number should not be greater than the total number of Pods. Default is 1. + - Timeout: When this time is exceeded, the job will be marked as failed to execute, and all Pods under the job will be deleted. When it is empty, it means that no timeout is set. + - Restart Policy: Whether to restart the Pod when the setting fails. + +=== "Labels and Annotations" + + You can click the __Add__ button to add labels and annotations to the workload instance Pod. + + ![Labels and Annotations](../images/job04.png) + +## Create from YAML + +In addition to image, creation jobs can also be created more quickly through YAML files. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the cluster details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __Jobs__ in the left navigation bar, and then click the __Create from YAML__ button in the upper right corner of the page. + + ![Create](../images/job08.png) + +3. Enter or paste the YAML file prepared in advance, click __OK__ to complete the creation. + + ![Confirm](../images/job09.png) + +??? note "Click to view the complete YAML" + + ```yaml + kind: Job + apiVersion: batch/v1 + metadata: + name: demo + namespace: default + uid: a9708239-0358-4aa1-87d3-a092c080836e + resourceVersion: '92751876' + generation: 1 + creationTimestamp: '2022-12-26T10:52:22Z' + labels: + app: demo + controller-uid: a9708239-0358-4aa1-87d3-a092c080836e + job-name: demo + annotations: + revisions: >- + {"1":{"status":"running","uid":"a9708239-0358-4aa1-87d3-a092c080836e","start-time":"2022-12-26T10:52:22Z","completion-time":"0001-01-01T00:00:00Z"}} + spec: + parallelism: 1 + backoffLimit: 6 + selector: + matchLabels: + controller-uid: a9708239-0358-4aa1-87d3-a092c080836e + template: + metadata: + creationTimestamp: null + labels: + app: demo + controller-uid: a9708239-0358-4aa1-87d3-a092c080836e + job-name: demo + spec: + containers: + - name: container-4 + image: nginx + resources: + limits: + cpu: 250m + memory: 512Mi + requests: + cpu: 250m + memory: 512Mi + lifecycle: {} + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + securityContext: + privileged: false + restartPolicy: Never + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + securityContext: {} + schedulerName: default-scheduler + completionMode: NonIndexed + suspend: false + status: + startTime: '2022-12-26T10:52:22Z' + active: 1 + ``` diff --git a/docs/en/docs/end-user/kpanda/workloads/create-statefulset.md b/docs/en/docs/end-user/kpanda/workloads/create-statefulset.md new file mode 100644 index 0000000000..cc0f3bacfe --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/create-statefulset.md @@ -0,0 +1,655 @@ +--- +MTPE: FanLin +Date: 2024-02-28 +--- + +# Create StatefulSet + +This page describes how to create a StatefulSet through image and YAML files. + +[StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) is a common resource in Kubernetes, and [Deployment](create-deployment.md), mainly used to manage the deployment and scaling of Pod collections. The main difference between the two is that Deployment is stateless and does not save data, while StatefulSet is stateful and is mainly used to manage stateful applications. In addition, Pods in a StatefulSet have a persistent ID, which makes it easy to identify the corresponding Pod when matching storage volumes. + +Through the container management module of AI platform, workloads on multicloud and multiclusters can be easily managed based on corresponding role permissions, including the creation of StatefulSets, update, delete, elastic scaling, restart, version rollback and other full life cycle management. + +## Prerequisites + +Before using image to create StatefulSets, the following prerequisites need to be met: + +- In the [Container Management](../../intro/index.md) module [Access Kubernetes Cluster](../clusters/integrate-cluster.md) or [Create Kubernetes Cluster](../clusters/create-cluster.md), and can access the cluster UI interface. + +- Create a [namespace](../namespaces/createns.md) and a [user](../../../ghippo/access-control/user.md). + +- The current operating user should have [NS Editor](../permissions/permission-brief.md#ns-editor) or higher permissions, for details, refer to [Namespace Authorization](../namespaces/createns.md). + +- When there are multiple containers in a single instance, please make sure that the ports used by the containers do not conflict, otherwise the deployment will fail. + +## Create by image + +Follow the steps below to create a statefulSet using image. + +1. Click __Clusters__ on the left navigation bar, then click the name of the target cluster to enter Cluster Details. + + ![Clusters](../images/deploy01.png) + +2. Click __Workloads__ -> __StatefulSets__ in the left navigation bar, and then click the __Create by Image__ button in the upper right corner. + + ![StatefulSets](../images/state02.png) + +3. Fill in [Basic Information](create-statefulset.md#basic-information), [Container Settings](create-statefulset.md#container-settings), [Service Settings](create-statefulset.md#service-settings), [Advanced Settings](create-statefulset.md#advanced-settings), click __OK__ in the lower right corner of the page to complete the creation. + + The system will automatically return to the list of __StatefulSets__ , and wait for the status of the workload to become __running__ . If the workload status is abnormal, refer to [Workload Status](../workloads/pod-config/workload-status.md) for specific exception information. + + Click __┇__ on the right side of the New Workload column to perform operations such as update, delete, elastic scaling, restart, and version rollback on the workload. + + ![Status](../images/state10.png) + +### Basic Information + +- Workload Name: can contain up to 63 characters, can only contain lowercase letters, numbers, and a separator ("-"), and must start and end with a lowercase letter or number, such as deployment-01. The name of the same type of workload in the same namespace cannot be repeated, and the name of the workload cannot be changed after the workload is created. +- Namespace: Select the namespace where the newly created payload will be deployed. The default namespace is used by default. If you can't find the desired namespace, you can go to [Create a new namespace](../namespaces/createns.md) according to the prompt on the page. +- Pods: Enter the number of Pod instances for the load, and one Pod instance is created by default. +- Description: Enter the description information of the payload and customize the content. The number of characters cannot exceed 512. + +![Basic Information](../images/state01.png) + +### Container settings + +Container setting is divided into six parts: basic information, life cycle, health check, environment variables, data storage, and security settings. Click the tab below to view the requirements of each part. + +> Container settings is only configured for a single container. To add multiple containers to a pod, click __+__ on the right to add multiple containers. + +=== "Basic information (required)" + + When configuring container-related parameters, you must correctly fill in the container name and image parameters, otherwise you will not be able to proceed to the next step. After filling in the settings with reference to the following requirements, click __OK__ . + + - Container Name: Up to 63 characters, lowercase letters, numbers and separators ("-") are supported. Must start and end with a lowercase letter or number, eg nginx-01. + - Image: Enter the address or name of the image. When entering the image name, the image will be pulled from the official [DockerHub](https://hub.docker.com/) by default. After accessing the [container registry](../../../kangaroo/intro/index.md) module of AI platform, you can click __Select Image__ on the right to select an image. + - Image Pull Policy: After checking __Always pull image__ , the image will be pulled from the registry every time the workload restarts/upgrades. If it is not checked, only the local image will be pulled, and only when the image does not exist locally, it will be re-pulled from the container registry. For more details, refer to [Image Pull Policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). + - Privileged container: By default, the container cannot access any device on the host. After enabling the privileged container, the container can access all devices on the host and enjoy all the permissions of the running process on the host. + - CPU/Memory Quota: Requested value (minimum resource to be used) and limit value (maximum resource allowed to be used) of CPU/Memory resource. Please configure resources for containers as needed to avoid resource waste and system failures caused by excessive container resources. The default value is shown in the figure. + - GPU Exclusive: Configure the GPU usage for the container, only positive integers are supported. The GPU quota setting supports setting exclusive use of the entire GPU card or part of the vGPU for the container. For example, for an 8-core GPU card, enter the number __8__ to let the container exclusively use the entire length of the card, and enter the number __1__ to configure a 1-core vGPU for the container. + + > Before setting exclusive GPU, the administrator needs to install the GPU card and driver plug-in on the cluster nodes in advance, and enable the GPU feature in [Cluster Settings](../clusterops/cluster-settings.md). + + ![Basic Info](../images/state11.png) + +=== "Lifecycle (optional)" + + Set the commands that need to be executed when the container starts, after starting, and before stopping. For details, refer to [Container Lifecycle Configuration](pod-config/lifecycle.md). + + ![Lifecycle](../images/state06.png) + +=== "Health Check (optional)" + + Used to judge the health status of containers and applications. Helps improve app usability. For details, refer to [Container Health Check Configuration](pod-config/health-check.md). + + ![Health Check](../images/deploy07.png) + +=== "Environment Variables (optional)" + + Configure container parameters within the Pod, add environment variables or pass settings to the Pod, etc. For details, refer to [Container environment variable settings](pod-config/env-variables.md). + + ![Environment Variables](../images/deploy08.png) + +=== "Data Storage (optional)" + + Configure the settings for container mounting data volumes and data persistence. For details, refer to [Container Data Storage Configuration](pod-config/env-variables.md). + + ![Data Storage](../images/state09.png) + +=== "Security Settings (optional)" + + Containers are securely isolated through Linux's built-in account authority isolation mechanism. You can limit container permissions by using account UIDs (digital identity tokens) with different permissions. For example, enter __0__ to use the privileges of the root account. + + ![Security Settings](../images/deploy10.png) + +### Service settings + +Configure [Service (Service)](../network/create-services.md) for the statefulset, so that the statefulset can be accessed externally. + +1. Click the __Create Service__ button. + + ![Create Service](../images/state12.png) + +2. Refer to [Create Service](../network/create-services.md) to configure service parameters. + + ![Config Parameters](../images/deploy13.png) + +3. Click __OK__ and click __Next__ . + +### Advanced settings + +Advanced setting includes four parts: load network settings, upgrade policy, scheduling policy, label and annotation. You can click the tabs below to view the requirements of each part. + +=== "Network Configuration" + + 1. For container NIC settings, refer to [Workload Usage IP Pool](../../../network/config/use-ippool/usage.md) + 2. DNS settings + + In some cases, the application will have redundant DNS queries. Kubernetes provides DNS-related settings options for applications, which can effectively reduce redundant DNS queries and increase business concurrency in certain cases. + + - DNS Policy + + - Default: Make the container use the domain name resolution file pointed to by the __--resolv-conf__ parameter of kubelet. This setting can only resolve external domain names registered on the Internet, but cannot resolve cluster internal domain names, and there is no invalid DNS query. + - ClusterFirstWithHostNet: The domain name file of the application docking host. + - ClusterFirst: application docking with Kube-DNS/CoreDNS. + - None: New option value introduced in Kubernetes v1.9 (Beta in v1.10). After setting to None, dnsConfig must be set. At this time, the domain name resolution file of the container will be completely generated through the settings of dnsConfig. + + - Nameservers: fill in the address of the domain name server, such as __10.6.175.20__ . + - Search domains: DNS search domain list for domain name query. When specified, the provided search domain list will be merged into the search field of the domain name resolution file generated based on dnsPolicy, and duplicate domain names will be deleted. Kubernetes allows up to 6 search domains. + - Options: Configuration options for DNS, where each object can have a name attribute (required) and a value attribute (optional). The content in this field will be merged into the options field of the domain name resolution file generated based on dnsPolicy. If some options of dnsConfig options conflict with the options of the domain name resolution file generated based on dnsPolicy, they will be overwritten by dnsConfig. + - Host Alias: the alias set for the host. + + ![DNS](../images/state17.png) + +=== "Upgrade Policy" + + - Upgrade Mode: __Rolling upgrade__ refers to gradually replacing instances of the old version with instances of the new version. During the upgrade process, business traffic will be load-balanced to the old and new instances at the same time, so the business will not be interrupted. __Rebuild and upgrade__ refers to deleting the workload instance of the old version first, and then installing the specified new version. During the upgrade process, the business will be interrupted. + - Revision History Limit: Set the number of old versions retained when the version is rolled back. The default is 10. + - Graceful Period: The execution period (0-9,999 seconds) of the command before the workload stops, the default is 30 seconds. + + ![Upgrade Policy](../images/state14.png) + +=== "Container Management Policies" + + Kubernetes v1.7 and later versions can set Pod management policies through __.spec.podManagementPolicy__ , which supports the following two methods: + + - __OrderedReady__ : The default Pod management policy, which means that Pods are deployed in order. Only after the deployment of the previous Pod is successfully completed, the statefulset will start to deploy the next Pod. Pods are deleted in reverse order, with the last created being deleted first. + + - __Parallel__ : Create or delete containers in parallel, just like Pods of the Deployment type. The StatefulSet controller starts or terminates all containers in parallel. There is no need to wait for a Pod to enter the Running and ready state or to stop completely before starting or terminating other Pods. This option only affects the behavior of scaling operations, not the order of updates. + + ![Container Management Policies](../images/state05.png) + +=== "Scheduling Policies" + + - Tolerance time: When the node where the workload instance is located is unavailable, the time for rescheduling the workload instance to other available nodes, the default is 300 seconds. + - Node affinity: According to the label on the node, constrain which nodes the Pod can be scheduled on. + - Workload Affinity: Constrains which nodes a Pod can be scheduled to based on the labels of the Pods already running on the node. + - Workload anti-affinity: Constrains which nodes a Pod cannot be scheduled to based on the labels of Pods already running on the node. + - Topology domain: namely topologyKey, used to specify a group of nodes that can be scheduled. For example, __kubernetes.io/os__ indicates that as long as the node of an operating system meets the conditions of labelSelector, it can be scheduled to the node. + + > For details, refer to [Scheduling Policy](pod-config/scheduling-policy.md). + + ![Scheduling Policies](../images/state15.png) + +=== "Labels and Annotations" + + You can click the __Add__ button to add tags and annotations to workloads and pods. + + ![Labels and Annotations](../images/state16.png) + +## Create from YAML + +In addition to image, you can also create statefulsets more quickly through YAML files. + +1. Click __Clusters__ on the left navigation bar, and then click the name of the target cluster to enter the Cluster Details page. + + ![Clusters](../images/deploy01.png) + +2. On the cluster details page, click __Workloads__ -> __StatefulSets__ in the left navigation bar, and then click the __Create from YAML__ button in the upper right corner of the page. + + ![Workloads](../images/state02Yaml.png) + +3. Enter or paste the YAML file prepared in advance, click __OK__ to complete the creation. + + ![Confirm](../images/state03yaml.png) + +??? note "Click to see an example YAML for creating a statefulSet" + + ```yaml + kind: StatefulSet + apiVersion: apps/v1 + metadata: + name: test-mysql-123-mysql + namespace: default + uid: d3f45527-a0ab-4b22-9013-5842a06f4e0e + resourceVersion: '20504385' + generation: 1 + creationTimestamp: '2022-09-22T09:34:10Z' + ownerReferences: + - apiVersion: mysql.presslabs.org/v1alpha1 + kind: MysqlCluster + name: test-mysql-123 + uid: 5e877cc3-5167-49da-904e-820940cf1a6d + controller: true + blockOwnerDeletion: true + spec: + replicas: 1 + selector: + matchLabels: + app.kubernetes.io/managed-by: mysql.presslabs.org + app.kubernetes.io/name: mysql + mysql.presslabs.org/cluster: test-mysql-123 + template: + metadata: + creationTimestamp: null + labels: + app.kubernetes.io/component: database + app.kubernetes.io/instance: test-mysql-123 + app.kubernetes.io/managed-by: mysql.presslabs.org + app.kubernetes.io/name: mysql + app.kubernetes.io/version: 5.7.31 + mysql.presslabs.org/cluster: test-mysql-123 + annotations: + config_rev: '13941099' + prometheus.io/port: '9125' + prometheus.io/scrape: 'true' + secret_rev: '13941101' + spec: + volumes: + -name: conf + emptyDir: {} + - name: init-scripts + emptyDir: {} + - name: config-map + configMap: + name: test-mysql-123-mysql + defaultMode: 420 + - name: data + persistentVolumeClaim: + claimName: data + initContainers: + -name: init + image: docker.m.daocloud.io/bitpoke/mysql-operator-sidecar-5.7:v0.6.1 + args: + - clone-and-init + envFrom: + - secretRef: + name: test-mysql-123-mysql-operated + env: + - name: MY_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: MY_POD_NAME + valueFrom: + fieldRef:apiVersion: v1 + fieldPath: metadata.name + - name: MY_POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: MY_SERVICE_NAME + value: mysql + - name: MY_CLUSTER_NAME + value: test-mysql-123 + - name: MY_FQDN + value: $(MY_POD_NAME).$(MY_SERVICE_NAME).$(MY_NAMESPACE) + - name: MY_MYSQL_VERSION + value: 5.7.31 + - name: BACKUP_USER + valueFrom: + secretKeyRef: + name: test-mysql-123-mysql-operated + key: BACKUP_USER + optional: true + - name: BACKUP_PASSWORD + valueFrom: + secretKeyRef: + name: test-mysql-123-mysql-operated + key: BACKUP_PASSWORD + optional: true + resources: {} + volumeMounts: + - name: conf + mountPath: /etc/mysql + - name: config-map + mountPath: /mnt/conf + - name: data + mountPath: /var/lib/mysql + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + containers: + - name: mysql + image: docker.m.daocloud.io/mysql:5.7.31 + ports: + - name: mysql + containerPort: 3306 + protocol: TCP + env: + - name: MY_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: MY_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: MY_POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: MY_SERVICE_NAME + value: mysql + - name: MY_CLUSTER_NAME + value: test-mysql-123 + - name: MY_FQDN + value: $(MY_POD_NAME).$(MY_SERVICE_NAME).$(MY_NAMESPACE) + - name: MY_MYSQL_VERSION + value: 5.7.31 + - name: ORCH_CLUSTER_ALIAS + value: test-mysql-123.default + - name: ORCH_HTTP_API + value: http://mysql-operator.mcamel-system/api + - name: MYSQL_ROOT_PASSWORD + valueFrom: + secretKeyRef: + name: test-mysql-123-secret + key: ROOT_PASSWORD + optional: false + - name: MYSQL_USER + valueFrom: + secretKeyRef: + name: test-mysql-123-secret + key: USER + optional: true + - name: MYSQL_PASSWORD + valueFrom: + secretKeyRef: + name: test-mysql-123-secret + key: PASSWORD + optional: true + - name: MYSQL_DATABASE + valueFrom: + secretKeyRef: + name: test-mysql-123-secret + key: DATABASE + optional: true + resources: + limits: + cpu: '1' + memory: 1Gi + requests: + cpu: 100m + memory: 512Mi + volumeMounts: + - name: conf + mountPath: /etc/mysql + - name: data + mountPath: /var/lib/mysql + livenessProbe: + exec: + command: + - mysqladmin + - '--defaults-file=/etc/mysql/client.conf' + - ping + initialDelaySeconds: 60 + timeoutSeconds: 5 + periodSeconds: 5 + successThreshold: 1 + failureThreshold: 3 + readinessProbe: + exec: + command: + - /bin/sh + - '-c' + - >- + test $(mysql --defaults-file=/etc/mysql/client.conf -NB -e + 'SELECT COUNT(*) FROM sys_operator.status WHERE + name="configured" AND value="1"') -eq 1 + initialDelaySeconds: 5 + timeoutSeconds: 5 + periodSeconds: 2 + successThreshold: 1 + failureThreshold: 3 + lifecycle:preStop: + exec: + command: + - bash + - /etc/mysql/pre-shutdown-ha.sh + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + - name: sidecar + image: docker.m.daocloud.io/bitpoke/mysql-operator-sidecar-5.7:v0.6.1 + args: + - config-and-serve + ports: + - name: sidecar-http + containerPort: 8080 + protocol: TCP + envFrom: + - secretRef: + name: test-mysql-123-mysql-operated + env: + - name: MY_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: MY_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: MY_POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: MY_SERVICE_NAME + value: mysql + - name: MY_CLUSTER_NAME + value: test-mysql-123 + - name: MY_FQDN + value: $(MY_POD_NAME).$(MY_SERVICE_NAME).$(MY_NAMESPACE) + - name: MY_MYSQL_VERSION + value: 5.7.31 + - name: XTRABACKUP_TARGET_DIR + value: /tmp/xtrabackup_backupfiles/ + resources: + limits: + cpu: '1' + memory: 1Gi + requests: + cpu: 10m + memory: 64Mi + volumeMounts: + - name: conf + mountPath: /etc/mysql + - name: data + mountPath: /var/lib/mysql + readinessProbe: + httpGet: + path: /health + port: 8080 + scheme: HTTP + initialDelaySeconds: 30 + timeoutSeconds: 5 + periodSeconds: 5 + successThreshold: 1 + failureThreshold: 3 + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + - name: metrics-exporter + image: prom/mysqld-exporter:v0.13.0 + args: + - '--web.listen-address=0.0.0.0:9125' + - '--web.telemetry-path=/metrics' + - '--collect.heartbeat' + - '--collect.heartbeat.database=sys_operator' + ports: + - name: prometheus + containerPort: 9125 + protocol: TCP + env: + - name: MY_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: MY_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: MY_POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: MY_SERVICE_NAME + value: mysql + - name: MY_CLUSTER_NAME + value: test-mysql-123 + - name: MY_FQDN + value: $(MY_POD_NAME).$(MY_SERVICE_NAME).$(MY_NAMESPACE) + - name: MY_MYSQL_VERSION + value: 5.7.31 + - name: USER + valueFrom: + secretKeyRef: + name: test-mysql-123-mysql-operated + key: METRICS_EXPORTER_USER + optional: false + - name: PASSWORD + valueFrom: + secretKeyRef: + name: test-mysql-123-mysql-operated + key: METRICS_EXPORTER_PASSWORD + optional: false + - name: DATA_SOURCE_NAME + value: $(USER):$(PASSWORD)@(127.0.0.1:3306)/ + resources: + limits: + cpu: 100m + memory: 128Mi + requests: + cpu: 10m + memory: 32Mi + livenessProbe: + httpGet: + path: /metrics + port: 9125 + scheme: HTTP + initialDelaySeconds: 30 + timeoutSeconds: 30 + periodSeconds: 30 + successThreshold: 1 + failureThreshold: 3 + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + - name: pt-heartbeat + image: docker.m.daocloud.io/bitpoke/mysql-operator-sidecar-5.7:v0.6.1 + args: + - pt-heartbeat + - '--update' + - '--replace' + - '--check-read-only' + - '--create-table' + - '--database' + - sys_operator + - '--table' + - heartbeat + - '--utc' + - '--defaults-file' + - /etc/mysql/heartbeat.conf + - '--fail-successive-errors=20' + env: + - name: MY_NAMESPACE + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.namespace + - name: MY_POD_NAME + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: metadata.name + - name: MY_POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: MY_SERVICE_NAME + value: mysql + - name: MY_CLUSTER_NAME + value: test-mysql-123 + - name: MY_FQDN + value: $(MY_POD_NAME).$(MY_SERVICE_NAME).$(MY_NAMESPACE) + - name: MY_MYSQL_VERSION + value: 5.7.31 + resources: + limits: + cpu: 100m + memory: 64Mi + requests: + cpu: 10m + memory: 32Mi + volumeMounts: + - name: conf + mountPath: /etc/mysql + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + imagePullPolicy: IfNotPresent + restartPolicy: Always + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + securityContext: + runAsUser: 999 + fsGroup: 999 + affinity: + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchLabels: + app.kubernetes.io/component: database + app.kubernetes.io/instance: test-mysql-123 + app.kubernetes.io/managed-by: mysql.presslabs.org + app.kubernetes.io/name: mysql + app.kubernetes.io/version: 5.7.31 + mysql.presslabs.org/cluster: test-mysql-123 + topologyKey: kubernetes.io/hostname + schedulerName: default-scheduler + volumeClaimTemplates: + - kind: PersistentVolumeClaim + apiVersion: v1 + metadata: + name: data + creationTimestamp: null + ownerReferences: + - apiVersion: mysql.presslabs.org/v1alpha1 + kind: MysqlCluster + name: test-mysql-123 + uid: 5e877cc3-5167-49da-904e-820940cf1a6d + controller: true + spec: + accessModes: + - ReadWriteOnce + resources: + limits: + storage: 1Gi + requests: + storage: 1Gi + storageClassName: local-path + volumeMode: Filesystem + status: + phase: Pending + serviceName: mysql + podManagementPolicy: OrderedReady + updateStrategy: + type: RollingUpdate + rollingUpdate: + partition: 0 + revisionHistoryLimit: 10 + status: + observedGeneration: 1 + replicas: 1 + readyReplicas: 1 + currentReplicas: 1 + updatedReplicas: 1 + currentRevision: test-mysql-123-mysql-6b8f5577c7 + updateRevision: test-mysql-123-mysql-6b8f5577c7 + collisionCount: 0 + availableReplicas: 1 + ``` diff --git a/docs/en/docs/end-user/kpanda/workloads/pod-config/env-variables.md b/docs/en/docs/end-user/kpanda/workloads/pod-config/env-variables.md new file mode 100644 index 0000000000..1714e8fb48 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/pod-config/env-variables.md @@ -0,0 +1,19 @@ +# Configure environment variables + +An environment variable refers to a variable set in the container running environment, which is used to add environment flags to Pods or transfer configurations, etc. It supports configuring environment variables for Pods in the form of key-value pairs. + +Suanova container management adds a graphical interface to configure environment variables for Pods on the basis of native Kubernetes, and supports the following configuration methods: + +- **Key-value pair** (Key/Value Pair): Use a custom key-value pair as the environment variable of the container + +- **Resource reference** (Resource): Use the fields defined by Container as the value of environment variables, such as the memory limit of the container, the number of copies, etc. + +- **Variable/Variable Reference** (Pod Field): Use the Pod field as the value of an environment variable, such as the name of the Pod + +- **ConfigMap key value import** (ConfigMap key): Import the value of a key in the ConfigMap as the value of an environment variable + +- **Key key value import** (Secret Key): use the data from the Secret to define the value of the environment variable + +- **Key Import** (Secret): Import all key values ​​in Secret as environment variables + +- **ConfigMap import** (ConfigMap): import all key values ​​in the ConfigMap as environment variables \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/workloads/pod-config/health-check.md b/docs/en/docs/end-user/kpanda/workloads/pod-config/health-check.md new file mode 100644 index 0000000000..f4c2f72fef --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/pod-config/health-check.md @@ -0,0 +1,163 @@ +--- +MTPE: windsonsea +Date: 2024-10-15 +--- + +# Container health check + +Container health check checks the health status of containers according to user requirements. After configuration, if the application in the container is abnormal, the container will automatically restart and recover. Kubernetes provides Liveness checks, Readiness checks, and Startup checks. + +- **LivenessProbe** can detect application deadlock (the application is running, but cannot continue to run the following steps). Restarting containers in this state can help improve the availability of applications, even if there are bugs in them. + +- **ReadinessProbe** can detect when a container is ready to accept request traffic. A Pod can only be considered ready when all containers in a Pod are ready. One use of this signal is to control which Pod is used as the backend of the Service. If the Pod is not ready, it will be removed from the Service's load balancer. + +- **Startup check (StartupProbe)** can know when the application container is started. After configuration, it can control the container to check the viability and readiness after it starts successfully, so as to ensure that these liveness and readiness probes will not affect the start of the application. Startup detection can be used to perform liveness checks on slow-starting containers, preventing them from being killed before they start running. + +## Liveness and readiness checks + +The configuration of LivenessProbe is similar to that of ReadinessProbe, the only difference is to use __readinessProbe__ field instead of __livenessProbe__ field. + +**HTTP GET parameter description:** + +| Parameter | Description | +| --------- | ----------- | +| Path (Path) | The requested path for access. Such as: /healthz path in the example | +| Port (Port) | Service listening port. Such as: port 8080 in the example | +| protocol | access protocol, Http or Https | +| Delay time (initialDelaySeconds) | Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. | +| Timeout (timeoutSeconds) | Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. | +| Timeout (timeoutSeconds) | Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. | +| SuccessThreshold (successThreshold) | The minimum number of consecutive successes that are considered successful after a probe fails. The default value is 1, and the minimum value is 1. This value must be 1 for liveness and startup probes. | +| Maximum number of failures (failureThreshold) | The number of retries when the probe fails. Giving up in case of a liveness probe means restarting the container. Pods that are abandoned due to readiness probes are marked as not ready. The default value is 3. The minimum value is 1. | + +### Check with HTTP GET request + +**YAML example:** + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + test: liveness + name: liveness-http +spec: + containers: + - name: liveness # Container name + image: k8s.gcr.io/liveness # Container image + args: + - /server # Arguments to pass to the container + livenessProbe: + httpGet: + path: /healthz # Access request path + port: 8080 # Service listening port + httpHeaders: + - name: Custom-Header # Custom header name + value: Awesome # Custom header value + initialDelaySeconds: 3 # Wait 3 seconds before the first probe + periodSeconds: 3 # Perform liveness detection every 3 seconds +``` + +According to the set rules, Kubelet sends an HTTP GET request to the service running in the container (the service is listening on port 8080) to perform the detection. The kubelet considers the container alive if the handler under the __/healthz__ path on the server returns a success code. If the handler returns a failure code, the kubelet kills the container and restarts it. Any return code greater than or equal to 200 and less than 400 indicates success, and any other return code indicates failure. The __/healthz__ handler returns a 200 status code for the first 10 seconds of the container's lifetime. The handler then returns a status code of 500. + +### Use TCP port check + +**TCP port parameter description:** + +| Parameter | Description | +| --------- | ----------- | +| Port (Port) | Service listening port. Such as: port 8080 in the example | +| Delay time (initialDelaySeconds) | Delay check time, in seconds, this setting is related to the normal startup time of business programs. For example, if it is set to 30, it means that the health check will start 30 seconds after the container is started, which is the time reserved for business program startup. | +| Timeout (timeoutSeconds) | Timeout, in seconds. For example, if it is set to 10, it indicates that the timeout waiting period for executing the health check is 10 seconds. If this time is exceeded, the health check will be regarded as a failure. If set to 0 or not set, the default timeout waiting time is 1 second. | + +For a container that provides TCP communication services, based on this configuration, the cluster establishes a TCP connection to the container according to the set rules. If the connection is successful, it proves that the detection is successful, otherwise the detection fails. If you choose the TCP port detection method, you must specify the port that the container listens to. + +**YAML example:** + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: goproxy + labels: + app: goproxy +spec: + containers: + - name: goproxy + image: k8s.gcr.io/goproxy:0.1 + ports: + - containerPort: 8080 + readinessProbe: + tcpSocket: + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 10 + livenessProbe: + tcpSocket: + port: 8080 + initialDelaySeconds: 15 + periodSeconds: 20 +``` + +This example uses both readiness and liveness probes. The kubelet sends the first readiness probe 5 seconds after the container is started. Attempt to connect to port 8080 of the __goproxy__ container. If the probe is successful, the Pod will be marked as ready and the kubelet will continue to run the check every 10 seconds. + +In addition to the readiness probe, this configuration includes a liveness probe. The kubelet will perform the first liveness probe 15 seconds after the container is started. The readiness probe will attempt to connect to the __goproxy__ container on port 8080. If the liveness probe fails, the container will be restarted. + +### Run command check + +**YAML example:** + +```yaml +apiVersion: v1 +kind: Pod +metadata: + labels: + test: liveness + name: liveness-exec +spec: + containers: + - name: liveness # Container name + image: k8s.gcr.io/busybox # Container image + args: + - /bin/sh # Command to run + - -c # Pass the following string as a command + - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600 # Command to execute + livenessProbe: + exec: + command: + - cat # Command to check liveness + - /tmp/healthy # File to check + initialDelaySeconds: 5 # Wait 5 seconds before the first probe + periodSeconds: 5 # Perform liveness detection every 5 seconds +``` + +The __periodSeconds__ field specifies that the kubelet performs a liveness probe every 5 seconds, and the __initialDelaySeconds__ field specifies that the kubelet waits for 5 seconds before performing the first probe. According to the set rules, the cluster periodically executes the command __cat /tmp/healthy__ in the container through the kubelet to detect. If the command executes successfully and the return value is 0, the kubelet considers the container to be healthy and alive. If this command returns a non-zero value, the kubelet will kill the container and restart it. + +### Protect slow-starting containers with pre-start checks + +Some applications require a long initialization time at startup. You need to use the same command to set startup detection. For HTTP or TCP detection, you can set the __failureThreshold * periodSeconds__ parameter to a long enough time to cope with the long startup time scene. + +**YAML example:** + +```yaml +ports: +- name: liveness-port + containerPort: 8080 + hostPort: 8080 + +livenessProbe: + httpGet: + path: /healthz + port: liveness-port + failureThreshold: 1 + periodSeconds: 10 + +startupProbe: + httpGet: + path: /healthz + port: liveness-port + failureThreshold: 30 + periodSeconds: 10 +``` + +With the above settings, the application will have up to 5 minutes (30 * 10 = 300s) to complete the startup process. Once the startup detection is successful, the survival detection task will take over the detection of the container and respond quickly to the container deadlock. If the start probe has been unsuccessful, the container is killed after 300 seconds and further disposition is performed according to the __restartPolicy__ . \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/workloads/pod-config/job-parameters.md b/docs/en/docs/end-user/kpanda/workloads/pod-config/job-parameters.md new file mode 100644 index 0000000000..4deda315b7 --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/pod-config/job-parameters.md @@ -0,0 +1,51 @@ +--- +MTPE: windsonsea +Date: 2024-10-15 +--- + +# Description of job parameters + +According to the settings of __.spec.completions__ and __.spec.Parallelism__ , jobs (Job) can be divided into the following types: + +| Job Type | Description | +| -------- | ----------- | +| Non-parallel Job | Creates a Pod until its Job completes successfully | +| Parallel Jobs with deterministic completion counts | A Job is considered complete when the number of successful Pods reaches __.spec.completions__ | +| Parallel Job | Creates one or more Pods until one finishes successfully | + +**Parameter Description** + +| RestartPolicy | Creates a Pod until it terminates successfully | +| ------------- | ---------------------------------------------- | +| .spec.completions | Indicates the number of Pods that need to run successfully when the Job ends, the default is 1 | +| .spec.parallelism | Indicates the number of Pods running in parallel, the default is 1 | +| spec.backoffLimit | Indicates the maximum number of retries for a failed Pod, beyond which no more retries will continue. | +| .spec.activeDeadlineSeconds | Indicates the Pod running time. Once this time is reached, the Job, that is, all its Pods, will stop. And activeDeadlineSeconds has a higher priority than backoffLimit, that is, the job that reaches activeDeadlineSeconds will ignore the setting of backoffLimit. | + +The following is an example Job configuration, saved in myjob.yaml, which calculates π to 2000 digits and prints the output. + +```yaml +apiVersion: batch/v1 +kind: Job #The type of the current resource +metadata: + name: myjob +spec: + completions: 50 # Job needs to run 50 Pods at the end, in this example it prints π 50 times + parallelism: 5 # 5 Pods in parallel + backoffLimit: 5 # retry up to 5 times + template: + spec: + containers: + - name: pi + image: perl + command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] + restartPolicy: Never #restart policy +``` + +**Related commands** + +```bash +kubectl apply -f myjob.yaml # Start job +kubectl get job # View this job +kubectl logs myjob-1122dswzs View Job Pod logs +``` diff --git a/docs/en/docs/end-user/kpanda/workloads/pod-config/lifecycle.md b/docs/en/docs/end-user/kpanda/workloads/pod-config/lifecycle.md new file mode 100644 index 0000000000..0cf8c4a31a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/pod-config/lifecycle.md @@ -0,0 +1,60 @@ +--- +MTPE: windsonsea +Date: 2024-10-15 +--- + +# Configure the container lifecycle + +Pods follow a predefined lifecycle, starting in the __Pending__ phase and entering the __Running__ state if at least one container in the Pod starts normally. If any container in the Pod ends in a failed state, the state becomes __Failed__ . The following __phase__ field values ​​indicate which phase of the lifecycle a Pod is in. + +| Value | Description | +| ----- | ----------- | +| __Pending__
| The Pod has been accepted by the system, but one or more containers have not yet been created or run. This phase includes waiting for the pod to be scheduled and downloading the image over the network. | +| __Running__
(Running) | The Pod has been bound to a node, and all containers in the Pod have been created. At least one container is still running, or in the process of starting or restarting. | +| __Succeeded__
(Success) | All containers in the Pod were successfully terminated and will not be restarted. | +| __Failed__
| All containers in the Pod have terminated, and at least one container terminated due to failure. That is, the container exited with a non-zero status or was terminated by the system. | +| __Unknown__
(Unknown) | The status of the Pod cannot be obtained for some reason, usually due to a communication failure with the host where the Pod resides. | + +When creating a workload in Suanova container management, images are usually used to specify the running environment in the container. By default, when building an image, the __Entrypoint__ and __CMD__ fields can be used to define the commands and parameters to be executed when the container is running. If you need to change the commands and parameters of the container image before starting, after starting, and before stopping, you can override the default commands and parameters in the image by setting the lifecycle event commands and parameters of the container. + +## Lifecycle configuration + +Configure the startup command, post-start command, and pre-stop command of the container according to business needs. + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Start command | Type: Optional
Meaning: The container will be started according to the start command. | | +| Command after startup | Type: optional
Meaning: command after container startup
| | +| Command before stopping | Type: Optional
Meaning: The command executed by the container after receiving the stop command. Ensure that the services running in the instance can be drained in advance when the instance is upgraded or deleted. | - | + +### start command + +Configure the startup command according to the table below. + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Run command | Type: Required
Meaning: Enter an executable command, and separate multiple commands with spaces. If the command itself has spaces, you need to add ("").
Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. | /run/server | +| Running parameters | Type: Optional
Meaning: Enter the parameters of the control container running command.
| port=8080 | + +### Post-start commands + +Suanova provides two processing types, command line script and HTTP request, to configure post-start commands. You can choose the configuration method that suits you according to the table below. + +**Command line script configuration** + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| Run Command | Type: Optional
Meaning: Enter an executable command, and separate multiple commands with spaces. If the command itself contains spaces, you need to add ("").
Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. | /run/server | +| Running parameters | Type: Optional
Meaning: Enter the parameters of the control container running command.
| port=8080 | + +### stop pre-command + +Suanova provides two processing types, command line script and HTTP request, to configure the pre-stop command. You can choose the configuration method that suits you according to the table below. + +**HTTP request configuration** + +| Parameter | Description | Example value | +| --------- | ----------- | ------------- | +| URL Path | Type: Optional
Meaning: Requested URL path.
Meaning: When there are multiple commands, it is recommended to use /bin/sh or other shells to run the command, and pass in all other commands as parameters. | /run/server | +| Port | Type: Required
Meaning: Requested port.
| port=8080 | +| Node Address | Type: Optional
Meaning: The requested IP address, the default is the node IP where the container is located.
| - | \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/workloads/pod-config/scheduling-policy.md b/docs/en/docs/end-user/kpanda/workloads/pod-config/scheduling-policy.md new file mode 100644 index 0000000000..8d8588760f --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/pod-config/scheduling-policy.md @@ -0,0 +1,98 @@ +# Scheduling Policy + +In a Kubernetes cluster, like many other Kubernetes objects, nodes have [labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). You can [manually add labels](https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/#add-a-label-to-a-node). Kubernetes also adds some standard labels to all nodes in the cluster. See [Common Labels, Annotations, and Taints](https://kubernetes.io/docs/reference/labels-annotations-taints/) for common node labels. By adding labels to nodes, you can have pods scheduled on specific nodes or groups of nodes. You can use this feature to ensure that specific Pods can only run on nodes with certain isolation, security or governance properties. + +__nodeSelector__ is the simplest recommended form of a node selection constraint. You can add a __nodeSelector__ field to the Pod's spec to set the [node label](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built -in-node-labels). Kubernetes will only schedule pods on nodes with each label specified. __nodeSelector__ provides one of the easiest ways to constrain Pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define. Some benefits of using affinity and anti-affinity are: + +- Affinity and anti-affinity languages are more expressive. __nodeSelector__ can only select nodes that have all the specified labels. Affinity, anti-affinity give you greater control over selection logic. + +- You can mark a rule as "soft demand" or "preference", so that the scheduler will still schedule the Pod if no matching node can be found. + +- You can use the labels of other Pods running on the node (or in other topological domains) to enforce scheduling constraints, instead of only using the labels of the node itself. This capability allows you to define rules which allow Pods to be placed together. + +You can choose which node the Pod will deploy to by setting affinity and anti-affinity. + +## Tolerance time + +When the node where the workload instance is located is unavailable, the period for the system to reschedule the instance to other available nodes. The default is 300 seconds. + +## Node affinity (nodeAffinity) + +Node affinity is conceptually similar to __nodeSelector__ , which allows you to constrain which nodes Pods can be scheduled on based on the labels on the nodes. There are two types of node affinity: + +- **Must be satisfied: ( __requiredDuringSchedulingIgnoredDuringExecution__ )** The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to __nodeSelector__ , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied. + +- **Satisfy as much as possible: ( __preferredDuringSchedulingIgnoredDuringExecution__ )** The scheduler will try to find nodes that meet the corresponding rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied. + +#### Tag name + +The label corresponding to the node can use the default label or user-defined label. + +#### Operators + +- In: the label value needs to be in the list of values +- NotIn: the tag's value is not in a list +- Exists: To judge whether a certain label exists, no need to set the label value +- DoesNotExist: Determine if a tag does not exist, no need to set the tag value +- Gt: the value of the label is greater than a certain value (string comparison) +- Lt: the value of the label is less than a certain value (string comparison) + +#### Weights + +It can only be added in the "as far as possible" policy, which can be understood as the priority of scheduling, and those with the highest weight will be scheduled first. The value range is 1 to 100. + +## Workload Affinity + +Similar to node affinity, there are two types of workload affinity: + +- **Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution)** The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to __nodeSelector__ , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied. +- **Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution)** The scheduler will try to find nodes that meet the corresponding rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied. + +The affinity of the workload is mainly used to determine which Pods of the workload can be deployed in the same topology domain. For example, services that communicate with each other can be deployed in the same topology domain (such as the same availability zone) by applying affinity scheduling to reduce the network delay between them. + +#### Tag name + +The label corresponding to the node can use the default label or user-defined label. + +#### Namespaces + +Specifies the namespace in which the scheduling policy takes effect. + +#### Operators + +- In: the label value needs to be in the list of values +- NotIn: the tag's value is not in a list +- Exists: To judge whether a certain label exists, no need to set the label value +- DoesNotExist: Determine if a tag does not exist, no need to set the tag value + +#### Topology domain + +Specify the scope of influence during scheduling. If you specify kubernetes.io/Clustername, it will use the Node node as the distinguishing scope. + +## Workload Anti-Affinity + +Similar to node affinity, there are two types of anti-affinity for workloads: + +- **Must be satisfied: (requiredDuringSchedulingIgnoredDuringExecution)** The scheduler can only run scheduling when the rules are satisfied. This functionality is similar to __nodeSelector__ , but with a more expressive syntax. You can define multiple hard constraint rules, but only one of them must be satisfied. +- **Satisfy as much as possible: (preferredDuringSchedulingIgnoredDuringExecution)** The scheduler will try to find nodes that meet the corresponding rules. If no matching node is found, the scheduler will still schedule the Pod. You can also set weights for soft constraint rules. During specific scheduling, if there are multiple nodes that meet the conditions, the node with the highest weight will be scheduled first. At the same time, you can also define multiple hard constraint rules, but only one of them needs to be satisfied. + +The anti-affinity of the workload is mainly used to determine which Pods of the workload cannot be deployed in the same topology domain. For example, the same Pod of a load is distributed to different topological domains (such as different hosts) to improve the stability of the workload itself. + +#### Tag name + +The label corresponding to the node can use the default label or user-defined label. + +#### Namespaces + +Specifies the namespace in which the scheduling policy takes effect. + +#### Operators + +- In: the label value needs to be in the list of values +- NotIn: the tag's value is not in a list +- Exists: To judge whether a certain label exists, no need to set the label value +- DoesNotExist: Determine if a tag does not exist, no need to set the tag value + +#### Topology domain + +Specify the scope of influence when scheduling, such as specifying kubernetes.io/Clustername, it will use the Node node as the distinguishing scope. \ No newline at end of file diff --git a/docs/en/docs/end-user/kpanda/workloads/pod-config/workload-status.md b/docs/en/docs/end-user/kpanda/workloads/pod-config/workload-status.md new file mode 100644 index 0000000000..7d12636d5a --- /dev/null +++ b/docs/en/docs/end-user/kpanda/workloads/pod-config/workload-status.md @@ -0,0 +1,58 @@ +--- +MTPE: windsonsea +Date: 2024-07-19 +--- + +# Workload Status + +A workload is an application running on Kubernetes, and in Kubernetes, whether your application is composed of a single same component or composed of many different components, you can use a set of Pods to run it. Kubernetes provides five built-in workload resources to manage pods: + +- [Deployment](../create-deployment.md) +- [StatefulSet](../create-statefulset.md) +- [Daemonset](../create-daemonset.md) +- [Job](../create-job.md) +- [CronJob](../create-cronjob.md) + +You can also expand workload resources by setting [Custom Resource CRD](../../custom-resources/create.md). In the fifth-generation container management, it supports full lifecycle management of workloads such as creation, update, capacity expansion, monitoring, logging, deletion, and version management. + +## Pod Status + +Pod is the smallest computing unit created and managed in Kubernetes, that is, a collection of containers. These containers share storage, networking, and management policies that control how the containers run. +Pods are typically not created directly by users, but through workload resources. +Pods follow a predefined lifecycle, starting at __Pending__ [phase](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase), if at least one of the primary containers starts normally, it enters __Running__ , and then enters the __Succeeded__ or __Failed__ stage depending on whether any container in the Pod ends in a failed status. + +## Workload Status + +The fifth-generation container management module designs a built-in workload life cycle status set based on factors such as Pod status and number of replicas, so that users can more realistically perceive the running status of workloads. +Because different workload types (such as Deployment and Jobs) have inconsistent management mechanisms for Pods, different workloads will have different lifecycle status during operation, as shown in the following table. + +### Deployment, StatefulSet, DamemonSet Status + +| Status | Description | +| ------ | ----------- | +| Waiting | 1. A workload is in this status while its creation is in progress.
2. After an upgrade or rollback action is triggered, the workload is in this status.
3. Trigger operations such as pausing/scaling, and the workload is in this status. | +| Running | This status occurs when all instances under the workload are running and the number of replicas matches the user-defined number. | +| Deleting | When a delete operation is performed, the payload is in this status until the delete is complete. | +| Exception | Unable to get the status of the workload for some reason. This usually occurs because communication with the pod's host has failed. | +| Not Ready | When the container is in an abnormal, pending status, this status is displayed when the workload cannot be started due to an unknown error | + +### Job Status + +| Status | Description | +| ------ | ----------- | +| Waiting | The workload is in this status while Job creation is in progress. | +| Executing | The Job is in progress and the workload is in this status. | +| Execution Complete | The Job execution is complete and the workload is in this status. | +| Deleting | A delete operation is triggered and the workload is in this status. | +| Exception | Pod status could not be obtained for some reason. This usually occurs because communication with the pod's host has failed. | + +### CronJob status + +| Status | Description | +| ------ | ----------- | +| Waiting | The CronJob is in this status when it is being created. | +| Started | After the CronJob is successfully created, the CronJob is in this status when it is running normally or when the paused task is started. | +| Stopped | The CronJob is in this status when the stop task operation is performed. | +| Deleting | The deletion operation is triggered, and the CronJob is in this status. | + +When the workload is in an abnormal or unready status, you can move the mouse over the status value of the load, and the system will display more detailed error information through a prompt box. You can also view the [log](../../../../insight/data-query/log.md) or events to obtain related running information of the workload. diff --git a/docs/en/docs/end-user/register/index.md b/docs/en/docs/end-user/register/index.md index 2b0c71f815..f58415a87d 100644 --- a/docs/en/docs/end-user/register/index.md +++ b/docs/en/docs/end-user/register/index.md @@ -1,31 +1,31 @@ -# 用户注册 +# User Registration -新用户首次使用 AI 算力平台需要进行注册。 +New users need to register when using the AI platform for the first time. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- 已开启邮箱注册功能 -- 有一个可用的邮箱 +- The AI platform is installed +- Email registration functionality is enabled +- An available email address -## 邮箱注册步骤 +## Email Registration Steps -1. 打开 AI 算力平台首页 ,点击 **注册** +1. Open the AI platform homepage at and click on **Register**. ![home](../../images/regis01.PNG) -1. 键入用户名、密码、邮箱后点击 **注册** +2. Enter your username, password, and email, then click **Register**. ![to register](../../images/regis02.PNG) -1. 系统提示发送了一封邮件到您的邮箱。 +3. The system will prompt that an email has been sent to your inbox. ![to register](../../images/regis03.PNG) -1. 登录自己的邮箱,找到邮件,点击链接。 +4. Log into your email, find the email, and click the link. ![email](../../images/regis04.PNG) -1. 恭喜,您成功进入了 AI 算力平台,现在可以开始您的 AI 之旅了。 +5. Congratulations, you have successfully accessed the AI platform and can now start your AI journey. ![verify](../../images/regis05.PNG) diff --git a/docs/en/docs/end-user/share/notebook.md b/docs/en/docs/end-user/share/notebook.md index 4cbc010a05..49f8340182 100644 --- a/docs/en/docs/end-user/share/notebook.md +++ b/docs/en/docs/end-user/share/notebook.md @@ -1,85 +1,83 @@ -# 使用 Notebook +# Using Notebook -Notebook 通常指的是 Jupyter Notebook 或类似的交互式计算环境。 -这是一种非常流行的工具,广泛用于数据科学、机器学习和深度学习等领域。 -本页说明如何在算丰 AI 算力平台中使用 Notebook。 +Notebook usually refers to Jupyter Notebook or similar interactive computing environments. It is a very popular tool widely used in fields such as data science, machine learning, and deep learning. This page explains how to use Notebook in the AI platform. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](../register/index.md) -- 管理员为用户分配了工作空间 -- 已准备好数据集(代码、数据等) +- The AI platform is installed +- [User has successfully registered](../register/index.md) +- The administrator has assigned a workspace to the user +- Datasets (code, data, etc.) are prepared -## 创建和使用 Notebook 实例 +## Creating and Using Notebook Instances -1. 以 **管理员身份** 登录 AI 算力平台 -1. 导航至 **AI Lab** -> **运维管理** -> **队列管理** ,点击右侧的 **创建** 按钮 +1. Log into the AI platform as an **Administrator**. +2. Navigate to **AI Lab** -> **Operator** -> **Queue Management**, and click the **Create** button on the right. ![create queue](../images/notebook01.png) -1. 键入名称,选择集群、工作空间和配额后,点击 **确定** +3. Enter a name, select the cluster, workspace, and quota, then click **OK**. ![ok](../images/notebook02.png) -1. 以 **用户身份** 登录 AI 算力平台,导航至 **AI Lab** -> **Notebook** ,点击右侧的 **创建** 按钮 +4. Log into the AI platform as a **User**, navigate to **AI Lab** -> **Notebook**, and click the **Create** button on the right. ![create notebook](../images/notebook03.png) -1. 配置各项参数后点击 **确定** +5. After configuring the various parameters, click **OK**. - === "基本信息" + === "Basic Information" - 键入名称,选择集群、命名空间,选择刚创建的队列,点击 **一键初始化** + Enter a name, select the cluster, namespace, choose the queue just created, and click **One-Click Initialization**. ![basic](../images/notebook04.png) - === "资源配置" + === "Resource Configuration" - 选择 Notebook 类型,配置内存、CPU,开启 GPU,创建和配置 PVC: + Select the Notebook type, configure memory, CPU, enable GPU, create and configure PVC: ![resource](../images/notebook05.png) - === "高级配置" + === "Advanced Configuration" - 开启 SSH 外网访问: + Enable SSH external network access: ![advanced](../images/notebook06.png) -1. 自动跳转到 Notebook 实例列表,点击实例名称 +6. You will be automatically redirected to the Notebook instance list, click on the instance name. ![click name](../images/notebook07.png) -1. 进入 Notebook 实例详情页,点击右上角的 **打开** 按钮 +7. Enter the Notebook instance detail page and click the **Open** button in the upper right corner. ![open](../images/notebook08.png) -1. 进入了 Notebook 开发环境,比如在 `/home/jovyan` 目录挂载了持久卷,可以通过 git 克隆代码,通过 SSH 连接后上传数据等。 +8. You have entered the Notebook development environment, where a persistent volume is mounted in the `/home/jovyan` directory. You can clone code through git, upload data after connecting via SSH, etc. ![notebook](../images/notebook09.png) -## 通过 SSH 访问 Notebook 实例 +## Accessing Notebook Instances via SSH -1. 在自己的电脑上生成 SSH 密钥对 +1. Generate an SSH key pair on your own computer. - 在自己电脑上打开命令行,比如在 Windows 上打开 git bash,输入 `ssh-keygen.exe -t rsa`,然后一路回车。 + Open the command line on your computer, for example, open git bash on Windows, enter `ssh-keygen.exe -t rsa`, and press enter through the prompts. ![generate](../images/ssh01.png) -1. 通过 `cat ~/.ssh/id_rsa.pub` 等命令查看并复制公钥 +2. Use commands like `cat ~/.ssh/id_rsa.pub` to view and copy the public key. ![copy key](../images/ssh02.png) -1. 以用户身份登录 AI 算力平台,在右上角点击 **个人中心** -> **SSH 公钥** -> **导入 SSH 公钥** +3. Log into the AI platform as a user, click on **Personal Center** -> **SSH Public Key** -> **Import SSH Public Key** in the upper right corner. ![import](../images/ssh03.png) -1. 进入 Notebook 实例的详情页,复制 SSH 的链接 +4. Enter the detail page of the Notebook instance and copy the SSH link. ![copy link](../images/ssh04.png) -1. 在客户端使用 SSH 访问 Notebook 实例 +5. Use SSH to access the Notebook instance from the client. ![ssh](../images/ssh05.png) -下一步:[创建训练任务](../../admin/baize/developer/jobs/create.md) +Next step: [Create Training Job](../../admin/baize/developer/jobs/create.md) diff --git a/docs/en/docs/end-user/share/workload.md b/docs/en/docs/end-user/share/workload.md index 2a5517ec35..cd20f62bf6 100644 --- a/docs/en/docs/end-user/share/workload.md +++ b/docs/en/docs/end-user/share/workload.md @@ -1,51 +1,50 @@ -# 创建 AI 负载使用 GPU 资源 +# Creating AI Workloads Using GPU Resources -管理员为工作空间分配资源配额后,用户就可以创建 AI 工作负载来使用 GPU 算力资源。 +After the administrator allocates resource quotas for the workspace, users can create AI workloads to utilize GPU computing resources. -## 前置条件 +## Prerequisites -- 已安装 AI 算力平台 -- [用户已成功注册](../register/index.md) -- 管理员为用户分配了工作空间 -- 管理员为工作空间设置了资源配额 -- 管理员已经为用户分配了一个集群 +- The AI platform is installed +- [User has successfully registered](../register/index.md) +- The administrator has assigned a workspace to the user +- The administrator has set resource quotas for the workspace +- The administrator has assigned a cluster to the user -## 创建 AI 负载步骤 +## Steps to Create AI Workloads -1. 以用户身份登录 AI 算力平台 -1. 导航至 **容器管理** ,选择一个命名空间,点击 **工作负载** -> **无状态负载** , - 点击右侧的 **镜像创建** 按钮 +1. Log into the AI platform as a user. +2. Navigate to **Container Management**, select a namespace, click on **Workloads** -> **Deployments** , and then click the **Create Image** button on the right. ![button](../images/workload01.png) -1. 配置各项参数后点击 **确定** +3. After configuring various parameters, click **OK**. - === "基本信息" + === "Basic Information" - 选择自己的命名空间。 + Select your namespace. ![basic](../images/workload02.png) - === "容器配置" + === "Container Configuration" - 设置镜像,配置 CPU、内存、GPU 等资源,设置启动命令。 + Set the image, configure CPU, memory, GPU, and other resources, and set the startup command. ![container](../images/workload03.png) - === "其他" + === "Other" - 服务配置和高级配置可以使用默认配置。 + Service configuration and advanced configuration can use the default settings. -1. 自动返回无状态负载列表,点击负载名称 +4. You will be automatically redirected to the stateless workload list; click on the workload name. ![click name](../images/workload04.png) -1. 进入详情页,可以看到 GPU 配额 +5. Enter the detail page where you can see the GPU quota. ![check gpu](../images/workload05.png) -1. 你还可以进入控制台,运行 `mx-smi` 命令查看 GPU 资源 +6. You can also access the console and run the `nvidia-smi` command to view GPU resources. ![check gpu](../images/workload06.png) -下一步:[使用 Notebook](./notebook.md) +Next step: [Using Notebook](./notebook.md) diff --git a/docs/en/docs/openapi/index.md b/docs/en/docs/openapi/index.md index f4fd857f8e..af730d72aa 100644 --- a/docs/en/docs/openapi/index.md +++ b/docs/en/docs/openapi/index.md @@ -13,7 +13,7 @@ This is some OpenAPI documentation aimed at developers. Access Keys can be used to access the OpenAPI and for continuous publishing. You can follow the steps below to obtain their keys and access the API in their personal center. -Log in to the AI computing platform, find __Personal Center__ in the dropdown menu at +Log in to the AI platform, find __Personal Center__ in the dropdown menu at the top right corner, and manage your account's access keys on the __Access Keys__ page. ![ak list](../images/platform02_1.png) @@ -26,7 +26,7 @@ the top right corner, and manage your account's access keys on the __Access Keys ## Using the Key to Access the API -When accessing the AI computing platform's OpenAPI, include the request header `Authorization:Bearer ${token}` +When accessing the AI platform's OpenAPI, include the request header `Authorization:Bearer ${token}` in the request to identify the visitor's identity, where `${token}` is the key obtained in the previous step. **Request Example**