MetaX GPU Component Installation and Usage¶
-This chapter provides installation guidance for MetaX's gpu-extensions, gpu-operator, and other components, as well as usage methods for both the full GPU card and vGPU modes.
+This chapter provides installation guidance for MetaX's gpu-extensions, gpu-operator, and other components, as well as usage methods for both the full GPU and vGPU modes.
Prerequisites¶
- The required tar package has been downloaded and installed from the MetaX Software Center. This article uses metax-gpu-k8s-package.0.7.10.tar.gz as an example. @@ -671,8 +671,8 @@
- Metax-extensions: Includes two components, gpu-device and gpu-label. When using the Metax-extensions solution, the user's application container image needs to be built based on the MXMACA® base image. Moreover, Metax-extensions is only suitable for scenarios using the full GPU card. -
- gpu-operator: Includes components such as gpu-device, gpu-label, driver-manager, container-runtime, and operator-controller. When using the gpu-operator solution, users can choose to create application container images that do not include the MXMACA® SDK. The gpu-operator is suitable for both full GPU card and vGPU scenarios. +
- Metax-extensions: Includes two components, gpu-device and gpu-label. When using the Metax-extensions solution, the user's application container image needs to be built based on the MXMACA® base image. Moreover, Metax-extensions is only suitable for scenarios using the full GPU. +
- gpu-operator: Includes components such as gpu-device, gpu-label, driver-manager, container-runtime, and operator-controller. When using the gpu-operator solution, users can choose to create application container images that do not include the MXMACA® SDK. The gpu-operator is suitable for both full GPU and vGPU scenarios.
- - + - Using Cambricon in SuanFeng AI Computing Platform + Using Cambricon in Suanova AI Computing Platform @@ -605,9 +605,9 @@
- - + - Using Cambricon in SuanFeng AI Computing Platform + Using Cambricon in Suanova AI Computing Platform @@ -626,10 +626,10 @@
- The SuanFeng AI computing platform's container management platform has been deployed and is running normally. +
- The Suanova AI computing platform's container management platform has been deployed and is running normally.
- The container management module has either integrated with a Kubernetes cluster or created a Kubernetes cluster, and is able to access the cluster's UI interface.
- The current cluster has installed the Cambricon firmware, drivers, and DevicePlugin components. For installation details, please refer to the official documentation:
- Driver Firmware Installation @@ -637,7 +637,7 @@
Prerequisites
- Dynamic SMLU Mode: Further refines resource allocation, allowing control over the size of memory and computing power allocated to containers.
- MIM Mode: Allows the Cambricon GPU to be divided into multiple GPUs of fixed specifications for use.
- diff --git a/en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html b/en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html index 52b46b702a..e8b886f6fe 100644 --- a/en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html +++ b/en/admin/kpanda/gpu/nvidia/full_gpu_userguide.html @@ -26,7 +26,7 @@ @@ -714,14 +714,14 @@
- AI platform container management platform has been deployed and is running properly.
- The container management module has been connected to a Kubernetes cluster or a Kubernetes cluster has been created, and you can access the UI interface of the cluster.
- GPU Operator has been offline installed and NVIDIA DevicePlugin has been enabled on the current cluster. Refer to Offline Installation of GPU Operator for instructions. -
- The GPU card in the current cluster has not undergone any virtualization operations or been occupied by other applications. +
- The GPU in the current cluster has not undergone any virtualization operations or been occupied by other applications.
- The kernel version of the cluster nodes where the gpu-operator is to be deployed must be - completely consistent. The distribution and GPU card model of the nodes must fall within + completely consistent. The distribution and GPU model of the nodes must fall within the scope specified in the GPU Support Matrix.
- When installing the gpu-operator, select v23.9.0+2 or above.
- Check the system requirements for the GPU driver installation on the target node: GPU Support Matrix diff --git a/en/admin/kpanda/gpu/nvidia/mig/index.html b/en/admin/kpanda/gpu/nvidia/mig/index.html index a98bc77d62..b47f97c948 100644 --- a/en/admin/kpanda/gpu/nvidia/mig/index.html +++ b/en/admin/kpanda/gpu/nvidia/mig/index.html @@ -724,13 +724,13 @@
- SM (Streaming Multiprocessor): The core computational unit of a GPU responsible for executing graphics rendering and general-purpose computing tasks. Each SM contains a group of CUDA cores, as well as shared memory, register files, and other resources, capable of executing multiple threads concurrently. Each MIG instance has a certain number of SMs and other related resources, along with the allocated memory slices. diff --git a/en/admin/kpanda/gpu/nvidia/mig/mig_usage.html b/en/admin/kpanda/gpu/nvidia/mig/mig_usage.html index 05115e1640..fc3909149b 100644 --- a/en/admin/kpanda/gpu/nvidia/mig/mig_usage.html +++ b/en/admin/kpanda/gpu/nvidia/mig/mig_usage.html @@ -738,17 +738,17 @@
-
-
Confirm if the cluster has recognized the GPU card type.
+Confirm if the cluster has recognized the GPU type.
Go to Cluster Details -> Nodes and check if it has been correctly recognized as MIG.
-
When deploying an application using an image, you can select and use NVIDIA MIG resources.
-
-
Example of MIG Single Mode (used in the same way as a full GPU card):
+Example of MIG Single Mode (used in the same way as a full GPU):
Note
-The MIG single policy allows users to request and use GPU resources in the same way as a full GPU card (
+nvidia.com/gpu
). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design.The MIG single policy allows users to request and use GPU resources in the same way as a full GPU (
nvidia.com/gpu
). The difference is that these resources can be a portion of the GPU (MIG device) rather than the entire GPU. Learn more from the GPU MIG Mode Design. -
diff --git a/en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html b/en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html
index 88d3ab73d7..22a30eb491 100644
--- a/en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html
+++ b/en/admin/kpanda/gpu/nvidia/vgpu/vgpu_user.html
@@ -796,11 +796,11 @@
Using vGPU through YAML Configura limits: nvidia.com/gpucores: '20' # Request 20% of GPU cores for each card nvidia.com/gpumem: '200' # Request 200MB of GPU memory for each card - nvidia.com/vgpu: '1' # Request 1 GPU card + nvidia.com/vgpu: '1' # Request 1 GPU imagePullPolicy: Always restartPolicy: Always
PrerequisitesComponent Introduction¶
Metax provides two helm-chart packages: metax-extensions and gpu-operator. Depending on the usage scenario, different components can be selected for installation.
-
-
Operation Steps¶
-
diff --git a/en/admin/kpanda/gpu/mlu/use-mlu.html b/en/admin/kpanda/gpu/mlu/use-mlu.html
index d56486827a..00131ab922 100644
--- a/en/admin/kpanda/gpu/mlu/use-mlu.html
+++ b/en/admin/kpanda/gpu/mlu/use-mlu.html
@@ -484,9 +484,9 @@
Using Cambricon GPU¶
-This article introduces how to use Cambricon GPU in the SuanFeng AI computing platform.
+This article introduces how to use Cambricon GPU in the Suanova AI computing platform.
Prerequisites¶
-
-
When installing DevicePlugin, please disable the --enable-device-type parameter; otherwise, the SuanFeng AI computing platform will not be able to correctly recognize the Cambricon GPU.
+When installing DevicePlugin, please disable the --enable-device-type parameter; otherwise, the Suanova AI computing platform will not be able to correctly recognize the Cambricon GPU.
Introduction to Cambricon GPU Modes¶
Cambricon GPUs have the following modes:
-
@@ -646,7 +646,7 @@
Introduction to Cambricon GPU Modes
Using Cambricon in SuanFeng AI Computing Platform¶
+Using Cambricon in Suanova AI Computing Platform¶
Here, we take the Dynamic SMLU mode as an example:
Using the Whole NVIDIA GPU Card for an Application¶
-This section describes how to allocate the entire NVIDIA GPU card to a single application on the AI platform platform.
+Using the Whole NVIDIA GPU for an Application¶
+This section describes how to allocate the entire NVIDIA GPU to a single application on the AI platform platform.
Prerequisites¶
Procedure¶
Configuring via the User Interface¶
diff --git a/en/admin/kpanda/gpu/nvidia/index.html b/en/admin/kpanda/gpu/nvidia/index.html index f227828753..3a03b73ea5 100644 --- a/en/admin/kpanda/gpu/nvidia/index.html +++ b/en/admin/kpanda/gpu/nvidia/index.html @@ -26,7 +26,7 @@ @@ -642,7 +642,7 @@NVIDIA GPU Card Usage Modes¶
+NVIDIA GPU Usage Modes¶
NVIDIA, as a well-known graphics computing provider, offers various software and hardware solutions to enhance computational power. Among them, NVIDIA provides the following three solutions for GPU usage:
Full GPU¶
Full GPU refers to allocating the entire NVIDIA GPU to a single user or application. In this configuration, the application can fully occupy all the resources of the GPU and achieve maximum computational performance. Full GPU is suitable for workloads that require a large amount of computational resources and memory, such as deep learning training, scientific computing, etc.
diff --git a/en/admin/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.html b/en/admin/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.html index 118b82f545..6aafceb0ca 100644 --- a/en/admin/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.html +++ b/en/admin/kpanda/gpu/nvidia/install_nvidia_driver_of_operator.html @@ -803,7 +803,7 @@Offline Install gpu-operatorPrerequisites¶
diff --git a/en/admin/kpanda/gpu/nvidia/mig/create_mig.html b/en/admin/kpanda/gpu/nvidia/mig/create_mig.html
index c50ff5c496..f805432159 100644
--- a/en/admin/kpanda/gpu/nvidia/mig/create_mig.html
+++ b/en/admin/kpanda/gpu/nvidia/mig/create_mig.html
@@ -768,7 +768,7 @@ Enabling MIG FeaturesNVIDIA GPU Card Usage Modes.
+
For more details, refer to the NVIDIA GPU Usage Modes.
Prerequisites¶
MIG ScenariosMIG offers increased compute power and memory capacity for training large-scale deep learning models. By partitioning the physical GPU into multiple MIG instances, each instance can independently carry out model training, improving training efficiency and throughput.
In general, NVIDIA MIG is suitable for scenarios that require finer-grained allocation and management of GPU resources. It enables resource isolation, improved performance utilization, and meets the GPU computing needs of multiple users or applications.
Overview of MIG¶
-NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU card to provide separate GPU resources to multiple users, maximizing GPU utilization.
+NVIDIA Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA on H100, A100, and A30 series GPUs. Its purpose is to divide a physical GPU into multiple GPU instances to provide finer-grained resource sharing and isolation. MIG can split a GPU into up to seven GPU instances, allowing a single physical GPU to provide separate GPU resources to multiple users, maximizing GPU utilization.
This feature enables multiple applications or users to share GPU resources simultaneously, improving the utilization of computational resources and increasing system scalability.
With MIG, each GPU instance's processor has an independent and isolated path throughout the entire memory system, including cross-switch ports on the chip, L2 cache groups, memory controllers, and DRAM address buses, all uniquely allocated to a single instance.
This ensures that the workload of individual users can run with predictable throughput and latency, along with identical L2 cache allocation and DRAM bandwidth. MIG can partition available GPU compute resources (such as streaming multiprocessors or SMs and GPU engines like copy engines or decoders) to provide defined quality of service (QoS) and fault isolation for different clients such as virtual machines, containers, or processes. MIG enables multiple GPU instances to run in parallel on a single physical GPU.
MIG allows multiple vGPUs (and virtual machines) to run in parallel on a single GPU instance while retaining the isolation guarantees provided by vGPU. For more details on using vGPU and MIG for GPU partitioning, refer to NVIDIA Multi-Instance GPU and NVIDIA Virtual Compute Server.
MIG Architecture¶
-The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU card into seven GPU instances that can be used by multiple users.
+The following diagram provides an overview of MIG, illustrating how it virtualizes one physical GPU into seven GPU instances that can be used by multiple users.
Important Concepts¶
PrerequisitesUsing MIG GPU through the UI¶
PrerequisitesUsing MIG GPU through the UI¶
This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU card.
+This YAML configuration requests the application to use vGPU resources. It specifies that each card should utilize 20% of GPU cores, 200MB of GPU memory, and requests 1 GPU.