Merge branch 'Project-HAMi:master' into master

Project-HAMi · Oct 24, 2024 · 1e882c2 · 1e882c2
2 parents 299dae9 + b0873aa
commit 1e882c2
Show file tree

Hide file tree

Showing 32 changed files with 796 additions and 50 deletions.
diff --git a/.github/workflows/call-release-helm.yaml b/.github/workflows/call-release-helm.yaml
@@ -88,7 +88,7 @@ jobs:
           mkdir -p tmp
           mv charts/*.tgz tmp
       - name: Upload Artifact
-        uses: actions/[email protected].0
+        uses: actions/[email protected].3
         with:
           name: chart_package_artifact
           path: tmp/*

diff --git a/.github/workflows/ci-image-scanning.yaml b/.github/workflows/ci-image-scanning.yaml
@@ -50,7 +50,7 @@ jobs:
         # Prevent running from the forked repository that doesn't need to upload coverage.
         # In addition, running on the forked repository would fail as missing the necessary secret.
         if: ${{ github.repository == 'Project-HAMi/HAMi' }}
-        uses: aquasecurity/trivy-action@0.24.0
+        uses: aquasecurity/trivy-action@0.28.0
         with:
           image-ref: "projecthami/hami:${{ steps.runtime-tag.outputs.tag }}"
           format: "table"
@@ -59,7 +59,7 @@ jobs:
           vuln-type: "os,library"
           trivyignores: .trivyignore
       - name: Run Trivy vulnerability scanner (SARIF)
-        uses: aquasecurity/trivy-action@0.24.0
+        uses: aquasecurity/trivy-action@0.28.0
         with:
           image-ref: "projecthami/hami:${{ steps.runtime-tag.outputs.tag }}"
           format: "sarif"

diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ English version|[中文版](README_cn.md)
 [![codecov](https://codecov.io/gh/Project-HAMi/HAMi/branch/master/graph/badge.svg?token=ROM8CMPXZ6)](https://codecov.io/gh/Project-HAMi/HAMi)
 [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi?ref=badge_shield)
 [![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-vgpu.svg)](https://hub.docker.com/r/4pdosc/k8s-vgpu)
-[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
+[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://cloud-native.slack.com/archives/C07T10BU4R2)
 [![discuss](https://img.shields.io/badge/Discuss-Ask%20Questions-blue)](https://github.com/Project-HAMi/HAMi/discussions)
 [![website](https://img.shields.io/badge/website-blue)](http://project-hami.io)
 [![Contact Me](https://img.shields.io/badge/Contact%20Me-blue)](https://github.com/Project-HAMi/HAMi#contact)
@@ -65,6 +65,8 @@ will see 3G device memory inside container
 [![cambricon MLU](https://img.shields.io/badge/Cambricon-Mlu-blue)](docs/cambricon-mlu-support.md)
 [![hygon DCU](https://img.shields.io/badge/Hygon-DCU-blue)](docs/hygon-dcu-support.md)
 [![iluvatar GPU](https://img.shields.io/badge/Iluvatar-GPU-blue)](docs/iluvatar-gpu-support.md)
+[![mthreads GPU](https://img.shields.io/badge/Mthreads-GPU-blue)](docs/mthreads-support.md)
+[![ascend NPU](https://img.shields.io/badge/Ascend-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README.md)
 
 ## Architect
 
@@ -166,7 +168,7 @@ If you have any questions, please feel free to reach out to us through the follo
   - [Meeting Link](https://meeting.tencent.com/dm/Ntiwq1BICD1P)
 - Email: refer to the [MAINTAINERS.md](MAINTAINERS.md) to find the email addresses of all maintainers. Feel free to contact them via email to report any issues or ask questions.
 - [mailing list](https://groups.google.com/forum/#!forum/hami-project)
-- [slack](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
+- [slack](https://cloud-native.slack.com/archives/C07T10BU4R2) | [Join](https://slack.cncf.io/)
 
 ## License
 

diff --git a/README_cn.md b/README_cn.md
@@ -8,7 +8,7 @@
 [![codecov](https://codecov.io/gh/Project-HAMi/HAMi/branch/master/graph/badge.svg?token=ROM8CMPXZ6)](https://codecov.io/gh/Project-HAMi/HAMi)
 [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi?ref=badge_shield)
 [![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-vgpu.svg)](https://hub.docker.com/r/4pdosc/k8s-vgpu)
-[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
+[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://cloud-native.slack.com/archives/C07T10BU4R2)
 [![discuss](https://img.shields.io/badge/Discuss-Ask%20Questions-blue)](https://github.com/Project-HAMi/HAMi/discussions)
 [![website](https://img.shields.io/badge/website-blue)](http://project-hami.io)
 [![Contact Me](https://img.shields.io/badge/Contact%20Me-blue)](https://github.com/Project-HAMi/HAMi#contact)
@@ -22,6 +22,8 @@
 [![寒武纪 MLU](https://img.shields.io/badge/寒武纪-Mlu-blue)](docs/cambricon-mlu-support_cn.md)
 [![海光 DCU](https://img.shields.io/badge/海光-DCU-blue)](docs/hygon-dcu-support.md)
 [![天数智芯 GPU](https://img.shields.io/badge/天数智芯-GPU-blue)](docs/iluvatar-gpu-support_cn.md)
+[![摩尔线程 GPU](https://img.shields.io/badge/摩尔线程-GPU-blue)](docs/mthreads-support_cn.md)
+[![华为昇腾 NPU](https://img.shields.io/badge/华为昇腾-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README_cn.md)
 
 
 ## 简介
@@ -204,4 +206,4 @@ HAMi 社区致力于营造一个开放和友好的环境，并通过多种方式
   - [会议链接](https://meeting.tencent.com/dm/Ntiwq1BICD1P)
 - 电子邮件：请参阅[MAINTAINERS.md](MAINTAINERS.md)以查找所有维护者的电子邮件地址。请随时通过电子邮件与他们联系以报告任何问题或提出问题。
 - [邮件列表](https://groups.google.com/forum/#!forum/hami-project)
-- [slack](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
+- [slack]( https://cloud-native.slack.com/archives/C07T10BU4R2) | [Join](https://slack.cncf.io/)
diff --git a/charts/hami/templates/scheduler/configmap.yaml b/charts/hami/templates/scheduler/configmap.yaml
@@ -32,6 +32,14 @@ data:
                     },
                     {{- end }}
                     {{- end }}
+                    {{- if .Values.devices.mthreads.enabled }}
+                    {{- range .Values.devices.mthreads.resources }}
+                    {
+                      "name": "{{ . }}",
+                      "ignoredByScheduler": true
+                    },
+                    {{- end }}
+                    {{- end }}
                     {
                         "name": "{{ .Values.resourceName }}",
                         "ignoredByScheduler": true

diff --git a/charts/hami/templates/scheduler/configmapnew.yaml b/charts/hami/templates/scheduler/configmapnew.yaml
@@ -55,4 +55,10 @@ data:
         ignoredByScheduler: true
       {{- end }}
       {{- end }}
+      {{- if .Values.devices.mthreads.enabled }}
+      {{- range .Values.devices.mthreads.resources }}
+      - name: {{ . }}
+        ignoredByScheduler: true
+      {{- end }}
+      {{- end }}
 {{- end }}
diff --git a/charts/hami/templates/scheduler/job-patch/job-createSecret.yaml b/charts/hami/templates/scheduler/job-patch/job-createSecret.yaml
@@ -40,8 +40,8 @@ spec:
             - create
             - --cert-name=tls.crt
             - --key-name=tls.key
-            {{- if .Values.scheduler.customWebhook.enabled }}
-            - --host={{ printf "%s.%s.svc,127.0.0.1,%s" (include "hami-vgpu.scheduler" .) .Release.Namespace .Values.scheduler.customWebhook.host}}
+            {{- if .Values.scheduler.admissionWebhook.customURL.enabled }}
+            - --host={{ printf "%s.%s.svc,127.0.0.1,%s" (include "hami-vgpu.scheduler" .) .Release.Namespace .Values.scheduler.admissionWebhook.customURL.host}}
             {{- else }}
             - --host={{ printf "%s.%s.svc,127.0.0.1" (include "hami-vgpu.scheduler" .) .Release.Namespace }}
             {{- end }}

diff --git a/charts/hami/templates/scheduler/webhook.yaml b/charts/hami/templates/scheduler/webhook.yaml
@@ -6,16 +6,16 @@ webhooks:
   - admissionReviewVersions:
     - v1beta1
     clientConfig:
-      {{- if .Values.scheduler.customWebhook.enabled }}
-      url: https://{{ .Values.scheduler.customWebhook.host}}:{{.Values.scheduler.customWebhook.port}}{{.Values.scheduler.customWebhook.path}}
+      {{- if .Values.scheduler.admissionWebhook.customURL.enabled }}
+      url: https://{{ .Values.scheduler.admissionWebhook.customURL.host}}:{{.Values.scheduler.admissionWebhook.customURL.port}}{{.Values.scheduler.admissionWebhook.customURL.path}}
       {{- else }}
       service:
         name: {{ include "hami-vgpu.scheduler" . }}
         namespace: {{ .Release.Namespace }}
         path: /webhook
         port: {{ .Values.scheduler.service.httpPort }}
       {{- end }}
-    failurePolicy: {{ .Values.scheduler.mutatingWebhookConfiguration.failurePolicy }}
+    failurePolicy: {{ .Values.scheduler.admissionWebhook.failurePolicy }}
     matchPolicy: Equivalent
     name: vgpu.hami.io
     namespaceSelector:
@@ -24,19 +24,19 @@ webhooks:
         operator: NotIn
         values:
         - ignore
-      {{- if .Values.scheduler.customWebhook.whitelistNamespaces }}
+      {{- if .Values.scheduler.admissionWebhook.whitelistNamespaces }}
       - key: kubernetes.io/metadata.name
         operator: NotIn
         values:
-        {{- toYaml .Values.scheduler.customWebhook.whitelistNamespaces | nindent 10 }}
+        {{- toYaml .Values.scheduler.admissionWebhook.whitelistNamespaces | nindent 10 }}
       {{- end }}
     objectSelector:
       matchExpressions:
       - key: hami.io/webhook
         operator: NotIn
         values:
         - ignore
-    reinvocationPolicy: Never
+    reinvocationPolicy: {{ .Values.scheduler.admissionWebhook.reinvocationPolicy }}
     rules:
       - apiGroups:
           - ""

diff --git a/charts/hami/values.yaml b/charts/hami/values.yaml
@@ -79,18 +79,21 @@ scheduler:
   podAnnotations: {}
   tolerations: []
   #serviceAccountName: "hami-vgpu-scheduler-sa"
-  customWebhook:
-    enabled: false
-    # must be an endpoint using https.
-    # should generate host certs here
-    host: 127.0.0.1 # hostname or ip, can be your node'IP if you want to use https://<nodeIP>:<schedulerPort>/<path>
-    port: 31998
-    path: /webhook
+  admissionWebhook:
+    customURL:
+      enabled: false
+      # must be an endpoint using https.
+      # should generate host certs here
+      host: 127.0.0.1 # hostname or ip, can be your node'IP if you want to use https://<nodeIP>:<schedulerPort>/<path>
+      port: 31998
+      path: /webhook
     whitelistNamespaces:
     # Specify the namespaces that the webhook will not be applied to.
       # - default
       # - kube-system
       # - istio-system
+    reinvocationPolicy: Never
+    failurePolicy: Ignore
   patch:
     image: docker.io/jettech/kube-webhook-certgen:v1.5.2
     imageNew: liangjw/kube-webhook-certgen:v1.1.1
@@ -100,8 +103,6 @@ scheduler:
     nodeSelector: {}
     tolerations: []
     runAsUser: 2000
-  mutatingWebhookConfiguration:
-    failurePolicy: Ignore
   service:
     httpPort: 443
     schedulerPort: 31998
@@ -135,6 +136,10 @@ devicePlugin:
   tolerations: []
 
 devices:
+  mthreads:
+    enabled: false
+    resources:
+      - mthreads.com/vgpu
   ascend:
     enabled: false
     image: ""

diff --git a/docker/Dockerfile.withlib b/docker/Dockerfile.withlib
@@ -7,7 +7,7 @@ ADD . /k8s-vgpu
 ARG GOPROXY=https://goproxy.cn,direct
 RUN cd /k8s-vgpu && make all
 
-FROM ubuntu:20.04
+FROM ubuntu:24.04
 ENV NVIDIA_DISABLE_REQUIRE="true"
 ENV NVIDIA_VISIBLE_DEVICES=all
 ENV NVIDIA_DRIVER_CAPABILITIES=utility

diff --git a/docs/config.md b/docs/config.md
@@ -17,7 +17,7 @@ helm install vgpu-charts/vgpu vgpu --set devicePlugin.deviceMemoryScaling=5 ...
 * `devicePlugin.disablecorelimit:`
   String type, "true" for disable core limit, "false" for enable core limit, default: false
 * `scheduler.defaultMem:` 
-  Integer type, by default: 5000. The default device memory of the current task, in MB
+  Integer type, by default: 0. The default device memory of the current task, in MB.'0' means use 100% device memory
 * `scheduler.defaultCores:` 
   Integer type, by default: equals 0. Percentage of GPU cores reserved for the current task. If assigned to 0, it may fit in any GPU with enough device memory. If assigned to 100, it will use an entire GPU card exclusively.
 * `scheduler.defaultGPUNum:`

diff --git a/docs/config_cn.md b/docs/config_cn.md
@@ -15,7 +15,7 @@ helm install vgpu vgpu-charts/vgpu --set devicePlugin.deviceMemoryScaling=5 ...
 * `devicePlugin.disablecorelimit:`
   字符串类型，"true"为关闭算力限制，"false"为启动算力限制，默认为"false"
 * `scheduler.defaultMem:`
-  整数类型，预设值为5000，表示不配置显存时使用的默认显存大小，单位为MB
+  整数类型，预设值为0，表示不配置显存时使用的默认显存大小，单位为MB。当值为0时，代表使用全部的显存。
 * `scheduler.defaultCores:`
   整数类型(0-100)，默认为0，表示默认为每个任务预留的百分比算力。若设置为0，则代表任务可能会被分配到任一满足显存需求的GPU中，若设置为100，代表该任务独享整张显卡
 * `scheduler.defaultGPUNum:`

diff --git a/docs/mthreads-support.md b/docs/mthreads-support.md
@@ -0,0 +1,67 @@
+## Introduction
+
+**We now support mthreads.com/vgpu by implementing most device-sharing features as nvidia-GPU**, including:
+
+***GPU sharing***: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.
+
+***Device Memory Control***: GPUs can be allocated with certain device memory size on certain type(i.e MTT S4000) and have made it that it does not exceed the boundary.
+
+***Device Core Control***: GPUs can be allocated with limited compute cores on certain type(i.e MTT S4000) and have made it that it does not exceed the boundary.
+
+## Important Notes
+
+1. Device sharing for multi-cards is not supported.
+
+2. Only one mthreads device can be shared in a pod(even there are multiple containers).
+
+3. Support allocating exclusive mthreads GPU by specifying mthreads.com/vgpu only.
+
+4. These features are tested on MTT S4000
+
+## Prerequisites
+
+* [MT CloudNative Toolkits > 1.9.0](https://docs.mthreads.com/cloud-native/cloud-native-doc-online/)
+* driver version >= 1.2.0
+
+## Enabling GPU-sharing Support
+
+* Deploy MT-CloudNative Toolkit on mthreads nodes (Please consult your device provider to aquire its package and document)
+
+> **NOTICE:** *You can remove mt-mutating-webhook and mt-gpu-scheduler after installation(optional).*
+
+* set the 'devices.mthreads.enabled = true' when installing hami
+
+```
+helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set device.mthreads.enabled=true -n kube-system
+```
+
+## Running Mthreads jobs
+
+Mthreads GPUs can now be requested by a container
+using the `mthreads.com/vgpu`, `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core`  resource type:
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpushare-pod-default
+spec:
+  restartPolicy: OnFailure
+  containers:
+    - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc 
+      imagePullPolicy: IfNotPresent
+      name: gpushare-pod-1
+      command: ["sleep"]
+      args: ["100000"]
+      resources:
+        limits:
+          mthreads.com/vgpu: 1
+          mthreads.com/sgpu-memory: 32
+          mthreads.com/sgpu-core: 8
+```
+
+> **NOTICE1:** *Each unit of sgpu-memory indicates 512M device memory*
+
+> **NOTICE2:** *You can find more examples in [examples/mthreads folder](../examples/mthreads/)*
+
+
diff --git a/docs/mthreads-support_cn.md b/docs/mthreads-support_cn.md
@@ -0,0 +1,68 @@
+## 简介
+
+本组件支持复用摩尔线程GPU设备，并为此提供以下几种与vGPU类似的复用功能，包括：
+
+***GPU 共享***: 每个任务可以只占用一部分显卡，多个任务可以共享一张显卡
+
+***可限制分配的显存大小***: 你现在可以用显存值（例如3000M）来分配MLU，本组件会确保任务使用的显存不会超过分配数值、
+
+***可限制分配的算力核组比例***: 你现在可以用算力核组数量（例如8个）来分配GPU，本组件会确保任务使用的显存不会超过分配数值
+
+## 注意事项
+
+1. 暂时不支持多卡切片，多卡任务只能分配整卡
+
+2. 一个pod只能使用一个GPU生成的切片，即使该pod中有多个容器
+
+3. 支持独占模式，只指定`mthreads.com/vgpu`即为独占申请
+
+4. 本特性目前只支持MTT S4000设备
+
+## 节点需求
+
+* [MT CloudNative Toolkits > 1.9.0](https://docs.mthreads.com/cloud-native/cloud-native-doc-online/)
+* 驱动版本 >= 1.2.0
+
+## 开启GPU复用
+
+* 部署'gpu-manager'，天数智芯的GPU共享需要配合厂家提供的'MT-CloudNative Toolkit'一起使用，请联系设备提供方获取
+
+> **注意:** *（可选），部署完之后，卸载掉mt-mutating-webhook与mt-scheduler组件，因为这部分功能将由HAMi调度器提供*
+
+* 在安装HAMi时配置'devices.mthreads.enabled = true'参数
+
+```
+helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set device.mthreads.enabled=true -n kube-system
+```
+
+## 运行GPU任务
+
+通过指定`mthreads.com/vgpu`, `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core`这3个参数，可以确定容器申请的切片个数，对应的显存和算力核组
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpushare-pod-default
+spec:
+  restartPolicy: OnFailure
+  containers:
+    - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc 
+      imagePullPolicy: IfNotPresent
+      name: gpushare-pod-1
+      command: ["sleep"]
+      args: ["100000"]
+      resources:
+        limits:
+          mthreads.com/vgpu: 1
+          mthreads.com/sgpu-memory: 32
+          mthreads.com/sgpu-core: 8
+```
+
+> **注意1:** *每一单位的sgpu-memory代表512M的显存.*
+
+> **注意2:** *查看更多的[用例](../examples/mthreads/).*
+
+
+
+
diff --git a/examples/mthreads/default_use.yaml b/examples/mthreads/default_use.yaml
@@ -0,0 +1,17 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpushare-pod-default
+spec:
+  restartPolicy: OnFailure
+  containers:
+    - image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc 
+      imagePullPolicy: IfNotPresent
+      name: gpushare-pod-1
+      command: ["sleep"]
+      args: ["100000"]
+      resources:
+        limits:
+          mthreads.com/vgpu: 1
+          mthreads.com/sgpu-memory: 32
+          mthreads.com/sgpu-core: 8