Skip to content

Commit

Permalink
Merge branch 'Project-HAMi:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
jiangsanyin authored Oct 24, 2024
2 parents 299dae9 + b0873aa commit 1e882c2
Show file tree
Hide file tree
Showing 32 changed files with 796 additions and 50 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/call-release-helm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
mkdir -p tmp
mv charts/*.tgz tmp
- name: Upload Artifact
uses: actions/[email protected].0
uses: actions/[email protected].3
with:
name: chart_package_artifact
path: tmp/*
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/ci-image-scanning.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
# Prevent running from the forked repository that doesn't need to upload coverage.
# In addition, running on the forked repository would fail as missing the necessary secret.
if: ${{ github.repository == 'Project-HAMi/HAMi' }}
uses: aquasecurity/trivy-action@0.24.0
uses: aquasecurity/trivy-action@0.28.0
with:
image-ref: "projecthami/hami:${{ steps.runtime-tag.outputs.tag }}"
format: "table"
Expand All @@ -59,7 +59,7 @@ jobs:
vuln-type: "os,library"
trivyignores: .trivyignore
- name: Run Trivy vulnerability scanner (SARIF)
uses: aquasecurity/trivy-action@0.24.0
uses: aquasecurity/trivy-action@0.28.0
with:
image-ref: "projecthami/hami:${{ steps.runtime-tag.outputs.tag }}"
format: "sarif"
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ English version|[中文版](README_cn.md)
[![codecov](https://codecov.io/gh/Project-HAMi/HAMi/branch/master/graph/badge.svg?token=ROM8CMPXZ6)](https://codecov.io/gh/Project-HAMi/HAMi)
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi?ref=badge_shield)
[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-vgpu.svg)](https://hub.docker.com/r/4pdosc/k8s-vgpu)
[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://cloud-native.slack.com/archives/C07T10BU4R2)
[![discuss](https://img.shields.io/badge/Discuss-Ask%20Questions-blue)](https://github.com/Project-HAMi/HAMi/discussions)
[![website](https://img.shields.io/badge/website-blue)](http://project-hami.io)
[![Contact Me](https://img.shields.io/badge/Contact%20Me-blue)](https://github.com/Project-HAMi/HAMi#contact)
Expand Down Expand Up @@ -65,6 +65,8 @@ will see 3G device memory inside container
[![cambricon MLU](https://img.shields.io/badge/Cambricon-Mlu-blue)](docs/cambricon-mlu-support.md)
[![hygon DCU](https://img.shields.io/badge/Hygon-DCU-blue)](docs/hygon-dcu-support.md)
[![iluvatar GPU](https://img.shields.io/badge/Iluvatar-GPU-blue)](docs/iluvatar-gpu-support.md)
[![mthreads GPU](https://img.shields.io/badge/Mthreads-GPU-blue)](docs/mthreads-support.md)
[![ascend NPU](https://img.shields.io/badge/Ascend-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README.md)

## Architect

Expand Down Expand Up @@ -166,7 +168,7 @@ If you have any questions, please feel free to reach out to us through the follo
- [Meeting Link](https://meeting.tencent.com/dm/Ntiwq1BICD1P)
- Email: refer to the [MAINTAINERS.md](MAINTAINERS.md) to find the email addresses of all maintainers. Feel free to contact them via email to report any issues or ask questions.
- [mailing list](https://groups.google.com/forum/#!forum/hami-project)
- [slack](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
- [slack](https://cloud-native.slack.com/archives/C07T10BU4R2) | [Join](https://slack.cncf.io/)

## License

Expand Down
6 changes: 4 additions & 2 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
[![codecov](https://codecov.io/gh/Project-HAMi/HAMi/branch/master/graph/badge.svg?token=ROM8CMPXZ6)](https://codecov.io/gh/Project-HAMi/HAMi)
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FProject-HAMi%2FHAMi?ref=badge_shield)
[![docker pulls](https://img.shields.io/docker/pulls/4pdosc/k8s-vgpu.svg)](https://hub.docker.com/r/4pdosc/k8s-vgpu)
[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
[![slack](https://img.shields.io/badge/Slack-Join%20Slack-blue)](https://cloud-native.slack.com/archives/C07T10BU4R2)
[![discuss](https://img.shields.io/badge/Discuss-Ask%20Questions-blue)](https://github.com/Project-HAMi/HAMi/discussions)
[![website](https://img.shields.io/badge/website-blue)](http://project-hami.io)
[![Contact Me](https://img.shields.io/badge/Contact%20Me-blue)](https://github.com/Project-HAMi/HAMi#contact)
Expand All @@ -22,6 +22,8 @@
[![寒武纪 MLU](https://img.shields.io/badge/寒武纪-Mlu-blue)](docs/cambricon-mlu-support_cn.md)
[![海光 DCU](https://img.shields.io/badge/海光-DCU-blue)](docs/hygon-dcu-support.md)
[![天数智芯 GPU](https://img.shields.io/badge/天数智芯-GPU-blue)](docs/iluvatar-gpu-support_cn.md)
[![摩尔线程 GPU](https://img.shields.io/badge/摩尔线程-GPU-blue)](docs/mthreads-support_cn.md)
[![华为昇腾 NPU](https://img.shields.io/badge/华为昇腾-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README_cn.md)


## 简介
Expand Down Expand Up @@ -204,4 +206,4 @@ HAMi 社区致力于营造一个开放和友好的环境,并通过多种方式
- [会议链接](https://meeting.tencent.com/dm/Ntiwq1BICD1P)
- 电子邮件:请参阅[MAINTAINERS.md](MAINTAINERS.md)以查找所有维护者的电子邮件地址。请随时通过电子邮件与他们联系以报告任何问题或提出问题。
- [邮件列表](https://groups.google.com/forum/#!forum/hami-project)
- [slack](https://join.slack.com/t/hami-hsf3791/shared_invite/zt-2gcteqiph-Ls8Atnpky6clrspCAQ_eGQ)
- [slack]( https://cloud-native.slack.com/archives/C07T10BU4R2) | [Join](https://slack.cncf.io/)
8 changes: 8 additions & 0 deletions charts/hami/templates/scheduler/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ data:
},
{{- end }}
{{- end }}
{{- if .Values.devices.mthreads.enabled }}
{{- range .Values.devices.mthreads.resources }}
{
"name": "{{ . }}",
"ignoredByScheduler": true
},
{{- end }}
{{- end }}
{
"name": "{{ .Values.resourceName }}",
"ignoredByScheduler": true
Expand Down
6 changes: 6 additions & 0 deletions charts/hami/templates/scheduler/configmapnew.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,10 @@ data:
ignoredByScheduler: true
{{- end }}
{{- end }}
{{- if .Values.devices.mthreads.enabled }}
{{- range .Values.devices.mthreads.resources }}
- name: {{ . }}
ignoredByScheduler: true
{{- end }}
{{- end }}
{{- end }}
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ spec:
- create
- --cert-name=tls.crt
- --key-name=tls.key
{{- if .Values.scheduler.customWebhook.enabled }}
- --host={{ printf "%s.%s.svc,127.0.0.1,%s" (include "hami-vgpu.scheduler" .) .Release.Namespace .Values.scheduler.customWebhook.host}}
{{- if .Values.scheduler.admissionWebhook.customURL.enabled }}
- --host={{ printf "%s.%s.svc,127.0.0.1,%s" (include "hami-vgpu.scheduler" .) .Release.Namespace .Values.scheduler.admissionWebhook.customURL.host}}
{{- else }}
- --host={{ printf "%s.%s.svc,127.0.0.1" (include "hami-vgpu.scheduler" .) .Release.Namespace }}
{{- end }}
Expand Down
12 changes: 6 additions & 6 deletions charts/hami/templates/scheduler/webhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,16 @@ webhooks:
- admissionReviewVersions:
- v1beta1
clientConfig:
{{- if .Values.scheduler.customWebhook.enabled }}
url: https://{{ .Values.scheduler.customWebhook.host}}:{{.Values.scheduler.customWebhook.port}}{{.Values.scheduler.customWebhook.path}}
{{- if .Values.scheduler.admissionWebhook.customURL.enabled }}
url: https://{{ .Values.scheduler.admissionWebhook.customURL.host}}:{{.Values.scheduler.admissionWebhook.customURL.port}}{{.Values.scheduler.admissionWebhook.customURL.path}}
{{- else }}
service:
name: {{ include "hami-vgpu.scheduler" . }}
namespace: {{ .Release.Namespace }}
path: /webhook
port: {{ .Values.scheduler.service.httpPort }}
{{- end }}
failurePolicy: {{ .Values.scheduler.mutatingWebhookConfiguration.failurePolicy }}
failurePolicy: {{ .Values.scheduler.admissionWebhook.failurePolicy }}
matchPolicy: Equivalent
name: vgpu.hami.io
namespaceSelector:
Expand All @@ -24,19 +24,19 @@ webhooks:
operator: NotIn
values:
- ignore
{{- if .Values.scheduler.customWebhook.whitelistNamespaces }}
{{- if .Values.scheduler.admissionWebhook.whitelistNamespaces }}
- key: kubernetes.io/metadata.name
operator: NotIn
values:
{{- toYaml .Values.scheduler.customWebhook.whitelistNamespaces | nindent 10 }}
{{- toYaml .Values.scheduler.admissionWebhook.whitelistNamespaces | nindent 10 }}
{{- end }}
objectSelector:
matchExpressions:
- key: hami.io/webhook
operator: NotIn
values:
- ignore
reinvocationPolicy: Never
reinvocationPolicy: {{ .Values.scheduler.admissionWebhook.reinvocationPolicy }}
rules:
- apiGroups:
- ""
Expand Down
23 changes: 14 additions & 9 deletions charts/hami/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,18 +79,21 @@ scheduler:
podAnnotations: {}
tolerations: []
#serviceAccountName: "hami-vgpu-scheduler-sa"
customWebhook:
enabled: false
# must be an endpoint using https.
# should generate host certs here
host: 127.0.0.1 # hostname or ip, can be your node'IP if you want to use https://<nodeIP>:<schedulerPort>/<path>
port: 31998
path: /webhook
admissionWebhook:
customURL:
enabled: false
# must be an endpoint using https.
# should generate host certs here
host: 127.0.0.1 # hostname or ip, can be your node'IP if you want to use https://<nodeIP>:<schedulerPort>/<path>
port: 31998
path: /webhook
whitelistNamespaces:
# Specify the namespaces that the webhook will not be applied to.
# - default
# - kube-system
# - istio-system
reinvocationPolicy: Never
failurePolicy: Ignore
patch:
image: docker.io/jettech/kube-webhook-certgen:v1.5.2
imageNew: liangjw/kube-webhook-certgen:v1.1.1
Expand All @@ -100,8 +103,6 @@ scheduler:
nodeSelector: {}
tolerations: []
runAsUser: 2000
mutatingWebhookConfiguration:
failurePolicy: Ignore
service:
httpPort: 443
schedulerPort: 31998
Expand Down Expand Up @@ -135,6 +136,10 @@ devicePlugin:
tolerations: []

devices:
mthreads:
enabled: false
resources:
- mthreads.com/vgpu
ascend:
enabled: false
image: ""
Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.withlib
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ ADD . /k8s-vgpu
ARG GOPROXY=https://goproxy.cn,direct
RUN cd /k8s-vgpu && make all

FROM ubuntu:20.04
FROM ubuntu:24.04
ENV NVIDIA_DISABLE_REQUIRE="true"
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=utility
Expand Down
2 changes: 1 addition & 1 deletion docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ helm install vgpu-charts/vgpu vgpu --set devicePlugin.deviceMemoryScaling=5 ...
* `devicePlugin.disablecorelimit:`
String type, "true" for disable core limit, "false" for enable core limit, default: false
* `scheduler.defaultMem:`
Integer type, by default: 5000. The default device memory of the current task, in MB
Integer type, by default: 0. The default device memory of the current task, in MB.'0' means use 100% device memory
* `scheduler.defaultCores:`
Integer type, by default: equals 0. Percentage of GPU cores reserved for the current task. If assigned to 0, it may fit in any GPU with enough device memory. If assigned to 100, it will use an entire GPU card exclusively.
* `scheduler.defaultGPUNum:`
Expand Down
2 changes: 1 addition & 1 deletion docs/config_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ helm install vgpu vgpu-charts/vgpu --set devicePlugin.deviceMemoryScaling=5 ...
* `devicePlugin.disablecorelimit:`
字符串类型,"true"为关闭算力限制,"false"为启动算力限制,默认为"false"
* `scheduler.defaultMem:`
整数类型,预设值为5000,表示不配置显存时使用的默认显存大小,单位为MB
整数类型,预设值为0,表示不配置显存时使用的默认显存大小,单位为MB。当值为0时,代表使用全部的显存。
* `scheduler.defaultCores:`
整数类型(0-100),默认为0,表示默认为每个任务预留的百分比算力。若设置为0,则代表任务可能会被分配到任一满足显存需求的GPU中,若设置为100,代表该任务独享整张显卡
* `scheduler.defaultGPUNum:`
Expand Down
67 changes: 67 additions & 0 deletions docs/mthreads-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
## Introduction

**We now support mthreads.com/vgpu by implementing most device-sharing features as nvidia-GPU**, including:

***GPU sharing***: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

***Device Memory Control***: GPUs can be allocated with certain device memory size on certain type(i.e MTT S4000) and have made it that it does not exceed the boundary.

***Device Core Control***: GPUs can be allocated with limited compute cores on certain type(i.e MTT S4000) and have made it that it does not exceed the boundary.

## Important Notes

1. Device sharing for multi-cards is not supported.

2. Only one mthreads device can be shared in a pod(even there are multiple containers).

3. Support allocating exclusive mthreads GPU by specifying mthreads.com/vgpu only.

4. These features are tested on MTT S4000

## Prerequisites

* [MT CloudNative Toolkits > 1.9.0](https://docs.mthreads.com/cloud-native/cloud-native-doc-online/)
* driver version >= 1.2.0

## Enabling GPU-sharing Support

* Deploy MT-CloudNative Toolkit on mthreads nodes (Please consult your device provider to aquire its package and document)

> **NOTICE:** *You can remove mt-mutating-webhook and mt-gpu-scheduler after installation(optional).*
* set the 'devices.mthreads.enabled = true' when installing hami

```
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set device.mthreads.enabled=true -n kube-system
```

## Running Mthreads jobs

Mthreads GPUs can now be requested by a container
using the `mthreads.com/vgpu`, `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core` resource type:

```
apiVersion: v1
kind: Pod
metadata:
name: gpushare-pod-default
spec:
restartPolicy: OnFailure
containers:
- image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc
imagePullPolicy: IfNotPresent
name: gpushare-pod-1
command: ["sleep"]
args: ["100000"]
resources:
limits:
mthreads.com/vgpu: 1
mthreads.com/sgpu-memory: 32
mthreads.com/sgpu-core: 8
```

> **NOTICE1:** *Each unit of sgpu-memory indicates 512M device memory*
> **NOTICE2:** *You can find more examples in [examples/mthreads folder](../examples/mthreads/)*

68 changes: 68 additions & 0 deletions docs/mthreads-support_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
## 简介

本组件支持复用摩尔线程GPU设备,并为此提供以下几种与vGPU类似的复用功能,包括:

***GPU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡

***可限制分配的显存大小***: 你现在可以用显存值(例如3000M)来分配MLU,本组件会确保任务使用的显存不会超过分配数值、

***可限制分配的算力核组比例***: 你现在可以用算力核组数量(例如8个)来分配GPU,本组件会确保任务使用的显存不会超过分配数值

## 注意事项

1. 暂时不支持多卡切片,多卡任务只能分配整卡

2. 一个pod只能使用一个GPU生成的切片,即使该pod中有多个容器

3. 支持独占模式,只指定`mthreads.com/vgpu`即为独占申请

4. 本特性目前只支持MTT S4000设备

## 节点需求

* [MT CloudNative Toolkits > 1.9.0](https://docs.mthreads.com/cloud-native/cloud-native-doc-online/)
* 驱动版本 >= 1.2.0

## 开启GPU复用

* 部署'gpu-manager',天数智芯的GPU共享需要配合厂家提供的'MT-CloudNative Toolkit'一起使用,请联系设备提供方获取

> **注意:** *(可选),部署完之后,卸载掉mt-mutating-webhook与mt-scheduler组件,因为这部分功能将由HAMi调度器提供*
* 在安装HAMi时配置'devices.mthreads.enabled = true'参数

```
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set device.mthreads.enabled=true -n kube-system
```

## 运行GPU任务

通过指定`mthreads.com/vgpu`, `mthreads.com/sgpu-memory` and `mthreads.com/sgpu-core`这3个参数,可以确定容器申请的切片个数,对应的显存和算力核组

```
apiVersion: v1
kind: Pod
metadata:
name: gpushare-pod-default
spec:
restartPolicy: OnFailure
containers:
- image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc
imagePullPolicy: IfNotPresent
name: gpushare-pod-1
command: ["sleep"]
args: ["100000"]
resources:
limits:
mthreads.com/vgpu: 1
mthreads.com/sgpu-memory: 32
mthreads.com/sgpu-core: 8
```

> **注意1:** *每一单位的sgpu-memory代表512M的显存.*
> **注意2:** *查看更多的[用例](../examples/mthreads/).*



17 changes: 17 additions & 0 deletions examples/mthreads/default_use.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: v1
kind: Pod
metadata:
name: gpushare-pod-default
spec:
restartPolicy: OnFailure
containers:
- image: core.harbor.zlidc.mthreads.com:30003/mt-ai/lm-qy2:v17-mpc
imagePullPolicy: IfNotPresent
name: gpushare-pod-1
command: ["sleep"]
args: ["100000"]
resources:
limits:
mthreads.com/vgpu: 1
mthreads.com/sgpu-memory: 32
mthreads.com/sgpu-core: 8
Loading

0 comments on commit 1e882c2

Please sign in to comment.