Skip to content

Commit

Permalink
fix grafana dashboard and clarify dashboard usage more clearly; minor…
Browse files Browse the repository at this point in the history
… changes to document format

Signed-off-by: jiangsanyin <[email protected]>
  • Loading branch information
jiangsanyin committed Oct 24, 2024
1 parent 0abdd64 commit a6a626a
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 13 deletions.
12 changes: 5 additions & 7 deletions docs/dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@

## Deploy kube-prometheus stack

**Note:**See the version compatibility matrix for kubernetes and kube-prometheus stack in

[compatibility matrix]: https://github.com/prometheus-operator/kube-prometheus?tab=readme-ov-file#compatibility
**Note:**See the version compatibility matrix for kubernetes and kube-prometheus stack in:https://github.com/prometheus-operator/kube-prometheus?tab=readme-ov-file#compatibility

```shell
#Clone kube-prometheus code repository(using release-0.11 here)
Expand Down Expand Up @@ -50,7 +48,7 @@ grafana NodePort 10.233.56.112 <none> 3000:30300/TCP
prometheus-k8s NodePort 10.233.38.113 <none> 9090:30090/TCP,8080:31273/TCP 19h
```

​ If ip address of controller node is 10.0.0.21, then grafana, prometheus, and alertmanager can be accessed using the following urls: http://10.0.0.21:30300, http://10.0.0.21:30090, and http://10.0.0.21:30093, and the default user name and password for accessing grafana are admin
​ If ip address of controller node is 10.0.0.21, then grafana, prometheus, and alertmanager can be accessed using the following urls: http://10.0.0.21:30300 , http://10.0.0.21:30090 , and http://10.0.0.21:30093 , and the default user name and password for accessing grafana are admin

## Configure grafana

Expand All @@ -60,9 +58,9 @@ prometheus-k8s NodePort 10.233.38.113 <none> 9090:30090/TCP

### Import dashboard

​ Go to the "Configuration" -> "Data soutces" page in grafana and import the dashboard from https://grafana.com/grafana/dashboards/22043-hami-vgpu-metrics-dashboard/, and a dashboard page named "hami-vgpu-metrics-dashboard" will be created. 22043-hami-vgpu-metrics-dashboard is valid in grafana8.5.5 and grafana9.1.0, and it's grealty possible that this dashboard is vaild in grafana version later than 9.1.0. Now data of some panels in this dashboard page are missing, which requires you read the rest of the document.
​ Go to the "Configuration" -> "Data soutces" page in grafana and import the dashboard from https://grafana.com/grafana/dashboards/22043-hami-vgpu-metrics-dashboard/ , and a dashboard page named "hami-vgpu-metrics-dashboard" will be created. 22043-hami-vgpu-metrics-dashboard is valid in grafana8.5.5 and grafana9.1.0, and it's grealty possible that this dashboard is vaild in grafana version later than 9.1.0. Now data of some panels in this dashboard page are missing, which requires you read the rest of the document.

​ For versions earlier than grafana8.5.5, such as grafana7.5.17, please refer to:https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/
​ For versions earlier than grafana8.5.5, such as grafana7.5.17, please refer to:https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/

# Deploy dcgm-exporter

Expand Down Expand Up @@ -204,4 +202,4 @@ gpu-pod-01 0/1 Completed 0 52s 10.233.81.70 controlle

​ You can see the monitoring details in the dashboard. The contents are as follows:

![image-20241003215400685](https://s2.loli.net/2024/10/03/RFJuthzAGYw5UHk.png)
![image-20241003215400685](..\imgs\hami-vgpu-metrics-dashboard.png)
12 changes: 6 additions & 6 deletions docs/dashboard_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

## 部署kube-prometheus stack

**注:**kubernetes与kube-prometheus stack的版本兼容矩阵请查看https://github.com/prometheus-operator/kube-prometheus?tab=readme-ov-file#compatibility,请根据自己的kubernetes版本选择合适版本的kube-prometheus stack
**注:**kubernetes与kube-prometheus stack的版本兼容矩阵请查看 https://github.com/prometheus-operator/kube-prometheus?tab=readme-ov-file#compatibility ,请根据自己的kubernetes版本选择合适版本的kube-prometheus stack

```shell
#下载kube-prometheus代码仓库(此处使用分支 release-0.11)
Expand Down Expand Up @@ -48,19 +48,19 @@ grafana NodePort 10.233.56.112 <none> 3000:30300/TCP
prometheus-k8s NodePort 10.233.38.113 <none> 9090:30090/TCP,8080:31273/TCP 19h
```

​ 此时,假如控制节点的ip是10.0.0.21,则可以分别使用如下url访问grafana、prometheus与alertmanager:http://10.0.0.21:30300、http://10.0.0.21:30090、http://10.0.0.21:30093,其中访问grafana的默认用户名与密码都是admin
​ 此时,假如控制节点的ip是10.0.0.21,则可以分别使用如下url访问grafana、prometheus与alertmanager:http://10.0.0.21:30300 http://10.0.0.21:30090 http://10.0.0.21:30093 ,其中访问grafana的默认用户名与密码都是admin

## 配置grafana

### 创建数据源ALL

​ 访问”Configuration“->“Data soutces”页面,创建一个名为"ALL"的数据源,其中HTTP.URL的值保持跟默认创建的数据源“prometheus”中的一样即可为“http://prometheus-k8s.monitoring.svc:9090”,然后保存上述数据源“ALL”
​ 访问”Configuration“->“Data soutces”页面,创建一个名为"ALL"的数据源,其中HTTP.URL的值保持跟默认创建的数据源“prometheus”中的一样即可为 http://prometheus-k8s.monitoring.svc:9090” ,然后保存上述数据源“ALL”

### 导入HAMi默认的dashboard

​ 访问“Dashboards”->“Browse”页面,导入此dashboard:https://grafana.com/grafana/dashboards/22043-hami-vgpu-metrics-dashboard/,grafana中将创建一个名为“hami-vgpu-metrics-dashboard”的dashboard,22043这个编号对应的dashboard在grafana8.5.5与grafana9.1.0验证过,在grafana9.1.0之后应该也能用。此时此页面中有一些Panel如vGPUCorePercentage还没有数据,请继续看完此文档,执行完"部署dcgm-exporter"与“创建ServiceMonitor”中的步骤之后Panel数据将正常显示。
​ 访问“Dashboards”->“Browse”页面,导入此dashboard:https://grafana.com/grafana/dashboards/22043-hami-vgpu-metrics-dashboard/ ,grafana中将创建一个名为“hami-vgpu-metrics-dashboard”的dashboard,22043这个编号对应的dashboard在grafana8.5.5与grafana9.1.0验证过,在grafana9.1.0之后应该也能用。此时此页面中有一些Panel如vGPUCorePercentage还没有数据,请继续看完此文档,执行完"部署dcgm-exporter"与“创建ServiceMonitor”中的步骤之后Panel数据将正常显示。

​ 对于grafana8.5.5之前的版本如grafana7.5.17,请使用此dashboard:https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/
​ 对于grafana8.5.5之前的版本如grafana7.5.17,请使用此dashboard:https://grafana.com/grafana/dashboards/21833-hami-vgpu-dashboard/

# 部署dcgm-exporter

Expand Down Expand Up @@ -202,4 +202,4 @@ gpu-pod-01 0/1 Completed 0 52s 10.233.81.70 controlle

​ 此时,应该可以dashboard中看到监控详情。内容大概如下

![image-20241003215400685](https://s2.loli.net/2024/10/03/RFJuthzAGYw5UHk.png)
![image-20241003215400685](..\imgs\hami-vgpu-metrics-dashboard.png)
Binary file added imgs/hami-vgpu-metrics-dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a6a626a

Please sign in to comment.