fix grafana dashboard and clarify dashboard usage more clearly. #543

jiangsanyin · 2024-10-10T01:47:05Z

Signed-off-by: jiangsanyin [email protected]

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix grafana dashboard and clarify dashboard usage more clearly. Thanks "fangfenghuang (https://github.com/fangfenghuang)" for your help

Which issue(s) this PR fixes:
Fixes #498 #468

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

yxl · 2024-10-14T02:36:00Z

docs/dashboard.md

+    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
+    resources:
+      limits:
+        nvidia.com/vgpu: 2 # requesting 2 vGPUs


Should "nvidia.com/vgpu" be "nvidia.com/gpu"?

Should "nvidia.com/vgpu" be "nvidia.com/gpu"?

I forgot to explain it, it depends on our own case.
In order to distinguish from “nvidia.com/gpu” in nvidia-device-plugin, I used resourceName parameter and setted it's value to "nvidia.com/vgpu", such as: helm install hami hami-charts/hami --set resourceName=nvidia.com/vgpu --set scheduler.kubeScheduler.imageTag=v1.23.10 -n kube-system

wawa0210 · 2024-10-24T02:40:41Z

@fangfenghuang Can you help review this pr?

codecov · 2024-10-24T02:44:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag	Coverage Δ
unittests	`27.09% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

fangfenghuang

fix some http url

fangfenghuang · 2024-10-24T03:00:20Z

docs/dashboard.md

+
+	You can see the monitoring details in the dashboard. The contents are as follows:
+
+![image-20241003215400685](https://s2.loli.net/2024/10/03/RFJuthzAGYw5UHk.png)


It is best to place the referenced images in the ../imgs/

ok, changes has been made.

…he image display problem in document and document format Signed-off-by: jiangsanyin <[email protected]>

Nimbus318 · 2024-11-25T07:33:10Z

@jiangsanyin
I have followed the installation instructions as described in the documentation, but encountered a minor issue, which I also mentioned previously in Issue #498. By default, the dcgm-exporter only includes the Hostname label. To match the current Grafana dashboard configuration, it's necessary to add a node_name relabeling configuration when installing dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/blob/b97b7633e3f39f7a537bd77561cc0ec0c2dca3f5/deployment/values.yaml#L117C3-L117C18

This relabeling should be consistent with the configurations for hami-device-plugin-svc-monitor and hami-scheduler-svc-monitor

It would be helpful to include this information in the documentation, as users unfamiliar with the Prometheus stack may struggle to configure everything correctly on the first attempt

jiangsanyin · 2024-11-25T07:45:06Z

@jiangsanyin I have followed the installation instructions as described in the documentation, but encountered a minor issue, which I also mentioned previously in Issue #498. By default, the dcgm-exporter only includes the Hostname label. To match the current Grafana dashboard configuration, it's necessary to add a node_name relabeling configuration when installing dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/blob/b97b7633e3f39f7a537bd77561cc0ec0c2dca3f5/deployment/values.yaml#L117C3-L117C18

This relabeling should be consistent with the configurations for hami-device-plugin-svc-monitor and hami-scheduler-svc-monitor

It would be helpful to include this information in the documentation, as users unfamiliar with the Prometheus stack may struggle to configure everything correctly on the first attempt

Have you created and applied the ServiceMonitor as depicted in dashboard.md or dashboard_cn.md？node_name is added after this is done.
#Create the file hami-device-plugin-svc-monitor.yaml
root@controller01:~# cat hami-device-plugin-svc-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-device-plugin-svc-monitor
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-device-plugin
namespaceSelector:
matchNames:
- "kube-system"
endpoints:

path: /metrics
port: monitorport
interval: "15s"
honorLabels: false
relabelings:
- sourceLabels: [__meta_kubernetes_endpoints_name]
  regex: hami-.*
  replacement: $1
  action: keep
- sourceLabels: [__meta_kubernetes_pod_node_name]
  regex: (.*)
  targetLabel: node_name
  replacement: ${1}
  action: replace
- sourceLabels: [__meta_kubernetes_pod_host_ip]
  regex: (.*)
  targetLabel: ip
  replacement: $1
  action: replace

#apply the file hami-device-plugin-svc-monitor.yaml
root@controller01:~# kubectl apply -f hami-device-plugin-svc-monitor.yaml

Nimbus318 · 2024-11-25T08:00:57Z

@jiangsanyin
Both are correct. What I meant is that you might have forgotten to include the explanation for the relabel configuration of dcgm-exporter. By default, dcgm-exporter only includes the Hostname label

It’s important to document this configuration to ensure it aligns with the relabeling setup for hami-device-plugin-svc-monitor. Without this explanation, users may miss adding the necessary node_name relabeling when setting up dcgm-exporter

yxl reviewed Oct 14, 2024

View reviewed changes

fangfenghuang reviewed Oct 24, 2024

View reviewed changes

fix grafana dashboard and clarify dashboard usage more clearly; Fix t…

1fe9420

…he image display problem in document and document format Signed-off-by: jiangsanyin <[email protected]>

jiangsanyin force-pushed the master branch from a6a626a to 1fe9420 Compare October 24, 2024 08:34

fangfenghuang approved these changes Oct 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix grafana dashboard and clarify dashboard usage more clearly. #543

fix grafana dashboard and clarify dashboard usage more clearly. #543

jiangsanyin commented Oct 10, 2024

yxl Oct 14, 2024

jiangsanyin Oct 14, 2024

wawa0210 commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

fangfenghuang left a comment

fangfenghuang Oct 24, 2024

jiangsanyin Oct 24, 2024

Nimbus318 commented Nov 25, 2024

jiangsanyin commented Nov 25, 2024 •

edited

Loading

Nimbus318 commented Nov 25, 2024


		You can see the monitoring details in the dashboard. The contents are as follows:

		![image-20241003215400685](https://s2.loli.net/2024/10/03/RFJuthzAGYw5UHk.png)

fix grafana dashboard and clarify dashboard usage more clearly. #543

Are you sure you want to change the base?

fix grafana dashboard and clarify dashboard usage more clearly. #543

Conversation

jiangsanyin commented Oct 10, 2024

yxl Oct 14, 2024

Choose a reason for hiding this comment

jiangsanyin Oct 14, 2024

Choose a reason for hiding this comment

wawa0210 commented Oct 24, 2024

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

fangfenghuang left a comment

Choose a reason for hiding this comment

fangfenghuang Oct 24, 2024

Choose a reason for hiding this comment

jiangsanyin Oct 24, 2024

Choose a reason for hiding this comment

Nimbus318 commented Nov 25, 2024

jiangsanyin commented Nov 25, 2024 • edited Loading

Nimbus318 commented Nov 25, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

jiangsanyin commented Nov 25, 2024 •

edited

Loading