有任务运行，但是任务管理中“算力使用趋势”和“显存使用趋势”始终为0 #10

bogan-FMA · 2024-11-22T03:31:23Z

启动了一个hami的vgpu pod，并且在里面运行了任务，GPU的利用率信息UI能展示，但是container任务的信息UI始终为0。
应该是hami_container_core_util和hami_container_memory_util两个值没有正确计数。但是Allocated的信息是正常的。
版本是v1.0.4。

root@master:~/host# curl 192.168.1.81:30080/metrics | grep hami_container % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 13966 0 13966 0 0 166k 0 --:--:-- --:--:-- --:--:-- 166k HELP hami_container_core_used task used core TYPE hami_container_core_used gauge hami_container_core_used{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_core_util task core util percent 0-100 TYPE hami_container_core_util gauge hami_container_core_util{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_memory_used task used memory unit MB TYPE hami_container_memory_used gauge hami_container_memory_used{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_memory_util task memory util percent 0-100 TYPE hami_container_memory_util gauge hami_container_memory_util{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_vcore_allocated task allocated vCore size TYPE hami_container_vcore_allocated gauge hami_container_vcore_allocated{container_name="client",container_pod_uuid="client:98a8c2e3-86d4-41fd-9a09-5f50dd1834cb",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 30 HELP hami_container_vgpu_allocated task allocated vGPU count TYPE hami_container_vgpu_allocated gauge hami_container_vgpu_allocated{container_name="client",container_pod_uuid="client:98a8c2e3-86d4-41fd-9a09-5f50dd1834cb",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 1 HELP hami_container_vmemory_allocated task allocated vMemory size TYPE hami_container_vmemory_allocated gauge hami_container_vmemory_allocated{container_name="client",container_pod_uuid="client:98a8c2e3-86d4-41fd-9a09-5f50dd1834cb",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 6000

fe和be的日志如下：

INFO ts=2024-11-22T07:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
INFO ts=2024-11-22T08:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T08:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T08:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
INFO ts=2024-11-22T09:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T09:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T09:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
INFO ts=2024-11-22T10:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T10:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T10:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
root@master:~/host# k logs hami-webui-7c9859b578-72pd7 -c hami-webui-fe-oss
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [NestFactory] Starting Nest application...
[Nest] 1 - 11/20/2024, 5:57:41 PM WARN [ModuleTokenFactory] The module "InternalCoreModule" is taking 82.12ms to serialize, this may be caused by larger objects statically assigned to the module. More details: nestjs/nest#12738 +84ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [InstanceLoader] HttpModule dependencies initialized +10ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [InstanceLoader] AppModule dependencies initialized +1ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [RoutesResolver] AppController {/}: +6ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [RouterExplorer] Mapped {/health_check, GET} route +2ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [RouterExplorer] Mapped {/*, GET} route +2ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [NestApplication] Nest application successfully started +80ms
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [Server]. Use emitter.setMaxListeners() to increase limit
(Use node --trace-warnings ... to show where the warning was created)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

有任务运行，但是任务管理中“算力使用趋势”和“显存使用趋势”始终为0 #10

有任务运行，但是任务管理中“算力使用趋势”和“显存使用趋势”始终为0 #10

bogan-FMA commented Nov 22, 2024 •

edited

Loading

有任务运行，但是任务管理中“算力使用趋势”和“显存使用趋势”始终为0 #10

有任务运行，但是任务管理中“算力使用趋势”和“显存使用趋势”始终为0 #10

Comments

bogan-FMA commented Nov 22, 2024 • edited Loading

bogan-FMA commented Nov 22, 2024 •

edited

Loading