Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

有任务运行,但是任务管理中“算力使用趋势”和“显存使用趋势”始终为0 #10

Open
bogan-FMA opened this issue Nov 22, 2024 · 0 comments

Comments

@bogan-FMA
Copy link

bogan-FMA commented Nov 22, 2024

启动了一个hami的vgpu pod,并且在里面运行了任务,GPU的利用率信息UI能展示,但是container任务的信息UI始终为0。
应该是hami_container_core_util和hami_container_memory_util两个值没有正确计数。但是Allocated的信息是正常的。
版本是v1.0.4。

root@master:~/host# curl 192.168.1.81:30080/metrics | grep hami_container % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 13966 0 13966 0 0 166k 0 --:--:-- --:--:-- --:--:-- 166k HELP hami_container_core_used task used core TYPE hami_container_core_used gauge hami_container_core_used{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_core_util task core util percent 0-100 TYPE hami_container_core_util gauge hami_container_core_util{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_memory_used task used memory unit MB TYPE hami_container_memory_used gauge hami_container_memory_used{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_memory_util task memory util percent 0-100 TYPE hami_container_memory_util gauge hami_container_memory_util{container_name="client",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 0 HELP hami_container_vcore_allocated task allocated vCore size TYPE hami_container_vcore_allocated gauge hami_container_vcore_allocated{container_name="client",container_pod_uuid="client:98a8c2e3-86d4-41fd-9a09-5f50dd1834cb",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 30 HELP hami_container_vgpu_allocated task allocated vGPU count TYPE hami_container_vgpu_allocated gauge hami_container_vgpu_allocated{container_name="client",container_pod_uuid="client:98a8c2e3-86d4-41fd-9a09-5f50dd1834cb",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 1 HELP hami_container_vmemory_allocated task allocated vMemory size TYPE hami_container_vmemory_allocated gauge hami_container_vmemory_allocated{container_name="client",container_pod_uuid="client:98a8c2e3-86d4-41fd-9a09-5f50dd1834cb",devicetype="NVIDIA-NVIDIA GeForce RTX 3060",deviceuuid="GPU-6172f5d3-5c0e-d266-2c38-3008f874074c",namespace_name="hami",node="node1",pod_name="native-tf-85bf47f4dd-4bfvz",provider="NVIDIA"} 6000

fe和be的日志如下:

INFO ts=2024-11-22T07:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
INFO ts=2024-11-22T08:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T08:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T08:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
INFO ts=2024-11-22T09:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T09:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T09:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
INFO ts=2024-11-22T10:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T10:57:42+08:00 caller=util/util.go:307 msg=Decoded pod annos: poddevices map[NVIDIA:[[{0 GPU-6172f5d3-5c0e-d266-2c38-3008f874074c NVIDIA 6000 30 }]]]
INFO ts=2024-11-22T10:57:42+08:00 caller=data/pod.go:96 msg=Pod added: Name: native-tf-85bf47f4dd-4bfvz, UID: 98a8c2e3-86d4-41fd-9a09-5f50dd1834cb, Namespace: hami, NodeID: node1
root@master:~/host# k logs hami-webui-7c9859b578-72pd7 -c hami-webui-fe-oss
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [NestFactory] Starting Nest application...
[Nest] 1 - 11/20/2024, 5:57:41 PM WARN [ModuleTokenFactory] The module "InternalCoreModule" is taking 82.12ms to serialize, this may be caused by larger objects statically assigned to the module. More details: nestjs/nest#12738 +84ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [InstanceLoader] HttpModule dependencies initialized +10ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [InstanceLoader] AppModule dependencies initialized +1ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [RoutesResolver] AppController {/}: +6ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [RouterExplorer] Mapped {/health_check, GET} route +2ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [RouterExplorer] Mapped {/*, GET} route +2ms
[Nest] 1 - 11/20/2024, 5:57:41 PM LOG [NestApplication] Nest application successfully started +80ms
(node:1) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [Server]. Use emitter.setMaxListeners() to increase limit
(Use node --trace-warnings ... to show where the warning was created)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant