Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgmi policy about Reset GPU Not effective #185

Open
mr-j-1992 opened this issue Aug 28, 2024 · 0 comments
Open

dcgmi policy about Reset GPU Not effective #185

mr-j-1992 opened this issue Aug 28, 2024 · 0 comments

Comments

@mr-j-1992
Copy link

A description of the problem.
dcgmi policy about Reset GPU Not effective
Steps to reproduce the issue.
root@dcgm-image-4090:# dcgmi policy -g 0 --get -v
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information |
| GPU ID: 0 |
+=============================+================================================+
| Violation conditions | XID error detected. |
| Isolation mode | Manual |
| Action on violation | Reset GPU |
| Validation after action | System Validation (Short) |
| Validation failure action | None |
+-----------------------------+------------------------------------------------+
root@dcgm-image-4090:
#
root@dcgm-image-4090:# dcgmi test --inject --gpuid 0 -f 230 -v 1
Successfully injected field info.
root@dcgm-image-4090:
#
root@dcgm-image-4090:#
root@dcgm-image-4090:
# dcgmi test --inject --gpuid 0 -f 230 -v 1
Successfully injected field info.
root@dcgm-image-4090:~#

but dmesg did not set gpu reset

Relevant configuration information
bare metal environment

root@dcgm-image-4090:~# nvidia-smi
Wed Aug 28 03:45:20 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:00:05.0 Off | Off |
| 37% 29C P8 6W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@dcgm-image-4090:~#

root@dcgm-image-4090:~# dcgmi -v
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857

Hostengine build info:
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
root@dcgm-image-4090:~#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant