You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A description of the problem.
dcgmi policy about Reset GPU Not effective Steps to reproduce the issue.
root@dcgm-image-4090:# dcgmi policy -g 0 --get -v
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information |
| GPU ID: 0 |
+=============================+================================================+
| Violation conditions | XID error detected. |
| Isolation mode | Manual |
| Action on violation | Reset GPU |
| Validation after action | System Validation (Short) |
| Validation failure action | None |
+-----------------------------+------------------------------------------------+
root@dcgm-image-4090:#
root@dcgm-image-4090:# dcgmi test --inject --gpuid 0 -f 230 -v 1
Successfully injected field info.
root@dcgm-image-4090:#
root@dcgm-image-4090:#
root@dcgm-image-4090:# dcgmi test --inject --gpuid 0 -f 230 -v 1
Successfully injected field info.
root@dcgm-image-4090:~#
but dmesg did not set gpu reset
Relevant configuration information
bare metal environment
root@dcgm-image-4090:~# nvidia-smi
Wed Aug 28 03:45:20 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:00:05.0 Off | Off |
| 37% 29C P8 6W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@dcgm-image-4090:~#
root@dcgm-image-4090:~# dcgmi -v
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
Hostengine build info:
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
root@dcgm-image-4090:~#
The text was updated successfully, but these errors were encountered:
A description of the problem.
dcgmi policy about Reset GPU Not effective
Steps to reproduce the issue.
root@dcgm-image-4090:
# dcgmi policy -g 0 --get -v#Policy information
+-----------------------------+------------------------------------------------+
| Policy Information |
| GPU ID: 0 |
+=============================+================================================+
| Violation conditions | XID error detected. |
| Isolation mode | Manual |
| Action on violation | Reset GPU |
| Validation after action | System Validation (Short) |
| Validation failure action | None |
+-----------------------------+------------------------------------------------+
root@dcgm-image-4090:
root@dcgm-image-4090:
# dcgmi test --inject --gpuid 0 -f 230 -v 1#Successfully injected field info.
root@dcgm-image-4090:
root@dcgm-image-4090:
## dcgmi test --inject --gpuid 0 -f 230 -v 1root@dcgm-image-4090:
Successfully injected field info.
root@dcgm-image-4090:~#
but dmesg did not set gpu reset
Relevant configuration information
bare metal environment
root@dcgm-image-4090:~# nvidia-smi
Wed Aug 28 03:45:20 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:00:05.0 Off | Off |
| 37% 29C P8 6W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@dcgm-image-4090:~#
root@dcgm-image-4090:~# dcgmi -v
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
Hostengine build info:
Version : 3.3.7
Build ID : 26
Build Date : 2024-07-09
Build Type : Release
Commit ID : 105620196e46a7ef2f99a1ce3e69a5d12af1e845
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c1b74febf52d45d29ae956b78c091857
root@dcgm-image-4090:~#
The text was updated successfully, but these errors were encountered: