Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgm diagnostic tests pass despite nvlinks being down #198

Open
aditigaur4 opened this issue Nov 13, 2024 · 0 comments
Open

dcgm diagnostic tests pass despite nvlinks being down #198

aditigaur4 opened this issue Nov 13, 2024 · 0 comments

Comments

@aditigaur4
Copy link

Hello,

I have a machine where all nvlinks are down:

dcgmi nvlink --link-status
+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        D D D D D D D D D D D D D D D D D D
    gpuId 1:
        D D D D D D D D D D D D D D D D D D
NvSwitches:
    No NvSwitches found.

Key: Up=U, Down=D, Disabled=X, Not Supported=_

Yet when i run dcgmi diagnostics tests at level 2, they all pass, shouldnt they fail if the nvlinks aren't working? the dcgmi docs say that the PCIE tests tests both PCIe and nvlink?

dcgmi diag -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.7                                          |
| Driver Version Detected   | 560.35.03                                      |
| GPU Device IDs Detected   | 2321,2321                                      |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Info                      | Persistence mode for GPU 0 is disabled. Enabl  |
|                           | e persistence mode by running "nvidia-smi -i   |
|                           | <gpuId> -pm 1 " as root.,Persistence mode for  |
|                           |  GPU 1 is disabled. Enable persistence mode b  |
|                           | y running "nvidia-smi -i <gpuId> -pm 1 " as r  |
|                           | oot.                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant