Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix display issues with metrics files and links #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ gpu-mon 是[open-falcon](http://open-falcon.com/)用于监控GPU状态的一个

### 监控项

1. 详细的监控项说明可以参考[metric](https://github.com/open-falcon/gpu-mon/metric)文件,其中常用的一些监控项说明如下:
1. 详细的监控项说明可以参考[metrics](https://github.com/open-falcon/gpu-mon/blob/master/metrics)文件,其中常用的一些监控项说明如下:

```plain
GPUUtils GPU 使用率 (%)
Expand All @@ -27,13 +27,13 @@ gpu-mon 是[open-falcon](http://open-falcon.com/)用于监控GPU状态的一个

### 1. 相关依赖

1. 安装dcgm(1.4.2版本)并开启nv-hostengine进程
1. 安装DCGM并开启nv-hostengine进程
2. 目前能够支持DCGM 1.4.2版本全部功能的GPU型号包括:
- K80及K80以后的Tesla GPU
- Maxwell及更新的非Tesla GPU
- K80及K80以后的Tesla GPU
- Maxwell及更新的非Tesla GPU

关于 Dcgm支持的GPU型号及DCGM安装可以参考[(DCGM) NVIDIA Data Center GPU Manager](https://developer.nvidia.com/data-center-gpu-manager-dcgm)
3. 目前插件已测试支持的GPU型号包括:v100、p4、p40。
关于Dcgm支持的GPU型号及DCGM安装可以参考[(DCGM) NVIDIA Data Center GPU Manager](https://developer.nvidia.com/data-center-gpu-manager-dcgm)
3. 目前插件已测试支持的GPU型号包括:v100、p4、p40,测试使用的DCGM版本为1.4.2

### 2. 安装及使用

Expand Down
14 changes: 7 additions & 7 deletions metrics
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ Tx MB PCIe Tx utilization information
Replays PCIe replay counter
Performance Performance state (P-State) 0-15. 0=highest
FanSpeed % Fan speed for the device in percent 0-100
PowerUsed W Power usage for the device in Watts
PowerUsed W Power usage for the device in Watts
DeviceTemperature °C Current temperature readings for the device, in degrees C
MemTemperature °C Memory temperature for the device
SlowdownTemperature °C Slowdown temperature for the device
ShutdownTemperature °C Shutdown temperature for the device Modules
PowerCurrentLimit W Current Power limit for the device
PowerMinManLimit W Minimum power management limit for the device
PowerMaxManLimit W Maximum power management limit for the device
PowerDefaultManLimit W Default power management limit for the device
PowerEnforcedLimit W Effective power limit that the driver enforces after taking into account all limiters
PowerCurrentLimit W Current Power limit for the device
PowerMinManLimit W Minimum power management limit for the device
PowerMaxManLimit W Maximum power management limit for the device
PowerDefaultManLimit W Default power management limit for the device
PowerEnforcedLimit W Effective power limit that the driver enforces after taking into account all limiters
PowerViolationTime W Power Violation time in usec
FBtotal MB Total Frame Buffer of the GPU in MB
FBfree MB Free Frame Buffer in MB
Expand All @@ -39,4 +39,4 @@ DeviceMemSBErrors Device memory single bit volatile ECC errors
DeviceMemDBErrors Device memory double bit volatile ECC errors
RegisterSBErrors Register file single bit volatile ECC errors
RegisterDBErrors Register file double bit volatile ECC errors
DcgmSupported supported 1, not supported -1
DcgmSupported Support Dcgm 1, not support -1