Components

GPU components

accelerator-nvidia-bad-envs: Tracks any bad environment variables that are globally set for the NVIDIA GPUs.
accelerator-nvidia-hw-slowdown: Monitors NVIDIA GPU hardware slowdown clock events of all GPUs.
accelerator-nvidia-clock-speed: Tracks the per-GPU clock speed.
accelerator-nvidia-ecc: Tracks the NVIDIA per-GPU ECC errors and other ECC related information.
accelerator-nvidia-error: Tracks NVIDIA GPU errors real-time in the SMI queries -- likely requires host restarts.
accelerator-nvidia-error-sxid: Tracks the NVIDIA GPU SXid errors scanning the dmesg -- see fabric manager documentation.
accelerator-nvidia-error-xid: Tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML) -- see Xid messages.
accelerator-nvidia-error-xid-sxid: Tracks the NVIDIA GPU Xid and SXid errors scanning the dmesg and using the NVIDIA Management Library (NVML) -- see Xid messages.
accelerator-nvidia-fabric-manager: Tracks the fabric manager version and its activeness.
accelerator-nvidia-gsp-firmware: Tracks the GSP firmware mode.
accelerator-nvidia-infiniband: Monitors the infiniband status of the system. Optional, enabled if the host has NVIDIA GPUs.
accelerator-nvidia-info: Serves relatively static information about the NVIDIA accelerators (e.g., GPU product names).
accelerator-nvidia-memory: Monitors the NVIDIA per-GPU memory usage.
accelerator-nvidia-gpm: Monitors the NVIDIA per-GPU GPM metrics.
accelerator-nvidia-nvlink: Monitors the NVIDIA per-GPU nvlink devices.
accelerator-nvidia-peermem: Monitors the peermem module status. Optional, enabled if the host has NVIDIA GPUs.
accelerator-nvidia-persistence-mode: Tracks the NVIDIA persistence mode.
accelerator-nvidia-nccl: Monitors the NCCL (NVIDIA Collective Communications Library) status. Optional, enabled if the host has NVIDIA GPUs.
accelerator-nvidia-power: Tracks the NVIDIA per-GPU power usage.
accelerator-nvidia-processes: Tracks the NVIDIA per-GPU processes.
accelerator-nvidia-remapped-rows: Tracks the NVIDIA per-GPU remapped rows (which indicates whether to reset the GPU or not).
accelerator-nvidia-temperature: Tracks the NVIDIA per-GPU temperatures.
accelerator-nvidia-utilization: Tracks the NVIDIA per-GPU utilization.

General Hardware components

cpu: Tracks the combined usage of all CPUs (not per-CPU).
disk: Tracks the disk usage of all the mount points specified in the configuration.
memory: Tracks the memory usage of the host.
network-latency: Tracks global network connectivity statistics.
power-supply: Tracks the power supply/usage on the host.
pci: Tracks the PCI devices and their Access Control Services (ACS) status.

System components

info: Provides static information about the host (e.g., labels, IDs).
os: Queries the host OS information (e.g., kernel version).
systemd: Tracks the systemd state and unit files.
dmesg: Scans and watches dmesg outputs for errors,, as specified in the configuration (e.g., regex match NVIDIA GPU errors).
file-descriptor: Tracks the number of file descriptors used on the host.
kernel-module: Tracks the kernel modules loaded on the host.

Misc. components

containerd-pod: Tracks the current pods from the containerd CRI.
k8s-pod: Tracks the current pods from the kubelet read-only port.
docker-container: Tracks the current containers from the docker runtime.
tailscale: Tracks the tailscale state (e.g., version) if available.
file: Returns healthy if and only if all the specified files exist.
library: Returns healthy if and only if all the specified libraries exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMPONENTS.md

COMPONENTS.md

Components

GPU components

General Hardware components

System components

Misc. components

Files

COMPONENTS.md

Latest commit

History

COMPONENTS.md

File metadata and controls

Components

GPU components

General Hardware components

System components

Misc. components