accelerator-nvidia-bad-envs
: Tracks any bad environment variables that are globally set for the NVIDIA GPUs.accelerator-nvidia-hw-slowdown
: Monitors NVIDIA GPU hardware slowdown clock events of all GPUs.accelerator-nvidia-clock-speed
: Tracks the per-GPU clock speed.accelerator-nvidia-ecc
: Tracks the NVIDIA per-GPU ECC errors and other ECC related information.accelerator-nvidia-error
: Tracks NVIDIA GPU errors real-time in the SMI queries -- likely requires host restarts.accelerator-nvidia-error-sxid
: Tracks the NVIDIA GPU SXid errors scanning the dmesg -- see fabric manager documentation.accelerator-nvidia-error-xid
: Tracks the NVIDIA GPU Xid errors scanning the dmesg and using the NVIDIA Management Library (NVML) -- see Xid messages.accelerator-nvidia-error-xid-sxid
: Tracks the NVIDIA GPU Xid and SXid errors scanning the dmesg and using the NVIDIA Management Library (NVML) -- see Xid messages.accelerator-nvidia-fabric-manager
: Tracks the fabric manager version and its activeness.accelerator-nvidia-gsp-firmware
: Tracks the GSP firmware mode.accelerator-nvidia-infiniband
: Monitors the infiniband status of the system. Optional, enabled if the host has NVIDIA GPUs.accelerator-nvidia-info
: Serves relatively static information about the NVIDIA accelerators (e.g., GPU product names).accelerator-nvidia-memory
: Monitors the NVIDIA per-GPU memory usage.accelerator-nvidia-gpm
: Monitors the NVIDIA per-GPU GPM metrics.accelerator-nvidia-nvlink
: Monitors the NVIDIA per-GPU nvlink devices.accelerator-nvidia-peermem
: Monitors the peermem module status. Optional, enabled if the host has NVIDIA GPUs.accelerator-nvidia-persistence-mode
: Tracks the NVIDIA persistence mode.accelerator-nvidia-nccl
: Monitors the NCCL (NVIDIA Collective Communications Library) status. Optional, enabled if the host has NVIDIA GPUs.accelerator-nvidia-power
: Tracks the NVIDIA per-GPU power usage.accelerator-nvidia-processes
: Tracks the NVIDIA per-GPU processes.accelerator-nvidia-remapped-rows
: Tracks the NVIDIA per-GPU remapped rows (which indicates whether to reset the GPU or not).accelerator-nvidia-temperature
: Tracks the NVIDIA per-GPU temperatures.accelerator-nvidia-utilization
: Tracks the NVIDIA per-GPU utilization.
cpu
: Tracks the combined usage of all CPUs (not per-CPU).disk
: Tracks the disk usage of all the mount points specified in the configuration.memory
: Tracks the memory usage of the host.network-latency
: Tracks global network connectivity statistics.power-supply
: Tracks the power supply/usage on the host.pci
: Tracks the PCI devices and their Access Control Services (ACS) status.
info
: Provides static information about the host (e.g., labels, IDs).os
: Queries the host OS information (e.g., kernel version).systemd
: Tracks the systemd state and unit files.dmesg
: Scans and watches dmesg outputs for errors,, as specified in the configuration (e.g., regex match NVIDIA GPU errors).file-descriptor
: Tracks the number of file descriptors used on the host.kernel-module
: Tracks the kernel modules loaded on the host.
containerd-pod
: Tracks the current pods from the containerd CRI.k8s-pod
: Tracks the current pods from the kubelet read-only port.docker-container
: Tracks the current containers from the docker runtime.tailscale
: Tracks the tailscale state (e.g., version) if available.file
: Returns healthy if and only if all the specified files exist.library
: Returns healthy if and only if all the specified libraries exist.