-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Hi,
first of all, thank you for the great work on check_redfish
! We're currently using it to monitor around 2,500 hosts from various vendors in our datacenter.
However, we've encountered a problem with how exit codes are handled when checking multiple components (e.g. fan
, memory
, power
, proc
, storage
, temp
) in a single Icinga service. Currently, if any component returns an UNKNOWN
state, the overall exit code of the plugin is 3 (UNKNOWN
) — even if another component reports a CRITICAL
issue. Below is an example from a DELL PowerEdge R620 [1].
In our setup, we prioritize alerts in the following order:
CRITICAL (2) > WARNING (1) > UNKNOWN (3) > OK (0)
Our datacenter operations team typically ignores UNKNOWN
alerts, whereas CRITICAL
ones are acted upon immediately. Because of this behavior, critical hardware failures can go unnoticed if masked by an unrelated UNKNOWN
state in the same check.
We previously used separate service checks for each component, but with ~100,000 total checks, this approach led to performance issues on our Icinga nodes.
Is there a way to change this behavior so that CRITICAL
(and possibly even WARNING
) states are prioritized over UNKNOWN
when multiple components are checked at once?
We can't use --ignore_unavailable_resources
, as we still need visibility into any issues that would lead to an UNKNOWN
state.
Thanks in advance!
Best regards,
Alex
[1]
~ check_redfish.py --fan --memory --power --proc --storage --temp --retries 3 --sessionfiledir /opt/icinga2/redfish-sessions/ --timeout 60 --host <...> --username <...> --password <...>
[UNKNOWN]: No storage controller and disk drive data found in system
[CRITICAL]: Power supply 2 (PWR SPLY,750WP,RDNT,FLX) status is: CRITICAL
[CRITICAL]: Power redundancy 1 status is: Enabled
[OK]: All processors (2) are in good condition
[OK]: All memory modules (Total 128GB) are in good condition
[OK]: All temp sensors (4) are in good condition
[OK]: All fans (14) are in good condition and fan redundancy status is: Enabled|'voltage_CPU1_VCORE_PG'=1.0,;;....
~ echo $?
3