Skip to content

Prioritize CRITICAL (WARNING) over UNKNOWN #174

@trauta

Description

@trauta

Hi,

first of all, thank you for the great work on check_redfish! We're currently using it to monitor around 2,500 hosts from various vendors in our datacenter.

However, we've encountered a problem with how exit codes are handled when checking multiple components (e.g. fan, memory, power, proc, storage, temp) in a single Icinga service. Currently, if any component returns an UNKNOWN state, the overall exit code of the plugin is 3 (UNKNOWN) — even if another component reports a CRITICAL issue. Below is an example from a DELL PowerEdge R620 [1].

In our setup, we prioritize alerts in the following order:

CRITICAL (2) > WARNING (1) > UNKNOWN (3) > OK (0)

Our datacenter operations team typically ignores UNKNOWN alerts, whereas CRITICAL ones are acted upon immediately. Because of this behavior, critical hardware failures can go unnoticed if masked by an unrelated UNKNOWN state in the same check.

We previously used separate service checks for each component, but with ~100,000 total checks, this approach led to performance issues on our Icinga nodes.

Is there a way to change this behavior so that CRITICAL (and possibly even WARNING) states are prioritized over UNKNOWN when multiple components are checked at once?

We can't use --ignore_unavailable_resources, as we still need visibility into any issues that would lead to an UNKNOWN state.

Thanks in advance!

Best regards,
Alex

[1]

~ check_redfish.py --fan --memory --power --proc --storage --temp --retries 3 --sessionfiledir /opt/icinga2/redfish-sessions/ --timeout 60 --host <...> --username <...> --password <...>
[UNKNOWN]: No storage controller and disk drive data found in system
[CRITICAL]: Power supply 2 (PWR SPLY,750WP,RDNT,FLX) status is: CRITICAL
[CRITICAL]: Power redundancy 1 status is: Enabled
[OK]: All processors (2) are in good condition
[OK]: All memory modules (Total 128GB) are in good condition
[OK]: All temp sensors (4) are in good condition
[OK]: All fans (14) are in good condition and fan redundancy status is: Enabled|'voltage_CPU1_VCORE_PG'=1.0,;;....

~ echo $?
3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions