Skip to content

[Feature] <title>Daily AI-Powered System Health Report & Risk Detection #4025

@anusha975

Description

@anusha975

Feature Request

Daily AI-Powered System Health Report & Risk Detection

Is your feature request related to a problem? Please describe

Yes. Currently, HertzBeat provides powerful monitoring and alerting, but it still requires significant manual effort to inspect system health daily.
O&M teams face alert fatigue, and many issues remain hidden until they trigger critical alerts. Without a proactive inspection workflow, users must manually review multiple monitors, trend graphs, and alerts to understand system health.

Describe the solution you'd like

Implement an Intelligent Inspection Workflow inside the hertzbeat-ai module that automates daily system health checks using LLMs.
The workflow should:

Automatically scan all monitors and active alerts (e.g., last 24 hours)

Collect trend data for abnormal monitors (CPU, memory, latency)

Perform correlation analysis to identify shared root causes

Generate a concise Markdown report summarizing:

Overall system health status

Critical risks and anomalies

Optimization suggestions

Provide a human-in-the-loop confirmation before any automated actions

The solution should be optimized for tokens using:

Funnel filtering (only abnormal monitors)

Statistical summarization (max, avg, trend)

Describe alternatives you've considered

Manual inspection: Users manually review metrics and alerts, which is time-consuming and error-prone.

Static rule-based reports: Predefined rules can generate reports, but they cannot handle complex correlations or unknown issues.

Raw time-series analysis without LLM: Requires heavy computing and cannot provide human-friendly explanations.

Full agent-based automation: Fully automated actions can be risky; human confirmation is required for safety.

Additional context

This feature is aimed at evolving HertzBeat into an AIOps platform.
It builds on the existing hertzbeat-ai module and leverages LLMs for correlation analysis, risk assessment, and reporting.
The design should prioritize token efficiency and user safety (manual confirmation for actions).
The generated report can be exported as Markdown or integrated into the dashboard as a daily summary.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions