-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Feature Request
Daily AI-Powered System Health Report & Risk Detection
Is your feature request related to a problem? Please describe
Yes. Currently, HertzBeat provides powerful monitoring and alerting, but it still requires significant manual effort to inspect system health daily.
O&M teams face alert fatigue, and many issues remain hidden until they trigger critical alerts. Without a proactive inspection workflow, users must manually review multiple monitors, trend graphs, and alerts to understand system health.
Describe the solution you'd like
Implement an Intelligent Inspection Workflow inside the hertzbeat-ai module that automates daily system health checks using LLMs.
The workflow should:
Automatically scan all monitors and active alerts (e.g., last 24 hours)
Collect trend data for abnormal monitors (CPU, memory, latency)
Perform correlation analysis to identify shared root causes
Generate a concise Markdown report summarizing:
Overall system health status
Critical risks and anomalies
Optimization suggestions
Provide a human-in-the-loop confirmation before any automated actions
The solution should be optimized for tokens using:
Funnel filtering (only abnormal monitors)
Statistical summarization (max, avg, trend)
Describe alternatives you've considered
Manual inspection: Users manually review metrics and alerts, which is time-consuming and error-prone.
Static rule-based reports: Predefined rules can generate reports, but they cannot handle complex correlations or unknown issues.
Raw time-series analysis without LLM: Requires heavy computing and cannot provide human-friendly explanations.
Full agent-based automation: Fully automated actions can be risky; human confirmation is required for safety.
Additional context
This feature is aimed at evolving HertzBeat into an AIOps platform.
It builds on the existing hertzbeat-ai module and leverages LLMs for correlation analysis, risk assessment, and reporting.
The design should prioritize token efficiency and user safety (manual confirmation for actions).
The generated report can be exported as Markdown or integrated into the dashboard as a daily summary.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status