Skip to content

Replace alerts cloudrun function with a agent automation tasks #2506

@psFried

Description

@psFried

Alert evaluation is occasionally hitting the postgres statement timeout, causing all alert evaluation to fail until the next attempt. This makes our alerting less reliable. We've also seen cases where sending alert notification emails fails, and users miss important notifications. So we'd like to make alerting more robust, and easier to monitor. The plan is:

  • Add a new AlertNotifications automation task, and a corresponding executor in agent for sending alert emails for a specific alert instance. A task of this type will get created whenever an alert starts firing. The task should stick around while the alert is active. After the alert resolves, this task will send the final resolution email and then complete, deleting itself by returning Action::Done from the Outcome.
  • Alter the alert_history table to add a nullable task_id flowid column. This column should be populated with a newly generated flowid when a new row is inserted.
  • Alter the insert trigger on alert_history to create an AlertNotifications task for each new alert, with a task_id that's the same as the alert_history.task_id. The newly created task should be immediately queued to run.
  • Alter the update trigger on alert_history to send a final Resolved message to the AlertNotifications task
  • Add a new alert evaluation automation task for tenant-related alerts (free_trial, free_trial_ending, etc), which should periodically query the tenant_alerts view and insert/update alert_history
  • Add a new alert evaluation automation task for data_movement_stalled alerts, which should periodically query the corresponding alert view and insert/update alert_history
  • Update the live spec controllers automation executor to insert/update alert_history in response to changes in the alerts sub-status, after each controller run.
  • Remove the alerts cloudrun function

Metadata

Metadata

Assignees

Labels

choreInternal cleanups, refactoring, or improvementscontrol-plane

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions