-
Notifications
You must be signed in to change notification settings - Fork 86
Open
Labels
choreInternal cleanups, refactoring, or improvementsInternal cleanups, refactoring, or improvementscontrol-plane
Description
Alert evaluation is occasionally hitting the postgres statement timeout, causing all alert evaluation to fail until the next attempt. This makes our alerting less reliable. We've also seen cases where sending alert notification emails fails, and users miss important notifications. So we'd like to make alerting more robust, and easier to monitor. The plan is:
- Add a new
AlertNotificationsautomation task, and a corresponding executor inagentfor sending alert emails for a specific alert instance. A task of this type will get created whenever an alert starts firing. The task should stick around while the alert is active. After the alert resolves, this task will send the final resolution email and then complete, deleting itself by returningAction::Donefrom theOutcome. - Alter the
alert_historytable to add a nullabletask_id flowidcolumn. This column should be populated with a newly generatedflowidwhen a new row is inserted. - Alter the insert trigger on
alert_historyto create anAlertNotificationstask for each new alert, with atask_idthat's the same as thealert_history.task_id. The newly created task should be immediately queued to run. - Alter the update trigger on
alert_historyto send a finalResolvedmessage to theAlertNotificationstask - Add a new alert evaluation automation task for tenant-related alerts (free_trial, free_trial_ending, etc), which should periodically query the
tenant_alertsview and insert/updatealert_history - Add a new alert evaluation automation task for
data_movement_stalledalerts, which should periodically query the corresponding alert view and insert/updatealert_history - Update the live spec controllers automation executor to insert/update
alert_historyin response to changes in thealertssub-status, after each controller run. - Remove the alerts cloudrun function
Metadata
Metadata
Assignees
Labels
choreInternal cleanups, refactoring, or improvementsInternal cleanups, refactoring, or improvementscontrol-plane