Phil/alert evals #2583

psFried · 2025-12-30T21:38:15Z

Description:

Refactors alerts, and makes them more reliable.

Our previous alerting system suffered from a few different reliability issues:

Alert evaluations would sometimes run into postgres statement timeouts, meaning no alerts would be fired or resolved until the next successful evaluation 10 minutes later
Alert notification requests sometimes fail, meaning no notification would be sent for some alerts and resolutions. This was especially common because Resend has a 9 req/s rate limit, so we'd generally fail at least one notification whenever there's more than 9 alerts firing or resolving at the same time (which was common since we evaluated all alerts at the same time).

The new design fixes those issues, and adds a few other improvements. The basic idea is to use the automations crate to manage a few new background tasks for evaluating alerts and sending any notifications.

A new AlertNotifications task gets created for every alert, as part of adding it to alert_history. This task is responsible for sending all of the notifications pertaining to a single alert. After the alert has resolved and all resolution notifications have been sent, the task will delete itself.
Two new AlertEvaluator tasks get created as part of the migration. One evaluates data_movement_stalled alerts, and the other evaluates all of the tenant-related alerts (free_trial, missing_payment_method, etc). They each just add/update alert_history.
Controller runs now return a set of AlertActions, which will fire and resolve alerts while committing the controller outcome.

The alert_history table structure is unchanged except for the addition of an id column. The id is essentially a surrogate key, though the original (catalog_name, alert_type, fired_at) primary key was left in place. The shape of the arguments and resolved_arguments JSON remains as it was, with the resolved recipients being embedded in it. I have regrets about the recipients being embedded like that, but I'm not sure that it's worth changing at this point, and I thought it would help to keep the transition as simple as possible.

Deployment plan:

Run the migration, which will disable the old alerting system
Deploy new agent version, which starts the new one

Alerts that were fired by the legacy system can be resolved by the new system. The migration creates a notification task for each open alert, and sets the state to indicate that the "fired" notifications have already been sent. When the alert resolves, the new notification system will send the emails.

If we need to rollback, we can do so without much risk by deploying the previous agent version and re-enabling the evaluation cron job and alert_history triggers.

Adds a rust crate for rendering notification messages using handlebars templates. This will be used by the new alert system for rendering alert emails. The previous alerts edge function used the `mjml-browser` library to style the base alert templates so that they match the visual style of the application UI. This removes that in favor plain HTML, which was generated by an LLM from the original mjml template. The resulting emails are pretty similar to what we had before, but not quite identical. This was considered a pretty fair tradeoff, since otherwise we'd need a JS runtime in order to render the mjml.

Adds a database migration for moving to the new alerts system. This migration disables the previous alerting system by disabling the evaluation cron job, and disabling the triggers on `alert_history`. The new alerts system has an automations task for each alert, which gets created when the alert is fired, and which cleans itself up automatically after resolution emails are sent. The migration adds a new `id` column to `alert_history`, in order to associate each alert with the notification automation tasks. The `task_id` of the notifications task is the same as the `id` of the alert. Technically, this column could have been named `notification_task_id`. But it also happens to function as a surrogate id, since there is at most 1 notification task per alert, and the notification tasks need to be able to query alerts by this id. So I'm kinda just leaning into that, and calling it `id`. This doesn't replace the unique index on `(catalog_name, alert_type, fired_at)` (though we might one day revisit that index). control-plane: complete alerts migration Disables the old alerting system and enables the new one, completing the cutover.

Adds end-to-end alerting to the agent. Previously, alerting was handled by a postgres cron job, which would evaluate alert views, add/update `alert_history`, and send email notifications via the alerts edge function. This adds all of that functionality to the agent, along with a number of improvements. The new system is able to tolerate faults in both sending notifications and in alert evaluation, so we should no longer have problems with missing alert emails. And any alerts tht get fired by controllers will no longer suffer from a delay due to the cron-based evaluation. The `agent::alerts` module documentation has an overview of how the main pieces fit together, so I won't repeat that here. During development and testing, I stumbled across some problems that were ultimately caused by cloning the `models::IdGenerator`, causing duplicate ids to be generated. So I removed the `Clone` impl from `IdGenerator`, so it's no longer a footgun.

Adds some basic integration tests for alerts, and improves the current `assert_alert_firing|resolved` test assertions to include exercising the alert notification code.

This is no longer used now that we have agent-based alerting.

psFried added 5 commits December 30, 2025 20:59

agent: integration tests for alerts

48d2b72

Adds some basic integration tests for alerts, and improves the current `assert_alert_firing|resolved` test assertions to include exercising the alert notification code.

remove legacy alerts edge function

18346e2

This is no longer used now that we have agent-based alerting.

psFried force-pushed the phil/alert-evals branch from 08a8629 to 18346e2 Compare December 30, 2025 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phil/alert evals #2583

Phil/alert evals #2583

Uh oh!

psFried commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Phil/alert evals #2583

Are you sure you want to change the base?

Phil/alert evals #2583

Uh oh!

Conversation

psFried commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants