Skip to content

Conversation

@psFried
Copy link
Member

@psFried psFried commented Dec 30, 2025

Description:

Refactors alerts, and makes them more reliable.

Our previous alerting system suffered from a few different reliability issues:

  • Alert evaluations would sometimes run into postgres statement timeouts, meaning no alerts would be fired or resolved until the next successful evaluation 10 minutes later
  • Alert notification requests sometimes fail, meaning no notification would be sent for some alerts and resolutions. This was especially common because Resend has a 9 req/s rate limit, so we'd generally fail at least one notification whenever there's more than 9 alerts firing or resolving at the same time (which was common since we evaluated all alerts at the same time).

The new design fixes those issues, and adds a few other improvements. The basic idea is to use the automations crate to manage a few new background tasks for evaluating alerts and sending any notifications.

  • A new AlertNotifications task gets created for every alert, as part of adding it to alert_history. This task is responsible for sending all of the notifications pertaining to a single alert. After the alert has resolved and all resolution notifications have been sent, the task will delete itself.
  • Two new AlertEvaluator tasks get created as part of the migration. One evaluates data_movement_stalled alerts, and the other evaluates all of the tenant-related alerts (free_trial, missing_payment_method, etc). They each just add/update alert_history.
  • Controller runs now return a set of AlertActions, which will fire and resolve alerts while committing the controller outcome.

The alert_history table structure is unchanged except for the addition of an id column. The id is essentially a surrogate key, though the original (catalog_name, alert_type, fired_at) primary key was left in place. The shape of the arguments and resolved_arguments JSON remains as it was, with the resolved recipients being embedded in it. I have regrets about the recipients being embedded like that, but I'm not sure that it's worth changing at this point, and I thought it would help to keep the transition as simple as possible.

Deployment plan:

  • Run the migration, which will disable the old alerting system
  • Deploy new agent version, which starts the new one

Alerts that were fired by the legacy system can be resolved by the new system. The migration creates a notification task for each open alert, and sets the state to indicate that the "fired" notifications have already been sent. When the alert resolves, the new notification system will send the emails.

If we need to rollback, we can do so without much risk by deploying the previous agent version and re-enabling the evaluation cron job and alert_history triggers.

Adds a rust crate for rendering notification messages using handlebars
templates. This will be used by the new alert system for rendering alert
emails.

The previous alerts edge function used the `mjml-browser` library
to style the base alert templates so that they match the visual style
of the application UI. This removes that in favor plain HTML, which
was generated by an LLM from the original mjml template. The resulting
emails are pretty similar to what we had before, but not quite
identical. This was considered a pretty fair tradeoff, since otherwise
we'd need a JS runtime in order to render the mjml.
Adds a database migration for moving to the new alerts system.

This migration disables the previous alerting system by disabling the
evaluation cron job, and disabling the triggers on `alert_history`.

The new alerts system has an automations task for each alert, which gets
created when the alert is fired, and which cleans itself up
automatically after resolution emails are sent.

The migration adds a new `id` column to `alert_history`, in order to
associate each alert with the notification automation tasks. The
`task_id` of the notifications task is the same as the `id` of the
alert. Technically, this column could have been named
`notification_task_id`. But it also happens to function as a surrogate
id, since there is at most 1 notification task per alert, and the
notification tasks need to be able to query alerts by this id. So I'm
kinda just leaning into that, and calling it `id`. This doesn't replace
the unique index on `(catalog_name, alert_type, fired_at)` (though we
might one day revisit that index).

control-plane: complete alerts migration

Disables the old alerting system and enables the new one, completing the
cutover.
Adds end-to-end alerting to the agent. Previously, alerting was handled
by a postgres cron job, which would evaluate alert views, add/update
`alert_history`, and send email notifications via the alerts edge
function. This adds all of that functionality to the agent, along with
a number of improvements.

The new system is able to tolerate faults in both sending notifications
and in alert evaluation, so we should no longer have problems with
missing alert emails. And any alerts tht get fired by controllers will
no longer suffer from a delay due to the cron-based evaluation.

The `agent::alerts` module documentation has an overview of how the main
pieces fit together, so I won't repeat that here.

During development and testing, I stumbled across some problems that
were ultimately caused by cloning the `models::IdGenerator`, causing
duplicate ids to be generated. So I removed the `Clone` impl from
`IdGenerator`, so it's no longer a footgun.
Adds some basic integration tests for alerts, and improves the
current `assert_alert_firing|resolved` test assertions to include
exercising the alert notification code.
This is no longer used now that we have agent-based alerting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants