Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
Refactors alerts, and makes them more reliable.
Our previous alerting system suffered from a few different reliability issues:
The new design fixes those issues, and adds a few other improvements. The basic idea is to use the
automationscrate to manage a few new background tasks for evaluating alerts and sending any notifications.AlertNotificationstask gets created for every alert, as part of adding it toalert_history. This task is responsible for sending all of the notifications pertaining to a single alert. After the alert has resolved and all resolution notifications have been sent, the task will delete itself.AlertEvaluatortasks get created as part of the migration. One evaluatesdata_movement_stalledalerts, and the other evaluates all of the tenant-related alerts (free_trial,missing_payment_method, etc). They each just add/updatealert_history.AlertActions, which will fire and resolve alerts while committing the controller outcome.The
alert_historytable structure is unchanged except for the addition of anidcolumn. Theidis essentially a surrogate key, though the original(catalog_name, alert_type, fired_at)primary key was left in place. The shape of theargumentsandresolved_argumentsJSON remains as it was, with the resolvedrecipientsbeing embedded in it. I have regrets about therecipientsbeing embedded like that, but I'm not sure that it's worth changing at this point, and I thought it would help to keep the transition as simple as possible.Deployment plan:
Alerts that were fired by the legacy system can be resolved by the new system. The migration creates a notification task for each open alert, and sets the state to indicate that the "fired" notifications have already been sent. When the alert resolves, the new notification system will send the emails.
If we need to rollback, we can do so without much risk by deploying the previous agent version and re-enabling the evaluation cron job and
alert_historytriggers.