Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish convention for capturing system incidents #730

Open
grahamalama opened this issue Oct 31, 2023 · 0 comments
Open

Establish convention for capturing system incidents #730

grahamalama opened this issue Oct 31, 2023 · 0 comments

Comments

@grahamalama
Copy link
Contributor

grahamalama commented Oct 31, 2023

We are trying to reduce the frequency of incidents which cause customers to ask, "why isn't my bug syncing?". In order answer the question "are we reducing the frequency?", we need to capture this data consistently.

In an ADR, we should establish a workflow for tracking incidents, including the start time and resolution time. This way, we can capture both the number of incidents and the average time to resolve incidents.

A non-exhaustive list of "incidents" might include:

  • webhook queue becomes disabled
  • a workflow or workflows record n partial syncs over some duration of time
  • a workflow or workflows record n total sync failures over some duration of time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant