Incidents and Outages

An incident is when we have a service outage, degraded service, or a security concern that needs immediate attention.

Setting Intention

An incident is a stressful event. This is why we conduct a collaborative incident response: our goal in DevOps culture is to reduce stress and increase trust. In the wake of some failure of our current system, our highest value use of time is to understand why our system broke and to conduct analysis so the organization can learn from what just happened.

Take some time to get into a learning frame of mind. If an outage is fixed, but no one learns anything about why the outage occurred or how to prevent it in the future, this will increase the team's stress load, because there is a failure condition we haven't accounted for.

If, on the other hand, it takes a little longer to fix the issue, but we all leave with a good understanding of the systems that led to this failure state, and some ideas about how to prevent this outcome in the future, this will reduce the team's stress load. This is by far the better option long term. So... take a few deep breaths and get into a learning frame of mind!

Process

Fill in the incident details in the text below, post in #incident_reports on Slack, and then follow the notes document.

@here We have the following incident: [Service] is [down|degraded|insecure].

Notes document

Zoom link is the first bookmark on this channel.

Our goal in response is to learn about the causes, fix the issue, reflect on these, and take steps toward never having to respond to this particular issue again.

Everyone is encouraged to come, even if this system is not in your area of expertise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incident_response.md

incident_response.md

Incidents and Outages

Setting Intention

Process

Files

incident_response.md

Latest commit

History

incident_response.md

File metadata and controls

Incidents and Outages

Setting Intention

Process