Incident Analysis

madScalability, 2020-09-23

Your Host Tonight


What We Will Cover

Root cause analysis. The five whys

The Chernobyl Accident (1986)

Blameless Postmortems

USS John McCain Collision (2017)


Root Cause Analysis 

Oh noes!

Putting Out Fires All Day Long?

The Five Whys

Ask like a small child

Do not stop at the first cause

But, why "five" whys?

Keep on asking until everything is clear!

Problem: Root Cause?

In a complex system failures rarely have a single cause

We should strive to search for every issue
... and then fix all of them

Dive in until you have complete confidence
that you have understood the problem

Root of causes

Ishikawa Diagram

Or fishbone diagram

Finding the root cause of a failure is like finding the root cause of a success

Chernobyl (1986)

The Chernobyl Accident (1986)

A thoroughly studied accident

Hundreds of deaths

Thousands of people displaced

Different points of view

Chernobyl (2019), HBO

Do You Think It Was Realistic?

Overview (No Spoilers!)

Evil bosses

Incompetent operators

Dreadful response by authorities

Politics involved in disaster mitigation

International Atomic Energy Agency

Report IAEA INSAG-7, 1992

A most extraordinary report



Legasov Tapes

Original source

Overview (Spoilers!)

Adequate response
Devoted responders

Inexperienced operators

Inadequate culture

Incompetent hierarchy

Safety Standard

1: Make the reactor maximally reliable

2: Make the operation maximally reliable:
  • trained staff
  • good discipline
  • easy-to-operate equipment

3: enclosed in a containment

(Tape 4 Side B)

Redundant Systems

At least two protection systems

Based on different principles

Not 211 identical rods!

Three Views of Chernobyl

HBO: drama, bad communists

IAEA: guilty operators, wrong culture

Valery Legasov: safety standard, redundant systems

Blameless Postmortems 

Do People Blame Each Other?

Human Error!

Human error is not a cause, it is an effect.

If people are punished for being honest about what transpired, employees will soon learn that the personal costs to speaking up far outweigh the personal benefits. Improving the safety of a system is rooted in information.

United States Forest Service

Second Stories

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

USS John McCain Collision

Review the incident summary

Review the consequences

Who was to blame?

What measures were correct?

Punishing the Operator


What Do Team Members Expect?

Attitude and Expectations

How you act in a crisis will set the tone

Communication is crucial

Explain clearly what you look for

Try to get the best out of people

Consider listening to what an incident has to teach you. It's your job to figure out what that is.

John Allspaw, Incidents as we Imagine Them Versus How They Actually Are

Wrong: Putting Out Fires

All Day Long

Right: Incident-Driven Development

Incidents will show the way

Accept it: there will be unknown unknowns

Try to understand and solve issues thoroughly

What can be done to avoid a repetition?

Yes, We Can!

And Now for Some Spam

An Incident-shaped Hole

Scalability Training

In Spanish

Spend that training budget!

Deductible for corporations