Incident Analysis


madScalability, 2020-09-23

Your Host Tonight


What We Will Cover


Root cause analysis. The five whys


The Chernobyl Accident (1986)


Blameless Postmortems


USS John McCain Collision (2017)


Leadership

Root Cause Analysis 


Oh noes!


Putting Out Fires All Day Long?


The Five Whys


Ask like a small child


Do not stop at the first cause


But, why "five" whys?


Keep on asking until everything is clear!

Problem: Root Cause?


In a complex system failures rarely have a single cause


We should strive to search for every issue
... and then fix all of them


Dive in until you have complete confidence
that you have understood the problem

Root of causes


Ishikawa Diagram






Finding the root cause of a failure is like finding the root cause of a success

Chernobyl (1986)


The Chernobyl Accident (1986)



A thoroughly studied accident


Hundreds of deaths

Thousands of people displaced


Different points of view

Chernobyl (2019), HBO


Do You Think It Was Realistic?



Overview (No Spoilers!)


Evil bosses


Incompetent operators


Dreadful response by authorities


Politics involved in disaster mitigation

International Atomic Energy Agency



Report IAEA INSAG-7, 1992



A most extraordinary report



Excerpts



Source

Legasov Tapes


Overview (Spoilers!)


Adequate response
Devoted responders


Inexperienced operators


Inadequate culture


Incompetent hierarchy

Safety Standard


1: Make the reactor maximally reliable

2: Make the operation maximally reliable:
  • trained staff
  • good discipline
  • easy-to-operate equipment


3: enclosed in a containment



(Tape 4 Side B)

Redundant Systems


At least two protection systems


Based on different principles


Not 211 identical rods!



Three Views of Chernobyl



HBO: drama, bad communists


IAEA: guilty operators, wrong culture


Valery Legasov: safety standard, redundant systems

Blameless Postmortems 


Do People Blame Each Other?



Human Error!



Human error is not a cause, it is an effect.


If people are punished for being honest about what transpired, employees will soon learn that the personal costs to speaking up far outweigh the personal benefits. Improving the safety of a system is rooted in information.

United States Forest Service

Second Stories


First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

USS John McCain Collision


Review the incident summary


Review the consequences


Who was to blame?


What measures were correct?

Punishing the Operator


Leadership



What Do Team Members Expect?



Attitude and Expectations


How you act in a crisis will set the tone



Communication is crucial



Explain clearly what you look for



Try to get the best out of people



Consider listening to what an incident has to teach you. It's your job to figure out what that is.


John Allspaw, Incidents as we Imagine Them Versus How They Actually Are

Wrong: Putting Out Fires

All Day Long


Right: Incident-Driven Development


Incidents will show the way


Accept it: there will be unknown unknowns


Try to understand and solve issues thoroughly


What can be done to avoid a repetition?

Yes, We Can!


And Now for Some Spam


An Incident-shaped Hole



Scalability Training




In Spanish


Spend that training budget!


Deductible for corporations

Thanks!


madScalability: Incident Analysis

By Alex Fernández

madScalability: Incident Analysis

Slides for the Meetup at madScalability: https://www.meetup.com/madscalability/events/273170362/

  • 1,240