Incident Analysis
madScalability, 2020-09-23
What We Will Cover
Root cause analysis. The five whys
The Chernobyl Accident (1986)
Blameless Postmortems
USS John McCain Collision (2017)
Leadership
Putting Out Fires All Day Long?
The Five Whys
Ask like a small child
Do not stop at the first cause
But, why "five" whys?
Keep on asking until everything is clear!
Problem: Root Cause?
In a complex system failures rarely have a single cause
We should strive to search for every issue
... and then fix all of them
Dive in until you have complete confidence
that you have understood the problem
Finding the root cause of a failure is like finding the root cause of a success
The Chernobyl Accident (1986)
A thoroughly studied accident
Hundreds of deaths
Thousands of people displaced
Different points of view
Do You Think It Was Realistic?
Overview (No Spoilers!)
Evil bosses
Incompetent operators
Dreadful response by authorities
Politics involved in disaster mitigation
International Atomic Energy Agency
Overview (Spoilers!)
Adequate response
Devoted responders
Inexperienced operators
Inadequate culture
Incompetent hierarchy
Safety Standard
1: Make the reactor maximally reliable
2: Make the operation maximally reliable:
- trained staff
- good discipline
- easy-to-operate equipment
3: enclosed in a containment
(Tape 4 Side B)
Redundant Systems
At least two protection systems
Based on different principles
Not 211 identical rods!
Three Views of Chernobyl
HBO: drama, bad communists
IAEA: guilty operators, wrong culture
Valery Legasov: safety standard, redundant systems
Do People Blame Each Other?
Human Error!
Human error is not a cause, it is an effect.
If people are punished for being honest about what transpired, employees will soon learn that the personal costs to speaking up far outweigh the personal benefits. Improving the safety of a system is rooted in information.
United States Forest Service
Second Stories
| First Stories |
Second Stories |
| Human error is seen as cause of failure |
Human error is seen as the effect of systemic vulnerabilities deeper inside the organization |
| Saying what people should have done is a satisfying way to describe failure |
Saying what people should have done doesn’t explain why it made sense for them to do what they did |
| Telling people to be more careful will make the problem go away |
Only by constantly seeking out its vulnerabilities can organizations enhance safety |
USS John McCain Collision
Review the consequences
Who was to blame?
What measures were correct?
What Do Team Members Expect?
Attitude and Expectations
How you act in a crisis will set the tone
Communication is crucial
Explain clearly what you look for
Try to get the best out of people
Wrong: Putting Out Fires
All Day Long
Right: Incident-Driven Development
Incidents will show the way
Accept it: there will be unknown unknowns
Try to understand and solve issues thoroughly
What can be done to avoid a repetition?
Scalability Training
In Spanish
Spend that training budget!
Deductible for corporations