Incident Analysis
madScalability, 2020-09-23
Your Host Tonight
What We Will Cover
Root cause analysis. The five whys
The Chernobyl Accident (1986)
Blameless Postmortems
USS John McCain Collision (2017)
Leadership
Root Cause Analysis
Oh noes!
Putting Out Fires All Day Long?
The Five Whys
Ask like a small child
Do not stop at the first cause
But, why "five" whys?
Keep on asking until everything is clear!
Problem: Root Cause?
In a complex system failures rarely have a single cause
We should strive to search for every issue
... and then fix all of them
Dive in until you have complete confidence
that you have understood the problem
Root of causes
Ishikawa Diagram
Finding the root cause of a failure is like finding the root cause of a success
Chernobyl (1986)
The Chernobyl Accident (1986)
A thoroughly studied accident
Hundreds of deaths
Thousands of people displaced
Different points of view
Chernobyl (2019), HBO
Do You Think It Was Realistic?
Overview (No Spoilers!)
Evil bosses
Incompetent operators
Dreadful response by authorities
Politics involved in disaster mitigation
International Atomic Energy Agency
Report IAEA INSAG-7, 1992
A most extraordinary report
Legasov Tapes
Overview (Spoilers!)
Adequate response
Devoted responders
Inexperienced operators
Inadequate culture
Incompetent hierarchy
Safety Standard
1: Make the reactor maximally reliable
2: Make the operation maximally reliable:
- trained staff
- good discipline
- easy-to-operate equipment
3: enclosed in a containment
Redundant Systems
At least two protection systems
Based on different principles
Not 211 identical rods!
Three Views of Chernobyl
HBO: drama, bad communists
IAEA: guilty operators, wrong culture
Valery Legasov: safety standard, redundant systems
Blameless Postmortems
Do People Blame Each Other?
Human Error!
Human error is not a cause, it is an effect.
John Allspaw: Outages, Post Mortems, and Human Error 101
If people are punished for being honest about what transpired, employees will soon learn that the personal costs to speaking up far outweigh the personal benefits. Improving the safety of a system is rooted in information.
United States Forest Service
Second Stories
First Stories | Second Stories |
---|---|
Human error is seen as cause of failure | Human error is seen as the effect of systemic vulnerabilities deeper inside the organization |
Saying what people should have done is a satisfying way to describe failure | Saying what people should have done doesn’t explain why it made sense for them to do what they did |
Telling people to be more careful will make the problem go away | Only by constantly seeking out its vulnerabilities can organizations enhance safety |
USS John McCain Collision
Review the incident summary
Review the consequences
Who was to blame?
What measures were correct?
Punishing the Operator
Leadership
What Do Team Members Expect?
Attitude and Expectations
How you act in a crisis will set the tone
Communication is crucial
Explain clearly what you look for
Try to get the best out of people
Consider listening to what an incident has to teach you. It's your job to figure out what that is.
John Allspaw, Incidents as we Imagine Them Versus How They Actually Are
Wrong: Putting Out Fires
All Day Long
Right: Incident-Driven Development
Incidents will show the way
Accept it: there will be unknown unknowns
Try to understand and solve issues thoroughly
What can be done to avoid a repetition?
Yes, We Can!
And Now for Some Spam
An Incident-shaped Hole
Scalability Training
In Spanish
Spend that training budget!
Deductible for corporations
Thanks!
madScalability: Incident Analysis
By Alex Fernández
madScalability: Incident Analysis
Slides for the Meetup at madScalability: https://www.meetup.com/madscalability/events/273170362/
- 1,434