Chernóbil: Lecciones para DevOps
2020-11-24
Vuestro anfitrión hoy
Freelancer dev + ops
Hoy veremos
Múltiples fuentes
El incidente
El análisis
El postmórtem
Las consecuencias
Múltiples fuentes
Chernobyl (2019), HBO
¿Crees que la serie es realista?
Brevemente (¡Sin spoilers!)
Jefes malos
Operadores incompetentes
Respuesta terrible de las autoridades
La mitigación del desastre enmarañada por política
International Atomic Energy Agency
Informe IAEA INSAG-7, 1992
Un informe absolutamente extraordinario
Corrige el informe IAEA INSAG-1, 1987
Las cintas de Legásov
Brevemente (¡spoilers!)
Respuesta adecuada
Trabajadores abnegados
Operadores inexpertos
Cultura inadecuada
Jerarquía incompetente
Poco que perder
Legásov se suicidó en el segundo aniversario de la catástrofe
Tres visiones de Chernóbil
HBO: drama, comunistas malos
IAEA: operadores culpables, mala cultura
Valery Legásov: cultura inadecuada, jerarquía incompetente
Otras fuentes
Wilson Center:
National Security Archive:
Green Facts: Chernobyl Nuclear Accident
El incidente
Lección: Hablamos de "incidentes"
A veces tenemos accidentes o incluso catástrofes
Ayuda a eliminar dramatismo de la situación
Ayuda a pensar con la cabeza fría
Chernóbil a vista de pájaro
La central nuclear (planeada) más grande del mundo
Un accidente muy estudiado
Cientos de muertes directas
Miles de personas desplazadas
I must say that at multiple times that I attended the Operative group meetings, these meetings were held in a very calm and conservative manner. They tried as much as they could to base their decisions on the specialists’ point of view. [...] In summary, for me it was an example of a correctly set up workflow.
Leǵasov, cinta 1 cara B
Lección: Comité de crisis
Investigación inicial
Evaluación del incidente
Debe reunir al personal necesario
We could only kick ourselves for not having external automatic dosimetry devices set up around the station, that would record the telemetry about radiation conditions within, say, 1 km, 2 km, 4 km and 10 km radius.
Lección: Monitorización
Esencial para saber qué está pasando
También conocida como: telemetría
Muy distinta de la observabilidad:
conocer el estado interno de un sistema por su salida
(y que es incluso más deseable)
This is why [the firemen's] actions were not only heroic but very professional, educated and correct from the point that they took the first precise steps to localize the accident and prevent it from spreading.
Leǵasov, cinta 1 cara A
[...] the Operative group was constantly trying to provide maximum protection for the people and judging from possible degrees of contamination decide on a compensation amount that would be necessary for the evacuees. [...] I personally was a witness to that. They made many decisions specifically to help people who were affected by this accident.
Lección: Mitigación
Entérate a fondo de qué está pasando
Pon todos los medios para mitigar el incidente
Pon especial cuidado en ayudar a la gente afectada
El análisis
The scram just before the sharp rise in power that destroyed the reactor may well have been the decisive contributory factor. On the other hand, the features of the RBMK reactor had also set other pitfalls for the operating staff.
Informe IAEA INSAG-7, 1992
Lección: ¿Los cinco porqués?
Pregunta como un niño pequeño
No te pares en la primera causa
Pero, ¿por qué "cinco" porqués?
¡Sigue preguntando hasta que todo esté claro!
Thus the question arises: Which weakness ultimately caused the accident?
There is a second question: Does it really matter which shortcoming was the actual cause, if any of them could potentially have been the determining factor?
Informe IAEA INSAG-7, 1992
Lección: ¿Causa raíz?
En un sistema complejo, los fallos no tienen una única causa
Deberíamos buscar cada problema
... y arreglarlos todos
Bucea hasta que tengas total confianza
en haber entendido el problema
Raíz de causas
Diagrama de Ishikawa
O de espina de pescado
Finding the root cause of a failure is like finding the root cause of a success
[...] the positive scram effect [...] had been known of at the time of the accident and had first been identified at the Ignalina RBMK plant in the Lithuanian Republic in 1983. Although the Chief Design Engineer for RBMK reactors [...] stated that design changes would be made to correct the problem, he made no such changes, and the procedural measures he recommended for inclusion in plant operating instructions were not adopted.
The accident at Leningrad Unit 1 is even considered by some to have been a precursor to the Chernobyl accident. However, lessons learned from these accidents prompted at most only very limited design modifications or improvements in operating practices.
Lección: ¿Apagando fuegos?
Tiempo dedicado a incidencias
¿Todo el día apagando fuegos?
The [...] the normal standard of safety for [...] nuclear power stations [...] consists of three elements.
One, make the reactor maximally reliable. Two, make the operation maximally reliable; trained staff, good discipline, easy-to-operate equipment, etc. [...] Three, all this dangerous industry [...] must compulsorily be [...] enclosed in a containment as it is called in the West.
Lección: Estándar de seguridad
1: Hacer el sistema fiable al máximo
2: Hacer la operación fiable al máximo:
- Personal entrenado
- Buena disciplina
- Equipos fáciles de operar
3: Hacer el sistema tolerante a fallos
Since the emergency protection system consists of 211 rods that are lowered, [the designers of the RBMK reactor] say that they have 211 [protection] systems, not two. But this is rubbish [...]. And if the operator is killed, falls ill or dies, then all these 211 rods will remain in place.
Lección: Sistemas redundantes
Al menos dos sistemas de protección
Basados en principios diferentes
¡No 211 barras de control idénticas!
El postmórtem
[...] the test procedure was altered on an ad hoc basis; [...]
the operating staff did not stop and think, but on the spot they modified the test conditions [...]
Where in the process it is found that the initial procedures are defective [...], tests should cease while a carefully preplanned process is followed to evaluate any changes contemplated.
Lección: Checklists
Revisa los procedimientos
Elabora una checklist
Cada procedimiento delicado debería tener su checklist
elaborado en frío
Ante cualquier desviación, detén el procedimiento
The account given to the Vienna Conference stated that it was possible to explain the nature of the accident [...] in terms of an uncontrolled reactivity driven excursion. [...] The assertion was made that the accident arose through a low probability coincidence of a number of violations of rules and procedures by the operating staff and those responsible for authorizing the test.
Lección: Postmórtems sin culpa
¿Es normal culpar a la gente?
¡Error humano!
Human error is not a cause, it is an effect.
John Allspaw: Outages, Post Mortems, and Human Error 101
There is a need to shift the balance of perception so as to emphasize more the deficiencies in the safety features of the design which were touched on in INSAG-1.
However, INSAG remains of the view that in many respects the actions of the operators were unsatisfactory.
Lección: Segundas historias
First Stories | Second Stories |
---|---|
Human error is seen as cause of failure | Human error is seen as the effect of systemic vulnerabilities deeper inside the organization |
Saying what people should have done is a satisfying way to describe failure | Saying what people should have done doesn’t explain why it made sense for them to do what they did |
Telling people to be more careful will make the problem go away | Only by constantly seeking out its vulnerabilities can organizations enhance safety |
Las consecuencias
I immediately told [Scherbina] that 200 tons won’t solve any problems. We would need around 2000 tons of lead to be dropped into the crater of the reactor. He listened very carefully and [...] ordered 6000 tons of lead, because he thought that we could have made a mistake in our calculations.
Leǵasov, cinta 1 cara A
Lección: Liderazgo efectivo
Cómo actúes en una crisis marcará el tono
La comunicación es lo más importante
Explica claramente qué buscas
Intenta sacar lo mejor de cada cual
¿Qué espera el equipo?
The tens of thousands of deaths of liquidators and victims of the catastrophe and the loss of health and quality of life for the nine million people who still survive in the affected areas paid for them.
Evaluación de daños
Víctimas:
- OMS: 4000
- A. Yaroshinskaya: tens of thousands
- Hasta medio millón (?)
5~7% del gasto del gobierno de Ucrania
Lección: Los incidentes salen caros
El equipo lo paga en sangre
Resultado: alta rotación, equipo quemado
No siempre los costes son visibles
Impacto en clientes
Secretos dañinos
Lección: Informe público
Para enviar a la organización
Todo el mundo puede aprender
Amplía la audiencia a los afectados
Un fallo público suele aceptarse mejor
Percepción
Realidad
Lección: Los accidentes son evitables
Los incidentes son inevitables
La buena ingeniería evita que sean accidentes
La buena ingeniería es cara...
Pero los accidentes son más caros
¡Podemos hacerlo mejor!
16 lecciones
Hablamos de "incidentes" Comité de crisis Monitorización Mitigación ¿Los cinco porqués? ¿Causa raíz? ¿Apagando fuegos? Estándar de seguridad Sistemas redundantes Checklists Postmórtems sin culpa Segundas historias Liderazgo efectivo Los incidentes salen caros Informe público Los accidentes son evitables
Y ahora un poco de spam
Curso de escalabilidad
Esperamos una tercera edición
¡Gasta ese presupuesto de formación!
Bonificable para empresas
Temario
¡Gracias!
El accidente de Chernóbil: Lecciones para DevOps
By Alex Fernández
El accidente de Chernóbil: Lecciones para DevOps
Diapos para el Meetup en Madrid DevOps: https://www.meetup.com/es-ES/madrid-devops/events/274538968/
- 2,245