Chernóbil: Lecciones para DevOps


2020-11-24

Vuestro anfitrión hoy


Freelancer dev + ops

Hoy veremos


Múltiples fuentes


El incidente


El análisis


El postmórtem


Las consecuencias

Múltiples fuentes


Chernobyl (2019), HBO


¿Crees que la serie es realista?



Brevemente (¡Sin spoilers!)


Jefes malos


Operadores incompetentes


Respuesta terrible de las autoridades


La mitigación del desastre enmarañada por política

International Atomic Energy Agency



Informe IAEA INSAG-7, 1992



Un informe absolutamente extraordinario



Corrige el informe IAEA INSAG-1, 1987



Fuente

Las cintas de Legásov


Brevemente (¡spoilers!)


Respuesta adecuada
Trabajadores abnegados


Operadores inexpertos


Cultura inadecuada


Jerarquía incompetente

Poco que perder


Legásov se suicidó en el segundo aniversario de la catástrofe

Tres visiones de Chernóbil



HBO: drama, comunistas malos


IAEA: operadores culpables, mala cultura


Valery Legásov: cultura inadecuada, jerarquía incompetente

Otras fuentes


Wilson Center:

National Security Archive:


El incidente


Lección: Hablamos de "incidentes"



A veces tenemos accidentes o incluso catástrofes


Ayuda a eliminar dramatismo de la situación


Ayuda a pensar con la cabeza fría

Chernóbil a vista de pájaro


La central nuclear (planeada) más grande del mundo


Un accidente muy estudiado


Cientos de muertes directas


Miles de personas desplazadas


I must say that at multiple times that I attended the Operative group meetings, these meetings were held in a very calm and conservative manner. They tried as much as they could to base their decisions on the specialists’ point of view. [...] In summary, for me it was an example of a correctly set up workflow.


Leǵasov, cinta 1 cara B


Lección: Comité de crisis



Investigación inicial


Evaluación del incidente


Debe reunir al personal necesario



We could only kick ourselves for not having external automatic dosimetry devices set up around the station, that would record the telemetry about radiation conditions within, say, 1 km, 2 km, 4 km and 10 km radius.

Leǵasov, cinta 1 cara A

Lección: Monitorización


Esencial para saber qué está pasando


También conocida como: telemetría


Muy distinta de la observabilidad:
conocer el estado interno de un sistema por su salida
(y que es incluso más deseable)



This is why [the firemen's] actions were not only heroic but very professional, educated and correct from the point that they took the first precise steps to localize the accident and prevent it from spreading.

Leǵasov, cinta 1 cara A


[...] the Operative group was constantly trying to provide maximum protection for the people and judging from possible degrees of contamination decide on a compensation amount that would be necessary for the evacuees. [...] I personally was a witness to that. They made many decisions specifically to help people who were affected by this accident.

Leǵasov, cinta 1 cara B

Lección: Mitigación



Entérate a fondo de qué está pasando


Pon todos los medios para mitigar el incidente


Pon especial cuidado en ayudar a la gente afectada

El análisis




The scram just before the sharp rise in power that destroyed the reactor may well have been the decisive contributory factor. On the other hand, the features of the RBMK reactor had also set other pitfalls for the operating staff.

Informe IAEA INSAG-7, 1992

Lección: ¿Los cinco porqués?


Pregunta como un niño pequeño


No te pares en la primera causa


Pero, ¿por qué "cinco" porqués?


¡Sigue preguntando hasta que todo esté claro!



Thus the question arises: Which weakness ultimately caused the accident?
There is a second question: Does it really matter which shortcoming was the actual cause, if any of them could potentially have been the determining factor?

Informe IAEA INSAG-7, 1992

Lección: ¿Causa raíz?


En un sistema complejo, los fallos no tienen una única causa


Deberíamos buscar cada problema
... y arreglarlos todos


Bucea hasta que tengas total confianza
en haber entendido el problema

Raíz de causas


Diagrama de Ishikawa






Finding the root cause of a failure is like finding the root cause of a success


[...] the positive scram effect [...] had been known of at the time of the accident and had first been identified at the Ignalina RBMK plant in the Lithuanian Republic in 1983. Although the Chief Design Engineer for RBMK reactors [...] stated that design changes would be made to correct the problem, he made no such changes, and the procedural measures he recommended for inclusion in plant operating instructions were not adopted.

Informe IAEA INSAG-7, 1992


The accident at Leningrad Unit 1 is even considered by some to have been a precursor to the Chernobyl accident. However, lessons learned from these accidents prompted at most only very limited design modifications or improvements in operating practices.


Informe IAEA INSAG-7, 1992

Lección: ¿Apagando fuegos?


Tiempo dedicado a incidencias



¿Todo el día apagando fuegos?



The [...] the normal standard of safety for [...] nuclear power stations [...] consists of three elements.
One, make the reactor maximally reliable. Two, make the operation maximally reliable; trained staff, good discipline, easy-to-operate equipment, etc. [...] Three, all this dangerous industry [...] must compulsorily be [...] enclosed in a containment as it is called in the West.

Leǵasov, Cinta 4 cara B

Lección: Estándar de seguridad


1: Hacer el sistema fiable al máximo

2: Hacer la operación fiable al máximo:
  • Personal entrenado
  • Buena disciplina
  • Equipos fáciles de operar


3: Hacer el sistema tolerante a fallos



Since the emergency protection system consists of 211 rods that are lowered, [the designers of the RBMK reactor] say that they have 211 [protection] systems, not two. But this is rubbish [...]. And if the operator is killed, falls ill or dies, then all these 211 rods will remain in place.

Leǵasov, Cinta 4 cara B

Lección: Sistemas redundantes



Al menos dos sistemas de protección


Basados en principios diferentes


¡No 211 barras de control idénticas!

El postmórtem



[...] the test procedure was altered on an ad hoc basis; [...]
the operating staff did not stop and think, but on the spot they modified the test conditions [...]
Where in the process it is found that the initial procedures are defective [...], tests should cease while a carefully preplanned process is followed to evaluate any changes contemplated.

Informe IAEA INSAG-7, 1992

Lección: Checklists


Revisa los procedimientos


Elabora una checklist


Cada procedimiento delicado debería tener su checklist
elaborado en frío


Ante cualquier desviación, detén el procedimiento


The account given to the Vienna Conference stated that it was possible to explain the nature of the accident [...] in terms of an uncontrolled reactivity driven excursion. [...] The assertion was made that the accident arose through a low probability coincidence of a number of violations of rules and procedures by the operating staff and those responsible for authorizing the test.

Informe IAEA INSAG-7, 1992

Lección: Postmórtems sin culpa


¿Es normal culpar a la gente?



¡Error humano!



Human error is not a cause, it is an effect.


There is a need to shift the balance of perception so as to emphasize more the deficiencies in the safety features of the design which were touched on in INSAG-1.
However, INSAG remains of the view that in many respects the actions of the operators were unsatisfactory.

Informe IAEA INSAG-7, 1992

Lección: Segundas historias


First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

Las consecuencias




I immediately told [Scherbina] that 200 tons won’t solve any problems. We would need around 2000 tons of lead to be dropped into the crater of the reactor. He listened very carefully and [...] ordered 6000 tons of lead, because he thought that we could have made a mistake in our calculations.

Leǵasov, cinta 1 cara A

Lección: Liderazgo efectivo



Cómo actúes en una crisis marcará el tono


La comunicación es lo más importante


Explica claramente qué buscas


Intenta sacar lo mejor de cada cual

¿Qué espera el equipo?





The tens of thousands of deaths of liquidators and victims of the catastrophe and the loss of health and quality of life for the nine million people who still survive in the affected areas paid for them.

Evaluación de daños


Víctimas:


5~7% del gasto del gobierno de Ucrania

Lección: Los incidentes salen caros


El equipo lo paga en sangre


Resultado: alta rotación, equipo quemado


No siempre los costes son visibles


Impacto en clientes

Secretos dañinos


Lección: Informe público


Para enviar a la organización


Todo el mundo puede aprender


Amplía la audiencia a los afectados


Un fallo público suele aceptarse mejor

Percepción


Realidad


Lección: Los accidentes son evitables


Los incidentes son inevitables


La buena ingeniería evita que sean accidentes


La buena ingeniería es cara...


Pero los accidentes son más caros

¡Podemos hacerlo mejor!



16 lecciones


Hablamos de "incidentes"
Comité de crisis
Monitorización
Mitigación
¿Los cinco porqués?
¿Causa raíz?
¿Apagando fuegos?
Estándar de seguridad
Sistemas redundantes
Checklists
Postmórtems sin culpa
Segundas historias
Liderazgo efectivo
Los incidentes salen caros
Informe público
Los accidentes son evitables

Y ahora un poco de spam


Curso de escalabilidad




Esperamos una tercera edición


¡Gasta ese presupuesto de formación!


Bonificable para empresas

Temario



¡Gracias!


El accidente de Chernóbil: Lecciones para DevOps

By Alex Fernández

El accidente de Chernóbil: Lecciones para DevOps

Diapos para el Meetup en Madrid DevOps: https://www.meetup.com/es-ES/madrid-devops/events/274538968/

  • 2,176