Chernóbil: Lecciones para DevOps
Vuestro anfitrión hoy
Chernobyl (2019), HBO
¿Crees que la serie es realista?
Brevemente (¡Sin spoilers!)
International Atomic Energy Agency
Informe IAEA INSAG-7, 1992
Un informe absolutamente extraordinario
Corrige el informe IAEA INSAG-1, 1987
Las cintas de Legásov
Poco que perder
Tres visiones de Chernóbil
Lección: Hablamos de "incidentes"
Chernóbil a vista de pájaro
I must say that at multiple times that I attended the Operative group meetings, these meetings were held in a very calm and conservative manner. They tried as much as they could to base their decisions on the specialists’ point of view. [...] In summary, for me it was an example of a correctly set up workflow.
Leǵasov, cinta 1 cara B
Lección: Comité de crisis
We could only kick ourselves for not having external automatic dosimetry devices set up around the station, that would record the telemetry about radiation conditions within, say, 1 km, 2 km, 4 km and 10 km radius.
This is why [the firemen's] actions were not only heroic but very professional, educated and correct from the point that they took the first precise steps to localize the accident and prevent it from spreading.
[...] the Operative group was constantly trying to provide maximum protection for the people and judging from possible degrees of contamination decide on a compensation amount that would be necessary for the evacuees. [...] I personally was a witness to that. They made many decisions specifically to help people who were affected by this accident.
The scram just before the sharp rise in power that destroyed the reactor may well have been the decisive contributory factor. On the other hand, the features of the RBMK reactor had also set other pitfalls for the operating staff.
Lección: ¿Los cinco porqués?
Thus the question arises: Which weakness ultimately caused the accident?
There is a second question: Does it really matter which shortcoming was the actual cause, if any of them could potentially have been the determining factor?
Lección: ¿Causa raíz?
Raíz de causas
Diagrama de Ishikawa
Finding the root cause of a failure is like finding the root cause of a success
[...] the positive scram effect [...] had been known of at the time of the accident and had first been identified at the Ignalina RBMK plant in the Lithuanian Republic in 1983. Although the Chief Design Engineer for RBMK reactors [...] stated that design changes would be made to correct the problem, he made no such changes, and the procedural measures he recommended for inclusion in plant operating instructions were not adopted.
The accident at Leningrad Unit 1 is even considered by some to have been a precursor to the Chernobyl accident. However, lessons learned from these accidents prompted at most only very limited design modifications or improvements in operating practices.
Lección: ¿Apagando fuegos?
Tiempo dedicado a incidencias
¿Todo el día apagando fuegos?
The [...] the normal standard of safety for [...] nuclear power stations [...] consists of three elements.
One, make the reactor maximally reliable. Two, make the operation maximally reliable; trained staff, good discipline, easy-to-operate equipment, etc. [...] Three, all this dangerous industry [...] must compulsorily be [...] enclosed in a containment as it is called in the West.
Lección: Estándar de seguridad
- Personal entrenado
- Buena disciplina
- Equipos fáciles de operar
3: Hacer el sistema tolerante a fallos
Since the emergency protection system consists of 211 rods that are lowered, [the designers of the RBMK reactor] say that they have 211 [protection] systems, not two. But this is rubbish [...]. And if the operator is killed, falls ill or dies, then all these 211 rods will remain in place.
Lección: Sistemas redundantes
[...] the test procedure was altered on an ad hoc basis; [...]
the operating staff did not stop and think, but on the spot they modified the test conditions [...]
Where in the process it is found that the initial procedures are defective [...], tests should cease while a carefully preplanned process is followed to evaluate any changes contemplated.
The account given to the Vienna Conference stated that it was possible to explain the nature of the accident [...] in terms of an uncontrolled reactivity driven excursion. [...] The assertion was made that the accident arose through a low probability coincidence of a number of violations of rules and procedures by the operating staff and those responsible for authorizing the test.
Lección: Postmórtems sin culpa
¿Es normal culpar a la gente?
Human error is not a cause, it is an effect.
There is a need to shift the balance of perception so as to emphasize more the deficiencies in the safety features of the design which were touched on in INSAG-1.
However, INSAG remains of the view that in many respects the actions of the operators were unsatisfactory.
Lección: Segundas historias
|First Stories||Second Stories|
|Human error is seen as cause of failure||Human error is seen as the effect of systemic vulnerabilities deeper inside the organization|
|Saying what people should have done is a satisfying way to describe failure||Saying what people should have done doesn’t explain why it made sense for them to do what they did|
|Telling people to be more careful will make the problem go away||Only by constantly seeking out its vulnerabilities can organizations enhance safety|
I immediately told [Scherbina] that 200 tons won’t solve any problems. We would need around 2000 tons of lead to be dropped into the crater of the reactor. He listened very carefully and [...] ordered 6000 tons of lead, because he thought that we could have made a mistake in our calculations.
Lección: Liderazgo efectivo
Cómo actúes en una crisis marcará el tono
La comunicación es lo más importante
Explica claramente qué buscas
Intenta sacar lo mejor de cada cual
¿Qué espera el equipo?
The tens of thousands of deaths of liquidators and victims of the catastrophe and the loss of health and quality of life for the nine million people who still survive in the affected areas paid for them.
Evaluación de daños
Lección: Los incidentes salen caros
El equipo lo paga en sangre
Resultado: alta rotación, equipo quemado
No siempre los costes son visibles
Impacto en clientes
Lección: Informe público
Lección: Los accidentes son evitables
¡Podemos hacerlo mejor!
Hablamos de "incidentes" Comité de crisis Monitorización Mitigación ¿Los cinco porqués? ¿Causa raíz? ¿Apagando fuegos? Estándar de seguridad Sistemas redundantes Checklists Postmórtems sin culpa Segundas historias Liderazgo efectivo Los incidentes salen caros Informe público Los accidentes son evitables
Y ahora un poco de spam
Curso de escalabilidad
El accidente de Chernóbil: Lecciones para DevOps
By Alex Fernández