Haciendo el (chaos) monkey

Alejandro Guirao Rodríguez

@lekum

github.com/lekum

1. precedentes

2. NETflix y el chaos monkey

3. otras inIciativas similares

4. chaos monkey do-it-yourself

AGENDA

1. precedentes

"Master of Disaster" (Availability program)
Año 2001, en Amazon
Resilience Engineering
GameDay
- Aviso anticipado
- Simulación (realista) de incidencia
- Procesos de negocio y comunicaciones
- Cultura de "estar preparado"
- MTBF y MTTR

JESSE robbins y los gameday

"Disaster Recovery Testing Event"
En Google
DiRT
- Preaviso de 3-4 meses
- Varios días de duración
- Fallos complejos
- Tecnical team, coordinators, war rooms...
- Informe final con los fixes requeridos

KRIPA KRISHNAN y el DiRT

Publicado en 2012 por Nassim Nicholas Taleb
Teoría matemático-financiera-filosófica

antifragile: things that gain from disorder

Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better

2. NETFLIX Y EL CHAOS MONKEY

Patente de Gregory S.Orzell y Yury Izrailevsky (20 de septiembre de 2010): "Validating the resiliency of networked applications"

ORIGEN

Techniques are disclosed for validating the resiliency of a networked application made available using a collection of interacting servers. In one embodiment, a network monitoring application observes each running server (or application) and at unspecified intervals, picks one and terminates it. In the case of a cloud based deployment, this may include terminating a virtual machine instance or terminating a process running on the server. By observing the effects of the failed server on the rest of the network application, a provider can ensure that each component can tolerate any single instance disappearing without warning.

Mención en el Netflix Tech Blog en diciembre de 2010
Explicación más detallada en 2011
Código liberado a la comunidad a través del Netflix Open Source Software Center (OSS) en julio de 2012
Formalización de la definición de Chaos Engineering:

PUblicación

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production

MOTIVACIÓN

Failures happen, and they inevitably happen when least desired. If your application can't tolerate a system failure would you rather find out by being paged at 3am or after you are in the office having already had your morning coffee?

Even if you are confident that your architecture can tolerate a system failure, are you sure it will still be able to next week, how about next month? Software is complex and dynamic, that "simple fix" you put in place last week could have undesired consequences.

Do your traffic load balancers correctly detect and route requests around system failures? Can you reliably rebuild your systems? Perhaps an engineer "quick patched" a live system last week and forgot to commit the changes to your source repository?

¿Qué es netflix chaos monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. [...] In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

otros miembros de la simian army

Conformity Monkey
Janitor Monkey
Latency Monkey
Doctor Monkey
Security Monkey
10-18 Monkey
Chaos Gorilla
Chaos Kong

¿Cómo usar Netflix chaos monkey?

Práctica imagen dockerizada:

docker run -it --rm \
    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
    -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
    -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
    -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
    -e SIMIANARMY_CHAOS_ASG_ENABLED=true \
    -e SIMIANARMY_CHAOS_ASGTAG_KEY=chaos_monkey \
    -e SIMIANARMY_CHAOS_ASGTAG_VALUE=true \
    -e SIMIANARMY_CHAOS_LEASHED=false \
    mlafeldt/simianarmy

Las properties que configuran el funcionamiento vienen detalladas en la documentación de Netflix

Ataques del chaos monkey

Shutdown instance (Simius Mortus)
Block all network traffic (Simius Quies)
Detach all EBS volumes (Simius Amputa)
Burn-CPU (Simius Cogitarius)
Burn-IO (Simius Occupatus)
Fill Disk (Simius Plenus)
Kill Processes (Simius Delirius)

Ataques del chaos monkey

Null-Route (Simius Desertus)
Fail DNS (Simius Nonomenius)
Fail EC2 API (Simius Noneccius)
Fail S3 API (Simius Amnesius)
Fail DynamoDB API (Simius Nodynamus)
Network Corruption (Simius Politicus)
Network Latency (Simius Tardus)
Network Loss (Simius Perditus)

3. otras iniciativas similares

cloudfoundry: Strepsirrhini Army

Chaos Loris: destrucción de instancias
Chaos Lemur: chaos monkey en un entorno de BOSH

azure: wazmonkey

Reinicia instancias
Repositorio con muy poca actividad

4. CHAOS MONKEY DO-it-yourself

terraform destroy

Happy hacking!

github.com/lekum