Haciendo el (chaos) monkey

Alejandro Guirao Rodríguez

@lekum

github.com/lekum

1. precedentes

2. NETflix y el chaos monkey

3. otras inIciativas similares

4. chaos monkey do-it-yourself

AGENDA

1. precedentes

  • "Master of Disaster" (Availability program)
  • Año 2001, en Amazon
  • Resilience Engineering
  • GameDay
    • Aviso anticipado
    • Simulación (realista) de incidencia
    • Procesos de negocio y comunicaciones
    • Cultura de "estar preparado"
    • MTBF y MTTR

JESSE robbins y los gameday

  • "Disaster Recovery Testing Event"
  • En Google
  • DiRT
    • Preaviso de 3-4 meses
    • Varios días de duración
    • Fallos complejos
    • Tecnical team, coordinators, war rooms...
    • Informe final con los fixes requeridos

KRIPA KRISHNAN y el DiRT

  • Publicado en 2012 por Nassim Nicholas Taleb
  • Teoría matemático-financiera-filosófica

antifragile: things that gain from disorder

Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better

2. NETFLIX Y EL CHAOS MONKEY

  • Patente de Gregory S.Orzell y Yury Izrailevsky (20 de septiembre de 2010):  "Validating the resiliency of networked applications"

ORIGEN

Techniques are disclosed for validating the resiliency of a networked application made available using a collection of interacting servers. In one embodiment, a network monitoring application observes each running server (or application) and at unspecified intervals, picks one and terminates it. In the case of a cloud based deployment, this may include terminating a virtual machine instance or terminating a process running on the server. By observing the effects of the failed server on the rest of the network application, a provider can ensure that each component can tolerate any single instance disappearing without warning.

  • Mención en el Netflix Tech Blog en diciembre de 2010
  • Explicación más detallada en 2011
  • Código liberado a la comunidad a través del Netflix Open Source Software Center (OSS) en julio de 2012
  • Formalización de la definición de Chaos Engineering:

PUblicación

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production

MOTIVACIÓN

Failures happen, and they inevitably happen when least desired. If your application can't tolerate a system failure would you rather find out by being paged at 3am or after you are in the office having already had your morning coffee?

 

Even if you are confident that your architecture can tolerate a system failure, are you sure it will still be able to next week, how about next month? Software is complex and dynamic, that "simple fix" you put in place last week could have undesired consequences.

 

Do your traffic load balancers correctly detect and route requests around system failures? Can you reliably rebuild your systems? Perhaps an engineer "quick patched" a live system last week and forgot to commit the changes to your source repository?

¿Qué es netflix chaos monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. [...] In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.

otros miembros de la simian army

  • Conformity Monkey
  • Janitor Monkey
  • Latency Monkey
  • Doctor Monkey
  • Security Monkey
  • 10-18 Monkey
  • Chaos Gorilla
  • Chaos Kong

 

¿Cómo usar Netflix chaos monkey?

Práctica imagen dockerizada:

 

docker run -it --rm \
    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
    -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
    -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
    -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
    -e SIMIANARMY_CHAOS_ASG_ENABLED=true \
    -e SIMIANARMY_CHAOS_ASGTAG_KEY=chaos_monkey \
    -e SIMIANARMY_CHAOS_ASGTAG_VALUE=true \
    -e SIMIANARMY_CHAOS_LEASHED=false \
    mlafeldt/simianarmy

Las properties que configuran el funcionamiento vienen detalladas en la documentación de Netflix

 

Ataques del chaos monkey

  • Shutdown instance (Simius Mortus)

  • Block all network traffic (Simius Quies)

  • Detach all EBS volumes (Simius Amputa)

  • Burn-CPU (Simius Cogitarius)

  • Burn-IO (Simius Occupatus)

  • Fill Disk (Simius Plenus)

  • Kill Processes (Simius Delirius)

 

Ataques del chaos monkey

  • Null-Route (Simius Desertus)

  • Fail DNS (Simius Nonomenius)

  • Fail EC2 API (Simius Noneccius)

  • Fail S3 API (Simius Amnesius)

  • Fail DynamoDB API (Simius Nodynamus)

  • Network Corruption (Simius Politicus)

  • Network Latency (Simius Tardus)

  • Network Loss (Simius Perditus)

 

3. otras iniciativas similares

cloudfoundry: Strepsirrhini Army

azure: wazmonkey

  • Reinicia instancias
  • Repositorio con muy poca actividad

4. CHAOS MONKEY DO-it-yourself

terraform destroy

Happy hacking!

Ninja - Haciendo el (chaos) monkey

By Alejandro Guirao Rodríguez

Ninja - Haciendo el (chaos) monkey

  • 2,297