Alejandro Guirao Rodríguez
Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better
Techniques are disclosed for validating the resiliency of a networked application made available using a collection of interacting servers. In one embodiment, a network monitoring application observes each running server (or application) and at unspecified intervals, picks one and terminates it. In the case of a cloud based deployment, this may include terminating a virtual machine instance or terminating a process running on the server. By observing the effects of the failed server on the rest of the network application, a provider can ensure that each component can tolerate any single instance disappearing without warning.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production
Failures happen, and they inevitably happen when least desired. If your application can't tolerate a system failure would you rather find out by being paged at 3am or after you are in the office having already had your morning coffee?
Even if you are confident that your architecture can tolerate a system failure, are you sure it will still be able to next week, how about next month? Software is complex and dynamic, that "simple fix" you put in place last week could have undesired consequences.
Do your traffic load balancers correctly detect and route requests around system failures? Can you reliably rebuild your systems? Perhaps an engineer "quick patched" a live system last week and forgot to commit the changes to your source repository?
Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. [...] In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.
Práctica imagen dockerizada:
docker run -it --rm \ -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \ -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \ -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \ -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \ -e SIMIANARMY_CHAOS_ASG_ENABLED=true \ -e SIMIANARMY_CHAOS_ASGTAG_KEY=chaos_monkey \ -e SIMIANARMY_CHAOS_ASGTAG_VALUE=true \ -e SIMIANARMY_CHAOS_LEASHED=false \ mlafeldt/simianarmy
Las properties que configuran el funcionamiento vienen detalladas en la documentación de Netflix
Shutdown instance (Simius Mortus)
Block all network traffic (Simius Quies)
Detach all EBS volumes (Simius Amputa)
Burn-CPU (Simius Cogitarius)
Burn-IO (Simius Occupatus)
Fill Disk (Simius Plenus)
Kill Processes (Simius Delirius)
Null-Route (Simius Desertus)
Fail DNS (Simius Nonomenius)
Fail EC2 API (Simius Noneccius)
Fail S3 API (Simius Amnesius)
Fail DynamoDB API (Simius Nodynamus)
Network Corruption (Simius Politicus)
Network Latency (Simius Tardus)
Network Loss (Simius Perditus)