SURVIVING
CHAOS
A field guide
with Andrew Hobden of PingCAP
Thank you to
our hosts.
Vancouver was founded on the unceded traditional territories of the Musqueam, Squamish and
Tsleil-Waututh First Nations.
GET THIS
CONTENT
github.com/hoverbear/chaos-workshop
pingcap.gitbook.io/chaos-workshop
CHAOS
When failure strikes
What is Chaos Engineering?
TLDR:
The practice of causing faults to find bugs, refine operations, and improve resiliency.
Why do Chaos Engineering?
- Find new bugs in testing before finding them in production.
- Learn about points of failure in your infrastructure.
- Break things for fun and profit.
- Improve operations experience.
GOALS & NON-GOALS
Start small, work your way up
Goals
-
Learn the basics of Chaos Engineering
-
Learn to break some stuff (in a controlled way)
-
Get ideas about what to test in your system
-
Get ideas about how to include chaos in your regular testing
Non-Goals
-
Leave with a full fledged Chaos Engineering system
-
Spend a lot of time installing and setting up complex tools
-
Teach you to solve bugs discovered this way
-
Show you a magic bullet
PRINCIPLES OF CHAOS
Via principlesofchaos.org
Determine a Steady State
We need a way to compare a control and an experiment.
We can do this by determining key metrics about the system.
Example:
A database this could measure QPS during a TPC-H benchmarks.
A REST API could measure the average response time of a sample workload.
Hypothesize Failure Effects
Consider what types failures should be tolerated by the system before:
- Degraded Performance
- Complete failure
Example:
PostgreSQL should tolerate a replica failing without performance loss.
Losing 4/5th of your Rails workers should cause a performance loss.
Introduce Failures
Using a variety of tools and techniques, start applying failure situations.
Example:
Use Pumba to test how your new containerized infrastructure might behave in a lossy network.
Try to Disprove Hypotheses
- Your test isn't correct. (Check and try more!)
- Your hypothesis was wrong. (Check your assumptions)
- You have a legitimate bug. (Get it reproducible and make a test!)
If any of your tests cause your hypothesis to be disproved it means one of three things:
ADVANCED PRINCIPLES
Hypothesize via Steady State
Instead of focusing on the internal workings of the system, focus on the steady state metrics.
Try to verify that the system does work, instead of just validating how it works.
Vary Real World Events
Prioritize your testing to reflect real world events and demands.
Consider high impact and high frequency failures first.
Example:
Web services may frequently receive malformed responses.
Databases may lose valuable data when their drives (infrequently) fail.
Experiment on Production
At the start you may use a staging environment, and captured sample workloads.
However, this is just a simulation.
The only way to authentically develop confidence is to test in production.
Chaos lends itself well to production environments, as your system should withstand the failures.
Experiment Continuously
Manual tests are slow and expensive. Ultimately they are unsustainable.
Instead, work towards regular, automated chaos testing.
Doing this requires automation both in orchestrating (getting the tests to run) and analysis (identifying and warning on failures).
Minimize Blast Radius
Testing in production may cause some short-term pain for the team and users, but it is for the long term good.
Still, it is the responsibility and obligation of the Chaos Engineer to ensure that experiments can be contained and recovered from.
KINDS OF
FAILURE
Commonly considered situations
Disk
Disks, like all hardware, fail due to age, wear, or misuse.
-
A file that was expected was not present. (open fails)
-
A file that was not expected was present. (create fails)
Common Failures:
-
A file was removed after being opened. (read / write fails)
-
A file contains data that is invalid to the reader. (encoding mismatch, missing/extra data)
Network
The tendrils that connect machines are imperfect, and so are the operators.
-
One node becomes isolated from the rest.
-
A partition isolates two (or more) distinct node groups.
-
Two particular nodes can no longer communicate.
Common Failures:
-
Malformed (or outright hostile) requests
-
Increased probability of packet corruption (forcing re-transmits)
-
The network becomes intolerably slow at some or all links.
Scheduler
In a multi-thread, multi-machine environment, the ordering of events is not constant.
Common Failures:
The system expects to have events ABC in order and gets ACB instead.
Power
Achieving 100% reliable power for the entire infrastructure is still only just a hope. Components fail and leave machines offline.
Common Failures:
-
A machine loses power, and recovers some time later.
(Try to find the worst cases possible for this)
-
A machine reboots, disappearing and reappearing a minute later.
-
A machine with persistent data reboots, and returns with corrupted data.
Byzantine
Trying to get a sufficiently large number of nodes to agree on something while a small number bad actors subvert the system.
Common Failures:
-
A node starts sending messages it shouldn't be, under the influence of a bad actor.
-
A node is under the wrong impression about the state of the system due to some bug, and sends messages it shouldn't.
(Eg "I'm the master replica now, listen to me!")
Data Center
In many distributed systems cross-data center deployments are used to help protect against regional failures.
Common Failures:
-
Data center(s) loses connectivity
(Eg. an undersea fiber is cut)
-
Data center is lost entirely and a new one needs to be bootstrapped.
SETTING UP
YOUR TOOLKIT
Some assembly required
Options
- Use one of the ones we spawned on DigitalOcean.
- Spawn your own with Terraform.
- Provision your own on a VM/hardware. (Vagrant?)
For some experiments you may want more than one machine. Feel free to boot up machines however you please and install tools as needed.
This is just for ease of use and reference.
Spawn New VM(s)
git clone https://github.com/hoverbear/chaos-workshop
cd chaos-workshop
terraform init
terraform apply -var="digitalocean_token=$TOKEN"
Provision Existing VM(s)
git clone https://github.com/hoverbear/chaos-workshop
cd chaos-workshop
./provision.sh
Using Ubuntu 18.04
Please, do not use a container.
Use ours!
Pass is "pawlooza", user is "root"
EXPLORING
THE TOOLS
In Depth Instructions in the GitBook!
kill
- HUP
- SIGINT
- KILL
- TERM
KILL a service that is necessary. Is there an interruption? How long until it is detected?
HUP the same service, is there any interruption?
ip
- link down/up
- MTU manipulation
Take down a link during an HTTP download. Put it up again at varying times later, see if it resumes.
Try changing the MTU to something abnormally small/large to simulate misconfiguration. Does anything happen?
tc
- Delay
- Loss
- Delay distributions
Cause a delay between two services. See how big you can make it before things break.
Drop 1% of packets, then 10%. Do things still work? How far can you push it?
iptables
- Loss
- Partitioning
Drop 1% of packets, then 10%. Is it harder than tc? Easier?
Cause a host to become unreachable for the node.
nmz
- Scheduling
pumba
- Container based networking
Simulate packet loss between a database container and a web host container.
FAILURE AS A FEATURE
Injecting failures using libraries
Fail-rs
Inject failures via environment variables.
fn main() {
fail_point!("foo");
}
FAILPOINTS="foo=panic" cargo run
Baking in Failure
fn foo() {
fail_point!("foo");
}
#[test]
fn test_foo() {
fail::cfg("foo", "panic");
foo();
}
Failure in Tests
CONTINUOUS CHAOS
Injecting failures using libraries
Chaos in your CI
- cron based jobs (avoid unrelated failures)
- Collect as much trace as possible
- Try to find regressions
Some Examples
- spinnaker.io (Chaos Monkey)
- gremlin.com
Q & A
& Thanks
PS: We're hiring!
Remote friendly, Rust, Go, C++.
Surviving Chaos
By hoverbear
Surviving Chaos
- 589