SURVIVING

CHAOS

A field guide

with Andrew Hobden of PingCAP

Thank you to

our hosts.

Vancouver was founded on the unceded traditional territories of the Musqueam, Squamish and

Tsleil-Waututh First Nations.

GET THIS

CONTENT

github.com/hoverbear/chaos-workshop

pingcap.gitbook.io/chaos-workshop

CHAOS

When failure strikes

What is Chaos Engineering?

TLDR:

The practice of causing faults to find bugs, refine operations, and improve resiliency.

Why do Chaos Engineering?

Find new bugs in testing before finding them in production.
Learn about points of failure in your infrastructure.
Break things for fun and profit.
Improve operations experience.

GOALS & NON-GOALS

Start small, work your way up

Goals

Learn the basics of Chaos Engineering
Learn to break some stuff (in a controlled way)
Get ideas about what to test in your system
Get ideas about how to include chaos in your regular testing

Non-Goals

Leave with a full fledged Chaos Engineering system
Spend a lot of time installing and setting up complex tools
Teach you to solve bugs discovered this way
Show you a magic bullet

PRINCIPLES OF CHAOS

Via principlesofchaos.org

Determine a Steady State

We need a way to compare a control and an experiment.

We can do this by determining key metrics about the system.

Example:

A database this could measure QPS during a TPC-H benchmarks.

A REST API could measure the average response time of a sample workload.

Hypothesize Failure Effects

Consider what types failures should be tolerated by the system before:

Degraded Performance
Complete failure

Example:

PostgreSQL should tolerate a replica failing without performance loss.

Losing 4/5th of your Rails workers should cause a performance loss.

Introduce Failures

Using a variety of tools and techniques, start applying failure situations.

Example:

Use Pumba to test how your new containerized infrastructure might behave in a lossy network.

Try to Disprove Hypotheses

Your test isn't correct. (Check and try more!)
Your hypothesis was wrong. (Check your assumptions)

You have a legitimate bug. (Get it reproducible and make a test!)

If any of your tests cause your hypothesis to be disproved it means one of three things:

ADVANCED PRINCIPLES

Via principlesofchaos.org

Hypothesize via Steady State

Instead of focusing on the internal workings of the system, focus on the steady state metrics.

Try to verify that the system does work, instead of just validating how it works.

Vary Real World Events

Prioritize your testing to reflect real world events and demands.

Consider high impact and high frequency failures first.

Example:

Web services may frequently receive malformed responses.

Databases may lose valuable data when their drives (infrequently) fail.

Experiment on Production

At the start you may use a staging environment, and captured sample workloads.

However, this is just a simulation.

The only way to authentically develop confidence is to test in production.

Chaos lends itself well to production environments, as your system should withstand the failures.

Experiment Continuously

Manual tests are slow and expensive. Ultimately they are unsustainable.

Instead, work towards regular, automated chaos testing.

Doing this requires automation both in orchestrating (getting the tests to run) and analysis (identifying and warning on failures).

Minimize Blast Radius

Testing in production may cause some short-term pain for the team and users, but it is for the long term good.

Still, it is the responsibility and obligation of the Chaos Engineer to ensure that experiments can be contained and recovered from.

KINDS OF

FAILURE

Commonly considered situations

Disk

Disks, like all hardware, fail due to age, wear, or misuse.

A file that was expected was not present. (open fails)
A file that was not expected was present. (create fails)

Common Failures:

A file was removed after being opened. (read / write fails)
A file contains data that is invalid to the reader. (encoding mismatch, missing/extra data)

Network

The tendrils that connect machines are imperfect, and so are the operators.

One node becomes isolated from the rest.
A partition isolates two (or more) distinct node groups.
Two particular nodes can no longer communicate.

Common Failures:

Malformed (or outright hostile) requests
Increased probability of packet corruption (forcing re-transmits)
The network becomes intolerably slow at some or all links.

Scheduler

In a multi-thread, multi-machine environment, the ordering of events is not constant.

Common Failures:

The system expects to have events ABC in order and gets ACB instead.

Power

Achieving 100% reliable power for the entire infrastructure is still only just a hope. Components fail and leave machines offline.

Common Failures:

A machine loses power, and recovers some time later.
(Try to find the worst cases possible for this)

A machine reboots, disappearing and reappearing a minute later.
A machine with persistent data reboots, and returns with corrupted data.

Byzantine

Trying to get a sufficiently large number of nodes to agree on something while a small number bad actors subvert the system.

Common Failures:

A node starts sending messages it shouldn't be, under the influence of a bad actor.

A node is under the wrong impression about the state of the system due to some bug, and sends messages it shouldn't.
(Eg "I'm the master replica now, listen to me!")

Data Center

In many distributed systems cross-data center deployments are used to help protect against regional failures.

Common Failures:

Data center(s) loses connectivity
(Eg. an undersea fiber is cut)

Data center is lost entirely and a new one needs to be bootstrapped.

SETTING UP

YOUR TOOLKIT

Some assembly required

Options

Use one of the ones we spawned on DigitalOcean.
Spawn your own with Terraform.
Provision your own on a VM/hardware. (Vagrant?)

For some experiments you may want more than one machine. Feel free to boot up machines however you please and install tools as needed.

This is just for ease of use and reference.

Spawn New VM(s)

git clone https://github.com/hoverbear/chaos-workshop
cd chaos-workshop
terraform init
terraform apply -var="digitalocean_token=$TOKEN"

Provision Existing VM(s)

git clone https://github.com/hoverbear/chaos-workshop
cd chaos-workshop
./provision.sh

Using Ubuntu 18.04

Please, do not use a container.

Use ours!

Pass is "pawlooza", user is "root"

EXPLORING

THE TOOLS

In Depth Instructions in the GitBook!

kill

HUP
SIGINT
KILL
TERM

KILL a service that is necessary. Is there an interruption? How long until it is detected?

HUP the same service, is there any interruption?

ip

link down/up
MTU manipulation

Take down a link during an HTTP download. Put it up again at varying times later, see if it resumes.

Try changing the MTU to something abnormally small/large to simulate misconfiguration. Does anything happen?

tc

Delay
Loss
Delay distributions

Cause a delay between two services. See how big you can make it before things break.

Drop 1% of packets, then 10%. Do things still work? How far can you push it?

iptables

Loss
Partitioning

Drop 1% of packets, then 10%. Is it harder than tc? Easier?

Cause a host to become unreachable for the node.

nmz

Scheduling

Try to reproduce one of the bugs under

https://github.com/osrg/namazu/#found-and-reproduced-bugs

pumba

Container based networking

Simulate packet loss between a database container and a web host container.

FAILURE AS A FEATURE

Injecting failures using libraries

Fail-rs

Inject failures via environment variables.

https://github.com/pingcap/fail-rs

fn main() {
    fail_point!("foo");
}

FAILPOINTS="foo=panic" cargo run

Baking in Failure

fn foo() {
    fail_point!("foo");
}

#[test]
fn test_foo() {
    fail::cfg("foo", "panic");
    foo();
}

Failure in Tests

CONTINUOUS CHAOS

Injecting failures using libraries

Chaos in your CI

cron based jobs (avoid unrelated failures)
Collect as much trace as possible
Try to find regressions

Some Examples

spinnaker.io (Chaos Monkey)
gremlin.com

Q & A

& Thanks

PS: We're hiring!

Remote friendly, Rust, Go, C++.