Vancouver was founded on the unceded traditional territories of the Musqueam, Squamish and
Tsleil-Waututh First Nations.
TLDR:
The practice of causing faults to find bugs, refine operations, and improve resiliency.
Learn the basics of Chaos Engineering
Learn to break some stuff (in a controlled way)
Get ideas about what to test in your system
Get ideas about how to include chaos in your regular testing
Leave with a full fledged Chaos Engineering system
Spend a lot of time installing and setting up complex tools
Teach you to solve bugs discovered this way
Show you a magic bullet
We need a way to compare a control and an experiment.
We can do this by determining key metrics about the system.
Example:
A database this could measure QPS during a TPC-H benchmarks.
A REST API could measure the average response time of a sample workload.
Consider what types failures should be tolerated by the system before:
Example:
PostgreSQL should tolerate a replica failing without performance loss.
Losing 4/5th of your Rails workers should cause a performance loss.
Using a variety of tools and techniques, start applying failure situations.
Example:
Use Pumba to test how your new containerized infrastructure might behave in a lossy network.
If any of your tests cause your hypothesis to be disproved it means one of three things:
Instead of focusing on the internal workings of the system, focus on the steady state metrics.
Try to verify that the system does work, instead of just validating how it works.
Prioritize your testing to reflect real world events and demands.
Consider high impact and high frequency failures first.
Example:
Web services may frequently receive malformed responses.
Databases may lose valuable data when their drives (infrequently) fail.
At the start you may use a staging environment, and captured sample workloads.
However, this is just a simulation.
The only way to authentically develop confidence is to test in production.
Chaos lends itself well to production environments, as your system should withstand the failures.
Manual tests are slow and expensive. Ultimately they are unsustainable.
Instead, work towards regular, automated chaos testing.
Doing this requires automation both in orchestrating (getting the tests to run) and analysis (identifying and warning on failures).
Testing in production may cause some short-term pain for the team and users, but it is for the long term good.
Still, it is the responsibility and obligation of the Chaos Engineer to ensure that experiments can be contained and recovered from.
Disks, like all hardware, fail due to age, wear, or misuse.
A file that was expected was not present. (open fails)
A file that was not expected was present. (create fails)
A file was removed after being opened. (read / write fails)
A file contains data that is invalid to the reader. (encoding mismatch, missing/extra data)
The tendrils that connect machines are imperfect, and so are the operators.
One node becomes isolated from the rest.
A partition isolates two (or more) distinct node groups.
Two particular nodes can no longer communicate.
Malformed (or outright hostile) requests
Increased probability of packet corruption (forcing re-transmits)
The network becomes intolerably slow at some or all links.
In a multi-thread, multi-machine environment, the ordering of events is not constant.
The system expects to have events ABC in order and gets ACB instead.
Achieving 100% reliable power for the entire infrastructure is still only just a hope. Components fail and leave machines offline.
A machine loses power, and recovers some time later.
(Try to find the worst cases possible for this)
A machine reboots, disappearing and reappearing a minute later.
A machine with persistent data reboots, and returns with corrupted data.
Trying to get a sufficiently large number of nodes to agree on something while a small number bad actors subvert the system.
A node starts sending messages it shouldn't be, under the influence of a bad actor.
A node is under the wrong impression about the state of the system due to some bug, and sends messages it shouldn't.
(Eg "I'm the master replica now, listen to me!")
In many distributed systems cross-data center deployments are used to help protect against regional failures.
Data center(s) loses connectivity
(Eg. an undersea fiber is cut)
Data center is lost entirely and a new one needs to be bootstrapped.
For some experiments you may want more than one machine. Feel free to boot up machines however you please and install tools as needed.
This is just for ease of use and reference.
git clone https://github.com/hoverbear/chaos-workshop
cd chaos-workshop
terraform init
terraform apply -var="digitalocean_token=$TOKEN"
git clone https://github.com/hoverbear/chaos-workshop
cd chaos-workshop
./provision.sh
Using Ubuntu 18.04
Please, do not use a container.
Pass is "pawlooza", user is "root"
KILL a service that is necessary. Is there an interruption? How long until it is detected?
HUP the same service, is there any interruption?
Take down a link during an HTTP download. Put it up again at varying times later, see if it resumes.
Try changing the MTU to something abnormally small/large to simulate misconfiguration. Does anything happen?
Cause a delay between two services. See how big you can make it before things break.
Drop 1% of packets, then 10%. Do things still work? How far can you push it?
Drop 1% of packets, then 10%. Is it harder than tc? Easier?
Cause a host to become unreachable for the node.
Simulate packet loss between a database container and a web host container.
Inject failures via environment variables.
fn main() {
fail_point!("foo");
}
FAILPOINTS="foo=panic" cargo run
fn foo() {
fail_point!("foo");
}
#[test]
fn test_foo() {
fail::cfg("foo", "panic");
foo();
}
PS: We're hiring!
Remote friendly, Rust, Go, C++.