Kévin Maschtaler

Developer at marmelab

Errors Budget

An SRE Principle

Errors Budget

Deployment

Reliability

Monitoring

Benjamin Treynor Sloss, VP of Engineering, Google

Source: https://irishtugofwar.com/gallery-2/

Source: http://jonathan-marks.com/complexity-cooperation/

Get To Know What Really Matters For Your Users

Weather API

Availability

Database

Data Consistency

Stock Exchange App.

Response Time

Define A Service Level Indicator

Availability = \frac{Uptime}{Uptime + Downtime}

Availability = \frac{Uptime}{Uptime + Downtime}

Choose A Service Level Objective

Unrealistic & unreachable

Do more harm than good

99% ("two nines"): 3.65 days of downtime

99.9% ("three nines"): 8.77 hours of downtime

99.99% ("four nines"): 52.60 minutes of downtime

99.999% ("five nines"): 5.26 minutes of downtime

Example of realistic SLOs (over a year)

Focus on unplanned downtime

Errors Budget = SLI - SLO

Errors Budget = SLI - SLO

if (budget > 0 && !friday)

if (budget <= 0 || friday)

Recap

1. Get To Know What Really Matters For Your Users

2. Measure it (SLI)

3. Choose A Realistic Objective (SLO)

4. Align Team Behavior With The Errors Budget

5. Iterate and goto 1

- Risky

- Not Risky

Focus On Stability

Focus On Velocity

DEMO

https://github.com/Kmaschta/monitoring-example

About Site Reliability Engineering

Google (Benjamin Treynor Sloss)
Microsoft (David N. Blank-Edelman)
Stack Overflow (Nick Craver)

leboncoin.fr
vente-privee.com

AirBnb
Amazon
Apple
Baidu
Dropbox
Etsy

Facebook
GitHub
LinkedIn
Netflix
Pinterest
Twitter

Uber
Yahoo!
Yelp
...

Errors Budget - An SRE Principle

A short introduction to the error budget method, or how to reconcile devs and sysadmins thanks to SRE principles.

2,757

Kevin Maschtaler

I write code | @tint