SRE Metrics


Hivency Workshop, 2021-09-16

What We Will See


Four Golden Signals


Performance Optimization


Commercial Metrics: SLAs, Uptime


Uptime Objective. Incident Budget


Unknown Unknowns

What Is a Metric?



magnitude + a measurement system


What is it, and how is it obtained?


Where does it come from?


What does it mean?

Internal and external metrics



Internal: measured in the machine
Affect measurement


External: measured from the outside
Can be less precise


All external metrics have their internal replica

Measurement Problem



The act of measurement affects measures


Happens at all levels


If the measurement is done externally, it may be less precise


Some problems are solved just by observing them 🤔

Four Golden Signals

Traffic: requests per second


Errors: rate of failures in the system


Latency: time to answer


Saturation: if the system is near its limits


Fuente: SRE Book

Requests


A request is part of an information exchange


Request + response / reply


Oriented towards network connections
Sockets
HTTP
APIs


Roughly equivalent to a database query

Request + Response



Traffic Measurement



Requests are usually counted externally


Can also be measured internally using log events


For a time interval (typically second or day)


Usually the result is registered
  • HTTP Status
  • Success or error

Exercise: How to Measure Traffic?



Discuss in group how to measure traffic


How did you measure it in the past?


How can it be better measured?



Awesome!



Error Rate


In a distributed system there's not always "up" or "down"


Systems can fail intermittently


Availability is computed as:


successful requests / total requests
Error rate  = 1 - availability


Hard to verify as a customer 😅

When There Is No "Up" or "Down"



Latency


Wait time between the start of a request
and the reception of a response

Measured externally in the client
usually as milliseconds


Includes network time


The closer the measure, better precision

Measuring Latency














Usually enough with milliseconds

Exercise: Measuring Latencies



Send a request to a sample service:
https://reqres.in/api/users?page=2


Measure the latency in ms


Latency = time(end) - time(start)



Exercise +


Code:

let xhr = new XMLHttpRequest();
let start = Date.now()
xhr.open('GET', 'https://reqres.in/api/users?page=2', true);
xhr.onload = function(){
    const elapsed = Date.now() - start
    console.log('Elapsed:', elapsed);
};
xhr.send();


Open your browser, go to https://reqres.in/ and hit F12

Exercise +


Now we will solve a performance issue


We will measure against https://reqres.in/api/products/3.


Measure latency again


What do you get now?


Awesome!



Event



Message sent asynchronously


Measured internally in the processing machine


Can generate an action, or not


Indicates delayed processing
  • Fire & forget
  • Polling

Event Processing


Saturation


Use of finite resources: CPU, network...


In absolute units or percentage
Network measurement: 16 Mbit/s


Percentage indicates saturation


Measure CPU usage
Internal (by CPUs): top can measure 400%
External (by server): EC2 never goes beyond 100%

Why the Four Golden Signals



Traffic: important for stability and cost


Errors: crucial for customers stability


Latency: crucial for customer response


Saturation: important for stability and cost


The Tyranny of Averages




Twitter is a daily reminder that 50 percent of the population are of below average intelligence.

Richard Dawkins


Intelligence is not the same as IQ


Metrics never replace the magnitude that we want to measure

False!


Or not necessarily true

Generalized Pareto distribution

Performance Optimization


Why Doesn't a System Scale



Bottlenecks


Resources that can be exhausted


Resource optimization


Locks

Bottlenecks


A system cannot run faster than the slowest component


Corollary: a component with too much capacity
is over dimensioned

What resources can saturate?


Short answer: all

  • Memory *
  • CPU
  • Bandwidth
  • File descriptors
  • Input/output buffers *
  • ...

Non-catastrophic saturation: the system blocks
*Catastrophic saturation: the system breaks


Do we even notice when a resource is saturated?

Classics: CPU



Independence


No locks required

Immutability

Functional programming
Example: Erlang

Locks



Protect a resource


Example: global kernel lock


When requests cannot be processed, they accumulate


Input / output buffers
(which also saturate)

When Should We Use Locks?


Modelling dependencies (e.g. games)

Single process programming (Node.js)

Atomic operations

Separation of reads / writes

Mutexes

Events

Channels
...

Exercise: Visit Counter


We are a video service in 2012

We want to count visits to all videos


We had a single process Rails app
writing to multiple files

We have outgrown it


Can you identify possible bottlenecks?


Exercise +



Design a visit counter that can scale horizontally


What type of operations are necessary?


What calls do you need in the API?


What horizontal strategies would you use?



Exercise +


Es 2013, estamos en una loca expansión internacional

Tenemos visitas de todo el mundo


¿Cómo ampliar el contador?

¿Es posible tener un contador preciso y sincronizado?


¿Se te ocurren alternativas?



Ejercicio + (en común)



Diseña ahora un servidor que asigne identificadores únicos
UUIDs


Tenemos un servidor centralizado
Mantiene un contador único


El servidor se colapsa por tráfico
Además abrimos nuevas regiones
¿Se te ocurren otras opciones?


Beep beep



Commercial Metrics


Availability


The percentage of time that the system is up


Also known as: uptime


Typical values:
  • 99%: down 14 minutes per day
  • 99.9%: down 10 minutes per week
  • 99.999%: down 24 seconds per month
  • 99.99999%: down 3 seconds per year
  • 99.9999999%: down 3 seconds per century

Exercise: Periodic Downtime



Compute the availability of these systems


National bank: down 1 hour per day


Marketing: down 1 hour per month


Online shopping: down 5 hours per year



Exercise +


We have an availability compromise of 99.9%


We want to do monthly maintenance
How many minutes do we have?


An intervention is two hours
We need to do it now
How can we maintain our compromise?



Exercise + (in common)


Extra, extra!
Sales sold 99.99% of uptime 😱


We have ten servers to update


Every one takes 30 minutes


How do we keep the compromise?



Not bad




SLA


SLI: Service Level Indicator



SLO: Service Level Objective



SLA: Service Level Agreement




Chris Jones et al
, SRE book

SLI


An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.


Indicator for service level
(So, basically a metric)

Examples:
  • Latency
  • Availability: 99.999%
  • Error rate
  • Throughput, traffic, requests per second

SLO


An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.


Objective for service level


Examples:
  • Average latency < 100 ms
  • Error rate < 0.01%

Objectives are useful to set expectations

SLA, for realz


Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.


Agreement for service level


business entity


Usually tricky

Exercise: Amazon SLA



Amazon Compute SLA


We spend €1000/month on EC2
There is an earthquake and availability falls to 99.01%
How much money do we get from Amazon?


How long should EC2 be down in a month to get back:
  • 10% of the bill?
  • 100% of the bill?


That was easy!



Service Degradation



SLA is often just for availability


Sometimes services work but in a degraded condition


High latency, sporadic errors...


An SLA can combine serveral SLOs:
  • Error rate < 0.1%
  • Latency > 1000 ms ⇒ error

Uptime Objective


Error budget



Introduced by Google: error budget


La mayoría de las incidencias son por cambios
~70%


¿Queremos cumplir el SLO? Limitamos los cambios


Demasiada disponibilidad: mal
¡Podíamos haber hecho más cambios!

Unknown Unkowns



Rumsfeld quadrant

What we don't know


There's known knowns:
What we know we know


There's known unknowns:
What we know we don't know


There's unknown knowns:
What we don't know we know


There's unknown unknowns
What we don't know we don't know

Testing in Production



Fuente


Heidi Waterhouse





I really hate metrics.

Charity Majors

Bibliography



SRE Book: Service Level Objectives


SRE Book: Embracing Risk


pinchito.es: Continuous Deployment on the Cheap


Charity Majors: I test in prod


Charity Majors: Yes, I Test in Production (And So Do You)