SRE Metrics

Hivency Workshop, 2021-09-16

What We Will See

Four Golden Signals

Performance Optimization

Commercial Metrics: SLAs, Uptime

Uptime Objective. Incident Budget

Unknown Unknowns

What Is a Metric?

magnitude + a measurement system

What is it, and how is it obtained?

Where does it come from?

What does it mean?

Internal and external metrics

Internal: measured in the machine
Affect measurement

External: measured from the outside
Can be less precise

All external metrics have their internal replica

Measurement Problem

The act of measurement affects measures

Happens at all levels

If the measurement is done externally, it may be less precise

Some problems are solved just by observing them 🤔

Four Golden Signals

Traffic: requests per second

Errors: rate of failures in the system

Latency: time to answer

Saturation: if the system is near its limits

Fuente: SRE Book


A request is part of an information exchange

Request + response / reply

Oriented towards network connections

Roughly equivalent to a database query

Request + Response

Traffic Measurement

Requests are usually counted externally

Can also be measured internally using log events

For a time interval (typically second or day)

Usually the result is registered
  • HTTP Status
  • Success or error

Exercise: How to Measure Traffic?

Discuss in group how to measure traffic

How did you measure it in the past?

How can it be better measured?


Error Rate

In a distributed system there's not always "up" or "down"

Systems can fail intermittently

Availability is computed as:

successful requests / total requests
Error rate  = 1 - availability

Hard to verify as a customer 😅

When There Is No "Up" or "Down"


Wait time between the start of a request
and the reception of a response

Measured externally in the client
usually as milliseconds

Includes network time

The closer the measure, better precision

Measuring Latency

Usually enough with milliseconds

Exercise: Measuring Latencies

Send a request to a sample service:

Measure the latency in ms

Latency = time(end) - time(start)

Exercise +


let xhr = new XMLHttpRequest();
let start ='GET', '', true);
xhr.onload = function(){
    const elapsed = - start
    console.log('Elapsed:', elapsed);

Open your browser, go to and hit F12

Exercise +

Now we will solve a performance issue

We will measure against

Measure latency again

What do you get now?



Message sent asynchronously

Measured internally in the processing machine

Can generate an action, or not

Indicates delayed processing
  • Fire & forget
  • Polling

Event Processing


Use of finite resources: CPU, network...

In absolute units or percentage
Network measurement: 16 Mbit/s

Percentage indicates saturation

Measure CPU usage
Internal (by CPUs): top can measure 400%
External (by server): EC2 never goes beyond 100%

Why the Four Golden Signals

Traffic: important for stability and cost

Errors: crucial for customers stability

Latency: crucial for customer response

Saturation: important for stability and cost

The Tyranny of Averages

Twitter is a daily reminder that 50 percent of the population are of below average intelligence.

Intelligence is not the same as IQ

Metrics never replace the magnitude that we want to measure


Or not necessarily true

Performance Optimization

Why Doesn't a System Scale


Resources that can be exhausted

Resource optimization



A system cannot run faster than the slowest component

Corollary: a component with too much capacity
is over dimensioned

What resources can saturate?

Short answer: all

  • Memory *
  • CPU
  • Bandwidth
  • File descriptors
  • Input/output buffers *
  • ...

Non-catastrophic saturation: the system blocks
*Catastrophic saturation: the system breaks

Do we even notice when a resource is saturated?

Classics: CPU


No locks required


Functional programming
Example: Erlang


Protect a resource

Example: global kernel lock

When requests cannot be processed, they accumulate

Input / output buffers
(which also saturate)

When Should We Use Locks?

Modelling dependencies (e.g. games)

Single process programming (Node.js)

Atomic operations

Separation of reads / writes



Exercise: Visit Counter

We are a video service in 2012

We want to count visits to all videos

We had a single process Rails app
writing to multiple files

We have outgrown it

Can you identify possible bottlenecks?

Exercise +

Design a visit counter that can scale horizontally

What type of operations are necessary?

What calls do you need in the API?

What horizontal strategies would you use?

Exercise +

Es 2013, estamos en una loca expansión internacional

Tenemos visitas de todo el mundo

¿Cómo ampliar el contador?

¿Es posible tener un contador preciso y sincronizado?

¿Se te ocurren alternativas?

Ejercicio + (en común)

Diseña ahora un servidor que asigne identificadores únicos

Tenemos un servidor centralizado
Mantiene un contador único

El servidor se colapsa por tráfico
Además abrimos nuevas regiones
¿Se te ocurren otras opciones?

Beep beep

Commercial Metrics


The percentage of time that the system is up

Also known as: uptime

Typical values:
  • 99%: down 14 minutes per day
  • 99.9%: down 10 minutes per week
  • 99.999%: down 24 seconds per month
  • 99.99999%: down 3 seconds per year
  • 99.9999999%: down 3 seconds per century

Exercise: Periodic Downtime

Compute the availability of these systems

National bank: down 1 hour per day

Marketing: down 1 hour per month

Online shopping: down 5 hours per year

Exercise +

We have an availability compromise of 99.9%

We want to do monthly maintenance
How many minutes do we have?

An intervention is two hours
We need to do it now
How can we maintain our compromise?

Exercise + (in common)

Extra, extra!
Sales sold 99.99% of uptime 😱

We have ten servers to update

Every one takes 30 minutes

How do we keep the compromise?

Not bad


SLI: Service Level Indicator

SLO: Service Level Objective

SLA: Service Level Agreement

Chris Jones et al
, SRE book


An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

Indicator for service level
(So, basically a metric)

  • Latency
  • Availability: 99.999%
  • Error rate
  • Throughput, traffic, requests per second


An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.

Objective for service level

  • Average latency < 100 ms
  • Error rate < 0.01%

Objectives are useful to set expectations

SLA, for realz

Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

Agreement for service level

business entity

Usually tricky

Exercise: Amazon SLA

We spend €1000/month on EC2
There is an earthquake and availability falls to 99.01%
How much money do we get from Amazon?

How long should EC2 be down in a month to get back:
  • 10% of the bill?
  • 100% of the bill?

That was easy!

Service Degradation

SLA is often just for availability

Sometimes services work but in a degraded condition

High latency, sporadic errors...

An SLA can combine serveral SLOs:
  • Error rate < 0.1%
  • Latency > 1000 ms ⇒ error

Uptime Objective

Error budget

Introduced by Google: error budget

La mayoría de las incidencias son por cambios

¿Queremos cumplir el SLO? Limitamos los cambios

Demasiada disponibilidad: mal
¡Podíamos haber hecho más cambios!

Unknown Unkowns

Rumsfeld quadrant

What we don't know

There's known knowns:
What we know we know

There's known unknowns:
What we know we don't know

There's unknown knowns:
What we don't know we know

There's unknown unknowns
What we don't know we don't know

Testing in Production

I really hate metrics.


SRE Book: Embracing Risk

Charity Majors: I test in prod

Hivency Tech Workshop: SRE Metrics

By Alex Fernández

Hivency Tech Workshop: SRE Metrics

Hivency Workshop: Metrics

  • 1,254