SRE Metrics

Hivency Workshop, 2021-09-16

What We Will See

Four Golden Signals

Performance Optimization

Commercial Metrics: SLAs, Uptime

Uptime Objective. Incident Budget

Unknown Unknowns

What Is a Metric?

A magnitude + a measurement system

What is it, and how is it obtained?

Where does it come from?

What does it mean?

Internal and external metrics

Internal: measured in the machine

Affect measurement

External: measured from the outside

Can be less precise

All external metrics have their internal replica

Measurement Problem

The act of measurement affects measures

Happens at all levels

If the measurement is done externally, it may be less precise

Some problems are solved just by observing them 🤔

Four Golden Signals

Traffic: requests per second

Errors: rate of failures in the system

Latency: time to answer

Saturation: if the system is near its limits

Fuente: SRE Book

Requests

A request is part of an information exchange

Request + response / reply

Oriented towards network connections

Sockets

HTTP

APIs

Roughly equivalent to a database query

Request + Response

Traffic Measurement

Requests are usually counted externally

Can also be measured internally using log events

For a time interval (typically second or day)

Usually the result is registered

HTTP Status
Success or error

Exercise: How to Measure Traffic?

Discuss in group how to measure traffic

How did you measure it in the past?

How can it be better measured?

⮯

Awesome!

Error Rate

In a distributed system there's not always "up" or "down"

Systems can fail intermittently

Availability is computed as:

successful requests / total requests

Error rate = 1 - availability

Hard to verify as a customer 😅

When There Is No "Up" or "Down"

Latency

Wait time between the start of a request

and the reception of a response

Measured externally in the client

usually as milliseconds

Includes network time

The closer the measure, better precision

Measuring Latency

Usually enough with milliseconds

Exercise: Measuring Latencies

Send a request to a sample service:

https://reqres.in/api/users?page=2

Measure the latency in ms

Latency = time(end) - time(start)

⮯

Exercise +

Code:

let xhr = new XMLHttpRequest();
let start = Date.now()
xhr.open('GET', 'https://reqres.in/api/users?page=2', true);
xhr.onload = function(){
    const elapsed = Date.now() - start
    console.log('Elapsed:', elapsed);
};
xhr.send();

Open your browser, go to https://reqres.in/ and hit F12

⮯

Exercise +

Now we will solve a performance issue

We will measure against https://reqres.in/api/products/3.

Measure latency again

What do you get now?

⮯

Awesome!

Event

Message sent asynchronously

Measured internally in the processing machine

Can generate an action, or not

Indicates delayed processing

Fire & forget
Polling

Event Processing

Saturation

Use of finite resources: CPU, network...

In absolute units or percentage

Network measurement: 16 Mbit/s

Percentage indicates saturation

Measure CPU usage

Internal (by CPUs): top can measure 400%

External (by server): EC2 never goes beyond 100%

Why the Four Golden Signals

Traffic: important for stability and cost

Errors: crucial for customers stability

Latency: crucial for customer response

Saturation: important for stability and cost

The Tyranny of Averages

Twitter is a daily reminder that 50 percent of the population are of below average intelligence.

Richard Dawkins

Intelligence is not the same as IQ

Metrics never replace the magnitude that we want to measure

False!

Or not necessarily true

Generalized Pareto distribution

Performance Optimization

Why Doesn't a System Scale

Bottlenecks

Resources that can be exhausted

Resource optimization

Locks

Bottlenecks

A system cannot run faster than the slowest component

Corollary: a component with too much capacity

is over dimensioned

What resources can saturate?

Short answer: all

Memory *
CPU
Bandwidth
File descriptors
Input/output buffers *
...

Non-catastrophic saturation: the system blocks

*Catastrophic saturation: the system breaks

Do we even notice when a resource is saturated?

Classics: CPU

Independence

No locks required

Immutability

Functional programming

Example: Erlang

Locks

Protect a resource

Example: global kernel lock

When requests cannot be processed, they accumulate

Input / output buffers

(which also saturate)

When Should We Use Locks?

Modelling dependencies (e.g. games)

Single process programming (Node.js)

Atomic operations

Separation of reads / writes

Mutexes

Events

Channels

...

Exercise: Visit Counter

We are a video service in 2012

We want to count visits to all videos

We had a single process Rails app

writing to multiple files

We have outgrown it

Can you identify possible bottlenecks?

⮯

Exercise +

Design a visit counter that can scale horizontally

What type of operations are necessary?

What calls do you need in the API?

What horizontal strategies would you use?

⮯

Exercise +

Es 2013, estamos en una loca expansión internacional

Tenemos visitas de todo el mundo

¿Cómo ampliar el contador?

¿Es posible tener un contador preciso y sincronizado?

¿Se te ocurren alternativas?

⮯

Ejercicio + (en común)

Diseña ahora un servidor que asigne identificadores únicos

UUIDs

Tenemos un servidor centralizado

Mantiene un contador único

El servidor se colapsa por tráfico

Además abrimos nuevas regiones

¿Se te ocurren otras opciones?

⮯

Beep beep

Commercial Metrics

Availability

The percentage of time that the system is up

Also known as: uptime

Typical values:

99%: down 14 minutes per day
99.9%: down 10 minutes per week
99.999%: down 24 seconds per month
99.99999%: down 3 seconds per year
99.9999999%: down 3 seconds per century

Exercise: Periodic Downtime

Compute the availability of these systems

National bank: down 1 hour per day

Marketing: down 1 hour per month

Online shopping: down 5 hours per year

⮯

Exercise +

We have an availability compromise of 99.9%

We want to do monthly maintenance

How many minutes do we have?

An intervention is two hours

We need to do it now

How can we maintain our compromise?

⮯

Exercise + (in common)

Extra, extra!

Sales sold 99.99% of uptime 😱

We have ten servers to update

Every one takes 30 minutes

How do we keep the compromise?

⮯

Not bad

SLA

SLI: Service Level Indicator

SLO: Service Level Objective

SLA: Service Level Agreement

Chris Jones et al, SRE book

SLI

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

Indicator for service level

(So, basically a metric)

Examples:

Latency
Availability: 99.999%
Error rate
Throughput, traffic, requests per second

SLO

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.

Objective for service level

Examples:

Average latency < 100 ms
Error rate < 0.01%

Objectives are useful to set expectations

SLA, for realz

Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.

Agreement for service level

A business entity

Usually tricky

Exercise: Amazon SLA

Amazon Compute SLA

We spend €1000/month on EC2

There is an earthquake and availability falls to 99.01%

How much money do we get from Amazon?

How long should EC2 be down in a month to get back:

10% of the bill?
100% of the bill?

⮯

That was easy!

Service Degradation

SLA is often just for availability

Sometimes services work but in a degraded condition

High latency, sporadic errors...

An SLA can combine serveral SLOs:

Error rate < 0.1%
Latency > 1000 ms ⇒ error

Uptime Objective

Error budget

Introduced by Google: error budget

La mayoría de las incidencias son por cambios

~70%

¿Queremos cumplir el SLO? Limitamos los cambios

Demasiada disponibilidad: mal

¡Podíamos haber hecho más cambios!

Unknown Unkowns

Rumsfeld quadrant

What we don't know

There's known knowns:

What we know we know

There's known unknowns:

What we know we don't know

There's unknown knowns:

What we don't know we know

There's unknown unknowns

What we don't know we don't know

Testing in Production

Fuente

Heidi Waterhouse

I really hate metrics.

Charity Majors

Bibliography

SRE Book: Service Level Objectives

SRE Book: Embracing Risk

pinchito.es: Continuous Deployment on the Cheap

Charity Majors: I test in prod

Charity Majors: Yes, I Test in Production (And So Do You)

Hivency Tech Workshop: SRE Metrics

By Alex Fernández

Hivency Tech Workshop: SRE Metrics

Hivency Workshop: Metrics

1,471

SRE Metrics

Hivency Workshop, 2021-09-16

What We Will See

What Is a Metric?

Internal and external metrics

Measurement Problem

Four Golden Signals

Requests

Request + Response

Traffic Measurement

Exercise: How to Measure Traffic?

Awesome!

Error Rate

When There Is No "Up" or "Down"

Latency

Measuring Latency

Exercise: Measuring Latencies

Exercise +

Exercise +

Awesome!

Event

Event Processing

Saturation

Why the Four Golden Signals

The Tyranny of Averages

False!

Performance Optimization

Why Doesn't a System Scale

Bottlenecks

What resources can saturate?

Classics: CPU

Independence

Locks

When Should We Use Locks?

Exercise: Visit Counter

Exercise +

Exercise +

Ejercicio + (en común)

Beep beep

Commercial Metrics

Availability

Exercise: Periodic Downtime

Exercise +

Exercise + (in common)

Not bad

SLA

SLI

SLO

SLA, for realz

Exercise: Amazon SLA

That was easy!

Service Degradation

Uptime Objective

Error budget

Unknown Unkowns

Rumsfeld quadrant

What we don't know

Testing in Production

Bibliography

Hivency Tech Workshop: SRE Metrics

More from Alex Fernández