SRE Metrics
Hivency Workshop, 2021-09-16
What We Will See
Four Golden Signals
Performance Optimization
Commercial Metrics: SLAs, Uptime
Uptime Objective. Incident Budget
Unknown UnknownsWhat Is a Metric?
A magnitude + a measurement system
What is it, and how is it obtained?
Where does it come from?
What does it mean?
Internal and external metrics
Internal: measured in the machine
Affect measurement
External: measured from the outside
Can be less precise
All external metrics have their internal replica
Measurement Problem
The act of measurement affects measures
Happens at all levels
If the measurement is done externally, it may be less precise
Some problems are solved just by observing them 🤔
Four Golden Signals
Traffic: requests per second
Errors: rate of failures in the system
Latency: time to answer
Saturation: if the system is near its limits
Requests
A request is part of an information exchange
Request + response / reply
Oriented towards network connections
Sockets
HTTP
APIs
Roughly equivalent to a database query
Traffic Measurement
Requests are usually counted externally
Can also be measured internally using log events
For a time interval (typically second or day)
Usually the result is registered
- HTTP Status
- Success or error
Exercise: How to Measure Traffic?
Discuss in group how to measure traffic
How did you measure it in the past?
How can it be better measured?
Error Rate
In a distributed system there's not always "up" or "down"
Systems can fail intermittently
Availability is computed as:
successful requests / total requests
Error rate = 1 - availability
Hard to verify as a customer 😅
When There Is No "Up" or "Down"
Latency
Wait time between the start of a request
and the reception of a response
Measured externally in the client
usually as milliseconds
Includes network time
The closer the measure, better precision
Measuring Latency
Usually enough with milliseconds
Exercise: Measuring Latencies
Send a request to a sample service:
https://reqres.in/api/users?page=2
Measure the latency in ms
Latency = time(end) - time(start)
⮯
Exercise +
Code:
let xhr = new XMLHttpRequest();
let start = Date.now()
xhr.open('GET', 'https://reqres.in/api/users?page=2', true);
xhr.onload = function(){
const elapsed = Date.now() - start
console.log('Elapsed:', elapsed);
};
xhr.send();
Open your browser, go to https://reqres.in/ and hit F12
⮯
Exercise +
Now we will solve a performance issue
We will measure against https://reqres.in/api/products/3.
Measure latency again
What do you get now?
⮯
Event
Message sent asynchronously
Measured internally in the processing machine
Can generate an action, or not
Indicates delayed processing
Saturation
Use of finite resources: CPU, network...
In absolute units or percentage
Network measurement: 16 Mbit/s
Percentage indicates saturation
Measure CPU usage
Internal (by CPUs): top can measure 400%
External (by server): EC2 never goes beyond 100%
Why the Four Golden Signals
Traffic: important for stability and cost
Errors: crucial for customers stability
Latency: crucial for customer response
Saturation: important for stability and cost
The Tyranny of Averages
Twitter is a daily reminder that 50 percent of the population are of below average intelligence.
Intelligence is not the same as IQ
Metrics never replace the magnitude that we want to measure
False!
Or not necessarily true
Why Doesn't a System Scale
Bottlenecks
Resources that can be exhausted
Resource optimization
Locks
Bottlenecks
A system cannot run faster than the slowest component
Corollary: a component with too much capacity
is over dimensioned
What resources can saturate?
Short answer: all
- Memory *
- CPU
- Bandwidth
- File descriptors
- Input/output buffers *
- ...
Non-catastrophic saturation: the system blocks
*Catastrophic saturation: the system breaks
Do we even notice when a resource is saturated?
Independence
No locks required
Immutability
Functional programming
Example: Erlang
Locks
Protect a resource
Example: global kernel lock
When requests cannot be processed, they accumulate
Input / output buffers
(which also saturate)
When Should We Use Locks?
Modelling dependencies (e.g. games)
Single process programming (Node.js)
Atomic operations
Separation of reads / writes
Events
Channels
...
Exercise: Visit Counter
We are a video service in 2012
We want to count visits to all videos
We had a single process Rails app
writing to multiple files
We have outgrown it
Can you identify possible bottlenecks?
⮯
Exercise +
Design a visit counter that can scale horizontally
What type of operations are necessary?
What calls do you need in the API?
What horizontal strategies would you use?
⮯
Exercise +
Es 2013, estamos en una loca expansión internacional
Tenemos visitas de todo el mundo
¿Cómo ampliar el contador?
¿Es posible tener un contador preciso y sincronizado?
¿Se te ocurren alternativas?
⮯
Ejercicio + (en común)
Diseña ahora un servidor que asigne identificadores únicos
UUIDs
Tenemos un servidor centralizado
Mantiene un contador único
El servidor se colapsa por tráfico
Además abrimos nuevas regiones
¿Se te ocurren otras opciones?
⮯
Availability
The percentage of time that the system is up
Also known as: uptime
Typical values:
- 99%: down 14 minutes per day
- 99.9%: down 10 minutes per week
- 99.999%: down 24 seconds per month
- 99.99999%: down 3 seconds per year
- 99.9999999%: down 3 seconds per century
Exercise: Periodic Downtime
Compute the availability of these systems
National bank: down 1 hour per day
Marketing: down 1 hour per month
Online shopping: down 5 hours per year
⮯
Exercise +
We have an availability compromise of 99.9%
We want to do monthly maintenance
How many minutes do we have?
An intervention is two hours
We need to do it now
How can we maintain our compromise?
⮯
Exercise + (in common)
Extra, extra!
Sales sold 99.99% of uptime 😱
We have ten servers to update
Every one takes 30 minutes
How do we keep the compromise?
⮯
SLA
SLI: Service Level Indicator
SLO: Service Level Objective
SLA: Service Level Agreement
Chris Jones et al, SRE book
SLI
An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
Indicator for service level
(So, basically a metric)
Examples:
- Latency
- Availability: 99.999%
- Error rate
-
Throughput, traffic, requests per second
SLO
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
Objective for service level
Examples:
- Average latency < 100 ms
- Error rate < 0.01%
Objectives are useful to set expectations
SLA, for realz
Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
Agreement for service level
A business entity
Usually tricky
Exercise: Amazon SLA
We spend €1000/month on EC2
There is an earthquake and availability falls to 99.01%
How much money do we get from Amazon?
How long should EC2 be down in a month to get back:
- 10% of the bill?
-
100% of the bill?
⮯
Service Degradation
SLA is often just for availability
Sometimes services work but in a degraded condition
High latency, sporadic errors...
An SLA can combine serveral SLOs:
- Error rate < 0.1%
- Latency > 1000 ms ⇒ error
Error budget
La mayoría de las incidencias son por cambios
~70%
¿Queremos cumplir el SLO? Limitamos los cambios
Demasiada disponibilidad: mal
¡Podíamos haber hecho más cambios!
What we don't know
There's known knowns:
What we know we know
There's known unknowns:
What we know we don't know
There's unknown knowns:
What we don't know we know
There's unknown unknowns
What we don't know we don't know