Monitoring Production Methodologically
Diego Parra
Cristian Spinetta


@diegolparra
@dpsoft
@cebspinetta


@cspinetta

Outline
Monitoring... for what?

What we really want to monitor?
How to design it?
What is not monitoring?
Observability anti-methodologies





We can do it better?

What we need when we want monitoring a system?

Classic starting points...
Is everything OK?
What are the response time of my api?
Since when the system is broken?
Where does the problem come from?
...
How fast is the usage of disk growing?










perhaps, after a lot of effort we've gotten a graph for each question
so, what do we have now?...


let me tell you even the weather

but wait!
How useful is this dashboard to face production problems?
I think not so useful...


Monitoring
Monitoring... for what???
The "what’s broken" indicates the symptom

The "why" indicates a (possibly intermediate) cause

"Your monitoring system should address two questions: what’s broken, and why?"


"In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the failure as well as the effect of any fix deployed." by Cindy Sridharan.
Some examples...
I'm serving Http 500s
Symptom (What?)
Cause (Why?)
DB are refusing connections
My responses are slow
Web Server is queuing requests


Users can't logging
Auth Client is receiving HTTP 503


Blackbox

Whitebox
What versus why is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
In a multilayered system, one person’s symptom is another person’s cause
Key distinctions
User/Business points of view

Blackbox monitoring
Whitebox monitoring
SLI/SLO based control

Mostly easy to know

Tends to be the last to alert

Reactive approach

Usually on-call resolution

Preferably few metrics

Component points of view

Threshold based control

Mostly hard to know

Tends to be the early alarm

Proactive approach

Usually automatic resolution

Preferably not few metrics

what?
why?

Detect active problem

Detect imminent problem

Monitoring: best practices
Keeps an eye toward simplicity

Should be easy to read and understand

Don't disaggregate unnecessarily

Metrics/Alerts that are rarely exercised should be up for removal
Keeps dashboard clean and tidy


Few is better than many

Monitoring: best practices
Time-series is preferable to discrete checks

Avoid averages

Select an appropriate resolution

Keeps an eye on the tail latency

Pay close attention to how you are measuring
collects metrics in buckets


Monitoring Methodologically


Utilization: % time that the resource was busy
USE Method

Errors: rate of error events

Saturation: amount of work resource has to do, often queue length
For every resource, monitor



Latency: time to service a request
4 Golden Signals

Traffic: requests/seconds

Errors: error rate of requests

Saturation: how overloaded a service is
For every service, monitor


4 Golden Signals - Example

Monitoring
with SRE practices
Monitoring with SLIs
Quantifies meeting user expectations:
(Service Level Indicators)
is our service working as our users expect it to?

Monitoring with SLIs
1. For each User Journey/Data Flow identify from the SLI Menu suitable types of SLI:

2. Make a decision about how to measure good and valid events.
3. Decide where to measure the SLI
Monitoring with SLIs
Availability
Specification: % GET requests complete successfully
Examples: backend API for user info
Implementation:



Latency
Specification: % of requests that return 2xx will complete in less than 500ms.
Implementation:



Monitoring with SLIs
Add objective:


+ SLOs


Objectives for SLI
(Service Level Objectives)
Examples
Availability: 99.9% GET requests complete successfully
Latency: 95% of requests that return 2xx will complete in less than 500ms.
Conditions:
- Measured across all the backend servers from the load balancer
- Taking the past 24 hours
SLIs examples
Availability (2xx / 2xx + 5xx)

Latency measured only on 2xx status

End-to-end elapsed time for processing events within data pipeline


The SLOs is being defining yet. So far we just chosen some values as suggestion which were expressed with a "should be ..."
Alerting

Symptom-based alerting
Does this rule detect an condition that is urgent, actionable and imminently user-visible?

Can I take action? is it urgent or could it wait until morning?

Questions that help to avoid false positives:
Alerts should be just about urgency!
Otherwise, it makes people demoralized and fatigued
Could the action be automated?

People add latency: automate as much as possible!
Are other people getting paged for this issue unnecessary?

OK, but I lost the power to know and predict the behavior of my system
and that's where the observability tools come in...

Observability
Highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.
Observability captures what "montoring" doesn't (and shouldn't)
based on evidences (not conjectures)





Observability tools
Logs
Discrete Events
Good practices for more effective logs:
Logging with context (trace-id / uow / whatever)

Use structured-logs for enable machine-readability

Standardized Logging Levels



Example: logs by Loki

Example: filtering logs

Example: Metrics from logs

Metrics
Aggregated data
For reliability and trending in use:
What happens right now?

Reliability is not just about throughput and latency
What will happen next?

There are a lot of extra effects to consider: gc, hiccups, cpu stolen, ...



Metrics example

Tracing
A few of the critical questions that Tracing can answer quickly and easly:
Which did a request pass through?

Where are the bottlenecks?

How much time is lost due to network lag during communication between services?

What occurred in each service for a given request?


Distributed tracing example

In Summary
Use four golden signals + USE metrics


Instrument services as early as possible


Combine with profiling, logging and tracing


Iterate


Observability with
anti-methodologies
Blame Someone Else
1. Find a system or environment component you are not resposible for

2. Hypothesize that the issue is with that component
3. Redirect the issue to the responsible team
4. When proven wrong, go to 1




Drunk Man
Change things at random until the problem goes away



Traffic Light
Open dashboard
All green? Assume everything is good
Something red? Assume that's a problem




We can do it better?
Is Controllability the purpose of Observability?
Is Observability just about Brute Force?
What about Systems and their Dynamics?



What about Signaling and State?


Questions?

Thanks!

Diego Parra
Cristian Spinetta


@diegolparra
@dpsoft
@cebspinetta


@cspinetta
Bonus!

Feedback Control
Feedback Control is used heavily in

Process control


Electronics


Automation


Aerospace

Was used by the Egyptians in a water clock more than 2000 years ago
Feedback Control


OpenSignals

Signals are emmited or recieved

Signals are indicative of operations or outcomes

Signals influence others and help to infer state of others as well as ourselves

A signal is a direct and meaniful unit of information whithin a context, like a emoji

Signals are recorded and then givin a scoring
OpenSignals
States
Signals
UNKNOWN

OK

DEVIATING

DEGRADED

DEFECTIVE

DOWN

START

STOP

CALL

SUCCEED

RETRY

DELAY

DISCONNECT

SCHEDULE

S1 CONTEXT | S2 CONTEXT
------------------------------------------
S1 EMIT START |
S2 EMIT CALL |
| S2 EMIT START
| S2 EMIT FAIL
| S2 EMIT STOP
S2 RECEIPT FAIL |
S1 EMIT FAIL |
S1 EMIT STOP |
Service-to-Service interaction

Monitoring an API
Monitoring data pipeline

Monitoring Production: Methodologically
By Diego Parra
Monitoring Production: Methodologically
A tour of Monitoring / Observability and some of the most used methodologies (USE | THE FOUR GOLDEN SIGNALS) to monitor systems and troubleshoot. Some of the questions that guided this talk are: - How to design a good monitoring system? - What methodologies can we apply? - What is observability and how can it benefit us? - Are we doing the right things? Really?
- 1,723