Monitoring Production Methodologically

Diego Parra

Cristian Spinetta

@diegolparra

@dpsoft

@cebspinetta

@cspinetta

Outline

Monitoring... for what?

What we really want to monitor?

How to design it?

What is not monitoring?

Observability ~~anti-~~methodologies

We can do it better?

What we need when we want monitoring a system?

Classic starting points...

Is everything OK?

What are the response time of my api?

Since when the system is broken?

Where does the problem come from?

...

How fast is the usage of disk growing?

perhaps, after a lot of effort we've gotten a graph for each question

so, what do we have now?...

let me tell you even the weather

but wait!

How useful is this dashboard to face production problems?

I think not so useful...

Monitoring

Monitoring... for what???

The "what’s broken" indicates the symptom

The "why" indicates a (possibly intermediate) cause

"Your monitoring system should address two questions: what’s broken, and why?"

SRE book: Monitoring Distributed System

"In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the failure as well as the effect of any fix deployed." by Cindy Sridharan.

Some examples...

I'm serving Http 500s

Symptom (What?)

Cause (Why?)

DB are refusing connections

My responses are slow

Web Server is queuing requests

Users can't logging

Auth Client is receiving HTTP 503

Blackbox

Whitebox

What versus why is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.

In a multilayered system, one person’s symptom is another person’s cause

Key distinctions

User/Business points of view

Blackbox monitoring

Whitebox monitoring

SLI/SLO based control

Mostly easy to know

Tends to be the last to alert

Reactive approach

Usually on-call resolution

Preferably few metrics

Component points of view

Threshold based control

Mostly hard to know

Tends to be the early alarm

Proactive approach

Usually automatic resolution

Preferably not few metrics

what?

why?

Detect active problem

Detect imminent problem

Monitoring: best practices

Keeps an eye toward simplicity

Should be easy to read and understand

Don't disaggregate unnecessarily

Metrics/Alerts that are rarely exercised should be up for removal

Keeps dashboard clean and tidy

Few is better than many

Monitoring: best practices

Time-series is preferable to discrete checks

Avoid averages

Select an appropriate resolution

Keeps an eye on the tail latency

Pay close attention to how you are measuring

collects metrics in buckets

Monitoring Methodologically

Utilization: % time that the resource was busy

USE Method

Errors: rate of error events

Saturation: amount of work resource has to do, often queue length

For every resource, monitor

The USE Method - Brendan Gregg

Latency: time to service a request

4 Golden Signals

Traffic: requests/seconds

Errors: error rate of requests

Saturation: how overloaded a service is

For every service, monitor

SRE book: Monitoring Distributed System

4 Golden Signals - Example

Monitoring

with SRE practices

Monitoring with SLIs

Quantifies meeting user expectations:

(Service Level Indicators)

is our service working as our users expect it to?

Monitoring with SLIs

1. For each User Journey/Data Flow identify from the SLI Menu suitable types of SLI:

2. Make a decision about how to measure good and valid events.

3. Decide where to measure the SLI

Monitoring with SLIs

Availability

Specification: % GET requests complete successfully

Examples: backend API for user info

Implementation:

Latency

Specification: % of requests that return 2xx will complete in less than 500ms.

Implementation:

Monitoring with SLIs

Add objective:

+ SLOs

Objectives for SLI

(Service Level Objectives)

Examples

Availability: 99.9% GET requests complete successfully

Latency: 95% of requests that return 2xx will complete in less than 500ms.

Conditions:

- Measured across all the backend servers from the load balancer

- Taking the past 24 hours

SLIs examples

Availability (2xx / 2xx + 5xx)

Latency measured only on 2xx status

End-to-end elapsed time for processing events within data pipeline

The SLOs is being defining yet. So far we just chosen some values as suggestion which were expressed with a "should be ..."

Alerting

Symptom-based alerting

Does this rule detect an condition that is urgent, actionable and imminently user-visible?

Can I take action? is it urgent or could it wait until morning?

Questions that help to avoid false positives:

Alerts should be just about urgency!

Otherwise, it makes people demoralized and fatigued

Could the action be automated?

People add latency: automate as much as possible!

Are other people getting paged for this issue unnecessary?

OK, but I lost the power to know and predict the behavior of my system

and that's where the observability tools come in...

Observability

Highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.

Observability captures what "montoring" doesn't (and shouldn't)

based on evidences (not conjectures)

Monitoring and Observability by Cindy Sridharan

Observability tools

Logs

Discrete Events

Good practices for more effective logs:

Logging with context (trace-id / uow / whatever)

Use structured-logs for enable machine-readability

Standardized Logging Levels

LogQl example extracted from Loki future talks at GRAFANACONline 2020

Example: logs by Loki

Example: filtering logs

Example: Metrics from logs

Metrics

Aggregated data

For reliability and trending in use:

What happens right now?

Reliability is not just about throughput and latency

What will happen next?

There are a lot of extra effects to consider: gc, hiccups, cpu stolen, ...

Metrics example

Tracing

Distributed Tracing tech talk by us

A few of the critical questions that Tracing can answer quickly and easly:

Which did a request pass through?

Where are the bottlenecks?

How much time is lost due to network lag during communication between services?

What occurred in each service for a given request?

Distributed tracing example

In Summary

Use four golden signals + USE metrics

Instrument services as early as possible

Combine with profiling, logging and tracing

Iterate

Observability with

anti-methodologies

Blame Someone Else

1. Find a system or environment component you are not resposible for

2. Hypothesize that the issue is with that component

3. Redirect the issue to the responsible team

4. When proven wrong, go to 1

Anti-Methodologies

Drunk Man

Change things at random until the problem goes away

Anti-Methodologies

Traffic Light

Open dashboard

All green? Assume everything is good

Something red? Assume that's a problem

Anti-Methodologies

We can do it better?

Is Controllability the purpose of Observability?

Is Observability just about Brute Force?

What about Systems and their Dynamics?

What about Signaling and State?

Questions?

Thanks!

Diego Parra

Cristian Spinetta

@diegolparra

@dpsoft

@cebspinetta

@cspinetta

Bonus!

Feedback Control

Feedback Control is used heavily in

Process control

Electronics

Automation

Aerospace

Feedback Control

Was used by the Egyptians in a water clock more than 2000 years ago

Feedback Control

Feedback Control

OpenSignals

OpenSignals

Signals are emmited or recieved

Signals are indicative of operations or outcomes

Signals influence others and help to infer state of others as well as ourselves

A signal is a direct and meaniful unit of information whithin a context, like a emoji

Signals are recorded and then givin a scoring

OpenSignals

States

Signals

UNKNOWN

DEVIATING

DEGRADED

DEFECTIVE

DOWN

START

STOP

CALL

SUCCEED

RETRY

DELAY

DISCONNECT

SCHEDULE

S1 CONTEXT          |  S2 CONTEXT
------------------------------------------
S1 EMIT    START    |    
S2 EMIT    CALL     |    
                    |  S2 EMIT    START
                    |  S2 EMIT    FAIL
                    |  S2 EMIT    STOP
S2 RECEIPT FAIL     |
S1 EMIT    FAIL     |
S1 EMIT    STOP     |

Service-to-Service interaction

OpenSignals

Monitoring Production Methodologically

Outline

What we need when we want monitoring a system?

Classic starting points...

Is everything OK?

What are the response time of my api?

Since when the system is broken?

Where does the problem come from?

...

How fast is the usage of disk growing?

perhaps, after a lot of effort we've gotten a graph for each question

so, what do we have now?...

let me tell you even the weather

but wait!

How useful is this dashboard to face production problems?

I think not so useful...

Monitoring

Monitoring... for what???

Some examples...

Key distinctions

Monitoring: best practices

Monitoring: best practices

Monitoring Methodologically

USE Method

4 Golden Signals

4 Golden Signals - Example

Monitoring

with SRE practices

Monitoring with SLIs

Monitoring with SLIs

Monitoring with SLIs

Monitoring with SLIs

+ SLOs

SLIs examples

Alerting

Symptom-based alerting

OK, but I lost the power to know and predict the behavior of my system

and that's where the observability tools come in...

Observability

Observability tools

Logs

Example: logs by Loki

Example: filtering logs

Example: Metrics from logs

Metrics

Metrics example

Tracing

Distributed tracing example

In Summary

Observability with

anti-methodologies

Blame Someone Else

Drunk Man

Change things at random until the problem goes away

Traffic Light

We can do it better?

Questions?

Thanks!

Bonus!

Feedback Control

Feedback Control

OpenSignals

OpenSignals

Monitoring an API

Monitoring data pipeline