Monitoring Production Methodologically
Diego Parra
Cristian Spinetta
@diegolparra
@dpsoft
@cebspinetta
@cspinetta
Outline
Monitoring... for what?
What we really want to monitor?
How to design it?
What is not monitoring?
Observability anti-methodologies
We can do it better?
What we need when we want monitoring a system?
Classic starting points...
Is everything OK?
What are the response time of my api?
Since when the system is broken?
Where does the problem come from?
...
How fast is the usage of disk growing?
perhaps, after a lot of effort we've gotten a graph for each question
so, what do we have now?...
let me tell you even the weather
but wait!
How useful is this dashboard to face production problems?
I think not so useful...
Monitoring
Monitoring... for what???
The "what’s broken" indicates the symptom
The "why" indicates a (possibly intermediate) cause
"Your monitoring system should address two questions: what’s broken, and why?"
"In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the failure as well as the effect of any fix deployed." by Cindy Sridharan.
Some examples...
I'm serving Http 500s
Symptom (What?)
Cause (Why?)
DB are refusing connections
My responses are slow
Web Server is queuing requests
Users can't logging
Auth Client is receiving HTTP 503
Blackbox
Whitebox
What versus why is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
In a multilayered system, one person’s symptom is another person’s cause
Key distinctions
User/Business points of view
Blackbox monitoring
Whitebox monitoring
SLI/SLO based control
Mostly easy to know
Tends to be the last to alert
Reactive approach
Usually on-call resolution
Preferably few metrics
Component points of view
Threshold based control
Mostly hard to know
Tends to be the early alarm
Proactive approach
Usually automatic resolution
Preferably not few metrics
what?
why?
Detect active problem
Detect imminent problem
Monitoring: best practices
Keeps an eye toward simplicity
Should be easy to read and understand
Don't disaggregate unnecessarily
Metrics/Alerts that are rarely exercised should be up for removal
Keeps dashboard clean and tidy
Few is better than many
Monitoring: best practices
Time-series is preferable to discrete checks
Avoid averages
Select an appropriate resolution
Keeps an eye on the tail latency
Pay close attention to how you are measuring
collects metrics in buckets
Monitoring Methodologically
Utilization: % time that the resource was busy
USE Method
Errors: rate of error events
Saturation: amount of work resource has to do, often queue length
For every resource, monitor
Latency: time to service a request
4 Golden Signals
Traffic: requests/seconds
Errors: error rate of requests
Saturation: how overloaded a service is
For every service, monitor
4 Golden Signals - Example
Monitoring
with SRE practices
Monitoring with SLIs
Quantifies meeting user expectations:
(Service Level Indicators)
is our service working as our users expect it to?
Monitoring with SLIs
1. For each User Journey/Data Flow identify from the SLI Menu suitable types of SLI:
2. Make a decision about how to measure good and valid events.
3. Decide where to measure the SLI
Monitoring with SLIs
Availability
Specification: % GET requests complete successfully
Examples: backend API for user info
Implementation:
Latency
Specification: % of requests that return 2xx will complete in less than 500ms.
Implementation:
Monitoring with SLIs
Add objective:
+ SLOs
Objectives for SLI
(Service Level Objectives)
Examples
Availability: 99.9% GET requests complete successfully
Latency: 95% of requests that return 2xx will complete in less than 500ms.
Conditions:
- Measured across all the backend servers from the load balancer
- Taking the past 24 hours
SLIs examples
Availability (2xx / 2xx + 5xx)
Latency measured only on 2xx status
End-to-end elapsed time for processing events within data pipeline
The SLOs is being defining yet. So far we just chosen some values as suggestion which were expressed with a "should be ..."
Alerting
Symptom-based alerting
Does this rule detect an condition that is urgent, actionable and imminently user-visible?
Can I take action? is it urgent or could it wait until morning?
Questions that help to avoid false positives:
Alerts should be just about urgency!
Otherwise, it makes people demoralized and fatigued
Could the action be automated?
People add latency: automate as much as possible!
Are other people getting paged for this issue unnecessary?
OK, but I lost the power to know and predict the behavior of my system
and that's where the observability tools come in...
Observability
Highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.
Observability captures what "montoring" doesn't (and shouldn't)
based on evidences (not conjectures)
Observability tools
Logs
Discrete Events
Good practices for more effective logs:
Logging with context (trace-id / uow / whatever)
Use structured-logs for enable machine-readability
Standardized Logging Levels
Example: logs by Loki
Example: filtering logs
Example: Metrics from logs
Metrics
Aggregated data
For reliability and trending in use:
What happens right now?
Reliability is not just about throughput and latency
What will happen next?
There are a lot of extra effects to consider: gc, hiccups, cpu stolen, ...
Metrics example
Tracing
A few of the critical questions that Tracing can answer quickly and easly:
Which did a request pass through?
Where are the bottlenecks?
How much time is lost due to network lag during communication between services?
What occurred in each service for a given request?
Distributed tracing example
In Summary
Use four golden signals + USE metrics
Instrument services as early as possible
Combine with profiling, logging and tracing
Iterate
Observability with
anti-methodologies
Blame Someone Else
1. Find a system or environment component you are not resposible for
2. Hypothesize that the issue is with that component
3. Redirect the issue to the responsible team
4. When proven wrong, go to 1
Drunk Man
Change things at random until the problem goes away
Traffic Light
Open dashboard
All green? Assume everything is good
Something red? Assume that's a problem
We can do it better?
Is Controllability the purpose of Observability?
Is Observability just about Brute Force?
What about Systems and their Dynamics?
What about Signaling and State?
Questions?
Thanks!
Diego Parra
Cristian Spinetta
@diegolparra
@dpsoft
@cebspinetta
@cspinetta
Bonus!
Feedback Control
Feedback Control is used heavily in
Process control
Electronics
Automation
Aerospace
Was used by the Egyptians in a water clock more than 2000 years ago
Feedback Control
OpenSignals
Signals are emmited or recieved
Signals are indicative of operations or outcomes
Signals influence others and help to infer state of others as well as ourselves
A signal is a direct and meaniful unit of information whithin a context, like a emoji
Signals are recorded and then givin a scoring
OpenSignals
States
Signals
UNKNOWN
OK
DEVIATING
DEGRADED
DEFECTIVE
DOWN
START
STOP
CALL
SUCCEED
RETRY
DELAY
DISCONNECT
SCHEDULE
S1 CONTEXT | S2 CONTEXT
------------------------------------------
S1 EMIT START |
S2 EMIT CALL |
| S2 EMIT START
| S2 EMIT FAIL
| S2 EMIT STOP
S2 RECEIPT FAIL |
S1 EMIT FAIL |
S1 EMIT STOP |
Service-to-Service interaction
Monitoring an API
Monitoring data pipeline
Monitoring Production: Methodologically
By Diego Parra
Monitoring Production: Methodologically
A tour of Monitoring / Observability and some of the most used methodologies (USE | THE FOUR GOLDEN SIGNALS) to monitor systems and troubleshoot. Some of the questions that guided this talk are: - How to design a good monitoring system? - What methodologies can we apply? - What is observability and how can it benefit us? - Are we doing the right things? Really?
- 1,650