Diego Parra
Cristian Spinetta
@diegolparra
@dpsoft
@cebspinetta
@cspinetta
Monitoring... for what?
What we really want to monitor?
How to design it?
What is not monitoring?
Observability anti-methodologies
We can do it better?
The "what’s broken" indicates the symptom
The "why" indicates a (possibly intermediate) cause
"Your monitoring system should address two questions: what’s broken, and why?"
"In the event of a failure, monitoring data should immediately be able to provide visibility into impact of the failure as well as the effect of any fix deployed." by Cindy Sridharan.
I'm serving Http 500s
Symptom (What?)
Cause (Why?)
DB are refusing connections
My responses are slow
Web Server is queuing requests
Users can't logging
Auth Client is receiving HTTP 503
Blackbox
Whitebox
What versus why is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
In a multilayered system, one person’s symptom is another person’s cause
User/Business points of view
Blackbox monitoring
Whitebox monitoring
SLI/SLO based control
Mostly easy to know
Tends to be the last to alert
Reactive approach
Usually on-call resolution
Preferably few metrics
Component points of view
Threshold based control
Mostly hard to know
Tends to be the early alarm
Proactive approach
Usually automatic resolution
Preferably not few metrics
what?
why?
Detect active problem
Detect imminent problem
Keeps an eye toward simplicity
Should be easy to read and understand
Don't disaggregate unnecessarily
Metrics/Alerts that are rarely exercised should be up for removal
Keeps dashboard clean and tidy
Few is better than many
Time-series is preferable to discrete checks
Avoid averages
Select an appropriate resolution
Keeps an eye on the tail latency
Pay close attention to how you are measuring
collects metrics in buckets
Utilization: % time that the resource was busy
Errors: rate of error events
Saturation: amount of work resource has to do, often queue length
For every resource, monitor
Latency: time to service a request
Traffic: requests/seconds
Errors: error rate of requests
Saturation: how overloaded a service is
For every service, monitor
Quantifies meeting user expectations:
(Service Level Indicators)
is our service working as our users expect it to?
1. For each User Journey/Data Flow identify from the SLI Menu suitable types of SLI:
2. Make a decision about how to measure good and valid events.
3. Decide where to measure the SLI
Availability
Specification: % GET requests complete successfully
Examples: backend API for user info
Implementation:
Latency
Specification: % of requests that return 2xx will complete in less than 500ms.
Implementation:
Add objective:
Objectives for SLI
(Service Level Objectives)
Examples
Availability: 99.9% GET requests complete successfully
Latency: 95% of requests that return 2xx will complete in less than 500ms.
Conditions:
- Measured across all the backend servers from the load balancer
- Taking the past 24 hours
Availability (2xx / 2xx + 5xx)
Latency measured only on 2xx status
End-to-end elapsed time for processing events within data pipeline
The SLOs is being defining yet. So far we just chosen some values as suggestion which were expressed with a "should be ..."
Does this rule detect an condition that is urgent, actionable and imminently user-visible?
Can I take action? is it urgent or could it wait until morning?
Questions that help to avoid false positives:
Alerts should be just about urgency!
Otherwise, it makes people demoralized and fatigued
Could the action be automated?
People add latency: automate as much as possible!
Are other people getting paged for this issue unnecessary?
Highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.
Observability captures what "montoring" doesn't (and shouldn't)
based on evidences (not conjectures)
Discrete Events
Good practices for more effective logs:
Logging with context (trace-id / uow / whatever)
Use structured-logs for enable machine-readability
Standardized Logging Levels
Aggregated data
For reliability and trending in use:
What happens right now?
Reliability is not just about throughput and latency
What will happen next?
There are a lot of extra effects to consider: gc, hiccups, cpu stolen, ...
A few of the critical questions that Tracing can answer quickly and easly:
Which did a request pass through?
Where are the bottlenecks?
How much time is lost due to network lag during communication between services?
What occurred in each service for a given request?
Use four golden signals + USE metrics
Instrument services as early as possible
Combine with profiling, logging and tracing
Iterate
1. Find a system or environment component you are not resposible for
2. Hypothesize that the issue is with that component
3. Redirect the issue to the responsible team
4. When proven wrong, go to 1
Open dashboard
All green? Assume everything is good
Something red? Assume that's a problem
Is Controllability the purpose of Observability?
Is Observability just about Brute Force?
What about Systems and their Dynamics?
What about Signaling and State?
Diego Parra
Cristian Spinetta
@diegolparra
@dpsoft
@cebspinetta
@cspinetta
Feedback Control is used heavily in
Process control
Electronics
Automation
Aerospace
Was used by the Egyptians in a water clock more than 2000 years ago
Signals are emmited or recieved
Signals are indicative of operations or outcomes
Signals influence others and help to infer state of others as well as ourselves
A signal is a direct and meaniful unit of information whithin a context, like a emoji
Signals are recorded and then givin a scoring
States
Signals
UNKNOWN
OK
DEVIATING
DEGRADED
DEFECTIVE
DOWN
START
STOP
CALL
SUCCEED
RETRY
DELAY
DISCONNECT
SCHEDULE
S1 CONTEXT | S2 CONTEXT
------------------------------------------
S1 EMIT START |
S2 EMIT CALL |
| S2 EMIT START
| S2 EMIT FAIL
| S2 EMIT STOP
S2 RECEIPT FAIL |
S1 EMIT FAIL |
S1 EMIT STOP |
Service-to-Service interaction