Cristian Spinetta
Software developer.
Cristian Spinetta
@cebspinetta
@cspinetta
First productive datacenter
Miami region
Private cloud based on Openstack
Self-administrated by devs via Cloudia, an in-house solution developed for creating VMs, load balancers, storage, traffic rules...
A lot of VMs distributed in 2 datacenters:
~8K nodes
AWS
Contingency region
Active mode with productive traffic
~1.4K nodes
~500 deploys per day
Centralizing at least the access log
Logging with context: you need to correlate logs and see the causality between services
For reliability and trending in use:
What is happening right now?
What will happen next?
To measure the impact of a change:
Aggregated Data
How is doing before the change?
How is doing after the change?
Reliability is not just about throughput and latency
There are a lot of extra effects to consider
Garbage collection
What about aggregated effects introduced by underlying platform?
Never use averages
For all of them, the average is ~50ns
Don't be fooled by the average
you'll be blinded!!
Average says:
Response time: ~55ms
Percentiles say:
Avg: ~55ms | P95: ~250ms - P99: ~550ms
Actual latency
It could be a lot of requests with high values
It's a system optimized for handling time series data (usually an array of long indexed by time).
Prometheus
InfluxDB
Graphite
Datadog
Khronus
OpenTSDB
Be careful or you'll be lying to yourself
Averages lie! Percentiles are good option
Neither percentiles nor average can be aggregated
Remember the external effects: hypervisor, jvm, networking...
Get to know your tools. Don't believe to anyone!
Which services did a request pass through?
Where are the bottlenecks?
How much time is lost due to network lag during communication between services?
What occurred in each service for a given request?
A few of the critical questions that DT can answer quickly and easily:
Sampling reduces Overhead
Observability tools are unintrusive
Instrumentations can be delegated to commons frameworks
Don't trace every single operation
Distributed Tracing System
Based on Google Dapper (2010)
Created by Twitter (2012)
OpenZipkin (2015)
Active Community
Distributed Tracing System
Based on Google Dapper (2010)
Inspired by OpenZipkin
Created by Uber
Distributed Tracing, Metrics and Context Propagation for application running on the JVM.
- Observability SDK(metrics, tracing).
Trace instrumentation API definitions.
- OpenZipkin's java library and instrumentation.
By Cristian Spinetta
In an environment made up of thousands of instances, how do we identify which one a request passed through? Which one failed or slowed the rest? Come to learn how Despegar faces these challenges!