Monitoring a !little ecosystem

Cristian Spinetta

@cebspinetta

@cspinetta

Outline

Despegar Infra

Observability

Logging

Metrics

Distributed tracing

A bit of context...

First productive datacenter

Miami region

Private cloud based on Openstack

Self-administrated by devs via Cloudia, an in-house solution developed for creating VMs, load balancers, storage, traffic rules...

A lot of VMs distributed in 2 datacenters:

~8K nodes

AWS

Contingency region

Active mode with productive traffic

~1.4K nodes

~500 deploys per day

What happens internally?

The 3 Pillars of Observability

Metrics-Tracing-Logging

Logging

Centralizing at least the access log

Logging with context: you need to correlate logs and see the causality between services

Metrics

For reliability and trending in use:

What is happening right now?

What will happen next?

To measure the impact of a change:

Aggregated Data

How is doing before the change?

How is doing after the change?

Metrics

Reliability is not just about throughput and latency

There are a lot of extra effects to consider

Metrics

Garbage collection

How Java Got the Hiccups

Metrics

What about aggregated effects introduced by underlying platform?

How Java Got the Hiccups

Metrics

Never use averages

What do they have in common?

Metrics

For all of them, the average is ~50ns

Don't be fooled by the average

you'll be blinded!!

Metrics

Average says:

Why Averages Suck and Percentiles are Great

Response time: ~55ms

Metrics

Percentiles say:

Why Averages Suck and Percentiles are Great

Avg: ~55ms | P95: ~250ms - P99: ~550ms

Metrics

Actual latency

Why Averages Suck and Percentiles are Great

It could be a lot of requests with high values

It's a system optimized for handling time series data (usually an array of long indexed by time).

Prometheus

InfluxDB

Time series databases

Graphite

Datadog

Khronus

OpenTSDB

Be careful or you'll be lying to yourself

Averages lie! Percentiles are good option

Things To Keep In Mind

Neither percentiles nor average can be aggregated

Remember the external effects: hypervisor, jvm, networking...

Get to know your tools. Don't believe to anyone!

Distributed Tracing

Which services did a request pass through?

Where are the bottlenecks?

How much time is lost due to network lag during communication between services?

What occurred in each service for a given request?

A few of the critical questions that DT can answer quickly and easily:

Distributed Tracing Core Concepts

Distributed Tracing Example

There are two services involved in serving the /users endpoint of this system.

There are three HTTP calls made from Service A to Service B and happen in parallel.

Storing the session token happens after all HTTP calls to Service B have completed.

A substantial amount of time was spent on storing the session token.

Distributed Tracing Components

A Span represents a logical unit of work.

Tags and Marks add extra information to spans.

Distributed Tracing Components

A Trace is a end-to-end latency graph, composed of spans.

Distributed Tracing

Brave

Things To Keep In Mind

Sampling reduces Overhead

Understanding How Your All components Work Together

Observability tools are unintrusive

Instrumentations can be delegated to commons frameworks

Don't trace every single operation

Zipkin

Dapper, a Large-Scale Distributed Systems Tracing

Distributed Tracing System

Based on Google Dapper (2010)

Created by Twitter (2012)

OpenZipkin (2015)

Active Community

Zipkin UI

Jaeger

Jaeger

Distributed Tracing System

Based on Google Dapper (2010)

Inspired by OpenZipkin

Created by Uber

Jaeger UI

Tracers

Distributed Tracing, Metrics and Context Propagation for application running on the JVM.

- Observability SDK(metrics, tracing).

Trace instrumentation API definitions.

- OpenZipkin's java library and instrumentation.

Brave

Tech talk: Kamon for Distributed Tracing by Ivan Topolnjak

Thanks for Coming!

Questions?

Monitoring a !little ecosystem

By Cristian Spinetta

Monitoring a !little ecosystem

In an environment made up of thousands of instances, how do we identify which one a request passed through? Which one failed or slowed the rest? Come to learn how Despegar faces these challenges!

Cristian Spinetta

Software developer.

Monitoring a !little ecosystem

Outline

Despegar Infra

Observability

Logging

Metrics

Distributed tracing

A bit of context...

What happens internally?

The 3 Pillars of Observability

Logging

Logging

Metrics

Metrics

Metrics

Metrics

Metrics

Metrics

What do they have in common?

Metrics

Metrics

Metrics

Metrics

Time series databases

Things To Keep In Mind

Distributed Tracing

Distributed Tracing

Distributed Tracing Example

Distributed Tracing Components

Distributed Tracing Components

Distributed Tracing

Brave

Things To Keep In Mind

Zipkin

Zipkin UI

Jaeger

Jaeger UI

Tracers

Brave

Thanks for Coming!

Questions?

Monitoring a !little ecosystem

More from Cristian Spinetta