Observability Toolchain

First, let's talk about... Monitoring

  • Each system component expose information
    • Network
    • Host, database, web/application server
    • Application
  • Focused on failure or symptoms
  • No coordinated strategy
    • Correlation is difficult
  • Common features
    • Centralized store
    • Dashboards
    • Alerting

And then this happened

Managing Complexity

  • Environments are complex
    • Distributed + cloud based
    • CI/CD
    • Containerized workloads
    • Microservices architecture
  • Automation is at the center of DevOps
  • Actual vs. desired state
    • Controllability
      • State deterministic from inputs
    • Observability
      • ​State can be determined from outputs

Netflix's Deathstar

So...

Observability: the new Monitoring?

  • Observability is a property of a system
  • A superset of monitoring
  • Understand the state of the system at any time
    • Detect, comprehend and react to issues
    • Predict issues
    • Enable forensics analysis
    • Support business decisions
      • Validate hypotheses
      • Measure value delivery
    • Make changes with confidence

And...

The 3 Pillars of Observability

Logs

Logs

Evolution of log analytics

Challenges of logging

  • Cloud-native data collector for a unified logging layer
    • A single interface for multiple
      • Data sources
      • Destinations
    • Horizontally-scalable and provides reliable data transfers
      • Small memory footprint
  • Plugin architecture for input, parsing, filtering, formatting and outputting logs

Elasticsearch

  • Distributed and highly-available RESTful search engine
    • ​​​Full-text search
    • Each index is sharded to paralellize operations
    • Shard replicas for HA
  • JSON-document oriented
    • No need for upfront schema definition
  • (Near) real-time search

Inverted index

ES cluster design

  • Analytics and visualization for Elasticsearch
    • ​Search, view, and interact data stored in indices
  • Filter events based on tags such as environments, application names, K8s namespaces, etc...
  • Advanced time-series data analysis
  • Create custom dashboards including histograms, line graphs, pie charts and sunbursts

Examples

Metrics

Metrics

Prometheus

  • Monitoring system and time series database
  • Time series collection happens via a pull model over HTTP
  • Trigger alerts when a rule expression is evaluated from given time interval
  • Multiple modes of graphing and dashboarding supported
  • Flexible query language to leverage multiple dimensions
  • Targets are discovered via service discovery or static configuration (like consul)

Anatomy of a Prometheus Metric

Grafana

  • Feature-rich metrics dashboard and graph editor for multiple datasources like Elasticsearch, OpenTSDB, Prometheus and InfluxDB
  • Allows to query, visualize, alert and understand metrics no matter where they are stored
  • Templating queries for generic dashboards
  • Different ways to visualize metrics and logs
  • Alias patterns for short readable series names

Examples

  • APM for micro-services-based applications
  • Real-time alerting and notifications
    • Integrations with web hook, slack, email
  • 1 second metric granularity
  • Monitoring multiple client technologies based on Sensors
  • Full-stack distributed tracing
    •  Traces every single request
  • Auto-discovery, data collection, and dynamic graph features

Examples

Tracing

Oldschool Debugging

Tracing

What's distributed tracing?

Examples

Grafana

Alerting

AlertManager

  • What is an alert?
  • Prometheus server sends alerts to an Alertmanager
  • Alertmanager manages the alerts it receives
    • Aggregate
    • Group
    • Inhibit
    • Silence
    • Route and send out notifications

Example

Grafana Alerts

ElastAlert

  • Open source project from Yelp
  • Alerting based on logs or other information managed in Elasticsearch
  • Define alerting conditions based on a DSL query
  • Define notification channel
    • Slack
    • Email
    • JIRA
    • HipChat
    • ...

Instana Alerts

  • Integrates with Instana concepts of
    • Issues
    • Incidents
    • Changes
  • Uses Lucene query syntax to filter assets
  • Integrations represent notification channels
    • Slack
    • Email
    • Splunk
    • Office365
    • ...

Questions

Feedback

https://bit.ly/2J9G8ui

Thank you!

Made with Slides.com