Observability Toolchain
First, let's talk about... Monitoring
Each system component expose information
Network
Host, database, web/application server
Application
Focused on failure or symptoms
No coordinated strategy
Correlation is difficult
Common features
Centralized store
Dashboards
Alerting
And then this happened
Managing Complexity
Environments are complex
Distributed + cloud based
CI/CD
Containerized workloads
Microservices architecture
Automation is at the center of DevOps
Actual vs. desired state
Controllability
State deterministic from inputs
Observability
State
can be determined from outputs
Netflix's Deathstar
So...
Observability: the new Monitoring?
Observability is a property of a system
A superset of monitoring
Understand the
state
of the system at any time
Detect, comprehend and react to issues
Predict issues
Enable forensics analysis
Support business decisions
Validate hypotheses
Measure value delivery
Make changes with confidence
And...
The 3 Pillars of Observability
Logs
Logs
Evolution of log analytics
Challenges of logging
Cloud-native data collector for a
unified logging layer
A single interface for multiple
Data sources
Destinations
Horizontally-scalable and provides reliable data transfers
Small memory footprint
Plugin architecture for input, parsing, filtering, formatting and outputting logs
Elasticsearch
Distributed and highly-available
RESTful search engine
Full-text search
Each index is sharded to paralellize operations
Shard replicas for HA
JSON-document oriented
No need for upfront schema definition
(Near) real-time search
Inverted index
ES cluster design
Analytics and visualization
for Elasticsearch
S
earch
,
view
, and
interact
data stored in indices
Filter events based on tags such as environments, application names, K8s namespaces, etc...
Advanced time-series data analysis
Create
custom dashboards including histograms, line graphs, pie charts and sunbursts
Examples
Docker Swarm
https://bitbucket.org/walmartdigital/bff-black-cyber/src/285af956fdd748a843793f2a5aa2786ba758166d/stack.yml?at=master#stack.yml-20,22,64,66,98
K8s
https://bitbucket.org/walmartdigital/backend-k8s/src/c8e4c3e141fa42d6eb203266eaff645b96d35b9b/deployments/production/eastus2/grocery/production/oms/broker/deployment.yaml
https://bitbucket.org/walmartdigital/backend-k8s/src/master/deployments/production/eastus2/grocery/production/shared/configmap.yaml
Metrics
Metrics
Prometheus
Monitoring system and
time series database
Time series collection happens via a
pull
model over HTTP
Trigger alerts when a
rule expression is evaluated
from given time interval
Multiple modes of
graphing and
dashboarding
supported
Flexible
query language
to leverage multiple dimensions
Targets are discovered via
service discovery
or static configuration (like
consul
)
Anatomy of a Prometheus Metric
Grafana
Feature-rich
metrics dashboard
and
graph editor
for multiple datasources like
Elasticsearch
, OpenTSDB,
Prometheus
and InfluxDB
Allows to query, visualize, alert and
understand metrics
no matter where they are
stored
Templating
queries for generic dashboards
Different ways to visualize metrics and logs
Alias patterns for short readable series names
Examples
Production Prometheus instance:
https://prometheus.tools.walmartdigital.cl/graph
Business-model metrics
https://bitbucket.org/walmartdigital/job-metrics/src/master/src/index.js
System-level metrics
https://bitbucket.org/walmartdigital/midas-nats/src/master/stack.yml
APM for micro-services-based applications
Real-time alerting and
notifications
Integrations with web hook, slack, email
1 second metric
granularity
Monitoring multiple client technologies based on
Sensors
Full-stack distributed
tracing
Traces
every single request
Auto-discovery, data collection, and dynamic graph features
Examples
Production Instana instance:
https://production-walmartdigital.instana.io
mercurio-slots-api service dashboard
https://production-walmartdigital.instana.io/#/service;serviceId=157e9ee1ac301b74f25f15dd7280ddbd273f01e4/summary?timeline.to&timeline.ws=3600000
Grafana panel from Instana metrics
https://grafana.tools.walmartdigital.cl/d/XqGwU-Qik/slots?refresh=5s&orgId=1
Tracing
Oldschool Debugging
Tracing
What's distributed tracing?
Examples
Including the Instana sensor in NodeJS code
https://bitbucket.org/walmartdigital/mercurio-slots-api/src/71b8edbcd586ac9fe5e381bffcdc8bb87c9df3ec/src/index.js?at=master#index.js-1
Production Instana instance:
https://production-walmartdigital.instana.io
mercurio-slots-api service dashboard
https://production-walmartdigital.instana.io/#/service;serviceId=157e9ee1ac301b74f25f15dd7280ddbd273f01e4/summary?timeline.to&timeline.ws=3600000
Grafana
Visualize metrics from tracing tools such as:
Instana
https://grafana.com/plugins/instana-datasource
Zipkin
https://grafana.com/dashboards/1598
Jaeger
https://grafana.com/dashboards/7439
Alerting
AlertManager
What is an alert?
Prometheus server sends alerts to an Alertmanager
Alertmanager manages the alerts it receives
Aggregate
Group
Inhibit
Silence
Route and send out notifications
Example
Slack receiver configuration:
https://bitbucket.org/walmartdigital/tools-deployments/src/f84922e84a93b8c51d00bd79bc5381b634030403/prometheus.yaml?at=master#prometheus.yaml-403
Consul backups job alerting rule:
https://bitbucket.org/walmartdigital/tools-deployments/src/f84922e84a93b8c51d00bd79bc5381b634030403/prometheus.yaml?at=master#prometheus.yaml-226,229
Grafana Alerts
ElastAlert
Open source project from Yelp
Alerting based on logs or other information managed in Elasticsearch
Define alerting conditions based on a DSL query
Define notification channel
Slack
Email
JIRA
HipChat
...
Instana Alerts
Integrates with Instana concepts of
Issues
Incidents
Changes
Uses Lucene query syntax to filter assets
Integrations
represent notification channels
Slack
Email
Splunk
Office365
...
Questions
Feedback
https://bit.ly/2J9G8ui
Thank you!
Made with Slides.com