Monitoring 101

Nir Cohen

  @                              

@nir0s

Agenda

  • Why are we here?
  • What is monitoring?
  • Why monitor?
  • Metrics
  • Pipeline
  • Some basic math
  • Metrics vs. Logs

Why?

  • Increase BigPanda's SLA
  • A neglected area in software
  • NOT to become experts

To monitor is to

observe and check the progress or quality of (something) over a period of time; keep under systematic review.

What is a Monitoring Pipline?

+ IMPROVEMENT

Monitor What (in software)?

  • Infra (CPU, Memory, inodes, Network, IOPS, disk space, etc..)
  • Middleware (Web-server request count, thread count, etc..)
  • Product (user events, functional performance, etc..)
  • Business KPIs (revenue, SLA, etc..)

MEME PLACEHOLDER  FOR ERIK

Why Monitor?

  • Visibility (System Failure, Proactivitiy → Prevention)
  • Feedback Loops (automatic and manual)

Metrics

  • What is a metric
  • Namespace History
  • Formats
  • Dimensionality & Tags
  • Types of metrics (Counters, Gauges, Timers, Histograms, (Sets))
  • Functions (sum, upper, mean, etc..)
  • Naming conventions

Naming

1932819231 correlation.incidents_copied_count 13|c
1932819231 correlation.kafka_to_rabbit_copy 13|c
incidents_copied_count,service=corr,host=prod-corr-11,org=intel 13|c

Pipeline[collecting]

  • Push vs. Scrape
  • Agents
  • Code

Pipeline[shipping]

  • Protocols (TCP, UDP, HTTP)
  • Aggregation/Bucketing
  • Sampling

Pipeline[storing]

  • What is a Timeseries?
  • What is a Timeseries Database? (TSDB)
  • How much data to keep? And at what resolution?
  • Why not use a relational database?
  • Buckets
  • Downsampling (storage schemas)
  • Retention Periods (technical limits and system/business requirements)

Pipeline[graphing]

GRAPH CHAOS!

MOAR CHAOS!

DASH CHAOS!

Graph Color

Dash Color

Dual-Y

Pipeline[alerting]

  • What is an alert
  • Alert Criterion
    • Law of Actionability
    • Time Spans
    • Alert Hierarchy (and channeling)
  • Fatigue
  • Levels*

Some Math

Averages

Percentiles

Metrics vs. Logs

{
  "timestamp": 12391238121,
  "message": "Received Incoming Event",
  "org": "AWS",
  "metadata" : {
    "method": "POST",
    "service": "consumer",
    "params": [...]
  }

  ...
}
incoming_events_count,service=consumer,method=POST,... 1|c

~=

Monitoring 101

By Nir Cohen

Monitoring 101

A summary of logging problems and solutions

  • 1,966