Monitoring done wrong

Learning by failing

Avishai Ish-Shalom (@nukemberg)

Is your monitor working?

How do you know?

If you've never seen it fail, you've never seen it working

Oh no, CPU > 70%!!!!

Monitoring requires a specification of "ok"

Your memcached is broken

Or is it?

stats.timer.xmemcached.*.p99

What does this metric actually measure?

What can possibly go wrong?

  • Client <> Memcached network
  • Client GC
  • Client CPU starvation
  • Client queue
  • Client <> StatsD network
  • StatsD GC

"memcached latency" = Latency + Network + client GC + client queueing + ....

If you don't know what it is you measure, its better you don't

Darn it, DB is down!
Quick! Recover from backups!

Backups not working since July

Monitoring things that didn't happen is just as important as monitoring things that did happen

Monitor state, not activity

  • Last backup artifact timestamp
  • SSL certificate expiry date
  • Deployed version on instances

 

If wait for errors, good luck

The system is down! let's look at the monitor!

Monitoring systems should be decoupled from the systems they monitor*

*As much as possible

The system is down! let's look at the graphs!

Monitoring systems need to be more available and scalable than the systems being monitored.

- Adrian Cockcroft

What have we learned?

Monitoring system should be

  • Decoupled
  • Reliable, scalable
  • Simple
  • Understandable
  • Check state, not activity
  • Realistic

Monitoring should be designed for extreme scenarios

Uptime is an illusion created by lack of monitoring

- @nukemberg

10x! Questions?

Monitoring done wrong

By Avishai Ish-Shalom

Monitoring done wrong

  • 1,888