Monitoring done wrong

Learning by failing

Avishai Ish-Shalom (@nukemberg)

Is your monitor working?

How do you know?

If you've never seen it fail, you've never seen it working

Oh no, CPU > 70%!!!!

Monitoring requires a specification of "ok"

Your memcached is broken

Or is it?

stats.timer.xmemcached.*.p99

What does this metric actually measure?

What can possibly go wrong?

Client <> Memcached network
Client GC
Client CPU starvation
Client queue
Client <> StatsD network
StatsD GC

"memcached latency" = Latency + Network + client GC + client queueing + ....

If you don't know what it is you measure, its better you don't

Darn it, DB is down!
Quick! Recover from backups!

Backups not working since July

Monitoring things that didn't happen is just as important as monitoring things that did happen

Monitor state, not activity

Last backup artifact timestamp
SSL certificate expiry date
Deployed version on instances

If wait for errors, good luck

The system is down! let's look at the monitor!

Monitoring systems should be decoupled from the systems they monitor*

*As much as possible

The system is down! let's look at the graphs!

Monitoring systems need to be more available and scalable than the systems being monitored.

- Adrian Cockcroft

What have we learned?

Monitoring system should be

Decoupled
Reliable, scalable
Simple
Understandable
Check state, not activity
Realistic

Monitoring should be designed for extreme scenarios

Uptime is an illusion created by lack of monitoring

- @nukemberg