Monitoring done wrong
Learning by failing
Avishai Ish-Shalom (@nukemberg)
Is your monitor working?
How do you know?
If you've never seen it fail, you've never seen it working
Oh no, CPU > 70%!!!!
Monitoring requires a specification of "ok"
Your memcached is broken
Or is it?
stats.timer.xmemcached.*.p99
What does this metric actually measure?
What can possibly go wrong?
- Client <> Memcached network
- Client GC
- Client CPU starvation
- Client queue
- Client <> StatsD network
- StatsD GC
"memcached latency" = Latency + Network + client GC + client queueing + ....
If you don't know what it is you measure, its better you don't
Darn it, DB is down!
Quick! Recover from backups!
Backups not working since July
Monitoring things that didn't happen is just as important as monitoring things that did happen
Monitor state, not activity
- Last backup artifact timestamp
- SSL certificate expiry date
- Deployed version on instances
If wait for errors, good luck
The system is down! let's look at the monitor!
Monitoring systems should be decoupled from the systems they monitor*
*As much as possible
The system is down! let's look at the graphs!
Monitoring systems need to be more available and scalable than the systems being monitored.
- Adrian Cockcroft
What have we learned?
Monitoring system should be
- Decoupled
- Reliable, scalable
- Simple
- Understandable
- Check state, not activity
- Realistic
Monitoring should be designed for extreme scenarios
Uptime is an illusion created by lack of monitoring
- @nukemberg
10x! Questions?
Monitoring done wrong
By Avishai Ish-Shalom
Monitoring done wrong
- 1,888