Monitoring done wrong
Learning by failing
Avishai Ish-Shalom (@nukemberg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/4923081/aleph_black.png)
Is your monitor working?
How do you know?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129112/30wmt5.jpg)
If you've never seen it fail, you've never seen it working
Oh no, CPU > 70%!!!!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129170/30wnfu.jpg)
Monitoring requires a specification of "ok"
Your memcached is broken
Or is it?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129202/putin-laughing.gif)
stats.timer.xmemcached.*.p99
What does this metric actually measure?
What can possibly go wrong?
- Client <> Memcached network
- Client GC
- Client CPU starvation
- Client queue
- Client <> StatsD network
- StatsD GC
"memcached latency" = Latency + Network + client GC + client queueing + ....
If you don't know what it is you measure, its better you don't
Darn it, DB is down!
Quick! Recover from backups!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129326/double-facepalm.jpg)
Backups not working since July
Monitoring things that didn't happen is just as important as monitoring things that did happen
Monitor state, not activity
- Last backup artifact timestamp
- SSL certificate expiry date
- Deployed version on instances
If wait for errors, good luck
The system is down! let's look at the monitor!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129302/Screen_Shot_2019-05-14_at_10.58.28.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129268/Screen_Shot_2019-05-14_at_10.51.08.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129195/hnrt2uupp3d11.jpg)
Monitoring systems should be decoupled from the systems they monitor*
*As much as possible
The system is down! let's look at the graphs!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129588/Screen_Shot_2019-05-14_at_12.18.40.png)
Monitoring systems need to be more available and scalable than the systems being monitored.
- Adrian Cockcroft
What have we learned?
Monitoring system should be
- Decoupled
- Reliable, scalable
- Simple
- Understandable
- Check state, not activity
- Realistic
Monitoring should be designed for extreme scenarios
Uptime is an illusion created by lack of monitoring
- @nukemberg
10x! Questions?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/50960/images/6129081/unexplainable-pictures37.jpg)
Monitoring done wrong
By Avishai Ish-Shalom
Monitoring done wrong
- 1,811