Practicalities of Managing Large Clusters

Avishai Ish-Shalom (@nukemberg)

Size Matters

Things don't scale linearly

at some point, we get a...

Phase Change

Traditional IT

Large Scale IT

Broken is the new normal

Capacity/Latency tradeoffs

E.g: you can have spare capacity and be out of capacity simultaneously!

The Straggler pattern

Floods, Storms and other weather effects

Ever tried to turn on 5000 machines at once?

Ever tried to reboot a datacenter?

Storms
(aka feedback loops)

Maintenance Overhead

Minimize Recovery Paths

I Robot

Assume everything is broken

Monitoring

What do you do when you have more servers than pixels?

Facebook Claspin

Graphs & Metrics

Alerting

Questions?

By Avishai Ish-Shalom

Practicalities of Managing Large Clusters

Practicalities of Managing Large Clusters

Size Matters

Phase Change

Traditional IT

Large Scale IT

Broken is the new normal

Capacity/Latency tradeoffs

The Straggler pattern

Floods, Storms and other weather effects

Ever tried to turn on 5000 machines at once?

Ever tried to reboot a datacenter?

Storms (aka feedback loops)

Maintenance Overhead

Minimize Recovery Paths

I Robot

Assume everything is broken

Monitoring

Facebook Claspin

Graphs & Metrics

Alerting

Questions?

More from Avishai Ish-Shalom

Storms
(aka feedback loops)