Practicalities of Managing Large Clusters
Avishai Ish-Shalom (@nukemberg)
Size Matters
Things don't scale linearly
at some point, we get a...
Phase Change
Traditional IT
- 1:10 admin/server
- 1:1000 admin/user
- 2 machines/cluster
- Centralized network
- A/B power grid
- Redundant infrastructure
Large Scale IT
- 1:10000 admin/server
- 1:10M admin/user
- 10000 machines/cluster
- Mesh network
- Single power feed, multiple grids
- Non-redundant, compartmentalized infra
Broken is the new normal
- 0% malfunctions is infeasible
- Some (individual) failures
- The system is OK
- We care about capacity/latency
- Designed for fault tolerance
Capacity/Latency tradeoffs
- Statistics, statistics, statistics
- >80% capacity -> high latency
- Load balancing/scheduling is hard
- Multiple dimentions
E.g: you can have spare capacity and be out of capacity simultaneously!
The Straggler pattern
Floods, Storms and other weather effects
Ever tried to turn on 5000 machines at once?
Ever tried to reboot a datacenter?
Storms
(aka feedback loops)
- Mirroring storms
- Retry storms
- Resync storms
Maintenance Overhead
Minimize Recovery Paths
- Automated provisioning
- Re-provision first
- Immutable servers
- Amputate
I Robot
- Automate everything
- Throttling
- Locking
- Monitoring integration
- Tools instead of Docs
Assume everything is broken
- Prefer async
- Better to retry
- Idempotence
- Easier to replace than repair
Monitoring
What do you do when you have more servers than pixels?
Facebook Claspin
Graphs & Metrics
- Most deviant/Top
- Histograms/Percentiles
- Outliers
Alerting
- Individual faults don't matter!
- Individual capacity doesn't matter!
- Only aggregates matter
- No CPU/disk/RAM alerts
- Turn it ALL OFF
Questions?
Practicalities of Managing Large Clusters
By Avishai Ish-Shalom
Practicalities of Managing Large Clusters
- 1,734