Practicalities of Managing Large Clusters

Avishai Ish-Shalom (@nukemberg)

Size Matters

Things don't scale linearly

at some point, we get a...

Phase Change

Traditional IT

  • 1:10 admin/server
  • 1:1000 admin/user
  • 2 machines/cluster
  • Centralized network
  • A/B power grid
  • Redundant infrastructure

Large Scale IT

  • 1:10000 admin/server
  • 1:10M admin/user
  • 10000 machines/cluster
  • Mesh network
  • Single power feed, multiple grids
  • Non-redundant, compartmentalized infra 

Broken is the new normal

  • 0% malfunctions is infeasible
  • Some (individual) failures
  • The system is OK
  • We care about capacity/latency
  • Designed for fault tolerance

Capacity/Latency tradeoffs

  • Statistics, statistics, statistics
  • >80% capacity -> high latency
  • Load balancing/scheduling is hard
  • Multiple dimentions

E.g: you can have spare capacity and be out of capacity simultaneously!

The Straggler pattern

Floods, Storms and other weather effects

Ever tried to turn on 5000 machines at once?

Ever tried to reboot a datacenter?

Storms
(aka feedback loops)

  • Mirroring storms
  • Retry storms
  • Resync storms

Maintenance Overhead

Minimize Recovery Paths

  • Automated provisioning
  • Re-provision first
  • Immutable servers
  • Amputate

I Robot

  • Automate everything
  • Throttling
  • Locking
  • Monitoring integration
  • Tools instead of Docs

Assume everything is broken

  • Prefer async
  • Better to retry
  • Idempotence
  • Easier to replace than repair

Monitoring

What do you do when you have more servers than pixels?

Facebook Claspin

Graphs & Metrics

  • Most deviant/Top
  • Histograms/Percentiles
  • Outliers

Alerting

  • Individual faults don't matter!
  • Individual capacity doesn't matter!
  • Only aggregates matter
  • No CPU/disk/RAM alerts
  • Turn it ALL OFF

Questions?

Made with Slides.com