scaling

Cleo

It's time to discuss the elephant...

Pop Quiz

What do these four fictional stories have in common?

Pop Quiz: Answer

They are all examples of people misunderstanding scale!

Spiderman

Explanation:

  • Ratio of surface area to mass too low
  • Would need enormous hands, 1m wide
  • Would rip his skin apart

Wall crawling works at spider scale, not human scale

King Kong

Explanation:

  • Bones would not support weight
  • Would struggle to dissipate heat
  • Heart could not produce enough pressure to circulate blood around his body

Gorillas work at gorilla scale, not King Kong scale

The Borrowers

Explanation:

  • Would lose a lot of body heat and water
  • Struggle to breathe
  • Would be practically blind

Humans work at human scale, not at borrower (5cm) scale

Up

Explanation:

  • Would require an impossible amount of balloons
  • Would take too long to fill those balloons
  • House would rip apart from the forces

Lifting things with balloons works at toy scale, not at house scale

Solving problems with Lego!

The point?

Solutions are scale-specific

  • Solutions to problems are only useful at certain scale
  • When we reach a limiting factor, the solution has to change

The point?

We really struggle to understand scale!

  • Specifically, we struggle to understand non-linear scaling
  • When we change one dimension of a system, we often see disproportionate changes as a result
  • Non-linear scaling can mean that limiting factors can surprise us

Cleo is scaling!

  • Scaling in our data
  • Scaling in our code
  • Scaling in our organisation

Challenge: Database capacity

Join performance suddenly degrades!

Join performance becomes superlinear when it crosses “in-memory” limits

Example scenario:

  • A join that took 200ms at 5M rows, suddenly takes 20,000ms (100x) at 50M rows
  • 100x slower for a 10x increase in scale
  • Working memory spills into I/O

Change in query plan causes sudden latency spike

A previously selective filter becomes common, updated stats tell the planner the index is no longer useful, and it switches to a sequential scan.

Example scenario:

  • Cardinality estimate increases sharply after ANALYZE
  • Planner stops using the index

  • Latency jumps from ~10ms to ~2,000ms

Index scans suddenly get much slower when the index no longer fits in cache

Index lookups that were fast in memory become dominated by random disk I/O once the index grows beyond what PostgreSQL and the OS can keep cached.

  • At 3M rows, a 200MB index fits comfortably in memory; a query with thousands of lookups runs in ~5–10ms

  • As the table grows to 40M rows, the index reaches ~2GB and no longer fits in cache

  • Cache hit rate collapses; many index page reads now require random disk I/O

  • The same query’s latency jumps from ~10ms to multiple seconds despite no code or schema changes

scenario:

  • Cardinality estimate increases sharply after ANALYZE

  • Planner stops using the index

  • Latency jumps from ~10ms to ~2,000ms

Ignore

Challenge: Code complexity

  • Code metrics (LoC etc), tend to scale linearly
  • Linear growth in objects, creates super-linear growth in possible behaviours

Challenge: Code complexity

Challenge: Code complexity

Challenge: Code complexity

Challenge: Code complexity

Challenge: Code complexity

Challenge: Code complexity

Challenge: Organisational complexity

How are we going to solve these problems of scale?

Learn from nature

Why nature?

Genome on earth has about 10^30 bits of information

Equivalent to the amount of RAM Google Chrome uses. (joke 🥁)

More information than human beings have ever created, ever.

Nature knows how to scale

Bacteria => Whale

  • Smallest living organism
    • Mycoplasma genitalium - 10^-16
  • Largest living creature
    • Blue whale - 10^8
  • 24 orders of magnitude
  • Greater than the diff. between Earth and the Milky Way
  • Nature knows how to scale!

Nature knows how to scale

How does nature scale?

  • Nature generalises
  • Nature specialises

How nature generalises

  • Repeats same patterns everywhere
  • Encapsulates, fanatically
  • Communicates through interfaces

How nature specialises

  • Organises particular capabilities
  • Comes up with new solutions to limiting factors

Where we will generalise

If one part of the system works differently from all the rest, that part will require additional effort to control

  • Apply general rules across all domains
    • Encapsulation of implementation
      • hide implementation, expose intent.
    • Encapsulation of state / data
    • One owner per piece of state
    • A single interface for synchronous commands
    • A single interface for asynchronous event publishing

Where we will specialise

  • Group common functions together into modules
    • Specialise in domain expertise
    • Specialising around capabilities
    • Specialising around tooling
      • DB (doesn't have to be Postgres)
      • Language (doesn't have to be Ruby)

What do we mean by "Capability"?

  • Technology gives us the ability to do something
  • Most technology doesn't give us a new ability, it allows us to do something more effectively

Transportation example

We've always had the ability to move.

  • By foot
  • By donkey
  • By wagon
  • By car
  • By train
  • By plane

These are all examples of technology making us more effective at the same capability: transportation.

By specialising by capability

  • We focus domain expertise in one place
  • We become future-proof, because the how is an implementation detail
  • We achieve better organisation

Work so far

We've established a Proof of Concept with Subscriptions

Scale

By Gavin Morrice