scaling
Cleo
It's time to discuss the elephant...
Pop Quiz
What do these four fictional stories have in common?




Pop Quiz: Answer
They are all examples of people misunderstanding scale!
Spiderman
Explanation:
- Ratio of surface area to mass too low
- Would need enormous hands, 1m wide
- Would rip his skin apart
Wall crawling works at spider scale, not human scale

King Kong
Explanation:
- Bones would not support weight
- Would struggle to dissipate heat
- Heart could not produce enough pressure to circulate blood around his body
Gorillas work at gorilla scale, not King Kong scale

The Borrowers
Explanation:
- Would lose a lot of body heat and water
- Struggle to breathe
- Would be practically blind
Humans work at human scale, not at borrower (5cm) scale

Up
Explanation:
- Would require an impossible amount of balloons
- Would take too long to fill those balloons
- House would rip apart from the forces
Lifting things with balloons works at toy scale, not at house scale

Solving problems with Lego!





The point?
Solutions are scale-specific
- Solutions to problems are only useful at certain scale
- When we reach a limiting factor, the solution has to change
The point?
We really struggle to understand scale!
- Specifically, we struggle to understand non-linear scaling
- When we change one dimension of a system, we often see disproportionate changes as a result
- Non-linear scaling can mean that limiting factors can surprise us
Cleo is scaling!
- Scaling in our data
- Scaling in our code
- Scaling in our organisation
Challenge: Database capacity
Join performance suddenly degrades!
Join performance becomes superlinear when it crosses “in-memory” limits
Example scenario:
- A join that took 200ms at 5M rows, suddenly takes 20,000ms (100x) at 50M rows
- 100x slower for a 10x increase in scale
- Working memory spills into I/O
Change in query plan causes sudden latency spike
A previously selective filter becomes common, updated stats tell the planner the index is no longer useful, and it switches to a sequential scan.
Example scenario:
- Cardinality estimate increases sharply after
ANALYZE -
Planner stops using the index
-
Latency jumps from ~10ms to ~2,000ms
Index scans suddenly get much slower when the index no longer fits in cache
Index lookups that were fast in memory become dominated by random disk I/O once the index grows beyond what PostgreSQL and the OS can keep cached.
-
At 3M rows, a 200MB index fits comfortably in memory; a query with thousands of lookups runs in ~5–10ms
-
As the table grows to 40M rows, the index reaches ~2GB and no longer fits in cache
-
Cache hit rate collapses; many index page reads now require random disk I/O
-
The same query’s latency jumps from ~10ms to multiple seconds despite no code or schema changes
scenario:
-
Cardinality estimate increases sharply after
ANALYZE -
Planner stops using the index
-
Latency jumps from ~10ms to ~2,000ms
Ignore
Challenge: Code complexity
- Code metrics (LoC etc), tend to scale linearly
- Linear growth in objects, creates super-linear growth in possible behaviours
Challenge: Code complexity
Challenge: Code complexity
Challenge: Code complexity
Challenge: Code complexity
Challenge: Code complexity
Challenge: Code complexity
Challenge: Organisational complexity
How are we going to solve these problems of scale?
Learn from nature
Why nature?
Genome on earth has about 10^30 bits of information
Equivalent to the amount of RAM Google Chrome uses. (joke 🥁)
More information than human beings have ever created, ever.
Nature knows how to scale
Bacteria => Whale
- Smallest living organism
- Mycoplasma genitalium - 10^-16
- Largest living creature
- Blue whale - 10^8
- 24 orders of magnitude
- Greater than the diff. between Earth and the Milky Way
- Nature knows how to scale!
Nature knows how to scale
How does nature scale?
- Nature generalises
- Nature specialises
How nature generalises
- Repeats same patterns everywhere
- Encapsulates, fanatically
- Communicates through interfaces
How nature specialises
- Organises particular capabilities
- Comes up with new solutions to limiting factors
Where we will generalise
If one part of the system works differently from all the rest, that part will require additional effort to control
- Apply general rules across all domains
- Encapsulation of implementation
- hide implementation, expose intent.
- Encapsulation of state / data
- One owner per piece of state
- A single interface for synchronous commands
- A single interface for asynchronous event publishing
- Encapsulation of implementation
Where we will specialise
- Group common functions together into modules
- Specialise in domain expertise
- Specialising around capabilities
- Specialising around tooling
- DB (doesn't have to be Postgres)
- Language (doesn't have to be Ruby)
What do we mean by "Capability"?
- Technology gives us the ability to do something
- Most technology doesn't give us a new ability, it allows us to do something more effectively
Transportation example
We've always had the ability to move.
- By foot
- By donkey
- By wagon
- By car
- By train
- By plane
These are all examples of technology making us more effective at the same capability: transportation.
By specialising by capability
- We focus domain expertise in one place
- We become future-proof, because the how is an implementation detail
- We achieve better organisation
Work so far
We've established a Proof of Concept with Subscriptions
Scale
By Gavin Morrice
Scale
- 8