Database Storage
Data Layouts, Indexes, and More
Outline
- DB 101
- DB Classifications
- In-Memory vs On-Disk
- OLTP vs OLAP
- Data Storage
- SSDs vs HDDs
- Data Layouts
- Indexes and Compression
- General Theory
- Hash Tables
- B-Tree Family
- LSM Trees
Databases 101
Data Management
- Safety
- Auditablity
- Access control
- Knowledge sharing
Types of Queries
- Point Queries
SELECT *
FROM tbl
WHERE x = 42;
SELECT *
FROM tbl
WHERE x > 42
AND x < 99;
SELECT
x, y, z,
COUNT(*)
FROM tbl
GROUP BY x, y, z
-
Range Queries
- >, <, >=, <=
-
Aggregations
- GROUP BY x, y, z
- MIN, MAX, AVG, COUNT, SUM, AVG
Read/Write Usage
- Volume
- Queries per Second (QPS)
Write Strategies
- Mutable (In-Place Updates)
- Immutable (Append-Only / Log)
Database Classifications
In-Memory vs On-Disk
- Bullet One
- Bullet Two
- Bullet Three
In-Memory vs On-Disk
- Bullet One
- Bullet Two
- Bullet Three



OLTP
- Transactions
- Point and Range Queries
- In-place Updates
- High Read Rate
- > 10k TPS
- Medium Storage Needs
- 10s of TBs
See more at https://dbdb.io/browse?tag=oltp
OLTP
See more at https://dbdb.io/browse?tag=oltp






- Transactions
- Point and Range Queries
- In-place Updates
- High Read Rate
- > 10k TPS
- Medium Storage Needs
- 10s of TBs
OLAP
See more at https://dbdb.io/browse?tag=olap
- Aggregations (full table scans)
- Append-only storage
- Write-Optimized, Low Read Rate
- 100s Read QPS
- Large Storage Needs
- 100s of PBs
OLAP

BigQuery
Redshift
See more at https://dbdb.io/browse?tag=olap

Hive
Snowflake

ClickHouse
DuckDB



Impala
Pinot
- Aggregations (full table scans)
- Append-only storage
- Write-Optimized, Low Read Rate
- 100s Read QPS
- Large Storage Needs
- 100s of PBs





Database Architecture
DBMS Architecture

- Transport Subsystem
- Query Processor
- Execution Engine
- Storage Engine
Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)
Transport

- Transport Subsystem
- Handling client requests
- Communication with nodes across cluster
Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)
Query Processing

- Query Processor
- Parses, interprets, validates, and optimizes queries
Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)
Execution

- Execution Engine
- Runs execution plan
- Collects the results of the execution of local and remote operations
Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)
Storage

- Storage Engine
- dasfajsdnlasd
Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)
Data Storage
Storage Hierarchy
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
CPU Registers
CPU Cache
DRAM
SSD
HDD
Network Storage
Storage Hierarchy
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
CPU Registers
CPU Cache
DRAM
SSD
HDD
Network Storage
Disk
Memory
CPU
{
{
{
Storage Hierarchy
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
CPU Registers
CPU Cache
DRAM
SSD
HDD
Network Storage
Faster
Smaller
More Expensive
Slower
Larger
Cheaper
Interlude: SSDs vs HDDs
-
Hard Disk Drive
- Slow Random Access (Spinning Disk w/ Mechanical Arm)
- Longevity (Magnetic Storage)

Interlude: SSDs vs HDDs
-
Solid State Drive
-
No Moving Parts
- 100x faster random access
- Parallel Accesses
-
Cons
- Limited Lifetime (Write Endurance + Power Loss)
- Expensive
-
No Moving Parts

SSD Organization Schematics

Image from Database Internals (2 - B-Tree Basics/Disk-Based Structures/Solid State Drives)
SSD Organization Schematics

Image from Database Internals (2 - B-Tree Basics/Disk-Based Structures/Solid State Drives)
Page: 2 - 16 Kb
Block: >= 1 MB
- 64 - 512 Pages
Planes:
Die:
Memory Cell: 1 bit
Strings: 32 or 64 cells
Arrays: Multiple strings
Page: Multiple arrays
SSDs: Flash Translation Layer
- Bullet One
- Bullet Two
- Bullet Three
Interlude: SSDs vs HDDs
Attribute | Solid-State Drive | Hard Disk Drive |
---|---|---|
Price per capacity | 30 cents / GB | 3 cents / GB |
Start-up time | Almost Instantaneous | Several Seconds |
Sequential-access performance | 200 - 3500 MB/sec | 200 - 480 MB/sec |
Random-access performance | < 0.1 ms | 3 - 12 ms |
Parallel Access | Yes | No |
Storage Hierarchy
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
CPU Registers
CPU Cache
DRAM
SSD
HDD
Network Storage
Faster
Smaller
More Expensive
Slower
Larger
Cheaper
Storage Hierarchy
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
CPU Registers
CPU Cache
DRAM
SSD
HDD
Network Storage
Faster
Smaller
More Expensive
Slower
Larger
Cheaper
Volatile
Random Access
Byte-Addressable
Non-Volatile
Sequential Access
Block-Addressable
Storage Hierarchy
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
CPU Registers
CPU Cache
DRAM
SSD
HDD
Network Storage
Volatile
Random Access
Byte-Addressable
Non-Volatile
Sequential Access
Block-Addressable


Tiered Storage
- Data Lake with SSDs + HDDs
- SSDs for new, active data
- HDDs for older, archival data
- Ex. ClickHouse Cloud
- Offers read-through + write-through NVMe SSD caches on top of S3 network storage
- In-Memory Buffer Pools
- Every on-disk store will have this
Disaggregated Storage
- Bullet One
- Bullet Two
- Bullet Three
Access Times
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
"Latency Numbers Every Programmer Should Know"
1 ns | L1 Cache Ref |
4 ns | L2 Cache Ref |
100 ns | DRAM |
16,000 ns | SSD |
2,000,000 ns | HDD |
~50,000,000 ns | Network Storage |
1,000,000,000 ns | Tape Archives |
Access Times x 1 Million
Slides copied from Database Storage Part 1 - @CMUDatabaseGroup
"Latency Numbers Every Programmer Should Know"
1 ns | L1 Cache Ref | 1 sec |
4 ns | L2 Cache Ref | 4 sec |
100 ns | DRAM | 100 sec |
16,000 ns | SSD | 4.4 hours |
2,000,000 ns | HDD | 3.3 weeks |
~50,000,000 ns | Network Storage | 1.5 years |
1,000,000,000 ns | Tape Archives | 31.7 years |
"Latency Numbers Every Programmer Should Know"

Jeff Dean's Latency Numbers via github:colin-scott/interactive_latencies

1 sec
10 sec
100 sec
3 hours
1 day
1.5 weeks
16.5 weeks
3 years
3 decades
3 centuries
x1 Million
Storage Engine
Buffer (Pool) Manager
- Bullet One
- Bullet Two
- Bullet Three
On-Disk Storage
- Bullet One
- Bullet Two
- Bullet Three
By Data Structure

Data Layout
Row vs Columnar
- Bullet One
- Bullet Two
- Bullet Three
Slotted Pages
- Bullet One
- Bullet Two
- Bullet Three
Log-Structured Storage
- Bullet One
- Bullet Two
- Bullet Three
Columnar File Formats
- Bullet One
- Bullet Two
- Bullet Three
Scaling Databases
Scaling Up (Single Server)
Database | Example | Peak QPS (w/o experts) |
Storage Capacity |
---|---|---|---|
Cache (KV Store) | Valkey / Redis | 150k QPS | ~13 GB (AWS ElasticCache) |
OLTP | Postgres | 50k TPS | 64 TiB (Aurora on RDS) |
OLAP | ClickHouse on EBS | 5k QPS | 64 TiB (Max EBS Volume) |
ClickHouse on S3 |
Unscientific and Totally Made-Up Capacity
Scaling Out
- Read Replicas
- QPS++
- Partitioning
- Storage++
- ClickHouse can handle 100s of PBs
- Requires object storage (S3) instead of block storage
Compression & Indexes
An Intuition
Compression
- Bullet One
- Bullet Two
- Bullet Three
Indexes
- Auxillary Data Structure
- Support queries on a (usually) separate data structure
- Index Flavors
- Ordering
- Grouping
Compressed Index
- Queryable in compressed form
- Ordering helps both compression and indexing
- By surfacing the latent structure of our data
See more from Ben Langmead JHU Burrows-Wheeler Indexing Course
Compact Data Structures 🤝 Databases



Index Types
- Adaptive Radix Tree (ART)
- AVL-Tree
- B-Epsilon Tree
- BitMap
- Block Range Index (BRIN)
- B+Tree
- Bw-Tree
- Hash Table
- Hierarchical Navigable Small World (HNSW)
- Inverted Index (Full Text)
- K-D Tree
- Log-Structured Merge Tree
- MassTree
- Patricia/Radix Trie
- Red-Black Tree
- R-Tree
- Skip List
- T-Tree
Extended Indexes
- Partial Indexes
- Functional Indexes
- Skip Indexes
- Fence Pointers
- Min / Max
- Block Range Indexes
- Bloom Filters
Index Lookups
- Full Table Scan
- (Parallel) Sequential Scan
- (Parallel) Sequential Scan
- Index Scan
- Index-Only Scan
- Bitmap Index Scan
Sequential Scan Optimizations
- Prefetching
- Buffer Pool Bypass
- Parallelization
- Heap Clustering
- Late Materialization
- Data Skipping
- Approximation
- BRIN / Zone Maps
Copied from CMU DB Course - Query Execution Part 1
Compression
- Bullet One
- Bullet Two
- Bullet Three

Hash Indexes
Limitations of Hash Indexes
- No Range Scans
- Only point queries
- No Prefix Lookups
- Need entire key
B-Tree Indexes
Block Range Indexes
BRIN / Zone Maps
- Pre-Compute
- Bullet Two
- Bullet Three
LSM Indexes
Log-Structured Merge Tree
- Always IOT / clustered index????
- Data stored in sorted key order on disk
- Bullet Two
- Bullet Three

Database Storage Internals
By Heneli Kailahi
Database Storage Internals
- 10