Database Storage

Data Layouts, Indexes, and More

Outline

  • DB 101
  • DB Classifications
    • In-Memory vs On-Disk
    • OLTP vs OLAP
  • Data Storage
    • SSDs vs HDDs
  • Data Layouts
  • Indexes and Compression
    • General Theory
    • Hash Tables
    • B-Tree Family
    • LSM Trees

Databases 101

Data Management

  • Safety
  • Auditablity
  • Access control
  • Knowledge sharing

Types of Queries

  • Point Queries
SELECT *
FROM tbl
WHERE x = 42;
SELECT *
FROM tbl
WHERE x > 42 
  AND x < 99;
SELECT
  x, y, z,
  COUNT(*)
FROM tbl
GROUP BY x, y, z
  • Range Queries
    • >, <, >=, <=
  • Aggregations
    • GROUP BY x, y, z
    • MIN, MAX, AVG, COUNT, SUM, AVG

Read/Write Usage

  • Volume
  • Queries per Second (QPS)

Write Strategies

  • Mutable (In-Place Updates)
  • Immutable (Append-Only / Log)

Database Classifications

In-Memory vs On-Disk

  • Bullet One
  • Bullet Two
  • Bullet Three

In-Memory vs On-Disk

  • Bullet One
  • Bullet Two
  • Bullet Three

OLTP

  • Transactions
    • Point and Range Queries
    • In-place Updates
    • High Read Rate
      • > 10k TPS
    • Medium Storage Needs
      • 10s of TBs

OLTP

  • Transactions
    • Point and Range Queries
    • In-place Updates
    • High Read Rate
      • > 10k TPS
    • Medium Storage Needs
      • 10s of TBs

OLAP

  • Aggregations (full table scans)
  • Append-only storage
  • Write-Optimized, Low Read Rate
    • 100s Read QPS
  • Large Storage Needs
    • 100s of PBs

OLAP

Redshift
BigQuery

BigQuery

Redshift

Hive

Snowflake

ClickHouse

DuckDB

Impala

Pinot

  • Aggregations (full table scans)
  • Append-only storage
  • Write-Optimized, Low Read Rate
    • 100s Read QPS
  • Large Storage Needs
    • 100s of PBs

Database Architecture

DBMS Architecture

  • Transport Subsystem
  • Query Processor
  • Execution Engine
  • Storage Engine

Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)

Transport

  • Transport Subsystem
    • Handling client requests
    • Communication with nodes across cluster

Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)

Query Processing

  • Query Processor
    • Parses, interprets, validates, and optimizes queries

Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)

Execution

  • Execution Engine
    • Runs execution plan
    • Collects the results of the execution of local and remote operations

Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)

Storage

  • Storage Engine
    • dasfajsdnlasd

Image from Database Internals (1 - Introduction and Overview/DBMS Architecture)

Data Storage

Storage Hierarchy

CPU Registers

CPU Cache

DRAM

SSD

HDD

Network Storage

Storage Hierarchy

CPU Registers

CPU Cache

DRAM

SSD

HDD

Network Storage

Disk

Memory

CPU

{

{

{

Storage Hierarchy

CPU Registers

CPU Cache

DRAM

SSD

HDD

Network Storage

Faster

Smaller

More Expensive

Slower

Larger

Cheaper

Interlude: SSDs vs HDDs

  • Hard Disk Drive
    • Slow Random Access (Spinning Disk w/ Mechanical Arm)
    • Longevity (Magnetic Storage)

Interlude: SSDs vs HDDs

  • Solid State Drive
    • No Moving Parts
      • 100x faster random access
      • Parallel Accesses
    • Cons
      • Limited Lifetime (Write Endurance + Power Loss)
      • Expensive

SSD Organization Schematics

Image from Database Internals (2 - B-Tree Basics/Disk-Based Structures/Solid State Drives)

SSD Organization Schematics

Image from Database Internals (2 - B-Tree Basics/Disk-Based Structures/Solid State Drives)

Page: 2 - 16 Kb

Block: >= 1 MB

  • 64 - 512 Pages

Planes:

Die:

 

Memory Cell: 1 bit

Strings: 32 or 64 cells

Arrays: Multiple strings

Page: Multiple arrays

SSDs: Flash Translation Layer

  • Bullet One
  • Bullet Two
  • Bullet Three

Interlude: SSDs vs HDDs

Attribute Solid-State Drive Hard Disk Drive
Price per capacity 30 cents / GB 3 cents / GB
Start-up time Almost Instantaneous Several Seconds
Sequential-access performance 200 - 3500 MB/sec 200 - 480 MB/sec
Random-access performance < 0.1 ms 3 - 12 ms
Parallel Access Yes No

Storage Hierarchy

CPU Registers

CPU Cache

DRAM

SSD

HDD

Network Storage

Faster

Smaller

More Expensive

Slower

Larger

Cheaper

Storage Hierarchy

CPU Registers

CPU Cache

DRAM

SSD

HDD

Network Storage

Faster

Smaller

More Expensive

Slower

Larger

Cheaper

Volatile

Random Access

Byte-Addressable

Non-Volatile

Sequential Access

Block-Addressable

Storage Hierarchy

CPU Registers

CPU Cache

DRAM

SSD

HDD

Network Storage

Volatile

Random Access

Byte-Addressable

Non-Volatile

Sequential Access

Block-Addressable

Tiered Storage

  • Data Lake with SSDs + HDDs
    • SSDs for new, active data
    • HDDs for older, archival data
  • Ex. ClickHouse Cloud
    • Offers read-through + write-through NVMe SSD caches on top of S3 network storage
  • In-Memory Buffer Pools
    • Every on-disk store will have this

Disaggregated Storage

  • Bullet One
  • Bullet Two
  • Bullet Three

Access Times

"Latency Numbers Every Programmer Should Know"

1 ns L1 Cache Ref
4 ns L2 Cache Ref
100 ns DRAM
16,000 ns SSD
2,000,000 ns HDD
~50,000,000 ns Network Storage
1,000,000,000 ns Tape Archives

Access Times x 1 Million

"Latency Numbers Every Programmer Should Know"

1 ns L1 Cache Ref 1 sec
4 ns L2 Cache Ref 4 sec
100 ns DRAM 100 sec
16,000 ns SSD 4.4 hours
2,000,000 ns HDD 3.3 weeks
~50,000,000 ns Network Storage 1.5 years
1,000,000,000 ns Tape Archives 31.7 years

"Latency Numbers Every Programmer Should Know"

Jeff Dean's Latency Numbers via github:colin-scott/interactive_latencies

1 sec

10 sec

100 sec

 3 hours

1 day

1.5 weeks

16.5 weeks

3 years

3 decades

3 centuries

x1 Million

Storage Engine

Buffer (Pool) Manager

  • Bullet One
  • Bullet Two
  • Bullet Three

On-Disk Storage

  • Bullet One
  • Bullet Two
  • Bullet Three

By Data Structure

Data Layout

Row vs Columnar

  • Bullet One
  • Bullet Two
  • Bullet Three

Slotted Pages

  • Bullet One
  • Bullet Two
  • Bullet Three

Log-Structured Storage

  • Bullet One
  • Bullet Two
  • Bullet Three

Columnar File Formats

  • Bullet One
  • Bullet Two
  • Bullet Three

Scaling Databases

Scaling Up (Single Server)

Database Example Peak QPS
(w/o experts)
Storage Capacity
Cache (KV Store) Valkey / Redis 150k QPS ~13 GB
(AWS ElasticCache)
OLTP Postgres 50k TPS 64 TiB
(Aurora on RDS)
OLAP ClickHouse on EBS 5k QPS 64 TiB
(Max EBS Volume)
ClickHouse on S3

Unscientific and Totally Made-Up Capacity

Scaling Out

  • Read Replicas
    • QPS++
  • Partitioning
    • Storage++
    • ClickHouse can handle 100s of PBs
      • Requires object storage (S3) instead of block storage

Compression & Indexes

An Intuition

Compression

  • Bullet One
  • Bullet Two
  • Bullet Three

Indexes

  • Auxillary Data Structure
    • Support queries on a (usually) separate data structure
  • Index Flavors
    • Ordering
    • Grouping

Compressed Index

  • Queryable in compressed form
  • Ordering helps both compression and indexing
    • By surfacing the latent structure of our data

Compact Data Structures 🤝 Databases

Index Types

  • Adaptive Radix Tree (ART)
  • AVL-Tree
  • B-Epsilon Tree
  • BitMap
  • Block Range Index (BRIN)
  • B+Tree
  • Bw-Tree
  • Hash Table
  • Hierarchical Navigable Small World (HNSW)
  • Inverted Index (Full Text)
  • K-D Tree
  • Log-Structured Merge Tree
  • MassTree
  • Patricia/Radix Trie
  • Red-Black Tree
  • R-Tree
  • Skip List
  • T-Tree

Extended Indexes

  • Partial Indexes
  • Functional Indexes
  • Skip Indexes
    • Fence Pointers
    • Min / Max
    • Block Range Indexes
    • Bloom Filters

Index Lookups

  • Full Table Scan
    • (Parallel) Sequential Scan
  • Index Scan
  • Index-Only Scan
  • Bitmap Index Scan

Sequential Scan Optimizations

  • Prefetching
  • Buffer Pool Bypass
  • Parallelization
  • Heap Clustering
  • Late Materialization
  • Data Skipping
    • Approximation
    • BRIN / Zone Maps

Compression

  • Bullet One
  • Bullet Two
  • Bullet Three

Hash Indexes

Limitations of Hash Indexes

  • No Range Scans
    • Only point queries
  • No Prefix Lookups
    • Need entire key

B-Tree Indexes

Block Range Indexes

BRIN / Zone Maps

  • Pre-Compute
  • Bullet Two
  • Bullet Three

LSM Indexes

Log-Structured Merge Tree

  • Always IOT / clustered index????
    • Data stored in sorted key order on disk
  • Bullet Two
  • Bullet Three

Database Storage Internals

By Heneli Kailahi

Database Storage Internals

  • 10