Under the hood:
Storage and Retrieval in databases from scratch
# about me
Ten Zhi Yang
- @tzyinc on slides.com, codepen
- Software Developer @ Shopify (Not the other e-commerce)
- Writes nonsense in HTML, CSS and JS
https://slides.com/d/kEy393M/live
https://codepen.io/collection/dbeEYv
Defining "Database"
- Fundamentally:
- Store Data
- Retrieve Data
- OLTP: Online Transaction Processing
- OLAP: Online Analytical Processing
World's simplest DB
Worlds Simplest database
# PRESENTING CODE
Learnings
- Writing has pretty good performance, appending to file is cheap
- Basic principle applies to logging subsystems in databases
- Reading requires you to find the start of the next key, hence some form of escape is necessary
- Linear search for records O(n)
- escape characters will mess up your storage if not handled properly
- Writing to multiple files and hence reading, requires some form of pointer to remember where you are and where to write
Improvements
- Segmenting
- Compacting (on older files in background)
- Merging segments (can be done with compacting
- Crash recovery (in prev example)
- Checksums for crash recovery
Hash Indexes
Hash indexes - in memory hash
# PRESENTING CODE
Strengths
- sequential writing is fast (esp for disk drives)
- concurrency & crash recovery (segments are immutable and/or append only)
- little disk fragmentation
Limitations
- hash tables must fit into memory
- range queries are not efficient
SSTables + LSM trees
SSTables + LSM trees (kind of)
# PRESENTING CODE
Learnings
- the LSM in LSM trees mean Log-Structured Merge, which happens periodically, in a rolling function
- possible to have 2 trees/structures (or more) C0 and C1 which can be in the form of SSTables
- C0 is stored in memory and is very fast, containing frequently ref-ed nodes
actual rolling merge
Improvements
- bloom filter: space-efficient probabilistic data structure
- may contain false +ves (but better than false -ves)
- check if key exists before going through LSMtree
- 2 strategies in merging+compaction
- size-tiered, small tables merge into bigger tables
- level-tiered, oldest tables are merged
B+ trees
B+trees
# PRESENTING CODE
Learnings
- B+Trees are actually more established than LSM trees
- B+Trees store the tree as pages, of two kinds
- A series of pointers representing the index
- Leaf pages which contain the data
- B+Trees modify in place
- B+Trees are built from "bottom up"; every time you split, you grow more parents
not sure what I was expecting
Improvements
- Write Ahead Logs (WAL) - basically an undo list
- Alternatively, use copy on write scheme, also useful for snapshots
- add lightweight thread locks to maintain integrity during updates
- abbreviate keys, only need to know the boundaries between ranges
- Arrange page orders on disk such to be sequential
- sibling pointers of leafs, to reduce need to go back to root
B trees vs LSM trees
Faster Writes
Sequential Writes
Compact Data
Compaction blocks read/write
Faster Reads
Pages are fixed sizes
Each key only appear once
Light locking on write
Other Indexing Structures
Full text search and Fuzzy Index
- Some applications require searching for a similar key. This can be done by fuzzy indexes.
- For example, Lucene allows searching for words within edit distance
OLTP VS OLAP
Transactions VS Analytics
OLTP | OLAP | |
---|---|---|
Main read pattern | Small number of records, fetched by key | Aggregated over large number of records |
Main write pattern | Random-access, low-latency writes from user input | Bulk import or event stream |
Primarily used by | Customer, through web-application | Analysts, for making business decisions |
Data | Latest state of the world | History of events |
Size | GB to TB | TB to PB |
Bottleneck | Disk seek time | Disk bandwidth |
Data warehousing
- High availability, low latency
- Usually copy-only data
- Can be optimised for analytics access patterns
- key indexing not as important
OLAP Structures
Stars and Snowflakes schema
each dimension is normalised
no data redundancy
less storage required
Data integrity
easy to maintain
one fact with dimensions
easier for analysts to use
faster query time (less joins)
Data could mismatch
easy to set up
Column-Oriented Storage
- In column-oriented storage, data of all rows from each column are stored together. It relies on each file containing the rows in the same order.
- Access time: usually in OLAP we only need a columns, no need to waste time getting entire rows
- Compression: Adjacent similar data can be compressed, just need to track how many rows are the same data. eg. Gender has very few cardinality, Age is also small (none older than 150) we can sort by the low cardinality columns to achieve better compression
- Optimisations: bitmap encoding for distinct values (eg. country) allows faster equality comparisons, we can also have multiple sort orders copies for regularly used queries
Writing to Column-Oriented Storage
- Sorted columns optimizes for read-only queries, yet writes are more difficult.
- In-place updates would require rewriting the whole column on every write. Instead, we can use a LSM-tree like structure where a in-memory store buffers the new writes. When enough new writes are accumulated, they are then merged with the column files and written to new files in bulk.
- Queries, in this case, would require reading data from both disk and the in-memory store. This will be hidden within the engine and invisible to the user.
Aggregation: Materialized Views
- Materialized view is defined like a standard view, but with results stored on disk, while virtual views are expanded and processed at query time.
- Updating materialized view when data changes is expensive
Aggregation: Data Cube
A common special case of materialized view is a data cube or OLAP cube, where data is pre-summarized over each dimension. This enables fast queries with precomputed queries, yet a data cube may not have the flexibility as raw data.
Can slice and dice (pictured above) and rollup and drill down
Storage and retrieval in databases
By tzyinc
Storage and retrieval in databases
- 472