Hash Tables, B-Trees, and Learned Indexes
Teb for short.
I used to work at some startups that don't matter.
I teach programming and computer science.
I write about technology and its impact on the world.
Website: www.tebslab.com
Medium: @tebbaVonMathenstien
Twitter: @TebbaVonMaths
Database Indexing
Well Known Indexing Strategies
New Research Into "Learned Indexes"
How Do Learned Indexes Stack Up?
Indexing helps us quickly find specific items within a large collection of data.
Especially helpful when that data is not sorted by the field you're querying against.
id: 0
name: Teb
age: 79
...
This is a single row in a database table
These are all rows in the same table
Where is Tim?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Finding Tim Without an Index...
X
√
?
?
?
?
?
?
?
?
?
?
Are you Tim?
X
X
X
X
Finding Tim With an Index
Tim
The Index
Where is Tim?
Offset 7
Indexing is a trade off between memory usage and computation speed.
The index occupies memory, but it makes queries faster.
5 major performance considerations:
Type of Query
Where is Tim?
Fetch everyone age 5-12.
Lookup Speed
The Index
Where is Tim?
Offset 7
How long does this take?
Cost to insert
id: 99
name: Yannet
age: 30
...
The Index
How long does it take to update the index?
Cost to delete
The Index
How long does it take to update the index?
Size & utilization
The Index
How much additional data is this index?
What is the information density of this index?
A Caveat
Indexing is primarily a tool used when there is a lot of data.
Optimizing for the number of hard drive accesses is more common than optimizing for total operations
Hash Function
Arbitrary Data
Integer
Hash functions map arbitrary data to an integer, deterministically.
Hash Table
Hash Function
Arbitrary Data
Integer
Hash table indexes map those integers to memory locations of database records
Memory Address
Hash Function
"Tim"
15
The Hash Table:
The Database Table:
Performance
Gotchas
The structure of a B-Tree represents sorted data, similar to a Binary Search Tree.
...
First
Last
...
The root node and each internal node are lists of pointers to sorted sections of the overall data.
a | b-d | e-g |
---|---|---|
h-l | j-p | q-t |
u-v | w-x | y-z |
aa-ab | ac-ad | ae-ag |
---|---|---|
ah-al | aj-ap | aq-at |
au-av | aw-ax | ay-az |
aeon | afar |
---|---|
afraid | agriculture |
Root
Internal
Leaf
Leaf nodes contain the items themselves and a pointer to their values location in the database
a | b-d | e-g |
---|---|---|
h-l | j-p | q-t |
u-v | w-x | y-z |
aa-ab | ab-ad | ae-ag |
---|---|---|
ah-al | aj-ap | aq-at |
au-av | aw-ax | ay-az |
aeon | afar |
---|---|
afraid | agriculture |
Root
Internal
Leaf
Because indexes are optimized for hard drive access, the size of each node is often based on hard drive block size.
...
First
Last
...
Root
2. Iterate over block
Hard Drive/Secondary Storage
3. Indicates which block to read next
1. Read block to RAM
Performance
Gotchas
A Google + MIT Collaboration explored the idea of applying machine learning to indexing
https://arxiv.org/abs/1712.01208
"Hash Tables and B-Trees are just models"
Index
Item of Interest
Location in Memory
This is essentially regression!
"Range indexes are just CDFs"
estimatedPosition = Model(key) ∗ numberOfElements
One important caveat...
ML Model
Item of Interest
Approximate Location in Memory
Indexes must return the exact location in memory — there is no room for any error.
This is addressed by storing the min and max error over the dataset.
ML Model
Item of Interest
Approximate Location in Memory
Lookups don't have as strong theoretical guarantees, but are bounded by magnitude of maxError - minError
Min Error
Max Error
ML Model
Item of Interest
Approximate Location in Memory
Min Error
Max Error
Min Error + Guess
Max Error + Guess
Guess
Models could also replace hash functions.
Hash Function
Arbitrary Data
Integer
ML
Model
Arbitrary Data
Integer
Hash Table
The researchers used a collection of simpler models, not a single large model.
This helped with last mile accuracy.
The researchers focused on simple models. The paper is a proof of concept.
Promising size and speed savings!
B-trees compared to learned indexes
Promising utilization improvements, slower speeds
Learned hash functions compared to standard hash function
Models are better at taking the distribution into account
Overfitting is kind of the goal.
Average Error is less important than error bounds
Min Error + Guess
Max Error + Guess
Guess
Interesting properties at scale
Takes advantage of new hardware
Multidimensional indexing
My name is still Tyler Elliot Bettilyon
Website: www.tebslab.com
Medium: @tebbaVonMathenstien
Twitter: @TebbaVonMaths
My name is still Tyler Elliot Bettilyon
I'm teaching a deep learning course focused on computer vision starting Tuesday the 28th.