Partitioning

vertical scaling or scaling up

shared-memory architecture

Many CPUs, many RAM chips, and many disks can be joined together under one operating system

problem: cost + can be fault-tolerant with hot-swappable components but still geographically one cluster

buy a more powerful machine

problem: cannot scale up infinitely

shared-disk architecture

several machines stores data on an array of disks that are shared between them

problem: overhead of locking data limits scalability

shared-nothing architecture

each machine or virtual machine running the database software is called a node. Each node uses its CPUs, RAM, and disks independently

benefits: price, geo distribution, highly fault tolerante

MongoDB

Elasticsearch

SolrCloud

HBase

Bigtable

Cassandra

Couchbase

vBucket

vnode

tablet

region

shard

If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed.

In an extreme case, all the load could end up on one partition, so n-1 out of n nodes are idle and your bottleneck is the single busy node

A partition with disproportionately high load is called a hot spot

Assign record to a partition

random

+ avoid hot spots due to randomness

- read requires querying all partitions since we don't have a way to calculate the correct partition

partitioning by key range

+ range scans are easy

- can lead to hot spots (example: timestamps and partitions by hour - all writes go to a single last hour partition. Solution: prefix with sensor name and partition by a sensor. The downside of solution: leads to querying all partitions again)

Assign record to a partition

partitioning by hash of key

+ even distribution of load (if partition space is random - called consistent hashing)

- range scans not possible (more precisely, requests are sent to all partitions)

(note: hash function should be strong and determenistic: MD5 for example

hybrid

hash only the first part of the combined key. Example: (user_id, update_timestamp)

+ even distribution of load

+ range scans are possible for the consequent keys

Assign record to a partition

application level

example: celebrities on social networks

most data systems are not able to automatically compensate for such a highly skewed workload hence that shall be done on app-level

solution: add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys

+ remove the hot spot for writes

- reads are spread across partitions - have to query and combine

- needs book-keeping since not all keys are hot spots. Applying to small partitions would be an overhead

Secondary Indexes - by document

only operates within a partition (local index)

writes are fast

requires read queries to all partitions (scatter/gather)

used by: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB

Secondary Indexes - by term

operates across all partitions (global index)

secondary index is also partitioned by its value (term comes from full-text indexes, where the terms are all the words that occur in a document

writes are slow (+ can lead to stale index reads due to distributed transaction)

reads are fast

Reasons:

- query throughput increases

- dataset size increases

- node fails

Requirements:

- after rebalancing, load should be shared fairly

- while rebalancing, database is in working condition

- no more data than necessary should be moved

Rebalancing

modulo approach

- requires a lot of data moving when rebalancing happens

fixed number of partitions

create more partitions than there are nodes. Size of partition proportional to dataset size. Move the whole partition not keys within a partition. While moving operate with old one.

if partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead.

+ easy to operate

- choosing the right number of partitions is difficult if the total size of the dataset is highly variable

used in: Riak, Elasticsearch,

Couchbase, Voldemort

dynamic partitioning

used by: MongoDB, HBase, RethinkDB

amount of nodes are fixed, size of partition is fixed (~10GB), amount of partitions dynamic and proportional to the size of the dataset

+ adapts to dataset size

+ good choice for range-partitioned data

- empty dataset starts with one partition - no fault tolerance and all writes are on single node (to mitigate some DBs including MongoDB allow to configure number of partitions (pre-splitting)

partitioning proportionally to nodes

used by: Cassandra, Ketama

fixed number of partitions per node (consistent hashing)

size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable.

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split

automatic or manual rebalancing

some databsases generate a suggested partition assignment automatically, but require an administrator to commit it before it takes effect.

fully automated rebalancing can be unpredictable as together with automatic failure detection can lead to cascading failure

For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This puts additional load on the overloaded node, other nodes, and the network—making the situation worse and potentially causing a cascading failure.

request routing (service discovery)

in all cases, the key problem is: how does the component making the routing decision learn about changes in the assignment of partitions to nodes?

Apache ZooKeeper

HBase, SolrCloud, Kafka use ZooKeepr

MongoDB uses approach with routing tier but relies on its own config server instead of ZooKeeper

Cassandra and Riak use gossip protocol

Partitioning

By Michael Romanov

Partitioning

vertical scaling or scaling up

shared-memory architecture

shared-disk architecture

shared-nothing architecture

random

partitioning by key range

partitioning by hash of key

hybrid

application level

modulo approach

fixed number of partitions

dynamic partitioning

partitioning proportionally to nodes

automatic or manual rebalancing

request routing (service discovery)

Apache ZooKeeper

Partitioning

More from Michael Romanov