Dynamo: Amazon’s Highly Available Key-value Store
Colin King
CMSC818E
November 7, 2017
Motivation
Amazon was pushing the limits with Oracle's Enterprise Edition
Unable to sustain availability, scalability, and performance needs
70% of Amazon's operations were K/V, no need for RDMS
Availability >> Strong consistency
Reliability depends on persistent state
Created
Dynamo
: a distributed, scalable, highly available, eventually-consistent K-V store
Mixes together a variety of previously studied concepts
Question
: Can an eventually consistent database work in production under significant load?
Principles
Highly available
: 99.9% of requests < 300ms
Incrementally scalable
: Ability to add new nodes, one at a time
Symmetry / Decentralization
: No SPOF, all nodes share the same role
Optimistic Replication
: "always writeable" system, postpone CR
Heterogeneity
: Support nodes of varying capabilities
Data Model
Intentionally barebones
No transactions or relational schema
K-V systems are "embarrassingly parallel"
get(key)
key
: arbitrary data, MD5-hased to a 128-bit id
value:
data blob, opaque to Dynamo
may return multiple objects, with conflicting versioning
rare: 99.94% return a single version
put(key, context, object)
context
: opaque to client
Used for client-side CR
Key Design Features
P2P System
Consistent Hashing
Virtual Nodes
Preference List Replication
Failure Detection
Temporary: Hinted Handoff
Permanent: Gossip Protocol
Conflict Resolution
Vector Clocks
Replica Synchronization
Configurability
P2P: Consistent Hashing
Need to be able to add and remove nodes without downtime
Solution: Map each node to a
range
in the hash's output space
Each node is assigned a random location on the ring
Handles all keys up to the next node in the ring
P2P: Virtual Nodes
Problems
:
Not usually a uniform distribution, without LLN
Fails to support heterogeneity
Solution
: Virtual Nodes
Each physical machine is treated as a configurable number of virtual nodes
Distributes the key space handled by a physical node
Great for failures
P2P: REPLICATION
Data is replicated across
N
successor nodes
This forms the "preference list"
Effectively, each node is actually responsible for the range of keys belonging to all
N-1
predecessors
Problem
: Replication needs to occur across physical not virtual nodes
Preference list is constructed by skipping nodes to ensure physical replication
Failures: TEMPORARY
Nodes fail all the time in a cloud environment, for a variety of reasons
Oftentimes are only temporarily offline
When a node is unavailable, another node can accept the write
Enables "always-writeable" system
(Assuming 1+ nodes!)
The write is stored in a local buffer
Applied once the network heals
Failures: PERMANENT
Problem: Need to know when a node goes offline
Heartbeats are expensive (n^2!)
Gossip Protocol
Every node tracks the overall cluster state
At every tick, randomly choose another node in the ring
Two nodes sync their cluster states
Leads to
eventual failure propagation
CONFLICT RESOLUTION
For an "always-writeable" system, conflict resolution cannot be done at write-time.
Dynamo uses
read-time reconciliation
instead
Weak consistency guarantees -> divergent object versions
CONFLICT RESOLUTION: Vector CLocks
Dynamo uses vector clocks
Avoids walk-clock skew issues
VC is a list of tuples:
[node, counter]
Each object version is immutable
"Application-assisted conflict resolution"
CONFLICT RESOLUTION: Replica SYNCHRONIZATION
Problem
: Recovering from a permanent failure or a network partition
Can be very expensive with significant divergence
Merkel trees
simplify the process of figuring out which parts of a key-range are different.
Each leaf is a hash of a key
Minimizes data exchange, too!
Used to sync divergent replicas in the background.
CONFIGURABILITY
Some services need to be able to configure the trade-off between availability, consistency, cost-effectivess, and performance.
Three parameters:
N
: Length of the
preference list
W
: # nodes required for a
write
Must consider
durability
R
: # nodes required for a
read
Typical configuration: N=3, R=2, W=2
Very small!
TRADEOFFS
AP system
(in CAP)
Hotspots
can form in the chord ring, not handled by Dynamo
No relational query capabilities (like RDMS)
Movement to NewSQL
The full ring's metadata could become very large, and each node has to maintain this!
Hierarchal system could resolve some of these issues
Resources
Dynamo: Amazon’s Highly Available Key-value Store
[DeCandia, et al.]
A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications
[Werner Vogels]
Made with Slides.com