Ólafur Helgason
VP of Engineering @ OZ
Papers We Love
Reykjavik University
19 nov 2014
Why Chord?
- Super-cool and simple
- Theory & practice
- Influential paper & topic
- Personal reasons - nostalgy
Presentation overview
- Background
- Distributed Hash Tables
- Chord
- Lookup and churn in Chord
- Applications
- Discussions - Yes you!

Interest in Distributed Systems
Distributed Hash Table
- Same interface as normal hash table
- keys map to values
- Keys are assigned to nodes
Node stores all values for its set of keys
- Hash table buckets are nodes in a netw
- Chord one of the first implementations
- CAN, Tapestry, Pastry, Kademlia
Wtf do I care?
- Benefits
- Decentralized
- Scalability
- Fault tolerance
- Reliability
- Drawbacks/challenges
- Nodes leave/fail
- Nodes join
Chord: One ring to rule them all

lookup(key) -> node
One ring to rule them all
- Each node responsible for 1/n of keyspace
- Half of neighbours keyspace on join
- Split keyspace on leave
- Simple lookup
- Each node queries neighbour
- Node state (successor) ~ O(1)
- Lookup ~ O(N)
Example: m=6, 10 node, 5 keys

Simple lookup (non optimal)

Improve lookup performance
- Finger table
- At most m entries ~ log(N)
- keyspace [0, 2^(m-1))
- Entry i
- successor(n + 2^(i-1))
- At most m entries ~ log(N)
Finger table

Improved lookup

- Join
- new node (N26) find its successor
- receives keys from successor
- each node periodically updates finger table
- Leave
- Transfer keys to successor & notify predecessor

- Simple: lookup(key) -> node
- Scalable
- State per node ~ O(log(N))
- lookup performance ~ O(log(N))
- Provable correctness
- Even under churn
DHT Applications/implementations
- BitTorrent distributed tracker
- Overlay multicast
- Corel CDN
- Amazon Dynamo
- Cassandra, Riak
Amazon Dynamo
- Each virtual node gets random ring position
- Data assigned to vnode such that
- vnode = successor(key)
- Data replicated to N-1 next successors
- N is number of replicas
- Coordinator to 'tune' replication across different physical nodes
- Fault-tolerance
- Nodes can fail
- Can read and write under network partitioning
- Write not propogated to all nodes - inconsistency!
