Consensus algorithms in distributed systems

Piotr Grzesik

What is consensus ?

Reaching an agreement between processes and proposed values. 

 

Consensus is fundamental for solving problems like leader election, commiting a transaction to database or performing clock synchronisation.

Potential problems

  • Network partitions
  • Lost or duplicated messages
  • Processes crashing
  • Byzantine failures
  • Storage failure

Properties of consensus protocols

  • Agreement - final decision of all non-faulty processes must be the same  
  • Termination - each non-faulty process must eventually decide on a value
  • Validity - final decision must be one of the values proposed by the processes (no default values)

Consensus protocols

  • 2 phase commit
  • 3 phase commit
  • Paxos
  • Raft

2 Phase commit

Simple approach to solving consensus problem. Constitutes of two distinct phases: Propose and Commit/abort

2 Phase commit - propose

2 Phase commit - commit

Potential failures

  • Coordinator fails before propose phase
  • Coordinator fails during propose phase
  • Participant fails during propose phase
  • Participant fails during commit phase
  • Coordinator fails during commit phase

3 Phase commit

  • Improvement over 2PC at the cost of extra messages
  • Splits commit phase into prepare-to-commit and commit phases
  • Recovery node can take over in cases of coordinator crashing
  • Can sustain failure of a participant that commited/aborted transaction (all other participants know the outcome)
  • Prone to network partitioning

3 Phase commit

https://en.wikipedia.org/wiki/Three-phase_commit_protocol#/media/File:Three-phase_commit_diagram.png

Raft

  • Proposed by Diego Ongaro in his paper "In Search of an Understandable Consensus Algorithm"
  • Built, as easier to understand alternative to (multi)Paxos
  • Manages replicated log (replicated state machine)
  • Separates leader election and log replication
  • To operate needs at least n/2 +1 nodes (for 3 node cluster, can tolerate failure of 1 node)
  • Time divided into terms (change of leader -> change of term)

Replicated state machine

Possible server states

  • Leader - clients communicate with leader, responsible for log replication
  • Follower - accept replicated logs from leader
  • Candidate - candidate for a new leader during election

Used RPCs

  • RequestVote - used by candidates during election
  • AppendEntries - used by leaders to replicate log entries or serves as a heartbeat
  • (optional) InstallSnapshot - used by a leader to sent log snapshot to follower

Leader election

  • Increment term
  • Switch to Candidate
  • Vote for itself
  • Send RequestVote RPC call to all other participants
  • Become leader if received majority of votes
  • Become follower if received AppendEntries from valid leader
  • Start new election if no leader emerges (split vote)

Log

  • Each participant stores log entries, which consist of log index, term and value (command)
  • Logs are stored in a durable way
  • Entry is commited if it was durably stored on majority of the participants
  • No empty fields in log - if a given entry is commited all preceding entries are also commited

Log replication

  • Client communicates with a leader(or follower and gets redirected to leader) and sends a command
  • Leader appends the command to it's own log
  • Leader sends AppendEntries RPC call to all followers
  • On receiving responses from majority, leader passes command to it's state machine and returns result to the client
  • Leader notifies followers of commited entries
  • Followers pass command to their's state machines

Safety

  • If log entry command has been applied to state machine, no other state machine can apply different command for that log entry
  • Entries must be commited before applying to state machine
  • Leader never overwrite entries in its log
  • If entry is commited, it will be present in all future leader's logs
  • Leader might order follower to overwrite its entries to 'fix' log

Usage

  • Consul (https://www.consul.io/)
  • etcd (https://github.com/etcd-io/etcd)
  • rqlite (https://github.com/rqlite/rqlite)
  • Apache Ratis (http://ratis.incubator.apache.org/)

Demo

Sources

  1. https://www.the-paper-trail.org/post/2008-11-27-consensus-protocols-two-phase-commit/
  2. In Search of an Understandable Consensus Algorithm https://raft.github.io/raft.pdf
  3. https://www.the-paper-trail.org/post/2008-11-29-consensus-protocols-three-phase-commit/
  4. An Introduction to Raft (CoreOS Fest 2015) - https://www.youtube.com/watch?v=6bBggO6KN_k
  5. http://thesecretlivesofdata.com/raft

deck

By progressive

deck

  • 489