Membership Changes

Evolving the cluster

Current State

How we handle peer changes.

PD

  • (Remove) Receives a command from pd-ctl.
  • (Add) Receives a command from a TiKV peer.
  • Sends instructions to each TiKV/Raft node.
  • If a new node is added, it is inserted as a learner.
message Message {
    MessageType msg_type = 1;
    ...
    repeated Entry entries = 7;
    ...
}

enum EntryType {
    EntryNormal = 0;
    EntryConfChange = 1;
}

TiKV

  • Receives instruction from pd.
  • Informs tooling/instrumentation as needed.
  • Relays message to raft.

Raft

  • Receives command via TiKV.
  • Forcibly takes action without coordination.
/// Takes the conf change and applies it.
pub fn apply_conf_change(&mut self, cc: &ConfChange) -> ConfState {
    // ...
    match cc.get_change_type() {
        ConfChangeType::AddNode => self.raft.add_node(nid),
        ConfChangeType::AddLearnerNode => self.raft.add_learner(nid),
        ConfChangeType::RemoveNode => self.raft.remove_node(nid),
    }
    // ...
}

pub fn remove_node(&mut self, id: u64) {
    self.mut_prs().remove(id);
    // ...
}

Not
Joint
Consensus

Not Good Enough:

How the current system can fail.

Example of How It Fails

Add, then Remove:

 

IDC 1 fails before Remove, quorum fails, cluster pauses until IDC 1 is back up.

Remove, then Add:

 

IDC 2 fails before Add, leaving only C. Quorum fails, cluster pauses.

Nodes:  A  B  C  D    Current: ABC
IDCS:   1  2  3  1    To Replace: A with D

There are others...

Lurking in the shadows.

We can stop them.

Joint Consensus

Problem

Adding and removing Raft peers offers several potential problems.

 

Primarily:

How can pauses in the cluster be prevented?

Solution

The Raft paper describes  a process called Joint Consensus.

 

TL;DR:

It involves using a union of both the old peer set and the new peer set temporarily.

A Corner with a Joint

We allow the leader to receive a command to change the peer set.

 

It enters the union state, then distributes the log message to followers.

Each of them then enter the union state as they learn of it.

 

After the union is committed (and learners caught up) the new configuration is applied the same way.

C_{(old)} \rightarrow C_{(old \bigcup new) } \rightarrow C_{(new)}
C(old)C(oldnew)C(new)C_{(old)} \rightarrow C_{(old \bigcup new) } \rightarrow C_{(new)}

The Joint Consensus State

Log entries are still replicated to all servers.

 

Agreement for both elections and entry commits requires separate majorities from both C(new) and C(old) clusters.

Still services requests.

 

Individual peers transition at different times.

 

No leadership restrictions

Other Problems

Solved by the Joint

Catching Up New Nodes

If a node is added without any log entries it may take quite some time before it catches up, which could potentially render the cluster unable to make progress.

 

This is resolved via the learner state, when nodes in this state are not considered for majorities.

 

Leader not in C(new)

It is possible that the peer which is currently leader in C(old) will not be part of C(new). This means at some point during C(old, new) the leader will be managing a cluster it is not a part of until the C(new) entry is committed.

 

During this time it replicates log entries, but does not count itself in majorities. When C(new) is committed a leadership transfer can occur.

 

Removed Peers are Disruptive

When a server is removed they will time out, and try to start a new election with a new term. Though this will cause a new election.

 

In order to prevent this, peers will not service RequestVote RPCs they receive within the minimum election timeout of hearing from their current leader.

Moving Forward

Joint Consensus for Us

Raft

  • Add a new variant: ConfChangeType::SetNodes.
  • Have ConfChange hold a ConfState.
  • Add logic to calculate C(old, new) from C(old) and C(new).
  • Add mechanism to append C(new) when a C(old, new) is committed and any added learners are promoted.
  • Add way to remember the intended C(new).
  • Design way to react to ConfChange at the appropriate time.

Raft

message ConfState {
    repeated uint64 nodes = 1;
    repeated uint64 learners = 2;
}

enum ConfChangeType {
    AddNode    = 0;
    RemoveNode = 1;
    AddLearnerNode = 2;
    SetNodes = 3;
}

message ConfChange {
    uint64 id = 1;
    ConfChangeType change_type = 2;
    // Used for add/remove
    uint64 node_id = 3;
    bytes context = 4;
    // Used for set
    ConfState new_state = 4;
}

TiKV

  • Validate new Raft commands are not obstructed.
  • Add new metrics.
  • (Optional) Add configuration change to tikv-ctl.

PD

  • Migrate to calling new SetNodes command instead of AddNode and RemoveNode.

Joint Consensus

By hoverbear

Joint Consensus

  • 559