Membership Changes

Evolving the cluster

Current State

How we handle peer changes.

PD

(Remove) Receives a command from pd-ctl.
(Add) Receives a command from a TiKV peer.
Sends instructions to each TiKV/Raft node.
If a new node is added, it is inserted as a learner.

message Message {
    MessageType msg_type = 1;
    ...
    repeated Entry entries = 7;
    ...
}

enum EntryType {
    EntryNormal = 0;
    EntryConfChange = 1;
}

TiKV

Receives instruction from pd.
Informs tooling/instrumentation as needed.
Relays message to raft.

Raft

Receives command via TiKV.
Forcibly takes action without coordination.

/// Takes the conf change and applies it.
pub fn apply_conf_change(&mut self, cc: &ConfChange) -> ConfState {
    // ...
    match cc.get_change_type() {
        ConfChangeType::AddNode => self.raft.add_node(nid),
        ConfChangeType::AddLearnerNode => self.raft.add_learner(nid),
        ConfChangeType::RemoveNode => self.raft.remove_node(nid),
    }
    // ...
}

pub fn remove_node(&mut self, id: u64) {
    self.mut_prs().remove(id);
    // ...
}

Not
Joint
Consensus

Not Good Enough:

How the current system can fail.

Example of How It Fails

Add, then Remove:

IDC 1 fails before Remove, quorum fails, cluster pauses until IDC 1 is back up.

Remove, then Add:

IDC 2 fails before Add, leaving only C. Quorum fails, cluster pauses.

Nodes:  A  B  C  D    Current: ABC
IDCS:   1  2  3  1    To Replace: A with D

There are others...

Lurking in the shadows.

We can stop them.

Joint Consensus

Problem

Adding and removing Raft peers offers several potential problems.

Primarily:

How can pauses in the cluster be prevented?

Solution

The Raft paper describes a process called Joint Consensus.

TL;DR:

It involves using a union of both the old peer set and the new peer set temporarily.

A Corner with a Joint

We allow the leader to receive a command to change the peer set.

It enters the union state, then distributes the log message to followers.

Each of them then enter the union state as they learn of it.

After the union is committed (and learners caught up) the new configuration is applied the same way.

C_{(old)} \rightarrow C_{(old \bigcup new) } \rightarrow C_{(new)}

C_{(old)} \rightarrow C_{(old \bigcup new) } \rightarrow C_{(new)}

The Joint Consensus State

Log entries are still replicated to all servers.

Agreement for both elections and entry commits requires separate majorities from both C(new) and C(old) clusters.

Still services requests.

Individual peers transition at different times.

No leadership restrictions

Moving Forward

Joint Consensus for Us

Raft

Add a new variant: ConfChangeType::SetNodes.
Have ConfChange hold a ConfState.
Add logic to calculate C(old, new) from C(old) and C(new).
Add mechanism to append C(new) when a C(old, new) is committed and any added learners are promoted.
Add way to remember the intended C(new).
Design way to react to ConfChange at the appropriate time.

Raft

message ConfState {
    repeated uint64 nodes = 1;
    repeated uint64 learners = 2;
}

enum ConfChangeType {
    AddNode    = 0;
    RemoveNode = 1;
    AddLearnerNode = 2;
    SetNodes = 3;
}

message ConfChange {
    uint64 id = 1;
    ConfChangeType change_type = 2;
    // Used for add/remove
    uint64 node_id = 3;
    bytes context = 4;
    // Used for set
    ConfState new_state = 4;
}

TiKV

Validate new Raft commands are not obstructed.
Add new metrics.
(Optional) Add configuration change to tikv-ctl.

PD

Migrate to calling new SetNodes command instead of AddNode and RemoveNode.

Membership Changes

Evolving the cluster

Current State

How we handle peer changes.

PD

TiKV

Raft

Not
Joint
Consensus

Not Good Enough:

How the current system can fail.

Example of How It Fails

There are others...

Lurking in the shadows.

We can stop them.

Joint Consensus

Problem

Solution

A Corner with a Joint

The Joint Consensus State

Other Problems

Solved by the Joint

Catching Up New Nodes

Leader not in C(new)

Removed Peers are Disruptive

Moving Forward

Joint Consensus for Us

Raft

Raft

TiKV

PD

Membership Changes

Evolving the cluster

Current State

How we handle peer changes.

PD

TiKV

Raft

Not Joint Consensus

Not Good Enough:

How the current system can fail.

Example of How It Fails

There are others...

Lurking in the shadows.

We can stop them.

Joint Consensus

Problem

Solution

A Corner with a Joint

The Joint Consensus State

Other Problems

Solved by the Joint

Catching Up New Nodes

Leader not in C(new)

Removed Peers are Disruptive

Moving Forward

Joint Consensus for Us

Raft

Raft

TiKV

PD

Not
Joint
Consensus