Membership Changes
Evolving the cluster
Current State
How we handle peer changes.
PD
- (Remove) Receives a command from pd-ctl.
- (Add) Receives a command from a TiKV peer.
- Sends instructions to each TiKV/Raft node.
- If a new node is added, it is inserted as a learner.
message Message {
MessageType msg_type = 1;
...
repeated Entry entries = 7;
...
}
enum EntryType {
EntryNormal = 0;
EntryConfChange = 1;
}
TiKV
- Receives instruction from pd.
- Informs tooling/instrumentation as needed.
- Relays message to raft.
Raft
- Receives command via TiKV.
- Forcibly takes action without coordination.
/// Takes the conf change and applies it.
pub fn apply_conf_change(&mut self, cc: &ConfChange) -> ConfState {
// ...
match cc.get_change_type() {
ConfChangeType::AddNode => self.raft.add_node(nid),
ConfChangeType::AddLearnerNode => self.raft.add_learner(nid),
ConfChangeType::RemoveNode => self.raft.remove_node(nid),
}
// ...
}
pub fn remove_node(&mut self, id: u64) {
self.mut_prs().remove(id);
// ...
}
Not
Joint
Consensus
Not Good Enough:
How the current system can fail.
Example of How It Fails
Add, then Remove:
IDC 1 fails before Remove, quorum fails, cluster pauses until IDC 1 is back up.
Remove, then Add:
IDC 2 fails before Add, leaving only C. Quorum fails, cluster pauses.
Nodes: A B C D Current: ABC IDCS: 1 2 3 1 To Replace: A with D
There are others...
Lurking in the shadows.
We can stop them.
Joint Consensus
Problem
Adding and removing Raft peers offers several potential problems.
Primarily:
How can pauses in the cluster be prevented?
Solution
The Raft paper describes a process called Joint Consensus.
TL;DR:
It involves using a union of both the old peer set and the new peer set temporarily.
A Corner with a Joint
We allow the leader to receive a command to change the peer set.
It enters the union state, then distributes the log message to followers.
Each of them then enter the union state as they learn of it.
After the union is committed (and learners caught up) the new configuration is applied the same way.
The Joint Consensus State
Log entries are still replicated to all servers.
Agreement for both elections and entry commits requires separate majorities from both C(new) and C(old) clusters.
Still services requests.
Individual peers transition at different times.
No leadership restrictions
Other Problems
Solved by the Joint
Catching Up New Nodes
If a node is added without any log entries it may take quite some time before it catches up, which could potentially render the cluster unable to make progress.
This is resolved via the learner state, when nodes in this state are not considered for majorities.
Leader not in C(new)
It is possible that the peer which is currently leader in C(old) will not be part of C(new). This means at some point during C(old, new) the leader will be managing a cluster it is not a part of until the C(new) entry is committed.
During this time it replicates log entries, but does not count itself in majorities. When C(new) is committed a leadership transfer can occur.
Removed Peers are Disruptive
When a server is removed they will time out, and try to start a new election with a new term. Though this will cause a new election.
In order to prevent this, peers will not service RequestVote RPCs they receive within the minimum election timeout of hearing from their current leader.
Moving Forward
Joint Consensus for Us
Raft
- Add a new variant: ConfChangeType::SetNodes.
- Have ConfChange hold a ConfState.
- Add logic to calculate C(old, new) from C(old) and C(new).
- Add mechanism to append C(new) when a C(old, new) is committed and any added learners are promoted.
- Add way to remember the intended C(new).
- Design way to react to ConfChange at the appropriate time.
Raft
message ConfState {
repeated uint64 nodes = 1;
repeated uint64 learners = 2;
}
enum ConfChangeType {
AddNode = 0;
RemoveNode = 1;
AddLearnerNode = 2;
SetNodes = 3;
}
message ConfChange {
uint64 id = 1;
ConfChangeType change_type = 2;
// Used for add/remove
uint64 node_id = 3;
bytes context = 4;
// Used for set
ConfState new_state = 4;
}
TiKV
- Validate new Raft commands are not obstructed.
- Add new metrics.
- (Optional) Add configuration change to tikv-ctl.
PD
- Migrate to calling new SetNodes command instead of AddNode and RemoveNode.
Joint Consensus
By hoverbear
Joint Consensus
- 559