message Message {
MessageType msg_type = 1;
...
repeated Entry entries = 7;
...
}
enum EntryType {
EntryNormal = 0;
EntryConfChange = 1;
}
/// Takes the conf change and applies it.
pub fn apply_conf_change(&mut self, cc: &ConfChange) -> ConfState {
// ...
match cc.get_change_type() {
ConfChangeType::AddNode => self.raft.add_node(nid),
ConfChangeType::AddLearnerNode => self.raft.add_learner(nid),
ConfChangeType::RemoveNode => self.raft.remove_node(nid),
}
// ...
}
pub fn remove_node(&mut self, id: u64) {
self.mut_prs().remove(id);
// ...
}
Add, then Remove:
IDC 1 fails before Remove, quorum fails, cluster pauses until IDC 1 is back up.
Remove, then Add:
IDC 2 fails before Add, leaving only C. Quorum fails, cluster pauses.
Nodes: A B C D Current: ABC IDCS: 1 2 3 1 To Replace: A with D
Adding and removing Raft peers offers several potential problems.
Primarily:
How can pauses in the cluster be prevented?
The Raft paper describes a process called Joint Consensus.
TL;DR:
It involves using a union of both the old peer set and the new peer set temporarily.
We allow the leader to receive a command to change the peer set.
It enters the union state, then distributes the log message to followers.
Each of them then enter the union state as they learn of it.
After the union is committed (and learners caught up) the new configuration is applied the same way.
Log entries are still replicated to all servers.
Agreement for both elections and entry commits requires separate majorities from both C(new) and C(old) clusters.
Still services requests.
Individual peers transition at different times.
No leadership restrictions
If a node is added without any log entries it may take quite some time before it catches up, which could potentially render the cluster unable to make progress.
This is resolved via the learner state, when nodes in this state are not considered for majorities.
It is possible that the peer which is currently leader in C(old) will not be part of C(new). This means at some point during C(old, new) the leader will be managing a cluster it is not a part of until the C(new) entry is committed.
During this time it replicates log entries, but does not count itself in majorities. When C(new) is committed a leadership transfer can occur.
When a server is removed they will time out, and try to start a new election with a new term. Though this will cause a new election.
In order to prevent this, peers will not service RequestVote RPCs they receive within the minimum election timeout of hearing from their current leader.
message ConfState {
repeated uint64 nodes = 1;
repeated uint64 learners = 2;
}
enum ConfChangeType {
AddNode = 0;
RemoveNode = 1;
AddLearnerNode = 2;
SetNodes = 3;
}
message ConfChange {
uint64 id = 1;
ConfChangeType change_type = 2;
// Used for add/remove
uint64 node_id = 3;
bytes context = 4;
// Used for set
ConfState new_state = 4;
}