Plan
Replication
History of replication in Tarantool
Raft algorithm
Schema of asynchronous replication
Implementation of synchronous replication
Interface
Difference from Raft
Future work
Replication group - "replicaset"
Master
Replica
Replica
Replica
Master
Master
Master
Usage
Types
Log write = commit. Does not wait for replication
Transaction
Log (master)
Commit and response
Replication
Log (replica)
Transaction
1
Master failure between (3) and (5) = transaction loss
3
Commit and response
Log
2
4
Replication
Log
5
{money: 100}
{money: 150}
I have 100 money
I put 50
I have 150 money
{money: 100}
{money: 100}
Where are my 150 money?!
With asynchronous replication the guarantees are almost like without replication
{money: +50}
{success}
get({money})
{money: 100}
Log
Replication start...
_ = box.schema.create_space(‘test’) _ = _:create_index(‘pk’)
box.space.test:replace{1, 100}
Create database schema
Execute transaction
wait_replication()
Manual replication wait
The changes are already visible!
Transaction
Log
(master)
Replication
Log
(replica)
Commit and response
Replication is guaranteed before commit
1
Transaction
2
Log
Commit and response
5
3
Replication
Log
4
Typical configurations
1 + 1
Triplet
50% + 1
Synchronous replication
Asynchronous replication
Fast
Slow
Good availability
Fragile availability
Easy to configure
Hard to configure
Can do master-master
Only master-replica
Easy to loose data
High durability
Raft - algorithm of synchronous replication
Guarantee of data safety on > 50% nodes;
Extreme simplicity
Tested by time
Leader election
2015 year
Market demand for synchronous replication appeared
It was decided to use Raft, and a great plan was created
SWIM module for cluster build
Auto-proxy
Manual leader election - box.ctl.promote()
Replication optimisation
Raft
Plan creation
box.ctl.promote()
SWIM
Change of team management
Raft
2015
2018
2019
2020
Auto-proxy
Replication optimizations
SQL
Vinyl
It is Synchronous replication and Leader election
Roles:
- leader
- replica
- candidate
Each node is a replica at first
Nodes store a persistent term - logical clock of the cluster
Term: 1
Term: 1
Term: 1
Term: 1
Term: 1
Wait for leader. No leader too long - start election
Vote request
Term: 2
Votes: 1
Term: 2
Votes: 1
After one got majority, it becomes a leader
Term: 2
Term: 2
Votes: 3
Term: 2
Votes: 2
Term: 2
Term: 2
Leader accepts all transactions, replicates them, and commits right after getting the quorum
Replication
Quorum
Commit
Term: 2
Term: 2
Acks
Replica async synchronization
Transaction
Response
Four stages of transaction commit
Data to replica log
Commit to leader log and respond to the user
Data to leader log
Commit to replica log
Denial on any stage of any node does not lead to loss of data after commit while more than 50% of nodes are alive
Log: transactions, terms, votes, commits
Term 1
Term 2
Term 3
Leader
Replica
Replica
Replica
1
2
1
2
3
4
5
6
4
3
2
1
1
2
3
4
5
6
7
8
9
New transactions
Last commit
5
6
7
TX-thread
IProto-threads
WAL-thread
Relay-threads
Network
Client connections
Data from replicas
Connection management;
Input/output
Log;
Write to disk
Transactions replication
Database;
Transactions;
Fibers
Data to replicas
Transaction ID: {Replica ID, LSN}
VClock - pairs {Replica ID, LSN} from all nodes, snapshot of cluster state
{0, 0, 0}
{0, 0, 0}
{0, 0, 0}
{1, 0, 0}
{1, 0, 0}
{1, 0, 0}
{1, 2, 0}
{1, 2, 0}
{1, 2, 0}
Replica ID = 1
Replica ID = 2
Replica ID = 3
A set of hard rules
Complete compatibility with old versions while synchronous replication is not used
Log format does not change, only new record types can be added
Tarantool architecture does not change significantly
$> sync = box.schema.create_space( ‘stest’, {is_sync = true} ):create_index(‘pk’) $> sync:replace{1} $> box.begin() $> sync:replace{2} $> sync:replace{3} $> box.commit()
Synchrony - property of space
These transactions are synchronous
One synchronous space makes whole transaction synchronous
$> async = box.schema.create_space( ‘atest’, {is_sync = false} ):create_index(‘pk’) $> box.begin() $> sync:replace{5} $> async:replace{6} $> box.commit()
Commit is started by writing transaction to local log
After log write the transaction goes to the limbo
Limbo is a queue of synchronous transactions in TX thread
Limbo is a link between TX, WAL, and Relay threads
5
7
10
LSN
...
New transactions
Committed transactions
TX-thread
Quorum
Replica writes to log, to limbo, and responds with the applied vclock. Master limbo collects the quorum based on these responses
Transactions
{0, 0}
{0, 0}
vclock
vclock
Log write
{3, 0}
Replication
Log write
{3, 0}
Waiting in the limbo
Ack
{3, 0}
Waiting in the limbo
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
Quorum collection
Limbo uses a special vclock
{0, 0, 0}
{0, 0, 0}
log vclock
limbo vclock
{0, 0, 0}
{0, 0, 0}
Transactions
{0, 4, 0}
{0, 4, 0}
Replication
{0, 4, 0}
{0, 4, 0}
Acks
{4, 4, 4}
Limbo sees all the transactions with LSN <= 4 got quorum 3
Transactions
{0, 8, 0}
Replication
{0, 8, 0}
{0, 8, 0}
Acks
{4, 8, 4}
{8, 8, 4}
Limbo sees LSN 4 got quorum 3, and LSN 8 - quorum 2
Replica denial
Replica ID = 1
Replica ID = 3
Replica ID = 2
{0, 0, 1, 0, 0}
Limbo
Limbo checks the quorum when collects acks
{1, 0, 1, 0, 0}
Got ack from replica ID = 1. Total is 2, and quorum is 3. Wait more
{1, 0, 1, 2, 2}
Got acks from replicas ID = 4 and 5.
1
2
3
4
5
6
7
8
...
LSN
Commit
New transactions
Commit
Commit({LSN = 1})
Commit is written to the log
{5, 5, 4, 6, 7}
Got more acks from all replicas:
Commit LSN 5
Commit
Commit({LSN = 5})
Commit is written to the log
Timeout against infinite queue grow
After the timeout all the transactions are deleted. A record ROLLBACK is written to the log, users get an error
But ROLLBACK does not mean the replication didn't happen. After master dies, the transaction still can be committed on a new master
All transactions are deleted because they can depend on each other
Automatic
box.cfg{ election_mode = <mode>, election_timeout = <seconds>, replication_synchro_quorum = <count>, }
Automation of the second part of Raft algorithm - leader election
Mode - off/candidate/voter
Election timeout in seconds
Quorum for election and replication
Manual
box.ctl.clear_synchro_queue()
New leader should have the biggest LSN of the old leader among other alive nodes
Look at
box.info.vclock
On the new leader it is necessary to call
to clear the limbo
box.cfg{ election_mode = <mode>, election_timeout = <seconds>, replication_synchro_quorum = <count>, replication_synchro_timeout = <seconds>, replication_timeout = <seconds>, }
box.info.replication
box.ctl.clear_synchro_queue()
box.info.election
Election options
Synchronous replication options
box.space.<name>:alter{is_sync = true}
box.schema.create_space(name, {is_sync = true} )
Turn on for an existing space
Create new space
Log format
In Raft the log is linear - one LSN for all
In Tarantool the log is vectorised - each node has own LSN
Log type
In Raft the log is UNDO - can be reverted from the end
In Tarantool the log is REDO - can't be reverted from the end
Allows to implement master-master synchronous replication in future
Replica can't delete uncommitted transactions from the log - it may require rejoin to the cluster
Master-master synchronous replication
Cluster auto-build with SWIM
Integration with VShard
Example of leader election
Example of synchronous replication
Official site