Vladislav Shpilevoy PRO
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.
Plan
Asynchronous replication
Synchronous replication
Leader election
Transaction manager
Future work
Log write = commit. Does not wait for replication
Transaction
Log (master)
Commit and response
Replication
Log (replica)
Transaction
1
Master failure between (3) and (5) = transaction loss
3
Commit and response
Log
2
4
Replication
Log
5
{money: 100}
{money: 150}
I have 100 money
I put 50
I have 150 money
{money: 100}
{money: 100}
Where are my 150 money?!
With asynchronous replication the guarantees are almost like without replication
{money: +50}
{success}
get({money})
{money: 100}
Log
Replication start...
_ = box.schema.create_space(‘test’) _ = _:create_index(‘pk’)
box.space.test:replace{1, 100}
Create database schema
Execute transaction
wait_replication()
Manual replication wait
The changes are already visible!
Transaction
Log
(master)
Replication
Log
(replica)
Commit and response
Replication is guaranteed before commit
1
Transaction
2
Log
Commit and response
5
3
Replication
Log
4
Typical configurations
1 + 1
Triplet
50% + 1
{money: 100}
I have 100 money
I put 50
I have 150 money
{money: 150}
{money: 100}
Synchronous replication guarantees data safety while enough nodes are alive
{money: +50}
{success}
get({money})
{money: 150}
Log
Replication
Log
{money: 150}
{money: 150}
Replication
Ack
Commit
{money: 150}
Commit
{money: 150}
Replication happens before commit
Commit is done after quorum of replicas responded
Commit is replicated asynchronously, and its loss is not critical
Synchronous replication
Asynchronous replication
Fast
Slow
Good availability
Fragile availability
Easy to configure
Hard to configure
Can do master-master
Only master-replica
Easy to loose data
High durability
Synchronous transactions wait in a queue inside of Tarantool
Against infinite queue grow there is a timeout for synchronous transactions
box.cfg{ replication_synchro_quorum = <count>, replication_synchro_timeout = <seconds> }
Quorum for synchronous transactions commit
Timeout for synchronous transactions commit
Quorum must be set manually, can be changed dynamically
box.space.<name>:alter( {is_sync = true} )
box.schema.create_space(name, {is_sync = true} )
Turn on for an existing space
Create new space
Can make the existing data synchronous
Or start from the scratch
Synchronicity is a property of transaction, not of the whole replication
Synchronous transaction changes sync space
$> sync = box.schema.create_space( ‘stest’, {is_sync = true} ):create_index(‘pk’) $> sync:replace{1} $> box.begin() $> sync:replace{2} $> sync:replace{3} $> box.commit()
Change sync space - transaction is sync
These transactions are synchronous
If any changed space is sync, the whole transaction is sync
$> async = box.schema.create_space( ‘atest’, {is_sync = false} ):create_index(‘pk’) $> box.begin() $> sync:replace{5} $> async:replace{6} $> box.commit()
box.cfg{ listen = 3313, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, memtx_use_mvcc_engine = true, replication_synchro_quorum = 3, replication_synchro_timeout = 1000, } box.schema.user.grant('guest', 'super')
There will be 2 replicas on these hosts
Not to care about access rights at all
To turn off dirty reads for memtx
Quorum is set manually to 3 - master and both replicas
box.cfg{ listen = 3314, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true }
box.cfg{ listen = 3315, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true }
Replica 1
Replica 2
No replication_synchro options - needed only on master
$> s = box.schema.create_space( 'test', {is_sync = true}) $> _ = s:create_index('pk')
Replica 2
Replica 1
Master
Create schema
$> s:replace{1}
Test how replication works
$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)
Kill one of the replicas
$> fiber = require('fiber') $> f = fiber.create(function() s:replace{2} end) $> f:status() --- - suspended ...
Start new sync transaction in a separate fiber
Replica 2
Replica 1
Master
The transaction is not committed - changes are not visible. On replica too. Because can't achieve the quorum 3
$> s:get{2}
---
...
$> box.space.test:get{2}
---
...
$> box.cfg{...}
Start the replica with the same config again
$> f:status() --- - dead ...
The transaction is finished on master now
$> s:get{2} --- - [2] ...
Changes are visible on all nodes
$> box.space.test:get{2}
---
- [2]
...
$> box.space.test:get{2}
---
- [2]
...
Normally only one node is kept writable, to avoid conflicts
box.cfg{read_only = <boolean>}
With asynchronous replication it is enough to set read_only option
Need special rules to elect a new leader without loosing data
With synchronous replication only master-replica is possible - election is necessary
{a = 10}
{a = 10}
{a = 10}
{a = 10}
{a = 10}
{a = 20}
{a = 20}
{a = 20}
Update a to 20 got quorum 3. The others didn't get the update yet.
Then the leader dies
New leader should be from the latest quorum
Bad choice - shall not be a leader, too old data!
Good choice - have the newest data, can be a new leader!
Transaction ID: {Replica ID, LSN}
VClock - pairs {Replica ID, LSN} from all nodes, snapshot of cluster state
{0, 0, 0}
{0, 0, 0}
{0, 0, 0}
{1, 0, 0}
{1, 0, 0}
{1, 0, 0}
{1, 2, 0}
{1, 2, 0}
{1, 2, 0}
Replica ID = 1
Replica ID = 2
Replica ID = 3
box.ctl.clear_synchro_queue()
New master should have the biggest LSN of the old master among other alive nodes
To find it look at
box.info.vclock
On the new master it is necessary to call
to clear the synchronous transaction queue correctly
box.cfg{ listen = 3313, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, memtx_use_mvcc_engine = true, replication_synchro_quorum = 3, replication_synchro_timeout = 1000, } box.schema.user.grant('guest', 'super')
box.cfg{ listen = 3315, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true }
box.cfg{ listen = 3314, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true }
All the same as for synchronous replication example
$> s = box.schema.create_space( 'test', {is_sync = true}) $> _ = s:create_index('pk') $> s:replace{1}
Replica 2
Replica 1
Master
Create schema and see how replication works
$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)
Kill one of the replicas
$> fiber = require('fiber') $> f = fiber.create(function() s:replace{2} end) $> s:get{2} --- ... $> os.exit(1)
Start new sync transaction in a separate fiber and kill the leader
Replica 2
Replica 1
$> box.cfg{ listen = 3315, replication = { '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true, }
Restart the replica without the old leader in replication
$> box.space.test:get({2})
---
...
The transaction still is not finished
$> box.space.test:get({2})
---
...
$> box.info.vclock
---
- {1: 9}
...
$> box.info.vclock
---
- {1: 9}
...
Vclock is the same - anyone can be a new master
Replica 2
Replica 1
$> box.cfg{ replication_synchro_quorum = 2, replication_synchro_timeout = 3, }
Change parameters to prepare to become a new master
$> box.ctl.clear_synchro_queue()
Finalise pending synchronous transactions
$> box.space.test:get{2} --- - [2] ...
The transaction from the old leader is committed
$> box.space.test:get{2} --- - [2] ...
$> box.cfg{read_only = false}
Become a fully functional master
$> box.space.test:replace{3} $> box.space.test:get{3} --- - [3] ...
Can commit new transactions
$> box.space.test:get{3} --- - [3] ...
- leader
- replica
- candidate
Each node always has 1 role
Term: 1
Term: 1
Term: 1
Term: 1
Term: 1
All nodes start as replicas with term 1. Term - logical clock of the elections
After some time a few nodes notice leader absence and start new election
They become candidates, vote for self, and send vote requests
Term: 2
Votes: 1
Term: 2
Votes: 1
Others vote for exactly one candidate
Term: 2
Votes: 2
Term: 2
Votes: 3
The node having the majority becomes a leader and notifies other nodes
Others accept the leader
Term: 2
Voted
Term: 2
Voted
Term: 2
Voted
box.cfg{ election_mode = 'candidate'/'voter'/'off', election_timeout = <seconds>, replication_synchro_quorum = <count>, }
Election mode, if some nodes must not be a leader, or if you want manual election
How long to wait for majority of votes before election restart
How many votes a node needs to become a leader
box.cfg{ listen = 3313, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, memtx_use_mvcc_engine = true, replication_synchro_quorum = 3, replication_synchro_timeout = 1000, election_mode = 'candidate', } box.ctl.wait_rw() box.schema.user.grant('guest', 'super')
All the same config as for synchronous replication example, but with election_mode
Need to wait for being writable, as the election will take some time at bootstrap
box.cfg{ listen = 3314, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true, election_mode = 'voter', }
box.cfg{ listen = 3315, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true, election_mode = 'voter', }
Replica 1
Replica 2
I want only one node be a leader for now, so I make other voters
$> s = box.schema.create_space( 'test', {is_sync = true}) $> _ = s:create_index('pk') $> s:replace{1}
Replica 2
Replica 1
Master
Create schema and see how replication works
$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)
Kill one of the replicas
$> fiber = require('fiber') $> f = fiber.create(function() s:replace{2} end) $> s:get{2} --- ... $> os.exit(1)
Start new sync transaction in a separate fiber and kill the leader
Replica 2
Replica 1
$> box.space.test:get({2})
---
...
The transaction is not committed
$> box.info.election.state
---
- candidate
...
After some time the leader death is detected, and the node becomes a candidate
$> box.cfg{ election_mode = 'candidate', replication_synchro_quorum = 2, read_only = false, }
Become a candidate and set the quorum to 2, as the old leader won't return
$> box.cfg{ listen = 3315, replication = { '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = true, election_mode = 'voter', }
Start the other replica again. But no old leader in its config
Replica 2
Replica 1
$> box.ctl.wait_rw()
Now two alive nodes, so the leader must be elected. The leader will be writable, so wait for it
$> box.info.election --- - leader ...
The state became 'leader'
$> box.space.test:get({2})
---
- [2]
...
The uncommitted transaction of the old leader is finished automatically
$> s:replace{3} $> box.space.test:select() --- - - [1] - [2] - [3] ...
Can create new transactions
$> box.space.test:select() --- - - [1] - [2] - [3] ...
Raft - algorithm of synchronous replication and leader election
Guarantee of data safety on > 50% nodes;
Extreme simplicity
Tested by time
Leader election
This is how it worked before:
$> box.cfg{} $> s = box.schema.create_space('test') $> _ = s:create_index('pk')
Create schema
$> fiber = require('fiber')
$> function yield_in_txn()
box.begin()
s:replace{1}
fiber.yield()
s:replace{2}
box.commit()
end
$> yield_in_txn()
---
- error: Transaction has been aborted by a fiber yield
...
Make a yield inside of a transaction - it is aborted
box.cfg{ listen = 3313, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, memtx_use_mvcc_engine = false, replication_synchro_quorum = 3, replication_synchro_timeout = 1000, } box.schema.user.grant('guest', 'super')
box.cfg{ listen = 3315, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = false }
box.cfg{ listen = 3314, replication = { '127.0.0.1:3313', '127.0.0.1:3314', '127.0.0.1:3315' }, read_only = true, memtx_use_mvcc_engine = false }
All the same as for synchronous replication example, but no memtx_use_mvcc_engine
$> s = box.schema.create_space( 'test', {is_sync = true}) $> _ = s:create_index('pk')
Replica 2
Replica 1
Master
Create schema
$> s:replace{1}
Test how replication works
$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)
Kill one of the replicas
$> fiber = require('fiber') $> f = fiber.create(function() s:replace{2} end) $> f:status() --- - suspended ...
Start new sync transaction in a separate fiber
Replica 2
Replica 1
Master
$> box.space.test:get{2}
---
- [2]
...
$> box.space.test:get{2}
---
- [2]
...
Dirty read! The transaction is not committed, but it is visible
Master node
$> box.cfg{memtx_use_mvcc_engine = true} $> s = box.schema.create_space('test') $> _ = s:create_index('pk')
Start with the manager enabled, create schema
$> require('console').listen(3313)
Start console to connect from 2 terminals
Console 1
$> require('console').connect(3313)
Start 2 client consoles
Console 2
$> require('console').connect(3313)
$> box.begin() $> s:replace{1, 1}
Start a transaction in the first console
$> s:get{1}
---
...
Its data is not visible yet
$> box.commit()
Commit the transaction
$> s:get{1}
---
- [1, 1]
...
The data became visible
Console 1
Console 2
$> box.begin() $> s:update({1}, {{'+', 2, 1}})
A more complex action - update, which involves reading old data
One transaction is committed successfully
$> box.begin() $> s:update({1}, {{'+', 2, 1}})
$> box.commit()
$> s:get{1}
---
- [1, 2]
...
$> box.commit() --- - error: Transaction has been aborted by conflict ... $> s:get{1} --- - [1, 2] ...
The other one is less lucky
Console 1
Console 2
$> box.begin() $> s:replace({1, 3})
Replace does not involve reading - no conflict
One transaction is committed successfully
$> box.begin() $> s:replace({1, 4})
$> box.commit()
$> s:get{1}
---
- [1, 3]
...
$> box.commit()
$> s:get{1}
---
- [1, 4]
...
And the other one too! Because it didn't rely on the old data
Stability improvements - 2.6 is beta.
APIs still can change
Auto-calculation of quorum
box.cfg{ replication_synchro_quorum = "N/2 + 1" }
New monitoring endpoints
box.info.synchro
Transaction queue size
Interactive transactions over network
Without writing Lua functions
New transaction options
box.commit({is_lazy, is_sync})
box.begin({isolation = ...})
Election triggers
box.ctl.on_election(function() ... end)
Example of leader election
Example of synchronous replication
Official site
Full release info
Synchronous replication talk
By Vladislav Shpilevoy
Tarantool 2.6 was released in October of 2020. This is the biggest release in several years, which brings beta version of synchronous replication and transaction manager for memtx storage engine. The talk sheds more light on the key features of the release.
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.