Tarantool 2.6

Vladislav Shpilevoy

Synchronous Replication
Transaction Manager

Plan

Plan

Asynchronous replication

Synchronous replication

Leader election

Transaction manager

Future work

Asynchronous Replication [1]

Log write = commit. Does not wait for replication

Transaction

Log (master)

Commit and response

Replication

Log (replica)

Transaction

1

Master failure between (3) and (5) = transaction loss

3

Commit and response

Log

2

4

Replication

Log

5

Asynchronous Replication [2]: data loss

{money: 100}

{money: 150}

I have 100 money

I put 50

I have 150 money

{money: 100}

{money: 100}

Where are my 150 money?!

With asynchronous replication the guarantees are almost like without replication

{money: +50}

{success}

get({money})

{money: 100}

Log

Replication start...

Asynchronous Replication [3]: dirty reads

_ = box.schema.create_space(‘test’)
_ = _:create_index(‘pk’)
box.space.test:replace{1, 100}

Create database schema

Execute transaction

wait_replication()

Manual replication wait

The changes are already visible!

Synchronous Replication [1]

Transaction

Log

(master)

Replication

Log

(replica)

Commit and response

Replication is guaranteed before commit

1

Transaction

2

Log

Commit and response

5

3

Replication

Log

4

Synchronous Replication [2]: quorum

Typical configurations

1 + 1

Triplet

50% + 1

Synchronous Replication [3]: no loss

{money: 100}

I have 100 money

I put 50

I have 150 money

{money: 150}

{money: 100}

Synchronous replication guarantees data safety while enough nodes are alive

{money: +50}

{success}

get({money})

{money: 150}

Log

Replication

Log

{money: 150}

{money: 150}

Replication

Ack

Commit

{money: 150}

Commit

{money: 150}

Replication happens before commit

Commit is done after quorum of replicas responded

Commit is replicated asynchronously, and its loss is not critical

Synchronous Replication [4]: tradeoff

Synchronous replication

Asynchronous replication

Fast

Slow

Good availability

Fragile availability

Easy to configure

Hard to configure

Can do master-master

Only master-replica

Easy to loose data

High durability

Synchronous Replication [5]: options

Synchronous transactions wait in a queue inside of Tarantool

Against infinite queue grow there is a timeout for synchronous transactions

box.cfg{
    replication_synchro_quorum = <count>,
    replication_synchro_timeout = <seconds>
}

Quorum for synchronous transactions commit

Timeout for synchronous transactions commit

Quorum must be set manually, can be changed dynamically

Synchronous Replication [6]: usage [1]

box.space.<name>:alter(
    {is_sync = true}
)
box.schema.create_space(name,
    {is_sync = true}
)

Turn on for an existing space

Create new space

Can make the existing data synchronous

Or start from the scratch

Synchronicity is a property of transaction, not of the whole replication

Synchronous transaction changes sync space

Synchronous Replication [6]: usage [2]

$> sync = box.schema.create_space(
    ‘stest’, {is_sync = true}
):create_index(‘pk’)

$> sync:replace{1}

$> box.begin()
$> sync:replace{2}
$> sync:replace{3}
$> box.commit()

Change sync space - transaction is sync

These transactions are synchronous

If any changed space is sync, the whole transaction is sync

$> async = box.schema.create_space(
    ‘atest’, {is_sync = false}
):create_index(‘pk’)

$> box.begin()
$> sync:replace{5}
$> async:replace{6}
$> box.commit()

Synchronous Replication [7]: example [1]

Master configuration

box.cfg{
    listen = 3313,
    replication =  {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    memtx_use_mvcc_engine = true,
    replication_synchro_quorum = 3,
    replication_synchro_timeout = 1000,
}
box.schema.user.grant('guest', 'super')

There will be 2 replicas on these hosts

Not to care about access rights at all

To turn off dirty reads for memtx

Quorum is set manually to 3 - master and both replicas

Synchronous Replication [7]: example [2]

Replicas configuration

box.cfg{
    listen = 3314,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true
}
box.cfg{
    listen = 3315,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true
}

Replica 1

Replica 2

No replication_synchro options - needed only on master

Synchronous Replication [7]: example [3]

$> s = box.schema.create_space(
    'test', {is_sync = true})
$> _ = s:create_index('pk')

Replica 2

Replica 1

Master

Create schema

$> s:replace{1}

Test how replication works

$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)

Kill one of the replicas

$> fiber = require('fiber')
$> f = fiber.create(function()       
    s:replace{2}
end)
$> f:status()
---
- suspended
...

Start new sync transaction in a separate fiber

Synchronous Replication [7]: example [4]

Replica 2

Replica 1

Master

The transaction is not committed - changes are not visible. On replica too. Because can't achieve the quorum 3

$> s:get{2}
---
...
$> box.space.test:get{2}
---
...
$> box.cfg{...}

Start the replica with the same config again

$> f:status()
---
- dead
...

The transaction is finished on master now

$> s:get{2}
---
- [2]
...

Changes are visible on all nodes

$> box.space.test:get{2}
---
- [2]
...
$> box.space.test:get{2}
---
- [2]
...

Leader election [1]: asynchronous

Normally only one node is kept writable, to avoid conflicts

box.cfg{read_only = <boolean>}

With asynchronous replication it is enough to set read_only option

Leader election [2]: synchronous [1]

Need  special rules  to elect a new leader without loosing data

With synchronous replication only master-replica is possible - election is necessary

  • Latest committed transaction may be not on all replicas
  • Latest COMMIT entry could be not delivered to any replica

Leader election [2]: synchronous [2]

{a = 10}

{a = 10}

{a = 10}

{a = 10}

{a = 10}

{a = 20}

{a = 20}

{a = 20}

Update a to 20 got quorum 3. The others didn't get the update yet.

Then the leader dies

New leader should be from the latest quorum

Bad choice - shall not be a leader, too old data!

Good choice - have the newest data, can be a new leader!

Leader election [3]: vclock

Transaction ID: {Replica ID, LSN}

  • Replica ID - node-author ID
  • LSN - monotonic unique counter

VClock - pairs {Replica ID, LSN} from all nodes, snapshot of cluster state

{0, 0, 0}
{0, 0, 0}
{0, 0, 0}
{1, 0, 0}
{1, 0, 0}
{1, 0, 0}
{1, 2, 0}
{1, 2, 0}
{1, 2, 0}
Replica ID = 1
Replica ID = 2
Replica ID = 3

Leader election [4]: manual

box.ctl.clear_synchro_queue()

New master should have the biggest LSN of the old master among other alive nodes

To find it look at

box.info.vclock

On the new master it is necessary to call

to clear the synchronous transaction queue correctly

Leader election [4]: example [1]

Master configuration

box.cfg{
    listen = 3313,
    replication =  {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    memtx_use_mvcc_engine = true,
    replication_synchro_quorum = 3,
    replication_synchro_timeout = 1000,
}
box.schema.user.grant('guest', 'super')
box.cfg{
    listen = 3315,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true
}

Replicas configuration

box.cfg{
    listen = 3314,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true
}

All the same as for synchronous replication example

Leader election [4]: example [2]

$> s = box.schema.create_space(
    'test', {is_sync = true})
$> _ = s:create_index('pk')
$> s:replace{1}

Replica 2

Replica 1

Master

Create schema and see how replication works

$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)

Kill one of the replicas

$> fiber = require('fiber')
$> f = fiber.create(function()       
    s:replace{2}
end)
$> s:get{2}
---
...
$> os.exit(1)

Start new sync transaction in a separate fiber and kill the leader

Leader election [4]: example [3]

Replica 2

Replica 1

$> box.cfg{
    listen = 3315,
    replication = {
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true,
}

Restart the replica without the old leader in replication

$> box.space.test:get({2})
---
...

The transaction still is not finished

$> box.space.test:get({2})
---
...
$> box.info.vclock
---
- {1: 9}
...
$> box.info.vclock
---
- {1: 9}
...

Vclock is the same - anyone can be a new master

Leader election [4]: example [4]

Replica 2

Replica 1

$> box.cfg{
    replication_synchro_quorum = 2,
    replication_synchro_timeout = 3,
}

Change parameters to prepare to become a new master

$> box.ctl.clear_synchro_queue()

Finalise pending synchronous transactions

$> box.space.test:get{2}
---
- [2]
...

The transaction from the old leader is committed

$> box.space.test:get{2}
---
- [2]
...
$> box.cfg{read_only = false}

Become a fully functional master

$> box.space.test:replace{3}

$> box.space.test:get{3}
---
- [3]
...

Can commit new transactions

$> box.space.test:get{3}
---
- [3]
...

Leader election [5]: automatic

- leader

- replica

- candidate

Each node always has 1 role

Term: 1

Term: 1

Term: 1

Term: 1

Term: 1

All nodes start as replicas with term 1. Term - logical clock of the elections

After some time a few nodes notice leader absence and start new election

They become candidates, vote for self, and send vote requests

Term: 2

Votes: 1

Term: 2

Votes: 1

Others vote for exactly one candidate

Term: 2

Votes: 2

Term: 2

Votes: 3

The node having the majority becomes a leader and notifies other nodes

Others accept the leader

Term: 2

Voted

Term: 2

Voted

Term: 2

Voted

Leader election [6]: usage

box.cfg{
    election_mode = 'candidate'/'voter'/'off',
    election_timeout = <seconds>,
    replication_synchro_quorum = <count>,
}

Election mode, if some nodes must not be a leader, or if you want manual election

How long to wait for majority of votes before election restart

How many votes a node needs to become a leader

Leader election [7]: example [1]

Master configuration

box.cfg{
    listen = 3313,
    replication =  {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    memtx_use_mvcc_engine = true,
    replication_synchro_quorum = 3,
    replication_synchro_timeout = 1000,
    election_mode = 'candidate',
}
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')

All the same config as for synchronous replication example, but with  election_mode 

Need to wait for being writable, as the election will take some time at bootstrap

Leader election [7]: example [2]

Replicas configuration

box.cfg{
    listen = 3314,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true,
    election_mode = 'voter',
}
box.cfg{
    listen = 3315,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true,
    election_mode = 'voter',
}

Replica 1

Replica 2

I want only one node be a leader for now, so I make other voters

Leader election [7]: example [3]

$> s = box.schema.create_space(
    'test', {is_sync = true})
$> _ = s:create_index('pk')
$> s:replace{1}

Replica 2

Replica 1

Master

Create schema and see how replication works

$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)

Kill one of the replicas

$> fiber = require('fiber')
$> f = fiber.create(function()       
    s:replace{2}
end)
$> s:get{2}
---
...
$> os.exit(1)

Start new sync transaction in a separate fiber and kill the leader

Leader election [7]: example [4]

Replica 2

Replica 1

$> box.space.test:get({2})
---
...

The transaction is not committed

$> box.info.election.state
---
- candidate
...

After some time the leader death is detected, and the node becomes a candidate

$> box.cfg{
    election_mode = 'candidate',
    replication_synchro_quorum = 2,
    read_only = false,
}

Become a candidate and set the quorum to 2, as the old leader won't return

$> box.cfg{
    listen = 3315,
    replication = {
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = true,
    election_mode = 'voter',
}

Start the other replica again. But no old leader in its config

Leader election [7]: example [5]

Replica 2

Replica 1

$> box.ctl.wait_rw()

Now two alive nodes, so the leader must be elected. The leader will be writable, so wait for it

$> box.info.election
---
- leader
...

The state became 'leader'

$> box.space.test:get({2})
---
- [2]
...

The uncommitted transaction of the old leader is finished automatically

$> s:replace{3}
$> box.space.test:select()
---
- - [1]
  - [2]
  - [3]
...

Can create new transactions

$> box.space.test:select()
---
- - [1]
  - [2]
  - [3]
...

Raft

Raft - algorithm of synchronous replication and leader election

Guarantee of data safety on > 50% nodes;

Extreme simplicity

Tested by time

Leader election

Transaction manager [1]: before

This is how it worked before:

$> box.cfg{}
$> s = box.schema.create_space('test')
$> _ = s:create_index('pk')

Create schema

$> fiber = require('fiber')

$> function yield_in_txn()
    box.begin()
    s:replace{1}
    fiber.yield()
    s:replace{2}
    box.commit()
end

$> yield_in_txn()
---
- error: Transaction has been aborted by a fiber yield
...

Make a yield inside of a transaction - it is aborted

Transaction manager [2]: dirty reads [1]

Master configuration

box.cfg{
    listen = 3313,
    replication =  {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    memtx_use_mvcc_engine = false,
    replication_synchro_quorum = 3,
    replication_synchro_timeout = 1000,
}
box.schema.user.grant('guest', 'super')
box.cfg{
    listen = 3315,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = false
}

Replicas configuration

box.cfg{
    listen = 3314,
    replication = {
        '127.0.0.1:3313',
        '127.0.0.1:3314',
        '127.0.0.1:3315'
    },
    read_only = true,
    memtx_use_mvcc_engine = false
}

All the same as for synchronous replication example, but no memtx_use_mvcc_engine

Transaction manager [2]: dirty reads [2]

$> s = box.schema.create_space(
    'test', {is_sync = true})
$> _ = s:create_index('pk')

Replica 2

Replica 1

Master

Create schema

$> s:replace{1}

Test how replication works

$> box.space.test:get({1})
---
- [1]
...
$> box.space.test:get({1})
---
- [1]
...
$> os.exit(1)

Kill one of the replicas

$> fiber = require('fiber')
$> f = fiber.create(function()       
    s:replace{2}
end)
$> f:status()
---
- suspended
...

Start new sync transaction in a separate fiber

Transaction manager [2]: dirty reads [3]

Replica 2

Replica 1

Master

$> box.space.test:get{2}
---
- [2]
...
$> box.space.test:get{2}
---
- [2]
...

Dirty read! The transaction is not committed, but it is visible

Transaction manager [2]: example [1]

Master node

$> box.cfg{memtx_use_mvcc_engine = true}
$> s = box.schema.create_space('test')
$> _ = s:create_index('pk')

Start with the manager enabled, create schema

$> require('console').listen(3313)

Start console to connect from 2 terminals

Transaction manager [2]: example [2]

Console 1

$> require('console').connect(3313)

Start 2 client consoles

Console 2

$> require('console').connect(3313)
$> box.begin()
$> s:replace{1, 1}

Start a transaction in the first console

$> s:get{1}
---
...

Its data is not visible yet

$> box.commit()

Commit the transaction

$> s:get{1}
---
- [1, 1]
...

The data became visible

Transaction manager [2]: example [3]

Console 1

Console 2

$> box.begin()
$> s:update({1}, {{'+', 2, 1}})

A more complex action - update, which involves reading old data

One transaction is committed successfully

$> box.begin()
$> s:update({1}, {{'+', 2, 1}})
$> box.commit()
$> s:get{1}
---
- [1, 2]
...
$> box.commit()
---
- error: Transaction has been aborted by conflict
...
$> s:get{1}
---
- [1, 2]
...

The other one is less lucky

Transaction manager [2]: example [4]

Console 1

Console 2

$> box.begin()
$> s:replace({1, 3})

Replace does not involve reading - no conflict

One transaction is committed successfully

$> box.begin()
$> s:replace({1, 4})
$> box.commit()
$> s:get{1}
---
- [1, 3]
...
$> box.commit()
$> s:get{1}
---
- [1, 4]
...

And the other one too! Because it didn't rely on the old data

Future work

Stability improvements - 2.6 is beta.
APIs still can change

Auto-calculation of quorum

box.cfg{
    replication_synchro_quorum = "N/2 + 1"
}

New monitoring endpoints

box.info.synchro

Transaction queue size

Interactive transactions over network

Without writing Lua functions

New transaction options

box.commit({is_lazy, is_sync})
box.begin({isolation = ...})

Election triggers

box.ctl.on_election(function()
    ...
end)

Links

Example of leader election

Example of synchronous replication

Official site

Full release info

Synchronous replication talk

Tarantool 2.6 release - Synchronous Replication, Transaction Manager

By Vladislav Shpilevoy

Tarantool 2.6 release - Synchronous Replication, Transaction Manager

Tarantool 2.6 was released in October of 2020. This is the biggest release in several years, which brings beta version of synchronous replication and transaction manager for memtx storage engine. The talk sheds more light on the key features of the release.

  • 939