Inventing synchronous replication

Vladislav Shpilevoy

Plan

Replication

History of replication in Tarantool

Raft algorithm

Schema of asynchronous replication

Implementation of synchronous replication

Interface

Difference from Raft

Future work

Replication [1]

Replication group - "replicaset"

Master

Replica

Master

Usage

Backup
Load balancing

Types

Asynchronous
Synchronous

Replication [2]: asynchronous

Log write = commit. Does not wait for replication

Transaction

Log (master)

Commit and response

Replication

Log (replica)

Transaction

Master failure between (3) and (5) = transaction loss

Commit and response

Log

Replication

Log

Replication [3]: data loss

{money: 100}

{money: 150}

I have 100 money

I put 50

I have 150 money

{money: 100}

Where are my 150 money?!

With asynchronous replication the guarantees are almost like without replication

{money: +50}

{success}

get({money})

{money: 100}

Log

Replication start...

Replication [4]: dirty reads

_ = box.schema.create_space(‘test’)
_ = _:create_index(‘pk’)

box.space.test:replace{1, 100}

Create database schema

Execute transaction

wait_replication()

Manual replication wait

The changes are already visible!

Replication [5]: synchronous

Transaction

Log

(master)

Replication

Log

(replica)

Commit and response

Replication is guaranteed before commit

Transaction

Log

Commit and response

Replication

Log

Replication [6]: quorum

Typical configurations

1 + 1

Triplet

50% + 1

Replication [7]: comparison

Synchronous replication

Asynchronous replication

Fast

Slow

Good availability

Fragile availability

Easy to configure

Hard to configure

Can do master-master

Only master-replica

Easy to loose data

High durability

Replication [8]: Raft

Raft - algorithm of synchronous replication

Guarantee of data safety on > 50% nodes;

Extreme simplicity

Tested by time

Leader election

History of development [1]

2015 year

Market demand for synchronous replication appeared

It was decided to use Raft, and a great plan was created

SWIM module for cluster build

Auto-proxy

Manual leader election - box.ctl.promote()

Replication optimisation

Raft

History of development [2]

Plan creation

box.ctl.promote()

SWIM

Change of team management

Raft

2015

2018

2019

2020

Auto-proxy

Replication optimizations

SQL

Vinyl

Raft [1]: common schema

It is Synchronous replication and Leader election

Roles:

- leader

- replica

- candidate

Each node is a replica at first

Nodes store a persistent term - logical clock of the cluster

Term: 1

Wait for leader. No leader too long - start election

Vote request

Term: 2

Votes: 1

Term: 2

Votes: 1

After one got majority, it becomes a leader

Term: 2

Votes: 3

Term: 2

Votes: 2

Term: 2

Leader accepts all transactions, replicates them, and commits right after getting the quorum

Replication

Quorum

Commit

Term: 2

Acks

Replica async synchronization

Transaction

Response

Raft [2]: synchronous transaction

Four stages of transaction commit

Data to replica log

Commit to leader log and respond to the user

Data to leader log

Commit to replica log

Denial on any stage of any node does not lead to loss of data after commit while more than 50% of nodes are alive

Raft [3]: synchronous log

Log: transactions, terms, votes, commits

Term 1

Term 2

Term 3

Leader

Replica

New transactions

Last commit

Asynchronous replication in Tarantool [1]

TX-thread

IProto-threads

WAL-thread

Relay-threads

Network

Client connections

Data from replicas

Connection management;

Input/output

Log;

Write to disk

Transactions replication

Database;

Transactions;

Fibers

Data to replicas

Asynchronous replication in Tarantool [2]

Transaction ID: {Replica ID, LSN}

Replica ID - node-author ID
LSN - monotonic unique counter

VClock - pairs {Replica ID, LSN} from all nodes, snapshot of cluster state

{0, 0, 0}

{0, 0, 0}

{0, 0, 0}

{1, 0, 0}

{1, 0, 0}

{1, 0, 0}

{1, 2, 0}

{1, 2, 0}

{1, 2, 0}

Replica ID = 1

Replica ID = 2

Replica ID = 3

A set of hard rules

Asynchronous replication in Tarantool [1]

Complete compatibility with old versions while synchronous replication is not used

Log format does not change, only new record types can be added

Tarantool architecture does not change significantly

Synchronous transaction begin

$> sync = box.schema.create_space(
    ‘stest’, {is_sync = true}
):create_index(‘pk’)

$> sync:replace{1}

$> box.begin()
$> sync:replace{2}
$> sync:replace{3}
$> box.commit()

Synchrony - property of space

These transactions are synchronous

One synchronous space makes whole transaction synchronous

$> async = box.schema.create_space(
    ‘atest’, {is_sync = false}
):create_index(‘pk’)

$> box.begin()
$> sync:replace{5}
$> async:replace{6}
$> box.commit()

Commit is started by writing transaction to local log

Quorum wait [1]: limbo

After log write the transaction goes to the limbo

Limbo is a queue of synchronous transactions in TX thread

Limbo is a link between TX, WAL, and Relay threads

LSN

...

New transactions

Committed transactions

TX-thread

Quorum

Quorum wait [2]: replication

Replica writes to log, to limbo, and responds with the applied vclock. Master limbo collects the quorum based on these responses

Transactions

{0, 0}

{0, 0}

vclock

Log write

{3, 0}

Replication

Log write

{3, 0}

Waiting in the limbo

Ack

{3, 0}

Waiting in the limbo

Quorum collection

Quorum wait [3]: collection

Limbo uses a special vclock

Replica ID
LSN - last LSN of the leader acked by this replica

{0, 0, 0}

{0, 0, 0}

log vclock

limbo vclock

{0, 0, 0}

{0, 0, 0}

Transactions

{0, 4, 0}

{0, 4, 0}

Replication

{0, 4, 0}

{0, 4, 0}

Acks

{4, 4, 4}

Limbo sees all the transactions with LSN <= 4 got quorum 3

Transactions

{0, 8, 0}

Replication

{0, 8, 0}

{0, 8, 0}

Acks

{4, 8, 4}

{8, 8, 4}

Limbo sees LSN 4 got quorum 3, and LSN 8 - quorum 2

Replica denial

Replica ID = 1

Replica ID = 3

Replica ID = 2

End of sync transaction [1]: commit

{0, 0, 1, 0, 0}

Limbo

Limbo checks the quorum when collects acks

{1, 0, 1, 0, 0}

Got ack from replica ID = 1. Total is 2, and quorum is 3. Wait more

{1, 0, 1, 2, 2}

Got acks from replicas ID = 4 and 5.

LSN 1 got 4 acks - commit
LSN 2 got 2 acks - wait more

...

LSN

Commit

New transactions

Commit

Commit({LSN = 1})

Commit is written to the log

{5, 5, 4, 6, 7}

Got more acks from all replicas:

LSN 4 - 5 acks
LSN 5 - 4 acks
LSN 6 - 2 acks
LSN 7 - 1 acks

Commit LSN 5

Commit

Commit({LSN = 5})

Commit is written to the log

End of sync transaction [2]: rollback

Timeout against infinite queue grow

After the timeout all the transactions are deleted. A record ROLLBACK is written to the log, users get an error

But ROLLBACK does not mean the replication didn't happen. After master dies, the transaction still can be committed on a new master

All transactions are deleted because they can depend on each other

Master change in Tarantool

Automatic

box.cfg{
    election_mode = <mode>,
    election_timeout = <seconds>,
    replication_synchro_quorum = <count>,
}

Automation of the second part of Raft algorithm - leader election

Mode - off/candidate/voter

Election timeout in seconds

Quorum for election and replication

Manual

box.ctl.clear_synchro_queue()

New leader should have the biggest LSN of the old leader among other alive nodes

Look at

box.info.vclock

On the new leader it is necessary to call

to clear the limbo

API Raft

box.cfg{
    election_mode = <mode>,
    election_timeout = <seconds>,
    replication_synchro_quorum = <count>,
    replication_synchro_timeout = <seconds>,
    replication_timeout = <seconds>,
}

box.info.replication

box.ctl.clear_synchro_queue()

box.info.election

Election options

Synchronous replication options

box.space.<name>:alter{is_sync = true}

box.schema.create_space(name,
    {is_sync = true}
)

Turn on for an existing space

Create new space

Difference from Raft

Log format

In Raft the log is linear - one LSN for all

In Tarantool the log is vectorised - each node has own LSN

Log type

In Raft the log is UNDO - can be reverted from the end

In Tarantool the log is REDO - can't be reverted from the end

Allows to implement master-master synchronous replication in future

Replica can't delete uncommitted transactions from the log - it may require rejoin to the cluster

Future work

Master-master synchronous replication

Cluster auto-build with SWIM

Integration with VShard

https://slides.com/gerold103/swim-fosdem2020

Links

https://github.com/tarantool/examples/tree/master/cookbook/election

https://github.com/tarantool/examples/tree/master/cookbook/synchro

Example of leader election

Example of synchronous replication

https://www.tarantool.io/en/

Official site