SWIM – Protocol to Build a Cluster

Vladislav Shpilevoy

Plan

  1. Failure Detection
  2. Gossip protocols
  3. SWIM protocol
  4. SWIM algorithm
  5. Tarantool
  6. SWIM implementation in Tarantool
  7. SWIM extensions
  8. Usage examples
  9. Performance
  10. Plans

Horizontal scaling

Failure detection

Make decisions

Ping

Ping

Ping

Ping

Ping

Dead nodes eviction

Repair broken nodes

Ping  check the leader/master/...

Heartbeat  node makes a broadcast

Failure detection

nodes

messages in the network

O(N^2)
N

"All to all" does not work

Gossip

nodes

messages in the network

O(N)
N

Infection-style protocols

Gossip

!

!

!

!

!

!

!

!

All nodes spread the event, not one node

Algorithms are randomized:

  • no synchronization;
  • load is even

Gossip. Failure detection

Direct ping

Indirect ping

Network is broken

Ping

Ack

PingRequest

Ping

Ack

Ack

Self ping

Network is broken

... or me?

Gossip. Failure granularity

Contiguous state

Alive

Dead

VS

Discrete state

Criteria:

  • Number of not received ACK
  • Latency
  • Direct or indirect availability

...

Gossip. Dissemination direction

Push

Event

!

!

!

Pull

!

Request

!

Event

!

Push-Pull

!

!

!

!

!

!

Full dissemination:

O(log(N))

Detection:

O(1)

Network load:

O(N)

Node load:

O(1)

Gossip. Usage

Lifeguard

SWIM

SWIM

Scalable

!

O(1)

Weakly-Consistent

!

!

!

!

!

!

O(log(N))

A

B

C

A

A

Infection-Style Process

!

!

!

!

Group Membership

D

E

H

G

F

A, B, C, D, E, F, G, H

A, B, C, D, E, F, G, H

A, B, C, D, E, F, G, H

A, B, C, D, E, F, G, H

A, B, C, D, E, F, G, H

A, B, C, D, E, F, G, H

A, B, C, D, E, F, G, H

SWIM

Failure detection

Event dissemination

Ping Request

Ack

Ack

Failure suspicion

!

Ping

Events

UDP-packet

T

T

Ping +

!

!

!

!

!

Each event is sent                  times

Suspicion was not confirmed

log(N)

SWIM

Incarnation

C

A

B

B

B

B alive

?

B

B dead

?

B

B

B

Assume the worst

How will B restore its state?

'Incarnation' concept is added to refute false gossips

B 1

B 1

B 1

You know ... how could I say this ... You are dead

?!

B 2

B 2

?

B 2

Assume the worst, but greater incarnation  newer information

B 2

  • False gossip refutation
  • Protection against UDP problems
  • If equal  assume the worst

DBMS + Application Server in one

C, Lua, SQL, Python, PHP, Go, Java, C# ...

Persistent in-memory and disk storage at your choice

Stored procedures on C, Lua, SQL

Cooperative multitasking with fibers

Thread:

Transaction processing

WAL

Network

Process

Threads

Event-loop

Tasks

Examples on Lua

box.cfg{
    listen = '192.168.1.1:3313',
    replication = {
        '192.168.1.1:3313', 
        '192.168.1.2:3313'
    }
}

Start one node of replicaset, listen for clients, setup replication

nb = require('net.box')
c = nb.connect('192.168.1.1:3313')
c:ping()
c:call(function_name)

Connect to a remote node, ping, call a stored function on it

Tarantool SWIM

Lua module

Many nodes can be created

Almost everything is configurable

Functions for manual manipulation with member table

Can access individual nodes

Async subscription on events

Graceful cluster quit

swim = require('swim')

s = swim.new([cfg])

s:quit()
s:cfg({
    heartbeat_rate = ...,
    ack_timeout = ...,
    gc_mode = ..., uri = ..., uuid = ...
})

s:probe_member(uri)

s:add_member({uuid = ..., uri = ...})

s:remove_member(uuid)

s:broadcast([port])

s:size()

s:member_by_uuid(uuid)

s:pairs()

s:on_member_event(new_trigger[, old_trigger])

SWIM  Example

Two Tarantools with 2 SWIM nodes

First node sends a ping to the second node

Now the nodes know and ping each other, share information

Triggers can be used to learn moments and details of a member table update

If no answer too long, the node is marked dead after several pings

swim = require('swim')

s = swim.new({uri = 3333, uuid = uuid1,
              ack_timeout = 0.1})

local event
s:on_member_event(function(m, ev, ctx)
    event = {m, ev, ctx}
end)
swim = require('swim')

s = swim.new({uri = 3334, uuid = uuid2,
              ack_timeout = 0.1})
s:probe_member(3334)
tarantool> s:size()
--
- 2
...
tarantool> event
---
- - uri: 127.0.0.1:3334
    status: alive
    incarnation: 0
    uuid: 00000000-0000-1000-8000-000000000002
  - is_new: true
...

tarantool> member = event[1]
---
...

tarantool> member:status()
---
- dead
...
tarantool> s:size()
--
- 2
...
s:delete()

 

 

 

 

Ping + UUID + IP:Port

Extensions  Anti-entropy

A

B

A, B

A, B

Nodes A and B know each other

С

A third node appeared, knowing B. How will it learn about А?

B, C

B broadcasts an event "new node", but it is lost

Need table sync. Each message carries a random part of sender's member table

Ping

Events

UDP-packet

Anti-entropy

A, B, C

A, B, C

A, B, C

New node С!

My table  A, B, C

Soon or late, A and C learn about each other  antientropy works always

My table  A, B, C

Extensions  Payload

After SWIM nodes are linked, how to find TCP port for connection?

swim = require('swim')

s = swim.new({
    uri = 3333, uuid = uuid1
})
swim = require('swim')

s = swim.new({
    uri = 3335, uuid = uuid2
})
box.cfg{listen = 3336}
s:set_payload({
    tport = 3336,
    any_other_data = value
})

Method set_payload spreads arbitrary user-defined information

s:broadcast(3335)

It is enough to connect SWIM nodes anyhow

mem = s:member_by_uuid(uuid2)
p = mem:payload()

tarantool> p
---
- {'tport': 3336,
   'any_other_data': value}
...

Payload will be available on all nodes

nb = require('net.box')

tarantool> nb.connect(p.tport)
---
- peer_uuid: ...
  schema_version: 73
  protocol: Binary
  state: active
  peer_version_id: 131584
  port: '3336'
...

You can connect!

tarantool> s:size()
--
- 2
...

Extensions  Encryption

SWIM can be used in an open network

Open network is vulnerable

Tarantool SWIM provides built-in encryption

swim = require('swim')

s = swim.new()

s:set_codec({
    algo = 'aes192', mode = 'cbc', key = private_key
})

Algorithms:

DES, AES128/192/256

 

Modes:

ECB, CBC, CFB, OFB

Extensions  Restart detector

Incarnation is not sufficient sometimes

A

B

A 1

Payload:

Two nodes. B sees incarnation A = 1 and payload "blue circle"

A 1

Payload:

A 1

Payload:

A is restarted, incarnation is again 1, payload is different

Hey, I have a new payload with incarnation 1!

A1 vs A1?  Nothing new. Payload is not changed.

A tells B he has new payload, but the incarnation is the same  B still remembers A before restart

А is off

!

swim = require('swim')
s = swim.new({generation = 1})

swim = require('swim')
s = swim.new()
saved_e = nil
s:on_member_event(function(m, e)
    saved_e = e
end)
s:cfg({uri = 3334, uuid = uuid2})

s:probe_member(3333)
s:cfg({uri = 3333, uuid = uuid1})

Restart

swim = require('swim')

s = swim.new({generation = 2})

s:cfg({uri = 3333, uuid = uuid1})
tarantool> saved_e
---
- is_new_version: true
  is_new_generation: true
  is_update: true
  is_new_incarnation: true
...

In Tarantool SWIM incarnation consists of two parts, one of which grows always, and is persistent

At creation a user can set generation  this number if the high part of incarnation. Be default it is current time

It should be set either before or during the first configuratio

Second is started. They are linked

First is started

New generation is set after start

The second learns abut that via an event

Extensions  Restart detector

Practice  Build a cluster

Datacenter

Datacenter

Task:

Create N replicasets with 2 replicas in each, grouped by datacenters

datacenter = arg[1]
swim_uri = arg[2]
box_uri = arg[3]
uuid = arg[4]

One node attributes

fiber = require('fiber')
fiber.create(function()
  while true do
    for port in pairs(swim_ports) do
      swim:broadcast(port)
    end
    fiber.sleep(1)
  end
end)
swim = require('swim').new()
nodes = {}
function on_event(m, e)
  if e:is_drop() then
    nodes[m:uuid()] = nil
  elseif e:is_new_payload() and
         m:payload() then
    local dc = m:payload().dc
    if dc == datacenter then
      nodes[m:uuid()] = m
    end
  end
end
swim:on_member_event(on_event)
swim:cfg({
  uri = swim_uri,
  uuid = swim_uuid
})
swim:set_payload({
  dc = datacenter,
  box = box_uri
})
while map_size(nodes) ~= 2 do
  fiber.sleep(1)
end

local rep = {}
for uuid, n in pairs(nodes) do
  table.insert(rep, n:payload().box)
end

box.cfg{
  listen = box_uri,
  replication = rep,
  instance_uuid = uuid,
}

>

>

One SWIM per node

>

>

Array with datacenter nodes

>

Is updated automatically

>

>

Tell DC and port to everyone

>

Search for new nodes by periodic broadcast

>

2 nodes are needed to build the replicaset

>

Replication is ready to work!

Practice - Build a cluster - Code

Practice  Other applications

Among several Tarantools choose one leader to make decisions about cluster configuration, master failover

  • Leader election

  • Monitoring

Find failed nodes, notify an administrator, rebalance load, initiate a master failover

Performance

With heartbeat rate 1 second

Cluster size

Full dissemination time, seconds

800

25

200

50

100

150

300

250

500

5

4

6

7

8

SWIM implementation

 real

log_2(x)

Summary

SWIM

  • Detection
  • Dissemination
  • Simplicity
  • Availability

Plans

  • Raft for cluster leader change
  • Cluster autobuild
  • Cluster topology discovery
  • More extensions against false-positive failure detection