Vladislav Shpilevoy PRO
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.
Vladislav Shpilevoy
Horizontal scaling
Make decisions
Ping
Ping
Ping
Ping
Ping
Dead nodes eviction
Repair broken nodes
Ping – check the leader/master/...
Heartbeat – node makes a broadcast
nodes
messages in the network
"All to all" does not work
nodes
messages in the network
Infection-style protocols
!
!
!
!
!
!
!
!
All nodes spread the event, not one node
Algorithms are randomized:
Network is broken
Ping
Ack
PingRequest
Ping
Ack
Ack
Network is broken
... or me?
Alive
Dead
VS
Criteria:
...
Event
!
!
!
!
Request
!
Event
!
!
!
!
!
!
!
Full dissemination:
Detection:
Network load:
Node load:
Lifeguard
Scalable
!
Weakly-Consistent
!
!
!
!
!
!
A
B
C
A
A
Infection-Style Process
!
!
!
!
Group Membership
D
E
H
G
F
A, B, C, D, E, F, G, H
A, B, C, D, E, F, G, H
A, B, C, D, E, F, G, H
A, B, C, D, E, F, G, H
A, B, C, D, E, F, G, H
A, B, C, D, E, F, G, H
A, B, C, D, E, F, G, H
Ping Request
Ack
Ack
Failure suspicion
!
Ping
Events
UDP-packet
T
T
Ping +
!
!
!
!
!
Each event is sent times
Suspicion was not confirmed
C
A
B
B
B
B alive
?
B
B dead
?
B
B
B
Assume the worst
How will B restore its state?
'Incarnation' concept is added to refute false gossips
B 1
B 1
B 1
You know ... how could I say this ... You are dead
?!
B 2
B 2
?
B 2
Assume the worst, but greater incarnation – newer information
B 2
DBMS + Application Server in one
C, Lua, SQL, Python, PHP, Go, Java, C# ...
Persistent in-memory and disk storage at your choice
Stored procedures on C, Lua, SQL
Cooperative multitasking with fibers
Thread:
Transaction processing
WAL
Network
Process
Threads
Event-loop
Tasks
Examples on Lua
box.cfg{ listen = '192.168.1.1:3313', replication = {
'192.168.1.1:3313',
'192.168.1.2:3313'
} }
Start one node of replicaset, listen for clients, setup replication
nb = require('net.box') c = nb.connect('192.168.1.1:3313') c:ping() c:call(function_name)
Connect to a remote node, ping, call a stored function on it
Lua module
Many nodes can be created
Almost everything is configurable
Functions for manual manipulation with member table
Can access individual nodes
Async subscription on events
Graceful cluster quit
swim = require('swim')
s = swim.new([cfg])
s:quit()
s:cfg({ heartbeat_rate = ..., ack_timeout = ..., gc_mode = ..., uri = ..., uuid = ... })
s:probe_member(uri) s:add_member({uuid = ..., uri = ...}) s:remove_member(uuid) s:broadcast([port])
s:size() s:member_by_uuid(uuid) s:pairs()
s:on_member_event(new_trigger[, old_trigger])
Two Tarantools with 2 SWIM nodes
First node sends a ping to the second node
Now the nodes know and ping each other, share information
Triggers can be used to learn moments and details of a member table update
If no answer too long, the node is marked dead after several pings
swim = require('swim') s = swim.new({uri = 3333, uuid = uuid1, ack_timeout = 0.1}) local event s:on_member_event(function(m, ev, ctx) event = {m, ev, ctx} end)
swim = require('swim') s = swim.new({uri = 3334, uuid = uuid2, ack_timeout = 0.1})
s:probe_member(3334)
tarantool> s:size() -- - 2 ...
tarantool> event --- - - uri: 127.0.0.1:3334 status: alive incarnation: 0 uuid: 00000000-0000-1000-8000-000000000002 - is_new: true ...
tarantool> member = event[1] --- ... tarantool> member:status() --- - dead ...
tarantool> s:size() -- - 2 ...
s:delete()
Ping + UUID + IP:Port
A
B
A, B
A, B
Nodes A and B know each other
С
A third node appeared, knowing B. How will it learn about А?
B, C
B broadcasts an event "new node", but it is lost
Need table sync. Each message carries a random part of sender's member table
Ping
Events
UDP-packet
Anti-entropy
A, B, C
A, B, C
A, B, C
New node С!
My table – A, B, C
Soon or late, A and C learn about each other – antientropy works always
My table – A, B, C
After SWIM nodes are linked, how to find TCP port for connection?
swim = require('swim') s = swim.new({ uri = 3333, uuid = uuid1 })
swim = require('swim')
s = swim.new({
uri = 3335, uuid = uuid2
})
box.cfg{listen = 3336}
s:set_payload({ tport = 3336, any_other_data = value })
Method set_payload spreads arbitrary user-defined information
s:broadcast(3335)
It is enough to connect SWIM nodes anyhow
mem = s:member_by_uuid(uuid2) p = mem:payload() tarantool> p --- - {'tport': 3336, 'any_other_data': value} ...
Payload will be available on all nodes
nb = require('net.box') tarantool> nb.connect(p.tport) --- - peer_uuid: ... schema_version: 73 protocol: Binary state: active peer_version_id: 131584 port: '3336' ...
You can connect!
tarantool> s:size() -- - 2 ...
SWIM can be used in an open network
Open network is vulnerable
Tarantool SWIM provides built-in encryption
swim = require('swim') s = swim.new() s:set_codec({ algo = 'aes192', mode = 'cbc', key = private_key })
Algorithms:
DES, AES128/192/256
Modes:
ECB, CBC, CFB, OFB
Incarnation is not sufficient sometimes
A
B
A 1
Payload:
Two nodes. B sees incarnation A = 1 and payload "blue circle"
A 1
Payload:
A 1
Payload:
A is restarted, incarnation is again 1, payload is different
Hey, I have a new payload with incarnation 1!
A1 vs A1? – Nothing new. Payload is not changed.
A tells B he has new payload, but the incarnation is the same – B still remembers A before restart
А is off
!
swim = require('swim') s = swim.new({generation = 1})
swim = require('swim') s = swim.new() saved_e = nil s:on_member_event(function(m, e) saved_e = e end) s:cfg({uri = 3334, uuid = uuid2})
s:probe_member(3333)
s:cfg({uri = 3333, uuid = uuid1})
Restart
swim = require('swim') s = swim.new({generation = 2})
s:cfg({uri = 3333, uuid = uuid1})
tarantool> saved_e --- - is_new_version: true is_new_generation: true is_update: true is_new_incarnation: true ...
In Tarantool SWIM incarnation consists of two parts, one of which grows always, and is persistent
At creation a user can set generation – this number if the high part of incarnation. Be default it is current time
It should be set either before or during the first configuratio
Second is started. They are linked
First is started
New generation is set after start
The second learns abut that via an event
Datacenter
Datacenter
Task:
Create N replicasets with 2 replicas in each, grouped by datacenters
datacenter = arg[1]
swim_uri = arg[2]
box_uri = arg[3]
uuid = arg[4]
One node attributes
fiber = require('fiber')
fiber.create(function()
while true do
for port in pairs(swim_ports) do
swim:broadcast(port)
end
fiber.sleep(1)
end
end)
swim = require('swim').new()
nodes = {}
function on_event(m, e)
if e:is_drop() then
nodes[m:uuid()] = nil
elseif e:is_new_payload() and
m:payload() then
local dc = m:payload().dc
if dc == datacenter then
nodes[m:uuid()] = m
end
end
end
swim:on_member_event(on_event)
swim:cfg({
uri = swim_uri,
uuid = swim_uuid
})
swim:set_payload({
dc = datacenter,
box = box_uri
})
while map_size(nodes) ~= 2 do
fiber.sleep(1)
end
local rep = {}
for uuid, n in pairs(nodes) do
table.insert(rep, n:payload().box)
end
box.cfg{
listen = box_uri,
replication = rep,
instance_uuid = uuid,
}
>
>
One SWIM per node
>
>
Array with datacenter nodes
>
Is updated automatically
>
>
Tell DC and port to everyone
>
Search for new nodes by periodic broadcast
>
2 nodes are needed to build the replicaset
>
Replication is ready to work!
Among several Tarantools choose one leader to make decisions about cluster configuration, master failover
Find failed nodes, notify an administrator, rebalance load, initiate a master failover
With heartbeat rate 1 second
Cluster size
Full dissemination time, seconds
800
25
200
50
100
150
300
250
500
5
4
6
7
8
– SWIM implementation
– real
SWIM
By Vladislav Shpilevoy
SWIM is a protocol for detection and monitoring of cluster nodes, distribution of events and data between them. The protocol is lightweight, decentralized and its speed does not depend on cluster size. The talk describes how SWIM protocol is organized, how and with which extensions it is implemented in Tarantool.
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.