Vladislav Shpilevoy PRO
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.
Task scheduling
Typical solutions
New task scheduler
Scheduling gears
Verification
Benchmarks
Future plans
is just execution of code
Thread pool
Event loop
Coroutine engine
class TrivialSched: Thread myThread; Mutex myLock; ConditionVariable myCond; List<Callback> myQueue;
TrivialSched::Post(const Callback& aFunc) { myLock.Lock(); myQueue.Append(aFunc); myCond.Signal(); myLock.Unlock(); } TrivialSched::Run() { while(!IsStopped()) { myLock.Lock(); while (myQueue.IsEmpty()) myCond.Wait(); Callback cb = myQueue.Pop(); myLock.Unlock(); cb.Execute(); } }
Single thread
Grind the tasks one by one
Simple locked queue
Append under a lock
Pop under a lock
Execute one by one
Handling savegames
Locks
Saves
x10k RPS
CPU-intensive
lock a user profile
download a saveblob
change the blob
upload the blob
unlock the profile
Multiple steps:
Network exchange
Won't scale
Choking on CPU
Can't postpone a blocked task
Need multiple threads
Need coroutines
task1
task2
task3
Waiting for events
Tasks stick to threads
Round-robin distribution
Periodic update of each task
Updater "updates" all tasks all the time
Thread pool
Uneven CPU usage
High latency
Waste of CPU
Min latency is 215 ms
5 ms
100 ms
Got events
Unnecessary waiting
Unnecessary wakeup, cost = 10 us
x10 000 tasks = 100 ms wasted
Fairness
Coroutines
Events
Front Queue
Ready Queue
Worker threads
Workers compete for scheduling
Wait Queue
Who processes the queues?
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Front Queue
Ready Queue
Worker threads
Wait Queue
Mutex
Condition variable
Containers
Lock-free atomics
AtomicExchange(var, new_value) { old_value = var; var = new_value; return old_value; }
AtomicCompareExchange(var, new_value, check) { if (var != check) return false; var = new_value; return true; }
Atomically set if the old value equals something
Atomically set a new value and get the old value
Atomically get the old value
AtomicLoad(var) { return var; }
* Google "memory models" about why it has to be atomic
Multi-Producer-Single-Consumer*
All threads
Sched-worker
* Common notation for queues: MPSC, MPMC, SPSC, SPMC
Multi-Producer-Single-Consumer
All threads
Sched-worker
High contention
class NormalQueue: T* myHead; T* myTail;
NormalQueue::Push(T* aItem) { if (myHead == nullptr) myHead = aItem; else myTail->myNext = aItem; myTail = aItem; } NormalQueue::Pop() { if (myHead == nullptr) return nullptr; T* res = myHead; myHead = res->myNext; return res; }
Normally a queue needs 2 members: head and tail
Doing it in a lock-free way is hardly possible - too many variables
Multi-Producer-Single-Consumer
All threads
Sched-worker
High contention
class MPSCQueue: T* myTop;
MPSCQueue::Push(T* aItem) { T* oldTop; do { oldTop = AtomicLoad(myTop); aItem->myNext = oldTop; } while (not AtomicCompareExchange( myTop, aItem, oldTop)); } MPSCQueue::PopAll(T* aItem) { T* top = AtomicExchange(myTop, nullptr); return ReverseList(top); }
Make it a stack to reduce the number of variables
Retry atomic push of a new top. Stack grows
Pop takes all and turns stack to queue
Completely lock-free
Quickly get expired tasks
Binary heap
Sched-worker
Binary heap
Quickly get expired tasks
Sched-worker
0
3
1
4
5
6
2
Quickly get expired tasks
Very good time complexity
Perfectly balanced binary tree
Node >= children
Sort tasks by deadlines - the closest on top
Multi-Consumer-Single-Producer
All threads
Sched-worker
High contention
Multi-Consumer-Single-Producer
All threads
Sched-worker
High contention
There is no simple unbounded lock-free MCSP queue
* Google "ABA-problem" about why
Bounded lock-free MCSP queue
But there are these:
Unbounded lock-based MCSP queue
class LockFreeBounded: uint64 myIdxBegin; uint64 myIdxEnd; uint32 mySize; T* myBuffer;
LockFreeBounded::Push(T* aItem) { uint64 idxEnd = AtomicLoad(myIdxEnd); if (idxEnd - AtomicLoad(myIdxBegin) == mySize) return false; myBuffer[idxEnd % mySize] = aItem; AtomicExchange(myIdxEnd, idxEnd + 1); }
LockFreeBounded::Pop() { T* res; do { uint64 idxBegin = AtomicLoad(myIdxBegin); if (idxBegin == AtomicLoad(myIdxEnd)) return nullptr; res = myBuffer[idxBegin % mySize]; } while (not AtomicCompareExchange( myIdxBegin, idxBegin + 1, idxBegin)); return res; }
Cyclic array with atomic indexes
Single producer atomically bumps 'end index'
Consumers read by atomically incremented 'begin index'
class LockedUnbounded: Mutex myLock; List<T> myQueue;
LockedUnbounded::Push(T* aItem) { myLock.Lock(); myQueue.Append(aItem); myLock.Unlock(); } LockedUnbounded::Pop() { myLock.Lock(); T* res = nullptr; if (myQueue.IsEmpty()) res = myQueue.PopFirst(); myLock.Unlock(); return res; }
Trivial mutex and list
Lock on push and pop
Lock-free
Lock-protected
Lock-based queue of lock-free queues
Single producer
Multiple consumers
Consumers need an explicit state
Lock is taken once per sub-queue size
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
Task scheduling
Typical solutions
New task scheduler
Scheduling gears
Verification
Benchmarks
Future plans
TaskScheduler parts
Coroutines
Start a download
Handle the result
Need to step away to let other tasks work
Need some timeout for the waiting
Need to wakeup when get a response
Can yield
Can set a deadline
Can be woken up explicitly
MyTask *t = new MyTask(); t->SetCallback(Download); sched.Post(t);
TaskScheduler sched; HTTPClient http;
Download(t): t->SetCallback(HandleResult); http->GetAsync(url, { sched.Wakeup(t); }); sched.PostDeadline(t, now + 5 sec);
HandleResult(t): if (t->IsExpired()) { http->Cancel(); sched.PostWait(t); return; } if (http->IsSuccess()) HandleSuccess(); else HandleFailure(); delete t;
Prepare the next step
Send an async request, wakeup() on completion
Yield until +5 secs or wakeup
Check if the deadline is up. Cancel then
Not expired = completed. Handle it
Completely lock-free, very low overhead
Temporal Logic of Actions
Math logic with "time" concept
Language
Runtime
State C
State A
State B
Your system / algorithm
State D
Initial
Fullscan and checking
Init == /\ Pipe = << >> /\ LastReceived = 0 /\ LastSent = 0
CONSTANT Count CONSTANT Lim VARIABLES Pipe, LastReceived, LastSent
Decide on granularity of the system: objects, actions on them
Actions consist of conditions. First is usually "init"
Examples of expressions:
X and Y are true: x /\ y
List of items x, y, z:
<<x, y, z>>
X equals Y: x = y
All items in set Y are equal 10: \A x \in Y: x = 10
Init == /\ Pipe = << >> /\ LastReceived = 0 /\ LastSent = 0
Send == /\ Len(Pipe) < Lim /\ LastSent < Count /\ Pipe' = Append(Pipe, LastSent + 1) /\ LastSent' = LastSent + 1
CONSTANT Count CONSTANT Lim VARIABLES Pipe, LastReceived, LastSent
Conditions for the action to be possible
Conditions which "change" the state if the action is possible
Single quote _'_ refers to the next value of the variable
Next value of X equals X + 1: X' = X + 1
Init == /\ Pipe = << >> /\ LastReceived = 0 /\ LastSent = 0
Send == /\ Len(Pipe) < Lim /\ LastSent < Count /\ Pipe' = Append(Pipe, LastSent + 1) /\ LastSent' = LastSent + 1
Recv == /\ Len(Pipe) > 0 /\ LastReceived' = Head(Pipe) /\ Pipe' = Tail(Pipe)
CONSTANT Count CONSTANT Lim VARIABLES Pipe, LastReceived, LastSent
PipeInvariant == /\ \A i \in 1..Len(Pipe) - 1: Pipe[i] + 1 = Pipe[i + 1] /\ Len(Pipe) =< Lim /\ \/ Len(Pipe) = 0 \/ Pipe[1] = LastReceived + 1
Items are ordered
The queue never overflows
The first item is always the next to receive (or the queue is empty)
TaskScheduler spec: ~750 lines
MCSP queue spec: ~430 lines
How to run it:
Study TLA+, great course from its author:
Comparative - algorithm vs its naive trivial version
Example: Debian Linux, 8 cores, 2.3GHz, hyperthreading
Front Queue
Ready Queue
Example: Debian Linux, 8 cores, 2.3GHz, hyperthreading
Savegame blobs multistep processing
This is a thread-safe box. Can do here anything, not just deadlines. Like add epoll or IOCP
The algorithm is extendible
"Updater":
x10 speed up right away
"TaskScheduler":
Debian Linux, 8 cores, 2.3GHz, hyperthreading
Try on ARM
Other languages
Optimizations
TLA+ specs and C++ code:
My talks (and this one too):
Feedback
An event storage
Threads shouldn't poll
An event storage
Thread 1: hasEvent = true; sig.Send();
Thread 2: sig.BlockingReceive() assert(hasEvent); // Won't receive again: assert(sig.IsEmpty());
Signal sig; bool hasEvent = false;
Threads shouldn't poll
An event storage
class Signal: Mutex myLock; ConditionVariable myCond; bool myFlag;
Signal::Send() { myLock.Lock(); myFlag = true; myCond.Signal(); myLock.Unlock(); }
Signal::BlockingReceive() { myLock.Lock(); while (not myFlag) myCond.Wait(); myFlag = false; myLock.Unlock(); }
Usual implementation
Expensive mutex lock on each operation
Threads shouldn't poll
An event storage
class Signal: Mutex myLock; ConditionVariable myCond; bool myFlag;
Signal::Send() { myLock.Lock(); AtomicExchange(myFlag, true); myCond.Signal(); myLock.Unlock(); }
Signal::BlockingReceive() { if (AtomicExchange(myFlag, false)) return; myLock.Lock(); while (not AtomicExchange(myFlag, false)) myCond.Wait(); myLock.Unlock(); }
Threads shouldn't poll
Lock-free receipt if already signaled
An event storage
class Signal: Mutex myLock; ConditionVariable myCond; bool myFlag;
Signal::Send() { if (AtomicExchange(myFlag, true)) return; myLock.Lock(); myCond.Signal(); myLock.Unlock(); }
Signal::BlockingReceive() { if (AtomicExchange(myFlag, false)) return; myLock.Lock(); while (not AtomicExchange(myFlag, false)) myCond.Wait(); myLock.Unlock(); }
Threads shouldn't poll
Lock-free receipt if already signaled
Lock-free send if already signaled
AtomicCompareExchange(var, new_value, check) { if (old_value == check) return false; var = new_value; return true; }
AtomicCompareExchangeGetOld(var, new_value, check) { if (old_value == check) return old_value; var = new_value; return old_value; }
AtomicCompareExchange(var, new_value, check) { return AtomicCompareExchangeGetOld( val, new_value, check) == check;
}
class MPSCQueue: T* myTop;
MPSCQueue::Push(T* aItem) { T* oldTop; do { oldTop = AtomicLoad(myTop); aItem->myNext = oldTop; } while (not AtomicCompareExchange( myTop, aItem, oldTop)); } MPSCQueue::PopAll(T* aItem) { T* top = AtomicExchange(myTop, nullptr); return ReverseList(top); }
class MPSCQueue: T* myTop;
MPSCQueue::Push(T* aItem) { T* oldTop; T* res = AtomicLoad(myTop); do { oldTop = res; aItem->myNext = oldTop; res = AtomicCompareExchangeGetOld(myTop, aItem, oldTop); } while (res != oldTop); } MPSCQueue::PopAll(T* aItem) { T* top = AtomicExchange(myTop, nullptr); return ReverseList(top); }
Load the top once
Atomically retry setting a new top and getting the old one
Download(t): t->SetCallback(HandleResult); http->GetAsync(url, { http->SetReady(); sched.Wakeup(t); }); sched.PostDeadline(t, now + 5 sec);
HandleResult(t): if (http->IsReady()) { Process(http->GetResult()); delete t; return;
} http->Cancel(); sched.PostWait(t);
TaskScheduler sched; HTTPClient http;
Check the completion before the expiration
The code will eventually crash here
HTTP threads set 'ready' flag
Thread A
1
2
Thread B
Scheduler thread wakes up by timeout and deletes the task
3
HTTP thread uses a deleted task
Download(t): t->SetCallback(HandleResult); http->GetAsync(url, { sched.Signal(t); }); sched.PostDeadline(t, now + 5 sec);
HandleResult(t): if (t->ReceiveSignal()) { Process(http->GetResult()); delete t; return;
} http->Cancel(); sched.PostWait(t);
TaskScheduler sched; HTTPClient http;
Signal is atomic "wakeup + set flag"
By Vladislav Shpilevoy
Algorithm for a multithreaded task scheduler for languages like C, C++, C#, Rust, Java. C++ version is open-sourced. Features: (1) formally verified in TLA+, (2) even CPU usage across worker threads, (3) coroutine-like functionality, (4) almost entirely lock-free, (5) up to 10 million RPS per thread. Key points for the potential audience: fair task scheduling with multiple worker threads; open source; algorithms; TLA+ verified; up to 10 million RPS per thread; for backend programmers; algorithm for languages like C++, C, Java, Rust, C# and others.
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.