Fair Threaded Task Scheduler Verified in TLA+

Vladislav Shpilevoy

Plan

Task scheduling

Typical solutions

New task scheduler

Scheduling gears

Verification

Benchmarks

Future plans

Task scheduling

is just execution of code

Thread pool

Event loop

Coroutine engine

Trivial scheduler

class TrivialSched:
    Thread            myThread;
    Mutex             myLock;
    ConditionVariable myCond;
    List<Callback>    myQueue;
TrivialSched::Post(const Callback& aFunc)
{
    myLock.Lock();
    myQueue.Append(aFunc);
    myCond.Signal();
    myLock.Unlock();
}

TrivialSched::Run()
{
    while(!IsStopped()) {
        myLock.Lock();
        while (myQueue.IsEmpty())
            myCond.Wait();
        Callback cb = myQueue.Pop();
        myLock.Unlock();

        cb.Execute();
    }
}

Single thread

Grind the tasks one by one

Simple locked queue

Append under a lock

Pop under a lock

Execute one by one

The real task

Handling savegames

Locks

Saves

x10k RPS

CPU-intensive

lock a user profile

download a saveblob

change the blob

upload the blob

unlock the profile

Multiple steps:

Network exchange

Trivial scheduler problems

Won't scale

Choking on CPU

Can't postpone a blocked task

Need multiple threads

Need coroutines

task1
task2
task3

Waiting for events

Trivial scheduler v2

Tasks stick to threads

Round-robin distribution

Periodic update of each task

Updater "updates" all tasks all the time

Thread pool

Trivial scheduler v2 problems

Uneven CPU usage

High latency

Waste of CPU

Min latency is 215 ms

5 ms

100 ms

Got events

Unnecessary waiting

Unnecessary wakeup, cost = 10 us

x10 000 tasks = 100 ms wasted

Summary requirements

Fairness

Coroutines

Events

TaskScheduler

Front Queue

Ready Queue

Worker threads

Workers compete for scheduling

Wait Queue

Who processes the queues?

TaskScheduler example

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

TaskScheduler

Front Queue

Ready Queue

Worker threads

Wait Queue

Dependencies

Mutex

Condition variable

Containers

Lock-free atomics

Lock-free atomics

AtomicExchange(var, new_value)
{
    old_value = var;
    var = new_value;
    return old_value;
}
AtomicCompareExchange(var, new_value, check)
{
    if (var != check)
        return false;
    var = new_value;
    return true;
}

Atomically set if the old value equals something

Atomically set a new value and get the old value

Atomically get the old value

AtomicLoad(var)
{
    return var;
}

* Google "memory models" about why it has to be atomic

Front queue

Multi-Producer-Single-Consumer*

All threads

Sched-worker

* Common notation for queues: MPSC, MPMC, SPSC, SPMC

Front queue

Multi-Producer-Single-Consumer

All threads

Sched-worker

High contention

class NormalQueue:
    T* myHead;
    T* myTail;
NormalQueue::Push(T* aItem)
{
    if (myHead == nullptr)
        myHead = aItem;
    else
        myTail->myNext = aItem;
    myTail = aItem;
}

NormalQueue::Pop()
{
    if (myHead == nullptr)
        return nullptr;
    T* res = myHead;
    myHead = res->myNext;
    return res;
}

Normally a queue needs 2 members: head and tail

Doing it in a lock-free way is hardly possible - too many variables

Front queue

Multi-Producer-Single-Consumer

All threads

Sched-worker

High contention

class MPSCQueue:
    T* myTop;
MPSCQueue::Push(T* aItem)
{
    T* oldTop;
    do {
        oldTop = AtomicLoad(myTop);
        aItem->myNext = oldTop;
    } while (not AtomicCompareExchange(
        myTop, aItem, oldTop));
}

MPSCQueue::PopAll(T* aItem)
{
    T* top = AtomicExchange(myTop, nullptr);
    return ReverseList(top);
}

Make it a stack to reduce the number of variables

Retry atomic push of a new top. Stack grows

Pop takes all and turns stack to queue

Completely lock-free

Wait queue

Quickly get expired tasks

Binary heap

Sched-worker

Wait queue

Binary heap

Quickly get expired tasks

Sched-worker

\begin{aligned} O(log(N)) &- update\\ O(1) &- get\ top \end{aligned}

0

3

1

4

5

6

2

Quickly get expired tasks

Very good time complexity

Perfectly balanced binary tree

Node >= children

Sort tasks by deadlines - the closest on top

Ready queue

Multi-Consumer-Single-Producer

All threads

Sched-worker

High contention

Ready queue

Multi-Consumer-Single-Producer

All threads

Sched-worker

High contention

There is no simple unbounded lock-free MCSP queue

* Google "ABA-problem" about why

Bounded lock-free MCSP queue

But there are these:

Unbounded lock-based MCSP queue

Ready queue - Bounded Lock-Free

class LockFreeBounded:
    uint64 myIdxBegin;
    uint64 myIdxEnd;
    uint32 mySize;
    T* myBuffer;
LockFreeBounded::Push(T* aItem)
{
    uint64 idxEnd = AtomicLoad(myIdxEnd);
    if (idxEnd - AtomicLoad(myIdxBegin) == mySize)
        return false;
    myBuffer[idxEnd % mySize] = aItem;
    AtomicExchange(myIdxEnd, idxEnd + 1);
}

LockFreeBounded::Pop()
{
    T* res;
    do {
        uint64 idxBegin = AtomicLoad(myIdxBegin);
        if (idxBegin == AtomicLoad(myIdxEnd))
            return nullptr;
        res = myBuffer[idxBegin % mySize];
    } while (not AtomicCompareExchange(
        myIdxBegin, idxBegin + 1, idxBegin));
    return res;
}

Cyclic array with atomic indexes

Single producer atomically bumps 'end index'

Consumers read by atomically incremented 'begin index'

Ready queue - Unbounded Locked

class LockedUnbounded:
    Mutex myLock;
    List<T> myQueue;
LockedUnbounded::Push(T* aItem)
{
    myLock.Lock();
    myQueue.Append(aItem);
    myLock.Unlock();
}

LockedUnbounded::Pop()
{
    myLock.Lock();
    T* res = nullptr;
    if (myQueue.IsEmpty())
        res = myQueue.PopFirst();
    myLock.Unlock();
    return res;
}

Trivial mutex and list

Lock on push and pop

Ready queue

Combine

Ready queue

Lock-free

Lock-protected

Lock-based queue of lock-free queues

Single producer

Multiple consumers

Consumers need an explicit state

Lock is taken once per sub-queue size

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Ready queue example

B

A

Our progress

Task scheduling

Typical solutions

New task scheduler

Scheduling gears

Verification

Benchmarks

Future plans

TaskScheduler parts

Coroutines

Coroutines

Start a download

Handle the result

Need to step away to let other tasks work

Need some timeout for the waiting

Need to wakeup when get a response

Can yield

Can set a deadline

Can be woken up explicitly

Coroutines

MyTask *t = new MyTask();
t->SetCallback(Download);
sched.Post(t);
TaskScheduler sched;
HTTPClient http;
Download(t):
    t->SetCallback(HandleResult);
    http->GetAsync(url, {
        sched.Wakeup(t);
    });
    sched.PostDeadline(t, now + 5 sec);
HandleResult(t):
    if (t->IsExpired()) {
        http->Cancel();
        sched.PostWait(t);
        return;
    }
    if (http->IsSuccess())
        HandleSuccess();
    else
        HandleFailure();
    delete t;

Prepare the next step

Send an async request, wakeup() on completion

Yield until +5 secs or wakeup

Check if the deadline is up. Cancel then

Not expired = completed. Handle it

Completely lock-free, very low overhead

How to verify a multithreaded algorithm?

TLA+ Verification

Temporal Logic of Actions

Math logic with "time" concept

Language

Runtime

State C

State A

State B

Your system / algorithm

State D

Initial

Fullscan and checking

TLA+ Example

Init ==
  /\ Pipe = << >>
  /\ LastReceived = 0
  /\ LastSent = 0
CONSTANT Count
CONSTANT Lim
VARIABLES Pipe,
          LastReceived,
          LastSent

Decide on granularity of the system: objects, actions on them

Actions consist of conditions. First is usually "init"

Examples of expressions:

 X and Y are true:
 x /\ y
 List of items x, y, z:
 <<x, y, z>>
 X equals Y:
 x = y
 All items in set Y are equal 10:
 \A x \in Y: x = 10
X \land Y
(x, y, z)
X = Y
\forall x \in Y: x = 10

TLA+ Example

Init ==
  /\ Pipe = << >>
  /\ LastReceived = 0
  /\ LastSent = 0
Send ==
  /\ Len(Pipe) < Lim
  /\ LastSent < Count
  /\ Pipe' = Append(Pipe, LastSent + 1)
  /\ LastSent' = LastSent + 1
CONSTANT Count
CONSTANT Lim
VARIABLES Pipe,
          LastReceived,
          LastSent

Conditions for the action to be possible

Conditions which "change" the state if the action is possible

Single quote _'_ refers to the next value of the variable

Next value of X equals X + 1:
X' = X + 1

TLA+ Example

Init ==
  /\ Pipe = << >>
  /\ LastReceived = 0
  /\ LastSent = 0
Send ==
  /\ Len(Pipe) < Lim
  /\ LastSent < Count
  /\ Pipe' = Append(Pipe, LastSent + 1)
  /\ LastSent' = LastSent + 1
Recv ==
  /\ Len(Pipe) > 0
  /\ LastReceived' = Head(Pipe)
  /\ Pipe' = Tail(Pipe)
CONSTANT Count
CONSTANT Lim
VARIABLES Pipe,
          LastReceived,
          LastSent
PipeInvariant ==
  /\ \A i \in 1..Len(Pipe) - 1: Pipe[i] + 1 = Pipe[i + 1]
  /\ Len(Pipe) =< Lim
  /\ \/ Len(Pipe) = 0
     \/ Pipe[1] = LastReceived + 1

Items are ordered

The queue never overflows

The first item is always the next to receive (or the queue is empty)

TaskScheduler TLA+

TaskScheduler spec: ~750 lines

./tla/TaskScheduler.tla

MCSP queue spec: ~430 lines

./tla/MCSPQueue.tla

How to run it:

./tla/README.md

Study TLA+, great course from its author:

lamport.azurewebsites.net/tla/tla.html

Benchmarks - parts

Comparative - algorithm vs its naive trivial version

  • 5 push-threads
  • ~9 mln/sec
  • x1.5 faster vs trivial

Example: Debian Linux, 8 cores, 2.3GHz, hyperthreading

  • 10 push-threads
  • ~5.8 mln/sec
  • x2.6 faster vs trivial
  • 5 pop-threads
  • ~2.5 mln/sec
  • x2.6 faster vs trivial
  • x0.0009 lock-contention

Front Queue

Ready Queue

  • 10 pop-threads
  • ~1.7 mln tasks/sec
  • x4.5 faster vs trivial
  • x0.0007 lock-contention

Benchmarks - scheduler

Example: Debian Linux, 8 cores, 2.3GHz, hyperthreading

  • 1 worker
  • ~11 mln/sec
  • x2.2 faster vs trivial
  • x0 lock-contention
  • 10 workers
  • ~5.2 mln/sec
  • x7.5 faster vs trivial
  • x0.003 lock-contention
  • 5 workers
  • ~4 mln/sec
  • x3 faster vs trivial
  • x0.002 lock-contention

Real usage

Savegame blobs multistep processing

This is a thread-safe box. Can do here anything, not just deadlines. Like add epoll or IOCP

The algorithm is extendible

"Updater":

  • 4 workers
  • ~100-300 RPS
  • ~500 ms latency

x10 speed up right away

"TaskScheduler":

  • 4 workers
  • >10000 RPS
  • ~100 ms latency

Debian Linux, 8 cores, 2.3GHz, hyperthreading

Future plans

Try on ARM

Other languages

Optimizations

TLA+ specs and C++ code:

github.com/ubisoft/
task-scheduler

Feedback

Additional content

Signal

An event storage

Threads shouldn't poll

Signal

An event storage

Thread 1:
    hasEvent = true;
    sig.Send();
Thread 2:
    sig.BlockingReceive()
    assert(hasEvent);
    // Won't receive again:
    assert(sig.IsEmpty());
Signal sig;
bool hasEvent = false;

Threads shouldn't poll

Signal

An event storage

class Signal:
    Mutex myLock;
    ConditionVariable myCond;
    bool myFlag;
Signal::Send()
{
    myLock.Lock();
    myFlag = true;
    myCond.Signal();
    myLock.Unlock();
}
Signal::BlockingReceive()
{
    myLock.Lock();
    while (not myFlag)
        myCond.Wait();
    myFlag = false;
    myLock.Unlock();
}

Usual implementation

Expensive mutex lock on each operation

Threads shouldn't poll

Signal

An event storage

class Signal:
    Mutex myLock;
    ConditionVariable myCond;
    bool myFlag;
Signal::Send()
{
    myLock.Lock();
    AtomicExchange(myFlag, true);
    myCond.Signal();
    myLock.Unlock();
}
Signal::BlockingReceive()
{
    if (AtomicExchange(myFlag, false))
        return;
    myLock.Lock();
    while (not AtomicExchange(myFlag, false))
        myCond.Wait();
    myLock.Unlock();
}

Threads shouldn't poll

Lock-free receipt if already signaled

Signal

An event storage

class Signal:
    Mutex myLock;
    ConditionVariable myCond;
    bool myFlag;
Signal::Send()
{
    if (AtomicExchange(myFlag, true))
        return;
    myLock.Lock();
    myCond.Signal();
    myLock.Unlock();
}
Signal::BlockingReceive()
{
    if (AtomicExchange(myFlag, false))
        return;
    myLock.Lock();
    while (not AtomicExchange(myFlag, false))
        myCond.Wait();
    myLock.Unlock();
}

Threads shouldn't poll

Lock-free receipt if already signaled

Lock-free send if already signaled

Smart usage of cmpxchg

AtomicCompareExchange(var, new_value, check)
{
    if (old_value == check)
        return false;
    var = new_value;
    return true;
}

Smart usage of cmpxchg

AtomicCompareExchangeGetOld(var, new_value, check)
{
    if (old_value == check)
        return old_value;
    var = new_value;
    return old_value;
}
AtomicCompareExchange(var, new_value, check)
{
    return AtomicCompareExchangeGetOld(
        val, new_value, check) == check;
}

Smart usage of cmpxchg

class MPSCQueue:
    T* myTop;
MPSCQueue::Push(T* aItem)
{
    T* oldTop;
    do {
        oldTop = AtomicLoad(myTop);
        aItem->myNext = oldTop;
    } while (not AtomicCompareExchange(
        myTop, aItem, oldTop));
}

MPSCQueue::PopAll(T* aItem)
{
    T* top = AtomicExchange(myTop, nullptr);
    return ReverseList(top);
}

Smart usage of cmpxchg

class MPSCQueue:
    T* myTop;
MPSCQueue::Push(T* aItem)
{
    T* oldTop;
    T* res = AtomicLoad(myTop);
    do {
        oldTop = res;
        aItem->myNext = oldTop;
        res = AtomicCompareExchangeGetOld(myTop, aItem, oldTop);
    } while (res != oldTop);
}

MPSCQueue::PopAll(T* aItem)
{
    T* top = AtomicExchange(myTop, nullptr);
    return ReverseList(top);
}

Load the top once

Atomically retry setting a new top and getting the old one

Task signal

Download(t):
    t->SetCallback(HandleResult);
    http->GetAsync(url,
    {
        http->SetReady();
        sched.Wakeup(t);
    });
    sched.PostDeadline(t, now + 5 sec);
HandleResult(t):
    if (http->IsReady())
    {
        Process(http->GetResult());
        delete t;
        return;
    }
    http->Cancel();
    sched.PostWait(t);
TaskScheduler sched;
HTTPClient http;

Check the completion before the expiration

The code will eventually crash here

HTTP threads set 'ready' flag

Thread A

1

2

Thread B

Scheduler thread wakes up by timeout and deletes the task

3

HTTP thread uses a deleted task

Task signal

Download(t):
    t->SetCallback(HandleResult);
    http->GetAsync(url,
    {
        sched.Signal(t);
    });
    sched.PostDeadline(t, now + 5 sec);
HandleResult(t):
    if (t->ReceiveSignal())
    {
        Process(http->GetResult());
        delete t;
        return;
    }
    http->Cancel();
    sched.PostWait(t);
TaskScheduler sched;
HTTPClient http;

Signal is atomic "wakeup + set flag"

Highload 2022: Fair threaded task scheduler verified in TLA+

By Vladislav Shpilevoy

Highload 2022: Fair threaded task scheduler verified in TLA+

Algorithm for a multithreaded task scheduler for languages like C, C++, C#, Rust, Java. C++ version is open-sourced. Features: (1) formally verified in TLA+, (2) even CPU usage across worker threads, (3) coroutine-like functionality, (4) almost entirely lock-free, (5) up to 10 million RPS per thread. Key points for the potential audience: fair task scheduling with multiple worker threads; open source; algorithms; TLA+ verified; up to 10 million RPS per thread; for backend programmers; algorithm for languages like C++, C, Java, Rust, C# and others.

  • 421