Advanced C++

Multithreading

Máté Cserép

December 2022, Budapest

Overview

Atomicity
Threads and thread safety
Threads with STL
Synchronization objects
Deadlocks
Singleton Pattern
Condition variables
Asynchronicity with futures and promises
Packaged tasks
Parallel STL

Multithreading in C++

Multithreading is just one damn thing after, before, or simultaneous with another.

Andrei Alexandrescu

Multithreading in C++

Beyond the errors which can occur in single-threaded programs,
multithreaded environments are subject to additional errors:

Race conditions
Deadlock, livelock
Priority failures (priority inversion, starvation, etc.)

Moreover testing and debugging of multithreaded programs are harder. Multithreaded programs are non-deterministic. Failures are often non-repeatable. Debugged code can produce very different results then non-debugged ones. Testing on single processor hardware may produce different results than testing on multiprocessor hardware.

Atomicity

int x, y;

// thread 1               
x = 1;
y = 2;

Since C++11 this is undefined behaviour, in C++98/03 not even that.



// thread 2                  
std::cout << y << ", ";
std::cout << x << std::endl;

int x, y;
std::mutex x_mutex, y_mutex;

// thread 1              
x_mutex.lock();
x = 1;
x_mutex.unlock();
y_mutex.lock();
y = 2;
y_mutex.unlock();

Workaround with mutexes:




// thread 2                  
y_mutex.lock();
std::cout << y << ", ";
y_mutex.unlock();
x_mutex.lock();
std::cout << x << std::endl;
x_mutex.unlock();

std::atomic<int> x, y;

// thread 1
x.store(1);
y.store(2);

Workaround with atomic:



// thread 2
std::cout << y.load() << ", ";   
std::cout << x.load() << std::endl;

Threads

namespace std
{
  class thread
  {
  public:
    typedef native_handle /* ... */;
    typedef id /* ... */;

    thread() noexcept;               // does not represent a thread
    thread(thread&& other) noexcept; // move constructor
    ~thread();                       // if joinable() calls std::terminate()

    template <typename Function, typename... Args> // copies args to thread local
    explicit thread(Function&& f, Arg&&... args);  // then execute f with args

    thread(const thread&) = delete;             // no copy
    thread& operator=(thread&& other) noexcept; // move
    void swap(thread& other);                   // swap 

    bool joinable() const; // thread object owns a physical thread 
    void join();           // blocks current thread until *this finish
    void detach();         // separates physical thread from the thread object

    std::thread::id get_id() const;              // std::this_thread
    static unsigned int hardware_concurrency();  // supported concurrent threads
    native_handle_type native_handle();          // e.g. thread id
  };
}

Typesafe parameter passing to the thread

void f(int i, const std::string& s);

std::thread t(f, 3, "Hello");

Creates a new thread of execution with t, which calls f(3, "hello"), where arguments are copied (as is) into an internal storage (even if the function takes them as reference).

If an exception occurs, it will be thrown in the hosting thread.

class f
{
public:
    f(int i = 0, std::string s = "") : _i(i), _s(s) { }
    void operator()() const
    {
       // background activity
    }
    int _i;
    std::string _s;
};

std::thread t(f());              // Most vexing parse (Scott Meyers: Effective STL) prior C++11
std::thread t((f()));            // OK
std::thread t((f(3, "Hello")));  // OK

f can be any callable function, e.g. operator():

See code example

Passing parameter by reference

void f( int i, const std::string&)
{
    std::cout << "Hello Concurrent World!" << std::endl;
}

int main()
{
    int i = 3;
    std::string s("Hello");

    // Will copy both i and s
    //std::thread t(f, i, s);

    // We can prevent the copy by using reference wrapper
    std::thread t(f, std::ref(i), std::ref(s));

    // If the thread destructor runs and the thread is joinable, then 
    // std::system_error will be thrown.
    // Use join() or detach() to avoid that.
    t.join();
    return 0;
}

By default all arguments are copied by value, even if the function takes them as reference.

Alternative destructor strategies

Implicit join: the destructor waits until the thread execution is completed. Hard to detect performance issues.

Implicit detach: destructor runs, but the underlying thread will still run. Destructor may free memory but the thread may try to access them.

Scott Meyers

Possible problems

struct func
{
    int& i;
    func(int& i_) : i (i_) { }

    void operator()()
    {
        for(unsigned int j = 0; j < 1000000; ++j)
        {
            do_something(i);  // i may refer to a destroyed variable
        }
    }
};

int main()
{
    int some_local_state = 0;
    func my_func(some_local_state);
    std::thread my_thread(my_func);
    my_thread.detach();  // don't wait the thread to finish
    return 0;
}  // some_local_state is destroyed, but the thread is likely still running.

Still, there is possible to make wrong code (of course, this is C++).
Better to avoid pointers and references, or join().

Scoped thread

class scoped_thread
{
  std::thread t;
public:
  explicit scoped_thread(std::thread t_): t(std::move(t_))
  {
    if(!t.joinable())
      throw std::logic_error(“No thread”);
  }
  ~scoped_thread()
  {
    t.join();
  }
  scoped_thread(scoped_thread const&)=delete;
  scoped_thread& operator=(scoped_thread const&)=delete;
};

struct func;

void f()
{
  int some_local_state;
  scoped_thread t(std::thread(func(some_local_state)));
  do_something_in_current_thread();
}

Source: C++ Concurrency In Action, by Anthony Williams

Implementation can also be found in the Boost Library: scoped_thread

And also in the standard since C++20: std::jthread

Possible problems

void f(int i, const std::string& s);

std::thread t(f, 3, "Hello");

void f(int i, const std::string& s);

void bad(int some_param)
{
    char buffer[1024];
    sprintf(buffer, "%i", some_param);
    std::thread t(f, 3, buffer);
    t.detach();
}

void good(int some_param)
{
    char buffer[1024];
    sprintf(buffer,"%i",some_param);
    std::thread t(f, 3, std::string(buffer));
    t.detach();
}

"Hello" is passed to f as const char * and converted to std::string in the new thread. This can lead to problems, e.g.:

Threads with STL

void do_work(unsigned id);

int main()
{
  std::vector<std::thread> threads;
  for(unsigned i=0;i<20;++i)
  {
    threads.push_back(std::thread(do_work,i));
  }
  std::for_each(threads.begin(), threads.end(), 
                std::mem_fn(&std::thread::join)); // generates functor for function
  return 0;
}

sth::thread is compatible with the STL containers.

Syncronization objects

std::mutex m;
int sh; // shared data

void f()
{
    /* ... */
    m.lock();
    // manipulate shared data:
    sh += 1;
    m.unlock();
    /* ... */
}

Mutex:

Recursive mutex:

std::recursive_mutex m;
int sh; // shared data

void f(int i)
{
    /* ... */
    m.lock();
    // manipulate shared data:
    sh += 1;
    if (--i > 0) f(i);
    m.unlock();
    /* ... */
}

Syncronization objects

std::timed_mutex m;
int sh; // shared data

void f()
{
  /* ... */
  if (m.try_lock_for(std::chrono::seconds(10))) {
    // we got the mutex, manipulate shared data:
    sh += 1;
    m.unlock();
  }
  else {
    // we didn't get the mutex; do something else
  }
}
void g()
{
  /* ... */
  if (m.try_lock_until(midnight)) {
    // we got the mutex, manipulate shared data:
    sh+=1;
    m.unlock();
  }
  else {
    // we didn't get the mutex; do something else
  }
}

Timed mutex:

Locks

std::list<int> l;
std::mutex     m;

void add_to_list(int value);
{
    // lock acquired - with RAII style lock management
    std::lock_guard<std::mutex> guard(m);
    l.push_back(value);
}   // lock released

Locks support the Resource Allocation Is Initialization (RAII) idiom.

Pointers or references pointing out from the guarded area can be an issue!

Shared mutexes locks

Since C++14 we have synchronization objects that can be used to protect shared data from being simultaneously accessed by multiple threads. In contrast to basic mutex and lock types which facilitate exclusive access, shared mutexes and locks has two levels of access:

shared - several threads can share ownership of the same mutex.
exclusive - only one thread can own the mutex.

If one thread has acquired the exclusive lock, no other threads can acquire the lock (including the shared).

If one thread has acquired the shared lock, no other thread can acquire the exclusive lock, but can acquire the shared lock.

New types are: shared_mutex, shared_timed_mutex, shared_lock

shared_mutex and shared_lock

std::shared_mutex m;
my_data d;

void reader()
{
  std::shared_lock<std::shared_mutex> rl(m);
  read_only(d);
}

void writer()
{
  std::lock_guard<std::shared_mutex> wl(m);
  write(d);
}

std::shared_mutex m;
my_data d;

void reader()
{
  m.lock_shared();
  read_only(d);
  m.unlock_shared();
}

void writer()
{
  m.lock();
  write(d);
  m.unlock();
}

Deadlocks

template <class T>
bool operator<(const T& lhs, const X& rhs)
{
  if (&lhs == &rhs)
    return false;

  lhs.m.lock(); rhs.m.lock();
  bool result = lhs.data < rhs.data;
  lhs.m.unlock(); rhs.m.unlock();
  return result;
}

The code below can result in a deadlock when a < b and b < a are simultaneously evaluated on 2 threads.

Avoid deadlocks

Avoid nested locks
Avoid user defined call when holding a lock
Acquire locks in a fixed order

Deadlocks

template <class T>
bool operator<(T const& lhs, X const& rhs)
{
  if (&lhs == &rhs)
    return false;

  // std::lock - locks two or more mutexes
  std::lock(lhs.m, rhs.m);

  // std::adopt_lock - assume the calling thread already has ownership
  std::lock_guard<std::mutex> lock_lhs(lhs.m, std::adopt_lock); 
  std::lock_guard<std::mutex> lock_rhs(rhs.m, std::adopt_lock);

  return lhs.data < rhs.data;
}

A correct solution to avoid deadlock:

With the lock guards, mutexes are released with RAII.

Deadlocks

template <class T>
bool operator<(T const& lhs, X const& rhs)
{
  if (&lhs == &rhs )
    return false;

  // std::unique_locks constructed with defer_lock can be locked
  // manually, by using lock() on the lock object ...
  std::unique_lock<std::mutex> lock_lhs(lhs.m, std::defer_lock);
  std::unique_lock<std::mutex> lock_rhs(rhs.m, std::defer_lock);
  // lock_lhs.owns_lock() now false

  // ... or passing to std::lock
  std::lock(lock_lhs, lock_rhs);  // designed to avoid dead-lock
  // also there is an unlock() member function

  // lock_lhs.owns_lock() now true
  return lhs.data < rhs.data;
}

Another correct solution with different approach:

std::unique_lock can be locked and unlocked, in comparison to std::lock_guard.

(It is also moveable, but not copyable, but that is not a factor here.)

Deadlocks

template <class T>
bool operator<(T const& lhs, X const& rhs)
{
  if (&lhs == &rhs )
    return false;

  // designed to avoid dead-lock
  std::scoped_lock(lhs.m, rhs.m);
  
  return lhs.data < rhs.data;
}

C++17 simplifies the problem with the introduction of scoped_lock, specifically designed for locking (and releasing) multiple mutexes and the same time in RAII style:

Singleton Pattern: single thread

template <typename T>
class MySingleton
{
public:
  static std::shared_ptr<T> instance()
  {
    if (!resource_ptr)
    {
      resource_ptr.reset(new MySingleton(/* ... */)); // lazy initialization
    }
    return resource_ptr;
  }
private:
  static std::shared_ptr<T> resource_ptr;
  static mutable std::mutex resource_mutex;
};

Problem: in case of multiple threads, a race condition can occur, where the resource_ptr gets initialized more than once!

Singleton Pattern: naïve

template <typename T>
class MySingleton
{
public:
  static std::shared_ptr<T> instance()
  {
    std::unique_lock<std::mutex> lock(resource_mutex);
    if (!resource_ptr)
      resource_ptr.reset(new MySingleton(/* ... */));
    lock.unlock();
    return resource_ptr;
  }
private:
  static std::shared_ptr<T> resource_ptr;
  static mutable std::mutex resource_mutex;
};

Problem: while the problematic race condition is connected only to the initialization of the Singleton instance, the critical section is executed for every calls of the instance() method. Such an excessive usage of the locking mechanism may cause serious overhead which could not be acceptable.

Singleton Pattern: naïve (again)

template <typename T>
class MySingleton
{
public:
  static std::shared_ptr<T> instance()
  {
    if (!resource_ptr)
    {
      std::lock_guard<std::mutex> guard(resource_mutex);
      resource_ptr.reset(new MySingleton(/* ... */));
    }
    return resource_ptr;
  }
private:
  static std::shared_ptr<T> resource_ptr;
  static mutable std::mutex resource_mutex;
};

Problem: to address the previous problem, the critical section is narrowed down to the construction of the object. This means checking the resource_ptr to be initialized is not guarded, resulting in possible multiple object creation.

Singleton Pattern: DCLP

template <typename T>
class MySingleton
{
public:
  std::shared_ptr<T> instance()
  {
    if (!resource_ptr)  // 1
    {
      std::unique_lock<std::mutex> lock(resource_mutex);
      if (!resource_ptr)
        resource_ptr.reset(new T(/* ... */));  // 2
      lock.unlock();
    }
    return resource_ptr;
  }
private:
  static std::shared_ptr<T> resource_ptr;
  static mutable std::mutex resource_mutex;
};

Problem: load in (1) and store in (2) is not synchronized:
An overly-agressive compiler can optimize by caching the pointer in some way (e.g. storing it in a register) or by removing the second check of the pointer. (Possible fix: volatile.)

Double-Checked Locking Pattern

Singleton Pattern: call_once

template <typename T>
class MySingleton
{
public:
  static std::shared_ptr<T> instance()
  {
    std::call_once(resource_init_flag, init_resource);
    return resource_ptr;
  }
private:
  void init_resource()
  {
    resource_ptr.reset(new T(/* ... */));
  }
  static std::shared_ptr<T> resource_ptr;
  static std::once          resource_init_flag; // can't be moved or copied
};

std::call_once is guaranteed to execute its callable parameter exactly once, even if called from several threads.

Singleton Pattern: Meyers singleton

Source: Effective C++, by Scott Meyers

class MySingleton;
MySingleton& MySingletonInstance()
{
  static MySingleton _instance;
  return _instance;
}

C++11 guarantees local static is initialized in a thread safe way!

Condition variable

std::mutex                my_mutex;
std::queue<data_t>        my_queue;
std::conditional_variable data_cond;  // conditional variable

void producer() {
  while (more_data_to_produce())
  {
    const data_t data = produce_data();
    std::lock_guard<std::mutex> prod_lock(my_mutex);  // guard the push
    my_queue.push(data);
    data_cond.notify_one(); // notify the waiting thread to evaluate cond.
  }
}

void consumer() {
  while (true)
  {
    std::unique_lock<std::mutex> cons_lock(my_mutex);   // not lock_guard
    data_cond.wait(cons_lock,                           // returns if lamdba returns true 
             [&my_queue]{return !my_queue.empty();});   // else unlocks and waits 
    data_t data = my_queue.front();                     // lock is hold here to protect pop...
    my_queue.pop();
    cons_lock.unlock();                                 // ... until here
    consume_data(data);
  }
}

Classical producer-consumer example:

Condition variable

During the wait the condition variable may check the condition any time, but under the protection of the mutex and returns immediately if condition is true.

Spurious wake: wake up without notification from other thread. Undefined times and frequency, so it is better to avoid functions with side effect. (E.g. using a counter in lambda to check how many notifications were is typically bad.)

Futures and Promises

Future is a read-only placeholder view of a variable.
Promise is a writable, single assignment container (set the value of future).
Futures are results of asyncronous function calls. When I execute that function I won't get the result, but get a future which will hold the result when the function completed.
A future is also capable to store exceptions.
With shared futures multiple threads can wait for a single shared async result (std::shared_future<T>).

Futures

int f(int);
void do_other_stuff();

int main()
{
  std::future<int> the_answer = std::async(f, 1);
  do_other_stuff();
  std::cout<< "The answer is " << the_answer.get() << std::endl;
  return 0;
}

The std::async() executes the task either in a new thread or on get().

// starts in a new thread
auto fut1 = std::async(std::launch::async, f, 1);

// run in the same thread on wait() or get()
auto fut2 = std::async(std::launch::deferred, f, 2);

// default: implementation chooses
auto fut3 = std::async(std::launch::deferred | std::launch::async, f, 3);

// default: implementation chooses
auto fut4 = std::async(f, 4);

If no wait() or get() is called, then the task may not be executed at all.

Futures and exceptions

double square_root(double x)
{
  if (x < 0)
  {
    throw std::out_of_range("x is negative");
  }
  return sqrt(x);
}

int main()
{
  std::future<double> fut = std::async(square_root, -1);
  double res = fut.get(); // f becomes ready on exception and rethrows
  return 0;
}

Further future methods

fut.valid(): future has a shared state
fut.wait(): wait until result is available
fut.wait_for(): timeout duration
fut.wait_until(): wait until specific time point

Futures' destructors

double long_calculation(int n)
{
  /* ... */
}

int main()
{
  std::async(std::launch::async, long_calculation, 42);  // ~future blocks
  std::async(std::launch::async, long_calculation, 100); // ~future blocks
}

int main()
{
  std::future<double> fut1 = std::async(std::launch::async, long_calculation, 42);  
  // no blocking
  std::future<double> fut2 = std::async(std::launch::async, long_calculation, 100);
  // no blocking
}

Keep in mind that the futures has a special shared state, which demands that future::~future blocks.

For real asynchronous you need to keep the returned future:

Promises

A promise is a tool for passing the return value (or exception) from a thread executing a function to the thread that consumes the result using future.

void asyncFun(std::promise<int> myPromise)
{
  int result;
  try
  {
    // calculate the result
    myPromise.set_value(result);
  }
  catch (...)
  {
    myPromise.set_exception(std::current_exception());
  }
}

Promises

int main()
{
  std::promise<int> intPromise;
  std::future<int> intFuture = intPromise.getFuture();
  std::thread t(asyncFun, std::move(intPromise));

  // do other stuff here, while asyncFun is working

  int result = intFuture.get();  // may throw exception
  return 0;
}

void asyncFun(std::promise<int> myPromise)
{
  int result;
  try
  {
    // calculate the result
    myPromise.set_value(result);
  }
  catch (...)
  {
    myPromise.set_exception(std::current_exception());
  }
}

Packaged task

double square_root(double x)
{
  if ( x < 0 )
  {
    throw std::out_of_range("x<0");
  }
  return sqrt(x);
}

int main()
{
  double x = 4.0;

  std::packaged_task<double(double)> tsk(square_root);
  std::future<double> fut = tsk.get_future(); // future will be ready when task completes

  std::thread t(std::move(tsk), x);  // make sure, task starts immediately
                                     // on different thread
                                     // thread can be joined, detached
  
  double res = fut.get();            // using the future
  return 0;
}

A higher level tool than promises.

Packaged task implementation

template <typename> class my_task;

template <typename R, typename ...Args>
class my_task<R(Args...)>
{
  std::function<R(Args...)> fn;
  std::promise<R> pr;
public:
  template <typename ...Ts>
  explicit my_task(Ts&&... ts) : fn(std::forward<Ts>(ts)...) { }

  template <typename ...Ts>
  void operator()(Ts&&... ts)
  {
      pr.set_value(fn(std::forward<Ts>(ts)...));
  }

  std::future<R> get_future() { return pr.get_future(); }
  // disable copy, default move
};

Packaged task vs async

In the end a std::packaged_task is just a lower level feature for implementing std::async (which is why it can do more than std::async if used together with other lower level stuff, like std::thread).

Simply spoken a std::packaged_task is a std::function linked to a std::future and std::async wraps and calls a std::packaged_task (possibly in a different thread).

Parallel STL (C++17)

C++17 brings us parallel algorithms, so the well known STL algorithms (std::find_if, std::for_each, std::sort, etc.) get a support for parallel (or vectorized) execution.

vector<int> v = { /* ... */ };

// standard sequential sort
std::sort(v.begin(), v.end());

// sequential execution
std::sort(std::parallel::seq, v.begin(), v.end());

// permitting parallel execution
std::sort(std::parallel::par, v.begin(), v.end());

// permitting vectorized execution (only since C++20)
std::sort(std::parallel::unseq, v.begin(), v.end());

// permitting parallel and vectorized execution
std::sort(std::parallel::par_unseq, v.begin(), v.end());

Parallel STL (C++17)

What is vectorized (or unsequenced) execution?

Most modern CPUs can execute SIMD (single instruction, multiple data) operations, significantly boosting efficiency.
A SIMD operation is where the same instruction is executed on multiple data. SIMD is data level parallelism without thread concurrency.
With parallel and vectorized execution policies, it is the programmer's responsibility to avoid data races and deadlocks.

Parallel STL (C++17)

Parallel STL is only implemented in modern compilers, so keep in mind where you can use this new feature.

GNU GCC: supported since version 9.1 (May 2019)
Clang: no support (soon thanks to Intel's "donation")
MSVC: supported since MSVC 19.14
(Visual Studio 2017 with at least version 15.7)

Support is not necessarily complete immediately, e.g. MSVC only implements the vectorized execution policy since MSVC 19.28 (VS 2019 16.8+)

Follow current state at:

https://en.cppreference.com/w/cpp/compiler_support

Look for "Parallel algorithms and execution policies".