CS110 Lecture 18: Multithreading Patterns

CS110: Principles of Computer Systems

Winter 2021-2022

Stanford University

Instructors: Nick Troccoli and Jerry Cain

PDF of this presentation

💤📢

🔐

PERMIT

CS110 Topic 3: How can we have concurrency within a single process?

Learning About Multithreading

Introduction to Threads

Mutexes and Condition Variables

Semaphores

Multithreading Patterns

Lecture 13

Lectures 14/15

Lecture 16

Lecture 17/this lecture

assign5: implement your own multithreaded news aggregator to quickly fetch news from the web!

Learning Goals

Practice applying our toolbox of concurrency directives (mutexes, condition variables and semaphores) to coordinate threads in different ways
Learn about the challenges of applying multithreading to an existing program and how we can intuit the benefits of multithreading for different programs

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Multithreading Patterns

Binary lock (mutex) - e.g. dining philosophers' forks

Generalized wait (condition variable) - e.g. waiting for complex condition

Permits (semaphore) - e.g. dining philosophers permits for eating

Binary coordination (semaphore) - e.g. writer telling reader there is new content

Generalized coordination (semaphore) - e.g. thread waits for N others to finish a task

Layered Construction (combo) - combine multiple patterns

Reader-Writer

Let's implement a program that requires thread coordination with semaphores. First, we'll look at a version without semaphores to see why they are necessary.

The reader-writer pattern/program spawns 2 threads: one writer (publishes content to a shared buffer) and one reader (reads from shared buffer when content is available)

Common pattern! E.g. web server publishes content over a dedicated communication channel, and the web browser consumes that content.

Optionally consider a more complex version: multiple readers, similar to how a web server handles many incoming requests (puts request in buffer, readers each read and process requests)

int main(int argc, const char *argv[]) {
  // Create an empty buffer
  char buffer[kNumBufferSlots];
  memset(buffer, ' ', sizeof(buffer));

  thread writer(writeToBuffer, buffer, sizeof(buffer), kNumIterations);
  thread reader(readFromBuffer, buffer, sizeof(buffer), kNumIterations);
  writer.join();
  reader.join();
  return 0;
}

Confused Reader-Writer

confused-reader-writer.cc

Both threads share the same buffer, so they agree where content is stored (think of buffer like state for a pipe or a connection between client and server)

static void readFromBuffer(char buffer[], size_t bufferSize, size_t iterations) {
  cout << oslock << "Reader: ready to read." << endl << osunlock;
  for (size_t i = 0; i < iterations * bufferSize; i++) {
  
    // Read and process the data
    char ch = buffer[i % bufferSize];
    processData(ch); // sleep to simulate work
    buffer[i % bufferSize] = ' ';
    
    cout << oslock << "Reader: consumed data packet " 
      << "with character '" << ch << "'.\t\t" << osunlock;
    printBuffer(buffer, bufferSize);
  }
}

Confused Reader-Writer

The reader consumes the content as it's written. Each thread cycles through the buffer the same number of times, and they both agree that i % 8 identifies the next slot of interest.

confused-reader-writer.cc

static void writeToBuffer(char buffer[], size_t bufferSize, size_t iterations) {
  cout << oslock << "Writer: ready to write." << endl << osunlock;
  for (size_t i = 0; i < iterations * bufferSize; i++) {

    char ch = prepareData();
    buffer[i % bufferSize] = ch;
    
    cout << oslock << "Writer: published data packet with character '" 
      << ch << "'.\t\t" << osunlock;
    printBuffer(buffer, bufferSize);
  }
}

Confused Reader-Writer

The writer publishes content to the circular buffer. Each thread cycles through the buffer the same number of times, and they both agree that i % 8 identifies the next slot of interest.

confused-reader-writer.cc

Confused Reader-Writer

Problem: each thread runs independently, without knowing how much progress the other has made.

Example: no way for the reader to know that the slot it wants to read from has meaningful data in it. It's possible the writer hasn't gotten that far yet.
Example: the writer could loop around and overwrite content that the reader has not yet consumed.

Goal: we must encode constraints into our program.

What constraint(s) should we add to our program?

A reader should not read until something is available to read

A writer should not write until there is space available to write

How can we model these constraint(s)?

One semaphore to manage empty slots

One semaphore to manage full slots

Reader-Writer Constraints

What might this look like in code?

The writer thread waits until at least one buffer slot is empty before writing. Once it writes, it increments the full buffer count by one.
The reader thread waits until at least one buffer slot is full before reading. Once it reads, it increments the empty buffer count by one.

Reader-Writer

Could we do this with one semaphore instead of 2?

What might this look like in code?

The writer thread waits until at least one buffer slot is empty before writing. Once it writes, it increments the full buffer count by one.
The reader thread waits until at least one buffer slot is full before reading. Once it reads, it increments the empty buffer count by one.

Reader-Writer

Could we do this with one semaphore instead of 2? Unfortunately, no.

let's say we initialize 1 semaphore to have 8 permits, representing empty slots.
writer can call wait before it writes, but how does it tell the reader there's a full slot? If we call signal, that would increment the number of empty slots.
reader can call signal when it's done reading, but how does it block until there's a full slot? There's no way to check if the semaphore count is < 8.

What might this look like in code?

The writer thread waits until at least one buffer slot is empty before writing. Once it writes, it increments the full buffer count by one.
The reader thread waits until at least one buffer slot is full before reading. Once it reads, it increments the empty buffer count by one.

Reader-Writer

Generally, we use one semaphore per type of "event"
There are two kinds of events here - slot becoming empty, and slot becoming full
There is bidirectional communication here - writer-to-reader and reader-to-writer

Could we do this with one semaphore instead of 2? Unfortunately, no.

static void readFromBuffer(char buffer[], size_t bufferSize, size_t iterations, semaphore& fullBufferSlots, 
       semaphore& emptyBufferSlots) {

  cout << oslock << "Reader: ready to read." << endl << osunlock;
  for (size_t i = 0; i < iterations * bufferSize; i++) {

    fullBufferSlots.wait();

    char ch = buffer[i % bufferSize];
    processData(ch); // sleep to simulate work
    buffer[i % bufferSize] = ' ';

    emptyBufferSlots.signal();

    cout << oslock << "Reader: consumed data packet " << "with character '" << ch << "'.\t\t" << osunlock;
    printBuffer(buffer, bufferSize);
  }
}

Reader-Writer

reader-writer.cc

The reader consumes the content as it's written. Before reading, it waits for a slot to be full. After reading, it indicates that a new slot is empty. It is "tracing the steps" of the writer because it cycles through the same indexes that the writer does.

static void writeToBuffer(char buffer[], size_t bufferSize, size_t iterations, semaphore& fullBufferSlots, 
       semaphore& emptyBufferSlots) {

  cout << oslock << "Writer: ready to write." << endl << osunlock;
  for (size_t i = 0; i < iterations * bufferSize; i++) {
    char ch = prepareData();

    emptyBufferSlots.wait();

    buffer[i % bufferSize] = ch;

    fullBufferSlots.signal();

    cout << oslock << "Writer: published data packet with character '" << ch << "'.\t\t" << osunlock;
    printBuffer(buffer, bufferSize);
  }
}

Reader-Writer

reader-writer.cc

The writer publishes content to the circular buffer. Before writing, it waits for a slot to be empty. After writing, it indicates that a new slot is full. It is "leading the way" for the reader because it cycles through the same indexes always ahead of the reader.

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Mythbuster

Let's implement a program called myth-buster that prints out how many CS110 student processes are running on each myth machine right now.

representative of load balancers (e.g. myth.stanford.edu or www.netflix.com) determining which internal server your request should forward to.

myth51 has this many CS110-student processes: 59
myth52 has this many CS110-student processes: 135
myth53 has this many CS110-student processes: 112
myth54 has this many CS110-student processes: 89
myth55 has this many CS110-student processes: 107
myth56 has this many CS110-student processes: 58
myth57 has this many CS110-student processes: 70
myth58 has this many CS110-student processes: 93
myth59 has this many CS110-student processes: 107
myth60 has this many CS110-student processes: 145
myth61 has this many CS110-student processes: 105
myth62 has this many CS110-student processes: 126
myth63 has this many CS110-student processes: 314
myth64 has this many CS110-student processes: 119
myth65 has this many CS110-student processes: 156
myth66 has this many CS110-student processes: 144
Machine least loaded by CS110 students: myth56
Number of CS110 processes on least loaded machine: 58

Mythbuster

Let's implement a program called myth-buster that prints out how many CS110 student processes are running on each myth machine right now.

representative of load balancers (e.g. myth.stanford.edu or www.netflix.com) determining which internal server your request should forward to.

int getNumProcesses(int mythNum, const std::unordered_set<std::string>& sunetIDs);

We'll use the following pre-implemented function that does some networking to fetch process counts. This connects to the specified myth machine, and blocks until done.

Mythbuster

Let's implement a program called myth-buster that prints out how many CS110 student processes are running on each myth machine right now.

representative of load balancers (e.g. myth.stanford.edu or www.netflix.com) determining which internal server your request should forward to.

int main(int argc, char *argv[]) {
  // Create a set of student SUNETs
  unordered_set<string> cs110SUNETs;
  readStudentSUNETsFile(cs110SUNETs, kCS110StudentIDsFile);

  // Create a map from myth number -> CS110 process count and print its info
  map<int, int> processCountMap;
  createCS110ProcessCountMap(cs110SUNETs, processCountMap);
  printMythMachineWithFewestProcesses(processCountMap);

  return 0;
}

We'll implement createCS110ProcessCountMap sequentially and concurrently.

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Mythbuster: Sequential

static void createCS110ProcessCountMap(const unordered_set<string>& sunetIDs,
					map<int, int>& processCountMap) {

  for (int mythNum = kMinMythMachine; mythNum <= kMaxMythMachine; mythNum++) {
    int numProcesses = getNumProcesses(mythNum, sunetIDs);

    // If successful, add to the map and print out
    if (numProcesses >= 0) {
      processCountMap[mythNum] = numProcesses;
      cout << "myth" << mythNum << " has this many CS110-student processes: " << numProcesses << endl;
    }
  }
}

This implementation fetches the count for each myth machine one after the other. This means we have to wait for 16 sequential connections to be started and completed.

myth-buster-sequential.cc

We want to make this multithreaded. How can we parallelize this? And what gains should we expect to see?

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

I/O-Bound vs. CPU-Bound Programs

Depending on a program's tasks, it may see different benefits from multithreading.

Our goal is to make a program faster
Multithreading means splitting work across many threads, but those threads are still ultimately executed by the CPU (maybe on multiple cores)
Multithreading can make a program faster due to one or more of:
- leveraging multiple cores to run tasks truly simultaneously
- programs having tasks that have "idle time" where they can be swapped off the CPU and another thread can be given time.

I/O-Bound vs. CPU-Bound Programs

Thought experiment: each of us is a single-core CPU! (yes, it's true)

So how is it possible for us to multitask if we can only do one thing at a time?

Key Idea: not everything needs our constant, undivided attention:

doing laundry
letting something bake in the oven

These tasks are primarily time where we can do other things, and come back to them when they do need our attention. Even without our attention, they make progress. We can thus alternate between them to "multitask"!

These are I/O-bound tasks: the time to complete them is dictated by how long it takes for some external mechanism to complete its work (laundry machine, shipping, oven).

I/O-Bound vs. CPU-Bound Programs

Thought experiment: each of us is a single-core CPU! (yes, it's true)

So how is it possible for us to multitask if we can only do one thing at a time?

Key Idea: some things do need our constant, undivided attention:

doing homework
reading a book

These tasks are primarily time where we must devote our attention in order to make progress. We'll probably see limited gains by alternating between them, and may even see "context switch" penalties.

These are CPU-bound tasks: the time to complete them is dictated by how long it takes us to do the CPU computation (solve homework, read chapters).

I/O-Bound vs. CPU-Bound Programs

CPU-bound tasks: the time to complete them is dictated by how long it takes us to do the CPU computation.

heavy computations
data processing

I/O-bound tasks: the time to complete them is dictated by how long it takes for some external mechanism to complete its work.

reading from an external device (e.g. disk)
reading data from the network

Even a single-core CPU can see performance improvements by parallelizing I/O-bound tasks. But parallelizing CPU-bound tasks will likely show minimal gains unless we have a multi-core CPU.

Parallelizing Mythbuster

For mythbuster, the primary task is fetching the number of running CS110 processes over the network. Is this an I/O-bound or CPU-bound task?

I/O-bound!

This means we should see large gains from multithreading, even on a single-core machine.

Mythbusters: Sequential

Why is this implementation slow?

Each call to getNumProcesses is independent. We should call it multiple times concurrently to overlap this "dead time".

We wait 16 times, because we idle while waiting for a connection to come back.

How can we improve its performance?

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Mythbusters: Concurrent

myth-buster-concurrent.cc

What might this look like in code?

For each myth machine number, we'll spawn a new thread that will fetch the count for that myth machine. It must acquire a lock before modifying the map because modifying the map is not thread-safe.

Implementation: spawn multiple threads, each responsible for connecting to a different myth machine and updating the map.

Demo: myth-buster-concurrent.cc

myth-buster-concurrent.cc

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Mythbusters: Capped

What might this look like in code?

For each myth machine number, we'll spawn a new thread if there are permits available. That thread will fetch the count for that myth machine.
When the thread finishes, it returns its permit.

When spawning threads, we don't want to spawn too many, because we might overwhelm the OS and diminish the performance gains of our multithreaded implementation.

A common approach is to limit the number of simultaneous threads with a cap. E.g. we can only have 16 spawned threads at a time. Once one finishes, then we can spawn another.

myth-buster-concurrent.cc

Mythbusters: Capped

For each myth machine number, we'll spawn a new thread if there are permits available. That thread will fetch the count for that myth machine.
When the thread finishes, it returns its permit.

myth-buster-concurrent.cc

static void createCS110ProcessCountMap(const unordered_set<string>& sunetIDs, map<int, int>& processCountMap) {
  vector<thread> threads;
  mutex processCountMapLock;
  semaphore permits(kMaxNumSimultaneousThreads);

  for (int mythNum = kMinMythMachine; mythNum <= kMaxMythMachine; mythNum++) {
    permits.wait();

    threads.push_back(thread(countCS110ProcessesForMyth, mythNum, ref(sunetIDs),
      ref(processCountMap), ref(processCountMapLock), ref(permits)));
  }

  for (thread& threadToJoin : threads) threadToJoin.join();
}

Mythbusters: Capped

For each myth machine number, we'll spawn a new thread if there are permits available. That thread will fetch the count for that myth machine.
When the thread finishes, it returns its permit.

myth-buster-concurrent.cc

static void countCS110ProcessesForMyth(int mythNum, const unordered_set<string>& sunetIDs,
  map<int, int>& processCountMap, mutex& processCountMapLock, semaphore& permits) {

  int numProcesses = getNumProcesses(mythNum, sunetIDs);

  if (numProcesses >= 0) {
    processCountMapLock.lock();
    processCountMap[mythNum] = numProcesses;
    processCountMapLock.unlock();
    cout << "myth" << mythNum << " has this many CS110-student processes: " << numProcesses << endl;
  }
  
  permits.signal();
}

(Nagging voice) hey, technically isn't is possible for more than the permitted number of threads to be alive if one is spawned here?

Mythbusters: Capped

For each myth machine number, we'll spawn a new thread if there are permits available. That thread will fetch the count for that myth machine.
When the thread finishes, it returns its permit. We can use a special version of signal() to specify that the semaphore should be signaled only once it exits.

myth-buster-concurrent.cc

static void countCS110ProcessesForMyth(int mythNum, const unordered_set<string>& sunetIDs,
  map<int, int>& processCountMap, mutex& processCountMapLock, semaphore& permits) {
  
  permits.signal(on_thread_exit);

  int numProcesses = getNumProcesses(mythNum, sunetIDs);

  if (numProcesses >= 0) {
    processCountMapLock.lock();
    processCountMap[mythNum] = numProcesses;
    processCountMapLock.unlock();
    cout << "myth" << mythNum << " has this many CS110-student processes: " << numProcesses << endl;
  }
}

Mythbusters Takeaways

We parallelized an independent operation to speed up runtime
One call to getNumProcesses isn't dependent on another
To share the map for updating, we need a lock
We use signal(on_thread_exit) to signal only once the thread has terminated. This more accurately reflects permits as a cap on spawned threads.

myth-buster-concurrent.cc

Plan For Today

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Mythbusters: Thread Pool

Even though we are limiting the number of simultaneous threads, we still spawn that many in total. It would be nice if we could use the same threads to complete all the tasks.

A common approach is to use a thread pool; a variable type that maintains a pool of worker threads that can complete assigned tasks.

You initialize the thread pool and specify the number of workers
You can call schedule and pass in a function you want it to execute. It will assign it to the next available worker.
You can call wait to block until all currently-assigned tasks have been completed.

class ThreadPool {
public:
   ThreadPool(size_t numThreads);
   void schedule(const std::function<void(void)>& thunk);
   void wait();
   ~ThreadPool();
};

Mythbusters: Thread Pool

Even though we are limiting the number of simultaneous threads, we still spawn that many in total. It would be nice if we could use the same threads to complete all the tasks.

A common approach is to use a thread pool; a variable type that maintains a pool of worker threads that can complete assigned tasks.

In myth buster, instead of spawning threads, we can schedule a "thunk" for each task of fetching a myth machine's count of CS110 processes. It must be a function that has no parameters or return value.
After we add all the tasks to the thread pool, we wait on the thread pool to finish all the tasks.

class ThreadPool {
public:
   ThreadPool(size_t numThreads);
   void schedule(const std::function<void(void)>& thunk);
   void wait();
   ~ThreadPool();
};

Demo: Myth Buster w/ Thread Pool

myth-buster-pooled.cc

Mythbuster Thread Pool Takeaways

We parallelized an independent operation to speed up runtime
One call to getNumProcesses isn't dependent on another
To share the map for updating, we need a lock
We used a thread pool to parallelize tasks without spawning too many threads
We can schedule a "thunk", which is a function that takes no input and gives back no output. To pass data to it, we can use a lambda function and capture values.
We can wait on the thread pool to finish all enqueued tasks

myth-buster-pooled.cc

Recap

Recap: Multithreading Patterns and Reader-Writer
Example: Mythbuster
- V1: Sequential
- I/O Bound vs. CPU-Bound Programs
- V2: Multithreaded
- V2.5: "Capped" Multithreaded
- V3: Thread Pool

Next time: multithreading wrap-up and introduction to networking