Principles of Computer Systems

Autumn 2019

Stanford University

Computer Science Department

Instructors: Chris Gregg

Philip Levis

PDF of this presentation

Lecture 17: MapReduce

The Genesis of Datacenter Computing: MapReduce

Problem for large service providers such as Google: computation requires hundreds or thousands of computers
- How do you distribute the computation?
- Distributing the computation boils down to distributing data
- Nodes fail, some nodes are slow, load balancing: difficult to make it work well
System came from observation that many early Google systems had a common pattern
Designing a computing system around this pattern
- allows the system (written once) to do all of the hard systems work (fault tolerance, load balancing, parallelization)
- allows tens of thousands of programmers to just write their computation
Canonical example of how the right abstraction revolutionized computing
- An open source version immediately appeared: Hadoop

Core Data Abstraction: Key/Value Pairs

Take a huge amount of data (more than can fit in the memory of 1000 machines)
Write two functions:
Using these two functions, MapReduce parallelizes the computation across thousands of machines, automatically load balancing, recovering from failures, and producing the correct result.
You can string together MapReduce programs: output of reduce becomes input to map.
Simple example of word count (wc):

map(k1, v1) -> list(k2, v2)

reduce(k2, list(v2)) -> list(v2)

map(String key, String value):
  // key: document name
  // value: document contents
  for word w in value:
    EmitIntermediate(w,"1")
    
reduce(String key, List values):
  // key: a word
  // values: a list of counts
  int result = 0
  for v in values:
    result += ParseInt(v)
  Emit(AsString(result))

"The number of partitions (R) and the partitioning function are specified by the user."

("the", "1") , ("the", "1"), ("The", "1"), ("of", 1),

(number, "1"), ...

"the", ("1", "1") -> "2"

"The", ("1") -> "1"

"of", ("1") -> "1"

"number", ("1") -> "1"

map output

input

reduce

Key/Value Pairs: How and Where

Keys allow MapReduce to distribute and parallelize load
Core abstraction: data can be partitioned by key, there is no locality between keys
In the original paper...
- Each mapper writes a local file for each key in k2, and reports its files to a master node
- The master node tells the reducer for k2 where the all of the k2 files are
- The reducer reads all of the k2 files from the nodes that ran the mappers and writes its own output locally, reporting this to the master node
- There have been lots of optimizations since
Keys can be arbitrary data: hash the keys, and assign keys to splits using modulus
- split(key) = hash(key) % N, where N is the number of splits
A master node tracks progress of workers and manages execution
- When a worker falls idle, the master sends it a new split to compute
- When the master thinks a worker has failed, tell another worker to compute its split

map(k1, v1) -> list(k2, v2)

reduce(k2, list(v2)) -> list(v2)

MapReduce System Architecture

Your MapReduce

Your MapReduce will differ from the standard one in one significant way: instead of writing results to local disk, it writes results to AFS, myth's networked file system
- Every worker can access the files directly using AFS
- Cost: file write is over the network
- Benefit: recovery from failure is easier (don't have to regenerate lots files)
- Benefit: don't have to write file transfer protocol (handled by AFS)

mapper

reducer

Google

mapper

reducer

Your System

AFS

The map component of a MapReduce job typically parses input data and distills it down to some intermediate result.
The reduce component of a MapReduce job collates these intermediate results and distills them down even further to the desired output.
The pipeline of processes involved in a MapReduce job is captured by the below illustration:
The processes shaded in yellow are programs specific to the data set being processed, whereas the processes shaded in green are present in all MapReduce pipelines.
We'll invest some energy over the next several slides explaining what a mapper, a reducer, and the group-by-key processes look like.

MapReduce Data Flow

Word Count Example

Walk through word count example in detail, see what MapReduce does
There are a bunch of parameters, let's set them so
- Number of map tasks (input partitions/splits): 12
  - In normal MapReduce this is user-specifiable, in your implementation this is predefined by how the input is split
- Number of map workers: 4
- Number of reduce tasks (intermediate and output partitions/splits ): 20
- Number of reduce workers: 5
Each map task (assigned to one of 4 map workers) maps an input partition

12 input partitions

chunks of file

map tasks

map(String key, String value):
  // key: document name
  // value: document contents
  for word w in value:
    EmitIntermediate(w,"1")
    
reduce(String key, List values):
  // key: a word
  // values: a list of counts
  int result = 0
  for v in values:
    result += ParseInt(v)
  Emit(AsString(result))

Word Count Example

Walk through word count example in detail, see what MapReduce does
There are a bunch of parameters, let's set them so
- Number of map tasks (input partitions/splits): 12
  - In normal MapReduce this is user-specifiable, in your implementation this is predefined by how the input is split
- Number of map workers: 4
- Number of reduce tasks (intermediate and output partitions/splits ): 20
- Number of reduce workers: 5
Each map task produces 20 output files; each file F contains hash(key) % 20 == F

12 input partitions

"Propitious Pallas, to secure..."

map tasks

240 intermediate files

each of 12 map tasks produces one file for each worker task

12.19.mapped

1.5.mapped

1.0.mapped

12.18.mapped

This file contains all of the map outputs for map task 7 whose key, modulo 20, is 5

in 1

now 1

step 1

now 1

in 1

....

7.5.mapped

Word Count Example

Walk through word count example in detail, see what MapReduce does
There are a bunch of parameters, let's set them so
- Number of map tasks (input partitions/splits): 12
  - In normal MapReduce this is user-specifiable, in your implementation this is predefined by how the input is split
- Number of map workers: 4
- Number of reduce tasks (intermediate and output partitions/splits ): 20
- Number of reduce workers: 5
Runtime (your reducer) collates, sorts, and groups all of the inputs for each reduce task

12 input partitions

"Propitious Pallas, to secure..."

map tasks

240 intermediate files

each of 12 map tasks produces one file for each worker task

12.19.mapped

1.5.mapped

1.0.mapped

12.18.mapped

7.5.mapped

5.grouped

This file contains all of the reduce inputs for every key whose hash % 20 == 5

in 1 1 1 1 1 1 1 1 .. (1149 times)

now 1 1 1 1 ... (126 times)

step 1 1 1 1 ... (10 times)

....

Word Count Example

Walk through word count example in detail, see what MapReduce does
There are a bunch of parameters, let's set them so
- Number of map tasks (input partitions/splits): 12
  - In normal MapReduce this is user-specifiable, in your implementation this is predefined by how the input is split
- Number of map workers: 4
- Number of reduce tasks (intermediate and output partitions/splits ): 20
- Number of reduce workers: 5
Each reduce task (scheduled to one of 5 workers) runs reduce on its input

12 input partitions

"Propitious Pallas, to secure..."

map tasks

240 intermediate files

each of 12 map tasks produces one file for each worker task

12.19.mapped

1.5.mapped

1.0.mapped

12.18.mapped

7.5.mapped

5.grouped

5.output

Contains the count of words whose hash % 20 == 5

in 1149

now 126

step 10

reduce tasks

Importance of Keys

Keys are the mechanism that allows MapReduce to distribute load across many machines while keeping data locality
- All data with the same key is processed by the same mapper or reducer
- Any map worker or reduce worker can process a given key
- Keys can be collected together into larger units of work by hashing
  - We've seen this: if there are N tasks, a key K is the responsibility of the task whose ID is hash(K) % N
Distinction between tasks and nodes: there are many more tasks than nodes
- Worker nodes request work from the master server, which sends them map or reduce tasks to execute
- If one task is fast, the worker just requests a new one to complete
- Very simple load balancing
- This load balancing works because a task can run anywhere

Let's see an example run with the solution executables:

There is a plethora of communication between the machine we run on and the other myths.
Output ends up in the files/ directory.

MapReduce Assignment: Example

myth57:$ make filefree
rm -fr files/intermediate/* files/output/*

myth57:$ ./samples/mr_soln --mapper ./samples/mrm_soln --reducer ./samples/mrr_soln --config odyssey-full.cfg

Determining which machines in the myth cluster can be used... [done!!]
Mapper executable: word-count-mapper
Reducer executable: word-count-reducer
Number of Mapping Workers: 8
Number of Reducing Workers: 4
Input Path: /usr/class/cs110/samples/assign8/odyssey-full
Intermediate Path: /afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/intermediate
Output Path: /afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output
Server running on port 48721

Received a connection request from myth59.stanford.edu.
Incoming communication from myth59.stanford.edu on descriptor 6.
Instructing worker at myth59.stanford.edu to process this pattern: "/usr/class/cs110/samples/assign8/odyssey-full/00001.input"
Conversation with myth59.stanford.edu complete.
Received a connection request from myth61.stanford.edu.
Incoming communication from myth61.stanford.edu on descriptor 7.

... LOTS of lines removed

Remote ssh command on myth56 executed and exited with status code 0.
Reduction of all intermediate chunks now complete.
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00000.output hashes to 13585898109251157014
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00001.output hashes to 1022930401727915107
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00002.output hashes to 9942936493001557706
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00003.output hashes to 5127170323801202206

... more lines removed

Server has shut down.

The map phase of mr has the 8 mappers (from the .cfg file) process the 12 files processed by word-count-mapper and put the results into files/intermediate:

If we look at 00012.00028, we see:

MapReduce Assignment: Mapped File Contents

myth57:$ ls -lu files/intermediate/
total 858
-rw------- 1 cgregg operator 2279 May 29 09:29 00001.00000.mapped
-rw------- 1 cgregg operator 1448 May 29 09:29 00001.00001.mapped
-rw------- 1 cgregg operator 1927 May 29 09:29 00001.00002.mapped
-rw------- 1 cgregg operator 2776 May 29 09:29 00001.00003.mapped
-rw------- 1 cgregg operator 1071 May 29 09:29 00001.00004.mapped
...lots removed
-rw------- 1 cgregg operator  968 May 29 09:29 00012.00027.mapped
-rw------- 1 cgregg operator 1720 May 29 09:29 00012.00028.mapped
-rw------- 1 cgregg operator 1686 May 29 09:29 00012.00029.mapped
-rw------- 1 cgregg operator 2930 May 29 09:29 00012.00030.mapped
-rw------- 1 cgregg operator 2355 May 29 09:29 00012.00031.mapped

myth57:$ head -10 files/intermediate/00012.00028.mapped
thee 1
rest 1
thee 1
woes 1
knows 1
grieve 1
sire 1
laertes 1
sire 1
power 1

This file represents the words in 00012.input that hashed to 28 modulo 32 (because we have 8 mappers * 4 reducers)
Note that some words will appear multiple times (e.g., "thee")

If we look at 00005.00028, we can also see "thee" again:

This makes sense because "thee" also occurs in file 00005.input (these files are not reduced yet!)
"thee" hashes to 28 modulo 32, so it will end up in any of the .00028 files if occurs in the input that produced that file.
To test a word with a hash, you can run the hasher program, located here.

MapReduce Assignment: Hashing

myth57:$ head -10 files/intermediate/00005.00028.mapped
vain 1
must 1
strand 1
cry 1
herself 1
she 1
along 1
head 1
dayreflection 1
thee 1

myth57:$ ./hasher thee 32
28

Let's test the starter code (this only runs map):

If we look in files/intermediate, we see files without the reducer split:

MapReduce Assignment: Starter Code

myth57:~$ make directories filefree
// make command listings removed for brevity
myth57:~$ make
// make command listings removed for brevity
myth57:~$ ./mr --mapper ./mrm --reducer ./mrr --config odyssey-full.cfg --map-only --quiet 
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00001.mapped hashes to 2579744460591809953
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00002.mapped hashes to 15803262022774104844
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00003.mapped hashes to 15899354350090661280
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00004.mapped hashes to 15307244185057831752
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00005.mapped hashes to 13459647136135605867
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00006.mapped hashes to 2960163283726752270
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00007.mapped hashes to 3717115895887543972
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00008.mapped hashes to 8824063684278310934
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00009.mapped hashes to 673568360187010420
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00010.mapped hashes to 9867662168026348720
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00011.mapped hashes to 5390329291543335432
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00012.mapped hashes to 13755032733372518054
myth57:~$

myth57:~$ $ ls -l files/intermediate/
total 655
-rw------- 1 cgregg operator 76280 May 29 10:26 00001.mapped
-rw------- 1 cgregg operator 54704 May 29 10:26 00002.mapped
-rw------- 1 cgregg operator 53732 May 29 10:26 00003.mapped
-rw------- 1 cgregg operator 53246 May 29 10:26 00004.mapped
-rw------- 1 cgregg operator 53693 May 29 10:26 00005.mapped
-rw------- 1 cgregg operator 53182 May 29 10:26 00006.mapped
-rw------- 1 cgregg operator 54404 May 29 10:26 00007.mapped
-rw------- 1 cgregg operator 53464 May 29 10:26 00008.mapped
-rw------- 1 cgregg operator 53143 May 29 10:26 00009.mapped
-rw------- 1 cgregg operator 53325 May 29 10:26 00010.mapped
-rw------- 1 cgregg operator 53790 May 29 10:26 00011.mapped
-rw------- 1 cgregg operator 52207 May 29 10:26 00012.mapped

It turns out that "thee" is only in 11 of the 12 files:

also, files/output is empty:

$ grep -l "^thee " files/intermediate/*.mapped \
          | wc -l
11

myth57:$ ls -l files/output/
total 0

MapReduce questions

Master sends map and reduce tasks to workers

Centralized vs. Distributed

Master sends map and reduce tasks to workers
- This gives master a global view of the system: progress of map and reduce tasks, etc.
- Simplifies scheduling and debugging
Master can become a bottleneck: what if it cannot issue tasks fast enough?
- Centralized controllers can usually process 8,000-12,000 tasks/second
- MapReduce generally does not hit this bottleneck
  - Both map and reduce tasks read from disk: take seconds-tens of seconds
  - MapReduce can scale to run on 80,000 - 120,000 cores
- More modern frameworks, like Spark, can
  - Spark and other frameworks operate on
    in-memory data
  - Significant work to make them
    computationally fast: sometimes 10s
    of ms

Centralized vs. Distributed

"Execution Templates: Caching Control Plane Decisions for
Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Graph shows what happens as you try
to parallelize a Spark workload across
more servers: it slows down
More and more time is spent in the "control plane" -- sending messages to spawn tasks to compute
Nodes fall idle, waiting for the master to send them work to do.
Suppose tasks are 100ms long
- 30 workers = 2400 cores
- Can execute 24,000 tasks/second
- Master can only send 8,000

Centralized Bottleneck

This bottleneck will only worsen as we make tasks computationally faster: Spark used to be 50x slower than C++, now it is 6x slower (3x for JVM, 2x for data copies)
- What happens when it involves C++ libraries and is 1.1x slower?
- What happens when it invokes GPUs?

Some systems have no centralized master: there are only worker nodes
Problem: what do you do when one of the workers fails? How do other workers discover this and respond appropriately?
- In practice, these systems just fail and force you to restart the computation
- Complexity of handling this is not work the benefit of automatic recovery
Problem: how do you load balance?
- You do so locally and suboptimally or not at all
- Just try to keep workers busy
- Work stealing: if a worker falls idle, tries to steal work from adjacent workers
- Can require complex reconfiguration of program, as workers need to know new placements of computations/data (e.g., Naiad requires installing new dataflow graph)

Decentralized Approach (e.g., Naiad, MPI)

If a job uses fast tasks and runs for a long time, it must either be an enormously long program or have loops
Loops are repeated patterns of control plane messages: they execute the same tasks again and again
- Common in machine learning, simulations, data analytics, approximations, etc.
Idea: rather than execute each loop iteration from scratch, cache a block of control plane operations and re-instantiate with a
single message
This occurs between the user program
and the master as well as between the
master and workers
Call this cached structure a template:
some values and structures are static
(like which tasks to execute), others are
bound dynamically (like which data
objects to operate on, task IDs, etc.)

Idea: Cache Control Plane Messages

"Execution Templates: Caching Control Plane Decisions for
Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Results

"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Caching control plane messages (Nimbus) is as fast as systems that do not have a centralized master (Naiad-opt), and much faster than systems which have a centralized master but do not cache messages (Spark-opt)

Results

"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Caching control plane messages allows Nimbus to scale to support over 100,000 tasks/second, while Spark bottlenecks around 6,000

Results

"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"

Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis

In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)

Because systems caching control messages can use a centralized manager, they can recover from failures and load-balance easily, although doing so requires generating new templates and caching them.

Question From Piazza

The assignment spec suggests that we should be able to select a thread to which we assign a thunk ("The dispatcher thread should loop almost interminably. In each iteration, it should sleep until schedule tells it that something has been added to the queue. It should then wait for a worker to become available, select the worker, mark it as unavailable, dequeue a function, put a copy of that function in a place where the worker can access it, and then signal the worker to execute it").

This requires bookkeeping at a fine granularity as we need to store context for each thread. An alternative implementation would be to keep a global task queue and a synchronization mechanism realized by, say, a single condition variable so that we can simply `nofity_one` just one thread from the pool of worker threads and let it execute the thunk. This should work just fine as long as we do not care which thread performs the task, and that seems to be the case in this assignment. This alternative implementation is likely more lightweight, since we will be freed from keeping track of each thread's context. Maybe I am missing something, but what are the reasons for having the control over exactly which thread executes a thunk?

Visually

Centralized with dispatcher
thread

Distributed: workers pull thunks
off queue

CS110 Lecture 17: MapReduce

By philip_levis

CS110 Lecture 17: MapReduce

5 years ago
2,471

Lecture 17: MapReduce

The Genesis of Datacenter Computing: MapReduce

The Genesis of Datacenter Computing: MapReduce

Core Data Abstraction: Key/Value Pairs

Key/Value Pairs: How and Where

MapReduce System Architecture

Your MapReduce

MapReduce Data Flow

Word Count Example

Word Count Example

Word Count Example

Word Count Example

Importance of Keys

MapReduce Assignment: Example

MapReduce Assignment: Mapped File Contents

MapReduce Assignment: Hashing

MapReduce Assignment: Starter Code

MapReduce questions

Centralized vs. Distributed

Centralized vs. Distributed

Centralized Bottleneck

Decentralized Approach (e.g., Naiad, MPI)

Idea: Cache Control Plane Messages

Results

Results

Results

Question From Piazza

Visually

CS110 Lecture 17: MapReduce

More from philip_levis