Autumn 2021
Jerry Cain
PDF
Credit Phil Levis, Chris Gregg, and Nick Troccoli for these slides.
Idea: Split document into chunks, count words in each chunk concurrently
Problem: What if a word appears in multiple chunks?
Better Idea: Combine all the output, sort, split into chunks, combine counts in each one (in parallel).
the very very
quick fox greeted
the brown fox
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
Sorted
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
There are 2 "phases" where we can install parallelism.
map the input to some intermediate data representation
reduce the intermediate data representation into final result
the very very
quick fox greeted
the brown fox
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
Sorted
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
The first phase focuses on finding, and the second phase focuses on summing. So the first phase should only output 1's, and leave the summing for later.
the very very
quick fox greeted
the brown fox
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
...
the, 1
very, 1
very, 1
There are 2 "phases" where we can install parallelism.
map the input to some intermediate data representation
reduce the intermediate data representation into final result
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
Sorted
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Question: Is there a way to parallelize this operation as well?
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
Sorted
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Question: Is there a way to parallelize this operation as well?
bucket # = hash(key) % R where R = # reduce tasks (3)
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Question: Is there a way to parallelize this operation as well?
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Input
Files
Map
Phase
Intermediate
Files
Reduce
Phase
Output
Files
Task: We have web pages, and want to make a list of what web pages link to a given URL.
Possible Approach: Leverage our existing idea to read web pages and build a URL to list(web page) map.
Fundamentally, this is a simplified version of PageRank 😊. In fact, Google once used MapReduce to build an in-memory index of the entire web.
How can we parallelize this?
a.com: Visit d.com for more! Also see e.com.
b.com: Visit a.com for more! Also see e.com.
c.com: Visit a.com for more! Also see d.com.
a.com, b.com
a.com, c.com
d.com, a.com
d.com, c.com
e.com, a.com
e.com, b.com
a.com, [b.com, c.com]
d.com, [a.com, c.com]
e.com, [a.com, b.com]
d.com, a.com
e.com, a.com
bucket 2
bucket 3
a.com, b.com
e.com, b.com
bucket 1
bucket 3
d.com, c.com
bucket 1
bucket 2
a.com, c.com
Input
Files
Map
Phase
Intermediate
Files
Reduce
Phase
Output
Files
Case Study: Counting Word Frequencies
Sequential Approach: program that reads a document and builds a word to frequency map
Parallel Approach: split document into chunks, count words in each chunk concurrently, partitioning output. Then, sort and reduce each chunk concurrently.
Case Study: Inverted Web Index
Sequential Approach: program that reads web pages and builds a URL to list(web page) map
Parallel Approach: split web pages into chunks, find URLs in each chunk concurrently, partitioning output. Then, sort and reduce each chunk concurrently.
map the input to some intermediate data representation
reduce the intermediate data representation into final result
MapReduce executes programs in parallel, provided you specify the input, the map step and the reduce step.
Published by Google in 2004. Read it [here].
MapReduce's primary goal is to make running programs across multiple machines as easy as possible.
Many challenges in writing programs spanning many
machines, including:machines failing
communicating over the network
coordinating tasks
Programmer must implement map and reduce steps:
map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(v2)
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
User specifies information about the job they wish to run:
input data
map component
reduce component
# map tasks M (perhaps set to reach some target size for task data)
# reduce tasks R (perhaps set to reach some target size for task data)
User specifies information about the job they wish to run:
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
MapReduce partitions input data into M pieces, starts program running on cluster on machines - one will be the orchestrator, the others will be workers.
Orchestrator assigns tasks (map or
reduce) to idle workers until job
is done.
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Map task - worker reads slice of input data, calls map(), output is broadcast across R (= 3) partitions on disk using hashing % R.
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Reduce task - reducer is told by orchestrator where its relevant partitions are, it reads them / sorts them by intermediate key. For each intermediate key and set of intermediate values,
calls reduce(), output is appended to
output file.
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Orchestrator repeatedly assigns tasks (map or reduce) to idle workers until job is done.
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
bucket 2
bucket 3
fox, 1
greeted, 1
quick, 1
bucket 1
bucket 2
brown, 1
fox, 1
the, 1
bucket 1
bucket 2
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Takeaways:
We have to execute all maps before we execute any reduces, because maps may all feed into a single reduce.
The number of workers is separate from the number of tasks; e.g. in the word count example, we could have 1, 2, 3, 4, ... etc. workers. A worker can execute multiple tasks, and tasks can run anywhere.
MapReduce framework handles parallel processing, networking, error handling, etc...
MapReduce relies heavily on keys to distribute load across many machines while avoiding the need to move data around unnecessarily.
Hashing lets us collect keys into larger units of work, e.g. for N tasks, a key K falls under the responsibility of the task whose ID is hash(K) % N.
All data with the same key is processed by the same working with the same reduce task.
Optional Assignment 7: You code to the standard MapReduce design, with tons of scaffolding and a few differences:
Intermediate files aren't saved locally and fetched - we rely on AFS