Principles of Computer Systems
Autumn 2019
Stanford University
Computer Science Department
Instructors: Chris Gregg
Philip Levis
map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(v2)
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
"The number of partitions (R) and the partitioning function are specified by the user."
("the", "1") , ("the", "1"), ("The", "1"), ("of", 1),
(number, "1"), ...
"the", ("1", "1") -> "2"
"The", ("1") -> "1"
"of", ("1") -> "1"
"number", ("1") -> "1"
map output
input
reduce
map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(v2)
mapper
reducer
mapper
reducer
Your System
AFS
12 input partitions
chunks of file
map tasks
1
2
11
12
map(String key, String value):
// key: document name
// value: document contents
for word w in value:
EmitIntermediate(w,"1")
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0
for v in values:
result += ParseInt(v)
Emit(AsString(result))
12 input partitions
"Propitious Pallas, to secure..."
map tasks
1
2
11
12
240 intermediate files
each of 12 map tasks produces one file for each worker task
12.19.mapped
1.5.mapped
1.0.mapped
12.18.mapped
This file contains all of the map outputs for map task 7 whose key, modulo 20, is 5
in 1
now 1
step 1
now 1
in 1
in 1
....
7.5.mapped
12 input partitions
"Propitious Pallas, to secure..."
map tasks
1
2
11
12
240 intermediate files
each of 12 map tasks produces one file for each worker task
12.19.mapped
1.5.mapped
1.0.mapped
12.18.mapped
7.5.mapped
5.grouped
This file contains all of the reduce inputs for every key whose hash % 20 == 5
in 1 1 1 1 1 1 1 1 .. (1149 times)
now 1 1 1 1 ... (126 times)
step 1 1 1 1 ... (10 times)
....
12 input partitions
"Propitious Pallas, to secure..."
map tasks
1
2
11
12
240 intermediate files
each of 12 map tasks produces one file for each worker task
12.19.mapped
1.5.mapped
1.0.mapped
12.18.mapped
7.5.mapped
5.grouped
0
1
5
19
5.output
Contains the count of words whose hash % 20 == 5
in 1149
now 126
step 10
reduce tasks
files/
directory.myth57:$ make filefree
rm -fr files/intermediate/* files/output/*
myth57:$ ./samples/mr_soln --mapper ./samples/mrm_soln --reducer ./samples/mrr_soln --config odyssey-full.cfg
Determining which machines in the myth cluster can be used... [done!!]
Mapper executable: word-count-mapper
Reducer executable: word-count-reducer
Number of Mapping Workers: 8
Number of Reducing Workers: 4
Input Path: /usr/class/cs110/samples/assign8/odyssey-full
Intermediate Path: /afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/intermediate
Output Path: /afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output
Server running on port 48721
Received a connection request from myth59.stanford.edu.
Incoming communication from myth59.stanford.edu on descriptor 6.
Instructing worker at myth59.stanford.edu to process this pattern: "/usr/class/cs110/samples/assign8/odyssey-full/00001.input"
Conversation with myth59.stanford.edu complete.
Received a connection request from myth61.stanford.edu.
Incoming communication from myth61.stanford.edu on descriptor 7.
... LOTS of lines removed
Remote ssh command on myth56 executed and exited with status code 0.
Reduction of all intermediate chunks now complete.
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00000.output hashes to 13585898109251157014
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00001.output hashes to 1022930401727915107
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00002.output hashes to 9942936493001557706
/afs/.ir.stanford.edu/users/c/g/cgregg/cs110/spring-2019/assignments/assign8/files/output/00003.output hashes to 5127170323801202206
... more lines removed
Server has shut down.
mr
has the 8 mappers (from the .cfg file) process the 12 files processed by word-count-mapper and put the results into files/intermediate:
myth57:$ ls -lu files/intermediate/
total 858
-rw------- 1 cgregg operator 2279 May 29 09:29 00001.00000.mapped
-rw------- 1 cgregg operator 1448 May 29 09:29 00001.00001.mapped
-rw------- 1 cgregg operator 1927 May 29 09:29 00001.00002.mapped
-rw------- 1 cgregg operator 2776 May 29 09:29 00001.00003.mapped
-rw------- 1 cgregg operator 1071 May 29 09:29 00001.00004.mapped
...lots removed
-rw------- 1 cgregg operator 968 May 29 09:29 00012.00027.mapped
-rw------- 1 cgregg operator 1720 May 29 09:29 00012.00028.mapped
-rw------- 1 cgregg operator 1686 May 29 09:29 00012.00029.mapped
-rw------- 1 cgregg operator 2930 May 29 09:29 00012.00030.mapped
-rw------- 1 cgregg operator 2355 May 29 09:29 00012.00031.mapped
myth57:$ head -10 files/intermediate/00012.00028.mapped
thee 1
rest 1
thee 1
woes 1
knows 1
grieve 1
sire 1
laertes 1
sire 1
power 1
00012.input
that hashed to 28 modulo 32 (because we have 8 mappers * 4 reducers)00005.00028
, we can also see "thee" again:
00005.input
(these files are not reduced yet!)myth57:$ head -10 files/intermediate/00005.00028.mapped
vain 1
must 1
strand 1
cry 1
herself 1
she 1
along 1
head 1
dayreflection 1
thee 1
myth57:$ ./hasher thee 32
28
files/intermediate
, we see files without the reducer split:myth57:~$ make directories filefree
// make command listings removed for brevity
myth57:~$ make
// make command listings removed for brevity
myth57:~$ ./mr --mapper ./mrm --reducer ./mrr --config odyssey-full.cfg --map-only --quiet
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00001.mapped hashes to 2579744460591809953
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00002.mapped hashes to 15803262022774104844
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00003.mapped hashes to 15899354350090661280
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00004.mapped hashes to 15307244185057831752
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00005.mapped hashes to 13459647136135605867
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00006.mapped hashes to 2960163283726752270
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00007.mapped hashes to 3717115895887543972
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00008.mapped hashes to 8824063684278310934
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00009.mapped hashes to 673568360187010420
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00010.mapped hashes to 9867662168026348720
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00011.mapped hashes to 5390329291543335432
/afs/ir.stanford.edu/users/c/g/cgregg/assign8/files/intermediate/00012.mapped hashes to 13755032733372518054
myth57:~$
myth57:~$ $ ls -l files/intermediate/
total 655
-rw------- 1 cgregg operator 76280 May 29 10:26 00001.mapped
-rw------- 1 cgregg operator 54704 May 29 10:26 00002.mapped
-rw------- 1 cgregg operator 53732 May 29 10:26 00003.mapped
-rw------- 1 cgregg operator 53246 May 29 10:26 00004.mapped
-rw------- 1 cgregg operator 53693 May 29 10:26 00005.mapped
-rw------- 1 cgregg operator 53182 May 29 10:26 00006.mapped
-rw------- 1 cgregg operator 54404 May 29 10:26 00007.mapped
-rw------- 1 cgregg operator 53464 May 29 10:26 00008.mapped
-rw------- 1 cgregg operator 53143 May 29 10:26 00009.mapped
-rw------- 1 cgregg operator 53325 May 29 10:26 00010.mapped
-rw------- 1 cgregg operator 53790 May 29 10:26 00011.mapped
-rw------- 1 cgregg operator 52207 May 29 10:26 00012.mapped
$ grep -l "^thee " files/intermediate/*.mapped \
| wc -l
11
myth57:$ ls -l files/output/
total 0
"Execution Templates: Caching Control Plane Decisions for
Strong Scaling of Data Analytics"
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
"Execution Templates: Caching Control Plane Decisions for
Strong Scaling of Data Analytics"
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
"Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics"
Omid Mashayekhi, Hang Qu, Chinmayee Shah, Philip Levis
In Proceedings of 2017 USENIX Annual Technical Conference (USENIX ATC '17)
The assignment spec suggests that we should be able to select a thread to which we assign a thunk ("The dispatcher thread should loop almost interminably. In each iteration, it should sleep until schedule tells it that something has been added to the queue. It should then wait for a worker to become available, select the worker, mark it as unavailable, dequeue a function, put a copy of that function in a place where the worker can access it, and then signal the worker to execute it").
This requires bookkeeping at a fine granularity as we need to store context for each thread. An alternative implementation would be to keep a global task queue and a synchronization mechanism realized by, say, a single condition variable so that we can simply `nofity_one` just one thread from the pool of worker threads and let it execute the thunk. This should work just fine as long as we do not care which thread performs the task, and that seems to be the case in this assignment. This alternative implementation is likely more lightweight, since we will be freed from keeping track of each thread's context. Maybe I am missing something, but what are the reasons for having the control over exactly which thread executes a thunk?
Centralized with dispatcher
thread
Distributed: workers pull thunks
off queue