Map Reduce
Achieving locality by being oblivious to order
- To minimize cache misses we want to process data sequentially.
- To compute in parallel on several CPUs, we want processing in each CPU to be independent of the others.
- As a programmer, we want to achieve sequentiality and parallelism, without knowing the details of the hardware.
- Approach: write code that expresses the desired end result, without specifying how to get there.
- Map-Reduce: perform operations on arrays without specifying the order ot the computation.
- Spark will optimize the order of computation on the fly.
Map: square each item
- list L=[0,1,2,3]
- Compute the square of each item
- output: [0,1,4,9]
Traditional
Map-Reduce
## For Loop
O=[]
for i in L:
O.append(i*i)
## List Comprehension
[i*i for i in L]
map(lambda x:x*x, L)
compute from first to last in order
computation order is not specified
Reduce: compute the sum
- A list L=[3,1,5,7]
- Find the sum (16)
Traditional
Map-Reduce
## Use Builtin
sum(L)
## for loop
s=0
for i in L:
s+=i
reduce(lambda (x,y): x+y, L)
compute from first to last in order
computation order is not specified
Map + Reduce
- list L=[0,1,2,3]
- Compute the sum of the squares
- Note the differences
Traditional
Map-Reduce
## For Loop
s=0
for i in L:
s+= i*i
## List comprehension
sum([i*i for i in L])
reduce(lambda x,y:x+y, \\
map(lambda i:i*i,L))
compute from first to last in order
computation order is not specified
Execution plan
Immediate execution
Order independence
- The result of map or reduce must not depend on the order
sum does not depend on computation order
5
7
3
1
3
12
15
16
19
For loop order
5
7
3
1
3
parallel order
10
4
14
19
Result does not depend on order
difference depends on computation order
5
7
3
1
3
-2
-5
-6
-9
For loop order
5
7
3
1
3
parallel order
4
-2
6
-1
Result depends on order
Average = data.reduce(lambda a,b: (a+b)/2)
Computing the average incorrectly
Average = data.reduce(lambda a,b: (a+b)/2)
data=[1,2,3], average is 2
Computed Average=((1+2)/2 + 3)/2 = 2.25
Average = data.reduce(lambda a,b: (a+b)/2)
Average = data.reduce(lambda a,b: (a+b)/2)
Computing the average correctly
sum,count=data.map(lambda x: (x,1)) .reduce(lambda P1,P2: (P1[0]+P2[0],P1[1]+P2[1])) Average = sum/count
data=[1,2,3], average is 2
sum, count = [(1,1),(2,1),(3,1)].reduce() = 6,3
[1,2,3].map(lambda x: (x,1)) = [(1,1),(2,1),(3,1)]
average = 6/3 = 2
Why Order Independence?
- Computation order can be chosen by compiler/optimizer.
- Allows for parallel computation of sums of subsets.
- Modern hardware calls for parallel computation but parallel computation is very hard to program.
- Using map-reduce programmer exposes to the compiler opportunities for parallel computation.
Spark and Map-Reduce
- Map reduce is the basis for for many systems.
- For big data: Hadoop and Spark.
Map Reduce
By Yoav Freund
Map Reduce
- 5,692