Map Reduce

Achieving locality by being oblivious to order

To minimize cache misses we want to process data sequentially.
To compute in parallel on several CPUs, we want processing in each CPU to be independent of the others.
As a programmer, we want to achieve sequentiality and parallelism, without knowing the details of the hardware.
Approach: write code that expresses the desired end result, without specifying how to get there.
Map-Reduce: perform operations on arrays without specifying the order ot the computation.
Spark will optimize the order of computation on the fly.

Map: square each item

list L=[0,1,2,3]
Compute the square of each item
output: [0,1,4,9]

Traditional

Map-Reduce

## For Loop
O=[]
for i in L:
    O.append(i*i)

## List Comprehension
[i*i for i in L]

map(lambda x:x*x, L)

compute from first to last in order

computation order is not specified

Reduce: compute the sum

A list L=[3,1,5,7]
Find the sum (16)

Traditional

Map-Reduce

## Use Builtin
sum(L)

## for loop
s=0
for i in L:
    s+=i

reduce(lambda (x,y): x+y, L)

compute from first to last in order

computation order is not specified

Map + Reduce

list L=[0,1,2,3]
Compute the sum of the squares
Note the differences

Traditional

Map-Reduce

## For Loop
s=0
for i in L:
   s+= i*i
## List comprehension
sum([i*i for i in L])

reduce(lambda x,y:x+y, \\
        map(lambda i:i*i,L))

compute from first to last in order

computation order is not specified

Execution plan

Immediate execution

Order independence

The result of map or reduce must not depend on the order

sum does not depend on computation order

For loop order

parallel order

Result does not depend on order

difference depends on computation order

-2

-5

-6

-9

For loop order

parallel order

-2

-1

Result depends on order

Average = data.reduce(lambda a,b: (a+b)/2)

Computing the average incorrectly

Average = data.reduce(lambda a,b: (a+b)/2)

data=[1,2,3], average is 2

Computed Average=((1+2)/2 + 3)/2 = 2.25

Average = data.reduce(lambda a,b: (a+b)/2)

Average = data.reduce(lambda a,b: (a+b)/2)

Computing the average correctly

sum,count=data.map(lambda x: (x,1))
   .reduce(lambda P1,P2:
           (P1[0]+P2[0],P1[1]+P2[1]))

Average = sum/count

data=[1,2,3], average is 2

sum, count = [(1,1),(2,1),(3,1)].reduce() = 6,3

[1,2,3].map(lambda x: (x,1)) = [(1,1),(2,1),(3,1)]

average = 6/3 = 2

Why Order Independence?

Computation order can be chosen by compiler/optimizer.
Allows for parallel computation of sums of subsets.
- Modern hardware calls for parallel computation but parallel computation is very hard to program.
Using map-reduce programmer exposes to the compiler opportunities for parallel computation.

Spark and Map-Reduce

Map reduce is the basis for for many systems.
For big data: Hadoop and Spark.

Map Reduce

By Yoav Freund

Map Reduce

5,995

Map Reduce

Achieving locality by being oblivious to order

Map: square each item

Traditional

Map-Reduce

Reduce: compute the sum

Traditional

Map-Reduce

Map + Reduce

Traditional

Map-Reduce

Order independence

sum does not depend on computation order

difference depends on computation order

Computing the average incorrectly

Computing the average correctly

Why Order Independence?

Spark and Map-Reduce

Map Reduce

More from Yoav Freund