Manipulating RDDs

plain RDDS

Plain RDDs

Plain RDD are parallelized lists of elements
Examples:

Lines=sc.parallelize(
        ['you are my sunshine'
        ,'my only sunshine'
        ,'you make me happy'])

A=sc.parallelize(range(4))

Three groups of commands

Creation: RDD from files, databases, or data on driver node. (We will talk about those later)
Transformations: RDD to RDD
Actions: RDD to data on driver node, databases, files.

Plain RDD

Transformations

## Initialize RDD
A=sc.parallelize(range(4))

## map
B=A.map(lambda x: x-2)
B.collect()

# output:
[-2,-1,0,1]

## Initialize RDD
A=sc.parallelize(range(4))

## map
B=A.filter(lambda x: x%2==0)
B.collect()

# output:
[0,2]

Plain RDD

Actions

sc.parallelize(range(4)).collect()

# output:
[0,1,2,3]

sc.parallelize(range(4)).count()

# output:
4

## Initialize RDD
A=sc.parallelize(range(4))

## reduce
A.reduce(lambda x,y: x+y)

# output:
6

Summary

We described some operations on Plain RDDs
Transformations: RDD to RDD
Actions: RDDs to list on head node.
Next: Operations on (key,value) RDDs

Manipulating RDDs

(key, value)

car_count=sc.parallelize(
         ('honda',3),
         ('subaru',2),
         ('honda',2)]

database=sc.parallelize(
         (55632,{'name':'yoav','city':'jerusalem'})
        ,(3342,{'name':'homer','town':'fairview'})]

(key,value) RDDs

Each element of the RDD is a pair (key,value):

Key: an identifier (example SSN)
Value: can be anything
Examples:

(key,Value) RDDs

Transformations

## Initialize pair RDD
A=sc.parallelize(range(4))\
    .map(lambda x: (x,x*x))

## output
A.collect()

# output:
[(0,0),(1,1),(2,4),(3,9)]

A=sc.parallelize(\
   [(1,3),(4,100),(1,-5),(3,2)])

A.reduceByKey(lambda x,y: x*y)
 .collect()

# output:
[(1,-15),(4,100),(3,2)]

ReduceByKey : perform reduce separately on each

key value. Note: transformation, not action.

## Waste of memory space:
for i in range(1000000000):
   # do something

## No Waste:
for i in xrange(1000000000):
   # do something

A detour: iterators

range creates a large list and then iterates on it.
xrange is an iterator that generates the elements
one by one. Similar to C:

for(i=0; i<1000000000; i++) 
   {Do something}

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.groupByKey()
 .map(lambda k,iter: \ 
             (k,[x for x in iter])

# output:
[(1,[3,-5]),(3,[100,2])]

groupByKey : returns a (key,<iterator>) pair for each

key value. The iterator iterates over the values corresponding to the key.

(key,Value) RDDs

Actions

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.countByKey()

# output:
{1:2, 3:2}

countByKey : returns a python Dictionary with the number of pairs for each key.

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.lookup(3)

# output:
[100,2]

lookup(key) : returns the list of all of the values associated with key

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.collectAsMap()

# output:
{1:[3,-5], 3:[100,2]}

collectAsMap() : like collect() but instead of returning a list of tuples it returns a Map (=Dictionary)

Summary

We described some operations on (key,value) RDDs
Indexing using keys provide a richer set of operations
Next: Partitioners.

Partitioners

Partitions

Each RDD is divided into partitions.
- One partition per worker (core)
- A single partition can be operated on as a regular python list.

Partition imbalance

After manipulations (such as filter()) some partitions can shrink to zero and some might be very large
This means that future work is not balanced across the workers.
Wait for loaded worker, while others are idle.
If RDD consists of (key,value) pairs we can use a partitioner to redistribute the items among the workers

Types of partitioners

HashPartitioner(n): divide the keys into n groups at random. Divide the pairs according to their keys
RangePartitioner(n): each partition corresponds to a range of key values, so that each range contains approximately the same number of items (keys).
Custom Partitioner:
define a partitioner that maps key K to integer I.
n= number of partitions.
pair with key K placed in partition I mod n

Transformations

Involving partitioners

A=sc.parallelize(range(1000000))
#select 10% of the entries
B=A.filter(lambda k,v: k%10==0)
# get no. of partitions
n=B.getNumPartitions() 
# Repartition using fewer partitions.
B.partitionBy(n/10)

partitionBy(partitioner) : initiates a shuffle to distribute the keys to partitions according to the

partitioner.

glom() : returns an RDD with one array per partition.

Allows the worker to access all data in it's partition.

A=sc.parallelize(range(1000000))
    .map(lambda x: (x*x%100,x))
    .partitionBy(100) # One partition per key
    .glom()

def variation(A):
    d=0
    if len(A)>1: 
        for i in range(len(A)-1):
            d+=abs(A[i+1][1]-A[i][1])
        return (A[0][0],d)
    else:
        return(None)

A.map(lambda A: variation(A)).collect()

Summary

Partitioners are an optimization tool
Goal: distribute the data evenly across workers.
This is an introduction to pyspark.
For more, see the spark programming guide
http://spark.apache.org/docs/latest/rdd-programming-guide.html