Manipulating RDDs

plain RDDS

Plain RDDs

  • Plain RDD are parallelized lists of elements
  • Examples:
Lines=sc.parallelize(
        ['you are my sunshine'
        ,'my only sunshine'
        ,'you make me happy'])

A=sc.parallelize(range(4))

Three groups of commands

  • Creation: RDD from files, databases, or data on driver node. (We will talk about those later)
  • Transformations: RDD to RDD
  • Actions: RDD to data on driver node, databases, files.

Plain RDD

Transformations

## Initialize RDD
A=sc.parallelize(range(4))

## map
B=A.map(lambda x: x-2)
B.collect()

# output:
[-2,-1,0,1]
## Initialize RDD
A=sc.parallelize(range(4))

## map
B=A.filter(lambda x: x%2==0)
B.collect()

# output:
[0,2]

Plain RDD

Actions

sc.parallelize(range(4)).collect()
# output:
[0,1,2,3]
sc.parallelize(range(4)).count()
# output:
4
## Initialize RDD
A=sc.parallelize(range(4))

## reduce
A.reduce(lambda x,y: x+y)

# output:
6

Summary

  • We described some operations on Plain RDDs
  • Transformations: RDD to RDD
  • Actions: RDDs to list on head node.
  • Next: Operations on (key,value) RDDs

 

 

Manipulating RDDs

(key, value)

car_count=sc.parallelize(
         ('honda',3),
         ('subaru',2),
         ('honda',2)] 


database=sc.parallelize(
         (55632,{'name':'yoav','city':'jerusalem'})
        ,(3342,{'name':'homer','town':'fairview'})] 


(key,value) RDDs

Each element of the RDD is a pair (key,value):

  • Key: an identifier (example SSN)
  • Value: can be anything
  • Examples:

(key,Value) RDDs

Transformations

## Initialize pair RDD
A=sc.parallelize(range(4))\
    .map(lambda x: (x,x*x))

## output
A.collect()

# output:
[(0,0),(1,1),(2,4),(3,9)]
A=sc.parallelize(\
   [(1,3),(4,100),(1,-5),(3,2)])

A.reduceByKey(lambda x,y: x*y)
 .collect()

# output:
[(1,-15),(4,100),(3,2)]

ReduceByKey : perform reduce separately on each 

                    key value. Note: transformation, not action.

## Waste of memory space:
for i in range(1000000000):
   # do something

## No Waste:
for i in xrange(1000000000):
   # do something

A detour: iterators 

  • range creates a large list and then iterates on it.
  • xrange is an iterator that generates the elements
    one by one. Similar to C:
for(i=0; i<1000000000; i++) 
   {Do something}
A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.groupByKey()
 .map(lambda k,iter: \ 
             (k,[x for x in iter])
# output:
[(1,[3,-5]),(3,[100,2])]

groupByKey : returns a (key,<iterator>) pair for each

    key value. The iterator iterates over the values corresponding to the key.

(key,Value) RDDs

Actions

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.countByKey()
# output:
{1:2, 3:2}

countByKey : returns a python Dictionary with the number of pairs for each key.

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.lookup(3)
# output:
[100,2]

lookup(key) : returns the list of all of the values associated with key

A=sc.parallelize(\
   [(1,3),(3,100),(1,-5),(3,2)])

A.collectAsMap()
# output:
{1:[3,-5], 3:[100,2]}

collectAsMap() :  like collect() but instead of returning a list of tuples it returns a Map (=Dictionary)

Summary

  • We described some operations on (key,value) RDDs
  • Indexing using keys provide a richer set of operations
  • Next: Partitioners.

 

 

Partitioners

Partitions

  • Each RDD is divided into partitions. 
    • One partition per worker (core)
    • A single partition can be operated on as a regular python list.

Partition imbalance

  • After manipulations (such as filter()) some partitions can shrink to zero and some might be very large
  • This means that future work is not balanced across the workers.
  • Wait for loaded worker, while others are idle.
  • If RDD consists of (key,value) pairs we can use a partitioner to redistribute the items among the workers

Types of partitioners

  • HashPartitioner(n):  divide the keys into n groups at random. Divide the pairs according to their keys
  • RangePartitioner(n): each partition corresponds to a range of key values, so that each range contains approximately the same number of items (keys).  
  • Custom Partitioner:
    define a partitioner that maps key K to integer I.
    n= number of partitions. 
    pair with key K placed in partition I mod n

Transformations

Involving  partitioners

A=sc.parallelize(range(1000000))
#select 10% of the entries
B=A.filter(lambda k,v: k%10==0)
# get no. of partitions
n=B.getNumPartitions() 
# Repartition using fewer partitions.
B.partitionBy(n/10) 

partitionBy(partitioner) : initiates a shuffle to distribute the keys to partitions according to the 

partitioner.

glom() : returns an RDD with one array per partition.

Allows the worker to access all data in it's partition.

A=sc.parallelize(range(1000000))
    .map(lambda x: (x*x%100,x))
    .partitionBy(100) # One partition per key
    .glom()

def variation(A):
    d=0
    if len(A)>1: 
        for i in range(len(A)-1):
            d+=abs(A[i+1][1]-A[i][1])
        return (A[0][0],d)
    else:
        return(None)

A.map(lambda A: variation(A)).collect()

Summary

 

 

RDDs (Plain and key-value)

By Yoav Freund

RDDs (Plain and key-value)

  • 3,606