Manipulating RDDs
plain RDDS
Plain RDDs
- Plain RDD are parallelized lists of elements
- Examples:
Lines=sc.parallelize(
['you are my sunshine'
,'my only sunshine'
,'you make me happy'])
A=sc.parallelize(range(4))
Three groups of commands
- Creation: RDD from files, databases, or data on driver node. (We will talk about those later)
- Transformations: RDD to RDD
- Actions: RDD to data on driver node, databases, files.
Plain RDD
Transformations
## Initialize RDD
A=sc.parallelize(range(4))
## map
B=A.map(lambda x: x-2)
B.collect()
# output:
[-2,-1,0,1]
## Initialize RDD
A=sc.parallelize(range(4))
## map
B=A.filter(lambda x: x%2==0)
B.collect()
# output:
[0,2]
Plain RDD
Actions
sc.parallelize(range(4)).collect()
# output:
[0,1,2,3]
sc.parallelize(range(4)).count()
# output:
4
## Initialize RDD
A=sc.parallelize(range(4))
## reduce
A.reduce(lambda x,y: x+y)
# output:
6
Summary
- We described some operations on Plain RDDs
- Transformations: RDD to RDD
- Actions: RDDs to list on head node.
- Next: Operations on (key,value) RDDs
Manipulating RDDs
(key, value)
car_count=sc.parallelize(
('honda',3),
('subaru',2),
('honda',2)]
database=sc.parallelize(
(55632,{'name':'yoav','city':'jerusalem'})
,(3342,{'name':'homer','town':'fairview'})]
(key,value) RDDs
Each element of the RDD is a pair (key,value):
- Key: an identifier (example SSN)
- Value: can be anything
- Examples:
(key,Value) RDDs
Transformations
## Initialize pair RDD
A=sc.parallelize(range(4))\
.map(lambda x: (x,x*x))
## output
A.collect()
# output:
[(0,0),(1,1),(2,4),(3,9)]
A=sc.parallelize(\
[(1,3),(4,100),(1,-5),(3,2)])
A.reduceByKey(lambda x,y: x*y)
.collect()
# output:
[(1,-15),(4,100),(3,2)]
ReduceByKey : perform reduce separately on each
key value. Note: transformation, not action.
## Waste of memory space:
for i in range(1000000000):
# do something
## No Waste:
for i in xrange(1000000000):
# do something
A detour: iterators
- range creates a large list and then iterates on it.
- xrange is an iterator that generates the elements
one by one. Similar to C:
for(i=0; i<1000000000; i++)
{Do something}
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.groupByKey()
.map(lambda k,iter: \
(k,[x for x in iter])
# output:
[(1,[3,-5]),(3,[100,2])]
groupByKey : returns a (key,<iterator>) pair for each
key value. The iterator iterates over the values corresponding to the key.
(key,Value) RDDs
Actions
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.countByKey()
# output:
{1:2, 3:2}
countByKey : returns a python Dictionary with the number of pairs for each key.
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.lookup(3)
# output:
[100,2]
lookup(key) : returns the list of all of the values associated with key
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.collectAsMap()
# output:
{1:[3,-5], 3:[100,2]}
collectAsMap() : like collect() but instead of returning a list of tuples it returns a Map (=Dictionary)
Summary
- We described some operations on (key
,value ) RDDs - Indexing using keys provide a richer set of operations
- Next: Partitioners.
Partitioners
Partitions
- Each RDD is divided into partitions.
- One partition per worker (core)
- A single partition can be operated on as a regular python list.
Partition imbalance
- After manipulations (such as filter()) some partitions can shrink to zero and some might be very large
- This means that future work is not balanced across the workers.
- Wait for loaded worker, while others are idle.
- If RDD consists of (key,value) pairs we can use a partitioner to redistribute the items among the workers
Types of partitioners
- HashPartitioner(n): divide the keys into n groups at random. Divide the pairs according to their keys
- RangePartitioner(n): each partition corresponds to a range of key values, so that each range contains approximately the same number of items (keys).
-
Custom Partitioner:
define a partitioner that maps key K to integer I.
n= number of partitions.
pair with key K placed in partition I mod n
Transformations
Involving partitioners
A=sc.parallelize(range(1000000))
#select 10% of the entries
B=A.filter(lambda k,v: k%10==0)
# get no. of partitions
n=B.getNumPartitions()
# Repartition using fewer partitions.
B.partitionBy(n/10)
partitionBy(partitioner) : initiates a shuffle to distribute the keys to partitions according to the
partitioner.
glom() : returns an RDD with one array per partition.
Allows the worker to access all data in it's partition.
A=sc.parallelize(range(1000000))
.map(lambda x: (x*x%100,x))
.partitionBy(100) # One partition per key
.glom()
def variation(A):
d=0
if len(A)>1:
for i in range(len(A)-1):
d+=abs(A[i+1][1]-A[i][1])
return (A[0][0],d)
else:
return(None)
A.map(lambda A: variation(A)).collect()
Summary
- Partitioners are an optimization tool
- Goal: distribute the data evenly across workers.
- This is an introduction to pyspark.
- For more, see the spark programming guide
- http://spark.apache.org/docs/latest/rdd-programming-guide.html
RDDs (Plain and key-value)
By Yoav Freund
RDDs (Plain and key-value)
- 3,689