plain RDDS
Lines=sc.parallelize(
['you are my sunshine'
,'my only sunshine'
,'you make me happy'])
A=sc.parallelize(range(4))
## Initialize RDD
A=sc.parallelize(range(4))
## map
B=A.map(lambda x: x-2)
B.collect()
# output:
[-2,-1,0,1]
## Initialize RDD
A=sc.parallelize(range(4))
## map
B=A.filter(lambda x: x%2==0)
B.collect()
# output:
[0,2]
sc.parallelize(range(4)).collect()
# output:
[0,1,2,3]
sc.parallelize(range(4)).count()
# output:
4
## Initialize RDD
A=sc.parallelize(range(4))
## reduce
A.reduce(lambda x,y: x+y)
# output:
6
(key, value)
car_count=sc.parallelize(
('honda',3),
('subaru',2),
('honda',2)]
database=sc.parallelize(
(55632,{'name':'yoav','city':'jerusalem'})
,(3342,{'name':'homer','town':'fairview'})]
Each element of the RDD is a pair (key,value):
## Initialize pair RDD
A=sc.parallelize(range(4))\
.map(lambda x: (x,x*x))
## output
A.collect()
# output:
[(0,0),(1,1),(2,4),(3,9)]
A=sc.parallelize(\
[(1,3),(4,100),(1,-5),(3,2)])
A.reduceByKey(lambda x,y: x*y)
.collect()
# output:
[(1,-15),(4,100),(3,2)]
ReduceByKey : perform reduce separately on each
key value. Note: transformation, not action.
## Waste of memory space:
for i in range(1000000000):
# do something
## No Waste:
for i in xrange(1000000000):
# do something
A detour: iterators
for(i=0; i<1000000000; i++)
{Do something}
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.groupByKey()
.map(lambda k,iter: \
(k,[x for x in iter])
# output:
[(1,[3,-5]),(3,[100,2])]
groupByKey : returns a (key,<iterator>) pair for each
key value. The iterator iterates over the values corresponding to the key.
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.countByKey()
# output:
{1:2, 3:2}
countByKey : returns a python Dictionary with the number of pairs for each key.
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.lookup(3)
# output:
[100,2]
lookup(key) : returns the list of all of the values associated with key
A=sc.parallelize(\
[(1,3),(3,100),(1,-5),(3,2)])
A.collectAsMap()
# output:
{1:[3,-5], 3:[100,2]}
collectAsMap() : like collect() but instead of returning a list of tuples it returns a Map (=Dictionary)
A=sc.parallelize(range(1000000))
#select 10% of the entries
B=A.filter(lambda k,v: k%10==0)
# get no. of partitions
n=B.getNumPartitions()
# Repartition using fewer partitions.
B.partitionBy(n/10)
partitionBy(partitioner) : initiates a shuffle to distribute the keys to partitions according to the
partitioner.
glom() : returns an RDD with one array per partition.
Allows the worker to access all data in it's partition.
A=sc.parallelize(range(1000000))
.map(lambda x: (x*x%100,x))
.partitionBy(100) # One partition per key
.glom()
def variation(A):
d=0
if len(A)>1:
for i in range(len(A)-1):
d+=abs(A[i+1][1]-A[i][1])
return (A[0][0],d)
else:
return(None)
A.map(lambda A: variation(A)).collect()