an introduction to
using Spark and Dask
Israel Saeta Pérez
Adrián Pino Alcalde
June 2016
slides.com/israelsaetaperez/distributed-computing-spark-dask
Process a long list of (similar) tasks
Pipelining commands in bash (W10 too!)
zcat train.csv.gz | cut -d"," -f1 | grep 2014-12
scikit-learn 'n_jobs' for Grid Search, Random Forest, Cross-Validation...
DeepMind
Answer to the Ultimate Question of Life, The Universe, and Everything
AWS EC2 X1
2 TB RAM, 128 vCPU
$4,000/hour
Raspberry Pi cluster
ContinuumIO Dask
Hadoop cluster
HDFS + MapReduce, 2006
Image credit: Alexey Grishchenko
IPYTHON_OPTS="notebook" /path/to/spark/bin/pyspark --master local[nthreads]
port 4040 by default
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
join: Return an RDD containing all pairs of elements with matching keys in self and other. Example:
>>> rdd = sc.parallelize(["b", "a", "c"])
>>> sorted(rdd.map(lambda x: (x, 1)).collect())
[('a', 1), ('b', 1), ('c', 1)]
>>> rdd = sc.parallelize([1, 2, 3, 4, 5])
>>> rdd.filter(lambda x: x % 2 == 0).collect()
[2, 4]
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(lambda x, y: x + y).collect())
[('a', 2), ('b', 1)]
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2), ("a", 3)])
>>> sorted(x.join(y).collect())
[('a', (1, 2)), ('a', (1, 3))]
>>> sc.parallelize([1, 2, 3, 4, 5]).reduce(lambda x, y: x + y)
15
>>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
[2, 3]
>>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x)
[10, 9, 7, 6, 5, 4]
Image credit: Alexey Grishchenko
Stages: Parallel tasks grouped
Disk shuffle
>>> lines = sc.textFile('expedia.csv')
>>> lines.cache() # transformation, doesn't compute anything
>>> lines.count() # performs expensive count and caches 'lines'
37670293
>>> lines.count() # faster b/c 'lines' is in RAM
37670293
young = users[users.age < 21] # filter users by age
young.select(young.name, young.age + 1) # increment everybody’s age by 1
young.groupBy("gender").count() # return ppl count for each gender
Parallelized Large ScaleMachine Learning
It's a parallel computing library for analytics in python.
Three main core elements:
Dask array: distributed numpy arrays
Dask bag: to work with arbitrary collections of data (equivalent to RDD in Spark)
Dask dataframe: distributed pandas dataframes
import dask.array as da
x = da.arange(1e7, chunks = 3e6)
res = x.dot(x) - da.var(x)
To see what dask did, we call the method visualize:
res.visualize()
To trigger computation of a graph of tasks we call the method compute:
res.compute()
We can chain different operations, and evaluate them at the end, without having to have them in RAM
Take for instance:
res = da.arange(1e9, chunks = 3e6)
res = x.dot(x) - da.var(x)
%timeit res.compute()
1 loop, best of 3: 6.96 s per loop
Multicore!
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
result = df.groupby([list_columns]).var.mean()
result.compute()
If you're confortable using pandas, you'll find like at home.
Load data from a single and multiple files using globstrings:
df1 = dd.read_csv('file.csv')
df2 = dd.read_csv('file_*.csv')
df3 = df2.to_csv('file_output.csv')
Read and write to hdf files.
dd.read_hdf('file_input.hdf5', '/data')
dd.to_hdf('file_output.hdf5', key='data')
Integration with new generation of compressed/columnar storage (castra, bcolz)
Partitioning of dataframes is determined by a column of the dataframe, its index.
df.divisions
df_new = df.set_index(df.column)
Doing this will reshufle the data, but subsequent operations involving this index will be faster.
df.set_index(df.column, compute=False).to_castra('df.castra')
df.groupby('column').apply(foo)
Dask's own columnar format, castra, stores data in columns, compressed and partitioned on the index.
As with Spark, dask support caching for faster repetitive computations, but it works differntly.
LRU may not be the best for analytic computations. Instead, we can be more opportunistic and keep:
np.std(x) # small result, costly to recompute
np.transpose(x) # big result, cheap to recompute
from dask.cache import Cache
cache = Cache(cache=1e9)
cache.register()
df.column.apply(whatever).compute()
(df.column.apply(whatever) + 1).compute() # this call will be fast
When to use dask:
When not to use dask :
Remember: GNU Coreutils is your friend
Spark:
Dask