an introduction to
distributed computing
using Spark and Dask
Israel Saeta Pérez
Adrián Pino Alcalde
June 2016
slides.com/israelsaetaperez/distributed-computing-spark-dask
What will we cover?
- What is distributed computing
- What is Apache Spark
- Main concepts of Spark architecture
- Drivers and workers
- RDDs and DAGs
- Transformations vs. Actions
- Dataframes
- Dask:
- Lazy operations and the DAG
- Dataframes
- Caching
- If time permits: example Analyzing Expedia dataset
with Spark & Dask
What is distributed computing?
Process a long list of (similar) tasks
- Simple
- Slow if lots of tasks
- More complex
- Needs scheduling & orchestration
- Faster only if maaany tasks
Building blocks of distributed computing
- Computing nodes: Threads, processes, machines, executors...
- Distributed-friendly task definition language
- Task partitioning
- Aggregation logic
- Scheduler: Assigns tasks to nodes
- Message Passing channel and protocol
Single machine easy distributed computing
-
Pipelining commands in bash (W10 too!) zcat train.csv.gz | cut -d"," -f1 | grep 2014-12
- Some library ops like numpy.dot(A,B) can use parallel algebra implementations (OpenBLAS/MKL)
-
scikit-learn 'n_jobs' for Grid Search, Random Forest, Cross-Validation...
- Python and R packages for multi-processing and multi-threading
- Need parallelization framework for more complex tasks!
But what if computing needs start growing...
DeepMind
Answer to the Ultimate Question of Life, The Universe, and Everything
AWS EC2 X1
2 TB RAM, 128 vCPU
$4,000/hour
But what if computing needs start growing...
Raspberry Pi cluster
ContinuumIO Dask
Hadoop cluster
HDFS + MapReduce, 2006
- Framework for cluster computing
- Started around 2009 at Berkeley
- Top level Apache project 2014
- Last version 1.6.1 March 2016
- Written mainly in Scala
- Interfaces for Scala, Java, Python and R
- Can run standalone (EC2 scripts available) or Mesos/YARN
- Cool UI for cluster and tasks status
- Emiliano says it's the future
- Correlates with higher salaries
When should you use Spark?
- Client already has a Hadoop/Spark cluster
- You have to process 100s or 1000s of GBs
- You have the $$$, time and knowledge to build a cluster of machines
- You want to use Spark MLLib parallel algorithms, or Spark Streaming
- For anything else... use GNU Utils or Dask!
How is it different from plain Hadoop MapReduce?
- Easy interactive lazy task definition using RDDs
- LRU data caching - but Spark is NOT in-memory!
- Efficient pipelining via DAGs
- Result: Better suited for iterative algorithms and data exploration
- BUT: It needs Hadoop HDFS and libraries for distributed data access (data locality)
Spark cluster architecture
Image credit: Alexey Grishchenko
- Entry point
- Task definition
- Scheduler
- WebUI monitor
- Work, work work...
- Data I/O
Easy start
- spark.apache.org/downloads.html
- Get last pre-built version
- Launch Jupyter notebook
IPYTHON_OPTS="notebook" /path/to/spark/bin/pyspark --master local[nthreads]
Driver Web UI
port 4040 by default
Resilient Distributed Datasets
- Representation of unit of data
- List of elements
- Follow map-reduce paradigm
- (Just) metadata:
- Where data comes from
- How is it partitioned
- Computations to be performed (tasks definition)
Map-Reduce paradigm
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
RDDs transformations & actions
Basic transformations
-
map: Apply an arbitrary function to every element of the RDD. Example:
-
filter: Keep only elements satisfying the specified condition. Example:
-
reduceByKey: Aggregate values by key. Example:
-
join: Return an RDD containing all pairs of elements with matching keys in self and other. Example:
>>> rdd = sc.parallelize(["b", "a", "c"])
>>> sorted(rdd.map(lambda x: (x, 1)).collect())
[('a', 1), ('b', 1), ('c', 1)]
>>> rdd = sc.parallelize([1, 2, 3, 4, 5])
>>> rdd.filter(lambda x: x % 2 == 0).collect()
[2, 4]
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(lambda x, y: x + y).collect())
[('a', 2), ('b', 1)]
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2), ("a", 3)])
>>> sorted(x.join(y).collect())
[('a', (1, 2)), ('a', (1, 3))]
Basic actions
- collect: Take all elements. Can destroy computers.
-
reduce: Aggregate the elements of the RDD using a commutative and associative function that takes two arguments and returns one. Example:
-
take: Take the first num elements of the RDD. Similar to df.head() Example:
- takeOrdered: Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. Example:
>>> sc.parallelize([1, 2, 3, 4, 5]).reduce(lambda x, y: x + y)
15
>>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
[2, 3]
>>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x)
[10, 9, 7, 6, 5, 4]
Directed Acyclic Graphs
Image credit: Alexey Grishchenko
Stages: Parallel tasks grouped
Disk shuffle
Caching
- Data is not cached by default
- Least Recent Used approach: If cache mem is full, drop oldest blocks
- Several storage levels: Cache to memory and/or disk
- .cache() == .persist(MEMORY_ONLY)
>>> lines = sc.textFile('expedia.csv')
>>> lines.cache() # transformation, doesn't compute anything
>>> lines.count() # performs expensive count and caches 'lines'
37670293
>>> lines.count() # faster b/c 'lines' is in RAM
37670293
Spark Dataframes
- Released Spark 1.3, Feb 2015
- Mimics pandas dataframes
- No more map-reduce crap, better for DS
- Columnar format
- Can run SQL Queries! (better in Spark 2.0)
- Great speedup for native ops b/c no Python serialization
young = users[users.age < 21] # filter users by age
young.select(young.name, young.age + 1) # increment everybody’s age by 1
young.groupBy("gender").count() # return ppl count for each gender
Spark MLLib
Parallelized Large ScaleMachine Learning
- Logistic Regression, with SGD and LBGFS
- Ridge, Lasso, Linear Regression
- SVM, Naïve Bayes
- Random Forest, GBTrees
- Alternating Least Squares (recommendations)
- See also: spark-sklearn, sparkit-learn
Dask
Pandas meets distributed computing
What is Dask
It's a parallel computing library for analytics in python.
- Performs operations from disk, so fits in memory becomes fits in disk
- Very easy to set up (you probably have it installed already) and use it (really)
- It's a python library, not an interface to Scala, Java or any weird thing.
- Scales from your laptop to clusters.
What is Dask
Three main core elements:
-
Dask array: distributed numpy arrays
-
Dask bag: to work with arbitrary collections of data (equivalent to RDD in Spark)
-
Dask dataframe: distributed pandas dataframes
What does under the hood
- Operations in Dask are performed lazily
- When you define a computation, dask elaborates the Direct Acyclic Graph (DAG) of the tasks required to complete it.
- Example with dask arrays:
import dask.array as da
x = da.arange(1e7, chunks = 3e6)
res = x.dot(x) - da.var(x)
- We do this, and nothing is calculated (lazy evaluation)
What did just happened?
To see what dask did, we call the method visualize:
res.visualize()
How to get results?
To trigger computation of a graph of tasks we call the method compute:
res.compute()
We can chain different operations, and evaluate them at the end, without having to have them in RAM
Multicore madness
Take for instance:
res = da.arange(1e9, chunks = 3e6)
res = x.dot(x) - da.var(x)
%timeit res.compute()
1 loop, best of 3: 6.96 s per loop
Multicore!
Dask dataframes
- On-disk parallel equivalent to pandas famous library for data analysis.
- Very easy to use, (really):
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
- Lazy operations again, this does not load any data to disk, but sets the partitions:
Dask dataframes
- On-disk parallel equivalent to pandas famous library for data analysis.
- Very easy to use, (really):
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
result = df.groupby([list_columns]).var.mean()
result.compute()
- Lazy operations again, this does not load any data to disk, but sets the partitions:
Dask dataframes
Dask dataframes
If you're confortable using pandas, you'll find like at home.
Pandas and the GIL
- In CPython's implementation of Python, native python code can't run into multiple threads simultaneously (safety reasons). It's called the Global Interpreter Lock (GIL).
- Still, if python interpreter runs functions written in external libraries (C/Fortran) can release the GIL.
- Most of pandas methods are written in C (Cython).
- Dask splits dataframe operations into different chunks and launch them in different threads achieving parallelism.
I/O operations
Load data from a single and multiple files using globstrings:
df1 = dd.read_csv('file.csv')
df2 = dd.read_csv('file_*.csv')
df3 = df2.to_csv('file_output.csv')
Read and write to hdf files.
dd.read_hdf('file_input.hdf5', '/data')
dd.to_hdf('file_output.hdf5', key='data')
Integration with new generation of compressed/columnar storage (castra, bcolz)
Indices
Partitioning of dataframes is determined by a column of the dataframe, its index.
df.divisions
df_new = df.set_index(df.column)
Doing this will reshufle the data, but subsequent operations involving this index will be faster.
df.set_index(df.column, compute=False).to_castra('df.castra')
df.groupby('column').apply(foo)
Dask's own columnar format, castra, stores data in columns, compressed and partitioned on the index.
Caching
As with Spark, dask support caching for faster repetitive computations, but it works differntly.
LRU may not be the best for analytic computations. Instead, we can be more opportunistic and keep:
- Expensive to compute
- Cheap to store
- Frequently used
np.std(x) # small result, costly to recompute
np.transpose(x) # big result, cheap to recompute
from dask.cache import Cache
cache = Cache(cache=1e9)
cache.register()
df.column.apply(whatever).compute()
(df.column.apply(whatever) + 1).compute() # this call will be fast
Summary
When to use dask:
- Doing exploratory analysis on larger-than-memory datasets
- Working with multiple files at the same time.
- Applying embarrasingly parallel tasks
When not to use dask :
- When your operations require shuffling (sorting, merges, etc.)
- Simple operations with fast on th command line: sorts, deduplicating files, subselecting cols, etc.
Remember: GNU Coreutils is your friend
References
Spark:
- Spark docs (obviously)
- Data Science and Engineering with Spark XSeries
- Alexey Grishchenko blog
- Mastering Apache Spark gitbook (incomplete but deep)
Dask