an introduction to

distributed computing

using Spark and Dask


Israel Saeta Pérez

Adrián Pino Alcalde

June 2016

What will we cover?

  • What is distributed computing
  • What is Apache Spark
  • Main concepts of Spark architecture
    • Drivers and workers
    • RDDs and DAGs
    • Transformations vs. Actions
    • Dataframes
  • Dask:
    • Lazy operations and the DAG
    • Dataframes
    • Caching
  • If time permits: example Analyzing Expedia dataset
    with Spark & Dask

What is distributed computing?

Process a long list of (similar) tasks

  • Simple
  • Slow if lots of tasks
  • More complex
  • Needs scheduling & orchestration
  • Faster only if maaany tasks

Building blocks of distributed computing

  • Computing nodes: Threads, processes, machines, executors...
  • Distributed-friendly task definition language
    • Task partitioning
    • Aggregation logic
  • Scheduler: Assigns tasks to nodes
  • Message Passing channel and protocol

Single machine easy distributed computing

  • Pipelining commands in bash (W10 too!)
    zcat train.csv.gz  | cut -d"," -f1 | grep 2014-12
  • Some library ops like,B) can use parallel algebra implementations (OpenBLAS/MKL)
  • scikit-learn 'n_jobs' for Grid Search, Random Forest, Cross-Validation...

  • Python and R packages for multi-processing and multi-threading
  • Need parallelization framework for more complex tasks!

But what if computing needs start growing...


Answer to the Ultimate Question of Life, The Universe, and Everything


2 TB RAM, 128 vCPU


But what if computing needs start growing...

Raspberry Pi cluster

ContinuumIO Dask

Hadoop cluster

HDFS + MapReduce, 2006

  • Framework for cluster computing
  • Started around 2009 at Berkeley
  • Top level Apache project 2014
  • Last version 1.6.1 March 2016
  • Written mainly in Scala
  • Interfaces for Scala, Java, Python and R
  • Can run standalone (EC2 scripts available) or Mesos/YARN
  • Cool UI for cluster and tasks status
  • Emiliano says it's the future
  • Correlates with higher salaries

When should you use Spark?

  • Client already has a Hadoop/Spark cluster
  • You have to process 100s or 1000s of GBs
  • You have the $$$, time and knowledge to build a cluster of machines
  • You want to use Spark MLLib parallel algorithms, or Spark Streaming
  • For anything else... use GNU Utils or Dask!

How is it different from plain Hadoop MapReduce?

  • Easy interactive lazy task definition using RDDs
  • LRU data caching - but Spark is NOT in-memory!
  • Efficient pipelining via DAGs
  • Result: Better suited for iterative algorithms and data exploration
  • BUT: It needs Hadoop HDFS and libraries for distributed data access (data locality)

Spark cluster architecture

Image credit: Alexey Grishchenko

  • Entry point
  • Task definition
  • Scheduler
  • WebUI monitor
  • Work, work work...
  • Data I/O

Easy start

  • Get last pre-built version
  • Launch Jupyter notebook
IPYTHON_OPTS="notebook" /path/to/spark/bin/pyspark --master local[nthreads]

Driver Web UI

port 4040 by default

Resilient Distributed Datasets

  • Representation of unit of data
  • List of elements
  • Follow map-reduce paradigm
  • (Just) metadata:
    • Where data comes from
    • How is it partitioned
    • Computations to be performed (tasks definition)

Map-Reduce paradigm

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

RDDs transformations & actions

Basic transformations

  • map: Apply an arbitrary function to every element of the RDD. Example:

  • filter: Keep only elements satisfying the specified condition. Example:

  • reduceByKey: Aggregate values by key. Example:

  • join: Return an RDD containing all pairs of elements with matching keys in self and other. Example:

>>> rdd = sc.parallelize(["b", "a", "c"])
>>> sorted( x: (x, 1)).collect())
[('a', 1), ('b', 1), ('c', 1)]
>>> rdd = sc.parallelize([1, 2, 3, 4, 5])
>>> rdd.filter(lambda x: x % 2 == 0).collect()
[2, 4]
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(lambda x, y: x + y).collect())
[('a', 2), ('b', 1)]
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2), ("a", 3)])
>>> sorted(x.join(y).collect())
[('a', (1, 2)), ('a', (1, 3))]

Basic actions

  • collect: Take all elements. Can destroy computers.
  • reduce: Aggregate the elements of the RDD using a commutative and associative function that takes two arguments and returns one. Example:

  • take: Take the first num elements of the RDD. Similar to df.head() Example:

  • takeOrdered: Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. Example:
>>> sc.parallelize([1, 2, 3, 4, 5]).reduce(lambda x, y: x + y)
>>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
[2, 3]
>>> sc.parallelize([10, 1, 2, 9, 3, 4, 5, 6, 7], 2).takeOrdered(6, key=lambda x: -x)
[10, 9, 7, 6, 5, 4]

Directed Acyclic Graphs

Image credit: Alexey Grishchenko

Stages: Parallel tasks grouped

Disk shuffle


  • Data is not cached by default
  • Least Recent Used approach: If cache mem is full, drop oldest blocks
  • Several storage levels: Cache to memory and/or disk
  • .cache() == .persist(MEMORY_ONLY)
>>> lines = sc.textFile('expedia.csv')
>>> lines.cache()  # transformation, doesn't compute anything
>>> lines.count()  # performs expensive count and caches 'lines'
>>> lines.count()  # faster b/c 'lines' is in RAM

Spark Dataframes

  • Released Spark 1.3, Feb 2015
  • Mimics pandas dataframes
  • No more map-reduce crap, better for DS
  • Columnar format
  • Can run SQL Queries! (better in Spark 2.0)
  • Great speedup for native ops b/c no Python serialization
young = users[users.age < 21]  # filter users by age, young.age + 1)  # increment everybody’s age by 1
young.groupBy("gender").count()   # return ppl count for each gender

Spark MLLib

Parallelized Large ScaleMachine Learning

  • Logistic Regression, with SGD and LBGFS
  • Ridge, Lasso, Linear Regression
  • SVM, Naïve Bayes
  • Random Forest, GBTrees
  • Alternating Least Squares (recommendations)
  • See also: spark-sklearn, sparkit-learn


Pandas meets distributed computing

What is Dask

It's a parallel computing library for analytics in python.


  • Performs operations from disk, so fits in memory becomes fits in disk
  • Very easy to set up (you probably have it installed already) and use it (really)
  • It's a python library, not an interface to Scala, Java or any weird thing.
  • Scales from  your laptop to clusters.

What is Dask

Three main core elements:

  1. Dask array: distributed numpy arrays

  2. Dask bag: to work with arbitrary collections of data (equivalent to RDD in Spark)

  3. Dask dataframe: distributed pandas dataframes

What does under the hood

  • Operations in Dask are performed lazily
  • When you define a computation, dask elaborates the Direct Acyclic Graph (DAG) of the tasks required to complete it.
  • Example with dask arrays:


import dask.array as da
x = da.arange(1e7, chunks = 3e6)
res = - da.var(x)
  • We do this, and nothing is calculated (lazy evaluation)

What did just happened?

To see what dask did, we call the method visualize:


How to get results?

To trigger computation of a graph of tasks we call the method compute:


We can chain different operations, and evaluate them at the end, without having to have them in RAM

Multicore madness

Take for instance:

res = da.arange(1e9, chunks = 3e6)
res = - da.var(x)
%timeit res.compute()

1 loop, best of 3: 6.96 s per loop


Dask dataframes

  • On-disk parallel equivalent to pandas famous library for data analysis.
  • Very easy to use, (really):
import dask.dataframe as dd

df = dd.read_csv('filename.csv')
  • Lazy operations again, this does not load any data to disk, but sets the partitions:

Dask dataframes

  • On-disk parallel equivalent to pandas famous library for data analysis.
  • Very easy to use, (really):
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
result = df.groupby([list_columns]).var.mean()

  • Lazy operations again, this does not load any data to disk, but sets the partitions:

Dask dataframes

Dask dataframes

If you're confortable using pandas, you'll find like at home.

Pandas and the GIL

  • In CPython's implementation of Python, native python code can't run into multiple threads simultaneously (safety reasons). It's called the Global Interpreter Lock (GIL).
  • Still, if python interpreter runs functions written in external libraries (C/Fortran) can release the GIL.
  • Most of pandas methods are written in C (Cython).
  • Dask splits dataframe operations into different chunks and launch them in different threads achieving parallelism.

I/O operations

Load data from a single and multiple files using globstrings:


df1 = dd.read_csv('file.csv')
df2 = dd.read_csv('file_*.csv')

df3 = df2.to_csv('file_output.csv')

Read and write to hdf files.

dd.read_hdf('file_input.hdf5', '/data')
dd.to_hdf('file_output.hdf5', key='data')

Integration with new generation of compressed/columnar storage (castra, bcolz)



Partitioning of dataframes is determined by a column of the dataframe, its index.


df_new = df.set_index(df.column)

Doing this will reshufle the data, but subsequent operations involving this index will be faster.

df.set_index(df.column, compute=False).to_castra('df.castra')


Dask's own columnar format, castra, stores data in columns, compressed and partitioned on the index.


As with Spark, dask support caching for faster repetitive computations, but it works differntly.

LRU may not be the best for analytic computations. Instead, we can be more opportunistic  and keep:

  • Expensive to compute
  • Cheap to store
  • Frequently used
np.std(x)        # small result, costly to recompute
np.transpose(x)  # big result, cheap to recompute

from dask.cache import Cache
cache = Cache(cache=1e9)

(df.column.apply(whatever) + 1).compute()    # this call will be fast


When to use dask:

  • Doing exploratory analysis on larger-than-memory datasets
  • Working with multiple files at the same time.
  • Applying embarrasingly parallel tasks

When not to use dask :

  • When your operations require  shuffling (sorting, merges, etc.)
  • Simple operations with fast on th command line: sorts, deduplicating files, subselecting cols, etc.

Remember: GNU Coreutils is your friend