parallel computing with


Israel Saeta Pérez


PyConES Almería Oct 2016


  • intro to analytic parallel computing
  • dask features
  • dask.distributed
  • links


parallel computing

processing a long list of (similar) tasks

  • Simple
  • Slow if lots of tasks
  • More complex
  • Needs scheduling & orchestration
  • Scheduling overhead → faster?
  • Profits from modern multi-core

Building blocks of parallel computing

  • Computing nodes: Threads, processes, machines, executors...
  • Distributed-friendly task definition language
    • Task partitioning
    • Aggregation logic
  • Scheduler: Assigns tasks to nodes intelligently
  • Message Passing channel and protocol

What is Dask?

Dask is a Python library for parallel programming that leverages task scheduling for computational problems

Why dask?

very easy to set up and start using

pip install dask

  • No configuration
  • No daemons (by default)
  • No need to learn Java/Scala
  • Feels like a library
  • Just pure ol' Python!

amazing support

  • SO questions answered in minutes by main author (@mrocklin)
  • Very good documentation
  • Lots of didactic examples
  • Funding for full-time devs
  • pandas & scipy devs onboard
  • Great GH Pulse, 1000+ *s!

fast and responsive

  • Low overhead, low latency job scheduling
  • Interactive (Jupyter notebook, non-blocking)
  • Progressbar and diagnostics to help humans

fast and responsive

scales up and down

from a single computer to a cluster

under the hoods

def inc(x):
    return x + 1

def add(x, y):
    return x + y

# task definition - simple dict!!!
dsk = {'a': 1, 
       'x': 10,

       'b': (inc, 'a'),       
       'y': (inc, 'x'),
       'z': (add, 'b', 'y')}

from dask.threaded import get as tget
tget(dsk, 'z')  # execute in multiple threads

from dask.multiprocessing import get as mget
mget(dsk, 'z')  # execute in multiple processes


to use

map-reduce no more


high-level blocked algorithms


dataframes split wrt index


# numpy
import numpy as np
f = h5py.File('myfile.hdf5')
x = np.array(f['/small-data'])
x - x.mean(axis=1)                       

# dask
import dask.array as da
f = h5py.File('myfile.hdf5')
x = da.from_array(f['/big-data'],
                  chunks=(1000, 1000))  # partition big array
x - x.mean(axis=1).compute()

mimics existing known interfaces!

uses numpy arrays under the hood



# pandas
import pandas as pd                     
df = pd.read_csv('2015-01-01.csv')      

# dask.dataframe
import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')

mimics existing known interfaces!

uses pandas dataframes under the hood


import dask.bag as db
b = db.read_text('lines_*.txt')
b.filter(lambda line: 'python' in line)

mimics collections, iterators, Spark RDDs

multiprocessing scheduler (GIL)

How does it work?

  • Computations defined in dataframes or arrays get translated to a Direct Acyclic Graph (DAG) with all the required tasks
  • Array example:
import dask.array as da
x = da.arange(1e9, chunks=3e6)
res = (x + 1).sum()

# res:
dask.array<sum-agg..., shape=(), dtype=int64, chunksize=()>

Nothing is calculated yet (lazy evaluation)

add 1

sum each chunk

add up partial sums

create blocks




Compute results

>>> res.compute()

no GIL - multi-threading madness!


for computer clusters


  • Separate scheduler that can run on a cluster
  • Launches status server a la Spark, using Bokeh
  • From Medium Data to Big Data
  • Data locality
  • Can read from HDFS or S3
  • A bit harder to set up, understand and use


# on every computer of the cluster
$ pip install distributed

# on main, scheduler node
$ dask-scheduler
Start scheduler at

# on worker nodes (2 in this example)
$ dask-worker
Start worker at:  
Registered with center at:

$ dask-worker
Start worker at:  
Registered with center at:

# on local machine
$ python
>>> from distributed import Client
>>> client = Client('')
  • SSH
  • EC2 (script)
  • YARN/Mesos
  • Marathon
  • Kubernetes
  • ...


from distributed import Client

# automatically registered as default scheduler
client = Client('')

import dask.array as da
x = da.arange(1e9, chunks = 3e6)
res = (x + 1).sum()

future = client.compute(res)  # returns immediately
# future:
<Future: status: pending, key: finalize-e8bdd...>

# a couple of seconds later...
<Future: status: finished, type: int64, key: finalize-e8bdd...>

future.result()  # would block until result is available
# out: 500000000500000000

joblib integration

import distributed.joblib
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import load_digits
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import SVC
import numpy as np

digits = load_digits()

param_space = {
    'C': np.logspace(-6, 6, 13),
    'gamma': np.logspace(-8, 8, 17)

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)

with parallel_backend('dask.distributed', scheduler_host='localhost:8786'):,

Dask vs Spark

  • "Just a library"
  • Pure Python
  • Good for single computer
  • Good for medium data
  • Builds on existing libraries
  • Easy to write complex algorithms, so other libraries using it! (dask-learn, xarray)
  • Whole framework
  • JVM, extra serialization
  • Aimed for large clusters
  • Aimed for Big Big Data
  • Replaces existing libraries
  • Hard to write complex algorithms

(thanks @eyadsibai)

links & examples

Thanks to all contributors!

  • mrocklin
  • cowlicks
  • jcrist
  • sinhrnks
  • cpcloud
  • shoyer
  • 60+!

Thanks to all supporters!

  • Continuum Analytics
  • XDATA Blaze proyect
  • The Moore Foundation

Thank you!


Parallel computing with Dask

By Israel Saeta Pérez

Parallel computing with Dask

Intro to Dask for Data Science

  • 6,809