parallel computing with

 

Israel Saeta Pérez

@dukebody

PyConES Almería Oct 2016

Overview

  • intro to analytic parallel computing
  • dask features
  • dask.distributed
  • links

DISCLAIMER

parallel computing

processing a long list of (similar) tasks

  • Simple
  • Slow if lots of tasks
  • More complex
  • Needs scheduling & orchestration
  • Scheduling overhead → faster?
  • Profits from modern multi-core

Building blocks of parallel computing

  • Computing nodes: Threads, processes, machines, executors...
  • Distributed-friendly task definition language
    • Task partitioning
    • Aggregation logic
  • Scheduler: Assigns tasks to nodes intelligently
  • Message Passing channel and protocol

What is Dask?

Dask is a Python library for parallel programming that leverages task scheduling for computational problems

Why dask?

very easy to set up and start using

pip install dask

  • No configuration
  • No daemons (by default)
  • No need to learn Java/Scala
  • Feels like a library
  • Just pure ol' Python!

amazing support

  • SO questions answered in minutes by main author (@mrocklin)
  • Very good documentation
  • Lots of didactic examples
  • Funding for full-time devs
  • pandas & scipy devs onboard
  • Great GH Pulse, 1000+ *s!

fast and responsive

  • Low overhead, low latency job scheduling
  • Interactive (Jupyter notebook, non-blocking)
  • Progressbar and diagnostics to help humans

fast and responsive

scales up and down

from a single computer to a cluster

under the hoods

def inc(x):
    return x + 1

def add(x, y):
    return x + y

# task definition - simple dict!!!
dsk = {'a': 1, 
       'x': 10,

       'b': (inc, 'a'),       
       'y': (inc, 'x'),
       
       'z': (add, 'b', 'y')}

from dask.threaded import get as tget
tget(dsk, 'z')  # execute in multiple threads

from dask.multiprocessing import get as mget
mget(dsk, 'z')  # execute in multiple processes

familiar

to use

map-reduce no more

larger-than-memory

high-level blocked algorithms

 

dataframes split wrt index

dask.array

# numpy
import numpy as np
f = h5py.File('myfile.hdf5')
x = np.array(f['/small-data'])
x - x.mean(axis=1)                       

# dask
import dask.array as da
f = h5py.File('myfile.hdf5')
x = da.from_array(f['/big-data'],
                  chunks=(1000, 1000))  # partition big array
x - x.mean(axis=1).compute()

mimics existing known interfaces!

uses numpy arrays under the hood

dask.dataframe

 

# pandas
import pandas as pd                     
df = pd.read_csv('2015-01-01.csv')      
df.groupby(df.user_id).value.mean()     

# dask.dataframe
import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()

mimics existing known interfaces!

uses pandas dataframes under the hood

dask.bag

import dask.bag as db
b = db.read_text('lines_*.txt')
b.filter(lambda line: 'python' in line)

mimics collections, iterators, Spark RDDs

multiprocessing scheduler (GIL)

How does it work?

  • Computations defined in dataframes or arrays get translated to a Direct Acyclic Graph (DAG) with all the required tasks
  • Array example:
import dask.array as da
x = da.arange(1e9, chunks=3e6)
res = (x + 1).sum()

# res:
dask.array<sum-agg..., shape=(), dtype=int64, chunksize=()>

Nothing is calculated yet (lazy evaluation)

add 1

sum each chunk

add up partial sums

create blocks

res.visualize()

operations

results

Compute results

>>> res.compute()
500000000500000000

no GIL - multi-threading madness!

dask.distributed

for computer clusters

dask.distributed

  • Separate scheduler that can run on a cluster
  • Launches status server a la Spark, using Bokeh
  • From Medium Data to Big Data
  • Data locality
  • Can read from HDFS or S3
  • A bit harder to set up, understand and use

setup

# on every computer of the cluster
$ pip install distributed

# on main, scheduler node
$ dask-scheduler
Start scheduler at 192.168.0.1:8786


# on worker nodes (2 in this example)
$ dask-worker 192.168.0.1:8786
Start worker at:            192.168.0.2:12345
Registered with center at:  192.168.0.1:8786

$ dask-worker 192.168.0.1:8786
Start worker at:            192.168.0.3:12346
Registered with center at:  192.168.0.1:8786


# on local machine
$ python
>>> from distributed import Client
>>> client = Client('192.168.0.1:8786')
  • SSH
  • EC2 (script)
  • YARN/Mesos
  • Marathon
  • Kubernetes
  • ...

usage

from distributed import Client

# automatically registered as default scheduler
client = Client('192.168.0.1:8786')

import dask.array as da
x = da.arange(1e9, chunks = 3e6)
res = (x + 1).sum()

future = client.compute(res)  # returns immediately
# future:
<Future: status: pending, key: finalize-e8bdd...>

# a couple of seconds later...
<Future: status: finished, type: int64, key: finalize-e8bdd...>

future.result()  # would block until result is available
# out: 500000000500000000

joblib integration

import distributed.joblib
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import load_digits
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import SVC
import numpy as np

digits = load_digits()

param_space = {
    'C': np.logspace(-6, 6, 13),
    'gamma': np.logspace(-8, 8, 17)
}

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)

with parallel_backend('dask.distributed', scheduler_host='localhost:8786'):
    search.fit(digits.data, digits.target)

Dask vs Spark

  • "Just a library"
  • Pure Python
  • Good for single computer
  • Good for medium data
  • Builds on existing libraries
  • Easy to write complex algorithms, so other libraries using it! (dask-learn, xarray)
  • Whole framework
  • JVM, extra serialization
  • Aimed for large clusters
  • Aimed for Big Big Data
  • Replaces existing libraries
  • Hard to write complex algorithms

(thanks @eyadsibai)

links & examples

Thanks to all contributors!

  • mrocklin
  • cowlicks
  • jcrist
  • sinhrnks
  • cpcloud
  • shoyer
  • 60+!

Thanks to all supporters!

  • Continuum Analytics
  • XDATA Blaze proyect
  • The Moore Foundation

Thank you!

@dukebody

Parallel computing with Dask

By Israel Saeta Pérez

Parallel computing with Dask

Intro to Dask for Data Science

  • 7,173