Machine Learning with Apache Spark

Favio Vázquez

Cosmologist & Data Scientist

@faviovaz

2 de junio de 2017

(And something more...)

Follow live presentation

About me

  • Venezuelan
  • Physicist and Computer Engineer
  • Master student at PCF-UNAM
  • Data Scientist
  • Collaborator of Apache Spark project on GitHub

Releases 1.3.0, 1.4.0, 1.4.1 y 1.5.0

Outline

Machine Learning

Apache Spark

TensorFlow

Dask

MLeap

Demo (ML + Spark)

DVC

Machine Learning

Laying the foundations for Skynet

Machine Learning

Machine Learning is about software that learns from previous experiences. Such a computer program improves performance as more and more examples are available. The hope is that if you throw enough data at this machinery, it will learn patterns and produce intelligent results for newly fed input.

Machine Learning

Machine Learning

What is Spark?

Is a fast and general engine for large-scale data processing.

Unified Engine

High level APIs with space for optimization 

  • Expresses all the workflow with a single API
  • Connects existing libraries and storage systems

RDD

Transformations

Actions

Caché

Dataset

Tiped

Scala & Java

RDD Benefits

Dataframe

Dataset[Row]

Optimized

Versatile

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

Deep Learning

Dask is a flexible parallel computing library for analytic computing.

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Components

import numpy as np

%%time 
x = np.random.normal(10, 0.1, size=(20000, 20000)) 
y = x.mean(axis=0)[::100] 
y

CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s
Wall time: 19.7 s
import dask.array as da

%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100] 
y.compute() 

CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s
Wall time: 4.01 s

> 4 GBs

MBs

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark, Scikit-learn, TensorFlow graphs, or an MLeap pipeline for use in a scoring engine (API Servers).

Data Version Control

DVC makes your data science projects reproducible by automatically building data dependency graph (DAG).

DVC derives the DAG (dependency graph) transparently to the user.

 

Data Version Control

Share your code and data separately from a single DVC environment.

DEMO

Questions?

Favio Vázquez

Cosmologist & Data Scientist

@faviovaz

Apache Spark

TensorFlow

Dask

MLeap

DVC

Made with Slides.com