Machine Learning with Apache Spark
Favio Vázquez
Cosmologist & Data Scientist
@faviovaz
2 de junio de 2017
(And something more...)
Follow live presentation
Releases 1.3.0, 1.4.0, 1.4.1 y 1.5.0
Outline
Machine Learning
Apache Spark
TensorFlow
Dask
MLeap
Demo (ML + Spark)
DVC
Machine Learning is about software that learns from previous experiences. Such a computer program improves performance as more and more examples are available. The hope is that if you throw enough data at this machinery, it will learn patterns and produce intelligent results for newly fed input.
Is a fast and general engine for large-scale data processing.
Transformations
Actions
Caché
Tiped
Scala & Java
RDD Benefits
Dataset[Row]
Optimized
Versatile
TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
Dask is a flexible parallel computing library for analytic computing.
Components
import numpy as np
%%time
x = np.random.normal(10, 0.1, size=(20000, 20000))
y = x.mean(axis=0)[::100]
y
CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s
Wall time: 19.7 s
import dask.array as da
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100]
y.compute()
CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s
Wall time: 4.01 s
> 4 GBs
MBs
MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark, Scikit-learn, TensorFlow graphs, or an MLeap pipeline for use in a scoring engine (API Servers).
DVC derives the DAG (dependency graph) transparently to the user.
Share your code and data separately from a single DVC environment.
Questions?
Favio Vázquez
Cosmologist & Data Scientist
@faviovaz