Machine Learning with Apache Spark
Favio Vázquez
Cosmologist & Data Scientist
@faviovaz
2 de junio de 2017
(And something more...)
Follow live presentation
About me
- Venezuelan
- Physicist and Computer Engineer
- Master student at PCF-UNAM
- Data Scientist
- Collaborator of Apache Spark project on GitHub
Releases 1.3.0, 1.4.0, 1.4.1 y 1.5.0
Outline
Machine Learning
Apache Spark
TensorFlow
Dask
MLeap
Demo (ML + Spark)
DVC
Machine Learning
Laying the foundations for Skynet
Machine Learning
Machine Learning is about software that learns from previous experiences. Such a computer program improves performance as more and more examples are available. The hope is that if you throw enough data at this machinery, it will learn patterns and produce intelligent results for newly fed input.
Machine Learning
Machine Learning
What is Spark?
Is a fast and general engine for large-scale data processing.
Unified Engine
High level APIs with space for optimization
- Expresses all the workflow with a single API
- Connects existing libraries and storage systems
RDD
Transformations
Actions
Caché
Dataset
Tiped
Scala & Java
RDD Benefits
Dataframe
Dataset[Row]
Optimized
Versatile
TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.
Deep Learning
Dask is a flexible parallel computing library for analytic computing.
- Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
- “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.
Components
import numpy as np
%%time
x = np.random.normal(10, 0.1, size=(20000, 20000))
y = x.mean(axis=0)[::100]
y
CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s
Wall time: 19.7 s
import dask.array as da
%%time
x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
y = x.mean(axis=0)[::100]
y.compute()
CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s
Wall time: 4.01 s
> 4 GBs
MBs
MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark, Scikit-learn, TensorFlow graphs, or an MLeap pipeline for use in a scoring engine (API Servers).
Data Version Control
DVC makes your data science projects reproducible by automatically building data dependency graph (DAG).
DVC derives the DAG (dependency graph) transparently to the user.
Data Version Control
Share your code and data separately from a single DVC environment.
DEMO
Questions?
Favio Vázquez
Cosmologist & Data Scientist
@faviovaz
Apache Spark
TensorFlow
Dask
MLeap
DVC
Machine Learning with Apache Spark (and more)
By Favio Vazquez
Machine Learning with Apache Spark (and more)
Charla a dar el 2 de junio en el IA de la UNAM
- 4,338