Distributed Deep Learning and Transfer Learning with Spark, Keras and DLS

Favio Vázquez

Data Scientist

@faviovaz

Webinars

Eric Feuilleaubois

Phd in Artificial Neural Networks

@Deep_In_Depth

Introduction to Transfer Learning with Convolutional Neural Networks

Webinars

Deep Learner / Machine Learner

Curator of Deep_In_Depth - news feed on Deep Learning, Machine Learning and Data Science

Writer for Medium - Towards data science

Eric Feuilleaubois

Phd in Artificial Neural Networks

@Deep_In_Depth

Principle of Transfer Learning

Aim: Predict classes (labels) that have not been seen by the source (pre-trained) model

ImageNet Dataset

n07714571 - head cabbage
n07714990 - broccoli
n07715103 - cauliflower
n07716358 - zucchini, courgette
n07718472 - cucumber, cuke
n07718747 - artichoke, globe artichoke
n07720875 - bell pepper
n07730033 - cardoon
n07734744 - mushroom
n07742313 - Granny Smith
n07745940 - strawberry
n07747607 - orange
n07749582 - lemon
n07753113 - fig
n07753275 - pineapple, ananas
n07753592 - banana
n07768694 - pomegranate

1,000 image categories
Training - 1,2 Million images
Validation & test - 150,000 images

Veg Dataset - 4124 records

Pumpkin

Tomato

Watermelon

Motivations for CNN TL

- Availability of Open CNN Models with Top class performance

CNN Transfer Learning

Proof by Features

- CNNs detect visual features (patterns) in images

- First layers learn "basic" features

- Last layers learn "advanced" features

- Top layers classify

Basic features are very similar from one CNN model to another --> No need to

re-learn them, best to re-use them

DL model with pre-trained VGG16

Training results

Dataset	Train Acc. (%)	Validation Acc. (%)
300 - 7%	95	88
600 - 14%	96	89
900 - 21%	97	90
3917 - 95%	97	91
300 - with Data Aug.	98	88
600- with Data Aug.	98	90
900- with Data Aug.	98	90

Wrong results

300 - 600 - 900

WaterMelon

Pumpkin

300 - 600 - 900

WaterMelon

300 - 600

Training Dataset

Result for "difficult" images

Pumpkin

WaterMelon

Tomato

300 - 600 - 900

Fine-tuning TL models

Two approaches:

1) Fine tune the convolutional part of the CNN

Retrained the last convolutional layer (Deep Learning Studio provide easy way to do it)
Freezing and training specific layers

2) Fine tune the classification part of the CNN

Classification architecture optimisation
- number of neuron in the classification layers
Hyper-parameter optimization

Simpler DL model with pre-trained VGG16

Training results

Dataset	Train Acc. (%)	Validation Acc. (%)
300 - Simpler	99	88
300 - 512 - 64	99	86
300 - MSimpler-no BN	98	84
300 - Simpler VGG 10% trainable	99	89
3917 - 95% - Simpler	98	93.5
95% - MSimpler-no BN	40	35
95% - MSimpler- BN	97	92

Benefits of CNN

Transfer Learning

Dataset does not need to be huge
Save time and effort
- no CNN architecture search
- training phase faster
- less computing power needed
Produce very good models that can then be fine tune
Can be applied to very diverse classification problem

And drawbacks:

need a fair amount of memory on GPU
not fully tunable with Keras

Distributed Deep Learning with Spark, Keras and DLS

Favio Vázquez

Data Scientist

@faviovaz

https://github.com/faviovazquez

https://www.linkedin.com/in/faviovazquez/

Webinars

Outline

Timeline

Fundamentals of Apache Spark

Deep Learning Pipelines

Apache Spark on Deep Cognition

Apache Spark

Favio Vázquez

About me

Venezuelan
Physicist and Computer Engineer
Master in Physics UNAM
Data Scientist
Collaborator of Apache Spark project on GitHub and StackOverFlow
Principal Data Scientist at Oxxo
Chief Data Scientist at Iron
Main Developer of Optimus
Creator of Ciencia y Datos

Favio Vázquez

Very active member of LinkedIn ;)
Editor - International Journal of Business Analytics and Intelligence
Lecturer Afi (Data Science program-MX)
Writer in Towards Data Science, Becoming Human, Planeta Chatbot and more :)

About me

http://spark.apache.org/

Favio Vázquez

2004 – Google

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat research.google.com/archive/mapreduce.html

2006 – Apache

Hadoop, originating from the Nutch Project

Doug Cutting research.yahoo.com/files/cutting.pdf

2008 – Yahoo

web scale search indexing

Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/

2009 – Amazon AWS

Elastic MapReduce

Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/

Favio Vázquez

What is Spark?

Is a fast and general engine for large-scale data processing.

Favio Vázquez

Unified Engine

High level APIs with space for optimization

Expresses all the workflow with a single API
Connects existing libraries and storage systems

Favio Vázquez

RDD

Transformations

Actions

Caché

Dataset

Tiped

Scala & Java

RDD Benefits

Dataframe

Dataset[Row]

Optimized

Versatile

Favio Vázquez

Deep Learning Pipelines

Deep Learning Pipelines is an open source library created by Databricks that provides high-level APIs for scalable deep learning in Python with Apache Spark.