Distributed Deep Learning and Transfer Learning with Spark, Keras and DLS

Favio Vázquez

Data Scientist

@faviovaz

Webinars

Eric Feuilleaubois

Phd in Artificial Neural Networks

@Deep_In_Depth

Introduction to Transfer Learning with Convolutional Neural Networks

Webinars

Deep Learner / Machine Learner

Curator of Deep_In_Depth - news feed on Deep Learning, Machine Learning and Data Science

Writer for Medium - Towards data science

Eric Feuilleaubois

Phd in Artificial Neural Networks

@Deep_In_Depth

Principle of Transfer Learning

Aim: Predict classes (labels) that have not been seen by the source (pre-trained) model

ImageNet Dataset

n07714571  -  head cabbage
n07714990
  -  broccoli
n07715103
  -  cauliflower
n07716358
  -  zucchini, courgette
n07718472
  -  cucumber, cuke
n07718747
  -  artichoke, globe artichoke
n07720875
  -  bell pepper
n07730033
  -  cardoon
n07734744
  -  mushroom
n07742313
  -  Granny Smith
n07745940
  -  strawberry
n07747607
  -  orange
n07749582
  -  lemon
n07753113
  -  fig
n07753275
  -  pineapple, ananas
n07753592
  -  banana
n07768694  -  pomegranate

  • 1,000 image categories
  • Training  -  1,2 Million images
  •  Validation & test  -  150,000 images

Veg Dataset - 4124 records

Pumpkin

Tomato

Watermelon

Motivations for CNN TL

- Availability of Open CNN Models with Top class performance

CNN Transfer Learning

Proof by Features

- CNNs detect visual features (patterns) in images

- First  layers learn "basic" features

- Last layers learn "advanced" features

- Top layers classify

Basic features are very similar from one CNN model to another --> No need to

re-learn them, best to re-use them

DL model with pre-trained VGG16

Training results

Dataset Train Acc.
(%)
Validation
Acc. (%)
300  -  7% 95 88
600  - 14% 96 89
900 -  21% 97 90
3917 - 95% 97 91
300 - with Data Aug. 98 88
600- with Data Aug. 98 90
900- with Data Aug. 98 90

Wrong results

300 - 600 - 900

WaterMelon

Pumpkin

300 - 600 - 900

WaterMelon

300 - 600

Training Dataset

Result for "difficult" images 

Pumpkin

WaterMelon

Tomato

300 - 600 - 900

300 - 600 - 900

300 - 600 - 900

Fine-tuning TL models

Two approaches:

 

1) Fine tune the convolutional part of the CNN

  • Retrained the last convolutional layer (Deep Learning Studio provide easy way to do it)
  • Freezing and training specific layers

 

2) Fine tune the classification part of the CNN

  • Classification architecture optimisation 
    • number of neuron in the classification layers
  • Hyper-parameter optimization

Simpler DL model with pre-trained VGG16

Training results

Dataset Train Acc.
(%)
Validation
Acc. (%)
300 - Simpler 99 88
300 - 512 - 64 99 86
300 - MSimpler-no BN 98 84
300 - Simpler
VGG 10% trainable
99 89
3917 - 95% - Simpler 98 93.5
95% - MSimpler-no BN 40 35
95% - MSimpler- BN 97 92

Benefits of CNN

Transfer Learning

  • Dataset does not need to be huge
  • Save time and effort
    • no CNN architecture search
    • training phase faster
    • less computing power needed
  • ​Produce very good models that can then be fine tune
  • Can be applied to very diverse classification problem

 

And drawbacks:

  • need a fair amount of memory on GPU
  • not fully tunable with Keras

Distributed Deep Learning  with Spark, Keras and DLS

Favio Vázquez

Data Scientist

@faviovaz

https://github.com/faviovazquez

https://www.linkedin.com/in/faviovazquez/

Webinars

Outline

Timeline 

Fundamentals of Apache Spark

Deep Learning Pipelines

Apache Spark on Deep Cognition

Apache Spark

Favio Vázquez

About me

  • Venezuelan
  • Physicist and Computer Engineer
  • Master in Physics UNAM
  • Data Scientist
  • Collaborator of Apache Spark project on GitHub and StackOverFlow
  • Principal Data Scientist at Oxxo
  • Chief Data Scientist at Iron
  • Main Developer of Optimus
  • Creator of Ciencia y Datos

Favio Vázquez

  • Very active member of LinkedIn ;)
  • Editor - International Journal of Business Analytics and Intelligence
  • Lecturer Afi (Data Science program-MX)
  • Writer in Towards Data Science, Becoming Human, Planeta Chatbot and more :)

About me

Favio Vázquez

2004 – Google

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat research.google.com/archive/mapreduce.html

 

2006 – Apache

Hadoop, originating from the Nutch Project

Doug Cutting research.yahoo.com/files/cutting.pdf

 

2008 – Yahoo

web scale search indexing

Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/

 

2009 – Amazon AWS

Elastic MapReduce

Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/

Favio Vázquez

What is Spark?

Is a fast and general engine for large-scale data processing.

Favio Vázquez

Unified Engine

High level APIs with space for optimization 

  • Expresses all the workflow with a single API
  • Connects existing libraries and storage systems

Favio Vázquez

RDD

Transformations

Actions

Caché

Dataset

Tiped

Scala & Java

RDD Benefits

Dataframe

Dataset[Row]

Optimized

Versatile

Favio Vázquez

Deep Learning Pipelines

Deep Learning Pipelines is an open source library created by Databricks that provides high-level APIs for scalable deep learning in Python with Apache Spark.

Favio Vázquez

Apache Spark on Deep Cognition

Favio Vázquez

Apache Spark on Deep Cognition

  • How to load an image to Apache Spark
  • How to apply pre-trained models as transformers in a Spark ML pipeline
  • Transfer learning with Apache Spark
  • Deploying models in DataFrames and SQL

We will learn:

Favio Vázquez

Favio Vázquez

Competition and Prices!!

Take my article on Detecting Breast Cancer with Deep Learning, and using DLS solve it by yourselves! 

Create a post or blog, and the top 10 will win $50 Amazon gift card each

DEMO

Favio Vázquez

Made with Slides.com