Distributed Deep Learning and Transfer Learning with Spark, Keras and DLS

Favio Vázquez

Data Scientist

@faviovaz

Webinars

Eric Feuilleaubois

Phd in Artificial Neural Networks

@Deep_In_Depth

Introduction to Transfer Learning with Convolutional Neural Networks

Webinars

Deep Learner / Machine Learner

Curator of Deep_In_Depth - news feed on Deep Learning, Machine Learning and Data Science

Writer for Medium - Towards data science

Eric Feuilleaubois

Phd in Artificial Neural Networks

@Deep_In_Depth

Principle of Transfer Learning

Aim: Predict classes (labels) that have not been seen by the source (pre-trained) model

ImageNet Dataset

n07714571  -  head cabbage
n07714990
  -  broccoli
n07715103
  -  cauliflower
n07716358
  -  zucchini, courgette
n07718472
  -  cucumber, cuke
n07718747
  -  artichoke, globe artichoke
n07720875
  -  bell pepper
n07730033
  -  cardoon
n07734744
  -  mushroom
n07742313
  -  Granny Smith
n07745940
  -  strawberry
n07747607
  -  orange
n07749582
  -  lemon
n07753113
  -  fig
n07753275
  -  pineapple, ananas
n07753592
  -  banana
n07768694  -  pomegranate

  • 1,000 image categories
  • Training  -  1,2 Million images
  •  Validation & test  -  150,000 images

Veg Dataset - 4124 records

Pumpkin

Tomato

Watermelon

Motivations for CNN TL

- Availability of Open CNN Models with Top class performance

CNN Transfer Learning

Proof by Features

- CNNs detect visual features (patterns) in images

- First  layers learn "basic" features

- Last layers learn "advanced" features

- Top layers classify

Basic features are very similar from one CNN model to another --> No need to

re-learn them, best to re-use them

DL model with pre-trained VGG16

Training results

Dataset Train Acc.
(%)
Validation
Acc. (%)
300  -  7% 95 88
600  - 14% 96 89
900 -  21% 97 90
3917 - 95% 97 91
300 - with Data Aug. 98 88
600- with Data Aug. 98 90
900- with Data Aug. 98 90

Wrong results

300 - 600 - 900

WaterMelon

Pumpkin

300 - 600 - 900

WaterMelon

300 - 600

Training Dataset

Result for "difficult" images 

Pumpkin

WaterMelon

Tomato

300 - 600 - 900

300 - 600 - 900

300 - 600 - 900

Fine-tuning TL models

Two approaches:

 

1) Fine tune the convolutional part of the CNN

  • Retrained the last convolutional layer (Deep Learning Studio provide easy way to do it)
  • Freezing and training specific layers

 

2) Fine tune the classification part of the CNN

  • Classification architecture optimisation 
    • number of neuron in the classification layers
  • Hyper-parameter optimization

Simpler DL model with pre-trained VGG16

Training results

Dataset Train Acc.
(%)
Validation
Acc. (%)
300 - Simpler 99 88
300 - 512 - 64 99 86
300 - MSimpler-no BN 98 84
300 - Simpler
VGG 10% trainable
99 89
3917 - 95% - Simpler 98 93.5
95% - MSimpler-no BN 40 35
95% - MSimpler- BN 97 92

Benefits of CNN

Transfer Learning

  • Dataset does not need to be huge
  • Save time and effort
    • no CNN architecture search
    • training phase faster
    • less computing power needed
  • ​Produce very good models that can then be fine tune
  • Can be applied to very diverse classification problem

 

And drawbacks:

  • need a fair amount of memory on GPU
  • not fully tunable with Keras

Distributed Deep Learning  with Spark, Keras and DLS

Favio Vázquez

Data Scientist

@faviovaz

https://github.com/faviovazquez

https://www.linkedin.com/in/faviovazquez/

Webinars

Outline

Timeline 

Fundamentals of Apache Spark

Deep Learning Pipelines

Apache Spark on Deep Cognition

Apache Spark

Favio Vázquez

About me

  • Venezuelan
  • Physicist and Computer Engineer
  • Master in Physics UNAM
  • Data Scientist
  • Collaborator of Apache Spark project on GitHub and StackOverFlow
  • Principal Data Scientist at Oxxo
  • Chief Data Scientist at Iron
  • Main Developer of Optimus
  • Creator of Ciencia y Datos

Favio Vázquez

  • Very active member of LinkedIn ;)
  • Editor - International Journal of Business Analytics and Intelligence
  • Lecturer Afi (Data Science program-MX)
  • Writer in Towards Data Science, Becoming Human, Planeta Chatbot and more :)

About me

Favio Vázquez

2004 – Google

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat research.google.com/archive/mapreduce.html

 

2006 – Apache

Hadoop, originating from the Nutch Project

Doug Cutting research.yahoo.com/files/cutting.pdf

 

2008 – Yahoo

web scale search indexing

Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/

 

2009 – Amazon AWS

Elastic MapReduce

Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/

Favio Vázquez

What is Spark?

Is a fast and general engine for large-scale data processing.

Favio Vázquez

Unified Engine

High level APIs with space for optimization 

  • Expresses all the workflow with a single API
  • Connects existing libraries and storage systems

Favio Vázquez

RDD

Transformations

Actions

Caché

Dataset

Tiped

Scala & Java

RDD Benefits

Dataframe

Dataset[Row]

Optimized

Versatile

Favio Vázquez

Deep Learning Pipelines

Deep Learning Pipelines is an open source library created by Databricks that provides high-level APIs for scalable deep learning in Python with Apache Spark.

Favio Vázquez

Apache Spark on Deep Cognition

Favio Vázquez

Apache Spark on Deep Cognition

  • How to load an image to Apache Spark
  • How to apply pre-trained models as transformers in a Spark ML pipeline
  • Transfer learning with Apache Spark
  • Deploying models in DataFrames and SQL

We will learn:

Favio Vázquez

Favio Vázquez

Competition and Prices!!

Take my article on Detecting Breast Cancer with Deep Learning, and using DLS solve it by yourselves! 

Create a post or blog, and the top 10 will win $50 Amazon gift card each

DEMO

Favio Vázquez

Distributed Deep Learning and Transfer Learning with Spark, Keras and DLS

By Favio Vazquez

Distributed Deep Learning and Transfer Learning with Spark, Keras and DLS

  • 1,814