Health Data Science Meetup

December 5, 2016

Convolutional Neural Network

 

Recurrent Neural Network

 

Implementations in Python

Convolutional Neural Network

  • The flattening of the image matrix of pixels to a long vector of pixel values looses all of the spatial structure in the image

C

C

C

C

C

-'C'

-'C'

-'C'

-'C'

-?

Does color matter?

No, only the structure matters

C

C

C

Translation Invariance

Weight Sharing

R

G

B

w_1
w1w_1
w_1
w1w_1
w_1
w1w_1

C

C

C

w_2
w2w_2
w_2
w2w_2
w_2
w2w_2

Statistical Invariants

  • Preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data
     
  • Features are learned and used across the whole image
    -  allowing for the objects in the images to be shifted or translated in the scene and still detectable by the network
     
  • Advantages of CNNs:
    -  fewer parameters to learn than a fully connected network
    -  designed to be invariant to object position and distortion in the scene
    -  automatically learn and generalize features from the input domain

CNN

Building Blocks of CNNs

Convolutional Layers

  • Filters:
    -  essentially the neurons of the layer
    -  have both weighted inputs and generate an output value like a neuron
    -  the input size is a fixed square called a patch or a receptive field
     
  • Feature Maps:
    -  the output of one filter applied to the previous layer
    -  e.g. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map
    -  the distance that filter is moved across the input from the previous layer each activation is referred to as the stride
     

Pooling Layers

  • Down-sample the previous layers feature map
     
  • Intended to consolidate the features learned and expressed in the previous layers feature map
     
  • May be considered as a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model
     
  • They also have a receptive field, often much smaller than the convolutional layer
     
  • They are often very simple: taking the average or the maximum of the input value in order to create its own feature map

Fully Connected Layers

  • Normal flat feedforward layers
     
  • Usually used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers
     
  • Used to create final nonlinear combinations of features and for making predictions by the network

Example

  • A dataset of gray scale images
    -  Size: 32 × 32 × 1 (width × height × channels)
     
  • Convolutional Layer:
    -  10 filters
    -  a patch 5 pixels wide and 5 pixels high
    -  a stride length of 1
     
  • Since each filter can only get input from 5 × 5 (25) pixels at a time, each will require 25+1 input weights
     
  • Drag the 5 × 5 patch across the input image data with a stride width of 1, this results in a feature map of 28 × 28 output values (or 784 distinct activations per image)
     
  • We have 10 filters, so we will get 10 different 28 × 28 feature maps (or 7,840 outputs for one image)
     
  • In summary, 26 × 10 × 28 × 28 (or 203,840) connections in the convolutional layer

Example

  • Pooling layer:
    -  a patch with a width of 2 and a height of 2
    -  a stride of 2
    -  use a max() operation for each patch so that the activation is the maximum input value
     
  • This results in feature maps that are one half the size of the input feature maps
    -  e.g. from 10 different 28 × 28 feature maps as input to 10 different 14 × 14 feature maps as output
     
  • Fully connected layer:
    -  flatten out the square feature maps into a traditional flat fully-connected layer
    -  200 hidden neurons, each with 10 × 14 × 14 input connections (or 1,960+1 weights per neuron)
    -  a total of 392,200 connections and weights to learn in this layer
    -  finally, we use a sigmoid function to output probabilities of class values directly

     

Best Practices for CNN

  • Input receptive field dimensions:
    -  1D for words in a sentence
    -  2D for images
    -  3D for videos

  • Receptive field size:
    -  patch should be as small as possible, but large enough to "see" features in the input data
    -  common to use 3 × 3 on small images and 5 × 5 or 7 × 7 and more on larger image sizes
     
  • Stride Width:
    -  start with the default stride of 1 (easy to understand and don't need padding to handle the receptive field falling off the edge of the images)
    -  usually increase to 2 or higher for larger images
     
  • Number of filters:
    -  filters are feature detectors
    -  usually fewer filters are used at the input layer, and increasingly more filters used at deeper layers

Best Practices for CNN

  • Padding:
    -  use zero-padding when reading non-input data
    -  useful when you cannot standardize input image sizes or the patch and stride sizes cannot neatly divide up the image size
     
  • Pooling:
    -  a generalization process to reduce overfitting
    -  patch size is almost always set to 2 × 2 with a stride of 2 to discard 75% of the activations from the output of the previous layer
     
  • Data preparation:
    -  standardize input data (both the dimensions of the images and the pixel values)

     
  • Dropout:
    -  CNNs have a habit of overfitting, even with pooling layers
    -  dropout should be used such as between fully connected layers and perhaps after pooling layers

     

Recurrent Neural Network

Sequences

  • Time-series data:
    -  e.g. prince of a stock over time
     
  • Classical feedforward NN:
    -  define a window size (e.g. 5)
    -  train the network to learn to make short term predictions from the fixed sized window of inputs
    -  limitations: how to determine the window size
     
  • Different types of sequence problems:
    -  one-to-many: sequence output, for image captioning
    -  many-to-one: sequence input, for sentiment classification
    -  many-to-many: sequence in and out, for machine translation
    -  synchronized many to many: synced sequences in and out, for video classification

RNNs

  • RNNs are a special type of NN designed for sequence problems
     
  • A RNN can be thought of as the addition of loops to the archetecture of a standard feedforward NN
    -  the output of the network may feedback as an input to the network with the next input vector, and so on
     
  • The recurrent connections add state or memory to the network and allow it to learn broader abstractions from the input sequences
     
  • Two major issues:
    -  how to train the network with back propagation
    -  how to stop gradients vanishing or exploding during training

LSTM

  • Long Short-Term Memory network
    -  overcomes the vanishing gradient problem
    -  can be used to create large RNNs
     
  • Instead of neurons, LSTM has memory blocks that are connected into layers
    -  a block contains gates that manage the block's state and output
    -  a unit operates upon an input sequence and each gate within a unit uses the sigmoid activation function to control whether they are triggered or not
  • Three types of gates within a memory unit:
    -  Input Gate: conditionally decides which values from the input to update the memory state
    -  Forget Gate: conditionally decides what information to discard from the unit
    -  Output Gate: conditionally decides what to output based on input and the memory of the unit

Implementations

HDS Meetup 12/5/2016

By Hui Hu

HDS Meetup 12/5/2016

Slides for the Health Data Science Meetup

  • 729