## Health Data Science Meetup

### December 5, 2016

Convolutional Neural Network

Recurrent Neural Network

Implementations in Python

## Convolutional Neural Network

• The flattening of the image matrix of pixels to a long vector of pixel values looses all of the spatial structure in the image

# C

### -?

Does color matter?

No, only the structure matters

# C

## B

w_1
$w_1$
w_1
$w_1$
w_1
$w_1$

# C

w_2
$w_2$
w_2
$w_2$
w_2
$w_2$

### Statistical Invariants

• Preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data

• Features are learned and used across the whole image
-  allowing for the objects in the images to be shifted or translated in the scene and still detectable by the network

-  fewer parameters to learn than a fully connected network
-  designed to be invariant to object position and distortion in the scene
-  automatically learn and generalize features from the input domain

### Convolutional Layers

• Filters:
-  essentially the neurons of the layer
-  have both weighted inputs and generate an output value like a neuron
-  the input size is a fixed square called a patch or a receptive field

• Feature Maps:
-  the output of one filter applied to the previous layer
-  e.g. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map
-  the distance that filter is moved across the input from the previous layer each activation is referred to as the stride

### Pooling Layers

• Down-sample the previous layers feature map

• Intended to consolidate the features learned and expressed in the previous layers feature map

• May be considered as a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model

• They also have a receptive field, often much smaller than the convolutional layer

• They are often very simple: taking the average or the maximum of the input value in order to create its own feature map

### Fully Connected Layers

• Normal flat feedforward layers

• Usually used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers

• Used to create final nonlinear combinations of features and for making predictions by the network

### Example

• A dataset of gray scale images
-  Size: 32 × 32 × 1 (width × height × channels)

• Convolutional Layer:
-  10 filters
-  a patch 5 pixels wide and 5 pixels high
-  a stride length of 1

• Since each filter can only get input from 5 × 5 (25) pixels at a time, each will require 25+1 input weights

• Drag the 5 × 5 patch across the input image data with a stride width of 1, this results in a feature map of 28 × 28 output values (or 784 distinct activations per image)

• We have 10 filters, so we will get 10 different 28 × 28 feature maps (or 7,840 outputs for one image)

• In summary, 26 × 10 × 28 × 28 (or 203,840) connections in the convolutional layer

### Example

• Pooling layer:
-  a patch with a width of 2 and a height of 2
-  a stride of 2
-  use a max() operation for each patch so that the activation is the maximum input value

• This results in feature maps that are one half the size of the input feature maps
-  e.g. from 10 different 28 × 28 feature maps as input to 10 different 14 × 14 feature maps as output

• Fully connected layer:
-  flatten out the square feature maps into a traditional flat fully-connected layer
-  200 hidden neurons, each with 10 × 14 × 14 input connections (or 1,960+1 weights per neuron)
-  a total of 392,200 connections and weights to learn in this layer
-  finally, we use a sigmoid function to output probabilities of class values directly

### Best Practices for CNN

• Input receptive field dimensions:
-  1D for words in a sentence
-  2D for images
-  3D for videos

• Receptive field size:
-  patch should be as small as possible, but large enough to "see" features in the input data
-  common to use 3 × 3 on small images and 5 × 5 or 7 × 7 and more on larger image sizes

• Stride Width:
-  start with the default stride of 1 (easy to understand and don't need padding to handle the receptive field falling off the edge of the images)
-  usually increase to 2 or higher for larger images

• Number of filters:
-  filters are feature detectors
-  usually fewer filters are used at the input layer, and increasingly more filters used at deeper layers

### Best Practices for CNN

-  useful when you cannot standardize input image sizes or the patch and stride sizes cannot neatly divide up the image size

• Pooling:
-  a generalization process to reduce overfitting
-  patch size is almost always set to 2 × 2 with a stride of 2 to discard 75% of the activations from the output of the previous layer

• Data preparation:
-  standardize input data (both the dimensions of the images and the pixel values)

• Dropout:
-  CNNs have a habit of overfitting, even with pooling layers
-  dropout should be used such as between fully connected layers and perhaps after pooling layers

## Recurrent Neural Network

### Sequences

• Time-series data:
-  e.g. prince of a stock over time

• Classical feedforward NN:
-  define a window size (e.g. 5)
-  train the network to learn to make short term predictions from the fixed sized window of inputs
-  limitations: how to determine the window size

• Different types of sequence problems:
-  one-to-many: sequence output, for image captioning
-  many-to-one: sequence input, for sentiment classification
-  many-to-many: sequence in and out, for machine translation
-  synchronized many to many: synced sequences in and out, for video classification

### RNNs

• RNNs are a special type of NN designed for sequence problems

• A RNN can be thought of as the addition of loops to the archetecture of a standard feedforward NN
-  the output of the network may feedback as an input to the network with the next input vector, and so on

• The recurrent connections add state or memory to the network and allow it to learn broader abstractions from the input sequences

• Two major issues:
-  how to train the network with back propagation
-  how to stop gradients vanishing or exploding during training

### LSTM

• Long Short-Term Memory network
-  overcomes the vanishing gradient problem
-  can be used to create large RNNs

• Instead of neurons, LSTM has memory blocks that are connected into layers
-  a block contains gates that manage the block's state and output
-  a unit operates upon an input sequence and each gate within a unit uses the sigmoid activation function to control whether they are triggered or not
• Three types of gates within a memory unit:
-  Input Gate: conditionally decides which values from the input to update the memory state
-  Forget Gate: conditionally decides what information to discard from the unit
-  Output Gate: conditionally decides what to output based on input and the memory of the unit

By Hui Hu

# HDS Meetup 12/5/2016

Slides for the Health Data Science Meetup

• 540