Health Data Science Meetup
December 5, 2016
Convolutional Neural Network
Recurrent Neural Network
Implementations in Python
Convolutional Neural Network
- The flattening of the image matrix of pixels to a long vector of pixel values looses all of the spatial structure in the image
C
C
C
C
C
-'C'
-'C'
-'C'
-'C'
-?
Does color matter?
No, only the structure matters
C
C
C
Translation Invariance
Weight Sharing
R
G
B
w_1
w1
w_1
w1
w_1
w1
C
C
C
w_2
w2
w_2
w2
w_2
w2
Statistical Invariants
- Preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data
- Features are learned and used across the whole image
- allowing for the objects in the images to be shifted or translated in the scene and still detectable by the network
- Advantages of CNNs:
- fewer parameters to learn than a fully connected network
- designed to be invariant to object position and distortion in the scene
- automatically learn and generalize features from the input domain
CNN
Building Blocks of CNNs
Convolutional Layers
- Filters:
- essentially the neurons of the layer
- have both weighted inputs and generate an output value like a neuron
- the input size is a fixed square called a patch or a receptive field
- Feature Maps:
- the output of one filter applied to the previous layer
- e.g. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map
- the distance that filter is moved across the input from the previous layer each activation is referred to as the stride
Pooling Layers
- Down-sample the previous layers feature map
- Intended to consolidate the features learned and expressed in the previous layers feature map
- May be considered as a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model
- They also have a receptive field, often much smaller than the convolutional layer
- They are often very simple: taking the average or the maximum of the input value in order to create its own feature map
Fully Connected Layers
- Normal flat feedforward layers
- Usually used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers
- Used to create final nonlinear combinations of features and for making predictions by the network
Example
- A dataset of gray scale images
- Size: 32 × 32 × 1 (width × height × channels)
- Convolutional Layer:
- 10 filters
- a patch 5 pixels wide and 5 pixels high
- a stride length of 1
- Since each filter can only get input from 5 × 5 (25) pixels at a time, each will require 25+1 input weights
-
Drag the 5 × 5 patch across the input image data with a stride width of 1, this results in a feature map of 28 × 28 output values (or 784 distinct activations per image)
-
We have 10 filters, so we will get 10 different 28 × 28 feature maps (or 7,840 outputs for one image)
- In summary, 26 × 10 × 28 × 28 (or 203,840) connections in the convolutional layer
Example
- Pooling layer:
- a patch with a width of 2 and a height of 2
- a stride of 2
- use a max() operation for each patch so that the activation is the maximum input value
- This results in feature maps that are one half the size of the input feature maps
- e.g. from 10 different 28 × 28 feature maps as input to 10 different 14 × 14 feature maps as output
-
Fully connected layer:
- flatten out the square feature maps into a traditional flat fully-connected layer
- 200 hidden neurons, each with 10 × 14 × 14 input connections (or 1,960+1 weights per neuron)
- a total of 392,200 connections and weights to learn in this layer
- finally, we use a sigmoid function to output probabilities of class values directly
Best Practices for CNN
- Input receptive field dimensions:
- 1D for words in a sentence
- 2D for images
- 3D for videos - Receptive field size:
- patch should be as small as possible, but large enough to "see" features in the input data
- common to use 3 × 3 on small images and 5 × 5 or 7 × 7 and more on larger image sizes
- Stride Width:
- start with the default stride of 1 (easy to understand and don't need padding to handle the receptive field falling off the edge of the images)
- usually increase to 2 or higher for larger images
- Number of filters:
- filters are feature detectors
- usually fewer filters are used at the input layer, and increasingly more filters used at deeper layers
Best Practices for CNN
- Padding:
- use zero-padding when reading non-input data
- useful when you cannot standardize input image sizes or the patch and stride sizes cannot neatly divide up the image size
- Pooling:
- a generalization process to reduce overfitting
- patch size is almost always set to 2 × 2 with a stride of 2 to discard 75% of the activations from the output of the previous layer
-
Data preparation:
- standardize input data (both the dimensions of the images and the pixel values)
-
Dropout:
- CNNs have a habit of overfitting, even with pooling layers
- dropout should be used such as between fully connected layers and perhaps after pooling layers
Recurrent Neural Network
Sequences
- Time-series data:
- e.g. prince of a stock over time
- Classical feedforward NN:
- define a window size (e.g. 5)
- train the network to learn to make short term predictions from the fixed sized window of inputs
- limitations: how to determine the window size
- Different types of sequence problems:
- one-to-many: sequence output, for image captioning
- many-to-one: sequence input, for sentiment classification
- many-to-many: sequence in and out, for machine translation
- synchronized many to many: synced sequences in and out, for video classification
RNNs
- RNNs are a special type of NN designed for sequence problems
- A RNN can be thought of as the addition of loops to the archetecture of a standard feedforward NN
- the output of the network may feedback as an input to the network with the next input vector, and so on
- The recurrent connections add state or memory to the network and allow it to learn broader abstractions from the input sequences
- Two major issues:
- how to train the network with back propagation
- how to stop gradients vanishing or exploding during training
LSTM
- Long Short-Term Memory network
- overcomes the vanishing gradient problem
- can be used to create large RNNs
- Instead of neurons, LSTM has memory blocks that are connected into layers
- a block contains gates that manage the block's state and output
- a unit operates upon an input sequence and each gate within a unit uses the sigmoid activation function to control whether they are triggered or not
- Three types of gates within a memory unit:
- Input Gate: conditionally decides which values from the input to update the memory state
- Forget Gate: conditionally decides what information to discard from the unit
- Output Gate: conditionally decides what to output based on input and the memory of the unit
Implementations
HDS Meetup 12/5/2016
By Hui Hu
HDS Meetup 12/5/2016
Slides for the Health Data Science Meetup
- 729