The flattening of the image matrix of pixels to a long vector of pixel values looses all of the spatial structure in the image
C
C
C
C
C
-'C'
-'C'
-'C'
-'C'
-?
Does color matter?
No, only the structure matters
C
C
C
Translation Invariance
Weight Sharing
R
G
B
w_1
w1
w_1
w1
w_1
w1
C
C
C
w_2
w2
w_2
w2
w_2
w2
Statistical Invariants
Preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data
Features are learned and used across the whole image
- allowing for the objects in the images to be shifted or translated in the scene and still detectable by the network
Advantages of CNNs:
- fewer parameters to learn than a fully connected network
- designed to be invariant to object position and distortion in the scene
- automatically learn and generalize features from the input domain
CNN
Building Blocks of CNNs
Convolutional Layers
Filters:
- essentially the neurons of the layer
- have both weighted inputs and generate an output value like a neuron
- the input size is a fixed square called a patch or a receptive field
Feature Maps:
- the output of one filter applied to the previous layer
- e.g. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map
- the distance that filter is moved across the input from the previous layer each activation is referred to as the stride
Pooling Layers
Down-sample the previous layers feature map
Intended to consolidate the features learned and expressed in the previous layers feature map
May be considered as a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model
They also have a receptive field, often much smaller than the convolutional layer
They are often very simple: taking the average or the maximum of the input value in order to create its own feature map
Fully Connected Layers
Normal flat feedforward layers
Usually used at the end of the network after feature extraction and consolidation has been performed by the convolutional and pooling layers
Used to create final nonlinear combinations of features and for making predictions by the network
Example
A dataset of gray scale images
- Size: 32 × 32 × 1 (width × height × channels)
Convolutional Layer:
- 10 filters
- a patch 5 pixels wide and 5 pixels high
- a stride length of 1
Since each filter can only get input from 5 × 5 (25) pixels at a time, each will require 25+1 input weights
Drag the 5 × 5 patch across the input image data with a stride width of 1, this results in a feature map of 28 × 28 output values (or 784 distinct activations per image)
We have 10 filters, so we will get 10 different 28 × 28 feature maps (or 7,840 outputs for one image)
In summary, 26 × 10 × 28 × 28 (or 203,840) connections in the convolutional layer
Example
Pooling layer:
- a patch with a width of 2 and a height of 2
- a stride of 2
- use a max() operation for each patch so that the activation is the maximum input value
This results in feature maps that are one half the size of the input feature maps
- e.g. from 10 different 28 × 28 feature maps as input to 10 different 14 × 14 feature maps as output
Fully connected layer:
- flatten out the square feature maps into a traditional flat fully-connected layer
- 200 hidden neurons, each with 10 × 14 × 14 input connections (or 1,960+1 weights per neuron)
- a total of 392,200 connections and weights to learn in this layer
- finally, we use a sigmoid function to output probabilities of class values directly
Best Practices for CNN
Input receptive field dimensions:
- 1D for words in a sentence
- 2D for images
- 3D for videos
Receptive field size:
- patch should be as small as possible, but large enough to "see" features in the input data
- common to use 3 × 3 on small images and 5 × 5 or 7 × 7 and more on larger image sizes
Stride Width:
- start with the default stride of 1 (easy to understand and don't need padding to handle the receptive field falling off the edge of the images)
- usually increase to 2 or higher for larger images
Number of filters:
- filters are feature detectors
- usually fewer filters are used at the input layer, and increasingly more filters used at deeper layers
Best Practices for CNN
Padding:
- use zero-padding when reading non-input data
- useful when you cannot standardize input image sizes or the patch and stride sizes cannot neatly divide up the image size
Pooling:
- a generalization process to reduce overfitting
- patch size is almost always set to 2 × 2 with a stride of 2 to discard 75% of the activations from the output of the previous layer
Data preparation:
- standardize input data (both the dimensions of the images and the pixel values)
Dropout:
- CNNs have a habit of overfitting, even with pooling layers
- dropout should be used such as between fully connected layers and perhaps after pooling layers
Recurrent Neural Network
Sequences
Time-series data:
- e.g. prince of a stock over time
Classical feedforward NN:
- define a window size (e.g. 5)
- train the network to learn to make short term predictions from the fixed sized window of inputs
- limitations: how to determine the window size
Different types of sequence problems:
- one-to-many: sequence output, for image captioning
- many-to-one: sequence input, for sentiment classification
- many-to-many: sequence in and out, for machine translation
- synchronized many to many: synced sequences in and out, for video classification
RNNs
RNNs are a special type of NN designed for sequence problems
A RNN can be thought of as the addition of loops to the archetecture of a standard feedforward NN
- the output of the network may feedback as an input to the network with the next input vector, and so on
The recurrent connections add state or memory to the network and allow it to learn broader abstractions from the input sequences
Two major issues:
- how to train the network with back propagation
- how to stop gradients vanishing or exploding during training
LSTM
Long Short-Term Memory network
- overcomes the vanishing gradient problem
- can be used to create large RNNs
Instead of neurons, LSTM has memory blocks that are connected into layers
- a block contains gates that manage the block's state and output
- a unit operates upon an input sequence and each gate within a unit uses the sigmoid activation function to control whether they are triggered or not
Three types of gates within a memory unit: - Input Gate: conditionally decides which values from the input to update the memory state
- Forget Gate: conditionally decides what information to discard from the unit
- Output Gate: conditionally decides what to output based on input and the memory of the unit