Deep Learning with Tensorflow:

Achieving Optimal Model Performance

Sahil Singh

Topics of Discussion

Input Normalisation
Weight Initialization Strategies
Loss Functions
Activation Functions
Choice of Optimizers
Dropout
Optimizer Hyperparamters (learning rate, epochs, etc.)
Model Hyperparameters (no. of hidden layers, hidden layer size, etc.)
Batch Normalization

Structure of the Workshop

Highlighting the problem
A discussion of the technique that can address the problem
Implementation in Code

Starting model

A benchmark network in keras.

Dataset: MNIST

No. of hidden layers: 2

No. of hidden units: 10

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Flatten, Activation
from keras.optimizers import SGD
from keras.utils import to_categorical
from matplotlib import pyplot as plt
import seaborn as sns

(x_train, y_train), (x_test, y_test) = mnist.load_data()

num_classes = 10

# Apply one-hot encoding to the labels
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

# Break training data into training and validation sets
(x_train, x_valid) = x_train[:50000], x_train[50000:]
(y_train, y_valid) = y_train[:50000], y_train[50000:]

# Model definition
model = Sequential()
model.add(Flatten(input_shape=(28,28)))

# Model hyperparameter
num_hidden_layers = 2
num_hidden_units = 10


for _ in range(num_hidden_layers):
    model.add(Dense(num_hidden_units))
    model.add(Activation('relu'))

# Output Layer
model.add(Dense(num_classes))
model.add(Activation('softmax'))

# Set optimization hyperparameters
batch_size = 256
epochs = 10

model.compile(
    optimizer='adam', 
    loss='categorical_crossentropy',
    metrics=['accuracy'])

history = model.fit(
    x_train, y_train, 
    epochs=epochs, 
    batch_size=batch_size,
    validation_data=(x_valid, y_valid), 
    verbose=2)

accuracy_score = model.evaluate(
    x_test, y_test, batch_size=128)

print(f'\nTest accuracy: {accuracy_score[1]:>5.3f}')

Epoch 1/10
1s - loss: 2.8814 - acc: 0.2122 - val_loss: 1.9508 - val_acc: 0.2677
Epoch 2/10
0s - loss: 1.9028 - acc: 0.2714 - val_loss: 1.8804 - val_acc: 0.2859
...
...
...
Epoch 10/10
0s - loss: 0.9046 - acc: 0.6434 - val_loss: 0.8798 - val_acc: 0.6495
 5632/10000 [===============>..............] - ETA: 0s
Test accuracy: 0.648

Model Evaluation

Input Normalisation

Rule of Thumb

Input values and activations should be small and not come from highly variable distributions.

Different units for features

Features could be measured in different units - Kms and Kgs
Normalization makes inputs uniform for all feature representations
Large values lead to saturating gradients.
Results in efficient training
Min-Max Normalization and Z-Score Normalization

Min-Max Normalization

A' = \left(\frac{A - \min{A}}{\max{A} - \min{A}}\right) * (D - C) + C

A' = \left(\frac{A - \min{A}}{\max{A} - \min{A}}\right) * (D - C) + C

A is the overall range of values. [C, D] is the interval that we want to map to.

For the range [0, 1], The formula becomes:

A' = \left(\frac{A - \min{A}}{\max{A} - \min{A}}\right)

A' = \left(\frac{A - \min{A}}{\max{A} - \min{A}}\right)

Weight Initialization Strategies

The problem with improper weights

Strategies for initialization

All zeros
Sampling Normal Distribution ~ (0, 1)
Sampling Normal Distribution ~ (0, 0.1)
Xavier Initialization
He Initialization

All Zeros

Never do this!
All neurons output similar activations
Similar gradients during backpropagation
We need to break asymmetry between inputs

Sampling Normal Distribution ~ (0,1)

Can result in the weighted inputs being large
Saturating Neurons

tf.truncated_normal(shape=(INPUT, OUTPUT), mean=0, stddev=1)

Sampling Normal Distribution ~ (0,0.1)

Weights concentrated near 0
Efficient learning
Can be used as the go-to strategy

tf.truncated_normal(shape=(INPUT, OUTPUT), mean=0, stddev=0.1)

Xavier Initialization

Glorot & Bengio (2010)
Sample Normal Distribution ~ (0, 1/N)
Works well in many cases
Not ideal for ReLu activations

W = tf.get_variable("W", shape=[INPUT, OUTPUT], initializer=tf.contrib.layers.xavier_initializer())

He Initialization

Sample Normal Distribution ~ (0, 2/N)

W = tf.get_variable("W", shape=[INPUT, OUTPUT], initializer=tf.contrib.layers.variance_scaling_initializer())

Activation Functions

Sigmoid

The problem of vanishing gradient
Output is not zero-centered
Not a good choice for hidden layers
Almost always applied at the final output layer (Binary Classification)

Activation Functions

Tanh

The problem of vanishing gradient remains
Output is zero-centered
Not the ideal choice for hidden layers
In practice, performs better than Sigmoid

Activation Functions

Rectified Linear Units (ReLu)

Solves the problem of vanishing gradient
Standard non-linearity for hidden layers
Can suffer from 'dead' hidden units.

Batch Normalization

Normalizing inputs to all the layers of a network
Normalization using mean and variance of the current mini-batch
Inference happens using population mean and variance

Batch Normalization

Network trains slower

Helps with the problem of vanishing gradients

Added job of regularizing the model

Leads to faster convergence to optimal accuracy.

Allows higher learning rates, faster training.

Allows building much deeper networks.

Before Batch Normalization

After Batch Normalization

Dropout

Image Source: http://http://cs231n.github.io

Dropout

Main job is to prevent overfitting
Works extremely well.
Increases training time.
Requires hyperparameter tuning (keep probability)

Choice of optimizers

Optimal Learning rate
Why use the same learning rate for every parameter?
Escape sub-optimal local minima
Deal with Saddle points

Choice of optimizers

GradientDescentOptimizer
Momentum Optimizer
AdagradOptimizer
AdamOptimizer

Gradient Descent Optimizer

The baseline optimizer
Essentially used as the Stochastic Gradient Descent optimizer
Only parameter to tune is learning rate

Momentum Optimizer

Softens the update oscillations in irrelevant directions
Adds a fraction of the past step update to the current one
Parameter vector builds up velocity in the direction of consistent gradients
Can overshoot minima - Use Nesterov Momentum

AdaGrad Optimizer

Uses different learning rate for every parameter based on past gradients
Parameters with high gradients will have learning rate reduced
Weights with small/infrequent updates have their learning rates increased
Disadvantage - Learning rates always decreasing
Can lead to situation where no learning happens
Rectified by AdaDelta Optimizer

Adam Optimizer

Combines the best of AdaGrad and RMSProp
AdaGrad works well for Sparse features and gradients
Best performing optimizer

Hyperparameters

Optimizer Hyperparameters

Learning Rate
Momentum
Mini-batch size
Number of Epochs

Learning Rate

The most important hyperparameter to tune.
Larger values can make the model very unstable.
Lower values take too long to converge.

Batch Size

Small

Good for training.

Smaller updates and longer training leads to lower overall error.

Lower learning rates and more epochs means longer training times.

Need to compensate with lower learning rates and more epochs

Low memory usage.

Batch Size

Large

High memory consumption.

Lower number of weight updates will not let the network train effectively

Faster training.

Epochs

Need to stop training before network starts overfitting
Use large values and checkpoint model training

Model Hyperparameters

Number of Hidden Layers
Number of Hidden Units

Number of Hidden Layers

Helps increase the representative power of the model
Increasing helps capture hierarchical structures better - e.g. CNNs

Number of Hidden Units

Should be as large as possible - more parameters to tune
Same size for all layers works better in practice
Overcomplete layers better than undercomplete layers.

Batch Normalization

Deep Learning with Tensorflow

By sngsahil

Deep Learning with Tensorflow

1,205

Deep Learning with Tensorflow:

Achieving Optimal Model Performance

Topics of Discussion

Structure of the Workshop

Starting model

Input Normalisation

Rule of Thumb

Input values and activations should be small and not come from highly variable distributions.

Different units for features

Min-Max Normalization

Weight Initialization Strategies

The problem with improper weights

Strategies for initialization

All Zeros

Sampling Normal Distribution ~ (0,1)

Sampling Normal Distribution ~ (0,0.1)

Xavier Initialization

He Initialization

Activation Functions

Activation Functions

Sigmoid

Activation Functions

Tanh

Activation Functions

Rectified Linear Units (ReLu)

Batch Normalization

Batch Normalization

Batch Normalization

Before Batch Normalization

After Batch Normalization

Dropout

Dropout

Choice of optimizers

Choice of optimizers

Gradient Descent Optimizer

Momentum Optimizer

AdaGrad Optimizer

Adam Optimizer

Hyperparameters

Optimizer Hyperparameters

Learning Rate

Batch Size

Small

Batch Size

Large

Epochs

Model Hyperparameters

Number of Hidden Layers

Number of Hidden Units

Batch Normalization

Deep Learning with Tensorflow

More from sngsahil