Optimization Methods

for deep neural networks

Artificial Intelligence
- Machine Learning
Neural Networks
- Finding Optimal Parameters
Deep Learning
- Representation Learning
- Large Networks
- Challenges

AI & ML

Artificial Intelligence is the intelligence demonstrated by machines or robots, as opposed to the natural intelligence displayed by humans or animals.

Machine Learning is a subset of AI that utilizes advanced statistical techniques to enable computing systems to improve at tasks with experience over time.

AI & ML

Neural Networks

Artificial Neuron

Input vector $x$
Weight vector $w$
Bias variable $b$
Nonlinear function $f(x)$
Output variable $y$

$$y = f(w^T x + b)$$

Neural Networks

A collection of connected artificial neurons.
Loosely models the neurons in a biological brain.

Neural Networks: XOR

Neural Networks: Matrix form

Activation Function

A function that adds a nonlinearity to the model
Sigmoid

$$f(x)=\frac{1}{1+e^{-\alpha x}}$$

Tanh

$$f(x)=tanh(x) = 2\ sigmoid(2x) - 1$$

Loss Function

A function that computes the distance between the current output of the algorithm and the expected output.
Mean Squared Error:

$$L(y , \hat y) = \frac{1}{N}\sum_{i=1}^N(y_i - \hat y_i)^2$$

Mean Absolute Error:

$$L(y , \hat y) = \frac{1}{N}\sum_{i=1}^N|y_i - \hat y_i|$$

Finding Optimal Parameters

If we employ:

A differentiable activation function
A differentiable loss function

Then to find the optimal parameters, we can use a first-order gradient-based optimization algorithm.

How to find the gradient of loss function w.r.t parameters?

Backpropagation

Gradient Descent

$$w^{(new)} = w^{(old)} - \eta \nabla_{w} L(w)$$

Use the first-order derivative to minimize the loss function
Gradient Descent algorithm:

Momentum

$$\begin{aligned} v_t &= \gamma v_{t-1} + \eta \nabla_{w} L(w) \\ w^{(new)} &= w^{(old)} - v_t \end{aligned}$$

Gradient Descent

Adaptive Moment Estimation (Adam):

$$\begin{aligned} g_t &= \nabla_{w_t} L(w_t)\\ m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ w^{(new)} &= w^{(old)} - \dfrac{\eta}{\sqrt{\hat{v}^{(old)}} + \epsilon} \hat{m}^{(old)} \end{aligned}$$

Second-Order Optimizers

Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS)
- Approximates the Hessian matrix of the loss function
Limited Memory BFGS algorithm (LBFGS)

Deep Learning

Representation Learning

Convolutional Neural Networks
- ResNet
- VGG
Recurrent Neural Networks
- LSTM
- GRU

Large Networks

Task	Dataset	Architecture	# of params
Language Modelling	WikiText-103	GLM-XXLarge	10B
Machine Translation	WMT2014 French-English	GPT-3	175B
Image Classification	ImageNet	ViT-MoE-15B	14.7B
Object Detection	COCO	YOLO-V3	65M

Challenges

Computationally Expensive
- Use more efficient optimizers, momentum, adam, etc.
Vanishing & Exploding Gradients
- Use better activation functions
- Develop new architectures

Optimization Methods

for deep neural networks

Table of Contents

AI & ML

AI & ML

Neural Networks

Artificial Neuron

Neural Networks

Neural Networks: XOR

Neural Networks: Matrix form

Activation Function

Loss Function

Finding Optimal Parameters

Backpropagation

Gradient Descent

Gradient Descent

Second-Order Optimizers

Deep Learning

Representation Learning

Representation Learning

Representation Learning

Large Networks

Challenges

Any Questions?

Thanks

Optimization Methods for Deep Learning

Optimization Methods for Deep Learning

Alireza Afzal Aghaei

Optimization Methods

for deep neural networks

Table of Contents

AI & ML

AI & ML

Neural Networks

Artificial Neuron

Neural Networks

Neural Networks: XOR

Neural Networks: Matrix form

Activation Function

Loss Function

Finding Optimal Parameters

Backpropagation

Gradient Descent

Gradient Descent

Second-Order Optimizers

Deep Learning

Representation Learning

Representation Learning

Representation Learning

Large Networks

Challenges

Any Questions?

Thanks

Optimization Methods for Deep Learning

More from Alireza Afzal Aghaei