The symbiotic relationship between optimization and deep learning

Pierre Ablin

CNRS - Université paris-dauphine

Joint works with :

T. Moreau, M. Massias, A.Gramfort, M. Sander, M.Blondel and G.peyré

optimization

Optimization: how to minimize a function ?

 

$$\min_{x\in\mathcal{X}} f(x)$$

Challenges:

Algorithm design

Find algorithms depending on assumptions on \(f\) (convex, smooth, ...) and on \(\mathcal{X}\) (convex, manifold, discrete...)

Theoretical guarantees

Convergence ? In which sense ? At which speed ?

Implementation

Numerical complexity, practical computational costs, hardware 

Deep learning

Design a parametrized transform $$ \phi_{\theta}: x \to y$$ by composition of simple, differentiable blocks

Challenges:

Network design

Design networks that work well depending on the task / input data / output data

Theoretical guarantees

What can we say about generalization of the network ? About the learned weights?

Implementation

Fast training and inference, deal with memory issues,...

The obvious link

Neural networks are usually trained by optimizing a function

 

Empirical risk minimization:

 

$$\min_{\theta} \frac1n\sum_{i=1}^n \ell_i(\phi_{\theta}(x_i))$$
 

Basic algorithm: stochastic gradient descent

 

$$\text{sample } i\sim [1, n]$$

$$\theta \leftarrow \theta - \rho \nabla_{\theta}[\ell_i(\phi_{\theta}(x_i))]$$

Today's talk :

other links

Learning to optimize: neural networks for optimization

Inverse problems

Latent process \(z \) generates observed outputs  \(x\):

 

\(z \to x \) 

The forward operation "\( \to\)" is generally known:

 

\(x = f(z) + \varepsilon \)

Goal of inverse problems: find a mapping 

 

\(x \to z\)

Example: MEG acquisition

\( z \) : current density in the brain

\( x \) : observed MEG signals

\(f\) : linear operator given by physics (Maxwell's equations)

\(  x \)

\(  D \)

\(  = \)

\(  z \)

Linear regression

Linear forward model : \(z \in \mathbb{R}^m\), \(x\in\mathbb{R}^n\), \(D \in \mathbb{R}^{n \times m} \)

 

\(x = Dz + \varepsilon \)

Problem: in some applications, \(m \gg n \), least-squares ill-posed

\(\to\) bet on sparsity : only a few coefficients in \(z^*\) are \( \neq 0 \)

 

\(z\) is sparse

Simple solution: least squares

\( z^* \in \arg\min \frac12 \|x - Dz\|^2\)

The Lasso

\( \lambda > 0 \) regularization parameter :

 

\(z^*\in\arg\min \frac12\|x - Dz\|^2 + \lambda \|z\|_1 = F_x(z)\) 

Enforces sparsity of the solution.

 

Easier to see on the equivalent problem: \(z^* \in \arg\min \frac12 \|x-Dz\|^2 \) s.t. \(\|z\|_1\leq C\)

Tibshirani, Regression shrinkage and selection via the lasso, 1996

Lasso induces sparsity

\(z^*\in \arg\min \frac12 \|x - Dz \|^2 \)s.t. \(\|z\|_1\leq C\)

\(z^*\in \arg\min \frac12 \|x - Dz \|^2 \)

Iterative shrinkage-thresholding algorithm

ISTA: simple algorithm to fit the Lasso.

 \(F_x(z) = \frac12\|x-Dz\|^2 + \lambda \|z\|_1\)

Idea: use proximal gradient descent

\(\to\) \(\frac12\|x - Dz\|^2\) is a smooth function

$$\nabla_z \left(\frac12\|x-Dz\|^2\right) = D^{\top}(Dz-x)$$

\(\to\) \(\lambda \|z\|_1\) has a simple proximal operator

Iterative shrinkage-thresholding algorithm

ISTA: simple algorithm to fit the Lasso.

 \(F_x(z) = \frac12\|x-Dz\|^2 + \lambda \|z\|_1\)

Daubechies et al.,  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. , 2004

ISTA: gradient descent step on the smooth part + proximal step

\(z^{(t+1)} = \text{st}(z^{(t)} - \frac1LD^{\top}(Dz^{(t)} - x), \frac{\lambda}{L})\)

ISTA as a Recurrent neural network

Solving the lasso many times

Assume that we want to solve the Lasso for many observations \(x_1, \cdots, x_N\) with a fixed dictionary \(D\) 

e.g.  MEG inverse problem:

\(D\) is fixed given by Maxwell's equations, \(x_i\) is one sample of the recording:

 

up to 100K samples !

We want to solve the Lasso many times with same \(D\); can we accelerate ISTA ? 

Training / Testing

 

\(\to\) \((x_1, \cdots, x_N)\) is the training set, drawn from a distribution \(p\) and we want to accelerate ISTA on unseen data \(x \sim p\)

ISTA is a Neural Net

ISTA:

\(z^{(t+1)} = \text{st}(z^{(t)} - \frac1LD^{\top}(Dz^{(t)} - x), \frac{\lambda}{L})\)

 

Let \(W_1 = I_m - \frac1LD^{\top}D\) and \(W_2 = \frac1LD^{\top}\):

\(z^{(t+1)} = \text{st}(W_1z^{(t)} +W_2x, \frac{\lambda}{L})\)

 

 

3 iterations of ISTA = 3 layers NN

Learned-ISTA

Gregor, LeCun, Learning Fast Approximations of Sparse Coding, 2010

A \(T\)-layer Lista network is a function \(\Phi\) parametrized by \(T\) parameters \( \Theta = (W^t_1,W^t_2, \beta^t )_{t=0}^{T-1}\)

Learned-ISTA

A \(T\)-layer Lista network is a function \(\Phi\) parametrized by \(T\) parameters \( \Theta = (W^t_1,W^t_2, \beta^t )_{t=0}^{T-1}\) 

  • \(z^{(0)} = 0\)
  • \(z^{(t+1)} = st(W^t_1z^{(t)} + W^t_2x, \beta^t) \)
  • Return \(z^{(T)}  = \Phi_{\Theta}(x) \)

 

The parameters of the network are learned to get better results than ISTA

learning parameters

A \(T\)-layer Lista network is a function \(\Phi\) parametrized by \(T\) parameters \( \Theta = (W^t_1,W^t_2, \beta^t )_{t=0}^{T-1}\)

Supervised-learning

Ground truth \(s_1, \cdots, s_N\) available (e.g. such that \(x_i = D s_i\))

 

$$\mathcal{L}(\Theta) = \sum_{i=1}^N \left(\Phi_{\Theta}(x_i) - s_i\right)^2$$

  • \(z^{(0)} = 0\)
  • \(z^{(t+1)} = st(W^t_1z^{(t)} + W^t_2x, \beta^t) \)
  • Return \(z^{(T)}  = \Phi_{\Theta}(x) \)

Semi-supervised

Compute \(s_1, \cdots, s_N\)

as \(s_i = \argmin F_{x_i}\)

 

$$\mathcal{L}(\Theta) = \sum_{i=1}^N \left(\Phi_{\Theta}(x_i) - s_i\right)^2$$

Unsupervised

Learn to solve the Lasso

 

 

$$\mathcal{L}(\Theta) = \sum_{i=1}^N F_{x_i}(\Phi_{\Theta}(x_i))$$

\Theta \in \argmin \mathcal{L}(\Theta)

learning parameters

Unsupervised

Learn to solve the Lasso:

$$\Theta \in \argmin\mathcal{L}(\Theta) = \sum_{i=1}^N F_{x_i}(\Phi_{\Theta}(x_i))$$

If we see a new sample \(x\), we expect:

$$F_x(\Phi_{\Theta}(x)) \leq F_x(ISTA(x)) \enspace, $$

where ISTA is applied for \(T\) iterations.

Lista

Advantages:

  • Can handle large-scale datasets
  • GPU friendly

Drawbacks:

  • Learning problem is non-convex, non-differentiable: no practical guarantees, and generally hard to train
  • Can fail badly on unseen data

what does the network learn?

Consider a "deep" LISTA network: \(T\gg 1\), assume that we have solved the expected optimization problem:

$$\Theta \in \argmin \mathbb{E}_{x\sim p}\left[F_x(\Phi_{\Theta}(x))\right]$$

As \(T \to + \infty \), assume \( (W_1^t, W_2^t, \beta^t) \to (W_1^*,W_2^*, \beta^*)\). Call \(\alpha = \beta^* / \lambda\). Then:

 

$$ W_1^* = I_m - \alpha D^{\top}D$$

$$ W_2^* = \alpha D^{\top}$$

$$ \beta^* = \alpha \lambda$$

Ablin, Moreau et al., Learning Step Sizes for Unfolded Sparse Coding, 2019

Corresponds to ISTA with step size \(\alpha\) instead of \(1/L\)

$$ W_1 = I_m - \frac1LD^{\top}D$$

$$ W_2 = \frac1L D^{\top}$$

$$ \beta = \frac{\lambda}{L}$$

what does the network learn?

As \(T \to + \infty \), assume \( (W_1^t, W_2^t, \beta^t) \to (W_1^*,W_2^*, \beta^*)\). Call \(\alpha = \beta^* / \lambda\). Then:

 

$$ W_1^* = I_m - \alpha D^{\top}D$$

$$ W_2^* = \alpha D^{\top}$$

$$ \beta^* = \alpha \lambda$$

Ablin, Moreau et al., Learning Step Sizes for Unfolded Sparse Coding, 2019

Corresponds to ISTA with step size \(\alpha\) instead of \(1/L\)

$$ W_1 = I_m - \frac1LD^{\top}D$$

$$ W_2 = \frac1L D^{\top}$$

$$ \beta = \frac{\lambda}{L}$$

Optimization theory helps up characterize precisely what the network learns !

Network architecture guided by optimization

Residual networks

Classical networks :

 

\(x_{n+1} = f(x_n, \theta_n)\), e.g. \(x_{n+1} = \sigma(Wx_n + b)\)
 

Problem :

Stacking too many layers degrades perf

 

Residual networks

 

\(x_{n+1} = x_n +  f(x_n, \theta_n)\)
 

Easy to learn identity !

Can stack many layers :)

Residual networks

Allows to stack many layers (100s')

 

- State of the art on many problems for a long time

- Still widely used today

He et al., Deep residual learning for image recognition, 2015

Memory issues...

Forward pass: evaluate network


\(x_0 = x\)

\(x_{n+1}= x_n + f(x_n , \theta_n)\)

\(y = x_N\)

 

Backward pass: compute gradients


\(\frac{\partial \ell(y)}{\partial y} = \ell'(y)\)

\(\frac{\partial \ell(y)}{\partial x_n} = \frac{\partial \ell(y)}{\partial x_{n+1}}(I + \partial_{x}f(x_n, \theta_n))\)

\(\frac{\partial \ell(y)}{\partial \theta_n} =  \frac{\partial \ell(y)}{\partial x_{n+1}}\partial_{\theta}f(x_n, \theta_n)\)

Start from output : need to store \(x_n\)

Memory issues...

Need to store all activations \(x_n\) for backprop : huge memory cost !

On a classical image classification task : 

PARALLEL with optimization

Residual network

\(x_{n+1} = x_n + f(x_n, \theta) \)

Gradient descent

\(x_{n+1} = x_n -\rho \nabla g(x_n, \theta) \)

Equivalent if \(f = - \rho \nabla_xg\)

Momentum Gradient descent

\(v_{n+1}= \beta v_n - \rho \nabla g(x_n, \theta)\)

\(x_{n+1} = x_n +v_{n+1} \)

Momentum residual network

\(v_{n+1}= \beta v_n + f(x_n, \theta)\)

\(x_{n+1} = x_n +v_{n+1} \)

Sander et al., Momentum residual neural networks, 2021

Invertible layers

Momentum residual network

\(v_{n+1}= \beta v_n + f(x_n, \theta)\)

\(x_{n+1} = x_n +v_{n+1} \)

Inverted by:

\(x_n = v_{n+1} - x_n\)

\(v_n = \frac1\beta(v_{n+1} - f(x_n, \theta))\)

No need to store activations ! They can be recomputed on the fly during backprop

representation capacity

Momentum residual network

\(v_{n+1}= \beta v_n + f(x_n, \theta)\)

\(x_{n+1} = x_n +v_{n+1} \)

Continuous equivalent:

\(\varepsilon \ddot x +\dot x = f(x, \theta)\)

Residual network

\(x_{n+1} =x_n + f(x_n, \theta)\)

 

Continuous equivalent:

\(\dot x = f(x, \theta)\)

Open source code

>>> pip install momentumnet

Get python code : 

import torch
from momentumnet import transform_to_momentumnet
from torchvision.models import resnet101
resnet = resnet101(pretrained=True)
mresnet101 = transform_to_momentumnet(resnet, gamma=0.9, use_backprop=False)

Transform a resnet into a momentum net : 

Documentation: 

 

michaelsdr.github.io/momentumnet/  

Conclusion

 

 

- There are many links between deep learning and optimization

 

- Deep neural networks can accelerate optimization when training is possible

 

- Optimization can guide the design of deep networks and lead to intriguing properties

 

Thanks  ! 

Made with Slides.com