The symbiotic relationship between optimization and deep learning
Pierre Ablin
CNRS - Université paris-dauphine
Joint works with :
T. Moreau, M. Massias, A.Gramfort, M. Sander, M.Blondel and G.peyré
optimization
Optimization: how to minimize a function ?
$$\min_{x\in\mathcal{X}} f(x)$$
Challenges:
Algorithm design
Find algorithms depending on assumptions on \(f\) (convex, smooth, ...) and on \(\mathcal{X}\) (convex, manifold, discrete...)
Theoretical guarantees
Convergence ? In which sense ? At which speed ?
Implementation
Numerical complexity, practical computational costs, hardware
Deep learning
Design a parametrized transform $$ \phi_{\theta}: x \to y$$ by composition of simple, differentiable blocks
Challenges:
Network design
Design networks that work well depending on the task / input data / output data
Theoretical guarantees
What can we say about generalization of the network ? About the learned weights?
Implementation
Fast training and inference, deal with memory issues,...
The obvious link
Neural networks are usually trained by optimizing a function
Empirical risk minimization:
$$\min_{\theta} \frac1n\sum_{i=1}^n \ell_i(\phi_{\theta}(x_i))$$
Basic algorithm: stochastic gradient descent
$$\text{sample } i\sim [1, n]$$
$$\theta \leftarrow \theta - \rho \nabla_{\theta}[\ell_i(\phi_{\theta}(x_i))]$$
Today's talk :
other links
Learning to optimize: neural networks for optimization
Inverse problems
Latent process \(z \) generates observed outputs \(x\):
\(z \to x \)
The forward operation "\( \to\)" is generally known:
\(x = f(z) + \varepsilon \)
Goal of inverse problems: find a mapping
\(x \to z\)
Example: MEG acquisition
\( z \) : current density in the brain
\( x \) : observed MEG signals
\(f\) : linear operator given by physics (Maxwell's equations)
\( x \)
\( D \)
\( = \)
\( z \)
Linear regression
Linear forward model : \(z \in \mathbb{R}^m\), \(x\in\mathbb{R}^n\), \(D \in \mathbb{R}^{n \times m} \)
\(x = Dz + \varepsilon \)
Problem: in some applications, \(m \gg n \), least-squares ill-posed
\(\to\) bet on sparsity : only a few coefficients in \(z^*\) are \( \neq 0 \)
\(z\) is sparse
Simple solution: least squares
\( z^* \in \arg\min \frac12 \|x - Dz\|^2\)
The Lasso
\( \lambda > 0 \) regularization parameter :
\(z^*\in\arg\min \frac12\|x - Dz\|^2 + \lambda \|z\|_1 = F_x(z)\)
Enforces sparsity of the solution.
Easier to see on the equivalent problem: \(z^* \in \arg\min \frac12 \|x-Dz\|^2 \) s.t. \(\|z\|_1\leq C\)
Tibshirani, Regression shrinkage and selection via the lasso, 1996
Lasso induces sparsity
\(z^*\in \arg\min \frac12 \|x - Dz \|^2 \)s.t. \(\|z\|_1\leq C\)
\(z^*\in \arg\min \frac12 \|x - Dz \|^2 \)
Iterative shrinkage-thresholding algorithm
ISTA: simple algorithm to fit the Lasso.
\(F_x(z) = \frac12\|x-Dz\|^2 + \lambda \|z\|_1\)
Idea: use proximal gradient descent
\(\to\) \(\frac12\|x - Dz\|^2\) is a smooth function
$$\nabla_z \left(\frac12\|x-Dz\|^2\right) = D^{\top}(Dz-x)$$
\(\to\) \(\lambda \|z\|_1\) has a simple proximal operator
Iterative shrinkage-thresholding algorithm
ISTA: simple algorithm to fit the Lasso.
\(F_x(z) = \frac12\|x-Dz\|^2 + \lambda \|z\|_1\)
Daubechies et al., An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. , 2004
ISTA: gradient descent step on the smooth part + proximal step
\(z^{(t+1)} = \text{st}(z^{(t)} - \frac1LD^{\top}(Dz^{(t)} - x), \frac{\lambda}{L})\)
ISTA as a Recurrent neural network
Solving the lasso many times
Assume that we want to solve the Lasso for many observations \(x_1, \cdots, x_N\) with a fixed dictionary \(D\)
e.g. MEG inverse problem:
\(D\) is fixed given by Maxwell's equations, \(x_i\) is one sample of the recording:
up to 100K samples !
We want to solve the Lasso many times with same \(D\); can we accelerate ISTA ?
Training / Testing
\(\to\) \((x_1, \cdots, x_N)\) is the training set, drawn from a distribution \(p\) and we want to accelerate ISTA on unseen data \(x \sim p\)
ISTA is a Neural Net
ISTA:
\(z^{(t+1)} = \text{st}(z^{(t)} - \frac1LD^{\top}(Dz^{(t)} - x), \frac{\lambda}{L})\)
Let \(W_1 = I_m - \frac1LD^{\top}D\) and \(W_2 = \frac1LD^{\top}\):
\(z^{(t+1)} = \text{st}(W_1z^{(t)} +W_2x, \frac{\lambda}{L})\)
3 iterations of ISTA = 3 layers NN
Learned-ISTA
Gregor, LeCun, Learning Fast Approximations of Sparse Coding, 2010
A \(T\)-layer Lista network is a function \(\Phi\) parametrized by \(T\) parameters \( \Theta = (W^t_1,W^t_2, \beta^t )_{t=0}^{T-1}\)
Learned-ISTA
A \(T\)-layer Lista network is a function \(\Phi\) parametrized by \(T\) parameters \( \Theta = (W^t_1,W^t_2, \beta^t )_{t=0}^{T-1}\)
- \(z^{(0)} = 0\)
- \(z^{(t+1)} = st(W^t_1z^{(t)} + W^t_2x, \beta^t) \)
- Return \(z^{(T)} = \Phi_{\Theta}(x) \)
The parameters of the network are learned to get better results than ISTA
learning parameters
A \(T\)-layer Lista network is a function \(\Phi\) parametrized by \(T\) parameters \( \Theta = (W^t_1,W^t_2, \beta^t )_{t=0}^{T-1}\)
Supervised-learning
Ground truth \(s_1, \cdots, s_N\) available (e.g. such that \(x_i = D s_i\))
$$\mathcal{L}(\Theta) = \sum_{i=1}^N \left(\Phi_{\Theta}(x_i) - s_i\right)^2$$
- \(z^{(0)} = 0\)
- \(z^{(t+1)} = st(W^t_1z^{(t)} + W^t_2x, \beta^t) \)
- Return \(z^{(T)} = \Phi_{\Theta}(x) \)
Semi-supervised
Compute \(s_1, \cdots, s_N\)
as \(s_i = \argmin F_{x_i}\)
$$\mathcal{L}(\Theta) = \sum_{i=1}^N \left(\Phi_{\Theta}(x_i) - s_i\right)^2$$
Unsupervised
Learn to solve the Lasso
$$\mathcal{L}(\Theta) = \sum_{i=1}^N F_{x_i}(\Phi_{\Theta}(x_i))$$
learning parameters
Unsupervised
Learn to solve the Lasso:
$$\Theta \in \argmin\mathcal{L}(\Theta) = \sum_{i=1}^N F_{x_i}(\Phi_{\Theta}(x_i))$$
If we see a new sample \(x\), we expect:
$$F_x(\Phi_{\Theta}(x)) \leq F_x(ISTA(x)) \enspace, $$
where ISTA is applied for \(T\) iterations.
Lista
Advantages:
- Can handle large-scale datasets
- GPU friendly
Drawbacks:
- Learning problem is non-convex, non-differentiable: no practical guarantees, and generally hard to train
- Can fail badly on unseen data
what does the network learn?
Consider a "deep" LISTA network: \(T\gg 1\), assume that we have solved the expected optimization problem:
$$\Theta \in \argmin \mathbb{E}_{x\sim p}\left[F_x(\Phi_{\Theta}(x))\right]$$
As \(T \to + \infty \), assume \( (W_1^t, W_2^t, \beta^t) \to (W_1^*,W_2^*, \beta^*)\). Call \(\alpha = \beta^* / \lambda\). Then:
$$ W_1^* = I_m - \alpha D^{\top}D$$
$$ W_2^* = \alpha D^{\top}$$
$$ \beta^* = \alpha \lambda$$
Ablin, Moreau et al., Learning Step Sizes for Unfolded Sparse Coding, 2019
Corresponds to ISTA with step size \(\alpha\) instead of \(1/L\)
$$ W_1 = I_m - \frac1LD^{\top}D$$
$$ W_2 = \frac1L D^{\top}$$
$$ \beta = \frac{\lambda}{L}$$
what does the network learn?
As \(T \to + \infty \), assume \( (W_1^t, W_2^t, \beta^t) \to (W_1^*,W_2^*, \beta^*)\). Call \(\alpha = \beta^* / \lambda\). Then:
$$ W_1^* = I_m - \alpha D^{\top}D$$
$$ W_2^* = \alpha D^{\top}$$
$$ \beta^* = \alpha \lambda$$
Ablin, Moreau et al., Learning Step Sizes for Unfolded Sparse Coding, 2019
Corresponds to ISTA with step size \(\alpha\) instead of \(1/L\)
$$ W_1 = I_m - \frac1LD^{\top}D$$
$$ W_2 = \frac1L D^{\top}$$
$$ \beta = \frac{\lambda}{L}$$
Optimization theory helps up characterize precisely what the network learns !
Network architecture guided by optimization
Residual networks
Classical networks :
\(x_{n+1} = f(x_n, \theta_n)\), e.g. \(x_{n+1} = \sigma(Wx_n + b)\)
Problem :
Stacking too many layers degrades perf
Residual networks
\(x_{n+1} = x_n + f(x_n, \theta_n)\)
Easy to learn identity !
Can stack many layers :)
Residual networks
Allows to stack many layers (100s')
- State of the art on many problems for a long time
- Still widely used today
He et al., Deep residual learning for image recognition, 2015
Memory issues...
Forward pass: evaluate network
\(x_0 = x\)
\(x_{n+1}= x_n + f(x_n , \theta_n)\)
\(y = x_N\)
Backward pass: compute gradients
\(\frac{\partial \ell(y)}{\partial y} = \ell'(y)\)
\(\frac{\partial \ell(y)}{\partial x_n} = \frac{\partial \ell(y)}{\partial x_{n+1}}(I + \partial_{x}f(x_n, \theta_n))\)
\(\frac{\partial \ell(y)}{\partial \theta_n} = \frac{\partial \ell(y)}{\partial x_{n+1}}\partial_{\theta}f(x_n, \theta_n)\)
Start from output : need to store \(x_n\)
Memory issues...
Need to store all activations \(x_n\) for backprop : huge memory cost !
On a classical image classification task :
PARALLEL with optimization
Residual network
\(x_{n+1} = x_n + f(x_n, \theta) \)
Gradient descent
\(x_{n+1} = x_n -\rho \nabla g(x_n, \theta) \)
Equivalent if \(f = - \rho \nabla_xg\)
Momentum Gradient descent
\(v_{n+1}= \beta v_n - \rho \nabla g(x_n, \theta)\)
\(x_{n+1} = x_n +v_{n+1} \)
Momentum residual network
\(v_{n+1}= \beta v_n + f(x_n, \theta)\)
\(x_{n+1} = x_n +v_{n+1} \)
Sander et al., Momentum residual neural networks, 2021
Invertible layers
Momentum residual network
\(v_{n+1}= \beta v_n + f(x_n, \theta)\)
\(x_{n+1} = x_n +v_{n+1} \)
Inverted by:
\(x_n = v_{n+1} - x_n\)
\(v_n = \frac1\beta(v_{n+1} - f(x_n, \theta))\)
No need to store activations ! They can be recomputed on the fly during backprop
representation capacity
Momentum residual network
\(v_{n+1}= \beta v_n + f(x_n, \theta)\)
\(x_{n+1} = x_n +v_{n+1} \)
Continuous equivalent:
\(\varepsilon \ddot x +\dot x = f(x, \theta)\)
Residual network
\(x_{n+1} =x_n + f(x_n, \theta)\)
Continuous equivalent:
\(\dot x = f(x, \theta)\)
Open source code
>>> pip install momentumnet
Get python code :
import torch
from momentumnet import transform_to_momentumnet
from torchvision.models import resnet101
resnet = resnet101(pretrained=True)
mresnet101 = transform_to_momentumnet(resnet, gamma=0.9, use_backprop=False)
Transform a resnet into a momentum net :
Documentation:
michaelsdr.github.io/momentumnet/
Conclusion
- There are many links between deep learning and optimization
- Deep neural networks can accelerate optimization when training is possible
- Optimization can guide the design of deep networks and lead to intriguing properties
Thanks !
Optimization and deep learning
By Pierre Ablin
Optimization and deep learning
- 475