FAST AND ACCURATE OPTIMIZATION ON THE ORTHOGONAL MANIFOLD WITHOUT RETRACTION.
Gabriel Peyré
CNRS - école normale supérieure
Pierre Ablin
CNRS - Université paris-dauphine
ORTHOGONAL WEIGHTS IN A NEURAL NETWORK ?
A matrix \(W \in\mathbb{R}^{p\times p}\) is orthogonal when:
$$WW^\top = I_p$$
Can be used to build layers in neural networks \(y = \sigma(Wx + b)\) for :
[1]: Cisse, Moustapha, et al. "Parseval networks: Improving robustness to adversarial examples." International Conference on Machine Learning. PMLR, 2017.
- Certified adversarial robustness [1]
- Vanishing / exploding gradients [2]
- Normalizing flows [3]
[2]: Arjovsky, Martin, Amar Shah, and Yoshua Bengio. "Unitary evolution recurrent neural networks." International Conference on Machine Learning. PMLR, 2016.
[3]: Berg, Rianne van den, et al. "Sylvester normalizing flows for variational inference." arXiv preprint arXiv:1803.05649 (2018).
Training problem
Neural network \(\phi_{\theta}: x\mapsto y\) with parameters \(\theta\). Some parameters are orthogonal matrices.
Dataset \(x_1, \dots, x_n\).
Find parameters by empirical risk minimization:
$$\min_{\theta}f(\theta) = \frac1n\sum_{i=1}^n\ell_i(\phi_{\theta}(x_i))$$
\(\ell_i\) individual loss function (e.g. log-likelihood, cross-entropy with targets, ...)
How to do this with orthogonal weights?
Orthogonal manifold
\( \mathcal{O}_p = \{W\in\mathbb{R}^{p\times p}|\enspace W^\top W =I_p\}\) is a Riemannian manifold
Problem:
$$\min_{W\in\mathcal{O}_p}f(W)$$
Main approach:
Extend Euclidean algorithms to the Riemannian setting (gradient descent, stochastic gradient descent...)
Riemannian gradient descent with retraction
- Start from \(W_0\in\mathcal{O}_p\)
- Iterate \(W^{t+1} = \mathcal{R}(W^t, -\eta\mathrm{grad} f(W^t))\)
Classical retraction :
$$\mathcal{R}(W, AW) =\exp(A)W$$
\(\mathrm{grad} f(W) = \mathrm{Skew}(\nabla f(W)W^\top)W\)
Today's problem: retractions can be very costly for deep learning
COmputational cost
Riemannian gradient descent in a neural network:
- Compute \(\nabla f(W)\) using backprop
- Compute the Riemannian gradient \(\mathrm{grad}f(W) \)
- Move using a retraction \(W\leftarrow \mathcal{R}(W, -\eta \mathrm{grad} f(W))\)
Classical retraction :
$$\mathcal{R}(W, AW) =\exp(A)W$$
- costly linear algebra operations
- not suited for GPU (hard to parallelize)
- can be the most expensive step
Main idea
In a deep learning setting, moving on the manifold is too costly !
Can we have a method that is free to move outside the manifold that
- Still converges to the solutions of \(\min_{W\in\mathcal{O}_p} f(W)\)
- Has cheap iterations ?
optimization and projection
Projection
Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$
$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$
Optimization
Riemannian gradient:
$$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top) M$$
These two terms are orthogonal !
The landing field
Projection
Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$
$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$
Optimization
Riemannian gradient:
$$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top) M$$
$$\Lambda(M) = \mathrm{grad}f(M) + \lambda \nabla \mathcal{N}(M)$$
The landing field is cheap to compute
$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$
Only matrix-matrix mutliplications ! No expensive linear algebra + parrallelizable on GPU's
Comparison to retractions:
The Landing algorithm:
$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$
Starting from \(M^0\in\mathcal{O}_p\), iterate
$$M^{t+1} = M^t -\eta \Lambda(M^t)$$
landing aistats slide
By Pierre Ablin
landing aistats slide
- 417