FAST AND ACCURATE OPTIMIZATION ON THE ORTHOGONAL MANIFOLD WITHOUT RETRACTION. 

Gabriel Peyré

CNRS - école normale supérieure

Pierre Ablin

CNRS - Université paris-dauphine

ORTHOGONAL WEIGHTS IN A NEURAL NETWORK ?

A matrix \(W \in\mathbb{R}^{p\times p}\) is orthogonal when:

 

$$WW^\top = I_p$$

Can be used to build layers in neural networks \(y = \sigma(Wx + b)\) for :

[1]: Cisse, Moustapha, et al. "Parseval networks: Improving robustness to adversarial examples." International Conference on Machine Learning. PMLR, 2017.

  • Certified adversarial robustness [1]
  • Vanishing / exploding gradients [2]
  • Normalizing flows [3]

[2]: Arjovsky, Martin, Amar Shah, and Yoshua Bengio. "Unitary evolution recurrent neural networks." International Conference on Machine Learning. PMLR, 2016.

[3]: Berg, Rianne van den, et al. "Sylvester normalizing flows for variational inference." arXiv preprint arXiv:1803.05649 (2018).

Training problem

Neural network \(\phi_{\theta}: x\mapsto y\) with parameters \(\theta\). Some parameters are orthogonal matrices.

 

Dataset \(x_1, \dots, x_n\).

 

Find parameters by empirical risk minimization:

 

$$\min_{\theta}f(\theta) = \frac1n\sum_{i=1}^n\ell_i(\phi_{\theta}(x_i))$$

 

\(\ell_i\) individual loss function (e.g. log-likelihood, cross-entropy with targets, ...)

How to do this with orthogonal weights?

Orthogonal manifold

\( \mathcal{O}_p = \{W\in\mathbb{R}^{p\times p}|\enspace W^\top W =I_p\}\) is a Riemannian manifold

 

Problem:

$$\min_{W\in\mathcal{O}_p}f(W)$$

Main approach:

\mathcal{O}_p

Extend Euclidean algorithms to the Riemannian setting (gradient descent, stochastic gradient descent...)

Riemannian gradient descent with retraction

\mathcal{O}_p
W^0
-\mathrm{grad} f(W^0)
  • Start from \(W_0\in\mathcal{O}_p\)
  • Iterate \(W^{t+1} = \mathcal{R}(W^t, -\eta\mathrm{grad} f(W^t))\)
W^1
-\mathrm{grad} f(W^1)
-\mathrm{grad} f(W^2)
W^2
W^3

Classical retraction :

 $$\mathcal{R}(W, AW) =\exp(A)W$$

\(\mathrm{grad} f(W) = \mathrm{Skew}(\nabla f(W)W^\top)W\)

Today's problem: retractions can be very costly for deep learning

COmputational cost

Riemannian gradient descent in a neural network: 

  • Compute \(\nabla f(W)\) using backprop
  • Compute the Riemannian gradient \(\mathrm{grad}f(W) \)
  • Move using a retraction \(W\leftarrow \mathcal{R}(W, -\eta \mathrm{grad} f(W))\)

Classical retraction :

 

$$\mathcal{R}(W, AW) =\exp(A)W$$

  • costly linear algebra operations
  • not suited for GPU (hard to parallelize)
  • can be the most expensive step

Main idea

In a deep learning setting, moving on the manifold is too costly !

 

Can we have a method that is free to move outside the manifold that

  • Still converges to the solutions of \(\min_{W\in\mathcal{O}_p} f(W)\)
  • Has cheap iterations ?

 optimization and projection

Projection

 

Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$

$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$

Optimization

 

Riemannian gradient:

 $$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top)  M$$

These two terms are orthogonal ! 

The landing field

Projection

 

Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$

$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$

Optimization

 

Riemannian gradient:

 $$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top)  M$$

$$\Lambda(M) = \mathrm{grad}f(M) + \lambda \nabla \mathcal{N}(M)$$

The landing field is cheap to compute

$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$

 

Only matrix-matrix mutliplications ! No expensive linear algebra + parrallelizable on GPU's

 

 

Comparison to retractions:

 

The Landing algorithm:

$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$

Starting from \(M^0\in\mathcal{O}_p\), iterate

$$M^{t+1} = M^t -\eta \Lambda(M^t)$$

\mathcal{O}_p
M^0
-\mathrm{grad}f(M)
M^1
-\nabla \mathcal{N}(M)
-\mathrm{grad}f(M)
-\mathrm{grad}f(M)
M^2
-\nabla \mathcal{N}(M)

landing aistats slide

By Pierre Ablin

landing aistats slide

  • 417