FAST AND ACCURATE OPTIMIZATION ON THE ORTHOGONAL MANIFOLD WITHOUT RETRACTION.

Gabriel Peyré

CNRS - école normale supérieure

Pierre Ablin

CNRS - Université paris-dauphine

ORTHOGONAL WEIGHTS IN A NEURAL NETWORK ?

A matrix $W \in\mathbb{R}^{p\times p}$ is orthogonal when:

$$WW^\top = I_p$$

Can be used to build layers in neural networks $y = \sigma(Wx + b)$ for :

[1]: Cisse, Moustapha, et al. "Parseval networks: Improving robustness to adversarial examples." International Conference on Machine Learning. PMLR, 2017.

Certified adversarial robustness [1]

Vanishing / exploding gradients [2]

Normalizing flows [3]

[2]: Arjovsky, Martin, Amar Shah, and Yoshua Bengio. "Unitary evolution recurrent neural networks." International Conference on Machine Learning. PMLR, 2016.

[3]: Berg, Rianne van den, et al. "Sylvester normalizing flows for variational inference." arXiv preprint arXiv:1803.05649 (2018).

Training problem

Neural network $\phi_{\theta}: x\mapsto y$ with parameters $\theta$. Some parameters are orthogonal matrices.

Dataset $x_1, \dots, x_n$.

Find parameters by empirical risk minimization:

$$\min_{\theta}f(\theta) = \frac1n\sum_{i=1}^n\ell_i(\phi_{\theta}(x_i))$$

$\ell_i$ individual loss function (e.g. log-likelihood, cross-entropy with targets, ...)

How to do this with orthogonal weights?

Orthogonal manifold

$ \mathcal{O}_p = \{W\in\mathbb{R}^{p\times p}|\enspace W^\top W =I_p\}$ is a Riemannian manifold

Problem:

$$\min_{W\in\mathcal{O}_p}f(W)$$

Main approach:

\mathcal{O}_p

Extend Euclidean algorithms to the Riemannian setting (gradient descent, stochastic gradient descent...)

Riemannian gradient descent with retraction

\mathcal{O}_p

W^0

-\mathrm{grad} f(W^0)

Start from $W_0\in\mathcal{O}_p$
Iterate $W^{t+1} = \mathcal{R}(W^t, -\eta\mathrm{grad} f(W^t))$

W^1

-\mathrm{grad} f(W^1)

-\mathrm{grad} f(W^2)

W^2

W^3

Classical retraction :

$$\mathcal{R}(W, AW) =\exp(A)W$$

$\mathrm{grad} f(W) = \mathrm{Skew}(\nabla f(W)W^\top)W$

Today's problem: retractions can be very costly for deep learning

COmputational cost

Riemannian gradient descent in a neural network:

Compute $\nabla f(W)$ using backprop
Compute the Riemannian gradient $\mathrm{grad}f(W) $
Move using a retraction $W\leftarrow \mathcal{R}(W, -\eta \mathrm{grad} f(W))$

Classical retraction :

$$\mathcal{R}(W, AW) =\exp(A)W$$

costly linear algebra operations
not suited for GPU (hard to parallelize)
can be the most expensive step

Main idea

In a deep learning setting, moving on the manifold is too costly !

Can we have a method that is free to move outside the manifold that

Still converges to the solutions of $\min_{W\in\mathcal{O}_p} f(W)$
Has cheap iterations ?

optimization and projection

Projection

Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$

$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$

Optimization

Riemannian gradient:

$$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top) M$$

These two terms are orthogonal !

The landing field

Projection

Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$

$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$

Optimization

Riemannian gradient:

$$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top) M$$

$$\Lambda(M) = \mathrm{grad}f(M) + \lambda \nabla \mathcal{N}(M)$$

The landing field is cheap to compute

$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$

Only matrix-matrix mutliplications ! No expensive linear algebra + parrallelizable on GPU's

Comparison to retractions:

The Landing algorithm:

$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$

Starting from $M^0\in\mathcal{O}_p$, iterate

$$M^{t+1} = M^t -\eta \Lambda(M^t)$$

\mathcal{O}_p

M^0

-\mathrm{grad}f(M)

M^1

-\nabla \mathcal{N}(M)

-\mathrm{grad}f(M)

M^2

-\nabla \mathcal{N}(M)