Approximate optimization on the orthogonal manifold

Pierre Ablin

Mokameeting

01/04/2020

Introduction to optimization on manifolds

followed by:

Ph.D. from Inria, supervised by Jean-François Cardoso and Alexandre Gramfort

Subject: develop better Independent Component Analysis algorithms for neurosciences

Now: Post Doc with Gabriel

Interests: OT, optimization, automatic differentiation, manifolds

Who am I?

Outline:

What is a manifold, how to minimize a function defined on a manifold?

Approximate optimization on the orthogonal manifold

Manifold (informal definition)

$\mathbb{R}^p$ equipped with canonical scalar product / $ \ell_2 $ norm $|\cdot|$

(Sub)-manifold $\mathcal{M}$ of dimension $k$ is a "smooth differentiable surface" in $\mathbb{R}^p$.

At each point $x \in \mathcal{M}$, the manifold looks like an Euclidean space of dimension $k$, the tangent space $T_x$.

Riemannian manifolds

$T_x$ is endowed with a scalar product $\langle \cdot, \cdot \rangle_x$.

$\mathcal{M}$ is Riemannian when $x\to \langle \cdot, \cdot, \rangle_x $is smooth.

Examples

- The sphere $S^{p-1} = \{x \in \mathbb{R}^{p} \enspace | \|x\| = 1 \} $ is a $ p -1 $ dimensional manifold

- Any k dimensional vector space

- The orthogonal matrix manifold $\mathcal{O}_p = \{M \in \mathbb{R}^{p \times p} \enspace |MM^{\top} =I_p\}$ is a $p(p-1)/2$ dimensional manifold

Tangent space: example

- On the sphere $S^{P-1} = \{x \in \mathbb{R}^{P} \enspace | \|x\| = 1 \} $:

T_x = \{\xi \in \mathbb{R}^P | \enspace \xi^{\top} x = 0 \}

- On the orthogonal manifold $ \mathcal{O}_P = \{M \in \mathbb{R}^{P \times P} | \enspace MM^{\top} = I_P\} $ :

T_W = \{X \in \mathbb{R}^{P \times P} | XW^{\top} + WX^{\top} = 0\} = \{AW| A\in \mathcal{A}_p\}

- On a vector space $F$:

T_x = F

Curves

Consider a "curve on the manifold", linking two points $x$ and $x'$.

It is a (differentiable) function $\gamma$ from $[0, 1]$ to $\mathcal{M}$:

$$ \gamma(t) \in \mathcal{M} \enspace, $$

Such that $\gamma(0) = x$ and $\gamma(1) = x' $.

Its derivative at a point $t$ is a vector in the tangent space at $\gamma(t)$:

$$\gamma'(t) \in T_{\gamma(t)}$$

Geodesic: distances on manifolds

Let $ x, x' \in \mathcal{M} $. Let $\gamma : [0, 1] \to \mathcal{M} $ be a curve linking $x$ to $x'$.

The length of $\gamma $ is:

$$\text{length}(\gamma) = \int_{0}^1 \|\gamma'(t) \|_{\gamma(t)} dt \enspace ,$$

and the geodesic distance between $x$ and $x'$ is the minimal length:

$\gamma $ is called a geodesic.

d(x, x') = \min\text{length}(\gamma)

Example:

- On the sphere, equipped with constant metric, $d(x, x') = \arccos(x^{\top}x') $

- On $\mathcal{O}_p$ equipped with constant metric, $d(W, W') = \|\log(W'W^{\top})\|_F$

Geodesics depend on the metric

Exponential map

Let $x\in \mathcal{M} $ and $\xi \in T_x$. There is a unique geodesic $\gamma $ such that $\gamma(0) = x $ and $\gamma'(0) = \xi $. The exponential map at $x$ is a function from $T_x$ to $\mathcal{M}$:

$$\text{Exp}_x(\xi) = \gamma(1) \in \mathcal{M}$$

- Sometimes available in closed form

- Important property :

The exponential map preserves distances !

d(\text{Exp}_x(\xi), x) = \|\xi\|_x

Retractions

Retractions generalize the exponential: ways to move on the manifold.

$R_x(\xi)$ is a retraction if for all $x\in \mathcal{M}$, $R_x$ maps $T_x$ to $\mathcal{M}$, and:

R_x(\xi) = x + \xi + o(|\xi|_2)

Examples on the orthogonal manifold $\mathcal{O}_p$, $T_W = \{AW | A\in \mathcal{A}_p\}$:

Exponential: $R_W(AW) = \exp(A)W$
Projection: $R_W(AW) = \mathcal{P}(W + AW), \enspace \mathcal{P}(M) = (MM^{\top})^{-\frac12}M$
Cayley: $R_W(AW) = (I_p - \frac A2)^{-1}(I_p + \frac A2)W$

Optimization on manifolds

Gradient

Optimization on a manifold $\mathcal{M}$:

$$ \text{minimize} \enspace F(x) \enspace \text{s.t.} \enspace x \in \mathcal{M}\enspace, $$

The (Riemannian) gradient of $F$ at $x$ is $\text{Grad}F(x) \in T_x$ such that :

F(R_x(\xi)) = F(x) + \langle \xi, \text{Grad}F(x)\rangle_x + o(|\xi|_2)

Does not depend on the choice of retraction !

Link with Euclidean gradient: if $F$ is defined in an open set of $\mathbb{R}^p$ around $x$, differentiable at $x$ and $\langle\xi, \xi'\rangle_x = \langle S_x\xi, \xi'\rangle_2$:

\text{Grad}F(x) = S_x^{-1} \text{Proj}_{T_x}(\nabla F(x))

Riemannian gradient descent

$$ \text{minimize} \enspace F(x) \enspace \text{s.t.} \enspace x \in \mathcal{M}$$

$$x^{i+1} = R_{x^i}(-\rho \text{Grad}F(x^i))$$

Standard non-Convex optimization results [Absil et al., 2009, Optimization Algorithms on Matrix Manifolds]: asymptotic

E.g. with proper line-search:

\text{Grad}F(x^i) \to 0

Non-asymptotic analysis: geodesic convexity

$$ \text{minimize} \enspace F(x) \enspace \text{s.t.} \enspace x \in \mathcal{M}$$

$\mathcal{X}\subset \mathcal{M}$ is a geodesically convex set if there is only one geodesic between any pair of points.

Examples: -Northern hemisphere excluding equator.

-$\mathcal{O}_p$ is not geodesically convex

$F$ is geodesically convex on $\mathcal{X}$ if for any geodesic $\gamma$, $F \circ \gamma$ is convex:

$$F(\gamma(t)) \leq (1- t) F(x) + t F(y), x = \gamma(0), y= \gamma(1)$$

- Linear function $\langle W, B\rangle$ is g-convex around $W$ if $\mathcal{P}_{S_p}(BW^{\top})$ is negative definite

Non-asymptotic analysis harder than in Euclidean case, but essentially same rates [Zhang & Sra, 2016, First-order Methods for Geodesically Convex Optimization]

Approximate optimization on the orthogonal manifold

Ongoing unfinished work !

Joint work with G. Peyré

Optimization over big orthogonal matrices?

Orthogonal matrices are used in deep learning as layers :

$$x_{i+1} =\sigma(Wx_i), \enspace W \in \mathcal{O}_p$$

Why?

-[Arjovsky, Shah & Bengio, 2016, Unitary evolution recurrent neural networks] avoid gradient explosion / vanishing: orthogonal matrices conserve norm, and $\frac{\partial x_{i+1}}{\partial \theta} = \frac{\partial x_i}{\partial \theta} \text{diag}(\sigma')W^{\top}$

-Acts as implicit regularization

Riemannian gradient on $\mathcal{O}_p$

Assuming constant metric $\langle \cdot, \cdot \rangle_W = \langle \cdot, \cdot\rangle_2$

\text{Grad}F(W) = \text{Proj}_{T_W}(\nabla F(W)) = P_{ \mathcal{A}_p}(\nabla F(W)W^{\top})W, P_{\mathcal{A}_p}(M) = \frac12(M - M^{\top})

Relative gradient: $\phi(W) = P_{\mathcal{A}_p}(\nabla F(W)W^{\top})\in\mathcal{A}_p$, $\text{Grad} F(W) = \phi(W) W$

Example: Gradient descent with exponential retraction

$$W^{i+1} = \exp(-\rho \phi(W^i))W^i$$

Cheap to compute (only matrix products)

Retractions are costly

Exponential: $R_W(AW) = {\color{red}\exp}(A)W$
Projection: $R_W(AW) = {\color{red}\mathcal{P}}(W + AW)$, $\color{red} \mathcal{P}(M) = (MM^{\top})^{-\frac12}M$
Cayley: $R_W(AW) = (I_p - \frac A2)^{\color{red}-1}(I_p + \frac A2)W$

Not suited to GPU !

There is no polynomial retraction $R_W(AW) = P(A, W)$

Approximate optimization

At the first order, a retraction is $R_W(AW) = (I_p + A)W$ Cheap

Idea:

$$W^{i+1} = (I_p - \rho\phi(W^i))W^i$$

$\rightarrow$The corresponding flow stays on the manifold

$$\dot{W}(t) = -\phi(W(t))W(t)$$

but the discrete version can go away ...

Need to define $\phi(W)$ for $W \notin\mathcal{O}_p$ : $\phi(W) =P_{\mathcal{A}_p}(\nabla F(W)W^{\top})$ convenient

EASI algorithm

$$W^{i+1} = (I_p - \rho\phi(W^i))W^i$$

At stationnary point, $\phi(W) = 0$: no guarantee that $W\in\mathcal{O}_p$.

Add orthogonalization part [Cardoso & Laheld, 1996, Equivariant adaptive source separation]:

$$W^{i+1} = \left(I_p - \rho(\phi(W^i) + W^i(W^i)^{\top} - I_p)\right)W^i$$

At stationary point: $\phi(W^i) + W^i(W^i)^{\top} - I_p = 0$

Symmetric part: $WW^{\top}=I_p\rightarrow W$ is orthogonal

Antisymmetric part: $\phi(W)=0\rightarrow$ critical point of $F$

Landing flow

Corresponding flow:

$$\dot{W}= - \left(\phi(W) + WW^{\top} - I_p\right)W$$

Consider $N(W) = \|WW^{\top} - I_p\|_F^2$.

$$\dot{W}= - \text{Grad}F(W) - \nabla N(W)$$

Not a gradient flow

Orthogonalization

$N(W) = \|WW^{\top} - I_p\|_F^2$

Prop: Let $\ell(t) = N(W(t))$. We have $\ell(t) \leq K \exp(-4t)$ for some constant $K$

Proof:

$\phi$ disappears ! Same convergence as Oja's flow [Oja, 1982, Simplified neuron model as a principal component analyzer]:

$$\dot{W}= -(WW^{\top} - I_p)W$$

\ell'(t) =-\langle \dot W, (WW^{\top} -I_p)W\rangle

=-\langle \phi(W) + WW^{\top} - I_p, (WW^{\top} -I_p)WW^{\top}\rangle

=-\|(WW^{\top} - I_p)W\|^2

\leq -4 \ell(t)\sqrt{1 - \ell(t)}

Function minimization (?)

Let $L(t) = F(W(t))$.

Second term can be controlled with $\ell(t)$ using Cauchy-Schwarz + bounded gradient: -$\langle WW^{\top} - I_p, \nabla F(W)W^{\top}\rangle \leq C\sqrt{\ell(t)}$

First term: local Polyak-Lojasiewicz type inequality

L'(t) =-\langle \dot W, \nabla F(W)\rangle

=-\langle \phi(W) + WW^{\top} - I_p, \nabla F(W)W^{\top}\rangle

=-\|\phi\|^2 - \langle WW^{\top} - I_p, \nabla F(W)W^{\top}\rangle

\|\phi\|\geq 2\mu(F(W) - F^*)

L'(t)\leq -\mu L(t)^2 + K \exp(-2t)

\phi(W^* + E) = E H(W^*) + O(E^2)

Example: orthogonal Procrustes

$X, Y \in \mathbb{R}^{n \times p} $ two datasets for $n$ samples in dimension $p$

Orthogonal Procrustes:

$$\min \|X - YW\|_2^2, \text{ s.t. } W \in \mathcal{O}_p$$

Closed form solution: $W^* = \mathcal{P}(Y^{\top}X)$.

Landing flow for $p=2$:

Example: orthogonal Procrustes

Discrete algorithm:

Ongoing directions

Proof of convergence of the flow under suitable conditions
How to chose the step size $\rho$ ?
Experiments with neural networks
Acceleration possibility: quasi-Newton ($\phi$ obtained with different metric), extrapolation
Extension to Stiefel manifold ($W\in \mathbb{R}^{p \times {\color{red}d}}, \enspace WW^{\top} = I_p$)

Approximate optimization on the orthogonal manifold

Introduction to optimization on manifolds

followed by:

Who am I?

Outline:

What is a manifold, how to minimize a function defined on a manifold?

Approximate optimization on the orthogonal manifold

Manifold (informal definition)

Riemannian manifolds

Examples

Tangent space: example

Curves

Geodesic: distances on manifolds

Geodesics depend on the metric

Exponential map

Retractions

Optimization on manifolds

Gradient

Riemannian gradient descent

Non-asymptotic analysis: geodesic convexity

Approximate optimization on the orthogonal manifold

Optimization over big orthogonal matrices?

Riemannian gradient on \(\mathcal{O}_p\)

Retractions are costly

Approximate optimization

EASI algorithm

Landing flow

Orthogonalization

Function minimization (?)

Example: orthogonal Procrustes

Example: orthogonal Procrustes

Ongoing directions

Thanks for your attention !

Landing flow

Landing flow

Pierre Ablin

Approximate optimization on the orthogonal manifold

Introduction to optimization on manifolds

followed by:

Who am I?

Outline:

What is a manifold, how to minimize a function defined on a manifold?

Approximate optimization on the orthogonal manifold

Manifold (informal definition)

Riemannian manifolds

Examples

Tangent space: example

Curves

Geodesic: distances on manifolds

Geodesics depend on the metric

Exponential map

Retractions

Optimization on manifolds

Gradient

Riemannian gradient descent

Non-asymptotic analysis: geodesic convexity

Approximate optimization on the orthogonal manifold

Optimization over big orthogonal matrices?

Riemannian gradient on \(\mathcal{O}_p\)

Retractions are costly

Approximate optimization

EASI algorithm

Landing flow

Orthogonalization

Function minimization (?)

Example: orthogonal Procrustes

Example: orthogonal Procrustes

Ongoing directions

Thanks for your attention !

Landing flow

More from Pierre Ablin