Gradient descent

with a general cost

Flavien Léger

joint works with Pierre-Cyril Aubin-Frankowski,

Gabriele Todeschi, François-Xavier Vialard

Introduction

Minimize

using a function \(c(x,y)\) as a “movement limiter”

⏵ Explicit minimizing movement scheme based on \(c(x,y)\)

⏵ Even without differentiability

⏵ Identify convexity for convergence

Ex 1:   \(\displaystyle x_{n+1}\in\argmin_{x\in \mathbb{R}^d}f(x)+\frac{L\lVert x-x_n\rVert^2}{2} \quad\longrightarrow \quad x_{n+1}\in\argmin_{x\in X}f(x)+c(x,x_n)\)

Ex 2:

x_{n+1}-x_n=-\frac1L\nabla f(x_n)\quad \longrightarrow\quad ?
f\colon X\to\mathbb{R}\cup\{+\infty\}

Gradient descent with a general cost

If there exists \(h\) such that \(f(x)=\inf_{y\in Y}c(x,y)+h(y)\) then

\[\inf_{x\in X}f(x)=\inf_{x\in X,y\in Y}\underbrace{c(x,y)+f^c(y)}_{\phi(x,y)}\]

Algorithm GDGC: alternating minimization of \(\phi(x,y)\)

Given:      \(X,Y\) arbitrary sets

c\colon X\times Y\to\mathbb{R}\cup\{+\infty\}
f\colon X\to\mathbb{R}\cup\{+\infty\}

1. Formulation

2. Convergence theory

3. Direct applications

c-concavity

Definition. \(f\) is c-concave if there exists \(h\colon Y\to \mathbb{R}\cup\{+\infty\}\) s.t. \[f(x)=\inf_{y\in Y}c(x,y)+h(y).\]

Smallest such \(h\) is the c-transform \(f^c(y)=\sup_{x\in X} f(x)-c(x,y).\)

f(x)\leq c(x,y)+f^c(y)
f(x)=\inf_{y\in Y} c(x,y)+f^c(y)

c-concave  \(\iff\)

c-concavity

\(f\) is \(c\)-concave

\(f\) is not \(c\)-concave

\(f\) is \(c\)-concave \(\iff \nabla^2 f\preccurlyeq L\, I_{d\times d}\)

Example. \(X=Y=\mathbb{R}^d\)

\(c(x,y)+f^c(y)\)

\(f\)

\(f\)

c(x,y)=\frac{L}{2}\lVert x-y\rVert^2

Alternating minimization of the surrogate

\(y_{n+1} = \argmin_{y}\phi(x_n,y)\)

\(y\)-update  ↔  Majorize step

\(x_{n+1} = \argmin_{x}\phi(x,y_{n+1})\)

\(x\)-update  ↔  Minimize step

Family of majorizing functions \(\phi(x,y)\)

\(\phi(\cdot,y_{n+1})\)

Differentiable setting

Explicit algorithm

-\nabla_xc(x_n,y_{n+1})=-\nabla f(x_n)

\(X,Y\) finite-dimensional manifolds,

twisted \(c\in C^1(X\times Y)\),

\(f\in C^1(X)\) c-concave

\begin{aligned} y_{n+1} &= \argmin_{y\in Y} c(x_n,y)+f^c(y)\\ x_{n+1} &= \argmin_{x\in X} c(x,y_{n+1})+f^c(y_{n+1}) \end{aligned}
\nabla_xc(x_{n+1},y_{n+1})=0

Important examples

\(\,\,\,c(x,y)=\underbrace{u(x)-u(y)-\langle\nabla u(y),x-y\rangle}_{\qquad\quad\eqqcolon \,u(x|y)} \longrightarrow\)  mirror descent

\(\,\,\,c(x,y)=u(y|x) \longrightarrow\)  natural gradient descent

\(\,\,\,c(x,y)=\frac{L}{2}d_M^2(x,y)\longrightarrow\)  Riemannian gradient descent

Newton

\[\nabla u(x_{n+1})-\nabla u(x_n)=-\nabla f(x_n)\]

\[x_{n+1}-x_n=-\nabla^2 u(x_n)^{-1}\nabla f(x_n)\]

\[x_{n+1}=\exp_{x_n}(-\frac{1}{L}\nabla f(x_n))\]

\(\,\,\,c(x,y)=\frac{L}{2}\lVert x-y\rVert^2 \longrightarrow\)  standard gradient descent

\[x_{n+1}-x_n=-\frac1L\nabla f(x_n)\]

Summary

\inf_{x\in X} f(x)=\inf_{x\in X,y\in Y} c(x,y)+f^c(y)

GDGC: Alternating Minimization of \(c(x,y)+f^c(y)\)

If \(f\) is c-concave:

Nonsmooth formulation: \(X,Y\) arbitrary sets and \(c\colon X\times Y\to\mathbb{R}\cup\{+\infty\}\) (essentially) arbitrary

Differentiable formulation: if \(X,Y\) smooth manifolds and \(c\in C^1(X\times Y)\) then 

\begin{aligned} -\nabla_xc(x_n,y_{n+1})&=-\nabla f(x_n)\\ \nabla_xc(x_{n+1},y_{n+1})&=0 \end{aligned}

Differential geometry based on the cost \(c(x,y)\)

Gives a meaning to doing an explicit method without regularity

1. Formulation

2. Convergence theory

3. Direct applications

What I will generalize

Convergence rates: if \(\lambda I_d \preccurlyeq\nabla^2f\preccurlyeq L I_d\) then 

x_{n+1}-x_n=-\frac1L\nabla f(x_n)

\(X=\mathbb{R}^d\)

f(x_n)\leq f(x)+\frac{L\lVert x-x_0\rVert^2/2}{n}
f(x_n)\leq f(x)+\frac{\lambda L\lVert x-x_0\rVert^2/2}{\Lambda^n-1}

need new notions of  -> \(L\)-smoothness

-> (strong) convexity 

sublinear rates when \(\lambda=0\)

linear rates when \(\lambda>0\)

\Lambda=(1-\lambda)^{-1}>1

Nonsmooth geometry based on the cost

D E F I N I T I O N  (L–Todeschi–Vialard '24)

\((x(s),\bar y)\) is a variational c-segment if \(c(x(t),\bar y)\) is finite and 

\((X\times Y,c)\) is a space with nonnegative cross-curvature (NNCC space) if variational c-segments always exist.

\(X, Y\) two arbitrary sets,   \(c\colon X\times Y\to\mathbb{R}\cup\{+\infty\}\).

(1-s)[c(x(0),\bar y)-c(x(0),y)]+s[c(x(1),\bar y)-c(x(1),y)].
\forall y\in Y,\quad c(x(s),\bar y)-c(x(s),y)\leq

Note: no regularity

NNCC examples

\(c(x,y)=\) Bregman divergence on \(\mathbb{R}^d\)

\(c(x,y)=\lVert x-y\rVert^2\) on \(\mathbb{R}^d\)

Sphere \(\mathbb{S}^d\) with the squared geodesic distance

Bures–Wasserstein

\operatorname{BW}^2(\Sigma_1,\Sigma_2) = \operatorname{tr}(\Sigma_1) + \operatorname{tr}(\Sigma_2) - 2 \operatorname{tr}\left(\sqrt{\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2}}\right)

NNCC examples

Gromov–Wasserstein

Kullback–Leibler divergence, Hellinger, Fisher–Rao

W_2^2(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\int \lVert x-y\rVert^2\,d\pi

Optimal transport 

G. Peyré

\(\mathbf{X}=[X,f,\mu]\)   and   \(\mathbf{Y}=[Y,g,\nu]\)

\[\operatorname{GW}^2(\mathbf{X},\mathbf{Y})=\inf_{\pi\in\Pi(\mu,\nu)}\int\lvert f(x,x')-g(y,y')\rvert^2\,d\pi(x,y)\,d\pi(x',y')\,.\]

Suppose that for every \(y_0\in Y\) there exists \(x_0\in \argmin_{x\in X}\phi(x,y_0)\), \(y_1\in \argmin_{y\in Y}\phi(x_0,y)\), such that for each \(x\in X\),

T H E O R E M   (L–Aubin-Frankowski '24)

Convergence rates 1

Then sublinear (\(\lambda=0\)) or linear (\(\lambda>0\)) convergence rates.

⏵ there exists a variational c-segment \(s\mapsto (x(s),y_0)\) on \((X\times Y,c)\) with \(x(0)=x_0\) and \(x(1)=x\)

⏵ \(s\mapsto f(x(s))-\lambda c(x(s),y_1)\) is convex

\(\displaystyle\lim_{s\to 0^+}\frac{c(x(s),y_1)+f^c(y_1)-f(x(s))}{s}=0\)

T H E O R E M   (L–Aubin-Frankowski '24)

Convergence rates 2

\(X,Y\) smooth manifolds, \(c\in C^2\),  NNCC space.
    Let \(f\in C^2(X)\) and \(\lambda\geq 0\). Suppose that \(f\) is c-concave. Suppose that for every \(x\in X\) and \(y_1\in Y\) there exists a point \(y\in Y\) such that
   

 

 

\begin{aligned} \nabla f(x) &= \nabla_x c(x,y_1)-(1-\lambda)\nabla_x c(x,y),\\ \nabla^2f(x) &\succcurlyeq \nabla^2_{xx}c(x,y_1)-(1-\lambda)\nabla^2_{xx}c(x,y). \end{aligned}

Then sublinear (\(\lambda=0\)) or linear (\(\lambda>0\)) convergence rates.

1. Formulation

2. Convergence theory

3. Direct applications

Nondifferentiable mirror descent

Idea: cost \(c(x,y)=u(x)+u^*(y)-\langle x,y\rangle=u(x|\tilde y)\)      for \(y=\nabla u(\tilde y)\)

\(X=Y=\mathbb{R}^d\), \(u\colon \mathbb{R}^d\to\mathbb{R}\cup\{+\infty\}\) convex

Given \(y_n\),

\left\{\begin{aligned} x_n&\in\partial u^*(y_n)\\ y_n-y_{n+1}&\in\partial f(x_n)\text{ and }y_{n+1}\in\partial(u-f)(x_n) \end{aligned}\right.

\(f\colon \mathbb{R}^d\to\mathbb{R}\cup\{+\infty\}\)

Assume \(f\) convex and \(u-f\) convex

If differentiable: \(\nabla u(x_{n+1})-\nabla u(x_n)=-\nabla f(x_n)\)

Algorithm

Convergence rate:

f(x_n)\leq f(x)+\frac{c(x,y_0)}{n}

Global rates for Newton's method

\(c(x,y)=u(y|x)\longrightarrow\) \(x_{n+1}-x_n=-\nabla^2u(x_n)^{-1}\nabla f(x_n)\)

Newton's method: if \(0 \leq \nabla^3f(x)\big((\nabla^2f)^{-1}(x)\nabla f(x),-,-\big) \leq (1-\lambda)\nabla^2f(x)\) then

\[f(x_n)-f_*\leq \Big(\frac{1-\lambda}{2}\Big)^n(f(x_0)-f_*)\]

If \[\nabla^3u(\nabla^2u^{-1}\nabla f,-,-)\leq \nabla^2f\leq \nabla^2u+\nabla^3u(\nabla^2u^{-1}\nabla f,-,-)\] then

\[f(x_n)\leq f(x)+\frac{u(x_0|x)}{n}\]

Thank you!