Gradient descent
with a general cost
Flavien Léger
joint works with Pierre-Cyril Aubin-Frankowski,
Gabriele Todeschi, François-Xavier Vialard
Introduction
Minimize
using a function \(c(x,y)\) as a “movement limiter”
⏵ Explicit minimizing movement scheme based on \(c(x,y)\)
⏵ Even without differentiability
⏵ Identify convexity for convergence
Ex 1: \(\displaystyle x_{n+1}\in\argmin_{x\in \mathbb{R}^d}f(x)+\frac{L\lVert x-x_n\rVert^2}{2} \quad\longrightarrow \quad x_{n+1}\in\argmin_{x\in X}f(x)+c(x,x_n)\)
Ex 2:
Gradient descent with a general cost
If there exists \(h\) such that \(f(x)=\inf_{y\in Y}c(x,y)+h(y)\) then
\[\inf_{x\in X}f(x)=\inf_{x\in X,y\in Y}\underbrace{c(x,y)+f^c(y)}_{\phi(x,y)}\]
Algorithm GDGC: alternating minimization of \(\phi(x,y)\)
Given: \(X,Y\) arbitrary sets
1. Formulation
2. Convergence theory
3. Direct applications
c-concavity
Definition. \(f\) is c-concave if there exists \(h\colon Y\to \mathbb{R}\cup\{+\infty\}\) s.t. \[f(x)=\inf_{y\in Y}c(x,y)+h(y).\]
Smallest such \(h\) is the c-transform \(f^c(y)=\sup_{x\in X} f(x)-c(x,y).\)
c-concave \(\iff\)
c-concavity
\(f\) is \(c\)-concave
\(f\) is not \(c\)-concave
\(f\) is \(c\)-concave \(\iff \nabla^2 f\preccurlyeq L\, I_{d\times d}\)
Example. \(X=Y=\mathbb{R}^d\)
\(c(x,y)+f^c(y)\)
\(f\)
\(f\)
Alternating minimization of the surrogate
\(y_{n+1} = \argmin_{y}\phi(x_n,y)\)
\(y\)-update ↔ Majorize step
\(x_{n+1} = \argmin_{x}\phi(x,y_{n+1})\)
\(x\)-update ↔ Minimize step
Family of majorizing functions \(\phi(x,y)\)
\(\phi(\cdot,y_{n+1})\)
Differentiable setting
Explicit algorithm
\(X,Y\) finite-dimensional manifolds,
twisted \(c\in C^1(X\times Y)\),
\(f\in C^1(X)\) c-concave
Important examples
\(\,\,\,c(x,y)=\underbrace{u(x)-u(y)-\langle\nabla u(y),x-y\rangle}_{\qquad\quad\eqqcolon \,u(x|y)} \longrightarrow\) mirror descent
\(\,\,\,c(x,y)=u(y|x) \longrightarrow\) natural gradient descent
\(\,\,\,c(x,y)=\frac{L}{2}d_M^2(x,y)\longrightarrow\) Riemannian gradient descent
Newton
\[\nabla u(x_{n+1})-\nabla u(x_n)=-\nabla f(x_n)\]
\[x_{n+1}-x_n=-\nabla^2 u(x_n)^{-1}\nabla f(x_n)\]
\[x_{n+1}=\exp_{x_n}(-\frac{1}{L}\nabla f(x_n))\]
\(\,\,\,c(x,y)=\frac{L}{2}\lVert x-y\rVert^2 \longrightarrow\) standard gradient descent
\[x_{n+1}-x_n=-\frac1L\nabla f(x_n)\]
Summary
GDGC: Alternating Minimization of \(c(x,y)+f^c(y)\)
If \(f\) is c-concave:
Nonsmooth formulation: \(X,Y\) arbitrary sets and \(c\colon X\times Y\to\mathbb{R}\cup\{+\infty\}\) (essentially) arbitrary
Differentiable formulation: if \(X,Y\) smooth manifolds and \(c\in C^1(X\times Y)\) then
Differential geometry based on the cost \(c(x,y)\)
Gives a meaning to doing an explicit method without regularity
1. Formulation
2. Convergence theory
3. Direct applications
What I will generalize
Convergence rates: if \(\lambda I_d \preccurlyeq\nabla^2f\preccurlyeq L I_d\) then
\(X=\mathbb{R}^d\)
need new notions of -> \(L\)-smoothness
-> (strong) convexity
sublinear rates when \(\lambda=0\)
linear rates when \(\lambda>0\)
Nonsmooth geometry based on the cost
D E F I N I T I O N (L–Todeschi–Vialard '24)
\((x(s),\bar y)\) is a variational c-segment if \(c(x(t),\bar y)\) is finite and
\((X\times Y,c)\) is a space with nonnegative cross-curvature (NNCC space) if variational c-segments always exist.
\(X, Y\) two arbitrary sets, \(c\colon X\times Y\to\mathbb{R}\cup\{+\infty\}\).
Note: no regularity
NNCC examples
\(c(x,y)=\) Bregman divergence on \(\mathbb{R}^d\)
\(c(x,y)=\lVert x-y\rVert^2\) on \(\mathbb{R}^d\)
Sphere \(\mathbb{S}^d\) with the squared geodesic distance
Bures–Wasserstein
NNCC examples
Gromov–Wasserstein
Kullback–Leibler divergence, Hellinger, Fisher–Rao
Optimal transport
G. Peyré
\(\mathbf{X}=[X,f,\mu]\) and \(\mathbf{Y}=[Y,g,\nu]\)
\[\operatorname{GW}^2(\mathbf{X},\mathbf{Y})=\inf_{\pi\in\Pi(\mu,\nu)}\int\lvert f(x,x')-g(y,y')\rvert^2\,d\pi(x,y)\,d\pi(x',y')\,.\]
Suppose that for every \(y_0\in Y\) there exists \(x_0\in \argmin_{x\in X}\phi(x,y_0)\), \(y_1\in \argmin_{y\in Y}\phi(x_0,y)\), such that for each \(x\in X\),
T H E O R E M (L–Aubin-Frankowski '24)
Convergence rates 1
Then sublinear (\(\lambda=0\)) or linear (\(\lambda>0\)) convergence rates.
⏵ there exists a variational c-segment \(s\mapsto (x(s),y_0)\) on \((X\times Y,c)\) with \(x(0)=x_0\) and \(x(1)=x\)
⏵ \(s\mapsto f(x(s))-\lambda c(x(s),y_1)\) is convex
⏵ \(\displaystyle\lim_{s\to 0^+}\frac{c(x(s),y_1)+f^c(y_1)-f(x(s))}{s}=0\)
T H E O R E M (L–Aubin-Frankowski '24)
Convergence rates 2
\(X,Y\) smooth manifolds, \(c\in C^2\), NNCC space.
Let \(f\in C^2(X)\) and \(\lambda\geq 0\). Suppose that \(f\) is c-concave. Suppose that for every \(x\in X\) and \(y_1\in Y\) there exists a point \(y\in Y\) such that
Then sublinear (\(\lambda=0\)) or linear (\(\lambda>0\)) convergence rates.
1. Formulation
2. Convergence theory
3. Direct applications
Nondifferentiable mirror descent
Idea: cost \(c(x,y)=u(x)+u^*(y)-\langle x,y\rangle=u(x|\tilde y)\) for \(y=\nabla u(\tilde y)\)
\(X=Y=\mathbb{R}^d\), \(u\colon \mathbb{R}^d\to\mathbb{R}\cup\{+\infty\}\) convex
Given \(y_n\),
\(f\colon \mathbb{R}^d\to\mathbb{R}\cup\{+\infty\}\)
Assume \(f\) convex and \(u-f\) convex
If differentiable: \(\nabla u(x_{n+1})-\nabla u(x_n)=-\nabla f(x_n)\)
Algorithm
Convergence rate:
Global rates for Newton's method
\(c(x,y)=u(y|x)\longrightarrow\) \(x_{n+1}-x_n=-\nabla^2u(x_n)^{-1}\nabla f(x_n)\)
Newton's method: if \(0 \leq \nabla^3f(x)\big((\nabla^2f)^{-1}(x)\nabla f(x),-,-\big) \leq (1-\lambda)\nabla^2f(x)\) then
\[f(x_n)-f_*\leq \Big(\frac{1-\lambda}{2}\Big)^n(f(x_0)-f_*)\]
If \[\nabla^3u(\nabla^2u^{-1}\nabla f,-,-)\leq \nabla^2f\leq \nabla^2u+\nabla^3u(\nabla^2u^{-1}\nabla f,-,-)\] then
\[f(x_n)\leq f(x)+\frac{u(x_0|x)}{n}\]
Thank you!
(MIA 2024-11-07) GDGC
By Flavien Léger
(MIA 2024-11-07) GDGC
- 44