Gradient descent with a general cost
Flavien Léger
joint works with Pierre-Cyril Aubin-Frankowski
1. A new family of algorithms
2. Convergence theory
3. Applications
Outline
Gradient descent as alternating minimization
General method unifies gradient/mirror/natural gradient/Riemannian descent
Generalized smoothness and convexity
Optimal transport theory → local characterizations
Global rates for Newton
Explicit vs. implicit Riemannian gradient descent
1. Gradient descent as minimizing movement
Two steps:
1) majorize: find the tangent parabola (“surrogate”)
2) minimize: minimize the surrogate
\[x_{n+1}=x_n-\frac{1}{L}\nabla f(x_n),\]
objective function \(f\colon \mathbb{R}^d\to\mathbb{R}\)
\(f\) is \(L\)-smooth if \[\nabla^2f\leq L I_{d\times d}\]
D E F I N I T I O N
\[f(x)\]
\[\leq\]
\[f(x_n)+\langle\nabla f(x_n),x-x_n\rangle+\frac{L}{2}\lVert x-x_n\rVert^2\]
Reformulating the majorize step
Majorize step ↔ \(y\)-update:
\[y_{n+1} = \argmin_{y}\phi(x_n,y)\]
Minimize step ↔ \(x\)-update:
\[x_{n+1} = \argmin_{x}\phi(x,y_{n+1})\]
Family of majorizing functions \(\phi(x,y)\)
\(\phi(\cdot,y_{n+1})\)
General cost
Given: \(X\) and \(f\colon X\to\mathbb{R}\)
Choose: \(Y\) and \(c(x,y)\)
\[f(x)\leq \underbrace{c(x,y)+f^c(y)}_{\phi(x,y)}\]
\(f\) is \(c\)-concave if
\[f(x)=\inf_{y\in Y}c(x,y)+f^c(y)\]
D E F I N I T I O N
\[f^c(y)=\sup_{x\in X}f(x)-c(x,y)\]
D E F I N I T I O N
\(c\)-transform
\(c(\cdot,y)+f^c(y)\)
(Moreau '66)
\(c(\cdot,y)+\lambda\)
\(c\)-concavity is smoothness
\(f\) is \(c\)-concave if
\[f(x)=\inf_{y\in Y}c(x,y)+f^c(y)\]
D E F I N I T I O N
\(f\) is \(c\)-concave
\(f\) is not \(c\)-concave
\(c(x,y)=\frac{L}{2}\lVert x-y\rVert^2\)
\(f\) is \(c\)-concave \(\iff \nabla^2 f\leq L I_{d\times d}\)
Example
\[\inf_{x}f(x)=\inf_{x,y}c(x,y)+f^c(y)\]
Gradient descent with a general cost
(FL–PCAF '23)
“majorize”
“minimize”
A L G O R I T H M
\(\phi(\cdot,y_{n+1})\)
\(\phi(x,y)=c(x,y)+f^c(y)\)
\(-\nabla_xc(x_n,y_{n+1})=-\nabla f(x_n)\)
\(y_{n+1}=\operatorname{c-exp}_{x_n}(-\nabla f(x_n))\)
\(\iff\)
Some examples
\(\,\,\,c(x,y)=\underbrace{u(x)-u(y)-\langle\nabla u(y),x-y\rangle}_{\qquad\quad\eqqcolon \,u(x|y)} \longrightarrow\) mirror descent
\(\,\,\,c(x,y)=u(y|x) \longrightarrow\) natural gradient descent
\(\,\,\,c(x,y)=\frac{L}{2}d_M^2(x,y)\longrightarrow\) Riemannian gradient descent
Newton
\[\nabla u(x_{n+1})-\nabla u(x_n)=-\nabla f(x_n)\]
\[x_{n+1}-x_n=-\nabla^2 u(x_n)^{-1}\nabla f(x_n)\]
\[x_{n+1}=\exp_{x_n}(-\frac{1}{L}\nabla f(x_n))\]
1. A new family of algorithms
2. Convergence theory
3. Applications
Gradient descent as alternating minimization
General method unifies gradient/mirror/natural gradient/Riemannian descent
Generalized smoothness and convexity
Optimal transport theory → local characterizations
Global rates for Newton
Explicit vs. implicit Riemannian gradient descent
2. Cross-convexity
\(f\) is \(\lambda\)-strongly \(c\)-cross-convex if for all \(x,x_n,\)
\[f(x) \geq f(x_n) + \delta_c(x,y_n;x_n,y_{n+1})+\lambda(c(x,y_n)-c(x_n,y_n)).\]
D E F I N I T I O N
\(\delta_c(x',y';x,y)=[c(x,y')+c(x',y)]-[c(x,y)+c(x',y')]\)
Cross-difference:
Example: \(c(x,y)=\frac{L}{2}\lVert x-y\rVert^2\)
\[f(x)\geq f(x_n)+\langle\nabla f(x_n),x-x_n\rangle +\frac{\lambda L}{2}\lVert x-x_n\rVert^2\]
Convergence rates
If \(f\) is \(c\)-concave and \(c\)-cross-convex then
\[f(x_n)\le f(x) + \frac{c(x,y_0)-c(x_0,y_0)}{n}.\]
If \(f\) is \(\lambda\)-strongly \(c\)-cross-convex with \(0<\lambda<1\), then
\[f(x_n)\le f(x) + \frac{\lambda \,(c(x,y_0)-c(x_0,y_0))}{\Lambda^n-1},\]
where \(\Lambda\coloneqq(1-\lambda)^{-1}>1\).
T H E O R E M (FL–PCAF '23)
\[\left.\begin{aligned}f(x_{n+1})\leq c(x_{n+1},y_{n+1})+f^c(y_{n+1})\\~\end{aligned}\right\}\]
\(\implies f(x_{n+1})\leq f(x_{n})- [c(x_{n},y_{n+1})-c(x_{n+1},y_{n+1})]\)
\(f(x_{n+1}) \leq f(x)+ [c(x,y_n)-c(x_n,y_n)]\)
Proof.
\(f(x_{n})= c(x_{n},y_{n+1})+f^c(y_{n+1})\)
\(f(x_{n}) \leq f(x)+ c(x,y_n)-c(x,y_{n+1})\)
\(\implies\)
\(+c(x_n,y_{n+1})-c(x_{n},y_{n})\)
\(- [c(x,y_{n+1})-c(x_{n+1},y_{n+1})]\)
(“Fenchel–Young inequality”)
(\(c\)-concavity)
(cross-convexity)
The Kim–McCann geometry
\(\delta_c(x+\xi,y+\eta;x,y)=\underbrace{-\nabla^2_{xy}c(x,y)(\xi,\eta)}_{\text{Kim--McCann metric ('10)}}+o(\lvert\xi\rvert^2+\lvert\eta\rvert^2)\)
\(\delta_c(x',y';x,y)=\)
\[\inf_{\pi\in\Pi(\mu,\nu)}\iint_{X\times Y}c(x,y)\,\pi(dx,dy)\]
\([c(x,y')+c(x',y)]-[c(x,y)+c(x',y')]\)
➡ Kim–McCann geodesics
➡ Kim–McCann curvature: cross-curvature
Cross-curvature
The cross-curvature or Ma–Trudinger–Wang tensor is
\[\mathfrak{S}_c(\xi,\eta)=(c_{ik\bar s} c^{\bar s t} c_{t \bar\jmath\bar\ell}-c_{i\bar \jmath k\bar\ell}) \xi^i\eta^{\bar\jmath}\xi^k\eta^{\bar\ell}\]
(Ma–Trudinger–Wang ’05)
\[c_{i\bar \jmath}=\frac{\partial^2c}{\partial x^i\partial y^{\bar\jmath}},\dots\]
D E F I N I T I O N
T H E O R E M (Kim–McCann '11)
\[\mathfrak{S}_c\geq 0 \iff c(x(t),y)-c(x(t),y')\text{ convex in } t\]
for any Kim–McCann geodesic \(t\mapsto (x(t),y)\)
A local criteria for cross-convexity
Suppose that for all \(\bar x\in X\), there exists \(\hat y\in Y\) satisfying \(-\nabla_xc(\bar x,\hat y)=-\nabla f(\bar x)\) and such that
\[\nabla^2f(\bar x) \leq \nabla^2_{xx}c(\bar x,\hat y).\]
Then \(f\) is \(c\)-concave.
T H E O R E M (Trudinger–Wang '06)
Suppose that \(c\) has nonnegative cross-curvature.
Let \(\lambda>0\). Suppose that \[t\mapsto f(x(t))-\lambda c(x(t),\bar y)\] is convex on every Kim–McCann geodesic \(t\mapsto (x(t),\bar y)\) satisfying \(\nabla_xc(x(0),\bar y)=0\). Then \(f\) is \(\lambda\)-strongly \(c\)-cross-convex.
T H E O R E M (FL–PCAF '23)
1. A new family of algorithms
2. Convergence theory
3. Applications
Gradient descent as alternating minimization
General method unifies gradient/mirror/natural gradient/Riemannian descent
Generalized smoothness and convexity
Optimal transport theory → local characterizations
Global rates for Newton
Explicit vs. implicit Riemannian gradient descent
Global rates for Newton's method
\(c(x,y)=u(y|x)\longrightarrow\) Natural gradient descent:
\[x_{n+1}-x_n=-\nabla^2u(x_n)^{-1}\nabla f(x_n)\]
Newton's method: new global convergence rate.
New condition on \(f\) similar but different from self-concordance
T H E O R E M (FL–PCAF '23)
If \[\nabla^3u(\nabla^2u^{-1}\nabla f,-,-)\leq \nabla^2f\leq \nabla^2u+\nabla^3u(\nabla^2u^{-1}\nabla f,-,-)\] then
\[f(x_n)\leq f(x)+\frac{u(x_0|x)}{n}\]
Explicit vs. implicit Riemannian
\(c(x,y)=\frac{1}{2\tau} d_M^2(x,y)\)
2. Implicit: \(x_{n+1}=\argmin_{x} f(x)+\frac{1}{2\tau}d^2(x,x_n)\)
\(R\leq 0\): \(\nabla^2f\geq 0\) gives \(O(1/n)\) convergence rates
Wasserstein gradient flows, generalized geodesics (Ambrosio–Gigli–Savaré '05)
da Cruz Neto, de Lima, Oliveira ’98
Bento, Ferreira, Melo ’17
1. Explicit: \(x_{n+1}=\exp_{x_n}\big(-\tau\nabla f(x_n)\big)\)
\(R\geq 0\): (smoothness and) \(\nabla^2f\geq 0\) gives \(O(1/n)\) convergence rates
\(R\leq 0\): ? (nonlocal condition)
\[\operatorname*{minimize}_{x\in M} f(x)\]
\(R\geq 0\): if \(\mathfrak{S}_c\geq 0\) then convexity of \(f\) on Kim–McCann geodesics gives \(O(1/n)\) convergence rates
Thank you!
(SPO 2023-12-04) Gradient descent with a general cost
By Flavien Léger
(SPO 2023-12-04) Gradient descent with a general cost
- 198