Flavien Léger
joint works with Pierre-Cyril Aubin-Frankowski,
Gabriele Todeschi, François-Xavier Vialard

Gradient descent with a general cost
What I will present
Main motivation: optimization on a space of measures P(M):
minimize E:P(M)→R∪{+∞}
Typical scheme:
where D(μ,ν)=
transport cost: W22(μ,ν), Tc(μ,ν),...
Bregman divergence: KL(μ,ν),...
Csiszár divergence: ∫M(μ−ν)2,...
Regularized cost
Minimizing movement schemes based on general movement limiters
...
What I will present
Minimizing movement schemes based on general movement limiters
1. Formulations for implicit and explicit schemes with general movement limiter
2. Theory for rates of convergence based on convexity along specific paths, and generalized “L-smoothness” (“L-Lipschitz gradients”) for explicit scheme
→ unify gradient / mirror / natural gradient / Riemannian gradient descents
3. Applications
Implicit scheme
Minimize E:X→R∪{+∞}, where X is a set (finite or infinite dimensional...)
Use D:X×X→[0,+∞]
Algorithm
(Implicit scheme)
Motivations for general D(x,y):
Implicit scheme
- D(x,y) tailored to the problem
- Regularized squared distance
- Discretizing gradient flows “x˙(t)=−∇E(x(t))”
Toy example: x˙(t)=−∇2u(x(t))−1∇E(x(t)), u:Rd→R strictly convex
Two approaches:
d= distance for Hessian metric ∇2u
Explicit minimizing movements: warm-up
xn+1=xn−L1∇E(xn)
E:Rd→R
Gradient descent
“Variational” formulation of Gradient descent:

Two steps:
1) majorize: find the tangent parabola (“surrogate”)
2) minimize: minimize the surrogate
Explicit minimizing movements: warm-up

If E is L-smooth (∇2E≼LId×d) then it sits below the surrogate:
E(x)
≤
E(xn)+⟨∇E(xn),x−xn⟩+2L∥x−xn∥2
xn+1=xn−L1∇E(xn)
Two steps:
1) majorize: find the tangent parabola (“surrogate”)
2) minimize: minimize the surrogate
Explicit minimizing movements: c-concavity
∃h:Y→R∪{+∞}
Definition.
E is c-concave if
generalizes “L-smoothness”
Abstract setting:
Smallest such h is the c-transform
h(y)=supx∈XE(x)−D(x,y)
X,Y: two sets
∃h:Y→R∪{+∞}
Definition.
E is c-concave if



c-concave
not c-concave
Explicit minimizing movements: c-concavity
Suppose that ∀x∈X,∃y∈Y: ∇xD(x,y)=∇E(x) and ∇2E(x)≤∇xx2D(x,y).
Then E is c-concave.
∃h:Y→R∪{+∞}
Definition.
E is c-concave if
Explicit minimizing movements: c-concavity
E is c-concave ⟺∇2E≼LId×d
Example.
Example.
Differentiable NNCC setting.
Explicit minimizing movements

(majorize)
(minimize)
Algorithm.
(Explicit scheme)
Assume E c-concave.
(L–Aubin-Frankowski '23)
Other point of view:
Explicit minimizing movements
X,Y smooth manifolds, D∈C1(X×Y), E∈C1(X) c-concave
Under certain assumptions, the explicit scheme can be written as
2. Convergence rates
EVI and convergence rates
Definition.
(Csiszár–Tusnády ’84)
(L–Aubin-Frankowski ’23)
Evolution Variational Inequality (or five-point property):
If (xn,yn) satisfy the EVI then
sublinear rates when μ=0
exponential rates when μ>0
Theorem.
(L–Aubin-Frankowski '23)
(Ambrosio–Gigli–Savaré ’05)
Variational c-segments and NNCC spaces
⏵ s↦(x(s),yˉ) is a variational c-segment if D(x(s),yˉ) is finite and
⏵ (X×Y,D) is a space with nonnegative cross-curvature (NNCC space) if variational c-segments always exist.
X,Y two arbitrary sets, D:X×Y→R∪{±∞}.
Definition.
(L–Todeschi–Vialard '24)
Origins in regularity of optimal transport
(Ma–Trudinger–Wang ’05)
(Trudinger–Wang ’09)
(Kim–McCann ’10)
convexity of the set of c-concave functions
(Figalli–Kim–McCann '11)
Finite dimensional examples
D(x,y)= Bregman divergence on Rd
D(x,y)=∥x−y∥2 on Rd
Sphere Sd with the squared geodesic distance
Bures–Wasserstein
Infinite dimensional examples
Gromov–Wasserstein
Costs on measures. The following are NNCC:
Relative entropy KL(μ,ν)=∫log(dνdμ)dμ,
Hellinger D(μ,ν)=∫(dλdμ−dλdν)2dλ,
Fisher–Rao = length space associated with Hellinger
Transport costs: squared Wasserstein distance on Rd, on the sphere...
(G×G,GW2) is NCCC
X=[X,f,μ] and Y=[Y,g,ν]∈G

G. Peyré
Convergence rates for minimizing movements
Suppose that for each x∈X and n≥0,
Then sublinear (μ=0) or linear (μ>0) convergence rates.
⏵ there exists a variational c-segment s↦(x(s),yn) on (X×Y,D) with x(0)=xn and x(1)=x
⏵ s↦E(x(s))−μD(x(s),yn+1) is convex
⏵ s→0+limsD(x(s),yn+1)=0
Theorem.
(L–Aubin-Frankowski '23)
3. Applications
Background on mirror descent
Minimize E:Rd→R without using the Euclidean structure
Instead use strictly convex u:Rd→R.
“Mirror map” ∇u:Rd→Rd to go from primal to dual space
Mirror descent
Convergence rates: if μ∇2u≼∇2E≼∇2u then
Nondifferentiable mirror descent
Idea: cost D(x,y)=u(x)+u∗(y)−⟨x,y⟩=u(x∣y~) for y=∇u(y~)
X=Y=Rd, u:Rd→R∪{+∞} convex
Given yn,
E:Rd→R∪{+∞}
Assume E convex and u−E convex
Convergence rate:
Algorithm.
Theorem.
D(x,y)=u(y∣x)⟶ xn+1−xn=−∇2u(xn)−1∇f(xn)
Newton's method: if 0≤∇3f(x)((∇2f)−1(x)∇f(x),−,−)≤(1−λ)∇2f(x) then
f(xn)−f∗≤(21−λ)n(f(x0)−f∗)
If ∇3u(∇2u−1∇f,−,−)≤∇2f≤∇2u+∇3u(∇2u−1∇f,−,−) then
f(xn)≤f(x)+nu(x0∣x)
Global rates for Newton's method
Riemannian setting
da Cruz Neto, de Lima, Oliveira ’98
Bento, Ferreira, Melo ’17
2. Explicit: xn+1=expxn(−τ∇f(xn))
Riem≥0: ∇2f≥0 gives O(1/n) convergence rates
Riem≤0: convexity of f on c-segments gives O(1/n) convergence rates
1. Implicit: xn+1=argminxf(x)+2τ1dM2(x,xn)
Riem≤0: ∇2f≥0 gives O(1/n) convergence rates
Riem≥0: convexity of f on c-segments gives O(1/n) convergence rates
Thank you!
(IIP 2025-02-04) GDGC
By Flavien Léger
(IIP 2025-02-04) GDGC
- 66