Optimization is everywhere: machine learning, operation research, data mining, theoretical computer science, etc.

\underset{x \in C}{\min} ~f(x)

\underset{x \in C}{\min} ~f(x)

First-order methods:

x^{(t)} = \text{span}\{ x^{(0)}, \nabla f( x^{(0)} ), \nabla f( x^{(1)} ), \ldots, \nabla f( x^{(t-1)} ) \}

x^{(t)} = \text{span}\{ x^{(0)}, \nabla f( x^{(0)} ), \nabla f( x^{(1)} ), \ldots, \nabla f( x^{(t-1)} ) \}

Why first-order methods?

Background and motivation

x^{(t)}

x^{(t)}

x^{(t+1)}

x^{(t+1)}

The literature

Gradient descent can be traced back to Cauchy's work in1847.

It is gaining increasing interest in the past 3 decades due to its empirical success.

Fundamental works have been done by pioneering researchers (Bertsekas, Nesterov, etc.).

There are still gaps between theory and practice.

Coordinate optimization

Mrrior descent

Stochastic subgradient method

Coordinate Optimization

Coordinate Descent

For

t = 0, 1, 2, \ldots,

t = 0, 1, 2, \ldots,

Select coordinate

i \in [d]

i \in [d]

Update

x_i^{(t+1)} = x_i^{(t)} - \eta_t \nabla_i f( x^{(t)} )

x_i^{(t+1)} = x_i^{(t)} - \eta_t \nabla_i f( x^{(t)} )

Different coordinate selection rules:

Random selection:

Cyclic selection:

Random permuted cyclic (the matrix AMGM inequality conjecture is false [LL20, S20])

Greedy selection (Gauss-Southwell)

i \in \arg\max_{ j \in [d] } | \nabla_j f( x^{(t)} ) |

i \in \arg\max_{ j \in [d] } | \nabla_j f( x^{(t)} ) |

i \sim \mathrm{uniform}\{1,2,\ldots, d\}

i \sim \mathrm{uniform}\{1,2,\ldots, d\}

i = (t-1) ~\mathrm{mod}~ d + 1

i = (t-1) ~\mathrm{mod}~ d + 1

GCD for Sparse Optimization

\underset{ x \in \mathbb{R}^d }{\min}~F(x) \coloneqq f(x) + g(x)

\underset{ x \in \mathbb{R}^d }{\min}~F(x) \coloneqq f(x) + g(x)

regularizer

One-norm regularization or nonnegative constraint can promote a sparse solution.

When initailzed at zero, greedy CD is observed to have an implicit screening ability to select variables that are nonzero at solution.

Iter 1

Iter 2

Iter 3

Iter 4

Iter

T

T

g(x) = \lambda \| x \|_1

g(x) = \lambda \| x \|_1

g(x) = \delta_{\geq 0}(x)

g(x) = \delta_{\geq 0}(x)

data fitting

GCD for Sparse Optimization

[FFSF, AISTATS'20]

We provide a theoretical characterization of GCD's screening ability:

GCD converges fast in first few iterations.

The iterate is "close" the to solution when the iterate is still sparse, and sparsity pattern will not further expand anymore.

x^{(0)}

x^{(0)}

x^{*}

x^{*}

x^{(100)}

x^{(100)}

d = 10^4, \| x^* \|_0 = 10

d = 10^4, \| x^* \|_0 = 10

\| x^{(t)} \|_0 \leq 110

\| x^{(t)} \|_0 \leq 110

for

t > 100

t > 100

\delta

\delta

From coordinate to atom

Learning a sparse representation of an atomic set :

\mathcal{A}

\mathcal{A}

x^* = \sum_{a \in \mathcal{A}} c_a a

x^* = \sum_{a \in \mathcal{A}} c_a a

such that

c_a \geq 0~\forall a \in \mathcal{A}

c_a \geq 0~\forall a \in \mathcal{A}

sparse vector:

\mathcal{A} = \{ \pm e_1, \pm e_2, \ldots, \pm e_d \}

\mathcal{A} = \{ \pm e_1, \pm e_2, \ldots, \pm e_d \}

Low-rank matrix:

\mathcal{A} = \{ uv^T \mid u \in \mathbb{R}^{n}, v \in \mathbb{R}^m, \|u\| = \|v\| = 1 \}

\mathcal{A} = \{ uv^T \mid u \in \mathbb{R}^{n}, v \in \mathbb{R}^m, \|u\| = \|v\| = 1 \}

Our contribution: how to identify the atoms with nonzero coefficients at solution during the optimization process.

[FFF21] Submitted

Online Convex Optimization

Play a game for rounds, for

T

T

Propose a point

x^{(t)} \in \mathcal{X}.

x^{(t)} \in \mathcal{X}.

Suffer loss

f_t( x^{(t)} ).

f_t( x^{(t)} ).

t = 1,2,\ldots, T

t = 1,2,\ldots, T

The goal of online learning algorithm: obtain sublinear regret

\displaystyle \mathrm{Regret} (T, z) \coloneqq \sum_{t=1}^T f_t( x^{(t)} ) - \sum_{t=1}^T f_t(z).

\displaystyle \mathrm{Regret} (T, z) \coloneqq \sum_{t=1}^T f_t( x^{(t)} ) - \sum_{t=1}^T f_t(z).

\mathrm{Regret} (T, z) = o(T) \to \mathrm{Regret} (T, z)/T \to 0.

\mathrm{Regret} (T, z) = o(T) \to \mathrm{Regret} (T, z)/T \to 0.

player's loss

competitor's loss

Mirror Descent (MD) and Dual Averaging (DA)

MD and DA are parameterized by mirror map, they have advantages over the vanilla projected subgradient method.

f_t(x) = c_t^T x \quad \text{and} \quad \mathcal{X} = \left\{ \sum_{i=1}^d x_i = 1, x_i \geq 0 \right\}, ~\|c_t\|_{\infty} \leq 1.

f_t(x) = c_t^T x \quad \text{and} \quad \mathcal{X} = \left\{ \sum_{i=1}^d x_i = 1, x_i \geq 0 \right\}, ~\|c_t\|_{\infty} \leq 1.

When is known in advance or , both MD and DA guarantee regret

T

T

\mathcal{O}( \sqrt{T} ).

\mathcal{O}( \sqrt{T} ).

When is unknown in advance and , then MD has rate while DA still guarantees regret.

T

T

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) = +\infty

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) = +\infty

\Omega ( T )

\Omega ( T )

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) < +\infty

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) < +\infty

Our contribution: fix the divergence issue of MD and obtain

regret.

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

OMD Algorithm

[FHPF, ICML'20]

Primal

Dual

x^{(t)}

x^{(t)}

\hat{x}^{(t)}

\hat{x}^{(t)}

\nabla \Phi

\nabla \Phi

\hat{y}^{(t)}

\hat{y}^{(t)}

-\eta_t \nabla f_t( x^{(t)} )

-\eta_t \nabla f_t( x^{(t)} )

y^{(t)}

y^{(t)}

x^{(t+1)}

x^{(t+1)}

Bregman projection

\nabla \Phi^*

\nabla \Phi^*

Figure accredited to Victor Portella

Stabilized OMD

[FHPF, ICML'20]

Primal

Dual

x^{(t)}

x^{(t)}

\hat{x}^{(t)}

\hat{x}^{(t)}

\nabla \Phi

\nabla \Phi

\hat{y}^{(t)}

\hat{y}^{(t)}

-\eta_t \nabla f_t( x^{(t)} )

-\eta_t \nabla f_t( x^{(t)} )

y^{(t)}

y^{(t)}

x^{(t+1)}

x^{(t+1)}

Bregman projection

\nabla \Phi^*

\nabla \Phi^*

\hat{x}^{(0)}

\hat{x}^{(0)}

\hat{w}^{(t)}

\hat{w}^{(t)}

}

\gamma_t

\gamma_t

With stabilization, OMD can obtain regret.

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

Smooth v.s. Nonsmooth Minimization

Consider minimizing a convex function

The iteration complexity of gradient descent (GD) and subgradient descent (subGD) for smooth and nonsmooth objectives:

\underset{ x \in \mathbb{R}^d }{\min}~ f(x).

\underset{ x \in \mathbb{R}^d }{\min}~ f(x).

when is smooth:

f

f

\mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon).

when is nonsmooth:

f

f

\mathcal{O}(1/\epsilon^2).

\mathcal{O}(1/\epsilon^2).

smooth

nonsmooth

Filling the gap

Our assumption:

The objective satisfy certain structure:

\underset{x \in \mathbb{R}^d}{\min}~ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) \coloneqq \frac{1}{n} \sum_{i=1}^n \ell(h_i(x)),

\underset{x \in \mathbb{R}^d}{\min}~ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) \coloneqq \frac{1}{n} \sum_{i=1}^n \ell(h_i(x)),

where is a nonnegative, , convex, 1-smooth loss function, 's are Lipschitz continuous.

\ell : \mathbb{R} \to \mathbb{R}_{\geq 0}

\ell : \mathbb{R} \to \mathbb{R}_{\geq 0}

The interpolation condition: there exist such that

x^*

x^*

f(x^*) = 0 .

f(x^*) = 0 .

h_i

h_i

We prove

[FFF, ICLR'21]

Convex objective:

Strongly convex objective:

\mathcal{O}(1/\epsilon^2) \to \mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon^2) \to \mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon) \to \mathcal{O}(\log(1/\epsilon)).

\mathcal{O}(1/\epsilon) \to \mathcal{O}(\log(1/\epsilon)).

Optimal

\inf \ell = 0

\inf \ell = 0

The above rates match the rate of SGD for smooth objectives.

First-order methods for structured optimization

Background and motivation

The literature

Coordinate Optimization

Coordinate Descent

GCD for Sparse Optimization

GCD for Sparse Optimization

From coordinate to atom

Online Mirror Descent

Online Convex Optimization

Mirror Descent (MD) and Dual Averaging (DA)

OMD Algorithm

Stabilized OMD

(Stochastic) Subgradient Method

Smooth v.s. Nonsmooth Minimization

The empirical observation

Filling the gap

Acknowledgment

Thank you! Questions?