Optimization is everywhere: machine learning, operation research, data mining, theoretical computer science, etc.

\underset{x \in C}{\min} ~f(x)

\underset{x \in C}{\min} ~f(x)

First-order methods:

x^{(t)} = \text{span}\{ x^{(0)}, \nabla f( x^{(0)} ), \nabla f( x^{(1)} ), \ldots, \nabla f( x^{(t-1)} ) \}

x^{(t)} = \text{span}\{ x^{(0)}, \nabla f( x^{(0)} ), \nabla f( x^{(1)} ), \ldots, \nabla f( x^{(t-1)} ) \}

Why first-order methods?

First-order method

x^{(t)}

x^{(t)}

x^{(t+1)}

x^{(t+1)}

Coordinate Descent

For

t = 0, 1, 2, \ldots,

t = 0, 1, 2, \ldots,

Select coordinate

i \in [d]

i \in [d]

Update

x_i^{(t+1)} = x_i^{(t)} - \eta_t \nabla_i f( x^{(t)} )

x_i^{(t+1)} = x_i^{(t)} - \eta_t \nabla_i f( x^{(t)} )

Different coordinate selection rules:

Random selection:

Cyclic selection:

Random permuted cyclic (the matrix AMGM inequality conjecture is false [LL20, S20])

Greedy selection (Gauss-Southwell)

i \in \arg\max_{ j \in [d] } | \nabla_j f( x^{(t)} ) |

i \in \arg\max_{ j \in [d] } | \nabla_j f( x^{(t)} ) |

i \sim \mathrm{uniform}\{1,2,\ldots, d\}

i \sim \mathrm{uniform}\{1,2,\ldots, d\}

i = (t-1) ~\mathrm{mod}~ d + 1

i = (t-1) ~\mathrm{mod}~ d + 1

^1

^1

^1

^1

Forty-Two Open Problems in the Mathematics of Data Science

GCD for Sparse Optimization

\underset{ x \in \mathbb{R}^d }{\min}~F(x) \coloneqq f(x) + g(x)

\underset{ x \in \mathbb{R}^d }{\min}~F(x) \coloneqq f(x) + g(x)

regularizer

One-norm regularization or nonnegative constraint can promote a sparse solution.

When initailzed at zero, greedy CD is observed to have an implicit screening ability to select variables that are nonzero at solution.

Iter 1

Iter 2

Iter 3

Iter 4

Iter

T

T

g(x) = \lambda \| x \|_1

g(x) = \lambda \| x \|_1

g(x) = \delta_{\geq 0}(x)

g(x) = \delta_{\geq 0}(x)

data fitting

GCD for Sparse Optimization

[FFSF, AISTATS'20]

We provide a theoretical characterization of GCD's screening ability:

GCD converges fast in first few iterations.

The iterate is "close" the to solution when the iterate is still sparse, and sparsity pattern will not further expand anymore.

x^{(0)}

x^{(0)}

x^{*}

x^{*}

x^{(100)}

x^{(100)}

d = 10^4, \| x^* \|_0 = 10

d = 10^4, \| x^* \|_0 = 10

\| x^{(t)} \|_0 \leq 110

\| x^{(t)} \|_0 \leq 110

for

t > 100

t > 100

\delta

\delta

From coordinate to atom

Learning a sparse representation of an atomic set :

\mathcal{A}

\mathcal{A}

x^* = \sum_{a \in \mathcal{A}} c_a a

x^* = \sum_{a \in \mathcal{A}} c_a a

such that

c_a \geq 0~\forall a \in \mathcal{A}.

c_a \geq 0~\forall a \in \mathcal{A}.

sparse vector:

\mathcal{A} = \{ \pm e_1, \pm e_2, \ldots, \pm e_d \}

\mathcal{A} = \{ \pm e_1, \pm e_2, \ldots, \pm e_d \}

Low-rank matrix:

\mathcal{A} = \{ uv^T \mid u \in \mathbb{R}^{n}, v \in \mathbb{R}^m, \|u\| = \|v\| = 1 \}

\mathcal{A} = \{ uv^T \mid u \in \mathbb{R}^{n}, v \in \mathbb{R}^m, \|u\| = \|v\| = 1 \}

Our contribution: how to identify the atoms with nonzero coefficients at solution during the optimization process.

[FFF, OJMO'24]

Online Convex Optimization

Play a game for rounds, for

T

T

Propose a point

x^{(t)} \in \mathcal{X}.

x^{(t)} \in \mathcal{X}.

Suffer loss

f_t( x^{(t)} ).

f_t( x^{(t)} ).

t = 1,2,\ldots, T

t = 1,2,\ldots, T

The goal of online learning algorithm: obtain sublinear regret

\displaystyle \mathrm{Regret} (T, z) \coloneqq \sum_{t=1}^T f_t( x^{(t)} ) - \sum_{t=1}^T f_t(z).

\displaystyle \mathrm{Regret} (T, z) \coloneqq \sum_{t=1}^T f_t( x^{(t)} ) - \sum_{t=1}^T f_t(z).

\mathrm{Regret} (T, z) = o(T) \to \mathrm{Regret} (T, z)/T \to 0.

\mathrm{Regret} (T, z) = o(T) \to \mathrm{Regret} (T, z)/T \to 0.

player's loss

competitor's loss

Mirror Descent (MD) and Dual Averaging (DA)

MD and DA are parameterized by mirror map, they have advantages over the vanilla projected subgradient method.

f_t(x) = c_t^T x \quad \text{and} \quad \mathcal{X} = \left\{ \sum_{i=1}^d x_i = 1, x_i \geq 0 \right\}, ~\|c_t\|_{\infty} \leq 1.

f_t(x) = c_t^T x \quad \text{and} \quad \mathcal{X} = \left\{ \sum_{i=1}^d x_i = 1, x_i \geq 0 \right\}, ~\|c_t\|_{\infty} \leq 1.

When is known in advance or , both MD and DA guarantee regret

T

T

\mathcal{O}( \sqrt{T} ).

\mathcal{O}( \sqrt{T} ).

When is unknown in advance and , then MD has rate while DA still guarantees regret.

T

T

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) = +\infty

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) = +\infty

\Omega ( T )

\Omega ( T )

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) < +\infty

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) < +\infty

Our contribution: fix the divergence issue of MD and obtain

regret.

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

OMD Algorithm

[FHPF, ICML'20, JMLR'22]

Primal

Dual

x^{(t)}

x^{(t)}

\hat{x}^{(t)}

\hat{x}^{(t)}

\nabla \Phi

\nabla \Phi

\hat{y}^{(t)}

\hat{y}^{(t)}

-\eta_t \nabla f_t( x^{(t)} )

-\eta_t \nabla f_t( x^{(t)} )

y^{(t)}

y^{(t)}

x^{(t+1)}

x^{(t+1)}

Bregman projection

\nabla \Phi^*

\nabla \Phi^*

Figure accredited to Victor Portella.

Stabilized OMD

[FHPF, ICML'20, JMLR'22]

Primal

Dual

x^{(t)}

x^{(t)}

\hat{x}^{(t)}

\hat{x}^{(t)}

\nabla \Phi

\nabla \Phi

\hat{y}^{(t)}

\hat{y}^{(t)}

-\eta_t \nabla f_t( x^{(t)} )

-\eta_t \nabla f_t( x^{(t)} )

y^{(t)}

y^{(t)}

x^{(t+1)}

x^{(t+1)}

Bregman projection

\nabla \Phi^*

\nabla \Phi^*

\hat{x}^{(0)}

\hat{x}^{(0)}

\hat{w}^{(t)}

\hat{w}^{(t)}

}

\gamma_t

\gamma_t

With stabilization, OMD can obtain regret.

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

Smooth v.s. Nonsmooth Minimization

Consider minimizing a convex function

The iteration complexity of gradient descent (GD) and subgradient descent (subGD) for smooth and nonsmooth objectives:

\underset{ x \in \mathbb{R}^d }{\min}~ f(x).

\underset{ x \in \mathbb{R}^d }{\min}~ f(x).

when is smooth:

f

f

\mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon).

when is nonsmooth:

f

f

\mathcal{O}(1/\epsilon^2).

\mathcal{O}(1/\epsilon^2).

smooth

nonsmooth

Filling the gap

Two important structures:

The objective satisfy certain structure:

\underset{x \in \mathbb{R}^d}{\min}~ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) \coloneqq \frac{1}{n} \sum_{i=1}^n \ell(h_i(x)),

\underset{x \in \mathbb{R}^d}{\min}~ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) \coloneqq \frac{1}{n} \sum_{i=1}^n \ell(h_i(x)),

where is a nonnegative, , convex, 1-smooth loss function, 's are Lipschitz continuous.

\ell : \mathbb{R} \to \mathbb{R}_{\geq 0}

\ell : \mathbb{R} \to \mathbb{R}_{\geq 0}

The interpolation condition: there exist such that

x^*

x^*

f(x^*) = 0 .

f(x^*) = 0 .

h_i

h_i

[FFF, ICLR'21]

\inf \ell = 0

\inf \ell = 0

square loss, L2-hinge loss, logistic loss, etc.

absolute loss, L1-hinge loss.

Filling the gap

With constant learning rate, we prove

Convex objective:

Strongly convex objective:

\mathcal{O}(1/\epsilon^2) \to \mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon^2) \to \mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon) \to \mathcal{O}(\log(1/\epsilon)).

\mathcal{O}(1/\epsilon) \to \mathcal{O}(\log(1/\epsilon)).

The above rates match the rate of SGD for smooth objectives.

[FFF, ICLR'21]

Lower bounds

Can we accelerated SSGD using momentum under interpolation? e.g.,

Two follow up questions:

\mathcal{O}(\epsilon^{-{1}/{2}}).

\mathcal{O}(\epsilon^{-{1}/{2}}).

The structure stays in the center of our analysis, could this structure itself give us improved rate without the interpolation condition?

[FFF, ICLR'21]

f_i = \ell \circ h_i

f_i = \ell \circ h_i

The answer to above questions is "no".

With interpolation condition:

Without interpolation condition:

\mathcal{\Omega}(1/\epsilon).

\mathcal{\Omega}(1/\epsilon).

\mathcal{\Omega}(1/\epsilon^2).

\mathcal{\Omega}(1/\epsilon^2).

We derive lower bounds for iteration complexity:

Markov Logic Network (MLN)

First-order logic rules:

\text{father}(X, Y) \land \text{brother}(Y, Z) \Longrightarrow \text{father}(X, Z);

\text{father}(X, Y) \land \text{brother}(Y, Z) \Longrightarrow \text{father}(X, Z);

\text{friend}(X, Y) \land \text{smoke}(X) \Longrightarrow \text{smoke}(Y).

\text{friend}(X, Y) \land \text{smoke}(X) \Longrightarrow \text{smoke}(Y).

1.0

0.4

Given a set of facts, MLN defines the probability of the "world" as

\displaystyle \text{Pr}(X) \propto \exp\left( \sum_{i=1}^m w_i f_i( X ) \right)

\displaystyle \text{Pr}(X) \propto \exp\left( \sum_{i=1}^m w_i f_i( X ) \right)

where is the number of time that the i-th rule is satisfied.

f_i(X)

f_i(X)

Computational Optimization and Machine Learning

BIMSA_talk

More from Fang Huang