Optimization is everywhere: machine learning, operation research, data mining, theoretical computer science, etc.

\underset{x \in C}{\min} ~f(x)

\underset{x \in C}{\min} ~f(x)

First-order methods:

x^{(t)} = \text{span}\{ x^{(0)}, \nabla f( x^{(0)} ), \nabla f( x^{(1)} ), \ldots, \nabla f( x^{(t-1)} ) \}

x^{(t)} = \text{span}\{ x^{(0)}, \nabla f( x^{(0)} ), \nabla f( x^{(1)} ), \ldots, \nabla f( x^{(t-1)} ) \}

Why first-order methods?

First-order method

x^{(t)}

x^{(t)}

x^{(t+1)}

x^{(t+1)}

Coordinate Descent

For

t = 0, 1, 2, \ldots,

t = 0, 1, 2, \ldots,

Select coordinate

i \in [d]

i \in [d]

Update

x_i^{(t+1)} = x_i^{(t)} - \eta_t \nabla_i f( x^{(t)} )

x_i^{(t+1)} = x_i^{(t)} - \eta_t \nabla_i f( x^{(t)} )

Different coordinate selection rules:

Random selection:

Cyclic selection:

Random permuted cyclic (the matrix AMGM inequality conjecture is false [LL20, S20])

Greedy selection (Gauss-Southwell)

i \in \arg\max_{ j \in [d] } | \nabla_j f( x^{(t)} ) |

i \in \arg\max_{ j \in [d] } | \nabla_j f( x^{(t)} ) |

i \sim \mathrm{uniform}\{1,2,\ldots, d\}

i \sim \mathrm{uniform}\{1,2,\ldots, d\}

i = (t-1) ~\mathrm{mod}~ d + 1

i = (t-1) ~\mathrm{mod}~ d + 1

^1

^1

^1

^1

Forty-Two Open Problems in the Mathematics of Data Science

GCD for Sparse Optimization

\underset{ x \in \mathbb{R}^d }{\min}~F(x) \coloneqq f(x) + g(x)

\underset{ x \in \mathbb{R}^d }{\min}~F(x) \coloneqq f(x) + g(x)

regularizer

One-norm regularization or nonnegative constraint can promote a sparse solution.

When initailzed at zero, greedy CD is observed to have an implicit screening ability to select variables that are nonzero at solution.

Iter 1

Iter 2

Iter 3

Iter 4

Iter

T

T

g(x) = \lambda \| x \|_1

g(x) = \lambda \| x \|_1

g(x) = \delta_{\geq 0}(x)

g(x) = \delta_{\geq 0}(x)

data fitting

GCD for Sparse Optimization

[FFSF, AISTATS'20]

We provide a theoretical characterization of GCD's screening ability:

GCD converges fast in first few iterations.

The iterate is "close" the to solution when the iterate is still sparse, and sparsity pattern will not further expand anymore.

x^{(0)}

x^{(0)}

x^{*}

x^{*}

x^{(100)}

x^{(100)}

d = 10^4, \| x^* \|_0 = 10

d = 10^4, \| x^* \|_0 = 10

\| x^{(t)} \|_0 \leq 110

\| x^{(t)} \|_0 \leq 110

for

t > 100

t > 100

\delta

\delta

From coordinate to atom

Learning a sparse representation of an atomic set :

\mathcal{A}

\mathcal{A}

x^* = \sum_{a \in \mathcal{A}} c_a a

x^* = \sum_{a \in \mathcal{A}} c_a a

such that

c_a \geq 0~\forall a \in \mathcal{A}.

c_a \geq 0~\forall a \in \mathcal{A}.

sparse vector:

\mathcal{A} = \{ \pm e_1, \pm e_2, \ldots, \pm e_d \}

\mathcal{A} = \{ \pm e_1, \pm e_2, \ldots, \pm e_d \}

Low-rank matrix:

\mathcal{A} = \{ uv^T \mid u \in \mathbb{R}^{n}, v \in \mathbb{R}^m, \|u\| = \|v\| = 1 \}

\mathcal{A} = \{ uv^T \mid u \in \mathbb{R}^{n}, v \in \mathbb{R}^m, \|u\| = \|v\| = 1 \}

Our contribution: how to identify the atoms with nonzero coefficients at solution during the optimization process.

[FFF, OJMO'24]

Online Convex Optimization

Play a game for rounds, for

T

T

Propose a point

x^{(t)} \in \mathcal{X}.

x^{(t)} \in \mathcal{X}.

Suffer loss

f_t( x^{(t)} ).

f_t( x^{(t)} ).

t = 1,2,\ldots, T

t = 1,2,\ldots, T

The goal of online learning algorithm: obtain sublinear regret

\displaystyle \mathrm{Regret} (T, z) \coloneqq \sum_{t=1}^T f_t( x^{(t)} ) - \sum_{t=1}^T f_t(z).

\displaystyle \mathrm{Regret} (T, z) \coloneqq \sum_{t=1}^T f_t( x^{(t)} ) - \sum_{t=1}^T f_t(z).

\mathrm{Regret} (T, z) = o(T) \to \mathrm{Regret} (T, z)/T \to 0.

\mathrm{Regret} (T, z) = o(T) \to \mathrm{Regret} (T, z)/T \to 0.

player's loss

competitor's loss

Mirror Descent (MD) and Dual Averaging (DA)

MD and DA are parameterized by mirror map, they have advantages over the vanilla projected subgradient method.

f_t(x) = c_t^T x \quad \text{and} \quad \mathcal{X} = \left\{ \sum_{i=1}^d x_i = 1, x_i \geq 0 \right\}, ~\|c_t\|_{\infty} \leq 1.

f_t(x) = c_t^T x \quad \text{and} \quad \mathcal{X} = \left\{ \sum_{i=1}^d x_i = 1, x_i \geq 0 \right\}, ~\|c_t\|_{\infty} \leq 1.

When is known in advance or , both MD and DA guarantee regret

T

T

\mathcal{O}( \sqrt{T} ).

\mathcal{O}( \sqrt{T} ).

When is unknown in advance and , then MD has rate while DA still guarantees regret.

T

T

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) = +\infty

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) = +\infty

\Omega ( T )

\Omega ( T )

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) < +\infty

\sup_{x, y \in \mathcal{X}} D_\Phi(x, y) < +\infty

Our contribution: fix the divergence issue of MD and obtain

regret.

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

OMD Algorithm

[FHPF, ICML'20, JMLR'22]

Primal

Dual

x^{(t)}

x^{(t)}

\hat{x}^{(t)}

\hat{x}^{(t)}

\nabla \Phi

\nabla \Phi

\hat{y}^{(t)}

\hat{y}^{(t)}

-\eta_t \nabla f_t( x^{(t)} )

-\eta_t \nabla f_t( x^{(t)} )

y^{(t)}

y^{(t)}

x^{(t+1)}

x^{(t+1)}

Bregman projection

\nabla \Phi^*

\nabla \Phi^*

Figure accredited to Victor Portella.

Stabilized OMD

[FHPF, ICML'20, JMLR'22]

Primal

Dual

x^{(t)}

x^{(t)}

\hat{x}^{(t)}

\hat{x}^{(t)}

\nabla \Phi

\nabla \Phi

\hat{y}^{(t)}

\hat{y}^{(t)}

-\eta_t \nabla f_t( x^{(t)} )

-\eta_t \nabla f_t( x^{(t)} )

y^{(t)}

y^{(t)}

x^{(t+1)}

x^{(t+1)}

Bregman projection

\nabla \Phi^*

\nabla \Phi^*

\hat{x}^{(0)}

\hat{x}^{(0)}

\hat{w}^{(t)}

\hat{w}^{(t)}

}

\gamma_t

\gamma_t

With stabilization, OMD can obtain regret.

\mathcal{O}( \sqrt{T} )

\mathcal{O}( \sqrt{T} )

Smooth v.s. Nonsmooth Minimization

Consider minimizing a convex function

The iteration complexity of gradient descent (GD) and subgradient descent (subGD) for smooth and nonsmooth objectives:

\underset{ x \in \mathbb{R}^d }{\min}~ f(x).

\underset{ x \in \mathbb{R}^d }{\min}~ f(x).

when is smooth:

f

f

\mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon).

when is nonsmooth:

f

f

\mathcal{O}(1/\epsilon^2).

\mathcal{O}(1/\epsilon^2).

smooth

nonsmooth

Filling the gap

Two important structures:

The objective satisfy certain structure:

\underset{x \in \mathbb{R}^d}{\min}~ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) \coloneqq \frac{1}{n} \sum_{i=1}^n \ell(h_i(x)),

\underset{x \in \mathbb{R}^d}{\min}~ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) \coloneqq \frac{1}{n} \sum_{i=1}^n \ell(h_i(x)),

where is a nonnegative, , convex, 1-smooth loss function, 's are Lipschitz continuous.

\ell : \mathbb{R} \to \mathbb{R}_{\geq 0}

\ell : \mathbb{R} \to \mathbb{R}_{\geq 0}

The interpolation condition: there exist such that

x^*

x^*

f(x^*) = 0 .

f(x^*) = 0 .

h_i

h_i

[FFF, ICLR'21]

\inf \ell = 0

\inf \ell = 0

square loss, L2-hinge loss, logistic loss, etc.

absolute loss, L1-hinge loss.

Filling the gap

With constant learning rate, we prove

Convex objective:

Strongly convex objective:

\mathcal{O}(1/\epsilon^2) \to \mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon^2) \to \mathcal{O}(1/\epsilon).

\mathcal{O}(1/\epsilon) \to \mathcal{O}(\log(1/\epsilon)).

\mathcal{O}(1/\epsilon) \to \mathcal{O}(\log(1/\epsilon)).

The above rates match the rate of SGD for smooth objectives.

[FFF, ICLR'21]

Lower bounds

Can we accelerated SSGD using momentum under interpolation? e.g.,

Two follow up questions:

\mathcal{O}(\epsilon^{-{1}/{2}}).

\mathcal{O}(\epsilon^{-{1}/{2}}).

The structure stays in the center of our analysis, could this structure itself give us improved rate without the interpolation condition?

[FFF, ICLR'21]

f_i = \ell \circ h_i

f_i = \ell \circ h_i

The answer to above questions is "no".

With interpolation condition:

Without interpolation condition:

\mathcal{\Omega}(1/\epsilon).

\mathcal{\Omega}(1/\epsilon).

\mathcal{\Omega}(1/\epsilon^2).

\mathcal{\Omega}(1/\epsilon^2).

We derive lower bounds for iteration complexity:

Markov Logic Network (MLN)

First-order logic rules:

\text{father}(X, Y) \land \text{brother}(Y, Z) \Longrightarrow \text{father}(X, Z);

\text{father}(X, Y) \land \text{brother}(Y, Z) \Longrightarrow \text{father}(X, Z);

\text{friend}(X, Y) \land \text{smoke}(X) \Longrightarrow \text{smoke}(Y).

\text{friend}(X, Y) \land \text{smoke}(X) \Longrightarrow \text{smoke}(Y).

1.0

0.4

Given a set of facts, MLN defines the probability of the "world" as

\displaystyle \text{Pr}(X) \propto \exp\left( \sum_{i=1}^m w_i f_i( X ) \right)

\displaystyle \text{Pr}(X) \propto \exp\left( \sum_{i=1}^m w_i f_i( X ) \right)

where is the number of time that the i-th rule is satisfied.

f_i(X)

f_i(X)

Computational Optimization and Machine Learning

Education background

Industry experience

Research plan

Overview

Part I: First-order methods

First-order method

The literature

Coordinate Optimization

Coordinate Descent

GCD for Sparse Optimization

GCD for Sparse Optimization

From coordinate to atom

Online Mirror Descent

Online Convex Optimization

Mirror Descent (MD) and Dual Averaging (DA)

OMD Algorithm

Stabilized OMD

(Stochastic) Subgradient Method

Smooth v.s. Nonsmooth Minimization

The empirical observation

Filling the gap

Filling the gap

Lower bounds

Part II: Optimization for applications

Knowledge Graph Reasoning

Knowledge graph

Markov Logic Network (MLN)

The previous state of MLN

Key insights

Experiments

Experiments

Follow up work

Computational Optimization and Large Language Model

Optimization matters for pretraining

Training stability of the Adam optimizer

Low precision training

Thank you! Questions?