Online Convex Optimization With Unbounded Memory

Sarah Dean

ACC Workshop, May 2023

joint work with Raunak Kumar and Bobby Kleinberg

Online interaction

  1. Choose an action \(a_t\) according to policy \(\pi_t\)
  2. State updates according to \(f\) and \(w_t\)
  3. Pay a cost \(c_t(s_t,a_t)\)

Motivation: Online Optimal Control

Offline (hindsight) control problem

$$ \min_{\pi} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1}=f(s_t,a_t,w_t),~~a_t=\pi(s_t) $$

loss depends on all past actions

decision variable is a function

Outline

Online Convex Optimization (OCO) with Unbounded Memory is a framework which directly addresses these challenges

 

  1. Problem Setting & Examples
  2. Main Results: Regret Bounds
  3. Applications & Conclusion

Components of OCO with memory problem

  • Decision space \(\mathcal X\) is closed, convex subset of Hilbert space
  • History space \(\mathcal H\) is Banach space
  • Linear operators \(A:\mathcal H\to\mathcal H\) and \(B:\mathcal X\to \mathcal H\)
  • Convex loss functions \(f_t:\mathcal H\to\mathbb R\)

Problem Setting

Online interaction protocol

  • In round \(t=1,2,...,T\):
    • Learner chooses decision \(x_t\in\mathcal X\)
    • History updates as \(h_t = Ah_{t-1}+Bx_t\)
    • Learner suffers loss \(f_t(h_t)\) and observes \(f_t\)

\(f_t\)

\(x_t\)

\(h_t\)

Problem Setting

Online interaction protocol

  • In round \(t=1,2,...,T\):
    • Learner chooses decision \(x_t\in\mathcal X\)
    • History updates as \(h_t = Ah_{t-1}+Bx_t\)
    • Learner suffers loss \(f_t(h_t)\) and observes \(f_t\)

\(f_t\)

\(x_t\)

Example: loss depends arbitrarily on \(m\) past decisions (Anava et al., 2015)

  • History space \(\mathcal H = \mathcal X\times\dots\times\mathcal X = \mathcal X^m\)
  • Linear operators \(A=\begin{bmatrix} & I \\ && \ddots \\ &&& I \\  &\end{bmatrix},\quad B = \begin{bmatrix} I \\ 0 \\ \vdots \end{bmatrix}\)

\(h_t\)

Problem Setting

Online interaction protocol

  • In round \(t=1,2,...,T\):
    • Learner chooses decision \(x_t\in\mathcal X\)
    • History updates as \(h_t = Ah_{t-1}+Bx_t\)
    • Learner suffers loss \(f_t(h_t)\) and observes \(f_t\)

\(f_t\)

\(x_t\)

Example: loss depends on all past decisions with \(\rho\)-discount factor

  • History space \(\mathcal H\) contains \(T\) length sequences over \(\mathcal X\)
  • Linear operators $$A(x_0, x_1, \dots)=(0, \rho x_0, \rho x_1, \dots ),\quad B x = (x,0,\dots )$$

\(h_t\)

The regret of an algorithm whose decisions result in \(h_1,\dots,h_T\) is $$ R_T(\mathcal A) = \sum_{t=1}^T f_t(h_t) - \min_{x\in\mathcal X} \sum_{t=1}^T \underbrace{ f_t\left(\sum_{k=1}^t A^k B x\right)}_{\tilde f_t(x)}$$

Regret Minimization

\(f_t\)

\(x_t\)

Goal: perform well compared to the best fixed decision in hindsight

\(h_t\)

Assumptions

  1. Learner observes the function \(f_t\) after each round, knows \(A\) and \(B\), and \(\|B\|=1\)
  2. Functions \(f_t\) are differentiable, \(L\)-Lipschitz continuous, and convex
    • implies that \(\tilde f_t = f_t\circ  \sum_{k=1}^t A^k B \) are diff'ble, convex, Lipschitz with \(\tilde L \leq L\sum_{k=0}^\infty \|A^k \|\)

Assumptions & Definitions

Definition (\(p\)-effective memory capacity): \(\displaystyle H_p = \left( \sum_{k=0}^\infty k^p \|A^k\|^p \right)^{1/p}\)

  • Bounds distance in history resulting from decisions whose distance grows at most linearly with time

$$\min_{a} \sum_{t=1}^T c_t(s_t,a) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t $$

  • History combines "noiseless" state with action $$\bar s_{t+1} = F\bar s_t + G a_t,\quad h_t = \begin{bmatrix} \bar s_t \\a_t \end{bmatrix}$$
  • Linear operators defined by dynamics $$A=\begin{bmatrix} F\\ & \end{bmatrix},\quad B=\begin{bmatrix}G\\ I\end{bmatrix}$$
  • Loss functions defined by cost & disturbance $$f_t(h_t) = c_t\left (\bar s_t + \sum_{k=1}^t F^k G w_{t-k}, a_t\right )=c_t(s_t, a_t)$$

Example: constant input linear control

  • Theorem: there are algorithms such that the regret of an OCO with unbounded memory problem is at most $$O\left(\sqrt{T}\sqrt{H_p}\sqrt{L\tilde L}\right) $$
    • effective memory capacity & Lipschitz constants
  • Theorem: there exists an OCO with unbounded memory problem with regret at least $$\Omega \left(\sqrt{T}\sqrt{H_p}\sqrt{L\tilde L}\right) $$

Main Results

\(f_t\)

\(h_t\)

\(x_t\)

  • total loss from playing \(x\) every round
  • strongly convex regularizer
  • step size \(\eta = (T\tilde L(LH_p+\tilde L))^{-1/2} \)

Upper Bound


Algorithm:   Follow-the-Regularized-Leader on \(\tilde f_t\)

  • For \(t=1,\dots,T\) $$x_{t+1} = \min_{x\in\mathcal X} \sum_{k=1}^t \tilde f_k(x) + \frac{R(x)}{\eta} $$
     

lazy

^

if \(t \mod \frac{LH_p}{\tilde L}=0\), otherwise \(x_{t+1}=x_t\)

Upper Bound

Algorithm:   Follow-the-Regularized-Leader on \(\tilde f_t\)

  • For \(t=1,\dots,T\) $$x_{t+1} = \min_{x\in\mathcal X} \sum_{k=1}^t \tilde f_k(x) + \frac{R(x)}{\eta} $$

Proof Sketch. Decompose regret into two terms $$R_T(\mathcal A) = \textstyle \sum_{t=1}^T f_t(h_t) - \sum_{t=1}^T \tilde f_t(x_t) + \sum_{t=1}^T \tilde f_t(x_t) - \min_{x\in\mathcal X} \sum_{t=1}^T \tilde f_t(x)$$

  • Standard OCO with FTRL: \(\eta^{-1} + \eta T\tilde L^2\)
  • Actual vs. idealized history: \(\eta T L \tilde L H_p\)

The following instance of OCO with finite memory has $$R_T(\mathcal A) \geq \Omega \left(\sqrt{T}\sqrt{H_p}\sqrt{L\tilde L}\right) = \Omega \left(\sqrt{T} m\right)\quad \forall~~\mathcal A$$

Lower Bound

Let \(\mathcal X = [-1,1]\), finite memory \(\mathcal H = \mathcal X^m\), Rademacher samples \(w_1,\dots w_{\frac{T}{m}}\), and $$ f_t(h_t) = w_{\lceil\frac{t}{m}\rceil} m^{-1/2} (x_{t-m+1} + \dots + x_{m\lfloor\frac{t}{m}\rfloor + 1})$$

\(m\) steps in the past

time of sample

\(h_t\) and \(f_t\)

\(w_i\)

\(w_{i-1}\)

\(w_{i+1}\)

\(\underbrace{\qquad\qquad}\)

\(m\)

\(t\)

Application: Online Linear Control

$$\min_{K} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t ,~~a_t=Ks_t$$

  • Convex lifting to disturbance-action controllers (Youla et al., 1976; Anderson et al., 2019; Agarwal et al., 2019) $$ \textstyle a_t = \sum_{k=1}^{t+1}M^{[k]} w_{t-k},\quad X_t = (M^{[k]})_{k\in[t]}$$
  • History contains (weighted) sequences of controllers $$H_t = (X_t, GX_{t-1}, FGX_{t-2},F^2 GX_{t-3},\dots )$$

instead of a loop,

system looks like a line

\((F,G)\)

\(K\)

\(\bf s\)

\(\bf a\)

\(\bf w\)

\(\bf s\)

\(\bf a\)

\(\bf w\)

\(X\)

\(H\)

Application: Online Linear Control

$$\min_{K} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t ,~~a_t=Ks_t$$

  • Convex lifting to disturbance-action controllers (Youla et al., 1976; Anderson et al., 2019; Agarwal et al., 2019) $$ \textstyle a_t = \sum_{k=1}^{t+1}M^{[k]} w_{t-k},\quad X_t = (M^{[k]})_{k\in[t]}$$
  • History contains (weighted) sequences of controllers $$H_t = (X_t, GX_{t-1}, FGX_{t-2},F^2 GX_{t-3},\dots )$$
  • Linear operators defined by linear dynamics $$A\left((Y_0,Y_1,\dots)\right) = (0, GY_0, F Y_1,\dots),\quad BX=(X,0,.\dots)$$
  • States & actions are linear in history & decisions, so loss functions are defined by cost & disturbance $$f_t(H_t) = c_t\big ( \underbrace{\langle  H_t, w_{1:t} \rangle}_{(s_t,a_t)}\big)=c_t(s_t, a_t)$$

\(\bf s\)

\(\bf a\)

\(\bf w\)

\(X_t\)

\(H\)

Application: Online Linear Control

$$\min_{K} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t ,~~a_t=Ks_t$$

  • Approach: translate usual control assumptions on dynamics, controllers, and costs into the quantities
    \(H_2\), \(L\), and \(\tilde L\)
  • No truncation analysis is necessary
  • Improve existing upper bounds (Agarwal et al., 2019) by a factor of dimension and stability radius

Conclusion & Discussion

  • Key takeaways for OCO with Unbounded Memory
    1. General framework which captures online control
    2. Matching upper & (worst case) lower regret bounds highlight fundamental quantity: effective memory capacity
  • Open directions for future work
    1. Unknown dynamics \(A\) and \(B\)
    2. "Bandit" feedback of \(f_t(h_t)\)
    3. Nonlinearly evolving history
    4. Nonconvex optimization

Thank you!

Online Convex Optimization with Unbounded Memory

https://arxiv.org/abs/2210.09903

Raunak Kumar    Sarah Dean    Robert Kleinberg

Questions?

References:

  • Agarwal, Bullins, Hazan, Kakade, Singh. "Online control with adversarial disturbances." ICML, 2019.​
  • Anava, Hazan, Mannor. "Online learning for adversaries with memory: Price of past mistakes." NeurIPS, 2015.​
  • Anderson, Doyle, Low, Matni. "System level synthesis." Annual Reviews in Control, 2019.
  • Youla, Jabr, Bongiorno. "Modern Wiener-Hopf design of optimal controllers--Part II: The multivariable case." IEEE Transactions on Automatic Control, 1976.