# Online Convex Optimization With Unbounded Memory

Sarah Dean

ACC Workshop, May 2023

joint work with Raunak Kumar and Bobby Kleinberg

Online interaction

1. Choose an action $$a_t$$ according to policy $$\pi_t$$
2. State updates according to $$f$$ and $$w_t$$
3. Pay a cost $$c_t(s_t,a_t)$$

## Motivation: Online Optimal Control

Offline (hindsight) control problem

$$\min_{\pi} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1}=f(s_t,a_t,w_t),~~a_t=\pi(s_t)$$

loss depends on all past actions

decision variable is a function

## Outline

Online Convex Optimization (OCO) with Unbounded Memory is a framework which directly addresses these challenges

1. Problem Setting & Examples
2. Main Results: Regret Bounds
3. Applications & Conclusion

Components of OCO with memory problem

• Decision space $$\mathcal X$$ is closed, convex subset of Hilbert space
• History space $$\mathcal H$$ is Banach space
• Linear operators $$A:\mathcal H\to\mathcal H$$ and $$B:\mathcal X\to \mathcal H$$
• Convex loss functions $$f_t:\mathcal H\to\mathbb R$$

## Problem Setting

Online interaction protocol

• In round $$t=1,2,...,T$$:
• Learner chooses decision $$x_t\in\mathcal X$$
• History updates as $$h_t = Ah_{t-1}+Bx_t$$
• Learner suffers loss $$f_t(h_t)$$ and observes $$f_t$$

$$f_t$$

$$x_t$$

$$h_t$$

## Problem Setting

Online interaction protocol

• In round $$t=1,2,...,T$$:
• Learner chooses decision $$x_t\in\mathcal X$$
• History updates as $$h_t = Ah_{t-1}+Bx_t$$
• Learner suffers loss $$f_t(h_t)$$ and observes $$f_t$$

$$f_t$$

$$x_t$$

Example: loss depends arbitrarily on $$m$$ past decisions (Anava et al., 2015)

• History space $$\mathcal H = \mathcal X\times\dots\times\mathcal X = \mathcal X^m$$
• Linear operators $$A=\begin{bmatrix} & I \\ && \ddots \\ &&& I \\ &\end{bmatrix},\quad B = \begin{bmatrix} I \\ 0 \\ \vdots \end{bmatrix}$$

$$h_t$$

## Problem Setting

Online interaction protocol

• In round $$t=1,2,...,T$$:
• Learner chooses decision $$x_t\in\mathcal X$$
• History updates as $$h_t = Ah_{t-1}+Bx_t$$
• Learner suffers loss $$f_t(h_t)$$ and observes $$f_t$$

$$f_t$$

$$x_t$$

Example: loss depends on all past decisions with $$\rho$$-discount factor

• History space $$\mathcal H$$ contains $$T$$ length sequences over $$\mathcal X$$
• Linear operators $$A(x_0, x_1, \dots)=(0, \rho x_0, \rho x_1, \dots ),\quad B x = (x,0,\dots )$$

$$h_t$$

The regret of an algorithm whose decisions result in $$h_1,\dots,h_T$$ is $$R_T(\mathcal A) = \sum_{t=1}^T f_t(h_t) - \min_{x\in\mathcal X} \sum_{t=1}^T \underbrace{ f_t\left(\sum_{k=1}^t A^k B x\right)}_{\tilde f_t(x)}$$

## Regret Minimization

$$f_t$$

$$x_t$$

Goal: perform well compared to the best fixed decision in hindsight

$$h_t$$

Assumptions

1. Learner observes the function $$f_t$$ after each round, knows $$A$$ and $$B$$, and $$\|B\|=1$$
2. Functions $$f_t$$ are differentiable, $$L$$-Lipschitz continuous, and convex
• implies that $$\tilde f_t = f_t\circ \sum_{k=1}^t A^k B$$ are diff'ble, convex, Lipschitz with $$\tilde L \leq L\sum_{k=0}^\infty \|A^k \|$$

## Assumptions & Definitions

Definition ($$p$$-effective memory capacity): $$\displaystyle H_p = \left( \sum_{k=0}^\infty k^p \|A^k\|^p \right)^{1/p}$$

• Bounds distance in history resulting from decisions whose distance grows at most linearly with time

$$\min_{a} \sum_{t=1}^T c_t(s_t,a) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t$$

• History combines "noiseless" state with action $$\bar s_{t+1} = F\bar s_t + G a_t,\quad h_t = \begin{bmatrix} \bar s_t \\a_t \end{bmatrix}$$
• Linear operators defined by dynamics $$A=\begin{bmatrix} F\\ & \end{bmatrix},\quad B=\begin{bmatrix}G\\ I\end{bmatrix}$$
• Loss functions defined by cost & disturbance $$f_t(h_t) = c_t\left (\bar s_t + \sum_{k=1}^t F^k G w_{t-k}, a_t\right )=c_t(s_t, a_t)$$

## Example: constant input linear control

• Theorem: there are algorithms such that the regret of an OCO with unbounded memory problem is at most $$O\left(\sqrt{T}\sqrt{H_p}\sqrt{L\tilde L}\right)$$
• effective memory capacity & Lipschitz constants
• Theorem: there exists an OCO with unbounded memory problem with regret at least $$\Omega \left(\sqrt{T}\sqrt{H_p}\sqrt{L\tilde L}\right)$$

## Main Results

$$f_t$$

$$h_t$$

$$x_t$$

• total loss from playing $$x$$ every round
• strongly convex regularizer
• step size $$\eta = (T\tilde L(LH_p+\tilde L))^{-1/2}$$

## Upper Bound

Algorithm:   Follow-the-Regularized-Leader on $$\tilde f_t$$

• For $$t=1,\dots,T$$ $$x_{t+1} = \min_{x\in\mathcal X} \sum_{k=1}^t \tilde f_k(x) + \frac{R(x)}{\eta}$$

lazy

^

if $$t \mod \frac{LH_p}{\tilde L}=0$$, otherwise $$x_{t+1}=x_t$$

## Upper Bound

Algorithm:   Follow-the-Regularized-Leader on $$\tilde f_t$$

• For $$t=1,\dots,T$$ $$x_{t+1} = \min_{x\in\mathcal X} \sum_{k=1}^t \tilde f_k(x) + \frac{R(x)}{\eta}$$

Proof Sketch. Decompose regret into two terms $$R_T(\mathcal A) = \textstyle \sum_{t=1}^T f_t(h_t) - \sum_{t=1}^T \tilde f_t(x_t) + \sum_{t=1}^T \tilde f_t(x_t) - \min_{x\in\mathcal X} \sum_{t=1}^T \tilde f_t(x)$$

• Standard OCO with FTRL: $$\eta^{-1} + \eta T\tilde L^2$$
• Actual vs. idealized history: $$\eta T L \tilde L H_p$$

The following instance of OCO with finite memory has $$R_T(\mathcal A) \geq \Omega \left(\sqrt{T}\sqrt{H_p}\sqrt{L\tilde L}\right) = \Omega \left(\sqrt{T} m\right)\quad \forall~~\mathcal A$$

## Lower Bound

Let $$\mathcal X = [-1,1]$$, finite memory $$\mathcal H = \mathcal X^m$$, Rademacher samples $$w_1,\dots w_{\frac{T}{m}}$$, and $$f_t(h_t) = w_{\lceil\frac{t}{m}\rceil} m^{-1/2} (x_{t-m+1} + \dots + x_{m\lfloor\frac{t}{m}\rfloor + 1})$$

$$m$$ steps in the past

time of sample

$$h_t$$ and $$f_t$$

$$w_i$$

$$w_{i-1}$$

$$w_{i+1}$$

$$\underbrace{\qquad\qquad}$$

$$m$$

$$t$$

## Application: Online Linear Control

$$\min_{K} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t ,~~a_t=Ks_t$$

• Convex lifting to disturbance-action controllers (Youla et al., 1976; Anderson et al., 2019; Agarwal et al., 2019) $$\textstyle a_t = \sum_{k=1}^{t+1}M^{[k]} w_{t-k},\quad X_t = (M^{[k]})_{k\in[t]}$$
• History contains (weighted) sequences of controllers $$H_t = (X_t, GX_{t-1}, FGX_{t-2},F^2 GX_{t-3},\dots )$$

system looks like a line

$$(F,G)$$

$$K$$

$$\bf s$$

$$\bf a$$

$$\bf w$$

$$\bf s$$

$$\bf a$$

$$\bf w$$

$$X$$

$$H$$

## Application: Online Linear Control

$$\min_{K} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t ,~~a_t=Ks_t$$

• Convex lifting to disturbance-action controllers (Youla et al., 1976; Anderson et al., 2019; Agarwal et al., 2019) $$\textstyle a_t = \sum_{k=1}^{t+1}M^{[k]} w_{t-k},\quad X_t = (M^{[k]})_{k\in[t]}$$
• History contains (weighted) sequences of controllers $$H_t = (X_t, GX_{t-1}, FGX_{t-2},F^2 GX_{t-3},\dots )$$
• Linear operators defined by linear dynamics $$A\left((Y_0,Y_1,\dots)\right) = (0, GY_0, F Y_1,\dots),\quad BX=(X,0,.\dots)$$
• States & actions are linear in history & decisions, so loss functions are defined by cost & disturbance $$f_t(H_t) = c_t\big ( \underbrace{\langle H_t, w_{1:t} \rangle}_{(s_t,a_t)}\big)=c_t(s_t, a_t)$$

$$\bf s$$

$$\bf a$$

$$\bf w$$

$$X_t$$

$$H$$

## Application: Online Linear Control

$$\min_{K} \sum_{t=1}^T c_t(s_t,a_t) \quad\text{s.t.}\quad s_{t+1} = Fs_t+Ga + w_t ,~~a_t=Ks_t$$

• Approach: translate usual control assumptions on dynamics, controllers, and costs into the quantities
$$H_2$$, $$L$$, and $$\tilde L$$
• No truncation analysis is necessary
• Improve existing upper bounds (Agarwal et al., 2019) by a factor of dimension and stability radius

## Conclusion & Discussion

• Key takeaways for OCO with Unbounded Memory
1. General framework which captures online control
2. Matching upper & (worst case) lower regret bounds highlight fundamental quantity: effective memory capacity
• Open directions for future work
1. Unknown dynamics $$A$$ and $$B$$
2. "Bandit" feedback of $$f_t(h_t)$$
3. Nonlinearly evolving history
4. Nonconvex optimization

## Thank you!

Online Convex Optimization with Unbounded Memory

https://arxiv.org/abs/2210.09903

Raunak Kumar    Sarah Dean    Robert Kleinberg

## Questions?

References:

• Agarwal, Bullins, Hazan, Kakade, Singh. "Online control with adversarial disturbances." ICML, 2019.​
• Anava, Hazan, Mannor. "Online learning for adversaries with memory: Price of past mistakes." NeurIPS, 2015.​
• Anderson, Doyle, Low, Matni. "System level synthesis." Annual Reviews in Control, 2019.
• Youla, Jabr, Bongiorno. "Modern Wiener-Hopf design of optimal controllers--Part II: The multivariable case." IEEE Transactions on Automatic Control, 1976.

By Sarah Dean

• 188