Prof Sarah Dean

## Reminders/etc

• Project midterm update due 11/11
• Scribing feedback to come soon!
• Upcoming paper presentations starting next week
• Participation includes attending presentations

## Reporting back on NCCR Symposium

Symposium on Socially responsible Automation hosted by NCCR Automation at EPFL

• talks on responsibility/fairness in cyberphysical systems (power grid, dynamic pricing), recommendation systems, repeated resource allocation via "karma economy", and vulnerability-aware AV rules
• interesting idea: "reparative fairness"
• $$\mathbb E[$$future cost$$\mid$$past cost$$]\propto -$$past cost

policy

$$\pi_t:\mathcal S\to\mathcal A$$

observation

$$s_t$$

accumulate

$$\{(s_t, a_t, c_t)\}$$

## Action in a dynamic world

Goal: select actions $$a_t$$ to bring environment to low-cost states

action

$$a_{t}$$

## $$F$$

$$s$$

## Recap: Optimal Control

Stochastic Infinite Horizon Optimal Control Problem

$$\min_{\pi} ~~\lim_{t\to\infty} \mathbb E_w\Big[\frac{1}{T}\sum_{k=0}^{T} c(s_k, \pi(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi(s_k),w_k)$$

$$\underbrace{\qquad\qquad}_{J^\pi(s_0)}$$

Bellman Optimality Equation

• $$\underbrace{J^\star (s)}_{\text{value function}} = \min_{a\in\mathcal A} \underbrace{c(s, a)+\mathbb E_w[J^\star (F(s,a,w))]}_{\text{state-action function}}$$

• Minimizing argument is $$\pi^\star(s)$$

## Recap: LQR with known dynamics

• Goal: minimize quadratic cost ($$Q,R$$) in a system with linear dynamics ($$A,B$$)
• Classic approach: Dynamic programming/Bellman optimality
• $$P = \mathrm{DARE}(A,B,Q,R)$$ and $$K_\star = -(R+B^\top PB)^{-1}B^\top QPA$$
• System level synthesis: Convex optimization
• $$\underset{\mathbf{\Phi}}{\min}$$$$\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix} \mathbf{\Phi}_s \\ \mathbf{\Phi}_a \end{bmatrix} \right\|_{\mathcal H_2}^2~~\text{s.t.}~~ \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix} \mathbf{\Phi}_s \\ \mathbf{\Phi}_a \end{bmatrix}= I$$

Setting: have data $$\{s_k, a_k, c_k\}_{k=0}^N$$. Approaches include a focus on:

1. Model: learn dynamics (and costs) from data, then do policy design
• For LQR: estimate $$\hat A,\hat B$$ ($$\hat Q,\hat R$$) then design $$\hat K$$
• "model based"
2. Bellman: learn value or state-action function
• For LQR: estimate $$\hat J$$ then determine $$\hat K$$ as $$\argmin$$
• "model free"
3. Policy: estimate gradients and update policy directly
• For LQR: $$\hat K \leftarrow \hat K -\alpha\widehat{\nabla J}(\hat K)$$
• "model free"

## Recap: unknown dynamics

1. Learn Model:
• estimate $$\hat A,\hat B$$ via least-squares, guarantee $$\max\{\varepsilon_A, \varepsilon_B\}\lesssim \sqrt{\frac{m+n}{N}}$$
2. Design Policy:
• robust approach uses $$\hat A, \hat B, \varepsilon_A, \varepsilon_B$$
• $$J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \leq \epsilon$$ for $$N\gtrsim \frac{(m+n)^2}{\epsilon^2}$$

• nominal or certainty equivalent approach uses $$\hat A, \hat B$$
• for small enough $$\varepsilon$$, can show that $$J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \lesssim \varepsilon^2$$
• thus faster rate, $$J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \leq \epsilon$$ for $$N\gtrsim \frac{m+n}{\epsilon}$$

## Model-based LQR

Approximate Policy Iteration [KTR19]

• estimate quadratic state-action function with LSTD $$q^\top f + q^\top \phi(s_t, a_t) = c(s_t,a_t)+\mathbb E_{w_t}[q^\top \phi(s_{t+1}, \pi(s_{t+1})]$$
• update policy as $$\hat Ks = \arg\min_a \hat q^\top \phi(s, a) = \arg\min_a \begin{bmatrix}s\\ a\end{bmatrix}^\top \mathrm{mat}(q) \begin{bmatrix}s\\ a\end{bmatrix}$$
• guarantee $$\|\hat K - K_\star \|_2 \leq \epsilon$$ for $$NT \gtrsim \frac{(m+n)^3}{\epsilon^2}$$

## Model-free LQR

• estimate $$\widehat{\nabla J}(K)$$ with finite differencing
• update policy as $$K \leftarrow K -\alpha\widehat{\nabla J}( K)$$
• guarantee $$J(\hat K) - J(K_\star ) \leq\epsilon$$ for $$N\gtrsim poly(n,m,1/\epsilon)$$

## Dynamic Performative Optimality

• Learner chooses $$\theta_t$$
• Population reacts as $$\rho_{t} = \mathcal D(\theta_t, \rho_{t-1})$$
• For fixed $$\theta$$, the limiting distribution $$\rho_\star(\theta) = \lim_{t\to\infty} \rho_t$$
• Goal for learner: minimize $$\mathcal L^\star(\theta) = \mathbb E_{z\sim\rho_\star(\theta)} [\ell(z, \theta)]$$
• Similar to goal of minimizing $$\lim_{T\to\infty} \frac{1}{T} \sum_{t=0}^T \mathbb E_{z\sim\rho_t} [\ell(z, \theta)]$$
• estimate parameters of $$\rho_t$$ and parametric model of $$\partial_t \mathcal D$$, use to estimate gradient of $$\mathcal L^\star$$

Is low cost all we want?

• In the setting of performative prediction, we may learn that homogeneous populations are easier to make predictions about
• recall retention dynamics of Hashimoto et al
• Controlling autonomous vehicles or robots may involve avoiding obstacles or staying on the road

## Motivation: Safety

A trajectory of states $$(s_0,\dots,s_t)$$ is safe if $$\mathcal s_k\in\mathcal S_\mathrm{safe}$$ for all $$0\leq k\leq t$$.

## Safe Trajectories

We define safety in terms of the "safe set" $$\mathcal S_\mathrm{safe}\subseteq \mathcal S$$.

(we can analogously define $$\mathcal A_\mathrm{safe}\subseteq \mathcal A$$ and require that $$\mathcal a_k\in\mathcal A_\mathrm{safe}$$ for all $$0\leq k\leq t$$)

A state $$s$$ is safe if $$\mathcal s\in\mathcal S_\mathrm{safe}$$.

## Example

The state is position & velocity $$s=[\theta,\omega]$$ with $$s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t$$

Safety constraint on position $$|\theta|\leq 1$$

Are trajectories safe as long as $$|\theta_0|<1$$?

• no! Exercise: what is the necessary condition on $$\omega_0$$?

## Safe Invariant Sets

We define safety in terms of the "safe set" $$\mathcal S_\mathrm{safe}\subseteq \mathcal S$$

A system $$s_{t+1}=F(s_t)$$ is safe if some $$\mathcal S_\mathrm{inv}\subseteq \mathcal S_{\mathrm{safe}}$$ is invariant, i.e.

• for all $$s\in\mathcal S_\mathrm{inv}$$, $$F(s)\in\mathcal S_\mathrm{inv}$$

Exercise: Prove that if $$\mathcal S_\mathrm{inv}$$ is invariant for dynamics $$F$$, then $$s_0\in \mathcal S_\mathrm{inv} \implies s_t\in\mathcal S_\mathrm{inv}$$ for all $$t$$.

## Example: Linear Dynamics

• Consider stable linear dynamics $$s_{t+1}=As_t$$.
• Consider the set $$\{s\mid s^\top Ps \leq c\}$$ with $$P = \sum_{t=0}^\infty (A^t)^\top A^t$$
• Claim: This is an invariant set
• $$(As)^\top \sum_{t=0}^\infty (A^t)^\top A^t (As)$$

• $$= s^\top \sum_{t=1}^\infty (A^t)^\top A^t s$$

• $$\leq s^\top \sum_{t=0}^\infty (A^t)^\top A^t s \leq c$$

Example: An invariant set for
$$s=[\theta,\omega]$$ with $$s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t$$

Claim: if $$V(s)$$ is a Lyapunov function for $$F$$ then any sublevel set $$\{V(s)\leq c\}$$ is invariant.

## Invariance via Lyapunov

• $$V(F(s))$$
• $$\leq V(s)$$
• $$\leq c$$

Definition: A Lyapunov function $$V:\mathcal S\to \mathbb R$$ for $$F$$ is continuous and

• (positive definite) $$V(0)=0$$ and $$V(0)>0$$ for all $$s\in\mathcal S - \{0\}$$
• (decreasing) $$V(F(s)) - V(s) \leq 0$$ for all $$s\in\mathcal S$$

## Constrained Control

$$a_t = {\color{Goldenrod} K_t }s_{t}$$

$$\underset{\mathbf a }{\min}$$   $$\displaystyle\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t$$

$$\text{s.t.}~~s_{t+1} = As_t + Ba_t$$

$$s_t \in\mathcal S_\mathrm{safe},~~ a_t \in\mathcal A_\mathrm{safe}$$

$$\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w$$

$$\mathbf w = \begin{bmatrix}s_0\\ 0\\ \vdots \\0 \end{bmatrix}$$

• nonconvex in $$K$$
• convex if $$\mathcal S_\mathrm{safe}$$ and $$\mathcal A_\mathrm{safe}$$ are convex

$$\underset{\color{teal}\mathbf{\Phi}}{\min}$$$$\left\| \begin{bmatrix}\bar Q^{1/2} &\\& \bar R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a\end{bmatrix} \mathbf w\right\|_{2}^2$$

$$\text{s.t.}~~ \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a\end{bmatrix}= I$$

$$\mathbf \Phi_s\mathbf w \in\mathcal S_\mathrm{safe}^T,~~\mathbf \Phi_a\mathbf w\in\mathcal A_\mathrm{safe}^T$$

Phi_s = cvx.Variable((T*n, T*n), name="Phi_s")
Phi_a = cvx.Variable((T*p, T*n), name="Phi_a")

# Affine dynamics constraint
constr = [Phi_s[:n, :] == np.eye(n)]
for k in range(T-1):
constr.append(Phi_s[n*(k+1):n*(k+1+1),:] == A*Phi_s[n*k:n*(k+1),:] + B*Phi_a[p*k:p*(k+1),:])
constr.append(A*Phi_s[n*(T-1):,:] + B*Phi_a[p*(T-1):,:] == 0)

# Polytope safety constraint
# # F_s s_k <= b_x and F_a a_k <= b_a
for k in range(T-1):
constr.append(F_s @ Phi_s[n*(k+1):n*(k+1),:] @ s_0 <= b_s)
constr.append(F_a @ Phi_a[n*(k+1):n*(k+1),:] @ s_0 <= b_a)

cost_matrix = cvx.bmat([[Q_sqrt*Phi_s[n*k:n*(k+1), :]] for k in range(T)]
+ [[R_sqrt*Phi_a[p*k:p*(k+1), :]] for k in range(T)])
objective = cvx.norm(cost_matrix,'fro')

prob = cvx.Problem(cvx.Minimize(objective), constr)
prob.solve()
Phi_s = np.array(Phi_s.value)
Phi_a = np.array(Phi_a.value)

## Safe policies via convex programming

• Linear control means policy has a constant gain
• Constant gain may be an inefficient way to ensure safety

## Nonlinear Safe Control

size of $$a$$

size of $$s$$

safety constraint

## Control Barrier Function

Claim: Suppose that for all $$t$$, the policy satisfies

• $$C(F(s_t, \pi(s_t)))\leq \gamma C(s_t)$$ for some $$0\leq \gamma\leq 1$$.
• Then $$\{s\mid C(s)\leq 0\}$$ is an invariant set.

$$\pi(s_t) = \text{find}\quad a\quad\text{s.t.}\quad C(F(s_t, a)) \leq \gamma C(s_t)$$

$$C(F(s, a))-C(s) \leq -(1-\gamma) C(s)$$

size of $$s$$

size of $$a$$

safety constraint

$$C(s)=0$$

Example: safety filter for linear dynamics

## Control Barrier Function

$$a_t = \arg\min_{a\in\mathcal A_\mathrm{safe} } \|a-Ks_t\|_2 \quad \text{s.t.}\quad C(As_t+Ba_t) \leq \gamma C(s_t)$$

Claim: Suppose that for all $$t$$, the policy satisfies

• $$C(F(s_t, \pi(s_t)))\leq \gamma C(s_t)$$ for some $$0\leq \gamma\leq 1$$.
• Then $$\{s\mid C(s)\leq 0\}$$ is an invariant set.

$$\pi(s_t) = \text{find}\quad a\quad\text{s.t.}\quad C(F(s_t, a)) \leq \gamma C(s_t)$$

Exercise: If $$C$$ is a quadratic function, when is the above optimization problem feasible for some $$a\in\mathbb R^m$$?

Adversarial perspective is common when dealing with disturbances.

• $$\mathcal S_\mathrm{inv}$$ is robustly invariant if for all $$s\in\mathcal S_\mathrm{inv}$$ and $$w\in\mathcal W$$, $$F(s, w)\in\mathcal S_\mathrm{inv}$$
• ex: $$F(s,w) = \gamma s+w$$ and $$|w|\leq B$$ then $$|s|\leq B/(1-\gamma)$$ is invariant.
• Robust constraints: $$\mathbf \Phi_s\mathbf w \in\mathcal S_\mathrm{safe}^T\quad\text{for all}\quad w\in\mathcal W$$
• Robust safety filter: $$\pi(s_t) = \text{find}\quad a\quad\text{s.t.}\quad C(F(s_t, a, w)) \leq \gamma C(s_t)~~\forall ~~ w\in\mathcal W$$

## Recap

• Recap of data-driven optimal control (RL)
• policy, value, model
• Safety as constraints/invariance
• Safe control with