Safe Control
ML in Feedback Sys #20
Prof Sarah Dean
Reminders/etc
- Comments on proposals posted
- Project midterm update due 11/11
- Scribing feedback to come soon!
- Upcoming paper presentations starting next week
- Participation includes attending presentations
Reporting back on NCCR Symposium
Symposium on Socially responsible Automation hosted by NCCR Automation at EPFL
- talks on responsibility/fairness in cyberphysical systems (power grid, dynamic pricing), recommendation systems, repeated resource allocation via "karma economy", and vulnerability-aware AV rules
- interesting idea: "reparative fairness"
- \(\mathbb E[\)future cost\(\mid\)past cost\(]\propto -\)past cost

policy
\(\pi_t:\mathcal S\to\mathcal A\)
observation
\(s_t\)
accumulate
\(\{(s_t, a_t, c_t)\}\)
Action in a dynamic world
Goal: select actions \(a_t\) to bring environment to low-cost states
action
\(a_{t}\)
\(F\)
\(s\)
Recap: Optimal Control
Stochastic Infinite Horizon Optimal Control Problem
$$ \min_{\pi} ~~\lim_{t\to\infty} \mathbb E_w\Big[\frac{1}{T}\sum_{k=0}^{T} c(s_k, \pi(s_k)) \Big]\quad \text{s.t}\quad s_0~~\text{given},~~s_{k+1} = F(s_k, \pi(s_k),w_k) $$
\(\underbrace{\qquad\qquad}_{J^\pi(s_0)}\)
Bellman Optimality Equation
- \(\underbrace{J^\star (s)}_{\text{value function}} = \min_{a\in\mathcal A} \underbrace{c(s, a)+\mathbb E_w[J^\star (F(s,a,w))]}_{\text{state-action function}}\)
- Minimizing argument is \(\pi^\star(s)\)
Recap: LQR with known dynamics
- Goal: minimize quadratic cost (\(Q,R\)) in a system with linear dynamics (\(A,B\))
- Classic approach: Dynamic programming/Bellman optimality
- \(P = \mathrm{DARE}(A,B,Q,R)\) and \(K_\star = -(R+B^\top PB)^{-1}B^\top QPA\)
- System level synthesis: Convex optimization
-
\( \underset{\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}Q^{1/2} &\\& R^{1/2}\end{bmatrix} \begin{bmatrix} \mathbf{\Phi}_s \\ \mathbf{\Phi}_a \end{bmatrix} \right\|_{\mathcal H_2}^2~~\text{s.t.}~~ \begin{bmatrix} zI - A & - B\end{bmatrix} \begin{bmatrix} \mathbf{\Phi}_s \\ \mathbf{\Phi}_a \end{bmatrix}= I \)
-
Setting: have data \(\{s_k, a_k, c_k\}_{k=0}^N\). Approaches include a focus on:
-
Model: learn dynamics (and costs) from data, then do policy design
- For LQR: estimate \(\hat A,\hat B\) (\(\hat Q,\hat R\)) then design \(\hat K\)
- "model based"
-
Bellman: learn value or state-action function
- For LQR: estimate \(\hat J\) then determine \(\hat K\) as \(\argmin\)
- "model free"
-
Policy: estimate gradients and update policy directly
- For LQR: \(\hat K \leftarrow \hat K -\alpha\widehat{\nabla J}(\hat K)\)
- "model free"
Recap: unknown dynamics
-
Learn Model:
- estimate \(\hat A,\hat B\) via least-squares, guarantee \(\max\{\varepsilon_A, \varepsilon_B\}\lesssim \sqrt{\frac{m+n}{N}}\)
-
Design Policy:
-
robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)
-
\(J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \leq \epsilon\) for \(N\gtrsim \frac{(m+n)^2}{\epsilon^2}\)
-
-
nominal or certainty equivalent approach uses \(\hat A, \hat B\)
- for small enough \(\varepsilon\), can show that \(J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \lesssim \varepsilon^2\)
- thus faster rate, \(J(\hat{\mathbf \Phi}) - J(\mathbf \Phi_\star) \leq \epsilon\) for \(N\gtrsim \frac{m+n}{\epsilon}\)
-
robust approach uses \(\hat A, \hat B, \varepsilon_A, \varepsilon_B\)
Model-based LQR
Approximate Policy Iteration [KTR19]
- estimate quadratic state-action function with LSTD $$ q^\top f + q^\top \phi(s_t, a_t) = c(s_t,a_t)+\mathbb E_{w_t}[q^\top \phi(s_{t+1}, \pi(s_{t+1})]$$
- update policy as $$ \hat Ks = \arg\min_a \hat q^\top \phi(s, a) = \arg\min_a \begin{bmatrix}s\\ a\end{bmatrix}^\top \mathrm{mat}(q) \begin{bmatrix}s\\ a\end{bmatrix}$$
- guarantee \(\|\hat K - K_\star \|_2 \leq \epsilon\) for \(NT \gtrsim \frac{(m+n)^3}{\epsilon^2}\)
Model-free LQR
Policy Gradient [FGKM18]
- estimate \(\widehat{\nabla J}(K)\) with finite differencing
- update policy as $$ K \leftarrow K -\alpha\widehat{\nabla J}( K)$$
- guarantee \(J(\hat K) - J(K_\star ) \leq\epsilon\) for \(N\gtrsim poly(n,m,1/\epsilon)\)
Dynamic Performative Optimality
How to learn when data gradually reacts to your model [IZY22]
- Learner chooses \(\theta_t\)
- Population reacts as \(\rho_{t} = \mathcal D(\theta_t, \rho_{t-1})\)
- For fixed \(\theta\), the limiting distribution \(\rho_\star(\theta) = \lim_{t\to\infty} \rho_t\)
- Goal for learner: minimize \(\mathcal L^\star(\theta) = \mathbb E_{z\sim\rho_\star(\theta)} [\ell(z, \theta)]\)
- Similar to goal of minimizing \(\lim_{T\to\infty} \frac{1}{T} \sum_{t=0}^T \mathbb E_{z\sim\rho_t} [\ell(z, \theta)]\)
- Approach: sample-based gradient descent.
- estimate parameters of \(\rho_t\) and parametric model of \(\partial_t \mathcal D\), use to estimate gradient of \(\mathcal L^\star\)
Is low cost all we want?
- In the setting of performative prediction, we may learn that homogeneous populations are easier to make predictions about
- recall retention dynamics of Hashimoto et al
- Controlling autonomous vehicles or robots may involve avoiding obstacles or staying on the road
Motivation: Safety



A trajectory of states \((s_0,\dots,s_t)\) is safe if \(\mathcal s_k\in\mathcal S_\mathrm{safe}\) for all \(0\leq k\leq t\).
Safe Trajectories
We define safety in terms of the "safe set" \(\mathcal S_\mathrm{safe}\subseteq \mathcal S\).
(we can analogously define \(\mathcal A_\mathrm{safe}\subseteq \mathcal A\) and require that \(\mathcal a_k\in\mathcal A_\mathrm{safe}\) for all \(0\leq k\leq t\))
A state \(s\) is safe if \(\mathcal s\in\mathcal S_\mathrm{safe}\).
Example
The state is position & velocity \(s=[\theta,\omega]\) with \( s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t \)
Safety constraint on position \(|\theta|\leq 1\)

Are trajectories safe as long as \(|\theta_0|<1\)?
- no! Exercise: what is the necessary condition on \(\omega_0\)?
Safe Invariant Sets
We define safety in terms of the "safe set" \(\mathcal S_\mathrm{safe}\subseteq \mathcal S\)
A system \(s_{t+1}=F(s_t)\) is safe if some \(\mathcal S_\mathrm{inv}\subseteq \mathcal S_{\mathrm{safe}}\) is invariant, i.e.
- for all \( s\in\mathcal S_\mathrm{inv}\), \( F(s)\in\mathcal S_\mathrm{inv}\)
Exercise: Prove that if \(\mathcal S_\mathrm{inv}\) is invariant for dynamics \(F\), then \(s_0\in \mathcal S_\mathrm{inv} \implies s_t\in\mathcal S_\mathrm{inv}\) for all \(t\).
Example: Linear Dynamics
- Consider stable linear dynamics \(s_{t+1}=As_t\).
- Consider the set \( \{s\mid s^\top Ps \leq c\} \) with \(P = \sum_{t=0}^\infty (A^t)^\top A^t\)
-
Claim: This is an invariant set
-
\((As)^\top \sum_{t=0}^\infty (A^t)^\top A^t (As) \)
-
\(= s^\top \sum_{t=1}^\infty (A^t)^\top A^t s \)
-
\(\leq s^\top \sum_{t=0}^\infty (A^t)^\top A^t s \leq c\)
-

Example: An invariant set for
\(s=[\theta,\omega]\) with \( s_{t+1} = \begin{bmatrix} 0.9 & 0.1\\ & 0.9 \end{bmatrix}s_t \)
Claim: if \(V(s)\) is a Lyapunov function for \(F\) then any sublevel set \(\{V(s)\leq c\}\) is invariant.
Invariance via Lyapunov
- \(V(F(s))\)
- \(\leq V(s)\)
- \(\leq c\)
Definition: A Lyapunov function \(V:\mathcal S\to \mathbb R\) for \(F\) is continuous and
- (positive definite) \(V(0)=0\) and \(V(0)>0\) for all \(s\in\mathcal S - \{0\}\)
- (decreasing) \(V(F(s)) - V(s) \leq 0\) for all \(s\in\mathcal S\)
Constrained Control
\(a_t = {\color{Goldenrod} K_t }s_{t}\)
\( \underset{\mathbf a }{\min}\) \(\displaystyle\sum_{t=0}^T s_t^\top Q s_t + a_t^\top R a_t\)
\(\text{s.t.}~~s_{t+1} = As_t + Ba_t \)
\(s_t \in\mathcal S_\mathrm{safe},~~ a_t \in\mathcal A_\mathrm{safe}\)
\(\begin{bmatrix} \mathbf s\\ \mathbf a\end{bmatrix} = \begin{bmatrix} \mathbf \Phi_s\\ \mathbf \Phi_a\end{bmatrix}\mathbf w \)
\(\mathbf w = \begin{bmatrix}s_0\\ 0\\ \vdots \\0 \end{bmatrix}\)
- nonconvex in \(K\)
- convex if \(\mathcal S_\mathrm{safe}\) and \(\mathcal A_\mathrm{safe}\) are convex
\( \underset{\color{teal}\mathbf{\Phi}}{\min}\)\(\left\| \begin{bmatrix}\bar Q^{1/2} &\\& \bar R^{1/2}\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a\end{bmatrix} \mathbf w\right\|_{2}^2\)
\(\text{s.t.}~~ \begin{bmatrix} I - \mathcal Z \bar A & - \mathcal Z \bar B\end{bmatrix} \begin{bmatrix}\color{teal} \mathbf{\Phi}_s \\ \color{teal} \mathbf{\Phi}_a\end{bmatrix}= I \)
\(\mathbf \Phi_s\mathbf w \in\mathcal S_\mathrm{safe}^T,~~\mathbf \Phi_a\mathbf w\in\mathcal A_\mathrm{safe}^T\)
Phi_s = cvx.Variable((T*n, T*n), name="Phi_s")
Phi_a = cvx.Variable((T*p, T*n), name="Phi_a")
# Affine dynamics constraint
constr = [Phi_s[:n, :] == np.eye(n)]
for k in range(T-1):
constr.append(Phi_s[n*(k+1):n*(k+1+1),:] == A*Phi_s[n*k:n*(k+1),:] + B*Phi_a[p*k:p*(k+1),:])
constr.append(A*Phi_s[n*(T-1):,:] + B*Phi_a[p*(T-1):,:] == 0)
# Polytope safety constraint
# # F_s s_k <= b_x and F_a a_k <= b_a
for k in range(T-1):
constr.append(F_s @ Phi_s[n*(k+1):n*(k+1),:] @ s_0 <= b_s)
constr.append(F_a @ Phi_a[n*(k+1):n*(k+1),:] @ s_0 <= b_a)
# Quadratic cost
cost_matrix = cvx.bmat([[Q_sqrt*Phi_s[n*k:n*(k+1), :]] for k in range(T)]
+ [[R_sqrt*Phi_a[p*k:p*(k+1), :]] for k in range(T)])
objective = cvx.norm(cost_matrix,'fro')
prob = cvx.Problem(cvx.Minimize(objective), constr)
prob.solve()
Phi_s = np.array(Phi_s.value)
Phi_a = np.array(Phi_a.value)
Safe policies via convex programming
- Linear control means policy has a constant gain
- Constant gain may be an inefficient way to ensure safety
Nonlinear Safe Control
size of \(a\)
size of \(s\)
safety constraint
Control Barrier Function
Claim: Suppose that for all \(t\), the policy satisfies
- \(C(F(s_t, \pi(s_t)))\leq \gamma C(s_t)\) for some \(0\leq \gamma\leq 1\).
- Then \(\{s\mid C(s)\leq 0\}\) is an invariant set.
$$\pi(s_t) = \text{find}\quad a\quad\text{s.t.}\quad C(F(s_t, a)) \leq \gamma C(s_t) $$
\(C(F(s, a))-C(s) \leq -(1-\gamma) C(s) \)
size of \(s\)
size of \(a\)
safety constraint
\(C(s)=0\)
Example: safety filter for linear dynamics
Control Barrier Function
$$a_t = \arg\min_{a\in\mathcal A_\mathrm{safe} } \|a-Ks_t\|_2 \quad \text{s.t.}\quad C(As_t+Ba_t) \leq \gamma C(s_t) $$
Claim: Suppose that for all \(t\), the policy satisfies
- \(C(F(s_t, \pi(s_t)))\leq \gamma C(s_t)\) for some \(0\leq \gamma\leq 1\).
- Then \(\{s\mid C(s)\leq 0\}\) is an invariant set.
$$\pi(s_t) = \text{find}\quad a\quad\text{s.t.}\quad C(F(s_t, a)) \leq \gamma C(s_t) $$
Exercise: If \(C\) is a quadratic function, when is the above optimization problem feasible for some \(a\in\mathbb R^m\)?
Adversarial perspective is common when dealing with disturbances.
- \(\mathcal S_\mathrm{inv}\) is robustly invariant if for all \( s\in\mathcal S_\mathrm{inv}\) and \(w\in\mathcal W\), \( F(s, w)\in\mathcal S_\mathrm{inv}\)
- ex: \(F(s,w) = \gamma s+w\) and \(|w|\leq B\) then \(|s|\leq B/(1-\gamma)\) is invariant.
- Robust constraints: $$\mathbf \Phi_s\mathbf w \in\mathcal S_\mathrm{safe}^T\quad\text{for all}\quad w\in\mathcal W$$
- Robust safety filter: $$\pi(s_t) = \text{find}\quad a\quad\text{s.t.}\quad C(F(s_t, a, w)) \leq \gamma C(s_t)~~\forall ~~ w\in\mathcal W $$
Safety with Disturbances
Recap
- Recap of data-driven optimal control (RL)
- policy, value, model
- Safety as constraints/invariance
- Safe control with
- quadratic Lyapunov functions
- System level synthesis
- Barrier functions
References: Predictive Control by Borrelli, Bemporad, Morari
20 - Safe Control - ML in Feedback Sys
By Sarah Dean