Sarah Dean PRO
asst prof in CS at Cornell
Prof Sarah Dean
at=Ktst
amin E[t=0∑Tst⊤Qst+at⊤Rat]
s.t. st+1=Ast+Bat+wt
[sa]=[ΦsΦa]w
Φmin[Qˉ1/2Rˉ1/2][ΦsΦa]F2
s.t. [I−ZAˉ−ZBˉ][ΦsΦa]=I
B
A
s
wt
at
st
K
instead of a loop,
B
A
s
wt
at
st
K
system looks like a line
References: System Level Synthesis by Anderson, Doyle, Low, Matni
Theorem: For the a linear system in feedback with a linear controller over the horizon t=0,…,T:
Phi_s = cvx.Variable((T*n, T*n), name="Phi_s") Phi_a = cvx.Variable((T*p, T*n), name="Phi_a") # Affine dynamics constraint constr = [Phi_s[:n, :] == np.eye(n)] for k in range(T-1): constr.append(Phi_s[n*(k+1):n*(k+1+1),:] == A*Phi_s[n*k:n*(k+1),:] + B*Phi_a[p*k:p*(k+1),:]) constr.append(A*Phi_s[n*(T-1):,:] + B*Phi_a[p*(T-1):,:] == 0) # Quadratic cost cost_matrix = cvx.bmat([[Q_sqrt*Phi_s[n*k:n*(k+1), :]] for k in range(T)] + [[R_sqrt*Phi_a[p*k:p*(k+1), :]] for k in range(T)]) objective = cvx.norm(cost_matrix,'fro') prob = cvx.Problem(cvx.Minimize(objective), constr) prob.solve() Phi_s = np.array(Phi_s.value) Phi_a = np.array(Phi_a.value)
Infinite Horizon LQR Problem
πmin T→∞limEw[T1k=0∑Tsk⊤Qsk+ak⊤Rak]s.tsk+1=Ask+Bak+wk
Claim: The optimal cost-to-go function is quadratic and the optimal policy is linear J⋆(s)=s⊤Ps,π⋆(s)=Ks
Stochastic Infinite Horizon Optimal Control Problem
πmin t→∞lim Ew[T1k=0∑Tc(sk,π(sk))]s.ts0 given, sk+1=F(sk,π(sk),wk)
Jπ(s0)
Bellman Optimality Equation
Reference: Ch 1 in Dynamic Programming & Optimal Control, Vol. I by Bertsekas
st+1=[0.90.10.9]st+[01]at+wt
The state is position & velocity s=[θ,ω], input is a force a∈R.
Goal: stay near origin and be energy efficient
π⋆(s)≈−[7.0×10−23.7×10−2]s
J⋆(s)≈s⊤[33.55.85.82.4]s
B
A
s
wt
at
st
K
=(A+BK)t+1s0+∑k=0t(A+BK)t−kwk
=K(A+BK)t+1s0+∑k=0tK(A+BK)t−kwk
st+1=Ast+Bat+wt
at=Kst
st=Φs0s0+∑k=1tΦskwt−k
at=Φa0s0+∑k=1tΦakwt−k
s0⋮sT=Φs0Φs1⋮ ΦsTΦs0⋱ΦsT−1⋱…Φs0s0w0⋮wT−1
a0⋮aT=Φa0Φa1⋮ ΦaTΦa0⋱ΦaT−1⋱…Φa0s0w0⋮wT−1
st+1=[0.90.10.9]st+[01]at+wt
The state is position & velocity s=[θ,ω], input is a force a∈R.
π⋆(s)≈−[7.0×10−23.7×10−2]s
Φst≈[0.9−0.0700.10.86]t−1Φat≈−[7.0×10−23.7×10−2][0.9−0.0700.10.86]t−1
eigenvalues ≈0.88±0.082j
at=Kst
amin T→∞limE[T1t=0∑Tst⊤Qst+at⊤Rat]
s.t. st+1=Ast+Bat+wt
[sa]=[ΦsΦa]w
Φmin[Q1/2R1/2][ΦsΦa]H22
s.t. [zI− A−B][ΦsΦa]=I
Exercise: Using the frequency domain notation, derive the expression for the SLS cost and constraints. Hint: in signal notation, the dynamics can be written zs=As+Ba+w
Where we use the norm:
∥Φ∥H22=t=0∑∞∥Φt∥F2
Φmin[Q1/2R1/2][ΦsΦa]H22 s.t. [zI− A−B][ΦsΦa]=I
policy
πt:S→A
observation
st
accumulate
{(st,at,ct)}
Goal: select actions at to bring environment to low-cost states
action
at
s
Setting: dynamics (and cost) functions are not known, but we have data {sk,ak,ck}k=0N. Approaches include a focus on:
Setting: dynamics A,B are not known, but we have data {sk,ak}k=0N
The state is position & velocity s=[θ,ω], input is a force a∈R.
Goal: be energy efficient
π^⋆(s)≈−[6.1×10−52.8×10−4]s does not stabilize the system!
Even though ε=0.02, J(K^) is infinite!
true dynamics ([1.010.11.01],[01]) but we estimate ([0.990.10.99],[01])
a=Ksmin ∥B−B∥≤εB∥A−A∥≤εAmax E[limT→∞T1∑t=0Tst⊤Qst+at⊤Rat]
s.t. st+1=Ast+Bat+wt
Challenge: translating predictions
s^t+1=A^s^t+B^a^t
to reality
st+1=Ast+Bat
Lemma: if the system response variables satisfy
Proof:
Therefore, the estimated cost is $$ \hat J(\hat{\mathbf \Phi}) = \left\|\begin{bmatrix} Q^{1/2}\\ & R^{1/2}\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}\right\|_{\mathcal H_2}^2 $$ while the cost actually achieved is $$ J(\hat{\mathbf \Phi}) = \left\|\begin{bmatrix} Q^{1/2}\\ & R^{1/2}\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}(I+\mathbf \Delta)^{-1} \right\|_{\mathcal H_2}^2 $$
Theorem (Anderson et al., 2019): A policy designed from systems responses satisfying \(\begin{bmatrix} zI - \hat A & - \hat B\end{bmatrix} \begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix}= I\) will achieve response \(\begin{bmatrix} \hat\mathbf{\Phi}_s \\ \hat\mathbf{\Phi}_a \end{bmatrix} (I-\mathbf \Delta)^{-1}\)
where Δ=(ΔAA−A^)Φs+(ΔBB−B^)Φa if the inverse exists.
Φ=Φ,γargmin 1−γ1[Q1/2R1/2]ΦH2
s.t. [zI−A−B]Φ=I
∥[ εA εB ]Φ∥H∞≤γ
Φmin ∥ΔB∥≤εB∥ΔA∥≤εAmax [Q1/2R1/2]Φ(I−Δ)−1H2
s.t. Φ∈Affine(A,B)
Δ=[ΔAΔB]Φ
a=Ksmin ∥B−B∥≤εB∥A−A∥≤εAmax E[limT→∞T1∑t=0Tst⊤Qst+at⊤Rat]
s.t. st+1=Ast+Bat+wt
Where we use the norm:
∥Φ∥H∞=max∥x∥2≤1∥Φx∥2 induced by ∥x∥2=∑t=0∞∥xt∥22
Upper bounding this nonconvex objective leads to
Upper bounds follow by:
Φmin ∥ΔB∥≤εB∥ΔA∥≤εAmax [Q1/2R1/2]Φ(I−Δ)−1H2
s.t. Φ∈Affine(A,B)
Δ=[ΔAΔB]Φ
Φ,γmin 1−γ1[Q1/2R1/2]ΦH2
s.t. [zI−A−B]Φ=I
∥[ εA εB ]Φ∥H∞≤γ
Φ=Φ,γargmin 1−γ1[Q1/2R1/2]ΦH2
s.t. [zI−A−B]Φ=I
∥[ εA εB ]Φ∥H∞≤γ
Informal Theorem (Suboptimality):
For \(\hat\mathbf{\Phi}\) synthesized as above and Φ⋆ the true optimal system response,
J(Φ^)− J(Φ⋆)≲J(Φ⋆)[εAεB]Φ⋆H∞
J(Φ^)− J(Φ⋆)≲J(Φ⋆)∥Φ⋆∥H∞Nm+n
Using an explore then commit algorithm, we have R(T)=Rexplore(N)+Rcommit(N,T)
References: System Level Synthesis by Anderson, Doyle, Low, Matni and Ch 2-3 in Machine Learning in Feedback Systems by Sarah Dean
By Sarah Dean