Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Continuous Control
3. Linear Dynamics
Goal: achieve high cumulative reward
maximize \(\displaystyle \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]\) or \(\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, [H~\text{or}~\gamma]\}\)
Policy Iteration
Value Iteration
PollEV
Exactly compute the optimal policy
1. Recap
2. Continuous Control
3. Linear Dynamics
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
minimize \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)
s.t. \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}\)
\(a_t\)
\(a_t\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
\(\varepsilon\)
1. Recap
2. Continuous Control
3. Linear Dynamics
\(a_t\)
\(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)
You have investments in two companies.
Setting 1: Each dollar of investment in company \(i\) leads to \(\lambda_i\) returns. The companies are independent.
\(0<\lambda_2<\lambda_1<1\)
\(0<\lambda_2<1<\lambda_1\)
\(1<\lambda_2<\lambda_1\)
\(s_{1} = As_0 = \lambda s_0\)
\(s_t = \lambda^t v\)
If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)
\(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)
The effect of internal dynamics $$ s_{t+1} = As_t$$
Setting 2: The companies are interdependent: each dollar of investment in company \(i\) leads to \(\alpha\) return for company \(i\), but it also leads to \(\beta\) return (\(i=1\)) or loss (\(i=2\)) to the other company.
\(0<\alpha^2+\beta^2<1\)
\(1<\alpha^2+\beta^2\)
$$\begin{bmatrix}1\\0\end{bmatrix} \to \begin{bmatrix}\alpha\\ \beta\end{bmatrix} $$
rotation by \(\arctan(\beta/\alpha)\)
scale by \(\sqrt{\alpha^2+\beta^2}\)
\(\lambda = \alpha \pm i \beta\)
Setting 3: Each dollar of investment in company \(i\) leads to \(\lambda\) return for company \(i\), and \(2\) is a subsidiary of \(1\) who thus accumulates its returns as well.
\(0<\lambda<1\)
\(1<\lambda\)
$$ \left(\begin{bmatrix} \lambda & \\ & \lambda\end{bmatrix} + \begin{bmatrix} & 1\\ & \end{bmatrix} \right)^t$$
$$ =\begin{bmatrix} \lambda^t & t\lambda^{t-1}\\ & \lambda^t\end{bmatrix} $$
General case: diagonalizable, real eigenvalues (geometric \(=\) algebraic multiplicity)
Example 1: \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)
Example 2: \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta\\\beta & \alpha\end{bmatrix} s_t \)
General case: pair of complex eigenvalues
\(\lambda = \alpha \pm i \beta\)
Example 3: \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1\\ & \lambda\end{bmatrix} s_t \)
General case: non-diagonalizable (geometric \(<\) algebraic multiplicity)
Theorem: Let \(\{\lambda_i\}_{i=1}^n\subset \mathbb C\) be the eigenvalues of \(A\).
Then for \(s_{t+1}=As_t\), the equilibrium \(s_{eq}=0\) is
\(\mathbb C\)
Proof
If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)
By definition, \(Av_i = \lambda_i v_i\)
Therefore, \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)
Thus \(s_t\to 0\) if and only if all \(|\lambda_i|<1\), and if any \(|\lambda_i|>1\), \(\|s_t\|\to\infty\)
Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form
We call \(\max_i|\lambda_i|=1\) "marginally (un)stable"
Full dynamics depend on actions $$ s_{t+1} = As_t+Ba_t $$
\(a_t\)
\(a_t\)
Linear policy defined by \(a_t=Ks_t\): $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$
\(a_t\)
\(a_t\)