We are aimed to optimize
for now assume deterministic dynamics
What if \(r\) and \(f\) are known and are "nice":
Then we can express the optimal \(a_T\) action as a function of \(s_T\):
Now we can write the value of the last time-step
Going one step backward:
Apply recursively until \(t = 0 \) where the state \(s_0\) is known!
Then we can go forward:
Using the rule of "nice" compositions:
Dynamics
Rewards
The last Q:
Equate the gradient to zero:
Get the optimal last time-step behavior
The last time-step value is also quadratic in \(s_T\)
The Q function at \(T - 1\) is again quadratic in both \(s_{T-1}\) and \(a_{T-1}\):
Thus you can get an analytical formula for policy at each time-step:
Given: \(s_0, F_t, f_t, R_t, r_t\) for all \(t\)
Closed-loop
Planning
Tutorial
Given: \(s_0, F_t, f_t, R_t, r_t\) for all \(t\)
Tailor's expansion:
\( f(s_t, a_t) \approx f(\hat s_t, \hat a_t) + \nabla f (\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix}\)
\( r(s_t, a_t) \approx r(\hat s_t, \hat a_t) + \nabla r (\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix} + \frac{1}{2} \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix}^T \nabla^2 r(\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix} \)
Simplify a bit....
\( \tilde f_t(s_t, a_t) = F_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + f_t \)
\( \tilde r_t (s_t, a_t) = \frac{1}{2} \begin{bmatrix} s_t \\ a_t \end{bmatrix}^T R_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + \begin{bmatrix} s_t \\ a_t \end{bmatrix}^T r_t \)
Algorithm:
Initialize \(( \hat s_0, \hat a_0, \hat s_1, \dots) \) somehow