Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Outline:
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
Finite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)
ex - Pac-Man as MDP
Optimal Control Problem
ex - UAV as OCP
examples:
Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
\(+\gamma\)
\(+\gamma^2\)
\(+\quad ...\quad=\)
Food for thought:
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
examples:
...
...
...
Recursive Bellman Expectation Equation:
...
...
...
Recall: Icy navigation (PSet 2, lecture example), Prelim question
Recall: Verifying optimality in Icy Street example, Prelim
Food for thought: rigorous argument for optimal policy?
ex - UAV
Recall: PSet 4 and Prelim question about cumulative cost and stability
Model-Based RL
\(h_1=h\) w.p. \(\propto \gamma^h\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
Food for thought: how to compute off-policy gradient estimate?
\( \max ~J(\theta)\)
\(\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta \)
\( \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)\)
\(\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta\)
\(\theta_{i+1} = \theta_i + \alpha F^{-1}_{i} g_i\)
Food for thought: performance/regret of softmax policy?
Imitation Learning with BC
Food for thought: Expert in LQR setting? (Linear regression)
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Imitation Learning with DAgger
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Execute
Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\pi\) consistent with expert data