Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
0. Announcements
1. Review
2. Questions
HW4 due tonight
5789 Paper Review Assignment due Friday
Course evaluations
Final Monday 5/16 at 7pm in Statler Hall 196
Closed-book, definition/equation sheet provided
Focus: Units 1-4
Study Materials: Lecture Notes, HWs, Prelim, Review Slides
Outline:
Participation point: PollEV.com/sarahdean011
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
Finite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)
ex - Pac-Man as MDP
Optimal Control Problem
ex - UAV as OCP
examples:
Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
Food for thought:
examples:
...
...
...
Recursive Bellman Expectation Equation:
...
...
...
Recall: Gardening MDP HW problem, Prelim
...
...
...
Recall: Gardening MDP, Prelim (verifying optimality)
Food for thought: rigorous argument for optimal policy?
Basis for approximation-based algorithms (local linearization and iLQR)
Recall: Prelim question on linear policy \(a_t = K s_t\)
Model-Based RL
\(h_1=h\) w.p. \(\propto \gamma^h\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
Food for thought: how to compute off-policy gradient estimate?
Derivative Free Optimization: Random Search
\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)
\(J(\theta) = -\theta^2 - 1\)
\(\theta\)
Derivative Free Optimization: Sampling
\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(x)) h(x) \)
\(J(\theta) = \mathbb E_{x\sim P_\theta}[h(x)]\)
\(x\)
\(= 2(\theta-x)\theta h(x)\)
\(h(x) = -x^2\)
\(=\mathbb E_{x\sim\mathcal N(\theta, 1)}[-x^2]\)
\(P_\theta = \mathcal N(\theta, 1)\)
\( \max ~J(\theta)\)
\(\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta \)
\( \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)\)
\(\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta\)
\(\theta_{t+1} = \theta_t + \alpha F^{-1}_{t} g_t\)
Food for thought: performance/regret of softmax policy?
Explore-then-Commit
Upper Confidence Bound
For \(t=1,...,T\):
Set exploration \(N \approx T^{2/3}\),
\(R(T) \lesssim T^{2/3}\)
\(R(T) \lesssim \sqrt{T}\)
Imitation Learning with BC
Food for thought: Expert in LQR setting? (Linear regression)
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Imitation Learning with DAgger
Food for thought: Expert in LQR setting? (Linear regression)
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Execute
Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
Supervised learning guarantee
\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)
Online learning guarantee
\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\pi\) consistent with expert data
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)
\(\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)\)
Iterative or \(\nabla \mathcal L(x,w) = 0\)