CS 4/5789: Introduction to Reinforcement Learning

Lecture 15

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

1. PG with Q & A functions

2. Trust Regions & KL-Divergence

3. Natural Policy Gradient

Announcements

 

HW2 due Monday 3/28

 

5789 Paper Review Assignment (weekly pace suggested)

 

Monday 3/21 is the last day to drop

Prelim Tuesday 3/22 at 7:30-9pm in Phillips 101

 

Closed-book, definition/equation sheet provided

 

Focus: mainly Unit 1 (known models) but many lectures in Unit 2 revisit important key concepts

Study Materials: Lecture Notes 1-15, HW0&1

 

Lecture on Monday 3/21 will be a review

Prelim Exam

Recap

Derivative Free Optimization: Random Search

\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)

Parabola

\(J(\theta) = -\theta^2 - 1\)

\(\theta\)

Recap

Derivative Free Optimization: Sampling

\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(x)) h(x) \)

Parabola

\(J(\theta) = \mathbb E_{x\sim P_\theta}[h(x)]\)

\(x\)

image/svg+xml

\(= 2(\theta-x)\theta h(x)\)

\(h(x) = -x^2\)

\(=\mathbb E_{x\sim\mathcal N(\theta, 1)}[-x^2]\)

\(P_\theta = \mathcal N(\theta, 1)\)

RL Setting

  • MDP \(\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\}\) with \(P, r\) unknown
  • policy \(\pi_\theta\) with parameter \(\theta\in\mathbb R^d\)
  • observe rollout of \(\pi_\theta\): \(\tau = (s_0,a_0,s_1,...)\) and \((r_0, r_1,...)\)
  • objective function
    \(J(\theta) = \mathbb E_{s_0\sim\mu_0}[\sum_{t=0}^\infty\gamma^t r_t \mid P,r,\pi_\theta] = \mathbb E_{\tau\sim\rho_\theta}[R(\tau)]\)

Simple Random Search

  1. with \(\theta_t \pm \delta v\) observe \(\tau_+\) and \(\tau_-\)
  2. finite difference approx
    \(g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v\)

REINFORCE

  1. with \(\theta_t\) observe \(\tau\)
  2. trajectory-based approx
    \(g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\)

Meta-Algorithm: DF-SGA

initialize \(\theta_0\)

for \(t=0,1,...\)

  1. collect rollouts using \(\theta_t\)
  2. estimate gradient with \(g_t\)
  3. \(\theta_{t+1} = \theta_t + \alpha g_t\)

Agenda

 

0. Announcements & Recap

1. PG with Q & A functions

2. Trust Regions & KL-Divergence

3. Natural Policy Gradient

CS 4/5789: Lecture 15

By Sarah Dean

Private

CS 4/5789: Lecture 15