CS 4/5789: Introduction to Reinforcement Learning

Lecture 21

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

Announcements

 

5789 Paper Review Assignment (weekly pace suggested)

HW 3 released Monday, due 4/25

 

Final exam Monday 5/16 at 7pm

Recap: Contextual Bandit

A (less) simplified setting for studying exploration

ex - machine make and model affect rewards, so context \(x=(\)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\()\)

  • See context \(x\sim \mathcal D\), pull "arm" \(a\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x, a)] = \mu_a(x)\), in the linear case \(=\theta_a^\top x\)
  • Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t)  \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$

LinUCB Algorithm

For \(t=1,...,T\):

  1. Observe context \(x_t\)
  2. Pull \(a_t = \arg\max_a (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}\)
  3. Update \(\widehat \theta^{a_t}\) and \(A^{a_t}\)

Contextual Bandit

  • Claim: with high probability, \(\theta_a^\top x \leq (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}\)
  • Chebychev's: If \(\mathbb E[u]=0\) then with probability \(1-1/\beta^2\), $$|u| \leq \beta \mathbb E[u^2]$$

Contextual Bandit

  • Claim: with high probability, \(\theta_a^\top x \leq (\widehat \theta^a_t)^\top x + \alpha \sqrt{x^\top (A_t^a)^{-1} x}\)
  • Chebychev's: If \(\mathbb E[u]=0\) then with probability \(1-1/\beta^2\), $$|u| \leq \beta \mathbb E[u^2]$$
  • Proof of Claim:
    • Strategy: apply Chebychev's with \(u=\theta_a^\top x - (\widehat \theta^a_t)^\top x\)
    • Last lecture: showed \(\mathbb E[(\theta_a - \widehat \theta^a_t)^\top x]=0\)

Agenda

 

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

Recap: Tabular MBRL

Algorithm:

  1. Query each \((s,a)\) pair \(\frac{N}{SA}\) times, record sample \(s'\sim P(s,a)\)
  2. Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
  3. Design \(\widehat \pi\) as if \(\widehat P\) is true

Analysis: \(\widehat \pi\) vs. \(\pi^*\)

  • Compare \(\widehat P\) and \(P\) (Hoeffding's)
  • Compare \(\widehat V^\pi\) and \(V^\pi\) (Simulation Lemma)
  • Compare \(\widehat V^{\widehat \pi}\) and \(V^{\pi^*}\)

Agenda

 

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis