CS 4/5789: Introduction to Reinforcement Learning

Lecture 22: Exploration in MDPs

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • 5789 Paper Reviews due weekly on Mondays
    • PSet 7 due Monday
    • PA 4 due May 3
  • Final exam is Saturday 5/13 at 2pm
  • WICCxURMC Survey

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

Recap: Bandits

A simplified setting for studying exploration

ex - machine make and model affect rewards, so context \(x=(\)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\()\)

  • See context \(x_t\sim \mathcal D\), pull "arm" \(a_t\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x_, a)] = \mu_a(x)\), in the linear case \(=\theta_a^\top x\)
  • Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t)  \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$

Recap: UCB

UCB-type Algorithms

  • for \(t=1,2,...,T\)
    • Observe context \(x_t\)
    • Pull arm \(a_t\) optimistically
      • MAB: \(\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
      • Linear: \(\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}\)
    • update estimates and confidence intervals

Recap: Tabular MBRL

Algorithm:

  1. Query each \((s,a)\) pair \(\frac{N}{SA}\) times, record sample \(s'\sim P(s,a)\)
  2. Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
  3. Design \(\widehat \pi\) as if \(\widehat P\) is true

Analysis: \(\widehat \pi\) vs. \(\pi^*\)

  • Compare \(\widehat P\) and \(P\) (Hoeffding's)
  • Compare \(\widehat V^\pi\) and \(V^\pi\) (Simulation Lemma)
  • Compare \(\widehat V^{\widehat \pi}\) and \(V^{\pi^*}\)

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

MBRL with Exploration

  • Finite horizon tabular MDP with given initial state $$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, s_0\}$$
  • Finite states \(|\mathcal S| = S\) and actions \(|\mathcal A|=A\)
  • Transition probabilities \(P\) unknown
    • for simplicity assume reward function is noiseless, thus perfectly known once \(s,a\) is visited
  • Unlike in previous Unit, initial state is fixed!
    • No longer possible to query \(s'\sim P(s,a)\) for any \(s,a\)

Example

  • Consider the deterministic chain MDP
    • \(\mathcal S = \{0,\dots,H-1\}\) and \(A = \{1, \dots, A\}\)
    • \(s_{t+1} = s_t + \mathbf 1\{a_t=1\} - \mathbf 1\{a_t\neq 1\}\) except at endpoints
  • Suppose \(s_0=0\), finite horizon \(H\), and reward function $$r(s,a) = \mathbf 1\{s=H-1\}$$
  • Uniformly random policy for exploration, observe \((r_t)_{0\leq t\leq H-1}\)
    • \(\mathbb P\{r_t= 0\forall t\} = 1-1/A^H\)

\(\neq 1\)

\(1\)

\(0\)

\(1\)

\(2\)

\(H-1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

UCB-VI

  • Initialize \(\hat P_0\), \(\hat r_0=0\), and \(b_0\)
  • for \(i=1,2,...,T\)
    • Design optimistic policy \(\hat \pi^i = VI(\hat P_i, \hat r_i+b_i)\)
    • Rollout \(\hat\pi^i\) and observe trajectory
    • Update \(\hat P_{i+1}\), \(\hat r_i\), and \(b_{i+1}\)

Optimistic MBRL

  • Design a reward bonus to incentivize exploration of unseen states and actions

Model Estimation

  • Using the dataset at iteration \(i\): $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $$
  • Number of times we took action \(a\) in state \(s\) $$N_i(s,a) = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} $$
  • Number of times we transitioned to \(s'\) after \(s,a\) $$N_i(s,a, s') = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a, s^k_{t+1}=s'\} $$
  • Estimate transition probability \(\hat P_i(s'|s,a) = \frac{N_i(s,a,s')}{N_i(s,a)}\)
  • Reward of \(s,a\): \(\hat r_i(s,a) = \frac{1}{N_i(s,a) } \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} r_t^k\)

\(\neq 1\)

\(1\)

\(0\)

\(1\)

\(2\)

\(H-1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

  • Uniformly random policy:
    • e.g. \(a_{0:8} = (1,3,2,2,3,1,1,3,2)\)
    • then \(s_{0:9} = (0,1,0,0,0,0,1,2,1,0)\)
  • \(N(s,a) = ?\)
  • \(\hat P(s'|s,a)=\) PollEV
  • \(\hat r(s,a)=0\)

Example

Reward Bonus

  • Using the dataset at iteration \(i\): \(\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i \)
  • Number of times \(s,a\): \(N_i(s,a)\)
  • Number of times \(s,a\to s'\): \(N_i(s,a, s')\)
  • Reward of \(s,a\): \(\hat r_i(s,a)\)
  • Reward bonus $$ b_i(s,a) = H\sqrt{\frac{\alpha}{N_i(s,a)}} $$

DP

  • Initialize \(\hat V^i_H(s)=0\)
  • for \(t=H-1, ..., 0\)
    • \(\hat Q^i_t(s,a)=\hat r_i(s,a)+b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]\)
    • \(\hat\pi^i(s) = \arg\max_a \hat Q^i_t(s,a)\)
    • \(\hat V^i_{t}(s)=\hat Q^i_t(s,\hat\pi^i(s) )\)

Optimistic DP

  • In finite horizon setting, VI is just DP

\(\neq 1\)

\(1\)

\(0\)

\(1\)

\(2\)

\(H-1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

  • Iteration 1: uniformly random policy for exploration
    • trajectory contains \(s,a\) with probability \(\propto 1/A^s\)
  • Iteration 2: reward bonus incentivizes upward exploration
  • Eventually reach \(s=H-1\) and converge to \(\pi^\star_t(s) = 1\)

Example

Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

  1. The exploration bonus bounds, with high probability, the difference $$ |\mathbb E_{s'\sim\hat P(s,a)}[V(s')]-\mathbb E_{s'\sim P(s,a)}[V(s')]| \quad\forall ~~s,a $$
    • similar to confidence intervals bounding \(|\mu_a-\hat\mu_a|\)
  2. The exploration bonus leads to optimism $$ \hat V_t^i(s) \geq V_t^\star(s) $$
    • similar to showing that \(\hat\mu_{\hat a_\star} \geq \mu_\star\)

Analysis: Two Key Facts

  • These two properties are key to proving a regret bound $$R(T) = \mathbb E\left[\sum_{i=1}^T V_0^\star (s_0)-V^{\pi^i}(s_0) \right] $$
  • Above, define regret as cumulative sub-optimality (over episodes)
    • now sub-optimality itself is cumulative reward over time (i.e. value)
  • Argument follows UCB proof structure:
    1. By optimism, \(V^\star_0(s_0) - V_0^{\pi^i}(s_0) \leq \hat V_0^i(s_0) - V^{\pi^i}_0(s_0)\)
    2. Simulation Lemma to compare \(\hat V_0^i(s_0)\) & \(V^{\pi^i}_0(s_0)\)
  • Regret proof is out of scope for this class (see 6789)

Regret in RL Setting

\(\neq 1\)

\(1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

  • Regret bonus enables the "right amount" of exploration

Example

Analysis: Exploration Bonus

  • Lemma: For any fixed function \(V:\mathcal S\to [0,H]\), whp, $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| \leq H\sqrt{\frac{\alpha}{N_i(s,a)}}=b_i(s,a)$$ where \(\alpha\) depends on \(S,A,H\) and failure probability
  • Proof: \(|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| =\)
    • \(=\left|\sum_{s'\in\mathcal S} (\hat P_i(s'|s,a)  - P_i(s'|s,a) )V(s')\right|\)
    • \(\leq \sum_{s'\in\mathcal S} |\hat P_i(s'|s,a)  - P_i(s'|s,a) | |V(s')|\)
    • \(\leq \left(\sum_{s'\in\mathcal S} |\hat P_i(s'|s,a)  - P_i(s'|s,a) |\right)\max_{s'} |V(s')| \)
    • \(\leq \sqrt{\frac{\alpha}{N_i(s,a)}} \underbrace{\max_{s'} |V(s')|}_{\leq H}\) using result from Lecture 11

Analysis: Optimism

  • Lemma: as long as \(r(s,a)\in[0,1]\), \( \hat V_t^i(s) \geq V_t^\star(s)\) for all \(t,i,s\)
  • Proof: By induction.
    1. Base case \(\hat V_H^i(s)=0=V_H^\star(s)\)
    2. Assume that \( \hat V_{t+1}^i(s) \geq V_{t+1}^\star(s)\)
  • Then \(\hat Q_t^i(s,a) - Q_t^\star (s,a)\)
  • \( = \hat r_i(s,a) + b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]- r(s,a) - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \)
  • \( \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \) (noiseless \(r_t\))
  • \( \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[V^\star_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \) (assumption)
  • \( \geq b_i(s,a)-b_i(s,a) \) (previous Lemma)
  • Thus \(\hat Q_t^i(s,a) \geq Q_t^\star (s,a)\) which implies \(\hat V_t^i(s) \geq V_t^\star (s)\) (exercise)

Recap

  • PSet due Monday

 

  • Exploration problem
  • UCB-VI Algorithm

 

  • Next lecture: Revisiting imitation learning

Sp23 CS 4/5789: Lecture 22

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 22