CS 4/5789: Introduction to Reinforcement Learning

Lecture 22: Exploration in MDPs

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • My OH rescheduled to today after lecture
  • Homework
    • 5789 Paper Assignments
    • PSet 8 due Monday
    • Final PA due next week
  • Final exam is Tuesday 5/14 at 2pm in Ives 305
  • Prelim grades released. Out of 48 points,
    • A range: 42+ points
    • B range: 33+ points
    • C range: 28+ points

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

Recap: Bandits

A simplified setting for studying exploration

ex - machine make and model affect rewards, so context \(x=(\)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\()\)

  • See context \(x_t\sim \mathcal D\), pull "arm" \(a_t\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x_, a)] = \mu_a(x)\), in the linear case \(=\theta_a^\top x\)
  • Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t)  \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$

Recap: UCB

UCB-type Algorithms

  • for \(t=1,2,...,T\)
    • Observe context \(x_t\)
    • Pull arm \(a_t\) optimistically
      • MAB: \(\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
      • Linear: \(\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}\)
    • update estimates and confidence intervals

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

Model Based RL

  • Finite horizon tabular MDP with given initial state $$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, s_0\}$$
  • Finite states \(|\mathcal S| = S\) and actions \(|\mathcal A|=A\)
  • Transition probabilities \(P\) unknown
    • for simplicity assume reward function is noiseless, thus perfectly known once \(s,a\) is visited
  • Model-based RL has two simple steps:
    1. Use data to estimate \(\hat P\)
    2. Use \(\hat P\) with algorithms from Unit 1

Transition Estimation

  • Consider a dataset \(\mathcal D=\{s_0,a_0,s_1,a_1,...\}\)
  • How to estimate \(\hat P(\cdot|s,a)\)?
    • Collect all time steps \(t_i\) such \((s_{t_i},a_{t_i})=(s,a)\)
    • Define the dataset: $$\mathcal D(s,a) = \{(s_{t_i}, a_{t_i}, s_{t_{i}+1})\}_{i=1}^{N(s,a)}$$
    • \(N(s,a)\) is the number of times \(s,a\) is visited
  • Consider biased coin which is heads with probability \(p\) for an unknown value of \(p\in[0,1]\)
  • How to estimate from trials?
    • Flip coin \(N\) times $$\hat p =\frac{\mathsf{\# heads}}{N} $$
  • Consider \(S\) sided die which is side \(s\) with probability \(p_s\) for \(s\in\{1,\dots,S\}=[S]\), where the \(p_s\) are unknown
  • How to estimate from trials?
    • Roll dice \(N\) times $$\hat p_s =\frac{\mathsf{\# times~land~on}~s}{N} $$

Warmup: Coin & Dice

  • For the weighted coin,
    • Estimate \(\hat p\) is Binomial\((p, N)\)
    • \(|p-\hat p| \leq\sqrt{\frac{\log(2/\delta)}{N}}\) with probability \(1-\delta\)
  • For the \(S\) side die, with probability \(1-\delta\), $$\max_{s\in[S]} |p_s-\hat p_s| \leq \sqrt{\frac{\log(2S/\delta)}{N}} $$
  • Alternatively, the total variation distance w.p. \(1-\delta\) $$ \sum_{s\in[S]} |p_s-\hat p_s| \leq \sqrt{\frac{S\log(2/\delta)}{N}} $$
  • Why? Unbiased \(\mathbb E[\hat p]=p\) & concentration (Hoeffding's)

Estimation Errors

  • This slide is not in scope. Hoeffding's inequality states that for independent random variables \(X_1,\dots,X_n\) with \(a_i\leq X_i\leq b_i\), let \(S_n=\sum_{i=1}^n X_i\), then $$ \mathbb P\{|S_n-\mathbb E[S_n]|\geq t\} \leq 2\exp\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)$$
  • Rearranging, this is equivalent to $$ |\frac{1}{n}S_n-\mathbb E[\frac{1}{n} S_n]|\leq \frac{1}{n}\sqrt{\frac{1}{2}\log(2/\delta)\sum_{i=1}^n (b_i-a_i)^2} \quad\text{w.p. at least}\quad  1-\delta$$

Hoeffding's Inequality

Transition Estimation

  • How to estimate \(P(s,a)\)? $$ \mathcal D(s,a) = \{(s_{t_i}, a_{t_i}, s_{t_{i}+1})\}_{i=1}^{N(s,a)} $$
  • Estimate by counting! $$\hat P(s'|s,a) =\frac{\mathsf{\# times~}~s_{t_i+1}=s'}{N(s,a)} \quad\forall~~s,a$$
  • Lemma: Estimation error of above, with probability \(1-\delta\) $$\sum_{s'\in\mathcal S} |P(s'|s,a)-\hat P(s'|s,a)| \leq \sqrt{\frac{\alpha}{N(s,a)}} $$ for \(\alpha=S^2A \log(2/\delta)\)
    • Proof out of scope, but details in slides below
  • How can we make sure \(N(s,a)\) is large enough?

Exploration Example

  • Consider the deterministic chain MDP
    • \(\mathcal S = \{0,\dots,H-1\}\) and \(A = \{1, \dots, A\}\)
    • \(s_{t+1} = s_t + \mathbf 1\{a_t=1\} - \mathbf 1\{a_t\neq 1\}\) except at endpoints
  • Suppose \(s_0=0\), finite horizon \(H\), and reward function $$r(s,a) = \mathbf 1\{s=H-1\}$$
  • Uniformly random policy for exploration, observe \((r_t)_{0\leq t\leq H-1}\)
    • \(\mathbb P\{r_t= 0\forall t\} = 1-1/A^H\)

\(\neq 1\)

\(1\)

\(0\)

\(1\)

\(2\)

\(H-1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

UCB-VI

  • Initialize \(\hat P_0\), \(\hat r_0=0\), and \(b_0\)
  • for \(i=1,2,...,T\)
    • Design optimistic policy \(\hat \pi^i = VI(\hat P_i, \hat r_i+b_i)\)
    • Rollout \(\hat\pi^i\) and observe trajectory
    • Update \(\hat P_{i+1}\), \(\hat r_i\), and \(b_{i+1}\)

Optimistic MBRL

  • Design a reward bonus to incentivize exploration of unseen states and actions

Model Estimation

  • Using the dataset at iteration \(i\): $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $$
  • Number of times we took action \(a\) in state \(s\) $$N_i(s,a) = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} $$
  • Number of times we transitioned to \(s'\) after \(s,a\) $$N_i(s,a, s') = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a, s^k_{t+1}=s'\} $$
  • Estimate transition probability \(\hat P_i(s'|s,a) = \frac{N_i(s,a,s')}{N_i(s,a)}\)
  • Reward of \(s,a\): \(\hat r_i(s,a) = \frac{1}{N_i(s,a) } \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} r_t^k\)

\(\neq 1\)

\(1\)

\(0\)

\(1\)

\(2\)

\(H-1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

  • Uniformly random policy:
    • e.g. \(a_{0:8} = (1,3,2,2,3,1,1,3,2)\)
    • then \(s_{0:9} = (0,1,0,0,0,0,1,2,1,0)\)
  • \(N(s,a) = ?\)
  • \(\hat P(s'|s,a)=\) PollEV
  • \(\hat r(s,a)=0\)

Example

Reward Bonus

  • Using the dataset at iteration \(i\): \(\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i \)
  • Number of times \(s,a\): \(N_i(s,a)\)
  • Number of times \(s,a\to s'\): \(N_i(s,a, s')\)
  • Reward of \(s,a\): \(\hat r_i(s,a)\)
  • Reward bonus $$ b_i(s,a) = H\sqrt{\frac{\alpha}{N_i(s,a)}} $$

DP

  • Initialize \(\hat V^i_H(s)=0\)
  • for \(t=H-1, ..., 0\)
    • \(\hat Q^i_t(s,a)=\hat r_i(s,a)+b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]\)
    • \(\hat\pi^i(s) = \arg\max_a \hat Q^i_t(s,a)\)
    • \(\hat V^i_{t}(s)=\hat Q^i_t(s,\hat\pi^i(s) )\)

Optimistic DP

  • In finite horizon setting, VI is just DP

\(\neq 1\)

\(1\)

\(0\)

\(1\)

\(2\)

\(H-1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

  • Iteration 1: uniformly random policy for exploration
    • trajectory contains \(s,a\) with probability \(\propto 1/A^s\)
  • Iteration 2: reward bonus incentivizes upward exploration
  • Eventually reach \(s=H-1\) and converge to \(\pi^\star_t(s) = 1\)

Example

Agenda

1. Recap: Bandits

2. MBRL and Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

  1. The exploration bonus bounds, with high probability, the difference $$ |\mathbb E_{s'\sim\hat P(s,a)}[V(s')]-\mathbb E_{s'\sim P(s,a)}[V(s')]| \quad\forall ~~s,a $$
    • similar to confidence intervals bounding \(|\mu_a-\hat\mu_a|\)
  2. The exploration bonus leads to optimism $$ \hat V_t^i(s) \geq V_t^\star(s) $$
    • similar to showing that \(\hat\mu_{\hat a_\star} \geq \mu_\star\)

Analysis: Two Key Facts

  • These two properties are key to proving a regret bound $$R(T) = \mathbb E\left[\sum_{i=1}^T V_0^\star (s_0)-V_0^{\pi^i}(s_0) \right] \lesssim H^2\sqrt{SA\cdot T}$$
  • Above, define regret as cumulative sub-optimality (over episodes)
    • Note: sub-optimality in terms of Value is difference in cumulative reward (over time)
  • Argument follows UCB proof structure:
    1. By optimism, \(V^\star_0(s_0) - V_0^{\pi^i}(s_0) \leq \hat V_0^i(s_0) - V^{\pi^i}_0(s_0)\)
    2. Simulation Lemma to compare \(\hat V_0^i(s_0)\) & \(V^{\pi^i}_0(s_0)\)
  • Regret proof is out of scope for this class (see 6789)

Regret in RL Setting

\(\neq 1\)

\(1\)

...

\(\neq 1\)

\(\neq 1\)

\(\neq 1\)

\(1\)

\(1\)

\(1\)

\(1\)

  • Regret bonus enables the "right amount" of exploration

Example

Analysis: Exploration Bonus

  • Lemma: For any fixed function \(V:\mathcal S\to [0,H]\), whp, $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| \leq H\sqrt{\frac{\alpha}{N_i(s,a)}}=b_i(s,a)$$ where \(\alpha\) depends on \(S,A,H\) and failure probability
  • Proof: \(|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| =\)
    • \(=\left|\sum_{s'\in\mathcal S} (\hat P_i(s'|s,a)  - P_i(s'|s,a) )V(s')\right|\)
    • \(\leq \sum_{s'\in\mathcal S} |\hat P_i(s'|s,a)  - P_i(s'|s,a) | |V(s')|\)
    • \(\leq \left(\sum_{s'\in\mathcal S} |\hat P_i(s'|s,a)  - P_i(s'|s,a) |\right)\max_{s'} |V(s')| \)
    • \(\leq \sqrt{\frac{\alpha}{N_i(s,a)}} \underbrace{\max_{s'} |V(s')|}_{\leq H}\) using result from previous Lemma

Analysis: Optimism

  • Lemma: as long as \(r(s,a)\in[0,1]\), \( \hat V_t^i(s) \geq V_t^\star(s)\) for all \(t,i,s\)
  • Proof: By induction.
    1. Base case \(\hat V_H^i(s)=0=V_H^\star(s)\)
    2. Assume that \( \hat V_{t+1}^i(s) \geq V_{t+1}^\star(s)\)
  • Then \(\hat Q_t^i(s,a) - Q_t^\star (s,a)\)
  • \( = \hat r_i(s,a) + b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]- r(s,a) - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \)
  • \( = b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \) (noiseless \(r_t\))
  • \( \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[V^\star_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \) (assumption)
  • \( \geq b_i(s,a)-b_i(s,a) \) (previous Lemma)
  • Thus \(\hat Q_t^i(s,a) \geq Q_t^\star (s,a)\) which implies \(\hat V_t^i(s) \geq V_t^\star (s)\) (exercise)

Recap

  • PSet due Monday

 

  • Exploration problem
  • UCB-VI Algorithm

 

  • Next lecture: Imitation learning

Sp24 CS 4/5789: Lecture 22

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 22