## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 21

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

## Announcements

5789 Paper Review Assignment (weekly pace suggested)

HW 3 released Monday, due 4/25

Final exam Monday 5/16 at 7pm

## Recap: Contextual Bandit

A (less) simplified setting for studying exploration

ex - machine make and model affect rewards, so context $$x=($$$$,$$$$,$$$$,$$$$,$$$$,$$$$,$$$$,$$$$)$$

• See context $$x\sim \mathcal D$$, pull "arm" $$a$$ and get reward $$r_t\sim r(x_t, a_t)$$ with $$\mathbb E[r(x, a)] = \mu_a(x)$$, in the linear case $$=\theta_a^\top x$$
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$

LinUCB Algorithm

For $$t=1,...,T$$:

1. Observe context $$x_t$$
2. Pull $$a_t = \arg\max_a (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}$$
3. Update $$\widehat \theta^{a_t}$$ and $$A^{a_t}$$

## Contextual Bandit

• Claim: with high probability, $$\theta_a^\top x \leq (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}$$
• Chebychev's: If $$\mathbb E[u]=0$$ then with probability $$1-1/\beta^2$$, $$|u| \leq \beta \mathbb E[u^2]$$

## Contextual Bandit

• Claim: with high probability, $$\theta_a^\top x \leq (\widehat \theta^a_t)^\top x + \alpha \sqrt{x^\top (A_t^a)^{-1} x}$$
• Chebychev's: If $$\mathbb E[u]=0$$ then with probability $$1-1/\beta^2$$, $$|u| \leq \beta \mathbb E[u^2]$$
• Proof of Claim:
• Strategy: apply Chebychev's with $$u=\theta_a^\top x - (\widehat \theta^a_t)^\top x$$
• Last lecture: showed $$\mathbb E[(\theta_a - \widehat \theta^a_t)^\top x]=0$$

## Agenda

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

## Recap: Tabular MBRL

Algorithm:

1. Query each $$(s,a)$$ pair $$\frac{N}{SA}$$ times, record sample $$s'\sim P(s,a)$$
2. Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
3. Design $$\widehat \pi$$ as if $$\widehat P$$ is true

Analysis: $$\widehat \pi$$ vs. $$\pi^*$$

• Compare $$\widehat P$$ and $$P$$ (Hoeffding's)
• Compare $$\widehat V^\pi$$ and $$V^\pi$$ (Simulation Lemma)
• Compare $$\widehat V^{\widehat \pi}$$ and $$V^{\pi^*}$$

## Agenda

0. Announcements & Recap

1. Model-Based RL with Exploration

2. UCB Value Iteration Algorithm

3. UCB-VI Analysis

By Sarah Dean

Private