## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 18

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Explore-then-Commit

2. UCB Algorithm

3. UCB Analysis

## Announcements

HW2 due tonight!

5789 Paper Review Assignment (weekly pace suggested)

## Exam scores & corrections

Corrections

• due Tuesday after Spring Break
• corrected problems will be scored more strictly than original exam
• exam bonus proportional to the sum of differences between original & corrected score

## Exam Summary

1. Problem 1: Approximate Policy Evaluation
• Similar to PE proof from lecture with $$V$$
2. Problem 2: Optimal Machine Repair
• Similar to Gardening HW problem
3. Problem 3: State distributions
• Use proof techniques from review lecture
• Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
4. Problem 4: Value of Linear Policy
• Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics

## Multi-Armed Bandit

A simplified setting for studying exploration

## Recap

• "Arms" $$a\in \mathcal A =\{1,\dots,K\}$$
• Noisy rewards $$r_t\sim r(a_t)$$ with $$\mathbb E[r(a)] = \mu_a$$
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Algorithm: Explore-then-Commit

1. Pull each arm $$N$$ times ($$t=1,..., NK$$) and compute empirical mean $$\widehat \mu_a$$
2. For $$t=NK+1,...,T$$: Pull $$\widehat a^* = \arg\max_a \widehat \mu_a$$

By Sarah Dean

Private