## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 18

Prof. Sarah Dean

MW 2:45-4pm

110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Explore-then-Commit

2. UCB Algorithm

3. UCB Analysis

## Announcements

HW2 due tonight!

5789 Paper Review Assignment (weekly pace *suggested*)

Prelim grades released

## Exam scores & corrections

Relationship between scores and grades

Corrections

- due Tuesday after Spring Break
- corrected problems will be scored more strictly than original exam
- exam bonus proportional to the sum of differences between original & corrected score

## Exam Summary

- Problem 1: Approximate Policy Evaluation
- Similar to PE proof from lecture with \(V\)

- Problem 2: Optimal Machine Repair
- Similar to Gardening HW problem

- Problem 3: State distributions
- Use proof techniques from review lecture
- Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)

- Problem 4: Value of Linear Policy
- Finite horizon Bellman
*Expectation*Equation not Bellman*Optimality*Equation, or unrolled expression for linear dynamics

- Finite horizon Bellman

## Multi-Armed Bandit

A simplified setting for studying exploration

## Recap

- "Arms" \(a\in \mathcal A =\{1,\dots,K\}\)
- Noisy rewards \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Algorithm: Explore-then-Commit

- Pull each arm \(N\) times (\(t=1,..., NK\)) and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\): Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

#### CS 4/5789: Lecture 18

By Sarah Dean