CS 4/5789: Introduction to Reinforcement Learning
Lecture 18
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Explore-then-Commit
2. UCB Algorithm
3. UCB Analysis
Announcements
HW2 due tonight!
5789 Paper Review Assignment (weekly pace suggested)
Prelim grades released
Exam scores & corrections
Relationship between scores and grades
Corrections
- due Tuesday after Spring Break
- corrected problems will be scored more strictly than original exam
- exam bonus proportional to the sum of differences between original & corrected score
Exam Summary
- Problem 1: Approximate Policy Evaluation
- Similar to PE proof from lecture with \(V\)
- Problem 2: Optimal Machine Repair
- Similar to Gardening HW problem
- Problem 3: State distributions
- Use proof techniques from review lecture
- Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
- Problem 4: Value of Linear Policy
- Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics

Multi-Armed Bandit
A simplified setting for studying exploration
Recap
- "Arms" \(a\in \mathcal A =\{1,\dots,K\}\)
- Noisy rewards \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$
Algorithm: Explore-then-Commit
- Pull each arm \(N\) times (\(t=1,..., NK\)) and compute empirical mean \(\widehat \mu_a\)
- For \(t=NK+1,...,T\): Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)
CS 4/5789: Lecture 18
By Sarah Dean