CS 4/5789: Introduction to Reinforcement Learning

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements & Recap

1. Explore-then-Commit

2. UCB Algorithm

3. UCB Analysis

HW2 due tonight!

5789 Paper Review Assignment (weekly pace suggested)

Prelim grades released

Relationship between scores and grades

Corrections

due Tuesday after Spring Break
corrected problems will be scored more strictly than original exam
exam bonus proportional to the sum of differences between original & corrected score

Problem 1: Approximate Policy Evaluation
- Similar to PE proof from lecture with $V$
Problem 2: Optimal Machine Repair
- Similar to Gardening HW problem
Problem 3: State distributions
- Use proof techniques from review lecture
- Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
Problem 4: Value of Linear Policy
- Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics

A simplified setting for studying exploration

"Arms" $a\in \mathcal A =\{1,\dots,K\}$
Noisy rewards $r_t\sim r(a_t)$ with $\mathbb E[r(a)] = \mu_a$
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t) \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Algorithm: Explore-then-Commit

Pull each arm $N$ times ($t=1,..., NK$) and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$: Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

By Sarah Dean

asst prof in CS at Cornell