CS 4/5789: Introduction to Reinforcement Learning

Lecture 18

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall



0. Announcements & Recap

1. Explore-then-Commit

2. UCB Algorithm

3. UCB Analysis



HW2 due tonight!


5789 Paper Review Assignment (weekly pace suggested)


Prelim grades released

Exam scores & corrections


Relationship between scores and grades



  • due Tuesday after Spring Break
  • corrected problems will be scored more strictly than original exam
  • exam bonus proportional to the sum of differences between original & corrected score

Exam Summary

  1. Problem 1: Approximate Policy Evaluation
    • Similar to PE proof from lecture with \(V\)
  2. Problem 2: Optimal Machine Repair
    • Similar to Gardening HW problem
  3. Problem 3: State distributions
    • Use proof techniques from review lecture
    • Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
  4. Problem 4: Value of Linear Policy
    • Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics

Multi-Armed Bandit

A simplified setting for studying exploration


  • "Arms" \(a\in \mathcal A =\{1,\dots,K\}\)
  • Noisy rewards \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
  • Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t)  \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

Algorithm: Explore-then-Commit

  1. Pull each arm \(N\) times (\(t=1,..., NK\)) and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\): Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

CS 4/5789: Lecture 18

By Sarah Dean


CS 4/5789: Lecture 18