CS 4/5789: Introduction to Reinforcement Learning

Lecture 20

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

1. Linear Contextual Bandits

2. Interactive Demo

3. LinUCB Algorithm

Announcements

 

My office hours today are cancelled

 

Prelim corrections due tomorrow - please list collaborators

5789 Paper Review Assignment (weekly pace suggested)

HW 3 released tonight, due in 2 weeks

 

Final exam Monday 5/16 at 7pm

Multi-Armed Bandit

A simplified setting for studying exploration

  • Pull "arms" \(a\in \mathcal A =\{1,\dots,K\}\), get noisy reward \(r_t\sim r(a_t)\) with \(\mathbb E[r(a)] = \mu_a\)
  • Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(a^*)-r(a_t)  \right] = \sum_{t=1}^T \mu^* - \mu_{a_t}$$

MAB Recap

Explore-then-Commit

  1. Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\):
        Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

Upper Confidence Bound

For \(t=1,...,T\):

  • Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
  • Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)

Set exploration \(N \approx T^{2/3}\),

\(R(T) \lesssim T^{2/3}\)

\(R(T) \lesssim \sqrt{T}\)

Contextual Bandit

A (less) simplified setting for studying exploration

ex - machine make an model affect rewards, so context \(x=(\)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\(, \)\()\)

  • See context \(x\sim \mathcal D\), pull "arm" \(a\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x, a)] = \mu_a(x)\)
  • Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t)  \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$

Explore-then-Commit

  1. Pull each arm \(N\) times and used supervised learning to estimate $$\widehat \mu_a(x) = \arg\min_{\mu\in\mathcal M} \sum_{i=1}^N (\mu(x_i^a) - r_i^a)^2$$
  2. For \(t=NK+1,...,T\):
        Observe context \(x_t\)
        Pull \(a_t = \widehat \pi(x_t) = \arg\max_a \widehat \mu_a(x_t)\)

Contextual Bandit

Set exploration \(N \approx T^{2/3}\),

we showed \(R(T) \lesssim T^{2/3}\) using prediction error guarantees \(\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]\)

Contextual Bandit

Set exploration \(N \approx T^{2/3}\),

we showed \(R(T) \lesssim T^{2/3}\) using prediction error guarantees \(\mathbb E_{x\sim \mathcal D}[|\widehat \mu_a(x) - \mu_a(x)|]\)

For context-dependent confidence bounds, we need to understand

\(\mathbb E[|\widehat \mu_a(x) - \mu_a(x)|\mid x]\)

Agenda

 

0. Announcements & Recap

1. Linear Contextual Bandits

2. Interactive Demo

3. LinUCB Algorithm