Blocking Problems for Multi-Armed Bandits and Matching Problems

Nicholas Bishop

Resource Deployment

We must account for resource availability!

Outline

Adversarial Blocking Bandits

Sequential Blocked Matching

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}

\textcolor{blue}{\mu_{2}}

\textcolor{blue}{\mu_{3}}

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}

\textcolor{blue}{\mu_{2}}

\textcolor{blue}{\mu_{3}}

GOAL: Maximise reward accumulated over time

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}

\textcolor{blue}{\mu_{2}}

\textcolor{blue}{\mu_{3}}

\textcolor{green}{i_{t} = \arg\max_{i} \mu_{i}}

Just pull the best arm on every time step!

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}

\textcolor{blue}{\mu_{2}}

\textcolor{blue}{\mu_{3}}

\textcolor{green}{i_{t} = \arg\max_{i} \mu_{i}}

Just pull the best arm on every time step!

Mulit-Armed Bandits

\textcolor{blue}{?}

\textcolor{red}{i_{t} = \arg\max_{i} \mu_{i}}

Problem: Rewards are not known!

Multi-Armed Bandits

Rewards may be noisy!

Multi-Armed Bandits

Or worse, rewards may be chosen by a malicious adversary!

Multi-Armed Bandits

How do we lean about arms whilst maintaining good long-term reward?

Multi-Armed Bandits

How do we lean about arms whilst maintaining good long-term reward?

EXP3

UCB

ETC

Blocking Bandits

\textcolor{blue}{\mu_{1}}

\textcolor{blue}{\mu_{2}}

\textcolor{blue}{\mu_{3}}

\textcolor{red}{D_{1}}

\textcolor{red}{D_{2}}

\textcolor{red}{D_{3}}

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

\textcolor{green}{1.5}

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

\textcolor{green}{0.9}

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

What's a good performance benchmark?

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

We'd like to pull the best arm all the time...

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

But we can't as it will sometimes be blocked...

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

Instead we must interleave between arms!

Blocking Bandits

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

That is, we must benchmark against an adaptive policy!

No Blocking

With Blocking

Benchmark policy:

No Blocking

With Blocking

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

No Blocking

With Blocking

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

Easy to compute when rewards known?

No Blocking

With Blocking

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

Easy to compute when rewards known?

YES

No Blocking

With Blocking

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

Easy to compute when rewards known?

YES

Stochastic Blocking Bandits

Stochastic/noisy feedback

Delays for each arm are fixed

Stochastic Blocking Bandits

Stochastic/noisy feedback

Delays for each arm are fixed

There is no polynomial time algorithm in this setting!

Finding a Good Approximation

What if we just greedily pull the best arm on each time step out of those available?

Finding a Good Approximation

What if we just greedily pull the best arm on each time step out of those available?

\textcolor{green}{(1-1/e)}

- approximation!

Taking Things Online

GREEDY

Taking Things Online

+ UCB

GREEDY

Taking Things Online

+ UCB

GREEDY

Inherit an instance-dependent regret bound from classical bandit setting!

Taking Things Online

+ UCB

GREEDY

Inherit an instance-dependent regret bound from classical bandit setting!

\mathcal{O}(\log(T))

Blocking in the Real World

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

Blocking in the Real World

\textcolor{blue}{1}

\textcolor{blue}{5}

\textcolor{blue}{8}

\textcolor{red}{1}

\textcolor{red}{2}

\textcolor{blue}{10}

\textcolor{blue}{11}

\textcolor{blue}{2}

\textcolor{red}{3}

\textcolor{red}{5}

\textcolor{red}{3}

Blocking in the Real World

\textcolor{blue}{10}

\textcolor{blue}{11}

\textcolor{blue}{2}

\textcolor{red}{3}

\textcolor{red}{5}

\textcolor{red}{3}

Blocking in the Real World

\textcolor{blue}{2}

\textcolor{blue}{20}

\textcolor{blue}{10}

\textcolor{red}{6}

\textcolor{red}{7}

\textcolor{red}{6}

Adversarial Blocking Bandits

Rewards vary adversarially in accordance with a path variation budget

Blocking durations are free to vary arbitrarily, but are bounded above.

\sum^{T-1}_{t=1}\sum^{K}_{k=1}|X^{k}_{t+1} - X^{k}_{t}| \leq B_{T}

D^{k}_{t} \leq \tilde{D} \quad \forall k , t

Full Information Setting

Consider a greedy algorithm which pulls the arm with highest reward

Using a knapsack-style proof, we obtain the following regret guarantee

\text{arg}\max_{k \in A_{t}}X^{k}_{t}

\left(1 + \tilde{D}\right)^{-1}\left(1 - \frac{\tilde{D}B_{T}}{r(\pi^{\star})}\right)