Blocking Problems for Multi-Armed Bandits and Matching Problems

Nicholas Bishop

Resource Deployment

Resource Deployment

Resource Deployment

Resource Deployment

Resource Deployment

We must account for resource availability!

Outline

Adversarial Blocking Bandits

Sequential Blocked Matching

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}
\textcolor{blue}{\mu_{2}}
\textcolor{blue}{\mu_{3}}

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}
\textcolor{blue}{\mu_{2}}
\textcolor{blue}{\mu_{3}}

GOAL: Maximise reward accumulated over time

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}
\textcolor{blue}{\mu_{2}}
\textcolor{blue}{\mu_{3}}
\textcolor{green}{i_{t} = \arg\max_{i} \mu_{i}}

Just pull the best arm on every time step!

Mulit-Armed Bandits

\textcolor{blue}{\mu_{1}}
\textcolor{blue}{\mu_{2}}
\textcolor{blue}{\mu_{3}}
\textcolor{green}{i_{t} = \arg\max_{i} \mu_{i}}

Just pull the best arm on every time step!

Mulit-Armed Bandits

\textcolor{blue}{?}
\textcolor{blue}{?}
\textcolor{blue}{?}
\textcolor{red}{i_{t} = \arg\max_{i} \mu_{i}}

Problem: Rewards are not known!

Multi-Armed Bandits

Rewards may be noisy!

Multi-Armed Bandits

Or worse, rewards may be chosen by a malicious adversary!

Multi-Armed Bandits

How do we lean about arms whilst maintaining good long-term reward?

Multi-Armed Bandits

How do we lean about arms whilst maintaining good long-term reward?

EXP3

UCB

ETC

Blocking Bandits

\textcolor{blue}{\mu_{1}}
\textcolor{blue}{\mu_{2}}
\textcolor{blue}{\mu_{3}}
\textcolor{red}{D_{1}}
\textcolor{red}{D_{2}}
\textcolor{red}{D_{3}}

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}
\textcolor{green}{1.5}

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}
\textcolor{green}{0.9}

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

What's a good performance benchmark?

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

We'd like to pull the best arm all the time...

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

But we can't as it will sometimes be blocked...

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

Instead we must interleave between arms!

Blocking Bandits

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

That is, we must benchmark against an adaptive policy!

No Blocking

With Blocking

Benchmark policy:

Benchmark policy:

No Blocking

With Blocking

Benchmark policy:

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

No Blocking

With Blocking

Benchmark policy:

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

Easy to compute when rewards known?

Easy to compute when rewards known?

No Blocking

With Blocking

Benchmark policy:

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

Easy to compute when rewards known?

Easy to compute when rewards known?

YES

No Blocking

With Blocking

Benchmark policy:

Benchmark policy:

Always pull the best arm!

Find the best pulling schedule!

Easy to compute when rewards known?

Easy to compute when rewards known?

YES

NO

Stochastic Blocking Bandits

Stochastic/noisy feedback

Delays for each arm are fixed

Stochastic Blocking Bandits

Stochastic/noisy feedback

Delays for each arm are fixed

There is no polynomial time algorithm in this setting!

Finding a Good Approximation

What if we just greedily pull the best arm on each time step out of those available?

Finding a Good Approximation

What if we just greedily pull the best arm on each time step out of those available?

\textcolor{green}{(1-1/e)}

- approximation!

Taking Things Online

GREEDY

Taking Things Online

+ UCB

GREEDY

Taking Things Online

+ UCB

GREEDY

Inherit an instance-dependent regret bound from classical bandit setting!

Taking Things Online

+ UCB

GREEDY

Inherit an instance-dependent regret bound from classical bandit setting!

\mathcal{O}(\log(T))

Blocking in the Real World

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}

Blocking in the Real World

\textcolor{blue}{1}
\textcolor{blue}{5}
\textcolor{blue}{8}
\textcolor{red}{1}
\textcolor{red}{2}
\textcolor{red}{2}
\textcolor{blue}{10}
\textcolor{blue}{11}
\textcolor{blue}{2}
\textcolor{red}{3}
\textcolor{red}{5}
\textcolor{red}{3}

Blocking in the Real World

Blocking in the Real World

\textcolor{blue}{10}
\textcolor{blue}{11}
\textcolor{blue}{2}
\textcolor{red}{3}
\textcolor{red}{5}
\textcolor{red}{3}

Blocking in the Real World

\textcolor{blue}{2}
\textcolor{blue}{20}
\textcolor{blue}{10}
\textcolor{red}{6}
\textcolor{red}{7}
\textcolor{red}{6}

Adversarial Blocking Bandits

Rewards vary adversarially in accordance with a path variation budget

Blocking durations are free to vary arbitrarily, but are bounded above.

\sum^{T-1}_{t=1}\sum^{K}_{k=1}|X^{k}_{t+1} - X^{k}_{t}| \leq B_{T}
D^{k}_{t} \leq \tilde{D} \quad \forall k , t

Full Information Setting

Consider a greedy algorithm which pulls the arm with highest reward

Using a knapsack-style proof, we obtain the following regret guarantee

\text{arg}\max_{k \in A_{t}}X^{k}_{t}
\left(1 + \tilde{D}\right)^{-1}\left(1 - \frac{\tilde{D}B_{T}}{r(\pi^{\star})}\right)

Bandit Setting

0
T

Split the time horizon into blocks

Bandit Setting

0
T
\Delta_{T}

Split the time horizon into blocks

Bandit Setting

0
T

Split the time horizon into blocks

Consider one such block

At the start of the block, play each arm once, and store the rewards observed. Then pull no arms until all arms are available.

Bandit Setting

0
T

Split the time horizon into blocks

Consider one such block

Then play greedily, using the rewards received in the first phase as a proxy for the real reward

Bandit Setting

0
T

Split the time horizon into blocks

Consider one such block

Pull no arms at the end of the block so all arms will be available at the beginning of the next block.

Bandit Setting

0
T

Split the time horizon into blocks

Consider one such block

Pull no arms at the end of the block so all arms will be available at the beginning of the next block.

Bandit Setting

By appropriately choosing block length, we can obtain the following regret bound:

\mathcal{O}\left(\sqrt{T(2\tilde{D} + K)B_{T}}\right)

Problem: We need to know the variation budget to set the block length!

Solution: Run EXP3 as a meta-bandit algorithm to learn the correct block length!

Bandit Setting

Maintain a list of possible budgets and split the time horizon into blocks

0
T

Bandit Setting

Maintain a list of possible budgets and split the time horizon into blocks

0
T
H

Bandit Setting

Maintain a list of possible budgets and split the time horizon into blocks

0
T

Consider one such block

Bandit Setting

Maintain a list of possible budgets and split the time horizon into epochs

0
T

Consider one such epoch

Sample a budget and thus an associated block length and play the previous algorithm within the epoch

Bandit Setting

Maintain a list of possible budgets and split the time horizon into epochs

0
T

Consider one such epoch

Sample a budget and thus an associated block length and play the previous algorithm within the epoch

Bandit Setting

Maintain a list of possible budgets and split the time horizon into epochs

0
T

Consider one such epoch

At the end of the epoch, update the sampling probability of the chosen budget according to EXP3.

Bandit Setting

Maintain a list of possible budgets and split the time horizon into epochs

0
T

Consider one such epoch

Repeat this process with the next epoch.

Sequential Blocked Matching

2

2

1

1

2

1

1

1

1

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Sequential Blocked Matching

2

1

1

1

1

2

1

1

2

2

1

1

Requirements

Resistance to strategic manipulation induced by blocking - bound the incentive ratio.

What I get if I tell the truth

What I can get if lie

Requirements

Achieve high social welfare - minimise the distortion.

The social welfare of the best matching

The social welfare of my matching

Repeated RSD

Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.

0
T

Repeated RSD

Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.

0
T

Repeated RSD

Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.

0
T

Repeated RSD

Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.

0
T

Repeated RSD

Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.

0
T

Repeated RSD

Repeated RSD is asymptotically optimal in terms of distortion.

Repeated RSD can be derandomized to yield a deterministic algorithm which is also asymptotically optimal! 

Repeated RSD also has bounded incentive ratio!

Bandit Matching

Bandit Matching

Bandit Matching

i
j
m_{ij} \sim \mu_{ij}

Bandit Matching

i
j
m_{ij} \sim \mu_{ij}
\hat{\mu}_{i} = \frac{\sum^{T}_{t=1}\mathbf{1}[m_{i}(t) = j]m_{i}(t)}{\sum^{T}_{t=1}\mathbf{1}[m_{i}(t) = j]}

(mean-based)

Bandit RRSD

Idea: Extend RRSD to bandit setting with explore-then-commit framework!

0
T

Bandit RRSD

Idea: Extend RRSD to bandit setting with explore-then-commit framework!

0
T

In the exploration phase, assign each agent each service a fixed number of times

Bandit RRSD

Idea: Extend RRSD to bandit setting with explore-then-commit framework!

0
T

In the exploration phase, assign each agent each service a fixed number of times

Wait until all arms are available

Bandit RRSD

Idea: Extend RRSD to bandit setting with explore-then-commit framework!

0
T

In the exploration phase, assign each agent each service a fixed number of times

Wait until all arms are available

In the exploitation phase, play RRSD, using the last preference submission of each agent

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0

Bandit RRSD

0
T

2

2

1

1

1

2

1

1

2

1

1

1

8

1

9

7

10

3

9

4

1

5

7

0