Blocking Problems for Multi-Armed Bandits and Matching Problems
Nicholas Bishop
Resource Deployment
Resource Deployment
Resource Deployment
Resource Deployment
Resource Deployment
We must account for resource availability!
Outline
Adversarial Blocking Bandits
Sequential Blocked Matching
Mulit-Armed Bandits
Mulit-Armed Bandits
GOAL: Maximise reward accumulated over time
Mulit-Armed Bandits
Just pull the best arm on every time step!
Mulit-Armed Bandits
Just pull the best arm on every time step!
Mulit-Armed Bandits
Problem: Rewards are not known!
Multi-Armed Bandits
Rewards may be noisy!
Multi-Armed Bandits
Or worse, rewards may be chosen by a malicious adversary!
Multi-Armed Bandits
How do we lean about arms whilst maintaining good long-term reward?
Multi-Armed Bandits
How do we lean about arms whilst maintaining good long-term reward?
EXP3
UCB
ETC
Blocking Bandits
Blocking Bandits
Blocking Bandits
Blocking Bandits
Blocking Bandits
Blocking Bandits
What's a good performance benchmark?
Blocking Bandits
We'd like to pull the best arm all the time...
Blocking Bandits
But we can't as it will sometimes be blocked...
Blocking Bandits
Instead we must interleave between arms!
Blocking Bandits
That is, we must benchmark against an adaptive policy!
No Blocking
With Blocking
Benchmark policy:
Benchmark policy:
No Blocking
With Blocking
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
No Blocking
With Blocking
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Easy to compute when rewards known?
Easy to compute when rewards known?
No Blocking
With Blocking
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Easy to compute when rewards known?
Easy to compute when rewards known?
YES
No Blocking
With Blocking
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Easy to compute when rewards known?
Easy to compute when rewards known?
YES
NO
Stochastic Blocking Bandits
Stochastic/noisy feedback
Delays for each arm are fixed
Stochastic Blocking Bandits
Stochastic/noisy feedback
Delays for each arm are fixed
There is no polynomial time algorithm in this setting!
Finding a Good Approximation
What if we just greedily pull the best arm on each time step out of those available?
Finding a Good Approximation
What if we just greedily pull the best arm on each time step out of those available?
- approximation!
Taking Things Online
GREEDY
Taking Things Online
+ UCB
GREEDY
Taking Things Online
+ UCB
GREEDY
Inherit an instance-dependent regret bound from classical bandit setting!
Taking Things Online
+ UCB
GREEDY
Inherit an instance-dependent regret bound from classical bandit setting!
Blocking in the Real World
Blocking in the Real World
Blocking in the Real World
Blocking in the Real World
Blocking in the Real World
Adversarial Blocking Bandits
Rewards vary adversarially in accordance with a path variation budget
Blocking durations are free to vary arbitrarily, but are bounded above.
Full Information Setting
Consider a greedy algorithm which pulls the arm with highest reward
Using a knapsack-style proof, we obtain the following regret guarantee
Bandit Setting
Split the time horizon into blocks
Bandit Setting
Split the time horizon into blocks
Bandit Setting
Split the time horizon into blocks
Consider one such block
At the start of the block, play each arm once, and store the rewards observed. Then pull no arms until all arms are available.
Bandit Setting
Split the time horizon into blocks
Consider one such block
Then play greedily, using the rewards received in the first phase as a proxy for the real reward
Bandit Setting
Split the time horizon into blocks
Consider one such block
Pull no arms at the end of the block so all arms will be available at the beginning of the next block.
Bandit Setting
Split the time horizon into blocks
Consider one such block
Pull no arms at the end of the block so all arms will be available at the beginning of the next block.
Bandit Setting
By appropriately choosing block length, we can obtain the following regret bound:
Problem: We need to know the variation budget to set the block length!
Solution: Run EXP3 as a meta-bandit algorithm to learn the correct block length!
Bandit Setting
Maintain a list of possible budgets and split the time horizon into blocks
Bandit Setting
Maintain a list of possible budgets and split the time horizon into blocks
Bandit Setting
Maintain a list of possible budgets and split the time horizon into blocks
Consider one such block
Bandit Setting
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Sample a budget and thus an associated block length and play the previous algorithm within the epoch
Bandit Setting
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Sample a budget and thus an associated block length and play the previous algorithm within the epoch
Bandit Setting
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
At the end of the epoch, update the sampling probability of the chosen budget according to EXP3.
Bandit Setting
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Repeat this process with the next epoch.
Sequential Blocked Matching
2
2
1
1
2
1
1
1
1
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Sequential Blocked Matching
2
1
1
1
1
2
1
1
2
2
1
1
Requirements
Resistance to strategic manipulation induced by blocking - bound the incentive ratio.
What I get if I tell the truth
What I can get if lie
Requirements
Achieve high social welfare - minimise the distortion.
The social welfare of the best matching
The social welfare of my matching
Repeated RSD
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD
Repeated RSD is asymptotically optimal in terms of distortion.
Repeated RSD can be derandomized to yield a deterministic algorithm which is also asymptotically optimal!Â
Repeated RSD also has bounded incentive ratio!
Bandit Matching
Bandit Matching
Bandit Matching
Bandit Matching
(mean-based)
Bandit RRSD
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
Bandit RRSD
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Bandit RRSD
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Wait until all arms are available
Bandit RRSD
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Wait until all arms are available
In the exploitation phase, play RRSD, using the last preference submission of each agent
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Bandit RRSD
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
Copy of Viva Voce
By nickbishop
Copy of Viva Voce
- 72