Adversarial Blocking Bandits
Sequential Blocked Matching
GOAL: Maximise reward accumulated over time
Just pull the best arm on every time step!
Just pull the best arm on every time step!
Problem: Rewards are not known!
Rewards may be noisy!
Or worse, rewards may be chosen by a malicious adversary!
How do we lean about arms whilst maintaining good long-term reward?
How do we lean about arms whilst maintaining good long-term reward?
EXP3
UCB
ETC
What's a good performance benchmark?
We'd like to pull the best arm all the time...
But we can't as it will sometimes be blocked...
Instead we must interleave between arms!
That is, we must benchmark against an adaptive policy!
Benchmark policy:
Benchmark policy:
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Easy to compute when rewards known?
Easy to compute when rewards known?
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Easy to compute when rewards known?
Easy to compute when rewards known?
YES
Benchmark policy:
Benchmark policy:
Always pull the best arm!
Find the best pulling schedule!
Easy to compute when rewards known?
Easy to compute when rewards known?
YES
NO
Stochastic/noisy feedback
Delays for each arm are fixed
Stochastic/noisy feedback
Delays for each arm are fixed
There is no polynomial time algorithm in this setting!
What if we just greedily pull the best arm on each time step out of those available?
What if we just greedily pull the best arm on each time step out of those available?
- approximation!
GREEDY
+ UCB
GREEDY
+ UCB
GREEDY
Inherit an instance-dependent regret bound from classical bandit setting!
+ UCB
GREEDY
Inherit an instance-dependent regret bound from classical bandit setting!
Rewards vary adversarially in accordance with a path variation budget
Blocking durations are free to vary arbitrarily, but are bounded above.
Consider a greedy algorithm which pulls the arm with highest reward
Using a knapsack-style proof, we obtain the following regret guarantee
Split the time horizon into blocks
Split the time horizon into blocks
Split the time horizon into blocks
Consider one such block
At the start of the block, play each arm once, and store the rewards observed. Then pull no arms until all arms are available.
Split the time horizon into blocks
Consider one such block
Then play greedily, using the rewards received in the first phase as a proxy for the real reward
Split the time horizon into blocks
Consider one such block
Pull no arms at the end of the block so all arms will be available at the beginning of the next block.
Split the time horizon into blocks
Consider one such block
Pull no arms at the end of the block so all arms will be available at the beginning of the next block.
By appropriately choosing block length, we can obtain the following regret bound:
Problem: We need to know the variation budget to set the block length!
Solution: Run EXP3 as a meta-bandit algorithm to learn the correct block length!
Maintain a list of possible budgets and split the time horizon into blocks
Maintain a list of possible budgets and split the time horizon into blocks
Maintain a list of possible budgets and split the time horizon into blocks
Consider one such block
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Sample a budget and thus an associated block length and play the previous algorithm within the epoch
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Sample a budget and thus an associated block length and play the previous algorithm within the epoch
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
At the end of the epoch, update the sampling probability of the chosen budget according to EXP3.
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Repeat this process with the next epoch.
2
2
1
1
2
1
1
1
1
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
Resistance to strategic manipulation induced by blocking - bound the incentive ratio.
What I get if I tell the truth
What I can get if lie
Achieve high social welfare - minimise the distortion.
The social welfare of the best matching
The social welfare of my matching
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD is asymptotically optimal in terms of distortion.
Repeated RSD can be derandomized to yield a deterministic algorithm which is also asymptotically optimal!Â
Repeated RSD also has bounded incentive ratio!
(mean-based)
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Wait until all arms are available
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Wait until all arms are available
In the exploitation phase, play RRSD, using the last preference submission of each agent
2
2
1
1
1
2
1
1
2
1
1
1
2
2
1
1
1
2
1
1
2
1
1
1
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0