Consider agent incentives when making decisions
Account for real world constraints on decision
making
Stackelberg Prediction Games for Linear Regression (Chapter 3)
Adversarial Blocking Bandits (Chapter 4)
Sequential Blocked Matching (Chapter 5)
At training time:
At test time:
At training time:
At test time:
Target of the data provider
Idea: Simulate agent behaviour using training data!
Learner's loss
Agent's loss
Agent's manipulation cost
Can we solve such an optimisation problem?
Does the solution generalise well?
Can we solve such an optimisation problem?
Does the solution generalise well?
Reformulate the problem as a fractional program
Substitute outÂ
Use fractional programming to rewrite this problem as single parameter root finding problem.
Idea: Use bisection search to find a root!
Problem: How do we evaluate       Â
Solution: Convert to an SDP!       Â
Consider what a linear function predicts after manipulation:
Each of these functions are linear and have bounded norm!
Hence we can bound the Rademacher complexity of the resulting hypothesis class!
Rewards vary adversarially in accordance with a path variation budget
Blocking durations are free to vary arbitrarily, but are bounded above.
Consider a greedy algorithm which pulls the arm with highest reward
Using a knapsack-style proof, we obtain the following regret guarantee
Split the time horizon into blocks
Split the time horizon into blocks
Split the time horizon into blocks
Consider one such block
At the start of the block, play each arm once, and store the rewards observed. Then pull no arms until all arms are available.
Split the time horizon into blocks
Consider one such block
Then play greedily, using the rewards received in the first phase as a proxy for the real reward
Split the time horizon into blocks
Consider one such block
Pull no arms at the end of the block so all arms will be available at the beginning of the next block.
Split the time horizon into blocks
Consider one such block
Pull no arms at the end of the block so all arms will be available at the beginning of the next block.
By appropriately choosing block length, we can obtain the following regret bound:
Problem: We need to know the variation budget to set the block length!
Solution: Run EXP3 as a meta-bandit algorithm to learn the correct block length!
Maintain a list of possible budgets and split the time horizon into blocks
Maintain a list of possible budgets and split the time horizon into blocks
Maintain a list of possible budgets and split the time horizon into blocks
Consider one such block
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Sample a budget and thus an associated block length and play the previous algorithm within the epoch
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Sample a budget and thus an associated block length and play the previous algorithm within the epoch
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
At the end of the epoch, update the sampling probability of the chosen budget according to EXP3.
Maintain a list of possible budgets and split the time horizon into epochs
Consider one such epoch
Repeat this process with the next epoch.
2
2
1
1
2
1
1
1
1
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
2
1
1
1
1
2
1
1
2
2
1
1
Resistance to strategic manipulation induced by blocking - bound the incentive ratio.
Achieve high social welfare - minimise the distortion.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Generalise RSD, by allowing each agent to choose its allocation for the entire time horizon greedily.
Repeated RSD is asymptotically optimal in terms of distortion.
Repeated RSD can be derandomized to yield a deterministic algorithm which is also asymptotically optimal!Â
Repeated RSD also has bounded incentive ratio!
(mean-based)
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Wait until all arms are available
Idea: Extend RRSD to bandit setting with explore-then-commit framework!
In the exploration phase, assign each agent each service a fixed number of times
Wait until all arms are available
In the exploitation phase, play RRSD, using the last preference submission of each agent
2
2
1
1
1
2
1
1
2
1
1
1
2
2
1
1
1
2
1
1
2
1
1
1
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0
2
2
1
1
1
2
1
1
2
1
1
1
8
1
9
7
10
3
9
4
1
5
7
0