Ramesh Johari, Sven Schmit
Stanford University
INFORMS 2017 - Houston, TX
Platform
User
Interact over time
User abandons
Key ingredients
|
Discrete time
Actions
Reward
Thresholds
Stopping time
Discount factor
Objective
time
action
iid
Result
Optimal policy is a constant policy
where
Threshold drawn once
Action
Outcome
No reward
Reward
Continue with
Optimal policy:
Result
The optimal policy is a constant policy
where
for all t
Proof by induction on value iteration
Suppose optimal policy is increasing:
Compare to (constant) policy at time t
No difference with
Optimal to play
Two cases:
Back to Uniform example, if threshold drawn from
Optimal policy remains the same:
for any
Single
threshold
Independent
thresholds
Constant
policy
Constant
policy
?
Thresholds in between extremes:
Optimal policy is increasing, and intractable
Result
If mean reward function is -Lipschitz,
then there exists a constant policy such that
Fixed
threshold
Independent
thresholds
Constant
policy
Constant
policy
?
Constant policy
approximately optimal
Result (stated informally)
If is bounded by B and Lipschitz, and F satisfies a regularity condition, then there exists an constant policy such that
Constants depend on distributions, discount factor, reward function
Fixed
threshold
Independent
thresholds
Constant
policy
Constant
policy
Constant policy
approximately optimal
Constant policy
approximately optimal
Key idea
User does not always abandon (immediately)
Focus on single threshold model
Always
abandon
Never
abandon
When
no reward
Abandon with
probability 1-p
Optimal policy?
Stop
Reward
Reward
Understanding structure
Result
For every there exists
such that if
optimal action in state is
Dynamic program
State: such that
Two phases: adapt then constant
Always
abandon
Never
abandon
Abandon with
probability 1-p
Constant policy
Aggressive policy
Conservative policy
Learning about thresholds across users
Model for personalization with risk of abandonment
Result: optimal policy is constant (no learning)
Feel free to reach out: schmit@stanford.edu
Bounds on performance with additive noise
Feedback leads to partial learning and optimal policy can be more aggressive or more conservative