Learning with Abandonment
Ramesh Johari, Sven Schmit
Stanford University
INFORMS 2017 - Houston, TX
Platform
User
Interact over time
User abandons
Key ingredients
- User preferences
- Feedback besides abandonment
Related work
|
Model
Discrete time
Actions
Reward
Thresholds
Stopping time
Discount factor
Objective
Setup
time
action
threshold models
Independent thresholds
iid
Result
Optimal policy is a constant policy
where
Single threshold
Threshold drawn once
Example
Action
Outcome
No reward
Reward
Continue with
Optimal policy:
Optimal Policy
Result
The optimal policy is a constant policy
where
for all t
Proof by induction on value iteration
Intuition
Suppose optimal policy is increasing:
Compare to (constant) policy at time t
No difference with
Optimal to play
Two cases:
Surprising corollary
Back to Uniform example, if threshold drawn from
Optimal policy remains the same:
for any
Single
threshold
Independent
thresholds
Constant
policy
Constant
policy
?
Thresholds in between extremes:
Optimal policy is increasing, and intractable
Small noise
Result
If mean reward function is -Lipschitz,
then there exists a constant policy such that
Fixed
threshold
Independent
thresholds
Constant
policy
Constant
policy
?
Constant policy
approximately optimal
Large noise
Result (stated informally)
If is bounded by B and Lipschitz, and F satisfies a regularity condition, then there exists an constant policy such that
Constants depend on distributions, discount factor, reward function
Fixed
threshold
Independent
thresholds
Constant
policy
Constant
policy
Constant policy
approximately optimal
Constant policy
approximately optimal
Feedback
Key idea
User does not always abandon (immediately)
Focus on single threshold model
Always
abandon
Never
abandon
When
no reward
Abandon with
probability 1-p
Optimal policy?
Stop
Reward
Reward
Signal model
Optimal Policy
Understanding structure
partial learning
Result
For every there exists
such that if
optimal action in state is
Dynamic program
State: such that
Two phases: adapt then constant
Always
abandon
Never
abandon
Abandon with
probability 1-p
Initial action
Constant policy
Aggressive policy
Conservative policy
Wrapping up
Open problem
Learning about thresholds across users
Summary
Model for personalization with risk of abandonment
Result: optimal policy is constant (no learning)
Feel free to reach out: schmit@stanford.edu
Bounds on performance with additive noise
Feedback leads to partial learning and optimal policy can be more aggressive or more conservative
Learning with Abandonment
By Sven
Learning with Abandonment
INFORMS 2017 talk
- 2,160