Learning with Abandonment

Ramesh Johari, Sven Schmit

Stanford University

INFORMS 2017 - Houston, TX

Platform

User

Interact over time

User abandons

Key ingredients

User preferences
Feedback besides abandonment

Related work

Abandonment
- Lu, Kanoria, and Lobel [2017]
Dynamic pricing
- Pavan, Segal, and Toikka [2014]
- Myerson [1981]

Model

Discrete time

Actions

Reward

Thresholds

Stopping time

Discount factor

Objective

\theta_0, \theta_1, \theta_2, \ldots

\theta_0, \theta_1, \theta_2, \ldots

t = 0, 1, 2, \ldots

t = 0, 1, 2, \ldots

x_0, x_1, x_2, \ldots

x_0, x_1, x_2, \ldots

R_t(x_t), \mathbb{E} R_t(x_t) = r(x_t) > 0

R_t(x_t), \mathbb{E} R_t(x_t) = r(x_t) > 0

T = \min\{ \tau : x_t > \theta_t\}

T = \min\{ \tau : x_t > \theta_t\}

\max_{x_t} \mathbb{E}_{\theta, R} \sum_{t = 0}^{T-1} \gamma^t R_t(x_t)

\max_{x_t} \mathbb{E}_{\theta, R} \sum_{t = 0}^{T-1} \gamma^t R_t(x_t)

Setup

\gamma

\gamma

\{\theta_t\}_{t=0}^\infty

\{\theta_t\}_{t=0}^\infty

x_0

x_0

x_1

x_1

x_2

x_2

x_3

x_3

R_0

R_0

\gamma R_1

\gamma R_1

\gamma^2 R_2

\gamma^2 R_2

\gamma^3 R_3

\gamma^3 R_3

x_4

x_4

T=4

T=4

time

action

threshold models

Independent thresholds

\theta_t \sim F

\theta_t \sim F

iid

Result

Optimal policy is a constant policy

where

x_t = x^*_{iid}

x_t = x^*_{iid}

x^*_{iid} \in \arg\max_{x} r(x) \frac{(1-F(x))}{1-\gamma (1-F(x)}

x^*_{iid} \in \arg\max_{x} r(x) \frac{(1-F(x))}{1-\gamma (1-F(x)}

Single threshold

\theta \sim F

\theta \sim F

\theta_t = \theta

\theta_t = \theta

Threshold drawn once

Example

\theta \sim U[0, 1]

\theta \sim U[0, 1]

R_t(x_t) = x_t

R_t(x_t) = x_t

Action

x

x

Outcome

\mathbb{P}(x > \theta) = x

\mathbb{P}(x > \theta) = x

\mathbb{P}(x < \theta) = 1-x

\mathbb{P}(x < \theta) = 1-x

No reward

Reward

x

x

\theta \mid x \sim U[x, 1]

\theta \mid x \sim U[x, 1]

Continue with

Optimal policy:

x_t = 1/2

x_t = 1/2

Optimal Policy

Result

The optimal policy is a constant policy

where

x^* = \arg\max_x r(x) (1-F(x))

x^* = \arg\max_x r(x) (1-F(x))

x_t = x^*

x_t = x^*

for all t

Proof by induction on value iteration

Intuition

Suppose optimal policy is increasing:

x_t = y

x_t = y

x_{t+1} = z > y

x_{t+1} = z > y

Compare to (constant) policy at time t

\theta < y

\theta < y

No difference with

x_t = z

x_t = z

\theta > y

\theta > y

Optimal to play

x_t = z

x_t = z

x_{t+1} = z

x_{t+1} = z

Two cases:

x_t = z

x_t = z

Surprising corollary

\theta \sim U[c, 1]

\theta \sim U[c, 1]

Back to Uniform example, if threshold drawn from

c \in [0, 1/2]

c \in [0, 1/2]

Optimal policy remains the same:

x^* = 1/2

x^* = 1/2

for any

Single

threshold

Independent

thresholds

Constant

policy

Constant

policy

?

\theta_t = \theta + \epsilon_t

\theta_t = \theta + \epsilon_t

Thresholds in between extremes:

Optimal policy is increasing, and intractable

Small noise

\epsilon_t \in [-y, y]

\epsilon_t \in [-y, y]

Result

If mean reward function is -Lipschitz,

then there exists a constant policy such that

L

L

V^* - V_c \le \frac{2 y L}{1-\gamma}

V^* - V_c \le \frac{2 y L}{1-\gamma}

r

r

\theta_t = \theta + \epsilon_t

\theta_t = \theta + \epsilon_t

Fixed

threshold

Independent

thresholds

Constant

policy

Constant

policy

?

Constant policy

approximately optimal

Large noise

Result (stated informally)

If is bounded by B and Lipschitz, and F satisfies a regularity condition, then there exists an constant policy such that

r

r

V^* - V_c \le \frac{2 \eta B}{1-\gamma} + (1-\eta) \frac{L_v w}{2}

V^* - V_c \le \frac{2 \eta B}{1-\gamma} + (1-\eta) \frac{L_v w}{2}

Constants depend on distributions, discount factor, reward function

Fixed

threshold

Independent

thresholds

Constant

policy

Constant

policy

Constant policy

approximately optimal

Constant policy

approximately optimal

Feedback

Key idea

User does not always abandon (immediately)

Focus on single threshold model

Always

abandon

Never

abandon

When

no reward

x_t > \theta

x_t > \theta

Abandon with

probability 1-p

Optimal policy?

x_t

x_t

< \theta

< \theta

> \theta

> \theta

R_t(x_t)

R_t(x_t)

0

0

x_{t+1}

x_{t+1}

p

p

1-p

1-p

Stop

Reward

Signal model

Optimal Policy

Understanding structure

partial learning

Result

For every there exists

such that if

optimal action in state is

Dynamic program

State: such that

[l,u]

[l,u]

\theta \mid x_0, x_1, \ldots, x_{t-1} \in [l, u]

\theta \mid x_0, x_1, \ldots, x_{t-1} \in [l, u]

u

u

\epsilon(u) > 0

\epsilon(u) > 0

u - l < \epsilon(u)

u - l < \epsilon(u)

[l, u]

[l, u]

x_t = l

x_t = l

Two phases: adapt then constant

Always

abandon

Never

abandon

Abandon with

probability 1-p

Initial action

Constant policy

x_0 = x_c^*

x_0 = x_c^*

Aggressive policy

x_0 > x_c^*

x_0 > x_c^*

Conservative policy

x_0 < x_c^*

x_0 < x_c^*

Wrapping up

Open problem

Learning about thresholds across users

Summary

Model for personalization with risk of abandonment

Result: optimal policy is constant (no learning)

Feel free to reach out: schmit@stanford.edu

Bounds on performance with additive noise

Feedback leads to partial learning and optimal policy can be more aggressive or more conservative