Learning with Abandonment

Ramesh Johari, Sven Schmit

Stanford University

INFORMS 2017 - Houston, TX

Platform

User

Interact over time

User abandons

Key ingredients

  • User preferences
  • Feedback besides abandonment

Related work

  • Abandonment
    • Lu, Kanoria, and Lobel [2017]
  • Dynamic pricing
    • Pavan, Segal, and Toikka [2014]
    • Myerson [1981]

Model

Discrete time

Actions

Reward

Thresholds

Stopping time

Discount factor

Objective

\theta_0, \theta_1, \theta_2, \ldots
θ0,θ1,θ2,\theta_0, \theta_1, \theta_2, \ldots
t = 0, 1, 2, \ldots
t=0,1,2,t = 0, 1, 2, \ldots
x_0, x_1, x_2, \ldots
x0,x1,x2,x_0, x_1, x_2, \ldots
R_t(x_t), \mathbb{E} R_t(x_t) = r(x_t) > 0
Rt(xt),ERt(xt)=r(xt)>0R_t(x_t), \mathbb{E} R_t(x_t) = r(x_t) > 0
T = \min\{ \tau : x_t > \theta_t\}
T=min{τ:xt>θt}T = \min\{ \tau : x_t > \theta_t\}
\max_{x_t} \mathbb{E}_{\theta, R} \sum_{t = 0}^{T-1} \gamma^t R_t(x_t)
maxxtEθ,Rt=0T1γtRt(xt)\max_{x_t} \mathbb{E}_{\theta, R} \sum_{t = 0}^{T-1} \gamma^t R_t(x_t)

Setup

\gamma
γ\gamma
\{\theta_t\}_{t=0}^\infty
{θt}t=0\{\theta_t\}_{t=0}^\infty
x_0
x0x_0
x_1
x1x_1
x_2
x2x_2
x_3
x3x_3
R_0
R0R_0
\gamma R_1
γR1\gamma R_1
\gamma^2 R_2
γ2R2\gamma^2 R_2
\gamma^3 R_3
γ3R3\gamma^3 R_3
x_4
x4x_4
T=4
T=4T=4

time

action

threshold models

Independent thresholds

\theta_t \sim F
θtF\theta_t \sim F

iid

Result

Optimal policy is a constant policy

where

x_t = x^*_{iid}
xt=xiidx_t = x^*_{iid}
x^*_{iid} \in \arg\max_{x} r(x) \frac{(1-F(x))}{1-\gamma (1-F(x)}
xiidargmaxxr(x)(1F(x))1γ(1F(x)x^*_{iid} \in \arg\max_{x} r(x) \frac{(1-F(x))}{1-\gamma (1-F(x)}

Single threshold

\theta \sim F
θF\theta \sim F
\theta_t = \theta
θt=θ\theta_t = \theta

Threshold drawn once

Example

\theta \sim U[0, 1]
θU[0,1]\theta \sim U[0, 1]
R_t(x_t) = x_t
Rt(xt)=xtR_t(x_t) = x_t

Action

x
xx

Outcome

\mathbb{P}(x > \theta) = x
P(x>θ)=x\mathbb{P}(x > \theta) = x
\mathbb{P}(x < \theta) = 1-x
P(x<θ)=1x\mathbb{P}(x < \theta) = 1-x

No reward

Reward

x
xx
\theta \mid x \sim U[x, 1]
θxU[x,1]\theta \mid x \sim U[x, 1]

Continue with

Optimal policy:

x_t = 1/2
xt=1/2x_t = 1/2

Optimal Policy

Result

The optimal policy is a constant policy

where

x^* = \arg\max_x r(x) (1-F(x))
x=argmaxxr(x)(1F(x))x^* = \arg\max_x r(x) (1-F(x))
x_t = x^*
xt=xx_t = x^*

for all t

Proof by induction on value iteration

Intuition

Suppose optimal policy is increasing:

x_t = y
xt=yx_t = y
x_{t+1} = z > y
xt+1=z>yx_{t+1} = z > y

Compare to (constant) policy at time t

\theta < y
θ<y\theta < y

No difference with

x_t = z
xt=zx_t = z
\theta > y
θ>y\theta > y

Optimal to play

x_t = z
xt=zx_t = z
x_{t+1} = z
xt+1=zx_{t+1} = z

Two cases:

x_t = z
xt=zx_t = z

Surprising corollary

\theta \sim U[c, 1]
θU[c,1]\theta \sim U[c, 1]

Back to Uniform example, if threshold drawn from

c \in [0, 1/2]
c[0,1/2]c \in [0, 1/2]

Optimal policy remains the same:

x^* = 1/2
x=1/2x^* = 1/2

for any

Single

threshold

Independent 

thresholds

Constant

policy

Constant

policy

?

\theta_t = \theta + \epsilon_t
θt=θ+ϵt\theta_t = \theta + \epsilon_t

Thresholds in between extremes:

Optimal policy is increasing, and intractable

Small noise

\epsilon_t \in [-y, y]
ϵt[y,y]\epsilon_t \in [-y, y]

Result

If mean reward function       is       -Lipschitz,

then there exists a constant policy such that

L
LL
V^* - V_c \le \frac{2 y L}{1-\gamma}
VVc2yL1γV^* - V_c \le \frac{2 y L}{1-\gamma}
r
rr
\theta_t = \theta + \epsilon_t
θt=θ+ϵt\theta_t = \theta + \epsilon_t

Fixed

threshold

Independent 

thresholds

Constant

policy

Constant

policy

?

Constant policy

approximately optimal

Large noise

Result (stated informally)

If      is bounded by B and Lipschitz, and F satisfies a regularity condition, then there exists an constant policy such that

 

r
rr
V^* - V_c \le \frac{2 \eta B}{1-\gamma} + (1-\eta) \frac{L_v w}{2}
VVc2ηB1γ+(1η)Lvw2V^* - V_c \le \frac{2 \eta B}{1-\gamma} + (1-\eta) \frac{L_v w}{2}

Constants depend on distributions, discount factor, reward function

Fixed

threshold

Independent 

thresholds

Constant

policy

Constant

policy

Constant policy

approximately optimal

Constant policy

approximately optimal

Feedback

Key idea

User does not always abandon (immediately)

Focus on single threshold model

Always

abandon

Never

abandon

When

 

no reward

x_t > \theta
xt>θx_t > \theta

Abandon with

probability 1-p

Optimal policy?

x_t
xtx_t
< \theta
<θ< \theta
> \theta
>θ> \theta
R_t(x_t)
Rt(xt)R_t(x_t)
0
00
x_{t+1}
xt+1x_{t+1}
p
pp
1-p
1p1-p

Stop

Reward

Reward

Signal model

Optimal Policy

Understanding structure

partial learning

Result

For every      there exists

such that if

optimal action in state            is 

Dynamic program

State:            such that 

[l,u]
[l,u][l,u]
\theta \mid x_0, x_1, \ldots, x_{t-1} \in [l, u]
θx0,x1,,xt1[l,u]\theta \mid x_0, x_1, \ldots, x_{t-1} \in [l, u]
u
uu
\epsilon(u) > 0
ϵ(u)>0\epsilon(u) > 0
u - l < \epsilon(u)
ul<ϵ(u)u - l < \epsilon(u)
[l, u]
[l,u][l, u]
x_t = l
xt=lx_t = l

Two phases: adapt then constant

Always

abandon

Never

abandon

Abandon with

probability 1-p

Initial action

Constant policy

x_0 = x_c^*
x0=xcx_0 = x_c^*

Aggressive policy

x_0 > x_c^*
x0>xcx_0 > x_c^*

Conservative policy

x_0 < x_c^*
x0<xcx_0 < x_c^*

Wrapping up

Open problem

Learning about thresholds across users

Summary

Model for personalization with risk of abandonment

Result: optimal policy is constant (no learning)

Feel free to reach out: schmit@stanford.edu

Bounds on performance with additive noise

Feedback leads to partial learning and optimal policy can be more aggressive or more conservative

Learning with Abandonment

By Sven

Learning with Abandonment

INFORMS 2017 talk

  • 2,160