# Learning with Abandonment

Ramesh Johari, Sven Schmit

Stanford University

INFORMS 2017 - Houston, TX

Platform

User

Interact over time

User abandons

Key ingredients

• User preferences
• Feedback besides abandonment

## Related work

 Abandonment Lu, Kanoria, and Lobel [2017] Dynamic pricing Pavan, Segal, and Toikka [2014] Myerson [1981]

# Model

Discrete time

Actions

Reward

Thresholds

Stopping time

Discount factor

Objective

\theta_0, \theta_1, \theta_2, \ldots
$\theta_0, \theta_1, \theta_2, \ldots$
t = 0, 1, 2, \ldots
$t = 0, 1, 2, \ldots$
x_0, x_1, x_2, \ldots
$x_0, x_1, x_2, \ldots$
R_t(x_t), \mathbb{E} R_t(x_t) = r(x_t) > 0
$R_t(x_t), \mathbb{E} R_t(x_t) = r(x_t) > 0$
T = \min\{ \tau : x_t > \theta_t\}
$T = \min\{ \tau : x_t > \theta_t\}$
\max_{x_t} \mathbb{E}_{\theta, R} \sum_{t = 0}^{T-1} \gamma^t R_t(x_t)
$\max_{x_t} \mathbb{E}_{\theta, R} \sum_{t = 0}^{T-1} \gamma^t R_t(x_t)$

## Setup

\gamma
$\gamma$
\{\theta_t\}_{t=0}^\infty
$\{\theta_t\}_{t=0}^\infty$
x_0
$x_0$
x_1
$x_1$
x_2
$x_2$
x_3
$x_3$
R_0
$R_0$
\gamma R_1
$\gamma R_1$
\gamma^2 R_2
$\gamma^2 R_2$
\gamma^3 R_3
$\gamma^3 R_3$
x_4
$x_4$
T=4
$T=4$

time

action

# threshold models

## Independent thresholds

\theta_t \sim F
$\theta_t \sim F$

iid

Result

Optimal policy is a constant policy

where

x_t = x^*_{iid}
$x_t = x^*_{iid}$
x^*_{iid} \in \arg\max_{x} r(x) \frac{(1-F(x))}{1-\gamma (1-F(x)}
$x^*_{iid} \in \arg\max_{x} r(x) \frac{(1-F(x))}{1-\gamma (1-F(x)}$

## Single threshold

\theta \sim F
$\theta \sim F$
\theta_t = \theta
$\theta_t = \theta$

Threshold drawn once

## Example

\theta \sim U[0, 1]
$\theta \sim U[0, 1]$
R_t(x_t) = x_t
$R_t(x_t) = x_t$

Action

x
$x$

Outcome

\mathbb{P}(x > \theta) = x
$\mathbb{P}(x > \theta) = x$
\mathbb{P}(x < \theta) = 1-x
$\mathbb{P}(x < \theta) = 1-x$

No reward

Reward

x
$x$
\theta \mid x \sim U[x, 1]
$\theta \mid x \sim U[x, 1]$

Continue with

Optimal policy:

x_t = 1/2
$x_t = 1/2$

## Optimal Policy

Result

The optimal policy is a constant policy

where

x^* = \arg\max_x r(x) (1-F(x))
$x^* = \arg\max_x r(x) (1-F(x))$
x_t = x^*
$x_t = x^*$

for all t

Proof by induction on value iteration

## Intuition

Suppose optimal policy is increasing:

x_t = y
$x_t = y$
x_{t+1} = z > y
$x_{t+1} = z > y$

Compare to (constant) policy at time t

\theta < y
$\theta < y$

No difference with

x_t = z
$x_t = z$
\theta > y
$\theta > y$

Optimal to play

x_t = z
$x_t = z$
x_{t+1} = z
$x_{t+1} = z$

Two cases:

x_t = z
$x_t = z$

## Surprising corollary

\theta \sim U[c, 1]
$\theta \sim U[c, 1]$

Back to Uniform example, if threshold drawn from

c \in [0, 1/2]
$c \in [0, 1/2]$

Optimal policy remains the same:

x^* = 1/2
$x^* = 1/2$

for any

Single

threshold

Independent

thresholds

Constant

policy

Constant

policy

?

\theta_t = \theta + \epsilon_t
$\theta_t = \theta + \epsilon_t$

Thresholds in between extremes:

Optimal policy is increasing, and intractable

## Small noise

\epsilon_t \in [-y, y]
$\epsilon_t \in [-y, y]$

Result

If mean reward function       is       -Lipschitz,

then there exists a constant policy such that

L
$L$
V^* - V_c \le \frac{2 y L}{1-\gamma}
$V^* - V_c \le \frac{2 y L}{1-\gamma}$
r
$r$
\theta_t = \theta + \epsilon_t
$\theta_t = \theta + \epsilon_t$

Fixed

threshold

Independent

thresholds

Constant

policy

Constant

policy

?

Constant policy

approximately optimal

## Large noise

Result (stated informally)

If      is bounded by B and Lipschitz, and F satisfies a regularity condition, then there exists an constant policy such that

r
$r$
V^* - V_c \le \frac{2 \eta B}{1-\gamma} + (1-\eta) \frac{L_v w}{2}
$V^* - V_c \le \frac{2 \eta B}{1-\gamma} + (1-\eta) \frac{L_v w}{2}$

Constants depend on distributions, discount factor, reward function

Fixed

threshold

Independent

thresholds

Constant

policy

Constant

policy

Constant policy

approximately optimal

Constant policy

approximately optimal

# Feedback

Key idea

User does not always abandon (immediately)

Focus on single threshold model

Always

abandon

Never

abandon

When

no reward

x_t > \theta
$x_t > \theta$

Abandon with

probability 1-p

Optimal policy?

x_t
$x_t$
< \theta
$< \theta$
> \theta
$> \theta$
R_t(x_t)
$R_t(x_t)$
0
$0$
x_{t+1}
$x_{t+1}$
p
$p$
1-p
$1-p$

Stop

Reward

Reward

## Optimal Policy

Understanding structure

## partial learning

Result

For every      there exists

such that if

optimal action in state            is

Dynamic program

State:            such that

[l,u]
$[l,u]$
\theta \mid x_0, x_1, \ldots, x_{t-1} \in [l, u]
$\theta \mid x_0, x_1, \ldots, x_{t-1} \in [l, u]$
u
$u$
\epsilon(u) > 0
$\epsilon(u) > 0$
u - l < \epsilon(u)
$u - l < \epsilon(u)$
[l, u]
$[l, u]$
x_t = l
$x_t = l$

Always

abandon

Never

abandon

Abandon with

probability 1-p

## Initial action

Constant policy

x_0 = x_c^*
$x_0 = x_c^*$

Aggressive policy

x_0 > x_c^*
$x_0 > x_c^*$

Conservative policy

x_0 < x_c^*
$x_0 < x_c^*$

# Wrapping up

## Summary

Model for personalization with risk of abandonment

Result: optimal policy is constant (no learning)

Feel free to reach out: schmit@stanford.edu

Bounds on performance with additive noise

Feedback leads to partial learning and optimal policy can be more aggressive or more conservative

By Sven

# Learning with Abandonment

INFORMS 2017 talk

• 2,033