Human Interaction with Recommendation Systems:

On Bias and Exploration

 

Sven Schmit, joint with Ramesh Johari, Vijay Kamble, and Carlos Riquelme

Recommendations

User preferences

User picks an item

User reports outcome (e.g. rating)

entire population

specific population

vs

Example

Goal: find the quality of items efficiently

Bias

 

What happens if we ignore (private) preferences?

Exploration

 

Do private preferences help us learn about every item?

Model

 

Propose simple model

for dynamics

Outline

 

Finding quality

efficiently

Model

At every time t, new user arrives

K items:

i=1,2,\ldots,K
i=1,2,,Ki=1,2,\ldots,K

(Private) preference over

each item

\theta_{it} \sim F_i
θitFi\theta_{it} \sim F_i

Recommendation server

supplies score 

q_{it}
qitq_{it}

Quality

Q_i
QiQ_i

User selects item

a_t = \arg\max_i q_{it} + \theta_{it}
at=argmaxiqit+θita_t = \arg\max_i q_{it} + \theta_{it}

[Ifrach et al. 2013]

Horizontal differentiation

Vertical differentiation

Model ctd.

User selects item

a_t = \arg\max_i q_{it} + \theta_{it}
at=argmaxiqit+θita_t = \arg\max_i q_{it} + \theta_{it}

User reports value

V_t(a_t) = Q_{a_t t} + \theta_{a_t t} + \epsilon_t
Vt(at)=Qatt+θatt+ϵtV_t(a_t) = Q_{a_t t} + \theta_{a_t t} + \epsilon_t

Server uses this to update scores

\vec{q}_{t+1} = f(V_1, V_2, \ldots, V_t)
qt+1=f(V1,V2,,Vt)\vec{q}_{t+1} = f(V_1, V_2, \ldots, V_t)

quality

preference

error

preference

recommendation

Performance metric

Output K scores, observe 1 outcome

(Pseudo)-regret:

R_T = \sum_{t=1}^T \max_i (Q_i + \theta_{it}) - (Q_{a_t} + \theta_{a_t t})
RT=t=1Tmaxi(Qi+θit)(Qat+θatt)R_T = \sum_{t=1}^T \max_i (Q_i + \theta_{it}) - (Q_{a_t} + \theta_{a_t t})

expected value best item

Optimal policy

q_{it} \equiv Q_i
qitQiq_{it} \equiv Q_i

partial feedback

[Cesa-Bianchi, Bubeck 2012]

expected value selected item

Comment

This model does not use covariates that

could explain preferences

Data is sparse:

limited information per user

Highlight issues introduced by preferences

Analysis simple model

Order items

Q_1 > Q_2 > \ldots > Q_K
Q1>Q2>>QKQ_1 > Q_2 > \ldots > Q_K

Preferences are Bernoulli

\theta_{it} \sim \text{Bernoulli}(p)
θitBernoulli(p)\theta_{it} \sim \text{Bernoulli}(p)
\Delta = Q_1 - Q_2
Δ=Q1Q2\Delta = Q_1 - Q_2

Define gap

Relevant regime

p \lesssim \frac{\log(K)}{2K}
plog(K)2Kp \lesssim \frac{\log(K)}{2K}

Bias

[Marlin 2003] [Marlin et al. 2007] [Steck 2010]

Naive recommendation server

Score is average of reported values

S_{it} = \{\tau < t : a_\tau = i \}
Sit={τ<t:aτ=i}S_{it} = \{\tau < t : a_\tau = i \}
q_{it} = \frac{1}{|S_{it}|} \sum_{\tau \in S_{it}} V_\tau
qit=1SitτSitVτq_{it} = \frac{1}{|S_{it}|} \sum_{\tau \in S_{it}} V_\tau

where

are the times item i was chosen

Linear regret almost surely

If

\Delta < \Delta^*(K, p)
Δ<Δ(K,p)\Delta < \Delta^*(K, p)

Under the simple Bernoulli(p) preferences model

\limsup_t \frac{R_t}{t} \ge c
limsuptRttc\limsup_t \frac{R_t}{t} \ge c

for some 

c > 0
c>0c > 0

linear regret

then

a.s.

Note:

\Delta^*(2, 0.5) = \frac{1}{3}
Δ(2,0.5)=13\Delta^*(2, 0.5) = \frac{1}{3}

Simulations

\Delta = 0.5
Δ=0.5\Delta = 0.5
\Delta = 0.2
Δ=0.2\Delta = 0.2

Worst arm selected too often

Gap in scores

No gap in scores

Bernoulli preferences

p = \frac{1}{2}
p=12p = \frac{1}{2}

Solution

Debias estimates

 

  • Computationally difficult
  • Distributional assumptions
  • Not robust

 

Request different feedback

 

  • Ask user for debiased estimates
  • Computationally easy
  • No assumptions
  • Feasible in practice?

How does the item compare to your expectation?

W_t = V_t - (q_{it} + \theta_{it})
Wt=Vt(qit+θit)W_t = V_t - (q_{it} + \theta_{it})

Better

recommendations

Score is average of unbiased values

S_{it} = \{\tau < t : a_\tau = i \}
Sit={τ<t:aτ=i}S_{it} = \{\tau < t : a_\tau = i \}
q'_{it} = \frac{1}{|S_{it}|} \sum_{\tau \in S_{it}} (W_\tau + q'_{i\tau})
qit=1SitτSit(Wτ+qiτ)q'_{it} = \frac{1}{|S_{it}|} \sum_{\tau \in S_{it}} (W_\tau + q'_{i\tau})

where

are the times item i was chosen

Exploration

Motivation

How can we incentivize myopic users?

Could users explore due to

heterogeneous preferences?

Exploration is difficult

confidence bounds & incentives

[Hummel and McAfee 2014] [Slivkins et al 2015] [Frazier et al. 2014] [Papanastasiou et al. 2014]

Assumptions

Bernoulli preferences (K items)

But: lower bound scores

\max_i q'_{it} - \min_i q'_{it} < 1
maxiqitminiqit<1\max_i q'_{it} - \min_i q'_{it} < 1

Server takes empirical averages​

q'_{it}
qitq'_{it}

Regret bound

\epsilon_t
ϵt\epsilon_t

Then regret bound

E(R_t) \le CK\log(t) + C'K
E(Rt)CKlog(t)+CKE(R_t) \le CK\log(t) + C'K

Notes:

  • problem dependent bound
  • constants depend on relation p and K

sub-Gaussian

If

Biased

estimates

Unbiased

estimates

Regret simulations

Normal preferences

\Delta = 0.4
Δ=0.4\Delta = 0.4

Take-away

But: slow start for new items

Rather than explicitly explore

highlight new items

Users are effective explorers

Wrapping up

Bias

Private preferences lead to bad outcomes

 

Ask user for unbiased feedback

Exploration

Private preferences lead to

free exploration

 

Highlight new items

Model

Choice

Preferences

Recommendations

Feedback

Made with Slides.com