The Effect of Experimenter Demand 

on Inference

Brisbane, April 2025

David

Danz

Pittsburgh

Guillermo

Lezama

Pittsburgh

Pun

Winichakul

Smith College

Priyoma

Mustafi

Pittsburgh

Marissa

Lepper

Texas A&M

Lise

Vesterlund

Pittsburgh

Alistair

Wilson

Pittsburgh

The Effect of Experimenter Demand 

on Inference

Brisbane, April 2025

Guillermo

Lezama

Pittsburgh

Pun

Winichakul

Smith College

David

Danz

Pittsburgh

Priyoma

Mustafi

Pittsburgh

Marissa

Lepper

Texas A&M

Lise

Vesterlund

Pittsburgh

Alistair

Wilson

Pittsburgh

\textcircled{r}
\textcircled{r}
\textcircled{r}
\textcircled{r}
\textcircled{r}
\textcircled{r}

Experimenter Demand

[The participant’s] general attitude of mind is that of ready complacency and cheerful willingness to assist the investigator in every possible way by reporting to him those very things which he is most eager to find.

-A H. Pierce, 1908

 

The subject’s performance in an experiment might almost be conceptualized as problem-solving behavior... he sees it as his task to ascertain the true purpose of the experiment and respond in a manner which will support the hypotheses being tested.

-M. T. Orne, 1962

Experimenter Demand

“the critical assumption underlying the interpretation of data from lab experiments is that the insights gained can be extrapolated to the world beyond

-S. Levitt and J. List, 2007

“many reasons to suspect that these laboratory findings might fail to generalize to real markets
field experiments avoid many of the important obstacles to generalizability faced by lab experiments

-S. Levitt and J. List, 2008

Outline

1. EDE Mitigation

2. EDE Measurment

Jonathan

de Quidt

Queen Mary

Lise

Vesterlund

Pittsburgh

Alistair

Wilson

Pittsburgh

 Experimenter Demand

 

2019  (ed. Schram & Ule)

2024   (ed. Rees-Jones)

1. EDE Mitigation

2. EDE Measurment

  • Experimenter Demand Effects, de Quidt, Vesterlund, and Wilson (2019, 2024)
  • Best practice: Mitigate hypothesis speculations and reduce responsiveness EDE
    • Documentation (instructions, screen shots, survey, script, etc.)
    • Conceal the hypothesis
      • Abstract frame
      • Hiding independent variable
        • between-subject design
        • within-subject designs w/ “progressive revelation”
    • Reduce responsiveness to EDE
      • Incentives
      • Anonymous decisions
      • Participants
      • Limit experimenter-participant interaction

1. EDE Mitigation

2. EDE Measurment

Top Five Exp. Econ
​Between-subject 59% 89%
Abstract Frame 89% 96%
Blind 83% 94%
Incentivized 91% 99%
All the above 46% 84%

Design:

1. EDE Mitigation

Top Five Exp. Econ
Classroom 5% 5%
Lab 68% 84%
Lab-in-field 17% 7%
Online 12% 2%

Population:

Impact of EDE

2. EDE Measurment

  • Best practice – “ward off critiques that a result might be driven by experimenter demand”
  • Assessing EDE impact on decision estimates
    • Incentivize actors induce to high or low contributions (Bischoff and Frank, 2011)
    • Design-by-correlation
      • Dhar et al.(2018): Crowne-Marlowe social desirability scale to assess whether treatment-driven changes in attitudes toward gender equality result from experimenter demand.
      • Alcott and Taubinsky (2015): Snyder (1974) “Self-Monitoring Scale” to measure subjects’ responsiveness to experimenter demand
      • Tsutsui and Zizzo (2013): Demand-susceptibility measure dominated lottery choice w/ statement “it would be nice if you were to choose” and a smiley face
    • Bound EDE (de Quidt et al., 2017) intentional positive and negative demand
      • strong demand:  “do us a favor if...”
      • weak demand: “we expect that...”)

Impact of EDE on Inference?

  • The objective of much of experimental research is qualitative inference (Kessler & Vesterlund, 2015).

    • Causal effect of \(X\) on \(Y\)

    • Direction and economically meaningful (and statistically significant)

  • Can EDE alter inference?

    • Impact of an ill-intentioned experimenter who differentially applies positive and negative demand across a decision pair?

      • False negatives – where true effect is positive

      • False positives – where true effect is null

\textcircled{r}
\textcircled{r}
\textcircled{r}
\textcircled{r}
\textcircled{r}
\textcircled{r}

Guillermo

Lezama

Pittsburgh

Pun

Winichakul

Smith College

David

Danz

Pittsburgh

Priyoma

Mustafi

Pittsburgh

Marissa

Lepper

Texas A&M

Lise

Vesterlund

Pittsburgh

Alistair

Wilson

Pittsburgh

The Effect of Experimenter Demand 

on Inference

What do we do?

  • Use “worst case scenario” to assess false negatives and false positives
  • Differentially apply strong positive and negative demand across a decision pair (de Quidt, Haushofer and Roth, AER 2018)

    You will do us a favor if you take a higher (lower) action than you normally would.

  • Four domains
    1. Probability weighting
    2. The Endowment effect
    3. Charitable giving
    4. Intertemporal choice

(Risk)

(Ownership)

(Self vs. Other)

(Now vs Later)

Design

  • Eight within-subject decisions:
    • Four lottery valuations:
      • WTP and WTA
      • Lotteries with 10% and 90% chance of winning $10
    • Two donations:
      • Matched (low cost)
      • Unmatched (high cost)
    • Two intertemporal allocations:
      • Immediate (today vs a week from now)
      • Delayed  (tomorrow vs week from tomorrow)

Design

  • Eight within-subject decisions:
    • Four lottery valuations
    • Two donations
    • Two intertemporal allocations
  • Three between-subject treatments:
    1. No demand
    2. Positive demand
    3. Negative demand
  • Three Populations:
    1. Laboratory (Pitt undergrads)
    2. Mechanical Turk
    3. Prolific 

Online Populations

Papers on Google Scholar

2013

2023

2018

Design

  • Eight within-subject decisions:
    • Four lottery valuations
    • Two donations
    • Two intertemporal allocations
  • Three between-subject treatments:
    1. No demand
    2. Positive demand
    3. Negative demand
  • Three Populations:
    1. Laboratory (Pitt undergrads)
    2. Mechanical Turk
    3. Prolific 

(N=236, ~80/treatment)

(N=756, ~250/treatment)

(N=732, ~240/treatment)

Probability Weighting

Probability Weighting

Literature Predictions:

  • Kahneman & Tversky 1979; Prelec 1998 
    • Risk seeking at low probabilities: \(\text{WTP}(\tfrac{1}{10})>\$1)\)
    • Risk averse at high probabilities: \(\text{WTP}(\tfrac{9}{10})<\$9)\)

Task:

Endowed with $10, and asked about willingness to pay for the lottery:

\( p\cdot\$10\oplus(1-p)\cdot \$0\)

with two probabilities of winning \(p\in\left\{\tfrac{1}{10},\tfrac{9}{10}\right\}\)

Probability Weighting

\(p<0.001\)

\(p<0.001\)

Probability Weighting

\(p=0.002\)

\(p<0.001\)

Probability Weighting

You will do us a favor if you indicate a lower willingness to buy than you normally would

\(p<0.001\)

\(p<0.001\)

Probability Weighting

You will do us a favor if you indicate a higher willingness to buy than you normally would

\(p<0.001\)

\(p<0.001\)

Probability Weighting

This is extreme and differential demand over the comparative static

\(p<0.001\)

\(p<0.001\)

Endowment Effect

Endowment Effect

Literature Predictions:

  • Kahneman, Knetsch and Thaler (1990; 1991) 
    • \( \text{WTA}( \tfrac{1}{10} ) > \text{WTP}( \tfrac{1}{10} ) \)
    • \( \text{WTA}( \tfrac{9}{10} ) > \text{WTP}( \tfrac{9}{10} ) \)

Task:

WTA: endowed with $10 and lottery, asked about willingness to accept a price for lottery.

WTP: Endowed with $10 and asked about willingness to pay a price for the lottery

Endowment Effect

\(p<0.001\)

Endowment Effect

\(p=0.002\)

Endowment Effect

\(p<0.001\)

Endowment Effect

\(p<0.001\)

Endowment Effect

\(p=0.012\)

Endowment Effect (High)

\(p=0.019\)

Endowment Effect (High)

\(p=0.731\)

Endowment Effect (High)

\(p=0.001\)

Endowment Effect (High)

\(p=0.127\)

Charitable Giving

Chartiable Giving

Task:

Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).

Literature Predictions:

  • Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
    • Charity receives larger donation with than without a match
      • DonatedAmount(Low)>DonatedAmount(High)

Chartiable Giving

Task:

Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).

Literature Predictions:

  • Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
    • Charity receives larger donation with than without a match
      • DonatedAmount(Low)>DonatedAmount(High)

\(p<0.001\)

Chartiable Giving

Task:

Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).

Literature Predictions:

  • Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
    • Charity receives larger donation with than without a match
      • DonatedAmount(Low)>DonatedAmount(High)

\(p<0.001\)

Chartiable Giving

Task:

Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).

Literature Predictions:

  • Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
    • Charity receives larger donation with than without a match
      • DonatedAmount(Low)>DonatedAmount(High)

\(p<0.001\)

Present Bias

Present Bias

Task:

Convex budget set. Have $10 to be paid at date \(t\), can move up to $9 to date \(t+7\) earning 20% interest on moved amount. Treatments are for:

  • \(t=0\) (today vs week from today)
  • \(t=1\) (tomorrow vs week from tomorrow)

Literature Predictions:

  • Andreoni and Sprenger, 2012:
    • Compared to an immediate sooner date, participants will be no more patient when the sooner date is delayed
    • Purposeful null result: \( \text{Transfer}(t=0) = \text{Transfer}(t=1) \)

Present Bias

\(p=0.339\)

Present Bias

\(p=0.239\)

Present Bias

\(p=0.465\)

Present Bias

\(p=0.819\)

Different Populations

Sensitivities

Lab:

\(p=0.304\) from Fisher's exact on directions

Sensitivities

Mturk:

\(p=0.020\) from Fisher's exact on directions

Sensitivities

Prolific:

\(p=0.003\) from Fisher's exact on directions

Sensitivities

Lab

MTurk

Prolific

False Positive in Online Samples

  • All of the domains where we expect a directional result are replicated online:
    • Probability Weighting
    • Endowment effect (low probs)
    • Charitable giving
  • However, we do find that extreme experimenter demand can create false positives in both online samples:
    • Present Bias
    • Charitable giving foregone amount
  • Reasons:
    • Slightly more consistent demand effects
    • Large enough sample sizes to generate significance
  • With large samples, need to focus on economic size of the effects!

False Positive in Online Samples

Present Bias: Laboratory sample

False Positive in Online Samples

Present Bias: MTurk sample

False Positive in Online Samples

Present Bias: MTurk sample

\(p=0.039\)

False Positive in Online Samples

Present Bias: MTurk sample

\(p=0.033\)

False Positive in Online Samples

Present Bias: Prolific sample

False Positive in Online Samples

Present Bias: Prolific sample

\(p=0.043\)

False Positive in Online Samples

Present Bias: Prolific sample

\(p=0.112\)

Effect size normalization

  • For each comparative static we construct a normalized effect size  \[ y_i = \hat{\beta_0} +\hat{\beta}_1\cdot 1_{\text{Treat}}+\hat{\epsilon}_i \]
  • Variation normalized coefficient is: \[\hat{D}=\frac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}} \]
    • This effect size statistics is what is referred to as Cohen's-\(D\)
      • Cohen gives the informal guidance that \(0.2\sigma\) is small, \(0.5\sigma\) medium, \(0.8\sigma\) large
  • So interpretation of effect size is as a multiple of the unexplained variation over the decision \(y\) (separate from the treatment effect)
    • Many inferences require conditioning on more variables
  • Here if the total data size is \(N\), then we can just think of \(\sqrt{N}\cdot\hat{D}\) as the two-sample Student's-\(t\) test statistic

Comparative Static Sensitivity

Comparative Static Sensitivity

Comparative Static Sensitivity

Effect Sizes

  • Our evidences suggests even extreme experimenter demand (strong and differential across treatment) can push comparative statics by approximately \(0.2\sigma\)
  • How big are effects sizes in experimental studies?
  • Initial stages of data synthesis for 40 studies in the AER in the last five years:
    • Available data
    • "Important" general-interest studies
  • For each paper we try to construct a simple normalized effect size  \[ y_i = \hat{\beta_0} +\hat{\beta}_1\cdot 1_{\text{Treat}}+\hat{\epsilon}_i \]
    • We generalize this approach for Diff-in-Diff designs to focus on the interaction
  • Normalized coefficient is: \[\hat{D}=\frac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}} \]

Effect Sizes

Absolute normalized coefficient is: \(\left|\tfrac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}}\right| \)

Effect Sizes

Absolute normalized coefficient is: \(\left|\tfrac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}}\right| \)

Effect Sizes

Absolute normalized coefficient is: \(\left|\tfrac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}}\right| \)

Conclusions

  • Best practices widely adopted to mitigate EDE
  • Limited EDE impact on inference: 
    • For four classic domains EDE bounds narrow for lab and online 
  • Potential impact of ill-intentioned experimenter:
    • Lab: no false negatives or false positives for typical sample sizes
    • Online: no false negatives, but false positives (small)
  • ​Initial results from sample of AER papers:
    • ​Typical effect sizes are substantial, beyond what we can obtain with experimenter demand

Recommendations

Author:

  1. Adopt best practices when possible
  2. If you have EDE concerns
    • Engage directly: Check robustness
    • Report on existing evidence on EDE for domain of interest
    • Bounding approach to assess lack of sensitivity.
      • Note that sensitivity does not imply EDE driving treatment effect

Reviewer:

  1. First order: Identification, relevance
    • Manipulation? All data (e.g., Roth, 1994); ex ante sample size (Simmons,  et al 2011)
  2. Second order: EDE concerns, provide clear idea of your objection