The Effect of Experimenter Demand
on Inference
Brisbane, April 2025

David
Danz
Pittsburgh

Guillermo
Lezama
Pittsburgh

Pun
Winichakul
Smith College

Priyoma
Mustafi
Pittsburgh

Marissa
Lepper
Texas A&M

Lise
Vesterlund
Pittsburgh

Alistair
Wilson
Pittsburgh
The Effect of Experimenter Demand
on Inference
Brisbane, April 2025

Guillermo
Lezama
Pittsburgh

Pun
Winichakul
Smith College

David
Danz
Pittsburgh

Priyoma
Mustafi
Pittsburgh

Marissa
Lepper
Texas A&M

Lise
Vesterlund
Pittsburgh

Alistair
Wilson
Pittsburgh
Experimenter Demand
[The participant’s] general attitude of mind is that of ready complacency and cheerful willingness to assist the investigator in every possible way by reporting to him those very things which he is most eager to find.
-A H. Pierce, 1908
The subject’s performance in an experiment might almost be conceptualized as problem-solving behavior... he sees it as his task to ascertain the true purpose of the experiment and respond in a manner which will support the hypotheses being tested.
-M. T. Orne, 1962
Experimenter Demand
“the critical assumption underlying the interpretation of data from lab experiments is that the insights gained can be extrapolated to the world beyond-S. Levitt and J. List, 2007
“many reasons to suspect that these laboratory findings might fail to generalize to real markets
field experiments avoid many of the important obstacles to generalizability faced by lab experiments
-S. Levitt and J. List, 2008
Outline
1. EDE Mitigation
2. EDE Measurment

Jonathan
de Quidt
Queen Mary

Lise
Vesterlund
Pittsburgh

Alistair
Wilson
Pittsburgh
Experimenter Demand
2019 (ed. Schram & Ule)
2024 (ed. Rees-Jones)
1. EDE Mitigation
2. EDE Measurment
- Experimenter Demand Effects, de Quidt, Vesterlund, and Wilson (2019, 2024)
- Best practice: Mitigate hypothesis speculations and reduce responsiveness EDE
- Documentation (instructions, screen shots, survey, script, etc.)
- Conceal the hypothesis
- Abstract frame
- Hiding independent variable
- between-subject design
- within-subject designs w/ “progressive revelation”
- Reduce responsiveness to EDE
- Incentives
- Anonymous decisions
- Participants
- Limit experimenter-participant interaction
1. EDE Mitigation
2. EDE Measurment
Top Five | Exp. Econ | |
---|---|---|
Between-subject | 59% | 89% |
Abstract Frame | 89% | 96% |
Blind | 83% | 94% |
Incentivized | 91% | 99% |
All the above | 46% | 84% |
Design:
1. EDE Mitigation
Top Five | Exp. Econ | |
---|---|---|
Classroom | 5% | 5% |
Lab | 68% | 84% |
Lab-in-field | 17% | 7% |
Online | 12% | 2% |
Population:
Impact of EDE
2. EDE Measurment
- Best practice – “ward off critiques that a result might be driven by experimenter demand”
- Assessing EDE impact on decision estimates
- Incentivize actors induce to high or low contributions (Bischoff and Frank, 2011)
- Design-by-correlation
- Dhar et al.(2018): Crowne-Marlowe social desirability scale to assess whether treatment-driven changes in attitudes toward gender equality result from experimenter demand.
- Alcott and Taubinsky (2015): Snyder (1974) “Self-Monitoring Scale” to measure subjects’ responsiveness to experimenter demand
- Tsutsui and Zizzo (2013): Demand-susceptibility measure dominated lottery choice w/ statement “it would be nice if you were to choose” and a smiley face
- Bound EDE (de Quidt et al., 2017) intentional positive and negative demand
- strong demand: “do us a favor if...”
- weak demand: “we expect that...”)
Impact of EDE on Inference?
-
The objective of much of experimental research is qualitative inference (Kessler & Vesterlund, 2015).
-
Causal effect of \(X\) on \(Y\)
-
Direction and economically meaningful (and statistically significant)
-
-
Can EDE alter inference?
-
Impact of an ill-intentioned experimenter who differentially applies positive and negative demand across a decision pair?
-
False negatives – where true effect is positive
-
False positives – where true effect is null
-
-

Guillermo
Lezama
Pittsburgh

Pun
Winichakul
Smith College

David
Danz
Pittsburgh

Priyoma
Mustafi
Pittsburgh

Marissa
Lepper
Texas A&M

Lise
Vesterlund
Pittsburgh

Alistair
Wilson
Pittsburgh
The Effect of Experimenter Demand
on Inference
What do we do?
- Use “worst case scenario” to assess false negatives and false positives
- Differentially apply strong positive and negative demand across a decision pair (de Quidt, Haushofer and Roth, AER 2018)
You will do us a favor if you take a higher (lower) action than you normally would.
- Four domains
- Probability weighting
- The Endowment effect
- Charitable giving
- Intertemporal choice
(Risk)
(Ownership)
(Self vs. Other)
(Now vs Later)
Design
- Eight within-subject decisions:
- Four lottery valuations:
- WTP and WTA
- Lotteries with 10% and 90% chance of winning $10
- Two donations:
- Matched (low cost)
- Unmatched (high cost)
- Two intertemporal allocations:
- Immediate (today vs a week from now)
- Delayed (tomorrow vs week from tomorrow)
- Four lottery valuations:
Design
- Eight within-subject decisions:
- Four lottery valuations
- Two donations
- Two intertemporal allocations
- Three between-subject treatments:
- No demand
- Positive demand
- Negative demand
- Three Populations:
- Laboratory (Pitt undergrads)
- Mechanical Turk
- Prolific
Online Populations
Papers on Google Scholar
2013
2023
2018
Design
- Eight within-subject decisions:
- Four lottery valuations
- Two donations
- Two intertemporal allocations
- Three between-subject treatments:
- No demand
- Positive demand
- Negative demand
- Three Populations:
- Laboratory (Pitt undergrads)
- Mechanical Turk
- Prolific
(N=236, ~80/treatment)
(N=756, ~250/treatment)
(N=732, ~240/treatment)
Probability Weighting
Probability Weighting
Literature Predictions:
- Kahneman & Tversky 1979; Prelec 1998
- Risk seeking at low probabilities: \(\text{WTP}(\tfrac{1}{10})>\$1)\)
- Risk averse at high probabilities: \(\text{WTP}(\tfrac{9}{10})<\$9)\)
Task:
Endowed with $10, and asked about willingness to pay for the lottery:
\( p\cdot\$10\oplus(1-p)\cdot \$0\)
with two probabilities of winning \(p\in\left\{\tfrac{1}{10},\tfrac{9}{10}\right\}\)
Probability Weighting
\(p<0.001\)
\(p<0.001\)
Probability Weighting
\(p=0.002\)
\(p<0.001\)
Probability Weighting
You will do us a favor if you indicate a lower willingness to buy than you normally would
\(p<0.001\)
\(p<0.001\)
Probability Weighting
You will do us a favor if you indicate a higher willingness to buy than you normally would
\(p<0.001\)
\(p<0.001\)
Probability Weighting
This is extreme and differential demand over the comparative static
\(p<0.001\)
\(p<0.001\)
Endowment Effect
Endowment Effect
Literature Predictions:
- Kahneman, Knetsch and Thaler (1990; 1991)
- \( \text{WTA}( \tfrac{1}{10} ) > \text{WTP}( \tfrac{1}{10} ) \)
- \( \text{WTA}( \tfrac{9}{10} ) > \text{WTP}( \tfrac{9}{10} ) \)
Task:
WTA: endowed with $10 and lottery, asked about willingness to accept a price for lottery.
WTP: Endowed with $10 and asked about willingness to pay a price for the lottery
Endowment Effect
\(p<0.001\)
Endowment Effect
\(p=0.002\)
Endowment Effect
\(p<0.001\)
Endowment Effect
\(p<0.001\)
Endowment Effect
\(p=0.012\)
Endowment Effect (High)
\(p=0.019\)
Endowment Effect (High)
\(p=0.731\)
Endowment Effect (High)
\(p=0.001\)
Endowment Effect (High)
\(p=0.127\)
Charitable Giving
Chartiable Giving
Task:
Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).
Literature Predictions:
- Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
- Charity receives larger donation with than without a match
- DonatedAmount(Low)>DonatedAmount(High)
- Charity receives larger donation with than without a match
Chartiable Giving
Task:
Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).
Literature Predictions:
- Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
- Charity receives larger donation with than without a match
- DonatedAmount(Low)>DonatedAmount(High)
- Charity receives larger donation with than without a match
\(p<0.001\)
Chartiable Giving
Task:
Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).
Literature Predictions:
- Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
- Charity receives larger donation with than without a match
- DonatedAmount(Low)>DonatedAmount(High)
- Charity receives larger donation with than without a match
\(p<0.001\)
Chartiable Giving
Task:
Endowed with $20, and given the option to donate any of this to a local Children's Hospital. Donation cost is either Low (matched donation, \(c=\$0.50\)) or High (unmatched donation, \(c=\$1.00\)).
Literature Predictions:
- Andreoni & Miller (2002); Huck & Rasul, (2011); Karlan & List, (2007)
- Charity receives larger donation with than without a match
- DonatedAmount(Low)>DonatedAmount(High)
- Charity receives larger donation with than without a match
\(p<0.001\)
Present Bias
Present Bias
Task:
Convex budget set. Have $10 to be paid at date \(t\), can move up to $9 to date \(t+7\) earning 20% interest on moved amount. Treatments are for:
- \(t=0\) (today vs week from today)
- \(t=1\) (tomorrow vs week from tomorrow)
Literature Predictions:
- Andreoni and Sprenger, 2012:
- Compared to an immediate sooner date, participants will be no more patient when the sooner date is delayed
- Purposeful null result: \( \text{Transfer}(t=0) = \text{Transfer}(t=1) \)
Present Bias
\(p=0.339\)
Present Bias
\(p=0.239\)
Present Bias
\(p=0.465\)
Present Bias
\(p=0.819\)
Different Populations
Sensitivities
Lab:
\(p=0.304\) from Fisher's exact on directions
Sensitivities
Mturk:
\(p=0.020\) from Fisher's exact on directions
Sensitivities
Prolific:
\(p=0.003\) from Fisher's exact on directions
Sensitivities
Lab
MTurk
Prolific
False Positive in Online Samples
- All of the domains where we expect a directional result are replicated online:
- Probability Weighting
- Endowment effect (low probs)
- Charitable giving
- However, we do find that extreme experimenter demand can create false positives in both online samples:
- Present Bias
- Charitable giving foregone amount
- Reasons:
- Slightly more consistent demand effects
- Large enough sample sizes to generate significance
- With large samples, need to focus on economic size of the effects!
False Positive in Online Samples
Present Bias: Laboratory sample
False Positive in Online Samples
Present Bias: MTurk sample
False Positive in Online Samples
Present Bias: MTurk sample
\(p=0.039\)
False Positive in Online Samples
Present Bias: MTurk sample
\(p=0.033\)
False Positive in Online Samples
Present Bias: Prolific sample
False Positive in Online Samples
Present Bias: Prolific sample
\(p=0.043\)
False Positive in Online Samples
Present Bias: Prolific sample
\(p=0.112\)
Effect size normalization
- For each comparative static we construct a normalized effect size \[ y_i = \hat{\beta_0} +\hat{\beta}_1\cdot 1_{\text{Treat}}+\hat{\epsilon}_i \]
- Variation normalized coefficient is: \[\hat{D}=\frac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}} \]
- This effect size statistics is what is referred to as Cohen's-\(D\)
- Cohen gives the informal guidance that \(0.2\sigma\) is small, \(0.5\sigma\) medium, \(0.8\sigma\) large
- This effect size statistics is what is referred to as Cohen's-\(D\)
- So interpretation of effect size is as a multiple of the unexplained variation over the decision \(y\) (separate from the treatment effect)
- Many inferences require conditioning on more variables
- Here if the total data size is \(N\), then we can just think of \(\sqrt{N}\cdot\hat{D}\) as the two-sample Student's-\(t\) test statistic
Comparative Static Sensitivity
Comparative Static Sensitivity
Comparative Static Sensitivity
Effect Sizes
- Our evidences suggests even extreme experimenter demand (strong and differential across treatment) can push comparative statics by approximately \(0.2\sigma\)
- How big are effects sizes in experimental studies?
- Initial stages of data synthesis for 40 studies in the AER in the last five years:
- Available data
- "Important" general-interest studies
- For each paper we try to construct a simple normalized effect size \[ y_i = \hat{\beta_0} +\hat{\beta}_1\cdot 1_{\text{Treat}}+\hat{\epsilon}_i \]
- We generalize this approach for Diff-in-Diff designs to focus on the interaction
- Normalized coefficient is: \[\hat{D}=\frac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}} \]
Effect Sizes
Absolute normalized coefficient is: \(\left|\tfrac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}}\right| \)
Effect Sizes
Absolute normalized coefficient is: \(\left|\tfrac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}}\right| \)
Effect Sizes
Absolute normalized coefficient is: \(\left|\tfrac{\hat{\beta}_1}{\hat{\sigma}_{\hat{\epsilon}}}\right| \)
Conclusions
- Best practices widely adopted to mitigate EDE
- Limited EDE impact on inference:
- For four classic domains EDE bounds narrow for lab and online
- Potential impact of ill-intentioned experimenter:
- Lab: no false negatives or false positives for typical sample sizes
- Online: no false negatives, but false positives (small)
-
Initial results from sample of AER papers:
- Typical effect sizes are substantial, beyond what we can obtain with experimenter demand
Recommendations
Author:
- Adopt best practices when possible
- If you have EDE concerns
- Engage directly: Check robustness
- Report on existing evidence on EDE for domain of interest
- Bounding approach to assess lack of sensitivity.
- Note that sensitivity does not imply EDE driving treatment effect
Reviewer:
- First order: Identification, relevance
- Manipulation? All data (e.g., Roth, 1994); ex ante sample size (Simmons, et al 2011)
- Second order: EDE concerns, provide clear idea of your objection