Preferences over Experimental Populations

  • Neeraja Gupta

  • Luca Rigotti

  • Alistair Wilson

Basic Idea

Suppose you have a limited budget, and you're trying to uncover a qualitative comparative static effect. 

 

Different populations have different costs per observation, and different degrees of noise.

 

You can formulate your preference over populations by the power of the test

Attenuation

Supposing that two treatments \(A\) and \(B\) exhibit a difference \(\Delta_{AB}\) in level over a binary choice, but that data from the population is noisy (or equivalently, that the quantitative effect size is just smaller in this population).

 

If the overall effect size in a population with no error/attenuation is \(\Delta_{AB}\), then the expected effect size in population \(i\) where the true effect is attenuated by  \(\gamma_i\in\left[0,1\right]\) is given by:

\[(1-\gamma_i)\cdot \Delta_{AB}.\]

 

Power

Power in a traditional experimental test stems from both the true effect size, but also the sample size \(N\). 

 

While it might be nice to have a huge sample, financial constraints will limit us to smaller samples. Our focus is on two dual problems:

  • A fixed budget, where we are maximizing power
  • A fixed power level where we minimize the budget

If you want to maximize the chance that you detect a significant effect then there are tradeoffs between:

  • The costs per observation (lower cost means bigger \(N\))
  • The noise/attenuation on the platform

Dual Power Problem

Iso-Power

Iso-budget

Our Experiment

We ask participants to make choice in four simple games (without feedback) where they are matched to another participant.

 

On each platform we had a budget of $1600 using the standard incentives on each platform (scaling the game incentives with probability of payment not the amounts)

 

  • Two of the games satisfied a stronger-form of dominance (a dominant strategy for an agent with any pro-social preference over efficiency)
  • Two of the games were Prisoner's Dilemma games, but with different tensions

Our Games

(21,21) (2,28)
(28,2) (8,8)

\(C\)

\(D\)

\(C\)

\(D\)

(19,19) (8,22)
(22,8) (9,9)

\(C\)

\(D\)

\(C\)

\(D\)

(17,17) (12,16)
(16,12) (10,10)

\(C\)

\(D\)

\(C\)

\(D\)

(15,15) (16,10)
(10,16) (11,11)

\(C\)

\(D\)

\(C\)

\(D\)

PD 1:

PD 2:

Dom 1:

Dom 2:

Games differ in PD tension (PD1 more temptation)

Both games C is individually dominant and socially efficient

Treatments

Our environment varies:

  • The four strategic games presented to the participants (random order at participant level)
  • A presentation effect: (\(C\) action first vs. \(D\) action first
  • The population/mode from which the sample was drawn:
    • Standard Physical Lab sample
    • Virtual Lab sample
    • MTurk
    • Prolific
    • CloudResearch (Approved List)

Assessment

Setting the fixed and variable payments to match typical levels (and minimum payments) for each population, we recruited a sample on each to spend approximately $1650.

 

Using data from Charness et al (GEB 2016) we formulate an expected cooperation rate difference between the two PD games (expecting more cooperation with a smaller temptation).

Examine two key metrics across populations:

  • The proportion of participants who choose dominated actions
    • Required for control via incentives (cf. Smith Precepts)
  • Inference over the cooperation rate difference in the PD games
    • A standard behavioral comparative static

Dominated Response (Noise)

  • M-Turk exhibits substantial noise, with sensitivity to the first listed option
  • Other platforms have much lower fractions of dominated choices

Arrows show response to action order

PD Cooperation Static

  • Both M-Turk and Prolific show zero effect on the PD game choices
  • There are significant effects in the other populations

Arrows show PD-2 to PD-1

Population Power

  • If dominated choices were the only problem, Prolific and Cloud Research populations would be far superior
  • They have higher power because of much lower observation costs

But there is another effect

Pure noise effects

Observed PD-Static 

  • Prolific's small response elasticity to the PD tension makes it actually worse than then Lab for detecting this effect
  • In contrast, even though for Cloud Research the effect is smaller than the lab, cheaper observation costs more than compensate

Total attenuation

Conclusions

  • Unfiltered M-Turk is unfit for purpose: noise in the response leads to a substantial reduction in power
  • Both Prolific and Cloud Research offer curated populations with low rates of dominated response - only slightly larger than those in the lab samples (but much cheaper per obs)
  • With respect to our study-specific behavioral comparative static (PD-game cooperation):
    • Prolific shows almost zero elasticity of response (participants are overly cooperative, regardless of tension)
    • The sample from CloudResearch exhibits a similar effect-size to the lab sample, but the cheaper observations lead to substantially increased power