Beyond the Dead Salmon:
pvalues, the statistical crisis in science and what to do about it.
Patrick Beukema, PhD
CoAx lab, CMU
Center for Neuroscience, U. Pitt
Source Bennett et. al. 2009
Goldstandard analyses can result in nonsensical results
Experiment: fMRI analysis of emotional valence.
Analysis: Standard General Linear Model
Subject: North atlantic salmon (deceased), n=1
Study findings: significant BOLD activation related to emotional valence
Conclusion: There can be flaws in current
"best practice" analytical approaches
>50% of research findings are false
21% median statistical power in neuroscience
35% replication rate
$28 Billion estimated costs of irreproducible research
Statistical errors result in massive inefficiencies
Source: Ionnadis 2005, Button 2013, Freedman 2015, OCC 2015
Source: Brown 2013
Misunderstanding pvalues can be deadly
“InterMune Announces Phase III Data Demonstrating Survival Benefit of Actimmune in IPF...Reduces Mortality by 70% in Patients with Mild to Moderate Disease.”
Experiment: Interferon gamma1b for idiopathic pulmonary fibrosis
Analysis: 1. treatment vs. placebo (pvalue = 0.08) 2. in subgroup (pvalue = 0.004)
Dr. Harkonen
(former InterMune CEO)
Followup RCT with subgroup:
15 more people died on drug
Outline for today
 History of pvalues
 pvalue problems (PVPs)
 PVPs: False Findings & Low Power
 Utility of pvalues
 What should we do?
 Discussion
Pvalues quantify the probability of an observed result under the null
Source: Wikipedia
Question:
Where do we draw this line that distinguishes the likely from the unlikely
Fisher arbitrarily proposed pvalues are significant if less than 0.05
Source: Fisher 1926
"...If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level. "
QED
By the early 1960's, significance at the 0.05 level became standard practice in biomedical research
Some problems with pvalues are related to practice and can be avoided
Problem 1: misinterpreting what the test tells you
Problem 2: the use of sharp point null hypotheses
Source: Wagenmakers 2007
Straw man nulls assume (sometimes incorrectly)
that there is zero variance and zero systematic error
Unavoidable problems result from the assumptions of null hypothesis significance testing
Problem 3: pvalues depend on data that weren't observed
Problem 4: pvalues depend on unknown intentions of researcher
Source: Wagenmakers 2007
hypothetical replications
determine sampling distribution
Think of this
Problem 3: pvalues depend on data that weren't observed
Source: Wagenmakers 2007
Observation: x = 5
Under f(x): pvalue = 0.03+0.01 = 0.04 > reject null
Under g(x): pvalue = 0.03+0.03 = 0.06 > fail to reject null
"What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred" (Jeffreys, 1961)
Suppose we have the following two sampling distributions
Problem 4: pvalues depend on a sampling plan which in turn depends on the subjective intentions of researcher
Source: Wagenmakers 2007
Example: Alex tests if emotionally primed words are categorized more quickly than neutrally primed words in n=20 subjects.
Result: pval = 0.045, Conclusion: reject null
Question for Alex:
“What would you have done if the effect had not been significant after 20 subjects?”
Alex's response:
A. "I don't know"
B. "depends on the editor's response"
C. "I would not have tested more subjects"
pvalue undefined
pvalue depends on action letter
pvalue unaffected
Experiment 1:
pvalue = 0.032
n = 11
Problem 5: pvalues do not quantify statistical evidence (the pvalue postulate)
Source: Wagenmakers 2007
Which of these experiments provides more evidence against the null?
Experiment 2:
pvalue = 0.032
n = 98
?
There is widespread disagreement which violates the ppostulate
Consider the follow two experiments
There are many sources of flexibility in research design that lead to known errors
 adjusting binsize, frequencies, kernels, 1 vs 2 tailed, etc
 sub group analysis after main not sig
 Writing a hypothesis after analysis
phacking
50% had selectively reported only studies that 'worked'
58% had peeked at the results
43% threw out data after checking its impact on the pvalue
35% reported unexpected findings as if predicted from start
Source: Loewenstein 2012
These errors are extremely common
Source of flexibility
dredging
HARKing
In sample of 2000 researchers:
Results may be invalid even without explicit phacking
Source: Gelman and Loken 2013
1. Simple classical test based on a unique test statistic
2. Classical test prechosen from a set of possible tests
3. Researcher degrees of freedom without fishing: single test based on the data, but a different test would have been performed given different data
4. "Fishing": performing J tests and then reporting the best result
Well known problem
Lesser known problem
Teststatistic
The lesser known problem is described as the "Garden of forking Paths" which can lead to statistically invalid conclusions
Source: Bishop, Dorothy V M (2016): The Garden of Forking Paths.
Degrees of freedom in design choice can lead to many different outcomes
Source: PsyArXiv Crowdsourcing Analytics
Case study: Are referees more likely to give red cards to dark skin toned professional soccer players?
Subjective belief across time
Source: PsyArXiv Crowdsourcing Analytics
Case study: Are referees more likely to give red cards to dark skin toned professional soccer players?
Significance tests and pvalues do not indicate whether a result is true.
What proportion of studies can we trust, say at a given significance threshold of 0.05?
19/20? maybe less?

Bias

Prestudy odds

Statistical power
The truthvalue of a result is much more nuanced. In addition to the significance level, it depends on:
Source: N.D.G. on Science
Prestudy odds of the hypothesis being correct dramatically impact conclusions from pvalues
Identical pvalues, opposite conclusions
Source: Nuzzo 2014
Power in neuroscience is low
Median statistical power is 21%
Source: Button et. al. 2013
The distribution of power across 49 metaanalyses
It is challenging to collect highquality neural data, perhaps it is not surprising power is low
“when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.”
 Andrew Gelman
Source: Gelman 2015, artwork by Viktor Beekman
Still a problem
We can quantify the probability that a result is true
Source: Ioannidis 2005
It depends on:
1) power
2) prestudy odds
3) significance level
4) bias
80 % power
But we are here at 20 % power
}
Positive Predictive Value (PPV) =
PPV (%)
PPV (%)
less bias
more bias
Case study: Genomewide association study
Dataset: 100,000 polymorphisms
Power: 60%
Prestudy odds (R): 10/100,000
alpha = 0.05
Positive predictive value is low even for reasonably powered study and minimal/no bias
Without bias
u=0
PPV = .0012
With bias
u=0.1
PPV = .00044
Source: Ionnadis 2005
If research findings are accurate, then replication rates should be high, but they are not high
Source: Open Science Collaboration, Science 2015
Expected replication of 89/100 if original effects were true
but only able to replicate 35
Distribution of pvalues
Distribution of effect sizes
Interim Summary
1. Chance of a result being correct is low
(<50%)
2. Power in neuroscience is low (~21%)
3. Replication rates are low (~35%)
4. Misuse of pvalues are a central problem
What should we do?
The American Statistical Association recognized there was widespread misuse of pvalues and released a statement which underscores 6 key points.
1. Pvalues can indicate how incompatible the data are with a specified statistical model
2. Pvalues do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a pvalue passes a specific threshold.
4. Proper inference requires full reporting and
transparency
5. A pvalue, or statistical significance, does not measure
the size of an effect or the importance of a result.
6. By itself, a pvalue does not provide a good measure of
evidence regarding a model or hypothesis.
pvalues can be appropriate under specific circumstances
 You register the study
 Your analysis is specified before observing the data
 You are blind to the data
 You don't do the unblinding

The decision to publish will not be determined based on the p value i.e. even if pvalue>alpha
image source: www.ihatestatistics.com
For example:
There are three major strategies that have been put forward by the research community
1. Replace: use alternative measures of assessing evidence e.g. Bayes Factors
2. Reform: apply more stringent testing criteria, and use pvalues correctly (ASA)
3. Abandon: Embrace uncertainty instead of thresholds
1. Replace: pvalues can be replaced/supplemented with alternative measures such as Bayes Factors
What are Bayes Factors?
The ratio of posterior probabilities of two models.
Pros:
 Avoids dichotomization
 Can quantify evidence for H0
Cons:
 Less analytically tractable
 Specification of priors
Source: Wagenmakers 2007
BIC approximation to calculate posterior probability of null:
where
An alternative approach: Conditional Equivalence Testing
2. Reform: A "critical mass" of researchers recommend changing the default threshold to 0.005
Pros:
 Weak BFs at 0.05, Strong BFs at 0.005
 Will Reduce false positive rate
Cons:
 Will Increase false negative rate
 will require bigger n
 0.005 is still unjustified
Substantial controversy over this proposal.
Source: Benjamin et. al. 2017
Justification is strongly based on
Bayes Factors (BFs)
What effect will this proposal have on the positive predictive value?
Simulations using Ionnadis formulation and new threshold
New threshold increases PPV across the board, but extent depends heavily on power, prestudy odds, and especially bias
Original
New
"The solution is not to reform pvalues or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.”
3. Abandon: Embrace uncertainty
Source: Feynman 195?, Gelman 2017
AKA: use hierarchical Bayes at all times.
“Scientific knowledge is a body of statements of varying degrees of certainty  some most unsure, some nearly sure, none absolutely certain.”
Richard Feynman
Andrew Gelman
A statistician offers helpful advice for psychology researchers
Source: Gelman 2013
 Analyze all your data
 Present all your comparisons
 Make your data public
 Make your analysis public (use github)
Preregister if possible
Source: Data Colada
Advice from Rob Kass:
When Possible, Replicate!
When statistical inferences, such as pvalues, follow extensive looks at the data, they no longer have their usual interpretation. Ignoring this reality is dishonest: it is like painting a bull’s eye around the landing spot of your arrow...
The only truly reliable solution to the problem posed by data snooping is to record the statistical inference procedures that produced the key results, together with the features of the data to which they were applied, and then to replicate the same analysis using new data.
Source: Kass et. al. 2016
Remember to look at your data, and to
avoid blindly reporting pvalues
Source: Matejk & Fitzmaurice
Additional Resources
 ASA statement on pvalues (esp. supplemental material)
 World beyond 0.05 (ASA conference notes)
 "Ten Simple Rules for Effective Statistical Practice"
 "Why most published research findings are false"
 "Empirical estimates suggest most published medical research is true"  empirical
 "Power failure: why small sample size undermines the reliability of neuroscience"
 "Reanalysis of Power failure"  response to above
 "Abandon statistical significance"  recent comment (Sep 2017) by AG
 Vox article  useful for science communication
Additional Resources continued
 Vox article  useful for science communication
 Dorothy Bishop talk on reproducibility
 "How not to fool yourself with pvalues" Regina Nuzzo
 Practicing reproducible research
 Data Colada
 When the revolution came for Amy Cuddy
 Justify Your Alpha
 Nature commentary: 5 ways to fix statistics
 Prestigious science journals are especially bad and Impact Factors are nonsense
Discussion time...
Thanks to Ran Liu, Kyle Dunovan, Harlan Campbell
& Chris Hobson for comments on this talk
Beyond the dead salmon: pvalues, the statistical crisis in science and what to do about it.
By Patrick Beukema
Beyond the dead salmon: pvalues, the statistical crisis in science and what to do about it.
Are pvalues the cause of the current reproducibility crisis? Is neuroscience doomed? Should we abandon science and become sheep herders? This talk will explore the current use and abuse of pvalues, "the parody of falsificationism that is null hypothesis significance testing (Gelman 2014)", and the related power failure crisis. We will outline the theory and history of pvalues and explore how neuroscience, psychology and cognitive science publications have become so dominated by their use. Next we will consider recent (2017) calls for reform by the American Statistical Association, academics who advocate moving the standard threshold to \alpha = 0.005 or Bayes Factors, and ultimately why pragmatism in the face of NHST is challenging.
 917