# p-values, the statistical crisis in science and what to do about it.

Patrick Beukema, PhD

CoAx lab, CMU

Center for Neuroscience, U. Pitt

Gold-standard analyses can result in nonsensical results

Experiment: fMRI analysis of emotional valence.

Analysis: Standard General Linear Model

Subject: North atlantic salmon (deceased), n=1

Study findings: significant BOLD activation related to emotional valence

Conclusion: There can be flaws in current

"best practice" analytical approaches

>50% of research findings are false

21% median statistical power in neuroscience

35% replication rate

\$28 Billion estimated costs of irreproducible research

Statistical errors result in massive inefficiencies

Source: Ionnadis 2005, Button 2013, Freedman 2015, OCC 2015

Source: Brown 2013

Misunderstanding p-values can be deadly

“InterMune Announces Phase III Data Demonstrating Survival Benefit of Actimmune in IPF...Reduces Mortality by 70% in Patients with Mild to Moderate Disease.”

Experiment: Interferon gamma-1b for idiopathic pulmonary fibrosis

Analysis: 1. treatment vs. placebo (p-value = 0.08)                        2. in subgroup (p-value = 0.004)

Dr. Harkonen

(former InterMune CEO)

Followup RCT with subgroup:

15 more people died on drug

Outline for today

1. History of p-values
2. p-value problems (PVPs)
3. PVPs: False Findings & Low Power
4. Utility of p-values
5. What should we do?
6. Discussion

P-values quantify the probability of an observed result under the null

Source:  Wikipedia

Question:

Where do we draw this line that distinguishes the likely from the unlikely

Fisher arbitrarily proposed p-values are significant if less than 0.05

Source: Fisher 1926

"...If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level. "

QED

By the early 1960's,  significance at the 0.05 level became standard practice in biomedical research

Some problems with p-values are related to practice and can be avoided

Problem 1: misinterpreting what the test tells you

Problem 2: the use of sharp point null hypotheses

Source: Wagenmakers 2007

P(observation|hypothesis) \ne P(hypothesis|observation)
$P(observation|hypothesis) \ne P(hypothesis|observation)$

Straw man nulls assume (sometimes incorrectly)

that there is zero variance and zero systematic error

Unavoidable problems result from the assumptions of null hypothesis significance testing

Problem 3: p-values depend on data that weren't observed

Problem 4: p-values depend on unknown intentions of researcher

Source: Wagenmakers 2007

P(t(y^{rep}) > t(y_{observed})|H_0)
$P(t(y^{rep}) > t(y_{observed})|H_0)$

hypothetical replications

determine sampling distribution

Think of this

Problem 3: p-values depend on data that weren't observed

Source: Wagenmakers 2007

Observation: x = 5

Under f(x): p-value = 0.03+0.01 = 0.04 -> reject null
Under g(x): p-value = 0.03+0.03 = 0.06 -> fail to reject null

"What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred" (Jeffreys, 1961)

Suppose we have the following two sampling distributions

Problem 4: p-values depend on a sampling plan which in turn depends on the subjective intentions of researcher

Source: Wagenmakers 2007

Example: Alex tests if emotionally primed words  are categorized more quickly than neutrally primed words in n=20 subjects.

Result: p-val = 0.045, Conclusion: reject null

Question for Alex:

“What would you have done if the effect had not been significant after 20 subjects?”

Alex's response:

A. "I don't know"

B. "depends on the editor's response"

C. "I would not have tested more subjects"

p-value undefined

p-value depends on action letter

p-value unaffected

Experiment 1:

p-value = 0.032

n = 11

Problem 5: p-values do not quantify statistical evidence (the p-value postulate)

Source: Wagenmakers 2007

Which of these experiments provides more evidence against the null?

Experiment 2:

p-value = 0.032

n = 98

?

There is widespread disagreement which violates the p-postulate

Consider the follow two experiments

There are many sources of flexibility in research design that lead to known errors

• adjusting binsize, frequencies, kernels, 1 vs 2 tailed, etc
• sub group analysis after main not sig
• Writing a hypothesis after analysis

p-hacking

50% had selectively reported only studies that 'worked'

58% had peeked at the results

43% threw out data after checking its impact on the p-value

35% reported unexpected findings as if predicted from start

Source: Loewenstein 2012

These errors are extremely common

Source of flexibility

dredging

HARKing

In sample of 2000 researchers:

Results may be invalid even without explicit p-hacking

Source: Gelman and Loken 2013

1. Simple classical test based on a unique test statistic

2. Classical test pre-chosen from a set of possible tests

3. Researcher degrees of freedom without fishing: single test based on the data, but a different test would have been performed given different data

4. "Fishing": performing J tests and then reporting the best result

T(y)
$T(y)$
T(y;\phi)
$T(y;\phi)$
T(y;\phi(y))
$T(y;\phi(y))$
T(y;\phi^{best}(y))
$T(y;\phi^{best}(y))$

Well known problem

Lesser known problem

Test-statistic

The lesser known problem is described as the "Garden of forking Paths" which can lead to statistically invalid conclusions

Source: Bishop, Dorothy V M (2016): The Garden of Forking Paths.

Degrees of freedom in design choice can lead to many different outcomes

Source: PsyArXiv Crowdsourcing Analytics

Case study: Are referees more likely to give red cards to dark skin toned professional soccer players?

Subjective belief across time

Source: PsyArXiv Crowdsourcing Analytics

Case study: Are referees more likely to give red cards to dark skin toned professional soccer players?

Significance tests and p-values do not indicate whether a result is true.

What proportion of studies can we trust, say at a given significance threshold of 0.05?

19/20? maybe less?

• Bias

• Pre-study odds

• Statistical power

The truth-value of a result is much more nuanced. In addition to the significance level, it depends on:

Source: N.D.G. on Science

Pre-study odds of the hypothesis being correct dramatically impact conclusions from p-values

Identical p-values, opposite conclusions

Source: Nuzzo 2014

Power in neuroscience is low

Median statistical power is 21%

Source: Button et. al. 2013

The distribution of power across 49 meta-analyses

It is challenging to collect high-quality neural data, perhaps it is not surprising power is low

“when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.”

- Andrew Gelman

Source:  Gelman 2015, artwork by Viktor Beekman

Still a problem

\frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}
$\frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}$

We can quantify the probability that a result is true

Source:  Ioannidis 2005

It depends on:

1) power

2) prestudy odds

3) significance level

4) bias

(1-\beta)
$(1-\beta)$
(R)
$(R)$
(\alpha)
$(\alpha)$
(u)
$(u)$

80 % power

But we are here at 20 % power

# }

Positive Predictive Value (PPV) =

PPV (%)

PPV (%)

less bias

more bias

Case study: Genome-wide association study

PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}
$PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}$

Dataset: 100,000 polymorphisms

Power: 60%

Pre-study odds (R): 10/100,000

alpha = 0.05

Positive predictive value is low even for reasonably powered study and minimal/no bias

Without bias

u=0

PPV = .0012

With bias

u=0.1

PPV = .00044

If research findings are accurate, then replication rates should be high, but they are not high

Source: Open Science Collaboration, Science 2015

Expected replication of 89/100 if original effects were true

but only able to replicate 35

Distribution of p-values

Distribution of effect sizes

Interim Summary

1.  Chance of a result being correct is low

(<50%)

2. Power in neuroscience is low (~21%)

3. Replication rates are low (~35%)

4. Misuse of p-values are a central problem

What should we do?

The American Statistical Association recognized there was widespread misuse of p-values and released a statement which underscores 6 key points.

1. P-values can indicate how incompatible the data are with a specified statistical model

2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

4. Proper inference requires full reporting and

transparency

5. A p-value, or statistical significance, does not measure

the size of an effect or the importance of a result.

6. By itself, a p-value does not provide a good measure of

evidence regarding a model or hypothesis.

p-values can be appropriate under specific circumstances

• You register the study
• Your analysis is specified before observing the data
• You are blind to the data
• You don't do the unblinding
• The decision to publish will not be determined based on the p value i.e. even if p-value>alpha

image source: www.ihatestatistics.com

For example:

There are three major strategies that have been put forward by the research community

1. Replace: use alternative measures of assessing evidence e.g. Bayes Factors

2. Reform: apply more stringent testing criteria, and use p-values correctly (ASA)

3. Abandon: Embrace uncertainty instead of thresholds

1. Replace: p-values can be replaced/supplemented with alternative measures such as Bayes Factors

What are Bayes Factors?

The ratio of posterior probabilities of two models.

Pros:

• Avoids dichotomization
• Can quantify evidence for H0

Cons:

• Less analytically tractable
• Specification of priors

Source: Wagenmakers 2007

Pr_{BIC}(H_0|D) = \frac{1}{1+e^{-\frac{1}{2}*\Delta BIC_{10}}}
$Pr_{BIC}(H_0|D) = \frac{1}{1+e^{-\frac{1}{2}*\Delta BIC_{10}}}$
\Delta BIC_{10} = n log(\frac{SSE_1}{SSE_0})+(k_1-k_0)log(n)
$\Delta BIC_{10} = n log(\frac{SSE_1}{SSE_0})+(k_1-k_0)log(n)$

BIC approximation to calculate posterior probability of null:

where

An alternative approach: Conditional Equivalence Testing

2. Reform: A "critical mass" of researchers recommend changing the default threshold to 0.005

Pros:

• Weak BFs at 0.05, Strong BFs at 0.005
• Will Reduce false positive rate

Cons:

• Will Increase false negative rate
• will require bigger n
• 0.005 is still unjustified

Substantial controversy over this proposal.

Source: Benjamin et. al. 2017

Justification is strongly based on

Bayes Factors (BFs)

What effect will this proposal have on the positive predictive value?

Simulations using Ionnadis formulation and new threshold

New threshold increases PPV across the board, but extent depends heavily on power, pre-study odds, and especially bias

Original

New

PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}
$PPV = \frac{[1-\beta]R+u\beta R}{R+\alpha-\beta R+u-u\alpha+u\beta R}$

"The solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.”

3. Abandon: Embrace uncertainty

Source: Feynman 195?, Gelman 2017

AKA: use hierarchical Bayes at all times.

“Scientific knowledge is a body of statements of varying degrees of certainty -- some most unsure, some nearly sure, none absolutely certain.

Richard Feynman

Andrew Gelman

A statistician offers helpful advice for psychology researchers

Source: Gelman 2013

1. Analyze all your data
2. Present all your comparisons
3. Make your data public
4. Make your analysis public (use github)

Preregister if possible

Advice from Rob Kass:

When Possible, Replicate!

When statistical inferences, such as p-values, follow extensive looks at the data, they no longer have their usual interpretation. Ignoring this reality is dishonest: it is like painting a bull’s eye around the landing spot of your arrow...

The only truly reliable solution to the problem posed by data snooping is to record the statistical inference procedures that produced the key results, together with the features of the data to which they were applied, and then to replicate the same analysis using new data.

Source: Kass et. al. 2016

Remember to look at your data, and to

avoid blindly reporting p-values

Source: Matejk & Fitzmaurice

# Discussion time...

Thanks to Ran Liu, Kyle Dunovan, Harlan Campbell

& Chris Hobson for comments on this talk

#### Beyond the dead salmon: p-values, the statistical crisis in science and what to do about it.

By Patrick Beukema

### Beyond the dead salmon: p-values, the statistical crisis in science and what to do about it.

Are p-values the cause of the current reproducibility crisis? Is neuroscience doomed? Should we abandon science and become sheep herders? This talk will explore the current use and abuse of p-values, "the parody of falsificationism that is null hypothesis significance testing (Gelman 2014)", and the related power failure crisis. We will outline the theory and history of p-values and explore how neuroscience, psychology and cognitive science publications have become so dominated by their use. Next we will consider recent (2017) calls for reform by the American Statistical Association, academics who advocate moving the standard threshold to \alpha = 0.005 or Bayes Factors, and ultimately why pragmatism in the face of NHST is challenging.

• 1,022