On science, statistics and the problem

of pretty good inference

Danielle Navarro (@djnavarro)

https://slides.com/djnavarro/scienceandstatistics2

I briefly considered giving a talk about preregistration...

 

... but decided against it

(https://psyarxiv.com/x36pz/)

A request: if you tweet about my talk, please think about the framing & consider adding a link to my slides

 

https://slides.com/djnavarro/scienceandstatistics2

Statistics is a tool that can assist scientists develop, evaluate and modify theoretical claims

Statistical theory provides the logic of and the rules for scientific reasoning

Two views of statistics, alike in respectability...

  • Scientists must follow the rules
  • Statistics dictates what is "questionable"
  • Breaking the rules is "p-hacking"
  • Researcher "degrees of freedom" are bad
  • Human judgment is not to be trusted

"Whatever. Those rules aren't real"

"Without rules there is chaos"

  • Scientists should select a suitable tool
  • Scientists decide what is "questionable"
  • The rules aren't much more than guidelines
  • Researcher "degrees of freedom" are good
  • Statistical calculations are not to be trusted

"Without rules there is chaos"

I am deeply opposed to all forms of statistical prescriptivism... and my goal in this talk is to discuss why I feel this way

Part 1: The data you have are rarely the data you need

original artwork made with github.com/djnavarro/jasmines

Thomas Kuhn

"Science does not deal in all possible laboratory manipulations. Instead, it selects those relevant to the juxtaposition of a paradigm with the immediate experience that that paradigm has partially determined. As a result, scientists with different paradigms engage in different concrete laboratory manipulations"

Okay, so I want to study human concept learning and inductive reasoning?

Let's consider two strategies I typically follow...

Strategy 1: study human reasoning in the lab with controlled stimuli, randomisation, etc...

(https://psyarxiv.com/szr4u)

Operationalising the phenomenon of interest

In the small world defined by my operationalisation, I can specify a class of possible cognitive models

(Yay math psych! I don't have to hide the equations)

Experiments based on my operationalisation yield very clean data that discriminate between competing computational models, yay!!!

What does my Bayes factor tell me about the relationship between my models and human reasoning outside of the experiment?

But I have an external validity problem... does this map onto anything in the world?

There are more things in human reasoning than dreamt of within my operationalisation!

Strategy 2: study reasoning using natural language concepts and "real world" data...

(https://psyarxiv.com/mb93p/)

The SWOW norms encode common sense knowledge

 

... about science

The SWOW norms encode common sense knowledge

 

... about clothes

We can use SWOW to study reasoning about lexicalised concepts...

 

(https://psyarxiv.com/s3k79/)

If I don't want sexist assumptions embedded in experiments that rely on SWOW norms, it is unwise to take those norms at face value

Buuuuut....

http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/

(Mikolov et al 2013)

But... does this really matter?

A cautionary example of how to misuse real world data in an applied context

Yeah, yeah, the model achieves 95% cross-validation accuracy, whatever...

Looking good!

For lab studies, the data analysis is only useful if you know how your operationalisation affects the way you to generalise outside the lab

 

 

 

... either way, the limitations to your design and your data ensure that the scientific problem you care about is more difficult than the one your statistical tools are built to solve

For naturalistic studies, the data analysis is only useful if you know which aspects of your uncontrolled data are relevant to the thing you are trying to study

Part 2: The statistics we have are rarely the statistics we need

original artwork made with github.com/djnavarro/jasmines

Statistical inference as I learned it in my Ph.D. 

Should we use p-values to evaluate scientific models?

Life is too short to take p-values seriously

Okay how about something that isn't absurd? Bayes factors?

Bayes factors: A tool for confidently selecting the "best" "model"...

... where "best" is weird and the "models" are fantasies

Here be dragons.

Gronau and Wagenmakers (2018)

A toy problem that implicitly asserts a prescriptivist view of statistics

...which is unfortunate, because it's not in your model space

Selecting a horribly broken model with absolute confidence just because it's the "least wrong" (in KL-divergence terms) is a transparently insane inferential strategy

In real life everything falls apart when all your models are wrong in terrible ways

 

(https://psyarxiv.com/39q8y/)

P(X|M) = \int P(X|\theta) P(\theta | M) d\theta

The researcher's belief (or "knowledge") about the plausible values of the model parameters.

 

The prior predictive P(X|M) is quite sensitive to this prior, including the parts we don't think much about, like the heaviness of the tails

The more carefully you look, the stranger things become...

image via Michael Betancourt

The more parameters in your model, the more "tail" there is. So the closer we come to having genuinely plausible models of human reasoning, the less likely it is we can write down a plausible prior for those models

Oh and also the researcher is a human being who doesn't have perfect access to their own "beliefs" or "knowledge" (remember, we're psychologists????)

Anyone care to specify a prior for this model?

One that is precise enough for a Bayes factor calculation to actually mean something?

Lake et al (2015)

Part 3: Statistical modesty, domain knowledge, and "pretty good" inferences

original artwork made with github.com/djnavarro/jasmines

Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.

- George Box (1976)

Model fitting the way I was taught it as a naive young thing ... quantify the degree of agreement between model and data

(https://psyarxiv.com/2m83v/)

Correlations range from .91 to .95. About what I'd expect for a decent model, but is it ... (gasp) ... overfitting?

The middle-aged cynic, having found a way to "break" every model selection criterion she has tried...

(e.g., https://djnavarro.net/post/a-personal-essay-on-bayes-factors/)

What I do now...

What counts as a meaningful pattern in the data? To what extent is are such patterns a priori predictions of the model?

(https://psyarxiv.com/43auj/)

The same data set and model fits replotted.

Huh. Maybe I'm underfitting???

 

(https://psyarxiv.com/2m83v/)

This theoretically meaningful model failure is barely noticeable on the scatterplot, and a correlation of 0.92 rarely leads us to worry about underfitting, does it?

Fit looks good, but is the model overfitting???

Theoretical agreement looks good, but is the model underfitting???

Alas, this iterative process means Sir Ronald Fisher & I are "double dipping". This is purported to be a questionable research practice.

Oh well

Wrapping up...

On the naivete of statistical prescriptivism

(1) When statistical inferences don't work as advertised, they lose their usual meaning for confirmatory tests. So why be so petrified of making tentative inferences in exploratory analysis?

(2) Science is hard. Transparency, openness, audit trails, and scrupulous honesty are absolutely critical. This does not mean we must endorse the "remove researcher degrees of freedom to avoid p-hacking" opinion currently popular among methodological reformists

Thank you

https://slides.com/djnavarro/scienceandstatistics2

original artwork made with github.com/djnavarro/jasmines