Data Analysis Tactics

Ferran Muiños @fmuinos

Tuesday 20200204

Aim

"Introduction to statistical tactics by example"

  • Introduce some common tactics in data analysis

  • Correlation

  • Qualitative models

  • Fisher's test for categorical data

  • Regression

Correlation

correlation

Informal definition: 

the extent to which two experimental variables X, Y fluctuate in a synchronized way

Correlation is a handy proxy of causation, but they are not equivalent.

https://www.tylervigen.com/spurious-correlations

how do we spot correlation?

Sedentary YES Sedentary NO
CVD YES 100 25 125
CVD NO 15 100 115
115 125 240

for categorical data we use contingency tables

is the mass concentrated in any of the diagonals?

how do we spot correlation?

Sedentary YES Sedentary NO
CVD YES 100 25 125
CVD NO 15 100 115
115 125 240

for categorical data we use

contingency tables

\log\textrm{OR} = \log \frac{T_{11}T_{22}}{T_{12}T_{21}} = \log\frac{100\cdot 100}{25\cdot 15} \sim 3.28

log Odds Ratio:

how do we spot correlation?

for categorical data we use

contingency tables

There are many other available options to spot correlation between categorical variables:

- Fisher's test -- wait and see.

- Logistic regression -- wait and see.

- Chi-squared-statistic based methods -- like Cramer's V, Phi coefficient, etc.

how do we spot correlation?

for numerical data...

r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}

Pearson's correlation:

Qualitative Models

Causal diagrams

Pearl & Mackenzie 2018

Junction patterns

A \leftarrow B \rightarrow C
A \rightarrow B \rightarrow C
A \rightarrow B \leftarrow C

fork

chain

collider

Berkson's paradox

Berkson's paradox:

even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital.

Joseph Berkson

What kind of associations?Why?

Berkson's paradox

Scenario 1:

There is a causal relationship between the diseases. This could be true for some pairs D1, D2.

Berkson's paradox

Scenario 2:

A single disease is not severe enough to cause hospitalization

Berkson's paradox

Scenario 3:

By performing the study on patients that are hospitalized, we are controlling for the variable Hospitalization and this leads to spurious negative correlation (also known as "explain-away" effect).

let's examine this scenario with a simple simulation experiment...

Berkson's paradox

coin-flipping simulation: 

- independent diseases with prescribed prevalences

- each is severe enough to lead to hospitalization

binary_outcomes = [0, 1]
p1, p2 = 0.1, 0.1  # prevalences of d1, d2
d1 = np.random.choice(binary_outcomes, size=1000, p=[1-p1, p1])
d2 = np.random.choice(binary_outcomes, size=1000, p=[1-p2, p2])
\log OR ~ \sim 0.0

D1 = [0 1 0 0 1 0 1 0 1 0 0 0 0 ...]

D2 = [0 0 0 0 0 1 0 0 1 0 1 0 0 ...]

Berkson's paradox

run 100 simulations...

Berkson's paradox

coin-flipping simulation (cont)

now select those samples that have at least one disease and check the correlation among those samples only.

zipped = [z for z in zip(d1, d2) if (z[0] + z[1] >= 1)]
d1, d2 = tuple(map(np.array, zip(*zipped)))
\log OR \sim -8.3

Berkson's paradox

Berkson's paradox

(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
        1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
        1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
        0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
        1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
        1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
        0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
        1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
        1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]),
 array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
        0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
        1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
        1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
        0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
        0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1]))

Berkson's paradox

D1=YES D1=NO
D2=YES 13 109
D2=NO 87 0

In the restricted sample the diseases seem to repel each other, even if this does not happen in the general population!

confounders

B is often called a confounder of A and C

 

B will make A and C statistically correlated even when there is no direct causal link between them.

A \leftarrow B \rightarrow C

confounders

Age is a confounder that makes Shoe Size and Reading Ability be correlated.

 

To eliminate this spurious correlation we must correct for the Age.

E.g. within an age group (7-year-old) we will not see any correlation.

\textrm {Shoe Size} \leftarrow \textrm{Age} \rightarrow \textrm {Reading Ability}

Simpson's paradox

New drug D aimed to prevent heart attack.

Clinical trial where participants decide whether to adhere to the treatment or not.

Bad for men, bad for women, good for "people"?

Simpson's paradox

- Adherence to Drug treatment depend on Gender.

- Incidence of Heart Attack depends on Gender.

- Gender is a confounder that introduces a spurious correlation between Drug and Heart Attack.

Categorical Data Analysis

A lady claimed to be able to tell whether the tea or the milk were added first.

Lady tasting challenge

Milk first? TRUE FALSE
Lady says YES ?? ?? 4
Lady says NO ?? ?? 4
4 4 8

https://en.wikipedia.org/wiki/Lady_tasting_tea

Lady tasting challenge

TRUE FALSE
Lady says YES 3 1 4
Lady says NO 1 3 4
4 4 8

Is this outcome much better than pure chance?

\log \textrm{OR} = \log \frac{3\cdot 3}{1\cdot 1} \sim 2.2

Lady tasting challenge

How many possible outcomes?

N = \binom{8}{4} = \frac{8!}{4!\;\cdot\;4!} = 70

How many outcomes with 3 successes or more?

N_{\geq 3} = N_3 + N_4 = 4 + 1 = 5

How often would we score 3 successes or more just by chance?

P_{\geq 3} = N_{\geq 3} / N = 5 / 70 \sim 0.07

Fisher's exact test

T(a,b,c,d) X=1 X=0
Y=1 a b a+b
Y=0 c d c+d
a+c b+d n
\textrm{Prob } \textrm{T}(a,b,c,d) = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}

Suppose we randomly sample binary vectors X, Y of length n:

X = [0 1 0 0 1 0 1 1 1 1 ...] with #{1} = a+c; #{0} = b+d  

Y = [0 1 0 0 0 1 1 0 0 0 ...] with #{1} = a+b; #{0} = c+d

What are the chances of getting this Table?

p-value

How often the results of the experiment would be at least as extreme as the results actually observed just by chance.

 

"Measure of how surpising the observed is assuming the null-hypothesis"

p = \textrm{Prob}(T_{a=3}) + \textrm{Prob}(T_{a=4})

For instance, in the lady tasting challenge...

Fisher's test p-value

Suppose we run a more comprehensive lady testing challenge

TRUE FALSE
Lady says YES a=?? ?? 20
Lady says NO ?? ?? 20
20 20 40

Fisher's test p-value

What are the probabilities of a successes just by chance?

Fisher's test p-value

Suppose that the lady had a=12 successes. How often would we get a result at least as good? Is this very surprising?

p = \sum_{a=12}^{20} P(T_a)
\sim 0.17

Fisher's test p-value

[M, n, N] = [40, 20, 20]
rv = scipy.stats.hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_successes = rv.pmf(x)
p = sum(pmf_successes[12:21])

Fisher's test is just an application of the so-called hypergeometric distribution

Regression

Does growth rate depend on temperature?

X
Y

Does growth rate depend on temperature?

f:\textrm{Covariates} \to \textrm{Response}
X
Y

temperature

growth rate

pH

pressure

nutrients

X
Y

What patterns do we see?

Perfect fit!

But do not expect it to fit new data :(

X
Y

Parametric methods:

  • assume rigid global shape 
  • maybe inaccurate prediction
  • good for interpretation

Smoothers:

  • less rigid shape
  • shape is locally defined      
  • useful for prediction
  • more difficult to interpret

Different approaches

Linear

f_\theta(x) = ax + b\\ \theta = (a, b)
f_\theta(x) = ax^2 + bx + c\\ \theta = (a, b, c)

Quadratic

Parametric methods

\hat\theta = \argmin_\theta \sum_{i=1}^N (y_i - f_\theta(x_i))^2
f(x) = \frac{1}{k}\sum_{i \in \mathcal{N}_k(x)} y_i

Average Smoothing

f(x) = \frac{\sum_{i=1}^N K(x - x_i)y_i}{\sum_{i=1}^N K(x - x_i)}

Nadaraya-Watson

K(x)

: bell-shaped kernel

Smoothers (non-parametric)

f(x) = E(\textrm{Response}\; |\; \textrm{Covariates} = x)

Whatever the strategy, we want some function

f:\textrm{Covariates} \to \textrm{Response}

such that

Linear regression

Input

Covariates = Numerical

Response = Numerical

 

What for?

- Understanding the association between the covariates and the response variables.

- Prediction of expected value of the response given the values of the covariates.

Linear regression

Linear regression

f(x) = 0.12 \cdot x - 0.41

Linear regression

f(x) = 0.12 \cdot x - 0.41

what if we repeated the experiment?

Linear regression with many variables?

f(x_1, x_2, x_3) = 0.12\cdot x_1 + 0.89\cdot x_2 + 3.2\cdot x_3

Logistic regression

Input

Covariates = Numerical or Categorical

Response = Categorical Binary: {False, True} or {0, 1}

 

What for?

- Understanding the association between the response variable and the covariates.

- Classification -- probability that the response has True/False class value given the values of the covariates.

Logistic regression

Logistic regression

Logistic regression

f(x) = \frac{1}{1 + e^{-(ax + b)}}

Logistic Regression fits a sigmoid function of the form:

\hat a \sim 0.19\\ \hat b \sim -5.87

How do we interpret the parameters?

How to interpret parameters in logistic regression

ax + b = \log \frac{p(y=1|x)}{1 - p(y=1|x)}

For a sample with covariate value = x we denote p(y=1|x) the probability that the sample has response class = 1.

 

The association between covariate values and this probability is given by the following expression:

a = increase of logit for a 1 unit increase of the covariate.

b = log-odds when x = 0.

Categorical logistic regression

Example (Lady Tea Challenge Revisited):

Samples = Cups of tea

Covariate = Whether the lady perceives it as milk-first

Response = Whether the cup of tea is milk-first

Input

Covariates = Categorical Binary: {False, True} or {0, 1}

Response = Categorical Binary: {False, True} or {0, 1}

All the information we require to fit this regression is... the contingency table of the experiment!

Categorical logistic regression

Example (Lady Tea Challenge Revisited)

 

Covariate: X = [1, 1, 1, 0, 1, 0, 0, 0]

Response: Y = [1, 1, 1, 1, 0, 0, 0, 0]

TRUE FALSE
Lady says YES 3 1 4
Lady says NO 1 3 4
4 4 8
\log \textrm{OR} = \log \frac{3\cdot 3}{1\cdot 1} \sim 2.2
\hat a \sim 2.2\\ \hat b \sim -1.1

Categorical logistic regression

Example (Lady Tea Challenge Revisited)

 

Covariate: X = [1, 1, 1, 0, 1, 0, 0, 0]

Response: Y = [1, 1, 1, 1, 0, 0, 0, 0]

a = log-odds ratio of the contingency table.

b = log-odds when x = 0.

References

Pearl J, Mackenzie D. "The Book of Why"

 

James G, Witten D, Hastie T, Tibshirani R.

"Introduction to Statistical Learning"

 

Hastie T, Tibshirani R, Friedman J.

"The Elements of Statistical Learning"

Made with Slides.com