Basic Data Analysis

Ferran Muiños

Institute for Research in Biomedicine

(IRB Barcelona)

Monday 20210208:1800

Aim

"Introduction to statistical tactics by example"

  • Qualitative Models

  • Measuring Association

  • Fisher test

  • Mann-Whitney test

  • Multiple test correction

Qualitative Models

Causal diagrams

Pearl & Mackenzie 2018

source: Wikipedia

An arrow from X to Y says that some
probability rule specifies how Y would change if X were to change.

Causal diagrams

- Out of 1 million children,

990,000 get vaccinated,

9,900 have the reaction,

and 99 die from it.

- Out of 1 million children,

10,000 don’t get vaccinated,

200 get smallpox,

40 die from the disease.

In summary,

more children die from vaccination (99) than from the disease (40).

Junction patterns

A \leftarrow B \rightarrow C
A \rightarrow B \rightarrow C
A \rightarrow B \leftarrow C

fork

chain

collider

An example of collider bias: Berkson's paradox

"Even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital".

Joseph Berkson

Why???

Scenario 1

There is a causal association between the diseases and at least on of them is sufficient to lead to hospitalization.

Why would two diseases appear to be associated?

Scenario 2

A single disease is not severe enough to cause hospitalization

Why would two diseases appear to be associated?

Scenario 3: Berkson's paradox

- The diseases have no association in the general population whatsoever.

- Both diseases are sufficient to cause hospitalization

- By performing the study on patients that are hospitalized, we are introducing a spurious negative association.

- This is also known as "explain-away" effect.

Why would two diseases appear to be associated?

Berkson's paradox

simulation experiment [notebook]

 

assumptions:

- independent diseases

- with same prevalence in general population = 10%

- each is severe enough to cause hospitalization

Berkson's paradox

simulation experiment [notebook]

 

step 1:

- randomly pick N=1,000 cases as D1 with p=0.1

- randomly pick N=1,000 cases as D2 with p=0.1

person 1 person 2 ... person N
D1 0 1 ... 0
D2 0 0 ... 0

Berkson's paradox

simulation experiment [notebook]

 

step 2:

- discard samples with D1=D2=0

  these are the people who are not hospitalized

Berkson's paradox

simulation experiment [notebook]

result:

(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
        1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
        1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
        0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
        1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
        1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
        0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
        1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
        1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]),
 array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
        0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
        1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
        1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
        1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
        0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
        0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1]))

forks

B is often called a "confounder" of A and C

 

B will make A and C statistically associated even when there is no direct causal link between them

A \leftarrow B \rightarrow C

confounders

- Age is a confounder that makes Shoe Size and Reading Ability be correlated.

 

- To eliminate this spurious correlation we must "correct" for the variable Age.

 

- e.g. check if the association is maintained within groups of same age.

\textrm {Shoe Size} \leftarrow \textrm{Age} \rightarrow \textrm {Reading Ability}

forks

Failing to account for confounders in your analyses may completely trash the validity of your interpretations.

A \leftarrow B \rightarrow C

Bias driven by unaccounted for confounders

New drug D aimed to prevent heart attack.

Clinical trial where participants decide whether to adhere to the treatment.

Results:

Bias driven by unaccounted for confounders

Female: C=0.05; T=0.075

Male: C=0.3; T=0.4

Combined: C~0.22; T~0.18

Bad for men, bad for women, good combined???

Simpson's paradox

- Adherence to Drug treatment depend on Gender.

- Incidence of Heart Attack depends on Gender.

- Gender is a confounder that introduces a spurious correlation between Drug and Heart Attack.

Measuring

Association

correlation

Informal definition: 

the extent to which two experimental variables X, Y fluctuate in a synchronized way

Correlation is a handy proxy of causation, but they are not at all equivalent.

r_{xy} = \frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y}
\bar{x}
\bar{y}

Pearson correlation

Pearson correlation

r_{xy} = \frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y}

Remarks:

- do the fluctuations typically point towards the same direction?

- r has values between -1 and 1

- positive --> positive association

- negative --> negative association

Categorical Data

Sedentary Active
CVD 100 25 125
Healthy 15 100 115
115 125 240

 contingency tables: represent the frequency for each combination of categorical values

is the mass concentrated along any

of the diagonals?

(log) odds ratio

\textrm{Odds (sedentary)} = \frac{p_{\textrm{cvd}}}{p_{\textrm{healthy}}} = \frac{p_{\textrm{cvd}}}{1 - p_{\textrm{cvd}}} = \frac{100}{15} \sim 6.67
Sedentary Active
CVD 100 25 125
Healthy 15 100 115
115 125 240
\textrm{Odds (active)} = \frac{p_{\textrm{cvd}}}{p_{\textrm{healthy}}} = \frac{p_{\textrm{cvd}}}{1 - p_{\textrm{cvd}}} = \frac{25}{100} \sim 0.25
\textrm{Odds Ratio} = \frac{\textrm{Odds (sedentary)}}{\textrm{Odds (active)}} = \frac{6.67}{0.25} \sim 26.68

(log) odds ratio

Sedentary Active
CVD
Healthy
\log(\textrm{Odds Ratio}) \sim 3.28
\textrm{Odds Ratio} = \frac{\textrm{Odds (sedentary)}}{\textrm{Odds (active)}} = \frac{f_{11}f_{22}}{f_{12}f_{21}} \sim 26.68
f_{11}
f_{12}
f_{21}
f_{22}
f_{21} + f_{22}
f_{11} + f_{12}
f_{11} + f_{21}
f_{12} + f_{22}

(log) odds ratio

Remarks:

 

0 < OR < 1  --> negative association

OR > 1         --> positive association

 

log OR < 0  --> negative association

log OR > 0  --> positive association

Statistical testing

Statistical Significance

To what extent the "association" we observe can be explained by pure chance?

A English lady claimed to be able to tell whether the tea or the milk were added first.

Tea tasting challenge

Milk first? TRUE FALSE
Lady says YES ?? ?? 4
Lady says NO ?? ?? 4
4 4 8

https://en.wikipedia.org/wiki/Lady_tasting_tea

Tea tasting challenge

TRUE FALSE
Lady says YES 3 1 4
Lady says NO 1 3 4
4 4 8

is this performance much better than pure chance?

\log \textrm{OR} = \log \frac{3\cdot 3}{1\cdot 1} \sim 2.2

Tea tasting challenge

How many possible outcomes?

N = \binom{8}{4} = \frac{8!}{4!\;\cdot\;4!} = 70

How many outcomes have 3 successes or more?

n(x \geq 3) = n(x=3) + n(x=4) = 4 + 1 = 5

What are the chances that we score

3 successes or more just by chance?

\textrm{Prob}(x\geq 3) = \frac{n(x \geq 3)}{N}= \frac{5}{70} \sim 0.07

Fisher's exact test

X=1 X=0
Y=1 a b R
Y=0 c d S
U V n

Assuming the variables X, Y are not associated, suppose we are given any two binary vectors X, Y of length n such that:

 

X = {0 1 0 0 1 0 1 1 1 1 ... } with n(X=1) = U; n(X=0) = V Y = {0 1 0 0 0 1 1 0 0 0 ... } with n(Y=1) = R; n(Y=0) = S

Fisher's exact test

X=1 X=0
Y=1 a b R
Y=0 c d S
U V n

What is the probability that...?

n(X=1, Y=1) = a, n(X=0, Y=1) = b

n(X=1, Y=0) = c, n(X=0, X=0) = d

\textrm{Prob}_{a,b,c,d} = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}

p-value

How often the results of the experiment would be at least as extreme as the results actually observed just by chance.

 

"How surpising the observed is assuming that the variables are not associated"

p = \textrm{Prob}(T_{a=3}) + \textrm{Prob}(T_{a=4})

For instance, in the Tea Tasting Challenge...

Fisher's test p-value

Now, suppose we run a more comprehensive Tea Testing Challenge

TRUE FALSE
Lady says YES a=?? b=?? 20
Lady says NO c=?? d=?? 20
20 20 40

Fisher's test p-value

What are the probabilities of each possible outcome if we pick up randomly?

Fisher's test p-value

Suppose that the lady had a=12 successes. How often would we get a result at least as good just by chance? Is this surprising?

p = \sum_{a=12}^{20} P(T_a)
\sim 0.17

Fisher test p-value

[M, n, N] = [40, 20, 20]
rv = scipy.stats.hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_successes = rv.pmf(x)
p = sum(pmf_successes[12:21])

Fisher test is just an application of the so-called hypergeometric distribution

Group comparisons

Mann-Whitney test for group comparison

numerical scale:

e.g. expression of gene G

group 1

group 2

Mann-Whitney test for group comparison

We want to assess the propensity of G to be more/less expressed depending on the group

Mann-Whitney test for group comparison

- Effect: e.g. difference between average values

- Significance: how surprising the data is if we assumed there is no association between group and expression of G.

Mann-Whitney test: significance

we see many

greens to the left

we see many

reds to the right

Mann-Whitney test: significance

n_1
n_2

Mann-Whitney test: significance

If you simulate all the possible ranking arrangements and compute the respective U-statistic, the distribution of U-statistics will follow approximately a normal distribution

\mu = \frac{n_1 n_2}{2}
\sigma^2 = \frac{n_1 n_2 (n_1 + n_2 + 1)}{12}

Mann-Whitney test: significance

\mu = \frac{n_1 n_2}{2}
\sigma^2 = \frac{n_1 n_2 (n_1 + n_2 + 1)}{12}

one-sided p-value

If the U-statistic associated to our data (Û) is very extreme, we will deem the association unlikely to arise just by chance

\hat U

Multiple Test Correction

Iterative Tea Tasting Tour

...

- Doing the Tea Tasting Challenge many times makes it just more likely to have a good performance at some point just by chance

 

- The more times, the higher the chances of a good performance

Bonferroni correction

Bonferroni correction:

 

change the significance level to reject the null hypothesis to compensate for the higher number of opportunities.

 

If the significance level is set to some value alpha, then the new significance level is alpha/m, where m is the number of tests

Bonferroni correction

test p-value standard Bonferroni
#1 0.04 REJECT NULL
#2 0.15 NULL NULL
#3 0.9 NULL NULL
#4 0.01 REJECT NULL
#5 0.64 NULL NULL
#6 0.07 NULL NULL
#7 0.15 NULL NULL
#8 0.64 NULL NULL
#9 0.3 NULL NULL
\alpha = 0.05
\hat\alpha = \frac{0.05}{9} \sim 0.0056

Bonferroni correction

Caveats:

 

Bonferroni correction tends to yield too conservative results, particularly if there is a large number of tests or if the tests are correlated with each other.

 

The correction comes at the cost of increasing the chances to produce false negatives.

False Discovery Rate

False Discovery Rate

 

Given a significance level, what is the expected proportion level of false significant tests among all tests.

 

Benjamini-Hochberg FDR

 

Given the collection of all p-values, this is a method to estimate what is the FDR that each p-value represents.

FDR correction

q-value

 

expected proportion of false positives among all tests that are as extreme or more than the observed one.

 

FDR correction

test p-value standard q-value (FDR)
#1 0.04 REJECT 0.18
#2 0.15 NULL 0.27
#3 0.9 NULL 0.9
#4 0.01 REJECT 0.09
#5 0.64 NULL 0.72
#6 0.07 NULL 0.21
#7 0.15 NULL 0.27
#8 0.64 NULL 0.72
#9 0.3 NULL 0.45
\alpha = 0.05

FDR correction

- q-value computation depends on the full collection of p-values

- Suitable for analyses with many instances: e.g. analyses where you are conducting one test per gene

- Several more refined methods have been proposed in the literature, but Benjamini-Hochberg FDR continues to be universally accepted.

References

Pearl J, Mackenzie D. "The Book of Why"

 

James G, Witten D, Hastie T, Tibshirani R.

"Introduction to Statistical Learning"

 

Hastie T, Tibshirani R, Friedman J.

"The Elements of Statistical Learning"

Copy of Data Analysis Tactics

By Ferran Muiños

Copy of Data Analysis Tactics

A gentle introduction to statistical tactics in science.

  • 539