Ferran Muiños
Institute for Research in Biomedicine
(IRB Barcelona)
Monday 20210208:1800
"Introduction to statistical tactics by example"
Qualitative Models
Measuring Association
Fisher test
Mann-Whitney test
Multiple test correction
Pearl & Mackenzie 2018
source: Wikipedia
An arrow from X to Y says that some
probability rule specifies how Y would change if X were to change.
- Out of 1 million children,
990,000 get vaccinated,
9,900 have the reaction,
and 99 die from it.
- Out of 1 million children,
10,000 don’t get vaccinated,
200 get smallpox,
40 die from the disease.
In summary,
more children die from vaccination (99) than from the disease (40).
Souce: Our World in Data
fork
chain
collider
"Even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital".
Joseph Berkson
Why???
Scenario 1
There is a causal association between the diseases and at least on of them is sufficient to lead to hospitalization.
Scenario 2
A single disease is not severe enough to cause hospitalization
Scenario 3: Berkson's paradox
- The diseases have no association in the general population whatsoever.
- Both diseases are sufficient to cause hospitalization
- By performing the study on patients that are hospitalized, we are introducing a spurious negative association.
- This is also known as "explain-away" effect.
simulation experiment [notebook]
assumptions:
- independent diseases
- with same prevalence in general population = 10%
- each is severe enough to cause hospitalization
simulation experiment [notebook]
step 1:
- randomly pick N=1,000 cases as D1 with p=0.1
- randomly pick N=1,000 cases as D2 with p=0.1
person 1 | person 2 | ... | person N | |
---|---|---|---|---|
D1 | 0 | 1 | ... | 0 |
D2 | 0 | 0 | ... | 0 |
simulation experiment [notebook]
step 2:
- discard samples with D1=D2=0
these are the people who are not hospitalized
simulation experiment [notebook]
result:
(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]),
array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1]))
B is often called a "confounder" of A and C
B will make A and C statistically associated even when there is no direct causal link between them
- Age is a confounder that makes Shoe Size and Reading Ability be correlated.
- To eliminate this spurious correlation we must "correct" for the variable Age.
- e.g. check if the association is maintained within groups of same age.
Failing to account for confounders in your analyses may completely trash the validity of your interpretations.
New drug D aimed to prevent heart attack.
Clinical trial where participants decide whether to adhere to the treatment.
Results:
Female: C=0.05; T=0.075
Male: C=0.3; T=0.4
Combined: C~0.22; T~0.18
Bad for men, bad for women, good combined???
- Adherence to Drug treatment depend on Gender.
- Incidence of Heart Attack depends on Gender.
- Gender is a confounder that introduces a spurious correlation between Drug and Heart Attack.
Informal definition:
the extent to which two experimental variables X, Y fluctuate in a synchronized way
Correlation is a handy proxy of causation, but they are not at all equivalent.
Remarks:
- do the fluctuations typically point towards the same direction?
- r has values between -1 and 1
- positive --> positive association
- negative --> negative association
Sedentary | Active | ||
---|---|---|---|
CVD | 100 | 25 | 125 |
Healthy | 15 | 100 | 115 |
115 | 125 | 240 |
contingency tables: represent the frequency for each combination of categorical values
is the mass concentrated along any
of the diagonals?
Sedentary | Active | ||
---|---|---|---|
CVD | 100 | 25 | 125 |
Healthy | 15 | 100 | 115 |
115 | 125 | 240 |
Sedentary | Active | ||
---|---|---|---|
CVD | |||
Healthy | |||
Remarks:
0 < OR < 1 --> negative association
OR > 1 --> positive association
log OR < 0 --> negative association
log OR > 0 --> positive association
To what extent the "association" we observe can be explained by pure chance?
A English lady claimed to be able to tell whether the tea or the milk were added first.
Milk first? | TRUE | FALSE | |
---|---|---|---|
Lady says YES | ?? | ?? | 4 |
Lady says NO | ?? | ?? | 4 |
4 | 4 | 8 |
https://en.wikipedia.org/wiki/Lady_tasting_tea
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | 3 | 1 | 4 |
Lady says NO | 1 | 3 | 4 |
4 | 4 | 8 |
is this performance much better than pure chance?
How many possible outcomes?
How many outcomes have 3 successes or more?
What are the chances that we score
3 successes or more just by chance?
X=1 | X=0 | ||
---|---|---|---|
Y=1 | a | b | R |
Y=0 | c | d | S |
U | V | n |
Assuming the variables X, Y are not associated, suppose we are given any two binary vectors X, Y of length n such that:
X = {0 1 0 0 1 0 1 1 1 1 ... } with n(X=1) = U; n(X=0) = V Y = {0 1 0 0 0 1 1 0 0 0 ... } with n(Y=1) = R; n(Y=0) = S
X=1 | X=0 | ||
---|---|---|---|
Y=1 | a | b | R |
Y=0 | c | d | S |
U | V | n |
What is the probability that...?
n(X=1, Y=1) = a, n(X=0, Y=1) = b
n(X=1, Y=0) = c, n(X=0, X=0) = d
How often the results of the experiment would be at least as extreme as the results actually observed just by chance.
"How surpising the observed is assuming that the variables are not associated"
For instance, in the Tea Tasting Challenge...
Now, suppose we run a more comprehensive Tea Testing Challenge
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | a=?? | b=?? | 20 |
Lady says NO | c=?? | d=?? | 20 |
20 | 20 | 40 |
What are the probabilities of each possible outcome if we pick up randomly?
Suppose that the lady had a=12 successes. How often would we get a result at least as good just by chance? Is this surprising?
[M, n, N] = [40, 20, 20]
rv = scipy.stats.hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_successes = rv.pmf(x)
p = sum(pmf_successes[12:21])
Fisher test is just an application of the so-called hypergeometric distribution
numerical scale:
e.g. expression of gene G
group 1
group 2
We want to assess the propensity of G to be more/less expressed depending on the group
- Effect: e.g. difference between average values
- Significance: how surprising the data is if we assumed there is no association between group and expression of G.
we see many
greens to the left
we see many
reds to the right
If you simulate all the possible ranking arrangements and compute the respective U-statistic, the distribution of U-statistics will follow approximately a normal distribution
one-sided p-value
If the U-statistic associated to our data (Û) is very extreme, we will deem the association unlikely to arise just by chance
...
- Doing the Tea Tasting Challenge many times makes it just more likely to have a good performance at some point just by chance
- The more times, the higher the chances of a good performance
Bonferroni correction:
change the significance level to reject the null hypothesis to compensate for the higher number of opportunities.
If the significance level is set to some value alpha, then the new significance level is alpha/m, where m is the number of tests
test | p-value | standard | Bonferroni |
---|---|---|---|
#1 | 0.04 | REJECT | NULL |
#2 | 0.15 | NULL | NULL |
#3 | 0.9 | NULL | NULL |
#4 | 0.01 | REJECT | NULL |
#5 | 0.64 | NULL | NULL |
#6 | 0.07 | NULL | NULL |
#7 | 0.15 | NULL | NULL |
#8 | 0.64 | NULL | NULL |
#9 | 0.3 | NULL | NULL |
Caveats:
Bonferroni correction tends to yield too conservative results, particularly if there is a large number of tests or if the tests are correlated with each other.
The correction comes at the cost of increasing the chances to produce false negatives.
False Discovery Rate
Given a significance level, what is the expected proportion level of false significant tests among all tests.
Benjamini-Hochberg FDR
Given the collection of all p-values, this is a method to estimate what is the FDR that each p-value represents.
q-value
expected proportion of false positives among all tests that are as extreme or more than the observed one.
test | p-value | standard | q-value (FDR) |
---|---|---|---|
#1 | 0.04 | REJECT | 0.18 |
#2 | 0.15 | NULL | 0.27 |
#3 | 0.9 | NULL | 0.9 |
#4 | 0.01 | REJECT | 0.09 |
#5 | 0.64 | NULL | 0.72 |
#6 | 0.07 | NULL | 0.21 |
#7 | 0.15 | NULL | 0.27 |
#8 | 0.64 | NULL | 0.72 |
#9 | 0.3 | NULL | 0.45 |
- q-value computation depends on the full collection of p-values
- Suitable for analyses with many instances: e.g. analyses where you are conducting one test per gene
- Several more refined methods have been proposed in the literature, but Benjamini-Hochberg FDR continues to be universally accepted.
Pearl J, Mackenzie D. "The Book of Why"
James G, Witten D, Hastie T, Tibshirani R.
"Introduction to Statistical Learning"
Hastie T, Tibshirani R, Friedman J.
"The Elements of Statistical Learning"