Basic Data Analysis
Ferran Muiños
Institute for Research in Biomedicine
(IRB Barcelona)
Monday 20210208:1800
Aim
"Introduction to statistical tactics by example"
-
Qualitative Models
-
Measuring Association
-
Fisher test
-
Mann-Whitney test
-
Multiple test correction
Qualitative Models
Causal diagrams
Pearl & Mackenzie 2018
source: Wikipedia
An arrow from X to Y says that some
probability rule specifies how Y would change if X were to change.
Causal diagrams
- Out of 1 million children,
990,000 get vaccinated,
9,900 have the reaction,
and 99 die from it.
- Out of 1 million children,
10,000 don’t get vaccinated,
200 get smallpox,
40 die from the disease.
In summary,
more children die from vaccination (99) than from the disease (40).
Souce: Our World in Data
Junction patterns
fork
chain
collider
An example of collider bias: Berkson's paradox
"Even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital".
Joseph Berkson
Why???
Scenario 1
There is a causal association between the diseases and at least on of them is sufficient to lead to hospitalization.
Why would two diseases appear to be associated?
Scenario 2
A single disease is not severe enough to cause hospitalization
Why would two diseases appear to be associated?
Scenario 3: Berkson's paradox
- The diseases have no association in the general population whatsoever.
- Both diseases are sufficient to cause hospitalization
- By performing the study on patients that are hospitalized, we are introducing a spurious negative association.
- This is also known as "explain-away" effect.
Why would two diseases appear to be associated?
Berkson's paradox
simulation experiment [notebook]
assumptions:
- independent diseases
- with same prevalence in general population = 10%
- each is severe enough to cause hospitalization
Berkson's paradox
simulation experiment [notebook]
step 1:
- randomly pick N=1,000 cases as D1 with p=0.1
- randomly pick N=1,000 cases as D2 with p=0.1
person 1 | person 2 | ... | person N | |
---|---|---|---|---|
D1 | 0 | 1 | ... | 0 |
D2 | 0 | 0 | ... | 0 |
Berkson's paradox
simulation experiment [notebook]
step 2:
- discard samples with D1=D2=0
these are the people who are not hospitalized
Berkson's paradox
simulation experiment [notebook]
result:
(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]),
array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1]))
forks
B is often called a "confounder" of A and C
B will make A and C statistically associated even when there is no direct causal link between them
confounders
- Age is a confounder that makes Shoe Size and Reading Ability be correlated.
- To eliminate this spurious correlation we must "correct" for the variable Age.
- e.g. check if the association is maintained within groups of same age.
forks
Failing to account for confounders in your analyses may completely trash the validity of your interpretations.
Bias driven by unaccounted for confounders
New drug D aimed to prevent heart attack.
Clinical trial where participants decide whether to adhere to the treatment.
Results:
Bias driven by unaccounted for confounders
Female: C=0.05; T=0.075
Male: C=0.3; T=0.4
Combined: C~0.22; T~0.18
Bad for men, bad for women, good combined???
Simpson's paradox
- Adherence to Drug treatment depend on Gender.
- Incidence of Heart Attack depends on Gender.
- Gender is a confounder that introduces a spurious correlation between Drug and Heart Attack.
Measuring
Association
correlation
Informal definition:
the extent to which two experimental variables X, Y fluctuate in a synchronized way
Correlation is a handy proxy of causation, but they are not at all equivalent.
Pearson correlation
Pearson correlation
Remarks:
- do the fluctuations typically point towards the same direction?
- r has values between -1 and 1
- positive --> positive association
- negative --> negative association
Categorical Data
Sedentary | Active | ||
---|---|---|---|
CVD | 100 | 25 | 125 |
Healthy | 15 | 100 | 115 |
115 | 125 | 240 |
contingency tables: represent the frequency for each combination of categorical values
is the mass concentrated along any
of the diagonals?
(log) odds ratio
Sedentary | Active | ||
---|---|---|---|
CVD | 100 | 25 | 125 |
Healthy | 15 | 100 | 115 |
115 | 125 | 240 |
(log) odds ratio
Sedentary | Active | ||
---|---|---|---|
CVD | |||
Healthy | |||
(log) odds ratio
Remarks:
0 < OR < 1 --> negative association
OR > 1 --> positive association
log OR < 0 --> negative association
log OR > 0 --> positive association
Statistical testing
Statistical Significance
To what extent the "association" we observe can be explained by pure chance?
A English lady claimed to be able to tell whether the tea or the milk were added first.
Tea tasting challenge
Milk first? | TRUE | FALSE | |
---|---|---|---|
Lady says YES | ?? | ?? | 4 |
Lady says NO | ?? | ?? | 4 |
4 | 4 | 8 |
https://en.wikipedia.org/wiki/Lady_tasting_tea
Tea tasting challenge
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | 3 | 1 | 4 |
Lady says NO | 1 | 3 | 4 |
4 | 4 | 8 |
is this performance much better than pure chance?
Tea tasting challenge
How many possible outcomes?
How many outcomes have 3 successes or more?
What are the chances that we score
3 successes or more just by chance?
Fisher's exact test
X=1 | X=0 | ||
---|---|---|---|
Y=1 | a | b | R |
Y=0 | c | d | S |
U | V | n |
Assuming the variables X, Y are not associated, suppose we are given any two binary vectors X, Y of length n such that:
X = {0 1 0 0 1 0 1 1 1 1 ... } with n(X=1) = U; n(X=0) = V Y = {0 1 0 0 0 1 1 0 0 0 ... } with n(Y=1) = R; n(Y=0) = S
Fisher's exact test
X=1 | X=0 | ||
---|---|---|---|
Y=1 | a | b | R |
Y=0 | c | d | S |
U | V | n |
What is the probability that...?
n(X=1, Y=1) = a, n(X=0, Y=1) = b
n(X=1, Y=0) = c, n(X=0, X=0) = d
p-value
How often the results of the experiment would be at least as extreme as the results actually observed just by chance.
"How surpising the observed is assuming that the variables are not associated"
For instance, in the Tea Tasting Challenge...
Fisher's test p-value
Now, suppose we run a more comprehensive Tea Testing Challenge
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | a=?? | b=?? | 20 |
Lady says NO | c=?? | d=?? | 20 |
20 | 20 | 40 |
Fisher's test p-value
What are the probabilities of each possible outcome if we pick up randomly?
Fisher's test p-value
Suppose that the lady had a=12 successes. How often would we get a result at least as good just by chance? Is this surprising?
Fisher test p-value
[M, n, N] = [40, 20, 20]
rv = scipy.stats.hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_successes = rv.pmf(x)
p = sum(pmf_successes[12:21])
Fisher test is just an application of the so-called hypergeometric distribution
Group comparisons
Mann-Whitney test for group comparison
numerical scale:
e.g. expression of gene G
group 1
group 2
Mann-Whitney test for group comparison
We want to assess the propensity of G to be more/less expressed depending on the group
Mann-Whitney test for group comparison
- Effect: e.g. difference between average values
- Significance: how surprising the data is if we assumed there is no association between group and expression of G.
Mann-Whitney test: significance
we see many
greens to the left
we see many
reds to the right
Mann-Whitney test: significance
Mann-Whitney test: significance
If you simulate all the possible ranking arrangements and compute the respective U-statistic, the distribution of U-statistics will follow approximately a normal distribution
Mann-Whitney test: significance
one-sided p-value
If the U-statistic associated to our data (Û) is very extreme, we will deem the association unlikely to arise just by chance
Multiple Test Correction
Iterative Tea Tasting Tour
...
- Doing the Tea Tasting Challenge many times makes it just more likely to have a good performance at some point just by chance
- The more times, the higher the chances of a good performance
Bonferroni correction
Bonferroni correction:
change the significance level to reject the null hypothesis to compensate for the higher number of opportunities.
If the significance level is set to some value alpha, then the new significance level is alpha/m, where m is the number of tests
Bonferroni correction
test | p-value | standard | Bonferroni |
---|---|---|---|
#1 | 0.04 | REJECT | NULL |
#2 | 0.15 | NULL | NULL |
#3 | 0.9 | NULL | NULL |
#4 | 0.01 | REJECT | NULL |
#5 | 0.64 | NULL | NULL |
#6 | 0.07 | NULL | NULL |
#7 | 0.15 | NULL | NULL |
#8 | 0.64 | NULL | NULL |
#9 | 0.3 | NULL | NULL |
Bonferroni correction
Caveats:
Bonferroni correction tends to yield too conservative results, particularly if there is a large number of tests or if the tests are correlated with each other.
The correction comes at the cost of increasing the chances to produce false negatives.
False Discovery Rate
False Discovery Rate
Given a significance level, what is the expected proportion level of false significant tests among all tests.
Benjamini-Hochberg FDR
Given the collection of all p-values, this is a method to estimate what is the FDR that each p-value represents.
FDR correction
q-value
expected proportion of false positives among all tests that are as extreme or more than the observed one.
FDR correction
test | p-value | standard | q-value (FDR) |
---|---|---|---|
#1 | 0.04 | REJECT | 0.18 |
#2 | 0.15 | NULL | 0.27 |
#3 | 0.9 | NULL | 0.9 |
#4 | 0.01 | REJECT | 0.09 |
#5 | 0.64 | NULL | 0.72 |
#6 | 0.07 | NULL | 0.21 |
#7 | 0.15 | NULL | 0.27 |
#8 | 0.64 | NULL | 0.72 |
#9 | 0.3 | NULL | 0.45 |
FDR correction
- q-value computation depends on the full collection of p-values
- Suitable for analyses with many instances: e.g. analyses where you are conducting one test per gene
- Several more refined methods have been proposed in the literature, but Benjamini-Hochberg FDR continues to be universally accepted.
References
Pearl J, Mackenzie D. "The Book of Why"
James G, Witten D, Hastie T, Tibshirani R.
"Introduction to Statistical Learning"
Hastie T, Tibshirani R, Friedman J.
"The Elements of Statistical Learning"
Copy of Data Analysis Tactics
By Ferran Muiños
Copy of Data Analysis Tactics
A gentle introduction to statistical tactics in science.
- 632