"Introduction to statistical tactics by example"
Introduce some common tactics in data analysis
Correlation
Qualitative models
Fisher's test for categorical data
Regression
Informal definition:
the extent to which two experimental variables X, Y fluctuate in a synchronized way
Correlation is a handy proxy of causation, but they are not equivalent.
https://www.tylervigen.com/spurious-correlations
Sedentary YES | Sedentary NO | ||
---|---|---|---|
CVD YES | 100 | 25 | 125 |
CVD NO | 15 | 100 | 115 |
115 | 125 | 240 |
for categorical data we use contingency tables
is the mass concentrated in any of the diagonals?
Sedentary YES | Sedentary NO | ||
---|---|---|---|
CVD YES | 100 | 25 | 125 |
CVD NO | 15 | 100 | 115 |
115 | 125 | 240 |
for categorical data we use
contingency tables
log Odds Ratio:
for categorical data we use
contingency tables
There are many other available options to spot correlation between categorical variables:
- Fisher's test -- wait and see.
- Logistic regression -- wait and see.
- Chi-squared-statistic based methods -- like Cramer's V, Phi coefficient, etc.
for numerical data...
Pearson's correlation:
Pearl & Mackenzie 2018
fork
chain
collider
Berkson's paradox:
even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital.
Joseph Berkson
What kind of associations?Why?
Scenario 1:
There is a causal relationship between the diseases. This could be true for some pairs D1, D2.
Scenario 2:
A single disease is not severe enough to cause hospitalization
Scenario 3:
By performing the study on patients that are hospitalized, we are controlling for the variable Hospitalization and this leads to spurious negative correlation (also known as "explain-away" effect).
let's examine this scenario with a simple simulation experiment...
coin-flipping simulation:
- independent diseases with prescribed prevalences
- each is severe enough to lead to hospitalization
binary_outcomes = [0, 1]
p1, p2 = 0.1, 0.1 # prevalences of d1, d2
d1 = np.random.choice(binary_outcomes, size=1000, p=[1-p1, p1])
d2 = np.random.choice(binary_outcomes, size=1000, p=[1-p2, p2])
D1 = [0 1 0 0 1 0 1 0 1 0 0 0 0 ...]
D2 = [0 0 0 0 0 1 0 0 1 0 1 0 0 ...]
run 100 simulations...
coin-flipping simulation (cont)
now select those samples that have at least one disease and check the correlation among those samples only.
zipped = [z for z in zip(d1, d2) if (z[0] + z[1] >= 1)]
d1, d2 = tuple(map(np.array, zip(*zipped)))
(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]),
array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1]))
D1=YES | D1=NO | |
---|---|---|
D2=YES | 13 | 109 |
D2=NO | 87 | 0 |
In the restricted sample the diseases seem to repel each other, even if this does not happen in the general population!
B is often called a confounder of A and C
B will make A and C statistically correlated even when there is no direct causal link between them.
Age is a confounder that makes Shoe Size and Reading Ability be correlated.
To eliminate this spurious correlation we must correct for the Age.
E.g. within an age group (7-year-old) we will not see any correlation.
New drug D aimed to prevent heart attack.
Clinical trial where participants decide whether to adhere to the treatment or not.
Bad for men, bad for women, good for "people"?
- Adherence to Drug treatment depend on Gender.
- Incidence of Heart Attack depends on Gender.
- Gender is a confounder that introduces a spurious correlation between Drug and Heart Attack.
A lady claimed to be able to tell whether the tea or the milk were added first.
Milk first? | TRUE | FALSE | |
---|---|---|---|
Lady says YES | ?? | ?? | 4 |
Lady says NO | ?? | ?? | 4 |
4 | 4 | 8 |
https://en.wikipedia.org/wiki/Lady_tasting_tea
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | 3 | 1 | 4 |
Lady says NO | 1 | 3 | 4 |
4 | 4 | 8 |
Is this outcome much better than pure chance?
How many possible outcomes?
How many outcomes with 3 successes or more?
How often would we score 3 successes or more just by chance?
T(a,b,c,d) | X=1 | X=0 | |
---|---|---|---|
Y=1 | a | b | a+b |
Y=0 | c | d | c+d |
a+c | b+d | n |
Suppose we randomly sample binary vectors X, Y of length n:
X = [0 1 0 0 1 0 1 1 1 1 ...] with #{1} = a+c; #{0} = b+d
Y = [0 1 0 0 0 1 1 0 0 0 ...] with #{1} = a+b; #{0} = c+d
What are the chances of getting this Table?
How often the results of the experiment would be at least as extreme as the results actually observed just by chance.
"Measure of how surpising the observed is assuming the null-hypothesis"
For instance, in the lady tasting challenge...
Suppose we run a more comprehensive lady testing challenge
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | a=?? | ?? | 20 |
Lady says NO | ?? | ?? | 20 |
20 | 20 | 40 |
What are the probabilities of a successes just by chance?
Suppose that the lady had a=12 successes. How often would we get a result at least as good? Is this very surprising?
[M, n, N] = [40, 20, 20]
rv = scipy.stats.hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_successes = rv.pmf(x)
p = sum(pmf_successes[12:21])
Fisher's test is just an application of the so-called hypergeometric distribution
temperature
growth rate
pH
pressure
nutrients
What patterns do we see?
Perfect fit!
But do not expect it to fit new data :(
Parametric methods:
Smoothers:
Linear
Quadratic
Parametric methods
Average Smoothing
Nadaraya-Watson
: bell-shaped kernel
Smoothers (non-parametric)
Input
Covariates = Numerical
Response = Numerical
What for?
- Understanding the association between the covariates and the response variables.
- Prediction of expected value of the response given the values of the covariates.
what if we repeated the experiment?
Input
Covariates = Numerical or Categorical
Response = Categorical Binary: {False, True} or {0, 1}
What for?
- Understanding the association between the response variable and the covariates.
- Classification -- probability that the response has True/False class value given the values of the covariates.
Logistic Regression fits a sigmoid function of the form:
How do we interpret the parameters?
For a sample with covariate value = x we denote p(y=1|x) the probability that the sample has response class = 1.
The association between covariate values and this probability is given by the following expression:
a = increase of logit for a 1 unit increase of the covariate.
b = log-odds when x = 0.
Example (Lady Tea Challenge Revisited):
Samples = Cups of tea
Covariate = Whether the lady perceives it as milk-first
Response = Whether the cup of tea is milk-first
Input
Covariates = Categorical Binary: {False, True} or {0, 1}
Response = Categorical Binary: {False, True} or {0, 1}
All the information we require to fit this regression is... the contingency table of the experiment!
Example (Lady Tea Challenge Revisited)
Covariate: X = [1, 1, 1, 0, 1, 0, 0, 0]
Response: Y = [1, 1, 1, 1, 0, 0, 0, 0]
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | 3 | 1 | 4 |
Lady says NO | 1 | 3 | 4 |
4 | 4 | 8 |
Example (Lady Tea Challenge Revisited)
Covariate: X = [1, 1, 1, 0, 1, 0, 0, 0]
Response: Y = [1, 1, 1, 1, 0, 0, 0, 0]
a = log-odds ratio of the contingency table.
b = log-odds when x = 0.
Pearl J, Mackenzie D. "The Book of Why"
James G, Witten D, Hastie T, Tibshirani R.
"Introduction to Statistical Learning"
Hastie T, Tibshirani R, Friedman J.
"The Elements of Statistical Learning"