Data Analysis Tactics
Aim
"Introduction to statistical tactics by example"
-
Introduce some common tactics in data analysis
-
Correlation
-
Qualitative models
-
Fisher's test for categorical data
-
Regression
Correlation
correlation
Informal definition:
the extent to which two experimental variables X, Y fluctuate in a synchronized way
Correlation is a handy proxy of causation, but they are not equivalent.
https://www.tylervigen.com/spurious-correlations
how do we spot correlation?
Sedentary YES | Sedentary NO | ||
---|---|---|---|
CVD YES | 100 | 25 | 125 |
CVD NO | 15 | 100 | 115 |
115 | 125 | 240 |
for categorical data we use contingency tables
is the mass concentrated in any of the diagonals?
how do we spot correlation?
Sedentary YES | Sedentary NO | ||
---|---|---|---|
CVD YES | 100 | 25 | 125 |
CVD NO | 15 | 100 | 115 |
115 | 125 | 240 |
for categorical data we use
contingency tables
log Odds Ratio:
how do we spot correlation?
for categorical data we use
contingency tables
There are many other available options to spot correlation between categorical variables:
- Fisher's test -- wait and see.
- Logistic regression -- wait and see.
- Chi-squared-statistic based methods -- like Cramer's V, Phi coefficient, etc.
how do we spot correlation?
for numerical data...
Pearson's correlation:
Qualitative Models
Causal diagrams
Pearl & Mackenzie 2018
Junction patterns
fork
chain
collider
Berkson's paradox
Berkson's paradox:
even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital.
Joseph Berkson
What kind of associations?Why?
Berkson's paradox
Scenario 1:
There is a causal relationship between the diseases. This could be true for some pairs D1, D2.
Berkson's paradox
Scenario 2:
A single disease is not severe enough to cause hospitalization
Berkson's paradox
Scenario 3:
By performing the study on patients that are hospitalized, we are controlling for the variable Hospitalization and this leads to spurious negative correlation (also known as "explain-away" effect).
let's examine this scenario with a simple simulation experiment...
Berkson's paradox
coin-flipping simulation:
- independent diseases with prescribed prevalences
- each is severe enough to lead to hospitalization
binary_outcomes = [0, 1]
p1, p2 = 0.1, 0.1 # prevalences of d1, d2
d1 = np.random.choice(binary_outcomes, size=1000, p=[1-p1, p1])
d2 = np.random.choice(binary_outcomes, size=1000, p=[1-p2, p2])
D1 = [0 1 0 0 1 0 1 0 1 0 0 0 0 ...]
D2 = [0 0 0 0 0 1 0 0 1 0 1 0 0 ...]
Berkson's paradox
run 100 simulations...
Berkson's paradox
coin-flipping simulation (cont)
now select those samples that have at least one disease and check the correlation among those samples only.
zipped = [z for z in zip(d1, d2) if (z[0] + z[1] >= 1)]
d1, d2 = tuple(map(np.array, zip(*zipped)))
Berkson's paradox
Berkson's paradox
(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0]),
array([1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1]))
Berkson's paradox
D1=YES | D1=NO | |
---|---|---|
D2=YES | 13 | 109 |
D2=NO | 87 | 0 |
In the restricted sample the diseases seem to repel each other, even if this does not happen in the general population!
confounders
B is often called a confounder of A and C
B will make A and C statistically correlated even when there is no direct causal link between them.
confounders
Age is a confounder that makes Shoe Size and Reading Ability be correlated.
To eliminate this spurious correlation we must correct for the Age.
E.g. within an age group (7-year-old) we will not see any correlation.
Simpson's paradox
New drug D aimed to prevent heart attack.
Clinical trial where participants decide whether to adhere to the treatment or not.
Bad for men, bad for women, good for "people"?
Simpson's paradox
- Adherence to Drug treatment depend on Gender.
- Incidence of Heart Attack depends on Gender.
- Gender is a confounder that introduces a spurious correlation between Drug and Heart Attack.
Categorical Data Analysis
A lady claimed to be able to tell whether the tea or the milk were added first.
Lady tasting challenge
Milk first? | TRUE | FALSE | |
---|---|---|---|
Lady says YES | ?? | ?? | 4 |
Lady says NO | ?? | ?? | 4 |
4 | 4 | 8 |
https://en.wikipedia.org/wiki/Lady_tasting_tea
Lady tasting challenge
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | 3 | 1 | 4 |
Lady says NO | 1 | 3 | 4 |
4 | 4 | 8 |
Is this outcome much better than pure chance?
Lady tasting challenge
How many possible outcomes?
How many outcomes with 3 successes or more?
How often would we score 3 successes or more just by chance?
Fisher's exact test
T(a,b,c,d) | X=1 | X=0 | |
---|---|---|---|
Y=1 | a | b | a+b |
Y=0 | c | d | c+d |
a+c | b+d | n |
Suppose we randomly sample binary vectors X, Y of length n:
X = [0 1 0 0 1 0 1 1 1 1 ...] with #{1} = a+c; #{0} = b+d
Y = [0 1 0 0 0 1 1 0 0 0 ...] with #{1} = a+b; #{0} = c+d
What are the chances of getting this Table?
p-value
How often the results of the experiment would be at least as extreme as the results actually observed just by chance.
"Measure of how surpising the observed is assuming the null-hypothesis"
For instance, in the lady tasting challenge...
Fisher's test p-value
Suppose we run a more comprehensive lady testing challenge
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | a=?? | ?? | 20 |
Lady says NO | ?? | ?? | 20 |
20 | 20 | 40 |
Fisher's test p-value
What are the probabilities of a successes just by chance?
Fisher's test p-value
Suppose that the lady had a=12 successes. How often would we get a result at least as good? Is this very surprising?
Fisher's test p-value
[M, n, N] = [40, 20, 20]
rv = scipy.stats.hypergeom(M, n, N)
x = np.arange(0, n+1)
pmf_successes = rv.pmf(x)
p = sum(pmf_successes[12:21])
Fisher's test is just an application of the so-called hypergeometric distribution
Regression
Does growth rate depend on temperature?
Does growth rate depend on temperature?
temperature
growth rate
pH
pressure
nutrients
What patterns do we see?
Perfect fit!
But do not expect it to fit new data :(
Parametric methods:
- assume rigid global shape
- maybe inaccurate prediction
- good for interpretation
Smoothers:
- less rigid shape
- shape is locally defined
- useful for prediction
- more difficult to interpret
Different approaches
Linear
Quadratic
Parametric methods
Average Smoothing
Nadaraya-Watson
: bell-shaped kernel
Smoothers (non-parametric)
Whatever the strategy, we want some function
such that
Linear regression
Input
Covariates = Numerical
Response = Numerical
What for?
- Understanding the association between the covariates and the response variables.
- Prediction of expected value of the response given the values of the covariates.
Linear regression
Linear regression
Linear regression
what if we repeated the experiment?
Linear regression with many variables?
Logistic regression
Input
Covariates = Numerical or Categorical
Response = Categorical Binary: {False, True} or {0, 1}
What for?
- Understanding the association between the response variable and the covariates.
- Classification -- probability that the response has True/False class value given the values of the covariates.
Logistic regression
Logistic regression
Logistic regression
Logistic Regression fits a sigmoid function of the form:
How do we interpret the parameters?
How to interpret parameters in logistic regression
For a sample with covariate value = x we denote p(y=1|x) the probability that the sample has response class = 1.
The association between covariate values and this probability is given by the following expression:
a = increase of logit for a 1 unit increase of the covariate.
b = log-odds when x = 0.
Categorical logistic regression
Example (Lady Tea Challenge Revisited):
Samples = Cups of tea
Covariate = Whether the lady perceives it as milk-first
Response = Whether the cup of tea is milk-first
Input
Covariates = Categorical Binary: {False, True} or {0, 1}
Response = Categorical Binary: {False, True} or {0, 1}
All the information we require to fit this regression is... the contingency table of the experiment!
Categorical logistic regression
Example (Lady Tea Challenge Revisited)
Covariate: X = [1, 1, 1, 0, 1, 0, 0, 0]
Response: Y = [1, 1, 1, 1, 0, 0, 0, 0]
TRUE | FALSE | ||
---|---|---|---|
Lady says YES | 3 | 1 | 4 |
Lady says NO | 1 | 3 | 4 |
4 | 4 | 8 |
Categorical logistic regression
Example (Lady Tea Challenge Revisited)
Covariate: X = [1, 1, 1, 0, 1, 0, 0, 0]
Response: Y = [1, 1, 1, 1, 0, 0, 0, 0]
a = log-odds ratio of the contingency table.
b = log-odds when x = 0.
References
Pearl J, Mackenzie D. "The Book of Why"
James G, Witten D, Hastie T, Tibshirani R.
"Introduction to Statistical Learning"
Hastie T, Tibshirani R, Friedman J.
"The Elements of Statistical Learning"
Data Analysis Tactics
By Ferran Muiños
Data Analysis Tactics
A gentle introduction to statistical tactics in science.
- 899