Bias in OLAP QUeries: Detection, Explanation, and Removal
Babak Salimi, Johannes Gehrke, Dan Suciu
Outline
- Background: Simpson's Paradox
- Formal definition
- Detecting Bias
- Explaining Bias
- Resolving Bias
- Experiment
SELECT SUM(Delayed) / SUM(Total)
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
Delayed | Total | Carrier | Airport |
---|---|---|---|
5 | 10 | AA | COS |
6 | 10 | AA | MFE |
7 | 10 | AA | MTJ |
8 | 10 | AA | ROC |
4 | 10 | UA | COS |
5 | 10 | UA | MFE |
6 | 10 | UA | MTJ |
7 | 10 | UA | ROC |
AA: \(\frac{5+6+7+8}{10+10+10+10} = \frac{26}{40}\)
UA: \(\frac{4+5+6+7}{10+10+10+10} = \frac{22}{40}\)
Simpson's Paradox
SELECT SUM(Delayed) / SUM(Total)
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
Delayed | Total | Carrier | Airport |
---|---|---|---|
50 | 100 | AA | COS |
60 | 100 | AA | MFE |
7 | 10 | AA | MTJ |
8 | 10 | AA | ROC |
4 | 10 | UA | COS |
5 | 10 | UA | MFE |
60 | 100 | UA | MTJ |
70 | 100 | UA | ROC |
AA: \(\frac{50+60+7+8}{100+100+10+10} = \frac{125}{220}\)
UA: \(\frac{4+5+60+70}{10+10+100+100} = \frac{139}{220}\)
Simpson's Paradox
AA | UA | |
---|---|---|
COS | 50/100 | 4/10 |
MFE | 60/100 | 5/10 |
MTF | 7/10 | 60/100 |
ROC | 8/10 | 70/100 |
For each airport, it can be actually divided into small subgroups by airports, and the distribution for these subgroups are not the SAME.
Simpson's Paradox
SELECT SUM(Delayed) / SUM(Total)
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
At the heart of the issue is an incorrect interpretation of the query; while the analyst’s goal is to compare the causal effect of the carriers on delay, the OLAP query measures only their association.
Formal Definition
Principled business decision
Two alternatives: T \(\in \{t_0, t_1\} \)
An outcome: Y
Some other factors \(x_1, x_2, ...\)
\(t_0\): control group, taking the placebo
\(t_1\): treatment group: taking the medicine
\(Y(t_0), Y(t_1)\): Blood pressure after taking the placebo (medicine)
\(x_1, x_2,...\): Age, gender, ...
E.g.
In SQL
SELECT T, X, AVG(Y)
FROM D
WHERE C
GROUP BY T, X
T: UA or AA
D: FlightData
C: Carrier and Airport constraint
X: NULL
Y: Delay
We want to find the causal relationship between T (Carrier) and Y (Delay)
Neyman-Rubin Causal Model
Average treatment effect: \(ATE(T,Y)\)
\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)
\(= AVG[Delay(AA) - Delay(AU)]\)
Neyman-Rubin Causal Model
Average treatment effect: \(ATE(T,Y)\)
\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)
\(= AVG[Delay(AA) - Delay(AU)]\)
?
?
Or
Average treatment effect: \(ATE(T,Y)\)
\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)
\(= \mathbb{E}[Y(t_1)] - \mathbb{E}[Y(t_0)] \)
\(= AVG[Delay(AA)] - AVG[Delay(AU)]\)
If \((Y(t_1), Y(t_0)) \perp T\), -- Not possible
Or,
for some other variables \(\textbf{Z}\) -- covariates,
\((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),
Our goal
With \((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),
\(ATE(T,Y) = \sum_{\textbf{z} \in \textbf{Z}} (\mathbb{E}[Y|T=t_1,\textbf{z}] - \mathbb{E}[Y|T=t_0,\textbf{z}]) Pr(\textbf{z})\)
A.k.a. We calculate the average delay for each (Carrier, Airport) individually, then re-weight and sum them up
AA | UA | |
---|---|---|
COS | 50/100 | 4/10 |
MFE | 60/100 | 5/10 |
MTF | 7/10 | 60/100 |
ROC | 8/10 | 70/100 |
ATE=(50/100 - 4/10)*0.25 + (60/100 - 5/10)*0.25 + (7/10 - 60/100)*0.25 + (8/10 - 70/100)*0.25 = 1/10
Biased=(50+60+7+8)/(100+100+10+10) - (4+5+60+70) / (10+10+100+100) = -7/110
Covariate discovery?
- Learn a causal DAG from the data
- The parents of a node is a sufficient set of covariates
- Exponential to generate a DAG
Pearl's method
Markov Boundary
Detecting Bias
AA | UA | |
---|---|---|
COS | 5/10 | 4/10 |
MFE | 6/10 | 5/10 |
MTF | 7/10 | 6/10 |
ROC | 8/10 | 7/10 |
AA | UA | |
---|---|---|
COS | 50/100 | 4/10 |
MFE | 60/100 | 5/10 |
MTF | 7/10 | 60/100 |
ROC | 8/10 | 70/100 |
VS
Only when the distribution of
\(Pr(Airport|Carrier=AA)\) and \(Pr(Airport| Carrier=AU)\) are the same, then we can call the query is balanced.
AA | UA | |
---|---|---|
COS | 5/10 | 4/10 |
MFE | 6/10 | 5/10 |
MTF | 7/10 | 6/10 |
ROC | 8/10 | 7/10 |
A.k.a. \(Carrier \perp Airport\)
\(I(Carrier;Airport) = 0.25 \neq 0\) with p < 0.001
If the query is balanced, then
\(AVG(Delay|Carrier=AU) - AVG(Delay|Carrier=AA)\)
is a unbiased estimator of
\(\mathbb{E}[Delay|Airport=AU] - \mathbb{E}[Delay|Airport=AA]\)
SELECT SUM(Delayed) / SUM(Total)
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
Explaining Bias
- Coerce-grained explanation
- Fine-grained explanation
Coarse-grained Explaination
Intuitively, more dependent, more bias
\(\rho_Z = \frac{I(Carrier;Airport) - I(Carrier;Airport|Airport)}{...}\)
\(A \perp B \Rightarrow I(A;B) = 0\)
Rank attributes via \(\rho_z\)
Notice:
\(I(T;V) - I(T;V|Z) \)
\( = (H(T) + H(V) - H(TV)) - (H(TZ)+H(V)-H(TV)-H(Z))\)
\(= H(T)+H(Z)-H(TZ) \geq 0\)
So: \(\ 0 \leq \rho_X \leq 1\)
Fine-grained Explaination
Since I(X;Y) is calculated by \[ I(X;Y) = \sum_{x\in X} \sum_{y\in Y} Pr(x,y) \log(\frac{Pr(x,y)}{Pr(x)Pr(y)})\]
For each tuple (x,y) we can evaluate how much it contributes to the mutual information \(I(X;Y)\)
E.g. (Airport=ROC, Carrier=UA) contributes most for the bias
Resolving Bias
A.k.a. We calculate the average delay for each (Carrier, Airport) individually, then re-weight and sum them up
AA | UA | |
---|---|---|
COS | 50/100 | 4/10 |
MFE | 60/100 | 5/10 |
MTF | 7/10 | 60/100 |
ROC | 8/10 | 70/100 |
ATE=(50/100 - 4/10)*0.25 + (60/100 - 5/10)*0.25 + (7/10 - 60/100)*0.25 + (8/10 - 70/100)*0.25 = 1/10
With \((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),
\(ATE(T,Y) = \sum_{\textbf{z} \in \textbf{Z}} (\mathbb{E}[Y|T=t_1,\textbf{z}] - \mathbb{E}[Y|T=t_0,\textbf{z}]) Pr(\textbf{z})\)
WITH Blocks AS (
SELECT Carrier, Airport, AVG(Delayed) as AVG1
FROM D
GROUP BY Carrier, Airport
),
Weights AS (
SELECT Airport, count(*) / n as W
FROM D
GROUP BY Airport
HAVING count(DISTINCT Carrier) = 2
)
SELECT Carrier, sum(AVG1 * W)
FROM Blocks, Weights
GROUP BY Carrier
WHERE Blocks.Airport = Weights.Airport
SELECT Carrier, AVG(Delayed)
FROM D
GROUP BY Carrier
Experiments
E1: How easy to make a mistake
SELECT SUM(Delayed) / SUM(Total)
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
With its rewritten version
20% reversed
E2: End to end testing, Women earn less than men?
Conclusion
- Nice Idea, Important
- Interesting examples
Pros
Cons
- Fined-grained explanation obscure
- Continuous variable
- Multi treatments (naive is \(O(n^2)\))
deck
By Weiyüen Wu
deck
- 602