Bias in OLAP QUeries: Detection, Explanation, and Removal

Babak Salimi, Johannes Gehrke, Dan Suciu

Outline

Background: Simpson's Paradox
Formal definition
Detecting Bias
Explaining Bias
Resolving Bias
Experiment

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

Delayed	Total	Carrier	Airport
5	10	AA	COS
6	10	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
6	10	UA	MTJ
7	10	UA	ROC

AA: \(\frac{5+6+7+8}{10+10+10+10} = \frac{26}{40}\)

UA: \(\frac{4+5+6+7}{10+10+10+10} = \frac{22}{40}\)

Simpson's Paradox

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

Delayed	Total	Carrier	Airport
50	100	AA	COS
60	100	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
60	100	UA	MTJ
70	100	UA	ROC

AA: \(\frac{50+60+7+8}{100+100+10+10} = \frac{125}{220}\)

UA: \(\frac{4+5+60+70}{10+10+100+100} = \frac{139}{220}\)

Simpson's Paradox

	AA	UA
COS	50/100	4/10
MFE	60/100	5/10
MTF	7/10	60/100
ROC	8/10	70/100

For each airport, it can be actually divided into small subgroups by airports, and the distribution for these subgroups are not the SAME.

Simpson's Paradox

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

At the heart of the issue is an incorrect interpretation of the query; while the analyst’s goal is to compare the causal effect of the carriers on delay, the OLAP query measures only their association.

Formal Definition

Principled business decision

Two alternatives: T \(\in \{t_0, t_1\} \)

An outcome: Y

Some other factors \(x_1, x_2, ...\)

\(t_0\): control group, taking the placebo

\(t_1\): treatment group: taking the medicine

\(Y(t_0), Y(t_1)\): Blood pressure after taking the placebo (medicine)

\(x_1, x_2,...\): Age, gender, ...

E.g.

In SQL

SELECT T, X, AVG(Y)
FROM D
WHERE C
GROUP BY T, X

T: UA or AA

D: FlightData

C: Carrier and Airport constraint

X: NULL

Y: Delay

We want to find the causal relationship between T (Carrier) and Y (Delay)

Neyman-Rubin Causal Model

Average treatment effect: \(ATE(T,Y)\)

\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)

\(= AVG[Delay(AA) - Delay(AU)]\)

Neyman-Rubin Causal Model

Average treatment effect: \(ATE(T,Y)\)

\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)

\(= AVG[Delay(AA) - Delay(AU)]\)

Average treatment effect: \(ATE(T,Y)\)

\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)

\(= \mathbb{E}[Y(t_1)] - \mathbb{E}[Y(t_0)] \)

\(= AVG[Delay(AA)] - AVG[Delay(AU)]\)

If \((Y(t_1), Y(t_0)) \perp T\), -- Not possible

Or,

for some other variables \(\textbf{Z}\) -- covariates,

\((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),

Our goal

With \((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),

\(ATE(T,Y) = \sum_{\textbf{z} \in \textbf{Z}} (\mathbb{E}[Y|T=t_1,\textbf{z}] - \mathbb{E}[Y|T=t_0,\textbf{z}]) Pr(\textbf{z})\)

A.k.a. We calculate the average delay for each (Carrier, Airport) individually, then re-weight and sum them up

	AA	UA
COS	50/100	4/10
MFE	60/100	5/10
MTF	7/10	60/100
ROC	8/10	70/100

ATE=(50/100 - 4/10)*0.25 + (60/100 - 5/10)*0.25 + (7/10 - 60/100)*0.25 + (8/10 - 70/100)*0.25 = 1/10

Biased=(50+60+7+8)/(100+100+10+10) - (4+5+60+70) / (10+10+100+100) = -7/110

Covariate discovery?

Learn a causal DAG from the data
The parents of a node is a sufficient set of covariates
Exponential to generate a DAG

Pearl's method

Markov Boundary

Detecting Bias

	AA	UA
COS	5/10	4/10
MFE	6/10	5/10
MTF	7/10	6/10
ROC	8/10	7/10

	AA	UA
COS	50/100	4/10
MFE	60/100	5/10
MTF	7/10	60/100
ROC	8/10	70/100

Only when the distribution of

\(Pr(Airport|Carrier=AA)\) and \(Pr(Airport| Carrier=AU)\) are the same, then we can call the query is balanced.

	AA	UA
COS	5/10	4/10
MFE	6/10	5/10
MTF	7/10	6/10
ROC	8/10	7/10

A.k.a. \(Carrier \perp Airport\)

\(I(Carrier;Airport) = 0.25 \neq 0\) with p < 0.001

If the query is balanced, then

\(AVG(Delay|Carrier=AU) - AVG(Delay|Carrier=AA)\)

is a unbiased estimator of

\(\mathbb{E}[Delay|Airport=AU] - \mathbb{E}[Delay|Airport=AA]\)

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

Explaining Bias

Coerce-grained explanation
Fine-grained explanation

Coarse-grained Explaination

Intuitively, more dependent, more bias

\(\rho_Z = \frac{I(Carrier;Airport) - I(Carrier;Airport|Airport)}{...}\)

\(A \perp B \Rightarrow I(A;B) = 0\)

Rank attributes via \(\rho_z\)

Notice:

\(I(T;V) - I(T;V|Z) \)

\( = (H(T) + H(V) - H(TV)) - (H(TZ)+H(V)-H(TV)-H(Z))\)

\(= H(T)+H(Z)-H(TZ) \geq 0\)

So: \(\ 0 \leq \rho_X \leq 1\)

Fine-grained Explaination

Since I(X;Y) is calculated by \[ I(X;Y) = \sum_{x\in X} \sum_{y\in Y} Pr(x,y) \log(\frac{Pr(x,y)}{Pr(x)Pr(y)})\]

For each tuple (x,y) we can evaluate how much it contributes to the mutual information \(I(X;Y)\)

E.g. (Airport=ROC, Carrier=UA) contributes most for the bias

Resolving Bias

A.k.a. We calculate the average delay for each (Carrier, Airport) individually, then re-weight and sum them up

	AA	UA
COS	50/100	4/10
MFE	60/100	5/10
MTF	7/10	60/100
ROC	8/10	70/100

ATE=(50/100 - 4/10)*0.25 + (60/100 - 5/10)*0.25 + (7/10 - 60/100)*0.25 + (8/10 - 70/100)*0.25 = 1/10

With \((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),

\(ATE(T,Y) = \sum_{\textbf{z} \in \textbf{Z}} (\mathbb{E}[Y|T=t_1,\textbf{z}] - \mathbb{E}[Y|T=t_0,\textbf{z}]) Pr(\textbf{z})\)

WITH Blocks AS (
    SELECT Carrier, Airport, AVG(Delayed) as AVG1
    FROM D
    GROUP BY Carrier, Airport
),
Weights AS (
    SELECT Airport, count(*) / n as W
    FROM D
    GROUP BY Airport
    HAVING count(DISTINCT Carrier) = 2
)
SELECT Carrier, sum(AVG1 * W)
FROM Blocks, Weights
GROUP BY Carrier
WHERE Blocks.Airport = Weights.Airport

SELECT Carrier, AVG(Delayed)
FROM D
GROUP BY Carrier

Experiments

E1: How easy to make a mistake

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

With its rewritten version

20% reversed

E2: End to end testing, Women earn less than men?

Conclusion

Nice Idea, Important
Interesting examples

Pros

Cons

Fined-grained explanation obscure
Continuous variable
Multi treatments (naive is \(O(n^2)\))

Delayed	Total	Carrier	Airport
5	10	AA	COS
6	10	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
6	10	UA	MTJ
7	10	UA	ROC

Delayed	Total	Carrier	Airport
50	100	AA	COS
60	100	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
60	100	UA	MTJ
70	100	UA	ROC

Delayed	Total	Carrier	Airport
5	10	AA	COS
6	10	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
6	10	UA	MTJ
7	10	UA	ROC

Delayed	Total	Carrier	Airport
50	100	AA	COS
60	100	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
60	100	UA	MTJ
70	100	UA	ROC

Delayed	Total	Carrier	Airport
5	10	AA	COS
6	10	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
6	10	UA	MTJ
7	10	UA	ROC

Delayed	Total	Carrier	Airport
50	100	AA	COS
60	100	AA	MFE
7	10	AA	MTJ
8	10	AA	ROC
4	10	UA	COS
5	10	UA	MFE
60	100	UA	MTJ
70	100	UA	ROC