Bias in OLAP QUeries: Detection, Explanation, and Removal

Babak Salimi, Johannes Gehrke, Dan Suciu

Outline

  • Background: Simpson's Paradox
  • Formal definition
  • Detecting Bias
  • Explaining Bias
  • Resolving Bias
  • Experiment
SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
Delayed Total Carrier Airport
5 10 AA COS
6 10 AA MFE
7 10 AA MTJ
8 10 AA ROC
4 10 UA COS
5 10 UA MFE
6 10 UA MTJ
7 10 UA ROC

AA: \(\frac{5+6+7+8}{10+10+10+10} = \frac{26}{40}\)

UA: \(\frac{4+5+6+7}{10+10+10+10} = \frac{22}{40}\)

Simpson's Paradox

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')
Delayed Total Carrier Airport
50 100 AA COS
60 100 AA MFE
7 10 AA MTJ
8 10 AA ROC
4 10 UA COS
5 10 UA MFE
60 100 UA MTJ
70 100 UA ROC

AA: \(\frac{50+60+7+8}{100+100+10+10} = \frac{125}{220}\)

UA: \(\frac{4+5+60+70}{10+10+100+100} = \frac{139}{220}\)

Simpson's Paradox

AA UA
COS 50/100 4/10
MFE 60/100 5/10
MTF 7/10 60/100
ROC 8/10 70/100

For each airport, it can be actually divided into small subgroups by airports, and the distribution for these subgroups are not the ​SAME.

Simpson's Paradox

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

 At the heart of the issue is an incorrect interpretation of the query; while the analyst’s goal is to compare the causal effect of the carriers on delay, the OLAP query measures only their association.

Formal Definition

Principled business decision

Two alternatives: T \(\in \{t_0, t_1\} \)

An outcome: Y

Some other factors \(x_1, x_2, ...\)

\(t_0\): control group, taking the placebo

\(t_1\): treatment group: taking the medicine

\(Y(t_0), Y(t_1)\): Blood pressure after taking the placebo (medicine)

\(x_1, x_2,...\): Age, gender, ...

E.g.

In SQL

SELECT T, X, AVG(Y)
FROM D
WHERE C
GROUP BY T, X

T: UA or AA

D: FlightData

C: Carrier and Airport constraint

X: NULL

Y: Delay

We want to find the causal relationship between T (Carrier) and Y (Delay)

Neyman-Rubin Causal Model

Average treatment effect: \(ATE(T,Y)\)

\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)

\(= AVG[Delay(AA) - Delay(AU)]\)

Neyman-Rubin Causal Model

Average treatment effect: \(ATE(T,Y)\)

\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)

\(= AVG[Delay(AA) - Delay(AU)]\)

?

?

Or

Average treatment effect: \(ATE(T,Y)\)

\(= \mathbb{E}[Y(t_1) - Y(t_0)]\)

\(= \mathbb{E}[Y(t_1)] - \mathbb{E}[Y(t_0)] \)

\(= AVG[Delay(AA)] - AVG[Delay(AU)]\)

If \((Y(t_1), Y(t_0)) \perp T\), -- Not possible

Or,

for some other variables \(\textbf{Z}\) -- covariates,

\((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),

Our goal

With \((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),

\(ATE(T,Y) = \sum_{\textbf{z} \in \textbf{Z}} (\mathbb{E}[Y|T=t_1,\textbf{z}] - \mathbb{E}[Y|T=t_0,\textbf{z}]) Pr(\textbf{z})\)

A.k.a. We calculate the average delay for each (Carrier, Airport) individually, then re-weight and sum them up 

AA UA
COS 50/100 4/10
MFE 60/100 5/10
MTF 7/10 60/100
ROC 8/10 70/100

ATE=(50/100 - 4/10)*0.25 + (60/100 - 5/10)*0.25 + (7/10 - 60/100)*0.25 + (8/10 - 70/100)*0.25 =  1/10

Biased=(50+60+7+8)/(100+100+10+10) - (4+5+60+70) / (10+10+100+100) = -7/110

Covariate discovery?

  1. Learn a causal DAG from the data
  2. The parents of a node is a sufficient set of covariates
  3. Exponential to generate a DAG

Pearl's method

Markov Boundary

Detecting Bias

AA UA
COS 5/10 4/10
MFE 6/10 5/10
MTF 7/10 6/10
ROC 8/10 7/10
AA UA
COS 50/100 4/10
MFE 60/100 5/10
MTF 7/10 60/100
ROC 8/10 70/100

VS

Only when the distribution of 

\(Pr(Airport|Carrier=AA)\) and \(Pr(Airport| Carrier=AU)\) are the same, then we can call the query is balanced.

AA UA
COS 5/10 4/10
MFE 6/10 5/10
MTF 7/10 6/10
ROC 8/10 7/10

A.k.a. \(Carrier \perp Airport\)

\(I(Carrier;Airport) = 0.25 \neq 0\) with p < 0.001

If the query is balanced, then

\(AVG(Delay|Carrier=AU) - AVG(Delay|Carrier=AA)\)

is a unbiased estimator of

\(\mathbb{E}[Delay|Airport=AU] - \mathbb{E}[Delay|Airport=AA]\)

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

Explaining Bias

  • Coerce-grained explanation
  • Fine-grained explanation

Coarse-grained Explaination

Intuitively, more dependent, more bias

\(\rho_Z = \frac{I(Carrier;Airport) - I(Carrier;Airport|Airport)}{...}\)

\(A \perp B \Rightarrow I(A;B) = 0\)

Rank attributes via \(\rho_z\)

Notice:

\(I(T;V) - I(T;V|Z) \)

\( = (H(T) + H(V) - H(TV)) - (H(TZ)+H(V)-H(TV)-H(Z))\)

\(= H(T)+H(Z)-H(TZ) \geq 0\)

So: \(\ 0 \leq \rho_X \leq 1\)

Fine-grained Explaination

Since I(X;Y) is calculated by \[ I(X;Y) = \sum_{x\in X} \sum_{y\in Y} Pr(x,y) \log(\frac{Pr(x,y)}{Pr(x)Pr(y)})\]

For each tuple (x,y) we can evaluate how much it contributes to the mutual information \(I(X;Y)\)

E.g. (Airport=ROC, Carrier=UA) contributes most for the bias

Resolving Bias

A.k.a. We calculate the average delay for each (Carrier, Airport) individually, then re-weight and sum them up 

AA UA
COS 50/100 4/10
MFE 60/100 5/10
MTF 7/10 60/100
ROC 8/10 70/100

ATE=(50/100 - 4/10)*0.25 + (60/100 - 5/10)*0.25 + (7/10 - 60/100)*0.25 + (8/10 - 70/100)*0.25 =  1/10

With \((Y(t_1), Y(t_0)) \perp T | \textbf{Z}=\textbf{z}\),

\(ATE(T,Y) = \sum_{\textbf{z} \in \textbf{Z}} (\mathbb{E}[Y|T=t_1,\textbf{z}] - \mathbb{E}[Y|T=t_0,\textbf{z}]) Pr(\textbf{z})\)

WITH Blocks AS (
    SELECT Carrier, Airport, AVG(Delayed) as AVG1
    FROM D
    GROUP BY Carrier, Airport
),
Weights AS (
    SELECT Airport, count(*) / n as W
    FROM D
    GROUP BY Airport
    HAVING count(DISTINCT Carrier) = 2
)
SELECT Carrier, sum(AVG1 * W)
FROM Blocks, Weights
GROUP BY Carrier
WHERE Blocks.Airport = Weights.Airport
SELECT Carrier, AVG(Delayed)
FROM D
GROUP BY Carrier

Experiments

E1: How easy to make a mistake

SELECT SUM(Delayed) / SUM(Total) 
FROM FlightData
GROUP BY Carrier
WHERE Carrier IN ('AA', 'UA')
AND Airport IN ('COS', 'MFE', 'MTJ', 'ROC')

With its rewritten version

20% reversed

E2: End to end testing, Women earn less than men?

Conclusion

  • Nice Idea, Important
  • Interesting examples

Pros

Cons

  • Fined-grained explanation obscure
  • Continuous variable
  • Multi treatments (naive is \(O(n^2)\))

deck

By Weiyüen Wu