Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Zhengjie Miao et al.

Motivation

Author	Venue	Year	Pubcnt
X	SIGKDD	2006	4
X	SIGKDD	2007	1
X	SIGKDD	2008	4

Why X only published 1 SIGKDD paper in 2007?

SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year

Motivation

Author	Venue	Year	Pubcnt
X	SIGKDD	2006	4
X	SIGKDD	2007	1
X	SIGKDD	2008	4
X	VLDB	2006	5
X	VLDB	2007	5
X	VLDB	2008	5
X	ICDE	2006	5
X	ICDE	2007	7
X	ICDE	2008	4

Group by year: 2006 - 14 , 2007 - 13 , 2008 - 13

2006 - 14

2007 - 13

2008 - 13

Motivation

SIGKDD

2007

ICDE

2007

Because of

Existing Work: Provenance

ICDE

2007

Author 1	Author 2	Venue	Year	Title
X	Y	ICDE	2007	A
X	Y	ICDE	2007	B
X	Y	ICDE	2007	C
X	Z	ICDE	2007	D
...	...	...	...	...

SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year

Why X publish 7 ICDE papers in 2007?

Existing Work: Provenance

ICDE

2007

Author 1	Author 2	Venue	Year	Title
X	Y	ICDE	2007	A
X	Y	ICDE	2007	B
X	Y	ICDE	2007	C
X	Z	ICDE	2007	D
...	...	...	...	...

Because of

Intervention

ICDE

2007

Author 1	Author 2	Venue	Year	Title
X	Y	ICDE	2007	A
X	Y	ICDE	2007	B
X	Y	ICDE	2007	C
X	Z	ICDE	2007	D
...	...	...	...	...

Because of

What to delete?

SIGKDD

2007

Author 1	Author 2	Venue	Year	Title
X	Y	ICDE	2007	A
X	Y	ICDE	2007	B
X	Y	ICDE	2007	C
X	Z	ICDE	2007	D
...	...	...	...	...

Key Idea

SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year

Year	Count
2006	14
2007	13
2008	13

SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue

Year	Venue	Count
2007	SIGKDD	1
2007	ICDE	7

\(\curvearrowleft\)

Exp.

Aggregation Regression Pattern (ARP)

Year	Count
2006	14
2007	13
2008	13

\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)

In other words:

\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)

ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)

Example:

Relavant ARP

\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of

tuple \(t\): (X, 2007, SIGKDD, 1)

Example:

Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain

Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)

(Conserved Quantity)

Refinement ARP

\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of

\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)

Example:

Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)

Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)

(Drill Down)

Same Problem in ARP term

The SIGKDD problem can be formulated as:

Complaint: a tuple \(t\) generated by \(\gamma_{\{author, venue, year\},count(*)}\), the \(count(*)\) is "low".
Relavant ARP: Select \(\{author\} \cup \{year\} \subset \{author, venue, year\}\), \([author]: year \stackrel{Const}{\leadsto} count(*)\)
Refinement ARP: Add in "venue", \([author, venue]: year \stackrel{Const}{\leadsto} count(*)\)
Explanation: a tuple \(t' \neq t\) that \(t' \nsim (3)\), and is in the opposite direction of \(t\).

Example

The SIGKDD problem can be formulated as:

Complaint: a tuple \(t\) (X, SIGKDD, 2007, 1), 1 is low.
Relavant ARP: Select \([author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
Refinement ARP: Add in "ICDE", \[[author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const=5}{\leadsto} count(*)\]
Explanation: \(t' =\) (X, ICDE, 2007, 7), that \(t' \nsim (3)\), and 7 is high.

Offline

Online

Challenges

Year	Count
2006	14
2007	13
2008	13

Year	Venue	Count
2007	SIGKDD	1
2007	ICDE	7

1. Aggregation Regression Pattern (ARP) Mining

2. Explanation Generation

2007

ICDE

Regression

Author	Year	Count
Z	2006	5
Z	2007	6
Z	2008	5

How to get \(M\) in \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?

Define M is a Regression Model in {Const, Linear,...}
Fit \(Agg(A) \sim V\) (e.g. \(count \sim year\))

Regression

1. Not all the authors obey the ARP

\([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)

Author	Year	Count
Z	2006	5
Z	2007	11
Z	2008	3

2. Person's Chi-Square test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)

M can also be Linear Regresssion and they use R-squared statistic to test a good fit.

2006

3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)

ARP Mining

Sort group attributes and share queries:

\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)

\([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)

Limit Group Size: e.g. only support group by less than 4 attributes (\(\psi = 4\)).

Idea: exhaustively run group by queries offline.

Using "cube by" operator:

Group By A, B can be calculated through Group By A, B, C

FD pruning:

if \(\{A,B\} \rightarrow C\) , no need to calculate

\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.

ARP Pruning

Some ARP may only hold locally:

Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)

But \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)

and \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)

Some ARP may only have few instances.
\(\gamma\): Minimal percentage of good fits (ARP Global Quality)
\(\Delta\): Minimal support of good fits (ARP Global Support)

Explanation Generation

Idea: search top k expl. through generated ARPs.

Define:

\(dev(t')\): Deviation of \(t'\) from regression value.

E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7 - 5 = 2\).

\(dist(t_1, t_2)\): The distance between two tuples.

Explanation generation:

Find Relavant ARP \([author=X]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\).
Find Refinement ARP \([author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\)
Score tuples \(t'\) in (2) based on \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\)
Select the top scored tuple as the explanation.

Example Explanations

Rank	Explanation	Score
1	(X, ICDE, 2007,7)	13.78
2	(X, ICDE, 2006,5)	10.91
3	(X, ICDM, 2007, 5)	6.44
...	...	...
10	(X, 2010, 63)	3.20

Optimization

Goal: Top K explanation tuples.

Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).

Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)

Experiment Setting

Datasets:
- DBLP: 1M rows, 4 columns.
- Chicago Crime: after preprocessing, 1M rows, ? attributes.
Environment: Python 3.5, PostgreSQL 10.4
Machine: 128G RAM, 4x 1TB HDD
Cherry picked a set of parameters (\(\psi = 4, \theta=0.5,\lambda=0.5, \delta=15, \Delta=15\))

Time Spent in ARP Mining

ARP-mine: Share query + Sort
share-grp: Share query
Cube: Cube
Naive: No optimiation

ARP-mine out performs others.

Time Spent in ARP Mining

ARP-mine: Share query + Sort
share-grp: Share query
Cube: Cube
Naive: No optimiation

All three methods scale linearly w.r.t. data size

Time Spent in ARP Mining

ARP-mine: Share query + Sort
share-grp: Share query
Cube: Cube
Naive: No optimiation

FD has positive effects for ARP mining.

Time Spent in Explanation Generation

ExplGen-Naive: Brute Force Search

ExplGen-Opt: With UB pruning.

Expl. generation time increase linearly w.r.t. the size of candidate ARPs.

Time Spent in Explanation Generation

ExplGen-Naive: Brute Force Search

ExplGen-Opt: With UB pruning.

Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.

Parameter Sensitivity Analysis

Synthetic dataset: remove/add tuples to groups in real-world dataset.
10 user questions, top 10 explanations.
Vary parameters ( \(\theta,\lambda, \delta, \Delta\)), report # of generated explanations that are correct.

Parameter Sensitivity Analysis

\(\Delta\): ARP Global Support

\(\delta\): ARP Local Support

\(\lambda\): ARP Global Quality

\(\theta\): ARP Regression Quality

\(\delta\): Won't affect the result too much.
\(\Delta\): The lower the better.
\(\lambda\): 0.3-0.5.
\(\delta\): The lower the better.

Conclusion

ARP can explain some outliers in the aggregation query that provenance cannot explain.

While naively searching for ARP explanations is time consuming, optimizations can be done to accelerate the process.