Going Beyond Provenance: Explaining Query Answers with Patternbased Counterbalances
Zhengjie Miao et al.
Motivation
Author  Venue  Year  Pubcnt 

X  SIGKDD  2006  4 
X  SIGKDD  2007  1 
X  SIGKDD  2008  4 
1 
Why X only published 1 SIGKDD paper in 2007?
SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year
Motivation
Author  Venue  Year  Pubcnt 

X  SIGKDD  2006  4 
X  SIGKDD  2007  1 
X  SIGKDD  2008  4 
X  VLDB  2006  5 
X  VLDB  2007  5 
X  VLDB  2008  5 
X  ICDE  2006  5 
X  ICDE  2007  7 
X  ICDE  2008  4 
7 
Group by year: 2006  14 , 2007  13 , 2008  13
4 
5 
5 
1 
4 
5 
5 
4 
2006  14
2007  13
2008  13
7 
Motivation
X  SIGKDD  2007  1 
X  ICDE  2007  7 
Because of
Existing Work: Provenance
X  ICDE  2007  7 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year
7 
Why X publish 7 ICDE papers in 2007?
Existing Work: Provenance
X  ICDE  2007  7 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
Because of
Y 
Y 
Y 
Intervention
X  ICDE  2007  7 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
Because of

Y 
Y 
Y 
What to delete?
X  SIGKDD  2007  1 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
Key Idea
SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year
Year  Count 

2006  14 
2007  13 
2008  13 
SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue
Year  Venue  Count 

2007  SIGKDD  1 
2007  ICDE  7 
\(\curvearrowleft\)
Exp.
Aggregation Regression Pattern (ARP)
Year  Count 

2006  14 
2007  13 
2008  13 
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
In other words:
\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Example:
Relavant ARP
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of
tuple \(t\): (X, 2007, SIGKDD, 1)
Example:
Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain
Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)
(Conserved Quantity)
Refinement ARP
\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
Example:
Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)
(Drill Down)
Same Problem in ARP term
The SIGKDD problem can be formulated as:
 Complaint: a tuple \(t\) generated by \(\gamma_{\{author, venue, year\},count(*)}\), the \(count(*)\) is "low".
 Relavant ARP: Select \(\{author\} \cup \{year\} \subset \{author, venue, year\}\), \([author]: year \stackrel{Const}{\leadsto} count(*)\)
 Refinement ARP: Add in "venue", \([author, venue]: year \stackrel{Const}{\leadsto} count(*)\)
 Explanation: a tuple \(t' \neq t\) that \(t' \nsim (3)\), and is in the opposite direction of \(t\).
Example
The SIGKDD problem can be formulated as:
 Complaint: a tuple \(t\) (X, SIGKDD, 2007, 1), 1 is low.
 Relavant ARP: Select \([author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
 Refinement ARP: Add in "ICDE", \[[author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const=5}{\leadsto} count(*)\]
 Explanation: \(t' =\) (X, ICDE, 2007, 7), that \(t' \nsim (3)\), and 7 is high.
Offline
Online
Challenges
Year  Count 

2006  14 
2007  13 
2008  13 
Year  Venue  Count 

2007  SIGKDD  1 
2007  ICDE  7 
1. Aggregation Regression Pattern (ARP) Mining
2. Explanation Generation
2007  ICDE  7 
Regression
Author  Year  Count 

Z  2006  5 
Z  2007  6 
Z  2008  5 
How to get \(M\) in \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?
 Define M is a Regression Model in {Const, Linear,...}
 Fit \(Agg(A) \sim V\) (e.g. \(count \sim year\))
Regression
1. Not all the authors obey the ARP
\([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)
Author  Year  Count 

Z  2006  5 
Z  2007  11 
Z  2008  3 
2. Person's ChiSquare test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)
M can also be Linear Regresssion and they use Rsquared statistic to test a good fit.
W  2006  5 
3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)
ARP Mining
 Sort group attributes and share queries:
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)
\([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)
 Limit Group Size: e.g. only support group by less than 4 attributes (\(\psi = 4\)).
Idea: exhaustively run group by queries offline.
 Using "cube by" operator:
Group By A, B can be calculated through Group By A, B, C
 FD pruning:
if \(\{A,B\} \rightarrow C\) , no need to calculate
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.
ARP Pruning
 Some ARP may only hold locally:
Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)
But \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)
and \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)
 Some ARP may only have few instances.
 \(\gamma\): Minimal percentage of good fits (ARP Global Quality)
 \(\Delta\): Minimal support of good fits (ARP Global Support)
Explanation Generation
Idea: search top k expl. through generated ARPs.
Define:
 \(dev(t')\): Deviation of \(t'\) from regression value.
E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7  5 = 2\).
 \(dist(t_1, t_2)\): The distance between two tuples.
Explanation generation:
 Find Relavant ARP \([author=X]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\).
 Find Refinement ARP \([author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\)
 Score tuples \(t'\) in (2) based on \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\)
 Select the top scored tuple as the explanation.
Example Explanations
Rank  Explanation  Score 

1  (X, ICDE, 2007,7)  13.78 
2  (X, ICDE, 2006,5)  10.91 
3  (X, ICDM, 2007, 5)  6.44 
...  ...  ... 
10  (X, 2010, 63)  3.20 
Optimization
Goal: Top K explanation tuples.
Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).
Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)
Experiment Setting
 Datasets:
 DBLP: 1M rows, 4 columns.
 Chicago Crime: after preprocessing, 1M rows, ? attributes.
 Environment: Python 3.5, PostgreSQL 10.4
 Machine: 128G RAM, 4x 1TB HDD
 Cherry picked a set of parameters (\(\psi = 4, \theta=0.5,\lambda=0.5, \delta=15, \Delta=15\))
Time Spent in ARP Mining
 ARPmine: Share query + Sort
 sharegrp: Share query
 Cube: Cube
 Naive: No optimiation
ARPmine out performs others.
Time Spent in ARP Mining
 ARPmine: Share query + Sort
 sharegrp: Share query
 Cube: Cube
 Naive: No optimiation
All three methods scale linearly w.r.t. data size
Time Spent in ARP Mining
 ARPmine: Share query + Sort
 sharegrp: Share query
 Cube: Cube
 Naive: No optimiation
FD has positive effects for ARP mining.
Time Spent in Explanation Generation
ExplGenNaive: Brute Force Search
ExplGenOpt: With UB pruning.
Expl. generation time increase linearly w.r.t. the size of candidate ARPs.
Time Spent in Explanation Generation
ExplGenNaive: Brute Force Search
ExplGenOpt: With UB pruning.
Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.
Parameter Sensitivity Analysis
 Synthetic dataset: remove/add tuples to groups in realworld dataset.
 10 user questions, top 10 explanations.
 Vary parameters ( \(\theta,\lambda, \delta, \Delta\)), report # of generated explanations that are correct.
Parameter Sensitivity Analysis
\(\Delta\): ARP Global Support
\(\delta\): ARP Local Support
\(\lambda\): ARP Global Quality
\(\theta\): ARP Regression Quality
 \(\delta\): Won't affect the result too much.
 \(\Delta\): The lower the better.
 \(\lambda\): 0.30.5.
 \(\delta\): The lower the better.
Conclusion
 ARP can explain some outliers in the aggregation query that provenance cannot explain.
 While naively searching for ARP explanations is time consuming, optimizations can be done to accelerate the process.
Going Beyond Provenance: Explaining Query Answers with Patternbased Counterbalances
By Weiyüen Wu
Going Beyond Provenance: Explaining Query Answers with Patternbased Counterbalances
 185