Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
Zhengjie Miao et al.
Motivation
Author | Venue | Year | Pubcnt |
---|---|---|---|
X | SIGKDD | 2006 | 4 |
X | SIGKDD | 2007 | 1 |
X | SIGKDD | 2008 | 4 |
1 |
Why X only published 1 SIGKDD paper in 2007?
SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year
Motivation
Author | Venue | Year | Pubcnt |
---|---|---|---|
X | SIGKDD | 2006 | 4 |
X | SIGKDD | 2007 | 1 |
X | SIGKDD | 2008 | 4 |
X | VLDB | 2006 | 5 |
X | VLDB | 2007 | 5 |
X | VLDB | 2008 | 5 |
X | ICDE | 2006 | 5 |
X | ICDE | 2007 | 7 |
X | ICDE | 2008 | 4 |
7 |
Group by year: 2006 - 14 , 2007 - 13 , 2008 - 13
4 |
5 |
5 |
1 |
4 |
5 |
5 |
4 |
2006 - 14
2007 - 13
2008 - 13
7 |
Motivation
X | SIGKDD | 2007 | 1 |
X | ICDE | 2007 | 7 |
Because of
Existing Work: Provenance
X | ICDE | 2007 | 7 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year
7 |
Why X publish 7 ICDE papers in 2007?
Existing Work: Provenance
X | ICDE | 2007 | 7 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
Because of
Y |
Y |
Y |
Intervention
X | ICDE | 2007 | 7 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
Because of
|
Y |
Y |
Y |
What to delete?
X | SIGKDD | 2007 | 1 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
Key Idea
SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year
Year | Count |
---|---|
2006 | 14 |
2007 | 13 |
2008 | 13 |
SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue
Year | Venue | Count |
---|---|---|
2007 | SIGKDD | 1 |
2007 | ICDE | 7 |
\(\curvearrowleft\)
Exp.
Aggregation Regression Pattern (ARP)
Year | Count |
---|---|
2006 | 14 |
2007 | 13 |
2008 | 13 |
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
In other words:
\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Example:
Relavant ARP
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of
tuple \(t\): (X, 2007, SIGKDD, 1)
Example:
Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain
Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)
(Conserved Quantity)
Refinement ARP
\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
Example:
Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)
(Drill Down)
Same Problem in ARP term
The SIGKDD problem can be formulated as:
- Complaint: a tuple \(t\) generated by \(\gamma_{\{author, venue, year\},count(*)}\), the \(count(*)\) is "low".
- Relavant ARP: Select \(\{author\} \cup \{year\} \subset \{author, venue, year\}\), \([author]: year \stackrel{Const}{\leadsto} count(*)\)
- Refinement ARP: Add in "venue", \([author, venue]: year \stackrel{Const}{\leadsto} count(*)\)
- Explanation: a tuple \(t' \neq t\) that \(t' \nsim (3)\), and is in the opposite direction of \(t\).
Example
The SIGKDD problem can be formulated as:
- Complaint: a tuple \(t\) (X, SIGKDD, 2007, 1), 1 is low.
- Relavant ARP: Select \([author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
- Refinement ARP: Add in "ICDE", \[[author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const=5}{\leadsto} count(*)\]
- Explanation: \(t' =\) (X, ICDE, 2007, 7), that \(t' \nsim (3)\), and 7 is high.
Offline
Online
Challenges
Year | Count |
---|---|
2006 | 14 |
2007 | 13 |
2008 | 13 |
Year | Venue | Count |
---|---|---|
2007 | SIGKDD | 1 |
2007 | ICDE | 7 |
1. Aggregation Regression Pattern (ARP) Mining
2. Explanation Generation
2007 | ICDE | 7 |
Regression
Author | Year | Count |
---|---|---|
Z | 2006 | 5 |
Z | 2007 | 6 |
Z | 2008 | 5 |
How to get \(M\) in \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?
- Define M is a Regression Model in {Const, Linear,...}
- Fit \(Agg(A) \sim V\) (e.g. \(count \sim year\))
Regression
1. Not all the authors obey the ARP
\([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)
Author | Year | Count |
---|---|---|
Z | 2006 | 5 |
Z | 2007 | 11 |
Z | 2008 | 3 |
2. Person's Chi-Square test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)
M can also be Linear Regresssion and they use R-squared statistic to test a good fit.
W | 2006 | 5 |
3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)
ARP Mining
- Sort group attributes and share queries:
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)
\([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)
- Limit Group Size: e.g. only support group by less than 4 attributes (\(\psi = 4\)).
Idea: exhaustively run group by queries offline.
- Using "cube by" operator:
Group By A, B can be calculated through Group By A, B, C
- FD pruning:
if \(\{A,B\} \rightarrow C\) , no need to calculate
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.
ARP Pruning
- Some ARP may only hold locally:
Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)
But \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)
and \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)
- Some ARP may only have few instances.
- \(\gamma\): Minimal percentage of good fits (ARP Global Quality)
- \(\Delta\): Minimal support of good fits (ARP Global Support)
Explanation Generation
Idea: search top k expl. through generated ARPs.
Define:
- \(dev(t')\): Deviation of \(t'\) from regression value.
E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7 - 5 = 2\).
- \(dist(t_1, t_2)\): The distance between two tuples.
Explanation generation:
- Find Relavant ARP \([author=X]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\).
- Find Refinement ARP \([author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\)
- Score tuples \(t'\) in (2) based on \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\)
- Select the top scored tuple as the explanation.
Example Explanations
Rank | Explanation | Score |
---|---|---|
1 | (X, ICDE, 2007,7) | 13.78 |
2 | (X, ICDE, 2006,5) | 10.91 |
3 | (X, ICDM, 2007, 5) | 6.44 |
... | ... | ... |
10 | (X, 2010, 63) | 3.20 |
Optimization
Goal: Top K explanation tuples.
Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).
Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)
Experiment Setting
- Datasets:
- DBLP: 1M rows, 4 columns.
- Chicago Crime: after preprocessing, 1M rows, ? attributes.
- Environment: Python 3.5, PostgreSQL 10.4
- Machine: 128G RAM, 4x 1TB HDD
- Cherry picked a set of parameters (\(\psi = 4, \theta=0.5,\lambda=0.5, \delta=15, \Delta=15\))
Time Spent in ARP Mining
- ARP-mine: Share query + Sort
- share-grp: Share query
- Cube: Cube
- Naive: No optimiation
ARP-mine out performs others.
Time Spent in ARP Mining
- ARP-mine: Share query + Sort
- share-grp: Share query
- Cube: Cube
- Naive: No optimiation
All three methods scale linearly w.r.t. data size
Time Spent in ARP Mining
- ARP-mine: Share query + Sort
- share-grp: Share query
- Cube: Cube
- Naive: No optimiation
FD has positive effects for ARP mining.
Time Spent in Explanation Generation
ExplGen-Naive: Brute Force Search
ExplGen-Opt: With UB pruning.
Expl. generation time increase linearly w.r.t. the size of candidate ARPs.
Time Spent in Explanation Generation
ExplGen-Naive: Brute Force Search
ExplGen-Opt: With UB pruning.
Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.
Parameter Sensitivity Analysis
- Synthetic dataset: remove/add tuples to groups in real-world dataset.
- 10 user questions, top 10 explanations.
- Vary parameters ( \(\theta,\lambda, \delta, \Delta\)), report # of generated explanations that are correct.
Parameter Sensitivity Analysis
\(\Delta\): ARP Global Support
\(\delta\): ARP Local Support
\(\lambda\): ARP Global Quality
\(\theta\): ARP Regression Quality
- \(\delta\): Won't affect the result too much.
- \(\Delta\): The lower the better.
- \(\lambda\): 0.3-0.5.
- \(\delta\): The lower the better.
Conclusion
- ARP can explain some outliers in the aggregation query that provenance cannot explain.
- While naively searching for ARP explanations is time consuming, optimizations can be done to accelerate the process.
Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
By Weiyüen Wu
Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
- 759