Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Zhengjie Miao et al.

Motivation

Author Venue Year Pubcnt
X SIGKDD 2006 4
X SIGKDD 2007 1
X SIGKDD 2008 4
1

Why X only published 1 SIGKDD paper in 2007?

SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year

Motivation

Author Venue Year Pubcnt
X SIGKDD 2006 4
X SIGKDD 2007 1
X SIGKDD 2008 4
X VLDB 2006 5
X VLDB 2007 5
X VLDB 2008 5
X ICDE 2006 5
X ICDE 2007 7
X ICDE 2008 4
7

Group by year: 2006 - 14 , 2007 - 13 , 2008 - 13

4
5
5
1
4
5
5
4

2006 - 14

2007 - 13

2008 - 13

7

Motivation

X SIGKDD 2007 1
X ICDE 2007 7

Because of

Existing Work: Provenance

X ICDE 2007 7
Author 1 Author 2 Venue Year Title
X Y ICDE 2007 A
X Y ICDE 2007 B
X Y ICDE 2007 C
X Z ICDE 2007 D
... ... ... ... ...
SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year
7

Why X publish 7 ICDE papers in 2007?

Existing Work: Provenance

X ICDE 2007 7
Author 1 Author 2 Venue Year Title
X Y ICDE 2007 A
X Y ICDE 2007 B
X Y ICDE 2007 C
X Z ICDE 2007 D
... ... ... ... ...

Because of

Y
Y
Y

Intervention

X ICDE 2007 7
Author 1 Author 2 Venue Year Title
X Y ICDE 2007 A
X Y ICDE 2007 B
X Y ICDE 2007 C
X Z ICDE 2007 D
... ... ... ... ...

Because of

   4
Y
Y
Y

What to delete?

X SIGKDD 2007 1
Author 1 Author 2 Venue Year Title
X Y ICDE 2007 A
X Y ICDE 2007 B
X Y ICDE 2007 C
X Z ICDE 2007 D
... ... ... ... ...

Key Idea

SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year
Year Count
2006 14
2007 13
2008 13
SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue
Year Venue Count
2007 SIGKDD 1
2007 ICDE 7

\(\curvearrowleft\)

Exp.

Aggregation Regression Pattern (ARP)

Year Count
2006 14
2007 13
2008 13

\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)

In other words:

\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)

ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)

Example:

Relavant ARP

\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of 

tuple \(t\): (X, 2007, SIGKDD, 1)

Example:

Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain

Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)

(Conserved Quantity)

Refinement ARP

\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of 

\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)

Example:

Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)

Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)

(Drill Down)

Same Problem in ARP term

The SIGKDD problem can be formulated as:

  1. Complaint: a tuple \(t\) generated by \(\gamma_{\{author, venue, year\},count(*)}\),  the \(count(*)\) is "low".
  2. Relavant ARP: Select \(\{author\} \cup \{year\} \subset \{author, venue, year\}\),          \([author]: year \stackrel{Const}{\leadsto} count(*)\)
  3. Refinement ARP: Add in "venue",  \([author, venue]: year \stackrel{Const}{\leadsto} count(*)\)
  4. Explanation: a tuple \(t' \neq t\) that \(t' \nsim (3)\), and is in the opposite direction of \(t\).

Example

The SIGKDD problem can be formulated as:

  1. Complaint: a tuple \(t\) (X, SIGKDD, 2007, 1), 1 is low.
  2. Relavant ARP: Select \([author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
  3. Refinement ARP: Add in "ICDE", \[[author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const=5}{\leadsto} count(*)\]
  4. Explanation: \(t' =\) (X, ICDE, 2007, 7), that \(t' \nsim (3)\), and 7 is high.

Offline

Online

Challenges

Year Count
2006 14
2007 13
2008 13
Year Venue Count
2007 SIGKDD 1
2007 ICDE 7

1. Aggregation Regression Pattern (ARP)          Mining

2. Explanation Generation

2007 ICDE 7

Regression

Author Year Count
Z 2006 5
Z 2007 6
Z 2008 5

How to get \(M\) in  \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?

  1. Define M is a Regression Model in {Const, Linear,...}
  2. Fit \(Agg(A) \sim V\) (e.g. \(count \sim year\))

Regression

1. Not all the authors obey the ARP

    \([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)

Author Year Count
Z 2006 5
Z 2007 11
Z 2008 3

2. Person's Chi-Square test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)

M can also be Linear Regresssion and they use R-squared statistic to test a good fit.

W 2006 5

3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)

ARP Mining

  • Sort group attributes and share queries:

      \([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)

      \([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)

  • Limit Group Size: e.g. only support group by less than 4 attributes (\(\psi = 4\)).

Idea: exhaustively run group by queries offline.

  • Using "cube by" operator:

      Group By A, B can be calculated through Group By A, B, C

  • FD pruning:

      if \(\{A,B\} \rightarrow C\) , no need to calculate

      \([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.

ARP Pruning

  • Some ARP may only hold locally:

      Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)

      But            \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)

      and           \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)

  • Some ARP may only have few instances.
  • \(\gamma\): Minimal percentage of good fits (ARP Global Quality)
  • \(\Delta\): Minimal support of good fits (ARP Global Support)

Explanation Generation

Idea: search top k expl. through generated ARPs.

Define:

  • \(dev(t')\): Deviation of \(t'\) from regression value.

      E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7 - 5 = 2\).

  • \(dist(t_1, t_2)\): The distance between two tuples.

Explanation generation:

  1. Find Relavant ARP \([author=X]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\).
  2. Find Refinement ARP \([author=X, venue=ICDE]: \{2006,2007,2008\} \stackrel{Const}{\leadsto} count(*)\)
  3. Score tuples \(t'\) in (2) based on \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\)
  4. Select the top scored tuple as the explanation.

Example Explanations

Rank Explanation Score
1 (X, ICDE, 2007,7) 13.78
2 (X, ICDE, 2006,5) 10.91
3 (X, ICDM, 2007, 5) 6.44
... ... ...
10 (X, 2010, 63) 3.20

Optimization

Goal: Top K explanation tuples.

Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).

Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)  

Experiment Setting

  • Datasets:
    • DBLP: 1M rows, 4 columns.
    • Chicago Crime: after preprocessing, 1M rows, ? attributes.
  • Environment: Python 3.5, PostgreSQL 10.4
  • Machine: 128G RAM, 4x 1TB HDD
  • Cherry picked a set of parameters (\(\psi = 4, \theta=0.5,\lambda=0.5, \delta=15, \Delta=15\))

Time Spent in ARP Mining

  • ARP-mine: Share query + Sort
  • share-grp: Share query
  • Cube: Cube
  • Naive: No optimiation

ARP-mine out performs others.

Time Spent in ARP Mining

  • ARP-mine: Share query + Sort
  • share-grp: Share query
  • Cube: Cube
  • Naive: No optimiation

All three methods scale linearly w.r.t. data size

Time Spent in ARP Mining

  • ARP-mine: Share query + Sort
  • share-grp: Share query
  • Cube: Cube
  • Naive: No optimiation

FD has positive effects for ARP mining.

Time Spent in Explanation Generation

ExplGen-Naive: Brute Force Search

ExplGen-Opt: With UB pruning.

Expl. generation time increase linearly w.r.t. the size of candidate ARPs.

Time Spent in Explanation Generation

ExplGen-Naive: Brute Force Search

ExplGen-Opt: With UB pruning.

Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.

Parameter Sensitivity Analysis

  • Synthetic dataset: remove/add tuples to groups in real-world dataset.
  • 10 user questions, top 10 explanations.
  • Vary parameters ( \(\theta,\lambda, \delta, \Delta\)), report # of generated explanations that are correct.

Parameter Sensitivity Analysis

\(\Delta\): ARP Global Support

\(\delta\): ARP Local Support

\(\lambda\): ARP Global Quality

\(\theta\): ARP Regression Quality

  • \(\delta\): Won't affect the result too much.
  • \(\Delta\): The lower the better.
  • \(\lambda\): 0.3-0.5.
  • \(\delta\): The lower the better.

Conclusion

  • ARP can explain some outliers in the aggregation query that provenance cannot explain.

 

  • While naively searching for ARP explanations is time consuming, optimizations can be done to accelerate the process.
Made with Slides.com