Zhengjie Miao et al.
Author  Venue  Year  Pubcnt 

X  SIGKDD  2006  4 
X  SIGKDD  2007  1 
X  SIGKDD  2008  4 
1 
Why X only published 1 SIGKDD paper in 2007?
SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year
Author  Venue  Year  Pubcnt 

X  SIGKDD  2006  4 
X  SIGKDD  2007  1 
X  SIGKDD  2008  4 
X  VLDB  2006  5 
X  VLDB  2007  5 
X  VLDB  2008  5 
X  ICDE  2006  5 
X  ICDE  2007  7 
X  ICDE  2008  4 
7 
Group by year: 2006  14 , 2007  13 , 2008  13
4 
5 
5 
1 
4 
5 
5 
4 
2006  14
2007  13
2008  13
7 
X  SIGKDD  2007  1 
X  ICDE  2007  7 
Because of
X  ICDE  2007  7 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year
7 
Why X publish 7 ICDE papers in 2007?
X  ICDE  2007  7 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
Because of
Y 
Y 
Y 
X  ICDE  2007  7 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
Because of

Y 
Y 
Y 
X  SIGKDD  2007  1 
Author 1  Author 2  Venue  Year  Title 

X  Y  ICDE  2007  A 
X  Y  ICDE  2007  B 
X  Y  ICDE  2007  C 
X  Z  ICDE  2007  D 
...  ...  ...  ...  ... 
SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year
Year  Count 

2006  14 
2007  13 
2008  13 
SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue
Year  Venue  Count 

2007  SIGKDD  1 
2007  ICDE  7 
\(\curvearrowleft\)
Exp.
Year  Count 

2006  14 
2007  13 
2008  13 
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
In other words:
\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Example:
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of
tuple \(t\): (X, 2007, SIGKDD, 1)
Example:
Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain
Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)
(Conserved Quantity)
\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
Example:
Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)
(Drill Down)
The SIGKDD problem can be formulated as:
The SIGKDD problem can be formulated as:
Offline
Online
Year  Count 

2006  14 
2007  13 
2008  13 
Year  Venue  Count 

2007  SIGKDD  1 
2007  ICDE  7 
1. Aggregation Regression Pattern (ARP) Mining
2. Explanation Generation
2007  ICDE  7 
Author  Year  Count 

Z  2006  5 
Z  2007  6 
Z  2008  5 
How to get \(M\) in \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?
1. Not all the authors obey the ARP
\([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)
Author  Year  Count 

Z  2006  5 
Z  2007  11 
Z  2008  3 
2. Person's ChiSquare test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)
M can also be Linear Regresssion and they use Rsquared statistic to test a good fit.
W  2006  5 
3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)
\([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)
Idea: exhaustively run group by queries offline.
Group By A, B can be calculated through Group By A, B, C
if \(\{A,B\} \rightarrow C\) , no need to calculate
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.
Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)
But \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)
and \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)
Idea: search top k expl. through generated ARPs.
Define:
E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7  5 = 2\).
Explanation generation:
Rank  Explanation  Score 

1  (X, ICDE, 2007,7)  13.78 
2  (X, ICDE, 2006,5)  10.91 
3  (X, ICDM, 2007, 5)  6.44 
...  ...  ... 
10  (X, 2010, 63)  3.20 
Goal: Top K explanation tuples.
Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).
Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)
ARPmine out performs others.
All three methods scale linearly w.r.t. data size
FD has positive effects for ARP mining.
ExplGenNaive: Brute Force Search
ExplGenOpt: With UB pruning.
Expl. generation time increase linearly w.r.t. the size of candidate ARPs.
ExplGenNaive: Brute Force Search
ExplGenOpt: With UB pruning.
Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.
\(\Delta\): ARP Global Support
\(\delta\): ARP Local Support
\(\lambda\): ARP Global Quality
\(\theta\): ARP Regression Quality