Zhengjie Miao et al.
Author | Venue | Year | Pubcnt |
---|---|---|---|
X | SIGKDD | 2006 | 4 |
X | SIGKDD | 2007 | 1 |
X | SIGKDD | 2008 | 4 |
1 |
Why X only published 1 SIGKDD paper in 2007?
SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year
Author | Venue | Year | Pubcnt |
---|---|---|---|
X | SIGKDD | 2006 | 4 |
X | SIGKDD | 2007 | 1 |
X | SIGKDD | 2008 | 4 |
X | VLDB | 2006 | 5 |
X | VLDB | 2007 | 5 |
X | VLDB | 2008 | 5 |
X | ICDE | 2006 | 5 |
X | ICDE | 2007 | 7 |
X | ICDE | 2008 | 4 |
7 |
Group by year: 2006 - 14 , 2007 - 13 , 2008 - 13
4 |
5 |
5 |
1 |
4 |
5 |
5 |
4 |
2006 - 14
2007 - 13
2008 - 13
7 |
X | SIGKDD | 2007 | 1 |
X | ICDE | 2007 | 7 |
Because of
X | ICDE | 2007 | 7 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year
7 |
Why X publish 7 ICDE papers in 2007?
X | ICDE | 2007 | 7 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
Because of
Y |
Y |
Y |
X | ICDE | 2007 | 7 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
Because of
|
Y |
Y |
Y |
X | SIGKDD | 2007 | 1 |
Author 1 | Author 2 | Venue | Year | Title |
---|---|---|---|---|
X | Y | ICDE | 2007 | A |
X | Y | ICDE | 2007 | B |
X | Y | ICDE | 2007 | C |
X | Z | ICDE | 2007 | D |
... | ... | ... | ... | ... |
SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year
Year | Count |
---|---|
2006 | 14 |
2007 | 13 |
2008 | 13 |
SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue
Year | Venue | Count |
---|---|---|
2007 | SIGKDD | 1 |
2007 | ICDE | 7 |
\(\curvearrowleft\)
Exp.
Year | Count |
---|---|
2006 | 14 |
2007 | 13 |
2008 | 13 |
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
In other words:
\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Example:
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of
tuple \(t\): (X, 2007, SIGKDD, 1)
Example:
Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain
Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)
(Conserved Quantity)
\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
Example:
Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)
(Drill Down)
The SIGKDD problem can be formulated as:
The SIGKDD problem can be formulated as:
Offline
Online
Year | Count |
---|---|
2006 | 14 |
2007 | 13 |
2008 | 13 |
Year | Venue | Count |
---|---|---|
2007 | SIGKDD | 1 |
2007 | ICDE | 7 |
1. Aggregation Regression Pattern (ARP) Mining
2. Explanation Generation
2007 | ICDE | 7 |
Author | Year | Count |
---|---|---|
Z | 2006 | 5 |
Z | 2007 | 6 |
Z | 2008 | 5 |
How to get \(M\) in \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?
1. Not all the authors obey the ARP
\([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)
Author | Year | Count |
---|---|---|
Z | 2006 | 5 |
Z | 2007 | 11 |
Z | 2008 | 3 |
2. Person's Chi-Square test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)
M can also be Linear Regresssion and they use R-squared statistic to test a good fit.
W | 2006 | 5 |
3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)
\([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)
Idea: exhaustively run group by queries offline.
Group By A, B can be calculated through Group By A, B, C
if \(\{A,B\} \rightarrow C\) , no need to calculate
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.
Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)
But \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)
and \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)
Idea: search top k expl. through generated ARPs.
Define:
E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7 - 5 = 2\).
Explanation generation:
Rank | Explanation | Score |
---|---|---|
1 | (X, ICDE, 2007,7) | 13.78 |
2 | (X, ICDE, 2006,5) | 10.91 |
3 | (X, ICDM, 2007, 5) | 6.44 |
... | ... | ... |
10 | (X, 2010, 63) | 3.20 |
Goal: Top K explanation tuples.
Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).
Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)
ARP-mine out performs others.
All three methods scale linearly w.r.t. data size
FD has positive effects for ARP mining.
ExplGen-Naive: Brute Force Search
ExplGen-Opt: With UB pruning.
Expl. generation time increase linearly w.r.t. the size of candidate ARPs.
ExplGen-Naive: Brute Force Search
ExplGen-Opt: With UB pruning.
Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.
\(\Delta\): ARP Global Support
\(\delta\): ARP Local Support
\(\lambda\): ARP Global Quality
\(\theta\): ARP Regression Quality