Zhengjie Miao et al.
| Author | Venue | Year | Pubcnt |
|---|---|---|---|
| X | SIGKDD | 2006 | 4 |
| X | SIGKDD | 2007 | 1 |
| X | SIGKDD | 2008 | 4 |
| 1 |
Why X only published 1 SIGKDD paper in 2007?
SELECT Author, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author, Venue, Year| Author | Venue | Year | Pubcnt |
|---|---|---|---|
| X | SIGKDD | 2006 | 4 |
| X | SIGKDD | 2007 | 1 |
| X | SIGKDD | 2008 | 4 |
| X | VLDB | 2006 | 5 |
| X | VLDB | 2007 | 5 |
| X | VLDB | 2008 | 5 |
| X | ICDE | 2006 | 5 |
| X | ICDE | 2007 | 7 |
| X | ICDE | 2008 | 4 |
| 7 |
Group by year: 2006 - 14 , 2007 - 13 , 2008 - 13
| 4 |
| 5 |
| 5 |
| 1 |
| 4 |
| 5 |
| 5 |
| 4 |
2006 - 14
2007 - 13
2008 - 13
| 7 |
| X | SIGKDD | 2007 | 1 |
| X | ICDE | 2007 | 7 |
Because of
| X | ICDE | 2007 | 7 |
| Author 1 | Author 2 | Venue | Year | Title |
|---|---|---|---|---|
| X | Y | ICDE | 2007 | A |
| X | Y | ICDE | 2007 | B |
| X | Y | ICDE | 2007 | C |
| X | Z | ICDE | 2007 | D |
| ... | ... | ... | ... | ... |
SELECT Author1, Venue, Year, COUNT(*)
FROM DBLP
GROUP BY Author1, Venue, Year| 7 |
Why X publish 7 ICDE papers in 2007?
| X | ICDE | 2007 | 7 |
| Author 1 | Author 2 | Venue | Year | Title |
|---|---|---|---|---|
| X | Y | ICDE | 2007 | A |
| X | Y | ICDE | 2007 | B |
| X | Y | ICDE | 2007 | C |
| X | Z | ICDE | 2007 | D |
| ... | ... | ... | ... | ... |
Because of
| Y |
| Y |
| Y |
| X | ICDE | 2007 | 7 |
| Author 1 | Author 2 | Venue | Year | Title |
|---|---|---|---|---|
| X | Y | ICDE | 2007 | A |
| X | Y | ICDE | 2007 | B |
| X | Y | ICDE | 2007 | C |
| X | Z | ICDE | 2007 | D |
| ... | ... | ... | ... | ... |
Because of
|
|
| Y |
| Y |
| Y |
| X | SIGKDD | 2007 | 1 |
| Author 1 | Author 2 | Venue | Year | Title |
|---|---|---|---|---|
| X | Y | ICDE | 2007 | A |
| X | Y | ICDE | 2007 | B |
| X | Y | ICDE | 2007 | C |
| X | Z | ICDE | 2007 | D |
| ... | ... | ... | ... | ... |
SELECT Year, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year| Year | Count |
|---|---|
| 2006 | 14 |
| 2007 | 13 |
| 2008 | 13 |
SELECT Year, Venue, COUNT(*)
FROM DBLP
WHERE Author = 'X'
GROUP BY Year, Venue| Year | Venue | Count |
|---|---|---|
| 2007 | SIGKDD | 1 |
| 2007 | ICDE | 7 |
\(\curvearrowleft\)
Exp.
| Year | Count |
|---|---|
| 2006 | 14 |
| 2007 | 13 |
| 2008 | 13 |
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
In other words:
\([Author=X]: \{2006,2007,2008\} \stackrel{Const=13.3}{\leadsto} count(*)\)
ARP: \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Example:
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\) is a relavant ARP of
tuple \(t\): (X, 2007, SIGKDD, 1)
Example:
Given a tuple \(t\) from the query \(\gamma_{G, Agg(A)}\), that the user complain
Relavant ARP is \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \cup V \subset G\)
(Conserved Quantity)
\(P = [author,venue]: year \stackrel{Const}{\leadsto} count(*)\) is a refinement ARP of
\(P = [author]: year \stackrel{Const}{\leadsto} count(*)\)
Example:
Given a Relavant ARP \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\)
Refinement ARP is \(P = [F']: V \stackrel{M}{\leadsto} Agg(A)\), where \(F \subset F'\)
(Drill Down)
The SIGKDD problem can be formulated as:
The SIGKDD problem can be formulated as:
Offline
Online
| Year | Count |
|---|---|
| 2006 | 14 |
| 2007 | 13 |
| 2008 | 13 |
| Year | Venue | Count |
|---|---|---|
| 2007 | SIGKDD | 1 |
| 2007 | ICDE | 7 |
1. Aggregation Regression Pattern (ARP) Mining
2. Explanation Generation
| 2007 | ICDE | 7 |
| Author | Year | Count |
|---|---|---|
| Z | 2006 | 5 |
| Z | 2007 | 6 |
| Z | 2008 | 5 |
How to get \(M\) in \(P = [F]: V \stackrel{M}{\leadsto} Agg(A)\) ?
1. Not all the authors obey the ARP
\([author=Z]: year \stackrel{Const}{\not\leadsto} count(*)\)
| Author | Year | Count |
|---|---|---|
| Z | 2006 | 5 |
| Z | 2007 | 11 |
| Z | 2008 | 3 |
2. Person's Chi-Square test for \(const\): remove ARP with test score \(\lt \theta\) (ARP Regression Quality)
M can also be Linear Regresssion and they use R-squared statistic to test a good fit.
| W | 2006 | 5 |
3. Remove low supported ARP: \(\lt \delta\) (ARP Local Support)
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\)
\([A, B]: C, D \stackrel{Const}{\leadsto} count(*)\)
Idea: exhaustively run group by queries offline.
Group By A, B can be calculated through Group By A, B, C
if \(\{A,B\} \rightarrow C\) , no need to calculate
\([A, B, C]: D \stackrel{Const}{\leadsto} count(*)\), if \([A, B]: D \stackrel{Const}{\leadsto} count(*)\) is calculated.
Spurious: \([venue=SIGMOD]: year \stackrel{Linear}{\leadsto} count(*)\)
But \([venue=ICDE]: year \stackrel{Linear}{\not\leadsto} count(*)\)
and \([venue=SIGKDD]: year \stackrel{Linear}{\not\leadsto} count(*)\)
Idea: search top k expl. through generated ARPs.
Define:
E.g. Each year X publish ~ 5 @ ICDE, in 2007, 7 @ ICDE. \(dev = 7 - 5 = 2\).
Explanation generation:
| Rank | Explanation | Score |
|---|---|---|
| 1 | (X, ICDE, 2007,7) | 13.78 |
| 2 | (X, ICDE, 2006,5) | 10.91 |
| 3 | (X, ICDM, 2007, 5) | 6.44 |
| ... | ... | ... |
| 10 | (X, 2010, 63) | 3.20 |
Goal: Top K explanation tuples.
Idea: Calculate the upperbound for \(Score(t') = \frac{dev(t')}{dist(t, t') \cdot NORM}\).
Action: Drop \(t'\) if \(\text{UB}_{t'} \lt \min_{k=1..K}Score(t_k)\)
ARP-mine out performs others.
All three methods scale linearly w.r.t. data size
FD has positive effects for ARP mining.
ExplGen-Naive: Brute Force Search
ExplGen-Opt: With UB pruning.
Expl. generation time increase linearly w.r.t. the size of candidate ARPs.
ExplGen-Naive: Brute Force Search
ExplGen-Opt: With UB pruning.
Expl. generation time increase exponentially w.r.t. the # of attributes in the user question.
\(\Delta\): ARP Global Support
\(\delta\): ARP Local Support
\(\lambda\): ARP Global Quality
\(\theta\): ARP Regression Quality