Weiyuan Wu
youngw@sfu.ca
Committee:
Dr. Jiannan Wang - Senior Supervisor
Dr. Jian Pei - Supervisor
Dr. Oliver Schulte - Examiner
Dr. Steven Bergner - Chair
Thesis Defense
Aug. 22. 2019
SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam'
AND INBOX.date = 'Aug. 22, 2019'
ID | Text | Label |
---|---|---|
1 | CLICK AND GET FREE PICKLE AT http://spam.com/clickme... | Spam |
2 | Hi Rick, lets have a meeting tmr 8pm -Rick | Spam |
3 | Grandpa, the light in the garage is broken - Morty | Ham |
Spam |
Corrupted training data
SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam'
AND INBOX.date = 'Aug. 22, 2019'
Why is it so high?
Count(*) |
---|
1000 |
ID | Text | Label |
---|---|---|
1 | CLICK AND GET FREE PICKLE AT http://spam.com/clickme... | Spam |
2 | Hi Rick, lets have a meeting tmr 8pm -Rick | |
3 | Grandpa, the light in the garage is broken -Morty | Ham |
Why is it so high?
2. A SQL-ML Explanation tool
3. Training set bugs
1. A complaint
assert query(DB) <= 20
Spam
SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam'
Too High!
Training data
Predictions
Count(*)
Querying data
Model
SQL Explanation
ML Explanation
Complaint
Labels
Training Set Bugs
\(\Bigg\}\)
\(\Bigg\}\)
SQL Exp.
ML Exp.
Querying Set Points
Possible Corruptions
SQL Exp.
ML Exp.
Querying Set Points
Possible Corruptions
Inputs:
Output:
\(\checkmark\) Provenance Polynomial
\(\checkmark\) Relax and Combine
\(\checkmark\) Influence Function
SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam'
COUNT(*): \(\mathcal{Q} = f(P1,P2,P3) = P1 + P2 + P3\)
where \(P1,P2,P3 \in \{0,1\}\)
ID | Text | Predicted |
---|---|---|
1 | WANTED: Rick, for crime against interdimensional space | |
2 | Hi Rick, lets reschedule the meeting to next Monday -Rick | |
3 | http://test-spam.com... |
ID | Text | Predicted |
---|---|---|
1 | WANTED: Rick, for crime against interdimensional space | P1 |
2 | Hi Rick, lets reschedule the meeting to next Monday -Rick | P2 |
3 | http://test-spam.com... | P3 |
A | B | P |
---|---|---|
a | b | 1 |
d | b | 1 |
f | g | 1 |
A | B | P |
---|---|---|
a | b | 1 |
d | b | 0 |
f | g | 0 |
Provenance(t) = \(t.P^1 \cdot t.P^2\)
A | B | P |
---|---|---|
a | b | 1 |
\(\bowtie\)
\(=\)
Table 1
Table 2
d | b | 0 |
f | g | 0 |
ID | Text | Label |
---|---|---|
1 | CLICK AND GET FREE PICKLE AT http://spam.com/clickme... | Spam |
|
||
3 | Grandpa, the light in the garage is broken - Morty | Ham |
2 | Hi Rick, lets have a meeting tmr 8pm -Rick | Spam |
Training data
Count(*)
......
Influence
Training data
Model Params
Influence
Function\(^1\)
How?
Count(*)
Chain Rule
[1]: PW Koh et al. ICML'17
Where:
\(\theta^* = \argmin_{\theta} L(\theta)\)
\(L(\theta) = \sum_{(x,y) \in T} \ell( f(x, \theta),y) \)
\(\theta_{\epsilon}^* = \argmin_{\theta} L(\theta) + \epsilon \cdot \ell( f(x', \theta),y')\)
\(H_{\theta} = \nabla_{\theta}^2 L(\theta) \)
K most possible corruptions
2. Relax and plug in models
1. Expression of the query result using provenance
3. Calculate the influence
5. Delete the most influential training point and re-train
4. Rank the training points by the influence
Until K points deleted
[1]: PW Koh et al. ICML'17
[2]: A Meliou et al. SIGMOD'12
SELECT COUNT(*)
FROM DBLP, GOOG
WHERE predict(DBLP.*,GOOG.*) = true
SELECT COUNT(*)
FROM L, R
WHERE predict(L.img) = predict(R.img)
SELECT COUNT(*) FROM INBOX
WHERE INBOX.text LIKE ’%special_word%’
AND predict(INBOX.text) = ’spam’
dear partner ,
we are a team of government officials that belong to an eight - man committee in the ...
...
sincerely ,
john adams
( chairman senate committee on banks and currency )
call number : 234 - 802 - 306 - 8507
L:
R:
GOOG:
DBLP:
The Top K Corrutption Recall on MNIST. The closer to the ground truth line, the better.
SQL Exp.
ML Exp.
Querying Set Points
Possible Corruptions
Ambiguity!
Median Corruption
High Corruption
Corruption overwhelms!
The Top K Corrutption Recall on DBLP. The closer to the ground truth line, the better.
Median Corruption
High Corruption
The Top K Corrutption Recall on ENRON. The closer to the ground truth line, the better.
"deal" Corruption
"http" Corruption
Average time for finding one possible corruption, the lower the better. Loss: 1x, InfLoss: 2.4x, InfComp: 1.3x, HowToComp: 2x.