### Training Data

Weiyuan Wu

youngw@sfu.ca

Committee:

Dr. Jiannan Wang - Senior Supervisor

Dr. Jian Pei - Supervisor

Dr. Oliver Schulte - Examiner

Dr. Steven Bergner - Chair

Thesis Defense

Aug. 22. 2019

### SQL-ML Query

SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam'
AND INBOX.date = 'Aug. 22, 2019'

### Why SQL-ML System

• Democratize AI
• Improve Productivity
• Facilitate Management

### Training Data is Often Corrupted

ID Text Label
1 CLICK AND GET FREE PICKLE AT http://spam.com/clickme... Spam
2 Hi Rick, lets have a meeting tmr 8pm  -Rick Spam
3 Grandpa, the light in the garage is broken - Morty Ham
 Spam

Corrupted training data

### The Need for SQL-ML Explanation

SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam'
AND INBOX.date = 'Aug. 22, 2019'

Why is it so high?

Count(*)
1000

### A Debugging Workflow

ID Text Label
1 CLICK AND GET FREE PICKLE AT http://spam.com/clickme... Spam
2 Hi Rick, lets have a meeting tmr 8pm  -Rick
3 Grandpa, the light in the garage is broken -Morty Ham

Why is it so high?

2. A SQL-ML Explanation tool
3. Training set bugs
1. A complaint
• Software debugging:
assert query(DB) <= 20

Spam

### Existing Approches

• SQL Explanation
• Why Not? [A Chapman et al. SIGMOD'09]
• Tiresias [A Meliou et al. SIGMOD'12]
• Scorpion [E Wu et al. VLDB'13]
• ML Explanation
• Influence Function [PW Koh et al. ICML'17]
• DUTI [X Zhang et al. AAAI'18]

### SQL & ML Explanation

SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam' 

Too High!

Training data

Predictions

Count(*)

Querying data

Model

SQL Explanation

ML Explanation

Complaint

Labels

Training Set Bugs

$$\Bigg\}$$

$$\Bigg\}$$

### Simple Combination

• SQL explanation: produces labels for querying set
• ML explanation: requires labels for querying set to debug
• HowToComp:

SQL Exp.

ML Exp.

Querying Set Points

Possible Corruptions

### Ambiguity

SQL Exp.

ML Exp.

Querying Set Points

Possible Corruptions

• HowToComp: Multiple ways to select querying set points.

### Agenda

• Motivation & Background
• Problem Definition & Challenges
• Our Solution
• Experiments
• Conclusions

### The SQL-ML Explanation Problem

Inputs:

• A training dataset
• A SQL-ML query
• A complaint (e.g. count is too high)

Output:

• Possible corrupted training data points

$$\checkmark$$   Provenance Polynomial

$$\checkmark$$   Relax and Combine

$$\checkmark$$   Influence Function

### Challenges

• Problem Formulation
• Holistic
• Discrete + Continous Optimization
• Efficient Algorithm
• Need to train $$n^k$$ models

### Agenda

• Motivation & Background
• Problem Definition & Challenges
• Our Solution
• Experiments
• Conclusions

### InfComp: Influence & Complaint

• Compute influence on the differentiable query result
1. Make SQL query result differentiable w.r.t. model parameters by provenance polynomial
2. Training set debug using Influence Function

### Differentiable Query Result

SELECT COUNT(*)
FROM INBOX
WHERE predict(INBOX.text) = 'spam' 

COUNT(*): $$\mathcal{Q} = f(P1,P2,P3) = P1 + P2 + P3$$

where $$P1,P2,P3 \in \{0,1\}$$

ID Text Predicted
1 WANTED: Rick, for crime against interdimensional space
2 Hi Rick, lets reschedule the meeting to next Monday -Rick
3 http://test-spam.com...
ID Text Predicted
1 WANTED: Rick, for crime against interdimensional space P1
2 Hi Rick, lets reschedule the meeting to next Monday -Rick P2
3 http://test-spam.com... P3

### SQL Provenance

• SQL operators are isomorphic to some semimodules
• $$\pi, \sigma, \gamma, \bowtie, \cup$$ can be expressed using a boolean semimodule or integer semimodule, depending on the set or bag semantics
• Example:
A B P
a b 1
d b 1
f g 1
A B P
a b 1
d b 0
f g 0

Provenance(t) = $$t.P^1 \cdot t.P^2$$

A B P
a b 1

$$\bowtie$$

$$=$$

Table 1
Table 2
 d b 0 f g 0

### Connect ML and SQL

• Issue: Non-differentiable
• ​$$f(P1,P2,P3) = P1 + P2 + P3$$​​
• $$P1,P2,P3 \in \{0,1\}$$
• Solution: Relaxation
• Relax binaries into continuous: $$P1',P2',P3' \in [0,1]$$, thus $$f(P1',P2',P3') \in [0,3]$$
• Replace $$P1', P2', P3'$$ with the probabilistic output from the model

• Now the query output $$\mathcal{Q}$$ is differentiable w.r.t. ML params!
\mathcal{Q} = f(P(spam'|ID=1),P(spam'|ID=2),P(spam'|ID=3))

### Training Set Debugging

ID Text Label
1 CLICK AND GET FREE PICKLE AT http://spam.com/clickme... Spam

3 Grandpa, the light in the garage is broken - Morty Ham
 2 Hi Rick, lets have a meeting tmr 8pm  -Rick Spam

Training data

Count(*)

......

Influence

Training data

Model Params

Influence

Function$$^1$$

How?

Count(*)

Chain Rule

: PW Koh et al. ICML'17

### Influence Function

Where:

$$\theta^* = \argmin_{\theta} L(\theta)$$

$$L(\theta) = \sum_{(x,y) \in T} \ell( f(x, \theta),y)$$

$$\theta_{\epsilon}^* = \argmin_{\theta} L(\theta) + \epsilon \cdot \ell( f(x', \theta),y')$$

$$H_{\theta} = \nabla_{\theta}^2 L(\theta)$$

• Influence Function: $\left. \frac{d \theta_{\epsilon}^*}{d\epsilon}\right|_{\epsilon=0} = \lim_{\epsilon \to 0} \frac{\theta_{\epsilon}^* - \theta^*}{\epsilon} = - H_{\theta^*}^{-1} \nabla_{\theta^*} \ell( f(x', \theta^*),y')$
• InfComp: ​$\left. \frac{d q(\theta_{\epsilon}^*) }{d\epsilon}\right|_{{\epsilon=0}} = - \nabla_{\theta} q(\theta^*) \left. \frac{d \theta_{\epsilon}^*}{d\epsilon}\right|_{\epsilon=0} = - \nabla_{\theta} q(\theta^*) \ H_{\theta^*}^{-1} \ \nabla_{\theta} \ell( f(x', \theta^*),y')$

### InfComp Algorithm

K most possible corruptions

2. Relax and plug in models

1. Expression of the query result using provenance

3. Calculate the influence

5. Delete the most influential training point and re-train

4. Rank the training points by the influence

Until K points deleted

### Agenda

• Motivation & Background
• Problem Definition & Challenges
• Our Solution
• Experiments
• Conclusions

### Experiment: Compared Methods

: PW Koh et al. ICML'17

: A Meliou et al. SIGMOD'12

• Baselines:
• Complaint agnostic:
• Loss: delete training points with maximum loss values
• InfLoss$$^1$$: delete training points with maximum self-influences
• Complaint aware:
• HowToComp$$^{1,2}$$: delete training points by first using SQL explanation tools and then Influence Function
• Our approach:
• InfComp: Complaint aware, deleting training points by holistically considering SQL and ML

• Entity resolution task: DBLP-GOOG dataset

• Image recognition task: MNIST dataset

• Spam classification task: ENRON dataset

SELECT COUNT(*)
FROM DBLP, GOOG
WHERE predict(DBLP.*,GOOG.*) = true
SELECT COUNT(*)
FROM L, R
WHERE predict(L.img) = predict(R.img)
SELECT COUNT(*) FROM INBOX
WHERE INBOX.text LIKE ’%special_word%’
AND predict(INBOX.text) = ’spam’

dear partner ,
we are a team of government officials that belong to an eight - man committee in the ...
...

sincerely ,
( chairman senate committee on banks and currency )
call number : 234 - 802 - 306 - 8507

L:

R:

GOOG:

DBLP:

### Corruptions

• Random Label Flip: DBLP/MNIST
• Positive/Negative Label Flip: DBLP/MNIST
• Label Generation from a corrupted labeler: MNIST
• Label Flip by word: ENRON

### Experiment Results: MNIST

The Top K Corrutption Recall on MNIST. The closer to the ground truth line, the better.

SQL Exp.

ML Exp.

Querying Set Points

Possible Corruptions

Ambiguity!

Median Corruption

High Corruption

Corruption overwhelms!

### Experiment Results: DBLP & ENRON

The Top K Corrutption Recall on DBLP. The closer to the ground truth line, the better.

Median Corruption

High Corruption

The Top K Corrutption Recall on ENRON. The closer to the ground truth line, the better.

"deal" Corruption

"http" Corruption

Average time for finding one possible corruption, the lower the better. Loss: 1x, InfLoss: 2.4x, InfComp: 1.3x, HowToComp: 2x.`

### Conclusions

1. To our best knowledge, InfComp is the first to debug SQL-ML query.
2. Infcomp can leverage the complaint to find more corrupted training points compared to the baselines.

### Future Works

• Richer set of explanations
• Integration to analytical databases
• Why not complaints

By Weiyüen Wu

• 150