Enabling SQL-ML Explanation to Debug

Training Data

Weiyuan Wu

youngw@sfu.ca

Committee:

Dr. Jiannan Wang - Senior Supervisor

Dr. Jian Pei - Supervisor

Dr. Oliver Schulte - Examiner

Dr. Steven Bergner - Chair

Thesis Defense

Aug. 22. 2019

SQL-ML Query

SELECT COUNT(*)
  FROM INBOX 
 WHERE predict(INBOX.text) = 'spam' 
   AND INBOX.date = 'Aug. 22, 2019'

SQL-ML System

Why SQL-ML System

  • Democratize AI
  • Improve Productivity
  • Facilitate Management

Training Data is Often Corrupted

ID Text Label
1 CLICK AND GET FREE PICKLE AT http://spam.com/clickme... Spam
2 Hi Rick, lets have a meeting tmr 8pm  -Rick Spam
3 Grandpa, the light in the garage is broken - Morty Ham
Spam

Corrupted training data

The Need for SQL-ML Explanation

SELECT COUNT(*)
  FROM INBOX 
 WHERE predict(INBOX.text) = 'spam' 
   AND INBOX.date = 'Aug. 22, 2019'

Why is it so high?

Count(*)
1000

A Debugging Workflow

ID Text Label
1 CLICK AND GET FREE PICKLE AT http://spam.com/clickme... Spam
2 Hi Rick, lets have a meeting tmr 8pm  -Rick
3 Grandpa, the light in the garage is broken -Morty Ham

Why is it so high?

2. A SQL-ML Explanation tool
3. Training set bugs
1. A complaint
  • Software debugging: 
assert query(DB) <= 20

Spam
 

Existing Approches

  • SQL Explanation
    • Why Not? [A Chapman et al. SIGMOD'09]
    • Tiresias [A Meliou et al. SIGMOD'12]
    • Scorpion [E Wu et al. VLDB'13]
  • ML Explanation
    • Influence Function [PW Koh et al. ICML'17]
    • DUTI [X Zhang et al. AAAI'18]

SQL & ML Explanation

SELECT COUNT(*)
  FROM INBOX 
 WHERE predict(INBOX.text) = 'spam' 

Too High!

Training data

Predictions

Count(*)

Querying data

Model

SQL Explanation

ML Explanation

Complaint

Labels

Training Set Bugs

\(\Bigg\}\)

\(\Bigg\}\)

Simple Combination

  • SQL explanation: produces labels for querying set
  • ML explanation: requires labels for querying set to debug
  • HowToComp: 

SQL Exp.

 

ML Exp.

 

Querying Set Points

Possible Corruptions

Ambiguity

SQL Exp.

 

ML Exp.

 

Querying Set Points

Possible Corruptions

  • HowToComp: Multiple ways to select querying set points. 
 

Agenda

  • Motivation & Background
  • Problem Definition & Challenges
  • Our Solution
  • Experiments
  • Conclusions

The SQL-ML Explanation Problem

Inputs:

  • A training dataset
  • A SQL-ML query
  • A complaint (e.g. count is too high)

Output:

  • Possible corrupted training data points

\(\checkmark\)   Provenance Polynomial

\(\checkmark\)   Relax and Combine

\(\checkmark\)   Influence Function

Challenges

  • Problem Formulation
    • Holistic
    • Discrete + Continous Optimization
  • Efficient Algorithm
    • Need to train \(n^k\) models

Agenda

  • Motivation & Background
  • Problem Definition & Challenges
  • Our Solution
  • Experiments
  • Conclusions

InfComp: Influence & Complaint

  • Compute influence on the differentiable query result
    1. Make SQL query result differentiable w.r.t. model parameters by provenance polynomial
    2. Training set debug using Influence Function

Differentiable Query Result

SELECT COUNT(*)
  FROM INBOX 
 WHERE predict(INBOX.text) = 'spam' 

COUNT(*): \(\mathcal{Q} = f(P1,P2,P3) = P1 + P2 + P3\)

where \(P1,P2,P3 \in \{0,1\}\)

ID Text Predicted
1 WANTED: Rick, for crime against interdimensional space
2 Hi Rick, lets reschedule the meeting to next Monday -Rick
3 http://test-spam.com...
ID Text Predicted
1 WANTED: Rick, for crime against interdimensional space P1
2 Hi Rick, lets reschedule the meeting to next Monday -Rick P2
3 http://test-spam.com... P3

SQL Provenance

  • SQL operators are isomorphic to some semimodules
  • \(\pi, \sigma, \gamma, \bowtie, \cup\) can be expressed using a boolean semimodule or integer semimodule, depending on the set or bag semantics
  • Example:
A B P
a b 1
d b 1
f g 1
A B P
a b 1
d b 0
f g 0

Provenance(t) = \(t.P^1 \cdot t.P^2\)

A B P
a b 1

\(\bowtie\)

\(=\)

Table 1
Table 2
d b 0
f g 0

Connect ML and SQL

  • Issue: Non-differentiable
    • ​\(f(P1,P2,P3) = P1 + P2 + P3\)​​
    • \(P1,P2,P3 \in \{0,1\}\)
  • Solution: Relaxation
    • Relax binaries into continuous: \(P1',P2',P3' \in [0,1]\), thus \(f(P1',P2',P3') \in [0,3]\)
    • Replace \(P1', P2', P3'\) with the probabilistic output from the model

 

 

  • Now the query output \(\mathcal{Q}\) is differentiable w.r.t. ML params!
\mathcal{Q} = f(P(`spam'|ID=1),P(`spam'|ID=2),P(`spam'|ID=3))

Training Set Debugging

ID Text Label
1 CLICK AND GET FREE PICKLE AT http://spam.com/clickme... Spam

 
3 Grandpa, the light in the garage is broken - Morty Ham
2 Hi Rick, lets have a meeting tmr 8pm  -Rick Spam

Training data

Count(*)

......

Influence

Training data

Model Params

Influence

Function\(^1\)

How?

Count(*)

Chain Rule

[1]: PW Koh et al. ICML'17

Influence Function

Where:

\(\theta^* = \argmin_{\theta} L(\theta)\)

\(L(\theta) = \sum_{(x,y) \in T} \ell( f(x, \theta),y) \)

\(\theta_{\epsilon}^* = \argmin_{\theta} L(\theta) + \epsilon \cdot \ell( f(x', \theta),y')\)

\(H_{\theta} = \nabla_{\theta}^2 L(\theta) \)

  • Influence Function: \[\left. \frac{d \theta_{\epsilon}^*}{d\epsilon}\right|_{\epsilon=0} = \lim_{\epsilon \to 0} \frac{\theta_{\epsilon}^* - \theta^*}{\epsilon}  = - H_{\theta^*}^{-1} \nabla_{\theta^*} \ell( f(x', \theta^*),y') \]
  • InfComp: ​\[ \left. \frac{d q(\theta_{\epsilon}^*) }{d\epsilon}\right|_{{\epsilon=0}} = - \nabla_{\theta} q(\theta^*) \left. \frac{d \theta_{\epsilon}^*}{d\epsilon}\right|_{\epsilon=0} = - \nabla_{\theta} q(\theta^*) \ H_{\theta^*}^{-1} \ \nabla_{\theta} \ell( f(x', \theta^*),y') \]

InfComp Algorithm

K most possible corruptions

2. Relax and plug in models

1. Expression of the query result using provenance

3. Calculate the influence

5. Delete the most influential training point and re-train

4. Rank the training points by the influence

Until K points deleted

Agenda

  • Motivation & Background
  • Problem Definition & Challenges
  • Our Solution
  • Experiments
  • Conclusions

Experiment: Compared Methods

[1]: PW Koh et al. ICML'17

[2]: A Meliou et al. SIGMOD'12

  • Baselines:
    • Complaint agnostic:
      • Loss: delete training points with maximum loss values
      • InfLoss\(^1\): delete training points with maximum self-influences
    • Complaint aware:
      • HowToComp\(^{1,2}\): delete training points by first using SQL explanation tools and then Influence Function
  • Our approach:
    • InfComp: Complaint aware, deleting training points by holistically considering SQL and ML

Experiment: Tasks

  • Entity resolution task: DBLP-GOOG dataset

 

  • Image recognition task: MNIST dataset

 

  • Spam classification task: ENRON dataset

 

SELECT COUNT(*) 
  FROM DBLP, GOOG 
 WHERE predict(DBLP.*,GOOG.*) = true
SELECT COUNT(*) 
  FROM L, R 
 WHERE predict(L.img) = predict(R.img)
SELECT COUNT(*) FROM INBOX
 WHERE INBOX.text LIKE ’%special_word%’ 
   AND predict(INBOX.text) = ’spam’

dear partner ,
we are a team of government officials that belong to an eight - man committee in the ...
...

sincerely ,
john adams
( chairman senate committee on banks and currency )
call number : 234 - 802 - 306 - 8507

L:

R:

GOOG:

DBLP:

Corruptions

  • Random Label Flip: DBLP/MNIST
  • Positive/Negative Label Flip: DBLP/MNIST
  • Label Generation from a corrupted labeler: MNIST
  • Label Flip by word: ENRON

Experiment Results: MNIST

The Top K Corrutption Recall on MNIST. The closer to the ground truth line, the better.

SQL Exp.

 

ML Exp.

 

Querying Set Points

Possible Corruptions

Ambiguity!

Median Corruption

High Corruption

Corruption overwhelms!

Experiment Results: DBLP & ENRON

The Top K Corrutption Recall on DBLP. The closer to the ground truth line, the better.

Median Corruption

High Corruption

The Top K Corrutption Recall on ENRON. The closer to the ground truth line, the better.

"deal" Corruption

"http" Corruption

Experiment Result: Time Overhead

Average time for finding one possible corruption, the lower the better. Loss: 1x, InfLoss: 2.4x, InfComp: 1.3x, HowToComp: 2x.

Conclusions

  1. To our best knowledge, InfComp is the first to debug SQL-ML query.
  2. Infcomp can leverage the complaint to find more corrupted training points compared to the baselines. 
  3. InfComp is fast and adds little time overhead. 

Future Works

  • Richer set of explanations
  • Integration to analytical databases
  • Why not complaints

Thank you!

Made with Slides.com