Debug Machine learning embedded SQL
ML-Embedded SQL

How many cats in the picture?
ML-Embedded SQL

How many cats in the picture?



{Cat,Cat,Cat,Dog,Dog,Dog,Dog,...}






Count(Cat) = 3
SELECT COUNT(*)
FROM image
WHERE classify(patch) == "cat"
ML-Embedded SQL




{Cat,Cat,Cat,Dog,Dog,Dog,Dog,...}






Count(Cat) = 3
More practically
ML-Embedded SQL
ML-Embedded SQL important because:
- Machine learning is put into use as a system but not standalone
- Given the data size is large, we want to move computation to the data rather than move data to the computation
- Big companies are using!



ML-Embedded SQL
How many cats in the picture?
WRONG




{Cat,Cat,Cat,Dog,Dog,Dog,Dog,...}






Count(Cat) = 3
ML-Embedded SQL
Goal: Why & How
ML-Embedded SQL




- Column: Feature-based
ML Interpretations
- Row: Case-based
* Masked due to the actual knowledge is encoded into the model by learning


ML-Embedded SQL
- Column: Predicate-based
SQL Interpretations
- Row: Provenance-based
ρB=1: Caused the issue
a | b | c |
d | b | e |
A | B | C |
---|---|---|
a | b | c |
d | b | e |
f | g | e |
A | B | C |
ML-Embedded SQL
Predicate-based > Feature-based
* Cased-based = Provenance based
ML-Embedded SQL
Causality model: counterfactual
A→B
If not A then not B
-- In the absence of a cause, the effect doesn’t occur
In database
Given a query Q, database D, t∈D is the cause of Q(D)'s incorrectness if Q(D−t) is correct.
ML-Embedded SQL
Combining SQL & ML interpretations?
- SQL with multiple models, how to reason between them?
- Error propagates, how to identify?
- Aggregate query shields provenance, how to penetrate?
- The query is complex, how to efficiently compute?
ML-Embedded SQL
- Problem Definition
- Solution
- Experiment
ML-Embedded SQL
D: Database, DT: Database trains the models
Q: Query
M1,...,Mk: Models
E: User expectation of Q(D)
Goal: find tuples T⊂DT causes Q(D)=E
ML-Embedded SQL
Counterfactual: find T⊂DT that Q(D,DT−T)=E.
Trivial approach:
1. Pick up n tuples t∈DT randomly, delete them.
2. Re-train all the models M1,...,Mk.
3. Run Q(D,DT−T) to see if Q(D,DT−T)=E
ML-Embedded SQL
Be smarter? Optimization!
Objective: Find a T, if multiple solutions exist, we want the one with minimal ∣T∣
Constraints:
1. The query result is correct Q(D,DT−T)=E
2. The model should be well trained.
ML-Embedded SQL
Difficulties:
1. Encoding SQL logic into an analytical form
2. Efficiently solve the problem
Provenance Semirings
Q(D,DT−T) is not analytical representable!
Provenance Semirings
A | B | C |
---|---|---|
a | b | c |
d | b | e |
f | g | e |
SELECT *
FROM D
WHERE B = 'b'
A | B | C |
---|---|---|
a | b | c |
d | b | e |
How to express this analytically?
Provenance Semirings
Key idea: the SQL operators are actually set operations, so SQL should be isomorphic with some semirings
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 1 |
f | g | e | 1 |
SELECT *
FROM D
WHERE B = 'b'
Where B = 'b' → P(t) = 1{t.B = 'b'}\)
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 1 |
f | g | e | 0 |
Provenance(t) = t.I⋅P(t)
Provenance Semirings
Selection
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 1 |
f | g | e | 1 |
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 1 |
f | g | e | 1 |
Provenance(t) = t.I1+t.I2
Provenance Semirings
Union
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
⋃
=
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 1 |
f | g | e | 1 |
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 0 |
f | g | e | 0 |
Provenance(t) = t.I1⋅t.I2
Provenance Semirings
Natual join
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
⋈
=
A | B | C | I |
---|---|---|---|
a | b | c | 1 |
d | b | e | 1 |
f | g | e | 1 |
B | V | I |
---|---|---|
b | 2 | 1 |
g | 1 | 1 |
Provenance(t) = t.I⋅P(t)
Provenance Semirings
Aggregation
γB,count
=
Provenance_count(t) = ∑t.I⋅P(t)⋅1
Provenance Semirings
Q(D,DT−T)semiringQa(D,DT−T)
Efficiently solve the problem
Gradient descent with Tensorflow
1. Relax integers (from provenance) to continuous variable
2. Relax constraints into objective
3. Using KKT condition to remove the sub-level optimization

A preliminary result
SELECT Count(*) FROM D WHERE classify(P) = 'dog'
Too much!!
Further works
1. Gradient approach is still slow: A better algorithm to approximate objective
2. Provenance semiring is a complete solution. However, given some special types of queries, a simplified problem formulation may exist (i.e. predicate push down to avoid quadratic joins)
Conclusion
ML/SQL explanation still have lots of opportunies
1. Standardize/solid foundation
2. More ways to do explanation (Row, Column, Row/Column)
3. More efficient/accurate explanation
Debug Machine learning embedded SQL
By Weiyüen Wu
Debug Machine learning embedded SQL
- 666