Constraint-based Explanation and Repair of Filter-based Transformations
Boss: Happy new year Janet. Does our experimental diabetes medication work?
Janet: (Come on I was just back from my holiday.) Let me do the investigation.
Mini theatre
Generating the dataset
But it takes too long...
Problems in data analysis pipeline
- A common task: Transform dataset for downstream analysis, interactively
- Very simply transformations introduce bias
- The whole process is in a trial-and-error fashion
- The whole process is tedious and
time consuming
What if Janet has a system...
I want "type2 = True", "age > 65" and "COUNT(subjects) > 500"
Here is the filtered dataset, with a relaxation "age > 50". However, the prevalence of cardiovascular disease is skewed
Make it unskewed
Looks good
A constraint "excise = True" added
So, save Janet
Formally, we want a system that
- An existing dataset to transform
- User-provided constraints on the desired output dataset
- A set of transformations* and the data that the code produces, which best match the user’s constraints
- Explains of potentially undesirable bias to the user, requests feedback, and uses this to create a new result
- Responds with a result within a reasonable timeframe
This is hard because:
Given a set of transformations, finding an optimal subset that most satisfies user's need is a combinatorial optimization problem with a costly objective.
System Design Components
User model
A user has an initial set of goals for his desired output dataset, which may include:
- DesiredPred: Which items it should contain (e.g., type 2 diabetics);
- TargetDistrib: The distribution of particular items (e.g., a Gaussian distributed prevalence of cardiovascular disease);
- TupleCount: The number of data items (e.g., at least 500 subjects);
NoPred : Which transformations to use or avoid (e.g., no gender filtering).
Constraint weight \(\mathcal{W}\)
Usually, it is impossible to satisfy all the constraints user provided.
- Soft constraints
- Constraint weight \(\mathcal{W}\) to help the user determine which to satisfy first.
- Folding
- Extraction
- Filtering
- ...
The transformation universe
- Equality predicates: For each value in any categorical columns.
- Three predicates: For numerical columns, a histogram is generated with the column’s values. This generates a range predicate and greater-than and less-than inequality predicates of the next edge.
- This yields the set of candidate transformations C.
Interactive workflow
The system runs in a loopy fashion, for each cycle \(i\), the system output at a given query cycle \(i\)
- Generated output dataset \(\mathcal{R}_i\)
- Transformation program \(\mathcal{O}_i\)
- A set \(\mathcal{P}_i\) of system-identified problems.
Also, for a given result, the system identifies:
- Any distribution changes* amongst columns in \(\mathcal{R}_i\).
* Janet is notified that the distribution of cardiovascular disease prevalence
Core problem: Formulation of the constraints and objective
Mathematical programming
\(\{\text{user constraints}\} \rightarrow G(\mathbf{x})\), And
It is usually not possible to satisfy all the constraints
Translating TargetDistrib
Intuitively, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:
\(\min_{O_i \in \mathcal{C}} dist(\text{target distribution}, \text{current distribution})\)
Histogram similarity
- Split data of each column into bins
- Calculate the number of instances in each bin \(b\), and store it into variable \(tc[b]\)
- After a applied a predicate, recalculate the number of instances, store it into \(bc[b]\)
Recalculate \(bc[b]\) everytime? No.
- Setup an indicate variable set \(preds = \{p_1,p_2,...\}\), so if the i-th predicate should be included into the result, \(preds[i] = 1\), otherwise, 0.
- Because we know each candidate predicate, precompute the number of filtered tuples ahead of time, and store it in \(bpp[i,b]\), which means how much instances is left in the bin \(b\) after the filter \(i\).
- We get: \(bc[b] = tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]} \)
E.g. \(bc[b] = 0.8 * e^{0 * \log 0.1 + 1 * \log 0.3}\) = 0.8 * 0.3
Put everything together
Translating NoPred and DesiredPred
Similarly, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:
\(\min_{O_i \in \mathcal{C}} dist(\text{target predicate}, \text{current predicate})\)
Calculating "Code" similarity
Code distance
Because we already know all the predicates...
Precalculate pairwise predicate similarity for all predicates!
\(S[i,u]\): Similarity score for predicate \(i\) and \(u\)
For each user designated predicate \(u\), the distance between the candidate predicates is \(D_u[u] = \sum_i^{|preds|} preds[i] * (1 - S[i,u])\)
Compared algorithms
- Tiresias: Similar system aims to find a dataset satisfies user constraint. Don't generate predicates.
- NChooseK: Enumerate all possible \(k\) filter combinations, choose the best.
- Greedy: Based on the objective defined above, but greedy. Find a single best predicate \(p_1\), then find best predicate tuple \(p_1,p_2\). Repeat until \(k\) predicates found.
Synthetic dataset and TargetDist
Synthetic dataset and TargetDist
Synthetic dataset and TargetDist
Synthetic dataset and TargetDist
Performance of the code similarity
Janet, saved
- The good: a meaningful problem with a fast usable solution.
- The bad: Mediocre solution, not very "surprise" to me.
- The ugly: Code similarity is a little bit heuristic
By Weiyüen Wu
- 670