deck

Boss: Happy new year Janet. Does our experimental diabetes medication work?

Janet: (Come on I was just back from my holiday.) Let me do the investigation.

Mini theatre

Generating the dataset

But it takes too long...

Problems in data analysis pipeline

A common task: Transform dataset for downstream analysis, interactively
Very simply transformations introduce bias
The whole process is in a trial-and-error fashion
The whole process is tedious and time consuming

What if Janet has a system...

I want "type2 = True", "age > 65" and "COUNT(subjects) > 500"

Here is the filtered dataset, with a relaxation "age > 50". However, the prevalence of cardiovascular disease is skewed

Make it unskewed

Looks good

A constraint "excise = True" added

So, save Janet

Formally, we want a system that

Takes:

An existing dataset to transform
User-provided constraints on the desired output dataset

Produces:

A set of transformations* and the data that the code produces, which best match the user’s constraints

And:

Explains of potentially undesirable bias to the user, requests feedback, and uses this to create a new result
Responds with a result within a reasonable timeframe

This is hard because:

Given a set of transformations, finding an optimal subset that most satisfies user's need is a combinatorial optimization problem with a costly objective.

System Design Components

User model

A user has an initial set of goals for his desired output dataset, which may include:

DesiredPred: Which items it should contain (e.g., type 2 diabetics);
TargetDistrib: The distribution of particular items (e.g., a Gaussian distributed prevalence of cardiovascular disease);
TupleCount: The number of data items (e.g., at least 500 subjects);
NoPred: Which transformations to use or avoid (e.g., no gender filtering).

Constraint weight \(\mathcal{W}\)

Usually, it is impossible to satisfy all the constraints user provided.

Soft constraints
Constraint weight \(\mathcal{W}\) to help the user determine which to satisfy first.

Transformations

Folding
Extraction
Filtering
...

The transformation universe

Equality predicates: For each value in any categorical columns.
Three predicates: For numerical columns, a histogram is generated with the column’s values. This generates a range predicate and greater-than and less-than inequality predicates of the next edge.
This yields the set of candidate transformations C.

Interactive workflow

The system runs in a loopy fashion, for each cycle \(i\), the system output at a given query cycle \(i\)

Generated output dataset \(\mathcal{R}_i\)
Transformation program \(\mathcal{O}_i\)
A set \(\mathcal{P}_i\) of system-identified problems.

Also, for a given result, the system identifies:

Any distribution changes* amongst columns in \(\mathcal{R}_i\).

* Janet is notified that the distribution of cardiovascular disease prevalence

Core problem: Formulation of the constraints and objective

Mathematical programming

\begin{aligned} & \min_{\mathbf{x}} && f(\mathbf{x}) \\ & s.t. && G(\mathbf{x}) = 0 \end{aligned}

\begin{aligned} &amp; \min_{\mathbf{x}} &amp;&amp; f(\mathbf{x}) \\ &amp; s.t. &amp;&amp; G(\mathbf{x}) = 0 \end{aligned}

Ideally...

\(\{\text{user constraints}\} \rightarrow G(\mathbf{x})\), And

\begin{aligned} & \min_{\mathbf{x}} && 0 \\ & s.t. && G(\mathbf{x}) = 0 \end{aligned}

\begin{aligned} &amp; \min_{\mathbf{x}} &amp;&amp; 0 \\ &amp; s.t. &amp;&amp; G(\mathbf{x}) = 0 \end{aligned}

But...

It is usually not possible to satisfy all the constraints

\begin{aligned} & \min_{\mathbf{x}} && f(\mathbf{x}) \end{aligned}

\begin{aligned} &amp; \min_{\mathbf{x}} &amp;&amp; f(\mathbf{x}) \end{aligned}

Translating TargetDistrib

Intuitively, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:

\(\min_{O_i \in \mathcal{C}} dist(\text{target distribution}, \text{current distribution})\)

But...

Histogram similarity

Split data of each column into bins
Calculate the number of instances in each bin \(b\), and store it into variable \(tc[b]\)
After a applied a predicate, recalculate the number of instances, store it into \(bc[b]\)

\begin{aligned} minimize \sum_b^{|bins|} abs(tc[b] - bc[b]) \end{aligned}

\begin{aligned} minimize \sum_b^{|bins|} abs(tc[b] - bc[b]) \end{aligned}

Recalculate \(bc[b]\) everytime? No.

Setup an indicate variable set \(preds = \{p_1,p_2,...\}\), so if the i-th predicate should be included into the result, \(preds[i] = 1\), otherwise, 0.
Because we know each candidate predicate, precompute the number of filtered tuples ahead of time, and store it in \(bpp[i,b]\), which means how much instances is left in the bin \(b\) after the filter \(i\).
We get: \(bc[b] = tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]} \)

E.g. \(bc[b] = 0.8 * e^{0 * \log 0.1 + 1 * \log 0.3}\) = 0.8 * 0.3

\begin{aligned} minimize_{preds} \sum_b^{|bins|} abs(tc[b] - tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]}) \end{aligned}

\begin{aligned} minimize_{preds} \sum_b^{|bins|} abs(tc[b] - tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]}) \end{aligned}

Put everything together

Translating NoPred and DesiredPred

Similarly, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:

\(\min_{O_i \in \mathcal{C}} dist(\text{target predicate}, \text{current predicate})\)

Calculating "Code" similarity

Code distance

Because we already know all the predicates...

Precalculate pairwise predicate similarity for all predicates!

\(S[i,u]\): Similarity score for predicate \(i\) and \(u\)

For each user designated predicate \(u\), the distance between the candidate predicates is \(D_u[u] = \sum_i^{|preds|} preds[i] * (1 - S[i,u])\)

\begin{aligned} minimize \sum_{i=1}^{|U_s|} \mathcal{W}[U_s, i] * D_s[i] + \sum_{i=1}^{U_c} \mathcal{W}[U_c,i]* D_c[i] \end{aligned}

\begin{aligned} minimize \sum_{i=1}^{|U_s|} \mathcal{W}[U_s, i] * D_s[i] + \sum_{i=1}^{U_c} \mathcal{W}[U_c,i]* D_c[i] \end{aligned}

Experiments

Compared algorithms

Tiresias: Similar system aims to find a dataset satisfies user constraint. Don't generate predicates.
NChooseK: Enumerate all possible \(k\) filter combinations, choose the best.
Greedy: Based on the objective defined above, but greedy. Find a single best predicate \(p_1\), then find best predicate tuple \(p_1,p_2\). Repeat until \(k\) predicates found.

Synthetic dataset and TargetDist

Performance of the code similarity

Conclusion

Janet, saved

Conclusion

The good: a meaningful problem with a fast usable solution.
The bad: Mediocre solution, not very "surprise" to me.
The ugly: Code similarity is a little bit heuristic

Constraint-based Explanation and Repair of Filter-based Transformations

Mini theatre

Generating the dataset

But it takes too long...

Problems in data analysis pipeline

What if Janet has a system...

Formally, we want a system that

This is hard because:

System Design Components

User model

Constraint weight \(\mathcal{W}\)

Transformations

The transformation universe

Interactive workflow

Core problem: Formulation of the constraints and objective

Mathematical programming

Ideally...

But...

Translating TargetDistrib

Histogram similarity

Recalculate \(bc[b]\) everytime? No.

Put everything together

Translating NoPred and DesiredPred

Calculating "Code" similarity

Code distance

Experiments

Compared algorithms

Synthetic dataset and TargetDist

Synthetic dataset and TargetDist

Synthetic dataset and TargetDist

Synthetic dataset and TargetDist

Performance of the code similarity

Conclusion

Conclusion