Constraint-based Explanation and Repair of Filter-based Transformations

Boss: Happy new year Janet. Does our experimental diabetes medication work?

Janet: (Come on I was just back from my holiday.) Let me do the investigation.

Mini theatre

Generating the dataset

But it takes too long...

Problems in data analysis pipeline

  • A common task: Transform dataset for downstream analysis, interactively
  • Very simply transformations introduce bias
  • The whole process is in a trial-and-error fashion
  • The whole process is tedious and time consuming

What if Janet has a system...

I want "type2 = True", "age > 65" and "COUNT(subjects) > 500"

 Here is the filtered dataset, with a relaxation "age > 50". However, the prevalence of cardiovascular disease is skewed

Make it unskewed

Looks good

A constraint "excise = True" added

So, save Janet

Formally, we want a system that

Takes:

  • An existing dataset to transform
  • User-provided constraints on the desired output dataset

Produces:

  • A set of transformations* and the data that the code produces, which best match the user’s constraints

And:

  • Explains of potentially undesirable bias to the user, requests feedback, and uses this to create a new result
  • Responds with a result within a reasonable timeframe

This is hard because:

Given a set of transformations, finding an optimal subset that most satisfies user's need is a combinatorial optimization problem with a costly objective.

System Design Components

User model

A user has an initial set of goals for his desired output dataset, which may include:

  • DesiredPred: Which items it should contain (e.g., type 2 diabetics);
  • TargetDistrib: The distribution of particular items (e.g., a Gaussian distributed prevalence of cardiovascular disease);
  • TupleCount: The number of data items (e.g., at least 500 subjects);
  • NoPred: Which transformations to use or avoid (e.g., no gender filtering).

Constraint weight \(\mathcal{W}\)

Usually, it is impossible to satisfy all the constraints user provided.

  • Soft constraints
  • Constraint weight \(\mathcal{W}\) to help the user determine which to satisfy first.

Transformations

  • Folding
  • Extraction
  • Filtering
  • ...

The transformation universe

  1. Equality predicates: For each value in any categorical columns.
  2. Three predicates: For numerical columns, a histogram is generated with the column’s values. This generates a range predicate and greater-than and less-than inequality predicates of the next edge.
  3. This yields the set of candidate transformations C.

Interactive workflow

The system runs in a loopy fashion, for each cycle \(i\), the system output at a given query cycle \(i\)

  • Generated output dataset \(\mathcal{R}_i\)
  • Transformation program \(\mathcal{O}_i\)
  • A set \(\mathcal{P}_i\) of system-identified problems.

Also, for a given result, the system identifies:

  • Any distribution changes* amongst columns in \(\mathcal{R}_i\).

* Janet is notified that the distribution of cardiovascular disease prevalence

Core problem: Formulation of the constraints and objective

Mathematical programming

\begin{aligned} & \min_{\mathbf{x}} && f(\mathbf{x}) \\ & s.t. && G(\mathbf{x}) = 0 \end{aligned}
minxf(x)s.t.G(x)=0\begin{aligned} & \min_{\mathbf{x}} && f(\mathbf{x}) \\ & s.t. && G(\mathbf{x}) = 0 \end{aligned}

Ideally...

\(\{\text{user constraints}\} \rightarrow G(\mathbf{x})\), And

\begin{aligned} & \min_{\mathbf{x}} && 0 \\ & s.t. && G(\mathbf{x}) = 0 \end{aligned}
minx0s.t.G(x)=0\begin{aligned} & \min_{\mathbf{x}} && 0 \\ & s.t. && G(\mathbf{x}) = 0 \end{aligned}

But...

It is usually not possible to satisfy all the constraints

\begin{aligned} & \min_{\mathbf{x}} && f(\mathbf{x}) \end{aligned}
minxf(x)\begin{aligned} & \min_{\mathbf{x}} && f(\mathbf{x}) \end{aligned}

Translating TargetDistrib

Intuitively, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:

\(\min_{O_i \in \mathcal{C}} dist(\text{target distribution}, \text{current distribution})\)

But...

Histogram similarity

  • Split data of each column into bins
  • Calculate the number of instances in each bin \(b\), and store it into variable \(tc[b]\)
  • After a applied a predicate, recalculate the number of instances, store it into \(bc[b]\)
\begin{aligned} minimize \sum_b^{|bins|} abs(tc[b] - bc[b]) \end{aligned}
minimizebbinsabs(tc[b]bc[b])\begin{aligned} minimize \sum_b^{|bins|} abs(tc[b] - bc[b]) \end{aligned}

Recalculate \(bc[b]\) everytime? No.

  • Setup an indicate variable set \(preds = \{p_1,p_2,...\}\), so if the i-th predicate should be included into the result, \(preds[i] = 1\), otherwise, 0.
  • Because we know each candidate predicate, precompute the number of filtered tuples ahead of time, and store it in \(bpp[i,b]\), which means how much instances is left in the bin \(b\) after the filter \(i\).
  • We get: \(bc[b] = tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]} \)

E.g. \(bc[b] = 0.8 * e^{0 * \log 0.1 + 1 * \log 0.3}\) = 0.8 * 0.3

\begin{aligned} minimize_{preds} \sum_b^{|bins|} abs(tc[b] - tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]}) \end{aligned}
minimizepredsbbinsabs(tc[b]tc[b]ei=1predspreds[i]logbpp[i,b])\begin{aligned} minimize_{preds} \sum_b^{|bins|} abs(tc[b] - tc[b] * e^{\sum_{i=1}^{|preds|} preds[i] * \log bpp[i,b]}) \end{aligned}

Put everything together

Translating NoPred and DesiredPred

Similarly, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:

\(\min_{O_i \in \mathcal{C}} dist(\text{target predicate}, \text{current predicate})\)

Calculating "Code" similarity

Code distance

Because we already know all the predicates...

Precalculate pairwise predicate similarity for all predicates!

\(S[i,u]\): Similarity score for predicate \(i\) and \(u\)

For each user designated predicate \(u\), the distance between the candidate predicates is \(D_u[u] = \sum_i^{|preds|} preds[i] * (1 - S[i,u])\)

\begin{aligned} minimize \sum_{i=1}^{|U_s|} \mathcal{W}[U_s, i] * D_s[i] + \sum_{i=1}^{U_c} \mathcal{W}[U_c,i]* D_c[i] \end{aligned}
minimizei=1UsW[Us,i]Ds[i]+i=1UcW[Uc,i]Dc[i]\begin{aligned} minimize \sum_{i=1}^{|U_s|} \mathcal{W}[U_s, i] * D_s[i] + \sum_{i=1}^{U_c} \mathcal{W}[U_c,i]* D_c[i] \end{aligned}

Experiments

Compared algorithms

  • Tiresias: Similar system aims to find a dataset satisfies user constraint. Don't generate predicates.
  • NChooseK: Enumerate all possible \(k\) filter combinations, choose the best.
  • Greedy: Based on the objective defined above, but greedy. Find a single best predicate \(p_1\), then find best predicate tuple \(p_1,p_2\). Repeat until \(k\) predicates found.

Synthetic dataset and TargetDist

Synthetic dataset and TargetDist

Synthetic dataset and TargetDist

Synthetic dataset and TargetDist

Performance of the code similarity

Conclusion

Janet, saved

Conclusion

  • The good: a meaningful problem with a fast usable solution.
  • The bad: Mediocre solution, not very "surprise" to me.
  • The ugly: Code similarity is a little bit heuristic

deck

By Weiyüen Wu