Boss: Happy new year Janet. Does our experimental diabetes medication work?
Janet: (Come on I was just back from my holiday.) Let me do the investigation.
I want "type2 = True", "age > 65" and "COUNT(subjects) > 500"
Here is the filtered dataset, with a relaxation "age > 50". However, the prevalence of cardiovascular disease is skewed
Make it unskewed
Looks good
A constraint "excise = True" added
So, save Janet
Takes:
Produces:
And:
Given a set of transformations, finding an optimal subset that most satisfies user's need is a combinatorial optimization problem with a costly objective.
A user has an initial set of goals for his desired output dataset, which may include:
Usually, it is impossible to satisfy all the constraints user provided.
The system runs in a loopy fashion, for each cycle \(i\), the system output at a given query cycle \(i\)
Also, for a given result, the system identifies:
* Janet is notified that the distribution of cardiovascular disease prevalence
\(\{\text{user constraints}\} \rightarrow G(\mathbf{x})\), And
It is usually not possible to satisfy all the constraints
Intuitively, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:
\(\min_{O_i \in \mathcal{C}} dist(\text{target distribution}, \text{current distribution})\)
But...
E.g. \(bc[b] = 0.8 * e^{0 * \log 0.1 + 1 * \log 0.3}\) = 0.8 * 0.3
Similarly, given a bunch of candidate transformations (filters) \(\mathcal{C}\), during each turn \(i\), we want to:
\(\min_{O_i \in \mathcal{C}} dist(\text{target predicate}, \text{current predicate})\)
Because we already know all the predicates...
Precalculate pairwise predicate similarity for all predicates!
\(S[i,u]\): Similarity score for predicate \(i\) and \(u\)
For each user designated predicate \(u\), the distance between the candidate predicates is \(D_u[u] = \sum_i^{|preds|} preds[i] * (1 - S[i,u])\)
Janet, saved