Online Entity Resolution Using an Oracle

Donatella Firmani, Barna Saha, Divesh Srivastava

Order matters!

$(r_a,r_b)$ , $(r_a,r_c)$ , $(r_d,r_e)$ , $(r_a,r_d)$ , $(r_a,r_f)$ , $(r_d,r_f)$ .

Another ordering:

$(r_a, r_d)$ , $(r_a, r_f )$ , $(r_d, r_f )$ , $(r_d, r_e)$ , $(r_a,r_b)$ , $(r_a,r_c)$ .

The recall would be 0 after labeling the first three record pairs, 0.25 after labeling the fourth record pair, 0.5 after labeling the fifth record pair, and 1.0 after labeling the sixth record pair.

Considering pairs with probabilities:

Matching:

$p(r_d, r_e) = 0.80$ , $p(r_b, r_c) = 0.60$ , $p(r_a, r_c) = 0.54$ , $p(r_a, r_b) = 0.46$ .

Non-matching: $p(r_a, r_d) = 0.84$ , $p(r_d, r_f) = 0.81$ , $p(r_c, r_e) = 0.72$ , $p(r_a, r_e)=0.65$ , $p(r_c, r_d)=0.59$ , $p(r_e, r_f) = 0.59$ , $p(r_a, r_f)=0.55$ , $p(r_b, r_d) = 0.51$ , $p(r_b, r_e) = 0.46$ , $p(r_c, r_f) = 0.45$ , $p(r_b, r_f) = 0.29$

A total of 7 pairs will be queried $(r_a,r_d)$ , $(r_d,r_f)$ , $(r_d,r_e)$ , $(r_c,r_e)$ , $(r_b,r_c)$ , $(r_a,r_f)$ , $(r_a,r_c)$ , if using minimizing # queries strategy

Previous works:

Edge noise model

For matching edges, with probability $(1- \frac{\alpha}{n} )$ , $\alpha \gt 0$ the score is above 0.7, and the remaining probability mass is distributed uniformly in $[0, 0.7)$ .

For non-matching edges, with probability $(1-\frac{\beta}{n})$ , $\beta \gt 0$ , the score is below 0.1 and the remaining n probability mass is distributed uniformly in $(0.1,1]$ .

Wang et al.

$O(log^2n)-approximation$ under edge noise model

$\Omega(n)-approximation$ under progressive recall

Strategy: High probability pair first.

A total of 7 pairs will be queried $(r_a,r_d)$ , $(r_d,r_f)$ , $(r_d,r_e)$ , $(r_c,r_e)$ , $(r_b,r_c)$ , $(r_a,r_f)$ , $(r_a,r_c)$ .

Vesdapunt et al.

$O(log^2n)-approximation$ under edge noise model

$\Sigma(\sqrt n)$ under progressive recall

Strategy: High probability node first.

A total of 7 pairs will be queried $(r_e,r_d)$ , $(r_a,r_d)$ , $(r_c, r_e)$ , $(r_c, r_a)$ , $(r_f, r_d)$ . $(r_f, r_a)$ , $(r_b, r_c)$ .

A maximized PRecall strategy $S^*$ is:

Presume we already know the ground truth
For each cluster $C_k$ in the graph (which is a same real-world entity), we ask $|C_k|$ questions to the Oracle.
Questions are asked in the sequence of $C_1, C_2, ..., C_k$ , where $|C_k|$ has a decreasing order.
Finally asking $\binom k2$ non-matching questions.
Ideally, $s^*$ will ask $n - k + \binom k2$ questions