Detect Duplicate REsume document

（如何检测出重复简历）

Author: Andy Song

作者：宋楠

naive Approach 粗暴方式

First thanks to Xuanfei helping us defining the criteria for detecting the duplication of multiple resume documents. 首先感谢刘轩飞同学帮助我们定义了详细的条件判断重复简历。

Second we could leverage the criterias to go through all the documents and fit with the criterias respectively. 其次我们就使用这些规则在所有的简历中过滤判断哪些是符合条件的。

But we need consider the complexity. If we have M documents, for one criteria fitness we need go through the whole documents . (M-1) << # compares << M*(M-1). And it will yield many groups group(criteria type){criteria: {m1, m2}, criteria:{m3, m5}......
And since we have 7 criterias. So it may yield the many combinations with group(criteria 1) * group(criteria 2) * .... * group (criteria 7).

Finally we need take many space and timing for the final results checking. 最后我们需要很多空间和时间来结算结果。

How could we do?

First, we need review the problem. We find the goal is finding the criteria matching document sets and reducing the comparing times for finding the matching. Try imagining the difference between comparing 100 documents and comparing 10000 documents. 首先我们需要重新看一下问题在于找到符合条件的文档集以及同时减少找到这些匹配集所需要的比较次数。试想比较100篇文档与比较10000篇文档之间的区别。

Second, so the goal will tranfer to reduce the size of documents. If we can probabilistically filter simliar documents out of the massive documents, we will also win the game. 其次我们就可以发现原来问题可以转换到缩减比较文档集上来。假设我们采用概率地方式把可能相似的文档从一大堆文档中过滤出来，我们也可以取得最终的胜利。

Divide and conquer becomes your swiss knife again.

probabilistic Approach 概率方法

MinHash is the method for checking the similarity among the documents.

So the problem for us is transfer the documents to the bool matrix for similarity computation. 所以我们的问题就变成如何把文档转换成布尔矩阵为了相似度的计算。

Bool Matrix Transform

The transforming behaves quite similar with "Feature Extraction". 转换过程很像特征提取。

We need go through the documents once. Compute the statistical tables like:

telephone counts email counts

1*****68 10 A***@***.com 2

1*****89 1 B***@***.com 1

... ... ... ...

我们需要过滤一遍文档集，统计如上的一些信息。某个电话号码出现几次，某个邮件地址出现了几次等等。

Then go through the documents again. Compute each documents bool matrix according to the following rules. 然后我们需要再次过滤一遍文档，然后开始根据如下的规则构造我们需要的布尔矩阵。

BOOL MATRIX TRANSFORM（continue 1）

If the document has the concrete criteria value, and the number of documents contain the concrete criteria value is above 1 then we will give 1 score otherwise 0 to the document in the criteria row. 如果文档中包含一个具体的条件值，我们需要在统计表中观察该值所对应的文档数目是否大于1，如果是的话我们就给该文档在该条件行上为1否则就为0. For examples：

document sets:

document A :{telephone no : 12345, email : A@B.com} bool matrix (Simple):

document B:{telephone no : 23456, email : C@A.com} type |_Doc A_________Doc B

email | 0 1

statistics table: telephone | 1 0

telephone # docs email #docs

12345 10 A@B.com 1

23456 1 C@A.com 5

BOOL MATRIX TRANSFORM（CONTINUE 2）

Ideal Case

Criteria	Doc A	Doc B	Doc C
12345	1	0	1
23456	0	1	0
A@B.com	1	0	1
C@D.com	0	1	0

document sets:

Document A :{telephone no : 12345, email : A@B.com}

Document B:{telephone no : 23456, email : C@D.com}

Document A :{telephone no : 12345, email : A@B.com}

MinHash

Source:

H(a)	H(b)	Criteria	Doc A	Doc B	Doc C
1	3	12345	1	0	1
2	0	23456	0	1	0
3	2	A@B.com	1	0	1
4	4	C@D.com	0	1	0

Results:

H(@)	Doc A	Doc B	Doc C
H(A)	1	2	1
H(B)	2	0	2

We think Doc A similar with Doc C

Finally

After finding out the similar documents, you could do the final comparison for the final correct answers. 在使用minhash找到相似的文档之后，你可以使用最后的比较去得到最后正确的答案。
So it means "MinHash" only could help you shrink the document sets and reduce the complexity of duplication comparison. 所以也说明“MeanHash”仅仅能帮助你缩减文档集以及降低重复文档比较的复杂度。