Semantic analysis and
plagiarism search
Serhii Vashchilin
Paymaxi
Check the text for uniqueness
We love to search plagiarism on frameworks day
1 shingle: [We, love, to]
2 shingle: [love, to, search]
3 shingle: [to, search, plagiarism]
...
N-th shingle: [on, frameworks, day]
Same texts: 100.0 %
Text1 vs text2: 13.09 %
Text2 vs text1: 14.28 %
You have to choose only the compression program!
Compress two merged files.
Easy enough
import lzma
import sys
x = open(sys.argv[1], 'rb').read()
y = open(sys.argv[2], 'rb').read()
x_y = x + y
x_comp = lzma.compress(x)
y_comp = lzma.compress(y)
x_y_comp = lzma.compress(x_y)
print(len(x_comp),
len(y_comp),
len(x_y_comp),
sep=' ',
end='\n')
ncd = (len(x_y_comp) - min(len(x_comp),
len(y_comp))) / \
max(len(x_comp), len(y_comp))
print(ncd)
multi threading
save intermediate results to storage
https://github.com/SergeMerge