Semantic analysis and
plagiarism search
Serhii Vashchilin
Paymaxi
Strange
project
Check the text for uniqueness
Who needs this solution?
- paper mill
- news feeds
- publishers
- universities
Main problems
- semantic analysis
- search plagiarism
- different text languages
Impasses
- internet searching
- semantic analysis algorithms
- plagiarism search algorithms
- lack of useful information
Internet
searching
- choose search engine
- get keywords to search
Semantic analysis
- TF-IDF ranking (term frequency - inverse document frequency)
- latent semantic analysis
Latent semantic analysis
- like a simple neural network
- singular value decomposition
Plagiarism search algorithms
- shingles algorithm
- NCD (normalized compression distance)
Shingles algorithm
- text normalization
- divide text on shingles
- calculate hashes for each shingle
- get random value of hashes
How it works with shingles length equal 3?
We love to search plagiarism on frameworks day
1 shingle: [We, love, to]
2 shingle: [love, to, search]
3 shingle: [to, search, plagiarism]
...
N-th shingle: [on, frameworks, day]
Example code
Results
Same texts: 100.0 %
Text1 vs text2: 13.09 %
Text2 vs text1: 14.28 %
Normalized compression distance
You have to choose only the compression program!
How it works?
Compress two merged files.
Easy enough
Example code
import lzma
import sys
x = open(sys.argv[1], 'rb').read()
y = open(sys.argv[2], 'rb').read()
x_y = x + y
x_comp = lzma.compress(x)
y_comp = lzma.compress(y)
x_y_comp = lzma.compress(x_y)
print(len(x_comp),
len(y_comp),
len(x_y_comp),
sep=' ',
end='\n')
ncd = (len(x_y_comp) - min(len(x_comp),
len(y_comp))) / \
max(len(x_comp), len(y_comp))
print(ncd)
Results
- python3 pyncd.py examples/txt/1.txt examples/txt/1.txt:
-
0.11
-
- python3 pyncd.py examples/txt/1.txt examples/txt/2.txt:
-
0.44
-
-
1.txt the same as 1.txt :)
-
2.txt not the same as 1.txt :(
What about performance?
- search in the Internet brakes system
- slow on big amount of texts
Old solution
-
multi threading
-
save intermediate results to storage
New solution
- message broker (RabbitMQ)
- full text search engine (Apache Lucene)
- async requests (python packages: asyncio, aiohttp)
- requests time out
- quantization of processes
- split big texts for improving speed of semantic analysis
https://github.com/SergeMerge
Thank you for listening!
Highload
By Serg Vashchilin
Highload
- 1,935