![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4204877/pasted-from-clipboard.png)
Semantic analysis and
plagiarism search
Serhii Vashchilin
Paymaxi
Strange
project
Check the text for uniqueness
![](https://i.pinimg.com/736x/33/f1/aa/33f1aab04a95d0ca556734a7e3f95368--turtle-soup-mock-turtle.jpg)
Who needs this solution?
- paper mill
- news feeds
- publishers
- universities
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4201275/pasted-from-clipboard.png)
Main problems
- semantic analysis
- search plagiarism
- different text languages
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4205110/king-s.jpg)
Impasses
![](https://hsto.org/getpro/habr/post_images/5b6/a43/a6c/5b6a43a6c61aa4789b4505e77dc45352.jpg)
- internet searching
- semantic analysis algorithms
- plagiarism search algorithms
- lack of useful information
Internet
searching
- choose search engine
- get keywords to search
![](http://images.aif.ru/005/044/b4b58291a6d3429c010e45a91957134e.jpg)
Semantic analysis
- TF-IDF ranking (term frequency - inverse document frequency)
- latent semantic analysis
![](http://os.colta.ru/m/photo/2010/09/28/01holms_5_1.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4204964/introduction-to-information-retrieval-20-638.jpg)
Latent semantic analysis
![](https://hsto.org/getpro/habr/post_images/a39/dcf/f57/a39dcff57c999a877fa31b776a48e60d.jpg)
- like a simple neural network
- singular value decomposition
![](http://images.slideplayer.com/16/5189063/slides/slide_6.jpg)
Plagiarism search algorithms
- shingles algorithm
- NCD (normalized compression distance)
![](https://i.pinimg.com/736x/3c/2c/2e/3c2c2ee4a4ba7b9c55e70aac7d01b20b.jpg)
Shingles algorithm
- text normalization
- divide text on shingles
- calculate hashes for each shingle
- get random value of hashes
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4222199/download.png)
How it works with shingles length equal 3?
We love to search plagiarism on frameworks day
1 shingle: [We, love, to]
2 shingle: [love, to, search]
3 shingle: [to, search, plagiarism]
...
N-th shingle: [on, frameworks, day]
Example code
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4218810/Screenshot_from_2017-10-13_00-13-45.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4218825/qrcode.png)
Results
Same texts: 100.0 %
Text1 vs text2: 13.09 %
Text2 vs text1: 14.28 %
![](http://cs307510.userapi.com/v307510219/775a/sMBLc3fVU48.jpg)
Normalized compression distance
You have to choose only the compression program!
![](https://images.fineartamerica.com/images/artworkimages/mediumlarge/1/mad-hatter-and-rabbit-loremae-albano.jpg)
How it works?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4205414/compressing.png)
![](https://www.kleo.ru/img/items/0386p.jpg)
Compress two merged files.
Easy enough
Example code
import lzma
import sys
x = open(sys.argv[1], 'rb').read()
y = open(sys.argv[2], 'rb').read()
x_y = x + y
x_comp = lzma.compress(x)
y_comp = lzma.compress(y)
x_y_comp = lzma.compress(x_y)
print(len(x_comp),
len(y_comp),
len(x_y_comp),
sep=' ',
end='\n')
ncd = (len(x_y_comp) - min(len(x_comp),
len(y_comp))) / \
max(len(x_comp), len(y_comp))
print(ncd)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/769771/images/4218757/22447077_167556953825127_1103370437_n.jpg)
Results
- python3 pyncd.py examples/txt/1.txt examples/txt/1.txt:
-
0.11
-
- python3 pyncd.py examples/txt/1.txt examples/txt/2.txt:
-
0.44
-
-
1.txt the same as 1.txt :)
-
2.txt not the same as 1.txt :(
![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Alice_par_John_Tenniel_38.png/426px-Alice_par_John_Tenniel_38.png)
What about performance?
![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Down_the_Rabbit_Hole.png/320px-Down_the_Rabbit_Hole.png)
- search in the Internet brakes system
- slow on big amount of texts
Old solution
-
multi threading
-
save intermediate results to storage
![](http://data.whicdn.com/images/1628732/original.jpg)
New solution
- message broker (RabbitMQ)
- full text search engine (Apache Lucene)
- async requests (python packages: asyncio, aiohttp)
![](http://data.whicdn.com/images/1628732/original.jpg)
- requests time out
- quantization of processes
- split big texts for improving speed of semantic analysis
![](http://ic.pics.livejournal.com/idmkniga/39287902/1342543/1342543_original.jpg)
https://github.com/SergeMerge
Thank you for listening!
Highload
By Serg Vashchilin
Highload
- 1,902