Semantic analysis and

plagiarism search

Serhii Vashchilin

Paymaxi

Strange

project

Check the text for uniqueness

Who needs this solution?

paper mill
news feeds
publishers
universities

Main problems

semantic analysis
search plagiarism
different text languages

Impasses

internet searching
semantic analysis algorithms
plagiarism search algorithms
lack of useful information

Internet

searching

choose search engine
get keywords to search

Semantic analysis

TF-IDF ranking (term frequency - inverse document frequency)
latent semantic analysis

Latent semantic analysis

like a simple neural network
singular value decomposition

Plagiarism search algorithms

shingles algorithm
NCD (normalized compression distance)

Shingles algorithm

text normalization
divide text on shingles
calculate hashes for each shingle
get random value of hashes

How it works with shingles length equal 3?

We love to search plagiarism on frameworks day

1 shingle: [We, love, to]

2 shingle: [love, to, search]

3 shingle: [to, search, plagiarism]

...

N-th shingle: [on, frameworks, day]

Example code

https://gist.github.com/SergeMerge/dddcc00a9427e5c44776669d254add57

Results

Same texts: 100.0 %
Text1 vs text2: 13.09 %
Text2 vs text1: 14.28 %

Normalized compression distance

You have to choose only the compression program!

How it works?

Compress two merged files.

Easy enough

Example code

import lzma
import sys

x = open(sys.argv[1], 'rb').read()
y = open(sys.argv[2], 'rb').read() 
x_y = x + y 

x_comp = lzma.compress(x)
y_comp = lzma.compress(y)  
x_y_comp = lzma.compress(x_y) 

print(len(x_comp), 
    len(y_comp), 
    len(x_y_comp), 
    sep=' ', 
    end='\n')

ncd = (len(x_y_comp) - min(len(x_comp),
    len(y_comp))) / \
    max(len(x_comp), len(y_comp))

print(ncd)

Results

python3 pyncd.py examples/txt/1.txt examples/txt/1.txt:
- 0.11
python3 pyncd.py examples/txt/1.txt examples/txt/2.txt:
- 0.44

1.txt the same as 1.txt :)
2.txt not the same as 1.txt :(

What about performance?

search in the Internet brakes system
slow on big amount of texts

Old solution

multi threading
save intermediate results to storage

New solution

message broker (RabbitMQ)
full text search engine (Apache Lucene)
async requests (python packages: asyncio, aiohttp)

requests time out
quantization of processes
split big texts for improving speed of semantic analysis

https://github.com/SergeMerge

Strange

project

Who needs this solution?

Main problems

Impasses

Internet

searching

Semantic analysis

Latent semantic analysis

Plagiarism search algorithms

Shingles algorithm

How it works with shingles length equal 3?

Example code

Results

Normalized compression distance

How it works?

Example code

Results

0.11

0.44

1.txt the same as 1.txt :)

2.txt not the same as 1.txt :(

What about performance?

Old solution

New solution

Thank you for listening!