Semantic analysis and

plagiarism search

Serhii Vashchilin

Paymaxi

Strange

project 

Check the text for uniqueness

Who needs this solution? 

  • paper mill
  • news feeds
  • publishers
  • universities

Main problems 

  • semantic analysis
  • search plagiarism
  • different text languages

Impasses

  • internet searching
  • semantic analysis algorithms
  • plagiarism search algorithms
  • lack of useful information

Internet

searching

  • choose search engine
  • get keywords to search

Semantic analysis

  • TF-IDF ranking (term frequency - inverse document frequency)
  • latent semantic analysis

Latent semantic analysis

  • like a simple neural network
  • singular value decomposition

Plagiarism search algorithms

  • shingles algorithm
  • NCD (normalized compression distance)

Shingles algorithm

  • text normalization
  • divide text on shingles
  • calculate hashes for each shingle
  • get random value of hashes

How it works with shingles length equal 3?

We love to search plagiarism on frameworks day

1 shingle: [We, love, to]

2 shingle: [love, to, search]

3 shingle: [to, search, plagiarism]

...

N-th shingle: [on, frameworks, day]

Example code

Results

Same texts:  100.0 %
Text1 vs text2:  13.09 %
Text2 vs text1:  14.28 %

Normalized compression distance

You have to choose only the compression program!

How it works?

Compress two merged files.

Easy enough

Example code

import lzma
import sys

x = open(sys.argv[1], 'rb').read()
y = open(sys.argv[2], 'rb').read() 
x_y = x + y 

x_comp = lzma.compress(x)
y_comp = lzma.compress(y)  
x_y_comp = lzma.compress(x_y) 

print(len(x_comp), 
    len(y_comp), 
    len(x_y_comp), 
    sep=' ', 
    end='\n')

ncd = (len(x_y_comp) - min(len(x_comp),
    len(y_comp))) / \
    max(len(x_comp), len(y_comp))

print(ncd)

Results

  • python3 pyncd.py examples/txt/1.txt examples/txt/1.txt:
    • 0.11

  • python3 pyncd.py examples/txt/1.txt examples/txt/2.txt:
    • 0.44

  1. 1.txt the same as 1.txt :)

  2. 2.txt not the same as 1.txt :(

What about performance?

  • search in the Internet brakes system
  • slow on big amount of texts

Old solution

  • multi threading

  • save intermediate results to storage

 

New solution

  • message broker (RabbitMQ)
  • full text search engine (Apache Lucene)
  • async requests (python packages: asyncio, aiohttp)

 

  • requests time out
  • quantization of processes
  • split big texts for improving speed of semantic analysis

https://github.com/SergeMerge

Thank you for listening!

Highload

By Serg Vashchilin

Highload

  • 1,870