Semantic analysis and

plagiarism search

Strange project 

Check the text for uniqueness

Who needs this solution? 

  • paper mill
  • news feeds
  • publishers
  • universities

Main problems 

  • semantic analysis
  • search plagiarism
  • different text languages

Impasses

  • internet searching
  • semantic analysis
  • plagiarism search algorithms
  • lack of useful information

Internet

searching. Tasks

  • choose search engine
  • get keywords to search
  • async requests
  • save searching results to local storage

Semantic analysis

  • TF-IDF ranking (term frequency - inverse document frequency)
  • latent semantic analysis

Latent semantic analysis

  • like a simple neural network
  • singular value decomposition

Some math about SVD

Plagiarism search algorithms

  • shingles algorithm
  • NCD (normalized compression distance)

Shingles algorithms

  • text normalization
  • divide text on shingles
  • calculate hashes for each shingle
  • get random value of hashes

Normalized compression distance

You have to choose only the compression program!

How it works?

Compress two merged files.

Easy enough

Copy of deck

By Serg Vashchilin

Copy of deck

  • 342