Semantic analysis and

plagiarism search

Strange project 

Check the text for uniqueness

Who needs this solution? 

  • paper mill
  • news feeds
  • publishers
  • universities

Main problems 

  • semantic analysis
  • search plagiarism
  • different text languages

Impasses

  • internet searching
  • semantic analysis
  • plagiarism search algorithms
  • lack of useful information

Internet

searching. Tasks

  • choose search engine
  • get keywords to search
  • async requests
  • save searching results to local storage

Semantic analysis

  • TF-IDF ranking (term frequency - inverse document frequency)
  • latent semantic analysis

Latent semantic analysis

  • like a simple neural network
  • singular value decomposition

Some math about SVD

Plagiarism search algorithms

  • shingles algorithm
  • NCD (normalized compression distance)

Shingles algorithms

  • text normalization
  • divide text on shingles
  • calculate hashes for each shingle
  • get random value of hashes

Normalized compression distance

You have to choose only the compression program!

How it works?

Compress two merged files.

Easy enough

Made with Slides.com