Text Similarity

Stakes for the company

Measure the similarity between texts on a large scale

Which similarity metrics are relevant to our case ?

Is similarity on its own enough to explain / get a better understanding of Google (de)indexation ?

Can we build a mathematical model that would "predict" the indexation (or indexation chances) for a given text ?

NLP acts on 3 levels :

- Syntax (syntactic processing : what is grammatical ?)

- Semantic (semantic processing : what does it mean ?)

- Pragmatism (pragmatic processing : what is the goal ?)

So does similarity.

Similarity computing for us, is interesting on :

- Syntax : What makes two texts syntaxically / grammatically similars ? How to measure syntactic similarity ?

- Semantic : What makes two texts semantically similars ? How to measure semantic similarity ?

How to represent language (text) with mathematics ?