Measure the similarity between texts on a large scale
Which similarity metrics are relevant to our case ?
Is similarity on its own enough to explain / get a better understanding of Google (de)indexation ?
Can we build a mathematical model that would "predict" the indexation (or indexation chances) for a given text ?