Similarity of Sentences/Questions

Cheuk Ting Ho



lexical similarity

  • Compare how many words are similar
  • Context/order of words not important
  • Plagiarism checking algorithms
  • Not so useful in our task

semantic similarity

  • Compare the meaning of the sentences
  • Context/order of words is important
  • Used in customer service platform (e.g. chatbots)
  • Our approach

General approach:

Vectorization to capture the content then compare vectors using Cosine-similarity

Word embedding

It can do that better than TF-IDF and BoW 

For that reason, the key is:

Would the vectorization method capture the contextual content?

and the best is:

Choice of pre-train model:

Google Sentence Encoder

  • Using Deep Learning
  • Take care of sentences with different length 👍
  • Slow and need lot's of memory 👎

Choice of custom-train model:

Siamese Manhattan LSTM

  • Deep learning - Neural network architecture
  • comparing 2 LSTMs with Manhattan distance