3-gram fuzzy matching

Problem: Match

"AUSTRIA BEST CORPORATION"

to

"AUSTRIA BEST CORP."

and not to

"AUSTRALIA BEST CORPORATION"

Alternatives

  • Edit-based distances: Count number of differences    
  • Token-based distances: Divide the word in fragments and count number of identical fragments

 

  • Machine learning: Magic                                                    

Solution

3-grams + TD-IDF

3-GRAMS: "AUSTRIA" ==> "AUS" "UST" "STR" "TRI" "RIA"

TD-IDF: Tokens that are less frequent have a larger weight.

 

Another problem

200 million companies. Iterate over all is SLOW.

Solution

LSH forest

LSH: Group similar strings together.

Forest: Have different groupings to reduce biases.

 

Output: 100 "closest" matches

Solution

LSH forest

1000 matches in a few seconds

Further

Extra layer in the ouput

Use output + metadata + magic artificial neural networks to keep it training.

n-gram

By Javier GB

n-gram

  • 258
Loading comments...

More from Javier GB