3-gram fuzzy matching
"AUSTRIA BEST CORPORATION"
"AUSTRIA BEST CORP."
and not to
"AUSTRALIA BEST CORPORATION"
- Edit-based distances: Count number of differences
- Token-based distances: Divide the word in fragments and count number of identical fragments
- Machine learning: Magic
3-grams + TD-IDF
3-GRAMS: "AUSTRIA" ==> "AUS" "UST" "STR" "TRI" "RIA"
TD-IDF: Tokens that are less frequent have a larger weight.
200 million companies. Iterate over all is SLOW.
LSH: Group similar strings together.
Forest: Have different groupings to reduce biases.
Output: 100 "closest" matches
1000 matches in a few seconds
Extra layer in the ouput
Use output + metadata +
magic artificial neural networks to keep it training.
By Javier GB