by Cheuk Ting Ho
Somewhere in China...
"Starbucks" == "StarBucks"
"Starbucks" == "Star bucks"
"Starbucks" == "Starbuck"
upper() / lower()
"Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations."
matching 2 strings that is similar but not exactly the same
measure the similarity between two strings using
The distance is the number of deletions, insertions, or substitutions required to transform source into target.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein.
Levenshtein distance interpretation
Fuzzy string matching like a boss.
simply measuring the Levenshtein Distance
finding best partial match
Token Sort Ratio
tokenizing and match despite order difference
Token Set Ratio
tokenizing and finding best partial match
The data we used is found on http://download.companieshouse.gov.uk/en_output.html
It is an openly licensed publicly available dataset that contains a list of registered (limited liability) companies in Great Britain.
Num of Companies: 15889
Matching form common words like "LTD" and "COMPANY" is less meaningful and will be discounted automatically in the algorithm.
Find the 30 most common words in all company names and deduct the matching score of a pair if any keywords is present in the names.
from collections import Counter all_names = df['CompanyName'][df['RegAddress.PostTown']=='CAMBRIDGE'].unique() names_freq = Counter() for name in all_names: names_freq.update(str(name).split(" ")) key_words = [word for (word,_) in names_freq.most_common(30)] print(key_words)
Then we group the names by their 1st character. As the list is too long, it will take forever to match them all at once (15889 x 15889 pairs to consider).
The work around is to match them by groups, assuming if the names are not matched at the 1st character, it is unlikely that they are the same name.
all_main_name = pd.DataFrame(columns=['sort_gp','names','alias','score']) all_names.sort() all_main_name['names'] = all_names all_main_name['sort_gp'] = all_main_name['names'].apply(lambda x: x)
Here for each group, we use token_sort_ratio to matching the names.
Deduct 10 points if have common words.
If score 85 or more then it's a match!
rom fuzzywuzzy import fuzz all_sort_gp = all_main_name['sort_gp'].unique() def no_key_word(name): """check if the name contain the keywords in travel company""" output = True for key in key_words: if key in name: output = False return output for sortgp in all_sort_gp: this_gp = all_main_name.groupby(['sort_gp']).get_group(sortgp) gp_start = this_gp.index.min() gp_end = this_gp.index.max() for i in range(gp_start,gp_end+1): # if self has not got alias, asign to be alias of itself if pd.isna(all_main_name['alias'].iloc[i]): all_main_name['alias'].iloc[i] = all_main_name['names'].iloc[i] all_main_name['score'].iloc[i] = 100 # if the following has not got alias and fuzzy match, asign to be alias of this one for j in range(i+1,gp_end+1): if pd.isna(all_main_name['alias'].iloc[j]): fuzz_socre = fuzz.token_sort_ratio(all_main_name['names'].iloc[i],all_main_name['names'].iloc[j]) if not no_key_word(all_main_name['names'].iloc[j]): fuzz_socre -= 10 if (fuzz_socre > 85): all_main_name['alias'].iloc[j] = all_main_name['alias'].iloc[i] all_main_name['score'].iloc[j] = fuzz_socre if i % (len(all_names)//10) == 0: print("progress: %.2f" % (100*i/len(all_names)) + "%")
Matches consisted of:
Applying the matching: