Names Similarity - CA

Problem Statement

Develop an algorithm to find the similarity between two names.

Initials: Jane Doe Hamilton == Jane D. Hamilton
Missing Parts: Jane Doe Hamilton == Jane Hamilton
Spelling errors: Jane Hamlton == Jane Hamilton
Word Order: Hamilton, Jane == Jane Hamilton
Titles: Dr. Jane Doe == Jane Doe

*While disambiguating between names, less frequent names will be prioritised

Solution

A rule-based system that returns a score between {0, 1} indicating the similarity between the names as well as a short description of how the score was calculated.

Score: 0.83

Reason: Common Initials

Steps

1. Check for exactly same names

2. Check if names differ by only missing titles(Phd., Dr., Miss., etc.)

3. If one of the names has only first name then match for first names

4. Check if the first and last names are the same

5. Check if one of the names is in the reverse order

* If any of the conditions in 1-5 are true then return a score of 1.0; i.e. - same names

6. If the names have common initials then get a score based on matching initials

7. Get a score based on the number of different characters in both names

8. Get a score based on the popularity of the names

9. Return a combined score based on step 6, 7 and 8

Example Outputs

Name 1	Name 2	Trigger	Score
Jane Doe	Jane Doe	Same Names	1.0
Miss. Jane Doe	Jane Doe	Only additional title	1.0
Jane	Jane Doe	Same first name(one name has only part)	1.0*
Jane Doe Hamilton	Jane Hamilton	Same first and last names	1.0*
Jane Doe	Doe, Jane	Reversed names	1.0
J. Smith	James Smith	Average edit distance, rank and matching initials scores	0.86*
J. Nithercott	James Nithercott	Average edit distance, rank and matching initials scores	0.91*

'*' Uses damping factor(theta) = 1

Customisable/Trainable Parameter: Theta

Rule	Damping Formula
Same First names	den = name_1.n_parts + name_2.n_parts score = (1-(1/den)) + ((1/den)*theta)
Same First and Last names	den = name_1.n_parts + name_2.n_parts score = (1-(1/den)) + ((1/den)*theta)
Initials Match	score = (0.5*theta + (matching_parts)/2)
Rank Score	Score = 0.95*theta

Combined Score

(Where there is no direct rule-based matching)

Average(Edit Distance, Rank Score, Initials Score)

Edit Distance Score = 2*(Number of edits required)/(combined length of both names)

Rank Score = 0.95*theta

Gist: https://gist.github.com/rishy/2dbf3ed8ded93d767ed4fb152acfae4f

External Library Used: https://github.com/philipperemy/name-dataset