Names Similarity - CA

Problem Statement

Develop an algorithm to find the similarity between two names.

  • Initials: Jane Doe Hamilton == Jane D. Hamilton
  • Missing Parts: Jane Doe Hamilton == Jane Hamilton
  • Spelling errors: Jane Hamlton == Jane Hamilton
  • Word Order: Hamilton, Jane == Jane Hamilton
  • Titles: Dr. Jane Doe == Jane Doe

*While disambiguating between names, less frequent names will be prioritised

Solution

A rule-based system that returns a score between {0, 1} indicating the similarity between the names as well as a short description of how the score was calculated.

Score: 0.83

Reason: Common Initials 

Steps

1. Check for exactly same names

2. Check if names differ by only missing titles(Phd., Dr., Miss., etc.)

3. If one of the names has only first name then match for first names

4. Check if the first and last names are the same

5. Check if one of the names is in the reverse order

 

* If any of the conditions in 1-5 are true then return a score of 1.0; i.e. - same names

1

6. If the names have common initials then get a score based on matching initials

7. Get a score based on the number of different characters in both names

8. Get a score based on the popularity of the names

9. Return a combined score based on step 6, 7 and 8

Example Outputs

Name 1 Name 2 Trigger Score
Jane Doe Jane Doe Same Names 1.0
Miss. Jane Doe Jane Doe Only additional title 1.0
Jane Jane Doe Same first name(one name has only part) 1.0*
Jane Doe Hamilton Jane Hamilton Same first and last names 1.0*
Jane Doe Doe, Jane Reversed names 1.0
J. Smith James Smith Average edit distance, rank and matching initials scores 0.86*
J. Nithercott James Nithercott Average edit distance, rank and matching initials scores 0.91*

'*' Uses damping factor(theta) = 1

Customisable/Trainable Parameter: Theta

Rule Damping Formula
Same First names den = name_1.n_parts + name_2.n_parts
score = (1-(1/den)) + ((1/den)*theta)
Same First and Last names den = name_1.n_parts + name_2.n_parts
score = (1-(1/den)) + ((1/den)*theta)
Initials Match score = (0.5*theta + (matching_parts)/2)
Rank Score Score = 0.95*theta

Combined Score

(Where there is no direct rule-based matching)

Average(Edit Distance, Rank Score, Initials Score)

Edit Distance Score = 2*(Number of edits required)/(combined length of both names)

Rank Score = 0.95*theta

Names Similarity - CA

By Rishabh Shukla

Names Similarity - CA

  • 1,046