Names Similarity - CA
Problem Statement
Develop an algorithm to find the similarity between two names.
- Initials: Jane Doe Hamilton == Jane D. Hamilton
- Missing Parts: Jane Doe Hamilton == Jane Hamilton
- Spelling errors: Jane Hamlton == Jane Hamilton
- Word Order: Hamilton, Jane == Jane Hamilton
- Titles: Dr. Jane Doe == Jane Doe
*While disambiguating between names, less frequent names will be prioritised
Solution
A rule-based system that returns a score between {0, 1} indicating the similarity between the names as well as a short description of how the score was calculated.
Score: 0.83
Reason: Common Initials
Steps
1. Check for exactly same names
2. Check if names differ by only missing titles(Phd., Dr., Miss., etc.)
3. If one of the names has only first name then match for first names
4. Check if the first and last names are the same
5. Check if one of the names is in the reverse order
* If any of the conditions in 1-5 are true then return a score of 1.0; i.e. - same names
6. If the names have common initials then get a score based on matching initials
7. Get a score based on the number of different characters in both names
8. Get a score based on the popularity of the names
9. Return a combined score based on step 6, 7 and 8
Example Outputs
Name 1 | Name 2 | Trigger | Score |
---|---|---|---|
Jane Doe | Jane Doe | Same Names | 1.0 |
Miss. Jane Doe | Jane Doe | Only additional title | 1.0 |
Jane | Jane Doe | Same first name(one name has only part) | 1.0* |
Jane Doe Hamilton | Jane Hamilton | Same first and last names | 1.0* |
Jane Doe | Doe, Jane | Reversed names | 1.0 |
J. Smith | James Smith | Average edit distance, rank and matching initials scores | 0.86* |
J. Nithercott | James Nithercott | Average edit distance, rank and matching initials scores | 0.91* |
'*' Uses damping factor(theta) = 1
Customisable/Trainable Parameter: Theta
Rule | Damping Formula |
---|---|
Same First names | den = name_1.n_parts + name_2.n_parts score = (1-(1/den)) + ((1/den)*theta) |
Same First and Last names | den = name_1.n_parts + name_2.n_parts score = (1-(1/den)) + ((1/den)*theta) |
Initials Match | score = (0.5*theta + (matching_parts)/2) |
Rank Score | Score = 0.95*theta |
Combined Score
(Where there is no direct rule-based matching)
Average(Edit Distance, Rank Score, Initials Score)
Edit Distance Score = 2*(Number of edits required)/(combined length of both names)
Rank Score = 0.95*theta
Gist: https://gist.github.com/rishy/2dbf3ed8ded93d767ed4fb152acfae4f
External Library Used: https://github.com/philipperemy/name-dataset
Names Similarity - CA
By Rishabh Shukla
Names Similarity - CA
- 1,046