Cope With Typos: Fuzzy Matching
Lyubomir Filipov * BurgasConf 2016 * @FilipovG
Who am I
Lyubomir Filipov
PHP Dev
Enthusiast
Fuzzy matching
- Computer-assisted translation
- record linkage
- less than 100% perfect
- mostly English
- started in 1990s
Fuzzy matching
- Tomas Jones
- Tom Jones
- Thomas Jones
Soundex
a, e, i, o, u, y, h, w.
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
Robert → R163
Rupert → R163
Rubin → R150
Not what we expected
Approximate string matching
Approximate string matching
-
insertion: cot → coat
-
deletion: coat → cot
- substitution: coat → cost
What is common
Levenshtein distance
- difference between two sequences
-
kitten → sitten (substitution of "s" for "k")
-
sitten → sittin (substitution of "i" for "e")
- sittin → sitting (insertion of "g" at the end).
Damerau–Levenshtein distance
transposition of two adjacent characters
- cofn → conn (substitution of "n" for "f")
- conn → conf (substitution of "f" for "n")
-
cofn → conf
(transposition of "f" and "n")
D = 2
Levenshtein
D = 1
Damerau–Levenshtein
Usecase - MySQL
- list of dummy data (generated company names)
Usecase - MySQL
Array
(
[0] => _Agamba
[1] => gamba
[2] => _gamba
[3] => A_gamba
[4] => Aamba
[5] => A_amba
[6] => Ag_amba
[7] => Agmba
[8] => Ag_mba
[9] => Aga_mba
[10] => Agaba
[11] => Aga_ba
[12] => Agam_ba
[13] => Agama
[14] => Agam_a
[15] => Agamb_a
[16] => Agamb
[17] => Agamb_
[18] => Agamba_
)
We have made a typo in the company name 'Agimba'.
D = 1
3 * n - queries
Usecase - MySQL
SELECT
*
FROM
`companyList`
WHERE
`companyNames` LIKE '%_Agamba%'
OR `companyNames` LIKE '%gamba%'
OR `companyNames` LIKE '%_gamba%'
OR `companyNames` LIKE '%A_gamba%'
OR `companyNames` LIKE '%Aamba%'
OR `companyNames` LIKE '%A_amba%'
OR `companyNames` LIKE '%Ag_amba%'
OR `companyNames` LIKE '%Agmba%'
OR `companyNames` LIKE '%Ag_mba%'
OR `companyNames` LIKE '%Aga_mba%'
OR `companyNames` LIKE '%Agaba%'
OR `companyNames` LIKE '%Aga_ba%'
OR `companyNames` LIKE '%Agam_ba%'
OR `companyNames` LIKE '%Agama%'
OR `companyNames` LIKE '%Agam_a%'
OR `companyNames` LIKE '%Agamb_a%'
OR `companyNames` LIKE '%Agamb%'
OR `companyNames` LIKE '%Agamb_%'
OR `companyNames` LIKE '%Agamba_%';
15 results in 30 ms.
D = 1
3 * n - queries
Not what we expected
Usecase - Algolia
- 15 results in 1 ms.
- easy update of data
- auto index generated
Usecase - Algolia
- 15 results in 1 ms. - searching for Agamba
Usecase - Elastic
- No easy way to import data
- Supports fuzzy matching
- recommended to have indexes up to distance 2
Embulk
Usecase - Google
http://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=ninaj
Usecase - Google
[
"ninaj",
[
"ninja",
"ninja turtles",
"ninja blender",
...
],
{
"google:clientdata":{
"bpc":false,
"tlw":false
},
"google:suggestrelevance":[
1250,
601,
600,
],
...
}
]
Why do we need it
- Autocomplete
- Search with typos
- Bots
Thanks for watching
Cope With Typos: Fuzzy Matching
By Lyubomir Filipov
Cope With Typos: Fuzzy Matching
- 1,491