Fuzzy matching of strings
Национален семинар по теория на кодирането
"Професор Стефан Додунеков", 10-13 ноември 2016 г.
Lyubomir Filipov, Zlatko Varbanov,
Department of Information Technologies, University of Veliko Tarnovo
Why fuzzy?
Why fuzzy?
Scrape some data!
Fuzzy matching
- Computer-assisted translation
- record linkage
- less than 100% perfect
- mostly English
- started in 1990s
Fuzzy matching
- Tomas Jones
- Tom Jones
- Thomas Jones
Soundex
a, e, i, o, u, y, h, w.
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
Robert → R163
Soundex
1. Robert → R
2. Robert → R1
3. Robert → R16
4. Robert → R163
a, e, i, o, u, y, h, w.
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
Soundex
a, e, i, o, u, y, h, w.
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
Robert → R163
Rupert → R163
Rubin → R150
Bitap algorithm - agrep
P | H | P | R | O | C | K | S |
---|
R | O | C | K | S |
---|
Bitap algorithm - agrep
R | O | C | K | S |
---|
P | H | P | R | O | C | K | S |
---|
Bitap algorithm - agrep
R | O | C | K | S |
---|
P | H | P | R | O | C | K | S |
---|
O(n.m)
Boyer–Moore Algorithm - grep
TEXT
PATTERN
Shifting direction
Boyer–Moore Algorithm - grep
R | O | C | K | S |
---|
P | H | P | R | O | C | K | S |
---|
Boyer–Moore Algorithm - grep
R | O | C | K | S |
---|
P | H | P | R | O | C | K | S |
---|
1 2 3 4
1 2 3 4
Boyer–Moore Algorithm - grep
R | O | C | K | S |
---|
P | H | P | R | O | C | K | S |
---|
O(n+m)
Approximate string matching
-
insertion: cot → coat
-
deletion: coat → cot
- substitution: coat → cost
Levenshtein distance
- difference between two sequences
-
kitten → sitten (substitution of "s" for "k")
-
sitten → sittin (substitution of "i" for "e")
- sittin → sitting (insertion of "g" at the end).
Damerau–Levenshtein distance
transposition of two adjacent characters
- cofn → conn (substitution of "n" for "f")
- conn → conf (substitution of "f" for "n")
-
cofn → conf
(transposition of "f" and "n")
D = 2
Levenshtein
D = 1
Damerau–Levenshtein
Usecase - MySQL
- list of dummy data (generated company names)
Usecase - MySQL
Array
(
[0] => _Agamba
[1] => gamba
[2] => _gamba
[3] => A_gamba
[4] => Aamba
[5] => A_amba
[6] => Ag_amba
[7] => Agmba
[8] => Ag_mba
[9] => Aga_mba
[10] => Agaba
[11] => Aga_ba
[12] => Agam_ba
[13] => Agama
[14] => Agam_a
[15] => Agamb_a
[16] => Agamb
[17] => Agamb_
[18] => Agamba_
)
We have made a typo in the company name 'Agimba'.
D = 1
3 * n - queries
Usecase - MySQL
SELECT
*
FROM
`companyList`
WHERE
`companyNames` LIKE '%_Agamba%'
OR `companyNames` LIKE '%gamba%'
OR `companyNames` LIKE '%_gamba%'
OR `companyNames` LIKE '%A_gamba%'
OR `companyNames` LIKE '%Aamba%'
OR `companyNames` LIKE '%A_amba%'
OR `companyNames` LIKE '%Ag_amba%'
OR `companyNames` LIKE '%Agmba%'
OR `companyNames` LIKE '%Ag_mba%'
OR `companyNames` LIKE '%Aga_mba%'
OR `companyNames` LIKE '%Agaba%'
OR `companyNames` LIKE '%Aga_ba%'
OR `companyNames` LIKE '%Agam_ba%'
OR `companyNames` LIKE '%Agama%'
OR `companyNames` LIKE '%Agam_a%'
OR `companyNames` LIKE '%Agamb_a%'
OR `companyNames` LIKE '%Agamb%'
OR `companyNames` LIKE '%Agamb_%'
OR `companyNames` LIKE '%Agamba_%';
15 results in 30 ms.
D = 1
3 * n - queries
Usecase - Algolia
- 15 results in 1 ms.
- easy update of data
- auto index generated
Usecase - Algolia
- 15 results in 1 ms. - searching for Agamba
Usecase - Elastic
- No easy way to import data
- Supports fuzzy matching
- recommended to have indexes up to distance 2
Embulk
Usecase - Google
http://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=ninaj
Usecase - Google
[
"ninaj",
[
"ninja",
"ninja turtles",
"ninja blender",
...
],
{
"google:suggestrelevance":[
1250,
601,
600,
],
...
}
]
Usecase - Google
http://suggestqueries.google.com/complete/search?output=toolbar&hl=bg&q=ниндж
output | Type of response you want. |
---|---|
hl | Language's 2-letter abbreviation |
q | Your search term. |
Usecase - Google
http://suggestqueries.google.com/complete/search?output=firefox&hl=bg&q=ниндж
["нинд",
["нинджаго","нинджа","нинджи",
"нинджаго господари на спинджицу",
"нинджаго песни",
"нинджаго майсторите на спинджицу",
"нинджаго сезон 7","нинджа убиец",
"нинджаго играчки","нинджа игри"]
]
Typoes
-
Skip letter
-
Double letters
- Reverse letters
-
Skip spaces
-
Missed key
- Inserted key
Inserted key
Why do we need it
-
Autocomplete
-
Search with typos
- Bots
Thank you
for your attention
Lyubomir Filipov, Zlatko Varbanov,
Department of Information Technologies, University of Veliko Tarnovo
Fuzzy matching of strings
By Lyubomir Filipov
Fuzzy matching of strings
- 2,110