# Fuzzy matching of strings

Second International Conference

Mathematics Days in Sofia, July 10-14, 20187

Lyubomir Filipov, Zlatko Varbanov,

Department of Information Technologies, University of Veliko Tarnovo

## Fuzzy matching

• Computer-assisted translation

• less than 100% perfect

• mostly English

• started in 1990s

• Tomas Jones

• Tom Jones

• Thomas Jones

## Soundex

a, e, i, o, u, y, h, w.

• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6

Robert → R163

## Soundex

1.      Robert → R

2.      Robert
→ R1

3.      Rob
ert → R16

4.      Robert → R163

a, e, i, o, u, y, h, w.

• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6

## Soundex

a, e, i, o, u, y, h, w.

• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6

Robert → R163

Rupert → R163

Rubin → R150

P H P R O C K S
R O C K S

R O C K S
P H P R O C K S

R O C K S
P H P R O C K S

O(n.m)

## Boyer–Moore Algorithm - grep

TEXT

PATTERN

Shifting direction

R O C K S
P H P R O C K S

## Boyer–Moore Algorithm - grep

R O C K S
P H P R O C K S

1         2        3         4

1         2        3         4

R O C K S
P H P R O C K S

O(n+m)

## Approximate string matching

• insertion: cot → coat

• deletion: coat → cot

• substitution: coat → cost

## Levenshtein distance

- difference between two sequences

1. kitten → sitten (substitution of "s" for "k")

2. sitten → sittin (substitution of "i" for "e")

3. sittin → sitting (insertion of "g" at the end).

## Damerau–Levenshtein distance

transposition of two adjacent characters

1. cofn → conn (substitution of "n" for "f")

2. conn → conf (substitution of "f" for "n")
1. cofn → conf

(transposition of "f" and "n")

D = 2
Levenshtein

D = 1
Damerau–Levenshtein

## Usecase - MySQL

- list of dummy data (generated company names)

## Usecase - MySQL

``````Array
(
[0] => _Agamba
[1] => gamba
[2] => _gamba
[3] => A_gamba
[4] => Aamba
[5] => A_amba
[6] => Ag_amba
[7] => Agmba
[8] => Ag_mba
[9] => Aga_mba
[10] => Agaba
[11] => Aga_ba
[12] => Agam_ba
[13] => Agama
[14] => Agam_a
[15] => Agamb_a
[16] => Agamb
[17] => Agamb_
[18] => Agamba_
)``````

We have made a typo in the company name 'Agimba'.

D = 1

3 * n - queries

## Usecase - MySQL

``````SELECT
*
FROM
`companyList`
WHERE
`companyNames` LIKE '%_Agamba%'
OR `companyNames` LIKE '%gamba%'
OR `companyNames` LIKE '%_gamba%'
OR `companyNames` LIKE '%A_gamba%'
OR `companyNames` LIKE '%Aamba%'
OR `companyNames` LIKE '%A_amba%'
OR `companyNames` LIKE '%Ag_amba%'
OR `companyNames` LIKE '%Agmba%'
OR `companyNames` LIKE '%Ag_mba%'
OR `companyNames` LIKE '%Aga_mba%'
OR `companyNames` LIKE '%Agaba%'
OR `companyNames` LIKE '%Aga_ba%'
OR `companyNames` LIKE '%Agam_ba%'
OR `companyNames` LIKE '%Agama%'
OR `companyNames` LIKE '%Agam_a%'
OR `companyNames` LIKE '%Agamb_a%'
OR `companyNames` LIKE '%Agamb%'
OR `companyNames` LIKE '%Agamb_%'
OR `companyNames` LIKE '%Agamba_%';``````

15 results in 30 ms.

D = 1

3 * n - queries

## Usecase - Algolia

• 15 results in 1 ms.

• easy update of data

• auto index generated

## Usecase - Algolia

• 15 results in 1 ms. -  searching for Agamba

## Usecase - Elastic

• No easy way to import data

• Supports fuzzy matching

• recommended to have indexes up to distance 2

## Usecase - Google

``````[
"ninaj",
[
"ninja",
"ninja turtles",
"ninja blender",
...
],
{
1250,
601,
600,
],
...
}
]``````

## Usecase - Google

output Type of response you want.
hl  Language's 2-letter abbreviation
q Your search term.

## Usecase - Google

``````["нинд",
["нинджаго","нинджа","нинджи",
"нинджаго господари на спинджицу",
"нинджаго песни",
"нинджаго майсторите на спинджицу",
"нинджаго сезон 7","нинджа убиец",
"нинджаго играчки","нинджа игри"]
]``````

## Typoes

• Skip letter

• Double letters

• Reverse letters
• Skip spaces

• Missed key

• Inserted key

## Why do we need it

• Autocomplete

• Search with typos

• Bots

## Resources

• Knuth, Donald E. (1973). The Art of Computer Programming: Volume 3, Sorting and Searching. Addison-Wesley. pp. 391–92. ISBN 978-0-201-03803-3. OCLC 39472999
• G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46 (3), May 1999, 395–415
• Robert S. Boyer, J. Strother Moore, A fast string searching algorithm, Communications of the ACM, Volume 20, Oct. 1977, 762–772.
• https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html<br>
• https://dev.mysql.com/doc/refman/5.7/en/optimization-indexes.html
• https://blog.algolia.com/full-text-search-in-your-database-algolia-versus-elasticsearch

# for your attention

Lyubomir Filipov, Zlatko Varbanov,

Department of Information Technologies, University of Veliko Tarnovo

#### Fuzzy matching of strings - MDS 2017

By Lyubomir Filipov

• 525