Fuzzy matching of strings

Second International Conference

Mathematics Days in Sofia, July 10-14, 20187 

Lyubomir Filipov, Zlatko Varbanov,

Department of Information Technologies, University of Veliko Tarnovo

Why fuzzy?

Why fuzzy?

Scrape some data!

Fuzzy matching

  • Computer-assisted translation
     
  • record linkage
     
  • less than 100% perfect
     
  • mostly English
     
  • started in 1990s

Fuzzy matching

  • Tomas Jones
     
  • Tom Jones
     
  • Thomas Jones

Soundex

a, e, i, o, u, y, h, w.

  • b, f, p, v → 1
  • c, g, j, k, q, s, x, z → 2
  • d, t → 3
  • l → 4
  • m, n → 5
  • r → 6

Robert → R163
 

Soundex

1.      Robert → R

2.      Robert
→ R1

3.      Rob
ert → R16  

4.      Robert → R163

a, e, i, o, u, y, h, w.

  • b, f, p, v → 1
  • c, g, j, k, q, s, x, z → 2
  • d, t → 3
  • l → 4
  • m, n → 5
  • r → 6

Soundex

a, e, i, o, u, y, h, w.

  • b, f, p, v → 1
  • c, g, j, k, q, s, x, z → 2
  • d, t → 3
  • l → 4
  • m, n → 5
  • r → 6

Robert → R163

Rupert → R163

Rubin → R150

 

Bitap algorithm - agrep

P H P R O C K S
R O C K S

Bitap algorithm - agrep

R O C K S
P H P R O C K S

Bitap algorithm - agrep

R O C K S
P H P R O C K S

O(n.m)

Boyer–Moore Algorithm - grep

TEXT

PATTERN

Shifting direction

Boyer–Moore Algorithm - grep

R O C K S
P H P R O C K S

Boyer–Moore Algorithm - grep

R O C K S
P H P R O C K S

    1         2        3         4

    1         2        3         4

Boyer–Moore Algorithm - grep

R O C K S
P H P R O C K S

O(n+m)

Approximate string matching

  • insertion: cot → coat
     
  • deletion: coat → cot
     
  • substitution: coat → cost

Levenshtein distance

- difference between two sequences

  1. kitten → sitten (substitution of "s" for "k")
     
  2. sitten → sittin (substitution of "i" for "e")
     
  3. sittin → sitting (insertion of "g" at the end).

Damerau–Levenshtein distance

transposition of two adjacent characters

  1. cofn → conn (substitution of "n" for "f")
     
  2. conn → conf (substitution of "f" for "n")
  1. cofn → conf  

    (transposition of "f" and "n")

D = 2
Levenshtein

D = 1
Damerau–Levenshtein

Usecase - MySQL

- list of dummy data (generated company names)

Usecase - MySQL

Array
(
    [0] => _Agamba
    [1] => gamba
    [2] => _gamba
    [3] => A_gamba
    [4] => Aamba
    [5] => A_amba
    [6] => Ag_amba
    [7] => Agmba
    [8] => Ag_mba
    [9] => Aga_mba
    [10] => Agaba
    [11] => Aga_ba
    [12] => Agam_ba
    [13] => Agama
    [14] => Agam_a
    [15] => Agamb_a
    [16] => Agamb
    [17] => Agamb_
    [18] => Agamba_
)

We have made a typo in the company name 'Agimba'. 

D = 1

3 * n - queries

Usecase - MySQL

SELECT 
    *
FROM
    `companyList`
WHERE
    `companyNames` LIKE '%_Agamba%'
        OR `companyNames` LIKE '%gamba%'
        OR `companyNames` LIKE '%_gamba%'
        OR `companyNames` LIKE '%A_gamba%'
        OR `companyNames` LIKE '%Aamba%'
        OR `companyNames` LIKE '%A_amba%'
        OR `companyNames` LIKE '%Ag_amba%'
        OR `companyNames` LIKE '%Agmba%'
        OR `companyNames` LIKE '%Ag_mba%'
        OR `companyNames` LIKE '%Aga_mba%'
        OR `companyNames` LIKE '%Agaba%'
        OR `companyNames` LIKE '%Aga_ba%'
        OR `companyNames` LIKE '%Agam_ba%'
        OR `companyNames` LIKE '%Agama%'
        OR `companyNames` LIKE '%Agam_a%'
        OR `companyNames` LIKE '%Agamb_a%'
        OR `companyNames` LIKE '%Agamb%'
        OR `companyNames` LIKE '%Agamb_%'
        OR `companyNames` LIKE '%Agamba_%';

15 results in 30 ms.

D = 1

3 * n - queries

Usecase - Algolia

  • 15 results in 1 ms.
     
  • easy update of data
     
  • auto index generated

Usecase - Algolia

  • 15 results in 1 ms. -  searching for Agamba

Usecase - Elastic

  • No easy way to import data
     
  • Supports fuzzy matching
     
  • recommended to have indexes up to distance 2

Embulk

Usecase - Google

http://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=ninaj

Usecase - Google

[  
   "ninaj",
   [  
      "ninja",
      "ninja turtles",
      "ninja blender",
...
   ],
   {  
      "google:suggestrelevance":[  
         1250,
         601,
         600,
      ],
      ...
   }
]

Usecase - Google

http://suggestqueries.google.com/complete/search?output=toolbar&hl=bg&q=ниндж

output Type of response you want.
hl  Language's 2-letter abbreviation
q Your search term.

Usecase - Google

http://suggestqueries.google.com/complete/search?output=firefox&hl=bg&q=ниндж

["нинд",
    ["нинджаго","нинджа","нинджи",
    "нинджаго господари на спинджицу",
    "нинджаго песни",
    "нинджаго майсторите на спинджицу",
    "нинджаго сезон 7","нинджа убиец",
    "нинджаго играчки","нинджа игри"]
]

Typoes

  • Skip letter
     
  • Double letters
     
  • Reverse letters
  • Skip spaces
     
  • Missed key
     
  • Inserted key

Inserted key

Why do we need it

  • Autocomplete
     
  • Search with typos
     
  • Bots

Resources

  • Knuth, Donald E. (1973). The Art of Computer Programming: Volume 3, Sorting and Searching. Addison-Wesley. pp. 391–92. ISBN 978-0-201-03803-3. OCLC 39472999
  • G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46 (3), May 1999, 395–415
  • Robert S. Boyer, J. Strother Moore, A fast string searching algorithm, Communications of the ACM, Volume 20, Oct. 1977, 762–772.
  • https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html<br>
  • https://dev.mysql.com/doc/refman/5.7/en/optimization-indexes.html
  • https://blog.algolia.com/full-text-search-in-your-database-algolia-versus-elasticsearch
  • https://support.google.com/gsa/answer/6329266?hl=en

Thank you

for your attention

Lyubomir Filipov, Zlatko Varbanov,

Department of Information Technologies, University of Veliko Tarnovo

Made with Slides.com