Cope With Typos: Fuzzy Matching

Lyubomir Filipov  *  BurgasConf 2016  *  @FilipovG

Who am I

Lyubomir Filipov
 

PHP Dev
 

Enthusiast


 

Fuzzy matching

  • Computer-assisted translation
     
  • record linkage
     
  • less than 100% perfect
     
  • mostly English
     
  • started in 1990s

Fuzzy matching

  • Tomas Jones
     
  • Tom Jones
     
  • Thomas Jones

Soundex

a, e, i, o, u, y, h, w.

  • b, f, p, v → 1
  • c, g, j, k, q, s, x, z → 2
  • d, t → 3
  • l → 4
  • m, n → 5
  • r → 6

Robert → R163
Rupert → R163
Rubin → R150
 

Not what we expected

Approximate string matching

Approximate string matching

  • insertion: cot → coat
     
  • deletion: coat → cot
     
  • substitution: coat → cost

What is common

Levenshtein distance

- difference between two sequences

  1. kitten → sitten (substitution of "s" for "k")
     
  2. sitten → sittin (substitution of "i" for "e")
     
  3. sittin → sitting (insertion of "g" at the end).

Damerau–Levenshtein distance

transposition of two adjacent characters

  1. cofn → conn (substitution of "n" for "f")
     
  2. conn → conf (substitution of "f" for "n")
  1. cofn → conf  

    (transposition of "f" and "n")

D = 2
Levenshtein

D = 1
Damerau–Levenshtein

Usecase - MySQL

- list of dummy data (generated company names)

Usecase - MySQL

Array
(
    [0] => _Agamba
    [1] => gamba
    [2] => _gamba
    [3] => A_gamba
    [4] => Aamba
    [5] => A_amba
    [6] => Ag_amba
    [7] => Agmba
    [8] => Ag_mba
    [9] => Aga_mba
    [10] => Agaba
    [11] => Aga_ba
    [12] => Agam_ba
    [13] => Agama
    [14] => Agam_a
    [15] => Agamb_a
    [16] => Agamb
    [17] => Agamb_
    [18] => Agamba_
)

We have made a typo in the company name 'Agimba'. 

D = 1

3 * n - queries

Usecase - MySQL

SELECT 
    *
FROM
    `companyList`
WHERE
    `companyNames` LIKE '%_Agamba%'
        OR `companyNames` LIKE '%gamba%'
        OR `companyNames` LIKE '%_gamba%'
        OR `companyNames` LIKE '%A_gamba%'
        OR `companyNames` LIKE '%Aamba%'
        OR `companyNames` LIKE '%A_amba%'
        OR `companyNames` LIKE '%Ag_amba%'
        OR `companyNames` LIKE '%Agmba%'
        OR `companyNames` LIKE '%Ag_mba%'
        OR `companyNames` LIKE '%Aga_mba%'
        OR `companyNames` LIKE '%Agaba%'
        OR `companyNames` LIKE '%Aga_ba%'
        OR `companyNames` LIKE '%Agam_ba%'
        OR `companyNames` LIKE '%Agama%'
        OR `companyNames` LIKE '%Agam_a%'
        OR `companyNames` LIKE '%Agamb_a%'
        OR `companyNames` LIKE '%Agamb%'
        OR `companyNames` LIKE '%Agamb_%'
        OR `companyNames` LIKE '%Agamba_%';

15 results in 30 ms.

D = 1

3 * n - queries

Not what we expected

Usecase - Algolia

  • 15 results in 1 ms.
     
  • easy update of data
     
  • auto index generated

Usecase - Algolia

  • 15 results in 1 ms. -  searching for Agamba

Usecase - Elastic

  • No easy way to import data
     
  • Supports fuzzy matching
     
  • recommended to have indexes up to distance 2

Embulk

Usecase - Google

http://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=ninaj

Usecase - Google

[  
   "ninaj",
   [  
      "ninja",
      "ninja turtles",
      "ninja blender",
...
   ],
   {  
      "google:clientdata":{  
         "bpc":false,
         "tlw":false
      },
      "google:suggestrelevance":[  
         1250,
         601,
         600,
      ],
      ...
   }
]

Why do we need it

  • Autocomplete
  • Search with typos
  • Bots

Thanks for watching

Made with Slides.com