Let's get fuzzy!

Bulgaria PHP Conference 2016
Lyubomir Filipov * @FilipovG

Who am I

Lyubomir Filipov

PHP Dev

Enthusiast

Why fuzzy?

Scrape some data!

Fuzzy matching

Computer-assisted translation
record linkage
less than 100% perfect
mostly English
started in 1990s

Fuzzy matching

Tomas Jones
Tom Jones
Thomas Jones

Soundex

a, e, i, o, u, y, h, w.

b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6

Robert → R163

Soundex

1. Robert → R

2. Robert → R1

3. Robert → R16

4. Robert → R163

a, e, i, o, u, y, h, w.

b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6

Soundex

a, e, i, o, u, y, h, w.

b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6

Robert → R163

Rupert → R163

Rubin → R150

Not what we expected

Bitap algorithm - agrep

P	H	P		R	O	C	K	S

R	O	C	K	S

Bitap algorithm - agrep

R	O	C	K	S

P	H	P		R	O	C	K	S

Bitap algorithm

Boyer–Moore Algorithm - grep

TEXT

PATTERN

Shifting direction

Boyer–Moore Algorithm - grep

R	O	C	K	S

P	H	P		R	O	C	K	S

Boyer–Moore Algorithm - grep

R	O	C	K	S

P	H	P		R	O	C	K	S

1 2 3 4

Boyer–Moore Algorithm - grep

R	O	C	K	S

P	H	P		R	O	C	K	S

Approximate string matching

insertion: cot → coat
deletion: coat → cot
substitution: coat → cost

What is common

Levenshtein distance

- difference between two sequences

kitten → sitten (substitution of "s" for "k")
sitten → sittin (substitution of "i" for "e")
sittin → sitting (insertion of "g" at the end).

Damerau–Levenshtein distance

transposition of two adjacent characters

cofn → conn (substitution of "n" for "f")
conn → conf (substitution of "f" for "n")

cofn → conf

(transposition of "f" and "n")

D = 2
Levenshtein

D = 1
Damerau–Levenshtein

PHP functions

string soundex ( string $str )

int levenshtein ( string $str1 , string $str2 )

string metaphone ( string $str )

int similar_text ( string $first , string $second )

Usecase - MySQL

- list of dummy data (generated company names)

Usecase - MySQL

Array
(
    [0] => _Agamba
    [1] => gamba
    [2] => _gamba
    [3] => A_gamba
    [4] => Aamba
    [5] => A_amba
    [6] => Ag_amba
    [7] => Agmba
    [8] => Ag_mba
    [9] => Aga_mba
    [10] => Agaba
    [11] => Aga_ba
    [12] => Agam_ba
    [13] => Agama
    [14] => Agam_a
    [15] => Agamb_a
    [16] => Agamb
    [17] => Agamb_
    [18] => Agamba_
)

We have made a typo in the company name 'Agimba'.

D = 1

3 * n - queries

Usecase - MySQL

SELECT 
    *
FROM
    `companyList`
WHERE
    `companyNames` LIKE '%_Agamba%'
        OR `companyNames` LIKE '%gamba%'
        OR `companyNames` LIKE '%_gamba%'
        OR `companyNames` LIKE '%A_gamba%'
        OR `companyNames` LIKE '%Aamba%'
        OR `companyNames` LIKE '%A_amba%'
        OR `companyNames` LIKE '%Ag_amba%'
        OR `companyNames` LIKE '%Agmba%'
        OR `companyNames` LIKE '%Ag_mba%'
        OR `companyNames` LIKE '%Aga_mba%'
        OR `companyNames` LIKE '%Agaba%'
        OR `companyNames` LIKE '%Aga_ba%'
        OR `companyNames` LIKE '%Agam_ba%'
        OR `companyNames` LIKE '%Agama%'
        OR `companyNames` LIKE '%Agam_a%'
        OR `companyNames` LIKE '%Agamb_a%'
        OR `companyNames` LIKE '%Agamb%'
        OR `companyNames` LIKE '%Agamb_%'
        OR `companyNames` LIKE '%Agamba_%';

15 results in 30 ms.

D = 1

3 * n - queries

Not what we expected

Usecase - Algolia

15 results in 1 ms.
easy update of data
auto index generated

Usecase - Algolia

15 results in 1 ms. - searching for Agamba

Usecase - Elastic

No easy way to import data
Supports fuzzy matching
recommended to have indexes up to distance 2

Embulk

Usecase - Google

http://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=ninaj

Usecase - Google

[  
   "ninaj",
   [  
      "ninja",
      "ninja turtles",
      "ninja blender",
...
   ],
   {  
      "google:suggestrelevance":[  
         1250,
         601,
         600,
      ],
      ...
   }
]

Usecase - Google

http://suggestqueries.google.com/complete/search?output=toolbar&hl=bg&q=ниндж

output	Type of response you want.
hl	Language's 2-letter abbreviation
q	Your search term.

Usecase - Google

http://suggestqueries.google.com/complete/search?output=firefox&hl=bg&q=ниндж

["нинд",
    ["нинджаго","нинджа","нинджи",
    "нинджаго господари на спинджицу",
    "нинджаго песни",
    "нинджаго майсторите на спинджицу",
    "нинджаго сезон 7","нинджа убиец",
    "нинджаго играчки","нинджа игри"]
]

Typoes

Skip letter
Double letters
Reverse letters

Skip spaces
Missed key
Inserted key

Inserted key

Why do we need it

Autocomplete
Search with typos
Bots

Thanks for watching

Lyubomir Filipov * @FilipovG

https://joind.in/talk/88029

Let's get fuzzy!

By Lyubomir Filipov

Let's get fuzzy!

3,511

Let's get fuzzy!

Who am I

Why fuzzy?

Why fuzzy?

Scrape some data!

Fuzzy matching

Fuzzy matching

Soundex

Soundex

Soundex

Not what we expected

Bitap algorithm - agrep

Bitap algorithm - agrep

Bitap algorithm

Boyer–Moore Algorithm - grep

Boyer–Moore Algorithm - grep

Boyer–Moore Algorithm - grep

Boyer–Moore Algorithm - grep

Approximate string matching

Approximate string matching

What is common

Levenshtein distance

Damerau–Levenshtein distance

PHP functions

Usecase - MySQL

Usecase - MySQL

Usecase - MySQL

Not what we expected

Usecase - Algolia

Usecase - Algolia

Usecase - Elastic

Embulk

Usecase - Google

Usecase - Google

Usecase - Google

Usecase - Google

Typoes

Inserted key

Why do we need it

Thanks for watching

Let's get fuzzy!

More from Lyubomir Filipov