# Let's get fuzzy!

Bulgaria PHP Conference 2016
Lyubomir Filipov  *  @FilipovG

Lyubomir Filipov

PHP Dev

Enthusiast

## Fuzzy matching

• Computer-assisted translation

• less than 100% perfect

• mostly English

• started in 1990s

• Tomas Jones

• Tom Jones

• Thomas Jones

## Soundex

a, e, i, o, u, y, h, w.

• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6

Robert → R163

## Soundex

1.      Robert → R

2.      Robert
→ R1

3.      Rob
ert → R16

4.      Robert → R163

a, e, i, o, u, y, h, w.

• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6

## Soundex

a, e, i, o, u, y, h, w.

• b, f, p, v → 1
• c, g, j, k, q, s, x, z → 2
• d, t → 3
• l → 4
• m, n → 5
• r → 6

Robert → R163

Rupert → R163

Rubin → R150

P H P R O C K S
R O C K S

R O C K S
P H P R O C K S

## Boyer–Moore Algorithm - grep

TEXT

PATTERN

Shifting direction

R O C K S
P H P R O C K S

## Boyer–Moore Algorithm - grep

R O C K S
P H P R O C K S

1         2        3         4

1         2        3         4

R O C K S
P H P R O C K S

## Approximate string matching

• insertion: cot → coat

• deletion: coat → cot

• substitution: coat → cost

## Levenshtein distance

- difference between two sequences

1. kitten → sitten (substitution of "s" for "k")

2. sitten → sittin (substitution of "i" for "e")

3. sittin → sitting (insertion of "g" at the end).

## Damerau–Levenshtein distance

1. cofn → conn (substitution of "n" for "f")

2. conn → conf (substitution of "f" for "n")
1. cofn → conf

(transposition of "f" and "n")

D = 2
Levenshtein

D = 1
Damerau–Levenshtein

## PHP functions

string soundex ( string \$str )

int levenshtein ( string \$str1 , string \$str2 )

string metaphone ( string \$str )

int similar_text ( string \$first , string \$second )

## Usecase - MySQL

- list of dummy data (generated company names)

## Usecase - MySQL

``````Array
(
[0] => _Agamba
[1] => gamba
[2] => _gamba
[3] => A_gamba
[4] => Aamba
[5] => A_amba
[6] => Ag_amba
[7] => Agmba
[8] => Ag_mba
[9] => Aga_mba
[10] => Agaba
[11] => Aga_ba
[12] => Agam_ba
[13] => Agama
[14] => Agam_a
[15] => Agamb_a
[16] => Agamb
[17] => Agamb_
[18] => Agamba_
)``````

We have made a typo in the company name 'Agimba'.

D = 1

3 * n - queries

## Usecase - MySQL

``````SELECT
*
FROM
`companyList`
WHERE
`companyNames` LIKE '%_Agamba%'
OR `companyNames` LIKE '%gamba%'
OR `companyNames` LIKE '%_gamba%'
OR `companyNames` LIKE '%A_gamba%'
OR `companyNames` LIKE '%Aamba%'
OR `companyNames` LIKE '%A_amba%'
OR `companyNames` LIKE '%Ag_amba%'
OR `companyNames` LIKE '%Agmba%'
OR `companyNames` LIKE '%Ag_mba%'
OR `companyNames` LIKE '%Aga_mba%'
OR `companyNames` LIKE '%Agaba%'
OR `companyNames` LIKE '%Aga_ba%'
OR `companyNames` LIKE '%Agam_ba%'
OR `companyNames` LIKE '%Agama%'
OR `companyNames` LIKE '%Agam_a%'
OR `companyNames` LIKE '%Agamb_a%'
OR `companyNames` LIKE '%Agamb%'
OR `companyNames` LIKE '%Agamb_%'
OR `companyNames` LIKE '%Agamba_%';``````

15 results in 30 ms.

D = 1

3 * n - queries

## Usecase - Algolia

• 15 results in 1 ms.

• easy update of data

• auto index generated

## Usecase - Algolia

• 15 results in 1 ms. -  searching for Agamba

## Usecase - Elastic

• No easy way to import data

• Supports fuzzy matching

• recommended to have indexes up to distance 2

## Embulk

``````[
"ninaj",
[
"ninja",
"ninja turtles",
"ninja blender",
...
],
{
1250,
601,
600,
],
...
}
]``````

output Type of response you want.
hl  Language's 2-letter abbreviation

``````["нинд",
["нинджаго","нинджа","нинджи",
"нинджаго господари на спинджицу",
"нинджаго песни",
"нинджаго майсторите на спинджицу",
"нинджаго сезон 7","нинджа убиец",
"нинджаго играчки","нинджа игри"]
]``````

## Typoes

• Skip letter

• Double letters

• Reverse letters
• Skip spaces

• Missed key

• Inserted key

## Why do we need it

• Autocomplete

• Search with typos

• Bots

# Thanks for watching

Lyubomir Filipov  *  @FilipovG

https://joind.in/talk/88029

#### Let's get fuzzy!

By Lyubomir Filipov

• 1,313