Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent Terrorism Financing

Emrah Budur

Sr. Software Engineer

Garanti Technology

 

 

21-22 June 2016| Chicago

emrahbu@garanti.com.tr

@ebudur

9/11

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Terrorism Financing

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ 
:70:/RFB/INVOICE SENT TO TAMERLAAN 
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-} 

Find the problem

Sample Swift Message

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Screening

OFAC

30 mio

2 K

x 15000

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

What is so challenging?

{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ 
:70:/RFB/INVOICE SENT TO TAMERLAAN 
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-} 

Unstructured text

Misspellings

Noisy words

Name variations

Low latency

False positives

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

How Do Search Engines Work?

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Developing High Performance Fuzzy Name Search Engine

 

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Index

Naive

5 paths

34 nodes

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Tip 1: Trie

5 paths

23 nodes

32% space efficient

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

          Tip 2: DAG

5 paths

12 nodes

65% space efficient

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Search

Edit Distance

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

C L I N T O N
0 1 2 3 4 5 6 7
T 1 1 2 3 4 4 5 6
R 2 2 2 3 4 5 5 6
U 3 3 3 3 4 5 6 6
M 4 4 4 4 4 5 6 7
P 5 5 5 5 5 5 6 7

T  R  U           M  P

C  L   I   N  T  O  N

Tip 3: Edit Distance on Trie

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Query: HUSEIN

Tip 4: Weighting Errors

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Edit Distance = 2

Edit Distance = 2

H  O  S  S  E  I  N

H  U  S      E  I  N

T  U  R  K   I   Y  E

S  U  R       I   Y  E

Cost Matrix

sub_cost [o, u] = 0.2 
ins_cost [s, s] = 0.2 

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Substitution

Insertion

Deletion

Transposition

Hosein vs Hussein

sub_cost [t, s] = 0.993 
ins_cost [k, r] = 0.996 

Turkiye vs Suriye

Edit Distance = 0.4

Edit Distance = 1.989

Confusion Matrix

Substitution

Insertion

Deletion

Transposition

sub [o, u] = 39 
sub [s, t] = 15 

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Frequency of Errors

Error types

  • Husein

Hussein

Deletion of s after s

  • Husssein

Insertion of s after s

  • Hossein

Substitution of u with o

  • Hussien

Transposition of e and i

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Tip 10: Generate 1-Edit Errors

  • Husein

Cross match all aliases

Hossein, Husein, Hussein

  • Usama

Usame, Osama, Osame

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Cost matrices

P(x|w) = \begin{cases} \frac {del[w_{i-1}, w_i]}{count[w_{i-1}, w_i]}, & \text{if deletion}\\\\ \frac {ins[w_{i-1}, x_i]}{count[w_{i-1}]}, & \text{if insertion}\\\\ \frac {sub[x_{i}, w_i]}{count[w_{i}]}, & \text{if substitution}\\\\ \frac {trans[w_{i}, w_{i+1}]}{count[w_{i-1}w_i]}, & \text{if transposition}\\ \end{cases}
P(xw)={del[wi1,wi]count[wi1,wi],if deletionins[wi1,xi]count[wi1],if insertionsub[xi,wi]count[wi],if substitutiontrans[wi,wi+1]count[wi1wi],if transpositionP(x|w) = \begin{cases} \frac {del[w_{i-1}, w_i]}{count[w_{i-1}, w_i]}, & \text{if deletion}\\\\ \frac {ins[w_{i-1}, x_i]}{count[w_{i-1}]}, & \text{if insertion}\\\\ \frac {sub[x_{i}, w_i]}{count[w_{i}]}, & \text{if substitution}\\\\ \frac {trans[w_{i}, w_{i+1}]}{count[w_{i-1}w_i]}, & \text{if transposition}\\ \end{cases}

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Cost = 1-Prob
Cost=1ProbCost = 1-Prob

Weighted Edit Distance on Trie

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Query: HUSEIN

Filter

Tip 5: Frequent terms

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

INNOCENTA CORPORATION

BADDY CORPORATION

WANTED INTL CORPORATION

SANCTIONED CORPORATION LTD

BOMBER CORPORATION

NARCOTIC CORPORATION

Score

MATCHED ENTITY

95

94

92

90

84

Tip 5: Frequent terms

TERMS Doc Frequency
CORPORATION 7000+
BANK 6000+
INTERNATIONAL 5000+
SECURITIES 3000+
GLOBAL 2000+

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Tip 6: Frequent termset

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

INNOCENTA GLOBAL CORPORATION

BADDY GLOBAL CORPORATION

WANTED GLOBAL CORPORATION

SANCTIONED CORPORATION GLOBAL

BOMBER GLOBAL CORPORATION

NARCOTIC GLOBAL CORPORATION

95

94

92

90

84

MATCHED ENTITY

Score

Tip 6: Frequent termset

RECORDS Support / Frequency
CORPORATION + INTERNATIONAL 200+
GLOBAL + CORPORATION 100+
CORPORATION + SECURITIES 40+

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Tip 7: Unique termset

RECORDS Support / Frequency
GLOBAL SECURITIES CORPORATION 1
INTERNATIONAL BANK 1
INTERNATIONAL TECHNOLOGIES CORP. 1
TERMS Doc Frequency
CORPORATION 7000+
BANK 6000+
INTERNATIONAL 5000+
SECURITIES 3000+
GLOBAL 2000+

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Information (IDF)

LIMITED

CORPORATION

LAVRENIUK

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

i.e. Microsoft Corporation

vs

Google Corporation

i.e. Lavrenyuk Limited

vs

Lavreniuk Corporation

Scoring

Mutual Information (MI) = IDF x Similarity

Information (IDF) =

log \frac{N}{df}
logNdflog \frac{N}{df}
IDF (Lavreniuk) = log \frac{5361}{1} = 3.72
IDF(Lavreniuk)=log53611=3.72IDF (Lavreniuk) = log \frac{5361}{1} = 3.72
IDF (Corporation) = log \frac{5361}{1212} = 0.64
IDF(Corporation)=log53611212=0.64IDF (Corporation) = log \frac{5361}{1212} = 0.64
MI(\text{Lavrenyuk vs Lavreniuk}) = 3.72 \times 0.97 = 3.61
MI(Lavrenyuk vs Lavreniuk)=3.72×0.97=3.61MI(\text{Lavrenyuk vs Lavreniuk}) = 3.72 \times 0.97 = 3.61

Scoring

Score = \frac{TMI}{TI} \times 100 = \frac{3.61}{4.36} \times 100 =
Score=TMITI×100=3.614.36×100=Score = \frac{TMI}{TI} \times 100 = \frac{3.61}{4.36} \times 100 =

Lavreniuk Corporation

MI (Lavrenyuk vs Lavreniuk)

Total Mutual Info (TMI)

3.61

3.61

IDF (Lavreniuk)

IDF (Corporation)

Total Info (TI)

3.72

0.64

4.36

+

+

82.7 \%
82.7%82.7 \%
Lavrenyuk Limited

Tip 8: Score Maximization

Hong Kong
Hong 0 1
Kong 1 0

RECORD

QUERY

@ebudur

 

 

21-22 June 2016| Chicago

Query: Hong Kong

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

Tip 8: Score Maximization

Tip: Hungarian Algorithm

Hong Kong
Hong 0 1
Kong 1 0

RECORD

QUERY

@ebudur

 

 

21-22 June 2016| Chicago

Query: Hong Kong

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

Experiment Driven Development

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Evaluation

Evaluation

\text{Precision} = \frac{\#\text{relevant items retrieved}}{\#{\text{retrieved items}}}
Precision=#relevant items retrieved#retrieved items\text{Precision} = \frac{\#\text{relevant items retrieved}}{\#{\text{retrieved items}}}
\text{Recall} = \frac{\# \text{ relevant items retrieved}}{\text{relevant items}}
Recall=# relevant items retrievedrelevant items\text{Recall} = \frac{\# \text{ relevant items retrieved}}{\text{relevant items}}
F_\beta = (1+\beta^2) \frac{PR}{\beta^2 P+R}
Fβ=(1+β2)PRβ2P+RF_\beta = (1+\beta^2) \frac{PR}{\beta^2 P+R}
\beta \in \{1,5\}
β{1,5}\beta \in \{1,5\}

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Ground Truth

( No free lunch )

Query Text Match Text UID Score Action
ENHUK LEGAN NAKL LTD ENHUK LEGAN INC 13421 96
ALPHA BETA BANKING CORP ABBC 54611 100
BETA USA TOURS BETA AIRLINES 91887 96
GALAXY AIRLINES GALAXY AVIATIONS 57166 95

Tip 10:

Aggregate Data

True Positive

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Summary

  • Index
    • Trie, DAG
  • Search
    • Weighted Edit Distance
  • Filter
    • Frequent Termset
    • Unique Termset
  • Ranking
    • TF - IDF
    • Decision Tree
    • Score Maximization
  • EDD

What's next?

Word2Vec

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent

Terrorism Financing

 

 

21-22 June 2016| Chicago

Adobe Systems, Inc. 345 Park Ave. San Jose, CA 95110, US
Groupo San Jose
Spotify Inc, 988 Market St, San Francisco, CA 94109, US
Francisco Cal

@ebudur

http://noodle.org.uk

 

 

21-22 June 2016| Chicago

@ebudur

Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent Terrorism Financing

Questions

@ebudur

TAW2016

By Emrah Budur

TAW2016

  • 1,051