Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent Terrorism Financing
Emrah Budur
Sr. Software Engineer
Garanti Technology
21-22 June 2016| Chicago
emrahbu@garanti.com.tr
@ebudur
9/11
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Terrorism Financing
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ
:70:/RFB/INVOICE SENT TO TAMERLAAN
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-}
Find the problem
Sample Swift Message
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Screening
OFAC
30 mio
2 K
x 15000
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
What is so challenging?
{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ
:70:/RFB/INVOICE SENT TO TAMERLAAN
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-}
Unstructured text
Misspellings
Noisy words
Name variations
Low latency
False positives
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
How Do Search Engines Work?
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Developing High Performance Fuzzy Name Search Engine
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Index
Naive
5 paths
34 nodes
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Tip 1: Trie
5 paths
23 nodes
32% space efficient
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Tip 2: DAG
5 paths
12 nodes
65% space efficient
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Search
Edit Distance
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
C | L | I | N | T | O | N | ||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
T | 1 | 1 | 2 | 3 | 4 | 4 | 5 | 6 |
R | 2 | 2 | 2 | 3 | 4 | 5 | 5 | 6 |
U | 3 | 3 | 3 | 3 | 4 | 5 | 6 | 6 |
M | 4 | 4 | 4 | 4 | 4 | 5 | 6 | 7 |
P | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 7 |
T R U M P
C L I N T O N
Implementation details https://en.wikipedia.org/wiki/Edit_distance
Tip 3: Edit Distance on Trie
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Query: HUSEIN
Implementation details http://stevehanov.ca/blog/index.php?id=114
Tip 4: Weighting Errors
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Edit Distance = 2
Edit Distance = 2
H O S S E I N
H U S E I N
T U R K I Y E
S U R I Y E
Cost Matrix
sub_cost [o, u] = 0.2
ins_cost [s, s] = 0.2
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Substitution
Insertion
Deletion
Transposition
Hosein vs Hussein
sub_cost [t, s] = 0.993
ins_cost [k, r] = 0.996
Turkiye vs Suriye
Edit Distance = 0.4
Edit Distance = 1.989
Implementation details https://web.stanford.edu/class/cs124/lec/med.pdf
Confusion Matrix
Adopted from https://web.stanford.edu/class/cs124/lec/med.pdf
Substitution
Insertion
Deletion
Transposition
sub [o, u] = 39
sub [s, t] = 15
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Frequency of Errors
Error types
- Husein
Hussein
Deletion of s after s
- Husssein
Insertion of s after s
- Hossein
Substitution of u with o
- Hussien
Transposition of e and i
Implementation details http://norvig.com/spell-correct.html
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Tip 10: Generate 1-Edit Errors
- Husein
Cross match all aliases
Hossein, Husein, Hussein
- Usama
Usame, Osama, Osame
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Cost matrices
Adopted from https://web.stanford.edu/class/cs124/lec/med.pdf
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Weighted Edit Distance on Trie
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Query: HUSEIN
Filter
Tip 5: Frequent terms
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
INNOCENTA CORPORATION
BADDY CORPORATION
WANTED INTL CORPORATION
SANCTIONED CORPORATION LTD
BOMBER CORPORATION
NARCOTIC CORPORATION
Score
MATCHED ENTITY
95
94
92
90
84
Tip 5: Frequent terms
TERMS | Doc Frequency |
---|---|
CORPORATION | 7000+ |
BANK | 6000+ |
INTERNATIONAL | 5000+ |
SECURITIES | 3000+ |
GLOBAL | 2000+ |
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Tip 6: Frequent termset
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
INNOCENTA GLOBAL CORPORATION
BADDY GLOBAL CORPORATION
WANTED GLOBAL CORPORATION
SANCTIONED CORPORATION GLOBAL
BOMBER GLOBAL CORPORATION
NARCOTIC GLOBAL CORPORATION
95
94
92
90
84
MATCHED ENTITY
Score
Tip 6: Frequent termset
RECORDS | Support / Frequency |
---|---|
CORPORATION + INTERNATIONAL | 200+ |
GLOBAL + CORPORATION | 100+ |
CORPORATION + SECURITIES | 40+ |
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Tip 7: Unique termset
RECORDS | Support / Frequency |
---|---|
GLOBAL SECURITIES CORPORATION | 1 |
INTERNATIONAL BANK | 1 |
INTERNATIONAL TECHNOLOGIES CORP. | 1 |
TERMS | Doc Frequency |
---|---|
CORPORATION | 7000+ |
BANK | 6000+ |
INTERNATIONAL | 5000+ |
SECURITIES | 3000+ |
GLOBAL | 2000+ |
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Information (IDF)
LIMITED
CORPORATION
LAVRENIUK
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
i.e. Microsoft Corporation
vs
Google Corporation
i.e. Lavrenyuk Limited
vs
Lavreniuk Corporation
Scoring
Mutual Information (MI) = IDF x Similarity
Information (IDF) =
Scoring
Lavreniuk Corporation
MI (Lavrenyuk vs Lavreniuk)
Total Mutual Info (TMI)
3.61
3.61
IDF (Lavreniuk)
IDF (Corporation)
Total Info (TI)
3.72
0.64
4.36
+
+
Lavrenyuk Limited
Tip 8: Score Maximization
Hong | Kong | |
Hong | 0 | 1 |
Kong | 1 | 0 |
RECORD
QUERY
@ebudur
21-22 June 2016| Chicago
Query: Hong Kong
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
Tip 8: Score Maximization
Tip: Hungarian Algorithm
Hong | Kong | |
Hong | 0 | 1 |
Kong | 1 | 0 |
RECORD
QUERY
@ebudur
21-22 June 2016| Chicago
Query: Hong Kong
Implementation details https://en.wikipedia.org/wiki/Hungarian_algorithm
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
Experiment Driven Development
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Precision & Recall https://en.wikipedia.org/wiki/Precision_and_recall
Evaluation
Evaluation
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Ground Truth
( No free lunch )
Query Text | Match Text | UID | Score | Action |
---|---|---|---|---|
ENHUK LEGAN NAKL LTD | ENHUK LEGAN INC | 13421 | 96 | |
ALPHA BETA BANKING CORP | ABBC | 54611 | 100 | |
BETA USA TOURS | BETA AIRLINES | 91887 | 96 | |
GALAXY AIRLINES | GALAXY AVIATIONS | 57166 | 95 |
Tip 10:
Aggregate Data
True Positive
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Summary
- Index
- Trie, DAG
- Search
- Weighted Edit Distance
-
Filter
- Frequent Termset
- Unique Termset
- Ranking
- TF - IDF
- Decision Tree
- Score Maximization
- EDD
What's next?
Word2Vec
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent
Terrorism Financing
21-22 June 2016| Chicago
Adobe Systems, Inc. 345 Park Ave. San Jose, CA 95110, US |
---|
Groupo San Jose |
Spotify Inc, 988 Market St, San Francisco, CA 94109, US |
---|
Francisco Cal |
@ebudur
http://noodle.org.uk
21-22 June 2016| Chicago
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent Terrorism Financing
Questions
@ebudur
TAW2016
By Emrah Budur
TAW2016
- 1,051