Emrah Budur
Sr. Software Engineer
Garanti Technology
21-22 June 2016| Chicago
emrahbu@garanti.com.tr
@ebudur
9/11
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ
:70:/RFB/INVOICE SENT TO TAMERLAAN
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-}
@ebudur
21-22 June 2016| Chicago
OFAC
30 mio
2 K
x 15000
@ebudur
21-22 June 2016| Chicago
{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ
:70:/RFB/INVOICE SENT TO TAMERLAAN
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-}
Unstructured text
Misspellings
Noisy words
Name variations
Low latency
False positives
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
5 paths
34 nodes
@ebudur
21-22 June 2016| Chicago
5 paths
23 nodes
32% space efficient
@ebudur
21-22 June 2016| Chicago
5 paths
12 nodes
65% space efficient
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
C | L | I | N | T | O | N | ||
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
T | 1 | 1 | 2 | 3 | 4 | 4 | 5 | 6 |
R | 2 | 2 | 2 | 3 | 4 | 5 | 5 | 6 |
U | 3 | 3 | 3 | 3 | 4 | 5 | 6 | 6 |
M | 4 | 4 | 4 | 4 | 4 | 5 | 6 | 7 |
P | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 7 |
T R U M P
C L I N T O N
Implementation details https://en.wikipedia.org/wiki/Edit_distance
@ebudur
21-22 June 2016| Chicago
Query: HUSEIN
Implementation details http://stevehanov.ca/blog/index.php?id=114
@ebudur
21-22 June 2016| Chicago
Edit Distance = 2
Edit Distance = 2
H O S S E I N
H U S E I N
T U R K I Y E
S U R I Y E
sub_cost [o, u] = 0.2
ins_cost [s, s] = 0.2
@ebudur
21-22 June 2016| Chicago
Substitution
Insertion
Deletion
Transposition
Hosein vs Hussein
sub_cost [t, s] = 0.993
ins_cost [k, r] = 0.996
Turkiye vs Suriye
Edit Distance = 0.4
Edit Distance = 1.989
Implementation details https://web.stanford.edu/class/cs124/lec/med.pdf
Adopted from https://web.stanford.edu/class/cs124/lec/med.pdf
Substitution
Insertion
Deletion
Transposition
sub [o, u] = 39
sub [s, t] = 15
@ebudur
21-22 June 2016| Chicago
Frequency of Errors
Hussein
Deletion of s after s
Insertion of s after s
Substitution of u with o
Transposition of e and i
Implementation details http://norvig.com/spell-correct.html
@ebudur
21-22 June 2016| Chicago
Cross match all aliases
Hossein, Husein, Hussein
Usame, Osama, Osame
@ebudur
21-22 June 2016| Chicago
Adopted from https://web.stanford.edu/class/cs124/lec/med.pdf
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
Query: HUSEIN
@ebudur
21-22 June 2016| Chicago
INNOCENTA CORPORATION
BADDY CORPORATION
WANTED INTL CORPORATION
SANCTIONED CORPORATION LTD
BOMBER CORPORATION
NARCOTIC CORPORATION
Score
MATCHED ENTITY
95
94
92
90
84
TERMS | Doc Frequency |
---|---|
CORPORATION | 7000+ |
BANK | 6000+ |
INTERNATIONAL | 5000+ |
SECURITIES | 3000+ |
GLOBAL | 2000+ |
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
INNOCENTA GLOBAL CORPORATION
BADDY GLOBAL CORPORATION
WANTED GLOBAL CORPORATION
SANCTIONED CORPORATION GLOBAL
BOMBER GLOBAL CORPORATION
NARCOTIC GLOBAL CORPORATION
95
94
92
90
84
MATCHED ENTITY
Score
RECORDS | Support / Frequency |
---|---|
CORPORATION + INTERNATIONAL | 200+ |
GLOBAL + CORPORATION | 100+ |
CORPORATION + SECURITIES | 40+ |
@ebudur
21-22 June 2016| Chicago
RECORDS | Support / Frequency |
---|---|
GLOBAL SECURITIES CORPORATION | 1 |
INTERNATIONAL BANK | 1 |
INTERNATIONAL TECHNOLOGIES CORP. | 1 |
TERMS | Doc Frequency |
---|---|
CORPORATION | 7000+ |
BANK | 6000+ |
INTERNATIONAL | 5000+ |
SECURITIES | 3000+ |
GLOBAL | 2000+ |
@ebudur
21-22 June 2016| Chicago
LIMITED
CORPORATION
LAVRENIUK
@ebudur
21-22 June 2016| Chicago
i.e. Microsoft Corporation
vs
Google Corporation
i.e. Lavrenyuk Limited
vs
Lavreniuk Corporation
Mutual Information (MI) = IDF x Similarity
Information (IDF) =
Lavreniuk Corporation
MI (Lavrenyuk vs Lavreniuk)
Total Mutual Info (TMI)
3.61
3.61
IDF (Lavreniuk)
IDF (Corporation)
Total Info (TI)
3.72
0.64
4.36
+
+
Lavrenyuk Limited
Hong | Kong | |
Hong | 0 | 1 |
Kong | 1 | 0 |
RECORD
QUERY
@ebudur
21-22 June 2016| Chicago
Query: Hong Kong
Hong | Kong | |
Hong | 0 | 1 |
Kong | 1 | 0 |
RECORD
QUERY
@ebudur
21-22 June 2016| Chicago
Query: Hong Kong
Implementation details https://en.wikipedia.org/wiki/Hungarian_algorithm
@ebudur
21-22 June 2016| Chicago
Precision & Recall https://en.wikipedia.org/wiki/Precision_and_recall
@ebudur
21-22 June 2016| Chicago
( No free lunch )
Query Text | Match Text | UID | Score | Action |
---|---|---|---|---|
ENHUK LEGAN NAKL LTD | ENHUK LEGAN INC | 13421 | 96 | |
ALPHA BETA BANKING CORP | ABBC | 54611 | 100 | |
BETA USA TOURS | BETA AIRLINES | 91887 | 96 | |
GALAXY AIRLINES | GALAXY AVIATIONS | 57166 | 95 |
Tip 10:
Aggregate Data
True Positive
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
Adobe Systems, Inc. 345 Park Ave. San Jose, CA 95110, US |
---|
Groupo San Jose |
Spotify Inc, 988 Market St, San Francisco, CA 94109, US |
---|
Francisco Cal |
@ebudur
http://noodle.org.uk
21-22 June 2016| Chicago
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent Terrorism Financing
@ebudur