Emrah Budur
Sr. Software Engineer
Garanti Technology
21-22 June 2016| Chicago
emrahbu@garanti.com.tr
@ebudur
9/11
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ
:70:/RFB/INVOICE SENT TO TAMERLAAN
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-} @ebudur
21-22 June 2016| Chicago
OFAC
30 mio
2 K
x 15000
@ebudur
21-22 June 2016| Chicago
{1:F01MIDLGB22AXXX0548034693}{2:I103BKTRUS33XBRDN3}{3:{108:MT103}}{4:
:20:8861198-0706
:23B:CRED
:32A:000612USD5443,99
:33B:USD5443,99
:50K:IAN INTERNATIONAL LIMITED
:52A:BCITITMM500
:53A:BCITUS33
:54A:IRVTUS3N
:57A:BNPAFRPPGRE
:59:/20041010050500001M02606
AHMET EMRE TOASBIQ
:70:/RFB/INVOICE SENT TO TAMERLAAN
TZARNAEV
PLAZA DE ESPANA, 1 28934 MOSTOLES (MADRID)
:71A:SHA
-} Unstructured text
Misspellings
Noisy words
Name variations
Low latency
False positives
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
5 paths
34 nodes
@ebudur
21-22 June 2016| Chicago
5 paths
23 nodes
32% space efficient
@ebudur
21-22 June 2016| Chicago
5 paths
12 nodes
65% space efficient
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
| C | L | I | N | T | O | N | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| T | 1 | 1 | 2 | 3 | 4 | 4 | 5 | 6 |
| R | 2 | 2 | 2 | 3 | 4 | 5 | 5 | 6 |
| U | 3 | 3 | 3 | 3 | 4 | 5 | 6 | 6 |
| M | 4 | 4 | 4 | 4 | 4 | 5 | 6 | 7 |
| P | 5 | 5 | 5 | 5 | 5 | 5 | 6 | 7 |
T R U M P
C L I N T O N
Implementation details https://en.wikipedia.org/wiki/Edit_distance
@ebudur
21-22 June 2016| Chicago
Query: HUSEINImplementation details http://stevehanov.ca/blog/index.php?id=114
@ebudur
21-22 June 2016| Chicago
Edit Distance = 2
Edit Distance = 2
H O S S E I N
H U S E I N
T U R K I Y E
S U R I Y E
sub_cost [o, u] = 0.2
ins_cost [s, s] = 0.2
@ebudur
21-22 June 2016| Chicago
Substitution
Insertion
Deletion
Transposition
Hosein vs Hussein
sub_cost [t, s] = 0.993
ins_cost [k, r] = 0.996
Turkiye vs Suriye
Edit Distance = 0.4
Edit Distance = 1.989
Implementation details https://web.stanford.edu/class/cs124/lec/med.pdf
Adopted from https://web.stanford.edu/class/cs124/lec/med.pdf
Substitution
Insertion
Deletion
Transposition
sub [o, u] = 39
sub [s, t] = 15
@ebudur
21-22 June 2016| Chicago
Frequency of Errors
Hussein
Deletion of s after s
Insertion of s after s
Substitution of u with o
Transposition of e and i
Implementation details http://norvig.com/spell-correct.html
@ebudur
21-22 June 2016| Chicago
Cross match all aliases
Hossein, Husein, Hussein
Usame, Osama, Osame
@ebudur
21-22 June 2016| Chicago
Adopted from https://web.stanford.edu/class/cs124/lec/med.pdf
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
Query: HUSEIN@ebudur
21-22 June 2016| Chicago
INNOCENTA CORPORATIONBADDY CORPORATION
WANTED INTL CORPORATION
SANCTIONED CORPORATION LTD
BOMBER CORPORATION
NARCOTIC CORPORATION
Score
MATCHED ENTITY
95
94
92
90
84
| TERMS | Doc Frequency |
|---|---|
| CORPORATION | 7000+ |
| BANK | 6000+ |
| INTERNATIONAL | 5000+ |
| SECURITIES | 3000+ |
| GLOBAL | 2000+ |
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
INNOCENTA GLOBAL CORPORATIONBADDY GLOBAL CORPORATION
WANTED GLOBAL CORPORATION
SANCTIONED CORPORATION GLOBAL
BOMBER GLOBAL CORPORATION
NARCOTIC GLOBAL CORPORATION
95
94
92
90
84
MATCHED ENTITY
Score
| RECORDS | Support / Frequency |
|---|---|
| CORPORATION + INTERNATIONAL | 200+ |
| GLOBAL + CORPORATION | 100+ |
| CORPORATION + SECURITIES | 40+ |
@ebudur
21-22 June 2016| Chicago
| RECORDS | Support / Frequency |
|---|---|
| GLOBAL SECURITIES CORPORATION | 1 |
| INTERNATIONAL BANK | 1 |
| INTERNATIONAL TECHNOLOGIES CORP. | 1 |
| TERMS | Doc Frequency |
|---|---|
| CORPORATION | 7000+ |
| BANK | 6000+ |
| INTERNATIONAL | 5000+ |
| SECURITIES | 3000+ |
| GLOBAL | 2000+ |
@ebudur
21-22 June 2016| Chicago
LIMITED
CORPORATION
LAVRENIUK
@ebudur
21-22 June 2016| Chicago
i.e. Microsoft Corporation
vs
Google Corporation
i.e. Lavrenyuk Limited
vs
Lavreniuk Corporation
Mutual Information (MI) = IDF x Similarity
Information (IDF) =
Lavreniuk Corporation
MI (Lavrenyuk vs Lavreniuk)
Total Mutual Info (TMI)
3.61
3.61
IDF (Lavreniuk)
IDF (Corporation)
Total Info (TI)
3.72
0.64
4.36
+
+
Lavrenyuk Limited| Hong | Kong | |
| Hong | 0 | 1 |
| Kong | 1 | 0 |
RECORD
QUERY
@ebudur
21-22 June 2016| Chicago
Query: Hong Kong| Hong | Kong | |
| Hong | 0 | 1 |
| Kong | 1 | 0 |
RECORD
QUERY
@ebudur
21-22 June 2016| Chicago
Query: Hong KongImplementation details https://en.wikipedia.org/wiki/Hungarian_algorithm
@ebudur
21-22 June 2016| Chicago
Precision & Recall https://en.wikipedia.org/wiki/Precision_and_recall
@ebudur
21-22 June 2016| Chicago
( No free lunch )
| Query Text | Match Text | UID | Score | Action |
|---|---|---|---|---|
| ENHUK LEGAN NAKL LTD | ENHUK LEGAN INC | 13421 | 96 | |
| ALPHA BETA BANKING CORP | ABBC | 54611 | 100 | |
| BETA USA TOURS | BETA AIRLINES | 91887 | 96 | |
| GALAXY AIRLINES | GALAXY AVIATIONS | 57166 | 95 |
Tip 10:
Aggregate Data
True Positive
@ebudur
21-22 June 2016| Chicago
@ebudur
21-22 June 2016| Chicago
| Adobe Systems, Inc. 345 Park Ave. San Jose, CA 95110, US |
|---|
| Groupo San Jose |
| Spotify Inc, 988 Market St, San Francisco, CA 94109, US |
|---|
| Francisco Cal |
@ebudur
http://noodle.org.uk
21-22 June 2016| Chicago
@ebudur
Tips and Tricks on Developing High-Performance Fuzzy Name Search Engine to Prevent Terrorism Financing
@ebudur