Stefano Spigler
Unit8 May 6
Adverse Media Screening
Extracting information from news articles
Banking industry is highly regulated:
— Compliance plays an important role
— Discover issues early to mitigate consequences
Understand if clients are involved in illicit activities:
— A big deal of information is publicly available (web, news)
The Bank tried two approaches:
— Using a 3rd-Party Vendor (3PV)
— Developing a challenger model in house
— Benefits: IP in house, no vendor lock in, more flexibility, cheaper?
Premise
Supervised multilabel classification into 8 categories
— Multiple classes not strictly required for downstream tasks
— ... but such granularity provides additional information
Choice of labels is aligned with 3PV labels
— To reduce time to first working solution we trained on the outputs of 3PV model
— Small set of articles (25k) labeled by independent company
Multilingual approaches
— Either invoked language-specific models
— Or translate to EN with NMT
Solution: Risk classification / Inputs
(corruption, money laundering, ...)
Natural Language Processing
— How to deal with real-world texts? Machines understand numbers!
— How to capture the meaning of words with numbers?
Word embeddings
— Represent words as vectors: similar vectors ~ similar words
Then: neural networks!
Developed ad-hoc model, compared to SotA Transformer models.
— Better performance than 3PV!
Solution: Risk classification / Process
king - man + woman ~ queen
Solution: Risk imputation
Who is the main actor in an article?
— No resources to follow a supervised approach!
Two ingredients:
— Score risk at the sentence level (prediction explanation)
— Extract entities (people, orgs), resolve coreferences (pronouns)
Assign each entity a score related to proximity to relevant sentences.
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
Money Laundering: 98%
The probe reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The Frankfurt prosecutor's office did not release details about the background of the probe "due to ongoing investigative measures.
Explaining predictions with LIME
Neural network outputs are not easily justified.
In contrast to linear models — feature importance is known!
Simple idea: does the prediction change when we remove some parts?
Better idea — LIME:
— variations of text are close to the original
— approximate locally with linear model
— rank features!
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
MASKED
The Frankfurt prosecutor's office did not release details about the background of the probe "due to ongoing investigative measures.
Germany's biggest bank under pressure after investigation.
MASKED
The probe reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The Frankfurt prosecutor's office did not release details about the background of the probe "due to ongoing investigative measures.
Solution: Name matching
We match our hits with a database of client names.
Use fuzzy matching:
— Instead of matching full names exactly, split them into smaller bits (n grams)
— Represent all n grams as sparse vectors (TF-IDF: frequency based)
— Extract client names that are closest to the matched ones
Better performance than 3PV!
Assad \(\longrightarrow\) A-s-s-a-d, As-s-ad, Ass-ad, ...
System design
Processing layer
Application layer
Client layer
Spark cluster
Kafka queue
Data providers
Ingestion
Translation
Classification
Sentence scoring
Entity extraction
Entity scoring
Ingestion
Ingestion
Data providers
Data providers
Translation
Translation
Classification
Risk classification
Sentence scoring
Sentence scoring
Entity extraction
Risk imputation
Entity scoring
Entity matching
Doc archive
(NoSQL DB)
Client names
(DB2)
Alerts
(PostgreSQL DB)
Data layer
ElasticSearch
Kubernetes / Load Balancer
Data providers
Data providers
Data providers
Data providers
UI
Monitoring
REST API
Adverse Media Screening v2
By Stefano Spigler
Adverse Media Screening v2
- 203