Stefano Spigler
Zalando October 5
Extracting information from news articles
Premise
— Why this topic?
— Introduction to compliance
— Adverse-Media Screening
Our solution
— Multilingual inputs
— Labeling strategies
— Risk classification
Model outputs
— Estimation of targeted entities
— Alert generation
Why this topic?
— Previous research perhaps too technical
— Showcase practical industry applications
— Use case is heavily based on NLP
Happy to discuss theoretical research and other projects!
(although before LLM era!)
Banking industry is highly regulated:
— Clients' wrongdoings can entail huge fines
— Compliance plays an important role
— Discover issues early to mitigate consequences
Understand if clients are involved in illicit activities:
— Anti-Money Laundering (AML) aimed at verifying illicit sources
— Know Your Customer (KYC) aimed at identifying clients' identity
Is a client involved in illicit activities?
— A big deal of information is publicly available
Idea:
Read all newspaper articles → any mention to clients' wrongdoings?
Until recently: done manually!
The Bank tried two approaches:
— Using a 3rd-Party Vendor (3PV)
— Developing a Challenger Model in house
Ingestion
Translation
Risk classification
Sentence scoring
NER
Entity matching
Alerts
— Ingest & preprocess articles from external sources
— Extract information and identify relevant entities
— Match with client database and raise alerts
Risk imputation
Sources:
— Articles by external providers (Dow Jones Factiva, Lexis Nexis)
— ~1M articles / day, in all languages!
— We are only interested in top 10 languages
Multilingual approaches:
1. Single giant multilingual model
2. Many independent language-specific models
3. Translate to EN with Neural Machine Translation (NMT)
Start with Language Detection:
— Articles without original language tags...
— Many options (langdetect, FAIR's FastText)
(high maintenance, difficult labeling)
(low flexibility, difficult labeling)
We want to do multilabel classification into 8 categories
— Use same labels as 3PV
— Not great choice of labels, but we needed to be aligned to the 3PV
Can we use unsupervised models?
— top2vec extracts topics, but is too noisy and difficult to align to desired labels
How do we train a supervised model? We need labels!
1. Use outputs of 3PV model and train on them
2. Pay 3rd-party labeling company
(corruption, money laundering, ...)
(millions!)
(25k)
Then: neural networks!
Tried multiple approaches:
— Ad-hoc model GloVe embedding + 2x BiLSTM + 2x FC layers
— Frozen pretrained RoBERTa + head (2x FC layers)
On our data, both approaches yield similar performance
— Performance is evaluated on a subset of the independently labeled data
— We achieved a better performance then 3PV (evaluated on independent test set)
— Our labels were more consistent
— They probably did not optimize their model
... but before moving on: how do neural networks read words?
Machines understand numbers! How to capture the meaning of words with numbers?
We need a semantic representation:
— each word \(w\) is mapped into a vector \(\mathrm{repr}(w)\)
— similar words must have similar vectors!
Many algorithms developed over time:
— Frequency based (BOW, TF-IDF)
— Shallow neural networks (word2vec, GloVe)
— Transformers (e.g. RoBERTa)
\(\mathrm{repr}(``\mathrm{king}") - \mathrm{repr}(``\mathrm{man}") + \mathrm{repr}(``\mathrm{woman}") \approx \mathrm{repr}(``\mathrm{queen}")\)
bank robber vs river bank?
Who is the main actor in an article?
Can we simply apply Named-Entity Recognition? →
Also, no resources to follow a supervised approach...
NO!
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.
(people, organizations)
Two ingredients:
— Extract entities, resolve coreferences (e.g. pronouns)
— Score risk at the sentence level (prediction explanation)
Assign each entity a score related to proximity to relevant sentences.
(Stanford's CoreNLP, Spacy's neuralcoref)
(SHAP, LIME)
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
Money Laundering
The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.
(Spacy, FlairNLP)
Neural network outputs are not easily interpreted.
In contrast to linear models — feature importance is known!
Simple idea: does the prediction change when we remove some parts?
Better idea — LIME:
— variations of text are close to the original
— decision boundary locally linearly approximated
— rank features!
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
Money Laundering
The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.
MASKED
MASKED
Rank entities:
— How frequently are they mentioned close to a risky sentence?
— Keep most relevant ones
Then, match them against the database of client names (fuzzy matching):
— Instead of matching full names exactly, split them into bits (n-grams)
— Extract client names that are closest to the matched ones
We achieved a better performance than 3PV (evaluated on an independent test set).
Germany's biggest bank under pressure after investigation.
Deutsche Bank was raided on suspicion of filing money laundering reports too late.
Money Laundering
The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.
— ML/AI is used in the banking industry to tackle compliance-related use cases
— Adverse-Media Screening must to be attacked with NLP models
— The use case requires several intertwined NLP components
— My team successfully beat the 3rd-party provider on every performance metric