AMS @ Zalando

Stefano Spigler

Zalando October 5

Adverse-Media Screening

Extracting information from news articles

Premise

— Why this topic?

— Introduction to compliance

— Adverse-Media Screening

Our solution

— Multilingual inputs

— Labeling strategies

— Risk classification

Model outputs

— Estimation of targeted entities

— Alert generation

Outline

Why this topic?

— Previous research perhaps too technical

— Showcase practical industry applications

— Use case is heavily based on NLP

Happy to discuss theoretical research and other projects!

Premise

(although before LLM era!)

Banking industry is highly regulated:

— Clients' wrongdoings can entail huge fines

— Compliance plays an important role

— Discover issues early to mitigate consequences

Understand if clients are involved in illicit activities:

— Anti-Money Laundering (AML) aimed at verifying illicit sources

— Know Your Customer (KYC) aimed at identifying clients' identity

Compliance

Is a client involved in illicit activities?

— A big deal of information is publicly available

Idea:

Read all newspaper articles → any mention to clients' wrongdoings?

Adverse-Media Screening

Until recently: done manually!

The Bank tried two approaches:

— Using a 3rd-Party Vendor (3PV)

— Developing a Challenger Model in house

Solution Overview

Ingestion

Translation

Risk classification

Sentence scoring

NER

Entity matching

Alerts

— Ingest & preprocess articles from external sources

— Extract information and identify relevant entities

— Match with client database and raise alerts

Risk imputation

Sources:

— Articles by external providers (Dow Jones Factiva, Lexis Nexis)

— ~1M articles / day, in all languages!

— We are only interested in top 10 languages

Multilingual approaches:

1. Single giant multilingual model

2. Many independent language-specific models

3. Translate to EN with Neural Machine Translation (NMT)

Start with Language Detection:

— Articles without original language tags...

— Many options (langdetect, FAIR's FastText)

Multilingual Inputs

(high maintenance, difficult labeling)

(low flexibility, difficult labeling)

We want to do multilabel classification into 8 categories

— Use same labels as 3PV

— Not great choice of labels, but we needed to be aligned to the 3PV

Can we use unsupervised models?

— top2vec extracts topics, but is too noisy and difficult to align to desired labels

How do we train a supervised model? We need labels!

1. Use outputs of 3PV model and train on them

2. Pay 3rd-party labeling company

Supervised Learning: Labels

(corruption, money laundering, ...)

(millions!)

(25k)

Then: neural networks!

Tried multiple approaches:

— Ad-hoc model GloVe embedding + 2x BiLSTM + 2x FC layers

— Frozen pretrained RoBERTa + head (2x FC layers)

On our data, both approaches yield similar performance

— Performance is evaluated on a subset of the independently labeled data

— We achieved a better performance then 3PV (evaluated on independent test set)

— Our labels were more consistent

— They probably did not optimize their model

... but before moving on: how do neural networks read words?

Risk Classification

Machines understand numbers! How to capture the meaning of words with numbers?

We need a semantic representation:

— each word \(w\) is mapped into a vector \(\mathrm{repr}(w)\)

— similar words must have similar vectors!

Many algorithms developed over time:

— Frequency based (BOW, TF-IDF)

— Shallow neural networks (word2vec, GloVe)

— Transformers (e.g. RoBERTa)

Embeddings

\(\mathrm{repr}(``\mathrm{king}") - \mathrm{repr}(``\mathrm{man}") + \mathrm{repr}(``\mathrm{woman}") \approx \mathrm{repr}(``\mathrm{queen}")\)

bank robber vs river bank?

Risk Imputation: Overview

Who is the main actor in an article?

Can we simply apply Named-Entity Recognition? →

Also, no resources to follow a supervised approach...

NO!

Germany's biggest bank under pressure after investigation.

Deutsche Bank was raided on suspicion of filing money laundering reports too late.

The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.

The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.

(people, organizations)

Risk Imputation: Entities

Two ingredients:

— Extract entities, resolve coreferences (e.g. pronouns)

— Score risk at the sentence level (prediction explanation)

Assign each entity a score related to proximity to relevant sentences.

(Stanford's CoreNLP, Spacy's neuralcoref)

(SHAP, LIME)

Germany's biggest bank under pressure after investigation.

Deutsche Bank was raided on suspicion of filing money laundering reports too late.
Money Laundering
The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.

The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.

(Spacy, FlairNLP)

Risk Imputation: Explainability

Neural network outputs are not easily interpreted.

In contrast to linear models — feature importance is known!

Simple idea: does the prediction change when we remove some parts?

Better idea — LIME:

— variations of text are close to the original

— decision boundary locally linearly approximated

— rank features!

Germany's biggest bank under pressure after investigation.

Deutsche Bank was raided on suspicion of filing money laundering reports too late.

The probe, led by Frankfurt prosecutor Max Schmidt reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.

The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.

MASKED

Risky Entities & Name Matching

Rank entities:

— How frequently are they mentioned close to a risky sentence?

— Keep most relevant ones

Then, match them against the database of client names (fuzzy matching):

— Instead of matching full names exactly, split them into bits (n-grams)
— Extract client names that are closest to the matched ones

We achieved a better performance than 3PV (evaluated on an independent test set).

The prosecutor's office did not release details about the background of the probe due to ongoing investigative measures.

Conclusions

— ML/AI is used in the banking industry to tackle compliance-related use cases

— Adverse-Media Screening must to be attacked with NLP models

— The use case requires several intertwined NLP components

— My team successfully beat the 3rd-party provider on every performance metric