Stefano Spigler

Unit8                                                                                                                               May 6

Adverse Media Screening

Extracting information from news articles

Banking industry is highly regulated:

Compliance plays an important role

— Discover issues early to mitigate consequences

 

Understand if clients are involved in illicit activities:​

— A big deal of information is publicly available (web, news)

 

The Bank tried two approaches:

— Using a 3rd-Party Vendor (3PV)

— Developing a challenger model in house

— Benefits: IP in house, no vendor lock in, more flexibility, cheaper?

  Premise

Supervised multilabel classification into 8 categories

— Multiple classes not strictly required for downstream tasks

— ... but such granularity provides additional information

 

Choice of labels is aligned with 3PV labels

— To reduce time to first working solution we trained on the outputs of 3PV model

— Small set of articles (25k) labeled by independent company

 

 

Multilingual approaches

— Either invoked language-specific models

— Or translate to EN with NMT

  Solution: Risk classification / Inputs

(corruption, money laundering, ...)

Natural Language Processing

— How to deal with real-world texts? Machines understand numbers!

— How to capture the meaning of words with numbers?


 

Word embeddings

— Represent words as vectors: similar vectors ~ similar words

 

 

 

 

 

Then: neural networks!

Developed ad-hoc model, compared to SotA Transformer models.

­— Better performance than 3PV!

  Solution: Risk classification / Process

king - man + woman ~ queen

  Solution: Risk imputation

Who is the main actor in an article?

— No resources to follow a supervised approach!

 

Two ingredients:

— Score risk at the sentence level (prediction explanation)

— Extract entities (people, orgs), resolve coreferences (pronouns)
 

Assign each entity a score related to proximity to relevant sentences.

Germany's biggest bank under pressure after investigation.

Deutsche Bank was raided on suspicion of filing money laundering reports too late.
                                                                                                                  Money Laundering: 98%

The probe reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The Frankfurt prosecutor's office did not release details about the background of the probe "due to ongoing investigative measures.

  Explaining predictions with LIME

Neural network outputs are not easily justified.

In contrast to linear models — feature importance is known!

 

Simple idea: does the prediction change when we remove some parts?

 

 

 

 

 

 

 


 

Better idea — LIME:

— variations of text are close to the original

approximate locally with linear model

— rank features!

Germany's biggest bank under pressure after investigation.

Deutsche Bank was raided on suspicion of filing money laundering reports too late.

                                                                      MASKED                                                                                                                                                                   
The Frankfurt prosecutor's office did not release details about the background of the probe "due to ongoing investigative measures.

Germany's biggest bank under pressure after investigation.

                                                                          MASKED                                                                   

The probe reportedly concerns a transaction involving the family of Syrian leader Bashar Assad.
The Frankfurt prosecutor's office did not release details about the background of the probe "due to ongoing investigative measures.

  Solution: Name matching

We match our hits with a database of client names.
 

 

Use fuzzy matching:

— ​Instead of matching full names exactly, split them into smaller bits (n grams)

 

— Represent all n grams as sparse vectors (TF-IDF: frequency based)
 

— Extract client names that are closest to the matched ones

 

Better performance than 3PV!

Assad  \(\longrightarrow\)   A-s-s-a-d, As-s-ad, Ass-ad, ...

  System design

Processing layer

Application layer

Client layer

Spark cluster

 

 

 

 

 

Kafka queue

Data providers

Ingestion

Translation

Classification

Sentence scoring

Entity extraction

Entity scoring

Ingestion

Ingestion

Data providers

Data providers

Translation

Translation

Classification

Risk classification

Sentence scoring

Sentence scoring

Entity extraction

Risk imputation

Entity scoring

Entity matching

Doc archive
(NoSQL DB)

Client names

(DB2)

Alerts

(PostgreSQL DB)

Data layer

 

 

 

 

 

 


 

 

ElasticSearch

 

Kubernetes / Load Balancer

 

 

Data providers

Data providers

Data providers

Data providers

UI

Monitoring

REST API

Adverse Media Screening v2

By Stefano Spigler

Adverse Media Screening v2

  • 115