Santander Group

The Santander Group is the largest bank in the Eurozone with a market capitalization of €65,792M [4Q’15].

Europe

82.4%

America

17.1%

Rest of the world

0.5%

[1]

[1] Quarterly Shareholder Report October - December 2015

Worldwide presence:

Methodology

Task 1:Business Problem

Which customers will become inactive by March 2015?

Data

Process

Output

Active Customers

1.8M

Active Customers

1.7M

Inactive Customers

99K

Data from:
October 2012 to
December 2014.

62M+ Records

Data contains:
Customer Activities

Based on historical data, we want to identify inactive customers (99K)

Define active/inactive

Data Preprocessing

Predictive Analytics

Data Mining

Model Evaluation

Visualization

Task 1:Data Understanding/Preparation

Active Customers

1.8M

62M+ Records

Data contains:
Customer Activities

Performs at least 3 transaction with the account in the last 90 day
Have an average volume in the last 6 months >= pre-determine amount

Data

A client is considered active if the client:

Task 1:Data Mining

Process

Feature Selection

163 variables

Difficult to directly select attributes
Manually identified certain fields that are not useful
Weight by Information Gain operator which calculates the weight of attributes

Selecting attributes

Balance in the second package checking account
Number of products a customer has
CRM ID of the customer income segment (A , B, C, D,S)
Total balance of the ATM transactions made by the customer
Total number of ATM transactions made by the customer Identifies

Top Attributes

Task 1:Data Modeling

Model

Model Selection

TP – Predicted as inactive and are truly inactive

TN – Predicted as active and are truly active

FP – Predicted as inactive and are truly active

FN – Predicted as active and are truly inactive

Precision rate - TP / (TP + FP)

Recall rate - TP / (TP + FN)

Task 1:Model Evaluation

Model

Two models Decision Tree and Naïve Bayes are selected based on the class recall and class precision.

Model Selection

Task 1:Results

The model scored 97885 as inactive.
Remaining 1159 IDs are picked from the IDs scored as 1 but with a low confidence.

Decision tree is used for scoring Dec 2014 active customers (N0)

Output

Inactive Customers

99K

Task 1:Challenges

Memory crunch to process massive data of 62 million records.
Takes lot of time to perform any operations in SQL on huge database.
Need to do roll up activities and then calculate fields
Applying several models on huge data

Task 2: Nivel Satisfacción

Cost (Acquiring new customers) > Cost (retaining a customer)

Can we use transactional data to predict the level of satisfaction of a customer?

Dependent variable:

1. Nivel Satisfaccion ~ nominal variable with values 0,1,2.

2. Predict_binary ~ binary variable with values 0 (for 0) and 1 (for 1 and 2).

Task 2: Nivel Satisfacción

Predict the customer satisfaction level

Satisfaction data given for 30,000 customers
Satisfaction score will have values 0, 1 or 2
Need to score customer IDs that corresponds who were surveyed during Q1 2015
Scoring distribution:

0's - 133 ~ 10.6%

1's - 419 ~ 33.4%

2's - 703 ~ 56%

Task 2:Data Mining

Process

Feature Selection

200+ variables

Difficult to directly select attributes
Weka ‘Select Attribute’ feature is used to determine 32 features

Selecting attributes

Top 5 Attributes

Task 2:Model Evaluation

Model

Cost sensitive-Random forest is selected based on the class recall and class precision.

Model Selection

W-simplekmeans
Cost sensitive-Random forest
K-star
Multilayer perception
Neural net
Naïve Bayes
Decision Tree

Task 2:Results

Cost sensitive-Random forest is used for scoring satisfaction for 1Q15

Output

Satisfaction Scores

0, 1 and 2

Task 2: Challenges

Unlike task 1, in task 2 there were no good predictors of nivel_satisfaccion.

There was no access to the actual survey or knowledge as to why certain individuals may have had several surveys administered to them.

Certain attributes related to the satisfaction were on the training set but not in the scoring set

Next Steps

Segment the different "personas" and create the right intervention to prevent customers from becoming inactive

Understand the tradeoff between an accurate model vs. a model that allows you enough time to make an intervention

Understand the tradeoff between an accurate model vs. a model that allows you enough time to make an intervention

Santander NEOs Challenge

Santander Group

Europe

82.4%

America

17.1%

Rest of the world

0.5%

Methodology

Task 1:Business Problem

Active Customers

1.8M

Active Customers

1.7M

Inactive Customers

99K

Data from: October 2012 to December 2014.

62M+ Records

Data contains: Customer Activities

Define active/inactive

Data Preprocessing

Predictive Analytics

Task 1:Data Understanding/Preparation

Active Customers

1.8M

62M+ Records

Data contains: Customer Activities

Task 1:Data Mining

Task 1:Data Modeling

Task 1:Model Evaluation

Task 1:Results

Inactive Customers

99K

Task 1:Challenges

Task 2: Nivel Satisfacción

Task 2: Nivel Satisfacción

Task 2:Data Mining

Task 2:Model Evaluation

Task 2:Results

Satisfaction Scores

0, 1 and 2

Task 2: Challenges

Next Steps

Thank You!

Santader

More from acast317

Data from:
October 2012 to
December 2014.

Data contains:
Customer Activities

Data contains:
Customer Activities