Santander NEOs Challenge 

 March 14, 2016

Santander Group

The Santander Group is the largest bank in the Eurozone with a market capitalization of €65,792M [4Q’15].

Europe

82.4%

America

17.1%

Rest of the world

0.5%

[1]

[1]

[1] Quarterly Shareholder Report  October - December  2015

Worldwide presence:

Methodology

Task 1:Business Problem

  • Which customers will become inactive by March 2015?

Data

Process

Output

Active Customers

1.8M

Active Customers

1.7M

Inactive Customers

99K

Data from:
October 2012 to
December 2014
.

62M+ Records

Data contains: 
Customer Activities

Based on historical data, we want to identify inactive customers (99K)

Define active/inactive

Data Preprocessing

Predictive Analytics

Data Mining

Model Evaluation

Visualization

 

Task 1:Data Understanding/Preparation

Active Customers

1.8M

 62M+ Records

Data contains: 
Customer Activities

  • Performs at least 3 transaction with the account in the last 90 day

  • Have an average volume in the last 6 months >= pre-determine amount

Data

  • A client is considered active if the client:

Task 1:Data Mining

Process

Feature Selection

163 variables

  • Difficult to directly select attributes
  • Manually identified certain fields that are not useful
  • Weight by Information Gain operator which calculates the weight of attributes
  • Selecting attributes
  • Balance in the second package checking account
  • Number of products a customer has
  • CRM ID of the customer income segment (A , B, C, D,S)
  • Total balance of the  ATM transactions made by the customer
  • Total number of ATM transactions made by the customer Identifies
  • Top Attributes

Task 1:Data Modeling

Model

Model Selection

TP – Predicted as inactive and are truly inactive

TN – Predicted as active and are truly active

FP – Predicted as inactive and are truly active

FN – Predicted as active and are truly inactive

 

Precision rate - TP / (TP + FP)

Recall rate - TP / (TP + FN)

Task 1:Model Evaluation

Model

Two models Decision Tree and Naïve Bayes are selected based on the class recall and class precision.

Model Selection

Task 1:Results

  • The model scored 97885 as inactive.
     
  • Remaining 1159 IDs are picked from the IDs scored as 1 but with a low confidence.

 

Decision tree is used for scoring Dec 2014 active customers (N0)

Output

Inactive Customers

99K

Task 1:Challenges

  • Memory crunch to process massive data of 62 million records.
     
  • Takes lot of time to perform any operations in SQL on huge database.
     
  • Need to do roll up activities and then calculate fields
     
  • Applying several models on huge data

Task 2: Nivel Satisfacción

Cost (Acquiring new customers) > Cost (retaining a customer)

Can we use transactional data to predict the level of satisfaction of a customer?

Dependent variable:

 

1. Nivel Satisfaccion ~ nominal variable with values 0,1,2.

 

2. Predict_binary ~ binary variable with values 0 (for 0) and 1 (for 1 and 2).

Task 2: Nivel Satisfacción

Predict the customer satisfaction level

  • Satisfaction data given for 30,000 customers
     
  • Satisfaction score will have values 0, 1 or 2
     
  • Need to score customer IDs that corresponds who were surveyed during Q1 2015
     
  • Scoring distribution: 

0's - 133 ~ 10.6%

1's - 419 ~ 33.4%

2's - 703 ~ 56%

Task 2:Data Mining

Process

Feature Selection

200+ variables

  • Difficult to directly select attributes
  • Weka ‘Select Attribute’ feature is used to determine 32 features
  • Selecting attributes
  • Top 5  Attributes

Task 2:Model Evaluation

Model

Cost sensitive-Random forest is selected based on the class recall and class precision.

Model Selection

W-simplekmeans
Cost sensitive-Random forest
K-star
Multilayer perception
Neural net
Naïve Bayes
Decision Tree

 

Task 2:Results

Cost sensitive-Random forest is used for scoring satisfaction for 1Q15

Output

Satisfaction Scores

0, 1 and 2

Task 2: Challenges

  • Unlike task 1, in task 2 there were no good predictors of nivel_satisfaccion.
  • There was no access to the actual survey or knowledge as to why certain individuals may have had several surveys administered to them.
  • Certain attributes related to the satisfaction were on the training set but not in the scoring set

Next Steps

  • Segment the different "personas" and create the right intervention to prevent customers from becoming inactive
  • Understand the tradeoff between an accurate model vs. a model that allows you enough time to make an intervention
  • Understand the tradeoff between an accurate model vs. a model that allows you enough time to make an intervention

Thank You!

Santader

By acast317

Santader

  • 1,084