Motiv Data Science Challenge

Detkov Nikita

Liyasov Ilya

Pavelyev Ivan

Vasilyeva Tanya

Creating a scoring model of customer churn

Formulation of the problem

  • Predict customer churn, according to their data over the last 3 months;
  • There must be explicit predictors (i.e., features),​ so we can't use:
    • dimensionality reduction algorithms like PCA, t-SNE, UMAP and Autoencoders (not choosing features but "transforming" them);
    • Neural Networks (obvious - almost no way to interpret);
    • Bayesian methods (we can't guarantee the independence of variables);
    • Forests of decision trees and its ensembles.
  • The model must be interpretable, so we've chosen Linear models​.

Few words about Data

  • 100k customers in Train and 100k in Test
  • Train and Test differs a lot in terms of customer behaviour because of absolutely different lifetime distribution
  • 99.73% belong to one class and
      0.27% to another - extremely imbalanced classification problem
  • Features with sms, calls, internet traffic, SIM card and tariff id over each month
  • Metric is ROC AUC

How did we solve the problem?

  1. EDA and data correction (missing values, duplicates, incosistencies)
  2. Attempts to find data leak (unsuccessful)
  3. EDA (yes, again)
  4. The fight against bias (due to lifetime)
  5. Feature engineering
  6. Feature selection
  7. Testing approaches with different scalers and classifiers
  8. Choosing 5 uncorrelated Logistic Regression models with highest CV score -
                                                                                    fast, accurate and interpretable

- Correlation matrix of model predictions, each model has about 0.7 ROC AUC on Train set

 

ROC AUC on Test set is 0.67

Thanks
for Your
attention!

Motiv Hachathon

By Nikita

Motiv Hachathon

  • 183