Motiv Data Science Challenge
Detkov Nikita
Liyasov Ilya
Pavelyev Ivan
Vasilyeva Tanya
Creating a scoring model of customer churn
Formulation of the problem
- Predict customer churn, according to their data over the last 3 months;
-
There must be explicit predictors (i.e., features), so we can't use:
- dimensionality reduction algorithms like PCA, t-SNE, UMAP and Autoencoders (not choosing features but "transforming" them);
- Neural Networks (obvious - almost no way to interpret);
- Bayesian methods (we can't guarantee the independence of variables);
- Forests of decision trees and its ensembles.
- The model must be interpretable, so we've chosen Linear models.
Few words about Data
- 100k customers in Train and 100k in Test
- Train and Test differs a lot in terms of customer behaviour because of absolutely different lifetime distribution
-
99.73% belong to one class and
0.27% to another - extremely imbalanced classification problem - Features with sms, calls, internet traffic, SIM card and tariff id over each month
- Metric is ROC AUC
How did we solve the problem?
- EDA and data correction (missing values, duplicates, incosistencies)
- Attempts to find data leak (unsuccessful)
- EDA (yes, again)
- The fight against bias (due to lifetime)
- Feature engineering
- Feature selection
- Testing approaches with different scalers and classifiers
- Choosing 5 uncorrelated Logistic Regression models with highest CV score -
fast, accurate and interpretable
- Correlation matrix of model predictions, each model has about 0.7 ROC AUC on Train set
ROC AUC on Test set is 0.67
Thanks
for Your
attention!
Motiv Hachathon
By Nikita
Motiv Hachathon
- 231