Westerosi Ride Hailing Churn
EDA
- Computing Target:
- Convert last_ride to timestamp
- Compute 30 days prior to July 1
EDA
- Pretty good data:
- Few nan or missing values
- avg_rating_of_driver 16% empty (engineered column)
- avg_rating_by_driver .4% empty (dropped rows)
- phone .8% empty (dropped rows)
- Few nan or missing values
EDA
- Cities perform very differently:
- 74% of Astaporians Churn
- 65% of Winterfellers Churn
- Only 37% of King's Landites Churn
- We should focus our campaigns in Asatpor and Winterfell, our market in Kings Landing is already fairly strong!
- Further research into the characteristics and demographics of KL vs (A and W) should be done
EDA
- Rated Driver is Significant
- Did not rate driver: 1273 churns -> 81% of users who do not rate a driver churn!!
- Did rate driver: 4879 churns
- Recommendation:
- We should further examine this interesting data
- In the short term we should reach out to customers who are not rating drivers!
EDA
- Trips in First Month is significant!
- Incentivize users to take their first trip soon. Offer limited-time promotions (discounts/freebies) to encourage users to take ride.
- Enhance marketing / reach-out
EDA
- Android Users Churn More
- 79% Android users churn
- 55% iPhone users churn
Our team should check for bugs and usability issues in android app
LEAKAGE
LEAKAGE
# Engineer the churn column
today = pd.Timestamp('20140701')
days_delta = pd.Timedelta('30 days 00:00:00')
copy['days_since_last_used'] = today - copy['date_last_trip']
copy['churn'] = copy['days_since_last_used'] > days_delta
copy['days_since_signup'] = today - copy['date_signup']
copy['new'] = copy['days_since_signup'] < days_delta
# Remove the column from which the solution came (stop leakage)
copy = copy.drop(['date_last_trip', 'days_since_last_used',
'last_trip_date', 'signup_date', 'date_signup',
'days_since_signup', 'new'], axis=1)
Prevented.
Choosing a Metric
- Business objective is to predict and prevent churn.
- Implies greater importance on "hit detection", aka Recall rate.
- We preferred classifiers with high Recall rate, but also considered their Precision. Ultimately we performed our Grid Searching using the f1 harmonic mean.
- In future projects we'd like to use f1_beta to weight recall more heavily
Choosing an Algorithm
We examined 5 classifiers:
- GradientBoost
- AdaBoost (decision tree)
- Knn
- Decision Tree
- Random Forest
SURVEY SAYS
GradientBoosting
f1 score: 0.839278289993
params: {
'max_features': 'sqrt',
'n_estimators': 1000,
'learning_rate': 0.05,
'max_depth': 4
}
GradientBoosting
Training CV Results:
f1_score -> 0.838952585961
r2_score -> 0.124212558345
precision_score -> 0.813991763592
accuracy_score -> 0.793792696298
recall_score -> 0.86550548888
mean_squared_error -> 0.206207303702
roc_auc_score -> 0.771004947189
avg_dist 0.1824
weekday_pct 0.163
trips_in_first_30_days 0.1386
surge_pct 0.1141
avg_surge 0.1007
avg_rating_by_driver 0.0906
avg_rating_of_driver 0.0767
city_Kings Landing 0.0271
luxury_car_user 0.0248
city_Astapor 0.0193
phone_Android 0.0135
phone_iPhone 0.0134
city_Winterfell 0.0131
rated_driver_False 0.012
rated_driver_True 0.0106
GradientBoosting
Test Results:
f1_score -> 0.831244719975
r2_score -> 0.0743100110721
precision_score -> 0.806983691001
accuracy_score -> 0.78287255563
recall_score -> 0.857043512597
mean_squared_error -> 0.21712744437
roc_auc_score -> 0.758442903427
avg_dist 0.2377
weekday_pct 0.1394
trips_in_first_30_days 0.1231
surge_pct 0.1008
avg_surge 0.1004
avg_rating_by_driver 0.0889
avg_rating_of_driver 0.0821
city_Kings Landing 0.0247
luxury_car_user 0.0226
city_Astapor 0.0178
phone_iPhone 0.0142
phone_Android 0.014
city_Winterfell 0.0138
rated_driver_False 0.0108
rated_driver_True 0.0098
Profit and Loss
Next Steps
- Account for some class imbalance in some of the features we found in EDA: Android and rated_driver (t/f) both have imbalanced classes but seem significant
- Complete pricing curves -- now that we have our best estimator we want to find out where to set our thresholds from a profit perspective
- Follow up on the city imbalance!
- Switch to f_beta for better weighting of Recall over Precision during GridSearch
Westerosi Ride Sharing
By Tyler Bettilyon
Westerosi Ride Sharing
- 1,489