Westerosi Ride Hailing Churn

EDA

  • Computing Target:
    • Convert last_ride to timestamp
    • Compute 30 days prior to July 1
       

EDA

  • Pretty good data:
    • Few nan or missing values
      • avg_rating_of_driver 16% empty (engineered column)
      • avg_rating_by_driver .4% empty (dropped rows)
      • phone .8% empty (dropped rows)

EDA

  • Cities perform very differently:
    • 74% of Astaporians Churn
    • 65% of Winterfellers Churn
    • Only 37% of King's Landites Churn
       
  • We should focus our campaigns in Asatpor and Winterfell, our market in Kings Landing is already fairly strong!
  • Further research into the characteristics and demographics of KL vs (A and W) should be done

EDA

  • Rated Driver is Significant
    • Did not rate driver: 1273 churns -> 81% of users who do not rate a driver churn!!
    • Did rate driver: 4879 churns
       
  • Recommendation:
    • We should further examine this interesting data
    • In the short term we should reach out to customers who are not rating drivers!

EDA

  • Trips in First Month is significant!
  •  Incentivize users to take their first trip soon.  Offer limited-time promotions (discounts/freebies) to encourage users to take ride.
  •  Enhance marketing / reach-out

EDA

  • Android Users Churn More
  • 79% Android users churn
  • 55% iPhone users churn


Our team should check for bugs and usability issues in android app

LEAKAGE

LEAKAGE

# Engineer the churn column
today = pd.Timestamp('20140701')
days_delta = pd.Timedelta('30 days 00:00:00')
copy['days_since_last_used'] = today - copy['date_last_trip']
copy['churn'] = copy['days_since_last_used'] > days_delta
copy['days_since_signup'] = today - copy['date_signup']
copy['new'] = copy['days_since_signup'] < days_delta

# Remove the column from which the solution came (stop leakage)
copy = copy.drop(['date_last_trip', 'days_since_last_used',
                  'last_trip_date', 'signup_date', 'date_signup', 
                  'days_since_signup', 'new'], axis=1)

Prevented.

Choosing a Metric

  • Business objective is to predict and prevent churn.
     
  • Implies greater importance on "hit detection", aka Recall rate.
     
  • We preferred classifiers with high Recall rate, but also considered their Precision. Ultimately we performed our Grid Searching using the f1 harmonic mean.
     
  • In future projects we'd like to use f1_beta to weight recall more heavily

Choosing an Algorithm

We examined 5 classifiers:

  • GradientBoost
  • AdaBoost (decision tree)
  • Knn
  • Decision Tree
  • Random Forest

SURVEY SAYS

GradientBoosting

f1 score: 0.839278289993
params: {
  'max_features': 'sqrt',
  'n_estimators': 1000,
  'learning_rate': 0.05,
  'max_depth': 4
}

GradientBoosting
Training CV Results:

f1_score -> 0.838952585961
r2_score -> 0.124212558345
precision_score -> 0.813991763592
accuracy_score -> 0.793792696298
recall_score -> 0.86550548888
mean_squared_error -> 0.206207303702
roc_auc_score -> 0.771004947189

avg_dist 0.1824
weekday_pct 0.163
trips_in_first_30_days 0.1386
surge_pct 0.1141
avg_surge 0.1007
avg_rating_by_driver 0.0906
avg_rating_of_driver 0.0767
city_Kings Landing 0.0271
luxury_car_user 0.0248
city_Astapor 0.0193
phone_Android 0.0135
phone_iPhone 0.0134
city_Winterfell 0.0131
rated_driver_False 0.012
rated_driver_True 0.0106

GradientBoosting
Test Results:

f1_score -> 0.831244719975
r2_score -> 0.0743100110721
precision_score -> 0.806983691001
accuracy_score -> 0.78287255563
recall_score -> 0.857043512597
mean_squared_error -> 0.21712744437
roc_auc_score -> 0.758442903427
avg_dist 0.2377
weekday_pct 0.1394
trips_in_first_30_days 0.1231
surge_pct 0.1008
avg_surge 0.1004
avg_rating_by_driver 0.0889
avg_rating_of_driver 0.0821
city_Kings Landing 0.0247
luxury_car_user 0.0226
city_Astapor 0.0178
phone_iPhone 0.0142
phone_Android 0.014
city_Winterfell 0.0138
rated_driver_False 0.0108
rated_driver_True 0.0098

Profit and Loss

Next Steps

  • Account for some class imbalance in some of the features we found in EDA: Android and rated_driver (t/f) both have imbalanced classes but seem significant
     
  • Complete pricing curves -- now that we have our best estimator we want to find out where to set our thresholds from a profit perspective
     
  • Follow up on the city imbalance!
     
  • Switch to f_beta for better weighting of Recall over Precision during GridSearch
Made with Slides.com