Westerosi Ride Hailing Churn

EDA

Computing Target:
- Convert last_ride to timestamp
- Compute 30 days prior to July 1

EDA

Pretty good data:
- Few nan or missing values
  - avg_rating_of_driver 16% empty (engineered column)
  - avg_rating_by_driver .4% empty (dropped rows)
  - phone .8% empty (dropped rows)

EDA

Cities perform very differently:
- 74% of Astaporians Churn
- 65% of Winterfellers Churn
- Only 37% of King's Landites Churn
We should focus our campaigns in Asatpor and Winterfell, our market in Kings Landing is already fairly strong!
Further research into the characteristics and demographics of KL vs (A and W) should be done

EDA

Rated Driver is Significant
- Did not rate driver: 1273 churns -> 81% of users who do not rate a driver churn!!
- Did rate driver: 4879 churns
Recommendation:
- We should further examine this interesting data
- In the short term we should reach out to customers who are not rating drivers!

EDA

Trips in First Month is significant!
Incentivize users to take their first trip soon. Offer limited-time promotions (discounts/freebies) to encourage users to take ride.
Enhance marketing / reach-out

EDA

Android Users Churn More
79% Android users churn
55% iPhone users churn

Our team should check for bugs and usability issues in android app

LEAKAGE

# Engineer the churn column
today = pd.Timestamp('20140701')
days_delta = pd.Timedelta('30 days 00:00:00')
copy['days_since_last_used'] = today - copy['date_last_trip']
copy['churn'] = copy['days_since_last_used'] > days_delta
copy['days_since_signup'] = today - copy['date_signup']
copy['new'] = copy['days_since_signup'] < days_delta

# Remove the column from which the solution came (stop leakage)
copy = copy.drop(['date_last_trip', 'days_since_last_used',
                  'last_trip_date', 'signup_date', 'date_signup', 
                  'days_since_signup', 'new'], axis=1)

Prevented.

Choosing a Metric

Business objective is to predict and prevent churn.
Implies greater importance on "hit detection", aka Recall rate.
We preferred classifiers with high Recall rate, but also considered their Precision. Ultimately we performed our Grid Searching using the f1 harmonic mean.
In future projects we'd like to use f1_beta to weight recall more heavily

Choosing an Algorithm

We examined 5 classifiers:

GradientBoost
AdaBoost (decision tree)
Knn
Decision Tree
Random Forest

SURVEY SAYS

GradientBoosting

f1 score: 0.839278289993
params: {
  'max_features': 'sqrt',
  'n_estimators': 1000,
  'learning_rate': 0.05,
  'max_depth': 4
}

GradientBoosting
Training CV Results:

f1_score -> 0.838952585961
r2_score -> 0.124212558345
precision_score -> 0.813991763592
accuracy_score -> 0.793792696298
recall_score -> 0.86550548888
mean_squared_error -> 0.206207303702
roc_auc_score -> 0.771004947189

avg_dist 0.1824
weekday_pct 0.163
trips_in_first_30_days 0.1386
surge_pct 0.1141
avg_surge 0.1007
avg_rating_by_driver 0.0906
avg_rating_of_driver 0.0767
city_Kings Landing 0.0271
luxury_car_user 0.0248
city_Astapor 0.0193
phone_Android 0.0135
phone_iPhone 0.0134
city_Winterfell 0.0131
rated_driver_False 0.012
rated_driver_True 0.0106

GradientBoosting
Test Results:

f1_score -> 0.831244719975
r2_score -> 0.0743100110721
precision_score -> 0.806983691001
accuracy_score -> 0.78287255563
recall_score -> 0.857043512597
mean_squared_error -> 0.21712744437
roc_auc_score -> 0.758442903427

avg_dist 0.2377
weekday_pct 0.1394
trips_in_first_30_days 0.1231
surge_pct 0.1008
avg_surge 0.1004
avg_rating_by_driver 0.0889
avg_rating_of_driver 0.0821
city_Kings Landing 0.0247
luxury_car_user 0.0226
city_Astapor 0.0178
phone_iPhone 0.0142
phone_Android 0.014
city_Winterfell 0.0138
rated_driver_False 0.0108
rated_driver_True 0.0098

Profit and Loss

Next Steps

Account for some class imbalance in some of the features we found in EDA: Android and rated_driver (t/f) both have imbalanced classes but seem significant
Complete pricing curves -- now that we have our best estimator we want to find out where to set our thresholds from a profit perspective
Follow up on the city imbalance!
Switch to f_beta for better weighting of Recall over Precision during GridSearch

Westerosi Ride Hailing Churn

EDA

EDA

EDA

EDA

EDA

EDA

LEAKAGE

LEAKAGE

Prevented.

Choosing a Metric

Choosing an Algorithm

SURVEY SAYS

GradientBoosting

GradientBoosting
Training CV Results:

GradientBoosting
Test Results:

Profit and Loss

Next Steps

Westerosi Ride Sharing

Westerosi Ride Sharing

Tyler Bettilyon

Westerosi Ride Hailing Churn

EDA

EDA

EDA

EDA

EDA

EDA

LEAKAGE

LEAKAGE

Prevented.

Choosing a Metric

Choosing an Algorithm

SURVEY SAYS

GradientBoosting

GradientBoosting Training CV Results:

GradientBoosting Test Results:

Profit and Loss

Next Steps

Westerosi Ride Sharing

More from Tyler Bettilyon

GradientBoosting
Training CV Results:

GradientBoosting
Test Results: