Objective:
HR Team works on better anticipating unplanned leave or retirement in our Parisian workforce
1. Data Cleaning
2. Exploratory Analysis
3. Classifiers
4. Takeaways
1338 rows
35 columns
26 Numerical columns
9 Categorical columns
No NaN values
After cleaning, our dataset shows 214 resignations (19%) over these past 2 years
y = df.Attrition (0, 1) = Resigned or not
3 strategies for employee retention
Drop repetitive columns
Winsorize 6 columns by visualising box plots
Create dummies for Gender, MaritalStatus & Department, drop=True
Numerize Attrition & OverTime ({'Yes':1,'No':0}) and BusinessTravel ({'Travel_Rarely':1,'Travel_Frequently':2, 'Non-Travel':0})
Create bins for Age ( 30>40>50>60 ), drop=True
YearsAtCompany Violinplot
Attrition 1 = Employee Left
< 5 years of experience ... more chance to leave
YearsWithCurrManager & YearsInCurrentRole
Violinplots
Attrition 1 = Employee Resigned
After 2 YearsWithCurrManager, after 2 YearsInCurrentRole ... more people staying
JobLevel & StockOptionLevel
Violinplots
The higher the job level position and benefit packages ... the less chance of leaving
.
Highest Proportion of resignation (36%)
Highest Proportion of resignation (32%)
.
BusinessTravel & OverTime
Violinplots
Travelling for work and doing overtime affects retention ... less resignations for 0
.
Highest Proportion of resignation (32%)
.
Highest Proportion of resignation (42%)
WorkLifeBalance & Age
Violinplots
< 30 years old or work-life balance highly valued (1=top priority) by the employee... more risk to see the employee leave
Highest Proportion of resignation (38%)
.
Highest Proportion of resignation (49%)
.
Other Insights
No impact
People say what you want to hear
Confront these assumptions to our models
VIF Test performed
Remaining columns saved in X variables
Correlation Matrix for X
Input Data Selection
Input Data
Total of 27 numerical columns
Classifiers
Priorities:
1. Maximise AUC, the model's precision
2. Minimise False Negative rate (= predicted no resignation but the employee left)
lst=[]
for i in range(1,13):
FP = eval(f'conf{i}')[0][1]
FN = eval(f'conf{i}')[1][0]
TP = eval(f'conf{i}')[1][1]
TN = eval(f'conf{i}')[0][0]
FNR = FN/(TP+FN)*100
FPR = FP/(FP+TN)*100
ACC = (TP+TN)/(TP+FP+FN+TN)*100
AUC = eval(f'model{i}_roc')
lst.append([i,FNR,FPR,ACC,AUC])
results = pd.DataFrame(lst, columns=['Model','False_negative_rate',
'False_Positive_rate','Overall_Accuracy','Area_Under_Curve']).
False Positive Rate
Minimize
For every model
Area Under the Curve
False Negative Rate
Overall Accuracy
Model Evaluation
Priorities:
1. Maximise AUC, the model's precision
2. Minimise False Negative rate (= predicted no resignation but the employee left)
.
Best Model
Logistic Regression with versus without PCA
Model 1: Logistic Regression
AUC =0.71 & FNR = 28
Model 12: Logistic Regression with X_train_PCA & X_test_PCA
AUC =0.54 & FNR = 54.6
Priorities:
1. Maximise AUC, the model's precision
2. Minimise False Negative rate
.
Best Model
Final Model - Logistic Regression without PCA
Expression of Attrition =
Function = 0.64*BusinessTravel + 0.033*DistanceFromHome - 0.046*Education - 0.39*EnvironmentSatisfaction + 0.003*HourlyRate - 0.4*JobLevel - 0.39*JobSatisfaction + 0.14*NumCompaniesWorked
+ 1.4*OverTime - 0.69*StockOptionLevel - 0.07*TrainingTimesLastYear - 0.01*YearsInCurrentRole + 0.02*YearsSinceLastPromotion - - 0.11*YearsWithCurrManager + 0.34*Gender_Male + 0.23*MaritalStatus_Married + 0.44*MaritalStatus_Single + 0.69*Department_Sales - 0.29*Age_bins
Key Takeaways
3 Strategies for Employee Retention
Major focus on Sales department (high turnover) and entry-level position
Minimise business trips and anticipate high-demand cycles to foster a work environment pro work-life balance
Minimise overtime as much as possible and reward with stock options or training sessions