自動化商業資料分析
陳達泓 呂學靖
Business Data Analytics
Data
Data
我們的資料是英國2005 - 2015的交通意外資料。
import pandas as pd
import numpy as np
import datatable as dt
Accident = dt.fread("/Users/user/Desktop/自動化/dft-accident-data/Accidents0515.csv").to_pandas()
Casualty = dt.fread("/Users/user/Desktop/自動化/dft-accident-data/Clean Data/Casualties.csv").to_pandas()
Vehicle = dt.fread("/Users/user/Desktop/自動化/dft-accident-data/Clean Data/Vehicles.csv").to_pandas()
Total = Accident.merge(Casualty, how = "outer").merge(Vehicle, how = "left")
Rows : 2402909
Cols : 66
Total.columns
> Index(['Accident_Index',
'Location_Easting_OSGR', 'Location_Northing_OSGR','Longitude', 'Latitude',
'Police_Force',
'Accident_Severity',
'Number_of_Vehicles',
'Number_of_Casualties',
'Date', 'Day_of_Week', 'Time',
'Local_Authority_(District)', 'Local_Authority_(Highway)',
'1st_Road_Class', '1st_Road_Number',
'Road_Type',
'Speed_limit',
'Junction_Detail', 'Junction_Control',
'2nd_Road_Class', '2nd_Road_Number',
'Pedestrian_Crossing-Human_Control',
'Pedestrian_Crossing-Physical_Facilities',
'Light_Conditions',
'Weather_Conditions',
'Road_Surface_Conditions',
'Special_Conditions_at_Site',
'Carriageway_Hazards',
'Urban_or_Rural_Area',
'Did_Police_Officer_Attend_Scene_of_Accident',
'LSOA_of_Accident_Location',
'Vehicle_Reference',
'Casualty_Reference',
'Casualty_Class',
'Sex_of_Casualty',
'Age_of_Casualty', 'Age_Band_of_Casualty',
'Casualty_Severity',
'Pedestrian_Location', 'Pedestrian_Movement',
'Car_Passenger', 'Bus_or_Coach_Passenger',
'Pedestrian_Road_Maintenance_Worker',
'Casualty_Type','Casualty_Home_Area_Type',
'Vehicle_Type', 'Towing_and_Articulation',
'Vehicle_Manoeuvre',
'Vehicle_Location-Restricted_Lane',
'Junction_Location',
'Skidding_and_Overturning',
'Hit_Object_in_Carriageway', 'Vehicle_Leaving_Carriageway', 'Hit_Object_off_Carriageway',
'1st_Point_of_Impact',
'Was_Vehicle_Left_Hand_Drive?',
'Journey_Purpose_of_Driver',
'Sex_of_Driver', 'Age_of_Driver', 'Age_Band_of_Driver',
'Engine_Capacity_(CC)',
'Propulsion_Code',
'Age_of_Vehicle',
'Driver_IMD_Decile',
'Driver_Home_Area_Type'], dtype='object')
Columns Glimpse
Preprocessing
Check Missing
Total.isnull().sum()[Total.isnull().sum() != 0]
> Location_Easting_OSGR 183
Location_Northing_OSGR 183
Longitude 183
Latitude 183
dtype: int64
Merge Relational Datasets
def label_map(column_name,file) :
df = pd.read_csv("/Users/user/Desktop/自動化/dft-accident-data/contextCSVs/" + file)
DICT = dict(zip(df[df.columns[0]], df[df.columns[1]]))
Total[column_name] = Total[column_name].map(DICT)
label_map("Age_Band_of_Driver","Age_Band.csv") #Age_Band
label_map("Age_Band_of_Casualty","Age_Band.csv") #Age_Band
label_map("Casualty_Class","Casualty_Class.csv")
label_map("Casualty_Type","Casualty_Type.csv")
label_map("Day_of_Week", "Day_of_Week.csv")
label_map("Journey_Purpose_of_Driver", "Journey_Purpose.csv")
label_map("Junction_Control", "Junction_Control.csv")
label_map("Junction_Detail", "Junction_Detail.csv")
label_map("Junction_Location", "Junction_Location.csv")
label_map("Light_Conditions", "Light_Conditions.csv")
label_map("Local_Authority_(District)", "Local_Authority_District.csv")
label_map("Local_Authority_(Highway)", "Local_Authority_Highway.csv")
label_map("Pedestrian_Crossing-Human_Control", "Ped_Cross_Human.csv")
label_map("Pedestrian_Crossing-Physical_Facilities", "Ped_Cross_Physical.csv")
label_map("Pedestrian_Location", "Ped_Location.csv")
label_map("Pedestrian_Movement", "Ped_Movement.csv")
label_map("1st_Point_of_Impact", "Point_of_Impact.csv")
label_map("Police_Force","Police_Force.csv")
label_map("Did_Police_Officer_Attend_Scene_of_Accident", "Police_Officer_Attend.csv")
label_map("1st_Road_Class", "Road_Class.csv") #Road_Class
label_map("2nd_Road_Class", "Road_Class.csv") #Road_Class
label_map("Road_Type", "Road_Type.csv")
label_map("Sex_of_Driver", "Sex_of_Driver.csv")
label_map("Sex_of_Casualty", "Sex_of_Driver.csv")
label_map("Urban_or_Rural_Area", "Urban_Rural.csv")
label_map("Vehicle_Location-Restricted_Lane", "Vehicle_Location.csv")
label_map("Vehicle_Manoeuvre", "Vehicle_Manoeuvre.csv")
label_map("Vehicle_Type", "Vehicle_Type.csv")
Create Response Variable
Accident Severity & Casualty Severity
CODE | LABEL |
---|---|
1 | Fatal |
2 | Serious |
3 | Slight |
Response :
Accident Severity - Casualty Severity
LuckyOrNot
{
Lucky if `res` < 0
UnLucky else
Drop No Use Columns
Total.drop(["Accident_Index", "Time", "Date",
"Location_Easting_OSGR", "Location_Northing_OSGR",
"Longitude", "Latitude"], axis = 1, inplace = True)
報告主要著重在機器學習,不是資料視覺化呈現,因此將時間地點的變數剔除。
去除ID 、日期時間、地理資訊
Confirm Column Types
DiscreteVar = ["Police_Force", "Day_of_Week", "1st_Road_Class", "Road_Type", "Vehicle_Location-Restricted_Lane",
"Junction_Detail", "Junction_Control", "2nd_Road_Class", "Was_Vehicle_Left_Hand_Drive?",
"Pedestrian_Crossing-Physical_Facilities", "Light_Conditions", "Weather_Conditions",
"Road_Surface_Conditions", "Special_Conditions_at_Site", "Carriageway_Hazards", "Journey_Purpose_of_Driver",
"Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Casualty_Class",
"Sex_of_Casualty", "Age_Band_of_Casualty", "Casualty_Type", "Pedestrian_Crossing-Human_Control",
"Pedestrian_Location", "Pedestrian_Movement", "Casualty_Home_Area_Type", "Vehicle_Type",
"Towing_and_Articulation", "Vehicle_Manoeuvre" ,"Junction_Location", "Skidding_and_Overturning",
"Hit_Object_in_Carriageway", "1st_Point_of_Impact", "Sex_of_Driver", "Age_Band_of_Driver",
"Propulsion_Code", "Driver_IMD_Decile", "Driver_Home_Area_Type", "LuckyOrNot"]
NumericVar = ["Number_of_Vehicles", "Number_of_Casualties", "Speed_limit", "Age_of_Casualty",
"Car_Passenger", "Bus_or_Coach_Passenger", "Pedestrian_Road_Maintenance_Worker",
"Age_of_Driver", "Engine_Capacity_(CC)", "Age_of_Vehicle"]
確認好變數型態,並將類別變數中大於25種類別的變數去除
def COL_DROP_RET():
ser = DF[DiscreteVar].apply(lambda x : len(x.value_counts()))
ColumnsToDrop = list(ser.index[ser >= 25])
return ColumnsToDrop
drop_list = COL_DROP_RET()
DF.drop(labels = drop_list, axis = 1, inplace = True)
Replace Missing Values
NA_Labels = [-1, "Unknown", "", "nan", "Total missing or out of range" ,
"Other/Not known (2005-10)", "Data missing or out of range",
"data missing", np.nan, "Not known"]
DF.replace(to_replace = NA_Labels, value = -999, inplace = True)
將變數裡面 Label意義是遺失值的values取代成-999
方便後續做處理
Replace Missing Values
針對欄位型態將-999取代成不同值
Discrete Variable : "Others or Missing" Numeric Variable : np.nan
{
DiscreteVar = ["Police_Force", "Day_of_Week", "1st_Road_Class", "Road_Type", "Vehicle_Location-Restricted_Lane",
"Junction_Detail", "Junction_Control", "2nd_Road_Class", "Was_Vehicle_Left_Hand_Drive?",
"Pedestrian_Crossing-Physical_Facilities", "Light_Conditions", "Weather_Conditions",
"Road_Surface_Conditions", "Special_Conditions_at_Site", "Carriageway_Hazards", "Journey_Purpose_of_Driver",
"Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Casualty_Class",
"Sex_of_Casualty", "Age_Band_of_Casualty", "Casualty_Type", "Pedestrian_Crossing-Human_Control",
"Pedestrian_Location", "Pedestrian_Movement", "Casualty_Home_Area_Type", "Vehicle_Type",
"Towing_and_Articulation", "Vehicle_Manoeuvre" ,"Junction_Location", "Skidding_and_Overturning",
"Hit_Object_in_Carriageway", "1st_Point_of_Impact", "Sex_of_Driver", "Age_Band_of_Driver",
"Propulsion_Code", "Driver_IMD_Decile", "Driver_Home_Area_Type", "LuckyOrNot"]
NumericVar = ["Number_of_Vehicles", "Number_of_Casualties", "Speed_limit", "Age_of_Casualty",
"Car_Passenger", "Bus_or_Coach_Passenger", "Pedestrian_Road_Maintenance_Worker",
"Age_of_Driver", "Engine_Capacity_(CC)", "Age_of_Vehicle"]
for i in Data.columns :
if i in DiscreteVar :
if Data.loc[:, i].dtypes != "object" :
Data.loc[:, i] = Data[i].astype(str).values
else :
continue
elif i in NumericVar :
if Data.loc[:, i].dtypes != "float" :
Data.loc[:, i] = Data[i].astype(float).values
else :
continue
else :
continue
Data_num_var_list = [x for x in Data.columns if x in NumericVar]
Data_str_var_list = [x for x in Data.columns if x in DiscreteVar]
Data.loc[:, Data_num_var_list] = Data[Data_num_var_list].replace(to_replace = -999.0, value = np.nan)
Data.loc[:, Data_str_var_list] = Data[Data_str_var_list].replace(to_replace = "-999", value = "Others or Missing")
Missing Percentage
Discrete_Ser = Data.select_dtypes(["object"]).apply(lambda x : x[x == "Others or Missing"].count()).divide(Data.shape[0])
Numeric_Ser = Data.select_dtypes(["float", "int"]).apply(lambda x : x.isnull().sum()).divide(Data.shape[0])
Discrete_Ser.append(Numeric_Ser)[Discrete_Ser.append(Numeric_Ser) >= 0.2]
> Junction_Control 0.367242
2nd_Road_Class 0.420547
Journey_Purpose_of_Driver 0.726724
Propulsion_Code 0.241899
Driver_IMD_Decile 0.288484
Pedestrian_Road_Maintenance_Worker 0.598930
Engine_Capacity_(CC) 0.247660
Age_of_Vehicle 0.277768
dtype: float64
Data.drop(["2nd_Road_Class", "Junction_Control", "Propulsion_Code", "Driver_IMD_Decile",
"Pedestrian_Road_Maintenance_Worker", "Journey_Purpose_of_Driver"],
axis = 1, inplace = True)
主觀刪去遺失值比例過大的變數
Data Manipulation & Viz
Discrete Variables Viz
def PLOT(var) :
global Data
Data[var].value_counts().sort_values().plot.barh()
for i in Data.columns :
if i in DiscreteVar :
print(i, " :") ; print()
PLOT(i)
plt.show()
else :
continue
Numeric Variables Viz
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
Data.hist(ax = ax)
Data Viz
上述視覺化是為了之後資料處理而做。
Data Wrangling- Discrete
pd.options.mode.chained_assignment = None
def LuckyPCT(var) :
Df = Data.groupby([var])["LuckyOrNot"].value_counts().unstack().assign(Lucky_pct = lambda x : x.Lucky / (x.Lucky + x.UnLucky))
Df1 = Df.sort_values(by = "Lucky_pct", ascending = False)
return Df1
# Vehicle_Location-Restricted_Lane
# LuckyPCT("Vehicle_Location-Restricted_Lane")
def replace_Vehicle_Location_Restricted_Lane(x):
if "hard shoulder" in x :
value = "Lay-By or Hard Shoulder"
elif "On main c'way" in x :
value = "Main Carriageway"
elif "Bus" in x :
value = "Bus Lane or Busway"
elif "Cycle" in x :
value = "Cycle Lane or Cycleway"
else :
value = "Others or Missing"
return value
Data.loc[:, "Vehicle_Location-Restricted_Lane"] = Data["Vehicle_Location-Restricted_Lane"].apply(replace_Vehicle_Location_Restricted_Lane).values
# Junction_Detail
Data.loc[:, "Junction_Detail"] = Data["Junction_Detail"].replace(
to_replace = ["Slip road", "More than 4 arms (not roundabout)","Mini-roundabout"],
value = "Other junction").values
# Was_Vehicle_Left_Hand_Drive? Keep
LuckyPCT("Was_Vehicle_Left_Hand_Drive?")
# Pedestrian_Crossing-Physical_Facilities Drop
LuckyPCT("Pedestrian_Crossing-Physical_Facilities")
# Can drop the column since the dominated value also has the highest Lucky percentage
Data.drop("Pedestrian_Crossing-Physical_Facilities", axis = 1, inplace = True)
# Light_Conditions
# LuckyPCT("Light_Conditions")
Data.loc[:, "Light_Conditions"] = Data["Light_Conditions"].apply(lambda x : "Darkness - no lights" if x in ["Darkness - no lighting", "Darkness - lights unlit"] else x).values
# Weather_Conditions Keep
# LuckyPCT("Weather_Conditions")
# Road_Surface_Conditions Keep
# LuckyPCT("Road_Surface_Conditions")
# Special_Conditions_at_Site Keep
# LuckyPCT("Special_Conditions_at_Site")
# Carriageway_Hazards OK
# LuckyPCT("Carriageway_Hazards")
# Journey_Purpose_of_Driver Drop
# LuckyPCT(Journey_Purpose_of_Driver)
# LuckyOrNot Lucky | UnLucky | Lucky_pct
# Journey_Purpose_of_Driver | |
# ------------------------------------------------------------------------
# Other 882 | 17108 | 0.05
# Others or Missing 80590 | 1665661 | 0.05
# Journey as part of work 16705 | 347780 | 0.05
Data.drop("Journey_Purpose_of_Driver", axis = 1, inplace = True)
# Did_Police_Officer_Attend_Scene_of_Accident
Data.loc[:, "Did_Police_Officer_Attend_Scene_of_Accident"] = Data["Did_Police_Officer_Attend_Scene_of_Accident"].apply(lambda x : "No" if x not in ["Yes", "No", "Others or Missing"] else x).values
# Casualty_Class OK
# Sex_of_Casualty OK
# Age_Band_of_Casualty OK
# Casualty_Type Keep
# LuckyPCT("Casualty_Type")
def Replace_Casualty_Type(x) :
if "Goods vehicle" in x :
value = "Goods Vehicle occupant"
elif "Minibus" in x :
value = "Minibus(8-16) occupant"
elif ("Car" in x) or ("car" in x) :
value = "Car occupant"
elif "Tram" in x :
value = "Tram occupant"
elif "Motorcycle" in x :
value = "Motorcycle rider"
elif ("Pedestrian" in x) or ("Cyclist" in x) or("Bus" in x):
value = x
else :
value = "Others"
return value
Data.loc[:, "Casualty_Type"] = Data["Casualty_Type"].apply(Replace_Casualty_Type).values
# Pedestrian_Crossing-Human_Control Drop
# Pedestrian_Location Drop
# Pedestrian_Movement Drop
Data.drop(["Pedestrian_Crossing-Human_Control", "Pedestrian_Location", "Pedestrian_Movement"], axis = 1, inplace = True)
# Casualty_Home_Area_Type OK
# Vehicle_Type Drop
# LuckyPCT("Vehicle_Type")
# Data.groupby("Casualty_Type")["Vehicle_Type"].value_counts().unstack()
# Vehicle is same as Casualty
Data.drop("Vehicle_Type", axis = 1, inplace = True)
# Towing_and_Articulation Keep
# LuckyPCT("Towing_and_Articulation")
# Vehicle_Manoeuvrea
def replace_Vehicle_Manoeuvre(x):
if "Going ahead"in x :
value = "Going Ahead"
elif "Waiting to" in x :
value = "Waiting"
elif "Turning" in x :
value = "Turning"
elif "Overtaking" in x :
value = "Overtaking"
elif "Changing lane" in x :
value = "Changing Lane"
else :
value = x
return value
Data.loc[:, "Vehicle_Manoeuvre"] = Data["Vehicle_Manoeuvre"].apply(replace_Vehicle_Manoeuvre).values
# Junction_Location
def Replace_Junction_Location(X):
if "Approaching" in X :
value = "Approaching Junction or Near"
elif "Mid Junction" in X :
value = "in Middle"
elif "waiting" in X :
value = "Waiting or Parking"
elif "Entering" in X:
value = "Entering"
elif "Leaving" in X :
value = "Leaving"
elif "Data missing" in X :
value = "Missing"
else :
value = X
return value
Data.loc[:, "Junction_Location"] = Data["Junction_Location"].apply(Replace_Junction_Location).values
# Skidding_and_Overturning Keep
LuckyPCT("Skidding_and_Overturning")
# Hit_Object_in_Carriageway Keep
LuckyPCT("Hit_Object_in_Carriageway")
# 1st_Point_of_Impact OK
# # Sex_of_Driver
Data.loc[:, "Sex_of_Driver"] = Data["Sex_of_Driver"].apply(lambda x : x if x in ["Male", "Female"] else "Others or Missing").values
# Age_Band_of_Driver OK
Data.groupby("Driver_Home_Area_Type")["Casualty_Home_Area_Type"].value_counts().unstack()
# Driver_Home_Area_Type Keep
1. 整理類別較多的欄位,將相近的歸為一類
2. 變異較小的欄位,檢測其條件機率。
保留、整理、去除
Data Wrangling- Numeric
Pearson Correlation Matrix
Data Wrangling- Numeric
Number_of_Vehicles | Number_of_Casualties | Speed_limit | Age_of_Casualty | Car_Passenger | Bus_or_Coach_Passenger | Age_of_Driver | Engine_Capacity_(CC) | Age_of_Vehicle | |
---|---|---|---|---|---|---|---|---|---|
count | 2402909 | 2402909 | 2402909 | 2353995 | 2402129 | 2402846 | 2308445 | 1807804 | 1735458 |
mean | 1.94 | 1.85 | 40.23 | 35.23 | 0.28 | 0.09 | 37.75 | 1767.12 | 7.44 |
std | 0.90 | 2.02 | 14.69 | 18.37 | 0.59 | 0.56 | 16.06 | 1486.62 | 4.61 |
min | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
25% | 1.00 | 1.00 | 30.00 | 21.00 | 0.00 | 0.00 | 24.00 | 1242.00 | 4.00 |
50% | 2.00 | 1.00 | 30.00 | 32.00 | 0.00 | 0.00 | 35.00 | 1590.00 | 7.00 |
75% | 2.00 | 2.00 | 60.00 | 47.00 | 0.00 | 0.00 | 48.00 | 1956.00 | 10.00 |
max | 67.00 | 93.00 | 70.00 | 104.00 | 2.00 | 4.00 | 100.00 | 99999.00 | 111.00 |
Boxplots :
Data Wrangling- Numeric
Clamp Transformation :
# Clamp Transformation 3 times std
EngineCC_Ser = Data["Engine_Capacity_(CC)"]
Data.loc[:, "Engine_Capacity_(CC)"] = np.where(EngineCC_Ser > EngineCC_Ser.median() + 3 * EngineCC_Ser.std(), EngineCC_Ser > EngineCC_Ser.median() + 3 * EngineCC_Ser.std(), EngineCC_Ser)
Sample Distribution
Data["LuckyOrNot"].value_counts()
> UnLucky 2296804
Lucky 106105
Name: LuckyOrNot, dtype: int64
Machine Learning
Remove Rows with NaN
Before
After
Under-Sampling
Data_dropna = Data.dropna()
Data_dropna["LuckyOrNot"].value_counts()
> UnLucky 1588699
Lucky 80521
Name: LuckyOrNot, dtype: int64
# We use Under-sampling (prototype selection)
def Undersampling(data, label) :
Train = data
lev1 = Train[label].value_counts().index[0]
lev2 = Train[label].value_counts().index[1]
val1 = Train[label].value_counts().values[0]
val2 = Train[label].value_counts().values[1]
if val1 > val2 :
largerSample = lev1
smallerSample = lev2
del lev1, lev2
else :
largerSample = lev2
smallerSample = lev1
del lev1, lev2
LargeSampleData = Train[Train[label] == largerSample]
SmallSampleData = Train[Train[label] == smallerSample]
ExtractNrow = SmallSampleData.shape[0]
subsampleIndex = np.random.randint(0, LargeSampleData.shape[0]-1,ExtractNrow)
Chosen_LargeSample = LargeSampleData.iloc[subsampleIndex,:]
Under_Sample_SubSample = pd.concat([SmallSampleData, Chosen_LargeSample], axis = 0)
return Under_Sample_SubSample
# Function End
# UnderSampling for Catboost
Data_dropna_underS = Undersampling(Data_dropna, "LuckyOrNot")
Data_dropna_underS.loc[:, "LuckyOrNot"] = np.where(Data_dropna_underS["LuckyOrNot"] == "Lucky", 1, 0)
# End
# Create Train Test Data with constant response term proportion
catrainX, catestX, catrainY, catestY = train_test_split(Data_dropna_underS.loc[:,Data_dropna.columns != 'LuckyOrNot'],
Data_dropna_underS.iloc[:, 27],
shuffle = True,
stratify = Data_dropna_underS.iloc[:, 27],
train_size = 0.7)
# End
Response Proportion
下採樣並沒有使用較複雜的方法,例如生成下採樣...等,我們使用 prototype selection 隨機從多數樣本中抽樣
Under Sampling
切分訓練測試資料及採用 7:3
將Response打上Label
if "Lucky"
if "UnLucky"
1 :
0 :
{
CatBoost
我們使用CatBoost 模型。
# Create Pool train test data for catboost
categorical_features_indices = np.where(catrainX.dtypes != np.float)[0]
train_pool = Pool(catrainX, catrainY, cat_features=categorical_features_indices)
test_pool = Pool(catestX, cat_features=categorical_features_indices)
# End
將訓練跟測試資料Pool後,轉換成catboost演算法適合的features data的樣子
CatBoost
# Fit the Model!
catboost_model = CatBoostClassifier(iterations=1000,
loss_function = "Logloss",
custom_metric = "AUC",
learning_rate = 0.1,
bootstrap_type = "Bayesian",
depth = 7,
min_data_in_leaf = 1,
task_type = "GPU",
silent = True)
catboost_model.fit(train_pool)
Hyperparameter settings
# Predict
cat_pred_label = catboost_model.predict(test_pool, prediction_type="Class")
cat_pred_prob = catboost_model.predict(test_pool, prediction_type="Probability")
catboost_res_DF = pd.DataFrame(cat_pred_prob, columns=["Lucky", "UnLucky"]).assign(Label = cat_pred_label)
Predict
CatBoost
Overfitting?
CatBoost
# Score
def Cat_boost_model_Score(estimator) :
# Accuracy
print("Accuracy:",estimator.score(X = catestX, y = catestY))
# AUC
fpr, tpr, thresholds = metrics.roc_curve(np.where(catestY == "Lucky", 1, 0), catboost_res_DF["Lucky"])
print("AUC: ", metrics.auc(fpr, tpr))
# Recall
print("Recall:", metrics.recall_score(Lencoder.transform(catestY), Lencoder.transform(cat_pred_label)))
# Precision
print("Precision:", metrics.precision_score(Lencoder.transform(catestY), Lencoder.transform(cat_pred_label)))
# f1-score
print("f1 Score:",metrics.f1_score(catestY, cat_pred_label, pos_label="Lucky"))
# ROC Curve
fpr, tpr, thresholds = metrics.roc_curve(Lencoder.transform(catestY), catboost_res_DF["Lucky"], pos_label = 0)
plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve with CatBoost")
plt.show()
Cat_boost_model_Score(catboost_model)
Model Performance
Accuracy: 0.7290170347525511 AUC: 0.8060161445738684 Recall: 0.7615913230667329 Precision: 0.7150019432568986 f1 Score: 0.7375616405404323
績效低於預期...
CatBoost
GridSearchCV
### GridSearchCV
from sklearn.model_selection import GridSearchCV
parms = {"learning_rate" : [1, 0.5, 0.1],
"iterations" : [750, 1000, 1250],
"depth" :[3,5,7]}
catboost_model_grid = CatBoostClassifier(loss_function = "Logloss",
custom_metric = "AUC",
task_type = "GPU",
cat_features = categorical_features_indices,
min_data_in_leaf = 1,
bootstrap_type = "Bayesian",
silent = True)
Grid_Catboost = GridSearchCV(estimator = catboost_model_grid, param_grid = parms, cv = 5, n_jobs = 1)
Grid_Catboost.fit(catrainX, catrainY)
========================================================
Results from Grid Search
========================================================
The best estimator across ALL searched params:
<catboost.core.CatBoostClassifier object at 0x7f46b8d864a8>
The best score across ALL searched params:
0.7259888825009236
The best parameters across ALL searched params:
{'depth': 7, 'iterations': 1000, 'learning_rate': 0.1}
========================================================
Result
CatBoost
GridSearchCV Predict
# Best Estimator Fit!
Cat_boost_model_Score(Grid_Catboost.best_estimator_)
差距不大,那為什麼?
Accuracy: 0.7290170347525511
AUC: 0.8060161445738684
Recall: 0.7615913230667329
Precision: 0.7150019432568986
f1 Score: 0.7375616405404323
Original Result
Accuracy: 0.7316781197607269 AUC: 0.8075542831278494 Recall: 0.7630910295152544 Precision: 0.7160105618176978 f1 Score: 0.738556223090017
Grid Result
CatBoost
Why a bad performance?
1. 資料特徵粒度
2. 下採樣抽樣過程有bias,畢竟是隨機欠採樣
Model Interpretation
Feature Importance
Permutation Importance:
Feature Importance
SHAP ( Shapley Additive exPlanations):
Feature Importance
我們使用SHAP Value去探討變數重要性。
SHAP Force Plot
前100筆樣本的shap force plot
第50筆樣本的shap force plot
SHAP Decision Plot
Decision Plot (100 Obs)
模型輸出到這個變數開始收斂趨平緩。
SHAP Summary Plot
Partial Dependence Plot
One Way Plot : Vehicle Maneuver
U-turn
Going Ahead
Others or Missing
Changing Lane
Overtaking
Tuning
Parked
Moving off
Slowing or Stopping
Waiting
Reversing
Partial Dependence Plot
One Way Plot : Casualty Class
Passenger
Driver or rider
Pedestrian
Partial Dependence Plot
One Way Plot : Casualty Type
Minibus(8-16) occupant
Others
Bus or coach occupant
Goods Vehicle occupant
Car occupant
Motorcycle rider
Pedestrian
Partial Dependence Plot
One Way Plot : 1st Point of Impact
Front
Offside
Nearside
Did no impact
Back
Others or Missing
Thanks for Listening
Thanks for Listening
Q
&
A
Thanks for Listening
Business Data Analytics
By Chen Ta Hung
Business Data Analytics
- 40