自動化商業資料分析

陳達泓 呂學靖

Business Data Analytics

Data

Data

我們的資料是英國2005 - 2015的交通意外資料。

import pandas as pd
import numpy as np
import datatable as dt

Accident = dt.fread("/Users/user/Desktop/自動化/dft-accident-data/Accidents0515.csv").to_pandas()
Casualty = dt.fread("/Users/user/Desktop/自動化/dft-accident-data/Clean Data/Casualties.csv").to_pandas()
Vehicle = dt.fread("/Users/user/Desktop/自動化/dft-accident-data/Clean Data/Vehicles.csv").to_pandas()

Total = Accident.merge(Casualty, how = "outer").merge(Vehicle, how = "left")
Rows : 2402909

Cols : 66
Total.columns

> Index(['Accident_Index', 
         'Location_Easting_OSGR', 'Location_Northing_OSGR','Longitude', 'Latitude', 
         'Police_Force', 
         'Accident_Severity',
         'Number_of_Vehicles', 
         'Number_of_Casualties', 
         'Date', 'Day_of_Week', 'Time', 
         'Local_Authority_(District)', 'Local_Authority_(Highway)',
         '1st_Road_Class', '1st_Road_Number', 
         'Road_Type', 
         'Speed_limit',
         'Junction_Detail', 'Junction_Control', 
         '2nd_Road_Class', '2nd_Road_Number', 
         'Pedestrian_Crossing-Human_Control',
         'Pedestrian_Crossing-Physical_Facilities', 
         'Light_Conditions',
         'Weather_Conditions', 
         'Road_Surface_Conditions',
         'Special_Conditions_at_Site', 
         'Carriageway_Hazards',
         'Urban_or_Rural_Area', 
         'Did_Police_Officer_Attend_Scene_of_Accident',
         'LSOA_of_Accident_Location', 
         'Vehicle_Reference', 
         'Casualty_Reference',
         'Casualty_Class', 
         'Sex_of_Casualty', 
         'Age_of_Casualty', 'Age_Band_of_Casualty', 
         'Casualty_Severity', 
         'Pedestrian_Location', 'Pedestrian_Movement', 
         'Car_Passenger', 'Bus_or_Coach_Passenger',
         'Pedestrian_Road_Maintenance_Worker', 
         'Casualty_Type','Casualty_Home_Area_Type', 
         'Vehicle_Type', 'Towing_and_Articulation',
         'Vehicle_Manoeuvre', 
         'Vehicle_Location-Restricted_Lane',
         'Junction_Location', 
         'Skidding_and_Overturning',
         'Hit_Object_in_Carriageway', 'Vehicle_Leaving_Carriageway', 'Hit_Object_off_Carriageway', 
         '1st_Point_of_Impact',
         'Was_Vehicle_Left_Hand_Drive?', 
         'Journey_Purpose_of_Driver',
         'Sex_of_Driver', 'Age_of_Driver', 'Age_Band_of_Driver',
         'Engine_Capacity_(CC)', 
         'Propulsion_Code', 
         'Age_of_Vehicle',
         'Driver_IMD_Decile', 
         'Driver_Home_Area_Type'], dtype='object')

Columns Glimpse

Preprocessing

Check Missing

Total.isnull().sum()[Total.isnull().sum() != 0]

> Location_Easting_OSGR     183
  Location_Northing_OSGR    183
  Longitude                 183
  Latitude                  183
  dtype: int64

Merge Relational Datasets

def label_map(column_name,file) :
    df = pd.read_csv("/Users/user/Desktop/自動化/dft-accident-data/contextCSVs/" + file)
    DICT = dict(zip(df[df.columns[0]], df[df.columns[1]]))
    Total[column_name] = Total[column_name].map(DICT)
      
label_map("Age_Band_of_Driver","Age_Band.csv")   #Age_Band
label_map("Age_Band_of_Casualty","Age_Band.csv") #Age_Band
label_map("Casualty_Class","Casualty_Class.csv")
label_map("Casualty_Type","Casualty_Type.csv")
label_map("Day_of_Week", "Day_of_Week.csv")
label_map("Journey_Purpose_of_Driver", "Journey_Purpose.csv")
label_map("Junction_Control", "Junction_Control.csv")
label_map("Junction_Detail", "Junction_Detail.csv")
label_map("Junction_Location", "Junction_Location.csv")
label_map("Light_Conditions", "Light_Conditions.csv")
label_map("Local_Authority_(District)", "Local_Authority_District.csv")
label_map("Local_Authority_(Highway)", "Local_Authority_Highway.csv")
label_map("Pedestrian_Crossing-Human_Control", "Ped_Cross_Human.csv")
label_map("Pedestrian_Crossing-Physical_Facilities", "Ped_Cross_Physical.csv")
label_map("Pedestrian_Location", "Ped_Location.csv")
label_map("Pedestrian_Movement", "Ped_Movement.csv")
label_map("1st_Point_of_Impact", "Point_of_Impact.csv")
label_map("Police_Force","Police_Force.csv")
label_map("Did_Police_Officer_Attend_Scene_of_Accident", "Police_Officer_Attend.csv")
label_map("1st_Road_Class", "Road_Class.csv") #Road_Class
label_map("2nd_Road_Class", "Road_Class.csv") #Road_Class
label_map("Road_Type", "Road_Type.csv")
label_map("Sex_of_Driver", "Sex_of_Driver.csv")
label_map("Sex_of_Casualty", "Sex_of_Driver.csv")
label_map("Urban_or_Rural_Area", "Urban_Rural.csv")
label_map("Vehicle_Location-Restricted_Lane", "Vehicle_Location.csv")
label_map("Vehicle_Manoeuvre", "Vehicle_Manoeuvre.csv")
label_map("Vehicle_Type", "Vehicle_Type.csv")

Create Response Variable

Accident Severity & Casualty Severity

CODE LABEL
1 Fatal
2 Serious
3 Slight

Response :

  Accident Severity - Casualty Severity

LuckyOrNot

{

Lucky          if `res` < 0

UnLucky     else

Drop No Use Columns

Total.drop(["Accident_Index", "Time", "Date", 
            "Location_Easting_OSGR", "Location_Northing_OSGR", 
            "Longitude", "Latitude"], axis = 1, inplace = True) 

報告主要著重在機器學習,不是資料視覺化呈現,因此將時間地點的變數剔除。

去除ID 、日期時間、地理資訊

Confirm Column Types

DiscreteVar = ["Police_Force", "Day_of_Week", "1st_Road_Class", "Road_Type", "Vehicle_Location-Restricted_Lane",
               "Junction_Detail", "Junction_Control", "2nd_Road_Class", "Was_Vehicle_Left_Hand_Drive?",
               "Pedestrian_Crossing-Physical_Facilities", "Light_Conditions", "Weather_Conditions",
               "Road_Surface_Conditions", "Special_Conditions_at_Site", "Carriageway_Hazards", "Journey_Purpose_of_Driver",
               "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Casualty_Class", 
               "Sex_of_Casualty", "Age_Band_of_Casualty", "Casualty_Type", "Pedestrian_Crossing-Human_Control", 
               "Pedestrian_Location", "Pedestrian_Movement", "Casualty_Home_Area_Type", "Vehicle_Type", 
               "Towing_and_Articulation", "Vehicle_Manoeuvre" ,"Junction_Location", "Skidding_and_Overturning",
               "Hit_Object_in_Carriageway", "1st_Point_of_Impact", "Sex_of_Driver", "Age_Band_of_Driver", 
               "Propulsion_Code", "Driver_IMD_Decile", "Driver_Home_Area_Type", "LuckyOrNot"]

NumericVar = ["Number_of_Vehicles", "Number_of_Casualties", "Speed_limit", "Age_of_Casualty", 
              "Car_Passenger", "Bus_or_Coach_Passenger", "Pedestrian_Road_Maintenance_Worker",
              "Age_of_Driver", "Engine_Capacity_(CC)", "Age_of_Vehicle"]

確認好變數型態,並將類別變數中大於25種類別的變數去除

def COL_DROP_RET():
    ser = DF[DiscreteVar].apply(lambda x : len(x.value_counts()))
    ColumnsToDrop = list(ser.index[ser >= 25])
    return ColumnsToDrop

drop_list = COL_DROP_RET()
DF.drop(labels = drop_list, axis = 1, inplace = True)

Replace Missing Values

NA_Labels = [-1, "Unknown", "", "nan", "Total missing or out of range" ,
             "Other/Not known (2005-10)", "Data missing or out of range",
             "data missing", np.nan, "Not known"]
DF.replace(to_replace = NA_Labels, value = -999, inplace = True)

將變數裡面 Label意義是遺失值的values取代成-999

方便後續做處理

Replace Missing Values

針對欄位型態將-999取代成不同值

Discrete Variable : "Others or Missing"

Numeric Variable : np.nan
{
DiscreteVar = ["Police_Force", "Day_of_Week", "1st_Road_Class", "Road_Type", "Vehicle_Location-Restricted_Lane",
               "Junction_Detail", "Junction_Control", "2nd_Road_Class", "Was_Vehicle_Left_Hand_Drive?",
               "Pedestrian_Crossing-Physical_Facilities", "Light_Conditions", "Weather_Conditions",
               "Road_Surface_Conditions", "Special_Conditions_at_Site", "Carriageway_Hazards", "Journey_Purpose_of_Driver",
               "Urban_or_Rural_Area", "Did_Police_Officer_Attend_Scene_of_Accident", "Casualty_Class", 
               "Sex_of_Casualty", "Age_Band_of_Casualty", "Casualty_Type", "Pedestrian_Crossing-Human_Control", 
               "Pedestrian_Location", "Pedestrian_Movement", "Casualty_Home_Area_Type", "Vehicle_Type", 
               "Towing_and_Articulation", "Vehicle_Manoeuvre" ,"Junction_Location", "Skidding_and_Overturning",
               "Hit_Object_in_Carriageway", "1st_Point_of_Impact", "Sex_of_Driver", "Age_Band_of_Driver", 
               "Propulsion_Code", "Driver_IMD_Decile", "Driver_Home_Area_Type", "LuckyOrNot"]

NumericVar = ["Number_of_Vehicles", "Number_of_Casualties", "Speed_limit", "Age_of_Casualty", 
              "Car_Passenger", "Bus_or_Coach_Passenger", "Pedestrian_Road_Maintenance_Worker",
              "Age_of_Driver", "Engine_Capacity_(CC)", "Age_of_Vehicle"]

for i in Data.columns :
    if i in DiscreteVar :
        if Data.loc[:, i].dtypes != "object" :
            Data.loc[:, i] = Data[i].astype(str).values
        else :
            continue
    elif i in NumericVar :
        if Data.loc[:, i].dtypes != "float" :
            Data.loc[:, i] = Data[i].astype(float).values
        else :
            continue
    else :
        continue
        
Data_num_var_list = [x for x in Data.columns if x in NumericVar]
Data_str_var_list = [x for x in Data.columns if x in DiscreteVar]
Data.loc[:, Data_num_var_list] = Data[Data_num_var_list].replace(to_replace = -999.0, value = np.nan)
Data.loc[:, Data_str_var_list] = Data[Data_str_var_list].replace(to_replace = "-999", value = "Others or Missing")

Missing Percentage

Discrete_Ser = Data.select_dtypes(["object"]).apply(lambda x : x[x == "Others or Missing"].count()).divide(Data.shape[0])
Numeric_Ser = Data.select_dtypes(["float", "int"]).apply(lambda x : x.isnull().sum()).divide(Data.shape[0])
Discrete_Ser.append(Numeric_Ser)[Discrete_Ser.append(Numeric_Ser) >= 0.2]


> Junction_Control                      0.367242
  2nd_Road_Class                        0.420547
  Journey_Purpose_of_Driver             0.726724
  Propulsion_Code                       0.241899
  Driver_IMD_Decile                     0.288484
  Pedestrian_Road_Maintenance_Worker    0.598930
  Engine_Capacity_(CC)                  0.247660
  Age_of_Vehicle                        0.277768
  dtype: float64
    
Data.drop(["2nd_Road_Class", "Junction_Control", "Propulsion_Code", "Driver_IMD_Decile", 
           "Pedestrian_Road_Maintenance_Worker", "Journey_Purpose_of_Driver"], 
          axis = 1, inplace = True) 

主觀刪去遺失值比例過大的變數

Data Manipulation & Viz

Discrete Variables Viz

def PLOT(var) :
    global Data
    Data[var].value_counts().sort_values().plot.barh()
    
for i in Data.columns :
    if i in DiscreteVar :
        print(i, " :") ; print()
        PLOT(i)
        plt.show()
    else :
        continue

Numeric Variables Viz

fig = plt.figure(figsize = (15,20))
ax = fig.gca()
Data.hist(ax = ax)

Data Viz

上述視覺化是為了之後資料處理而做。

Data Wrangling- Discrete

pd.options.mode.chained_assignment = None

def LuckyPCT(var) :
    Df = Data.groupby([var])["LuckyOrNot"].value_counts().unstack().assign(Lucky_pct = lambda x : x.Lucky / (x.Lucky + x.UnLucky))
    Df1 = Df.sort_values(by = "Lucky_pct", ascending = False)
    return Df1
  
# Vehicle_Location-Restricted_Lane
# LuckyPCT("Vehicle_Location-Restricted_Lane")
def replace_Vehicle_Location_Restricted_Lane(x):
    if "hard shoulder" in x :
        value = "Lay-By or Hard Shoulder"
    elif "On main c'way" in x :
        value = "Main Carriageway" 
    elif "Bus" in x :
        value = "Bus Lane or Busway"
    elif "Cycle" in x :
        value = "Cycle Lane or Cycleway"
    else :
        value = "Others or Missing"
    return value
Data.loc[:, "Vehicle_Location-Restricted_Lane"] = Data["Vehicle_Location-Restricted_Lane"].apply(replace_Vehicle_Location_Restricted_Lane).values

# Junction_Detail
Data.loc[:, "Junction_Detail"] = Data["Junction_Detail"].replace(
    to_replace = ["Slip road", "More than 4 arms (not roundabout)","Mini-roundabout"], 
    value = "Other junction").values

# Was_Vehicle_Left_Hand_Drive?    Keep
LuckyPCT("Was_Vehicle_Left_Hand_Drive?")

# Pedestrian_Crossing-Physical_Facilities    Drop
LuckyPCT("Pedestrian_Crossing-Physical_Facilities")
# Can drop the column since the dominated value also has the highest Lucky percentage
Data.drop("Pedestrian_Crossing-Physical_Facilities", axis = 1, inplace = True)

# Light_Conditions
# LuckyPCT("Light_Conditions")
Data.loc[:, "Light_Conditions"] = Data["Light_Conditions"].apply(lambda x : "Darkness - no lights" if x in ["Darkness - no lighting", "Darkness - lights unlit"] else x).values

# Weather_Conditions     Keep
# LuckyPCT("Weather_Conditions")

# Road_Surface_Conditions     Keep
# LuckyPCT("Road_Surface_Conditions")

# Special_Conditions_at_Site     Keep
# LuckyPCT("Special_Conditions_at_Site")

# Carriageway_Hazards   OK
# LuckyPCT("Carriageway_Hazards")

# Journey_Purpose_of_Driver    Drop
# LuckyPCT(Journey_Purpose_of_Driver)

# LuckyOrNot                        Lucky   |  UnLucky   | Lucky_pct
# Journey_Purpose_of_Driver                 |            |
# ------------------------------------------------------------------------
# Other                               882   |    17108   |     0.05
# Others or Missing                 80590   |  1665661   |     0.05
# Journey as part of work           16705   |   347780   |     0.05
Data.drop("Journey_Purpose_of_Driver", axis = 1, inplace = True)

# Did_Police_Officer_Attend_Scene_of_Accident
Data.loc[:, "Did_Police_Officer_Attend_Scene_of_Accident"] = Data["Did_Police_Officer_Attend_Scene_of_Accident"].apply(lambda x : "No" if x not in ["Yes", "No", "Others or Missing"] else x).values

# Casualty_Class         OK
# Sex_of_Casualty        OK
# Age_Band_of_Casualty   OK

# Casualty_Type           Keep
# LuckyPCT("Casualty_Type")

def Replace_Casualty_Type(x) :
    if "Goods vehicle" in x :
        value = "Goods Vehicle occupant"
    elif "Minibus" in x :
        value = "Minibus(8-16) occupant"
    elif ("Car" in x) or ("car" in x) :
        value = "Car occupant"
    elif "Tram" in x :
        value = "Tram occupant"
    elif "Motorcycle" in x :
        value = "Motorcycle rider" 
    elif ("Pedestrian" in x) or ("Cyclist" in x) or("Bus" in x):
        value = x
    else :
        value = "Others"
    return value
        
Data.loc[:, "Casualty_Type"] = Data["Casualty_Type"].apply(Replace_Casualty_Type).values

# Pedestrian_Crossing-Human_Control       Drop
# Pedestrian_Location                     Drop
# Pedestrian_Movement                     Drop
Data.drop(["Pedestrian_Crossing-Human_Control", "Pedestrian_Location", "Pedestrian_Movement"], axis = 1, inplace = True)

# Casualty_Home_Area_Type     OK

# Vehicle_Type                Drop
# LuckyPCT("Vehicle_Type")
# Data.groupby("Casualty_Type")["Vehicle_Type"].value_counts().unstack()
# Vehicle is same as Casualty
Data.drop("Vehicle_Type", axis = 1, inplace = True)

# Towing_and_Articulation      Keep
# LuckyPCT("Towing_and_Articulation")

# Vehicle_Manoeuvrea
def replace_Vehicle_Manoeuvre(x):
    if "Going ahead"in x :
        value = "Going Ahead"
    elif "Waiting to" in x :
        value = "Waiting"
    elif "Turning" in x :
        value = "Turning"
    elif "Overtaking" in x :
        value = "Overtaking"
    elif "Changing lane" in x :
        value = "Changing Lane"
    else :
        value = x
    return value

Data.loc[:, "Vehicle_Manoeuvre"] = Data["Vehicle_Manoeuvre"].apply(replace_Vehicle_Manoeuvre).values

# Junction_Location
def Replace_Junction_Location(X):
    if "Approaching" in X :
        value = "Approaching Junction or Near"
    elif "Mid Junction" in X :
        value = "in Middle"
    elif "waiting" in X :
        value = "Waiting or Parking"
    elif "Entering" in X:
        value = "Entering"
    elif "Leaving" in X :
        value = "Leaving"
    elif "Data missing" in X :
        value = "Missing"
    else :
        value = X
    return value


Data.loc[:, "Junction_Location"] = Data["Junction_Location"].apply(Replace_Junction_Location).values

# Skidding_and_Overturning     Keep
LuckyPCT("Skidding_and_Overturning")

# Hit_Object_in_Carriageway      Keep
LuckyPCT("Hit_Object_in_Carriageway")

# 1st_Point_of_Impact           OK

# # Sex_of_Driver
Data.loc[:, "Sex_of_Driver"] = Data["Sex_of_Driver"].apply(lambda x : x if x in ["Male", "Female"] else "Others or Missing").values

# Age_Band_of_Driver           OK
Data.groupby("Driver_Home_Area_Type")["Casualty_Home_Area_Type"].value_counts().unstack()
# Driver_Home_Area_Type        Keep










1. 整理類別較多的欄位,將相近的歸為一類

2. 變異較小的欄位,檢測其條件機率。

保留、整理、去除

Data Wrangling- Numeric

Pearson Correlation Matrix

Data Wrangling- Numeric

Number_of_Vehicles Number_of_Casualties Speed_limit Age_of_Casualty Car_Passenger Bus_or_Coach_Passenger Age_of_Driver Engine_Capacity_(CC) Age_of_Vehicle
count 2402909 2402909 2402909 2353995 2402129 2402846 2308445 1807804 1735458
mean 1.94 1.85 40.23 35.23 0.28 0.09 37.75 1767.12 7.44
std 0.90 2.02 14.69 18.37 0.59 0.56 16.06 1486.62 4.61
min 1.00 1.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00
25% 1.00 1.00 30.00 21.00 0.00 0.00 24.00 1242.00 4.00
50% 2.00 1.00 30.00 32.00 0.00 0.00 35.00 1590.00 7.00
75% 2.00 2.00 60.00 47.00 0.00 0.00 48.00 1956.00 10.00
max 67.00 93.00 70.00 104.00 2.00 4.00 100.00 99999.00 111.00
Boxplots :

Data Wrangling- Numeric

Clamp Transformation :

# Clamp Transformation 3 times std
EngineCC_Ser = Data["Engine_Capacity_(CC)"]
Data.loc[:, "Engine_Capacity_(CC)"] = np.where(EngineCC_Ser > EngineCC_Ser.median() + 3 * EngineCC_Ser.std(), EngineCC_Ser > EngineCC_Ser.median() + 3 * EngineCC_Ser.std(), EngineCC_Ser)

Sample Distribution

Data["LuckyOrNot"].value_counts()

> UnLucky    2296804
  Lucky       106105
  Name: LuckyOrNot, dtype: int64

Machine Learning

Remove Rows with NaN

Before

After

Under-Sampling

Data_dropna = Data.dropna()
Data_dropna["LuckyOrNot"].value_counts()

> UnLucky    1588699
  Lucky        80521
  Name: LuckyOrNot, dtype: int64
# We use Under-sampling (prototype selection)
def Undersampling(data, label) :
  Train = data
  lev1 = Train[label].value_counts().index[0]
  lev2 = Train[label].value_counts().index[1]
  val1 = Train[label].value_counts().values[0]
  val2 = Train[label].value_counts().values[1]

  if val1 > val2 :
    largerSample = lev1
    smallerSample = lev2
    del lev1, lev2
  else :
    largerSample = lev2
    smallerSample = lev1
    del lev1, lev2
  
  LargeSampleData = Train[Train[label] == largerSample]
  SmallSampleData = Train[Train[label] == smallerSample]

  ExtractNrow = SmallSampleData.shape[0]
  subsampleIndex = np.random.randint(0, LargeSampleData.shape[0]-1,ExtractNrow)

  Chosen_LargeSample = LargeSampleData.iloc[subsampleIndex,:]
  Under_Sample_SubSample = pd.concat([SmallSampleData, Chosen_LargeSample], axis = 0)

  return Under_Sample_SubSample
# Function End

# UnderSampling for Catboost
Data_dropna_underS = Undersampling(Data_dropna, "LuckyOrNot")
Data_dropna_underS.loc[:, "LuckyOrNot"] = np.where(Data_dropna_underS["LuckyOrNot"] == "Lucky", 1, 0)
# End

# Create Train Test Data with constant response term proportion
catrainX, catestX, catrainY, catestY = train_test_split(Data_dropna_underS.loc[:,Data_dropna.columns != 'LuckyOrNot'], 
                                                        Data_dropna_underS.iloc[:, 27], 
                                                        shuffle = True, 
                                                        stratify = Data_dropna_underS.iloc[:, 27],
                                                        train_size = 0.7)
# End

Response Proportion

下採樣並沒有使用較複雜的方法,例如生成下採樣...等,我們使用 prototype selection 隨機從多數樣本中抽樣

Under Sampling

切分訓練測試資料及採用 7:3

將Response打上Label

if "Lucky" 

if "UnLucky"

1  :

0 :

{

CatBoost

我們使用CatBoost 模型。

# Create Pool train test data for catboost
categorical_features_indices = np.where(catrainX.dtypes != np.float)[0]
train_pool = Pool(catrainX, catrainY, cat_features=categorical_features_indices)
test_pool = Pool(catestX, cat_features=categorical_features_indices)
# End

將訓練跟測試資料Pool後,轉換成catboost演算法適合的features data的樣子

CatBoost

# Fit the Model!
catboost_model = CatBoostClassifier(iterations=1000, 
                                    loss_function = "Logloss",
                                    custom_metric = "AUC",
                                    learning_rate = 0.1,
                                    bootstrap_type = "Bayesian",
                                    depth = 7,
                                    min_data_in_leaf = 1, 
                                    task_type = "GPU",
                                    silent = True)

catboost_model.fit(train_pool)

Hyperparameter settings

# Predict
cat_pred_label = catboost_model.predict(test_pool, prediction_type="Class")
cat_pred_prob = catboost_model.predict(test_pool, prediction_type="Probability")
catboost_res_DF = pd.DataFrame(cat_pred_prob, columns=["Lucky", "UnLucky"]).assign(Label = cat_pred_label)

Predict

CatBoost

Overfitting?

CatBoost

# Score
def Cat_boost_model_Score(estimator) :
  # Accuracy
  print("Accuracy:",estimator.score(X = catestX, y = catestY))
  # AUC
  fpr, tpr, thresholds = metrics.roc_curve(np.where(catestY == "Lucky", 1, 0), catboost_res_DF["Lucky"])
  print("AUC: ", metrics.auc(fpr, tpr))
  # Recall
  print("Recall:", metrics.recall_score(Lencoder.transform(catestY), Lencoder.transform(cat_pred_label)))
  # Precision
  print("Precision:", metrics.precision_score(Lencoder.transform(catestY), Lencoder.transform(cat_pred_label)))
  # f1-score
  print("f1 Score:",metrics.f1_score(catestY, cat_pred_label, pos_label="Lucky"))
  # ROC Curve
  fpr, tpr, thresholds = metrics.roc_curve(Lencoder.transform(catestY), catboost_res_DF["Lucky"], pos_label = 0)
  plt.plot(fpr, tpr)
  plt.xlabel("False Positive Rate")
  plt.ylabel("True Positive Rate")
  plt.title("ROC Curve with CatBoost")
  plt.show()

Cat_boost_model_Score(catboost_model)

Model Performance

Accuracy: 0.7290170347525511

AUC: 0.8060161445738684

Recall: 0.7615913230667329

Precision: 0.7150019432568986

f1 Score: 0.7375616405404323

績效低於預期...

CatBoost

GridSearchCV

### GridSearchCV
from sklearn.model_selection import GridSearchCV
parms = {"learning_rate" : [1, 0.5, 0.1],
         "iterations" : [750, 1000, 1250],
         "depth" :[3,5,7]}
catboost_model_grid = CatBoostClassifier(loss_function = "Logloss",
                                         custom_metric = "AUC",
                                         task_type = "GPU", 
                                         cat_features = categorical_features_indices,
                                         min_data_in_leaf = 1,
                                         bootstrap_type = "Bayesian",
                                         silent = True)
Grid_Catboost = GridSearchCV(estimator = catboost_model_grid, param_grid = parms, cv = 5, n_jobs = 1)
Grid_Catboost.fit(catrainX, catrainY)

========================================================
 Results from Grid Search 
========================================================

 The best estimator across ALL searched params:
 <catboost.core.CatBoostClassifier object at 0x7f46b8d864a8>

 The best score across ALL searched params:
 0.7259888825009236

 The best parameters across ALL searched params:
 {'depth': 7, 'iterations': 1000, 'learning_rate': 0.1}

 ========================================================

Result

CatBoost

GridSearchCV Predict

# Best Estimator Fit!
Cat_boost_model_Score(Grid_Catboost.best_estimator_)

差距不大,那為什麼?

Accuracy: 0.7290170347525511

AUC: 0.8060161445738684

Recall: 0.7615913230667329

Precision: 0.7150019432568986

f1 Score: 0.7375616405404323

Original Result

Accuracy: 0.7316781197607269
AUC:  0.8075542831278494
Recall: 0.7630910295152544
Precision: 0.7160105618176978
f1 Score: 0.738556223090017

Grid Result

CatBoost

Why a bad performance?

1.  資料特徵粒度

2. 下採樣抽樣過程有bias,畢竟是隨機欠採樣

Model Interpretation

Feature Importance

Permutation Importance:

Feature Importance

SHAP ( Shapley Additive exPlanations):

Feature Importance

我們使用SHAP Value去探討變數重要性。

SHAP Force Plot

前100筆樣本的shap force plot

第50筆樣本的shap force plot

SHAP Decision Plot

Decision Plot (100 Obs)

模型輸出到這個變數開始收斂趨平緩。

SHAP Summary Plot

Partial Dependence Plot

One Way Plot : Vehicle Maneuver

U-turn

Going Ahead

Others or Missing

Changing Lane

Overtaking

Tuning

Parked

Moving off

Slowing or Stopping

Waiting

Reversing

Partial Dependence Plot

One Way Plot : Casualty Class

Passenger

Driver or rider

Pedestrian

Partial Dependence Plot

One Way Plot : Casualty Type

Minibus(8-16) occupant

Others

Bus or coach occupant 

Goods Vehicle occupant

Car occupant

Motorcycle rider

Pedestrian

Partial Dependence Plot

One Way Plot : 1st Point of Impact

Front

Offside

Nearside

Did no impact

Back

Others or Missing

Thanks for Listening

Thanks for Listening

Q

&

A

Thanks for Listening

Business Data Analytics

By Chen Ta Hung

Business Data Analytics

  • 35