variables
inputs and outputs
data types
import numpy as np
import pandas as pd
import pandas_profiling as pp
from pandas_profiling import ProfileReport
traindf = pd.read_csv('./data/train.csv').set_index('PassengerId')
testdf = pd.read_csv('./data/test.csv').set_index('PassengerId')
traindf.head(3)
boolean
categorical
date time
ordinal
categorical
ordinal
categorical
ordinal
number
integer
ordinal
float
categorical
strings
ordinal
others
the automated way
ProfileReport(train_df, title='Dataset profiling', html={'style':{'full_width':True}})
basic info
train_df.describe()
it will automatically plot charts
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
adf=AV.AutoViz(filename="",sep=',', depVar='Survived', dfte=traindf, header=0,
verbose=2, lowess=False, chart_format='svg',
max_rows_analyzed=150000, max_cols_analyzed=30)
Central Tendency: Mean, Median, Mode, Min, Max
Measure of Dispersion: Range, Quartile, Variance, Standard Deviation
Visualization: Histogram, Box Plot
Count per category
Visualization: bar chart
Finds strength of relationship
Visualization: matrix/heatmap
We will use Spearman's correlation coefficient
Visualization: scatter plot
Visualization: two-way table, stacked column chart
Use chi-square test to derive the statistical significance of relationship
0 = variables are dependent - vs - 1 = variable are independent
Visualization: box plot
Use Z-Test or T-Test to check statistical significance (T-Test from small number of observations = tens; otherwise Z-Test)
ANOVA for more then 2 groups but this we will keep for stats course
Solutions:
Detele rows
+ fast and easy
+ good when missing completely at random
- reduces sample size thus model power
Mean/Mode/Median
well...
easy, the most used option
but can be dangerous
Solutions:
Predict/Classify
+ can provide really good results
- if no relationships then poor results
- need to create model
= time consuming
Clustering
+ easier to do then model
+ easy treatment of multiple missing cols
- can be computationally very time consuming
Reason: error, non-natural, natural
Human Error
Measurement Error
Experiment Error
Sampling Error
Natural
Detection: plots, standard deviation (x3 stddev), Mahalanobis’ distance
Solutions:
Delete
+ fast and easy
- but can break model
Transformation
like log
+ easy and can provide good results
- something without result
- does not work on 0
Separate model
+ can provide better results
- more time
- can provide no results at all
Solutions:
Binning
transforming to categorical value
Transformation
like square root
+ works on 0
eg. add weather data that could affect sales or energy generation
fill missing values from other variable, combine some existing variables, apply transformations, change ratios, parse date
Like boolean column per category for categorical feature, tokenize string and so
import featuretools as ft
es = ft.EntitySet(id = 'titanic_data')
es = es.entity_from_dataframe(entity_id = 'traindf', dataframe = traindf.drop(['Survived'], axis=1),
variable_types =
{
'Embarked': ft.variable_types.Categorical,
'Sex': ft.variable_types.Boolean,
'Title': ft.variable_types.Categorical,
'Family_Size': ft.variable_types.Numeric,
'LastName': ft.variable_types.Categorical
},
index = 'PassengerId')
features, feature_names = ft.dfs(entityset = es,
target_entity = 'traindf',
max_depth = 2)
feature_defs = ['Some column']
feature_matrix_enc, features_enc = ft.encode_features(features, feature_defs)