PREPROCESSING 101
Outline
- Golden Rules
- Everything should be numbers
- Categorical Variables
- Date
- NaN?
- Feature Engineering
- Hyperparameters optimization
- Quick and Dirty
Golden Rules
Rule #1
DON’T REINVENT THE WHEEL
- Always, always, always google first
- StackOverflow has the best answers
“You’ll be lucky if you have a single original idea in your entire life” (my dad)
Rule #2
Never loop over your data in numpy / pandas
country_empty=[]
for i in range(0,len(train)):
if(train['feature1'][i] is np.nan):
country_empty.append(train['feature2'][i])train['feature2'][train['feature1'].isnull()]:-(
:-)
Codebase
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# You "should" drop the ID column there
y = train['status_group']
# Creating a DataFrame with train+test data
piv_train = train.shape[0]
df_all = pd.concat((train.drop('status_group', axis=1), test), axis=0, ignore_index=True)
Everything should be numbers
Categorical Variable
Gotta catch 'em all!
numerics = ['int16', 'int32', 'int64',
'float16', 'float32', 'float64']
non_numeric_columns = df_all.select_dtypes(exclude=numerics).columns
Now you can do different things...
1. Label Encoding
| column_before | column_after |
|---|---|
| foo | 0 |
| bar | 1 |
| baz | 2 |
| foo | 0 |
from sklearn.preprocessing import LabelEncoder
df_all[non_numeric_columns] = df_all[non_numeric_columns] \
.astype(str).apply(LabelEncoder().fit_transform)
df_all[non_numeric_columns] = df_all[non_numeric_columns].astype('category') \
.apply(lambda x: x.cat.codes)2. One Hot Encoding
| column_before | foo | bar | baz |
|---|---|---|---|
| foo | 1 | 0 | 0 |
| bar | 0 | 1 | 0 |
| baz | 0 | 0 | 1 |
| foo | 1 | 0 | 0 |
ohe_columns = non_numeric_columns[:3]
dummies = pd.concat(
[pd.get_dummies(df_all[col], prefix=col) for col in ohe_columns],
axis=1)
df_all.drop(ohe_columns, axis=1, inplace=True)
df_all = pd.concat((df_all, dummies), axis=1)
Date != Categorical Variable
You can extract Day/Month/Year/Whatever...
df_all['date_recorded'].apply(lambda x: pd.to_datetime(x).day)
Or transform it to Timestamp
# Unix timestamp in seconds
df_all['date_recorded'].apply(lambda x: pd.to_datetime(x)).astype(int) // 10**9
NaN?
Data might not be a number (yet)... Checkout:
df_all.isnull().sum()
Imputing
Replace each missing data in a column by the column’s mean, median or most frequent term…
from sklearn.preprocessing import Imputer
my_strategy = 'mean' # or 'median' or 'most_frequent'
imp = Imputer(strategy=my_strategy)
df_all = pd.DataFrame(imp.fit_transform(df_all),
columns=df_all.columns,
index=df_all.index)
Fillna()
… Or fill with the last valid term
# Checkout other method for fillna() ;-)
df_all.fillna(method='ffill', inplace=True)
TIPS
- Be smart! Don’t use only one method to replace NaNs…
- Be careful. Label Encoding and One Hot Encoding can somehow ‘remove’ your NaN values.
Feature Engineering
Feature Engineering
There is no method here, just feelings. Plot your data for each feature to give you an intuition, find correlations, ...
Think about the problem at hand and be creative !
A word on correlation
You should always check correlations and plot distributions of your features vs your target
Correlation : Pearson vs Spearman
Spearman is based on rank => evaluates monotonic changes
Pearson is based on linear relationship => evaluates proportional changes
You should read this link for more
x=(1:100);
y=exp(x); % then,
corr(x,y,'type','Spearman'); % will equal 1
corr(x,y,'type','Pearson'); % will be about equal to 0.25
Create new features
-
Create a flag (0/1)
Example: a company has 500+ employees
-
Create new categories
Example: indicate the season of a date feature
Scale your data
Some algorithms are dumb, help them.
1. Standardization = zero mean + unit variance
2. Transform your data: Sometimes a good logarithmic transformation is all you need
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
WITH ONLY ANONYMOUS NUMERICAL FEATURES
- Create a ton of polynomial features...
- ... And select among them !
Sklearn's Features Selection (SelectKBest, RFECV)
Hyperparameter optimization
Hyperparameter optimization
Random Search
Sklearn's RandomSearchCV (tuto)
Grid Search
Sklearn's GridSearchCV (tuto)
Bayesian search
Some libraries: Hyperopt (tuto), BayesianOptimisation (tuto kaggle)
Quick & Dirty
You always need a baseline
A small script with one goal:
deliver a (not so bad) prediction
My Quick and Dirty
-
All non numeric columns : Label encoder
(do not keep text features)
-
Date : Extract each component (year, month, day, hour, ...)
-
NaN : Imputing
-
0 feature engineering
-
Model : XGBoost
~ 300 / 500 rounds
~ 8 / 10 depth
Preprocessing 101
By Yann Carbonne
Preprocessing 101
- 607