PREPROCESSING 101

Outline

Golden Rules
Everything should be numbers
- Categorical Variables
- Date
NaN?
Feature Engineering
Hyperparameters optimization
Quick and Dirty

Golden Rules

Rule #1

DON’T REINVENT THE WHEEL

Always, always, always google first
StackOverflow has the best answers

“You’ll be lucky if you have a single original idea in your entire life” (my dad)

Rule #2

Never loop over your data in numpy / pandas

country_empty=[]
for i in range(0,len(train)):
    if(train['feature1'][i] is np.nan):
        country_empty.append(train['feature2'][i])

train['feature2'][train['feature1'].isnull()]

:-(

:-)

Codebase

import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# You "should" drop the ID column there

y = train['status_group']
# Creating a DataFrame with train+test data
piv_train = train.shape[0]
df_all = pd.concat((train.drop('status_group', axis=1), test), axis=0, ignore_index=True)

Everything should be numbers

Categorical Variable

Gotta catch 'em all!

numerics = ['int16', 'int32', 'int64',
            'float16', 'float32', 'float64']
non_numeric_columns = df_all.select_dtypes(exclude=numerics).columns

Now you can do different things...

1. Label Encoding

column_before	column_after
foo	0
bar	1
baz	2
foo	0

from sklearn.preprocessing import LabelEncoder
df_all[non_numeric_columns] = df_all[non_numeric_columns] \
                                .astype(str).apply(LabelEncoder().fit_transform)

df_all[non_numeric_columns] = df_all[non_numeric_columns].astype('category') \
                                .apply(lambda x: x.cat.codes)

2. One Hot Encoding

column_before	foo	bar	baz
foo	1	0	0
bar	0	1	0
baz	0	0	1
foo	1	0	0

ohe_columns = non_numeric_columns[:3]
dummies = pd.concat(
    [pd.get_dummies(df_all[col], prefix=col) for col in ohe_columns], 
    axis=1)
df_all.drop(ohe_columns, axis=1, inplace=True)
df_all = pd.concat((df_all, dummies), axis=1)

Date != Categorical Variable

You can extract Day/Month/Year/Whatever...

df_all['date_recorded'].apply(lambda x: pd.to_datetime(x).day)

Or transform it to Timestamp

# Unix timestamp in seconds
df_all['date_recorded'].apply(lambda x: pd.to_datetime(x)).astype(int)  // 10**9

NaN?

Data might not be a number (yet)... Checkout:

df_all.isnull().sum()

Imputing

Replace each missing data in a column by the column’s mean, median or most frequent term…

from sklearn.preprocessing import Imputer
my_strategy = 'mean' # or 'median' or 'most_frequent'
imp = Imputer(strategy=my_strategy)
df_all = pd.DataFrame(imp.fit_transform(df_all), 
                      columns=df_all.columns, 
                      index=df_all.index)

Fillna()

… Or fill with the last valid term

# Checkout other method for fillna() ;-)
df_all.fillna(method='ffill', inplace=True)

TIPS

Be smart! Don’t use only one method to replace NaNs…
Be careful. Label Encoding and One Hot Encoding can somehow ‘remove’ your NaN values.

Feature Engineering

There is no method here, just feelings. Plot your data for each feature to give you an intuition, find correlations, ...

Think about the problem at hand and be creative !

A word on correlation

You should always check correlations and plot distributions of your features vs your target

Correlation : Pearson vs Spearman

Spearman is based on rank => evaluates monotonic changes

Pearson is based on linear relationship => evaluates proportional changes

You should read this link for more

x=(1:100);  
y=exp(x);                         % then,
corr(x,y,'type','Spearman');      % will equal 1
corr(x,y,'type','Pearson');       % will be about equal to 0.25

Create new features

Create a flag (0/1)

Example: a company has 500+ employees
Create new categories

Example: indicate the season of a date feature

Scale your data

Some algorithms are dumb, help them.

1. Standardization = zero mean + unit variance

2. Transform your data: Sometimes a good logarithmic transformation is all you need

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import FunctionTransformer

WITH ONLY ANONYMOUS NUMERICAL FEATURES

Create a ton of polynomial features...

Sklearn's Polynomial Features

... And select among them !

Sklearn's Features Selection (SelectKBest, RFECV)

Hyperparameter optimization

Random Search

Sklearn's RandomSearchCV (tuto)

Grid Search

Sklearn's GridSearchCV (tuto)

Bayesian search

Some libraries: Hyperopt (tuto), BayesianOptimisation (tuto kaggle)

Quick & Dirty

You always need a baseline

A small script with one goal:

deliver a (not so bad) prediction

My Quick and Dirty

All non numeric columns : Label encoder

(do not keep text features)
Date : Extract each component (year, month, day, hour, ...)
NaN : Imputing
0 feature engineering
Model : XGBoost

~ 300 / 500 rounds

~ 8 / 10 depth