PREPROCESSING  101

Outline

  1. Golden Rules
  2. Everything should be numbers
    • Categorical Variables
    • Date
  3. NaN?
  4. Feature Engineering
  5. Hyperparameters optimization
  6. Quick and Dirty

Golden Rules

Rule #1

DON’T REINVENT THE WHEEL

  1. Always, always, always google first
  2. StackOverflow has the best answers

“You’ll be lucky if you have a single original idea in your entire life” (my dad)

Rule #2

Never loop over your data in numpy / pandas

country_empty=[]
for i in range(0,len(train)):
    if(train['feature1'][i] is np.nan):
        country_empty.append(train['feature2'][i])
train['feature2'][train['feature1'].isnull()]

:-(

:-)

Codebase

import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# You "should" drop the ID column there

y = train['status_group']
# Creating a DataFrame with train+test data
piv_train = train.shape[0]
df_all = pd.concat((train.drop('status_group', axis=1), test), axis=0, ignore_index=True)

Everything should be numbers

Categorical Variable

Gotta catch 'em all!

numerics = ['int16', 'int32', 'int64',
            'float16', 'float32', 'float64']
non_numeric_columns = df_all.select_dtypes(exclude=numerics).columns

Now you can do different things...

1. Label Encoding

column_before column_after
foo 0
bar 1
baz 2
foo 0
from sklearn.preprocessing import LabelEncoder
df_all[non_numeric_columns] = df_all[non_numeric_columns] \
                                .astype(str).apply(LabelEncoder().fit_transform)
df_all[non_numeric_columns] = df_all[non_numeric_columns].astype('category') \
                                .apply(lambda x: x.cat.codes)

2. One Hot Encoding

column_before foo bar baz
foo 1 0 0
bar 0 1 0
baz 0 0 1
foo 1 0 0
ohe_columns = non_numeric_columns[:3]
dummies = pd.concat(
    [pd.get_dummies(df_all[col], prefix=col) for col in ohe_columns], 
    axis=1)
df_all.drop(ohe_columns, axis=1, inplace=True)
df_all = pd.concat((df_all, dummies), axis=1)

Date != Categorical Variable

 

You can extract Day/Month/Year/Whatever...

df_all['date_recorded'].apply(lambda x: pd.to_datetime(x).day)

Or transform it to Timestamp

# Unix timestamp in seconds
df_all['date_recorded'].apply(lambda x: pd.to_datetime(x)).astype(int)  // 10**9

NaN?

Data might not be a number (yet)... Checkout:

df_all.isnull().sum()

Imputing

Replace each missing data in a column by the column’s mean, median or most frequent term…

from sklearn.preprocessing import Imputer
my_strategy = 'mean' # or 'median' or 'most_frequent'
imp = Imputer(strategy=my_strategy)
df_all = pd.DataFrame(imp.fit_transform(df_all), 
                      columns=df_all.columns, 
                      index=df_all.index)

Fillna()

… Or fill with the last valid term

 
# Checkout other method for fillna() ;-)
df_all.fillna(method='ffill', inplace=True)

TIPS

 

  1. Be smart! Don’t use only one method to replace NaNs…
  2. Be careful. Label Encoding and One Hot Encoding can somehow ‘remove’ your NaN values.

Feature Engineering

Feature Engineering

 

There is no method here, just feelings. Plot your data for each feature to give you an intuition, find correlations, ...

 

Think about the problem at hand and be creative !

A word on correlation

 

You should always check correlations and plot distributions of your features vs your target

 

Correlation : Pearson vs Spearman

Spearman is based on rank => evaluates monotonic changes

Pearson is based on linear relationship => evaluates proportional changes

 

 

 

 

You should read this link for more

x=(1:100);  
y=exp(x);                         % then,
corr(x,y,'type','Spearman');      % will equal 1
corr(x,y,'type','Pearson');       % will be about equal to 0.25

Create new features


  • Create a flag (0/1)

    Example: a company has 500+ employees

  • Create new categories

    Example: indicate the season of a date feature

Scale your data

Some algorithms are dumb, help them.

 

1. Standardization = zero mean + unit variance

 

 

               2. Transform your data: Sometimes a good logarithmic transformation is all you need

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer

WITH ONLY ANONYMOUS NUMERICAL FEATURES

 

  • Create a ton of polynomial features...

Sklearn's Polynomial Features

  • ... And select among them !

Sklearn's Features Selection (SelectKBest, RFECV)

Hyperparameter optimization

Hyperparameter optimization

 

Random Search

Sklearn's RandomSearchCV (tuto)

Grid Search

Sklearn's GridSearchCV (tuto)

Bayesian search

Some libraries: Hyperopt (tuto), BayesianOptimisation (tuto kaggle)

 

Quick & Dirty

You always need a baseline

A small script with one goal:

deliver a (not so bad) prediction

My Quick and Dirty

  • All non numeric columns : Label encoder

    (do not keep text features)

  • Date : Extract each component (year, month, day, hour, ...)

  • NaN : Imputing

  • 0 feature engineering

  • Model : XGBoost

    ~ 300 / 500 rounds

    ~ 8 / 10 depth

Made with Slides.com