“You’ll be lucky if you have a single original idea in your entire life” (my dad)
Never loop over your data in numpy / pandas
country_empty=[]
for i in range(0,len(train)):
if(train['feature1'][i] is np.nan):
country_empty.append(train['feature2'][i])train['feature2'][train['feature1'].isnull()]:-(
:-)
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# You "should" drop the ID column there
y = train['status_group']
# Creating a DataFrame with train+test data
piv_train = train.shape[0]
df_all = pd.concat((train.drop('status_group', axis=1), test), axis=0, ignore_index=True)
Gotta catch 'em all!
numerics = ['int16', 'int32', 'int64',
'float16', 'float32', 'float64']
non_numeric_columns = df_all.select_dtypes(exclude=numerics).columns
Now you can do different things...
| column_before | column_after |
|---|---|
| foo | 0 |
| bar | 1 |
| baz | 2 |
| foo | 0 |
from sklearn.preprocessing import LabelEncoder
df_all[non_numeric_columns] = df_all[non_numeric_columns] \
.astype(str).apply(LabelEncoder().fit_transform)
df_all[non_numeric_columns] = df_all[non_numeric_columns].astype('category') \
.apply(lambda x: x.cat.codes)| column_before | foo | bar | baz |
|---|---|---|---|
| foo | 1 | 0 | 0 |
| bar | 0 | 1 | 0 |
| baz | 0 | 0 | 1 |
| foo | 1 | 0 | 0 |
ohe_columns = non_numeric_columns[:3]
dummies = pd.concat(
[pd.get_dummies(df_all[col], prefix=col) for col in ohe_columns],
axis=1)
df_all.drop(ohe_columns, axis=1, inplace=True)
df_all = pd.concat((df_all, dummies), axis=1)
You can extract Day/Month/Year/Whatever...
df_all['date_recorded'].apply(lambda x: pd.to_datetime(x).day)
Or transform it to Timestamp
# Unix timestamp in seconds
df_all['date_recorded'].apply(lambda x: pd.to_datetime(x)).astype(int) // 10**9
Data might not be a number (yet)... Checkout:
df_all.isnull().sum()
Replace each missing data in a column by the column’s mean, median or most frequent term…
from sklearn.preprocessing import Imputer
my_strategy = 'mean' # or 'median' or 'most_frequent'
imp = Imputer(strategy=my_strategy)
df_all = pd.DataFrame(imp.fit_transform(df_all),
columns=df_all.columns,
index=df_all.index)
… Or fill with the last valid term
# Checkout other method for fillna() ;-)
df_all.fillna(method='ffill', inplace=True)
There is no method here, just feelings. Plot your data for each feature to give you an intuition, find correlations, ...
Think about the problem at hand and be creative !
You should always check correlations and plot distributions of your features vs your target
Correlation : Pearson vs Spearman
Spearman is based on rank => evaluates monotonic changes
Pearson is based on linear relationship => evaluates proportional changes
You should read this link for more
x=(1:100);
y=exp(x); % then,
corr(x,y,'type','Spearman'); % will equal 1
corr(x,y,'type','Pearson'); % will be about equal to 0.25
Create a flag (0/1)
Example: a company has 500+ employees
Create new categories
Example: indicate the season of a date feature
Some algorithms are dumb, help them.
1. Standardization = zero mean + unit variance
2. Transform your data: Sometimes a good logarithmic transformation is all you need
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
Sklearn's Features Selection (SelectKBest, RFECV)
Sklearn's RandomSearchCV (tuto)
Sklearn's GridSearchCV (tuto)
Some libraries: Hyperopt (tuto), BayesianOptimisation (tuto kaggle)
A small script with one goal:
deliver a (not so bad) prediction
All non numeric columns : Label encoder
(do not keep text features)
Date : Extract each component (year, month, day, hour, ...)
NaN : Imputing
0 feature engineering
Model : XGBoost
~ 300 / 500 rounds
~ 8 / 10 depth