sklearn-pandas

streamline your model building

Israel Saeta Pérez - PyBCN March 2016

slides.com/israelsaetaperez/sklearn-pandas

A bit of Wikipedia

scikit-learn

  • tabular data analysis and manipulation library
  • DataFrame: table-like object modelled after R
  • Wes McKinney, 2008, financial Quantitative Analysis

pandas

  • "Scientific Kit" of Machine Learning algorithms
  • Classification, regression, clustering, NLP
  • GSoC 2007 project by David Cournapeau

* improved in sklearn>=0.16!

*

Preprocessing

scikit-learn transformers

Numerical

Standardize: mean = 0, stdev = 1

>>> from sklearn import preprocessing
>>> import numpy as np

>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])

>>> scaler = preprocessing.StandardScaler().fit(X)

>>> scaler.transform(X)                               
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

Ordinal or categorical

Dummyfication

>>> from sklearn.preprocessing import OneHotEncoder

>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)

>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

input features must be numbers :(

1st

2nd

3rd

Ordinal or categorical

input features must be numbers :(

  • LabelEncoder + OneHotEncoder
  • LabelBinarizer
  • DF to dictionary + DictVectorizer
  • pandas.get_dummies

Missing values

imputation

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer

>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))                           
[[ 4.          2.        ]
 [ 6.          3.666...]
 [ 7.          6.        ]]

Heterogeneous

data

Kaggle's Titanic

1912, Southampton - New York

Kaggle's Titanic

>>> df = pd.read_csv('train.csv')
>>> df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

Missing values

Index

Binary

Label

Categorical

Kaggle's Titanic

  • Dummify Pclass, Sex, Embarked
  • Standardize Age, SibSp, Parch, Fare
  • Impute missing values in Age and Embarked

Data transformation

# copy original df
dft = df.copy()

# encode string features as numbers
dft['Sex'] = LabelEncoder().fit_transform(dft['Sex'])
dft['Embarked'] = dft['Embarked'].replace({'S': 1, 'C': 2, 'Q': 3})

# impute missing values
dft['Embarked'] = Imputer(strategy='most_frequent').fit_transform(dft[['Embarked']])
dft['Age'] = Imputer(strategy='mean').fit_transform(dft[['Age']])

# standardize continuous variables
to_standardize = ['Age', 'SibSp', 'Parch', 'Fare']
dft[to_standardize] = StandardScaler().fit_transform(dft[to_standardize])

# select input columns for the model
X = dft[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
Xt = OneHotEncoder(categorical_features=[0, 1, 6]).fit_transform(X)

Feature indexes :(

Data transformation

Prediction

>>> y = dft['Survived'].values
>>> clf = LogisticRegression()
>>> scores = cross_validation.cross_val_score(clf, Xt, y, cv=10)
>>> print('Accuracy: {:0.3f}'.format(scores.mean()))

Accuracy: 0.800

Issues

  • We had to write a lot
  • Code is hard to read
  • No proper train/test separation in transformations

sklearn-pandas

sugar up your code!

bridge between Scikit-Learn’s machine learning methods and pandas-style Data Frames

sklearn-pandas

  • Original code by Ben Hamner (Kaggle CTO) and Paul Butler (Google NY) 2013
  • Last version 1.1.0, 2015-12-06

DataFrameMapper

dft = df.copy()
dft['Embarked'] = dft['Embarked'].replace({'S': 1, 'C': 2, 'Q': 3})

mapper = DataFrameMapper([
        ('Pclass', LabelBinarizer()),
        (['Embarked'], [Imputer(strategy='most_frequent'), OneHotEncoder()]),
        (['Age'], [Imputer(strategy='mean'), StandardScaler()]),
        (['SibSp', 'Parch', 'Fare'], StandardScaler()),
        ('Sex', LabelBinarizer())
    ])

(n_samples,)

(n_samples, n_feats)

don't accept string features :(

DataFrameMapper

>>> clf = LogisticRegression()
>>> pipe = make_pipeline(mapper, clf)
>>> scores = cross_validation.cross_val_score(pipe, X, y, cv=10)
>>> print('Accuracy: {:0.3f}'.format(scores.mean()))

Accuracy: 0.800

can be used inside pipeline!

  • Works with most CVs with scikit-learn>=0.16.0
  • Doesn't work with CalibratedClassifierCV

Final words

Lessons learned

  • scikit-learn transformers API is not for humans yet
  • You can be a Free Software hero too!
  • But with great power...

Future work

  • Parallelized transformations (currently serial)
  • Default transformers w/o listing all columns
  • Transform also for y column
  • Give some love to sklearn transformers

Thank you! :)

sklearn-pandas

By Israel Saeta Pérez

sklearn-pandas

Intro to sklearn-pandas, a python package to bridge scikit-learn and pandas.

  • 3,704