sklearn-pandas
streamline your model building
Israel Saeta Pérez - PyBCN March 2016
slides.com/israelsaetaperez/sklearn-pandas
A bit of Wikipedia
scikit-learn
- tabular data analysis and manipulation library
- DataFrame: table-like object modelled after R
- Wes McKinney, 2008, financial Quantitative Analysis
pandas
- "Scientific Kit" of Machine Learning algorithms
- Classification, regression, clustering, NLP
- GSoC 2007 project by David Cournapeau
* improved in sklearn>=0.16!
*
Preprocessing
scikit-learn transformers
Numerical
Standardize: mean = 0, stdev = 1
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler.transform(X)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
Ordinal or categorical
Dummyfication
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
input features must be numbers :(
1st
2nd
3rd
Ordinal or categorical
input features must be numbers :(
- LabelEncoder + OneHotEncoder
- LabelBinarizer
- DF to dictionary + DictVectorizer
- pandas.get_dummies
Missing values
imputation
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4. 2. ]
[ 6. 3.666...]
[ 7. 6. ]]
Heterogeneous
data
Kaggle's Titanic
1912, Southampton - New York
Kaggle's Titanic
>>> df = pd.read_csv('train.csv')
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
Missing values
Index
Binary
Label
Categorical
Kaggle's Titanic
- Dummify Pclass, Sex, Embarked
- Standardize Age, SibSp, Parch, Fare
- Impute missing values in Age and Embarked
Data transformation
# copy original df
dft = df.copy()
# encode string features as numbers
dft['Sex'] = LabelEncoder().fit_transform(dft['Sex'])
dft['Embarked'] = dft['Embarked'].replace({'S': 1, 'C': 2, 'Q': 3})
# impute missing values
dft['Embarked'] = Imputer(strategy='most_frequent').fit_transform(dft[['Embarked']])
dft['Age'] = Imputer(strategy='mean').fit_transform(dft[['Age']])
# standardize continuous variables
to_standardize = ['Age', 'SibSp', 'Parch', 'Fare']
dft[to_standardize] = StandardScaler().fit_transform(dft[to_standardize])
# select input columns for the model
X = dft[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
Xt = OneHotEncoder(categorical_features=[0, 1, 6]).fit_transform(X)
Feature indexes :(
Data transformation
Prediction
>>> y = dft['Survived'].values
>>> clf = LogisticRegression()
>>> scores = cross_validation.cross_val_score(clf, Xt, y, cv=10)
>>> print('Accuracy: {:0.3f}'.format(scores.mean()))
Accuracy: 0.800
Issues
- We had to write a lot
- Code is hard to read
- No proper train/test separation in transformations
sklearn-pandas
sugar up your code!
bridge between Scikit-Learn’s machine learning methods and pandas-style Data Frames
sklearn-pandas
- Original code by Ben Hamner (Kaggle CTO) and Paul Butler (Google NY) 2013
- Last version 1.1.0, 2015-12-06
DataFrameMapper
dft = df.copy()
dft['Embarked'] = dft['Embarked'].replace({'S': 1, 'C': 2, 'Q': 3})
mapper = DataFrameMapper([
('Pclass', LabelBinarizer()),
(['Embarked'], [Imputer(strategy='most_frequent'), OneHotEncoder()]),
(['Age'], [Imputer(strategy='mean'), StandardScaler()]),
(['SibSp', 'Parch', 'Fare'], StandardScaler()),
('Sex', LabelBinarizer())
])
(n_samples,)
(n_samples, n_feats)
don't accept string features :(
DataFrameMapper
>>> clf = LogisticRegression()
>>> pipe = make_pipeline(mapper, clf)
>>> scores = cross_validation.cross_val_score(pipe, X, y, cv=10)
>>> print('Accuracy: {:0.3f}'.format(scores.mean()))
Accuracy: 0.800
can be used inside pipeline!
- Works with most CVs with scikit-learn>=0.16.0
- Doesn't work with CalibratedClassifierCV
Final words
Lessons learned
- scikit-learn transformers API is not for humans yet
- You can be a Free Software hero too!
- But with great power...
Future work
- Parallelized transformations (currently serial)
- Default transformers w/o listing all columns
- Transform also for y column
- Give some love to sklearn transformers
Thank you! :)
sklearn-pandas
By Israel Saeta Pérez
sklearn-pandas
Intro to sklearn-pandas, a python package to bridge scikit-learn and pandas.
- 3,927