Random Forests in Python
Ivo Flipse (@ivoflipse5)
Amsterdam Python Meetup, June 2014
Background
- Human Movement Scientist
- Specialized in gait analysis and pressure measurements
- Pythonista, Matlab-survivor and reluctant R-user
Normally pressure measurements are done on humans
But I needed a version that worked on animals too
So I created Pawlabeling, see Github: https://www.github.com/ivoflipse/pawlabeling
Scientific Python stack
- PySide (GUI)
- NumPy (array operations)
- SciPy (scientific calculations)
- PyTables (storage)
- OpenCV (computer vision)
- Matplotlib (plotting)
- Scikit-Learn (machine-learning)
For installing I recommend Anaconda (Continuum Analytics) or Canopy (Enthought).
Challenging problems
- Detection and tracking of paws
- Labeling the paws
- Detecting toes
- Image registration
Challenging problems
- Detection and tracking of paws
- Labeling the paws
- Detecting toes
- Image registration
Challenging problems
- Detection and tracking of paws
- Labeling the paws
- Detecting toes
- Image registration
Challenging problems
- Detection and tracking of paws
- Labeling the paws
- Detecting toes
- Image registration
Focus of the talk
Labeling the paws:
Given a paw, predict its label (LF, LH, RF, RH)
So how can we solve this?
Machine Learning in Python
Gael Varoquax (scikit-learn developer):
Machine Learning is about building programs with tunable parameters that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
Today we'll focus on Supervised Learning:
Supervised learning consists of learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length n_samples.
Machine Learning in Python
Like finding a line that separates these black and white points
Machine Learning in Python
Or guessing the digit given an small image:
Random Forests
Why use Random Forests:
- Easy to use
- Used for regression (predicting values)
- Used for classification (predicting categories)
- Gives you good scores on (entry-level) Kaggle competitions
- I just happen to like them :-)
Decision Tree
Split the data on the classes with the most information gain (lowest entropy)
Source: Citizennet.com | Microsoft Research
Random Forests
Random Forests combines a 'forest' of decision trees each trained on a random subset of the training data
Source: Citizennet.com | Microsoft Research
Random Forests in Action
Most machine learning algorithms implemented in scikit-learn expect a numpy array as input X. The expected shape of X is (n_samples, n_features).
The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample.
Source: Scikit-Learn Tutorial
import pandas as pd
# Let's load the data from some CSV files I prepared:
X_train = pd.read_csv("classify X_train.csv", index_col=0)
X_test = pd.read_csv("classify X_test.csv", index_col=0)
print(X_train.columns)
n_samples, n_features = X_train.shape
print n_samples, n_features
Index([u'subject_id', u'weight', u'measurement_id', u'contact_id', u'max_force', u'max_surface', u'max_duration', u'-f2', u'-s2', u'-m2', u'-x2', u'-y2', u'-z2', u'-f1', u'-s1', u'-m1', u'-x1', u'-y1', u'-z1', u'f1', u's1', u'm1', u'x1', u'y1', u'z1', u'f2', u's2', u'm2', u'x2', u'y2', u'z2', u'label'], dtype='object') 486 32
The features
cols = cols = ['max_force', 'max_surface', 'max_duration',
'-f2', '-s2', '-m2', '-x2', '-y2', '-z2',
'-f1', '-s1', '-m1', '-x1', '-y1', '-z1',
'f1', 's1', 'm1', 'x1', 'y1', 'z1',
'f2', 's2', 'm2', 'x2', 'y2', 'z2']
X_train.head(n=5)
subject_id | weight | measurement_id | contact_id | max_force | max_surface | max_duration | -f2 | -s2 | -m2 | -x2 | -y2 | -z2 | -f1 | -s1 | -m1 | -x1 | -y1 | -z1 | f1 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_0 | 77.5 | 19.354800 | 24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 69.200000 | ... |
1 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_1 | 69.2 | 19.354800 | 23 | NaN | NaN | NaN | NaN | NaN | NaN | 77.500000 | 19.354800 | 24 | -30.0 | 14.5 | -22 | 22.800001 | ... |
2 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_3 | 81.5 | 18.580608 | 23 | 69.200000 | 19.354800 | 23 | -29.5 | -20.0 | -21 | 22.800001 | 4.645152 | 14 | -60.5 | 3.5 | -12 | 43.700000 | ... |
3 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_4 | 43.7 | 10.064496 | 18 | 22.800001 | 4.645152 | 14 | -21.5 | 20.5 | -17 | 81.500000 | 18.580608 | 23 | 39.0 | 17.0 | -5 | 80.800000 | ... |
4 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_5 | 80.8 | 19.354800 | 21 | 81.500000 | 18.580608 | 23 | -29.5 | 17.0 | -24 | 43.700000 | 10.064496 | 18 | -68.5 | 0.0 | -19 | 61.200000 | ... |
5 rows × 32 columns
cols = ['max_force', 'max_surface', 'max_duration',
'-f2', '-s2', '-m2', '-x2', '-y2', '-z2',
'-f1', '-s1', '-m1', '-x1', '-y1', '-z1',
'f1', 's1', 'm1', 'x1', 'y1', 'z1',
'f2', 's2', 'm2', 'x2', 'y2', 'z2']
Random Forests in Action
Let's set up the Random Forest classifier and try it on our data
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train[cols].dropna(),
X_train[cols+["label"]].dropna()["label"])
score = rf.score(X_test[cols].dropna(),
X_test[cols+["label"]].dropna()["label"])
print "Score: {:.2f}".format(score)
Score: 0.94
# Great performance, but there's a catch:
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape
print "Number of samples before dropna: {}, after dropna: {}".format(
n_samples, n_samples_dropna)
print "Missing due to NaN: {}".format(n_samples-n_samples_dropna)
Number of samples before dropna: 3236, after dropna: 2359 Missing due to NaN: 877
Ways to improve performance
Several things we can do to improve our results
- Imputation of missing values
- Normalize or standardize values
- Cross validation
- Grid Search hyperparameters
- Pipeline: chaining estimators
- Model Evaluation
Imputation of missing values
My example data has a lot of samples that have features with NaN, rendering the sample unusable
Solution: impute or infer them from the known part of the data
Source: Scikit-Learn docs
from sklearn.preprocessing import Imputer
# Create the imputer, which will replace the NaN's with the mean along the columns
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
X_train2 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
# Impute the missing values
subject_group[cols] = imputer.fit_transform(subject_group[cols])
X_train2.append(subject_group)
# Create a data frame again
X_train2 = pd.concat(X_train2)
X_test2 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
subject_group[cols] = imputer.fit_transform(subject_group[cols])
X_test2.append(subject_group)
X_test2 = pd.concat(X_test2)
# Let's see how many samples we have left now
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape
n_samples_imputed, n_features_imputed = X_test2.dropna().shape
print("Number of samples before dropna: {}, after dropna: {}, "),
print("after imputing: {}".format(n_samples, n_samples_dropna, n_samples_imputed))
Number of samples before dropna: {}, after dropna: {}, after imputing: 3236
rf = RandomForestClassifier()
rf.fit(X_train2[cols], X_train2["label"])
score_imputed = rf.score(X_test2[cols], X_test2["label"])
print "Score after imputation: {:.2f}".format(score_imputed)
Score after imputation: 0.90
So we got a slightly worse score, but it labeled an additional 877 samples
These samples are also likely to be harder to predict, given they lack information about previous or next paws
A potential improvement would be to use regression to predict the missing value
Normalize or standardize values
Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
This isn't necessary for Random Forests though (see discussion here)
Source: Scikit-Learn docs
from sklearn import preprocessing
X_train3 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
# First impute the missing values
subject_group[cols] = imputer.fit_transform(subject_group[cols])
# Then scale the values
subject_group[cols] = preprocessing.scale(subject_group[cols])
X_train3.append(subject_group)
X_train3 = pd.concat(X_train3)
X_test3 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
subject_group[cols] = imputer.fit_transform(subject_group[cols])
subject_group[cols] = preprocessing.scale(subject_group[cols])
X_test3.append(subject_group)
X_test3 = pd.concat(X_test3)
rf = RandomForestClassifier()
rf.fit(X_train3[cols], X_train3["label"])
score_scaled = rf.score(X_test3[cols], X_test3["label"])
print "Score after scaling and imputation: {:.2f}".format(score_scaled)
Score after scaling and imputation: 0.93
Cross Validation
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.
To avoid over-fitting, we have to define two different sets : a training set X_train, y_train which is used for learning the parameters of a predictive model, and a testing set X_test, y_test which is used for evaluating the fitted predictive model.
This is a bit harder on my data as the samples are not independent
See this IPython Notebook for how this affects the results
Source: Scikit-Learn docs
dognames = np.array(['subject_30', 'subject_32', 'subject_33', 'subject_34', 'subject_35', 'subject_36',
'subject_16', 'subject_17', 'subject_14', 'subject_15', 'subject_12', 'subject_13',
'subject_10', 'subject_18', 'subject_19', 'subject_8', 'subject_9', 'subject_4',
'subject_5', 'subject_6', 'subject_7', 'subject_0', 'subject_2', 'subject_29',
'subject_28', 'subject_27', 'subject_26', 'subject_25', 'subject_24', 'subject_23',
'subject_22', 'subject_21', 'subject_20'])
# Little helper function to split the dataset
def split_data_set(X, train_dogs):
x_train = []
x_test = []
for subject_id, subject_group in X.groupby("subject_id"):
if subject_id in train_dogs:
x_train.append(subject_group)
else:
x_test.append(subject_group)
x_train = pd.concat(x_train)
x_test = pd.concat(x_test)
return x_train, x_test
import numpy as np
from sklearn import cross_validation
k_fold = cross_validation.KFold(n=len(dognames), n_folds=5)
X = pd.concat([X_train2, X_test2])
scores_cv = []
for index, (train_indices, test_indices) in enumerate(k_fold):
train_dogs = dognames[train_indices]
test_dogs = dognames[test_indices]
# Little helper function to split the dataset
x_train, x_test = split_data_set(X, train_dogs)
# Now fit the classifier
rf = RandomForestClassifier()
rf.fit(x_train[cols], x_train["label"])
score_cv = rf.score(x_test[cols], x_test["label"])
scores_cv.append(score_cv)
print("fold-{: <2} {:.2f}\t".format(index, score_cv))
print("")
print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))
fold-0 0.95 fold-1 0.99 fold-2 0.98 fold-3 0.87 fold-4 0.97 Average: 0.95 +/- 0.04
# On most other data sets this can be done a lot simpler
from sklearn import cross_validation
k_fold = cross_validation.KFold(n=len(X), n_folds=5)
# Scikit-Learn comes with a useful helper function to
# compute the cross validation score
score_cv = cross_validation.cross_val_score(rf, X[cols], X["label"], cv=k_fold)
print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))
Average: 0.95 +/- 0.04
Grid Search hyperparameters
A Grid Search can search for:
- an estimator (regressor or classifier such as sklearn.svm.SVC());
- a parameter space;
- a method for searching or sampling candidates;
- a cross-validation scheme; and
- a score function.
Source: Scikit-Learn docs
from operator import itemgetter
from time import time
def report(grid_scores, n_top=3):
top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}".format(score.parameters))
print("")
from sklearn.grid_search import GridSearchCV
# Use a full grid over all parameters
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
clf = RandomForestClassifier(n_estimators=20)
# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X[cols], X["label"])
print("GridSearchCV took {:.2f} seconds for {} candidate parameter settings.".format(
time() - start, len(grid_search.grid_scores_)))
print("")
# Helper function to pretty print the results
report(grid_search.grid_scores_)
GridSearchCV took 1.22 seconds for 4 candidate parameter settings. Model with rank: 1 Mean validation score: 0.961 (std: 0.027) Parameters: {'criterion': 'gini', 'max_depth': None} Model with rank: 2 Mean validation score: 0.959 (std: 0.023) Parameters: {'criterion': 'entropy', 'max_depth': None} Model with rank: 3 Mean validation score: 0.942 (std: 0.027) Parameters: {'criterion': 'gini', 'max_depth': 3}
rf = RandomForestClassifier(bootstrap=False, min_samples_leaf=1,
min_samples_split=10, criterion='entropy',
max_features=3, max_depth=None)
# Lets try it on one of our last train/test splits
rf.fit(x_train[cols], x_train["label"])
score_grid = rf.score(x_test[cols], x_test["label"])
print("Score after grid search: {:.2f}".format(score_grid))
Score after grid search: 0.97
Pipeline
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two purposes here:
- Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
- Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.
Source: Scikit-Learn docs
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation
# Create X again from the raw data
X = pd.concat([X_train, X_test])
pipeline = Pipeline([
("imputer", Imputer(missing_values="NaN",
strategy="mean", axis=0)),
("forest", RandomForestClassifier(n_estimators=20))])
# Note the difference in the params,
# you have to use double underscore notation
# See: http://stackoverflow.com/questions/16437022/how-to-tune-parameters-of-nested-pipelines-by-gridsearchcv-in-scikit-learn
param_grid = {
"forest__min_samples_split": [1, 3, 10],
"forest__min_samples_leaf": [1, 3, 10],
"forest__bootstrap": [True, False],
"forest__criterion": ["gini", "entropy"]
}
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X[cols], X["label"])
print("Best score: {:0.3f}".format(grid_search.best_score_))
Best score: 0.967
Model evaluation
We can evaluate the quality of the predictions of a model
For example the accuracy, precision or F1-score
Source: Scikit-Learn docs
%matplotlib inline
from sklearn.metrics import confusion_matrix
y_pred = rf.predict(X_test2[cols])
# Compute confusion matrix
cm = confusion_matrix(X_test2["label"], y_pred)
print(cm)
import matplotlib.pyplot as plt
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
[[843 0 6 0] [ 1 725 1 8] [ 13 1 862 0] [ 0 4 0 772]]
Visualize the Random Forest
Finally, we can export the tree and visualize its structure
import StringIO
import pydot
from sklearn import tree
from IPython.core.display import Image
# Visualize tree
dot_data = StringIO.StringIO()
tree.export_graphviz(rf.estimators_[0], out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
image = graph.write_png("./images/random_network.png")
print("Feature")
for col, importance in zip(cols, rf.feature_importances_):
if importance > 0.05:
print "{:>3}:\t{:.4f}".format(col, importance)
Feature -y2: 0.0913 -x1: 0.1633 -y1: 0.1220 x1: 0.1476 y1: 0.1387 y2: 0.1490
import matplotlib.pyplot as plt
# Example from: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
fig, ax = plt.subplots(figsize=(12,6))
fig.suptitle("Feature importances")
ax.bar(range(len(importances)), importances[indices], color="r", yerr=std[indices], align="center")
ax.set_xticks(range(len(importances)))
cols = np.array(cols)
ax.set_xticklabels(cols[indices], rotation=90)
ax.set_ylim([0, 0.3])
_ = ax.set_xlim([-1, len(cols)])
Random Forests in Python
By Ivo Flipse
Random Forests in Python
- 6,085