Random Forests in Python

Ivo Flipse (@ivoflipse5)

Amsterdam Python Meetup, June 2014

Background

Human Movement Scientist
Specialized in gait analysis and pressure measurements
Pythonista, Matlab-survivor and reluctant R-user

Normally pressure measurements are done on humans

But I needed a version that worked on animals too

So I created Pawlabeling, see Github: https://www.github.com/ivoflipse/pawlabeling

Scientific Python stack

PySide (GUI)
NumPy (array operations)
SciPy (scientific calculations)
PyTables (storage)
OpenCV (computer vision)
Matplotlib (plotting)
Scikit-Learn (machine-learning)

For installing I recommend Anaconda (Continuum Analytics) or Canopy (Enthought).

Challenging problems

Detection and tracking of paws
Labeling the paws
Detecting toes
Image registration

Challenging problems

Detection and tracking of paws
Labeling the paws
Detecting toes
Image registration

Challenging problems

Detection and tracking of paws
Labeling the paws
Detecting toes
Image registration

Challenging problems

Detection and tracking of paws
Labeling the paws
Detecting toes
Image registration

Focus of the talk

Labeling the paws:

Given a paw, predict its label (LF, LH, RF, RH)

So how can we solve this?

Machine Learning in Python

Gael Varoquax (scikit-learn developer):

Machine Learning is about building programs with tunable parameters that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Today we'll focus on Supervised Learning:

Supervised learning consists of learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length n_samples.

Machine Learning in Python

Like finding a line that separates these black and white points

Machine Learning in Python

Or guessing the digit given an small image:

Using Scikit-Learn

Scikit-Learn cheat-sheet:

Source: Andreas Müller

Random Forests

Why use Random Forests:

Easy to use
Used for regression (predicting values)
Used for classification (predicting categories)
Gives you good scores on (entry-level) Kaggle competitions
I just happen to like them :-)

I thought I'd add this Quora question as a note, that explains Random Forests in laymans terms:

Suppose you're very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you'll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you'll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like "Is X a romantic movie?", "Does Johnny Depp star in X?", and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

Thus, Willow is a decision tree for your movie preferences.

But Willow is only human, so she doesn't always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you'd like to ask a bunch of your friends, and watch movie X if most of them say they think you'll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you'll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

Now you don't want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you're not absolutely sure of your preferences yourself -- you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn't use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don't change your love/hate decisions, you just say you love/hate some movies a little more or less (you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don't mention Harry Potter at all.

By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.

There's still one problem with your data, however. While you loved both Titanic and Inception, it wasn't because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don't want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you're building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren't allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you're injecting randomness at the model level, by making your friends ask different questions at different times.

And so your friends now form a random forest.

Decision Tree

Split the data on the classes with the most information gain (lowest entropy)

Source: Citizennet.com | Microsoft Research

Random Forests

Random Forests combines a 'forest' of decision trees each trained on a random subset of the training data

Source: Citizennet.com | Microsoft Research

Random Forests in Action

Most machine learning algorithms implemented in scikit-learn expect a numpy array as input X. The expected shape of X is (n_samples, n_features).

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample.

Source: Scikit-Learn Tutorial

In [1]:

import pandas as pd

# Let's load the data from some CSV files I prepared:
X_train = pd.read_csv("classify X_train.csv", index_col=0)
X_test = pd.read_csv("classify X_test.csv", index_col=0)

print(X_train.columns)

n_samples, n_features = X_train.shape
print n_samples, n_features

Index([u'subject_id', u'weight', u'measurement_id', u'contact_id', u'max_force', u'max_surface', u'max_duration', u'-f2', u'-s2', u'-m2', u'-x2', u'-y2', u'-z2', u'-f1', u'-s1', u'-m1', u'-x1', u'-y1', u'-z1', u'f1', u's1', u'm1', u'x1', u'y1', u'z1', u'f2', u's2', u'm2', u'x2', u'y2', u'z2', u'label'], dtype='object')
486 32

The features

cols = cols = ['max_force', 'max_surface', 'max_duration', 
               '-f2', '-s2', '-m2', '-x2', '-y2', '-z2', 
               '-f1', '-s1', '-m1', '-x1', '-y1', '-z1', 
                'f1',  's1',  'm1',  'x1',  'y1',  'z1', 
                'f2',  's2',  'm2',  'x2',  'y2',  'z2']

features features2

In [2]:

X_train.head(n=5)

Out[2]:

	subject_id	weight	measurement_id	contact_id	max_force	max_surface	max_duration	-f2	-s2	-m2	-x2	-y2	-z2	-f1	-s1	-m1	-x1	-y1	-z1	f1
0	subject_30	7	sel_3 - 3-4-2010 - Entire Plate Roll Off	contact_0	77.5	19.354800	24	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	69.200000	...
1	subject_30	7	sel_3 - 3-4-2010 - Entire Plate Roll Off	contact_1	69.2	19.354800	23	NaN	NaN	NaN	NaN	NaN	NaN	77.500000	19.354800	24	-30.0	14.5	-22	22.800001	...
2	subject_30	7	sel_3 - 3-4-2010 - Entire Plate Roll Off	contact_3	81.5	18.580608	23	69.200000	19.354800	23	-29.5	-20.0	-21	22.800001	4.645152	14	-60.5	3.5	-12	43.700000	...
3	subject_30	7	sel_3 - 3-4-2010 - Entire Plate Roll Off	contact_4	43.7	10.064496	18	22.800001	4.645152	14	-21.5	20.5	-17	81.500000	18.580608	23	39.0	17.0	-5	80.800000	...
4	subject_30	7	sel_3 - 3-4-2010 - Entire Plate Roll Off	contact_5	80.8	19.354800	21	81.500000	18.580608	23	-29.5	17.0	-24	43.700000	10.064496	18	-68.5	0.0	-19	61.200000	...

5 rows × 32 columns

Random Forests in Action

Let's set up the Random Forest classifier and try it on our data

In [93]:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train[cols].dropna(), 
       X_train[cols+["label"]].dropna()["label"])  
score = rf.score(X_test[cols].dropna(), 
                 X_test[cols+["label"]].dropna()["label"])
print "Score: {:.2f}".format(score)

Score: 0.94

In [99]:

# Great performance, but there's a catch:
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape

print "Number of samples before dropna: {}, after dropna: {}".format(
    n_samples, n_samples_dropna)
print "Missing due to NaN: {}".format(n_samples-n_samples_dropna)

Number of samples before dropna: 3236, after dropna: 2359
Missing due to NaN: 877

Ways to improve performance

Several things we can do to improve our results

Imputation of missing values
Normalize or standardize values
Cross validation
Grid Search hyperparameters
Pipeline: chaining estimators
Model Evaluation

Imputation of missing values

My example data has a lot of samples that have features with NaN, rendering the sample unusable

Solution: impute or infer them from the known part of the data

Source: Scikit-Learn docs

In [103]:

from sklearn.preprocessing import Imputer

# Create the imputer, which will replace the NaN's with the mean along the columns
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)

X_train2 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
    # Impute the missing values
    subject_group[cols] = imputer.fit_transform(subject_group[cols])
    X_train2.append(subject_group)

# Create a data frame again
X_train2 = pd.concat(X_train2)


X_test2 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
    subject_group[cols] = imputer.fit_transform(subject_group[cols])
    X_test2.append(subject_group)
    
X_test2 = pd.concat(X_test2)

# Let's see how many samples we have left now
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape
n_samples_imputed, n_features_imputed = X_test2.dropna().shape
print("Number of samples before dropna: {}, after dropna: {}, "),
print("after imputing: {}".format(n_samples, n_samples_dropna, n_samples_imputed))

Number of samples before dropna: {}, after dropna: {},  after imputing: 3236

In [7]:

rf = RandomForestClassifier()
rf.fit(X_train2[cols], X_train2["label"])  
score_imputed = rf.score(X_test2[cols], X_test2["label"])

print "Score after imputation: {:.2f}".format(score_imputed)

Score after imputation: 0.90

So we got a slightly worse score, but it labeled an additional 877 samples

These samples are also likely to be harder to predict, given they lack information about previous or next paws

A potential improvement would be to use regression to predict the missing value

Normalize or standardize values

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

This isn't necessary for Random Forests though (see discussion here)

Source: Scikit-Learn docs

In [8]:

from sklearn import preprocessing

X_train3 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
    # First impute the missing values
    subject_group[cols] = imputer.fit_transform(subject_group[cols])  
    # Then scale the values
    subject_group[cols] = preprocessing.scale(subject_group[cols])
    X_train3.append(subject_group)

X_train3 = pd.concat(X_train3)

X_test3 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
    subject_group[cols] = imputer.fit_transform(subject_group[cols])
    subject_group[cols] = preprocessing.scale(subject_group[cols])      
    X_test3.append(subject_group)
    
X_test3 = pd.concat(X_test3)

In [9]:

rf = RandomForestClassifier()
rf.fit(X_train3[cols], X_train3["label"])  
score_scaled = rf.score(X_test3[cols], X_test3["label"])

print "Score after scaling and imputation: {:.2f}".format(score_scaled)

Score after scaling and imputation: 0.93

Cross Validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.

To avoid over-fitting, we have to define two different sets : a training set X_train, y_train which is used for learning the parameters of a predictive model, and a testing set X_test, y_test which is used for evaluating the fitted predictive model.

This is a bit harder on my data as the samples are not independent

See this IPython Notebook for how this affects the results

Source: Scikit-Learn docs

In [10]:

import numpy as np
from sklearn import cross_validation

k_fold = cross_validation.KFold(n=len(dognames), n_folds=5)

X = pd.concat([X_train2, X_test2])

scores_cv = []

for index, (train_indices, test_indices) in enumerate(k_fold):   
    train_dogs = dognames[train_indices]
    test_dogs = dognames[test_indices]
    
    # Little helper function to split the dataset
    x_train, x_test = split_data_set(X, train_dogs)
        
    # Now fit the classifier
    rf = RandomForestClassifier()
    rf.fit(x_train[cols], x_train["label"])  
    score_cv = rf.score(x_test[cols], x_test["label"])
    scores_cv.append(score_cv)
    
    print("fold-{: <2} {:.2f}\t".format(index, score_cv))

print("")
print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))

fold-0  0.95	
fold-1  0.99	
fold-2  0.98	
fold-3  0.87	
fold-4  0.97	

Average: 0.95 +/- 0.04

In [11]:

# On most other data sets this can be done a lot simpler
from sklearn import cross_validation

k_fold = cross_validation.KFold(n=len(X), n_folds=5)
# Scikit-Learn comes with a useful helper function to 
# compute the cross validation score
score_cv = cross_validation.cross_val_score(rf, X[cols], X["label"], cv=k_fold)

print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))

Average: 0.95 +/- 0.04

Grid Search hyperparameters

A Grid Search can search for:

an estimator (regressor or classifier such as sklearn.svm.SVC());
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function.

Source: Scikit-Learn docs

In [114]:

from sklearn.grid_search import GridSearchCV

# Use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

clf = RandomForestClassifier(n_estimators=20)

# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X[cols], X["label"])

print("GridSearchCV took {:.2f} seconds for {} candidate parameter settings.".format(
    time() - start, len(grid_search.grid_scores_)))

print("")
# Helper function to pretty print the results
report(grid_search.grid_scores_)

GridSearchCV took 1.22 seconds for 4 candidate parameter settings.

Model with rank: 1
Mean validation score: 0.961 (std: 0.027)
Parameters: {'criterion': 'gini', 'max_depth': None}

Model with rank: 2
Mean validation score: 0.959 (std: 0.023)
Parameters: {'criterion': 'entropy', 'max_depth': None}

Model with rank: 3
Mean validation score: 0.942 (std: 0.027)
Parameters: {'criterion': 'gini', 'max_depth': 3}

In [115]:

rf = RandomForestClassifier(bootstrap=False, min_samples_leaf=1, 
                            min_samples_split=10, criterion='entropy', 
                            max_features=3, max_depth=None)

# Lets try it on one of our last train/test splits
rf.fit(x_train[cols], x_train["label"])  
score_grid = rf.score(x_test[cols], x_test["label"])

print("Score after grid search: {:.2f}".format(score_grid))

Score after grid search: 0.97

Pipeline

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two purposes here:

Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.

Source: Scikit-Learn docs

In [27]:

from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation

# Create X again from the raw data
X = pd.concat([X_train, X_test])

pipeline = Pipeline([
    ("imputer", Imputer(missing_values="NaN", 
                        strategy="mean", axis=0)),
    ("forest", RandomForestClassifier(n_estimators=20))])

# Note the difference in the params, 
# you have to use double underscore notation
# See: http://stackoverflow.com/questions/16437022/how-to-tune-parameters-of-nested-pipelines-by-gridsearchcv-in-scikit-learn
param_grid = {
    "forest__min_samples_split": [1, 3, 10],
    "forest__min_samples_leaf": [1, 3, 10],
    "forest__bootstrap": [True, False],
    "forest__criterion": ["gini", "entropy"]
}

grid_search = GridSearchCV(pipeline, param_grid=param_grid)

grid_search.fit(X[cols], X["label"])
print("Best score: {:0.3f}".format(grid_search.best_score_))

Best score: 0.967

Model evaluation

We can evaluate the quality of the predictions of a model

For example the accuracy, precision or F1-score

Source: Scikit-Learn docs

In [28]:

%matplotlib inline
from sklearn.metrics import confusion_matrix

y_pred = rf.predict(X_test2[cols])

# Compute confusion matrix
cm = confusion_matrix(X_test2["label"], y_pred)
print(cm)

import matplotlib.pyplot as plt
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

[[843   0   6   0]
 [  1 725   1   8]
 [ 13   1 862   0]
 [  0   4   0 772]]

Visualize the Random Forest

Finally, we can export the tree and visualize its structure

In [58]:

import StringIO
import pydot
from sklearn import tree
from IPython.core.display import Image

# Visualize tree
dot_data = StringIO.StringIO()
tree.export_graphviz(rf.estimators_[0], out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
image = graph.write_png("./images/random_network.png")

random_network

In [77]:

print("Feature")
for col, importance in zip(cols, rf.feature_importances_):
    if importance > 0.05:
        print "{:>3}:\t{:.4f}".format(col, importance)

Feature
-y2:	0.0913
-x1:	0.1633
-y1:	0.1220
 x1:	0.1476
 y1:	0.1387
 y2:	0.1490

In [89]:

import matplotlib.pyplot as plt

# Example from: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

fig, ax = plt.subplots(figsize=(12,6))
fig.suptitle("Feature importances")
ax.bar(range(len(importances)), importances[indices], color="r", yerr=std[indices], align="center")
ax.set_xticks(range(len(importances)))
cols = np.array(cols)
ax.set_xticklabels(cols[indices], rotation=90)
ax.set_ylim([0, 0.3])
_ = ax.set_xlim([-1, len(cols)])

In []: