Random Forests in Python

Ivo Flipse (@ivoflipse5)

Amsterdam Python Meetup, June 2014

Background

  • Human Movement Scientist
  • Specialized in gait analysis and pressure measurements
  • Pythonista, Matlab-survivor and reluctant R-user

Normally pressure measurements are done on humans

But I needed a version that worked on animals too

So I created Pawlabeling, see Github: https://www.github.com/ivoflipse/pawlabeling

Scientific Python stack

  • PySide (GUI)
  • NumPy (array operations)
  • SciPy (scientific calculations)
  • PyTables (storage)
  • OpenCV (computer vision)
  • Matplotlib (plotting)
  • Scikit-Learn (machine-learning)

For installing I recommend Anaconda (Continuum Analytics) or Canopy (Enthought).

Challenging problems

  • Detection and tracking of paws
  • Labeling the paws
  • Detecting toes
  • Image registration

Challenging problems

  • Detection and tracking of paws
  • Labeling the paws
  • Detecting toes
  • Image registration

Challenging problems

  • Detection and tracking of paws
  • Labeling the paws
  • Detecting toes
  • Image registration

Challenging problems

  • Detection and tracking of paws
  • Labeling the paws
  • Detecting toes
  • Image registration

Focus of the talk

Labeling the paws:

Given a paw, predict its label (LF, LH, RF, RH)

So how can we solve this?

Machine Learning in Python

Gael Varoquax (scikit-learn developer):

Machine Learning is about building programs with tunable parameters that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Today we'll focus on Supervised Learning:

Supervised learning consists of learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length n_samples.

Machine Learning in Python

Like finding a line that separates these black and white points

Machine Learning in Python

Or guessing the digit given an small image:

Using Scikit-Learn

Scikit-Learn cheat-sheet:

Source: Andreas Müller

Random Forests

Why use Random Forests:

  • Easy to use
  • Used for regression (predicting values)
  • Used for classification (predicting categories)
  • Gives you good scores on (entry-level) Kaggle competitions
  • I just happen to like them :-)

Decision Tree

Split the data on the classes with the most information gain (lowest entropy)

Source: Citizennet.com | Microsoft Research

Random Forests

Random Forests combines a 'forest' of decision trees each trained on a random subset of the training data

Source: Citizennet.com | Microsoft Research

Random Forests in Action

Most machine learning algorithms implemented in scikit-learn expect a numpy array as input X. The expected shape of X is (n_samples, n_features).

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample.

Source: Scikit-Learn Tutorial

In [1]:
import pandas as pd

# Let's load the data from some CSV files I prepared:
X_train = pd.read_csv("classify X_train.csv", index_col=0)
X_test = pd.read_csv("classify X_test.csv", index_col=0)

print(X_train.columns)

n_samples, n_features = X_train.shape
print n_samples, n_features
Index([u'subject_id', u'weight', u'measurement_id', u'contact_id', u'max_force', u'max_surface', u'max_duration', u'-f2', u'-s2', u'-m2', u'-x2', u'-y2', u'-z2', u'-f1', u'-s1', u'-m1', u'-x1', u'-y1', u'-z1', u'f1', u's1', u'm1', u'x1', u'y1', u'z1', u'f2', u's2', u'm2', u'x2', u'y2', u'z2', u'label'], dtype='object')
486 32

The features

cols = cols = ['max_force', 'max_surface', 'max_duration', 
               '-f2', '-s2', '-m2', '-x2', '-y2', '-z2', 
               '-f1', '-s1', '-m1', '-x1', '-y1', '-z1', 
                'f1',  's1',  'm1',  'x1',  'y1',  'z1', 
                'f2',  's2',  'm2',  'x2',  'y2',  'z2']

features features2

In [2]:
X_train.head(n=5)
Out[2]:
subject_id weight measurement_id contact_id max_force max_surface max_duration -f2 -s2 -m2 -x2 -y2 -z2 -f1 -s1 -m1 -x1 -y1 -z1 f1
0 subject_30 7 sel_3 - 3-4-2010 - Entire Plate Roll Off contact_0 77.5 19.354800 24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 69.200000 ...
1 subject_30 7 sel_3 - 3-4-2010 - Entire Plate Roll Off contact_1 69.2 19.354800 23 NaN NaN NaN NaN NaN NaN 77.500000 19.354800 24 -30.0 14.5 -22 22.800001 ...
2 subject_30 7 sel_3 - 3-4-2010 - Entire Plate Roll Off contact_3 81.5 18.580608 23 69.200000 19.354800 23 -29.5 -20.0 -21 22.800001 4.645152 14 -60.5 3.5 -12 43.700000 ...
3 subject_30 7 sel_3 - 3-4-2010 - Entire Plate Roll Off contact_4 43.7 10.064496 18 22.800001 4.645152 14 -21.5 20.5 -17 81.500000 18.580608 23 39.0 17.0 -5 80.800000 ...
4 subject_30 7 sel_3 - 3-4-2010 - Entire Plate Roll Off contact_5 80.8 19.354800 21 81.500000 18.580608 23 -29.5 17.0 -24 43.700000 10.064496 18 -68.5 0.0 -19 61.200000 ...

5 rows × 32 columns

In [92]:
cols = ['max_force', 'max_surface', 'max_duration', 
        '-f2', '-s2', '-m2', '-x2', '-y2', '-z2', 
        '-f1', '-s1', '-m1', '-x1', '-y1', '-z1', 
        'f1', 's1', 'm1', 'x1', 'y1', 'z1', 
        'f2', 's2', 'm2', 'x2', 'y2', 'z2']

Random Forests in Action

Let's set up the Random Forest classifier and try it on our data

In [93]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train[cols].dropna(), 
       X_train[cols+["label"]].dropna()["label"])  
score = rf.score(X_test[cols].dropna(), 
                 X_test[cols+["label"]].dropna()["label"])
print "Score: {:.2f}".format(score)
Score: 0.94

In [99]:
# Great performance, but there's a catch:
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape

print "Number of samples before dropna: {}, after dropna: {}".format(
    n_samples, n_samples_dropna)
print "Missing due to NaN: {}".format(n_samples-n_samples_dropna)
Number of samples before dropna: 3236, after dropna: 2359
Missing due to NaN: 877

Ways to improve performance

Several things we can do to improve our results

  • Imputation of missing values
  • Normalize or standardize values
  • Cross validation
  • Grid Search hyperparameters
  • Pipeline: chaining estimators
  • Model Evaluation

Imputation of missing values

My example data has a lot of samples that have features with NaN, rendering the sample unusable

Solution: impute or infer them from the known part of the data

Source: Scikit-Learn docs

In [103]:
from sklearn.preprocessing import Imputer

# Create the imputer, which will replace the NaN's with the mean along the columns
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)

X_train2 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
    # Impute the missing values
    subject_group[cols] = imputer.fit_transform(subject_group[cols])
    X_train2.append(subject_group)

# Create a data frame again
X_train2 = pd.concat(X_train2)


X_test2 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
    subject_group[cols] = imputer.fit_transform(subject_group[cols])
    X_test2.append(subject_group)
    
X_test2 = pd.concat(X_test2)

# Let's see how many samples we have left now
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape
n_samples_imputed, n_features_imputed = X_test2.dropna().shape
print("Number of samples before dropna: {}, after dropna: {}, "),
print("after imputing: {}".format(n_samples, n_samples_dropna, n_samples_imputed))
Number of samples before dropna: {}, after dropna: {},  after imputing: 3236

In [7]:
rf = RandomForestClassifier()
rf.fit(X_train2[cols], X_train2["label"])  
score_imputed = rf.score(X_test2[cols], X_test2["label"])

print "Score after imputation: {:.2f}".format(score_imputed)
Score after imputation: 0.90

So we got a slightly worse score, but it labeled an additional 877 samples

These samples are also likely to be harder to predict, given they lack information about previous or next paws

A potential improvement would be to use regression to predict the missing value

Normalize or standardize values

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

This isn't necessary for Random Forests though (see discussion here)

Source: Scikit-Learn docs

In [8]:
from sklearn import preprocessing

X_train3 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
    # First impute the missing values
    subject_group[cols] = imputer.fit_transform(subject_group[cols])  
    # Then scale the values
    subject_group[cols] = preprocessing.scale(subject_group[cols])
    X_train3.append(subject_group)

X_train3 = pd.concat(X_train3)

X_test3 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
    subject_group[cols] = imputer.fit_transform(subject_group[cols])
    subject_group[cols] = preprocessing.scale(subject_group[cols])      
    X_test3.append(subject_group)
    
X_test3 = pd.concat(X_test3)
In [9]:
rf = RandomForestClassifier()
rf.fit(X_train3[cols], X_train3["label"])  
score_scaled = rf.score(X_test3[cols], X_test3["label"])

print "Score after scaling and imputation: {:.2f}".format(score_scaled)
Score after scaling and imputation: 0.93

Cross Validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.

To avoid over-fitting, we have to define two different sets : a training set X_train, y_train which is used for learning the parameters of a predictive model, and a testing set X_test, y_test which is used for evaluating the fitted predictive model.

This is a bit harder on my data as the samples are not independent

See this IPython Notebook for how this affects the results

Source: Scikit-Learn docs

In [95]:
dognames = np.array(['subject_30', 'subject_32', 'subject_33', 'subject_34', 'subject_35', 'subject_36', 
                     'subject_16', 'subject_17', 'subject_14', 'subject_15', 'subject_12', 'subject_13', 
                     'subject_10', 'subject_18', 'subject_19', 'subject_8', 'subject_9', 'subject_4', 
                     'subject_5', 'subject_6', 'subject_7', 'subject_0', 'subject_2', 'subject_29', 
                     'subject_28', 'subject_27', 'subject_26', 'subject_25', 'subject_24', 'subject_23', 
                     'subject_22', 'subject_21', 'subject_20'])

# Little helper function to split the dataset
def split_data_set(X, train_dogs):
    x_train = []
    x_test = []
    
    for subject_id, subject_group in X.groupby("subject_id"):
        if subject_id in train_dogs:
            x_train.append(subject_group)
        else:
            x_test.append(subject_group)
            
    x_train = pd.concat(x_train)
    x_test = pd.concat(x_test)
    return x_train, x_test
In [10]:
import numpy as np
from sklearn import cross_validation

k_fold = cross_validation.KFold(n=len(dognames), n_folds=5)

X = pd.concat([X_train2, X_test2])

scores_cv = []

for index, (train_indices, test_indices) in enumerate(k_fold):   
    train_dogs = dognames[train_indices]
    test_dogs = dognames[test_indices]
    
    # Little helper function to split the dataset
    x_train, x_test = split_data_set(X, train_dogs)
        
    # Now fit the classifier
    rf = RandomForestClassifier()
    rf.fit(x_train[cols], x_train["label"])  
    score_cv = rf.score(x_test[cols], x_test["label"])
    scores_cv.append(score_cv)
    
    print("fold-{: <2} {:.2f}\t".format(index, score_cv))

print("")
print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))
fold-0  0.95	
fold-1  0.99	
fold-2  0.98	
fold-3  0.87	
fold-4  0.97	

Average: 0.95 +/- 0.04

In [11]:
# On most other data sets this can be done a lot simpler
from sklearn import cross_validation

k_fold = cross_validation.KFold(n=len(X), n_folds=5)
# Scikit-Learn comes with a useful helper function to 
# compute the cross validation score
score_cv = cross_validation.cross_val_score(rf, X[cols], X["label"], cv=k_fold)

print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))
Average: 0.95 +/- 0.04

Grid Search hyperparameters

A Grid Search can search for:

  • an estimator (regressor or classifier such as sklearn.svm.SVC());
  • a parameter space;
  • a method for searching or sampling candidates;
  • a cross-validation scheme; and
  • a score function.

Source: Scikit-Learn docs

In [105]:
from operator import itemgetter
from time import time

def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")
In [114]:
from sklearn.grid_search import GridSearchCV

# Use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

clf = RandomForestClassifier(n_estimators=20)

# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X[cols], X["label"])

print("GridSearchCV took {:.2f} seconds for {} candidate parameter settings.".format(
    time() - start, len(grid_search.grid_scores_)))

print("")
# Helper function to pretty print the results
report(grid_search.grid_scores_)
GridSearchCV took 1.22 seconds for 4 candidate parameter settings.

Model with rank: 1
Mean validation score: 0.961 (std: 0.027)
Parameters: {'criterion': 'gini', 'max_depth': None}

Model with rank: 2
Mean validation score: 0.959 (std: 0.023)
Parameters: {'criterion': 'entropy', 'max_depth': None}

Model with rank: 3
Mean validation score: 0.942 (std: 0.027)
Parameters: {'criterion': 'gini', 'max_depth': 3}


In [115]:
rf = RandomForestClassifier(bootstrap=False, min_samples_leaf=1, 
                            min_samples_split=10, criterion='entropy', 
                            max_features=3, max_depth=None)

# Lets try it on one of our last train/test splits
rf.fit(x_train[cols], x_train["label"])  
score_grid = rf.score(x_test[cols], x_test["label"])

print("Score after grid search: {:.2f}".format(score_grid))
Score after grid search: 0.97

Pipeline

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two purposes here:

  • Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
  • Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.

Source: Scikit-Learn docs

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation

# Create X again from the raw data
X = pd.concat([X_train, X_test])

pipeline = Pipeline([
    ("imputer", Imputer(missing_values="NaN", 
                        strategy="mean", axis=0)),
    ("forest", RandomForestClassifier(n_estimators=20))])

# Note the difference in the params, 
# you have to use double underscore notation
# See: http://stackoverflow.com/questions/16437022/how-to-tune-parameters-of-nested-pipelines-by-gridsearchcv-in-scikit-learn
param_grid = {
    "forest__min_samples_split": [1, 3, 10],
    "forest__min_samples_leaf": [1, 3, 10],
    "forest__bootstrap": [True, False],
    "forest__criterion": ["gini", "entropy"]
}

grid_search = GridSearchCV(pipeline, param_grid=param_grid)

grid_search.fit(X[cols], X["label"])
print("Best score: {:0.3f}".format(grid_search.best_score_))
Best score: 0.967

Model evaluation

We can evaluate the quality of the predictions of a model

For example the accuracy, precision or F1-score

Source: Scikit-Learn docs

In [28]:
%matplotlib inline
from sklearn.metrics import confusion_matrix

y_pred = rf.predict(X_test2[cols])

# Compute confusion matrix
cm = confusion_matrix(X_test2["label"], y_pred)
print(cm)

import matplotlib.pyplot as plt
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
[[843   0   6   0]
 [  1 725   1   8]
 [ 13   1 862   0]
 [  0   4   0 772]]

Visualize the Random Forest

Finally, we can export the tree and visualize its structure

In [58]:
import StringIO
import pydot
from sklearn import tree
from IPython.core.display import Image

# Visualize tree
dot_data = StringIO.StringIO()
tree.export_graphviz(rf.estimators_[0], out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
image = graph.write_png("./images/random_network.png")

random_network random_network

In [77]:
print("Feature")
for col, importance in zip(cols, rf.feature_importances_):
    if importance > 0.05:
        print "{:>3}:\t{:.4f}".format(col, importance)
Feature
-y2:	0.0913
-x1:	0.1633
-y1:	0.1220
 x1:	0.1476
 y1:	0.1387
 y2:	0.1490

In [89]:
import matplotlib.pyplot as plt

# Example from: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

fig, ax = plt.subplots(figsize=(12,6))
fig.suptitle("Feature importances")
ax.bar(range(len(importances)), importances[indices], color="r", yerr=std[indices], align="center")
ax.set_xticks(range(len(importances)))
cols = np.array(cols)
ax.set_xticklabels(cols[indices], rotation=90)
ax.set_ylim([0, 0.3])
_ = ax.set_xlim([-1, len(cols)])
In []:
 

Random Forests in Python

By Ivo Flipse

Random Forests in Python

  • 6,070