Normally pressure measurements are done on humans
But I needed a version that worked on animals too
So I created Pawlabeling, see Github: https://www.github.com/ivoflipse/pawlabeling
For installing I recommend Anaconda (Continuum Analytics) or Canopy (Enthought).
Labeling the paws:
Given a paw, predict its label (LF, LH, RF, RH)
So how can we solve this?
Gael Varoquax (scikit-learn developer):
Machine Learning is about building programs with tunable parameters that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
Today we'll focus on Supervised Learning:
Supervised learning consists of learning the link between two datasets: the observed data X and an external variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length n_samples.
Like finding a line that separates these black and white points
Or guessing the digit given an small image:
Why use Random Forests:
Split the data on the classes with the most information gain (lowest entropy)
Source: Citizennet.com | Microsoft Research
Random Forests combines a 'forest' of decision trees each trained on a random subset of the training data
Source: Citizennet.com | Microsoft Research
Most machine learning algorithms implemented in scikit-learn expect a numpy array as input X. The expected shape of X is (n_samples, n_features).
The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample.
Source: Scikit-Learn Tutorial
import pandas as pd
# Let's load the data from some CSV files I prepared:
X_train = pd.read_csv("classify X_train.csv", index_col=0)
X_test = pd.read_csv("classify X_test.csv", index_col=0)
print(X_train.columns)
n_samples, n_features = X_train.shape
print n_samples, n_features
Index([u'subject_id', u'weight', u'measurement_id', u'contact_id', u'max_force', u'max_surface', u'max_duration', u'-f2', u'-s2', u'-m2', u'-x2', u'-y2', u'-z2', u'-f1', u'-s1', u'-m1', u'-x1', u'-y1', u'-z1', u'f1', u's1', u'm1', u'x1', u'y1', u'z1', u'f2', u's2', u'm2', u'x2', u'y2', u'z2', u'label'], dtype='object') 486 32
cols = cols = ['max_force', 'max_surface', 'max_duration',
'-f2', '-s2', '-m2', '-x2', '-y2', '-z2',
'-f1', '-s1', '-m1', '-x1', '-y1', '-z1',
'f1', 's1', 'm1', 'x1', 'y1', 'z1',
'f2', 's2', 'm2', 'x2', 'y2', 'z2']
X_train.head(n=5)
subject_id | weight | measurement_id | contact_id | max_force | max_surface | max_duration | -f2 | -s2 | -m2 | -x2 | -y2 | -z2 | -f1 | -s1 | -m1 | -x1 | -y1 | -z1 | f1 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_0 | 77.5 | 19.354800 | 24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 69.200000 | ... |
1 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_1 | 69.2 | 19.354800 | 23 | NaN | NaN | NaN | NaN | NaN | NaN | 77.500000 | 19.354800 | 24 | -30.0 | 14.5 | -22 | 22.800001 | ... |
2 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_3 | 81.5 | 18.580608 | 23 | 69.200000 | 19.354800 | 23 | -29.5 | -20.0 | -21 | 22.800001 | 4.645152 | 14 | -60.5 | 3.5 | -12 | 43.700000 | ... |
3 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_4 | 43.7 | 10.064496 | 18 | 22.800001 | 4.645152 | 14 | -21.5 | 20.5 | -17 | 81.500000 | 18.580608 | 23 | 39.0 | 17.0 | -5 | 80.800000 | ... |
4 | subject_30 | 7 | sel_3 - 3-4-2010 - Entire Plate Roll Off | contact_5 | 80.8 | 19.354800 | 21 | 81.500000 | 18.580608 | 23 | -29.5 | 17.0 | -24 | 43.700000 | 10.064496 | 18 | -68.5 | 0.0 | -19 | 61.200000 | ... |
5 rows × 32 columns
cols = ['max_force', 'max_surface', 'max_duration',
'-f2', '-s2', '-m2', '-x2', '-y2', '-z2',
'-f1', '-s1', '-m1', '-x1', '-y1', '-z1',
'f1', 's1', 'm1', 'x1', 'y1', 'z1',
'f2', 's2', 'm2', 'x2', 'y2', 'z2']
Let's set up the Random Forest classifier and try it on our data
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train[cols].dropna(),
X_train[cols+["label"]].dropna()["label"])
score = rf.score(X_test[cols].dropna(),
X_test[cols+["label"]].dropna()["label"])
print "Score: {:.2f}".format(score)
Score: 0.94
# Great performance, but there's a catch:
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape
print "Number of samples before dropna: {}, after dropna: {}".format(
n_samples, n_samples_dropna)
print "Missing due to NaN: {}".format(n_samples-n_samples_dropna)
Number of samples before dropna: 3236, after dropna: 2359 Missing due to NaN: 877
Several things we can do to improve our results
My example data has a lot of samples that have features with NaN, rendering the sample unusable
Solution: impute or infer them from the known part of the data
Source: Scikit-Learn docs
from sklearn.preprocessing import Imputer
# Create the imputer, which will replace the NaN's with the mean along the columns
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
X_train2 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
# Impute the missing values
subject_group[cols] = imputer.fit_transform(subject_group[cols])
X_train2.append(subject_group)
# Create a data frame again
X_train2 = pd.concat(X_train2)
X_test2 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
subject_group[cols] = imputer.fit_transform(subject_group[cols])
X_test2.append(subject_group)
X_test2 = pd.concat(X_test2)
# Let's see how many samples we have left now
n_samples, n_features = X_test.shape
n_samples_dropna, n_features_dropna = X_test.dropna().shape
n_samples_imputed, n_features_imputed = X_test2.dropna().shape
print("Number of samples before dropna: {}, after dropna: {}, "),
print("after imputing: {}".format(n_samples, n_samples_dropna, n_samples_imputed))
Number of samples before dropna: {}, after dropna: {}, after imputing: 3236
rf = RandomForestClassifier()
rf.fit(X_train2[cols], X_train2["label"])
score_imputed = rf.score(X_test2[cols], X_test2["label"])
print "Score after imputation: {:.2f}".format(score_imputed)
Score after imputation: 0.90
So we got a slightly worse score, but it labeled an additional 877 samples
These samples are also likely to be harder to predict, given they lack information about previous or next paws
A potential improvement would be to use regression to predict the missing value
Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
This isn't necessary for Random Forests though (see discussion here)
Source: Scikit-Learn docs
from sklearn import preprocessing
X_train3 = []
for subject_id, subject_group in X_train.groupby("subject_id"):
# First impute the missing values
subject_group[cols] = imputer.fit_transform(subject_group[cols])
# Then scale the values
subject_group[cols] = preprocessing.scale(subject_group[cols])
X_train3.append(subject_group)
X_train3 = pd.concat(X_train3)
X_test3 = []
for subject_id, subject_group in X_test.groupby("subject_id"):
subject_group[cols] = imputer.fit_transform(subject_group[cols])
subject_group[cols] = preprocessing.scale(subject_group[cols])
X_test3.append(subject_group)
X_test3 = pd.concat(X_test3)
rf = RandomForestClassifier()
rf.fit(X_train3[cols], X_train3["label"])
score_scaled = rf.score(X_test3[cols], X_test3["label"])
print "Score after scaling and imputation: {:.2f}".format(score_scaled)
Score after scaling and imputation: 0.93
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.
To avoid over-fitting, we have to define two different sets : a training set X_train, y_train which is used for learning the parameters of a predictive model, and a testing set X_test, y_test which is used for evaluating the fitted predictive model.
This is a bit harder on my data as the samples are not independent
See this IPython Notebook for how this affects the results
Source: Scikit-Learn docs
dognames = np.array(['subject_30', 'subject_32', 'subject_33', 'subject_34', 'subject_35', 'subject_36',
'subject_16', 'subject_17', 'subject_14', 'subject_15', 'subject_12', 'subject_13',
'subject_10', 'subject_18', 'subject_19', 'subject_8', 'subject_9', 'subject_4',
'subject_5', 'subject_6', 'subject_7', 'subject_0', 'subject_2', 'subject_29',
'subject_28', 'subject_27', 'subject_26', 'subject_25', 'subject_24', 'subject_23',
'subject_22', 'subject_21', 'subject_20'])
# Little helper function to split the dataset
def split_data_set(X, train_dogs):
x_train = []
x_test = []
for subject_id, subject_group in X.groupby("subject_id"):
if subject_id in train_dogs:
x_train.append(subject_group)
else:
x_test.append(subject_group)
x_train = pd.concat(x_train)
x_test = pd.concat(x_test)
return x_train, x_test
import numpy as np
from sklearn import cross_validation
k_fold = cross_validation.KFold(n=len(dognames), n_folds=5)
X = pd.concat([X_train2, X_test2])
scores_cv = []
for index, (train_indices, test_indices) in enumerate(k_fold):
train_dogs = dognames[train_indices]
test_dogs = dognames[test_indices]
# Little helper function to split the dataset
x_train, x_test = split_data_set(X, train_dogs)
# Now fit the classifier
rf = RandomForestClassifier()
rf.fit(x_train[cols], x_train["label"])
score_cv = rf.score(x_test[cols], x_test["label"])
scores_cv.append(score_cv)
print("fold-{: <2} {:.2f}\t".format(index, score_cv))
print("")
print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))
fold-0 0.95 fold-1 0.99 fold-2 0.98 fold-3 0.87 fold-4 0.97 Average: 0.95 +/- 0.04
# On most other data sets this can be done a lot simpler
from sklearn import cross_validation
k_fold = cross_validation.KFold(n=len(X), n_folds=5)
# Scikit-Learn comes with a useful helper function to
# compute the cross validation score
score_cv = cross_validation.cross_val_score(rf, X[cols], X["label"], cv=k_fold)
print("Average: {:.2f} +/- {:.2f}".format(np.mean(scores_cv), np.std(scores_cv)))
Average: 0.95 +/- 0.04
A Grid Search can search for:
Source: Scikit-Learn docs
from operator import itemgetter
from time import time
def report(grid_scores, n_top=3):
top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}".format(score.parameters))
print("")
from sklearn.grid_search import GridSearchCV
# Use a full grid over all parameters
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
clf = RandomForestClassifier(n_estimators=20)
# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X[cols], X["label"])
print("GridSearchCV took {:.2f} seconds for {} candidate parameter settings.".format(
time() - start, len(grid_search.grid_scores_)))
print("")
# Helper function to pretty print the results
report(grid_search.grid_scores_)
GridSearchCV took 1.22 seconds for 4 candidate parameter settings. Model with rank: 1 Mean validation score: 0.961 (std: 0.027) Parameters: {'criterion': 'gini', 'max_depth': None} Model with rank: 2 Mean validation score: 0.959 (std: 0.023) Parameters: {'criterion': 'entropy', 'max_depth': None} Model with rank: 3 Mean validation score: 0.942 (std: 0.027) Parameters: {'criterion': 'gini', 'max_depth': 3}
rf = RandomForestClassifier(bootstrap=False, min_samples_leaf=1,
min_samples_split=10, criterion='entropy',
max_features=3, max_depth=None)
# Lets try it on one of our last train/test splits
rf.fit(x_train[cols], x_train["label"])
score_grid = rf.score(x_test[cols], x_test["label"])
print("Score after grid search: {:.2f}".format(score_grid))
Score after grid search: 0.97
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two purposes here:
Source: Scikit-Learn docs
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation
# Create X again from the raw data
X = pd.concat([X_train, X_test])
pipeline = Pipeline([
("imputer", Imputer(missing_values="NaN",
strategy="mean", axis=0)),
("forest", RandomForestClassifier(n_estimators=20))])
# Note the difference in the params,
# you have to use double underscore notation
# See: http://stackoverflow.com/questions/16437022/how-to-tune-parameters-of-nested-pipelines-by-gridsearchcv-in-scikit-learn
param_grid = {
"forest__min_samples_split": [1, 3, 10],
"forest__min_samples_leaf": [1, 3, 10],
"forest__bootstrap": [True, False],
"forest__criterion": ["gini", "entropy"]
}
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X[cols], X["label"])
print("Best score: {:0.3f}".format(grid_search.best_score_))
Best score: 0.967
We can evaluate the quality of the predictions of a model
For example the accuracy, precision or F1-score
Source: Scikit-Learn docs
%matplotlib inline
from sklearn.metrics import confusion_matrix
y_pred = rf.predict(X_test2[cols])
# Compute confusion matrix
cm = confusion_matrix(X_test2["label"], y_pred)
print(cm)
import matplotlib.pyplot as plt
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
[[843 0 6 0] [ 1 725 1 8] [ 13 1 862 0] [ 0 4 0 772]]
Finally, we can export the tree and visualize its structure
import StringIO
import pydot
from sklearn import tree
from IPython.core.display import Image
# Visualize tree
dot_data = StringIO.StringIO()
tree.export_graphviz(rf.estimators_[0], out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
image = graph.write_png("./images/random_network.png")
print("Feature")
for col, importance in zip(cols, rf.feature_importances_):
if importance > 0.05:
print "{:>3}:\t{:.4f}".format(col, importance)
Feature -y2: 0.0913 -x1: 0.1633 -y1: 0.1220 x1: 0.1476 y1: 0.1387 y2: 0.1490
import matplotlib.pyplot as plt
# Example from: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
fig, ax = plt.subplots(figsize=(12,6))
fig.suptitle("Feature importances")
ax.bar(range(len(importances)), importances[indices], color="r", yerr=std[indices], align="center")
ax.set_xticks(range(len(importances)))
cols = np.array(cols)
ax.set_xticklabels(cols[indices], rotation=90)
ax.set_ylim([0, 0.3])
_ = ax.set_xlim([-1, len(cols)])