federica bianco
astro | data science | data for good
dr.federica bianco | fbb.space | fedhere | fedhere
Tree methods
what is machine learning
classification
prediction
feature selection
supervised learning
understanding structure
organizing/compressing data
anomaly detection dimensionality reduction
unsupervised learning
clustering
PCA
Apriori
k-Nearest Neighbors
Regression
Support Vector Machines
Classification/Regression Trees
Neural networks
used to:
understand structure of feature space
classify based on examples,
regression (classification with infinitely small classes)
should be interpretable:
ethical implications
ML model have parameters and hyperparameters
parameters: the model optimizes based on the data
hyperparameters: chosen by the model author, could be based on domain knowledge, other data, guessed (?!).
e.g. the shape of the polynomial
observed features:
(x, y)
x
y
observed features:
(x, y)
x
y
observed features:
(x, y)
goal is to partition the space so that the observed variables are
separated into
maximally homogeneous
maximally distinguishable groups
models typically return a cluster label by object
x
y
x
y
observed features:
(x, y)
models typically return a partition of the space
goal is to partition the space so that the unobserved variables are
separated in groups
consistently with
an observed subset
target features:
(color)
x
y
observed features:
(x, y)
ax+b
if y <= a*x + b :
return blue
else:
return orange
target features:
(color)
x
y
observed features:
(x, y)
if x**2 + y**2 <= (x-a)**2 + (y-b)**2 :
return blue
else:
return orange
target features:
(color)
x
y
observed features:
(x, y)
if x**2 + y**2 <= (x-a)**2 + (y-b)**2 :
return blue
else:
return orange
target features:
(color)
x
y
observed features:
(x, y)
this is a solution SVM would provide:
Support Vector Machine
target features:
(color)
x
y
observed features:
(x, y)
Support Vector Machine:
finds a hyperplane that partitions the space
A subset of variables has class labels. Guess the label for the other variables
target features:
(color)
x
y
observed features:
(x, y)
Support Vector Machine:
finds a hyperplane that partitions the space
A subset of variables has class labels. Guess the label for the other variables
2d hyperplane: line (curve)
3d hyperplane: surface
4d hyperplane: volume
...
target features:
(color)
x
y
observed features:
(x, y)
Tree Methods
split spaces along each axis separately
A subset of variables has class labels. Guess the label for the other variables
target features:
(color)
x
y
observed features:
(x, y)
Tree Methods
split spaces along each axis separately
A subset of variables has class labels. Guess the label for the other variables
split along x
if x <= a :
return blue
else:
return orange
target features:
(color)
x
y
observed features:
(x, y)
Tree Methods
split spaces along each axis separately
A subset of variables has class labels. Guess the label for the other variables
split along x
if x <= a :
if y <= b:
return blue
return orange
then
along y
target features:
(color)
classification
prediction
feature selection
supervised learning
understanding structure
organizing/compressing data
anomaly detection dimensionality reduction
unsupervised learning
classification
Calculate the distance d to all known objects Select the k closest objects Assign the most common among the k classes:
# k = 1
d = distance(x, trainingset)
C(x) = C(trainingset[argmin(d)])
Calculate the distance d to all known objects Select the k closest objects
Classification:
Assign the most common among the k classes
Regression: Predict the average (median) of the k target values
Good
non parametric
very good with large training sets
Cover and Hart 1967: As n→∞, the 1-NN error is no more than twice the error of the Bayes Optimal classifier.
Good
non parametric
very good with large training sets
Cover and Hart 1967: As n→∞, the 1-NN error is no more than twice the error of the Bayes Optimal classifier.
Let xNN be the nearest neighbor of x.
For n→∞, xNN→x(t) => dist(xNN,x(t))→0
Theorem: e[C(x(t)) = C(xNN)]< e_BayesOpt
e_BayesOpt = argmaxy P(y|x)
Proof: assume P(y|xt) = P(y|xNN)
(always assumed in ML)
eNN = P(y|x(t)) (1−P(y|xNN)) + P(y|xNN) (1−P(y|x(t))) ≤
(1−P(y|xNN)) + (1−P(y|x(t))) =
2 (1−P(y|x(t)) = 2ϵBayesOpt,
Good
non parametric
very good with large training sets
Not so good
it is only as good as the distance metric
If the similarity in feature space reflect similarity in label then it is perfect!
poor if training sample is sparse
poor with outliers
Tree Methods
supervised learning method
partitions feature space along each feature separately
The good
The bad
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
gender (binary)
M
Ns=93 Nd=360
F
Ns=197 Nd=64
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
gender (binary)
M
Ns=93 Nd=360
F
Ns=197 Nd=64
optimize over purity:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
gender (binary)
M
Ns=93 Nd=360
F
Ns=197 Nd=64
optimize over purity:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
gender (binary)
M
Ns=93 Nd=360
F
Ns=197 Nd=64
optimize over purity:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
1st
Ns=120 Nd=80
2nd +3rd
Ns=234 Nd=298
class (ordinal)
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
age (continuous)
>6.5
Ns=250 Nd=107
<=6.5
Ns=139 Nd=217
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
age (continuous)
>6.5
Ns=250 Nd=107
<=6.5
Ns=139 Nd=217
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
target variable:
-> survival (y/n)
gender (binary)
M
Ns=93 Nd=360
F
Ns=197 Nd=64
features:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
target variable:
-> survival (y/n)
gender (binary)
M
Ns=93 Nd=360
F
Ns=197 Nd=64
features:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
target variable:
-> survival (y/n)
gender
M
Ns=93 Nd=360
F
Ns=197 Nd=64
age
>6.5
Ns=250 Nd=107
<=6.5
Ns=139 Nd=217
class
1st + 2nd
Ns=120 Nd=80
3rd
Ns=234 Nd=298
features:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
target variable:
-> survival (y/n)
gender
M
Ns=93 Nd=360
F
Ns=197 Nd=64
age
>6.5
Ns=250 Nd=107
p=82%
<=6.5
Ns=139 Nd=217
p=67%
class
age
>2.5
Ns=1 Nd=1
p=50%
<=2,5
Ns=8 Nd=139
p=95%
age
>38.5
Ns=44 Nd=46
<=38.5
Ns=11 Nd=1
1st + 2nd
Ns=120 Nd=80
3rd
Ns=234 Nd=298
features:
Application:
a robot to predict surviving the Titanic
714 passengers Ns=424 Nd=290
features:
target variable:
-> survival (y/n)
gender
M
Ns=93 Nd=360
F
Ns=197 Nd=64
>6.5
Ns=250 Nd=107
p=82%
<=6.5
Ns=139 Nd=217
p=67%
1st + 2nd
Ns=120 Nd=80
3rd
Ns=234 Nd=298
1st
Ns=100 Nd=20
p=80%
2nd
Ns=40 Nd=40
p=50%
age
>38.5
Ns=44 Nd=46
<=38.5
Ns=11 Nd=1
class
age
class
A single tree
nodes
(make a decision)
root node
branches
(split off of a node)
leaves (last groups)
A single tree
this visualization is called a "dendrogram"
gini impurity
information gain (entropy)
A single tree: hyperparameters
depth
A single tree: hyperparameters
max depth = 2
A single tree: hyperparameters
max depth = 2
PREVENTS OVERGFITTING
A single tree: hyperparameters
alternative: tree pruning
A single tree: hyperparameters
CART: Classification and Regression Trees
mean square error
A single tree: hyperparameters
mean absolute error
variance:
different trees lead to different results
variance:
different trees lead to different results
why?
because calculating the criterion for every split and every mote is an untractable problem!
e.g. 2 coutinuous variables would be a problem of order
variance:
different trees lead to different results
solution
run many trees and take an "ensamble" decision!
Random Forests
Gradient Boosted Trees
a bunch of parallel trees
a series of trees
run multiple versions of the same model with some small (stochastic or progressive) variation and learn from the emsemble of methods
Gradient boosted trees:
trees run in series (one after the other)
each tree uses different weights for the features learning the weighs from the previous tree
the last tree has the prediction
Random forest:
trees run in parallel (independently of each other)
each tree uses a random subset of observations/features (boostrap - bagging)
class predicted by majority vote:
what class do most trees think a point belong to
LR = _____________________________
True Negative
False Negative
H0 is True | H0 is False | |
---|---|---|
H0 is falsified | Type I Error False Positive |
True Positive |
H0 is not falsified |
True Negative | Type II Error False Negative |
LR = _____________________________
True Negative
False Negative
H0 is True | H0 is False | |
---|---|---|
H0 is falsified | Type I Error False Positive |
True Positive |
H0 is not falsified |
True Negative | Type II Error False Negative |
important message spammed
spam in
your inbox
Precision
Recall
Accuracy
TP=True Positive
FP=False Positive
TN=True Negative
FN=False Positive
GOOD
BAD
GOOD
BAD
tuning by changing hyperparameters
In principle CART methods are interpretable
you can measure the influence that each feature has on the decision : feature importance
A Data-Driven Evaluation of Delays in Criminal Prosecution
feature importance:
how soon was a feature chosen,
how many times was it used...
RF
GBT
In principle CART methods are interpretable
you can measure the influence that each feature has on the decision : feature importance
In practice the interpretation is complicated by covariance of features
spicies | age | weight |
---|---|---|
dog | 7 | 32.3 |
bird | 1 | 0.3 |
cat | 3 | 8.1 |
spicies | age | weight |
---|---|---|
dog | 7 | 32.3 |
bird | 1 | 0.3 |
cat | 3 | 8.1 |
continuous
spicies | age | weight |
---|---|---|
dog | 7 | 32.3 |
bird | 1 | 0.3 |
cat | 3 | 8.1 |
continuous
ordinal
spicies | age | weight |
---|---|---|
dog | 7 | 32.3 |
bird | 1 | 0.3 |
cat | 3 | 8.1 |
continuous
ordinal
categorical
change categorical to (integer) numerical
spicies | age | weight |
---|---|---|
1 | 7 | 32.3 |
2 | 1 | 0.3 |
3 | 3 | 8.1 |
change each category to a binary
cat | bird | dog | age | weight |
---|---|---|---|---|
0 | 0 | 1 | 7 | 32.3 |
0 | 1 | 0 | 1 | 0.3 |
1 | 0 | 0 | 3 | 8.1 |
change categorical to (integer) numerical
spicies | age | weight |
---|---|---|
1 | 7 | 32.3 |
2 | 1 | 0.3 |
3 | 3 | 8.1 |
change each category to a binary
implies an order that does not exist
cat | bird | dog | age | weight |
---|---|---|---|---|
0 | 0 | 1 | 7 | 32.3 |
0 | 1 | 0 | 1 | 0.3 |
1 | 0 | 0 | 3 | 8.1 |
change categorical to (integer) numerical
spicies | age | weight |
---|---|---|
1 | 7 | 32.3 |
2 | 1 | 0.3 |
3 | 3 | 8.1 |
change each category to a binary
implies an order that does not exist
cat | bird | dog | age | weight |
---|---|---|---|---|
0 | 0 | 1 | 7 | 32.3 |
0 | 1 | 0 | 1 | 0.3 |
1 | 0 | 0 | 3 | 8.1 |
ignores covariance between features
change categorical to (integer) numerical
spicies | age | weight |
---|---|---|
1 | 7 | 32.3 |
2 | 1 | 0.3 |
3 | 3 | 8.1 |
change each category to a binary
implies an order that does not exist
cat | bird | dog | age | weight |
---|---|---|---|---|
0 | 0 | 1 | 7 | 32.3 |
0 | 1 | 0 | 1 | 0.3 |
1 | 0 | 0 | 3 | 8.1 |
ignores covariance between features
Definitely Preferred!
problematic if you are interested in feature importance
Machine Learning includes models that learn parameters from data
ML models have parameters learned from the data and hyperparameters assigned by the user.
Unsupervised learning:
Supervised learning:
Tree methods:
single trees have high variance as the optimization has to be local
ensemble methods solve variance issue by running multiple trees and making an ensemble decision
random forest: trees run in parallel with a random subset of features
and the decision scheme is "majority" decision
gradient boosted trees: trees run in series with feature weighted learning the weights from the outcome of the previous tree. The last tree has the division
feature importance: the importance of each feature can be extracted. In presence of covariance the feature importance may be hard to interpret
encoding categorical variables:
variables have to be encoded as numbers for computers to understand them. You can encode categorical variables with integers or floating point but you implicitly impart an order. The standard is to one-hot-encode which means creating a binary (True/False) feature (column) for each category of a categorical variables but this increases the feature space and generated covariance.
model diagnostics for classifiers: Fraction of True Positives and False Positives are the metrics to evaluate classifiers. Combinations of those numbers include Accuracy (TP/ (TP+FP)), Precision (TP/(TP+FN)), Recall ((TP+TN)/(TP+TN+FP+FN)).
ROC curve: (TP vs FP) is a holistic metric of a model. It can be used to guide the choice of hyperparameters to find the "sweet spot" for your problem
http://what-when-how.com/artificial-intelligence/decision-tree- applications-for-data-modelling-artificial-intelligence/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466856/
resource on coding
Data Science
Interpretability Methods in Machine Learning: A Brief Survey
Insights by Two Sigma
- Download the Higgs boson data from Kaggle (programmatically within the notebook, see how in the Titanic notebook)
- Split the provided training data into a training and a test set. For each model calculate and discuss the training and test score results.
- Use a Random Forest and a Gradiend Boosted Tree Classifier model to predict the label of the particles.
- Produce a confusion matrix for each model and compare them
- Use a Random Forest and a Gradiend Boosted Tree Regressor model to predict the weight of the particles.
- Calculate the L1 and L2 metrics of each model and compare them.
- For the Random Forest classifier, select the 4 most important features (see how in the Titanic notebook) and explore the parameter space with the sklearn module sklearn.model_selection.RandomizedSearchCV for a model that uses only those features to predict the labels https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py
- Generate an ROC curve plot for the best model and discuss it https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html or https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py
- EC and 667
---- Download the script provided in the kaggle challenge to validate your model.
---- Generate an output file as required by this script for your best model
---- Report on the result
Higgs Boson Search
By federica bianco
CART methods