Ferran Muiños
updated: Monday 20210301
"Demystify tree ensemble methods and in particular boosted trees"
Quick overview of regression
Gentle intro to decision trees and tree ensembles
What does boosting intend?
Some examples along the way
https://www.quora.com/What-machine-learning-approaches-have-won-most-Kaggle-competitions
covariates
response
covariates
response
Problem statement:
Find a function that gives a precise description of the dependence relationship between and :
Alternative problem statement:
Given a collection of samples
find a function that provides:
covariates
response
What patterns do we see?
Average Smoothing
Nadaraya-Watson
: bell-shaped kernel
Linear
Quadratic
Parametric methods:
Smoothers:
Applicable to datasets with n covatiates
This is the key!
How do we know what to expect after all?
Dataset
Training dataset
Test dataset
Fit the model
Test the model
Bagging = Bootstrap + Aggregating
"Averaging" is the typical way to reach consensus
Me learn good
Trees functions
a.k.a. decision trees:
yes
no
yes
no
1st split
2nd split
?
?
Which split gives minimum loss?
e.g. loss = RSS
?
For which split do we get minimum RSS?
from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=3, max_depth=1) res = model.fit(temp, rate)
n_estimators=1000, max_depth=1
n_estimators=1000, max_depth=2
Instead of fitting a global model to the loss function, training is done by adding one tree at a time (additive training):
Goal: find that minimizes the loss
The regularization term penalizes the tree complexity.
For example, in XGBoost:
is the number of leaves
are the values (or weights) at the leaves
?
We can provide a second order
Taylor approximation of the Loss function.
How do we find ?
?
How do we find ?
New goal: minimize this new loss function
Regrouping by leaf, we can write it as a sum of quadratic functions, one for each leaf:
?
How do we find ?
If the tree structure of t is fixed, then the optimal weights for each leaf are given by:
where and
Evaluating the new loss on gives a scoring on each possible tree structure.
Trees are grown greedily so that this scoring keeps decreasing at every step.
?
Example of a tree structure:
Now that we have a way to measure how good a tree is, ideally we would enumerate all possible trees and pick the best one. In practice, however, this is intractable.
?
?
?
Is the new split beneficial or not?
Optimal tree structure is searched for iteratively
We can introduce rules to constraint the search for each update. These rules define the "learning style" of the model.
Upsides:
Downsides:
Upsides:
Downsides:
Freund Y, Schapire R A short introduction to boosting
Hastie T, Tibshirani R, Friedman J The Elements of Statistical Learning
Natekin A, Knoll A Gradient boosting machines, a tutorial
XGBoost Documentation: https://xgboost.readthedocs.io/en/latest