one tree at a time
"Demystify tree ensemble methods"
Quick overview of regression
Gentle intro to decision trees and tree ensembles
What does gradient boosting intend?
Practical usage
Some examples along the way
covariates
response
covariates
response
Problem statement:
Find a function that gives a precise description of the dependence relationship between and :
Alternative problem statement:
Given a collection of samples
find a function that provides:
covariates
response
What patterns do we see?
Perfect fit!
But do not expect it to fit new data :(
Average Smoothing
Nadaraya-Watson
: bell-shaped kernel
Linear
Quadratic
Parametric methods:
Smoothers:
Applicable to datasets with n covatiates
This is the key!
How do we know what to expect after all?
Dataset
Training dataset
Test dataset
Fit the model
Test the model
Bagging = Bootstrap + Aggregating
"Averaging" is the typical way to reach consensus
I am a small tree.
Learn weak, die hard.
Trees (a.k.a. decision trees) are functions of a particular kind:
yes
no
yes
no
1st split
2nd split
?
?
For which split do we get minimum RSS?
?
YES!
For which split do we get minimum RSS?
from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=3, max_depth=1) res = model.fit(temp, rate)
n_estimators=1000, max_depth=1
n_estimators=1000, max_depth=2
Greedy cousin of the Random Forest:
Fix a loss function: e.g.
Initialize the model with an educated guess:
At each step we find a new tree :
The tree is such that the following "loss" is small:
Training Samples:
Using Taylor's expansion this can be re-written:
The tree is such that the following objective is minimized:
The tree is such that the following objective is minimized:
These pseudo-residuals are computed using the first derivative of the loss function L
These factors are computed using the second-derivative of the loss function L
Regularization term
We can introduce rules to constraint the search for each update. These rules define the "learning style" of the model.
Upsides:
Downsides:
Upsides:
Downsides:
Dataset: UV-induced CPD (cyclobutane pyrimidine dimer) in human skin fibroblasts
Samples: 1Mb chunks
Response: CPD counts per chunk
Covariates: Annotated coverage by chromatin-associated structures and enrichment of histone modifications enrichment.
Challenges
Many covariates: from 20 to 1000+
Many interactions expected
Want accurate prediction without killing interpretation
params = {
'objective': 'reg:linear',
'n_estimators': 15000,
'subsample': 0.5,
'colsample_bytree': 0.5,
'learning_rate': 0.001,
'max_depth': 4,
...
}
0.5 Fold
Partial model with single covariate (~ 0.3 Var Explained)
Full Model (~ 0.9 Var Explained)
Several choices in most frameworks
Attribution: model the effect of each covariate in each sample.
Attribution models: model the effect of each covariate in each sample.
We use an attribution model based on
Shapley values (cooperative game theory)
"average effect of adding a feature to predict a given sample"
Example with H3K27ac
Valiant (1984): PAC learning models
Kearns and Valiant (1988): first to pose the question of whether a “weak” learning algorithm which performs just slightly better than random guessing [PAC] can be “boosted” into an arbitrarily accurate “strong” learning algorithm.
Schapire (1989): first provable polynomial-time boosting algorithm.
Freund (2000): improved efficiency and caveats.
Many others...
Distributed Machine Learning Common Codebase https://xgboost.readthedocs.io/en/latest/model.html
Freund Y, Schapire R A short introduction to boosting
Hastie T, Tibshirani R, Friedman J The Elements of Statistical Learning
Lundberg S, Lee S-I Consistent individualized feature attribution for tree ensembles
Natekin A, Knoll A Gradient boosting machines, a tutorial