Outline

The forest model
Why we need another tree model
Regression Tree and Ensemble (What are we Learning)
Gradient Boosting (How do we Learn)
Experiment

Forest model

A bunch of trees
Output the aggregation of the outcome of each tree

Example

Problems in tree model

How to split the node?
How to aggregate the outcome?
How to prevent overfitting?

Outline

The Forest Model
Why We Need Another Tree Model
Gradient Boosting (How do we Learn)
System Designing Tricks
Take Home Message
Tips for Open Source

Common setting of

Classification & Regression Problem

A classification problem:

Find \(F(\textbf{x}) = f(\textbf{x};\Theta)\), \(\Theta\) is the parameter
So that \(F\) outputs class label \(\hat{y}\) on input \(x\) (classification)
\(F\) outputs a real number on input \(x\) (regression)

How to train it

Increasing the accuracy
- Defining the loss \(L = \sum_i \ell(y_i, \hat{y}_i)\)on the training data
- Minimize the loss \(min \sum_i \ell(y_i, \hat{y}_i)\)
Preventing over fitting
- Adding regularization \(\sum_i \ell(y_i, \hat{y}_i)+\Omega(\Theta)\)
- Minimize the whole thing

Objective function that is everywhere

Obj(\Theta) = L(\Theta) + \Omega(\Theta)

Obj(\Theta) = L(\Theta) + \Omega(\Theta)

Loss on training data: \(L = \sum^{n}_{i=1} l(y_i, \hat{y}_i)\)
- Square loss: \(l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2\)
- Logistic loss: \(l(y_i, \hat{y}_i) = y_i ln(1 + e^{-\hat{y}_i}) + (1-y_i)ln(1+e^{\hat{y}_i})\)
Regularization: how complicated the model is?
- L2 norm: \(\Omega(w) = \lambda \|w\|^2\)
- L1 norm (lasso): \(\Omega(w) = \lambda \|w\|_1\)

Formal definition of the task

Revisiting Random Forest

How to split the node?
- Heuristic method, entropy
How to aggregate the outcome?
- Voting for classification
- Average for regression
How to prevent overfitting?
- Max depth
- Pruning by impurity

Revisiting Random Forest

How to split the node?
- Heuristic method, entropy - Bad
How to aggregate the outcome?
- Voting for classification - Good
- Average for regression - Good
How to prevent overfitting?
- Max depth - Bad
- Pruning by impurity - Bad

Adapting optimization view to tree problems

Entropy-> training loss
Pruning -> regularization defined by #nodes
Max depth -> constraint on the function space
Smoothing leaf values -> L2 regularization on leaf weights

Tree in \(Obj(\Theta) = L(\Theta) + \Omega(\Theta)\)

Beautiful!

Outline

The Forest Model
Why We Need Another Tree Model
Gradient Boosting (How do we Learn)
System Designing Tricks

How do we Learn

Objective: \(\sum^n_{i=1} l(y_i, \hat{y}_i) + \sum_k\Omega(f_k), f_k \in \textbf{F} \)
no SGD
Solution: Additive Training (Boosting) [Friedman 99]
- Starting from constant prediction, add a new function each time

\(\hat{y}^{(0)}_i = 0\)

\(\hat{y}^{(1)}_i = f_1(x_i) = \hat{y}^{(0)}_i + f_1(x_i) \)

\(\hat{y}^{(2)}_i = f_1(x_i) + f_2(x_i)= \hat{y}^{(1)}_i + f_2(x_i) \)

...

\(\hat{y}^{(t)}_i = \sum^t_{k=1} f_k(x_i) = \hat{y}^{(t-1)}_i + f_t(x_i) \)

Boosted Tree Algorithm

Add a new tree in each iteration
Grow a tree \(f_t(x)\)
Add \(f_t(x)\) to the model \(\hat{y}^{(t)}_i = \hat{y}^{(t-1)}_i + f_t(x_i) \)
- Usually, instead we do \(\hat{y}^{(t)} = \hat{y}^{(t-1)} + \epsilon f_t(x_i)\)
- \(\epsilon\) is called step-size or shrinkage, usually set around 0.1
- This means we do not do full optimization in each step and reserve chance for future rounds which helps prevent overfitting

How do we get \(f_t(x)\)

Optimize the objective!!
The prediction at round t is \(\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i)\)

\(Obj^{(t)} = \sum^n_{i=1}l(y_i,\hat{y}_i^{(t)}) + \sum^t_{i=1}\Omega(f_i)\)

\(\ \ \ \ \ \ \ \ \ \ = \sum^n_{i=1}l(y_i,\hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + const\)

What is \(l\)

Loss function
Available for user customization

What is \(\Omega\)

\(\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum^T_{j=1}w^{2}_j \)

\( \Omega = 3 \gamma + \frac{1}{2} \lambda (4 + 0.01 + 1) \)

The Structure Score(Obj) Calculation

How to split the node?
- Training loss
How to aggregate the outcome?
How to prevent overfitting?

Revisiting Random Forest

Searching Algorithm for Single Tree

Enumerate the possible tree structures q
Calculate the structure score for the q, using the scoring function.
Find the best tree structure, and use the optimal leaf weight: \(w^*_j=-\frac{G_j}{H_j+\lambda} \)
But... there can be infinite possible tree structures.

Greedy Learning of the Tree

In practice, we grow the tree greedily
- Start from tree with depth 0
- For each leaf node of the tree, try to add a split. The change of objective after adding the split is

Gain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma

Gain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma

the score of the left child

the score of the right child

the score of if we do not split

the complexity cost by introducing additional leaf

How to split the node?
How to aggregate the outcome?
How to prevent overfitting?
- Pruning By Train Loss

Revisiting Random Forest

Pruning and Regularization

Gain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma

Gain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma

Recall the gain of split, it can be negative

When the training loss reduction is smaller than regularization
Trade-off between simplicity and predictiveness

Pre-stopping
- Stop split if the best split have negative gain
- But maybe a split can benefit future splits..
Post-Prunning
- Grow a tree to maximum depth, recursively prune all the leaf splits with negative gain

Outline

The Forest Model
Why We Need Another Tree Model
Gradient Boosting (How do we Learn)
System Designing Tricks

Approximate Split Algorithm

Traverse all the data points in each split stage is time-consuming!
- Using quantiles!
  - Global quantile
  - Local quantile

Sparsity-aware Split Algorithm

Sparsity can be caused by:
- Missing values
- One-hot encoding

Solution: Skip missing data and give them a default direction in each split

Block Structure

Sorting data in each step of split finding.

Precomputing a data structure that is sorted on each column

Cache miss!

Bulk fetch data

Outline

The Forest Model
Why We Need Another Tree Model
Gradient Boosting (How do we Learn)
System Designing Tricks

Take home messages

Optimizing none smooth functions? Try the trick used for gradient boost tree
Computer has it's own way to accelerate. Make use of them! (cache, pipeline)

Outline

The Forest Model
Why We Need Another Tree Model
Gradient Boosting (How do we Learn)
System Designing Tricks

Tips to help your open source project successful

Documents

User document and developer document
Doc string and gitbook or readthedoc is helpful
What it is, what's the use case, how's the performance

Code quality

Extensible module system
meaningful function and variable name
critical part would have document
Users are also developers and can help you!

Branding your project

Think about the using scenario
make it compatible with user's everyday tool.
for data science, make your tool distributing
Post news on hacker news and reddit

Introduction to XGBoost

Outline

Outline

Forest model

Example

Problems in tree model

Outline

Common setting of

Classification & Regression Problem

How to train it

Formal definition of the task

Revisiting Random Forest

Revisiting Random Forest

Adapting optimization view to tree problems

Tree in \(Obj(\Theta) = L(\Theta) + \Omega(\Theta)\)

Beautiful!

Outline

How do we Learn

Boosted Tree Algorithm

How do we get \(f_t(x)\)

What is \(l\)

What is \(\Omega\)

The Structure Score(Obj) Calculation

Revisiting Random Forest

Searching Algorithm for Single Tree

Greedy Learning of the Tree

Revisiting Random Forest

Pruning and Regularization

Outline

Approximate Split Algorithm

Sparsity-aware Split Algorithm

Block Structure

Cache miss!

Outline

Take home messages

Outline

Tips to help your open source project successful

Documents

Code quality

Branding your project

Introduction to XGBoost

More from Weiyüen Wu