Introduction to XGBoost
Outline
- The canonical forest model
- Why we need another tree model
- Regression Tree and Ensemble (What are we Learning)
- Gradient Boosting (How do we Learn)
- Experiment
Outline
- The forest model
- Why we need another tree model
- Regression Tree and Ensemble (What are we Learning)
- Gradient Boosting (How do we Learn)
- Experiment
Forest model
- A bunch of trees
- Output the aggregation of the outcome of each tree
Example
Problems in tree model
- How to split the node?
- How to aggregate the outcome?
- How to prevent overfitting?
Outline
- The Forest Model
- Why We Need Another Tree Model
- Gradient Boosting (How do we Learn)
- System Designing Tricks
- Take Home Message
- Tips for Open Source
Common setting of
Classification & Regression Problem
A classification problem:
- Find \(F(\textbf{x}) = f(\textbf{x};\Theta)\), \(\Theta\) is the parameter
- So that \(F\) outputs class label \(\hat{y}\) on input \(x\) (classification)
- \(F\) outputs a real number on input \(x\) (regression)
How to train it
- Increasing the accuracy
- Defining the loss \(L = \sum_i \ell(y_i, \hat{y}_i)\)on the training data
- Minimize the loss \(min \sum_i \ell(y_i, \hat{y}_i)\)
- Preventing over fitting
- Adding regularization \(\sum_i \ell(y_i, \hat{y}_i)+\Omega(\Theta)\)
- Minimize the whole thing
- Objective function that is everywhere
- Loss on training data: \(L = \sum^{n}_{i=1} l(y_i, \hat{y}_i)\)
- Square loss: \(l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2\)
- Logistic loss: \(l(y_i, \hat{y}_i) = y_i ln(1 + e^{-\hat{y}_i}) + (1-y_i)ln(1+e^{\hat{y}_i})\)
- Regularization: how complicated the model is?
- L2 norm: \(\Omega(w) = \lambda \|w\|^2\)
- L1 norm (lasso): \(\Omega(w) = \lambda \|w\|_1\)
Formal definition of the task
Revisiting Random Forest
- How to split the node?
- Heuristic method, entropy
- How to aggregate the outcome?
- Voting for classification
- Average for regression
- How to prevent overfitting?
- Max depth
- Pruning by impurity
Revisiting Random Forest
- How to split the node?
- Heuristic method, entropy - Bad
- How to aggregate the outcome?
- Voting for classification - Good
- Average for regression - Good
- How to prevent overfitting?
- Max depth - Bad
- Pruning by impurity - Bad
Adapting optimization view to tree problems
- Entropy-> training loss
- Pruning -> regularization defined by #nodes
- Max depth -> constraint on the function space
- Smoothing leaf values -> L2 regularization on leaf weights
Tree in \(Obj(\Theta) = L(\Theta) + \Omega(\Theta)\)
Beautiful!
Outline
- The Forest Model
- Why We Need Another Tree Model
- Gradient Boosting (How do we Learn)
- System Designing Tricks
How do we Learn
- Objective: \(\sum^n_{i=1} l(y_i, \hat{y}_i) + \sum_k\Omega(f_k), f_k \in \textbf{F} \)
- no SGD
- Solution: Additive Training (Boosting) [Friedman 99]
- Starting from constant prediction, add a new function each time
\(\hat{y}^{(0)}_i = 0\)
\(\hat{y}^{(1)}_i = f_1(x_i) = \hat{y}^{(0)}_i + f_1(x_i) \)
\(\hat{y}^{(2)}_i = f_1(x_i) + f_2(x_i)= \hat{y}^{(1)}_i + f_2(x_i) \)
...
\(\hat{y}^{(t)}_i = \sum^t_{k=1} f_k(x_i) = \hat{y}^{(t-1)}_i + f_t(x_i) \)
Boosted Tree Algorithm
- Add a new tree in each iteration
- Grow a tree \(f_t(x)\)
-
Add \(f_t(x)\) to the model \(\hat{y}^{(t)}_i = \hat{y}^{(t-1)}_i + f_t(x_i) \)
- Usually, instead we do \(\hat{y}^{(t)} = \hat{y}^{(t-1)} + \epsilon f_t(x_i)\)
- \(\epsilon\) is called step-size or shrinkage, usually set around 0.1
- This means we do not do full optimization in each step and reserve chance for future rounds which helps prevent overfitting
How do we get \(f_t(x)\)
- Optimize the objective!!
- The prediction at round t is \(\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i)\)
\(Obj^{(t)} = \sum^n_{i=1}l(y_i,\hat{y}_i^{(t)}) + \sum^t_{i=1}\Omega(f_i)\)
\(\ \ \ \ \ \ \ \ \ \ = \sum^n_{i=1}l(y_i,\hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + const\)
What is \(l\)
- Loss function
- Available for user customization
What is \(\Omega\)
- \(\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum^T_{j=1}w^{2}_j \)
\( \Omega = 3 \gamma + \frac{1}{2} \lambda (4 + 0.01 + 1) \)
The Structure Score(Obj) Calculation
-
How to split the node?
- Training loss
- How to aggregate the outcome?
- How to prevent overfitting?
Revisiting Random Forest
Searching Algorithm for Single Tree
- Enumerate the possible tree structures q
- Calculate the structure score for the q, using the scoring function.
- Find the best tree structure, and use the optimal leaf weight: \(w^*_j=-\frac{G_j}{H_j+\lambda} \)
- But... there can be infinite possible tree structures.
Greedy Learning of the Tree
- In practice, we grow the tree greedily
- Start from tree with depth 0
- For each leaf node of the tree, try to add a split. The change of objective after adding the split is
the score of the left child
the score of the right child
the score of if we do not split
the complexity cost by introducing additional leaf
- How to split the node?
- How to aggregate the outcome?
-
How to prevent overfitting?
- Pruning By Train Loss
Revisiting Random Forest
Pruning and Regularization
- Recall the gain of split, it can be negative
- When the training loss reduction is smaller than regularization
- Trade-off between simplicity and predictiveness
- Pre-stopping
- Stop split if the best split have negative gain
- But maybe a split can benefit future splits..
- Post-Prunning
- Grow a tree to maximum depth, recursively prune all the leaf splits with negative gain
Outline
- The Forest Model
- Why We Need Another Tree Model
- Gradient Boosting (How do we Learn)
- System Designing Tricks
Approximate Split Algorithm
- Traverse all the data points in each split stage is time-consuming!
- Using quantiles!
- Global quantile
- Local quantile
- Using quantiles!
Sparsity-aware Split Algorithm
- Sparsity can be caused by:
- Missing values
- One-hot encoding
Solution: Skip missing data and give them a default direction in each split
Block Structure
Sorting data in each step of split finding.
Precomputing a data structure that is sorted on each column
Cache miss!
- Bulk fetch data
Outline
- The Forest Model
- Why We Need Another Tree Model
- Gradient Boosting (How do we Learn)
- System Designing Tricks
Take home messages
- Optimizing none smooth functions? Try the trick used for gradient boost tree
- Computer has it's own way to accelerate. Make use of them! (cache, pipeline)
Outline
- The Forest Model
- Why We Need Another Tree Model
- Gradient Boosting (How do we Learn)
- System Designing Tricks
Tips to help your open source project successful
Documents
- User document and developer document
- Doc string and gitbook or readthedoc is helpful
- What it is, what's the use case, how's the performance
Code quality
- Extensible module system
- meaningful function and variable name
- critical part would have document
- Users are also developers and can help you!
Branding your project
- Think about the using scenario
- make it compatible with user's everyday tool.
- for data science, make your tool distributing
- Post news on hacker news and reddit
Introduction to XGBoost
By Weiyüen Wu
Introduction to XGBoost
- 699