Introduction to XGBoost

Outline

  • The canonical forest model
  • Why we need another tree model
  • Regression Tree and Ensemble (What are we Learning)
  • Gradient Boosting (How do we Learn)​
  • Experiment

Outline

  •   The forest model
  • Why we need another tree model
  • Regression Tree and Ensemble (What are we Learning)
  • Gradient Boosting (How do we Learn)​
  • Experiment  

Forest model

  • A bunch of trees
  • Output the aggregation of the outcome of each tree

Example

Problems in tree model

  • How to split the node?
  • How to aggregate the outcome?
  • How to prevent overfitting?

Outline

  • The Forest Model
  • Why We Need Another Tree Model
  • Gradient Boosting (How do we Learn)​
  • System Designing Tricks
  • Take Home Message
  • Tips for Open Source

Common setting of

Classification & Regression Problem

A classification problem:

  • Find \(F(\textbf{x}) = f(\textbf{x};\Theta)\), \(\Theta\) is the parameter
  • So that \(F\) outputs class label \(\hat{y}\) on input \(x\) (classification)
  • \(F\) outputs a real number on input \(x\) (regression)

How to train it

  • Increasing the accuracy
    • Defining the loss \(L = \sum_i \ell(y_i, \hat{y}_i)\)on the training data
    • Minimize the loss \(min \sum_i \ell(y_i, \hat{y}_i)\)
  • Preventing over fitting
    • Adding regularization \(\sum_i \ell(y_i, \hat{y}_i)+\Omega(\Theta)\)
    • Minimize the whole thing
  • Objective function that is everywhere
Obj(\Theta) = L(\Theta) + \Omega(\Theta)
Obj(Θ)=L(Θ)+Ω(Θ)Obj(\Theta) = L(\Theta) + \Omega(\Theta)
  • Loss on training data: \(L = \sum^{n}_{i=1} l(y_i, \hat{y}_i)\)
    • Square loss: \(l(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2\)
    • Logistic loss: \(l(y_i, \hat{y}_i) = y_i ln(1 + e^{-\hat{y}_i}) + (1-y_i)ln(1+e^{\hat{y}_i})\)
  • Regularization: how complicated the model is?
    • L2 norm: \(\Omega(w) = \lambda \|w\|^2\)
    • L1 norm (lasso): \(\Omega(w) = \lambda \|w\|_1\)

Formal definition of the task

Revisiting Random Forest

  • How to split the node?
    • Heuristic method, entropy
  • How to aggregate the outcome?
    • Voting for classification
    • Average for regression
  • How to prevent overfitting?
    • Max depth
    • Pruning by impurity

Revisiting Random Forest

  • How to split the node?
    • Heuristic method, entropy - Bad
  • How to aggregate the outcome?
    • Voting for classification - Good
    • Average for regression - Good
  • How to prevent overfitting?
    • Max depth - Bad
    • Pruning by impurity - Bad

Adapting optimization view to tree problems

  • Entropy-> training loss
  • Pruning -> regularization defined by #nodes
  • Max depth -> constraint on the function space
  • Smoothing leaf values -> L2 regularization on leaf weights

Tree in \(Obj(\Theta) = L(\Theta) + \Omega(\Theta)\)

Beautiful!

Outline

  • The Forest Model
  • Why We Need Another Tree Model
  • Gradient Boosting (How do we Learn)​
  • System Designing Tricks

How do we Learn

  • Objective: \(\sum^n_{i=1} l(y_i, \hat{y}_i) + \sum_k\Omega(f_k), f_k \in \textbf{F} \)
  • no SGD
  • Solution: Additive Training (Boosting) [Friedman 99]
    • ​​Starting from constant prediction, add a new function each time

\(\hat{y}^{(0)}_i = 0\)

\(\hat{y}^{(1)}_i = f_1(x_i) = \hat{y}^{(0)}_i + f_1(x_i) \)

\(\hat{y}^{(2)}_i = f_1(x_i) + f_2(x_i)= \hat{y}^{(1)}_i + f_2(x_i) \)

...

\(\hat{y}^{(t)}_i = \sum^t_{k=1} f_k(x_i) = \hat{y}^{(t-1)}_i + f_t(x_i) \)

Boosted Tree Algorithm

  • Add a new tree in each iteration
  • Grow a tree \(f_t(x)\)
  • Add \(f_t(x)\) to the model \(\hat{y}^{(t)}_i = \hat{y}^{(t-1)}_i + f_t(x_i) \)
    • ​​Usually, instead we do \(\hat{y}^{(t)} = \hat{y}^{(t-1)} + \epsilon f_t(x_i)\)
    • \(\epsilon\) is called step-size or shrinkage, usually set around 0.1
    • This means we do not do full optimization in each step and reserve chance for future rounds which helps prevent overfitting

How do we get \(f_t(x)\)

  • Optimize the objective!!
  • The prediction at round t is \(\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i)\)

\(Obj^{(t)} = \sum^n_{i=1}l(y_i,\hat{y}_i^{(t)}) + \sum^t_{i=1}\Omega(f_i)\)

\(\ \ \ \ \ \ \ \ \ \ = \sum^n_{i=1}l(y_i,\hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + const\)

What is \(l\)

  • Loss function
  • Available for user customization

What is \(\Omega\)

  • \(\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum^T_{j=1}w^{2}_j \)

\( \Omega = 3 \gamma + \frac{1}{2} \lambda (4 + 0.01 + 1) \)

The Structure Score(Obj) Calculation

  • How to split the node?
    • Training loss
  • How to aggregate the outcome?
  • How to prevent overfitting?

Revisiting Random Forest

Searching Algorithm for Single Tree

  • Enumerate the possible tree structures q
  • Calculate the structure score for the q,  using the scoring function.
  • Find the best tree structure, and use the optimal leaf weight: \(w^*_j=-\frac{G_j}{H_j+\lambda} \)
  • But... there can be infinite possible tree structures.

 

Greedy Learning of the Tree

  • In practice, we grow the tree greedily
    • Start from tree with depth 0
    • For each leaf node of the tree, try to add a split. The change of objective after adding the split is
Gain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma
Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ]γGain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma

the score of the left child

the score of the right child

the score of if we do not split

the complexity cost by introducing additional leaf

  • How to split the node?
  • How to aggregate the outcome?
  • How to prevent overfitting?
    • Pruning By Train Loss

Revisiting Random Forest

Pruning and Regularization

Gain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma
Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ]γGain = \frac{1}{2}\Big[\frac{G^2_L}{H_L+\lambda} + \frac{G^2_R}{H_R+\lambda} - \frac{(G_L+G^R)^2}{H_L+H_R+\lambda}\Big] - \gamma
  • Recall the gain of split, it can be negative
  • When the training loss reduction is smaller than regularization
  • Trade-off between simplicity and predictiveness
  • Pre-stopping
    • Stop split if the best split have negative gain
    • But maybe a split can benefit future splits..
  • Post-Prunning
    • Grow a tree to maximum depth, recursively prune all the leaf splits with negative gain

Outline

  • The Forest Model
  • Why We Need Another Tree Model
  • Gradient Boosting (How do we Learn)​
  • System Designing Tricks

Approximate Split Algorithm

  • Traverse all the data points in each split stage is time-consuming!
    • Using quantiles!
      • Global quantile
      • Local quantile

Sparsity-aware Split Algorithm

  • Sparsity can be caused by:
    • Missing values
    • One-hot encoding

Solution: Skip missing data and give them a default direction in each split

Block Structure

Sorting data in each step of split finding.

 

Precomputing a data structure that is sorted on each column

Cache miss!

  • Bulk fetch data

Outline

  • The Forest Model
  • Why We Need Another Tree Model
  • Gradient Boosting (How do we Learn)​
  • System Designing Tricks

Take home messages

  • Optimizing none smooth functions? Try the trick used for gradient boost tree
  • Computer has it's own way to accelerate. Make use of them! (cache, pipeline)

Outline

  • The Forest Model
  • Why We Need Another Tree Model
  • Gradient Boosting (How do we Learn)​
  • System Designing Tricks

Tips to help your open source project successful

Documents

  • User document and developer document
  • Doc string and gitbook or readthedoc is helpful
  • What it is, what's the use case, how's the performance

Code quality

  • Extensible module system
  • meaningful function and variable name
  • critical part would have document
  • Users are also developers and can help you!

Branding your project

  • Think about the using scenario
  • make it compatible with user's everyday tool.
  • for data science, make your tool distributing
  • Post news on hacker news and reddit
Made with Slides.com