A Brief Introduction to
Stochastic Gradient Boosting
in Machine Learning
Daina Bouquin
Can a set of weak learners create a single strong learner?
(yes)
Boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier
Boosting:
ML ensemble method/metaheuristic
Helps with bias-variance tradeoff (reduces both)
Boosting algorithms:
Weighted in relation to the weak predictors' accuracy
Weighting decorrelates the predictors by focusing on regions missed by past predictors
New predictors learn from previous predictor mistakes
∴ take fewer iterations to converge
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
Ensembling
Bagging
Boosting
Handles overfitting
Reduces variance
Independent classifiers
Can overfit
Reduces bias & variance
Sequential classifiers
e.g. Random Forest
e.g. Gradient Boosting
Helps address main causes of differences between actual and predicted values: variance and bias
(noise is somewhat irreducible)
Boosting with
Gradient Discent
gradient descent assuming a convex cost function
Local minimum must be a global minimum
Too much random noise can be an issue with convex optimization.
Non-convex optimization options for boosting exist though e.g. BrownBoost
If you're worried about local minima check out restarts (SGDR)
*The point of GD is to minimize the cost function*
(find the lowest error value)
https://hackernoon.com/gradient-descent-aynk-7cbe95a778da
GD optimizers use a technique called “annealing” to determine the learning rate (how small/large of a step to take) =
Theta should decrease at each iteration
The slope of a function is the derivative of the function with respect to a value.
Slope always points to the nearest valley.
Example:
Black line represents a non linear loss function
If our parameters are initialized to the blue dot, we need a way to move around parameter space to the lowest point.
https://medium.com/38th-street-studios/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e
-
Then just do it stochastically
- With every GD iteration shuffle the training set and pick a random training example
- Since you’re only using one training example, the path to the minima will be all zig-zag crazy
(Imagine trying to find the fastest way down a hill
only you can't see all of the curves in the hill)
May want to consider mini-batching rather than stochastic approach with very large datasets
Gradient boosting machine - Linear Regression Example
GBM can be configured to different base learners (e.g. tree, stump, linear model)
https://www.kaggle.com/grroverpr/gradient-boosting-simplified/
Pseudocode for a generic gradient boosting method
http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
*MATH*
StackOverflow fixed my problems
(there are a lot of people who can help you if you're lost)
Further...
-
The probabiliy of GD to get stuck at a saddle is actually 0: arxiv.org/abs/1602.04915
-
Presence of saddle points might severly slow GDs progress down: www.jmlr.org/proceedings/papers/v40/Ge15.pdf
-
Lots on optimization: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f
-
Tools like H2O are great: http://www.h2o.ai/wp-content/uploads/2018/01/GBM-BOOKLET.pdf
-
Learn about ranking: https://pdfs.semanticscholar.org/9b9c/4bf53eb680e2eb26b456c4752a23dafb2d5e.pdf
-
Learning rates: https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate
-
Original work from 1999: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
A brief introduction to stochastic gradient boosting in machine learning
By Daina Bouquin
A brief introduction to stochastic gradient boosting in machine learning
Invited talk at the CfA's Classification in the Golden Era of X-ray Catalogs Workshop May 4, 2018
- 418