A Brief Introduction to

Stochastic Gradient Boosting

in Machine Learning

Daina Bouquin

Can a set of weak learners create a single strong learner?

(yes)

Boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier

Boosting:
ML ensemble method/metaheuristic

Helps with bias-variance tradeoff (reduces both)

Boosting algorithms:

Weighted in relation to the weak predictors' accuracy

Weighting decorrelates the predictors by focusing on regions missed by past predictors

New predictors learn from previous predictor mistakes

∴ take fewer iterations to converge

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

Ensembling

Bagging

Boosting

Handles overfitting

Reduces variance

Independent classifiers

Can overfit

Reduces bias & variance

Sequential classifiers

e.g. Random Forest

e.g. Gradient Boosting

Helps address main causes of differences between actual and predicted values: variance and bias

(noise is somewhat irreducible)

Boosting with

Gradient Discent

gradient descent assuming a convex cost function

Local minimum must be a global minimum

Too much random noise can be an issue with convex optimization.

Non-convex optimization options for boosting exist though e.g. BrownBoost

If you're worried about local minima check out restarts (SGDR)

*The point of GD is to minimize the cost function*

(find the lowest error value)

https://hackernoon.com/gradient-descent-aynk-7cbe95a778da

GD optimizers use a technique called “annealing” to determine the learning rate (how small/large of a step to take) =

Theta should decrease at each iteration

The slope of a function is the derivative of the function with respect to a value.

Slope always points to the nearest valley.

Example:

Black line represents a non linear loss function

If our parameters are initialized to the blue dot, we need a way to move around parameter space to the lowest point.

https://medium.com/38th-street-studios/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e

Then just do it stochastically
With every GD iteration shuffle the training set and pick a random training example
Since you’re only using one training example, the path to the minima will be all zig-zag crazy

(Imagine trying to find the fastest way down a hill

only you can't see all of the curves in the hill)

May want to consider mini-batching rather than stochastic approach with very large datasets

Gradient boosting machine - Linear Regression Example

GBM can be configured to different base learners (e.g. tree, stump, linear model)

https://www.kaggle.com/grroverpr/gradient-boosting-simplified/

Pseudocode for a generic gradient boosting method

http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

*MATH*

StackOverflow fixed my problems

https://bit.ly/2FwXUAF

(there are a lot of people who can help you if you're lost)

Further...

The probabiliy of GD to get stuck at a saddle is actually 0: arxiv.org/abs/1602.04915
Presence of saddle points might severly slow GDs progress down: www.jmlr.org/proceedings/papers/v40/Ge15.pdf
Lots on optimization: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f
Tools like H2O are great: http://www.h2o.ai/wp-content/uploads/2018/01/GBM-BOOKLET.pdf
Learn about ranking: https://pdfs.semanticscholar.org/9b9c/4bf53eb680e2eb26b456c4752a23dafb2d5e.pdf
Learning rates: https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate
Original work from 1999: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf