Stochastic Gradient Boosting Machines

the basics

Daina Bouquin

I'm a librarian

MS Data Analytics

MS Library and Information Science

CAS Data Science

Can a set of weak learners create a single strong learner?

(yes)

Boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier

Boosting:
ML ensemble method/metaheuristic

Helps with bias-variance tradeoff (reduces both)

metaheuristic is a higher-level procedure or heuristic designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimization problem, especially with incomplete or imperfect information or limited computational capacity

Bias = error from erroneous assumptions in the learning algorithm

Variance = sensitivity to small fluctuations in the training set

You don't want to model noise

high bias means you could miss relevant relations between features

∴ underfitting

∴ overfitting

Boosting algorithms:

Weighted in relation to the weak predictors' accuracy

Weighting decorrelates the predictors by focusing on regions missed by past predictors

New predictors learn from previous predictor mistakes

∴ take fewer iterations to converge

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

Boosting means observations have an unequal probability of appearing in subsequent models

Observations with highest error

appear most

Ensembling

Bagging

Boosting

Handles overfitting

Reduces variance

Independent classifiers

Can overfit

Reduces bias & variance

Sequential classifiers

e.g. Random Forest

e.g. Gradient Boosting

Helps address main causes of differences between actual and predicted values: variance and bias

(noise is somewhat irreducible)

Boosting with

Gradient Discent

gradient descent assuming a convex cost function

Local minimum must be a global minimum

Most common cost function is mean squared error

Too much random noise can be an issue with convex optimization.

Non-convex optimization options for boosting exist though e.g. BrownBoost

If you're worried about local minima check out restarts (SGDR)

*The point of GD is to minimize the cost function*

(find the lowest error value/the deepest valley in that function)

https://hackernoon.com/gradient-descent-aynk-7cbe95a778da

Slope points to the nearest valley

Choice of cost function will affect calculation of the gradient of each weight.

Cost function is for monitoring the error with each training example

The derivative of the cost function with respect to the weight (slope!) is where we shift the weight to minimize the error for that training example

This gives us direction

https://hackernoon.com/gradient-descent-aynk-7cbe95a778da

GD optimizers use a technique called “annealing” to determine the learning rate (how small/large of a step to take) = α

Theta (weight) should decrease at each iteration

if alpha is too large we overshoot the min

if alpha is too small we take too many iterations to find the min

Example:

Black line represents a non linear loss function

If our parameters are initialized to the blue dot, we need a way to move around parameter space to the lowest point.

https://medium.com/38th-street-studios/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e

Then just do it stochastically
With every GD iteration shuffle the training set and pick a random training example
Since you’re only using one training example, the path to the minima will be all zig-zag crazy

(Imagine trying to find the fastest way down a hill

only you can't see all of the curves in the hill)

May want to consider mini-batching rather than stochastic approach with very large datasets

Gradient boosting machine - Linear Regression Example

GBM can be configured to different base learners (e.g. tree, stump, linear model)

https://www.kaggle.com/grroverpr/gradient-boosting-simplified/

basic assumption: sum of residuals = 0

leverage pattern in residuals to strengthen weak prediction model until residuals become randomly distributed

if you keep going you risk overfitting

Algorithmically we are minimizing our loss function such that the test loss reaches its minima

Adjusted our predictions using the fit on the residuals and accordingly adjusting value of alpha

We are doing supervised learning here

you can check for overfitting using a

k-fold cross validation

resampling procedure used to evaluate machine learning models on a limited data sample

Pseudocode for a generic gradient boosting method

http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

*MATH*

StackOverflow fixed my problems

https://bit.ly/2FwXUAF

(there are a lot of people who can help you if you're lost)

Further...

The probabiliy of GD to get stuck at a saddle is actually 0: arxiv.org/abs/1602.04915
Presence of saddle points might severly slow GDs progress down: www.jmlr.org/proceedings/papers/v40/Ge15.pdf
Lots on optimization: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f
Tools like H2O are great: http://www.h2o.ai/wp-content/uploads/2018/01/GBM-BOOKLET.pdf
Learn about ranking: https://pdfs.semanticscholar.org/9b9c/4bf53eb680e2eb26b456c4752a23dafb2d5e.pdf
Learning rates: https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate
Original work from 1999: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Please make your code citable

https://guides.github.com/activities/citable-code/

Stochastic Gradient Boosting Machines: the basics

By Daina Bouquin

Stochastic Gradient Boosting Machines: the basics

Presentation given at the Center for Astrophysics Machine Learning Journal Club. December 7, 2018

Stochastic Gradient Boosting Machines

I'm a librarian

Can a set of weak learners create a single strong learner?

(yes)

Boosting: ML ensemble method/metaheuristic

Helps with bias-variance tradeoff (reduces both)

Boosting algorithms:

Weighted in relation to the weak predictors' accuracy

Weighting decorrelates the predictors by focusing on regions missed by past predictors

New predictors learn from previous predictor mistakes

∴ take fewer iterations to converge

Boosting means observations have an unequal probability of appearing in subsequent models

Observations with highest error

appear most

Ensembling

Bagging

Boosting

Handles overfitting

Reduces variance

Independent classifiers

Can overfit

Reduces bias & variance

Sequential classifiers

Boosting with

Gradient Discent

Then just do it stochastically

Gradient boosting machine - Linear Regression Example

StackOverflow fixed my problems

Further...

Please make your code citable

Stochastic Gradient Boosting Machines: the basics

More from Daina Bouquin

Boosting:
ML ensemble method/metaheuristic