A Brief Introduction to

Stochastic Gradient Boosting

in Machine Learning

 

 

Daina Bouquin

 

Can a set of weak learners create a single strong learner?

(yes)

Boosting algorithms  iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier

Boosting:  
ML ensemble method/metaheuristic

Helps with bias-variance tradeoff (reduces both)

 

 

Boosting algorithms:

Weighted in relation to the weak predictors' accuracy

Weighting decorrelates the predictors by focusing on regions missed by past predictors

New predictors learn from previous predictor mistakes

∴ take fewer iterations to converge

https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

Ensembling

Bagging

Boosting

Handles overfitting

Reduces variance

Independent classifiers

Can overfit

Reduces bias & variance

Sequential classifiers

e.g. Random Forest

e.g. Gradient Boosting

Helps address main causes of differences between actual and predicted values:  variance and bias

(noise is somewhat irreducible)

Boosting with

Gradient Discent

gradient descent assuming a convex cost function

Local minimum must be a global minimum

 

 

 

Too much random noise can be an issue with convex optimization.

Non-convex optimization options for boosting exist though e.g. BrownBoost

If you're worried about local minima check out restarts  (SGDR)

*The point of GD is to minimize the cost function*

(find the lowest error value)

https://hackernoon.com/gradient-descent-aynk-7cbe95a778da

GD optimizers use a technique called “annealing” to determine the learning rate (how small/large of a step to take) =

 

Theta should decrease at each iteration

The slope of a function is the derivative of the function with respect to a value.

Slope always points to the nearest valley.

Example:

Black line represents a non linear loss function

If our parameters are initialized to the blue dot, we need a way to move around parameter space to the lowest point.

https://medium.com/38th-street-studios/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e

  1. Then just do it stochastically

  2.  
  3. With every GD iteration shuffle the training set and pick a random training example
  4.  
  5. Since you’re only using one training example, the path to the minima will be all zig-zag crazy

 

(Imagine trying to find the fastest way down a hill

only you can't see all of the curves in the hill)

May want to consider mini-batching rather than stochastic approach with very large datasets

Gradient boosting machine - Linear Regression Example

GBM can be configured to different base learners (e.g. tree, stump, linear model)

 

https://www.kaggle.com/grroverpr/gradient-boosting-simplified/

Pseudocode for a generic gradient boosting method

http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

*MATH*

StackOverflow fixed my problems

https://bit.ly/2FwXUAF

(there are a lot of people who can help you if you're lost)

Made with Slides.com