Stochastic Gradient Boosting Machines
the basics
Daina Bouquin
I'm a librarian
MS Data Analytics
MS Library and Information Science
CAS Data Science
Can a set of weak learners create a single strong learner?
(yes)
Boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier
Boosting:
ML ensemble method/metaheuristic
Helps with biasvariance tradeoff (reduces both)
metaheuristic is a higherlevel procedure or heuristic designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimization problem, especially with incomplete or imperfect information or limited computational capacity
Bias = error from erroneous assumptions in the learning algorithm
Variance = sensitivity to small fluctuations in the training set
You don't want to model noise
high bias means you could miss relevant relations between features
∴ underfitting
∴ overfitting
Boosting algorithms:
Weighted in relation to the weak predictors' accuracy
Weighting decorrelates the predictors by focusing on regions missed by past predictors
New predictors learn from previous predictor mistakes
∴ take fewer iterations to converge
https://quantdare.com/whatisthedifferencebetweenbaggingandboosting/
Boosting means observations have an unequal probability of appearing in subsequent models
Observations with highest error
appear most
Ensembling
Bagging
Boosting
Handles overfitting
Reduces variance
Independent classifiers
Can overfit
Reduces bias & variance
Sequential classifiers
e.g. Random Forest
e.g. Gradient Boosting
Helps address main causes of differences between actual and predicted values: variance and bias
(noise is somewhat irreducible)
Boosting with
Gradient Discent
gradient descent assuming a convex cost function
Local minimum must be a global minimum
Most common cost function is mean squared error
Too much random noise can be an issue with convex optimization.
Nonconvex optimization options for boosting exist though e.g. BrownBoost
If you're worried about local minima check out restarts (SGDR)
*The point of GD is to minimize the cost function*
(find the lowest error value/the deepest valley in that function)
https://hackernoon.com/gradientdescentaynk7cbe95a778da
Slope points to the nearest valley
Choice of cost function will affect calculation of the gradient of each weight.
Cost function is for monitoring the error with each training example
The derivative of the cost function with respect to the weight (slope!) is where we shift the weight to minimize the error for that training example
This gives us direction
https://hackernoon.com/gradientdescentaynk7cbe95a778da
GD optimizers use a technique called “annealing” to determine the learning rate (how small/large of a step to take) = α
Theta (weight) should decrease at each iteration
if alpha is too large we overshoot the min
if alpha is too small we take too many iterations to find the min
Example:
Black line represents a non linear loss function
If our parameters are initialized to the blue dot, we need a way to move around parameter space to the lowest point.
https://medium.com/38thstreetstudios/exploringstochasticgradientdescentwithrestartssgdrfa206c38a74e

Then just do it stochastically
 With every GD iteration shuffle the training set and pick a random training example
 Since you’re only using one training example, the path to the minima will be all zigzag crazy
(Imagine trying to find the fastest way down a hill
only you can't see all of the curves in the hill)
May want to consider minibatching rather than stochastic approach with very large datasets
Gradient boosting machine  Linear Regression Example
GBM can be configured to different base learners (e.g. tree, stump, linear model)
https://www.kaggle.com/grroverpr/gradientboostingsimplified/
basic assumption: sum of residuals = 0
leverage pattern in residuals to strengthen weak prediction model until residuals become randomly distributed
if you keep going you risk overfitting
Algorithmically we are minimizing our loss function such that the test loss reaches its minima
Adjusted our predictions using the fit on the residuals and accordingly adjusting value of alpha
We are doing supervised learning here
you can check for overfitting using a
kfold cross validation
resampling procedure used to evaluate machine learning models on a limited data sample
Pseudocode for a generic gradient boosting method
http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
*MATH*
StackOverflow fixed my problems
(there are a lot of people who can help you if you're lost)
Further...

The probabiliy of GD to get stuck at a saddle is actually 0: arxiv.org/abs/1602.04915

Presence of saddle points might severly slow GDs progress down: www.jmlr.org/proceedings/papers/v40/Ge15.pdf

Lots on optimization: https://towardsdatascience.com/typesofoptimizationalgorithmsusedinneuralnetworksandwaystooptimizegradient95ae5d39529f

Tools like H2O are great: http://www.h2o.ai/wpcontent/uploads/2018/01/GBMBOOKLET.pdf

Learn about ranking: https://pdfs.semanticscholar.org/9b9c/4bf53eb680e2eb26b456c4752a23dafb2d5e.pdf

Learning rates: https://www.coursera.org/learn/machinelearning/lecture/3iawu/gradientdescentinpracticeiilearningrate

Original work from 1999: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
Stochastic Gradient Boosting Machines: the basics
By Daina Bouquin
Stochastic Gradient Boosting Machines: the basics
Presentation given at the Center for Astrophysics Machine Learning Journal Club. December 7, 2018
 496