Daina Bouquin
Boosting algorithms iteratively learn weak classifiers with respect to a distribution and add them to a final strong classifier
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
e.g. Random Forest
e.g. Gradient Boosting
Helps address main causes of differences between actual and predicted values: variance and bias
(noise is somewhat irreducible)
gradient descent assuming a convex cost function
Local minimum must be a global minimum
Too much random noise can be an issue with convex optimization.
Non-convex optimization options for boosting exist though e.g. BrownBoost
If you're worried about local minima check out restarts (SGDR)
*The point of GD is to minimize the cost function*
(find the lowest error value)
https://hackernoon.com/gradient-descent-aynk-7cbe95a778da
GD optimizers use a technique called “annealing” to determine the learning rate (how small/large of a step to take) =
Theta should decrease at each iteration
The slope of a function is the derivative of the function with respect to a value.
Slope always points to the nearest valley.
Example:
Black line represents a non linear loss function
If our parameters are initialized to the blue dot, we need a way to move around parameter space to the lowest point.
https://medium.com/38th-street-studios/exploring-stochastic-gradient-descent-with-restarts-sgdr-fa206c38a74e
(Imagine trying to find the fastest way down a hill
only you can't see all of the curves in the hill)
May want to consider mini-batching rather than stochastic approach with very large datasets
GBM can be configured to different base learners (e.g. tree, stump, linear model)
https://www.kaggle.com/grroverpr/gradient-boosting-simplified/
Pseudocode for a generic gradient boosting method
http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
*MATH*
(there are a lot of people who can help you if you're lost)
The probabiliy of GD to get stuck at a saddle is actually 0: arxiv.org/abs/1602.04915
Presence of saddle points might severly slow GDs progress down: www.jmlr.org/proceedings/papers/v40/Ge15.pdf
Lots on optimization: https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f
Tools like H2O are great: http://www.h2o.ai/wp-content/uploads/2018/01/GBM-BOOKLET.pdf
Learn about ranking: https://pdfs.semanticscholar.org/9b9c/4bf53eb680e2eb26b456c4752a23dafb2d5e.pdf
Learning rates: https://www.coursera.org/learn/machine-learning/lecture/3iawu/gradient-descent-in-practice-ii-learning-rate
Original work from 1999: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf