image source: https://wikidocs.net/3413
Intuition via Gradient Descent vs Stochastic Gradient Desccent
it's not going away soon :)
As summarized by Peter H. Salus in A Quarter-Century of Unix (1994):[1]
Research project started by John Langford (initially Yahoo, now Microsoft)
Fast and scalable due to
Multiple learning algorithms / models
Loss functions:
Optimization algorithms:
[Label] (|[Namespace] Features)...Namespace=String[:Value]Features=(String[:Value])...here
tsv line
0 2 0 1 287e684f 0a519c5c 02cf9876vw line
-1 |n 1:2 2:0 3:1 4__287e684f 5__0a519c5c 6__02cf9876Example:
simplified format
LIVE
what could go wrong? :)
source: scikit-learn docs
L2 norm:
L1 norm:
leads to shrinked coefficients
leads to sparse coefficients
Elastic Net:
Let's talk now!