Regularized Composite ReLU-ReHU Loss Minimization with Linear Computation and Linear Convergence



Ben Dai (CUHK)
(Joint work with Yixuan Qiu)
LIBLINEAR

- LIBLINEAR is the winner of ICML 2008 large-scale learning challenge (linear SVM track). It is also used for winning KDD Cup 2010.

- In scikit-learn, liblinear is the default solver for SVMs in Python.

LIBLINEAR
As indicated from the official Liblinear website, thanks to contributions from researchers and developers worldwide, Liblinear has incorporated interfaces for various languages:
{ R, Python, Matlab, Java, Perl, Ruby, and even PHP }
The popularity of Liblinear is thus evident.

LIBLINEAR


LIBLINEAR



LIBLINEAR



-
From 2008 to 2024, a 16-year period of continuous contributions.
-
Countless hours have been devoted.
-
Since its development in 2008, it has consistently remained the No. 1 solver for solving SVMs.
Dual Coordinate Desent
The primal is QP with 2n linear constraints
Given a training set of n points of the form (xi,yi)i=1n, where y=±1 which indicates the binary label of the i-th instance xi∈Rd.
Primal form.
β,ξmini=1∑nCiξi+21∥β∥2
yiβTxi≥1−ξi,ξi≥0,i=1,…,n
βmini=1∑nCi(1−yiβTxi)++21∥β∥2
After introducing some slack variables,
Dual Coordinate Desent
The dual is a box-constrained QP
- simpler form than the primal problem
- naturally leads to coordinate descent (CD)
- Lagrange multiplier
LP=i=1∑nCiξi+21∥β∥2 −i=1∑nαi(yixiTβ−(1−ξi)) −i=1∑nμiξi
- Taking derivatives to w.r.t. β and ξi:
β= i=1∑nαiyixi,αi=Ci− μi,
Dual form:
αmin21αTQα−1Tα,s.t.0≤αi≤Ci
KKT Condition
Dual Coordinate Desent
- Lagrange multiplier
LP=i=1∑nCiξi+21∥β∥2 −i=1∑nαi(yixiTβ−(1−ξi)) −i=1∑nμiξi
- Taking derivatives to w.r.t. β and ξi:
β= i=1∑nαiyixi,αi=Ci− μi,
Dual form:
αmin21αTQα−1Tα,s.t.0≤αi≤Ci
KKT Condition

The dual is a box-constrained QP
- simpler form than the primal problem
- naturally leads to coordinate descent (CD)
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
Note that (Qα)i=yixiT∑j=1nyjxjαj
O(nd)
(at least O(n) if Q is pre-computed)
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
Note that (Qα)i=yixiT∑j=1nyjxjαj
O(nd)
(at least O(n) if Q is pre-computed)
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
O(n2)
Loop over (i=1,⋯,n)
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
O(n2)
Loop over (i=1,⋯,n)
- IPM
- ADMM
- ...
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
Note that (Qα)i=yixiT∑j=1nyjxjαj
O(nd)
KKT Condition
β= i=1∑nαiyixi
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
Note that (Qα)i=yixiT∑j=1nyjxjαj
O(nd)
=yixiTβ
KKT Condition
β= i=1∑nαiyixi
O(d)
Dual Coordinate Desent
δimin21Qiiδi2+((Qα)i−1)δi,s.t.−αi≤δi≤Ci−αi
where Qij=yiyjxiTxj. The solution to the sub-problem is:
CD sub-problem. Given an "old" value of α, we solve
δi∗=max(−αi,min(Ci−αi,Qii1−((Qα)i) ))
αi∗← αi+δi∗
O(n2)
Loop over (i=1,⋯,n)
δi∗=max(−αi,min(Ci−αi,Qii1−yiβTxi ))
αi∗← αi+δi∗,β←β+δ∗yixi
O(nd)
Loop over (i=1,⋯,n)
pure CD
primal-dual CD
LIBLINEAR
-
What contributes to the rapid efficiency of Liblinear?
- Analytic solution of each CD updates
- Reduce O(n2) to O(nd) in CD updates
- Linear convergence O(log(ϵ−1))
- CD usually is sublinear convergence
- Linear structure improves the convergence!

Source: Ryan Tibshirani, Convex Optimization, lecture notes
LIBLINEAR
-
What contributes to the rapid efficiency of Liblinear?
- Analytic solution of each CD updates
- Reduce O(n2) to O(nd) in CD updates
- Linear convergence O(log(ϵ−1))
- CD usually is sublinear convergence
- Linear structure improves the convergence!
Luo, Z. Q., & Tseng, P. (1992). On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications.



LIBLINEAR
-
What contributes to the rapid efficiency of Liblinear?
- Analytic solution of each CD updates
- Reduce O(n2) to O(nd) in CD updates
- Linear convergence O(log(ϵ−1))
- CD usually is sublinear convergence
Combine Linear KKT in CD updates.
Extension. When the idea of "LibLinear" can be applied?
LP=i=1∑nCiξi+21∥β∥2 −i=1∑nαi(yixiTβ−(1−ξi)) −i=1∑nμiξi
ReHLine
Extension. When the idea of "LibLinear" can be applied?
-
Loss
- hinge loss in SVMs (✔)
- check loss in Quantile Reg (✔)
- More?
- Many piecewise linear / quad (✔)
- A class of losses? PLQ (✔)
Linear KKT Conditions

ReHLine
Extension. When the idea of "LibLinear" can be applied?
-
Loss
- hinge loss in SVMs (✔)
- check loss in Quantile Reg (✔)
- order > 2 (✘)
- A class of losses? PLQ (✔)

-
Constraints
- box constraints? (✔)
- linear constraints (✔)
Linear KKT Conditions
ReHLine
In this paper, we consider a general regularized ERM based on a convex PLQ loss with linear constraints:
minβ∈Rd∑i=1nLi(xi⊺β)+21∥β∥22, s.t. Aβ+b≥0,
-
Li(⋅)≥0 is the proposed composite ReLU-ReHU loss.
-
xi∈Rd is the feature vector for the i-th observation.
-
A∈RK×d and b∈RK are linear inequality constraints for β.
-
We focus on working with a large-scale dataset, where the dimension of the coefficient vector and the total number of constraints are comparatively much smaller than the
sample sizes, that is, d≪n and K≪n.
ReHLine Loss
Definition 1 (Dai and Qiu. 2023). A function L(z) is composite ReLU-ReHU, if there exist u,v∈RL and τ,s,t∈RH such that
L(z)=∑l=1LReLU(ulz+vl)+∑h=1HReHUτh(shz+th)
where ReLU(z)=max{z,0}, and ReHUτh(z) is defined below.


Theorem 1 (Dai and Qiu. 2023). A loss function L:R→R≥0 is convex PLQ if and only if it is composite ReLU-ReHU.
ReHLine Formulation

minβ∈Rd∑i=1nLi(xi⊺β)+21∥β∥22, s.t. Aβ+b≥0,

can also handle elastic-net penalty.
ReHLine Results
A broad range of problems. ReHLine applies to any convex piecewise linear-quadratic loss function (potential for non-smoothness included) with any linear constraints, including the hinge loss, the check loss, the Huber loss, etc.

Super efficient. ReHLine has a linear convergence rate. The per-iteration computational complexity is linear in the sample size.
ReHLine Algo
- Inspired by CD and Liblinear

The linear relationship between primal and dual variables greatly simplifies the computation of CD.
ReHLine Algo

ReHLine Algo


ReHLine Algo





Software. generic/ specialized software
- cvx/cvxpy
- mosek (IPM)
- ecos (IPM)
- scs (ADMM)
- dccp (DCP)
- liblinear -> SVM
- hqreg -> Huber
- lightning -> sSVM
Experiments
- 1000x speed-up for generic solvers
- no worse than specialized solvers
Experiments


Powered by
Experiments



Powered by
ReHLine Universe




ReHLine SVM



LIBLINEAR
ReHLine
ReHLine QR
We illustrate our algo in QR using the simulated example from Feng, He, and Hu (2011):
yi=1+xi1+xi2+3−1/2(2+101(1+(xi1−8)2+x2i))ϵi

ReHLine QR

1M-Scale Quantile Reg
in 0.5 Second with ReHLine
-
Powerful Algo
- We have improved the computing power of a large category of Regularized Empirical Risk Minimization to the level of LibLinear (linear convergence + linear computation)
-
Powerful software
- Efficient software and C++ implementation. ReHLine is equivalent to LIBLINEAR within SVM, but our present implementation can be even faster than LIBLINEAR.
- It provides for flexible application concerning losses and constraints through Python/R API, which are intended to tackle a vast array of ML and STAT problems. (e.g. FairSVM).
Summary
Thank you!
If you like ReHLine
please star 🌟 our Github repository, thank you!







liblinear2rehline
By statmlben
liblinear2rehline
[NeurIPS2023] Regularized Composite ReLU-ReHU Loss Minimization with Linear Computation and Linear Convergence
- 170