Word-Level Maximum Mean Discrepancy Regularization for Word Embedding

 

Ben Dai (CUHK)

(Joint work with Youqian Gao)

 

 

Embedding Learning

Background

 

 

Embedding Learning

Embedding Learning

Muandet, Krikamol, et al. (2017) "Kernel mean embedding of distributions: A review and beyond." Foundations and Trends® in Machine Learning

Embedding Learning

Embedding Learning

Embedding Learning

Word-level MMD Regularization

Building on the exploratory data analysis, we introduce a word-level variant of the maximum mean discrepancy to enhance and regulate the construction of word embeddings.

 

Our objective is to obtain an embedding matrix \(\mathbf{M}\) such that the word-level distributions of the numerical representations (\(\mathbf{M}_{X_\tau}| Y=-1\) and \(\mathbf{M}_{X_\tau}| Y=1\)) preserve the discrepancy between \(X_\tau | Y = -1\) and \(X_\tau | Y=1\). Here, \(\mathbf{M}_{X_\tau} = \mathbf{M}^\intercal \mathbf{e}_{X_\tau}\) represents the numerical representation of the word \(X_\tau\) based on the embedding matrix \(\mathbf{M}\). To measure the difference between distributions, we propose a word-level variant of the maximum mean discrepancy (Gretton et al, 2012), termed word-level MMD (wMMD), which is based on the embedding matrix \(\mathbf{M}\):

Word-level MMD Regularization

where \((\tau', \mathbf{X}', Y')\) is an iid copy of \((\tau, \mathbf{X}, Y)\). \(\mathcal{H}_\mathcal{K}\) is a RKHS with a kernel \(\mathcal{K}(\cdot, \cdot)\), such as the Gaussian kernel.

$$ := \Big( \sup_{\|h\|_{\mathcal{H}_\mathcal{K}} \leq 1} \Big( \mathbb{E} \big( h (\mathbf{M}_{X_\tau} ) \big| Y =-1 \big) - \mathbb{E}\big( h(\mathbf{M}_{X'_{\tau'}}) \big| Y' = 1 \big) \Big) \Big)^2 $$

$$ \text{wMMD}^2 (\mathbf{M};\mathbb{P}_{\cdot|Y=-1}, \mathbb{P}_{\cdot|Y=1}) $$

$$ = \mathbb{E} \big( \mathcal{K}(\mathbf{M}_{X_\tau}, \mathbf{M}_{X'_{\tau'}}) \big| Y=-1, Y'=-1 \big)  $$

$$ - 2 \mathbb{E} \big( \mathcal{K}(\mathbf{M}_{X_\tau}, \mathbf{M}_{X'_{\tau'}}) \big| Y=-1, Y'=1 \big) $$

$$ + \mathbb{E} \big( \mathcal{K}(\mathbf{M}_{X_\tau}, \mathbf{M}_{X'_{\tau'}}) \big | Y=1, Y'=1 \big) $$

When \(\mathcal{K}\) is a universal kernel , \( \text{wMMD}(\mathbf{M}) = 0\) implies that \(\mathbf{M}_{X_\tau}|Y=-1 \stackrel{d}{=} \mathbf{M}_{X_\tau}|Y=1\), indicating no discrepancies between the word-level distributions of the numerical representations.

reproducing property

wMMD: Computation

Given a training dataset \(\mathcal{D}_n = (\mathbf{x}_i, y_i)_{i=1, \cdots, n}\) and a pd kernel function \(\mathcal{K}(\cdot, \cdot)\), we define \(\mathcal{I}^+ = \{1 \leq i \leq n: y_i = 1\}\) and \(\mathcal{I}^- = \{1 \leq i \leq n: y_i = -1\}\). The empirical estimate of \(\text{wMMD}^2(\mathbf{M})\) is given by:

$$     \widehat{\text{wMMD}}^2(\mathbf{M}; \mathcal{D}_n) = \frac{1}{d^2 |\mathcal{I}^-|(|\mathcal{I}^-| - 1)} \sum_{i,i' \in \mathcal{I}^-; i\neq i'} \sum_{j=1}^d \sum_{j'=1}^d \mathcal{K}( \mathbf{M}_{x_{ij}}, \mathbf{M}_{x_{i'j'}}) $$

$$ - \frac{2}{d^2 |\mathcal{I}^-| |\mathcal{I}^+|} \sum_{i \in \mathcal{I}^-; i' \in \mathcal{I}^+} \sum_{j=1}^d \sum_{j'=1}^d \mathcal{K}( \mathbf{M}_{x_{ij}}, \mathbf{M}_{x_{i'j'}}) $$

$$    + \frac{1}{d^2|\mathcal{I}^+|(|\mathcal{I}^+| - 1)} \sum_{i,i' \in \mathcal{I}^+; i \neq i'} \sum_{j=1}^d \sum_{j'=1}^d \mathcal{K}( \mathbf{M}_{x_{ij}}, \mathbf{M}_{x_{i'j'}} ). $$

Lemma 1. \(\widehat{\text{wMMD}}^2(\mathbf{M}; \mathcal{D}_n)\) is an unbiased estimator of \(\text{wMMD}^2(\mathbf{M})\).

 

wMMD: Computation

wMMD: Experiments

Simulation (Topic modelling). For \(1 \leq j \leq d\), let \(X_{j} \mid Y = 0 \overset{\mathrm{i.i.d.}}{\sim} \text{BetaBin}(V, \alpha_0, \beta_0)\) and \(X_{j} \mid Y = 1 \overset{\mathrm{i.i.d.}}{\sim} \text{BetaBin}(V, \alpha_1, \beta_1)\), where \(\text{BetaBin}(\cdot)\) denotes the beta-binomial distribution.

Blei D. et al. (2010). Introduction to Probabilistic Topic Models

wMMD: Experiments

Simulation (Topic modelling). For \(1 \leq j \leq d\), let \(X_{j} \mid Y = 0 \overset{\mathrm{i.i.d.}}{\sim} \text{BetaBin}(V, \alpha_0, \beta_0)\) and \(X_{j} \mid Y = 1 \overset{\mathrm{i.i.d.}}{\sim} \text{BetaBin}(V, \alpha_1, \beta_1)\), where \(\text{BetaBin}(\cdot)\) denotes the beta-binomial distribution.

wMMD: Experiments

Simulation (Topic modelling). For \(1 \leq j \leq d\), let \(X_{j} \mid Y = 0 \overset{\mathrm{i.i.d.}}{\sim} \text{BetaBin}(V, \alpha_0, \beta_0)\) and \(X_{j} \mid Y = 1 \overset{\mathrm{i.i.d.}}{\sim} \text{BetaBin}(V, \alpha_1, \beta_1)\), where \(\text{BetaBin}(\cdot)\) denotes the beta-binomial distribution.

wMMD: Experiments

Real Datasets

  • CE-T1. The CE-T1 dataset consists of 2,187 Spanish tweets posted after the 2010 Chilean earthquake. Each tweet is labeled as either relevant or not relevant to the earthquake.
  • BBC-News. BBC News dataset is composed of 2225 English documents sourced from the BBC News website. Each document is categorized into one of five topics: business, entertainment, politics, sports, or technology.

 

Real Datasets

  • CE-T1. The CE-T1 dataset consists of 2,187 Spanish tweets posted after the 2010 Chilean earthquake. Each tweet is labeled as either relevant or not relevant to the earthquake.
  • BBC-News. BBC News dataset is composed of 2225 English documents sourced from the BBC News website. Each document is categorized into one of five topics: business, entertainment, politics, sports, or technology.

Neural Network Architectures: BiLSTM, GRU, CNN

Competitors: L1, dropout, re-embedding, group-lasso, wMMD (our)

wMMD: Experiments

Real Application

  • CE-T1. The CE-T1 dataset consists of 2,187 Spanish tweets posted after the 2010 Chilean earthquake. Each tweet is labeled as either relevant or not relevant to the earthquake.
  • BBC-News. BBC News dataset is composed of 2225 English documents sourced from the BBC News website. Each document is categorized into one of five topics: business, entertainment, politics, sports, or technology.

     

CE-T1

BBC-News

wMMD: Experiments

wMMD: Experiments

LIBLINEAR

LIBLINEAR

  • From 2008 to 2024, a 16-year period of continuous contributions.

  • Countless hours have been devoted.

  • Since its development in 2008, it has consistently remained the No. 1 solver for solving SVMs.

Dual Coordinate Desent

The primal is QP with 2n linear constraints

Given a training set of \(n\) points of the form \((\mathbf{x}_i, y_i)_{i=1}^n\), where \(y = \pm 1\) which indicates the binary label of the \(i\)-th instance \( \mathbf{x}_i \in \mathbb{R}^d \).

Primal form.

$$ \min_{\pmb{\beta}, \xi} \sum_{i=1}^{n} C_i \xi_i + \frac{1}{2} \| \pmb{\beta} \|^2 $$

$$y_i \pmb{\beta}^T \mathbf{x}_i \geq 1 - \xi_i, \quad \xi_i \geq 0, \quad i = 1, \ldots, n$$

$$\min_{\pmb{\beta}} \sum_{i=1}^{n} C_i ( 1 - y_i \pmb{\beta}^T \mathbf{x}_i )_+ + \frac{1}{2} \| \pmb{\beta} \|^2 $$

After introducing some slack variables,

Dual Coordinate Desent

The dual is a box-constrained QP

  • simpler form than the primal problem
  • naturally leads to coordinate descent (CD)
  • Lagrange multiplier

$$L_P = \sum_{i=1}^{n} C_i \xi_i + \frac{1}{2} \| \pmb{\beta} \|^2  - \sum_{i=1}^n \alpha_i \big( y_i \mathbf{x}_i^T \pmb{\beta} - (1 - \xi_i) \big)  - \sum_{i=1}^n \mu_i \xi_i$$

  • Taking derivatives to w.r.t. \( \pmb{\beta} \) and \( \xi_i \):

$$ \pmb{\beta} =  \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i, \quad \alpha_i = C_i -  \mu_i, $$

Dual form:

$$ \min_{\pmb{\alpha}} \frac{1}{2} \pmb{\alpha}^T \mathbf{Q} \pmb{\alpha} - \mathbf{1}^T \pmb{\alpha}, \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C_i$$

KKT Condition

Dual Coordinate Desent

  • Lagrange multiplier

$$L_P = \sum_{i=1}^{n} C_i \xi_i + \frac{1}{2} \| \pmb{\beta} \|^2  - \sum_{i=1}^n \alpha_i \big( y_i \mathbf{x}_i^T \pmb{\beta} - (1 - \xi_i) \big)  - \sum_{i=1}^n \mu_i \xi_i$$

  • Taking derivatives to w.r.t. \( \pmb{\beta} \) and \( \xi_i \):

$$ \pmb{\beta} =  \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i, \quad \alpha_i = C_i -  \mu_i, $$

Dual form:

$$ \min_{\pmb{\alpha}} \frac{1}{2} \pmb{\alpha}^T \mathbf{Q} \pmb{\alpha} - \mathbf{1}^T \pmb{\alpha}, \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C_i$$

KKT Condition

The dual is a box-constrained QP

  • simpler form than the primal problem
  • naturally leads to coordinate descent (CD)

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

Note that \( (\mathbf{Q} \mathbf{\alpha})_{i} = y_i \mathbf{x}_i^T \sum_{j=1}^n y_j \mathbf{x}_j \alpha_j \)

\( O(nd) \)

(at least O(n) if Q is pre-computed)

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

Note that \( (\mathbf{Q} \mathbf{\alpha})_{i} = y_i \mathbf{x}_i^T \sum_{j=1}^n y_j \mathbf{x}_j \alpha_j \)

\( O(nd) \)

(at least O(n) if Q is pre-computed)

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

\( O(n^2) \)

Loop over \((i=1,\cdots,n)\)

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

\( O(n^2) \)

Loop over \((i=1,\cdots,n)\)

  • IPM
  • ADMM
  • ...

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

Note that \( (\mathbf{Q} \mathbf{\alpha})_{i} = y_i \mathbf{x}_i^T \sum_{j=1}^n y_j \mathbf{x}_j \alpha_j \)

\( O(nd) \)

KKT Condition

$$ \pmb{\beta} =  \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i $$

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

Note that \( (\mathbf{Q} \mathbf{\alpha})_{i} = y_i \mathbf{x}_i^T \sum_{j=1}^n y_j \mathbf{x}_j \alpha_j \)

\( O(nd) \)

$$= y_i \mathbf{x}^T_i \mathbf{\beta}$$

KKT Condition

$$ \pmb{\beta} =  \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i $$

\( O(d) \)

Dual Coordinate Desent

$$ \min_{\delta_i} \frac{1}{2} Q_{ii} \delta_i^2 + \big((\mathbf{Q} \pmb{\alpha})_i - 1\big) \delta_i, \quad \text{s.t.} \quad -\alpha_i \leq \delta_i \leq C_i - \alpha_i$$

where \( \mathbf{Q}_{ij} = y_i y_j \mathbf{x}^T_i \mathbf{x}_j \). The solution to the sub-problem is:

CD sub-problem. Given an "old" value of \( \pmb{\alpha} \), we solve

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - ( (\mathbf{Q}\pmb{\alpha})_i )}{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i$$

\( O(n^2) \)

Loop over \((i=1,\cdots,n)\)

$$ \delta^*_i = \max \Big( -\alpha_i, \min\big( C_i - \alpha_i, \frac{1 - y_i \pmb{\beta}^T\mathbf{x}_i }{Q_{ii}}  \big) \Big)  $$

$$ \alpha_i^* \leftarrow  \alpha_i + \delta^*_i, \quad \pmb{\beta} \leftarrow \pmb{\beta} + \delta^* y_i \mathbf{x}_i$$

\( O(nd) \)

Loop over \((i=1,\cdots,n)\)

pure CD

primal-dual CD

LIBLINEAR

  • What contributes to the rapid efficiency of Liblinear?

  • Analytic solution of each CD updates
  • Reduce \( O(n^2) \) to \(O(nd)\) in CD updates
  • Linear convergence \( O(\log(\epsilon^{-1})) \)
    • CD usually is sublinear convergence
    • Linear structure improves the convergence!

Source: Ryan Tibshirani, Convex Optimization, lecture notes

LIBLINEAR

  • What contributes to the rapid efficiency of Liblinear?

  • Analytic solution of each CD updates
  • Reduce \( O(n^2) \) to \(O(nd)\) in CD updates
  • Linear convergence \( O(\log(\epsilon^{-1})) \)
    • CD usually is sublinear convergence
    • Linear structure improves the convergence!

Luo, Z. Q., & Tseng, P. (1992). On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications.

LIBLINEAR

  • What contributes to the rapid efficiency of Liblinear?

  • Analytic solution of each CD updates
  • Reduce \( O(n^2) \) to \(O(nd)\) in CD updates
  • Linear convergence \( O(\log(\epsilon^{-1})) \)
    • CD usually is sublinear convergence

Combine Linear KKT in CD updates.

Extension. When the idea of "LibLinear" can be applied?

$$L_P = \sum_{i=1}^{n} C_i \xi_i + \frac{1}{2} \| \pmb{\beta} \|^2  - \sum_{i=1}^n \alpha_i \big( y_i \mathbf{x}_i^T \pmb{\beta} - (1 - \xi_i) \big)  - \sum_{i=1}^n \mu_i \xi_i$$

ReHLine

Extension. When the idea of "LibLinear" can be applied?

  • Loss
    • hinge loss in SVMs ()
    • check loss in Quantile Reg ()
    • order > 2 ()
    • Many piecewise linear / quad ()
    • A class of losses? PLQ ()

Linear KKT Conditions

ReHLine

Extension. When the idea of "LibLinear" can be applied?

  • Loss
    • hinge loss in SVMs ()
    • check loss in Quantile Reg ()
    • order > 2 ()
    • A class of losses? PLQ ()
  • Constraints
    • box constraints? ()
    • linear constraints ()

Linear KKT Conditions

ReHLine

In this paper, we consider a general regularized ERM based on a convex PLQ loss with linear constraints:

\( \min_{\mathbf{\beta} \in \mathbb{R}^d} \sum_{i=1}^n L_i(\mathbf{x}_i^\intercal \mathbf{\beta}) + \frac{1}{2} \| \mathbf{\beta} \|_2^2, \quad \text{ s.t. } \mathbf{A} \mathbf{\beta} + \mathbf{b} \geq \mathbf{0}, \)

  • \( L_i(\cdot) \geq 0\) is the proposed composite ReLU-ReHU loss.

  • \( \mathbf{x}_i \in \mathbb{R}^d\) is the feature vector for the \(i\)-th observation.

  • \(\mathbf{A} \in \mathbb{R}^{K \times d}\) and \(\mathbf{b} \in \mathbb{R}^K\) are linear inequality constraints for \(\mathbf{\beta}\).

  • We focus on working with a large-scale dataset, where the dimension of the coefficient vector and the total number of constraints are comparatively much smaller than the

    sample sizes, that is, \(d \ll n\) and \(K \ll n\).

ReHLine Loss

Definition 1 (Dai and Qiu. 2023). A function \(L(z)\) is composite ReLU-ReHU, if there exist \( \mathbf{u}, \mathbf{v} \in \mathbb{R}^{L}\) and \(\mathbf{\tau}, \mathbf{s}, \mathbf{t} \in \mathbb{R}^{H}\) such that

 

\( L(z) = \sum_{l=1}^L \text{ReLU}( u_l z + v_l) + \sum_{h=1}^H \text{ReHU}_{\tau_h}( s_h z + t_h)\)

where \( \text{ReLU}(z) = \max\{z,0\}\), and \( \text{ReHU}_{\tau_h}(z)\) is defined below.

Theorem 1 (Dai and Qiu. 2023). A loss function \(L:\mathbb{R}\rightarrow\mathbb{R}_{\geq 0}\) is convex PLQ if and only if it is composite ReLU-ReHU.

ReHLine Formulation

\( \min_{\mathbf{\beta} \in \mathbb{R}^d} \sum_{i=1}^n L_i(\mathbf{x}_i^\intercal \mathbf{\beta}) + \frac{1}{2} \| \mathbf{\beta} \|_2^2, \quad \text{ s.t. } \mathbf{A} \mathbf{\beta} + \mathbf{b} \geq \mathbf{0}, \)

can also handle elastic-net penalty.

ReHLine Results

A broad range of problems. ReHLine applies to any convex piecewise linear-quadratic loss function (potential for non-smoothness included) with any linear constraints, including the hinge loss, the check loss, the Huber loss, etc.

Super efficient. ReHLine has a linear convergence rate. The per-iteration computational complexity is linear in the sample size.

ReHLine Algo

  • Inspired by CD and Liblinear

The linear relationship between primal and dual variables greatly simplifies the computation of CD.

ReHLine Algo

ReHLine Algo

ReHLine Algo

Software. generic/ specialized software

  • cvx/cvxpy
  • mosek (IPM)
  • ecos (IPM)
  • scs (ADMM)
  • dccp (DCP)
  • liblinear -> SVM
  • hqreg -> Huber
  • lightning -> sSVM

Experiments

  • 1000x speed-up for generic solvers
  • no worse than specialized solvers

Experiments

Powered by

Experiments

Powered by

ReHLine Universe

ReHLine Universe

LIBLINEAR

ReHLine

  • Powerful Algo
    • We have improved the computing power of a large category of Regularized Empirical Risk Minimization to the level of LibLinear (linear convergence + linear computation)
  • Powerful software
    • Efficient software and C++ implementation. ReHLine is equivalent to LIBLINEAR within SVM, but our present implementation can be even faster than LIBLINEAR.
    • It provides for flexible application concerning losses and constraints through Python/R API, which are intended to tackle a vast array of ML and STAT problems. (e.g. FairSVM).

Summary

Thank you!

If you like ReHLine  please star 🌟 our Github repository, thank you!

wMMD

By statmlben

wMMD

Word-Level Maximum Mean Discrepancy Regularization for Word Embedding

  • 44