Convex Learning

Outline

  • Convex Learning Problem
    • Useful Properties
    • Learnability
    • Surrogate Loss Function
  • Regularization and Stability
    • Regularized Loss Minimization
    • Fitting-Stability Tradeoff
  • Stochastic Gradient Descent
    • Learning with SGD

In fact, we know it

  • Linear regression with square loss
     
  • Logistic regression
l(w, (x, y)) = {\log{\left(1+e^{-y\langle w, x\rangle} \right)}}
l(w,(x,y))=log(1+eyw,x)l(w, (x, y)) = {\log{\left(1+e^{-y\langle w, x\rangle} \right)}}

Convex set

\forall u,v \in C,\,\forall \alpha \in \left[ 0,1 \right],\,\alpha u+(1-\alpha)v \in C
u,vC,α[0,1],αu+(1α)vC\forall u,v \in C,\,\forall \alpha \in \left[ 0,1 \right],\,\alpha u+(1-\alpha)v \in C

Convex function

f:C\rightarrow \mathbb{R}\quad\forall u,v \in C,\,\forall \alpha \in \left[ 0,1 \right]
f:CRu,vC,α[0,1]f:C\rightarrow \mathbb{R}\quad\forall u,v \in C,\,\forall \alpha \in \left[ 0,1 \right]
f\left( \alpha u+(1-\alpha)v \right) \leq \alpha f(u) + (1-\alpha)f(v)
f(αu+(1α)v)αf(u)+(1α)f(v)f\left( \alpha u+(1-\alpha)v \right) \leq \alpha f(u) + (1-\alpha)f(v)

First Order Property

\forall u,v \in dom(f)
u,vdom(f)\forall u,v \in dom(f)
\, f(u) \geq f(w) + \langle\nabla f(w),\, w-v\rangle
f(u)f(w)+f(w),wv\, f(u) \geq f(w) + \langle\nabla f(w),\, w-v\rangle

Second order Property

For function f with f' and f'' exists, TFAE

  • f is convex
  • f' is monotonically increasing
  • f'' is nonegative

Examples

f(x) = x^{2}\Rightarrow f^{\prime\prime}(x) = 2
f(x)=x2f(x)=2f(x) = x^{2}\Rightarrow f^{\prime\prime}(x) = 2
f(x) = \log\left(1+e^{x} \right)\Rightarrow f^{\prime\prime}(x) =\frac{e^{-x}}{(1+e^{-x})^{2}}
f(x)=log(1+ex)f(x)=ex(1+ex)2f(x) = \log\left(1+e^{x} \right)\Rightarrow f^{\prime\prime}(x) =\frac{e^{-x}}{(1+e^{-x})^{2}}
l(w, (x, y)) = {\log{\left(1+e^{-y\langle w, x\rangle} \right)}}???
l(w,(x,y))=log(1+eyw,x)???l(w, (x, y)) = {\log{\left(1+e^{-y\langle w, x\rangle} \right)}}???

Linear transformation preserves convexity

f(w) =g(\langle w,\,x\rangle+y)
f(w)=g(w,x+y)f(w) =g(\langle w,\,x\rangle+y)

is convex when 

g(x)
g(x)g(x)

is convex 

g(x) = \log\left(1+e^{x} \right)
g(x)=log(1+ex)g(x) = \log\left(1+e^{x} \right)
l(w, (x, y)) =g(-y\langle w,\,x\rangle) =g(\langle w,\,-yx\rangle))={\log{\left(1+e^{-y\langle w, x\rangle} \right)}}
l(w,(x,y))=g(yw,x)=g(w,yx))=log(1+eyw,x)l(w, (x, y)) =g(-y\langle w,\,x\rangle) =g(\langle w,\,-yx\rangle))={\log{\left(1+e^{-y\langle w, x\rangle} \right)}}

is convex and so is

f(w)=(\langle w,\,x\rangle-y)^{2}
f(w)=(w,xy)2f(w)=(\langle w,\,x\rangle-y)^{2}

is convex

Obviously,

Other functions preserve convexity

g(x) = \max\limits_{i \in [r]}{f_{i}(x)}
g(x)=maxi[r]fi(x)g(x) = \max\limits_{i \in [r]}{f_{i}(x)}
g(x) = \sum\limits_{i=1}^{r}{w_{i}f_{i}(x)}
g(x)=i=1rwifi(x)g(x) = \sum\limits_{i=1}^{r}{w_{i}f_{i}(x)}

Logistic loss is convex

l(w, (x, y)) = \frac{1}{m}\sum\limits_{i=1}^{m}{\log{\left(1+e^{-y_{i}\langle w, x_{i}\rangle} \right)}}
l(w,(x,y))=1mi=1mlog(1+eyiw,xi)l(w, (x, y)) = \frac{1}{m}\sum\limits_{i=1}^{m}{\log{\left(1+e^{-y_{i}\langle w, x_{i}\rangle} \right)}}

Proofs

Lipschitzness

\forall w_{1},w_{2}\in C
w1,w2C\forall w_{1},w_{2}\in C
\lVert f(w_{1})-f(w_{2})\rVert \leq \rho \lVert w_{1}-w_{2}\rVert
f(w1)f(w2)ρw1w2\lVert f(w_{1})-f(w_{2})\rVert \leq \rho \lVert w_{1}-w_{2}\rVert

Lipschitzness

\forall x \in dom(f),\,f^{\prime}(x)\leq \rho
xdom(f),f(x)ρ\forall x \in dom(f),\,f^{\prime}(x)\leq \rho

For a differentiable function f, it is     -lipschitz if and only if

in 1-D case

\rho
ρ\rho
f(x) = \log\left(1+e^{x} \right)
f(x)=log(1+ex)f(x) = \log\left(1+e^{x} \right)

is 1-Lipschitz

Bounded by [-1, 1]

f(x) = x^{2}
f(x)=x2f(x) = x^{2}

is not Lipschitz

Unbounded above!

Smoothness

A function is called    -smooth when its derivative is    -lipschitz

\beta
β\beta
\beta
β\beta
f(x) = \log\left(1+e^{x} \right)
f(x)=log(1+ex)f(x) = \log\left(1+e^{x} \right)

is 1/4-smooth

Bounded by [-1/4, 1/4]

f(x) = x^{2}
f(x)=x2f(x) = x^{2}

is 2-smooth

bounded by [-2,2]

Property of smoothness

f(v ) \leq f(w) +\langle \nabla f(w), v-w \rangle + \frac{\beta}{2}\lVert v-w \rVert^{2}
f(v)f(w)+f(w),vw+β2vw2f(v ) \leq f(w) +\langle \nabla f(w), v-w \rangle + \frac{\beta}{2}\lVert v-w \rVert^{2}
f^{\prime}(v)
f(v)f^{\prime}(v)
f^{\prime}(v)+\beta\lVert v-w\rVert
f(v)+βvwf^{\prime}(v)+\beta\lVert v-w\rVert
f^{\prime}(v)-\beta\lVert v-w\rVert
f(v)βvwf^{\prime}(v)-\beta\lVert v-w\rVert
v
vv
w
ww

Self-bounded

f(v ) \leq f(w) +\langle \nabla f(w), v-w \rangle + \frac{\beta}{2}\lVert v-w \rVert^{2}
f(v)f(w)+f(w),vw+β2vw2f(v ) \leq f(w) +\langle \nabla f(w), v-w \rangle + \frac{\beta}{2}\lVert v-w \rVert^{2}
\lVert \nabla f(w) \rVert \leq 2\beta f(w)
f(w)2βf(w)\lVert \nabla f(w) \rVert \leq 2\beta f(w)

When f is non-negative and smooth we can obtain

by setting

f(v ) = f(w) - \frac{1}{\beta}\nabla f(w)
f(v)=f(w)1βf(w)f(v ) = f(w) - \frac{1}{\beta}\nabla f(w)

in

Lipschitzness and smoothness under linear transformation

f(w) =g(\langle w,\,x\rangle+y)
f(w)=g(w,x+y)f(w) =g(\langle w,\,x\rangle+y)

is    -lipschitz then

g(x)
g(x)g(x)
\rho
ρ\rho

is             -lipschitz

\rho \lVert x \rVert
ρx\rho \lVert x \rVert
f(w) =g(\langle w,\,x\rangle+y)
f(w)=g(w,x+y)f(w) =g(\langle w,\,x\rangle+y)

is    -smooth then

g(x)
g(x)g(x)
\beta
β\beta

is             -smooth

\beta \lVert x \rVert^{2}
βx2\beta \lVert x \rVert^{2}

Examples

l(w, (x, y)) = {\log{\left(1+e^{-y\langle w, x\rangle} \right)}}
l(w,(x,y))=log(1+eyw,x)l(w, (x, y)) = {\log{\left(1+e^{-y\langle w, x\rangle} \right)}}

-lipschitz and        -smooth

\lVert x \rVert
x\lVert x \rVert
\frac{\lVert x \rVert^{2}}{4}
x24\frac{\lVert x \rVert^{2}}{4}

Since

y \in \{1,-1\},
y{1,1},y \in \{1,-1\},
\,-y\langle w,x\rangle=\langle w,-yx\rangle, \,\lVert -yx\rVert^{2}=\lVert x\rVert^{2}
yw,x=w,yx,yx2=x2\,-y\langle w,x\rangle=\langle w,-yx\rangle, \,\lVert -yx\rVert^{2}=\lVert x\rVert^{2}

Examples

l(w, (x, y)) = \left( \langle w, x\rangle - y \right)^{2}
l(w,(x,y))=(w,xy)2l(w, (x, y)) = \left( \langle w, x\rangle - y \right)^{2}

       -smooth

2\lVert x \rVert^{2}
2x22\lVert x \rVert^{2}

Boundness of training set

In previous arguement, we have the form like            -smooth
But x is a variable, so we also need x to be bounded :
So that we can say a loss function is         -smooth

\lVert x \rVert^{2} \leq B
x2B\lVert x \rVert^{2} \leq B
K\lVert x \rVert^{2}
Kx2K\lVert x \rVert^{2}
KB
KBKB

Proofs

Convex learning problem

A learning problem with

  1. convex set H
  2. loss function              convex in h

     
l(h,z)
l(h,z)l(h,z)

So linear regression and logistic regression are convex learning problems

Convex learning problem and convex optimization problem

When we apply ERM  rule to a convex learning problem, we are finding the minimum of convex function

ERM_{H}(S) = \arg\min\limits_{w \in H} L_{S}(w)
ERMH(S)=argminwHLS(w)ERM_{H}(S) = \arg\min\limits_{w \in H} L_{S}(w)

which is equivalent to solving a convex optimization problem

Learnability of convex learning problems

Two kinds of convex learning problems are learnable : 

  1. Convex-Lipschitz-Bounded problem
  2. Convex-Smooth-Bounded problem

    Convex learning problem is not learnable in general

Example 12.8

(\mu,-1)
(μ,1)(\mu,-1)
\frac{1}{2}
12\frac{1}{2}
(1,0)
(1,0)(1,0)
\frac{1}{2\mu}
12μ\frac{1}{2\mu}
D_1
D1D_1
D_2
D2D_2
prob
probprob
(1,0)
(1,0)(1,0)
(\mu,-1)
(μ,1)(\mu,-1)
\mu
μ\mu
1-\mu
1μ1-\mu
0
00
1
11
\mu = \frac{\log(\frac{100}{99})}{2m}
μ=log(10099)2m\mu = \frac{\log(\frac{100}{99})}{2m}
y=-\frac{1}{2\mu}x
y=12μxy=-\frac{1}{2\mu}x
y
yy
x
xx

Example 12.8

(\mu,-1)
(μ,1)(\mu,-1)
(1,0)
(1,0)(1,0)
\frac{1}{2\mu}
12μ\frac{1}{2\mu}
y=\hat{w}x
y=w^xy=\hat{w}x
(\mu,-1)
(μ,1)(\mu,-1)
\frac{1}{2}
12\frac{1}{2}
y=-\frac{1}{2\mu}x
y=12μxy=-\frac{1}{2\mu}x
y
yy
x
xx
y=\hat{w}x
y=w^xy=\hat{w}x
D_1
D1D_1
D_2
D2D_2
y=-\frac{1}{2\mu}x
y=12μxy=-\frac{1}{2\mu}x

Example 12.9

(1,-1)
(1,1)(1,-1)
\frac{1}{2}
12\frac{1}{2}
(\frac{1}{\mu},0)
(1μ,0)(\frac{1}{\mu},0)
\frac{1}{2\mu}
12μ\frac{1}{2\mu}
D_1
D1D_1
D_2
D2D_2
prob
probprob
(\frac{1}{\mu},0)
(1μ,0)(\frac{1}{\mu},0)
(1,-1)
(1,1)(1,-1)
\mu
μ\mu
1-\mu
1μ1-\mu
0
00
1
11
\mu = \frac{\log(\frac{100}{99})}{2m}
μ=log(10099)2m\mu = \frac{\log(\frac{100}{99})}{2m}
y=-\frac{1}{2}x
y=12xy=-\frac{1}{2}x
y
yy
x
xx

Example 12.9

(1,-1)
(1,1)(1,-1)
\frac{1}{2}
12\frac{1}{2}
(\frac{1}{\mu},0)
(1μ,0)(\frac{1}{\mu},0)
\frac{1}{2\mu}
12μ\frac{1}{2\mu}
y=-\frac{1}{2}x
y=12xy=-\frac{1}{2}x
y
yy
x
xx
y=-x
y=xy=-x
(1,-1)
(1,1)(1,-1)
\frac{1}{2}
12\frac{1}{2}
y
yy
y=-\frac{1}{2}x
y=12xy=-\frac{1}{2}x
x
xx
y=x
y=xy=x
D_1
D1D_1
D_2
D2D_2
y=\hat{w}x
y=w^xy=\hat{w}x
y=\hat{w}x
y=w^xy=\hat{w}x

Surrogate Loss Function

Surrogate Loss Function

L_{D}^{0-1}(A(S))\leq L^{hinge}_{D}(A(S))\leq\min\limits_{w \in H}L_{D}^{hinge}(w)+\epsilon
LD01(A(S))LDhinge(A(S))minwHLDhinge(w)+ϵL_{D}^{0-1}(A(S))\leq L^{hinge}_{D}(A(S))\leq\min\limits_{w \in H}L_{D}^{hinge}(w)+\epsilon
=\min\limits_{w \in H}L_{D}^{0-1}(w)+\left( \min\limits_{w\in H}L_{D}^{hinge}(w)-\min\limits_{w\in H}L_{D}^{0-1}(w) \right) + \epsilon
=minwHLD01(w)+(minwHLDhinge(w)minwHLD01(w))+ϵ=\min\limits_{w \in H}L_{D}^{0-1}(w)+\left( \min\limits_{w\in H}L_{D}^{hinge}(w)-\min\limits_{w\in H}L_{D}^{0-1}(w) \right) + \epsilon
\epsilon_{approximation}
ϵapproximation\epsilon_{approximation}
\epsilon_{estimation}
ϵestimation\epsilon_{estimation}
\epsilon_{optimization}
ϵoptimization\epsilon_{optimization}

Regularization and Stability

  • Regularized Loss Minimization
  • Stability and Overfitting
  • Proof of Learnability
    • Convex-Lipschitz-Bounded problem
    • Convex-Smooth-Bounded problem

RLM learning rule

A(S)=\arg\min\limits_w (L_{S}(w)+R(w))
A(S)=argminw(LS(w)+R(w))A(S)=\arg\min\limits_w (L_{S}(w)+R(w))
A(S)=\arg\min\limits_w (L_{S}(w)+\lambda \lVert w\rVert^{2})
A(S)=argminw(LS(w)+λw2)A(S)=\arg\min\limits_w (L_{S}(w)+\lambda \lVert w\rVert^{2})

Tikhonov Regularization

Ridge Regression

\arg\min\limits_{w\in \mathbb{R}^{d}} \left(\lambda \lVert w\rVert_{2}^{2}+\frac{1}{m}\sum\limits_{i=1}^{m}\frac{1}{2}(\langle w, x_{i}-y_{i}\rangle)^{2}\right)
argminwRd(λw22+1mi=1m12(w,xiyi)2)\arg\min\limits_{w\in \mathbb{R}^{d}} \left(\lambda \lVert w\rVert_{2}^{2}+\frac{1}{m}\sum\limits_{i=1}^{m}\frac{1}{2}(\langle w, x_{i}-y_{i}\rangle)^{2}\right)
(2\lambda mI+A)w=b
(2λmI+A)w=b(2\lambda mI+A)w=b
w=(2\lambda mI+A)^{-1}b
w=(2λmI+A)1bw=(2\lambda mI+A)^{-1}b
A=\sum\limits_{i=1}^{m}x_{i}x_{i}^{\intercal}
A=i=1mxixiA=\sum\limits_{i=1}^{m}x_{i}x_{i}^{\intercal}

Stability

S = (z_{1},...,z_{m})
S=(z1,...,zm)S = (z_{1},...,z_{m})
S^{(i)} = (z_{1},...,z_{i-1},z^{\prime},z_{i+1},...,z_{m})
S(i)=(z1,...,zi1,z,zi+1,...,zm)S^{(i)} = (z_{1},...,z_{i-1},z^{\prime},z_{i+1},...,z_{m})
?\geq l(A(S^{(i)}),z_{i})-l(A(S),z_{i})\geq0
?l(A(S(i)),zi)l(A(S),zi)0?\geq l(A(S^{(i)}),z_{i})-l(A(S),z_{i})\geq0

Stability and Overfitting

Replace-One-Stable

Strong Convex

Strong Convex

A(S)=\arg\min\limits_w (L_{S}(w)+\lambda \lVert w\rVert^{2})
A(S)=argminw(LS(w)+λw2)A(S)=\arg\min\limits_w (L_{S}(w)+\lambda \lVert w\rVert^{2})

Strong convex

Replace-One-Stable

Author abuse the fact that the loss function is strong convex in this proof

RLM Stability-Fitting Tradeoff

Lipschitzness would help

Stochastic Gradient Descent

  • Gradient Descent to SGD
  • Learning with SGD
  • Comparison of SGD and RLM
  • Appliction of SGD

Gradient Descent

f(\bar{w})=f(\frac{1}{T}\sum\limits_{t=1}^{T}w^{(t)})
f(w¯)=f(1Tt=1Tw(t))f(\bar{w})=f(\frac{1}{T}\sum\limits_{t=1}^{T}w^{(t)})
w^{(t+1)}=w^{(t)}-\eta\nabla f(w^{(t)})
w(t+1)=w(t)ηf(w(t))w^{(t+1)}=w^{(t)}-\eta\nabla f(w^{(t)})
f(\bar{w})=f(w^{(T)})
f(w¯)=f(w(T))f(\bar{w})=f(w^{(T)})

Importance of Lipschitzness and Smoothness

f^{\prime}(x)\leq \rho
f(x)ρf^{\prime}(x)\leq \rho
\lVert f^{\prime}(x) \rVert \leq 2\beta f(x)
f(x)2βf(x)\lVert f^{\prime}(x) \rVert \leq 2\beta f(x)

Lipschitzness

Self-bounded(Smoothness)

Gradient Descent

Gradient Descent

Stochastic Gradient Descent

Stochastic Gradient Descent

f(\bar{w})-f(w^{*})\leq \epsilon
f(w¯)f(w)ϵf(\bar{w})-f(w^{*})\leq \epsilon
T \geq \frac{B^{2}\rho^{2}}{\epsilon^{2}}
TB2ρ2ϵ2T \geq \frac{B^{2}\rho^{2}}{\epsilon^{2}}

same as GD

What if we go out of boundary?

What if we have strong convexity?

SGD Learning

We directly minimize the true risk with an unbiased estimate of its gradient 

Followed by SGD method we can obtain the result we want

Comparison

\mathbb{E}_{S}[L_{D}(A(S))]\leq \min\limits_{w\in H}L_{D}(w)+\epsilon
ES[LD(A(S))]minwHLD(w)+ϵ\mathbb{E}_{S}[L_{D}(A(S))]\leq \min\limits_{w\in H}L_{D}(w)+\epsilon

convex-Lipschitz-bounded

convex-Smooth-bounded

\frac{8\rho^{2}B^{2}}{\epsilon^{2}}
8ρ2B2ϵ2\frac{8\rho^{2}B^{2}}{\epsilon^{2}}
\frac{150\beta B^{2}}{\epsilon^{2}}
150βB2ϵ2\frac{150\beta B^{2}}{\epsilon^{2}}
\frac{12\beta B^{2}}{\epsilon^{2}}
12βB2ϵ2\frac{12\beta B^{2}}{\epsilon^{2}}
\frac{\rho^{2}B^{2}}{\epsilon^{2}}
ρ2B2ϵ2\frac{\rho^{2}B^{2}}{\epsilon^{2}}
       samples / Iterations          RLM          SGD

 

 

Learning

Rule

Specific

Algorithm

Application

When training DNN, SGD is the most common algorithm to train the NN model

  • Hypothesis set is all possible neural network
  • Implement SGD by batches of training data
  • ERM rule apply to choose best DNN model
  • Usually use square loss function
  • In general, the loss function is not convex

SGD with Momentum

w^{(t+1)}=w^{(t)}-v^{(t)}
w(t+1)=w(t)v(t)w^{(t+1)}=w^{(t)}-v^{(t)}
v^{(t+1)}=\gamma v^{(t)}+\eta\nabla f(w^{(t)})
v(t+1)=γv(t)+ηf(w(t))v^{(t+1)}=\gamma v^{(t)}+\eta\nabla f(w^{(t)})

In practice SGD is a successful method in training DNN

  • SGD itself will introduce randomness
  • SGD with momentum will avoid local minimum
  • Local minimum is few in practical problem 

Other Issue of SGD

  • Properly selection of initial position
  • Boosting the efficiency : RMSProp, adagrad
  • Proper batch size selection
  • Back propagation to evaluate gradient
Made with Slides.com