Supervised learning :

Support Vector Machine (SVM)

Speaker : Joanne Tseng

2014/11/15 

Outline

  1. Review
    (supervised / semisupervised / unsupervised learning)
  2. The general supervised learning process
  3. Support Vector Machine (Linear separable problem)
    - Cost Function
    - Decision Boundary
  4. SVM with kernel (Non-linear separable problem)
    - Non-linear Separable
    - some choice of Kernel

Review

  1. supervised learning : training data with label

     
  2. semi-supervised learning : training data with both labeled and unlabeled data
     
  3. unsupervised learning : training data without label
\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{m},y_{m})\}
{(x1,y1),(x2,y2),...,(xm,ym)}

The general Supervised Learning Process

\{(X_{1},y_{1}),(X_{2},y_{2}),...,(X_{m},y_{m})\}
{(X1,y1),(X2,y2),...,(Xm,ym)}

Training Examples:

Learning Algorithm :

Hypothesis Set : 

A
A

Final Hypothesis :

want:                      

g
g

Unknown Target Function : 

H =\{ h_{1},h_{2},...,h_{M}\}
H={h1,h2,...,hM}
f:X\rightarrow Y
f:XY
g(x)\approx f(x)
g(x)f(x)

Support Vector Machine

​- Cost Function

- Linear Decision Boundary 

First, Let's start from logistic regression.
[REVIEW]

Cost Function(I)

idea of Cost Function:

choose                    so that              is close to        for our training examples :

\theta_{1},\theta_{2},...
θ1,θ2,...
h_{\theta}(x)
hθ(x)
y
y
\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{m},y_{m})\}
{(x1,y1),(x2,y2),...,(xm,ym)}

Cost Function(II)

Logistic Regression:

J(\theta)=-(y\log h_{\theta}+(1-y)\log(h_{\theta}))
J(θ)=(yloghθ+(1y)log(hθ))
h_\theta = \frac1{1+e^{-\theta^{T} X}}
hθ=1+eθTX1

where

Cost Function(III)

Support Vector Machine:

J(\theta)=-(yCost_{1}(\theta^TX)+(1-y)Cost_{0}(\theta^TX))
J(θ)=(yCost1(θTX)+(1y)Cost0(θTX))
Cost_{1}(z)
Cost1(z)
Cost_{0}(z)
Cost0(z)

Optimization Objective(I)

Logistic Regression :

\min_{\theta}\{[\frac1{m} \sum^{m}_{j=1} y^{(i)} (-\log h_{\theta}x^{(i)})+(1-y^{(i)})(1-\log(1-h_\theta x^{(i)}))]+ \frac\lambda{2m} \sum^{n}_{j=1} \theta_{j}^2 \}
minθ{[m1j=1my(i)(loghθx(i))+(1y(i))(1log(1hθx(i)))]+2mλj=1nθj2}

Two parts:

  • first term : Cost from the training (A)
  •     : parameter 
  • second term : Regularization term (B)
\lambda
λ

Support Vector Machine :

\min_{\theta} \{CA+B\}
minθ{CA+B}

where

C = \frac1{\lambda}
C=λ1

Optimization Objective(II)

Support Vector Machine :

\min_{\theta}\{[C \sum^{m}_{j=1} y^{(i)} Cost_{1}(\theta^Tx^{(i)})+(1-y^{(i)})Cost_{0}(\theta^Tx^{(i)})]+ \frac1{2} \sum^{n}_{j=1} \theta_{j}^2 \}
minθ{[Cj=1my(i)Cost1(θTx(i))+(1y(i))Cost0(θTx(i))]+21j=1nθj2}

NOTE : When C be a very large number, by minimizing the optimization function, we would like to choose the first term (A) to be zero.

SVM Decision Boundary:

Linear Separable Case

largest marginal distance

Large Margin Classifier

SVM Decision Boundary:

In presence of outliers

C : not too large (     : large)

\lambda
λ

C : very large (     : small)

\lambda
λ

SVM with kernel

- Non-linear Decision Boundary

- Other choice of kernel

Non-Linear Decision Boundary

  • Predict y = 1 if 
  •                   if  
\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\theta_{3}x_{1}x_{2}+\theta_{4}x_{1}^2 +...\geq 0
θ0+θ1x1+θ2x2+θ3x1x2+θ4x12+...0
h_{\theta}(x)=1
hθ(x)=1
\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\theta_{3}x_{1}x_{2}+\theta_{4}x_{1}^2 +...\geq 0
θ0+θ1x1+θ2x2+θ3x1x2+θ4x12+...0

Non-Linear Decision Boundary

By the model of the last page:

\theta_{0}+\theta_{1}f_{1}+\theta_{2}f_{2}+\theta_{3}f_{3}+\theta_{4}f_{4} +...
θ0+θ1f1+θ2f2+θ3f3+θ4f4+...

where

f_{1}=x_{1}, f_{2} = x_{2},f_{3}=x_{1}x_{2},f_{4} = x_{1}^2
f1=x1,f2=x2,f3=x1x2,f4=x12

Non-Linear Decision Boundary

SVM without Kernel :

\min_{\theta}\{[C \sum^{m}_{j=1} y^{(i)} Cost_{1}(\theta^Tx^{(i)})+(1-y^{(i)})Cost_{0}(\theta^Tx^{(i)})]+ \frac1{2} \sum^{n}_{j=1} \theta_{j}^2 \}
minθ{[Cj=1my(i)Cost1(θTx(i))+(1y(i))Cost0(θTx(i))]+21j=1nθj2}

SVM with Kernels :

\min_{\theta}\{[C \sum^{m}_{j=1} y^{(i)} Cost_{1}(\theta^Tf^{(i)})+(1-y^{(i)})Cost_{0}(\theta^Tf^{(i)})]+ \frac1{2} \sum^{m}_{j=0} \theta_{j}^2 \}
minθ{[Cj=1my(i)Cost1(θTf(i))+(1y(i))Cost0(θTf(i))]+21j=0mθj2}

Some choice of Kernels

  • Gaussian/RBF kernel (Ridial Basis Function)
    - most used
  • Polynomial Kernel
  • Sigmoid Kernel
  • chi-squared Kernel
  • string Kernel
  • histogram intersection Kernel

Logistic Regression vs. SVM

  • Let n be the number of features
  • Let m be the number of training examples 
  1. If n is large (>10,000)(relative to m) :
    use logistic regression/ SVM without kernel
  2. If n is small(1-1000),m is intermediate :
    use SVM with Gaussian Kernel
  3. If n is small, m is large(50000+) :
    create/ add more features, then use logistic regression/ SVM without Kernel.
     

use Orange & LIBSVM to practice ^.<

Made with Slides.com