Supervised learning :
Support Vector Machine (SVM)
Speaker : Joanne Tseng
2014/11/15
Outline
- Review
(supervised / semisupervised / unsupervised learning) - The general supervised learning process
- Support Vector Machine (Linear separable problem)
- Cost Function
- Decision Boundary - SVM with kernel (Non-linear separable problem)
- Non-linear Separable
- some choice of Kernel
Review
- supervised learning : training data with label
- semi-supervised learning : training data with both labeled and unlabeled data
- unsupervised learning : training data without label
\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{m},y_{m})\}
{(x1,y1),(x2,y2),...,(xm,ym)}
The general Supervised Learning Process
\{(X_{1},y_{1}),(X_{2},y_{2}),...,(X_{m},y_{m})\}
{(X1,y1),(X2,y2),...,(Xm,ym)}
Training Examples:
Learning Algorithm :
Hypothesis Set :
A
A
Final Hypothesis :
want:
g
g
Unknown Target Function :
H =\{ h_{1},h_{2},...,h_{M}\}
H={h1,h2,...,hM}
f:X\rightarrow Y
f:X→Y
g(x)\approx f(x)
g(x)≈f(x)
Support Vector Machine
- Cost Function
- Linear Decision Boundary
First, Let's start from logistic regression.
[REVIEW]
Cost Function(I)
idea of Cost Function:
choose so that is close to for our training examples :
\theta_{1},\theta_{2},...
θ1,θ2,...
h_{\theta}(x)
hθ(x)
y
y
\{(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{m},y_{m})\}
{(x1,y1),(x2,y2),...,(xm,ym)}
Cost Function(II)
Logistic Regression:
J(\theta)=-(y\log h_{\theta}+(1-y)\log(h_{\theta}))
J(θ)=−(yloghθ+(1−y)log(hθ))
h_\theta = \frac1{1+e^{-\theta^{T} X}}
hθ=1+e−θTX1
where
Cost Function(III)
Support Vector Machine:
J(\theta)=-(yCost_{1}(\theta^TX)+(1-y)Cost_{0}(\theta^TX))
J(θ)=−(yCost1(θTX)+(1−y)Cost0(θTX))
Cost_{1}(z)
Cost1(z)
Cost_{0}(z)
Cost0(z)
Optimization Objective(I)
Logistic Regression :
\min_{\theta}\{[\frac1{m} \sum^{m}_{j=1} y^{(i)} (-\log h_{\theta}x^{(i)})+(1-y^{(i)})(1-\log(1-h_\theta x^{(i)}))]+
\frac\lambda{2m} \sum^{n}_{j=1} \theta_{j}^2 \}
minθ{[m1∑j=1my(i)(−loghθx(i))+(1−y(i))(1−log(1−hθx(i)))]+2mλ∑j=1nθj2}
Two parts:
- first term : Cost from the training (A)
- : parameter
- second term : Regularization term (B)
\lambda
λ
Support Vector Machine :
\min_{\theta} \{CA+B\}
minθ{CA+B}
where
C = \frac1{\lambda}
C=λ1
Optimization Objective(II)
Support Vector Machine :
\min_{\theta}\{[C \sum^{m}_{j=1} y^{(i)} Cost_{1}(\theta^Tx^{(i)})+(1-y^{(i)})Cost_{0}(\theta^Tx^{(i)})]+
\frac1{2} \sum^{n}_{j=1} \theta_{j}^2 \}
minθ{[C∑j=1my(i)Cost1(θTx(i))+(1−y(i))Cost0(θTx(i))]+21∑j=1nθj2}
NOTE : When C be a very large number, by minimizing the optimization function, we would like to choose the first term (A) to be zero.
SVM Decision Boundary:
Linear Separable Case
largest marginal distance
Large Margin Classifier
SVM Decision Boundary:
In presence of outliers
C : not too large ( : large)
\lambda
λ
C : very large ( : small)
\lambda
λ
SVM with kernel
- Non-linear Decision Boundary
- Other choice of kernel
Non-Linear Decision Boundary
- Predict y = 1 if
- if
\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\theta_{3}x_{1}x_{2}+\theta_{4}x_{1}^2 +...\geq 0
θ0+θ1x1+θ2x2+θ3x1x2+θ4x12+...≥0
h_{\theta}(x)=1
hθ(x)=1
\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+\theta_{3}x_{1}x_{2}+\theta_{4}x_{1}^2 +...\geq 0
θ0+θ1x1+θ2x2+θ3x1x2+θ4x12+...≥0
Non-Linear Decision Boundary
By the model of the last page:
\theta_{0}+\theta_{1}f_{1}+\theta_{2}f_{2}+\theta_{3}f_{3}+\theta_{4}f_{4} +...
θ0+θ1f1+θ2f2+θ3f3+θ4f4+...
where
f_{1}=x_{1}, f_{2} = x_{2},f_{3}=x_{1}x_{2},f_{4} = x_{1}^2
f1=x1,f2=x2,f3=x1x2,f4=x12
Non-Linear Decision Boundary
SVM without Kernel :
\min_{\theta}\{[C \sum^{m}_{j=1} y^{(i)} Cost_{1}(\theta^Tx^{(i)})+(1-y^{(i)})Cost_{0}(\theta^Tx^{(i)})]+
\frac1{2} \sum^{n}_{j=1} \theta_{j}^2 \}
minθ{[C∑j=1my(i)Cost1(θTx(i))+(1−y(i))Cost0(θTx(i))]+21∑j=1nθj2}
SVM with Kernels :
\min_{\theta}\{[C \sum^{m}_{j=1} y^{(i)} Cost_{1}(\theta^Tf^{(i)})+(1-y^{(i)})Cost_{0}(\theta^Tf^{(i)})]+
\frac1{2} \sum^{m}_{j=0} \theta_{j}^2 \}
minθ{[C∑j=1my(i)Cost1(θTf(i))+(1−y(i))Cost0(θTf(i))]+21∑j=0mθj2}
Some choice of Kernels
- Gaussian/RBF kernel (Ridial Basis Function)
- most used - Polynomial Kernel
- Sigmoid Kernel
- chi-squared Kernel
- string Kernel
- histogram intersection Kernel
Logistic Regression vs. SVM
- Let n be the number of features
- Let m be the number of training examples
- If n is large (>10,000)(relative to m) :
use logistic regression/ SVM without kernel - If n is small(1-1000),m is intermediate :
use SVM with Gaussian Kernel - If n is small, m is large(50000+) :
create/ add more features, then use logistic regression/ SVM without Kernel.
use Orange & LIBSVM to practice ^.<
SVM
By tseng0211
SVM
- 1,169