Support Vector Machines (SVM)

                 models for target classification

MLPP

MLPP

Logistic Regression - recap

Modeling data for which the target consists of

binary values

MLPP

Logistic Regression - recap

Modeling data for which the target consists of

binary values

 

In these cases we can model our data using the logistic function*:

* to be interpreted as a probability that the target is True (=1)

f(x) = \frac{1}{1+e^{-z}}; z = mx+b

MLPP

Logistic Regression - recap

Modeling data for which the target consists of

binary values

 

In these cases we can model our data using the logistic function*:

f(x) = \frac{1}{1+e^{-z}}; z = mx+b

ML PARADIGM

      parameter: m and b

      metric:

      algorithm: SGD (or similar)

log(\mathcal{L}) = \sum(y_i log(f) + (1-y_i)log(1-f))

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

recall:

          To find the "best" line we need a metric              to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

recall:

          To find the "best" line we need a metric              to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

In Support Vector Machines (SVM):

          Metric: maximize the gap (aka. "margin")            between the classes)

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

recall:

          To find the "best" line we need a metric              to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

In Support Vector Machines (SVM):

          Metric: maximize the gap (aka. "margin")            between the classes)

margin

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

recall:

          To find the "best" line we need a metric              to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

In Support Vector Machines (SVM):

          Metric: maximize the gap (aka. "margin")            between the classes)

max margin

MLPP

Support Vector Machines (SVM)

Thinking about the classification problem differently:

recall:

          To find the "best" line we need a metric              to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

 -  finds the probability a data point falls in a class

In Support Vector Machines (SVM):

          Metric: maximize the gap (aka. "margin")            between the classes)

max margin

separating hyperplane

MLPP

Support Vector Machines (SVM)

Which points should influence the decision?

MLPP

Support Vector Machines (SVM)

Which points should influence the decision?

Logistic regression:

          All points

MLPP

Support Vector Machines (SVM)

Which points should influence the decision?

Logistic regression:

          All points

SVM:

          Only the "difficult points" on the decision            boundary

MLPP

Support Vector Machines (SVM)

Which points should influence the decision?

Logistic regression:

          All points

SVM:

          Only the "difficult points" on the decision            boundary

Support Vectors:

          Points (vectors from origin) that would                influence the decision if moved

          Points that only touch the boundary of              the margin

support vectors

MLPP

Support Vector Machines (SVM)

Which points should influence the decision?

Logistic regression:

          All points

SVM:

          Only the "difficult points" on the decision            boundary

Support Vectors:

          Points (vectors from origin) that would                influence the decision if moved

          Points that only touch the boundary of              the margin

Separating hyperplane:

w \cdot x + b = 0
y = \begin{cases} +1 & {\text{if } w \cdot x + b >= 1} \\ -1 & {\text{if } w \cdot x + b <= -1} \end{cases}
y (w \cdot x + b) >= 1

or

support vectors

MLPP

Support Vector Machines (SVM)

Which points should influence the decision?

Objective:

          Maximize the margin by minimizing:

Separating hyperplane:

w \cdot x + b = 0
y = \begin{cases} +1 & {\text{if } w \cdot x + b >= 1} \\ -1 & {\text{if } w \cdot x + b <= -1} \end{cases}
y (w \cdot x + b) >= 1

or

support vectors

Constraints:

\frac{||w||}{2}

such that

y_i (w \cdot x + b) >= 1

MLPP

Support Vector Machines (SVM)

What if it is not possible to cleanly separate the data?

MLPP

Support Vector Machines (SVM)

What if it is not possible to cleanly separate the data?

MLPP

Support Vector Machines (SVM)

What if it is not possible to cleanly separate the data?

Minimizing

is known as Hard Margin SVM

\frac{||w||}{2}

MLPP

Support Vector Machines (SVM)

What if it is not possible to cleanly separate the data?

Minimizing

is known as Hard Margin SVM

\frac{||w||}{2}

Allow for some errors using Soft Margin SVM:

          modify the objective and minimize:

\frac{||w||}{2} + c \sum_{i=1}^n \zeta_i

where:

\zeta_i = 0

for all correctly classified points

and:

\zeta_i =

distance to boundary for all incorrectly classified points

max margin

MLPP

Support Vector Machines (SVM)

What if it is not possible to cleanly separate the data?

Minimizing

is known as Hard Margin SVM

\frac{||w||}{2}

Allow for some errors using Soft Margin SVM:

          modify the objective and minimize:

\frac{||w||}{2} + c \sum_{i=1}^n \zeta_i

where:

\zeta_i = 0

for all correctly classified points

and:

\zeta_i =

distance to boundary for all incorrectly classified points

is the penalty term:

    Large c penalizes mistakes - creates hard margin

    small c lowers the penalty, allows errors - creates soft margin

c

max margin

d3

d2

d1

\sum_{i=1}^n \zeta_i = d1 + d2 + d3

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

Kernel Trick:

          Using a mathematical function to convert the data into linearly separable sets

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

Kernel Trick:

          Using a mathematical function to convert the data into linearly separable sets

kernel

function

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

Kernel Trick:

          Using a mathematical function to convert the data into linearly separable sets

kernel

function

kernel

function

MLPP

Support Vector Machines (SVM)

What if the data is not linearly separable?

Kernel Trick:

          Using a mathematical function to convert the data into linearly separable sets

kernel

function

kernel

function

Some kernel Types:

Linear kernel ("Non-kernel")

f(x_1, x_2) = (x_1 \cdot x_2 + r)

Polynomial kernel

f(x_1, x_2) = (\gamma x_1 \cdot x_2 + r)^d

Sigmoid kernel

f(x_1, x_2) = tanh(\gamma x_1 \cdot x_2 + r)

Radial Basis Function (RBF) kernel

f(x_1, x_2) = exp(-\gamma ||x_1 - x_2||^2)

MLPP

Support Vector Machines (SVM)

Logistic Regression:

  • Probabilistic - relies on maximizing the likelihood of reaching a label decision
  • Relies on well identified independent variables
  • Vulnerable to overfitting and the influence of outliers
  • Simple to implement and use
  • Efficient on large datasets with low number of features

Support Vector Machines

  • Geometric - relies on maximizing the margin distance between the classes
  • Capable of handling unstructured or semi-structured data like text and images
  • Lower risk of overfitting and not sensitive to outliers
  • Choosing the kernel is difficult
  • Inefficient on large datasets, performs best on small datasets with large number of features

MLPP - Support Vector Machines (SVM)

By Farid Qamar

MLPP - Support Vector Machines (SVM)

  • 357