### Farid Qamar

Data Science || Machine Learning || Remote Sensing || Astrophysics || Public Policy

MLPP

MLPP

Modeling data for which the target consists of

binary values

MLPP

Modeling data for which the target consists of

binary values

In these cases we can model our data using the logistic function*:

* to be interpreted as a probability that the target is True (=1)

f(x) = \frac{1}{1+e^{-z}}; z = mx+b

MLPP

Modeling data for which the target consists of

binary values

In these cases we can model our data using the logistic function*:

f(x) = \frac{1}{1+e^{-z}}; z = mx+b

__ML PARADIGM__

parameter: m and b

metric:

algorithm: SGD (or similar)

log(\mathcal{L}) = \sum(y_i log(f) + (1-y_i)log(1-f))

MLPP

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

MLPP

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

MLPP

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

MLPP

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

MLPP

Thinking about the classification problem differently:

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

MLPP

Thinking about the classification problem differently:

recall:

To find the "best" line we need a metric to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

MLPP

Thinking about the classification problem differently:

recall:

To find the "best" line we need a metric to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

In Support Vector Machines (SVM):

Metric: maximize the gap (aka. "margin") between the classes)

MLPP

Thinking about the classification problem differently:

recall:

To find the "best" line we need a metric to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

In Support Vector Machines (SVM):

Metric: maximize the gap (aka. "margin") between the classes)

margin

MLPP

Thinking about the classification problem differently:

recall:

To find the "best" line we need a metric to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

In Support Vector Machines (SVM):

Metric: maximize the gap (aka. "margin") between the classes)

max margin

MLPP

Thinking about the classification problem differently:

recall:

To find the "best" line we need a metric to optimize

Can't we just draw a line that separates the classes?

But...which line is "best"?

Logistic regression is a probabilistic approach:

- finds the probability a data point falls in a class

In Support Vector Machines (SVM):

Metric: maximize the gap (aka. "margin") between the classes)

max margin

separating hyperplane

MLPP

Which points should influence the decision?

MLPP

Which points should influence the decision?

Logistic regression:

All points

MLPP

Which points should influence the decision?

Logistic regression:

All points

SVM:

Only the "difficult points" on the decision boundary

MLPP

Which points should influence the decision?

Logistic regression:

All points

SVM:

Only the "difficult points" on the decision boundary

Support Vectors:

Points (vectors from origin) that would influence the decision if moved

Points that only touch the boundary of the margin

support vectors

MLPP

Which points should influence the decision?

Logistic regression:

All points

SVM:

Only the "difficult points" on the decision boundary

Support Vectors:

Points (vectors from origin) that would influence the decision if moved

Points that only touch the boundary of the margin

Separating hyperplane:

w \cdot x + b = 0

y = \begin{cases} +1 & {\text{if } w \cdot x + b >= 1} \\ -1 & {\text{if } w \cdot x + b <= -1} \end{cases}

y (w \cdot x + b) >= 1

or

support vectors

MLPP

Which points should influence the decision?

Objective:

Maximize the margin by minimizing:

Separating hyperplane:

w \cdot x + b = 0

y = \begin{cases} +1 & {\text{if } w \cdot x + b >= 1} \\ -1 & {\text{if } w \cdot x + b <= -1} \end{cases}

y (w \cdot x + b) >= 1

or

support vectors

Constraints:

\frac{||w||}{2}

such that

y_i (w \cdot x + b) >= 1

MLPP

What if it is not possible to cleanly separate the data?

MLPP

What if it is not possible to cleanly separate the data?

MLPP

What if it is not possible to cleanly separate the data?

Minimizing

is known as Hard Margin SVM

\frac{||w||}{2}

MLPP

What if it is not possible to cleanly separate the data?

Minimizing

is known as Hard Margin SVM

\frac{||w||}{2}

Allow for some errors using Soft Margin SVM:

modify the objective and minimize:

\frac{||w||}{2} + c \sum_{i=1}^n \zeta_i

where:

\zeta_i = 0

for all correctly classified points

and:

\zeta_i =

distance to boundary for all incorrectly classified points

max margin

MLPP

What if it is not possible to cleanly separate the data?

Minimizing

is known as Hard Margin SVM

\frac{||w||}{2}

Allow for some errors using Soft Margin SVM:

modify the objective and minimize:

\frac{||w||}{2} + c \sum_{i=1}^n \zeta_i

where:

\zeta_i = 0

for all correctly classified points

and:

\zeta_i =

distance to boundary for all incorrectly classified points

is the penalty term:

__Large c__ penalizes mistakes - creates hard margin

__small c__ lowers the penalty, allows errors - creates soft margin

c

max margin

d3

d2

d1

\sum_{i=1}^n \zeta_i = d1 + d2 + d3

MLPP

What if the data is not linearly separable?

MLPP

What if the data is not linearly separable?

MLPP

What if the data is not linearly separable?

MLPP

What if the data is not linearly separable?

**Kernel Trick:**

Using a mathematical function to convert the data into linearly separable sets

MLPP

What if the data is not linearly separable?

**Kernel Trick:**

Using a mathematical function to convert the data into linearly separable sets

kernel

function

MLPP

What if the data is not linearly separable?

**Kernel Trick:**

Using a mathematical function to convert the data into linearly separable sets

kernel

function

kernel

function

MLPP

What if the data is not linearly separable?

**Kernel Trick:**

Using a mathematical function to convert the data into linearly separable sets

kernel

function

kernel

function

**Some kernel Types:**

Linear kernel ("Non-kernel")

f(x_1, x_2) = (x_1 \cdot x_2 + r)

Polynomial kernel

f(x_1, x_2) = (\gamma x_1 \cdot x_2 + r)^d

Sigmoid kernel

f(x_1, x_2) = tanh(\gamma x_1 \cdot x_2 + r)

Radial Basis Function (RBF) kernel

f(x_1, x_2) = exp(-\gamma ||x_1 - x_2||^2)

MLPP

**Logistic Regression:**

- Probabilistic - relies on maximizing the likelihood of reaching a label decision

- Relies on well identified independent variables

- Vulnerable to overfitting and the influence of outliers

- Simple to implement and use

- Efficient on large datasets with low number of features

**Support Vector Machines**

- Geometric - relies on maximizing the margin distance between the classes

- Capable of handling unstructured or semi-structured data like text and images

- Lower risk of overfitting and not sensitive to outliers

- Choosing the kernel is difficult

- Inefficient on large datasets, performs best on small datasets with large number of features

By Farid Qamar

- 294