# Machine Learning Algorithms

A crash and concise approach

Disclaimer

The mathematical content is not strict. The notations may be incomplete / inaccurate. Please forgive me.

There is math.

There is theory.

There is pseudocode.

A lot of content in 40min

Interact with me!

# "M.L. is complicated!"

clf = linear_model.LogisticRegression(C=1.0,
penalty='l1', tol=1e-6)

"What does this method do?"

"M.L. "

"..."

# "M.L. is easy!11!"

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)$
\theta_0
$\theta_0$
-\alpha
$-\alpha$
\theta_0 =
$\theta_0 =$
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)$
\theta_1
$\theta_1$
-\alpha
$-\alpha$
\theta_1 =
$\theta_1 =$
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)$
\theta_2
$\theta_2$
-\alpha
$-\alpha$
\theta_2 =
$\theta_2 =$
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)$
\theta_n
$\theta_n$
-\alpha
$-\alpha$
\theta_n =
$\theta_n =$

"Because it is as cool as big data now."

# Goals

• Get started with Machine Learning
• Connect the missing pieces
• Show the relation between Mathematics, ML theory and implementation
• You can apply M.L. to your personal projects (Example: my bots )

# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

# Some problems

Supervised Learning

Unsupervised Learning

Reinforcement Learning

We have an idea of the right answer for what we are asking. Example: Given a picture of a person, predict how old is he/she

We have no idea of the right answer for what we are asking. Example: Given a collection of items you don't know, try to group them by similarity.

Let the machine take control of the context and you provide input feedback.

Example: Reduce items in stock by creating dynamic promotions

# General idea

Training set

Learning Algorithm

hypothesis

input

predicted output

# This presentation will cover distinct algorithms + theory + practice

## For every algorithm, we will:

• Come up with a problem to solve
• Come up with a strategy
• Understand the mechanics of the algorithm
• Come up with a test scenario
• Provide examples with libraries in C/C++/Python

# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

# Problem:

Sell used cars. Find the best price to sell them (not considering people who collect old cars)

Do we have any idea for know relations?

older -> cheaper

unpopular brands -> cheaper

too many kms -> cheaper

# Problem:

Sell used cars. Find the best price to sell them (not considering people who collect old cars)

What kind of M.L. Algorithm would you use here?

Supervised Learning

# Algorithm draft:

Chose one variable to analyse vs what you want to predict (example: year x price). Price is the variable you want to set a prediction.

Come up with a training set to analyse these variables

## Number of training examples (m)

input variable or features - x

output variable or target - y

# Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

m

x

h

\hat{y}
$\hat{y}$

# Strategy to h

Linear equation

h = ax + b

How do you choose a and b?

From the training set, we have expected values y for a certain x:

(x^i , y^i )
$(x^i , y^i )$

Come up with a hypothesis that gives you

the smallest error for all the training set:

# The algorithm

h(x)

y

for an input x

The output of the training set

-

Measure the difference

# The algorithm

h(x)

y

for an input x

The output of the training set

-

Measure the difference

for the entire training set

# The algorithm

for an input x

The output of the training set

-

Measure the difference

for the entire training set

h(x^i)
$h(x^i)$
y^i
$y^i$

(

)

\sum_{i=1}^{m}
$\sum_{i=1}^{m}$

# The algorithm

for an input x

The output of the training set

-

We don't want to cancel positive and negative values

h(x^i)
$h(x^i)$
y^i
$y^i$
\sum_{i=1}^{m}
$\sum_{i=1}^{m}$
)^2
$)^2$
(
$($
\frac{1}{2m}
$\frac{1}{2m}$

Average

# The algorithm

-

h(x^i)
$h(x^i)$
y^i
$y^i$
\sum_{i=1}^{m}
$\sum_{i=1}^{m}$
)^2
$)^2$
(
$($
\frac{1}{2m}
$\frac{1}{2m}$

Cost Function

J =

We want to minimize the difference

# Understanding it

We can come up with different hypothesis (slopes for h function)

# Understanding it

We can come up with different hypothesis (slopes for h function)

# Understanding it

That's the difference

h(x^i) -y^i
$h(x^i) -y^i$

for a certain cost function

h = a_1x + b
$h = a_1x + b$

# Understanding it

That's the difference

h(x^i) -y^i
$h(x^i) -y^i$

for another cost function

h = a_2x + b
$h = a_2x + b$

Minimum value

on h = ax +b,

we are varying a

J(a)

# We also need to vary b. J will be a surface (3 dimensions)

h = ax + b

J(a,b)

## Back to the algorithm

Minimize any Cost Function.

## Back to the algorithm

Minimize any Cost Function.

Min

And 'walk' to the min value

Or:

Min

# Calculus - we can get this information from the derivative

\frac{\partial }{\partial a}
$\frac{\partial }{\partial a}$

# Back to the cost function

\frac{\partial }{\partial a}J(a,b)
$\frac{\partial }{\partial a}J(a,b)$
\frac{\partial }{\partial b}J(a,b)
$\frac{\partial }{\partial b}J(a,b)$

Towards the min value

a_0
$a_0$
b_0
$b_0$

a guess

-\alpha
$-\alpha$
-\alpha
$-\alpha$

Walk on

the

graph

a =
$a =$
b =
$b =$

Partial Derivatives

\frac{\partial }{\partial a}J(a,b)
$\frac{\partial }{\partial a}J(a,b)$
\frac{\partial }{\partial b}J(a,b)
$\frac{\partial }{\partial b}J(a,b)$

Partial derivatives

(min value)

a_0
$a_0$
b_0
$b_0$

a guess

-\alpha
$-\alpha$
-\alpha
$-\alpha$

Walk on

the

graph

a =
$a =$
b =
$b =$

# Back to the cost function

\frac{\partial }{\partial a}J(a,b)
$\frac{\partial }{\partial a}J(a,b)$
\frac{\partial }{\partial b}J(a,b)
$\frac{\partial }{\partial b}J(a,b)$
a_0
$a_0$
b_0
$b_0$
-\alpha
$-\alpha$
-\alpha
$-\alpha$

Learning rate

(another guess)

a =
$a =$
b =
$b =$

Repeat until convergence

# Expand the derivatives:

\frac{1}{m}\sum_{i=1}^{m}(({h (x_i) - y_i}) x_i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x_i) - y_i}) x_i)$
\frac{1}{m}\sum_{i=1}^{m}({h (x_i) - y_i})
$\frac{1}{m}\sum_{i=1}^{m}({h (x_i) - y_i})$
a_0
$a_0$
b_0
$b_0$
-\alpha
$-\alpha$
-\alpha
$-\alpha$
a =
$a =$
b =
$b =$

Repeat until convergence

## The code in Octave

data = load('used_cars.csv'); %year x price
y = data(:, 2);
X = [ones(m, 1), data(:,1)];
theta = zeros(2, 1); %linear function
iterations = 1500;
alpha = 0.01;
m = length(y);
J = 0;

predictions = X*theta;

sqErrors = (predictions - y).^2;
J = 1/(2*m) * sum(sqErrors);

J_history = zeros(iterations, 1);

for iter = 1:iterations
x = X(:,2);
delta = theta(1) + (theta(2)*x);
tz = theta(1) - alpha * (1/m) * sum(delta-y);
t1 = theta(2) - alpha * (1/m) * sum((delta - y) .* x);
theta = [tz; t1];
J_history(iter) = computeCost(X, y, theta);
end



Initialise

Cost

Function

Descent

# Possible problems with Octave/Matlab

• Performance issues
• Integrate with existing project

# In a large project

Linear regression with C++

mlpack_linear_regression --training_file used_cars.csv
--test_file used_cars_test.csv -v

# Going back to used cars

We are only analysing year vs price. We have more factors (features): model, how much the car was used before, etc

# Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

m

x

h

\hat{y}
$\hat{y}$

# Strategy to h

Consider multiple variables: a, b, c,... (or using greek letters)

h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3
$h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3$

# Gradient Descent for multiple Variables

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)$
\theta_0
$\theta_0$
-\alpha
$-\alpha$
\theta_0 =
$\theta_0 =$

Repeat until convergence

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)$
\theta_1
$\theta_1$
-\alpha
$-\alpha$
\theta_1 =
$\theta_1 =$
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)$
\theta_2
$\theta_2$
-\alpha
$-\alpha$
\theta_2 =
$\theta_2 =$
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)
$\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)$
\theta_n
$\theta_n$
-\alpha
$-\alpha$
\theta_n =
$\theta_n =$

# Production tips

• Too many features - YOU NEED MEMORY :scream:
• Normalisation - check this Coursera lectures
• There is more than one way to minimise the cost function value, rather than gradient descent. Check a method called 'Normal Equation'

# Recap

• Hypothesis h
• Cost function: the difference between our hypothesis and the output in the training set
• Gradient descent: minimize the difference

# Multivariable regression -My Bots

Determine response time in a chat.

Goal - Prevent too much conversation and be helpful

Input: response minutes, human reply minutes, presence of words indicating it's taking too long, human typing action

# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

# Problem

Determine if a job opportunity is interesting to me or not

Do we have any idea for know relations?

outsourcing -> no

Machine Learning -> yes

# Problem

Determine if a job opportunity is interesting to me or not

What kind of M.L. Algorithm would you use here?

Supervised Learning

# Algorithm draft:

The output is yes or no

Come up with a training set to analyse these variables

y \in {\{ 0,1 \}}
$y \in {\{ 0,1 \}}$
0 \leq h \leq 1
$0 \leq h \leq 1$

# Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

m

x

h

\hat{y}
$\hat{y}$

# Strategy

• Hypothesis h
• Cost function: the difference between our hypothesis and the output in the training set
• Gradient descent: minimize the difference

(nothing new!)

# Strategy

h is a function between 0 and 1

Come up with a function to compare the test output with the predicted value

(cost function)

Minimize the cost function

# Strategy

h is a function between 0 and 1

Sigmoid function

\sigma=\frac{1}{1+e^{-z}}
$\sigma=\frac{1}{1+e^{-z}}$

h gives us a probability of yes or no

Input matrix (from your input data)

z = \theta^Tx
$z = \theta^Tx$

# How am I supposed to find a minimum

on this weird hypothesis function?

# Our cost function

needs to separate data, not

measure the difference between the slopes

# Our cost function

loss = -y \log(h(x)) - (1-y)\log(1 - h(x))
$loss = -y \log(h(x)) - (1-y)\log(1 - h(x))$

for every element of the training set

# Improving it

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]
$J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]$

# Now the minimum:

\frac{\partial }{\partial \theta_j }J(\theta)
$\frac{\partial }{\partial \theta_j }J(\theta)$
\theta_j
$\theta_j$
-\alpha
$-\alpha$
\theta_j =
$\theta_j =$

Repeat until convergence

# Please calculate this derivative for me!!111

\frac{\partial }{\partial \theta_j }J(\theta)
$\frac{\partial }{\partial \theta_j }J(\theta)$

where

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]
$J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]$
\sum_{i=1}^{m}((h(x^i) - y^i)x_j^i)
$\sum_{i=1}^{m}((h(x^i) - y^i)x_j^i)$
\theta_j
$\theta_j$
-\frac{\alpha}{m}
$-\frac{\alpha}{m}$
\theta_j =
$\theta_j =$

# Logistic Regression

## The code in Octave

data = load('job_positions.csv');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = zeros(n + 1, 1);
m = length(y);
J = 0;

h = sigmoid(X*theta);
J = (1/m)*(-y'* log(h) - (1 - y)'* log(1-h));

m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X*theta) >= 0.5;


Initialise

Cost Function

Prediction

# In a large project

Logistic regression with Python

model = LogisticRegression()
model = model.fit(X, y)
model.score(X, y)

# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

# Problem

The training set works fine; new predictions are terrible :'(

High variance

It fails to generalise!

Where is the problem? Is it in our hypothesis?

# Problem

The training set works fine; new predictions are terrible

The problem might be the cost function (comparing the predicted values with the training set)

-

h(x^i)
$h(x^i)$
y^i
$y^i$
\sum_{i=1}^{m}
$\sum_{i=1}^{m}$
)^2
$)^2$
(
$($
\frac{1}{2m}
$\frac{1}{2m}$

J =

## A better cost function

-

h(x^i)
$h(x^i)$
y^i
$y^i$
\sum_{i=1}^{m}
$\sum_{i=1}^{m}$
)^2
$)^2$
(
$($
\frac{1}{2m}
$\frac{1}{2m}$

J =

[
$[$
+ \lambda
$+ \lambda$
\sum_{j=1}^{numfeat} \theta_j^2
$\sum_{j=1}^{numfeat} \theta_j^2$
]
$]$

Regularization param

It controls the tradeoff / lowers the variance

# Regularization

We can use it with Linear and Logistic Regression

# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

## Problem

Logistic regression is inaccurate.

## Problem

Logistic regression is too slow.

d

d

# Strategy

Which one is better?

Equal largest distance

Not considering outliers

distance => vector norm

\parallel x - b \parallel
$\parallel x - b \parallel$

# Strategy

Measure group similarity

input vs each point in the graph

# Strategy

Can you guess a component of the function?

\parallel x - b \parallel
$\parallel x - b \parallel$

A bit more...

similarity =
$similarity =$

# Strategy

Can you guess a component of the function?

{\parallel x - b \parallel}^2
${\parallel x - b \parallel}^2$

A bit more...

similarity =
$similarity =$

# Strategy

Can you guess a component of the function?

\frac{{\parallel x - b \parallel}^2}{2\sigma^2}
$\frac{{\parallel x - b \parallel}^2}{2\sigma^2}$

A bit more...

similarity =
$similarity =$

# Strategy

Can you guess a component of the function?

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})
$exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})$

A bit more...

similarity =
$similarity =$

# Strategy

Can you guess a component of the function?

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})
$exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})$

A bit more...

similarity =
$similarity =$

And we do that for all the inputs x from the training set

## Back to the beginning: Strategy

• Hypothesis h
• Cost function: the difference between our hypothesis and the output in the training set
• Minimize the difference (it was the gradient descent in the previous methods)

(nothing new!)

## Back to the beginning: Strategy

• Hypothesis => similarity [or kernels] √
• Cost function: the difference between our hypothesis and the output in the training set
• Minimize the difference (it was the gradient descent in the previous methods)

## The cost function

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
$(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))$

The similarity of the groups (by the difference of the output in the training set with your predicted value)

\sum_{i=1}^{m}y^i
$\sum_{i=1}^{m}y^i$

Looks like Logistic Regression

[
$[$
]
$]$
\frac{1}{m}
$\frac{1}{m}$
+\frac{\lambda}{2m}\sum_{j=1}^{numfeat}\theta_j
$+\frac{\lambda}{2m}\sum_{j=1}^{numfeat}\theta_j$

Regularization

## The cost function - alternative notation

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
$(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))$
\sum_{i=1}^{m}y^i
$\sum_{i=1}^{m}y^i$
[
$[$
]
$]$
+\frac{\lambda}{2}\sum_{j=1}^{numfeat}\theta_j
$+\frac{\lambda}{2}\sum_{j=1}^{numfeat}\theta_j$
C = \frac{1}{\lambda}
$C = \frac{1}{\lambda}$

And replacing

## The cost function - alternative notation

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
$(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))$
\sum_{i=1}^{m}y^i
$\sum_{i=1}^{m}y^i$
[
$[$
]
$]$
+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j
$+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j$
C
$C$

Don't forget - h is the similarity/kernel function.

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})
$exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})$

## And nor minimize it!

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
$(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))$
\sum_{i=1}^{m}y^i
$\sum_{i=1}^{m}y^i$
[
$[$
]
$]$
+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j
$+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j$
C
$C$

And that is a lot of work!

Instead, we can use a library to calculate the minimum.

# Extras

• There is more than one formula for a kernel
• How many kernels should I have? You need to start with a guess.
• When should I use Logistic Regression and when should I use SVM? (quick analysis - if less features (< 1000), use SVM; else, Logistic Regression

# In a project

Support Vector Machine with Python

import numpy as npy
from sklearn import svm

X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y =  * 8 +  * 8

for kernel in ('linear', 'poly', 'rbf'):
clf = svm.SVC(kernel=kernel, gamma=2)
clf.fit(X, Y)


# In a project

Support Vector Machine with Python

import numpy as npy
from sklearn import svm

X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y =  * 8 +  * 8

for kernel in ('linear', 'poly', 'rbf'):
clf = svm.SVC(kernel=kernel, gamma=2)
clf.fit(X, Y)


# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

# Problem

I have no idea about the number of kernels. I can't establish any relation between my data.

What kind of M.L. algorithm is this?

Unsupervised Learning

# Problem

I have no idea about the number of kernels. I can't establish any relation between my data.

input = {x1, x2, x3...}

no y!

# Problem

Is this another group

or an anomaly?

# Problem

My bot is taking too long to respond - is it just heavy processing or is it someone trolling me?

# Strategy

Calculate the probability of each element of the input training set to determine if it's normal or not

Gaussian distribution

https://www.phy.ornl.gov/csep/gif_figures/mcf7.gif

# Strategy

Gaussian distribution

probability = p(x_1, \mu_1, \sigma_1^2)p(x_2, \mu_2, \sigma_2^2), p(x_3, \mu_3, \sigma_3^2)...
$probability = p(x_1, \mu_1, \sigma_1^2)p(x_2, \mu_2, \sigma_2^2), p(x_3, \mu_3, \sigma_3^2)...$
probability = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)
$probability = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)$

Select the elements on the training set that you consider weird (example: add upper/lower bounds)

# Algorithm

Expand it according to the Gaussian distribution:

\mu = \frac{1}{m} \sum_{i=1}^{m}(x^i)
$\mu = \frac{1}{m} \sum_{i=1}^{m}(x^i)$
\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}(x^i - \mu)^2
$\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}(x^i - \mu)^2$

Now we compute:

p(x) = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)
$p(x) = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)$
p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})
$p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})$

# Algorithm

p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})
$p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})$
p(x) < \varepsilon
$p(x) < \varepsilon$

(ex: 5% or 10% deviation)

There is an algorithm to determine the best Epsilon

## The code in Octave

data = load('response_times.csv');
X = data(:, [1, 2]);
[m, n] = size(X);
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

mu = mean(X)';
sigma2 = var(X, 1)';
k = length(mu);
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
Sigma2 = diag(Sigma2);
end
X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));
end



Gaussian

Elements

Initialise

## (cont..)The code in Octave

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;
stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
predictions = (pval < epsilon);
tp = sum((predictions == 1 & yval == 1));
fp = sum((predictions == 1 & yval == 0));
fn = sum((predictions == 0 & yval == 1));
precision = tp / (tp + fp);
recall = tp / (tp + fn);
F1 = (2 * precision * recall) / (precision + recall);
if F1 > bestF1
bestF1 = F1;
bestEpsilon = epsilon;
end
end
outliers = find(p < epsilon);

Determine

Epsilon

(max deviant)

The anomaly!

# Agenda

• General ideas about Machine Learning
• Problem 1 - Used car prices (Linear Regression)
• Problem 2 - Is this job position interesting? (Logistic regression)
• Problem 3 - Improving Problem 2 solution (Regularization)
• Problem 4 - Improving problem 3 (Support vector machines)
• Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
• Extras and References

# Bonus!

## Problem

The set of features is getting too large.

(n > 1000 )

I don't wanna handle with huuuuuge polynomials!

## Problem

The set of features is getting too large.

(n > 1000 )

Neural Networks

## Problem

The set of features is getting too large.

(n > 1000 )

Neural Networks

A full explanation of Neural Networks is out of the scope of this presentation

## Out of our scope - but interesting

• Backpropagation
• PCA (Principal Component Analysis)
• Obtaining the datasets
• A deeper comparison of which method you should select
• Is it Supervised or unsupervised?
• Reinforcement learning
• Which features should I select?

## Special Thanks

• B.C., for the constant review
• @confoo team

## Thank you :)

Questions?

hannelita@gmail.com

@hannelita

#### Machine Learning Algorithms - Intro (UNIFEI)

By Hanneli Tavante (hannelita)

• 1,004