Machine Learning Algorithms

A crash and concise approach

Disclaimer

The mathematical content is not strict. The notations may be incomplete / inaccurate. Please forgive me.

There is math.

There is theory.

There is pseudocode.

A lot of content in 40min

Interact with me!

There are several presentations about M.L.

"M.L. is complicated!"

clf = linear_model.LogisticRegression(C=1.0, 
  penalty='l1', tol=1e-6)

"What does this method do?"

"M.L. "

"..."

"M.L. is easy!11!"

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)

\theta_0

\theta_0

-\alpha

-\alpha

\theta_0 =

\theta_0 =

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)

\theta_1

\theta_1

-\alpha

-\alpha

\theta_1 =

\theta_1 =

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)

\theta_2

\theta_2

-\alpha

-\alpha

\theta_2 =

\theta_2 =

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)

\theta_n

\theta_n

-\alpha

-\alpha

\theta_n =

\theta_n =

"Yet another presentation about M.L..."

"Because it is as cool as big data now."

Goals

Get started with Machine Learning
Connect the missing pieces
Show the relation between Mathematics, ML theory and implementation
You can apply M.L. to your personal projects (Example: my bots )

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Some problems

Supervised Learning

Unsupervised Learning

Reinforcement Learning

We have an idea of the right answer for what we are asking. Example: Given a picture of a person, predict how old is he/she

We have no idea of the right answer for what we are asking. Example: Given a collection of items you don't know, try to group them by similarity.

Let the machine take control of the context and you provide input feedback.

Example: Reduce items in stock by creating dynamic promotions

General idea

Training set

Learning Algorithm

hypothesis

input

predicted output

This presentation will cover distinct algorithms + theory + practice

For every algorithm, we will:

Come up with a problem to solve
Come up with a strategy
Understand the mechanics of the algorithm
Come up with a test scenario
Provide examples with libraries in C/C++/Python

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Problem:

Sell used cars. Find the best price to sell them (not considering people who collect old cars)

Do we have any idea for know relations?

older -> cheaper

unpopular brands -> cheaper

too many kms -> cheaper

Problem:

Sell used cars. Find the best price to sell them (not considering people who collect old cars)

What kind of M.L. Algorithm would you use here?

Supervised Learning

Algorithm draft:

Chose one variable to analyse vs what you want to predict (example: year x price). Price is the variable you want to set a prediction.

Come up with a training set to analyse these variables

Number of training examples (m)

input variable or features - x

output variable or target - y

Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

\hat{y}

\hat{y}

Strategy to h

Linear equation

h = ax + b

How do you choose a and b?

From the training set, we have expected values y for a certain x:

(x^i , y^i )

(x^i , y^i )

Come up with a hypothesis that gives you

the smallest error for all the training set:

The algorithm

h(x)

Your hypothesis

for an input x

The output of the training set

Measure the difference

The algorithm

h(x)

Your hypothesis

for an input x

The output of the training set

Measure the difference

for the entire training set

The algorithm

Your hypothesis

for an input x

The output of the training set

Measure the difference

for the entire training set

h(x^i)

h(x^i)

y^i

y^i

(

)

\sum_{i=1}^{m}

\sum_{i=1}^{m}

The algorithm

Your hypothesis

for an input x

The output of the training set

We don't want to cancel positive and negative values

h(x^i)

h(x^i)

y^i

y^i

\sum_{i=1}^{m}

\sum_{i=1}^{m}

)^2

)^2

(

(

\frac{1}{2m}

\frac{1}{2m}

Average

The algorithm

h(x^i)

h(x^i)

y^i

y^i

\sum_{i=1}^{m}

\sum_{i=1}^{m}

)^2

)^2

(

(

\frac{1}{2m}

\frac{1}{2m}

Cost Function

J =

We want to minimize the difference

Understanding it

We can come up with different hypothesis (slopes for h function)

Understanding it

We can come up with different hypothesis (slopes for h function)

Understanding it

That's the difference

h(x^i) -y^i

h(x^i) -y^i

for a certain cost function

h = a_1x + b

h = a_1x + b

Understanding it

That's the difference

h(x^i) -y^i

h(x^i) -y^i

for another cost function

h = a_2x + b

h = a_2x + b

This cost function (J) will be a parabola

Minimum value

on h = ax +b,

we are varying a

J(a)

We also need to vary b. J will be a surface (3 dimensions)

h = ax + b

J(a,b)

Back to the algorithm

Minimize any Cost Function.

Back to the algorithm

Minimize any Cost Function.

Min

We start with a guess

And 'walk' to the min value

Back to the algorithm

Or:

Min

How do you find the min values of a function?

Calculus is useful here

Calculus - we can get this information from the derivative

\frac{\partial }{\partial a}

\frac{\partial }{\partial a}

Remember: we start with a guess and walk to the min value

Back to the cost function

\frac{\partial }{\partial a}J(a,b)

\frac{\partial }{\partial a}J(a,b)

\frac{\partial }{\partial b}J(a,b)

\frac{\partial }{\partial b}J(a,b)

Towards the min value

a_0

a_0

b_0

b_0

We start with

a guess

-\alpha

-\alpha

-\alpha

-\alpha

Walk on

the

graph

a =

a =

b =

b =

Partial Derivatives

Gradient Descent

\frac{\partial }{\partial a}J(a,b)

\frac{\partial }{\partial a}J(a,b)

\frac{\partial }{\partial b}J(a,b)

\frac{\partial }{\partial b}J(a,b)

Partial derivatives

(min value)

a_0

a_0

b_0

b_0

We start with

a guess

-\alpha

-\alpha

-\alpha

-\alpha

Walk on

the

graph

a =

a =

b =

b =

Back to the cost function

\frac{\partial }{\partial a}J(a,b)

\frac{\partial }{\partial a}J(a,b)

\frac{\partial }{\partial b}J(a,b)

\frac{\partial }{\partial b}J(a,b)

a_0

a_0

b_0

b_0

-\alpha

-\alpha

-\alpha

-\alpha

Learning rate

(another guess)

a =

a =

b =

b =

Repeat until convergence

How do I transform all this into code???

Octave/ Matlab for prototyping

Adjust your theory to work with Linear Algebra and Matrices

Expand the derivatives:

\frac{1}{m}\sum_{i=1}^{m}(({h (x_i) - y_i}) x_i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x_i) - y_i}) x_i)

\frac{1}{m}\sum_{i=1}^{m}({h (x_i) - y_i})

\frac{1}{m}\sum_{i=1}^{m}({h (x_i) - y_i})

a_0

a_0

b_0

b_0

-\alpha

-\alpha

-\alpha

-\alpha

a =

a =

b =

b =

Repeat until convergence

The code in Octave

data = load('used_cars.csv'); %year x price
y = data(:, 2);
X = [ones(m, 1), data(:,1)];
theta = zeros(2, 1); %linear function
iterations = 1500;
alpha = 0.01;
m = length(y);
J = 0;

predictions = X*theta;

sqErrors = (predictions - y).^2;
J = 1/(2*m) * sum(sqErrors);

J_history = zeros(iterations, 1);

for iter = 1:iterations
    x = X(:,2);
    delta = theta(1) + (theta(2)*x);
    tz = theta(1) - alpha * (1/m) * sum(delta-y);
    t1 = theta(2) - alpha * (1/m) * sum((delta - y) .* x);
    theta = [tz; t1];
    J_history(iter) = computeCost(X, y, theta);
end

Initialise

Cost

Function

Gradient

Descent

Possible problems with Octave/Matlab

Performance issues
Integrate with existing project

In a large project

Linear regression with C++

mlpack

mlpack_linear_regression --training_file used_cars.csv
 --test_file used_cars_test.csv -v

Going back to used cars

We are only analysing year vs price. We have more factors (features): model, how much the car was used before, etc

Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

\hat{y}

\hat{y}

Strategy to h

Consider multiple variables: a, b, c,... (or using greek letters)

h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3

h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3

Can you suggest the next steps?

Gradient Descent for multiple Variables

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)

\theta_0

\theta_0

-\alpha

-\alpha

\theta_0 =

\theta_0 =

Repeat until convergence

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)

\theta_1

\theta_1

-\alpha

-\alpha

\theta_1 =

\theta_1 =

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)

\theta_2

\theta_2

-\alpha

-\alpha

\theta_2 =

\theta_2 =

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)

\theta_n

\theta_n

-\alpha

-\alpha

\theta_n =

\theta_n =

Production tips

Too many features - YOU NEED MEMORY :scream:
Normalisation - check this Coursera lectures
There is more than one way to minimise the cost function value, rather than gradient descent. Check a method called 'Normal Equation'

Recap

Hypothesis h
Cost function: the difference between our hypothesis and the output in the training set
Gradient descent: minimize the difference

Don't forget :)

Multivariable regression -My Bots

Determine response time in a chat.

Goal - Prevent too much conversation and be helpful

Input: response minutes, human reply minutes, presence of words indicating it's taking too long, human typing action

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Problem

Determine if a job opportunity is interesting to me or not

Do we have any idea for know relations?

outsourcing -> no

Machine Learning -> yes

Problem

Determine if a job opportunity is interesting to me or not

What kind of M.L. Algorithm would you use here?

Supervised Learning

Algorithm draft:

The output is yes or no

Come up with a training set to analyse these variables

y \in {\{ 0,1 \}}

y \in {\{ 0,1 \}}

0 \leq h \leq 1

0 \leq h \leq 1

Input data:

Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

\hat{y}

\hat{y}

Strategy

Hypothesis h
Cost function: the difference between our hypothesis and the output in the training set
Gradient descent: minimize the difference

(nothing new!)

Strategy

h is a function between 0 and 1

Come up with a function to compare the test output with the predicted value

(cost function)

Minimize the cost function

(gradient descent)

Strategy

h is a function between 0 and 1

Sigmoid function

\sigma=\frac{1}{1+e^{-z}}

\sigma=\frac{1}{1+e^{-z}}

h gives us a probability of yes or no

Input matrix (from your input data)

z = \theta^Tx

z = \theta^Tx

How am I supposed to find a minimum

on this weird hypothesis function?

Back to Calculus

e function

often relates to ln

that often relates to log

Our cost function

needs to separate data, not

measure the difference between the slopes

Our cost function

loss = -y \log(h(x)) - (1-y)\log(1 - h(x))

loss = -y \log(h(x)) - (1-y)\log(1 - h(x))

for every element of the training set

Improving it

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]

Now the minimum:

\frac{\partial }{\partial \theta_j }J(\theta)

\frac{\partial }{\partial \theta_j }J(\theta)

\theta_j

\theta_j

-\alpha

-\alpha

\theta_j =

\theta_j =

Repeat until convergence

Please calculate this derivative for me!!111

\frac{\partial }{\partial \theta_j }J(\theta)

\frac{\partial }{\partial \theta_j }J(\theta)

where

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]

\sum_{i=1}^{m}((h(x^i) - y^i)x_j^i)

\sum_{i=1}^{m}((h(x^i) - y^i)x_j^i)

\theta_j

\theta_j

-\frac{\alpha}{m}

-\frac{\alpha}{m}

\theta_j =

\theta_j =

This classification algorithm is called

Logistic Regression

The code in Octave

data = load('job_positions.csv');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = zeros(n + 1, 1);
m = length(y);
J = 0;
grad = zeros(size(theta));

h = sigmoid(X*theta);
J = (1/m)*(-y'* log(h) - (1 - y)'* log(1-h));

grad = (1/m)*X'*(h - y);

m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X*theta) >= 0.5;

Initialise

Cost Function

Gradient Descent

Prediction

In a large project

Logistic regression with Python

scikit-learn

model = LogisticRegression()
model = model.fit(X, y)
model.score(X, y)

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Problem

The training set works fine; new predictions are terrible :'(

High variance

It fails to generalise!

Where is the problem? Is it in our hypothesis?

Problem

The training set works fine; new predictions are terrible

The problem might be the cost function (comparing the predicted values with the training set)

h(x^i)

h(x^i)

y^i

y^i

\sum_{i=1}^{m}

\sum_{i=1}^{m}

)^2

)^2

(

(

\frac{1}{2m}

\frac{1}{2m}

J =

A better cost function

h(x^i)

h(x^i)

y^i

y^i

\sum_{i=1}^{m}

\sum_{i=1}^{m}

)^2

)^2

(

(

\frac{1}{2m}

\frac{1}{2m}

J =

[

[

+ \lambda

+ \lambda

\sum_{j=1}^{numfeat} \theta_j^2

\sum_{j=1}^{numfeat} \theta_j^2

]

]

Regularization param

It controls the tradeoff / lowers the variance

Regularization

We can use it with Linear and Logistic Regression

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Problem

Logistic regression is inaccurate.

Problem

Logistic regression is too slow.

Strategy

Which one is better?

Equal largest distance

Not considering outliers

distance => vector norm

\parallel x - b \parallel

\parallel x - b \parallel

Strategy

Measure group similarity

input vs each point in the graph

Strategy

Can you guess a component of the function?

\parallel x - b \parallel

\parallel x - b \parallel

A bit more...

similarity =

similarity =

Strategy

Can you guess a component of the function?

{\parallel x - b \parallel}^2

{\parallel x - b \parallel}^2

A bit more...

similarity =

similarity =

Strategy

Can you guess a component of the function?

\frac{{\parallel x - b \parallel}^2}{2\sigma^2}

\frac{{\parallel x - b \parallel}^2}{2\sigma^2}

A bit more...

similarity =

similarity =

Strategy

Can you guess a component of the function?

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

A bit more...

similarity =

similarity =

Strategy

Can you guess a component of the function?

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

A bit more...

similarity =

similarity =

And we do that for all the inputs x from the training set

Back to the beginning: Strategy

Hypothesis h
Cost function: the difference between our hypothesis and the output in the training set
Minimize the difference (it was the gradient descent in the previous methods)

(nothing new!)

Back to the beginning: Strategy

Hypothesis => similarity [or kernels] √
Cost function: the difference between our hypothesis and the output in the training set
Minimize the difference (it was the gradient descent in the previous methods)

The cost function

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

The similarity of the groups (by the difference of the output in the training set with your predicted value)

\sum_{i=1}^{m}y^i

\sum_{i=1}^{m}y^i

Looks like Logistic Regression

[

[

]

]

\frac{1}{m}

\frac{1}{m}

+\frac{\lambda}{2m}\sum_{j=1}^{numfeat}\theta_j

+\frac{\lambda}{2m}\sum_{j=1}^{numfeat}\theta_j

Regularization

The cost function - alternative notation

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

\sum_{i=1}^{m}y^i

\sum_{i=1}^{m}y^i

[

[

]

]

+\frac{\lambda}{2}\sum_{j=1}^{numfeat}\theta_j

+\frac{\lambda}{2}\sum_{j=1}^{numfeat}\theta_j

C = \frac{1}{\lambda}

C = \frac{1}{\lambda}

And replacing

The cost function - alternative notation

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

\sum_{i=1}^{m}y^i

\sum_{i=1}^{m}y^i

[

[

]

]

+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j

+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j

C

Don't forget - h is the similarity/kernel function.

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

And nor minimize it!

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

\sum_{i=1}^{m}y^i

\sum_{i=1}^{m}y^i

[

[

]

]

+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j

+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j

C

And that is a lot of work!

Instead, we can use a library to calculate the minimum.

Welcome to the Support Vector Machine (SVM)

Extras

There is more than one formula for a kernel
How many kernels should I have? You need to start with a guess.
When should I use Logistic Regression and when should I use SVM? (quick analysis - if less features (< 1000), use SVM; else, Logistic Regression

In a project

Support Vector Machine with Python

scikit-learn

import numpy as npy
from sklearn import svm

X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8

for kernel in ('linear', 'poly', 'rbf'):
    clf = svm.SVC(kernel=kernel, gamma=2)
    clf.fit(X, Y)

In a project

Support Vector Machine with Python

scikit-learn

import numpy as npy
from sklearn import svm

X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8

for kernel in ('linear', 'poly', 'rbf'):
    clf = svm.SVC(kernel=kernel, gamma=2)
    clf.fit(X, Y)

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Problem

I have no idea about the number of kernels. I can't establish any relation between my data.

What kind of M.L. algorithm is this?

Unsupervised Learning

Problem

I have no idea about the number of kernels. I can't establish any relation between my data.

input = {x1, x2, x3...}

no y!

Welcome to unsupervised learning!

Problem

Is this another group

or an anomaly?

Problem

My bot is taking too long to respond - is it just heavy processing or is it someone trolling me?

Strategy

Calculate the probability of each element of the input training set to determine if it's normal or not

Gaussian distribution

https://www.phy.ornl.gov/csep/gif_figures/mcf7.gif

Strategy

Gaussian distribution

probability = p(x_1, \mu_1, \sigma_1^2)p(x_2, \mu_2, \sigma_2^2), p(x_3, \mu_3, \sigma_3^2)...

probability = p(x_1, \mu_1, \sigma_1^2)p(x_2, \mu_2, \sigma_2^2), p(x_3, \mu_3, \sigma_3^2)...

probability = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)

probability = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)

Select the elements on the training set that you consider weird (example: add upper/lower bounds)

Algorithm

Expand it according to the Gaussian distribution:

\mu = \frac{1}{m} \sum_{i=1}^{m}(x^i)

\mu = \frac{1}{m} \sum_{i=1}^{m}(x^i)

\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}(x^i - \mu)^2

\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}(x^i - \mu)^2

Now we compute:

p(x) = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)

p(x) = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)

p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})

p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})

Algorithm

p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})

p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})

p(x) < \varepsilon

p(x) < \varepsilon

(ex: 5% or 10% deviation)

There is an algorithm to determine the best Epsilon

The code in Octave

data = load('response_times.csv');
X = data(:, [1, 2]);
[m, n] = size(X);
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

mu = mean(X)';
sigma2 = var(X, 1)';
k = length(mu);
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
    Sigma2 = diag(Sigma2);
end
X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
    exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));
end

Gaussian

Elements

Initialise

(cont..)The code in Octave

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;
stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
  predictions = (pval < epsilon);
  tp = sum((predictions == 1 & yval == 1));
  fp = sum((predictions == 1 & yval == 0));
  fn = sum((predictions == 0 & yval == 1));
  precision = tp / (tp + fp);
  recall = tp / (tp + fn);
  F1 = (2 * precision * recall) / (precision + recall);
  if F1 > bestF1
     bestF1 = F1;
     bestEpsilon = epsilon;
  end
end
outliers = find(p < epsilon);

Determine

Epsilon

(max deviant)

The anomaly!

WE MADE IT TO THE END OF THIS SESSION!

Agenda

General ideas about Machine Learning
Problem 1 - Used car prices (Linear Regression)
Problem 2 - Is this job position interesting? (Logistic regression)
Problem 3 - Improving Problem 2 solution (Regularization)
Problem 4 - Improving problem 3 (Support vector machines)
Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
Extras and References

Bonus!

Problem

The set of features is getting too large.

(n > 1000 )

I don't wanna handle with huuuuuge polynomials!

Problem

The set of features is getting too large.

(n > 1000 )

Neural Networks

Problem

The set of features is getting too large.

(n > 1000 )

Neural Networks

A full explanation of Neural Networks is out of the scope of this presentation

Out of our scope - but interesting

Backpropagation
PCA (Principal Component Analysis)
Obtaining the datasets
A deeper comparison of which method you should select
Is it Supervised or unsupervised?
Reinforcement learning
Which features should I select?

References

Coursera Stanford course
Awesome M.L. Books
Alex Smola and S.V.N. Vishwanathan - Introduction to Machine Learning
mlpack
scikit-learn

Special Thanks

B.C., for the constant review
@confoo team

Thank you :)

Questions?

hannelita@gmail.com

@hannelita