Machine Learning Algorithms

A crash and concise approach

Disclaimer

The mathematical content is not strict. The notations may be incomplete / inaccurate. Please forgive me.

There is math.

There is theory.

There is pseudocode.

A lot of content in 40min

Interact with me!

There are several presentations about M.L.

"M.L. is complicated!"

clf = linear_model.LogisticRegression(C=1.0, 
  penalty='l1', tol=1e-6)

"What does this method do?"

"M.L. "

"..."

"M.L. is easy!11!"

"M.L. is easy!11!"

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)
1mi=1m((h(xi)yi)x0i)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)
\theta_0
θ0\theta_0
-\alpha
α-\alpha
\theta_0 =
θ0=\theta_0 =
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)
1mi=1m((h(xi)yi)x1i)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)
\theta_1
θ1\theta_1
-\alpha
α-\alpha
\theta_1 =
θ1=\theta_1 =
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)
1mi=1m((h(xi)yi)x2i)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)
\theta_2
θ2\theta_2
-\alpha
α-\alpha
\theta_2 =
θ2=\theta_2 =
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)
1mi=1m((h(xi)yi)xni)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)
\theta_n
θn\theta_n
-\alpha
α-\alpha
\theta_n =
θn=\theta_n =

"Yet another presentation about M.L..."

"Because it is as cool as big data now."

I'm Hanneli - @hannelita

  • Computer Engineer
  • Programming
  • Electronics
  • Math <3 <3
  • Physics
  • Lego
  • Meetups
  • Animals
  • Coffee
  • GIFs
  • Pokémon

Goals

  • Get started with Machine Learning
  • Connect the missing pieces
  • Show the relation between Mathematics, ML theory and implementation
  • You can apply M.L. to your personal projects (Example: my bots )

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Some problems

Supervised Learning

Unsupervised Learning

Reinforcement Learning

We have an idea of the right answer for what we are asking. Example: Given a picture of a person, predict how old is he/she

We have no idea of the right answer for what we are asking. Example: Given a collection of items you don't know, try to group them by similarity.

Let the machine take control of the context and you provide input feedback.

Example: Reduce items in stock by creating dynamic promotions

General idea

Training set

Learning Algorithm

hypothesis

input

predicted output

This presentation will cover distinct algorithms + theory + practice

For every algorithm, we will:

  • Come up with a problem to solve
  • Come up with a strategy
  • Understand the mechanics of the algorithm
  • Come up with a test scenario
  • Provide examples with libraries in C/C++/Python

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Problem:

Sell used cars. Find the best price to sell them (not considering people who collect old cars)

Do we have any idea for know relations?

older -> cheaper

unpopular brands -> cheaper

too many kms -> cheaper

Problem:

Sell used cars. Find the best price to sell them (not considering people who collect old cars)

What kind of M.L. Algorithm would you use here?

Supervised Learning

Algorithm draft:

Chose one variable to analyse vs what you want to predict (example: year x price). Price is the variable you want to set a prediction.

Come up with a training set to analyse these variables

Number of training examples (m)

input variable or features - x

output variable or target - y

Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

m

x

h

y

Strategy to h

Strategy to h

Linear equation

h = ax + b

How do you choose a and b?

From the training set, we have expected values y for a certain x:

(x^i , y^i )
(xi,yi)(x^i , y^i )

Come up with a hypothesis that gives you

the smallest error for all the training set:

The algorithm

h(x)

y

Your hypothesis 

for an input x

The output of the training set

-

Measure the difference

The algorithm

h(x)

y

Your hypothesis 

for an input x

The output of the training set

-

Measure the difference

for the entire training set

The algorithm

Your hypothesis 

for an input x

The output of the training set

-

Measure the difference

for the entire training set

h(x^i)
h(xi)h(x^i)
y^i
yiy^i

(

)

\sum_{i=1}^{m}
i=1m\sum_{i=1}^{m}

The algorithm

Your hypothesis 

for an input x

The output of the training set

-

We don't want to cancel positive and negative values

h(x^i)
h(xi)h(x^i)
y^i
yiy^i
\sum_{i=1}^{m}
i=1m\sum_{i=1}^{m}
)^2
)2)^2
(
((
\frac{1}{2m}
12m\frac{1}{2m}

Average

The algorithm

-

h(x^i)
h(xi)h(x^i)
y^i
yiy^i
\sum_{i=1}^{m}
i=1m\sum_{i=1}^{m}
)^2
)2)^2
(
((
\frac{1}{2m}
12m\frac{1}{2m}

Cost Function

J =

We want to minimize the difference

Understanding it

Understanding it

We can come up with different hypothesis (slopes for h function)

Understanding it

We can come up with different hypothesis (slopes for h function)

Understanding it

That's the difference

h(x^i) -y^i
h(xi)yih(x^i) -y^i

for a certain cost function

h = a'x + b
h=ax+bh = a'x + b

Understanding it

That's the difference

h(x^i) -y^i
h(xi)yih(x^i) -y^i

for another cost function

h = a''x + b
h=ax+bh = a''x + b

This cost function (J) will be a parabola

Minimum value

on h = ax +b,

we are varying a

J(a)

We also need to vary b. J will be a surface (3 dimensions)

h = ax + b

J(a,b)

Back to the algorithm

Minimize any Cost Function.

Back to the algorithm

Minimize any Cost Function.

Min

We start with a guess

And 'walk' to the min value

Back to the algorithm

Or:

Min

How do you find the min values of a function?

Calculus is useful here

Calculus - we can get this information from the derivative

\frac{\partial }{\partial a}
a\frac{\partial }{\partial a}

Remember: we start with a guess and walk to the min value

Back to the cost function

\frac{\partial }{\partial a}J(a,b)
aJ(a,b)\frac{\partial }{\partial a}J(a,b)
\frac{\partial }{\partial b}J(a,b)
bJ(a,b)\frac{\partial }{\partial b}J(a,b)

Towards the min value

a_0
a0a_0
b_0
b0b_0

We start with

a guess

-\alpha
α-\alpha
-\alpha
α-\alpha

Walk on

the 

graph

a =
a=a =
b =
b=b =

Partial Derivatives

Gradient Descent

\frac{\partial }{\partial a}J(a,b)
aJ(a,b)\frac{\partial }{\partial a}J(a,b)
\frac{\partial }{\partial b}J(a,b)
bJ(a,b)\frac{\partial }{\partial b}J(a,b)

Partial derivatives

(min value)

a_0
a0a_0
b_0
b0b_0

We start with

a guess

-\alpha
α-\alpha
-\alpha
α-\alpha

Walk on

the 

graph

a =
a=a =
b =
b=b =

Back to the cost function

\frac{\partial }{\partial a}J(a,b)
aJ(a,b)\frac{\partial }{\partial a}J(a,b)
\frac{\partial }{\partial b}J(a,b)
bJ(a,b)\frac{\partial }{\partial b}J(a,b)
a_0
a0a_0
b_0
b0b_0
-\alpha
α-\alpha
-\alpha
α-\alpha

Learning rate 

(another guess)

a =
a=a =
b =
b=b =

Repeat until convergence

How do I transform all this into code???

Octave/ Matlab for prototyping

Adjust your theory to work with Linear Algebra and Matrices

 

Expand the derivatives:

\frac{1}{m}\sum_{i=1}^{m}(({h (x_i) - y_i}) x_i)
1mi=1m((h(xi)yi)xi)\frac{1}{m}\sum_{i=1}^{m}(({h (x_i) - y_i}) x_i)
\frac{1}{m}\sum_{i=1}^{m}({h (x_i) - y_i})
1mi=1m(h(xi)yi)\frac{1}{m}\sum_{i=1}^{m}({h (x_i) - y_i})
a_0
a0a_0
b_0
b0b_0
-\alpha
α-\alpha
-\alpha
α-\alpha
a =
a=a =
b =
b=b =

Repeat until convergence

The code in Octave

data = load('used_cars.csv'); %year x price
y = data(:, 2);
X = [ones(m, 1), data(:,1)];
theta = zeros(2, 1); %linear function
iterations = 1500;
alpha = 0.01;
m = length(y);
J = 0;

predictions = X*theta;

sqErrors = (predictions - y).^2;
J = 1/(2*m) * sum(sqErrors);

J_history = zeros(iterations, 1);

for iter = 1:iterations
    x = X(:,2);
    delta = theta(1) + (theta(2)*x);
    tz = theta(1) - alpha * (1/m) * sum(delta-y);
    t1 = theta(2) - alpha * (1/m) * sum((delta - y) .* x);
    theta = [tz; t1];
    J_history(iter) = computeCost(X, y, theta);
end

Initialise

Cost

Function

Gradient

Descent

Possible problems with Octave/Matlab

  • Performance issues
  • Integrate with existing project

In a large project

Linear regression with C++

mlpack_linear_regression --training_file used_cars.csv
 --test_file used_cars_test.csv -v

Going back to used cars

We are only analysing year vs price. We have more factors: model, how much the car was used before, etc

Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

m

x

h

Strategy to h

Consider multiple variables: a, b, c,... (or using greek letters) 

h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3
h(x)=θ0+θ1x1+θ2x2+θ3x3h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3

Can you suggest the next steps?

Gradient Descent for multiple Variables

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)
1mi=1m((h(xi)yi)x0i)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_0^i)
\theta_0
θ0\theta_0
-\alpha
α-\alpha
\theta_0 =
θ0=\theta_0 =

Repeat until convergence

\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)
1mi=1m((h(xi)yi)x1i)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_1^i)
\theta_1
θ1\theta_1
-\alpha
α-\alpha
\theta_1 =
θ1=\theta_1 =
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)
1mi=1m((h(xi)yi)x2i)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_2^i)
\theta_2
θ2\theta_2
-\alpha
α-\alpha
\theta_2 =
θ2=\theta_2 =
\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)
1mi=1m((h(xi)yi)xni)\frac{1}{m}\sum_{i=1}^{m}(({h (x^i) - y^i})x_n^i)
\theta_n
θn\theta_n
-\alpha
α-\alpha
\theta_n =
θn=\theta_n =

Production tips

  • Too many features - YOU NEED MEMORY :scream:
  • Normalisation - check this Coursera lectures
  • There is more than one way to minimise the cost function value, rather than gradient descent. Check a method called 'Normal Equation'

Recap

  • Hypothesis h
  • Cost function: the difference between our hypothesis and the output in the training set
  • Gradient descent: minimize the difference

Don't forget :)

Multivariable regression -My Bots

Determine response time in a chat.

Goal - Prevent too much conversation and be helpful

Input: response minutes, human reply minutes, presence of words indicating it's taking too long, human typing action

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Problem

Determine if a job opportunity is interesting to me or not

Do we have any idea for know relations?

outsourcing -> no

Akka -> Yes

Machine Learning -> yes

Problem

Determine if a job opportunity is interesting to me or not

What kind of M.L. Algorithm would you use here?

Supervised Learning

Algorithm draft:

The output is yes or no

Come up with a training set to analyse these variables

y \in {\{ 0,1 \}}
y{0,1}y \in {\{ 0,1 \}}
0 \leq h \leq 1
0h10 \leq h \leq 1

Input data:

Back to the principle

Training set

Learning Algorithm

hypothesis

input

predicted output

m

x

h

y

Strategy

  • Hypothesis h
  • Cost function: the difference between our hypothesis and the output in the training set
  • Gradient descent: minimize the difference

(nothing new!)

Strategy

h is a function between 0 and 1

Come up with a function to compare the test output with the predicted value

(cost function)

Minimize the cost function

(gradient descent)

Strategy

h is a function between 0 and 1

Sigmoid function

g=\frac{1}{1+e^{-z}}
g=11+ezg=\frac{1}{1+e^{-z}}

h gives us a probability of yes or no

Input matrix (from your input data)

z = \theta^Tx
z=θTxz = \theta^Tx

How am I supposed to find a minimum 

on this weird hypothesis function?

Back to Calculus

e function

often relates to ln

that often relates to log

Our cost function

needs to separate data, not

measure the difference between the slopes

Our cost function

Cost = -y \log(h(x)) - (1-y)\log(1 - h(x))
Cost=ylog(h(x))(1y)log(1h(x))Cost = -y \log(h(x)) - (1-y)\log(1 - h(x))

for every element of the training set

Improving it

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]
J=1mi=1m[yilog(h(xi))(1yi)log(1h(xi))]J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]

Now the minimum:

\frac{\partial }{\partial \theta_j }J(\theta)
θjJ(θ)\frac{\partial }{\partial \theta_j }J(\theta)
\theta_j
θj\theta_j
-\alpha
α-\alpha
\theta_j =
θj=\theta_j =

Repeat until convergence

Please calculate this derivative for me!!111

\frac{\partial }{\partial \theta_j }J(\theta)
θjJ(θ)\frac{\partial }{\partial \theta_j }J(\theta)

where

J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]
J=1mi=1m[yilog(h(xi))(1yi)log(1h(xi))]J = -\frac{1}{m} \sum_{i=1}^{m} [ y^i \log(h(x^i)) - (1-y^i)\log(1 - h(x^i)) ]
\sum_{i=1}^{m}((h(x^i) - y^i)x_j^i)
i=1m((h(xi)yi)xji)\sum_{i=1}^{m}((h(x^i) - y^i)x_j^i)
\theta_j
θj\theta_j
-\frac{\alpha}{m}
αm-\frac{\alpha}{m}
\theta_j =
θj=\theta_j =

This classification algorithm is called

Logistic Regression

The code in Octave

data = load('job_positions.csv');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = zeros(n + 1, 1);
m = length(y);
J = 0;
grad = zeros(size(theta));

h = sigmoid(X*theta);
J = (1/m)*(-y'* log(h) - (1 - y)'* log(1-h));

grad = (1/m)*X'*(h - y);

m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X*theta) >= 0.5;

Initialise

Cost Function

Gradient Descent

Prediction

In a large project

Logistic regression with Python

model = LogisticRegression()
model = model.fit(X, y)
model.score(X, y)

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Problem

The training set works fine; new predictions are terrible :'(

High variance

It fails to generalise!

Where is the problem? Is it in our hypothesis?

Problem

The training set works fine; new predictions are terrible

The problem might be the cost function (comparing the predicted values with the training set)

-

h(x^i)
h(xi)h(x^i)
y^i
yiy^i
\sum_{i=1}^{m}
i=1m\sum_{i=1}^{m}
)^2
)2)^2
(
((
\frac{1}{2m}
12m\frac{1}{2m}

J =

A better cost function

-

h(x^i)
h(xi)h(x^i)
y^i
yiy^i
\sum_{i=1}^{m}
i=1m\sum_{i=1}^{m}
)^2
)2)^2
(
((
\frac{1}{2m}
12m\frac{1}{2m}

J =

[
[[
+ \lambda
+λ+ \lambda
\sum_{j=1}^{numfeat} \theta_j^2
j=1numfeatθj2\sum_{j=1}^{numfeat} \theta_j^2
]
]]

Regularization param

It controls the tradeoff / lowers the variance

Regularization

We can use it with Linear and Logistic Regression

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Problem

Logistic regression is inaccurate.

Problem

Logistic regression is too slow.

Strategy

d

Strategy

d

Strategy

Which one is better?

Equal largest distance

Not considering outliers

distance => vector norm

\parallel x - b \parallel
xb\parallel x - b \parallel

Strategy

Measure group similarity

input vs each point in the graph

Strategy

Can you guess a component of the function?

\parallel x - b \parallel
xb\parallel x - b \parallel

A bit more...

similarity =
similarity=similarity =

Strategy

Can you guess a component of the function?

{\parallel x - b \parallel}^2
xb2{\parallel x - b \parallel}^2

A bit more...

similarity =
similarity=similarity =

Strategy

Can you guess a component of the function?

\frac{{\parallel x - b \parallel}^2}{2\sigma^2}
xb22σ2\frac{{\parallel x - b \parallel}^2}{2\sigma^2}

A bit more...

similarity =
similarity=similarity =

Strategy

Can you guess a component of the function?

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})
exp(xb22σ2)exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

A bit more...

similarity =
similarity=similarity =

Strategy

Can you guess a component of the function?

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})
exp(xb22σ2)exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

A bit more...

similarity =
similarity=similarity =

And we do that for all the inputs x from the training set

Back to the beginning: Strategy

  • Hypothesis h
  • Cost function: the difference between our hypothesis and the output in the training set
  • Minimize the difference (it was the gradient descent in the previous methods)

(nothing new!)

Back to the beginning: Strategy

  • Hypothesis => similarity [or kernels] √
  • Cost function: the difference between our hypothesis and the output in the training set
  • Minimize the difference (it was the gradient descent in the previous methods)

The cost function

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
(logh(xi))+(1yi)((log(1h(xi)))(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))

The similarity of the groups (by the difference of the output in the training set with your predicted value)

\sum_{i=1}^{m}y^i
i=1myi\sum_{i=1}^{m}y^i

Looks like Logistic Regression

[
[[
]
]]
\frac{1}{m}
1m\frac{1}{m}
+\frac{\lambda}{2m}\sum_{j=1}^{numfeat}\theta_j
+λ2mj=1numfeatθj+\frac{\lambda}{2m}\sum_{j=1}^{numfeat}\theta_j

Regularization

The cost function - alternative notation

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
(logh(xi))+(1yi)((log(1h(xi)))(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
\sum_{i=1}^{m}y^i
i=1myi\sum_{i=1}^{m}y^i
[
[[
]
]]
+\frac{\lambda}{2}\sum_{j=1}^{numfeat}\theta_j
+λ2j=1numfeatθj+\frac{\lambda}{2}\sum_{j=1}^{numfeat}\theta_j
C = \frac{1}{\lambda}
C=1λC = \frac{1}{\lambda}

And replacing

The cost function - alternative notation

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
(logh(xi))+(1yi)((log(1h(xi)))(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
\sum_{i=1}^{m}y^i
i=1myi\sum_{i=1}^{m}y^i
[
[[
]
]]
+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j
+12j=1numfeatθj+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j
C
CC

Don't forget - h is the similarity/kernel function.

exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})
exp(xb22σ2)exp(-\frac{{\parallel x - b \parallel}^2}{2\sigma^2})

And nor minimize it!

(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
(logh(xi))+(1yi)((log(1h(xi)))(-\log{h}(x^i)) + (1-y^i) ((-\log(1-h(x^i)))
\sum_{i=1}^{m}y^i
i=1myi\sum_{i=1}^{m}y^i
[
[[
]
]]
+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j
+12j=1numfeatθj+\frac{1}{2}\sum_{j=1}^{numfeat}\theta_j
C
CC

And that is a lot of work!

Instead, we can use a library to calculate the minimum.

Welcome to the Support Vector Machine (SVM)

Extras

  • There is more than one formula for a kernel
  • How many kernels should I have? You need to start with a guess.
  • When should I use Logistic Regression and when should I use SVM? (quick analysis - if less features (< 1000), use SVM; else, Logistic Regression

In a project

Support Vector Machine with Python

import numpy as npy
from sklearn import svm

X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8

for kernel in ('linear', 'poly', 'rbf'):
    clf = svm.SVC(kernel=kernel, gamma=2)
    clf.fit(X, Y)

In a project

Support Vector Machine with Python

import numpy as npy
from sklearn import svm

X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8

for kernel in ('linear', 'poly', 'rbf'):
    clf = svm.SVC(kernel=kernel, gamma=2)
    clf.fit(X, Y)

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Problem

I have no idea about the number of kernels. I can't establish any relation between my data.

What kind of M.L. algorithm is this?

Unsupervised Learning

Problem

I have no idea about the number of kernels. I can't establish any relation between my data.

input = {x1, x2, x3...}

no y!

Welcome to unsupervised learning!

Problem

Is this another group

or an anomaly?

Problem

My bot is taking too long to respond - is it just heavy processing or is it someone trolling me?

Strategy

Calculate the probability of each element of the input training set to determine if it's normal or not

Gaussian distribution

https://www.phy.ornl.gov/csep/gif_figures/mcf7.gif

Strategy

Gaussian distribution

probability = p(x_1, \mu_1, \sigma_1^2)p(x_2, \mu_2, \sigma_2^2), p(x_3, \mu_3, \sigma_3^2)...
probability=p(x1,μ1,σ12)p(x2,μ2,σ22),p(x3,μ3,σ32)...probability = p(x_1, \mu_1, \sigma_1^2)p(x_2, \mu_2, \sigma_2^2), p(x_3, \mu_3, \sigma_3^2)...
probability = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)
probability=j=1trsetsizep(xj,μj,σj2)probability = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)

Select the elements on the training set that you consider weird (example: add upper/lower bounds)

Algorithm

Expand it according to the Gaussian distribution:

\mu = \frac{1}{m} \sum_{i=1}^{m}(x^i)
μ=1mi=1m(xi)\mu = \frac{1}{m} \sum_{i=1}^{m}(x^i)
\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}(x^i - \mu)^2
σ2=1mi=1m(xiμ)2\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}(x^i - \mu)^2

Now we compute:

p(x) = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)
p(x)=j=1trsetsizep(xj,μj,σj2)p(x) = \prod_{j=1}^{trsetsize} p(x_j, \mu_j, \sigma_j^2)
p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})
p(x)=j=1trsetsize12πσjexp((xjμj)22σj2)p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})

Algorithm

p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})
p(x)=j=1trsetsize12πσjexp((xjμj)22σj2)p(x) = \prod_{j=1}^{trsetsize} \frac{1}{\sqrt{2\pi}\sigma_j}exp (-\frac{(x_j - \mu_j)^2}{2\sigma_j^2})
p(x) < \varepsilon
p(x)<εp(x) < \varepsilon

(ex: 5% or 10% deviation)

There is an algorithm to determine the best Epsilon

The code in Octave

data = load('response_times.csv');
X = data(:, [1, 2]);
[m, n] = size(X);
mu = zeros(n, 1);
sigma2 = zeros(n, 1);

mu = mean(X)';
sigma2 = var(X, 1)';
k = length(mu);
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
    Sigma2 = diag(Sigma2);
end
X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
    exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));
end


Gaussian

Elements

Initialise

(cont..)The code in Octave

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;
stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
  predictions = (pval < epsilon);
  tp = sum((predictions == 1 & yval == 1));
  fp = sum((predictions == 1 & yval == 0));
  fn = sum((predictions == 0 & yval == 1));
  precision = tp / (tp + fp);
  recall = tp / (tp + fn);
  F1 = (2 * precision * recall) / (precision + recall);
  if F1 > bestF1
     bestF1 = F1;
     bestEpsilon = epsilon;
  end
end
outliers = find(p < epsilon);

Determine

Epsilon

(max deviant)

The anomaly!

WE MADE IT TO THE END OF THIS SESSION!

Agenda

  • General ideas about Machine Learning
  • Problem 1 - Used car prices (Linear Regression)
  • Problem 2 - Is this job position interesting? (Logistic regression)
  • Problem 3 - Improving Problem 2 solution (Regularization)
  • Problem 4 - Improving problem 3 (Support vector machines)
  • Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
  • Extras and References

Bonus!

Problem

The set of features is getting too large.

(n > 1000 )

I don't wanna handle with huuuuuge polynomials! 

Problem

The set of features is getting too large.

(n > 1000 )

Neural Networks

Problem

The set of features is getting too large.

(n > 1000 )

Neural Networks

A full explanation of Neural Networks is out of the scope of this presentation

Out of our scope - but interesting

  • Backpropagation
  • PCA (Principal Component Analysis)
  • Obtaining the datasets
  • A deeper comparison of which method you should select
  • Is it Supervised or unsupervised?
  • Reinforcement learning
  • Which features should I select?

References

Special Thanks

  • B.C., for the constant review
  • @confoo team

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita