Machine Learning Algorithms
A crash and concise approach
Disclaimer
The mathematical content is not strict. The notations may be incomplete / inaccurate. Please forgive me.
There is math.
There is theory.
There is pseudocode.
A lot of content in 40min
Interact with me!
There are several presentations about M.L.
"M.L. is complicated!"
clf = linear_model.LogisticRegression(C=1.0,
penalty='l1', tol=1e-6)
"What does this method do?"
"M.L. "
"..."
"M.L. is easy!11!"
"M.L. is easy!11!"
"Yet another presentation about M.L..."
"Because it is as cool as big data now."
Goals
- Get started with Machine Learning
- Connect the missing pieces
- Show the relation between Mathematics, ML theory and implementation
- You can apply M.L. to your personal projects (Example: my bots )
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Some problems
Supervised Learning
Unsupervised Learning
Reinforcement Learning
We have an idea of the right answer for what we are asking. Example: Given a picture of a person, predict how old is he/she
We have no idea of the right answer for what we are asking. Example: Given a collection of items you don't know, try to group them by similarity.
Let the machine take control of the context and you provide input feedback.
Example: Reduce items in stock by creating dynamic promotions
General idea
Training set
Learning Algorithm
hypothesis
input
predicted output
This presentation will cover distinct algorithms + theory + practice
For every algorithm, we will:
- Come up with a problem to solve
- Come up with a strategy
- Understand the mechanics of the algorithm
- Come up with a test scenario
- Provide examples with libraries in C/C++/Python
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Problem:
Sell used cars. Find the best price to sell them (not considering people who collect old cars)
Do we have any idea for know relations?
older -> cheaper
unpopular brands -> cheaper
too many kms -> cheaper
Problem:
Sell used cars. Find the best price to sell them (not considering people who collect old cars)
What kind of M.L. Algorithm would you use here?
Supervised Learning
Algorithm draft:
Chose one variable to analyse vs what you want to predict (example: year x price). Price is the variable you want to set a prediction.
Come up with a training set to analyse these variables
Number of training examples (m)
input variable or features - x
output variable or target - y
Back to the principle
Training set
Learning Algorithm
hypothesis
input
predicted output
m
x
h
Strategy to h
Strategy to h
Linear equation
h = ax + b
How do you choose a and b?
From the training set, we have expected values y for a certain x:
Come up with a hypothesis that gives you
the smallest error for all the training set:
The algorithm
h(x)
y
Your hypothesis
for an input x
The output of the training set
-
Measure the difference
The algorithm
h(x)
y
Your hypothesis
for an input x
The output of the training set
-
Measure the difference
for the entire training set
The algorithm
Your hypothesis
for an input x
The output of the training set
-
Measure the difference
for the entire training set
(
)
The algorithm
Your hypothesis
for an input x
The output of the training set
-
We don't want to cancel positive and negative values
Average
The algorithm
-
Cost Function
J =
We want to minimize the difference
Understanding it
Understanding it
We can come up with different hypothesis (slopes for h function)
Understanding it
We can come up with different hypothesis (slopes for h function)
Understanding it
That's the difference
for a certain cost function
Understanding it
That's the difference
for another cost function
This cost function (J) will be a parabola
Minimum value
on h = ax +b,
we are varying a
J(a)
We also need to vary b. J will be a surface (3 dimensions)
h = ax + b
J(a,b)
Back to the algorithm
Minimize any Cost Function.
Back to the algorithm
Minimize any Cost Function.
Min
We start with a guess
And 'walk' to the min value
Back to the algorithm
Or:
Min
How do you find the min values of a function?
Calculus is useful here
Calculus - we can get this information from the derivative
Remember: we start with a guess and walk to the min value
Back to the cost function
Towards the min value
We start with
a guess
Walk on
the
graph
Partial Derivatives
Gradient Descent
Partial derivatives
(min value)
We start with
a guess
Walk on
the
graph
Back to the cost function
Learning rate
(another guess)
Repeat until convergence
How do I transform all this into code???
Octave/ Matlab for prototyping
Adjust your theory to work with Linear Algebra and Matrices
Expand the derivatives:
Repeat until convergence
The code in Octave
data = load('used_cars.csv'); %year x price
y = data(:, 2);
X = [ones(m, 1), data(:,1)];
theta = zeros(2, 1); %linear function
iterations = 1500;
alpha = 0.01;
m = length(y);
J = 0;
predictions = X*theta;
sqErrors = (predictions - y).^2;
J = 1/(2*m) * sum(sqErrors);
J_history = zeros(iterations, 1);
for iter = 1:iterations
x = X(:,2);
delta = theta(1) + (theta(2)*x);
tz = theta(1) - alpha * (1/m) * sum(delta-y);
t1 = theta(2) - alpha * (1/m) * sum((delta - y) .* x);
theta = [tz; t1];
J_history(iter) = computeCost(X, y, theta);
end
Initialise
Cost
Function
Gradient
Descent
Possible problems with Octave/Matlab
- Performance issues
- Integrate with existing project
In a large project
Linear regression with C++
mlpack_linear_regression --training_file used_cars.csv
--test_file used_cars_test.csv -v
Going back to used cars
We are only analysing year vs price. We have more factors (features): model, how much the car was used before, etc
Back to the principle
Training set
Learning Algorithm
hypothesis
input
predicted output
m
x
h
Strategy to h
Consider multiple variables: a, b, c,... (or using greek letters)
Can you suggest the next steps?
Gradient Descent for multiple Variables
Repeat until convergence
Production tips
- Too many features - YOU NEED MEMORY :scream:
- Normalisation - check this Coursera lectures
- There is more than one way to minimise the cost function value, rather than gradient descent. Check a method called 'Normal Equation'
Recap
- Hypothesis h
- Cost function: the difference between our hypothesis and the output in the training set
- Gradient descent: minimize the difference
Don't forget :)
Multivariable regression -My Bots
Determine response time in a chat.
Goal - Prevent too much conversation and be helpful
Input: response minutes, human reply minutes, presence of words indicating it's taking too long, human typing action
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Problem
Determine if a job opportunity is interesting to me or not
Do we have any idea for know relations?
outsourcing -> no
Machine Learning -> yes
Problem
Determine if a job opportunity is interesting to me or not
What kind of M.L. Algorithm would you use here?
Supervised Learning
Algorithm draft:
The output is yes or no
Come up with a training set to analyse these variables
Input data:
Back to the principle
Training set
Learning Algorithm
hypothesis
input
predicted output
m
x
h
Strategy
- Hypothesis h
- Cost function: the difference between our hypothesis and the output in the training set
- Gradient descent: minimize the difference
(nothing new!)
Strategy
h is a function between 0 and 1
Come up with a function to compare the test output with the predicted value
(cost function)
Minimize the cost function
(gradient descent)
Strategy
h is a function between 0 and 1
Sigmoid function
h gives us a probability of yes or no
Input matrix (from your input data)
How am I supposed to find a minimum
on this weird hypothesis function?
Back to Calculus
e function
often relates to ln
that often relates to log
Our cost function
needs to separate data, not
measure the difference between the slopes
Our cost function
for every element of the training set
Improving it
Now the minimum:
Repeat until convergence
Please calculate this derivative for me!!111
where
This classification algorithm is called
Logistic Regression
The code in Octave
data = load('job_positions.csv');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = zeros(n + 1, 1);
m = length(y);
J = 0;
grad = zeros(size(theta));
h = sigmoid(X*theta);
J = (1/m)*(-y'* log(h) - (1 - y)'* log(1-h));
grad = (1/m)*X'*(h - y);
m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X*theta) >= 0.5;
Initialise
Cost Function
Gradient Descent
Prediction
In a large project
Logistic regression with Python
model = LogisticRegression()
model = model.fit(X, y)
model.score(X, y)
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Problem
The training set works fine; new predictions are terrible :'(
High variance
It fails to generalise!
Where is the problem? Is it in our hypothesis?
Problem
The training set works fine; new predictions are terrible
The problem might be the cost function (comparing the predicted values with the training set)
-
J =
A better cost function
-
J =
Regularization param
It controls the tradeoff / lowers the variance
Regularization
We can use it with Linear and Logistic Regression
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Problem
Logistic regression is inaccurate.
Problem
Logistic regression is too slow.
Strategy
d
Strategy
d
Strategy
Which one is better?
Equal largest distance
Not considering outliers
distance => vector norm
Strategy
Measure group similarity
input vs each point in the graph
Strategy
Can you guess a component of the function?
A bit more...
Strategy
Can you guess a component of the function?
A bit more...
Strategy
Can you guess a component of the function?
A bit more...
Strategy
Can you guess a component of the function?
A bit more...
Strategy
Can you guess a component of the function?
A bit more...
And we do that for all the inputs x from the training set
Back to the beginning: Strategy
- Hypothesis h
- Cost function: the difference between our hypothesis and the output in the training set
- Minimize the difference (it was the gradient descent in the previous methods)
(nothing new!)
Back to the beginning: Strategy
- Hypothesis => similarity [or kernels] √
- Cost function: the difference between our hypothesis and the output in the training set
- Minimize the difference (it was the gradient descent in the previous methods)
The cost function
The similarity of the groups (by the difference of the output in the training set with your predicted value)
Looks like Logistic Regression
Regularization
The cost function - alternative notation
And replacing
The cost function - alternative notation
Don't forget - h is the similarity/kernel function.
And nor minimize it!
And that is a lot of work!
Instead, we can use a library to calculate the minimum.
Welcome to the Support Vector Machine (SVM)
Extras
- There is more than one formula for a kernel
- How many kernels should I have? You need to start with a guess.
- When should I use Logistic Regression and when should I use SVM? (quick analysis - if less features (< 1000), use SVM; else, Logistic Regression
In a project
Support Vector Machine with Python
import numpy as npy
from sklearn import svm
X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8
for kernel in ('linear', 'poly', 'rbf'):
clf = svm.SVC(kernel=kernel, gamma=2)
clf.fit(X, Y)
In a project
Support Vector Machine with Python
import numpy as npy
from sklearn import svm
X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8
for kernel in ('linear', 'poly', 'rbf'):
clf = svm.SVC(kernel=kernel, gamma=2)
clf.fit(X, Y)
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Problem
I have no idea about the number of kernels. I can't establish any relation between my data.
What kind of M.L. algorithm is this?
Unsupervised Learning
Problem
I have no idea about the number of kernels. I can't establish any relation between my data.
input = {x1, x2, x3...}
no y!
Welcome to unsupervised learning!
Problem
Is this another group
or an anomaly?
Problem
My bot is taking too long to respond - is it just heavy processing or is it someone trolling me?
Strategy
Calculate the probability of each element of the input training set to determine if it's normal or not
Gaussian distribution
https://www.phy.ornl.gov/csep/gif_figures/mcf7.gif
Strategy
Gaussian distribution
Select the elements on the training set that you consider weird (example: add upper/lower bounds)
Algorithm
Expand it according to the Gaussian distribution:
Now we compute:
Algorithm
(ex: 5% or 10% deviation)
There is an algorithm to determine the best Epsilon
The code in Octave
data = load('response_times.csv');
X = data(:, [1, 2]);
[m, n] = size(X);
mu = zeros(n, 1);
sigma2 = zeros(n, 1);
mu = mean(X)';
sigma2 = var(X, 1)';
k = length(mu);
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
Sigma2 = diag(Sigma2);
end
X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));
end
Gaussian
Elements
Initialise
(cont..)The code in Octave
bestEpsilon = 0;
bestF1 = 0;
F1 = 0;
stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
predictions = (pval < epsilon);
tp = sum((predictions == 1 & yval == 1));
fp = sum((predictions == 1 & yval == 0));
fn = sum((predictions == 0 & yval == 1));
precision = tp / (tp + fp);
recall = tp / (tp + fn);
F1 = (2 * precision * recall) / (precision + recall);
if F1 > bestF1
bestF1 = F1;
bestEpsilon = epsilon;
end
end
outliers = find(p < epsilon);
Determine
Epsilon
(max deviant)
The anomaly!
WE MADE IT TO THE END OF THIS SESSION!
Agenda
- General ideas about Machine Learning
- Problem 1 - Used car prices (Linear Regression)
- Problem 2 - Is this job position interesting? (Logistic regression)
- Problem 3 - Improving Problem 2 solution (Regularization)
- Problem 4 - Improving problem 3 (Support vector machines)
- Problem 5 - Is this an attack or just heavy processing (Anomaly detection)
- Extras and References
Bonus!
Problem
The set of features is getting too large.
(n > 1000 )
I don't wanna handle with huuuuuge polynomials!
Problem
The set of features is getting too large.
(n > 1000 )
Neural Networks
Problem
The set of features is getting too large.
(n > 1000 )
Neural Networks
A full explanation of Neural Networks is out of the scope of this presentation
Out of our scope - but interesting
- Backpropagation
- PCA (Principal Component Analysis)
- Obtaining the datasets
- A deeper comparison of which method you should select
- Is it Supervised or unsupervised?
- Reinforcement learning
- Which features should I select?
References
- Coursera Stanford course
- Awesome M.L. Books
- Alex Smola and S.V.N. Vishwanathan - Introduction to Machine Learning
- mlpack
- scikit-learn
Special Thanks
- B.C., for the constant review
- @confoo team
Thank you :)
Questions?
hannelita@gmail.com
@hannelita
Machine Learning Algorithms - Intro (UNIFEI)
By Hanneli Tavante (hannelita)
Machine Learning Algorithms - Intro (UNIFEI)
- 1,797