A crash and concise approach
Disclaimer
The mathematical content is not strict. The notations may be incomplete / inaccurate. Please forgive me.
There is math.
There is theory.
There is pseudocode.
A lot of content in 40min
Interact with me!
clf = linear_model.LogisticRegression(C=1.0,
penalty='l1', tol=1e-6)
"What does this method do?"
"M.L. "
"..."
"Yet another presentation about M.L..."
"Because it is as cool as big data now."
Supervised Learning
Unsupervised Learning
Reinforcement Learning
We have an idea of the right answer for what we are asking. Example: Given a picture of a person, predict how old is he/she
We have no idea of the right answer for what we are asking. Example: Given a collection of items you don't know, try to group them by similarity.
Let the machine take control of the context and you provide input feedback.
Example: Reduce items in stock by creating dynamic promotions
Training set
Learning Algorithm
hypothesis
input
predicted output
Sell used cars. Find the best price to sell them (not considering people who collect old cars)
Do we have any idea for know relations?
older -> cheaper
unpopular brands -> cheaper
too many kms -> cheaper
Sell used cars. Find the best price to sell them (not considering people who collect old cars)
What kind of M.L. Algorithm would you use here?
Supervised Learning
Chose one variable to analyse vs what you want to predict (example: year x price). Price is the variable you want to set a prediction.
Come up with a training set to analyse these variables
input variable or features - x
output variable or target - y
Training set
Learning Algorithm
hypothesis
input
predicted output
m
x
h
y
Linear equation
h = ax + b
How do you choose a and b?
From the training set, we have expected values y for a certain x:
Come up with a hypothesis that gives you
the smallest error for all the training set:
h(x)
y
Your hypothesis
for an input x
The output of the training set
-
Measure the difference
h(x)
y
Your hypothesis
for an input x
The output of the training set
-
Measure the difference
for the entire training set
Your hypothesis
for an input x
The output of the training set
-
Measure the difference
for the entire training set
(
)
Your hypothesis
for an input x
The output of the training set
-
We don't want to cancel positive and negative values
Average
-
Cost Function
J =
We want to minimize the difference
We can come up with different hypothesis (slopes for h function)
We can come up with different hypothesis (slopes for h function)
That's the difference
for a certain cost function
That's the difference
for another cost function
Minimum value
on h = ax +b,
we are varying a
J(a)
h = ax + b
J(a,b)
Minimize any Cost Function.
Minimize any Cost Function.
Min
We start with a guess
And 'walk' to the min value
Or:
Min
Towards the min value
We start with
a guess
Walk on
the
graph
Partial Derivatives
Partial derivatives
(min value)
We start with
a guess
Walk on
the
graph
Learning rate
(another guess)
Repeat until convergence
Repeat until convergence
data = load('used_cars.csv'); %year x price
y = data(:, 2);
X = [ones(m, 1), data(:,1)];
theta = zeros(2, 1); %linear function
iterations = 1500;
alpha = 0.01;
m = length(y);
J = 0;
predictions = X*theta;
sqErrors = (predictions - y).^2;
J = 1/(2*m) * sum(sqErrors);
J_history = zeros(iterations, 1);
for iter = 1:iterations
x = X(:,2);
delta = theta(1) + (theta(2)*x);
tz = theta(1) - alpha * (1/m) * sum(delta-y);
t1 = theta(2) - alpha * (1/m) * sum((delta - y) .* x);
theta = [tz; t1];
J_history(iter) = computeCost(X, y, theta);
end
Initialise
Cost
Function
Gradient
Descent
Linear regression with C++
mlpack_linear_regression --training_file used_cars.csv
--test_file used_cars_test.csv -v
We are only analysing year vs price. We have more factors: model, how much the car was used before, etc
Training set
Learning Algorithm
hypothesis
input
predicted output
m
x
h
Consider multiple variables: a, b, c,... (or using greek letters)
Repeat until convergence
Determine response time in a chat.
Goal - Prevent too much conversation and be helpful
Input: response minutes, human reply minutes, presence of words indicating it's taking too long, human typing action
Determine if a job opportunity is interesting to me or not
Do we have any idea for know relations?
outsourcing -> no
Akka -> Yes
Machine Learning -> yes
Determine if a job opportunity is interesting to me or not
What kind of M.L. Algorithm would you use here?
Supervised Learning
The output is yes or no
Come up with a training set to analyse these variables
Training set
Learning Algorithm
hypothesis
input
predicted output
m
x
h
y
(nothing new!)
h is a function between 0 and 1
Come up with a function to compare the test output with the predicted value
(cost function)
Minimize the cost function
(gradient descent)
h is a function between 0 and 1
Sigmoid function
h gives us a probability of yes or no
Input matrix (from your input data)
on this weird hypothesis function?
needs to separate data, not
measure the difference between the slopes
for every element of the training set
Repeat until convergence
where
data = load('job_positions.csv');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = zeros(n + 1, 1);
m = length(y);
J = 0;
grad = zeros(size(theta));
h = sigmoid(X*theta);
J = (1/m)*(-y'* log(h) - (1 - y)'* log(1-h));
grad = (1/m)*X'*(h - y);
m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X*theta) >= 0.5;
Initialise
Cost Function
Gradient Descent
Prediction
Logistic regression with Python
model = LogisticRegression()
model = model.fit(X, y)
model.score(X, y)
The training set works fine; new predictions are terrible :'(
High variance
It fails to generalise!
Where is the problem? Is it in our hypothesis?
The training set works fine; new predictions are terrible
The problem might be the cost function (comparing the predicted values with the training set)
-
J =
-
J =
Regularization param
It controls the tradeoff / lowers the variance
We can use it with Linear and Logistic Regression
Logistic regression is inaccurate.
Logistic regression is too slow.
d
d
Which one is better?
Equal largest distance
Not considering outliers
distance => vector norm
Measure group similarity
input vs each point in the graph
Can you guess a component of the function?
A bit more...
Can you guess a component of the function?
A bit more...
Can you guess a component of the function?
A bit more...
Can you guess a component of the function?
A bit more...
Can you guess a component of the function?
A bit more...
And we do that for all the inputs x from the training set
(nothing new!)
The similarity of the groups (by the difference of the output in the training set with your predicted value)
Looks like Logistic Regression
Regularization
And replacing
Don't forget - h is the similarity/kernel function.
And that is a lot of work!
Instead, we can use a library to calculate the minimum.
Support Vector Machine with Python
import numpy as npy
from sklearn import svm
X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8
for kernel in ('linear', 'poly', 'rbf'):
clf = svm.SVC(kernel=kernel, gamma=2)
clf.fit(X, Y)
Support Vector Machine with Python
import numpy as npy
from sklearn import svm
X = npy.loadtxt(open("jobs.csv", "rb"), delimiter=",", skiprows=1)
Y = [0] * 8 + [1] * 8
for kernel in ('linear', 'poly', 'rbf'):
clf = svm.SVC(kernel=kernel, gamma=2)
clf.fit(X, Y)
I have no idea about the number of kernels. I can't establish any relation between my data.
What kind of M.L. algorithm is this?
Unsupervised Learning
I have no idea about the number of kernels. I can't establish any relation between my data.
input = {x1, x2, x3...}
no y!
Is this another group
or an anomaly?
My bot is taking too long to respond - is it just heavy processing or is it someone trolling me?
Calculate the probability of each element of the input training set to determine if it's normal or not
Gaussian distribution
https://www.phy.ornl.gov/csep/gif_figures/mcf7.gif
Gaussian distribution
Select the elements on the training set that you consider weird (example: add upper/lower bounds)
Expand it according to the Gaussian distribution:
Now we compute:
(ex: 5% or 10% deviation)
There is an algorithm to determine the best Epsilon
data = load('response_times.csv');
X = data(:, [1, 2]);
[m, n] = size(X);
mu = zeros(n, 1);
sigma2 = zeros(n, 1);
mu = mean(X)';
sigma2 = var(X, 1)';
k = length(mu);
if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
Sigma2 = diag(Sigma2);
end
X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));
end
Gaussian
Elements
Initialise
bestEpsilon = 0;
bestF1 = 0;
F1 = 0;
stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)
predictions = (pval < epsilon);
tp = sum((predictions == 1 & yval == 1));
fp = sum((predictions == 1 & yval == 0));
fn = sum((predictions == 0 & yval == 1));
precision = tp / (tp + fp);
recall = tp / (tp + fn);
F1 = (2 * precision * recall) / (precision + recall);
if F1 > bestF1
bestF1 = F1;
bestEpsilon = epsilon;
end
end
outliers = find(p < epsilon);
Determine
Epsilon
(max deviant)
The anomaly!
The set of features is getting too large.
(n > 1000 )
I don't wanna handle with huuuuuge polynomials!
The set of features is getting too large.
(n > 1000 )
Neural Networks
The set of features is getting too large.
(n > 1000 )
Neural Networks
A full explanation of Neural Networks is out of the scope of this presentation
Questions?
hannelita@gmail.com
@hannelita