A Gentle Introduction to Machine Learning and Neural Networks

Sean Meling Murray,

Department of Mathematics,

University of Bergen

Sources: New York Post, Futurism, VICE Motherboard, The Conversation, The Verge

Machine learning gives computers the ability to learn from data without being explicitly programmed to do so.

Algorithms allow us to build models that make data-driven predictions about the stuff we're interested in.

Wikipedia:

Regression

Typically used when we want our model to predict a continuous numerical value.

Classification

Typically used when we want our model to classify an event or a thing into one of K categories.

E.g. How much is my apartment worth given size, number of rooms, etc.?

E.g. Given today's housing market, will I be able to sell my apartment (yes or no)?

Let's look closer at linear regression:

Using             we want to find the straight line that best fits the data:

\hat{y} = \hat{w}_0 + \hat{w}_1x
y^=w^0+w^1x\hat{y} = \hat{w}_0 + \hat{w}_1x
This is 
our model
intercept
slope
parameters
input aka. features, i.e. properties of the thing we are trying to predict
y_i \approx w_0 + w_1x_i
yiw0+w1xiy_i \approx w_0 + w_1x_i
output

 

(x_i, y_i)
(xi,yi)(x_i, y_i)

Supervised

We compare our model's prediction to a known label or value and try to make the prediction error as small as possible.

Unsupervised

We try to identify meaningful patterns in unlabelled data. In other words, we let the data speak for itself.

Source: Tango with Code

Source: Sagar Sharma, towardsdatascience.com

C(y,\hat{y}) = \frac{1}{2N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
C(y,y^)=12Ni=1N(yiy^i)2C(y,\hat{y}) = \frac{1}{2N} \sum_{i=1}^N (y_i - \hat{y}_i)^2

We want to minimize the prediction errors:

Mean squared error (MSE)

Using the input-output pairs                  in our data, which combination of weight settings gives us the lowest value of the cost function?

Cost
True value
Prediction

Training

, optimizing

(x_i, y_i)
(xi,yi)(x_i, y_i)

learning

,

\frac{\partial C}{\partial w_0} = 0
Cw0=0\frac{\partial C}{\partial w_0} = 0
\frac{\partial C}{\partial w_1} = 0
Cw1=0\frac{\partial C}{\partial w_1} = 0
Solve:

Let's visualize the regression line a little differently...

y_i = w_0+w_1x_i
yi=w0+w1xiy_i = w_0+w_1x_i
w_0
w0w_0
w_1
w1w_1
1
11
x_i
xix_i
\sum{}
\sum{}
y_i
yiy_i
y_i
yiy_i
=
==
w_0
w0w_0
+
++
w_1
w1w_1
x_i
xix_i
1
11
Intercept
Feature
Let's call these circles neurons!

What if we want to model a non-linear relationship?

y_i = \frac{1}{1+e^{-(w_0+w_1x_i)}}
yi=11+e(w0+w1xi)y_i = \frac{1}{1+e^{-(w_0+w_1x_i)}}
w_0
w0w_0
w_1
w1w_1
1
11
x_i
xix_i
\sum{}|\sigma
σ\sum{}|\sigma
y_i
yiy_i
\sigma_1(input) = \frac{1}{1+e^{-input}}
σ1(input)=11+einput\sigma_1(input) = \frac{1}{1+e^{-input}}
y_i
yiy_i
\sigma_2(\sigma_1(input)) = \begin{cases} \text{bored if }\sigma_1(input)>0.5\\ \text{not bored otherwise} \end{cases}
σ2(σ1(input))={bored if σ1(input)>0.5not bored otherwise\sigma_2(\sigma_1(input)) = \begin{cases} \text{bored if }\sigma_1(input)>0.5\\ \text{not bored otherwise} \end{cases}
Thresholding
1
11
x_n
xnx_n
\sum{}|\sigma
σ\sum{}|\sigma
y
yy
1
11
\sum{}|\sigma
σ\sum{}|\sigma
We call the weighted sum of inputs the activation of the neuron. The non-linear transformation is called the activation function.

Neural network!

weights
weights
y
yy
=
==
\sigma_2
σ2\sigma_2
\sigma_1
σ1\sigma_1
\text{W}_1x
W1x\text{W}_1x
(
((
(
((
)
))
)
))
\text{w}_2^\text{T}
w2T\text{w}_2^\text{T}

Composition of non-linear functions!

\text{W}_1
W1\text{W}_1
\text{w}_2
w2\text{w}_2

Network architecture

The number of neurons in a layer is it's width.

Many layers results in a deep model, and is why we call it deep learning.

Image credit: KDNuggets

The number of layers in a network is it's depth.

How does a neural network learn?

1. Backpropagation

2. Gradient descent

1. Backpropagation

Check out 3blue1brown on YouTube

Backprop algorithm calculates the gradients of cost function wrt. the networks weights using the chain rule!

Iteratively nudge the weights in the direction where cost decreases the most, i.e. the negative of the gradient calculated in step 1.

C(y,\hat{y})
C(y,y^)C(y,\hat{y})
\hat{y}_i = \hat{w}_0 + \hat{w}_1x_i
y^i=w^0+w^1xi\hat{y}_i = \hat{w}_0 + \hat{w}_1x_i

Source: Tango with Code

2. Gradient descent

Deep Learning models

Convolutional neural networks

Image recognition and object detection:

https://github.com/shaoanlu/Udacity-SDCND-Vehicle-Detection

Recurrent neural networks

Natural language processing:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Reinforcement learning

Building software agents that
learn by reacting to the environment:

From the documentary AlphaGo

Google DeepMind

Generative adversarial networks

Pairs of networks that
generate images and other types of data:

https://github.com/junyanz/CycleGAN

Thanks!

A Gentle Introduction to Machine Learning and Neural Networks

By Sean Meling Murray

A Gentle Introduction to Machine Learning and Neural Networks

Presentation held at the first machine learning seminar arranged by the Western Norway University of Applied Sciences, the University of Bergen and Haukeland University Hospital. See https://mmiv.no/upcoming/ for upcoming events.

  • 940