CS6910: Fundamentals of Deep Learning

Lecture 2: McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron Learning Algorithm and Convergence, Multilayer Perceptrons (MLPs), Representation Power of MLPs

Mitesh M. Khapra

Department of Computer Science and Engineering, IIT Madras

Learning Objectives

At the end of this lecture, student will master the basics behind McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron Learning Algorithm and Convergence, Multilayer Perceptrons (MLPs), Representation Power of MLPs

Module 2.1: Biological Neurons

The most fundamental unit of a deep neural network is called an artificial neuron

Why is it called a neuron ? Where does the inspiration come from ?

The inspiration comes from biology (more specifically, from the brain)

biological neurons = neural cells = neural processing units

We will first see what a biological neuron looks like

Artificial Neuron

$$x_1$$

$$w_1$$

$$x_2$$

$$x_3$$

$$y_1$$

$$\sigma$$

$$w_2$$

$$w_3$$

dendrite: receives signals from other neurons

synapse: point of connection to other neurons

soma: processes the information

axon: transmits the output of this neuron

Biological Neurons*

*Image adapted from

https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

soma

dendrite

axon

synapse

Let us see a very cartoonish illustration of how a neuron works

Our sense organs interact with the outside  world

They relay information to the neurons

The neurons (may) get activated and produces a response (laughter in this case)

There is a massively parallel interconnected network of neurons

Of course, in reality, it is not just a single neuron which does all this

The sense organs relay information to the lowest layer of neurons

Some of these neurons may fire (in red) in response to this information and in turn relay information to other neurons they are connected to

These neurons may also fire (again, in red) and the process continues

An average human brain has around \(10^{11} \)(100 billion) neurons!

                                                                                                                    eventually resulting in a response (laughter in this case)

\(x_1\)

\(x_2\)

XOR

\(0\)

\(1\)

\(0\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(1\)

\(1\)

\(0\)

w_0 + \sum_{i=1}^{2}\ w_ix_i < 0
w_0 + \sum_{i=1}^{2}\ w_ix_i \geq 0
w_0 + \sum_{i=1}^{2}\ w_ix_i \geq 0
w_0 + \sum_{i=1}^{2}\ w_ix_i < 0
w_0 + w_1 . 0 + w_2 . 0 < 0 \implies w_0 < 0
w_0 + w_1 . 0 + w_2 . 1 \geq 0 \implies w_2 \geq -w_0
w_0 + w_1 . 1 + w_2 . 0 \geq 0 \implies w_1 \geq -w_0
w_0 + w_1 . 1 + w_2 . 1 < 0 \implies w_1 + w_2 < -w_0

The fourth condition contradicts conditions 2 and 3

And indeed you can see that it is impossible to draw a line which separates the red points from the blue points

Hence we cannot have a solution to this set of inequalities

\(x_2\)

\(x_1\)

\((0,0)\)

\((1,0)\)

\((0,1)\)

\((1,1)\)

Most real world data is not linearly separable and will always contain some outliers

While a single perceptron cannot deal with such data, we will show that a network of perceptrons can indeed deal with such data

In fact, sometimes there may not be any outliers but still the data may not be linearly separable

We need computational units (models) which can deal with such data

Before seeing how a network of perceptrons can deal with linearly inseparable data, we will discuss boolean functions in some more detail ...

How many boolean functions can you design from 2 inputs ?

Let us begin with some easy ones which you already know ..

Of these, how many are linearly separable ? (turns out all except XOR and !XOR - feel free to verify)

In general, how many boolean functions can you have for \(n\) inputs ?

How many of these \(2^{2^n}\) functions are not linearly separable ? For the time being, it suffices to know that at least some of these may not be linearly inseparable (I encourage you to figure out the exact answer :-))

\(x_1\)

\(x_2\)

\(0\)

\(0\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(1\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(0\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(0\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(0\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(1\)

\(0\)

\(f_1\)

\(f_2\)

\(f_3\)

\(f_4\)

\(f_5\)

\(f_6\)

\(f_7\)

\(f_8\)

\(f_9\)

\(f_{10}\)

\(f_{11}\)

\(f_{12}\)

\(f_{13}\)

\(f_{14}\)

\(f_{15}\)

\(f_{16}\)

                                                                        (turns out all except XOR and !XOR - feel free to verify)

 \(2^{2^n}\)

                                                                                                For the time being, it suffices to know that at least some of these may not be linearly inseparable   (I encourage you to figure out the exact answer :-))

Module 2.8: Representation Power of a Network of Perceptrons

We will now see how to implement any boolean function using a network of perceptrons ...

$$x_1$$

$$x_2$$

$$y$$

For this discussion, we will assume True = +1
and False = -1

We consider 2 inputs and 4 perceptrons

Each input is connected to all the 4 perceptrons with specific weights

The bias (\(w_0\)) of each perceptron is \(-2\) (i.e., each perceptron will fire only if the weighted sum of its input is \(\geq 2\))

Each of these perceptrons is connected to an output perceptron by weights (which need to be learned)

The output of this perceptron (\(y\)) is the output
of this network

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$x_1$$

$$x_2$$

$$y$$

Terminology:

This network contains 3 layers

The layer containing the inputs (\(x_1,x_2\)) is called the input layer

The middle layer containing the 4 perceptrons is called the hidden layer

The final layer containing one output neuron is called the output layer

The red and blue edges are called layer 1 weights

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

$$bias=-2$$

The outputs of the 4 perceptrons in the hidden layer are denoted by \(h_1,h_2,h_3,h_4\)

\(w_1,w_2,w_3,w_4\) are called layer 2 weights

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

We claim that this network can be used to implement any boolean function (linearly separable or not) !

In other words, we can find \(w_1,w_2,w_3,w_4\) such that the truth table of any boolean function can be represented by this network

Each perceptron in the middle layer fires only for a specific input (and no two perceptrons fire for the same input)

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

Astonishing claim! Well, not really, if you understand what is going on

                     Well, not really, if you understand what is going on

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

the first perceptron fires for {-1,-1}

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

We claim that this network can be used to implement any boolean function (linearly separable or not) !

In other words, we can find \(w_1,w_2,w_3,w_4\) such that the truth table of any boolean function can be represented by this network

Each perceptron in the middle layer fires only for a specific input (and no two perceptrons fire for the same input)

Astonishing claim! Well, not really, if you understand what is going on

                     Well, not really, if you understand what is going on

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

the second perceptron fires for {-1,1}

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

We claim that this network can be used to implement any boolean function (linearly separable or not) !

In other words, we can find \(w_1,w_2,w_3,w_4\) such that the truth table of any boolean function can be represented by this network

Each perceptron in the middle layer fires only for a specific input (and no two perceptrons fire for the same input)

Astonishing claim! Well, not really, if you understand what is going on

                     Well, not really, if you understand what is going on

-1,1

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

the third perceptron fires for {1,-1}

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

We claim that this network can be used to implement any boolean function (linearly separable or not) !

In other words, we can find \(w_1,w_2,w_3,w_4\) such that the truth table of any boolean function can be represented by this network

Each perceptron in the middle layer fires only for a specific input (and no two perceptrons fire for the same input)

Astonishing claim! Well, not really, if you understand what is going on

                     Well, not really, if you understand what is going on

-1,1

1,-1

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

the fourth perceptron fires for {1,1}

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

We claim that this network can be used to implement any boolean function (linearly separable or not) !

In other words, we can find \(w_1,w_2,w_3,w_4\) such that the truth table of any boolean function can be represented by this network

Each perceptron in the middle layer fires only for a specific input (and no two perceptrons fire for the same input)

Astonishing claim! Well, not really, if you understand what is going on

                     Well, not really, if you understand what is going on

-1,1

1,-1

1,1

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

-1,1

1,-1

1,1

Let us see why this network works by taking an example of the XOR function

We claim that this network can be used to implement any boolean function (linearly separable or not) !

In other words, we can find \(w_1,w_2,w_3,w_4\) such that the truth table of any boolean function can be represented by this network

Each perceptron in the middle layer fires only for a specific input (and no two perceptrons fire for the same input)

Astonishing claim! Well, not really, if you understand what is going on

                     Well, not really, if you understand what is going on

Let \(w_0\) be the bias output of the neuron (i.e.,
it will fire if \(\sum_{i=1}^{4}\ w_ih_i\geq w_0\)

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

This results in the following four conditions to implement XOR: \(w_1<w_0\), \(w_2 \geq w_0\), \(w_3 \geq w_0\), \(w_4 < w_0\)

Unlike before, there are no contradictions now and the system of inequalities can be satisfied

Essentially each \(w_i\) is now responsible for one of the 4 possible inputs and can be adjusted to get the desired output for that input

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

-1,1

1,-1

1,1

\(x_1\)

\(x_2\)

\(XOR\)

\(0\)

\(0\)

\(0\)

\(x_1\)

\(x_2\)

\(1\)

\(0\)

\(h_1\)

\(h_2\)

\(0\)

\(0\)

\(h_3\)

\(h_4\)

\(\sum_{i=1}^{4}\ w_ih_i\)

$$w_1$$

\(1\)

\(0\)

\(1\)

\(0\)

\(1\)

\(0\)

\(0\)

$$w_2$$

\(0\)

\(1\)

\(1\)

\(0\)

\(0\)

\(1\)

\(0\)

$$w_3$$

\(1\)

\(1\)

\(0\)

\(0\)

\(0\)

\(1\)

\(1\)

$$w_4$$

It should be clear that the same network can be used to represent the remaining 15 boolean functions also

red edge indicates \(w\) = -1

blue edge indicates \(w\) = +1

Each boolean function will result in a different set of non-contradicting inequalities which can be satisfied by appropriately setting \(w_1,w_2,w_3,w_4\)

Try it!

-1,-1

$$x_1$$

$$x_2$$

$$y$$

$$bias=-2$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$h_1$$

$$h_2$$

$$h_3$$

$$h_4$$

-1,1

1,-1

1,1

What if we have more than 3 inputs ?

Again each of the 8 perceptorns will fire only for one of the 8 inputs

Each of the 8 weights in the second layer is responsible for one of the 8 inputs and can be adjusted to produce the desired output for that input

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$x_2$$

$$x_1$$

$$x_1$$

$$y$$

$$w_5$$

$$w_6$$

$$w_7$$

$$w_8$$

$$bias=-3$$

What if we have more than \(n\) inputs ?

Theorem

Any boolean function of \(n\) inputs can be represented exactly by a network of perceptrons containing 1 hidden layer with \(2^n\) perceptrons and one output layer containing 1 perceptron

Proof (informal:) We just saw how to construct such a network

Note: A network of \(2^n + 1\) perceptrons is not necessary but sufficient. For example, we already saw how to represent AND function with just 1 perceptron

Catch: As \(n\) increases the number of perceptrons in the hidden layers obviously increases exponentially

Again, why do we care about boolean functions ?

How does this help us with our original problem: which was to predict whether we like a movie or not? Let us see!

Let us see!

We are given this data about our past movie experience

For each movie, we are given the values of the various factors (\(x_1,x_2,...,x_n\)) that we base our decision on and we are also also given the value of \(y\) (like/dislike)

\(p_i's\) are the points for which the output was 1 and \(n_i's\) are the points for which it was \(0\)

The data may or may not be linearly separable

The proof that we just saw tells us that it is possible to have a network of perceptrons and learn the weights in this network such that for any given \(p_i\) or \(n_j\) the output of the network will be the same as \(y_i\) or \(y_j\) (i.e., we can separate the positive and the negative points)

\begin{bmatrix} x_{11} & x_{12} & ... & x_{1n} & y_1=1\\ x_{21} & x_{22} & ... & x_{2n} & y_2=1\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ x_{k1} & x_{k2} & ... & x_{kn} & y_i=0\\ x_{j1} & x_{j2} & ... & x_{jn} & y_j=0\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ \end{bmatrix}
\begin{matrix} p_1\\ p_2\\ \vdots\\ n_1\\ n_2\\ \vdots\\ \end{matrix}

$$y$$

$$w_1$$

$$w_2$$

$$w_3$$

$$w_4$$

$$w_5$$

$$w_6$$

$$w_7$$

$$w_8$$

$$bias=-3$$

$$x_1$$

$$x_2$$

$$x_3$$

The story so far...

Networks of the form that we just saw (containing, an input, output and one or more hidden layers) are called Multilayer Perceptrons (MLP, in short)

More appropriate terminology would be "Multilayered Network of Perceptrons" but MLP is the more commonly used name

The theorem that we just saw gives us the representation power of a MLP with a single hidden layer

Specifically, it tells us that a MLP with a single hidden layer can represent any boolean function

Copy of Copy of CS6910: Lecture 2

By Amrutha

Copy of Copy of CS6910: Lecture 2

A (brief/partial) History of Deep Learning

  • 174