Outline

Recap, the leap from simple linear models
(Feedforward) Neural Networks Structure
- Design choices
Forward pass
Backward pass
- Back-propagation

Recap:

leveraging nonlinear transformations

👆

importantly, linear in \(\phi\), non-linear in \(x\)

\phi\left(\left[x_1; x_2\right]\right)=\left[1;|x_1-x_2|\right]

transform via

Pointed out key ideas (enabling neural networks):

Nonlinear feature transformation
"Composing" simple transformations
Backpropagation

\}

expressiveness

efficient training

\sigma_1 = \sigma(5 x_1 -5 x_2 + 1)

\sigma_2 = \sigma(-5 x_1 + 5 x_2 + 1)

Two epiphanies:

nonlinear transformation empowers linear tools
"composing" simple nonlinearities amplifies such effect

some appropriate weighted sum

Outline

Recap, the leap from simple linear models
(Feedforward) Neural Networks Structure
- Design choices
Forward pass
Backward pass
- Back-propagation

👋 heads-up, in this section, for simplicity:

all neural network diagrams focus on a single data point

A neuron:

\(w\): what the algorithm learns

\(x\): \(d\)-dimensional input

a = f(z)

\Sigma

A neuron:

\(a\): post-activation output

\(f\): activation function

\(w\): weights (i.e. parameters)

\(z\): pre-activation output

\(f\): what we engineers choose

\dots

x_1

x_2

x_d

x = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

w_1

w_d

\dots

w_2

= w^Tx

z

f(\cdot)

= f(w^Tx)

\(z\): scalar

\(a\): scalar

Choose activation \(f(z)=z\)

learnable parameters (weights)

e.g. linear regressor represented as a computation graph

= z

= w^Tx

w_1

w_d

\dots

x_1

x_2

x_d

\dots

x = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

w_2

\Sigma

f(\cdot)

g

z

Choose activation \(f(z)=\sigma(z)\)

learnable parameters (weights)

e.g. linear logistic classifier represented as a computation graph

= \sigma(z)

= w^Tx

w_1

w_d

\dots

x_1

x_2

x_d

\dots

w_2

\Sigma

f(\cdot)

g

z

x = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

\dots

A layer:

learnable weights

A layer:

a^1

z^1

\Sigma

f(\cdot)

z^2

\Sigma

f(\cdot)

a^2

z^m

a^m

\Sigma

f(\cdot)

\dots

x_1

x_2

x_d

(# of neurons) = (layer's output dimension).
typically, all neurons in one layer use the same activation \(f\) (if not, uglier algebra).
typically fully connected, where all \(x_i\) are connected to all \(z_j,\) meaning each \(x_i\) influences every \(a_j\) eventually.
typically, no "cross-wiring", meaning e.g. \(z_1\) won't affect \(a^2.\) (the final layer may be an exception if softmax is used.)

\dots

layer

linear combo

activations

A (fully-connected, feed-forward) neural network:

\dots

layer

\dots

x_1

x_2

x_d

input

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

neuron

learnable weights

We choose:

activation \(f\) in each layer
# of layers
# of neurons in each layer

hidden

output

Outline

Recap, the leap from simple linear models
(Feedforward) Neural Networks Structure
- Design choices
Forward pass
Backward pass
- Back-propagation

\sigma_1 = \sigma(5 x_1 + -5 x_2 + 1)

\sigma_2 = \sigma(-5 x_1 + 5 x_2 + 1)

some appropriate weighted sum

recall this example

x_1

x_2

1

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\Sigma

f(\cdot)

\(f(\cdot) = \sigma(\cdot)\)

\(f(\cdot) \) identity function

it can be represented as

Activation function \(f\) choices

\(\sigma\) used to be the most popular

firing rate of a neuron
elegant gradient \(\sigma^{\prime}(z)=\sigma(z) \cdot(1-\sigma(z))\)

a = f(z)

\Sigma

\dots

x_1

x_2

x_d

x = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

w_1

w_d

\dots

w_2

= w^Tx

z

f(\cdot)

= f(w^Tx)

default choice in hidden layers
very simple function form, so is the gradient.

\operatorname{ReLU}(z)=\left\{\begin{array}{ll} 0 & \text { if } z<0 \\ z & \text { otherwise } \end{array}\\ \\ \right.

=\max (0, z)

nowadays

drawback: if strongly in negative region, a single ReLU can be "dead" (no gradient).
Luckily, typically we have lots of units, so not everyone is dead.

\frac{\partial \text{ReLU}(z)}{\partial z}:=\left\{\begin{array}{lll} 0, & \text { if } \quad z<0 \\ 1, & \text { if } \quad \text{otherwise} \end{array}\right.

compositions of ReLU(s) can be quite expressive

in fact, asymptotically, can approximate any function!

(image credit: Phillip Isola)

x_1

x_2

(image credit: Tamara Broderick)

or give arbitrary decision boundaries!

+

=

(image credit: Tamara Broderick)

+

=

output layer design choices

# neurons, activation, and loss depend on the high-level goal.
typically straightforward.
Multi-class setup: if predict one and only one class out of \(K\) possibilities, then last layer: \(K\) neurons, softmax activation, cross-entropy loss

other multi-class settings, see discussion in lab.

e.g., say \(K=5\) classes

input \(x\)

hidden layer(s)

\dots

output layer

Width: # of neurons in layers
Depth: # of layers
More expressive if increasing either the width or depth.

The usual pitfall of overfitting (though in NN-land, it's also an active research topic.)

(The demo won't embed in PDF. But the direct link below works.)

Outline

Recap, the leap from simple linear models
(Feedforward) Neural Networks Structure
- Design choices
Forward pass
Backward pass
- Back-propagation

Evaluate the loss \(\mathcal{L} = (g-y)^2\)
Repeat for each data point, average the sum of \(n\) individual losses

e.g. forward-pass of a linear regressor

y^{(i)}

f(\cdot)

\mathcal{L}(g, y)

\underbrace{\quad \quad \quad } \\ n

\mathcal{L}(g, y)

\dots

= w^Tx

w_1

w_d

\dots

w_2

\Sigma

z

= z

g

\dots

x_1

x_2

x_d

x^{(i)} \quad = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

\dots

Evaluate the loss \(\mathcal{L} = - [y \log g+\left(1-y\right) \log \left(1-g\right)]\)
Repeat for each data point, average the sum of \(n\) individual losses

f(\cdot)

\mathcal{L}(g, y)

\underbrace{\quad \quad \quad } \\ n

\mathcal{L}(g, y)

\dots

x_1

x_2

x_d

x^{(i)} \quad = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

= w^Tx

w_1

w_d

\dots

w_2

\Sigma

z

= \sigma(z)

g

e.g. forward-pass of a linear logistic classifier

\dots

y^{(i)}

\dots

x^{(1)}

y^{(1)}

f^1

linear combination

nonlinear activation

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

g^{(1)}

\(\dots\)

f^2\left(\hspace{2cm}; \mathbf{W}^2\right)

f^1(\mathbf{x}^{(i)}; \mathbf{W}^1)

f^L\left(\dots \hspace{3.5cm}; \dots \mathbf{W}^L\right)

Forward pass:

evaluate, given the current parameters,

the model output \(g^{(i)}\) =

the loss incurred on the current data \(\mathcal{L}(g^{(i)}, y^{(i)})\)

the training error \(J = \frac{1}{n} \sum_{i=1}^{n}\mathcal{L}(g^{(i)}, y^{(i)})\)

\mathcal{L}(g^{(1)}, y^{(1)})

\mathcal{L}(g, y)

\mathcal{L}(g^{(n)}, y^{(n)})

\underbrace{\quad \quad \quad \quad \quad } \\ n

\dots

loss function

Outline

Recap, the leap from simple linear models
(Feedforward) Neural Networks Structure
- Design choices
Forward pass
Backward pass
- Back-propagation

Randomly pick a data point \((x^{(i)}, y^{(i)})\)
Evaluate the gradient \(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)
Update the weights \(W^2 \leftarrow W^2 - \eta \nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)}})\)

x^{(i)}

y^{(i)}

f^1

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

g^{(i)}

\(\dots\)

\mathcal{L}(g^{(i)}, y^{(i)})

\mathcal{L}(g, y)

\mathcal{L}(g^{(n)}, y^{(n)})

\underbrace{\quad \quad \quad \quad \quad } \\ n

\dots

Backward pass:

Run SGD to update the parameters, e.g. to update \(W^2\)

\(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)

x

\(\dots\)

y

f^1

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

\mathcal{L}(g,y)

g

\(\nabla_{W^2} \mathcal{L(g,y)}\)

Backward pass:

Run SGD to update the parameters, e.g. to update \(W^2\)

Evaluate the gradient \(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)

Update the weights \(W^2 \leftarrow W^2 - \eta \nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)}})\)

x

y

f^1

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\mathcal{L}(g,y)

g

How do we get these gradient though?

\(\nabla_{W^1} \mathcal{L(g,y)}\)

Backward pass:

Run SGD to update the parameters, e.g. to update \(W^1\)

Evaluate the gradient \(\nabla_{W^1} \mathcal{L(g^{(i)},y^{(i)})}\)

Update the weights \(W^1 \leftarrow W^1 - \eta \nabla_{W^1} \mathcal{L(g^{(i)},y^{(i)}})\)

\(\dots\)

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

Outline

Recap, the leap from simple linear models
(Feedforward) Neural Networks Structure
- Design choices
Forward pass
Backward pass
- Back-propagation

\mathcal{L}(g, y)

\underbrace{\quad \quad \quad } \\ n

\mathcal{L}(g, y)

\dots

x_1

x_2

x_d

x^{(i)} \quad = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

e.g. backward-pass of a linear regressor

\dots

y^{(i)}

\dots

Randomly pick a data point \((x^{(i)}, y^{(i)})\)
Evaluate the gradient \(\nabla_{w} \mathcal{L(g^{(i)},y^{(i)})}\)
Update the weights \(w \leftarrow w - \eta \nabla_w \mathcal{L(g^{(i)},y^{(i)}})\)

w

\left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

= w^Tx

\Sigma

w_1

w_2

\dots

w_d

w_1

w_2

\dots

w_d

=

\nabla_{w} \mathcal{L(g^{(i)},y^{(i)})}

g

\mathcal{L}(g, y)

\dots

x_1

x_2

x_d

x \quad = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

e.g. backward-pass of a linear regressor

y

\left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

\Sigma

w_1

w_2

\dots

w_d

y \in \mathbb{R}

x \in \mathbb{R^d}

w \in \mathbb{R^d}

= \frac{\partial \mathcal{L}(g,y)}{\partial w}

\nabla_{w} \mathcal{L(g,y)}

= x \cdot 2(g - y)

\frac{\partial g}{\partial w}

= \frac{\partial[(g - y)^2] }{\partial w}

= \frac{\partial[(w^T x - y)^2] }{\partial w}

\nabla_{w} \mathcal{L(g,y)}

\frac{\partial \mathcal{L}}{\partial g}

w

=

\nabla_{w} \mathcal{L(g,y)}

= w^Tx

g

\text{ReLU}

\mathcal{L}(g, y)

\dots

x_1

x_2

x_d

x \quad = \left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

e.g. backward-pass of a non-linear regressor

y

\left[ \begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \end{array} \right]

= w^Tx

\Sigma

z

w_1

w_2

\dots

w_d

y \in \mathbb{R}

x \in \mathbb{R^d}

w \in \mathbb{R^d}

= \frac{\partial \mathcal{L}(g,y)}{\partial w}

\nabla_{w} \mathcal{L(g,y)}

= x \cdot \frac{\partial[(\text{ReLU}(z))] }{\partial z} \cdot 2(g - y)

\frac{\partial g}{\partial z}

\frac{\partial z}{\partial w}

= \frac{\partial[(g - y)^2] }{\partial w}

\nabla_{w} \mathcal{L(g,y)}

\frac{\partial \mathcal{L}}{\partial g}

w

=

\nabla_{w} \mathcal{L(g,y)}

g

= \text{ReLU}(z)

x

\(\dots\)

y

f^1

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

\mathcal{L}(g,y)

g

\underbrace{\hspace{4.7cm}}

Now, back propagation: reuse of computation

Z^L

A^2

Z^2

A^1

Z^1

\frac{\partial \mathcal{L}(g,y)}{\partial W^2}

\frac{\partial \mathcal{L}(g,y)}{\partial g}

\frac{\partial g}{\partial Z^{L}}

\frac{\partial Z^3}{\partial A^{2}}\frac{\partial A^4}{\partial Z^{3}} \dots \frac{\partial Z^L}{\partial A^{L-1}}

\frac{\partial A^2}{\partial Z^{2}}

\frac{\partial Z^2}{\partial W^{2}}

\frac{\partial \mathcal{L}(g,y)}{\partial Z^2}

\underbrace{\hspace{4cm}}

\frac{\partial \mathcal{L}(g,y)}{\partial W^2}

how to find

?

\frac{\partial \mathcal{L}(g,y)}{\partial W^1}

x

\(\dots\)

y

f^1

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

\mathcal{L}(g,y)

g

Z^L

A^2

Z^2

A^1

Z^1

\frac{\partial \mathcal{L}(g,y)}{\partial g}

\frac{\partial g}{\partial Z^{L}}

\frac{\partial Z^3}{\partial A^{2}}\frac{\partial A^4}{\partial Z^{3}} \dots \frac{\partial Z^L}{\partial A^{L-1}}

\frac{\partial A^2}{\partial Z^{2}}

\frac{\partial \mathcal{L}(g,y)}{\partial Z^2}

\underbrace{\hspace{4cm}}

back propagation: reuse of computation

\frac{\partial Z^2}{\partial W^{2}}

\underbrace{\hspace{4.7cm}}

\frac{\partial \mathcal{L}(g,y)}{\partial W^2}

how to find

?

\underbrace{\hspace{6.5cm}}

\frac{\partial \mathcal{L}(g,y)}{\partial W^1}

\frac{\partial Z^2}{\partial A^{1}}

\frac{\partial A^1}{\partial Z^{1}}

\frac{\partial Z^1}{\partial W^{1}}

x

\(\dots\)

y

f^1

\begin{aligned} & W^1 \\ \end{aligned}

\begin{aligned} & W^2 \\ \end{aligned}

\begin{aligned} & W^L \\ \end{aligned}

f^2

f^L

\mathcal{L}(g,y)

g

Z^L

A^2

Z^2

A^1

Z^1

\frac{\partial \mathcal{L}(g,y)}{\partial g}

\frac{\partial g}{\partial Z^{L}}

\frac{\partial Z^3}{\partial A^{2}}\frac{\partial A^4}{\partial Z^{3}} \dots \frac{\partial Z^L}{\partial A^{L-1}}

\frac{\partial A^2}{\partial Z^{2}}

\frac{\partial \mathcal{L}(g,y)}{\partial Z^2}

\underbrace{\hspace{4cm}}

back propagation: reuse of computation

how to find

\frac{\partial \mathcal{L}(g,y)}{\partial W^1}

?

Summary

We saw that introducing non-linear transformations of the inputs can substantially increase the power of linear tools. But it’s kind of difficult/tedious to select a good transformation by hand.
Multi-layer neural networks are a way to automatically find good transformations for us!
Standard NNs have layers that alternate between parametrized linear transformations and fixed non-linear transforms (but many other designs are possible.)
Typical non-linearities include sigmoid, tanh, relu, but mostly people use relu.
Typical output transformations for classification are as we've seen: sigmoid, or softmax.
There’s a systematic way to compute gradients via back-propagation, in order to update parameters.

Thanks!

We'd love to hear your thoughts.

Lecture 6: Neural Networks

Intro to Machine Learning

Outline

Outline

Outline

Outline

Outline

Outline

Summary

Thanks!