
Lecture 6: Neural Networks
Shen Shen
Oct 4, 2024
Intro to Machine Learning

Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- Design choices
- Forward pass
- Backward pass
- Back-propagation



Recap:
leveraging nonlinear transformations
π
importantly, linear in Ο, non-linear in x

transform via




βPointed out key ideas (enabling neural networks):
- βNonlinear feature transformation
- "Composing" simple transformations
- Backpropagation


expressiveness
efficient training


Two epiphanies:
- nonlinear transformation empowers linear tools
- "composing" simple nonlinearities amplifies such effect

some appropriate weighted sum
Outline
- Recap, the leap from simple linear models
-
(Feedforward) Neural Networks Structure
- Design choices
- Forward pass
- Backward pass
- Back-propagation
π heads-up, in this section, for simplicity:
all neural network diagrams focus on a single data point
A neuron:
w: what the algorithm learns
- x: d-dimensional input
A neuron:
- a: post-activation output
- f: activation function
- w: weights (i.e. parameters)
- z: pre-activation output
f: what we engineers choose
z: scalar
a: scalar
Choose activation f(z)=z
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
Choose activation f(z)=Ο(z)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
A layer:
learnable weights
A layer:
- (# of neurons) = (layer's output dimension).
- typically, all neurons in one layer use the same activation f (if not, uglier algebra).
- typically fully connected, where all xiβ are connected to all zjβ, meaning each xiβ influences every ajβ eventually.
- typically, no "cross-wiring", meaning e.g. z1β won't affect a2. (the final layer may be an exception if softmax is used.)
layer
linear combo
activations
A (fully-connected, feed-forward) neural network:
layer
input
neuron
learnable weights
We choose:
- activation f in each layer
- # of layers
- # of neurons in each layer
hidden
output
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- βDesign choices
- βForward pass
- Backward pass
- Back-propagation



some appropriate weighted sum
recall this example
f(β )=Ο(β )
f(β ) identity function
it can be represented as
Activation function f choices

Ο used to be the most popular
- firing rate of a neuron
- elegant gradient Οβ²(z)=Ο(z)β (1βΟ(z))
- default choice in hidden layers
- very simple function form, so is the gradient.

nowadays
- drawback: if strongly in negative region, a single ReLU can be "dead" (no gradient).
- Luckily, typically we have lots of units, so not everyone is dead.
compositions of ReLU(s) can be quite expressive


in fact, asymptotically, can approximate any function!

(image credit: Phillip Isola)




(image credit: Tamara Broderick)
or give arbitrary decision boundaries!




(image credit: Tamara Broderick)
output layer design choices
- # neurons, activation, and loss depend on the high-level goal.
- typically straightforward.
- Multi-class setup: if predict one and only one class out of K possibilities, then last layer: K neurons, softmax activation, cross-entropy loss
- other multi-class settings, see discussion in lab.

e.g., say K=5 classes
input x
hidden layer(s)
output layer

- Width: # of neurons in layers
- Depth: # of layers
- More expressive if increasing either the width or depth.
- The usual pitfall of overfitting (though in NN-land, it's also an active research topic.)
(The demo won't embed in PDF. But the direct link below works.)
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- βDesign choices
- βForward pass
- Backward pass
- Back-propagation
- Evaluate the loss L=(gβy)2
- Repeat for each data point, average the sum of n individual losses
e.g. forward-pass of a linear regressor
- Evaluate the loss L=β[ylogg+(1βy)log(1βg)]
- Repeat for each data point, average the sum of n individual losses
e.g. forward-pass of a linear logistic classifier
linear combination
nonlinear activation
β¦
Forward pass:
evaluate, given the current parameters,
- the model output g(i) =
- the loss incurred on the current data L(g(i),y(i))
- the training error J=n1ββi=1nβL(g(i),y(i))
loss function
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- Design choices
- Forward pass
-
Backward pass
- Back-propagation
- Randomly pick a data point (x(i),y(i))
- Evaluate the gradient βW2βL(g(i),y(i))
- Update the weights W2βW2βΞ·βW2βL(g(i),y(i))
β¦
Backward pass:
Run SGD to update the parameters, e.g. to update W2
βW2βL(g(i),y(i))
β¦
βW2βL(g,y)
Backward pass:
Run SGD to update the parameters, e.g. to update W2
Evaluate the gradient βW2βL(g(i),y(i))
Update the weights W2βW2βΞ·βW2βL(g(i),y(i))
How do we get these gradient though?
βW1βL(g,y)
Backward pass:
Run SGD to update the parameters, e.g. to update W1
Evaluate the gradient βW1βL(g(i),y(i))
Update the weights W1βW1βΞ·βW1βL(g(i),y(i))
β¦
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- Design choices
- Forward pass
-
Backward pass
- Back-propagation
e.g. backward-pass of a linear regressor
- Randomly pick a data point (x(i),y(i))
- Evaluate the gradient βwβL(g(i),y(i))
- Update the weights wβwβΞ·βwβL(g(i),y(i))
e.g. backward-pass of a linear regressor
e.g. backward-pass of a non-linear regressor
β¦
Now, back propagation: reuse of computation
how to find
?
β¦
back propagation: reuse of computation
how to find
?
β¦
back propagation: reuse of computation
how to find
?
Summary
- We saw that introducing non-linear transformations of the inputs can substantially increase the power of linear tools. But itβs kind of difficult/tedious to select a good transformation by hand.
- Multi-layer neural networks are a way to automatically find good transformations for us!
- Standard NNs have layers that alternate between parametrized linear transformations and fixed non-linear transforms (but many other designs are possible.)
- Typical non-linearities include sigmoid, tanh, relu, but mostly people use relu.
- Typical output transformations for classification are as we've seen: sigmoid, or softmax.
- Thereβs a systematic way to compute gradients via back-propagation, in order to update parameters.
Thanks!
We'd love to hear your thoughts.
6.390 IntroML (Fall24) - Lecture 6 Neural Networks
By Shen Shen
6.390 IntroML (Fall24) - Lecture 6 Neural Networks
- 98