data:image/s3,"s3://crabby-images/176c7/176c746469f89d0898597fd5e5b6ad2ebe8ef0ca" alt=""
Lecture 6: Neural Networks
Shen Shen
Oct 4, 2024
Intro to Machine Learning
data:image/s3,"s3://crabby-images/651e3/651e3e8b658a1bfd61c989e8335ec0810203d560" alt=""
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- Design choices
- Forward pass
- Backward pass
- Back-propagation
data:image/s3,"s3://crabby-images/60ee9/60ee9a7c2fb9e2fd5d8c8c4ae1213715ae1c178c" alt=""
data:image/s3,"s3://crabby-images/b167d/b167d13d6a7d511a95a8b8f057c406aaf5aede5d" alt=""
data:image/s3,"s3://crabby-images/70091/70091e3261e58f9cf8eef6e38fd018e3f6bf0e1e" alt=""
Recap:
leveraging nonlinear transformations
π
importantly, linear in Ο, non-linear in x
data:image/s3,"s3://crabby-images/2b87c/2b87c4f0866f2726aa26944e5922b7e8efe525b6" alt=""
transform via
data:image/s3,"s3://crabby-images/ee5ca/ee5ca697d5db743fa33713fdd7261c9cbf9089a8" alt=""
data:image/s3,"s3://crabby-images/6c16b/6c16ba1a02bdbe675a3402382b377bce92f2c99a" alt=""
data:image/s3,"s3://crabby-images/316f6/316f6419348daf6dadae666c708b6e26d4e7568b" alt=""
data:image/s3,"s3://crabby-images/86bd3/86bd33685219e6d6a63197e8597b6c1b23029e46" alt=""
βPointed out key ideas (enabling neural networks):
- βNonlinear feature transformation
- "Composing" simple transformations
- Backpropagation
data:image/s3,"s3://crabby-images/6b955/6b955ef7b1b674889f1d258e729eace5a165b3db" alt=""
data:image/s3,"s3://crabby-images/d2469/d246949cfca081cf185bd11e406c57179a009eb9" alt=""
expressiveness
efficient training
data:image/s3,"s3://crabby-images/b6d4b/b6d4b3ef41fcdc69e70d5f1a5de9c5f0f045b59e" alt=""
data:image/s3,"s3://crabby-images/efa5e/efa5e9079e2bee7f9a06e4a4870b71f8e6d85e01" alt=""
Two epiphanies:
- nonlinear transformation empowers linear tools
- "composing" simple nonlinearities amplifies such effect
data:image/s3,"s3://crabby-images/86bd3/86bd33685219e6d6a63197e8597b6c1b23029e46" alt=""
some appropriate weighted sum
Outline
- Recap, the leap from simple linear models
-
(Feedforward) Neural Networks Structure
- Design choices
- Forward pass
- Backward pass
- Back-propagation
π heads-up, in this section, for simplicity:
all neural network diagrams focus on a single data point
A neuron:
w: what the algorithm learns
- x: d-dimensional input
A neuron:
- a: post-activation output
- f: activation function
- w: weights (i.e. parameters)
- z: pre-activation output
f: what we engineers choose
z: scalar
a: scalar
Choose activation f(z)=z
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
Choose activation f(z)=Ο(z)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
A layer:
learnable weights
A layer:
- (# of neurons) = (layer's output dimension).
- typically, all neurons in one layer use the same activation f (if not, uglier algebra).
- typically fully connected, where all xiβ are connected to all zjβ, meaning each xiβ influences every ajβ eventually.
- typically, no "cross-wiring", meaning e.g. z1β won't affect a2. (the final layer may be an exception if softmax is used.)
layer
linear combo
activations
A (fully-connected, feed-forward) neural network:
layer
input
neuron
learnable weights
We choose:
- activation f in each layer
- # of layers
- # of neurons in each layer
hidden
output
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- βDesign choices
- βForward pass
- Backward pass
- Back-propagation
data:image/s3,"s3://crabby-images/b6d4b/b6d4b3ef41fcdc69e70d5f1a5de9c5f0f045b59e" alt=""
data:image/s3,"s3://crabby-images/efa5e/efa5e9079e2bee7f9a06e4a4870b71f8e6d85e01" alt=""
data:image/s3,"s3://crabby-images/86bd3/86bd33685219e6d6a63197e8597b6c1b23029e46" alt=""
some appropriate weighted sum
recall this example
f(β )=Ο(β )
f(β ) identity function
it can be represented as
Activation function f choices
data:image/s3,"s3://crabby-images/6b924/6b924d25854e4d019570d14158e407b0c24bbca5" alt=""
Ο used to be the most popular
- firing rate of a neuron
- elegant gradient Οβ²(z)=Ο(z)β (1βΟ(z))
- default choice in hidden layers
- very simple function form, so is the gradient.
data:image/s3,"s3://crabby-images/a0719/a07197e0be8494dc3d30db04420ea534c615e55f" alt=""
nowadays
- drawback: if strongly in negative region, a single ReLU can be "dead" (no gradient).
- Luckily, typically we have lots of units, so not everyone is dead.
compositions of ReLU(s) can be quite expressive
data:image/s3,"s3://crabby-images/ba585/ba585969b3850c9f6ea18a65f71622f16dbde094" alt=""
data:image/s3,"s3://crabby-images/ba585/ba585969b3850c9f6ea18a65f71622f16dbde094" alt=""
in fact, asymptotically, can approximate any function!
data:image/s3,"s3://crabby-images/86fcb/86fcbffc802ca0e21f6f9ee497b11edbada124e7" alt=""
(image credit: Phillip Isola)
data:image/s3,"s3://crabby-images/8fb2e/8fb2e2f4ea65688e9f1f04ee82de6667b1edaf21" alt=""
data:image/s3,"s3://crabby-images/08c88/08c88edbdba843755589555ae9c53bceca727568" alt=""
data:image/s3,"s3://crabby-images/ebb5a/ebb5a3455901cf6ef3c6b8ffae6744c31b4554c5" alt=""
data:image/s3,"s3://crabby-images/b67b4/b67b4ff1eb865d2adae60eb845e677f443c3ea84" alt=""
(image credit: Tamara Broderick)
or give arbitrary decision boundaries!
data:image/s3,"s3://crabby-images/8fb2e/8fb2e2f4ea65688e9f1f04ee82de6667b1edaf21" alt=""
data:image/s3,"s3://crabby-images/88b2d/88b2da17bebd95de12a0154ea87b107930e7a522" alt=""
data:image/s3,"s3://crabby-images/f1e4c/f1e4c161ac6b00e438bd511e455d77998591259b" alt=""
data:image/s3,"s3://crabby-images/491cb/491cb40234e27aa5c37cd7eb8c29044ffc8b2758" alt=""
(image credit: Tamara Broderick)
output layer design choices
- # neurons, activation, and loss depend on the high-level goal.
- typically straightforward.
- Multi-class setup: if predict one and only one class out of K possibilities, then last layer: K neurons, softmax activation, cross-entropy loss
- other multi-class settings, see discussion in lab.
data:image/s3,"s3://crabby-images/68985/68985b27f36661357ce0fe27434861c9aad7f966" alt=""
e.g., say K=5 classes
input x
hidden layer(s)
output layer
data:image/s3,"s3://crabby-images/a7588/a758886a4c047b88b1144b7070a238772a961da0" alt=""
- Width: # of neurons in layers
- Depth: # of layers
- More expressive if increasing either the width or depth.
- The usual pitfall of overfitting (though in NN-land, it's also an active research topic.)
(The demo won't embed in PDF. But the direct link below works.)
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- βDesign choices
- βForward pass
- Backward pass
- Back-propagation
- Evaluate the loss L=(gβy)2
- Repeat for each data point, average the sum of n individual losses
e.g. forward-pass of a linear regressor
- Evaluate the loss L=β[ylogg+(1βy)log(1βg)]
- Repeat for each data point, average the sum of n individual losses
e.g. forward-pass of a linear logistic classifier
linear combination
nonlinear activation
β¦
Forward pass:
evaluate, given the current parameters,
- the model output g(i) =
- the loss incurred on the current data L(g(i),y(i))
- the training error J=n1ββi=1nβL(g(i),y(i))
loss function
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- Design choices
- Forward pass
-
Backward pass
- Back-propagation
- Randomly pick a data point (x(i),y(i))
- Evaluate the gradient βW2βL(g(i),y(i))
- Update the weights W2βW2βΞ·βW2βL(g(i),y(i))
β¦
Backward pass:
Run SGD to update the parameters, e.g. to update W2
βW2βL(g(i),y(i))
β¦
βW2βL(g,y)
Backward pass:
Run SGD to update the parameters, e.g. to update W2
Evaluate the gradient βW2βL(g(i),y(i))
Update the weights W2βW2βΞ·βW2βL(g(i),y(i))
How do we get these gradient though?
βW1βL(g,y)
Backward pass:
Run SGD to update the parameters, e.g. to update W1
Evaluate the gradient βW1βL(g(i),y(i))
Update the weights W1βW1βΞ·βW1βL(g(i),y(i))
β¦
Outline
- Recap, the leap from simple linear models
- (Feedforward) Neural Networks Structure
- Design choices
- Forward pass
-
Backward pass
- Back-propagation
e.g. backward-pass of a linear regressor
- Randomly pick a data point (x(i),y(i))
- Evaluate the gradient βwβL(g(i),y(i))
- Update the weights wβwβΞ·βwβL(g(i),y(i))
e.g. backward-pass of a linear regressor
e.g. backward-pass of a non-linear regressor
β¦
Now, back propagation: reuse of computation
how to find
?
β¦
back propagation: reuse of computation
how to find
?
β¦
back propagation: reuse of computation
how to find
?
Summary
- We saw that introducing non-linear transformations of the inputs can substantially increase the power of linear tools. But itβs kind of difficult/tedious to select a good transformation by hand.
- Multi-layer neural networks are a way to automatically find good transformations for us!
- Standard NNs have layers that alternate between parametrized linear transformations and fixed non-linear transforms (but many other designs are possible.)
- Typical non-linearities include sigmoid, tanh, relu, but mostly people use relu.
- Typical output transformations for classification are as we've seen: sigmoid, or softmax.
- Thereβs a systematic way to compute gradients via back-propagation, in order to update parameters.
Thanks!
We'd love to hear your thoughts.
6.390 IntroML (Fall24) - Lecture 6 Neural Networks
By Shen Shen
6.390 IntroML (Fall24) - Lecture 6 Neural Networks
- 98