Shen Shen
March 7, 2025
11am, Room 10-250
A neuron:
\(w\): what the algorithm learns
\(f\): what we engineers choose
\(z\): scalar
\(a\): scalar
Choose activation \(f(z)=z\)
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
neuron
Choose activation \(f(z)=\sigma(z)\)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
neuron
A layer:
learnable weights
A layer:
layer
linear combo
activations
A (fully-connected, feed-forward) neural network
layer
input
neuron
learnable weights
We choose:
hidden
output
(aka, multi-layer perceptrons MLP)
\(-3(\sigma_1 +\sigma_2)\)
recall this example
\(f =\sigma(\cdot)\)
\(f(\cdot) \) identity function
Recall
\(-3(\sigma_1 +\sigma_2)\)
e.g. forward-pass of a linear regressor
e.g. forward-pass of a linear logistic classifier
\(\dots\)
Forward pass: evaluate, given the current parameters
linear combination
loss function
(nonlinear) activation
compositions of ReLU(s) can be quite expressive
in fact, asymptotically, can approximate any function!
image credit: Phillip Isola
Recall:
stochastic gradient descent to learn a linear regressor
for simplicity, say the dataset has only one data point \((x,y)\)
example on black-board
Recall:
now, slightly more interesting activation:
example on black-board
\(\dots\)
\(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)
Backward pass: run SGD to update all parameters
\(\dots\)
\(\nabla_{W^2} \mathcal{L(g,y)}\)
Backward pass: run SGD to update all parameters
Evaluate the gradient \(\nabla_{W^2} \mathcal{L(g,y)}\)
Update the weights \(W^2 \leftarrow W^2 - \eta \nabla_{W^2} \mathcal{L(g,y})\)
\(\dots\)
how to find
?
\(\nabla_{W^1} \mathcal{L(g,y)}\)
Now, how to update \(W^1?\)
\(\dots\)
Evaluate the gradient \(\nabla_{W^1} \mathcal{L(g,y)}\)
Update the weights \(W^1 \leftarrow W^1 - \eta \nabla_{W^1} \mathcal{L(g,y})\)
\(\dots\)
how to find
?
Previously, we found
\(\dots\)
how to find
Now
\(\dots\)
back propagation: reuse of computation
\(\dots\)
back propagation: reuse of computation
Let's revisit this:
example on black-board
now, slightly more complex network:
example on black-board
now, slightly more complex network:
example on black-board
if \(z^2 > 0\) and \(z_1^1 < 0\), some weights (grayed-out ones) won't get updated
now, slightly more complex network:
example on black-board
if \(z^2 < 0\), no weights get updated
However, in the realm of neural networks, the precise nature of this relationship remains an active area of research—for example, phenomena like the double-descent curve and scaling laws
Recall:
Residual (skip) connection :
example on black-board
Now, \(g= a^1 + \text{ReLU}(z^2),\)
even if \(z^2 < 0\), with skip connection, weights in earlier layers can still get updated
We'd love to hear your thoughts.