(DRAFT)
Shen Shen
March 7, 2025
Recap:
leveraging  nonlinear transformations
👆
importantly, linear in \(\phi\), non-linear in \(x\)
transform via
​Pointed out key ideas (enabling neural networks):
expressiveness
efficient training
Two epiphanies: Â
some appropriate weighted sum
Â
đź‘‹ heads-up, in this section, for simplicity:
all neural network diagrams focus on a single data point
A neuron:
\(w\): what the algorithm learns
A neuron:
\(f\): what we engineers choose
\(z\): scalar
\(a\): scalar
Choose activation \(f(z)=z\)
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
Choose activation \(f(z)=\sigma(z)\)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
A layer:
learnable weights
A layer:
layer
linear combo
activations
A (fully-connected, feed-forward) neural network:Â
layer
input
neuron
learnable weights
We choose:
hidden
output
some appropriate weighted sum
recall this example
\(f(\cdot) = \sigma(\cdot)\)Â
\(f(\cdot) \) identity function
it can be represented as
Activation function \(f\) choices
\(\sigma\) used to be the most popular
Â
Â
Â
Â
nowadays
compositions of ReLU(s) can be quite expressive
in fact, asymptotically, can approximate any function!
(image credit: Phillip Isola)
(image credit: Tamara Broderick)
or give arbitrary decision boundaries!
(image credit: Tamara Broderick)
output layer design choices
Â
Â
Â
Â
e.g., say \(K=5\) classes
input \(x\)
hidden layer(s)
output layer
Â
(The demo won't embed in PDF. But the direct link below works.)
e.g. forward-pass of a linear regressor
e.g. forward-pass of a linear logistic classifier
linear combination
nonlinear activation
\(\dots\)
Forward pass:
evaluate, given the current parameters,
loss function
\(\dots\)
Backward pass:
Run SGD to update the parameters, e.g. to update \(W^2\)
\(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)
\(\dots\)
\(\nabla_{W^2} \mathcal{L(g,y)}\)
Backward pass:
Run SGD to update the parameters, e.g. to update \(W^2\)
Evaluate the gradient \(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)Â
Update the weights \(W^2 \leftarrow W^2 - \eta \nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)}})\)Â
How do we get these gradient though?
\(\nabla_{W^1} \mathcal{L(g,y)}\)
Backward pass:
Run SGD to update the parameters, e.g. to update \(W^1\)
Evaluate the gradient \(\nabla_{W^1} \mathcal{L(g^{(i)},y^{(i)})}\)Â
Update the weights \(W^1 \leftarrow W^1 - \eta \nabla_{W^1} \mathcal{L(g^{(i)},y^{(i)}})\)Â
\(\dots\)
e.g. backward-pass of a linear regressor
e.g. backward-pass of a linear regressor
e.g. backward-pass of a non-linear regressor
\(\dots\)
Now, back propagation: reuse of computation
how to find
?
\(\dots\)
back propagation: reuse of computation
how to find
?
\(\dots\)
back propagation: reuse of computation
how to find
?
We'd love to hear your thoughts.