Shen Shen
Oct 4, 2024
Recap:
leveraging nonlinear transformations
👆
importantly, linear in \(\phi\), non-linear in \(x\)
transform via
Pointed out key ideas (enabling neural networks):
expressiveness
efficient training
Two epiphanies:
some appropriate weighted sum
👋 heads-up, in this section, for simplicity:
all neural network diagrams focus on a single data point
A neuron:
\(w\): what the algorithm learns
A neuron:
\(f\): what we engineers choose
\(z\): scalar
\(a\): scalar
Choose activation \(f(z)=z\)
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
Choose activation \(f(z)=\sigma(z)\)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
A layer:
learnable weights
A layer:
layer
linear combo
activations
A (fully-connected, feed-forward) neural network:
layer
input
neuron
learnable weights
We choose:
hidden
output
some appropriate weighted sum
recall this example
\(f(\cdot) = \sigma(\cdot)\)
\(f(\cdot) \) identity function
it can be represented as
Activation function \(f\) choices
\(\sigma\) used to be the most popular
nowadays
compositions of ReLU(s) can be quite expressive
in fact, asymptotically, can approximate any function!
(image credit: Phillip Isola)
(image credit: Tamara Broderick)
or give arbitrary decision boundaries!
(image credit: Tamara Broderick)
output layer design choices
e.g., say \(K=5\) classes
input \(x\)
hidden layer(s)
output layer
(The demo won't embed in PDF. But the direct link below works.)
e.g. forward-pass of a linear regressor
e.g. forward-pass of a linear logistic classifier
linear combination
nonlinear activation
\(\dots\)
Forward pass:
evaluate, given the current parameters,
loss function
\(\dots\)
Backward pass:
Run SGD to update the parameters, e.g. to update \(W^2\)
\(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)
\(\dots\)
\(\nabla_{W^2} \mathcal{L(g,y)}\)
Backward pass:
Run SGD to update the parameters, e.g. to update \(W^2\)
Evaluate the gradient \(\nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)})}\)
Update the weights \(W^2 \leftarrow W^2 - \eta \nabla_{W^2} \mathcal{L(g^{(i)},y^{(i)}})\)
How do we get these gradient though?
\(\nabla_{W^1} \mathcal{L(g,y)}\)
Backward pass:
Run SGD to update the parameters, e.g. to update \(W^1\)
Evaluate the gradient \(\nabla_{W^1} \mathcal{L(g^{(i)},y^{(i)})}\)
Update the weights \(W^1 \leftarrow W^1 - \eta \nabla_{W^1} \mathcal{L(g^{(i)},y^{(i)}})\)
\(\dots\)
e.g. backward-pass of a linear regressor
e.g. backward-pass of a linear regressor
e.g. backward-pass of a non-linear regressor
\(\dots\)
Now, back propagation: reuse of computation
how to find
?
\(\dots\)
back propagation: reuse of computation
how to find
?
\(\dots\)
back propagation: reuse of computation
how to find
?
We'd love to hear your thoughts.