Shen Shen
March 8, 2024
(👉 the Live Slides)
(many slides adapted from Phillip Isola and Tamara Broderick)
learnable parameters (weights)
learnable parameters (weights)
We saw that, one way of getting complex input-output behavior is
to leverage nonlinear transformations
transform
e.g. use for decision boundary
👆 importantly, linear in \(\phi\), non-linear in \(x\)
Today (2nd cool idea): "stacking" helps too!
So, two epiphanies: Â
Â
(đź‘‹ heads-up: Â all neural network graphs focus on a single data point for simple illustration.)
Â
Â
Â
Â
Â
Â
Â
layer
learnable weights
learnable weights
layer
linear combo
activations
input
\(\sigma\) used to be popular
Â
Â
Â
Â
(output layer)
e.g., say \(K=5\) classes
More complicated example: predict one class out of \(K\) possibilities
then last layer: \(K\) nuerons, softmax activationÂ
Â
How do we optimize
\(J(\mathbf{W})=\sum_{i=1} \mathcal{L}\left(f_L\left(\ldots f_2\left(f_1\left(\mathbf{x}^{(i)}, \mathbf{W}_1\right), \mathbf{W}_2\right), \ldots \mathbf{W}_L\right), \mathbf{y}^{(i)}\right)\) though?
Recall that, the chain rule says:
For the composed function: \(h(\mathbf{x})=f(g(\mathbf{x})), \) its derivative is: \(h^{\prime}(\mathbf{x})=f^{\prime}(g(\mathbf{x})) g^{\prime}(\mathbf{x})\)
Here, our loss depends on the final output,
and the final output \(A^L\) comes from a chain of composition of functions
(The demo won't embed in PDF. But the direct link below works.)
We'd love it for you to share some lecture feedback.