Intro to Machine Learning
Lecture 6: Neural Networks
Shen Shen
March 8, 2024
(π the Live Slides)
(many slides adapted from Phillip Isola and Tamara Broderick)
Outline
- Recap and neural networks motivation
- Neural Networks
- A single neuron
- A single layer
- Many layers
- Design choices (activation functions, loss functions choices)
- Forward pass
- Backward pass (back-propogation)
e.g. linear regression represented as a computation graph
- Each data point incurs a loss of (wTx(i)+w0ββy(i))2
- Repeat for each data point, sum up the individual losses
- Gradient of the total loss gives us the "signal" on how to optimize for w,w0β
learnable parameters (weights)
- Each data point incurs a loss of β(y(i)logg(i)+(1βy(i))log(1βg(i)))
- Repeat for each data point, sum up the individual losses
- Gradient of the total loss gives us the "signal" on how to optimize for w,w0β
learnable parameters (weights)
e.g. linear logistic regression (linear classification) represented as a computation graph
We saw that, one way of getting complex input-output behavior is
to leverage nonlinear transformations
transform
e.g. use for decision boundary
π importantly, linear in Ο, non-linear in x
Today (2nd cool idea): "stacking" helps too!
So, two epiphanies:
- nonlinearity empowers linear tools
- stacking helps
(π heads-up: all neural network graphs focus on a single data point for simple illustration.)
Outline
- Recap and neural networks motivation
- Neural Networks
- A single neuron
- A single layer
- Many layers
- Design choices (activation functions, loss functions choices)
- Forward pass
- Backward pass (back-propogation)
A single neuron is
- the basic operating "unit" in a neural network.
- the basic "node" when a neural network is viewed as computational graph.
- neuron , a function, maps a vector input xβRm to a scalar output
- inside the neuron, circles do function evaluation/computation
- f: we engineers choose
- w: learnable parameters
- x: m-dimensional input (a single data point)
- w: weights (i.e. parameters)
- z: pre-activation scalar output
- f: activation function
- a: post-activation scalar output
A single layer is
- made of many individual neurons.
- (# of neurons) = (layer output dimension).
- typically, all neurons in one layer use the same activation f (if not; uglier/messier algebra)
- typically, no "cross-wire" between neurons. e.g. z1β doesn't influence a2β. in other words, a layer has the same activation applied element-wise. (softmax is an exception to this, details later.)
- typically, fully connected. i.e. there's an edge connecting xiβ to zjβ, for all iβ{1,2,3,β¦,m};jβ{1,2,β¦,n}. in other words, all xiβ influence all ajβ.
layer
learnable weights
learnable weights
layer
linear combo
activations
input
A (feed-forward) neural network is
Activation function f choices
Ο used to be popular
- firing rate of neuron
- Οβ²(z)=Ο(z)β (1βΟ(z))
ReLU is the de-facto activation choice nowadays
- Default choice in hidden layers.
- Pro: very efficient to implement, choose to let the gradient be:
- Drawback: if strongly in negative region, unit can be "dead" (no gradient).
- Inspired variants like elu, leaky-relu.
The last layer, the output layer, is special
- activation and loss depends on problem at hand
- we've seen e.g. regression (one unit in last layer, squared loss).
(output layer)
e.g., say K=5 classes
More complicated example: predict one class out of K possibilities
then last layer: K nuerons, softmax activation
Outline
- Recap and neural networks motivation
- Neural Networks
- A single neuron
- A single layer
- Many layers
- Design choices (activation functions, loss functions choices)
- Forward pass
- Backward pass (back-propogation)
How do we optimize
J(W)=βi=1βL(fLβ(β¦f2β(f1β(x(i),W1β),W2β),β¦WLβ),y(i)) though?
Backprop = gradient descent & the chain rule
Recall that, the chain rule says:
For the composed function: h(x)=f(g(x)), its derivative is: hβ²(x)=fβ²(g(x))gβ²(x)
Here, our loss depends on the final output,
and the final output AL comes from a chain of composition of functions
Backprop = gradient descent & the chain rule
Backprop = gradient descent & the chain rule
(
(The demo won't embed in PDF. But the direct link below works.)
)
Summary
- We saw last week that introducing non-linear transformations of the inputs can substantially increase the power of linear regression and classification hypotheses.
- We also saw that itβs kind of difficult to select a good transformation by hand.
- Multi-layer neural networks are a way to make (S)GD find good transformations for us!
- Fundamental idea is easy: specify a hypothesis class and loss function so that d Loss / d theta is well behaved, then do gradient descent.
- Standard feed-forward NNs (sometimes called multi-layer perceptrons which is actually kind of wrong) are organized into layers that alternate between parametrized linear transformations and fixed non-linear transforms (but many other designs are possible!)
- Typical non-linearities include sigmoid, tanh, relu, but mostly people use relu
- Typical output transformations for classification are as we have seen: sigmoid and/or softmax
- Thereβs a systematic way to compute d Loss / d theta via backpropagation
Thanks!
We'd love it for you to share some lecture feedback.
introml-sp24-lec6
By Shen Shen
introml-sp24-lec6
- 142