Shen Shen
Feb 28, 2025
11am, Room 10-250
linear regressor
Recap:
the regressor is linear in the feature \(x\)
\(y = \theta^{\top} x+\theta_0\)
linear classifier
Recap:
separator
the separator is linear in the feature \(x\)
linear logistic classifier
\(g(x)=\sigma\left(\theta^{\top} x+\theta_0\right)\)
Recap:
separator
the separator is linear in the feature \(x\)
Linear classification played a pivotal role in kicking off the first wave of AI enthusiasm.
👆
👇
Not linearly separable.
Linear tools cannot solve interesting tasks.
Linear tools cannot, by themselves, solve interesting tasks.
XOR dataset
feature engineering 👉
👈 neural networks
old/raw/ original features \(x \in \mathbb{R^d}\)
new features \(\phi(x) \in \mathbb{R^{d^{\prime}}}\)
non-linear in \(x\)
linear in \(\phi\)
non-linear transformation
Linearly separable in \(\phi(x) = x^2\) space
Not linearly separable in \(x\) space
transform via \(\phi(x) = x^2\)
Linearly separable in \(\phi(x) = x^2\) space, e.g. predict positive if \(\phi \geq 3\)
Non-linearly separated in \(x\) space, e.g. predict positive if \(x^2 \geq 3\)
transform via \(\phi(x) = x^2\)
training data
p1 | -2 | 5 |
p2 | 1 | 2 |
p3 | 3 | 10 |
transform via
\(\phi(x)=x^2\)
training data
p1 | 4 | 5 |
p2 | 1 | 2 |
p3 | 9 | 10 |
\(y=\phi+1\)
training data
p1 | 4 | 5 |
p2 | 1 | 2 |
p3 | 9 | 10 |
\(=x^2+1\)
training data
p1 | -2 | 5 |
p2 | 1 | 2 |
p3 | 3 | 10 |
systematic polynomial features construction
9 data points; each data point has a feature \(x \in \mathbb{R},\) label \(y \in \mathbb{R}\)
generated from green dashed line
Underfitting
Appropriate model
Overfitting
high error on train set
high error on test set
low error on train set
low error on test set
low error on train set
high error on test set
Underfitting
Appropriate model
Overfitting
Similar overfitting can happen in classification
Using polynomial features of order 3
Quick summary:
leveraging nonlinear transformations
👆
importantly, linear in \(\phi\), non-linear in \(x\)
transform via
Outlined the fundamental concepts of neural networks:
expressiveness
efficient learning
Two epiphanies:
some appropriately weighted sum
👋 heads-up:
all neural network diagrams focus on a single data point
A neuron:
\(w\): what the algorithm learns
A neuron:
\(f\): what we engineers choose
\(z\): scalar
\(a\): scalar
Choose activation \(f(z)=z\)
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
Choose activation \(f(z)=\sigma(z)\)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
A layer:
learnable weights
A layer:
layer
linear combo
activations
A (fully-connected, feed-forward) neural network:
layer
input
neuron
learnable weights
We choose:
hidden
output
some appropriate weighted sum
recall this example
\(f(\cdot) = \sigma(\cdot)\)
\(f(\cdot) \) identity function
it can be represented as
👋 heads-up:
all neural network diagrams focus on a single data point
Hidden layer activation function \(f\) choices
\(\sigma\) used to be the most popular
very simple function form (so is the gradient).
nowadays, default choice:
compositions of ReLU(s) can be quite expressive
in fact, asymptotically, can approximate any function!
image credit: Phillip Isola
or give arbitrary decision boundaries!
output layer design choices
e.g., say \(K=5\) classes
input \(x\)
hidden layer(s)
output layer
However, in the realm of neural networks, the precise nature of this relationship remains an active area of research—for example, phenomena like the double-descent curve and scaling laws
(The demo won't embed in PDF. But the direct link below works.)
We'd love to hear your thoughts.