Recall:
step
multi-class
\(z = \theta^{\top} x+\theta_0\)
\(z = \theta^{\top} x+\theta_0\)
\(z_k = \theta_k^{\top} x+\theta_{0,k}\)
\(g = \text{step}(z)\)
\(g = \sigma(z)\)
\(g = \text{softmax}(z_1,\ldots,z_K)\)
decision boundary is linear in feature \(x\)
logistic
Linear classification played a pivotal role in kicking off the first wave of AI enthusiasm.
👆
👇
Not linearly separable.
Linear tools cannot solve interesting tasks.
Linear tools cannot, by themselves, solve interesting tasks.
XOR dataset
feature engineering 👉
👈 neural networks
original features \(x \in \mathbb{R^d} \)
new features \(\phi(x) \in \mathbb{R^{d^{\prime}}}\)
non-linear in \(x\)
linear in \(\phi\)
non-linear
transformation
\(\phi\)
Linearly separable in \(\phi(x) = x^2\) space
Not linearly separable in \(x\) space
transform via \(\phi(x) = x^2\)
Linearly separable in \(\phi(x) = x^2\) space, e.g. predict positive if \(\phi \geq 3\)
Non-linearly separated in \(x\) space, e.g. predict positive if \(x^2 \geq 3\)
transform via \(\phi(x) = x^2\)
non-linear classification
data in \(x\)-space
\(x_1^2 + x_2^2 = 10\): non-linear in \(x\)
\(z = x_1^2 + x_2^2\), threshold at \(z\!=\!10\)
transform via \(\phi(x) = (x_1^2,\, x_2^2)\)
data in \(\phi\)-space
\(\phi_1 + \phi_2 = 10\): linear in \(\phi\)
\(z = \phi_1 + \phi_2\), threshold at \(z\!=\!10\)
decision boundary is linear in \(\phi,\) nonlinear in \(x\)
training data
| \(x\) | \(y\) | |
| p1 | -2 | 5 |
| p2 | 1 | 2 |
| p3 | 3 | 10 |
transform via
\(\phi(x)=x^2\)
training data
| \(\phi\) | \(y\) | |
| p1 | 4 | 5 |
| p2 | 1 | 2 |
| p3 | 9 | 10 |
non-linear regression
\(g = \phi + 1\)
\(g = x^2 + 1\)
systematic polynomial features construction
| \(d = 1\), features: \(x_1\) | \(d = 2\), features: \(x_1, x_2\) | |
|---|---|---|
| \(k=0\) | \(1\) | \(1\) |
| \(k=1\) | \(1\) \(x_1\) | \(1\) \(x_1,\; x_2\) |
| \(k=2\) | \(1\) \(x_1\) \(x_1^2\) | \(1\) \(x_1,\; x_2\) \(x_1^2,\; x_1 x_2,\; x_2^2\) |
| \(k=3\) | \(1\) \(x_1\) \(x_1^2\) \(x_1^3\) | \(1\) \(x_1,\; x_2\) \(x_1^2,\; x_1 x_2,\; x_2^2\) \(x_1^3,\; x_1^2 x_2,\; x_1 x_2^2,\; x_2^3\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) |
\( h = \textcolor{#4a6fa5}{\theta_0} + \textcolor{#4a6fa5}{\theta_1} \textcolor{#888}{x} \)
Learn 2 parameters — degree-1
\( h = \textcolor{#4a6fa5}{\theta_0} + \textcolor{#4a6fa5}{\theta_1} \textcolor{#888}{x} + \textcolor{#4a6fa5}{\theta_2} \textcolor{#888}{x^2} \)
Learn 3 parameters — degree-2
Underfitting
Appropriate model
Overfitting
high error on train set
high error on test set
low error on train set
low error on test set
low error on train set
high error on test set
Previously:
🧠⚙️
hypothesis class
loss function
hyperparameters
\(\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}\)
Linear
Learning Algorithm
\(\left\{\left(\phi(x^{(i)}), y^{(i)}\right)\right\}_{i=1}^{n}\)
Linear
Learning Algorithm
🧠⚙️
hypothesis class
loss function
hyperparameters
today, so far:
🧠⚙️
feature
transformation \(\phi(x)\)
can we automate 👆?
i.e. fold it into the learning algorithm?
\(\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}\)
Non-linear
Learning Algorithm
🧠⚙️
hypothesis class
loss function
hyperparameters
neural networks:
\(\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}\)
the nonlinearity \(\phi\) is learned not hand-designed
👋 heads-up:
all neural network diagrams focus on a single data point
Outlined the fundamental concepts of neural networks:
expressiveness
efficient learning
layered structure
We abstract this into a simple mathematical unit.
importantly, linear in \(\phi\), non-linear in \(x\)
transform via
some appropriately weighted sum
A neuron:
\(w\): what the algorithm learns
A neuron:
\(f\): what we engineers choose
\(z\): scalar
\(a\): scalar
Choose activation \(f(z)=z\)
learnable parameters (weights)
e.g. linear regressor represented as a computation graph
Choose activation \(f(z)=\sigma(z)\)
learnable parameters (weights)
e.g. linear logistic classifier represented as a computation graph
A layer:
learnable weights
A layer:
layer
linear combo
activations
A (fully-connected, feed-forward) neural network:
layer
input
neuron
learnable weights
Engineers choose:
hidden
output
some appropriately weighted sum
recall this example
\(f =\sigma(\cdot)\)
\(f(\cdot) \) identity function
\(-3(\sigma_1 +\sigma_2)\)
can be represented as:
e.g. forward-pass of a linear regressor
e.g. forward-pass of a linear logistic classifier
\(\dots\)
Forward pass: evaluate, given the current parameters
linear combination
loss function
(nonlinear) activation
Hidden layer activation function \(f\) choices
\(\sigma\) used to be the most popular
very simple function form (so is the gradient).
nowadays, the default choice:
compositions of ReLU(s) can be quite expressive
asymptotically, can approximate any continuous function arbitrarily well (for regression)
therefore can also approximate arbitrary decision boundaries (for classification)
+
=
\(\text{ReLU}(-x_1 - x_2 - 1)\)
\(\text{ReLU}(x_1 - x_2 - 0.5)\)
\(\text{ReLU}(-x_1 + x_2 - 0.5)\)
\(\text{ReLU}(x_1 + x_2)\)
just 4 ReLU neurons already
carve out a non-trivial surface
compositions of ReLU(s) can be quite expressive
However, in the realm of neural networks, the precise nature of this relationship remains an active area of research—for example, phenomena like the double-descent curve and scaling laws
(The demo won't embed in PDF. But the direct link below works.)
output layer design choices
e.g., say \(K=5\) classes
input \(x\)
hidden layer(s)
output layer
Linear models are convenient but lack expressiveness for most real-world tasks.
A fixed non-linear feature transformation lets us use linear methods on complex problems, but designing features by hand gets tedious.
Neural networks automate feature learning: layers alternate parameterized linear maps with non-linear activations (typically ReLU).
For classification outputs, we apply sigmoid (binary) or softmax (multi-class).
Next time: How do we train neural networks? (gradient descent + backpropagation)