Recall
parameters
\(\mathcal{L}\left(h\left(x^{(i)}\right), y^{(i)}\right) =(\theta^T x^{(i)}- y^{(i)} )^2\)
\(=\)
\(x\)
\(\theta^T\)
features
\(h\left(x ; \theta\right)\)
label
loss
guess (prediction)
See lec1/rec1 for discussion of the offset.
Recall
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
\(X = \begin{bmatrix}x_1^{(1)} & \dots & x_d^{(1)}\\\vdots & \ddots & \vdots\\x_1^{(n)} & \dots & x_d^{(n)}\end{bmatrix}\)
\(Y = \begin{bmatrix}y^{(1)}\\\vdots\\y^{(n)}\end{bmatrix}\)
\(\theta = \begin{bmatrix}\theta_{1}\\\vdots\\\theta_{d}\end{bmatrix}\)
\( J(\theta) =\frac{1}{n}({X} \theta-{Y})^{\top}({X} \theta-{Y})\)
Let
Then
\(\in \mathbb{R}^{n\times d}\)
\(\in \mathbb{R}^{n\times 1}\)
\(\in \mathbb{R}^{d\times 1}\)
\(\in \mathbb{R}^{1\times 1}\)
By matrix calculus and optimization
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
Spotted in lab:
\(J(\theta) = (3 \theta-6)^{2}\)
1d-feature training data \((x,y) = (3,6)\)
\(X=x = [3]\)
\(Y=y = [6]\)
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
\(\theta^*=(xx)^{-1}(xy)\)
\(=\frac{xy}{xx} =\frac{y}{x} = \frac{6}{3}=2\)
1-d feature training data set
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(X = \begin{bmatrix}2 \\3\\4\end{bmatrix}\)
\(Y = \begin{bmatrix}5 \\6 \\7\end{bmatrix}\)
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
\(=\frac{X^{\top}Y}{X^{\top}X}= \frac{56}{29}\approx 1.93\)
\(d=1\)
assume \(n=1\) and \(y=1\)
if the data is \((x,1) = (0.002,1)\)
if the data is \((x,y) = (-0.0002,1)\)
\(\theta^* = 500\)
\(\theta^* = -5,000\)
then \(\theta^*=\frac{1}{x}\)
most of the time, behaves nicely
\(\left({X}^{\top} {X}\right)\) is singular
\({X}\) is not full column rank
\(\left({X}^{\top} {X}\right)\) has zero eigenvalue(s)
\(\left({X}^{\top} {X}\right)\) is not full rank
the determinant of \(\left({X}^{\top} {X}\right)\) is zero
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
\(d\geq1\)
more generally,
most of the time, behaves nicely
but run into trouble when
if \({X}\) is not full column rank, then \(X^{\top}X\) is singular
X is not full column rank when:
all three cases have similar visual interpretations
(a). \(d=1\) and \(X \in \mathbb{R}^{n \times 1} \) is simply an all-zero vector
infinitely many optimal \(\theta\)
(b). \(n\)<\(d\)
infinitely many optimal \(\theta\)
\((x_1, x_2) = (2,3), y =4\)
(c). columns (features) in \( {X} \) are linearly dependent.
infinitely many optimal \(\theta\)
\((x_1, x_2) = (2,3), y =7\)
\((x_1, x_2) = (4,6), y =8\)
\((x_1, x_2) = (6,9), y =9\)
infinitely many optimal \(\theta^*\)
temperature \(x_1\)
population \(x_2\)
energy used
\(y\)
temperature ( °F) \(x_1\)
temperature (°C) \(x_2\)
energy used
\(y\)
data
MSE
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\) is not well-defined
Quick Summary:
Typically, \(X\) is full column rank
🥺
🥰
When \(X\) is not full column rank
\(X^\top X\) becoming more invertible
formula isn't wrong, data is trouble-making
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\) does exist
\(\theta^*\) does gives the unique optimal hyperplane
but
\(\theta^*\) tends to have huge magnitude
\(\theta^*\) tends to be very sensitive to the small changes in the data
\(\theta^*\) tends to overfit
when \(X^\top X\) is almost singular, technically
🥺
🥰
🥺
other hypotheses (other \(\theta\)) which does reasonably well in fitting the training data
prefer \(\theta\) with small magnitude (less sensitive prediction when \(x\) changes slightly)
when \(X^\top X\) is almost singular
Regularization
Ridge Regression
\((x_1, x_2) = (2,3), y =7\)
\((x_1, x_2) = (4,6), y =8\)
\((x_1, x_2) = (6,9), y =9\)
case (c) training data set again
Comments on the \(\lambda\)
Regression
Algorithm
💻
\(\in \mathbb{R}^d \)
\(\in \mathbb{R}\)
\(\mathcal{D}_\text{train}\)
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
hypothesis
🧠
Validation
Regression
Algorithm
💻
\(\mathcal{D}_\text{train}\)
🧠 fixed
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
validation error
Cross-validation
Regression
Algorithm
💻
\(\mathcal{D}_\text{train}\)
🧠 fixed
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
chunk-1 validation error
Regression
Algorithm
💻
\(\mathcal{D}_\text{train}\)
🧠 fixed
chunk-1 validation error
chunk-2 validation error
Cross-validation
Regression
Algorithm
💻
\(\mathcal{D}_\text{train}\)
🧠 fixed
chunk-1 validation error
chunk-2 validation error
Cross-validation
...
chunk-\(k\) validation error
averaging=>
cross-validation error
Comments on (cross)-validation
good idea to shuffle data first
a way to "reuse" data
cross-validation is more "reliable" than validation (less sensitive to chance)
it's not to evaluate a hypothesis (testing error is)
rather, it's to evaluate learning algorithm (e.g. hypothesis class choice, hyperparameter choice)
Can have an outer loop for picking good hyperparameter or hypothesis class
Closed-form formula for OLS is not well-defined when \(X^TX\) is singular, and we have infinitely many optimal \(\theta^*\).
Even in scenarios where \(X^TX\) is just ill-conditioned, we get sensitivity issues, many almost-as-good solutions, while the absolutely best \(\theta^*\) is overfitting to the data.
We need to indicate our preference somehow, and also fight overfitting.
Regularization helps battle overfitting -- by constructing a new optimization problem that implicitly prefers small-magnitude \(\theta.\)
Least-squares regularization leads to the ridge-regression formulation. (Good news: we can still solve it analytically!)
\(\lambda\) trades off training MSE and regularization strength, it's a hyperparameter.
Validation/cross-validation are a way to choose (regularization) hyperparameters.
We'd love to hear your thoughts.
\({X}\) is not full column rank
\(\left({X}^{\top} {X}\right)\) is not full rank
rank\((X) < d\)
rank\(({X}^{\top} {X}) < d\)
rank\(({AB})\leq\min\)[ rank\((A)\), rank\((B)\)]
for this (X^TX) matrix, since it's not just a genric, arbitrary matrix, we could read a bit more into it. it's square, symmetric
from the data's perspective
when is this (product) matrix singular?
\(X = \begin{bmatrix}x_1^{(1)} & \dots & x_d^{(1)}\\\vdots & \ddots & \vdots\\x_1^{(n)} & \dots & x_d^{(n)}\end{bmatrix}\)
feature 1
feature 2
\(X^\top X\)
each row is a data point
each column is a feature