Supervised Learning: Least Squares Regression

ML in Feedback Sys #2

Fall 2025, Prof Sarah Dean

A generalizable and accessible approach to machine learning with global satellite imagery (Rolf et al., 2021)

Text

Linear least-squares

"What we do"

Given: data $\{x_i,y_i\}_{i=1}^n$
- Possible feature transformation $x\leftarrow\phi(x)$
Find linear coefficients:
- Directly (pick a $\lambda\geq 0$) $$\displaystyle \hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top + \lambda I\Big)^{-1}\sum_{i=1}^n y_ix_i $$
- Or iteratively (pick a $\lambda,\eta\geq 0$) $$\theta_{t+1} = \theta_t - \eta\Big( \sum_{i=1}^n(\theta_t^\top x_i - y_i) x_i + \lambda \theta_t\Big)$$
Make predictions with $\hat y = \hat\theta^\top x$

Squared Loss

"Why we do it"

The empirical risk minimization problem for linear models and squared loss $$\min_{\theta\in \mathbb R^d} \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$
Fact 1: $\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top + \lambda I\Big)^{-1}\sum_{i=1}^n y_ix_i$ is the unique minimizer when $\lambda > 0$
Fact 2: As long as $\eta\leq \frac{2}{\lambda + \lambda_{\max}(\sum_{i=1}^n x_i x_i^\top))}$, gradient descent converges to $\hat\theta$
Fact 3: When $\lambda=0$, as long as $\eta\leq \frac{2}{\lambda_{\max}(\sum_{i=1}^n x_i x_i^\top))}$, gradient descent with $\theta_0=0$ converges to the minimum norm solution $\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i$.

Optimization

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Review: critical points

Critical points = where gradient is zero
Solutions to unconstrained optimization of differentiable function occur at critical points

global max

local max

global min

local min

saddle

Proof of Fact 1.

The optimum $\hat\theta$ must satisfy $\nabla J(\hat \theta) = 0$

Optimization

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Proof of Fact 1.

The optimum $\hat\theta$ must satisfy $\nabla J(\hat \theta) = 0$
Therefore we must have $$\Big(\sum_{i=1}^n x_i x_i^\top + \lambda I\Big) \hat \theta = \sum_{i=1}^n y_ix_i$$
For $\lambda>0$, unique solution is $\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top + \lambda I\Big)^{-1}\sum_{i=1}^n y_ix_i$

Review: linear algebra

For $A\in\mathbb R^{d\times d}$, the equation $A \theta = b$ will have a unique solution $A^{-1}b$ if $A$ is rank $d$ (full rank)
For positive semi-definite $M_1 \succeq 0$ and positive definite $M_2 \succ 0$, $M_1+M_2\succ 0$

Optimization

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

one global min

infinitely many global min

local and global min

strongly convex

convex

nonconvex

Review: convexity (resource)

For a differentiable convex function, all critical points are global minima

Gradient descent

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Gradient descent

$$\theta_{t+1} = \theta_t - \eta\Big( \sum_{i=1}^n(\theta_t^\top x_i - y_i) x_i + \lambda \theta_t\Big)$$

Proof of Fact 2 (GD converges).

$$\theta_{t+1} -\hat\theta = \Big((1-\eta\lambda) I - \eta \sum_{i=1}^n x_i x_i^\top \Big) (\theta_t-\hat\theta)$$
Consequently, $$\|\theta_{t+1} -\hat\theta\|_2 \leq \underbrace{\Big\|(1-\eta\lambda) I - \eta \sum_{i=1}^n x_i x_i^\top \Big\|_2}_{= \max_j |1-\eta(\lambda + \lambda_j(\sum_{i=1}^n x_i x_i^\top))|} \|\theta_t-\hat\theta\|_2$$
So as long as $\eta>0$ is small enough, $\|\theta_t-\hat\theta\| \leq \rho^t\|\theta_0-\hat\theta\|$ for $0<\rho<1$

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Lemma: The minimum norm solution is $\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i$.

Recall that any minimizers must satisfy $\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i$

$\sum_{i=1}^n x_i x_i^\top$ may not be full rank due to redundancy in features

example: the first entry of $x$ is always $0$
- then the first entry of least-norm $\hat\theta$ will be zero
example: the first entry of $x$ is twice the second entry of $x$
- then the second entry of least-norm $\hat\theta$ will be zero

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Lemma: The minimum norm solution is $\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i$.

Recall that any minimizers must satisfy $\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i$

Review: linear algebra

A square symmetric matrix $A$ is always diagonalizable, i.e. there exists an orthnormal matrix $V$ and a diagonal matrix $S$ such that $A = VSV^\top$
In this case, the pseudo-inverse is $A^\dagger = V S^\dagger V^\top$ where $$S^{\dagger} = \mathrm{diag}(s_1,s_2,\dots, s_r, 0, \dots, 0)^{\dagger} = \mathrm{diag}(1/s_1,1/s_2,\dots, 1/s_r, 0, \dots, 0)$$
The solution set to $Ax=b$ is $\{A^\dagger b + u|u\in\mathrm{null}(A)\}$ (nullspace)

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Lemma: The minimum norm solution is $\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i$.

Recall that any minimizers must satisfy $\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i$
So the solution set is $(\sum_{i=1}^n x_i x_i^\top)^\dagger \sum_{i=1}^n y_ix_i+u$ for $u\in\mathrm{null}(\sum_{i=1}^n x_ix_i^\top)$
Note that $\mathrm{range}(\sum_{i=1}^n x_i x_i^\top)=\mathrm{span}(x_1,...,x_n)$
Thus the terms are orthogonal, so the norm is smallest when $u=0$

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Gradient descent

$$\theta_{t+1} = \theta_t - \eta\sum_{i=1}^n(\theta_t^\top x_i + y_i) x_i$$

Proof of Fact 3 (GD converges to min norm).

Note that $\theta_t\in\mathrm{span}(x_i) \implies \theta_{t+1} \in \mathrm{span}(x_i)$
Like before, we have $$\theta_{t+1} -\hat\theta = \Big( I - \eta \sum_{i=1}^n x_i x_i^\top \Big) (\theta_t-\hat\theta)$$
Exercise: extend the argument from Fact 2, accounting for the fact that in general $\max_j |1-\eta \lambda_j(\sum_{i=1}^n x_i x_i^\top)| = 1$

Linear least-squares empirical risk minimization problem: $$\min_{\theta\in\mathbb R^d} \sum_{i=1}^n \left(\theta^\top x_i - y_i\right)^2 + \lambda \|\theta\|_2^2$$
The minimizer is unique when $\lambda>0$, and gradient descent converges when the step size is sufficiently small
When $\lambda=0$ and the features are redundant, minimizers are not unique, and gradient descent depends on initialization

Summary: linear least-squares

What if we don't already have a good, finite dimensional feature embedding?

Linear representation of nonlinear functions?

We can encode rich representations by expanding the features (increasing $d$)

$y = (x-1)^2$

$y = \begin{bmatrix}1\\-2\\1\end{bmatrix}^\top \begin{bmatrix}1\\x\\x^2\end{bmatrix} $

$\varphi(x)$

$\{$

Polynomials of degree $n$ require $d=\binom{d_0+n}{d_0}\approx \frac{(d_0+n)^{d_0+n}}{d_0^{d_0}n^n}$

Kernel ridge regression

"What we do"

Given: data $\{x_i,y_i\}_{i=1}^n$ and kernel function $k:\mathcal X\times\mathcal X\to\mathbb R$
Find coefficients:
- Construct kernel matrix $K\in\mathbb R^{n\times n}$ with $K_{ij} = k(x_i,x_j)$
  and target vector $Y = \begin{bmatrix}y_1 & \dots & y_n\end{bmatrix}^\top$
- Compute $\alpha = (K+\lambda I)^{-1} Y$
Make predictions with $\hat y = \sum_{i=1}^n \alpha_i k(x, x_i)$

High dim features

"Why we do it"

Recall the empirical risk minimization problem $$\min_{\theta\in \mathbb R^d} \sum_{i=1}^n (\theta^\top \varphi(x_i) - y_i)^2 + \lambda \|\theta\|_2^2$$
Fact 4: Suppose that $k(x,x') = \varphi(x)^\top \varphi(x')$. Then the predictions made by the ERM solution are identical to that of KRR $\hat y = \theta^\top \varphi(x)=\sum_{i=1}^n \alpha_i k(x, x_i)$.

Equivalence

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top \varphi(x_i) - y_i)^2 + \lambda \|\theta\|_2^2$$

Proof of Fact 4.

We have that $\hat\theta^\top \varphi(x) = \sum_{i=1}^n y_i\varphi(x_i)^\top \Big( \sum_{i=1}^n \varphi(x_i)\varphi( x_i)^\top + \lambda I\Big)^{-1}\varphi(x)$
- in matrix form, $ =Y^\top \Phi (\Phi^\top\Phi + \lambda I)^{-1} \varphi(x)$
Matrix inversion lemma: $U(V U + I)^{-1} = (UV + I)^{-1}U$
Therefore $\hat\theta^\top \varphi(x) = Y^\top (\underbrace{\Phi\Phi^\top }_K+ \lambda I)^{-1} \Phi \varphi(x)= \alpha^\top \Phi \varphi(x)=\sum_{i=1}^n \alpha_i k(x, x_i)$.

Example: RBF Kernel

Kernel regression is linear least-squares in a high (possibly infinite) dimensional feature space
Selecting a kernel $\iff$ choosing nonlinear feature transformation

Summary: kernel regression

Recap

Linear least-squares & kernel regression
Review of convexity, optimization, linear algebra
For more, see Ch 4 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org.

Next time: sequential data

Announcements

First assignment: github.com/ml-feedback-sys/materials-f25 due next Thursday 9/4
Do you have access to the collaborative repository?
Enrollment from waitlist:
- Based on interest form: link in Syllabus

02 - Supervised Learning: Least Squares - ML in Feedback Sys F25

By Sarah Dean

02 - Supervised Learning: Least Squares - ML in Feedback Sys F25

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Supervised Learning: Least Squares Regression

ML in Feedback Sys #2

Linear least-squares

Squared Loss

Optimization

Optimization

Optimization

Gradient descent

Minimum norm solution

Minimum norm solution

Minimum norm solution

Minimum norm solution

Summary: linear least-squares

\(\{\)

Kernel ridge regression

High dim features

Equivalence

Summary: kernel regression

Recap

Announcements

02 - Supervised Learning: Least Squares - ML in Feedback Sys F25

More from Sarah Dean