Supervised Learning: Least Squares Regression

ML in Feedback Sys #2

Fall 2025, Prof Sarah Dean

Text

Linear least-squares

"What we do"

  • Given: data \(\{x_i,y_i\}_{i=1}^n\)
    • Possible feature transformation \(x\leftarrow\phi(x)\)
  • Find linear coefficients:
    • Directly (pick a \(\lambda\geq 0\)) $$\displaystyle \hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top + \lambda I\Big)^{-1}\sum_{i=1}^n y_ix_i $$
    • Or iteratively (pick a \(\lambda,\eta\geq 0\)) $$\theta_{t+1} = \theta_t - \eta\Big( \sum_{i=1}^n(\theta_t^\top x_i - y_i) x_i + \lambda \theta_t\Big)$$
  • Make predictions with \(\hat y = \hat\theta^\top x\)

Squared Loss

"Why we do it"

  • The empirical risk minimization problem for linear models and squared loss $$\min_{\theta\in \mathbb R^d} \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$
  • Fact 1: \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top + \lambda I\Big)^{-1}\sum_{i=1}^n y_ix_i\) is the unique minimizer when \(\lambda > 0\)
  • Fact 2: As long as \(\eta\leq \frac{2}{\lambda + \lambda_{\max}(\sum_{i=1}^n x_i x_i^\top))}\), gradient descent converges to \(\hat\theta\)
  • Fact 3: When \(\lambda=0\), as long as \(\eta\leq \frac{2}{\lambda_{\max}(\sum_{i=1}^n x_i x_i^\top))}\), gradient descent with \(\theta_0=0\) converges to the minimum norm solution \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).

Optimization

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Review: critical points

  • Critical points = where gradient is zero
  • Solutions to unconstrained optimization of differentiable function occur at critical points

global max

local max

global min

local min

saddle

Proof of Fact 1.

  • The optimum \(\hat\theta\) must satisfy \(\nabla J(\hat \theta) = 0\)

Optimization

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Proof of Fact 1.

  • The optimum \(\hat\theta\) must satisfy \(\nabla J(\hat \theta) = 0\)
  • Therefore we must have $$\Big(\sum_{i=1}^n x_i x_i^\top + \lambda I\Big) \hat \theta = \sum_{i=1}^n y_ix_i$$
  • For \(\lambda>0\), unique solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top + \lambda I\Big)^{-1}\sum_{i=1}^n y_ix_i\)

Review: linear algebra

  • For \(A\in\mathbb R^{d\times d}\), the equation \(A \theta = b\) will have a unique solution \(A^{-1}b\) if \(A\) is rank \(d\) (full rank)

  • For positive semi-definite \(M_1 \succeq 0\) and positive definite \(M_2 \succ 0\), \(M_1+M_2\succ 0\)

Optimization

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Parabola

one global min

infinitely many global min

Parabola
Parabola

local and global min

strongly convex

convex

nonconvex

Review: convexity (resource)

  • For a differentiable convex function, all critical points are global minima

Gradient descent

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$

Gradient descent

$$\theta_{t+1} = \theta_t - \eta\Big( \sum_{i=1}^n(\theta_t^\top x_i - y_i) x_i + \lambda \theta_t\Big)$$

Proof of Fact 2 (GD converges).

  • $$\theta_{t+1} -\hat\theta = \Big((1-\eta\lambda) I - \eta \sum_{i=1}^n x_i x_i^\top  \Big) (\theta_t-\hat\theta)$$

  • Consequently, $$\|\theta_{t+1} -\hat\theta\|_2 \leq \underbrace{\Big\|(1-\eta\lambda) I - \eta \sum_{i=1}^n x_i x_i^\top  \Big\|_2}_{= \max_j |1-\eta(\lambda + \lambda_j(\sum_{i=1}^n x_i x_i^\top))|} \|\theta_t-\hat\theta\|_2$$

  • So as long as \(\eta>0\) is small enough, \(\|\theta_t-\hat\theta\| \leq  \rho^t\|\theta_0-\hat\theta\|\) for \(0<\rho<1\)

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Lemma: The minimum norm solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).

  • Recall that any minimizers must satisfy \(\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i\)

\(\sum_{i=1}^n x_i x_i^\top\) may not be full rank due to redundancy in features

  • example: the first entry of \(x\) is always \(0\)

    • then the first entry of least-norm \(\hat\theta\) will be zero

  • example: the first entry of \(x\) is twice the second entry of \(x\)

    • then the second entry of least-norm \(\hat\theta\) will be zero

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Lemma: The minimum norm solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).

  • Recall that any minimizers must satisfy \(\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i\)

Review: linear algebra

  • A square symmetric matrix \(A\) is always diagonalizable, i.e. there exists an orthnormal matrix \(V\) and a diagonal matrix \(S\) such that \(A = VSV^\top\)
  • In this case, the pseudo-inverse is \(A^\dagger = V S^\dagger V^\top\) where $$S^{\dagger} = \mathrm{diag}(s_1,s_2,\dots, s_r, 0, \dots, 0)^{\dagger} = \mathrm{diag}(1/s_1,1/s_2,\dots, 1/s_r, 0, \dots, 0)$$
  • The solution set to \(Ax=b\) is \(\{A^\dagger b + u|u\in\mathrm{null}(A)\}\) (nullspace)

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Lemma: The minimum norm solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).

  • Recall that any minimizers must satisfy \(\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i\)

  • So the solution set is \((\sum_{i=1}^n x_i x_i^\top)^\dagger \sum_{i=1}^n y_ix_i+u\) for \(u\in\mathrm{null}(\sum_{i=1}^n x_ix_i^\top)\)

  • Note that \(\mathrm{range}(\sum_{i=1}^n x_i x_i^\top)=\mathrm{span}(x_1,...,x_n)\)

  • Thus the terms are orthogonal, so the norm is smallest when \(u=0\)

Minimum norm solution

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$

Gradient descent

$$\theta_{t+1} = \theta_t - \eta\sum_{i=1}^n(\theta_t^\top x_i + y_i) x_i$$

Proof of Fact 3 (GD converges to min norm).

  • Note that \(\theta_t\in\mathrm{span}(x_i) \implies \theta_{t+1} \in \mathrm{span}(x_i)\)

  • Like before, we have $$\theta_{t+1} -\hat\theta = \Big( I - \eta \sum_{i=1}^n x_i x_i^\top  \Big) (\theta_t-\hat\theta)$$

  • Exercise: extend the argument from Fact 2, accounting for the fact that in general \(\max_j |1-\eta \lambda_j(\sum_{i=1}^n x_i x_i^\top)| = 1\)

  • Linear least-squares empirical risk minimization problem: $$\min_{\theta\in\mathbb R^d} \sum_{i=1}^n \left(\theta^\top x_i - y_i\right)^2 + \lambda \|\theta\|_2^2$$
  • The minimizer is unique when \(\lambda>0\), and gradient descent converges when the step size is sufficiently small
  • When \(\lambda=0\) and the features are redundant, minimizers are not unique, and gradient descent depends on initialization

Summary: linear least-squares

What if we don't already have a good, finite dimensional feature embedding?

Linear representation of nonlinear functions?

We can encode rich representations by expanding the features (increasing \(d\))

\(y = (x-1)^2\)

Parabola

\(y = \begin{bmatrix}1\\-2\\1\end{bmatrix}^\top \begin{bmatrix}1\\x\\x^2\end{bmatrix} \)

\(\varphi(x)\)

\(\{\)

Polynomials of degree \(n\) require \(d=\binom{d_0+n}{d_0}\approx \frac{(d_0+n)^{d_0+n}}{d_0^{d_0}n^n}\)

Kernel ridge regression

"What we do"

  • Given: data \(\{x_i,y_i\}_{i=1}^n\) and kernel function \(k:\mathcal X\times\mathcal X\to\mathbb R\)
  • Find coefficients:
    • Construct kernel matrix \(K\in\mathbb R^{n\times n}\) with \(K_{ij} = k(x_i,x_j)\)
      and target vector \(Y = \begin{bmatrix}y_1 & \dots & y_n\end{bmatrix}^\top\)
    • Compute \(\alpha = (K+\lambda I)^{-1} Y\)
  • Make predictions with \(\hat y = \sum_{i=1}^n \alpha_i k(x, x_i)\)

High dim features

"Why we do it"

  • Recall the empirical risk minimization problem $$\min_{\theta\in \mathbb R^d} \sum_{i=1}^n (\theta^\top \varphi(x_i) - y_i)^2 + \lambda \|\theta\|_2^2$$
  • Fact 4: Suppose that \(k(x,x') = \varphi(x)^\top \varphi(x')\). Then the predictions made by the ERM solution are identical to that of KRR \(\hat y = \theta^\top \varphi(x)=\sum_{i=1}^n \alpha_i k(x, x_i)\).

Equivalence

The empirical risk objective is

$$J(\theta) = \sum_{i=1}^n (\theta^\top \varphi(x_i) - y_i)^2 + \lambda \|\theta\|_2^2$$

Proof of Fact 4.

  • We have that \(\hat\theta^\top \varphi(x) = \sum_{i=1}^n y_i\varphi(x_i)^\top \Big( \sum_{i=1}^n \varphi(x_i)\varphi( x_i)^\top + \lambda I\Big)^{-1}\varphi(x)\)
    • in matrix form, \( =Y^\top \Phi (\Phi^\top\Phi + \lambda I)^{-1} \varphi(x)\)
  • Matrix inversion lemma: \(U(V U +  I)^{-1} = (UV + I)^{-1}U\)
  • Therefore \(\hat\theta^\top \varphi(x) = Y^\top (\underbrace{\Phi\Phi^\top }_K+ \lambda I)^{-1} \Phi \varphi(x)= \alpha^\top \Phi \varphi(x)=\sum_{i=1}^n \alpha_i k(x, x_i)\).

Example: RBF Kernel

  • Kernel regression is linear least-squares in a high (possibly infinite) dimensional feature space
  • Selecting a kernel \(\iff\) choosing  nonlinear feature transformation

Summary: kernel regression

Recap

  • Linear least-squares & kernel regression
  • Review of convexity, optimization, linear algebra

  • For more, see Ch 4 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org.

Next time: sequential data

Announcements

02 - Supervised Learning: Least Squares - ML in Feedback Sys F25

By Sarah Dean

02 - Supervised Learning: Least Squares - ML in Feedback Sys F25

  • 94