Linear Regression

Cornell CS 3/5780 · Spring 2026

(down arrow to see handout slides)

1. Setup and Assumptions

Demo: 1D Linear Regression
Demo: 2D Linear Regression

1. Setup and Assumptions

Data Assumption: $y_i \in \mathbb{R}$ (real-valued labels) and $\mathbf x_i \in \mathbb{R}^d$
Model Assumption: data looks like a "line" through the origin, with small errors $$ y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i \quad \text{where } \epsilon_i \text{ is small} $$
Ordinary Least Squares: minimize the sum of squared errors $$ \min_{\mathbf{w}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w} - y_i)^2 $$
MLE connection: This learning objective is equivalent to MLE (maximizing $P(D|\mathbf{w})$) under the probability model assumption: $$ y_i|\mathbf{x}_i \sim N(\mathbf{w}^\top\mathbf{x}_i, \sigma^2) \quad \Rightarrow \quad P(y_i|\mathbf{x}_i, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(\mathbf{x}_i^\top\mathbf{w}-y_i)^2}{2\sigma^2}} $$

2. Solving the minimization

Demo: 1D Linear Regression
Demo: 2D Linear Regression

2. Solving the minimization

Gradient: the gradient of the learning objective $$ \nabla_{\mathbf{w}} L(\mathbf{w}) = \frac{2}{n}\sum_{i=1}^n \mathbf{x}_i (\mathbf{x}_i^\top\mathbf{w} - y_i) $$
At optimum: gradient is equal to zero $$\frac{2}{n}\sum_{i=1}^n \mathbf{x}_i (\mathbf{x}_i^\top\mathbf{w} - y_i) = \mathbf{0} \iff \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top\mathbf{w} = \sum_{i=1}^n \mathbf{x}_i y_i $$
This is a system of linear equations that we need to solve for $\mathbf{w}$
Matrix notation: $\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_n]\in\mathbb R^{d\times n}$, $\mathbf{y} = [y_1, \ldots, y_n] \in\mathbb R^{1\times n}$ $$ \mathbf{X}\mathbf{X}^\top\mathbf{w} = \mathbf{X}\mathbf{y}^\top $$

3. Solving system of linear equations

Demo: Sports ball tracking

3. Solving system of linear equations

$$ \mathbf{X}\mathbf{X}^\top\mathbf{w} = \mathbf{X}\mathbf{y}^\top $$

If $\text{rank}(\mathbf{X}) = d$ (full rank): then $\mathbf{X}\mathbf{X}^\top$ is invertible
- Occurs when $n \geq d$ and data spans feature space
- Unique solution: the standard closed-form solution $$ \mathbf{w}^* = (\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}\mathbf{y}^\top $$
If $\text{rank}(\mathbf{X}) < d$: then $\mathbf{X}\mathbf{X}^\top$ is not invertible
- Fewer data points than features ($n < d$) OR data lies in lower-dimensional subspace
- Infinitely many solutions achieve the same minimum loss

4. Decomposing the Weight Vector

Any weight vector can be decomposed: $\mathbf{w} = \mathbf{w}_{\parallel} + \mathbf{w}_{\perp} $
- $\mathbf{w}_{\parallel}$ lies in column space of $\mathbf{X}$: $\mathbf{w}_{\parallel} = \mathbf{X}\mathbf{v}$ for some $\mathbf{v}$
- $\mathbf{w}_{\perp}$ lies in null space of $\mathbf{X}^\top$: $\mathbf{X}^\top\mathbf{w}_{\perp} = \mathbf{0}$
- They are orthogonal: $\mathbf{w}_{\parallel}^\top\mathbf{w}_{\perp} = 0$
Key observation: prediction depends only on $\mathbf{w}_{\parallel}$! $$ \mathbf{x}_i^\top\mathbf{w} = \mathbf{x}_i^\top(\mathbf{w}_{\parallel} + \mathbf{w}_{\perp}) = \mathbf{x}_i^\top\mathbf{w}_{\parallel} + \underbrace{\mathbf{x}_i^\top\mathbf{w}_{\perp}}_{=0} = \mathbf{x}_i^\top\mathbf{w}_{\parallel} $$ Thus, any two weight vectors with the same $\mathbf{w}_{\parallel}$ achieve the same loss.
Solution set (affine subspace): $$ \{\mathbf{w}^*_{\parallel} + \mathbf{w}_{\perp} : \mathbf{w}_{\perp} \in \text{null}(\mathbf{X}^\top)\} $$
Minimum-norm solution: Unique solution where $\mathbf{w}_{\perp} = \mathbf{0}$ defined by pseudoinverse $$ \mathbf{w}^*_{\text{min-norm}} = \mathbf{X}^\top(\mathbf{X}\mathbf{X}^\top)^{\dagger}\mathbf{y} $$

5. (Stochastic) Gradient Descent

Gradient descent: Starting from $\mathbf{w}^{(0)}$, with learning rate $\eta > 0$: $$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \frac{2}{n}\sum_{i=1}^n \mathbf{x}_i (\mathbf{x}_i^\top\mathbf{w}^{(t)} - y_i) $$
SGD: Approximate gradient using single random data point $i$ $$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \cdot 2\mathbf{x}_i(\mathbf{x}_i^\top\mathbf{w}^{(t)} - y_i) $$
Key observation: gradient is in the column space because it is a linear combination of $\mathbf x_i$ (columns of the data matrix $\mathbf{X}$)
(S)GD updates only change $\mathbf{w}_{\parallel}$, never $\mathbf{w}_{\perp}$
(S)GD will converge to min-norm solution if initialized at $\mathbf{w}^{(0)} = \mathbf{0}$.

6. Ridge Regression

Demo: 2D Linear Regression
Demo: Sports ball tracking

$\lambda=0$

$\lambda=1$

$\lambda=10$

6. Ridge Regression

Data & Model Assumption: Same setup $y_i = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i$, but now we also assume weights are small.
Ridge Objective: minimize squared errors plus a penalty on weight size, where $\lambda > 0$: $$ \min_{\mathbf{w}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w} - y_i)^2 + \lambda\|\mathbf{w}\|_2^2 $$
MAP connection: This objective is equivalent to MAP estimation, maximizing $P(\mathbf{w}|D) \propto P(D|\mathbf{w})P(\mathbf{w})$, with the Gaussian prior: $$ P(\mathbf{w}) = \frac{1}{\sqrt{2\pi\tau^2}}e^{-\frac{\mathbf{w}^\top\mathbf{w}}{2\tau^2}} \quad \Rightarrow \quad \lambda = \frac{\sigma^2}{n\tau^2} $$
Closed-form solution: Set gradient to zero and solve (always unique): $$ \mathbf{w}^* = (\mathbf{X}\mathbf{X}^\top + n\lambda\mathbf{I})^{-1}\mathbf{X}\mathbf{y}^\top $$

7. Summary

Ordinary Least Squares (unregularized):

From MLE with Gaussian noise assumption
Closed form (minimum norm): $\mathbf{w} = (\mathbf{X}\mathbf{X}^\top)^{\dagger}\mathbf{X}\mathbf{y}^\top$
GD from origin $\implies$ minimum-norm solution (implicit regularization!)

Ridge Regression (regularized):

From MAP with Gaussian prior on weights
Closed form: $\mathbf{w} = (\mathbf{X}\mathbf{X}^\top + n\lambda\mathbf{I})^{-1}\mathbf{X}\mathbf{y}^\top$
Always has unique solution, more stable

Linear Regression

By Sarah Dean

Linear Regression

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Linear Regression

Cornell CS 3/5780 · Spring 2026

1. Setup and Assumptions

1. Setup and Assumptions

2. Solving the minimization

2. Solving the minimization

3. Solving system of linear equations

3. Solving system of linear equations

4. Decomposing the Weight Vector

5. (Stochastic) Gradient Descent

5. (Stochastic) Gradient Descent

6. Ridge Regression

6. Ridge Regression

7. Summary

Linear Regression

Linear Regression

Sarah Dean PRO

More from Sarah Dean