Sarah Dean PRO
asst prof in CS at Cornell
Fall 2025, Prof Sarah Dean
Text
"What we do"
"Why we do it"
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$
Review: critical points
global max
local max
global min
local min
saddle
Proof of Fact 1.
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$
Proof of Fact 1.
Review: linear algebra
For \(A\in\mathbb R^{d\times d}\), the equation \(A \theta = b\) will have a unique solution \(A^{-1}b\) if \(A\) is rank \(d\) (full rank)
For positive semi-definite \(M_1 \succeq 0\) and positive definite \(M_2 \succ 0\), \(M_1+M_2\succ 0\)
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$
one global min
infinitely many global min
local and global min
strongly convex
convex
nonconvex
Review: convexity (resource)
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 + \lambda \|\theta\|_2^2$$
Gradient descent
$$\theta_{t+1} = \theta_t - \eta\Big( \sum_{i=1}^n(\theta_t^\top x_i - y_i) x_i + \lambda \theta_t\Big)$$
Proof of Fact 2 (GD converges).
$$\theta_{t+1} -\hat\theta = \Big((1-\eta\lambda) I - \eta \sum_{i=1}^n x_i x_i^\top \Big) (\theta_t-\hat\theta)$$
Consequently, $$\|\theta_{t+1} -\hat\theta\|_2 \leq \underbrace{\Big\|(1-\eta\lambda) I - \eta \sum_{i=1}^n x_i x_i^\top \Big\|_2}_{= \max_j |1-\eta(\lambda + \lambda_j(\sum_{i=1}^n x_i x_i^\top))|} \|\theta_t-\hat\theta\|_2$$
So as long as \(\eta>0\) is small enough, \(\|\theta_t-\hat\theta\| \leq \rho^t\|\theta_0-\hat\theta\|\) for \(0<\rho<1\)
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$
Lemma: The minimum norm solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).
Recall that any minimizers must satisfy \(\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i\)
\(\sum_{i=1}^n x_i x_i^\top\) may not be full rank due to redundancy in features
example: the first entry of \(x\) is always \(0\)
then the first entry of least-norm \(\hat\theta\) will be zero
example: the first entry of \(x\) is twice the second entry of \(x\)
then the second entry of least-norm \(\hat\theta\) will be zero
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$
Lemma: The minimum norm solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).
Recall that any minimizers must satisfy \(\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i\)
Review: linear algebra
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$
Lemma: The minimum norm solution is \(\hat\theta = \Big( \sum_{i=1}^n x_i x_i^\top \Big)^{\dagger}\sum_{i=1}^n y_ix_i\).
Recall that any minimizers must satisfy \(\sum_{i=1}^n x_i x_i^\top \hat \theta = \sum_{i=1}^n y_ix_i\)
So the solution set is \((\sum_{i=1}^n x_i x_i^\top)^\dagger \sum_{i=1}^n y_ix_i+u\) for \(u\in\mathrm{null}(\sum_{i=1}^n x_ix_i^\top)\)
Note that \(\mathrm{range}(\sum_{i=1}^n x_i x_i^\top)=\mathrm{span}(x_1,...,x_n)\)
Thus the terms are orthogonal, so the norm is smallest when \(u=0\)
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top x_i - y_i)^2 $$
Gradient descent
$$\theta_{t+1} = \theta_t - \eta\sum_{i=1}^n(\theta_t^\top x_i + y_i) x_i$$
Proof of Fact 3 (GD converges to min norm).
Note that \(\theta_t\in\mathrm{span}(x_i) \implies \theta_{t+1} \in \mathrm{span}(x_i)\)
Like before, we have $$\theta_{t+1} -\hat\theta = \Big( I - \eta \sum_{i=1}^n x_i x_i^\top \Big) (\theta_t-\hat\theta)$$
Exercise: extend the argument from Fact 2, accounting for the fact that in general \(\max_j |1-\eta \lambda_j(\sum_{i=1}^n x_i x_i^\top)| = 1\)
What if we don't already have a good, finite dimensional feature embedding?
Linear representation of nonlinear functions?
We can encode rich representations by expanding the features (increasing \(d\))
\(y = (x-1)^2\)
\(y = \begin{bmatrix}1\\-2\\1\end{bmatrix}^\top \begin{bmatrix}1\\x\\x^2\end{bmatrix} \)
\(\varphi(x)\)
Polynomials of degree \(n\) require \(d=\binom{d_0+n}{d_0}\approx \frac{(d_0+n)^{d_0+n}}{d_0^{d_0}n^n}\)
"What we do"
"Why we do it"
The empirical risk objective is
$$J(\theta) = \sum_{i=1}^n (\theta^\top \varphi(x_i) - y_i)^2 + \lambda \|\theta\|_2^2$$
Proof of Fact 4.
Example: RBF Kernel
Review of convexity, optimization, linear algebra
For more, see Ch 4 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org.
Next time: sequential data
By Sarah Dean