Shen Shen
❤️ Feb 14, 2025 ❤️
11am, Room 10-250
1. Typically, \(X\) is full column rank
🥺
🥰
a. either when \(n\)<\(d\) , or
b. columns (features) in \( {X} \) have linear dependency
2. When \(X\) is not full column rank
Recall
1. Typically, \(X\) is full column rank
🥺
2. When \(X\) is not full column rank
🥺
Want a more efficient and general method => gradient descent methods
🥰
For \(f: \mathbb{R}^m \rightarrow \mathbb{R}\), its gradient \(\nabla f: \mathbb{R}^m \rightarrow \mathbb{R}^m\) is defined at the point \(p=\left(x_1, \ldots, x_m\right)\) as:
Sometimes the gradient is undefined or ill-behaved, but today it is well-behaved unless stated otherwise.
3. The gradient can be symbolic or numerical.
example:
its symbolic gradient:
just like a derivative can be a function or a number.
evaluating the symbolic gradient at a point gives a numerical gradient:
4. The gradient points in the direction of the (steepest) increase in the function value.
\(\frac{d}{dx} \cos(x) \bigg|_{x = -4} = -\sin(-4) \approx -0.7568\)
\(\frac{d}{dx} \cos(x) \bigg|_{x = 5} = -\sin(5) \approx 0.9589\)
Want to fit a line (without offset) to minimize the MSE: \(f(\theta) = (3 \theta-6)^{2}\)
A single training data point
\((x,y) = (3,6)\)
MSE could get better.
How to formalize this?
Suppose we fit a line \(y= 1.5x\)
\(\nabla_\theta f = f'(\theta) \)
\(f(\theta) = (3 \theta-6)^{2}\)
\( = 2[3(3 \theta-6)]|_{\theta=1.5}\)
\(<0\)
MSE could get better. How to?
Leveraging the gradient.
Suppose we fit a line \(y= 1.5x\)
\(\nabla_\theta f = f'(\theta) \)
\(f(\theta) = (3 \theta-6)^{2}\)
\( = 2[3(3 \theta-6)]|_{\theta=2.4}\)
\(>0\)
MSE could get better. How to?
Leveraging the gradient.
Suppose we fit a line \(y= 2.4 x\)
hyperparameters
initial guess
of parameters
learning rate,
aka, step size
precision
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
level set
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Q: what does this condition imply?
A: the gradient (at the current parameter nearly) is zero.
1
2
3
4
5
6
7
8
Other possible stopping criteria for line 7:
- Small parameter change: \( \|\Theta^{(t)} - \Theta^{(t-1)}\| < \epsilon \), or
- Small gradient norm: \( \|\nabla_{\Theta} f(\Theta^{(t)})\| < \epsilon \) also imply the same "gradient close to zero"
When minimizing a function, we aim for a global minimizer.
At a global minimizer
the gradient vector is the zero
\(\Rightarrow\)
\(\nLeftarrow\)
gradient descent can achieve this (to arbitrary precision)
the gradient vector is the zero
\(\Leftarrow\)
the objective function is convex
A function \(f\) is convex if any line segment connecting two points of the graph of \(f\) lies above or on the graph.
When minimizing a function, we aim for a global minimizer.
At a global minimizer
Some examples
Convex functions
Non-convex functions
if violated, may not have gradient,
can't run gradient descent
if violated, may get stuck at a saddle point
or a local minimum
if violated:
may not terminate/no minimum to converge to
if violated:
see demo on next slide, also lab/recitation/hw
Fit a line (without offset) to the dataset, the MSE:
training data
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
Suppose we fit a line \(y= 2.5x\)
\(f(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta f = \frac{2}{3}[2(2 \theta-5)+3(3 \theta-6)+4(4 \theta-7)]\)
gradient can help MSE get better
\(f(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta f = \frac{2}{3}[2(2 \theta-5)+3(3 \theta-6)+4(4 \theta-7)]\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(f(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(\nabla_\theta f = \frac{2}{3}[2(2 \theta-5)+3(3 \theta-6)+4(4 \theta-7)]\)
Using our example data set,
\(f(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta f = \frac{2}{3}[2(2 \theta-5)+3(3 \theta-6)+4(4 \theta-7)]\)
Using any dataset,
In general,
👋 (gradient of the sum) = (sum of the gradient)
👆
In general,
each of these \(\nabla f_i(\theta) \in \mathbb{R}^{d}\)
need to add \(n\) of them
Costly!
Let's do stochastic gradient descent (on the board).
for a randomly picked data point \(i\)
\(\sum_{t=1}^{\infty} \eta(t)=\infty\) and \(\sum_{t=1}^{\infty} \eta(t)^2<\infty\)
is more "random"
is more efficient
may get us out of a local min
Compared with GD, SGD
Most ML methods can be formulated as optimization problems.
We won’t always be able to solve optimization problems analytically (in closed-form).
We won’t always be able to solve (for a global optimum) efficiently.
We can still use numerical algorithms to good effect. Lots of sophisticated ones available.
Introduce the idea of gradient descent in 1D: only two directions! But magnitude of step is important.
In higher dimensions the direction is very important as well as magnitude.
GD, under appropriate conditions (most notably, when objective function is convex), can guarantee convergence to a global minimum.
SGD: approximated GD, more efficient, more random, and less guarantees.
We'd love to hear your thoughts.
What do we need to know:
From lecture 2 feedback and questions:
1. post slides before lecture.
yes, will do.
2. what's the difference between notes and lectures.
same scope vs different media. think of novel versus movie. also one form might just fit your schedule better.
Oh should also mention the new WIP html notes. goal and seek suggestions
3. better mic positioning
yep will do.