Logistics
1. There is a student in our class who needs copies of class notes as an approved accommodation. If you're interested in serving as a paid note taker, please reach out to DAS, at 617-253-1674 or das-student@mit.edu.
2. Midterm 1: October 8, 730pm-9pm. It covers all the materials up to and including week 4 (linear classification). If you need to take the conflict or accommodation exam, please get in touch with us at 6.390-personal@mit.edu by Sept 24.
3. Heads-up: Midterm 2 is November 12, 730pm-9pm. Final is December 15, 9am-12pm.
More details on introML homepage
Recall:
Typically, \(X\) is full column rank
When \(X\) is not full column rank
https://epoch.ai/blog/machine-learning-model-sizes-and-the-parameter-gap
https://arxiv.org/pdf/2001.08361
In the real world,
Need a more efficient and general algorithm to train
=> gradient descent methods
For \(f: \mathbb{R}^m \rightarrow \mathbb{R}\), its gradient \(\nabla f: \mathbb{R}^m \rightarrow \mathbb{R}^m\) is defined at the point \(p=\left(x_1, \ldots, x_m\right)\) as:
Sometimes the gradient is undefined or ill-behaved, but today it is well-behaved.
3. The gradient can be symbolic or numerical.
example:
its symbolic gradient:
just like a derivative can be a function or a number.
evaluating the symbolic gradient at a point gives a numerical gradient:
4. The gradient points in the direction of the (steepest) increase in the function value.
\(\frac{d}{dx} \cos(x) \bigg|_{x = -4} = -\sin(-4) \approx -0.7568\)
\(\frac{d}{dx} \cos(x) \bigg|_{x = 5} = -\sin(5) \approx 0.9589\)
5. The gradient at the function minimizer is necessarily zero.
For \(f: \mathbb{R}^m \rightarrow \mathbb{R}\), its gradient \(\nabla f: \mathbb{R}^m \rightarrow \mathbb{R}^m\) is defined at the point \(p=\left(x_1, \ldots, x_m\right)\) as:
Sometimes the gradient is undefined or ill-behaved, but today it is well-behaved.
Want to fit a line (without offset) to minimize the MSE: \(J(\theta) = (3 \theta-6)^{2}\)
A single training data point
\((x,y) = (3,6)\)
MSE could get better.
How to formalize this?
Suppose we fit a line \(y= 1.5x\)
1.5
\(\nabla_\theta J = J'(\theta) \)
\(J(\theta) = (3 \theta-6)^{2}\)
\( = 2[3(3 \theta-6)]|_{\theta=1.5}\)
\(<0\)
MSE could get better. How to?
Leveraging the gradient.
Suppose we fit a line \(y= 1.5x\)
1.5
\( = 2[3(3 \theta-6)]|_{\theta=2.4}\)
\(>0\)
MSE could get better. How to?
Leveraging the gradient.
Suppose we fit a line \(y= 2.4 x\)
\(\nabla_\theta J = J'(\theta) \)
\(J(\theta) = (3 \theta-6)^{2}\)
2.4
hyperparameters
initial guess
of parameters
learning rate
precision
level set,
contour plot
iteration counter
objective improvement is nearly zero.
Other possible stopping criterion for line 6:
When minimizing a function, we aim for a global minimizer.
At a global minimizer
the gradient vector is zero
\(\Rightarrow\)
\(\nLeftarrow\)
gradient descent can achieve this (to arbitrary precision)
the gradient vector is zero
\(\Leftarrow\)
the objective function is convex
A function \(f\) is convex if any line segment connecting two points of the graph of \(f\) lies above or on the graph.
When minimizing a function, we aim for a global minimizer.
At a global minimizer
Some examples
Convex functions
Non-convex functions
convexity is why we can claim the point whose gradient is zero is a global minimizer.
is always convex
\((x_1, x_2) = (2,3), y =7\)
\((x_1, x_2) = (4,6), y =8\)
\((x_1, x_2) = (6,9), y =9\)
case (c) training data set again
convexity is why we can claim the point whose gradient is zero is a global minimizer.
if violated, may not have gradient,
can't run gradient descent
if violated, may get stuck at a saddle point
or a local minimum
if violated:
may not terminate/no minimum to converge to
if violated:
see demo on next slide, also lab/hw
Fit a line (without offset) to the dataset, the MSE:
training data
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
Suppose we fit a line \(y= 2.5x\)
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]|_{\theta=2.5}\)
gradient info can help MSE get better
\(= \frac{1}{3}[0+6(7.5-6)+8(10-7)] = 11\)
11
2.5
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
\( = \frac{1}{3}\left[ \quad \quad \right]\)
\( J_1\)
\( J_2\)
\( J_3\)
\( +\)
\( +\)
\( = \frac{1}{3}\left[ \quad \quad \quad \quad \right]\)
\(\nabla_\theta J_1\)
\(\nabla_\theta J_2\)
\( \nabla_\theta J_3\)
\( +\)
\( +\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
\( = \frac{1}{3}\left[ \quad \quad \right]\)
\( J_1\)
\( J_2\)
\( J_3\)
\( +\)
\( +\)
\( = \frac{1}{3}\left[ \quad \quad \quad \quad \right]\)
\(\nabla_\theta J_1\)
\(\nabla_\theta J_2\)
\( \nabla_\theta J_3\)
\( +\)
\( +\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
\( = \frac{1}{3}\left[ \quad \quad \right]\)
\( J_1\)
\( J_2\)
\( J_3\)
\( +\)
\( +\)
\( = \frac{1}{3}\left[ \quad \quad \quad \quad \right]\)
\(\nabla_\theta J_1\)
\(\nabla_\theta J_2\)
\( \nabla_\theta J_3\)
\( +\)
\( +\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
\( = \frac{1}{3}\left[ \quad \quad \right]\)
\( J_1\)
\( J_2\)
\( J_3\)
\( +\)
\( +\)
\( = \frac{1}{3}\left[ \quad \quad \quad \quad \right]\)
\(\nabla_\theta J_1\)
\(\nabla_\theta J_2\)
\( \nabla_\theta J_3\)
\( +\)
\( +\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
Using our example data set,
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
Using any dataset,
In general,
👋 (gradient of the sum) = (sum of the gradient)
👆
In general,
gradient info from a single \(i^{\text{th}}\) data point's loss
need to add \(n\) of these, each\(\nabla_\theta J_i(\theta) \in \mathbb{R}^{d}\)
Costly in practice!
loss incurred on a single \(i^{\text{th}}\) data point
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
\( = \frac{1}{3}\left[ \quad \quad \right]\)
\( J_1\)
\( J_2\)
\( J_3\)
\( +\)
\( +\)
\( = \frac{1}{3}\left[ \quad \quad \quad \quad \right]\)
\(\nabla_\theta J_1\)
\(\nabla_\theta J_2\)
\( \nabla_\theta J_3\)
\( +\)
\( +\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |
x1 | x2 | y | |
---|---|---|---|
p1 | 1 | 2 | 3 |
p2 | 2 | 1 | 2 |
p3 | 3 | 4 | 6 |
x1 | x2 | y | |
---|---|---|---|
p1 | 1 | 2 | 3 |
p2 | 2 | 1 | 2 |
p3 | 3 | 4 | 6 |
x1 | x2 | y | |
---|---|---|---|
p1 | 1 | 2 | 3 |
p2 | 2 | 1 | 2 |
p3 | 3 | 4 | 6 |
is much more "random"
is more efficient
may get us out of a local min
Compared with GD, SGD:
\(\sum_{t=1}^{\infty} \eta(t)=\infty\) and \(\sum_{t=1}^{\infty} \eta(t)^2<\infty\)
batch size
🥰more accurate gradient estimate
🥰stronger theoretical guarantee
🥺higher cost per parameter update
mini-batch GD
SGD
GD
Most ML methods can be formulated as optimization problems. We won’t always be able to solve optimization problems analytically (in closed-form) nor efficiently.
We can still use numerical algorithms to good effect. Lots of sophisticated ones available. Gradient descent is one of the simplest.
The GD algorithm, iterative algorithm, keeps applying the parameter update rule.
Under appropriate conditions (most notably, when objective function is convex, and when learning rate is small enough), GD can guarantee convergence to a global minimum.
SGD is approximated GD, it uses a single data point to approximate the entire data set, it's more efficient, more random, and less guarantees.
mini-batch GD is a middle ground between GD and SGD.
We'd love to hear your thoughts.
What do we need to know:
\(J(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]\)
\(\nabla_\theta J = \frac{1}{3}[4(2 \theta-5)+6(3 \theta-6)+8(4 \theta-7)]\)
p1 | 2 | 5 |
p2 | 3 | 6 |
p3 | 4 | 7 |