Lecture 3: Gradient Descent Methods

Shen Shen

September 13, 2024

Intro to Machine Learning

For $f: \mathbb{R}^m \rightarrow \mathbb{R}$ , its gradient $\nabla f: \mathbb{R}^m \rightarrow \mathbb{R}^m$ is defined at the point $p=\left(x_1, \ldots, x_m\right)$ in $m$ -dimensional space as the vector

\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_m}(p) \end{array}\right]

\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_m}(p) \end{array}\right]

Generalizes 1-dimensional derivatives.
By construction, always has the same dimensionality as the function input.

(Aside: sometimes, the gradient doesn't exist, or doesn't behave nicely, as we'll see later in this course. For today, we have well-defined, nice, gradients.)

Gradient

\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_m}(p) \end{array}\right]

\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_m}(p) \end{array}\right]

f(x, y, z) = x^2 + y^3 + z

f(x, y, z) = x^2 + y^3 + z

another example

\nabla f(x, y, z) = \begin{bmatrix} 2x \\ 3y^2 \\ 1 \end{bmatrix}

\nabla f(x, y, z) = \begin{bmatrix} 2x \\ 3y^2 \\ 1 \end{bmatrix}

a gradient can be the (symbolic) function

\nabla f(3, 2, 1) = \begin{bmatrix} 6\\ 12 \\ 1 \end{bmatrix}

\nabla f(3, 2, 1) = \begin{bmatrix} 6\\ 12 \\ 1 \end{bmatrix}

one cute example:

exactly like how derivative can be both a function and a number.

or,

we can also evaluate the gradient function at a point and get (numerical) gradient vectors

\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_m}(p) \end{array}\right]

\nabla f(p)=\left[\begin{array}{c} \frac{\partial f}{\partial x_1}(p) \\ \vdots \\ \frac{\partial f}{\partial x_m}(p) \end{array}\right]

3. the gradient points in the direction of the (steepest) increase in the function value.

$\frac{d}{dx} \cos(x) \bigg|_{x = -4} = -\sin(-4) \approx -0.7568$

$\frac{d}{dx} \cos(x) \bigg|_{x = 5} = -\sin(5) \approx 0.9589$

Assumptions:
- $f$ is sufficiently "smooth"
- $f$ has at least one global minimum
- Run the algorithm long enough
- $\eta$ is sufficiently small
- $f$ is convex
Conclusion:
- Gradient descent will return a parameter value within $\tilde{\epsilon}$ of a global minimum (for any chosen $\tilde{\epsilon}>0$ )

Gradient Descent Performance

if violated, may not have gradient,

can't run gradient descent

Assumptions:
- $f$ is sufficiently "smooth"
- $f$ has at least one global minimum
- Run the algorithm long enough
- $\eta$ is sufficiently small
- $f$ is convex
Conclusion:
- Gradient descent will return a parameter value within $\tilde{\epsilon}$ of a global minimum (for any chosen $\tilde{\epsilon}>0$ )

Gradient Descent Performance

if violated:

may not terminate/no minimum to converge to

Assumptions:
- $f$ is sufficiently "smooth"
- $f$ has at least one global minimum
- Run the algorithm long enough
- $\eta$ is sufficiently small
- $f$ is convex
Conclusion:
- Gradient descent will return a parameter value within $\tilde{\epsilon}$ of a global minimum (for any chosen $\tilde{\epsilon}>0$

Gradient Descent Performance

if violated:

see demo on next slide, also lab/recitation/hw

Assumptions:
- $f$ is sufficiently "smooth"
- $f$ has at least one global minimum
- Run the algorithm long enough
- $\eta$ is sufficiently small
- $f$ is convex
Conclusion:
- Gradient descent will return a parameter value within $\tilde{\epsilon}$ of a global minimum (for any chosen $\tilde{\epsilon}>0$ )

Gradient Descent Performance

if violated, may get stuck at a saddle point

or a local minimum

Gradient descent performance

Assumptions:
- $f$ is sufficiently "smooth"
- $f$ has at least one global minimum
- Run the algorithm sufficiently "long"
- $\eta$ is sufficiently small
- $f$ is convex
Conclusion:
- Gradient descent will return a parameter value within $\tilde{\epsilon}$ of a global minimum (for any chosen $\tilde{\epsilon}>0$ )

Gradient of an ML objective

An ML objective function is a finite sum

the MSE of a linear hypothesis:

The gradient of an ML objective :

\nabla f(\Theta) = \nabla (\frac{1}{n} \sum_{i=1}^n f_i(\Theta))

\nabla f(\Theta) = \nabla (\frac{1}{n} \sum_{i=1}^n f_i(\Theta))

\frac{2}{n} \sum_{i=1}^n\left(\theta^{\top} x^{(i)}-y^{(i)}\right) x^{(i)}

\frac{2}{n} \sum_{i=1}^n\left(\theta^{\top} x^{(i)}-y^{(i)}\right) x^{(i)}

= \frac{1}{n} \sum_{i=1}^n \nabla f_i(\Theta)

= \frac{1}{n} \sum_{i=1}^n \nabla f_i(\Theta)

and its gradient w.r.t. $\theta$ :

In general,

For instance,

\frac{1}{n} \sum_{i=1}^n\left(\theta^{\top} x^{(i)}-y^{(i)}\right)^2

\frac{1}{n} \sum_{i=1}^n\left(\theta^{\top} x^{(i)}-y^{(i)}\right)^2

f(\Theta)=\frac{1}{n} \sum_{i=1}^n f_i(\Theta)

f(\Theta)=\frac{1}{n} \sum_{i=1}^n f_i(\Theta)

(gradient of the sum) = (sum of the gradient)

👆

Concrete example

Three data points:

{(2,5), (3,6), (4,7)}

Fit a line (without offset) to the dataset, MSE:

f(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]

f(\theta) = \frac{1}{3}\left[(2 \theta-5)^2+(3 \theta-6)^{2}+(4 \theta-7)^2\right]

\nabla_\theta f = \frac{2}{3}[2(2 \theta-5)+3(3 \theta-6)+4(4 \theta-7)]

\nabla_\theta f = \frac{2}{3}[2(2 \theta-5)+3(3 \theta-6)+4(4 \theta-7)]

First data point's "pull"

Second data point 's "pull"

Third data point's "pull"

Assumptions:
- $f$ is sufficiently "smooth"
- $f$ has at least one global minimum
- Run the algorithm long enough
- $\eta$ is sufficiently small and satisfies additional "scheduling" condition
- $f$ is convex
Conclusion:
- Stochastic gradient descent will return a parameter value within $\tilde{\epsilon}$ of a global minimum with probability 1 (for any chosen $\tilde{\epsilon}>0$ )

Stochastic gradient descent performance

$\sum_{t=1}^{\infty} \eta(t)=\infty$ and $\sum_{t=1}^{\infty} \eta(t)^2<\infty$

https://introml.mit.edu/ Lecture 3: Gradient Descent Methods Shen Shen September 13, 2024 Intro to Machine Learning

6.390 IntroML (Fall24) - Lecture 3 Gradient Descent Methods

By Shen Shen

6.390 IntroML (Fall24) - Lecture 3 Gradient Descent Methods

5 months ago
108

Shen Shen

shenshen.mit.edu

Lecture 3: Gradient Descent Methods

Intro to Machine Learning

Outline

Outline

Outline

Gradient

Outline

Outline

Gradient Descent Performance

Gradient Descent Performance

Gradient Descent Performance

Gradient Descent Performance

Gradient descent performance

Outline

Gradient of an ML objective

Stochastic gradient descent

Stochastic gradient descent performance

Summary

Thanks!

6.390 IntroML (Fall24) - Lecture 3 Gradient Descent Methods

6.390 IntroML (Fall24) - Lecture 3 Gradient Descent Methods

Shen Shen

Lecture 3: Gradient Descent Methods

Intro to Machine Learning

6.390 IntroML (Fall24) - Lecture 3 Gradient Descent Methods

More from Shen Shen