Shen Shen
September 13, 2024
Recall
For \(f: \mathbb{R}^m \rightarrow \mathbb{R}\), its gradient \(\nabla f: \mathbb{R}^m \rightarrow \mathbb{R}^m\) is defined at the point \(p=\left(x_1, \ldots, x_m\right)\) in \(m\)-dimensional space as the vector
(Aside: sometimes, the gradient doesn't exist, or doesn't behave nicely, as we'll see later in this course. For today, we have well-defined, nice, gradients.)
another example
a gradient can be the (symbolic) function
one cute example:
exactly like how derivative can be both a function and a number.
or,
we can also evaluate the gradient function at a point and get (numerical) gradient vectors
3. the gradient points in the direction of the (steepest) increase in the function value.
\(\frac{d}{dx} \cos(x) \bigg|_{x = -4} = -\sin(-4) \approx -0.7568\)
\(\frac{d}{dx} \cos(x) \bigg|_{x = 5} = -\sin(5) \approx 0.9589\)
hyperparameters
initial guess
of parameters
learning rate,
aka, step size
precision
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Q: if this condition is satisfied, what does it imply?
A: the gradient at the current parameter is almost zero.
1
2
3
4
5
6
7
8
Other possible stopping criteria for line 7:
1
2
3
4
5
6
7
8
When minimizing a function, we'd hope to get to a global minimizer
At a global minimizer
the gradient vector is the zero vector
\(\Rightarrow\)
\(\nLeftarrow\)
When minimizing a function, we'd hope to get to a global minimizer
At a global minimizer
the gradient vector is the zero vector
\(\Leftarrow\)
the function is a convex function
A function \(f\) is convex
if any line segment connecting two points of the graph of \(f\) lies above or on the graph.
Some examples
Convex functions
Non-convex functions
What do we need to know:
if violated, may not have gradient,
can't run gradient descent
if violated:
may not terminate/no minimum to converge to
if violated:
see demo on next slide, also lab/recitation/hw
if violated, may get stuck at a saddle point
or a local minimum
In general,
For instance,
(gradient of the sum) = (sum of the gradient)
👆
Concrete example
Three data points:
{(2,5), (3,6), (4,7)}
Fit a line (without offset) to the dataset, MSE:
First data point's "pull"
Second data point 's "pull"
Third data point's "pull"
for a randomly picked data point \(i\)
\(\sum_{t=1}^{\infty} \eta(t)=\infty\) and \(\sum_{t=1}^{\infty} \eta(t)^2<\infty\)
is more "random"
is more efficient
may get us out of a local min
Compared with GD, SGD
Most ML methods can be formulated as optimization problems.
We won’t always be able to solve optimization problems analytically (in closed-form).
We won’t always be able to solve (for a global optimum) efficiently.
We can still use numerical algorithms to good effect. Lots of sophisticated ones available.
Introduce the idea of gradient descent in 1D: only two directions! But magnitude of step is important.
In higher dimensions the direction is very important as well as magnitude.
GD, under appropriate conditions (most notably, when objective function is convex), can guarantee convergence to a global minimum.
SGD: approximated GD, more efficient, more random, and less guarantees.
We'd love to hear your thoughts.