An introduction to gradient descent algorithms

Abdellah CHKIFA

abdellah.chkifa@um6p.ma

Gradient Descent: an intuitive algorithm

objective function (also called loss function or cost function etc.)

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla f (\theta^{(k)})

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla f (\theta^{(k)})

f

\theta^{(0)}

\theta^{(0)}

initial guess

min_{\theta \in \Omega} f(\theta)

min_{\theta \in \Omega} f(\theta)

We consider GD applied to

f: x\mapsto x^2

f: x\mapsto x^2

with

x_0=4

x_0=4

and rates

\eta=0.1

\eta=0.1

\eta=0.01

\eta=0.01

\eta=0.9

\eta=0.9

\eta=1.01

\eta=1.01

Gradient Descent: a convergence theorem

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla f (\theta^{(k)})

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla f (\theta^{(k)})

\theta^{(0)}

\theta^{(0)}

initial guess

f(t\theta + (1-t)\theta') \leq t f(\theta) + (1-t) \theta',\quad\quad \theta,\theta'\in{\mathbb R^d}, t\in [0,1]

f(t\theta + (1-t)\theta') \leq t f(\theta) + (1-t) \theta',\quad\quad \theta,\theta'\in{\mathbb R^d}, t\in [0,1]

f convex:

f L-smooth:

\| \nabla f(\theta) - \nabla f(\theta')\| \leq L \|\theta-\theta'\|,\quad\quad \theta,\theta'\in{\mathbb R^d}.

\| \nabla f(\theta) - \nabla f(\theta')\| \leq L \|\theta-\theta'\|,\quad\quad \theta,\theta'\in{\mathbb R^d}.

Suppose that f is convex and L-smooth. The gradient descent algorithm with η < 1/L converge to θ* and yields the convergence rate

f: \R^d \to R

f: \R^d \to R

0 \leq f(\theta^{(k)}) - f(\theta^*) \leq \frac {\|\theta^{(0)} - \theta^*\|^2}{2\eta k}

0 \leq f(\theta^{(k)}) - f(\theta^*) \leq \frac {\|\theta^{(0)} - \theta^*\|^2}{2\eta k}

Theorem

Gradient Descent: a convergence theorems

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla f (\theta^{(k)})

\theta^{(k+1)} = \theta^{(k)} - \eta \nabla f (\theta^{(k)})

\theta^{(0)}

\theta^{(0)}

initial guess

f μ-strongly convex:

f: \R^d \to R

f: \R^d \to R

f(\theta') \geq f(\theta) + ⟨ \nabla f(\theta), \theta'-\theta ⟩ + \frac \mu 2 \|\theta'-\theta\|^2 ,\quad \quad \theta,\theta'\in{\mathbb R^d}.

f(\theta') \geq f(\theta) + ⟨ \nabla f(\theta), \theta'-\theta ⟩ + \frac \mu 2 \|\theta'-\theta\|^2 ,\quad \quad \theta,\theta'\in{\mathbb R^d}.

suppose that f is μ-strongly convex and L-smooth. The gradient descent algorithm with η < 1/L converge to θ* with

\| \theta^{(k)} - \theta^* \| \leq (1-\eta \mu)^k \|\theta^{(0)} - \theta^* \|

\| \theta^{(k)} - \theta^* \| \leq (1-\eta \mu)^k \|\theta^{(0)} - \theta^* \|

Theorem

μ-strong convexity can be weakened to Polyak-Lojasiewicz Condition 🔗

An introduction to gradient descent algorithms Abdellah CHKIFA abdellah.chkifa@um6p.ma

IndabaX_Morocco_2019

By abdellah Chkifa

IndabaX_Morocco_2019

6 years ago
1,483

abdellah Chkifa

Some of my presentations

An introduction to gradient descent algorithms

IndabaX_Morocco_2019

More from abdellah Chkifa