Optimization

A very very short introduction

It all starts with a function

z = f(x, y)

z = f(x, y)

What (x,y) values will minimize my function ?

f(\bm{x})

f(\bm{x})

What can you tell me about your function ?

Convexity

Convexity guarantees a single global minimum

Differentiability

f^\prime(a) = \lim\limits_{h \rightarrow 0 } \frac{ f(a +h) - f(a) } {h}

f^\prime(a) = \lim\limits_{h \rightarrow 0 } \frac{ f(a +h) - f(a) } {h}

The IDEA BEHIND Gradient Descent

Follow the slope !

\nabla f = \left[\begin{matrix} \frac{ \partial f}{ \partial x} \\ \frac{ \partial f}{ \partial y} \end{matrix}\right]

\nabla f = \left[\begin{matrix} \frac{ \partial f}{ \partial x} \\ \frac{ \partial f}{ \partial y} \end{matrix}\right]

Gradient descent algorithm

Start from some point x

Compute the direction of the gradient
Take a step in that direction
Go back to 1

Here is another nice animation

\bm{x_{n+1}} = \bm{x_{n}} - { \mu} \nabla f( \bm{x_{n}} )

\bm{x_{n+1}} = \bm{x_{n}} - { \mu} \nabla f( \bm{x_{n}} )

How big of a step should I take?

\bm{x_{n+1}} = \bm{x_{n}} - {\color{red} \mu} \nabla f( \bm{x_{n}} )

\bm{x_{n+1}} = \bm{x_{n}} - {\color{red} \mu} \nabla f( \bm{x_{n}} )

Fixed step:
- If \( \nabla f \) is L-lipschitz convergence is guaranteed for \( \mu < 1/L \)
Line search: \( \min\limits_{\mu > 0} f( x - \mu \nabla f(x)) \)

Let's have a look at scipy.optimize

Newton's method

Newton and quasi-newton

$$ x_{n+1} = x_{n} - [ \nabla^2 f(x_n) ]^{-1} \nabla f (x_n) $$

Newton's update:

\nabla^2 f = \left[\begin{matrix} \frac{ \partial^2 f}{ \partial x \partial x} & \frac{\partial^2 f}{ \partial x \partial y} \\ \frac{ \partial^2 f}{ \partial y \partial x} & \frac{ \partial^2 f}{ \partial y \partial y} \end{matrix}\right]

\nabla^2 f = \left[\begin{matrix} \frac{ \partial^2 f}{ \partial x \partial x} &amp; \frac{\partial^2 f}{ \partial x \partial y} \\ \frac{ \partial^2 f}{ \partial y \partial x} &amp; \frac{ \partial^2 f}{ \partial y \partial y} \end{matrix}\right]

Computing and inverting the Hessian can be very costly, quasi-Newton work around it

Why is this useful ?

Deep Neural Networks

Example loss function for regression:
$$ L = \parallel y - f_{w}(x) \parallel^2 $$

$$ L = \sum\limits_{i=1}^{N} (y_i - f_w(x_i))^2$$

Stochastic Gradient Descent

It's not all bad

inverse Problems

Deconvolution

Inpainting

Denoising

$$ y = \bm{A} x + n $$
A is non-invertible or ill-conditioned

Regularization

$$ L = \parallel y - \bm{A} x \parallel^2 + R(x) $$

$$ R(x) = \lambda \parallel x \parallel_2 $$

Checkout the numerical tours

Another example

$$ L = \parallel y - \bm{A} x \parallel^2 + R(x) $$

$$ R(x) = \lambda \parallel \Phi x \parallel_1 $$

what i haven't talked about

Constrained optimization
Simulated annealing
NP-Hard problems
...