Optimization

A very very short introduction

It all starts with a function

z = f(x, y)
z=f(x,y)z = f(x, y)

What (x,y) values will minimize my function ?

f(\bm{x})
f(x)f(\bm{x})

What  can you tell me about your function ?

Convexity

Convexity guarantees a single global minimum

Differentiability 

f^\prime(a) = \lim\limits_{h \rightarrow 0 } \frac{ f(a +h) - f(a) } {h}
f(a)=limh0f(a+h)f(a)hf^\prime(a) = \lim\limits_{h \rightarrow 0 } \frac{ f(a +h) - f(a) } {h}

The IDEA BEHIND Gradient Descent

Follow the slope !

\nabla f = \left[\begin{matrix} \frac{ \partial f}{ \partial x} \\ \frac{ \partial f}{ \partial y} \end{matrix}\right]
f=[fxfy]\nabla f = \left[\begin{matrix} \frac{ \partial f}{ \partial x} \\ \frac{ \partial f}{ \partial y} \end{matrix}\right]

Gradient descent algorithm

Start from some point x

  1. Compute the direction of the gradient
  2. Take a step in that direction
  3. Go back to 1
\bm{x_{n+1}} = \bm{x_{n}} - { \mu} \nabla f( \bm{x_{n}} )
xn+1=xnμf(xn)\bm{x_{n+1}} = \bm{x_{n}} - { \mu} \nabla f( \bm{x_{n}} )

How big of a step should I take?

\bm{x_{n+1}} = \bm{x_{n}} - {\color{red} \mu} \nabla f( \bm{x_{n}} )
xn+1=xnμf(xn)\bm{x_{n+1}} = \bm{x_{n}} - {\color{red} \mu} \nabla f( \bm{x_{n}} )
  • Fixed step:
    • If \( \nabla f \) is L-lipschitz convergence is guaranteed for \( \mu < 1/L \)
  • Line search: \( \min\limits_{\mu > 0} f( x - \mu \nabla f(x)) \)

Let's have a look at scipy.optimize

Newton's method

Newton and quasi-newton

$$ x_{n+1} = x_{n} - [ \nabla^2 f(x_n) ]^{-1} \nabla f (x_n)  $$

  • Newton's update:
\nabla^2 f = \left[\begin{matrix} \frac{ \partial^2 f}{ \partial x \partial x} & \frac{\partial^2 f}{ \partial x \partial y} \\ \frac{ \partial^2 f}{ \partial y \partial x} & \frac{ \partial^2 f}{ \partial y \partial y} \end{matrix}\right]
2f=[2fxx2fxy2fyx2fyy]\nabla^2 f = \left[\begin{matrix} \frac{ \partial^2 f}{ \partial x \partial x} &amp; \frac{\partial^2 f}{ \partial x \partial y} \\ \frac{ \partial^2 f}{ \partial y \partial x} &amp; \frac{ \partial^2 f}{ \partial y \partial y} \end{matrix}\right]

Computing and inverting the Hessian can be  very costly, quasi-Newton work around it

Why is this useful ?

Deep Neural Networks

Example  loss function for regression:
$$ L = \parallel y - f_{w}(x) \parallel^2 $$

$$ L = \sum\limits_{i=1}^{N} (y_i - f_w(x_i))^2$$ 

Stochastic Gradient Descent

It's not all bad

inverse Problems

Deconvolution

Inpainting

Denoising

$$ y = \bm{A} x + n $$
A is non-invertible or ill-conditioned

Regularization

$$ L = \parallel y - \bm{A} x \parallel^2 + R(x) $$

$$ R(x) = \lambda \parallel x \parallel_2 $$

Checkout the numerical tours

Another example

$$ L = \parallel y - \bm{A} x \parallel^2 + R(x) $$

$$ R(x) = \lambda \parallel \Phi x \parallel_1 $$

what i haven't talked about

  • Constrained optimization
  • Simulated annealing
  • NP-Hard problems 
  • ...

Optimization

By eiffl

Optimization

Practical statistics series

  • 771