# Optimization

A very very short introduction

## It all starts with a function

z = f(x, y)
$z = f(x, y)$

What (x,y) values will minimize my function ?

f(\bm{x})
$f(\bm{x})$

## Convexity

Convexity guarantees a single global minimum

# Differentiability

f^\prime(a) = \lim\limits_{h \rightarrow 0 } \frac{ f(a +h) - f(a) } {h}
$f^\prime(a) = \lim\limits_{h \rightarrow 0 } \frac{ f(a +h) - f(a) } {h}$

## The IDEA BEHIND Gradient Descent

\nabla f = \left[\begin{matrix} \frac{ \partial f}{ \partial x} \\ \frac{ \partial f}{ \partial y} \end{matrix}\right]
$\nabla f = \left[\begin{matrix} \frac{ \partial f}{ \partial x} \\ \frac{ \partial f}{ \partial y} \end{matrix}\right]$

Start from some point x

1. Compute the direction of the gradient
2. Take a step in that direction
3. Go back to 1
\bm{x_{n+1}} = \bm{x_{n}} - { \mu} \nabla f( \bm{x_{n}} )
$\bm{x_{n+1}} = \bm{x_{n}} - { \mu} \nabla f( \bm{x_{n}} )$

## How big of a step should I take?

\bm{x_{n+1}} = \bm{x_{n}} - {\color{red} \mu} \nabla f( \bm{x_{n}} )
$\bm{x_{n+1}} = \bm{x_{n}} - {\color{red} \mu} \nabla f( \bm{x_{n}} )$
• Fixed step:
• If $$\nabla f$$ is L-lipschitz convergence is guaranteed for $$\mu < 1/L$$
• Line search: $$\min\limits_{\mu > 0} f( x - \mu \nabla f(x))$$

## Newton and quasi-newton

$$x_{n+1} = x_{n} - [ \nabla^2 f(x_n) ]^{-1} \nabla f (x_n)$$

• Newton's update:
\nabla^2 f = \left[\begin{matrix} \frac{ \partial^2 f}{ \partial x \partial x} & \frac{\partial^2 f}{ \partial x \partial y} \\ \frac{ \partial^2 f}{ \partial y \partial x} & \frac{ \partial^2 f}{ \partial y \partial y} \end{matrix}\right]
$\nabla^2 f = \left[\begin{matrix} \frac{ \partial^2 f}{ \partial x \partial x} & \frac{\partial^2 f}{ \partial x \partial y} \\ \frac{ \partial^2 f}{ \partial y \partial x} & \frac{ \partial^2 f}{ \partial y \partial y} \end{matrix}\right]$

Computing and inverting the Hessian can be  very costly, quasi-Newton work around it

# Why is this useful ?

## Deep Neural Networks

Example  loss function for regression:
$$L = \parallel y - f_{w}(x) \parallel^2$$

$$L = \sum\limits_{i=1}^{N} (y_i - f_w(x_i))^2$$

## inverse Problems

Deconvolution

Inpainting

Denoising

$$y = \bm{A} x + n$$
A is non-invertible or ill-conditioned

## Regularization

$$L = \parallel y - \bm{A} x \parallel^2 + R(x)$$

$$R(x) = \lambda \parallel x \parallel_2$$

Checkout the numerical tours

## Another example

$$L = \parallel y - \bm{A} x \parallel^2 + R(x)$$

$$R(x) = \lambda \parallel \Phi x \parallel_1$$

## what i haven't talked about

• Constrained optimization
• Simulated annealing
• NP-Hard problems
• ...

By eiffl

# Optimization

Practical statistics series

• 809