Stochastic Truncated Newton with Conjugate Gradient

Luis Manuel Román García

ITAM , 2017

Presentation Overview

Newton step

Conjugate Gradient

Stochastic Truncated Newton with Conjugate Gradient

Conclusion

Newton Step

Motivation

f(x)

f(x)

Suppose we want to find the root of a two times differentiable function

x_0

x_0

x_1 = x_0 - \nabla_x f(x_0)^{-1}f(x_0)

x_1 = x_0 - \nabla_x f(x_0)^{-1}f(x_0)

x_1

x_1

x_2

x_2

x_3

x_3

Newton Step

Taking this idea one step forward, we could think of finding the root, not of the original function but that of its gradient.

\nabla f(x)

\nabla f(x)

In this case the iteration:

x_{k+1} = x_k - \nabla_x f(x_k)^{-1}f(x_k)

x_{k+1} = x_k - \nabla_x f(x_k)^{-1}f(x_k)

becomes:

x_{k+1} = x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)

x_{k+1} = x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)

Newton Step

Advantages:

Quadratic convergence near the optimum

Disadvantages:

It is necessary to calculate and store the whole Hessian, per iteration.

Conjugate Gradient

Motivation

Suppose we want to find the solution of the linear system

Ax = b

Ax = b

Options:

Gaussian elimination (scaled partial pivoting)
Gram-Schmidt procedure
Householder reduction
Givens reduction

\approx \frac{n^3}{3}

\approx \frac{n^3}{3}

\approx n^3

\approx n^3

\approx \frac{2n^3}{3}

\approx \frac{2n^3}{3}

\approx \frac{4n^3}{3}

\approx \frac{4n^3}{3}

(Unconditionally stable)

n>>0

n>>0

???

Motivation

HINT: If A is positive definite, solving

Ax = b

Ax = b

Is equivalent to finding the minimum of:

\phi(x) = x'Ax + bx

\phi(x) = x'Ax + bx

Motivation

If A were a diagonal matrix, the procedure would be straight forward:

Just find the minimum among each coordinate axis

x_0

x_0

x_1

x_1

Motivation

Of course, in the real world A is not diagonal. So this strategy is not guaranteed to terminate in finite time.

Thankfully, since A is positive definite, there is a matrix S such that

A = SA'S^t

A = SA'S^t

where A' is diagonal, hence, we minimize thru the directions

p_k

p_k

Conjugate Gradient

r_0\gets Ax_0-b,

r_0\gets Ax_0-b,

p_0\gets -r_0,

p_0\gets -r_0,

k\gets 0

k\gets 0

While \quad r_k \neq0

While \quad r_k \neq0

\alpha_k \gets-\frac{r_k^tp_k}{p_k^tAp_k}

\alpha_k \gets-\frac{r_k^tp_k}{p_k^tAp_k}

r_{k+1} \gets Ax_{k+1} - b

r_{k+1} \gets Ax_{k+1} - b

\beta_{k+1} \gets-\frac{r_{k+1}^tAp_k}{p_k^tAp_k}

\beta_{k+1} \gets-\frac{r_{k+1}^tAp_k}{p_k^tAp_k}

x_{k+1} \gets x_k + \alpha_kp_k

x_{k+1} \gets x_k + \alpha_kp_k

p_{k+1} \gets -r_{k+1} + \beta_{k+1}p_k

p_{k+1} \gets -r_{k+1} + \beta_{k+1}p_k

Conjugate Gradient

A remarkable property, of CG is that the number of iterations is bounded above by the number of eigenvalues of A. More over, if A has eivenvalues

\lambda_1\leq \lambda_2\leq...\leq\lambda_n

\lambda_1\leq \lambda_2\leq...\leq\lambda_n

Then:

\|x_{k+1}-x^*\|_A\leq2\bigg(\frac{\sqrt{\frac{\lambda_n}{\lambda_1}}-1}{\sqrt{\frac{\lambda_n}{\lambda_1}}+1}\bigg)^2\|x_0-x^*\|_A

\|x_{k+1}-x^*\|_A\leq2\bigg(\frac{\sqrt{\frac{\lambda_n}{\lambda_1}}-1}{\sqrt{\frac{\lambda_n}{\lambda_1}}+1}\bigg)^2\|x_0-x^*\|_A

Stochastic Truncated Newton with Conjugate Gradient

Newton CG

Recall that the Newton step is given by the following formula:

x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=x_{k+1}

x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=x_{k+1}

Which is equivalent to:

\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=p_k

\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=p_k

When close enough to the optimum, the hessian is symmetric positive definite. Sounds familiar?

Newton CG

Advantage:

- No need to store the whole Hessian

\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)

\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)

can be approximated via central finite differences with a complexity

O(n^2)

O(n^2)

Disadvantage:

- Less precision in the descent direction

Newton CG

Now if we add an stochastic element and use sampling instead of the whole data-set a lot of per iteration cost can be reduced whereas the number of iterations might be greater. There are now multiple hyper-parameters:

- The size of the sample (both for the Hessian and the gradient)

- The number of CG iterations

- The size of the step

Conclusion

Truncated Newton

By Luis Roman

Truncated Newton

1,341

Stochastic Truncated Newton with Conjugate Gradient

Presentation Overview

Newton Step

Motivation

Newton Step

Newton Step

Conjugate Gradient

Motivation

Motivation

Motivation

Motivation

Conjugate Gradient

Conjugate Gradient

Stochastic Truncated Newton with Conjugate Gradient

Newton CG

Newton CG

Newton CG

Conclusion

Truncated Newton

More from Luis Roman