Stochastic Truncated Newton with Conjugate Gradient

Luis Manuel Román García

ITAM , 2017

Presentation Overview

  1. Newton step

  2. Conjugate Gradient

  3. Stochastic Truncated Newton with Conjugate Gradient

  4. Conclusion

Newton Step

Motivation

f(x)
f(x)f(x)

Suppose we want to find the root of a two times differentiable function

x_0
x0x_0
x_1 = x_0 - \nabla_x f(x_0)^{-1}f(x_0)
x1=x0xf(x0)1f(x0)x_1 = x_0 - \nabla_x f(x_0)^{-1}f(x_0)
x_1
x1x_1
x_2
x2x_2
x_3
x3x_3

Newton Step

Taking this idea one step forward, we could think of finding the root, not of the original function but that of its gradient.

\nabla f(x)
f(x)\nabla f(x)

In this case the iteration:

x_{k+1} = x_k - \nabla_x f(x_k)^{-1}f(x_k)
xk+1=xkxf(xk)1f(xk)x_{k+1} = x_k - \nabla_x f(x_k)^{-1}f(x_k)

becomes:

x_{k+1} = x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)
xk+1=xkxx2f(xk)1f(xk)x_{k+1} = x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)

Newton Step

Advantages:

  • Quadratic convergence near the optimum              

 

Disadvantages:

  • It is necessary to calculate and store the whole Hessian, per iteration. 

Conjugate Gradient

Motivation

Suppose we want to find the solution of the linear system

Ax = b
Ax=bAx = b

Options:

  1. Gaussian elimination (scaled partial pivoting)
  2. Gram-Schmidt procedure
  3. Householder reduction 
  4. Givens reduction
\approx \frac{n^3}{3}
n33\approx \frac{n^3}{3}
\approx n^3
n3\approx n^3
\approx \frac{2n^3}{3}
2n33\approx \frac{2n^3}{3}
\approx \frac{4n^3}{3}
4n33\approx \frac{4n^3}{3}

(Unconditionally stable)

(Unconditionally stable)

n>>0
n>>0n>>0

???

Motivation

HINT:   If A is positive definite,  solving 

Ax = b
Ax=bAx = b

Is equivalent to finding the minimum of: 

\phi(x) = x'Ax + bx
ϕ(x)=xAx+bx\phi(x) = x'Ax + bx

Motivation

If A were a diagonal matrix, the procedure would be straight forward:

Just find the minimum among each coordinate axis

x_0
x0x_0
x_1
x1x_1

Motivation

Of course, in the real world A is not diagonal. So this strategy is not guaranteed to terminate in finite time.

Thankfully, since A is positive definite, there is a matrix S such that

A = SA'S^t
A=SAStA = SA'S^t

where A' is diagonal, hence, we minimize thru the directions 

p_k
pkp_k

Conjugate Gradient

r_0\gets Ax_0-b,
r0Ax0b,r_0\gets Ax_0-b,
p_0\gets -r_0,
p0r0,p_0\gets -r_0,
k\gets 0
k0k\gets 0
While \quad r_k \neq0
Whilerk0While \quad r_k \neq0
\alpha_k \gets-\frac{r_k^tp_k}{p_k^tAp_k}
αkrktpkpktApk\alpha_k \gets-\frac{r_k^tp_k}{p_k^tAp_k}
r_{k+1} \gets Ax_{k+1} - b
rk+1Axk+1br_{k+1} \gets Ax_{k+1} - b
\beta_{k+1} \gets-\frac{r_{k+1}^tAp_k}{p_k^tAp_k}
βk+1rk+1tApkpktApk\beta_{k+1} \gets-\frac{r_{k+1}^tAp_k}{p_k^tAp_k}
x_{k+1} \gets x_k + \alpha_kp_k
xk+1xk+αkpkx_{k+1} \gets x_k + \alpha_kp_k
p_{k+1} \gets -r_{k+1} + \beta_{k+1}p_k
pk+1rk+1+βk+1pkp_{k+1} \gets -r_{k+1} + \beta_{k+1}p_k

Conjugate Gradient

A remarkable property, of CG is that the number of iterations is bounded above by the number of eigenvalues of A. More over, if A has eivenvalues

\lambda_1\leq \lambda_2\leq...\leq\lambda_n
λ1λ2...λn\lambda_1\leq \lambda_2\leq...\leq\lambda_n

Then: 

\|x_{k+1}-x^*\|_A\leq2\bigg(\frac{\sqrt{\frac{\lambda_n}{\lambda_1}}-1}{\sqrt{\frac{\lambda_n}{\lambda_1}}+1}\bigg)^2\|x_0-x^*\|_A
xk+1xA2(λnλ11λnλ1+1)2x0xA\|x_{k+1}-x^*\|_A\leq2\bigg(\frac{\sqrt{\frac{\lambda_n}{\lambda_1}}-1}{\sqrt{\frac{\lambda_n}{\lambda_1}}+1}\bigg)^2\|x_0-x^*\|_A

Stochastic Truncated Newton with Conjugate Gradient

 

Newton CG

Recall that the Newton step is given by the following formula: 

x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=x_{k+1}
xkxx2f(xk)1f(xk)=xk+1x_k - \nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=x_{k+1}

Which is equivalent to: 

\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=p_k
xx2f(xk)1f(xk)=pk\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)=p_k

When close enough to the optimum, the hessian is symmetric positive definite. Sounds familiar?

Newton CG

Advantage:

 

- No need to store the whole Hessian

\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)
xx2f(xk)1f(xk)\nabla_{xx}^{2}f(x_k)^{-1}f(x_k)

can be approximated via central finite differences with a complexity

O(n^2)
O(n2)O(n^2)

Disadvantage:

 

- Less precision in the descent direction

Newton CG

Now if we add an stochastic element and use sampling instead of the whole data-set a lot of per iteration cost can be reduced whereas the number of iterations might be greater. There are now multiple hyper-parameters:

 

- The size of the sample (both for the Hessian and the gradient)

- The number of CG iterations

- The size of the step

 

Conclusion

Made with Slides.com