Stochastic Truncated Newton with Conjugate Gradient
Luis Manuel Román García
ITAM , 2017
Presentation Overview
- Newton step
- Conjugate Gradient
- Stochastic Truncated Newton with Conjugate Gradient
- Conclusion
Newton Step
Motivation
Suppose we want to find the root of a two times differentiable function
Newton Step
Taking this idea one step forward, we could think of finding the root, not of the original function but that of its gradient.
In this case the iteration:
becomes:
Newton Step
Advantages:
- Quadratic convergence near the optimum
Disadvantages:
- It is necessary to calculate and store the whole Hessian, per iteration.
Conjugate Gradient
Motivation
Suppose we want to find the solution of the linear system
Options:
- Gaussian elimination (scaled partial pivoting)
- Gram-Schmidt procedure
- Householder reduction
- Givens reduction
(Unconditionally stable)
(Unconditionally stable)
???
Motivation
HINT: If A is positive definite, solving
Is equivalent to finding the minimum of:
Motivation
If A were a diagonal matrix, the procedure would be straight forward:
Just find the minimum among each coordinate axis
Motivation
Of course, in the real world A is not diagonal. So this strategy is not guaranteed to terminate in finite time.
Thankfully, since A is positive definite, there is a matrix S such that
where A' is diagonal, hence, we minimize thru the directions
Conjugate Gradient
Conjugate Gradient
A remarkable property, of CG is that the number of iterations is bounded above by the number of eigenvalues of A. More over, if A has eivenvalues
Then:
Stochastic Truncated Newton with Conjugate Gradient
Newton CG
Recall that the Newton step is given by the following formula:
Which is equivalent to:
When close enough to the optimum, the hessian is symmetric positive definite. Sounds familiar?
Newton CG
Advantage:
- No need to store the whole Hessian
can be approximated via central finite differences with a complexity
Disadvantage:
- Less precision in the descent direction
Newton CG
Now if we add an stochastic element and use sampling instead of the whole data-set a lot of per iteration cost can be reduced whereas the number of iterations might be greater. There are now multiple hyper-parameters:
- The size of the sample (both for the Hessian and the gradient)
- The number of CG iterations
- The size of the step
Conclusion
Truncated Newton
By Luis Roman
Truncated Newton
- 1,197