Luis Manuel Román García
ITAM , 2017
Suppose we want to find the root of a two times differentiable function
Taking this idea one step forward, we could think of finding the root, not of the original function but that of its gradient.
In this case the iteration:
becomes:
Advantages:
Disadvantages:
Suppose we want to find the solution of the linear system
Options:
(Unconditionally stable)
(Unconditionally stable)
???
HINT: If A is positive definite, solving
Is equivalent to finding the minimum of:
If A were a diagonal matrix, the procedure would be straight forward:
Just find the minimum among each coordinate axis
Of course, in the real world A is not diagonal. So this strategy is not guaranteed to terminate in finite time.
Thankfully, since A is positive definite, there is a matrix S such that
where A' is diagonal, hence, we minimize thru the directions
A remarkable property, of CG is that the number of iterations is bounded above by the number of eigenvalues of A. More over, if A has eivenvalues
Then:
Recall that the Newton step is given by the following formula:
Which is equivalent to:
When close enough to the optimum, the hessian is symmetric positive definite. Sounds familiar?
Advantage:
- No need to store the whole Hessian
can be approximated via central finite differences with a complexity
Disadvantage:
- Less precision in the descent direction
Now if we add an stochastic element and use sampling instead of the whole data-set a lot of per iteration cost can be reduced whereas the number of iterations might be greater. There are now multiple hyper-parameters:
- The size of the sample (both for the Hessian and the gradient)
- The number of CG iterations
- The size of the step