Second order methods

w_{k+1} = w_k - \alpha H_{(|S|_k)} \frac{1}{|X|_k}\displaystyle\sum_{i=1}^{|X|_k}\nabla l(h_{w_k})

w_{k+1} = w_k - \alpha H_{(|S|_k)} \frac{1}{|X|_k}\displaystyle\sum_{i=1}^{|X|_k}\nabla l(h_{w_k})

|S|_k\leq |X|_k

|S|_k\leq |X|_k

Under some regularity assumptions, the best we can expect is super linear - quadratic convergence:

Best case scenario

|X_k| \geq |X_0|\eta_k^k;\quad |X_0|\geq\bigg(\frac{6v\gamma M}{\hat{\mu}^2}\bigg), \eta_k > \eta_{k-1}, \eta_k \rightarrow\infty, \eta_1>1

|X_k| \geq |X_0|\eta_k^k;\quad |X_0|\geq\bigg(\frac{6v\gamma M}{\hat{\mu}^2}\bigg), \eta_k > \eta_{k-1}, \eta_k \rightarrow\infty, \eta_1>1

|S_k| > |S_{k-1}|;\quad \displaystyle\lim_{k\rightarrow\infty}|S_k|=\infty; \quad |S_0|\geq\bigg(\frac{4\sigma}{\hat{\mu}}\bigg)^2

|S_k| > |S_{k-1}|;\quad \displaystyle\lim_{k\rightarrow\infty}|S_k|=\infty; \quad |S_0|\geq\bigg(\frac{4\sigma}{\hat{\mu}}\bigg)^2

\|w_0-w^*\|\leq\frac{\hat{\mu}}{3\gamma M}

\|w_0-w^*\|\leq\frac{\hat{\mu}}{3\gamma M}

\mathbb{E}[|w_k - w^*|]\leq\tau_k\quad\quad\displaystyle\lim_{k\rightarrow\infty}\frac{\tau_{k+1}}{\tau_k}\rightarrow 0

\mathbb{E}[|w_k - w^*|]\leq\tau_k\quad\quad\displaystyle\lim_{k\rightarrow\infty}\frac{\tau_{k+1}}{\tau_k}\rightarrow 0

In an online scenario regret grows O(log(T)):

Best case scenario

\gamma = \frac{1}{2}\min\{\frac{1}{4GD}, \alpha\}, \epsilon = \frac{1}{\gamma^2 D^2}

\gamma = \frac{1}{2}\min\{\frac{1}{4GD}, \alpha\}, \epsilon = \frac{1}{\gamma^2 D^2}

regret_T \leq 5(\frac{1}{\alpha}+ GD)n\log(T)

regret_T \leq 5(\frac{1}{\alpha}+ GD)n\log(T)

The multiple advantages of second order methods:

Faster convergence rate

Embarrassingly parallel

Take into consideration curvature information

Ideal for highly varying functions

Higher per iteration cost

2010-2016

Martens is the first to successfully train a deep convolutional neural network with L-BFGS.
Sutskever successfully trains a recurrent neural network with a generalized Gauss-Newton algorithm
Bengio achieves state of the art results training recurrent networks with second order methods

On the Use of Second Order Stochastic Information

for Learning Deep Artificial Neural Network

Presentation Overview

State of the intelligence

2010

The breakthrough of Deep Artificial Neural Networks

2011

Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings Watson received the first place prize of $1 million.

2016

Alphago The first computer program to ever beat a professional player at the game of Go

2016

The ConceptNet 4 system has been put through a verbal IQ test, getting a score that is seen as “average for a four-year-old child”.

Poem 1

Poem 2

2016

the development of full artificial intelligence could spell the end of the human race

2018

the development of Video-to-video Synthesis now perfect replications can be created of individuals in videos

Deep Artificial Neural Networks

Where learnability comes from?

Composition of several layers can learn very complex functions efficiently

Example architecture: LSTM

The problem is that they are very difficult to train:

Stochastic gradient descent

The promises of second order methods

Second order methods

The multiple advantages of second order methods:

2010-2016

Martens is the first to successfully train a deep convolutional neural network with L-BFGS.

Sutskever successfully trains a recurrent neural network with a generalized Gauss-Newton algorithm

Bengio achieves state of the art results training recurrent networks with second order methods

Conclusion

Second Order Methods for

Second Order Methods for

Luis Roman

On the Use of Second Order Stochastic Information

for Learning Deep Artificial Neural Network

Second Order Methods for

More from Luis Roman