Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein
(NIPS 2018)
What is the relationship between trainability, generalisation, and architecture choices?
between two minima
Minimise \(L(\theta) = \frac{1}{m} \sum_{i=1}^m l(x_i, y_i; \theta)\)
\(x_i\) feature vectors
\(y_i\) labels
\(\theta\) model parameters
For two parameter sets \(\theta_1\) and \(\theta_2\),
plot \(f(\alpha) = L((1 - \alpha) \theta_1 + \alpha \theta_2)\)
around one minimum
Around a parameter set \(\theta\):
Scaling problem
Direction vectors have same dimensionality as \(\theta\)
Normalise each direction vector filter-wise: \(d_{i,j} \leftarrow \frac{d_{i,j}}{||d_{i,j}||} ||\theta_{i,j}||\)
VGG (2014)
AlexNet (2012)
ResNet (2015)
Wide-ResNet (2016)
ResNet-noshort
DenseNet (2016)
Ok, now what about \(\Longleftarrow\)?
Plot \(\left|\frac{\lambda_{min}}{\lambda_{max}}\right|\) for the original Hessian at each point
The principle curvatures of a randomly projected (Gaussian) surface are weighted averages of the principle curvatures of the original surface (with Chi-square coefficients)
So: non-convexity in the projected surface
\(\Longleftrightarrow\) non-positive projected Hessian eigenvalues
\(\Longrightarrow\) non-positive original Hessian eigenvalues
\(\Longleftrightarrow\) non-convexity in original surface
In high dimension, random vectors are mostly orthogonal to anything independent (\(E[S_C(v_1, v_2)] \sim \sqrt{\frac{2}{\pi n}}\))
So random projections don't capture much
use PCAs instead