Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein

(NIPS 2018)

What is the relationship between trainability, generalisation, and architecture choices?

  • architecture choices \(\rightarrow\) loss landscape
  • global loss landscape (convex or not) \(\leftrightarrow\) trainability
  • landscape surrounding minima \(\rightarrow\) generalisation


Tricks of the trade

  • Mini-batching
  • Batch normalisation
  • Self-normalisation
  • Dropout
  • Residual connections

Techniques for visualising

between two minima

Minimise \(L(\theta) = \frac{1}{m} \sum_{i=1}^m l(x_i, y_i; \theta)\)

\(x_i\) feature vectors

\(y_i\) labels

\(\theta\) model parameters

For two parameter sets \(\theta_1\) and \(\theta_2\),

plot \(f(\alpha) = L((1 - \alpha) \theta_1 + \alpha \theta_2)\)

Techniques for visualising

around one minimum

Around a parameter set \(\theta\):

  • For 1D, choose a direction vector \(\delta\), and plot \( f(\alpha) = L(\theta + \alpha \delta) \)
  • For 2D, choose two direction vectors \(\delta\) and \(\eta\), and plot \( f(\alpha, \beta) = L(\theta + \alpha \delta + \beta \eta) \)

Scaling problem

Filter-wise normalisation

Direction vectors have same dimensionality as \(\theta\)

Normalise each direction vector filter-wise: \(d_{i,j} \leftarrow \frac{d_{i,j}}{||d_{i,j}||} ||\theta_{i,j}||\)

Test models

VGG (2014)

AlexNet (2012)

ResNet (2015)

Wide-ResNet (2016)


DenseNet (2016)

Residual connections \(\leftrightarrow\) network depth

Residual connections \(\leftrightarrow\) network width

Does this capture convexity?

Ok, now what about \(\Longleftarrow\)?

Plot \(\left|\frac{\lambda_{min}}{\lambda_{max}}\right|\) for the original Hessian at each point

The principle curvatures of a randomly projected (Gaussian) surface are weighted averages of the principle curvatures of the original surface (with Chi-square coefficients)


So: non-convexity in the projected surface

\(\Longleftrightarrow\) non-positive projected Hessian eigenvalues

\(\Longrightarrow\) non-positive original Hessian eigenvalues

\(\Longleftrightarrow\) non-convexity in original surface

Visualising optimisation paths

In high dimension, random vectors are mostly orthogonal to anything independent (\(E[S_C(v_1, v_2)] \sim \sqrt{\frac{2}{\pi n}}\))

So random projections don't capture much

Visualising optimisation paths

use PCAs instead