### Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein

*(NIPS 2018)*

What is the relationship between trainability, generalisation, and architecture choices?

- architecture choices \(\rightarrow\) loss landscape
- global loss landscape (convex or not) \(\leftrightarrow\) trainability
- landscape surrounding minima \(\rightarrow\) generalisation

### Questions

### Tricks of the trade

- Mini-batching
- Batch normalisation
- Self-normalisation
- Dropout
- Residual connections

### Techniques for visualising

between two minima

Minimise \(L(\theta) = \frac{1}{m} \sum_{i=1}^m l(x_i, y_i; \theta)\)

\(x_i\) feature vectors

\(y_i\) labels

\(\theta\) model parameters

For two parameter sets \(\theta_1\) and \(\theta_2\),

plot \(f(\alpha) = L((1 - \alpha) \theta_1 + \alpha \theta_2)\)

### Techniques for visualising

around one minimum

Around a parameter set \(\theta\):

- For 1D, choose a direction vector \(\delta\), and plot \( f(\alpha) = L(\theta + \alpha \delta) \)
- For 2D, choose two direction vectors \(\delta\) and \(\eta\), and plot \( f(\alpha, \beta) = L(\theta + \alpha \delta + \beta \eta) \)

Scaling problem

### Filter-wise normalisation

Direction vectors have same dimensionality as \(\theta\)

Normalise each direction vector filter-wise: \(d_{i,j} \leftarrow \frac{d_{i,j}}{||d_{i,j}||} ||\theta_{i,j}||\)

### Test models

VGG (2014)

AlexNet (2012)

ResNet (2015)

Wide-ResNet (2016)

ResNet-noshort

DenseNet (2016)

### Residual connections \(\leftrightarrow\) network depth

### Residual connections \(\leftrightarrow\) network width

### Does this capture convexity?

Ok, now what about \(\Longleftarrow\)?

Plot \(\left|\frac{\lambda_{min}}{\lambda_{max}}\right|\) for the original Hessian at each point

*The principle curvatures of a randomly projected (Gaussian) surface are weighted averages of the principle curvatures of the original surface (with Chi-square coefficients)*

So: non-convexity in the projected surface

\(\Longleftrightarrow\) non-positive projected Hessian eigenvalues

\(\Longrightarrow\) non-positive original Hessian eigenvalues

\(\Longleftrightarrow\) non-convexity in original surface

### Visualising optimisation paths

In high dimension, random vectors are mostly orthogonal to anything independent (\(E[S_C(v_1, v_2)] \sim \sqrt{\frac{2}{\pi n}}\))

So random projections don't capture much

### Visualising optimisation paths

use PCAs instead

#### Visualizing the loss landscape of neural nets

By Sébastien Lerique