### Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein

(NIPS 2018)

What is the relationship between trainability, generalisation, and architecture choices?

• architecture choices $$\rightarrow$$ loss landscape
• global loss landscape (convex or not) $$\leftrightarrow$$ trainability
• landscape surrounding minima $$\rightarrow$$ generalisation

### Tricks of the trade

• Mini-batching
• Batch normalisation
• Self-normalisation
• Dropout
• Residual connections

### Techniques for visualising

between two minima

Minimise $$L(\theta) = \frac{1}{m} \sum_{i=1}^m l(x_i, y_i; \theta)$$

$$x_i$$ feature vectors

$$y_i$$ labels

$$\theta$$ model parameters

For two parameter sets $$\theta_1$$ and $$\theta_2$$,

plot $$f(\alpha) = L((1 - \alpha) \theta_1 + \alpha \theta_2)$$

### Techniques for visualising

around one minimum

Around a parameter set $$\theta$$:

• For 1D, choose a direction vector $$\delta$$, and plot $$f(\alpha) = L(\theta + \alpha \delta)$$
• For 2D, choose two direction vectors $$\delta$$ and $$\eta$$, and plot $$f(\alpha, \beta) = L(\theta + \alpha \delta + \beta \eta)$$

Scaling problem

### Filter-wise normalisation

Direction vectors have same dimensionality as $$\theta$$

Normalise each direction vector filter-wise: $$d_{i,j} \leftarrow \frac{d_{i,j}}{||d_{i,j}||} ||\theta_{i,j}||$$

### Test models

VGG (2014)

AlexNet (2012)

ResNet (2015)

Wide-ResNet (2016)

ResNet-noshort

DenseNet (2016)

### Does this capture convexity?

Ok, now what about $$\Longleftarrow$$?

Plot $$\left|\frac{\lambda_{min}}{\lambda_{max}}\right|$$ for the original Hessian at each point

The principle curvatures of a randomly projected (Gaussian) surface are weighted averages of the principle curvatures of the original surface (with Chi-square coefficients)

So: non-convexity in the projected surface

$$\Longleftrightarrow$$ non-positive projected Hessian eigenvalues

$$\Longrightarrow$$ non-positive original Hessian eigenvalues

$$\Longleftrightarrow$$ non-convexity in original surface

### Visualising optimisation paths

In high dimension, random vectors are mostly orthogonal to anything independent ($$E[S_C(v_1, v_2)] \sim \sqrt{\frac{2}{\pi n}}$$)

So random projections don't capture much

### Visualising optimisation paths

use PCAs instead

#### Visualizing the loss landscape of neural nets

By Sébastien Lerique

# Visualizing the loss landscape of neural nets

• 68
Loading comments...