## A jamming transition from under- to over-parametrization affects loss landscape and generalization

Stefano Spigler, Mario Geiger

Stéphane d’Ascoli      Levent Sagun      Marco Baity-Jesi

Giulio Biroli      Matthieu Wyart

Deep learning: high-dimensional,

non convex problem

• What is the geometry of the landscape?

• How does it depend on the number of parameters?

• Why not stuck in local minima?

### Set-up

• Dataset: $$\large \color{red}P$$ points $$\mathbf{x}_1,\dots,\mathbf{x}_P$$

• Binary classification: each point $$\mathbf{x}_\mu$$ has a label $$y_\mu=\pm1$$

• Fully-connected network, $$\large \color{red}N$$ parameters $$\mathbf{W}\in\mathbb{R}^N$$

$$\overbrace{\phantom{x+x+x+x+x+x++2}}^{L}$$

$$\to f(\mathbf{x}; \mathbf{W})$$

$$\mathbf{x} \to$$

### Training

• The goal of training is to fit all the points $$\mathbf{x}_\mu$$:

$y_\mu \cdot f(\mathbf{x}_\mu;\mathbf{W}) > 0$

• $$\mathbf{W}$$ is fixed by minimizing a loss function

$\mathcal{L}(\mathbf{W}) = \frac1P \sum_\mu \ell\left(y_\mu \cdot f(\mathbf{x}_\mu; \mathbf{W})\right)$

number of patterns

number of parameters

under-parametrized

over-parametrized

$$\mathbf W$$

$$\mathbf W$$

$$\mathcal{L}$$

$$\mathcal{L}$$

under-parametrized

over-parametrized

(above upper bound)

### Loss landscape

Jamming in spheres

• $$\large N =$$ Number of degrees of freedom
• $$\large N_\Delta =$$ Number of contacts (constraints)

$$N_\Delta/N = 0$$

$$N_\Delta/N \geq 1$$

### Loss function

Typical choice:

• cross entropy

$$\ell(yf) = \log(1+e^{yf})$$

infinite range

$\mathcal{L} = \frac1P \sum_\mu \ell\left(y_\mu f_\mu\right)$

$$\ell(fy)$$

$$fy$$

### Loss function

Let's use

$$\ell(yf) = \theta(\Delta) \frac12 \Delta^2$$

finite range!

same performance,

sharp jamming transition

$$\ell(fy)$$

$$fy$$

$$\Delta=1-yf$$

Dataset:

• $$\mathbf{x}_\mu=$$ random points on the sphere,

• $$y_\mu=$$ random labels $$\pm1$$

### Location of the transition

$$P$$

$$N$$

$${\large\color{red}N_\Delta} =$$ number of unsatisfied constraints $$\Delta_\mu > 0$$

### Sharpness of the transition

$\mathcal{L} = \frac1{P} \sum_\mu \theta(\Delta_\mu) \frac12 \Delta_\mu^2$

$$N \approx 8000$$

$$P/N$$

$$N_\Delta/N$$

After minimization   $$\longrightarrow$$   net is in a minimum of $$\mathcal{L}(\mathbf{W})$$

Stability   $$\Longrightarrow$$   positive semi-definite Hessian $$\nabla^2 \mathcal{L}(\mathbf{W})$$

Upper bound

$$N < \mathrm{const} \times P$$

Discontinuous jump

$$N_\Delta > \mathrm{const} \times N$$

\underbrace{\phantom{\frac1P \sum_\mu \theta(\Delta_\mu) \nabla\Delta_\mu\otimes\nabla\Delta_\mu}}_{\large\mathcal{H}_0}
$\underbrace{\phantom{\frac1P \sum_\mu \theta(\Delta_\mu) \nabla\Delta_\mu\otimes\nabla\Delta_\mu}}_{\large\mathcal{H}_0}$

Let $$N_- =$$ number of eigenvalues $$< 0$$

Stability   $$\Longrightarrow \ \ \large N_-\leq N_\Delta$$

$$\mathrm{rank}\leq N_\Delta$$

\underbrace{\phantom{\frac1P \sum_\mu \Delta_\mu\theta(\Delta_\mu)\,\nabla^2 \Delta_\mu}}_{\large\mathcal{H}_p}
$\underbrace{\phantom{\frac1P \sum_\mu \Delta_\mu\theta(\Delta_\mu)\,\nabla^2 \Delta_\mu}}_{\large\mathcal{H}_p}$

$\mathcal{L} = \frac1{P} \sum_\mu \theta(\Delta_\mu)\; \frac12 \Delta_\mu^2$

$\mathcal{H} = \frac1P \sum_\mu \theta(\Delta_\mu)\; \nabla\Delta_\mu\otimes\nabla\Delta_\mu + \frac1P \sum_\mu \theta(\Delta_\mu)\; \Delta_\mu\, \nabla^2 \Delta_\mu$

$\Delta_\mu = 1 - y_\mu f(\mathbf x_\mu; \mathbf W)$

### Spectrum of $$\mathcal{H}_p$$

$$N_- \approx N/2$$

eigenvalue

• Upper bound:   $$N < \frac{1}{C_0} \times P$$

• Discontinuous jump:   $$N_\Delta > C_0 \times N$$
• Stability   $$\Longrightarrow \ N_-\leq N_\Delta\leq P$$

• Assumption:  $$N_- = \left\{\begin{array}{ll}C_0\times N & \mathcal{L} > 0 \\ 0 & \mathcal{L} = 0\end{array} \right.$$

$$C_0 \approx 0.5$$   in all observed cases

$$\mathcal{L}=0$$

$$N_-=0$$

$$N_\Delta=0$$

$$\mathcal{L}>0$$

$$N_-= C_0\times N$$

$$N_\Delta>C_0\times N$$

$$\mathbf{x}_\mu =$$ MNIST

$$y_\mu =$$ parity

$$P$$

$$N$$

slope $$\approx$$ 3/4

### Spectrum of $$\mathcal{H}$$

Over-parametrized

Jamming

Under-parametrized

$$\mathcal{H}_p=0$$

$$N_\Delta=0$$

$$\mathcal{H}_p=0$$

$$N_\Delta>C_0\times N$$

$$\mathcal{H}_p\neq 0$$

$$N_\Delta\nearrow$$

$$\sim\sqrt{\mathcal{L}}$$

$$\leftrightarrow$$

eigenvalue

eigenvalue

eigenvalue

## Conclusions

• Jamming transition delimits over- and under-parametrized networks

• In the over-parametrized phase only $$\mathcal{L}=0$$ is stable

(perfect fitting)

### Generalization for MNIST

$$P$$

$$N$$

### Generalization

• There is a cusp at jamming

• Test error monotonically decreases in the over-parametrized phase

test error

$$N$$

$$N/N_\mathrm{jamming}$$

#### A jamming transition from under- to over-parametrizationaffects loss landscape and generalization

By Stefano Spigler

# A jamming transition from under- to over-parametrizationaffects loss landscape and generalization

HEP-AI talk on "A jamming transition from under- to over-parametrizationaffects loss landscape and generalization", arxivs: 1809.09349, 1810.09665

• 514