Journal Club @EPFL, 05.01.21

Prevalence of neural collapse during the

terminal phase of deep learning training

Vardan Papyana, X. Y. Hanb and David L. Donoho

Modern architectures training involves a terminal phase of training (TPT) in which $$\epsilon_\text{train} = 0, \qquad \text{while} \qquad \mathcal{L}(\text{train-set}) \to 0$$
In this phase, the authors observe an inductive bias in last layer activations.
They call it Neural Collapse (NC).

train error

train loss

Prevalence of neural collapse during the

terminal phase of deep learning training

Vardan Papyana, X. Y. Hanb and David L. Donoho

Some notation

Classes: $\quad c \in \{1, \dots, C\}$
Data-point: $\quad x_{i,c},\quad$ $i-$th example in class $c$
Net:
Net prediction: $$\widehat{c}(x) = \argmax_c\langle w_c, h(x)\rangle + b_c$$

Features

Classifier weights

Classifier output for class $c$

(focus on last layer)

Some more notation

Global mean: $\quad \mu_G = \langle h_{i,c} \rangle_{i,c}$
Class mean: $\;\;\;\quad \mu_c = \langle h_{i,c} \rangle_{i}$
Within-class covariance: $\quad \Sigma_W = \langle (h_{i,c}-\mu_c)(h_{i,c}-\mu_c)^T \rangle_{i,c}$
Between-class covariance: $\quad \Sigma_B = \langle (\mu_c-\mu_G)(\mu_c-\mu_G)^T \rangle_{c}$

Notice that the features $h_{i,c}$ and the classifier weights $w_c$ live in the same space.

That's the space in which we observe Neural Collapse.

w_1

w_2

w_3

Features&Classifiers space

h_{i,c}

Features&Classifiers space

\mu_c

Features&Classifiers space

Within-class covariance: $\quad \Sigma_W = \langle (h_{i,c}-\mu_c)(h_{i,c}-\mu_c)^T \rangle_{i,c}$

Between-class covariances: $\quad \Sigma_B = \langle (\mu_c-\mu_G)(\mu_c-\mu_G)^T \rangle_{c}$

Simplex Equiangular Tight Frame (ETF):

Set of vectors which are

Equinorm
Equiangle
Angle between them is maximal

https://purl.stanford.edu/br193mh4244

Neural Collapse

A net trained past $\epsilon_\text{train} = 0$, while $\mathcal{L}(\text{train-set}) \to 0$ frequently induces neural collapse (NC).
NC is characterized by:

[NC1] Within-class variability collapses: $$\Sigma^t_W \to 0 \text{ as } t \to \infty$$
[NC2] Class means converge to Simplex ETF
[NC3] Convergence to self-duality:
$$\left\vert\left\vert\dfrac{\mu^t_c}{||\mu^t_c||} - \dfrac{w^t_c}{||w^t_c||} \right\vert \right\vert \to 0$$
[NC4] Classification in the last layer is equivalent to nearest-neighbor classifier w.r.t. class means.

VGG-13 trained on CIFAR-10

Evidence of Neural Collapse

Related to "compression" of features irrelevant for the task.
For ImageNet one could argue that this is not exactly going to zero.. It could be that
- one just needs to spend more time in TPT
- noisy labels play a role
- ...

[NC1] Within-class variability collapses

\dfrac{Std_c[||\mu_c - \mu_G||]}{\mathbb{E_c}[||\mu_c - \mu_G||]}

[NC2a] Class means become equinorm

y-axis:

\cos_\mu(c, c') = \dfrac{\langle \mu_c - \mu_G, \mu_{c'} - \mu_G \rangle}{||\mu_c - \mu_G|| \,||\mu_{c'} - \mu_G||}

[NC2b] Class means approach equiangularity

y-axis:

\text{Std}_{c, c' \neq c}\cos_\mu(c, c') ,

\mathbb{E}_{c, c'} |\cos_\mu(c ,c') + 1 / (C-1) |

[NC2c] Class means approach maximal angle

y-axis:

[NC3] Self-duality

Class means matrix: $\mathbf{M} = [\mu_c - \mu_G,\: c \in \{1, \dots, C\}]$

Classifiers weights matrix $\mathbf{W}$

\left\vert\left\vert\dfrac{\mathbf{M}}{||\mathbf{M}||_F} - \left(\dfrac{\mathbf{W}}{||\mathbf{W}||_F}\right)^\intercal \right\vert \right\vert_F^2

y-axis:

[NC4] Last layer ~ nearest-neighbor classifier w.r.t. class means

Net classifier: $\argmax_c\langle w_c, h(x)\rangle + b_c$

Nearest-neighbor classifier: $\argmin_c ||h(x) - \mu_c||_2$

y-axis: mismatches between the two classifiers

Training beyond zero-error improves test error

Why the simplex ETF?
Is it optimal, in some sense?

Simplex ETF emergence in the presence of [NC1]

We are given observations $h = \mu_\gamma + \mathbf z$, with $\mathbf z \sim \mathcal{N}(\mathbf 0, \sigma^2 \mathbf I)$ and $\gamma \sim \text{Unif}\{1, \dots, C\}$.
Goal: recover $\gamma$ from $h$ with a linear classifier :
$$\hat{\gamma}(h) = \argmax_\gamma\:\langle w_\gamma, h\rangle + b_\gamma$$
Task: design a classifier $(\mathbf w, \mathbf b)$ and choose the class means $\mathbf M = \{\mu_c\}_{c=1}^C$ which minimize the classification error, subject to $||\mu_c||_2 \leq 1$.
Optimiality Criterion: large deviation error exponent

$$\beta(\mathbf M, \mathbf w, \mathbf b) = -\lim_{\sigma\to 0} \sigma^2 \log P_\sigma\{\hat \gamma (h) \neq \gamma\}$$
Theorem. Optimal error exponent $\beta^* = \max_{(\mathbf M, \mathbf w, \mathbf b)}\beta(\mathbf M, \mathbf w, \mathbf b)$ is achieved by the simplex ETF, $\mathbf M^*$:
$$\beta(\mathbf M^*,\mathbf M^*,0) = \beta^*$$

codewords are transmitted over a noisy channel

linear decoder

Information theory perspective

(norm constraint ~ limit to signal strenght)

design decoder and codebook for optimal retrieval

[NC1]

[Shannon, 1959]

Simplex ETF emerges as the optimal structure in the presence of [NC1] and Gaussian noise

Relation to previous results

Sharpening previous results (1)

slides from Papyan talk @ MIT 9.520/6.860: Statistical Learning Theory and Applications Fall 2020

Sharpening previous results (2)

slides from Papyan talk @ MIT 9.520/6.860: Statistical Learning Theory and Applications Fall 2020

The margin is the same for each point in the dataset and it is as large as it can possibly be.

Conclusions

The paper studies the canonical deep nets training protocol
During the terminal phase of training, NC takes place
- [NC1] No interclass variability
- [NC2-3] Class means and classifiers tend to simplex ETF
- [NC4] last-layer classifier = nearest-neighbor classifier
NC is optimal under Gaussian noise
NC sharpens previous insights

[Journal Club @EPFL] Prevalence of neural collapse during the terminal phase of deep learning training

By Leonardo Petrini

[Journal Club @EPFL] Prevalence of neural collapse during the terminal phase of deep learning training

Leonardo Petrini

PhD Student @ Physics of Complex Systems Lab, EPFL Lausanne

Journal Club @EPFL, 05.01.21

Prevalence of neural collapse during the

terminal phase of deep learning training

Prevalence of neural collapse during the

terminal phase of deep learning training

Some notation

Some more notation

Features&Classifiers space

Features&Classifiers space

Features&Classifiers space

Neural Collapse

Evidence of Neural Collapse

Training beyond zero-error improves test error

Why the simplex ETF? Is it optimal, in some sense?

Simplex ETF emergence in the presence of [NC1]

Relation to previous results

Sharpening previous results (1)

Sharpening previous results (2)

Conclusions

[Journal Club @EPFL] Prevalence of neural collapse during the terminal phase of deep learning training

More from Leonardo Petrini

Why the simplex ETF?
Is it optimal, in some sense?