Geometric compression of invariant manifolds in neural nets

Jonas Paccolat, Leonardo Petrini, Mario Geiger, Kevin Tyloo, and Matthieu Wyart

arXiv:2007.11471

Leonardo Petrini

Born here

Currently PhD Student @

Physics of Complex Systems Lab

Advisor: Prof. Matthieu Wyart

Research Interests:

Statistical Physics
Neural Nets
Reinforcement Learning
(very recently)

Other interests:

Hiking
Some climbing 🧗
Pizza 🍕
...

e.g. pixels in the corner are unrelated to the class label

CIFAR10 data-point

Motivation: neural nets adapt to the data structure

Success of neural nets is often attributed to

(e.g. Mallat, 2016)

learning data features / invariants
compressing irrelevant directions in data space

Success of neural nets is often attributed to

(e.g. Mallat, 2016)

learning data features / invariants
compressing irrelevant directions in data space

There is no quantitative general framework to understand these aspects
Some observables to quantify this compression have been proposed
- Mutual Information
  Shwartz-Ziv and Tishby (2017), Saxe et al. (2019)
- Effective dimension
  Ansuini et al. (2019), Recanatesi et al. (2019)

Motivation: neural nets adapt to the data structure

e.g. pixels in the corner are unrelated to the class label

CIFAR10 data-point

Neural Nets Learning Regimes

starting from [Jacot et al. 2018] ...

Lazy Regime

weights and NTK are
~constant during learning

Feature Regime

weights and NTK
evolve during learning

Lazy Regime

weights and NTK are
~constant during learning

Cannot learn features of the data

Feature Regime

weights and NTK
evolve during learning

Can possibly learn features and perform compression

Neural Nets Learning Regimes

starting from [Jacot et al. 2018] ...

Lazy Regime

weights and NTK are
~constant during learning

Cannot learn features of the data

Feature Regime

weights and NTK
evolve during learning

Can possibly learn features and perform compression

Neural Nets Learning Regimes

starting from [Jacot et al. 2018] ...

Idea... use the evolving NTK to diagnose compression

Classification task:

Data-points: $\vec x = (\vec x_\parallel, \vec x_\bot)$

Labelling function: $y(\vec x) = y(\vec x_\parallel) \in \{-1, 1\}$

Focus on simple models for compression

the Stripe Model

Dataset:
Train set size: $p$

$\vec x^\mu \sim \mathcal{N}(0, I_d)$, for $\mu=1,\dots,p$

Classification task: $y(\vec x) = y(\vec x_\parallel) \in \{-1, 1\}$

Instance for $d=2$

Learning Algo:

◦ Fully connected one-hidden layer NN:

$$f(\vec x) = \frac{1}{h} \sum_{n=1}^h \beta_n \: \sigma \left(\frac{\vec \omega_n \cdot \vec x}{\sqrt{d}} + b_n \right)$$

with $\sigma(\cdot) = \text{ReLU}(\cdot)$.

◦ Hinge Loss $l(y, \hat y) = \max(0, 1- y\hat y)$

◦ Vanilla gradient descent on $$F(\vec x) = \alpha \left(f(\vec x) - f_0(\vec x)\right),$$ where $f_0$ is the network function at initialization.

Varying $\alpha$ drives the network dynamics from the feature (small $\alpha$) to the lazy (large $\alpha$) regime [Chizat et al. 2018].

Dataset

Fully connected one-hidden layer NN:

$$f(\vec x) = \frac{1}{h} \sum_{n=1}^h \beta_n \: \sigma \left(\frac{\vec \omega_n \cdot \vec x}{\sqrt{d}} + b_n \right)$$

Architecture

$\vec x^\mu \sim \mathcal{N}(0, I_d)$, for $\mu=1,\dots,p$

Classification task: $y(\vec x) = y( x_\parallel) \in \{-1, 1\}$

\omega

\beta

\vec x

f(\vec x)

Simple model for compression: the Stripe Model

Training dynamics

Compression in the feature regime

weights evolution during training

for a subset of neurons

\omega

\beta

\vec x

f(\vec x)

Training dynamics

Compression in the feature regime

weights evolution during training

for a subset of neurons

\omega

\beta

\vec x

f(\vec x)

The network function loses its dependence on the $x_\bot$ direction
NTK evolves with weights and also loses dependence on $x_\bot$
more formal argument in the preprint arXiv:2007.11471
NTK eigenvectors are of two kinds:
$\phi^1_\lambda(\vec x)= \phi_\lambda^1(\vec x_\parallel)$ and $\phi^2_\lambda(\vec x) = \phi_\lambda^2(\vec x_\parallel) \vec u \cdot \vec x_\bot$
Given that the labelling function $y(\vec x) = y(x_\parallel)$, we expect $\phi_\lambda(\vec x)$ at the end of training to be informative about the label
We also expect top $\phi_\lambda(\vec x)$ projection on the label to be large
For kernel methods, if this projection on top eigenvectors is large, one expects good performance [Schölkopf et al. 2002]

more tomorrow on Matthieu's lecture

Testing predictions via Kernel PCA

Top kernel principal components more informative on output label
Larger projection $\rightarrow$ expect improved end-of-training NTK performance (if used in a kernel method)

Mutual Information

Projection

\omega_\lambda^2 = \frac{1}{p^2} \langle \phi_{\lambda_r}|y \rangle

I( \phi_{\lambda_r};y)

Neural Tangent Kernel PCA

\phi_{\lambda_1}

We look at the first two
NTK eigenvectors:

Kernel Method on NTK at initialization
Training the Net in the Feature Regime
Kernel Method with NTK after training

(training set size)

Measure Performance: Learning Curves

NTK dynamics

The neural tangent kernel [Jacot et al. 2018] is defined as $$\Theta(\vec x^\mu, \vec x^\nu) = \partial_W f(\vec x^\mu) \cdot \partial_W f(\vec x^\nu)$$

where the scalar product runs over all network weights.

Gradient Flow:

$$\partial_t W = \frac{1}{p} \sum_{\mu=1}^p \partial_W f(\vec x^\mu) \, y^\mu \: l'\left(f(\vec x^\mu) \, y^\mu\right),$$

Gradient evolution on functional space:

In general, $\Theta$ evolves with time.

\begin{aligned} \partial_t f(x) &= \sum_w \partial_w f(x) \partial_t w \\ &= \frac{1}{p} \sum_\mu \partial_W f(x)\cdot \partial_W f(x^\mu)\, y^\mu \:l'(f(x^\mu) y^\mu) \\ &= \frac{1}{p} \sum_\mu \Theta(x, x^\mu)\, y^\mu \:l'(f(x^\mu) y^\mu) \end{aligned}

Measure Performance: Learning Curves

Kernel Method on NTK at initialization
Training the Net in the Feature Regime
Kernel Method with NTK after training

(training set size)

The kernel at the end of learning performs as good as the network itself!

Compression makes the NTK more performant!

Empirical data: MNIST

Mutual Information

Projection

\omega_\lambda^2 = \frac{1}{p^2} \langle \phi_{\lambda_r}|y \rangle

I( \phi_{\lambda_r};y)

Learning Curves

Similarities with Stripe Model
→ hint to compression being key also in MNIST

Plotting the NTK eigenvectors

We look at the first two
NTK eigenvectors values

(color corresponds to class labels)

random at initialization
classes are well separated after learning

thank you!

Conclusions:

Introduced a simple model for compression of invariants in the feature learning regime
Compression shapes the evolution of the NTK and makes it more performant
Kernel-PCA is a good diagnostic for compression
Similarities Stripe-MNIST support that compression is relevant in the latter as well

[Les Houches 2020] Geometric compression of invariant manifolds in neural nets

By Leonardo Petrini

[Les Houches 2020] Geometric compression of invariant manifolds in neural nets

Talk for the Statistical Physics and ML Summer Workshop @ Ecole de Physique des Houches, August 2020. Video recording: https://bit.ly/3kQBAYe (from minute 12)

5 years ago
190

Leonardo Petrini

PhD Student @ Physics of Complex Systems Lab, EPFL Lausanne

Geometric compression of invariant manifolds in neural nets

Motivation: neural nets adapt to the data structure

Motivation: neural nets adapt to the data structure

Neural Nets Learning Regimes

Neural Nets Learning Regimes

Neural Nets Learning Regimes

Focus on simple models for compression

the Stripe Model

Dataset

Architecture

Simple model for compression: the Stripe Model

Training dynamics

Training dynamics

Testing predictions via Kernel PCA

Neural Tangent Kernel PCA

Measure Performance: Learning Curves

NTK dynamics

Measure Performance: Learning Curves

Empirical data: MNIST

Plotting the NTK eigenvectors

Conclusions:

[Les Houches 2020] Geometric compression of invariant manifolds in neural nets

More from Leonardo Petrini