Geometric compression of invariant manifolds in neural nets

Jonas Paccolat, Leonardo Petrini, Mario Geiger, Kevin Tyloo, and Matthieu Wyart

Leonardo Petrini

Born here

Currently PhD Student @

Physics of Complex Systems Lab

Advisor: Prof. Matthieu Wyart

 

Research Interests:

  • Statistical Physics
  • Neural Nets
  • Reinforcement Learning
    (very recently)

Other interests:

  • Hiking
  • Some climbing 🧗
  • Pizza 🍕
  • ...

e.g. pixels in the corner are unrelated to the class label

CIFAR10 data-point

Motivation: neural nets adapt to the data structure

Success of neural nets is often attributed to

(e.g. Mallat, 2016)

  • learning data features / invariants
  • compressing irrelevant directions in data space

Success of neural nets is often attributed to

(e.g. Mallat, 2016)

  • learning data features / invariants
  • compressing irrelevant directions in data space

 

  1. There is no quantitative general framework to understand these aspects
  2. Some observables to quantify this compression have been proposed
    • Mutual Information
      Shwartz-Ziv and Tishby (2017), Saxe et al. (2019)
    • Effective dimension
      Ansuini et al. (2019), Recanatesi et al. (2019)

Motivation: neural nets adapt to the data structure

e.g. pixels in the corner are unrelated to the class label

CIFAR10 data-point

Neural Nets Learning Regimes

starting from [Jacot et al. 2018] ...

Lazy Regime

 

weights and NTK are
~constant during learning 

 

Feature Regime

 

weights and NTK
evolve during learning

 

 

Lazy Regime

 

weights and NTK are
~constant during learning 

 

Cannot learn features of the data

Feature Regime

 

weights and NTK
evolve during learning

 

Can possibly learn features and perform compression

 

Neural Nets Learning Regimes

starting from [Jacot et al. 2018] ...

Lazy Regime

 

weights and NTK are
~constant during learning 

 

Cannot learn features of the data

Feature Regime

 

weights and NTK
evolve during learning

 

Can possibly learn features and perform compression

 

Neural Nets Learning Regimes

starting from [Jacot et al. 2018] ...

Idea... use the evolving NTK to diagnose compression

Classification task:

Data-points:   \(\vec x = (\vec x_\parallel, \vec x_\bot)\)

 Labelling function:   \(y(\vec x) = y(\vec x_\parallel) \in \{-1, 1\}\)

Focus on simple models for compression

the Stripe Model

Dataset:
Train set size: \(p\)

\(\vec x^\mu \sim \mathcal{N}(0, I_d)\),   for \(\mu=1,\dots,p\)

Classification task:  \(y(\vec x) = y(\vec x_\parallel) \in \{-1, 1\}\)

Instance for \(d=2\)

Learning Algo:

Fully connected one-hidden layer NN:

$$f(\vec x) = \frac{1}{h} \sum_{n=1}^h \beta_n \: \sigma \left(\frac{\vec \omega_n \cdot \vec x}{\sqrt{d}} + b_n \right)$$

with \(\sigma(\cdot) = \text{ReLU}(\cdot)\).

◦ Hinge Loss  \(l(y, \hat y) = \max(0, 1- y\hat y)\)

Vanilla gradient descent on  $$F(\vec x) = \alpha \left(f(\vec x) - f_0(\vec x)\right),$$ where \(f_0\) is the network function at initialization.

Varying \(\alpha\) drives the network dynamics from the feature (small \(\alpha\)) to the lazy (large \(\alpha\)) regime [Chizat et al. 2018].

Dataset

Fully connected one-hidden layer NN:

$$f(\vec x) = \frac{1}{h} \sum_{n=1}^h \beta_n \: \sigma \left(\frac{\vec \omega_n \cdot \vec x}{\sqrt{d}} + b_n \right)$$

 

Architecture

\(\vec x^\mu \sim \mathcal{N}(0, I_d)\),   for \(\mu=1,\dots,p\)

Classification task:  \(y(\vec x) = y( x_\parallel) \in \{-1, 1\}\)

\omega
\beta
\vec x
f(\vec x)

Simple model for compression: the Stripe Model

Training dynamics

Compression in the feature regime

weights evolution during training

for a subset of neurons

\omega
\beta
\vec x
f(\vec x)

Training dynamics

Compression in the feature regime

weights evolution during training

for a subset of neurons

\omega
\beta
\vec x
f(\vec x)
  • The network function loses its dependence on the \(x_\bot\) direction
  • NTK evolves with weights and also loses dependence on \(x_\bot\)
    more formal argument in the preprint arXiv:2007.11471
  • NTK eigenvectors are of two kinds:
                  \(\phi^1_\lambda(\vec x)= \phi_\lambda^1(\vec x_\parallel)\) and  \(\phi^2_\lambda(\vec x) = \phi_\lambda^2(\vec x_\parallel) \vec u \cdot \vec x_\bot\) 
  • Given that the labelling function \(y(\vec x) = y(x_\parallel)\), we expect \(\phi_\lambda(\vec x)\) at the end of training to be informative about the label
  • We also expect top \(\phi_\lambda(\vec x)\) projection on the label to be large
  • For kernel methods, if this projection on top eigenvectors is large, one expects good performance [Schölkopf et al. 2002]

more tomorrow on Matthieu's lecture

Testing predictions via Kernel PCA

  • Top kernel principal components more informative on output label
  • Larger projection \(\rightarrow\) expect improved end-of-training NTK performance (if used in a kernel method)

Mutual Information

Projection

r
r
\omega_\lambda^2 = \frac{1}{p^2} \langle \phi_{\lambda_r}|y \rangle
I( \phi_{\lambda_r};y)

Neural Tangent Kernel PCA

 

\phi_{\lambda_1}
\phi_{\lambda_1}

We look at the first two
NTK eigenvectors:

  • Kernel Method on NTK at initialization
  • Training the Net in the Feature Regime
  • Kernel Method with NTK after training

 

(training set size)

Measure Performance: Learning Curves

NTK dynamics

The neural tangent kernel [Jacot et al. 2018] is defined as $$\Theta(\vec x^\mu, \vec x^\nu) = \partial_W f(\vec x^\mu) \cdot \partial_W f(\vec x^\nu)$$

where the scalar product runs over all network weights.

 

Gradient Flow: 

$$\partial_t W = \frac{1}{p} \sum_{\mu=1}^p \partial_W f(\vec x^\mu) \, y^\mu \: l'\left(f(\vec x^\mu) \, y^\mu\right),$$

 

Gradient evolution on functional space:

 

 

 

 

 

 

In general, \(\Theta\) evolves with time.

\begin{aligned} \partial_t f(x) &= \sum_w \partial_w f(x) \partial_t w \\ &= \frac{1}{p} \sum_\mu \partial_W f(x)\cdot \partial_W f(x^\mu)\, y^\mu \:l'(f(x^\mu) y^\mu) \\ &= \frac{1}{p} \sum_\mu \Theta(x, x^\mu)\, y^\mu \:l'(f(x^\mu) y^\mu) \end{aligned}

Measure Performance: Learning Curves

  • Kernel Method on NTK at initialization
  • Training the Net in the Feature Regime
  • Kernel Method with NTK after training

 

(training set size)

The kernel at the end of learning performs as good as the network itself!

Compression makes the NTK more performant!

Empirical data: MNIST

Mutual Information

Projection

\omega_\lambda^2 = \frac{1}{p^2} \langle \phi_{\lambda_r}|y \rangle
I( \phi_{\lambda_r};y)

Learning Curves

Similarities with Stripe Model
→ hint to compression being key also in MNIST

Plotting the NTK eigenvectors

 

We look at the first two
NTK eigenvectors values

(color corresponds to class labels)

 

  • random at initialization
  • classes are well separated after learning

thank you!

Conclusions:

  • Introduced a simple model for compression of invariants in the feature learning regime
  • Compression shapes the evolution of the NTK and makes it more performant
  • Kernel-PCA is a good diagnostic for compression
  • Similarities Stripe-MNIST support that compression is relevant in the latter as well

[Les Houches 2020] Geometric compression of invariant manifolds in neural nets

By Leonardo Petrini

[Les Houches 2020] Geometric compression of invariant manifolds in neural nets

Talk for the Statistical Physics and ML Summer Workshop @ Ecole de Physique des Houches, August 2020. Video recording: https://bit.ly/3kQBAYe (from minute 12)

  • 138