PhD Private Defense
Breaking the Curse of Dimensionality
in Deep Neural Networks by
Learning Invariant Representations

Candidate: Leonardo Petrini

Exam jury:

Prof Hugo Dil, président du jury
Prof Matthieu Wyart, directeur de thèse
Prof Florent Krzakala, rapporteur
Prof Andrew Saxe, rapporteur
Prof Nathan Srebro, rapporteur

 

September 12, 2023

Learning from Labelled Examples

(supervised learning)

  • Training:







     
  • Testing:

 

= \text{"dog"}
=\text{ "cat"}

correct prediction?

?

model + algorithm

model + algorithm

  • Measuring performance in classification tasks:
    average number of errors on test samples (generalization).
  • Deep learning consists in a class of methods to do supervised learning

tune the model to make correct predictions on training data

  • In principle, generalization for these tasks is very hard

0                            1

0                            1

0                            1

\epsilon
  • Images live in very high dim., e.g.  \(1'000\times1'000 = 10^6\) pixels;
  • Goal: learn a generic function in dimension \(d\) by covering the space with hypercubes of side length \(\epsilon\), one per datapoint;
  • Num. of datapoints \(P\) to cover the space scales as $$P \sim (1 / \epsilon)^d \quad \text{(exponential in d !!)}$$

Example for \(\epsilon = 1/3\):

\(P = 3 \text{ in } d = 1\)

\(P = 9 \text{ in } d = 2\)

\(P = 27 \text{ in } d = 3 \dots\)

The Curse of Dimensionality

  • Learning generic functions in high dimension seems infeasible

The puzzling success of Deep Learning

Language, e.g. ChatGPT

Go-playing

Autonomous cars

Pose estimation

etc...

  • Successful in many high-dimensional tasks
  • Many open questions
    1. What is the structure of real data that makes them learnable?
    2. How is it exploited by deep neural networks?
    3. How much data is needed to learn a given task?

The Structure of Real Data

  • High \(d\) but low intrinsic dimensionality (ID)
    • Benchmark dataset ImageNet has \(10^7\) examples and  \(ID \approx 50\) with \(\exp(50) \gg 10^7\).
    • Hence, low ID does not explain why the curse is beaten.
  • Invariances give structure to real data:
    • Invariances = transformations of the input that leave the label unchanged;
    • Learning representations that are invariant to these transformations helps beating the curse?

data manifold

ID = 2
d = 3

Pope et al. '21

Goodfellow et al. (2009); Bengio et al. (2013); Bruna and Mallat (2013); Mallat (2016)

  • Which invariances are present in real data?
  • Can neural networks learn data invariances?
    Start by introducing neural networks.

Deng et al. '09

Deep Convloutional Neural Networks

Shallow fully-connected networks (FCNs) with ReLU activations,$$f(\bm{x}) = \frac{1}{H} \sum_{h=1}^H w_h \sigma( \bm{\theta}_h\cdot\bm{x}) $$

 

 

 

 

 

  • If only \(\omega\)'s are learned, then this is a kernel method (Random Feature Model);
  • If both layers are trained, features \(\sigma( \bm{\theta}_h\cdot\bm{x})\) can be learned from data.
\bm \theta
\omega
\bm x
f(\bm x)

$$H: \text{width}$$

$$d: \text{input dimension}$$

$${\bm \theta}: \text{1st layer weights}$$

$$\omega: \text{2nd layer weights}$$

$$\sigma: \text{activation function}$$

Deep Convloutional Neural Networks

  • Locality: in many real world tasks neighboring inputs make sense together \(\rightarrow\) local neurons.

Deep Convloutional Neural Networks

  • Locality: in many real world tasks neighboring inputs make sense together \(\rightarrow\) local neurons.
  • Depth: finally, the success of deep learning arguably stems from the use of multiple layers of neurons.
     
  • Deep CNNs are responsible for the modern deep learning revolution.
  • Is deep learning success in the ability to learn relevant features of the data?

network depth

Figure: inputs that maximally activate neurons at a given layer:

  • More and more abstract representations are built with depth;
  • More abstract = lower dimensional? Some evidence for that when measuring intrinsic dimension.

Do Deep CNNs learn relevant features?

  • How many points are needed to learn such representations?
  • How to measure the impact of feature learning on performance?
  • Many empirical works address this question.

Zeiler and Fergus (2014); Yosinski et al. (2015);

 Olah et al. (2017)

Ansuini et al. '19; Recenatesi et al. '19

Learning relevant data representations

  • Hypothesis: deep learning success is in its ability to exploit data structure by learning representations relevant for the task:
    • Deeper neural network layers respond to higher-level, more abstract features in a hierarchical manner;







       
    • Are more abstract representations lower dimensional?
    • Empirically: dimensionality of internal representations is reduced with depth \(\rightarrow\) lower dimensionality of the problem \(\rightarrow\) beat the curse. 

Ansuini et al. '19

Recenatesi et al. '19

adapted from Lee et al. '13

How dimensionality reduction affects performance?

\rightarrow f_\theta(\mathbf x)
\mathbf x \rightarrow

Lazy Regime

Feature Regime

  • weights are approximately constant during training
     
  • Equivalent to a Kernel Method with Neural Tangent Kernel (NTK)
     
  • Features are determined by architectural choice and are fixed during training
  • weights evolve during training
     
  • Can possibly learn features from data!

Neural Networks Training Regimes

 Jacot et al. (2018); Chizat et al. (2019); Bach (2017); Mei et al. (2018); Rotskoff and Vanden-Eijnden (2018); [...]

  • By comparing performance in the two regimes we can disentangle the role of feature learning and architectural choice in the success of neural networks

The same neural network architecture can be trained in two regimes, depending on initialization scale. Regimes are well characterized at infinite width.

(also NTK regime)

(also rich / hydrodynamic / mean-field / active)

How does operating in the feature or lazy regime affect neural nets performance?

Each regime can be favoured by different architectures and data structures [Geiger et al. 2020b]:

Test error

feature         lazy

Test error

feature          lazy

MNIST, CNN

\(\epsilon\)

Training regimes and Performance

 Chizat and Bach (2018);

Geiger et al. (2020b, 2021);

Novak et al. (2019);

Woodworth et al. (2020)...

algo: gradient descent

init. scale,

often power law in practice
Hestness et al. (2017); Spigler et al. (2020); Kaplan et al. (2020)

Moreover, if we look at gen. error vs. num. training points

 

 

 

we measure a different exponent \(\beta\) for the two regimes.
In the following, we will often use \(\beta\) to characterize performance.

 

\epsilon \sim P^{-\beta}

test error

train. set size

  • To provide answers we need to characterize the structure of the data that feature learning can adapt to;
  • We propose to do that in terms of data invariances.
  1. What causes this performance gap between the two regimes?
  2. More specifically, can we design simple data models to make sense of the generalization error difference between feature and lazy ?
  3. Why do FCNs trained on image data are often better when features are not learned?
  4. Why can deep CNNs succesfully learn features in such setting?

Questions Motivating the Thesis

We study interplay between data invariances
and neural networks architectures:

  1. Shallow Neural Networks
    • Linear Invariance
    • Non-linear Invariances (rotations, deformations)
  2. Deep Convolutional Neural Networks
    • Deformation Invariance
    • Synonymic Invariance

Outline

Irrelevant input directions

  • Simple invariance in image classification: some pixels may be irrelevant for the task.
     
  • Gives rise to Linear Invariance.
     
  • Can shallow networks learn it?

irrelevant pixels

Linear Invariance

Barron (1993); Bach (2017); Chizat and Bach (2020); Schmidt-Hieber (2020); Yehudai and Shamir (2019); Ghorbani et al. (2019, 2020); Wei et al. (2019)...

The target function is highly anisotropic, in the sense that it depends only on a linear subspace of the input space:
$$ f^*(\bm x) = g(A{\bm x}) \quad\text{where}\quad A\,\, : \mathbb{R}^d \to \mathbb{R}^{d'} \quad \text{and} \quad d' \ll d. $$

  • In [Paccolat, LP et al. 2020] we study a classification task exhibiting linear invariance.

Linear Invariance

Barron (1993); Bach (2017); Chizat and Bach (2020); Schmidt-Hieber (2020); Yehudai and Shamir (2019); Ghorbani et al. (2019, 2020); Wei et al. (2019)...

Animation: weights evolution during training

The target function is highly anisotropic, in the sense that it depends only on a linear subspace of the input space:
$$ f^*(\bm x) = g(A{\bm x}) \quad\text{where}\quad A\,\, : \mathbb{R}^d \to \mathbb{R}^{d'} \quad \text{and} \quad d' \ll d. $$

  • In [Paccolat, LP et al. 2020] we study a classification task exhibiting linear invariance.
  • Show that in the feature regime, weights align to the relevant direction, and quantify magnitude of alignment by $$\frac{\|\bm \theta_{d'}\|}{\|\bm \theta_{d-d'}\|} \sim \sqrt{P};$$

Linear Invariance

Barron (1993); Bach (2017); Chizat and Bach (2020); Schmidt-Hieber (2020); Yehudai and Shamir (2019); Ghorbani et al. (2019, 2020); Wei et al. (2019)...

Animation: weights evolution during training

  • Show that in the feature regime, weights align to the relevant direction, and quantify magnitude of alignment by $$\frac{\|\bm \theta_{d'}\|}{\|\bm \theta_{d-d'}\|} \sim \sqrt{P};$$
  • Compute scaling exponents of generalization error and find \(\beta_\text{Feature} > \beta_\text{Lazy}\), and verify that they are tight in practice.
  • Shallow networks can learn linear invariance in the feature regime;
  • However, often do not perform well on image tasks.
    Are there other invariances they cannot exploit?

Non-linear Invariances:    Rotations

Simple model of data for a task (approximately) invariant to rotations,
e.g. \({\bm x}\) uniform on the circle and regression of \(f^*({\bm x}) = 1\):

  • Weights in the feature regime collapse to a
    finite number of directions proportional to
    number of data points \(\rightarrow\) overfitting.
  • More generally we show that, for functions of
    controlled smoothness \(\nu_t\) on the hypersphere:$$\| f^*(\bm{x})-f^*(\bm{y})\| \sim \|\bm{x}-\bm{y}\|^{\nu_t},$$ \(\beta_\text{Lazy} > \beta_\text{Feature}\), if target is smooth enough.

Example: \(d=3\), large \(\nu_t\)

  • Which invariance in image tasks that cannot be learned by shallow nets?

LP, Cagnetta et al. 2022

  • In real data, features are sparse in space:
    only occupy a small portion of image frame.
     
  • Consequently, frame can be deformed, and relevant features be moved, without altering the content (deformation invariance).
     
  • Hypothesis: neural networks performance related to their invariance to input deformations.

Bruna and Mallat '13, Mallat '16

\approx
  1. True for FCNs in feature vs lazy regimes?
  2. And for deep CNNs?

Non-linear Invariances:    Deformations

Measuring deformation invariance

f(x)
x
\tau x
\tau
f(\tau x)
R_f \propto \langle\|f(x) - f(\tau x)\|^2 \rangle_{x, \tau}

Invariance measure: relative stability

(normalized such that is =1 if no diffeo stability)

we introduced a model to generate deformations of controlled magnitude

Rationalize suboptimal performance in image tasks

  • On images, feature learning in FCNs and for gradient descent is outperformed by the corresponding kernel method.



     
  • Proposed explanation: 
    1. image classes vary smoothly along small input deformations;
    2. high smoothness requires a continuous distribution of neurons;
                 follows that sparsity is detrimental for performance.

 Geiger et al. (2020a,b); Lee et al. (2020)

Figure. Lazy predictor is smoother along diffeomorphisms for image classification.

more smooth

Fashion MNIST

Correlation between Deformation Invariance and Performance?

  • \(R_f \sim 1\) at initialization for all arch.
  • Deep CNNs learn deformation invariance

more invariant

initialization: \(R_f \sim 1\)

Recap: Invariances in Real Data

Mossel '16; Poggio et al. '17; Malach and Shalev-Shwartz '18, '20; LP et al '23;

Bruna and Mallat '13, LP et al. '21, Tomasini, LP et al. '22;

...

dog

face

paws

eyes

nose

mouth

ear

edges

  • Dimensionality of the problem can be reduced by becoming invariant to aspects of the data irrelevant for the task:
     
    • Linear invariance. Due to presence of irrelevant pixels.

       
    • Deformation invariance. The relevant features are sparse in space, their exact position does not matter.

       
    • Synonymic invariance. Related to hierarchical structure of the task (higher-level features are a composition of lower-level features). Features can have different synonyms.
       

text:

images:

irrelevant pixel

Barron '93; Bach '17; Chizat and Bach '20; Schmidt-Hieber '20; Ghorbani et al. '19, '20; Paccolat, LP et al. '20; [...]

Learning Hierarchical Tasks
with Deep Neural Networks

...

dog

face

paws

eyes

nose

mouth

ear

edges

Previous works:

  • Hierarchical tasks can be represented by neural networks that are deep;
  • Correlations between low-level features and task necessary to performance;
  • ...

Poggio et al. '17

Mossel '16, Malach and Shalev-Shwartz '18, '20

Open questions:

  • Can deep neural networks trained by gradient descent learn hierarchical tasks?
  • How many samples do they need?

Physicist approach:
Introduce a simplified model of data

The Random Hierarchy Model

  1. Introduce structure of target function;


     
  2. How to generate data?




     
  3. Sample complexity of neural networks?

We consider hierarchically compositional and local tasks of the form: $$f^*({\bm x}) = g_3(g_{2}(g_{1}(x_1, x_2), g_{1}(x_3, x_4)), g_{2}(g_{1}(x_5, x_6), g_{1}(x_7, x_8))).$$

The Random Hierarchy Model:     Structure

  • Inputs \(x_i\) take values in a finite vocabulary \(V\) with \(|V|=\)\(\,\,v\);
  • Constituent functions \(g_l\) map groups of \(m\) input patches into each of \(v\) outputs:
      \(g_l: U_l \subseteq V^s \to V\)    with    \(|U_l| = \,\,\)\(m\)\(v\).

Depth
\(L=3\)

locality: \(s=2\)

input size: \(d = s^L\)

g_l(\qquad)
= g_l(\qquad)=
g_l(\qquad)
= g_l(\qquad)=

Example: \(m=v=s=2\)

Starting from outputs:
can use it as generative model
 

How to choose the \(g_l\)'s?

U
V^s
V

\(g_l\) for \(m=2, v=3, s=2\)

The constituent functions \(g_l\) are chosen uniformly at random:

randomly assign \(m\) input patches in \(V^s\) to each of the \(v\) outputs.

The Random Hierarchy Model:     Generation

synonyms

\tfrac{1}{2}
\tfrac{1}{2}
\tfrac{1}{2}
\tfrac{1}{2}

Can generate samples starting from outputs!

Simple example with \(m=v=2\)
Patches of dim. \(s=2\)

V^s
V

The constituent functions \(g_l\) are chosen uniformly at random:
randomly assign \(m\) input patches in \(V^s\) to each of the \(v\) outputs.

  • Similarly, rules for lower levels:






     
  • After specifying all the \(g_1, \dots, g_L\), new samples can be generated:

The Random Hierarchy Model:     Generation

\tfrac{1}{2}
\tfrac{1}{2}
\tfrac{1}{2}
\tfrac{1}{2}

label 2

\tfrac{1}{2}
\tfrac{1}{2}
\tfrac{1}{2}
\tfrac{1}{2}

lower level features

inputs

intermediate
representations

start from labels:

\,g_2\,
\,g_1\,

label 1

L=2

How many points do neural networks need to learn these tasks?

in practice: one-hot encoding of input features/color

Shallow neural networks are cursed

Training a shallow fully-connected neural network
with gradient descent to fit a training set of size \(P\). 

 

  • The sample complexity is proportional to the maximal training set size \(P_\text{max}\);
  • \(P_\text{max}\) grows exponentially with the input dimension \(\rightarrow\) curse of dimensionality.
m=n_c=v, s=2, L=2

original

rescaled \(x-\)axis

Deep neural networks beat the curse

Training a deep convolutional neural networks with \(L\)
hidden layers on a training set of size \(P\).

  • The sample complexity:



     
  • Power law in the the input dimension \(d=s^L\), curse is beaten!!
P/n_cm^L
P

original

rescaled \(x-\)axis

n_cm^L

Can we understand DNNs success?

Are they learning the structure of the data?

Synonymic Invariance indicates Performance

  • How do neural networks learn?
  • One way is to learn synonymic invariance at each layer
  • We can measure sensitivity to exchanges of synonymous features in deep CNNs:
m=n_c=v, L=3, s=2
  • Invariance and performance are strongly correlated!
  • \(P^*\) is also the point at which invariance is learned.

How is synonymic invariance learned?

Input-output correlations in the RHM

  • Synonymous patches have the same correlations with the labels.
     
  • Correlations can be exploited to recognize synonyms and learn the invariance, starting from the input, and going up the hierarchy until the label!

Input-output correlations in the RHM

  • Rules are chosen uniformly at random \(\rightarrow\) correlations exist between higher and lower-level features;
\text{example: } m=v=3, s=2
  • For the rightmost input patch, (redpink) and (orange; green) have the same correlation with the output;
  • In general, correlations are the same for semantically equivalent tuples; 
  • Correlations can be exploited to learn semantic invariance!
  • In RHM example: observing blue on the left \(\rightarrow\) class more likely to be "1";
  • In real data: a wheel somewhere in the image makes the class more likely to be a vehicle.

\(P^*\) points are needed to measure correlations

  • Signal. We can compute the average correlations as the             variance of  \(\text{Prob}(\text{label} | \text{input patch})\).
     
  • Sampling Noise. For a finite \(P\), correlations (full dots) are estimated from empirical counterpart (shaded dots);
     
  • Learning is possible starting from signal = noise, this gives:
P_{\text{signal = noise}} \sim n_cm^L = P^*

=    sample complexity of DNNs!!

Conclusions

Summary of Results

Main takeaway: neural networks can profitably learn data invariances in the feature regime, provided the right architecture.

  • Shallow Neural Networks:
    • Can learn linear invariances in feature regime;
    • Cannot learn certain non-linear invariances (performance deterioration in feature vs lazy);
    • Cannot learn hierarchical tasks in all regimes.
       
  • The Role of Depth and Locality.
    • For hierarchical tasks exhibiting synonymic invariance: depth is necessary;
    • For real-world tasks exhibiting deformation invariance: deeper networks generally perform better, if they use local filters (deep CNNs).

 

Some open questions:

  1. Role of local filters in hierarchical model we introduced?
  2. Model for both synonymic and deformation invariance?

Some Open Questions (1/2)

  • Learning Locality from Scratch. In the Random Hierarchy Model setting, we studied shallow FCNs and deep CNNs. What is the performance of deep FCNs?
    Can they learn locality and hierarchical structure when not imposed from the start?








     
  • Yes! Neurons split into groups and specialize to different patches. 
  • Sample complexity? Current hypothesis \(P_\text{dFCN}^* = C(d) P_\text{dCNN}^* \), with \(C(d) = \text{poly}(d)\).
  • Open question: theory of learning locality and neuron specialization.

Neyshabur (2020); Pellegrini and Biroli (2021); Ingrosso and Goldt (2022)

Some Open Questions (2/2)

  • Deformation Invariance in RHM. We can extend the RHM to encompass deformation invariance by adding \(s_0\) empty / irrelevant pixels in random positions.








     
  • Relevant (colored) features can be moved without affecting the task.
  • Preliminary results show that the same number of training points is needed to learn both deformation and synonymic invariance.
  • Ongoing work: determine sample complexity depending on sparsity. 
\dots
\dots

RHM

Sparse RHM

Tomasini at al. (in preparation)

Thank you

[Private Defense] Breaking the Curse of Dimensionality in Deep Neural Networks by Learning Invariant Representations

By Leonardo Petrini

[Private Defense] Breaking the Curse of Dimensionality in Deep Neural Networks by Learning Invariant Representations

  • 54