Beating the curse of dimensionality in deep neural networks by learning invariant representations

Leonardo Petrini

joint work with

Francesco Cagnetta, Alessandro Favero, Mario Geiger, Umberto Tomasini, Matthieu Wyart

Learning high dimensional data:

the curse of dimensionality

\(P\): training set size

\(d\) data-space dimension

  • Prerequisite for artificial intelligence: make sense of high-dimensional data
  • e.g. learn to classify images from examples 
  • Images live in high dimensions, e.g. \(1'000\times1'000 = 10^6\) pixels
  • Even measures of effective dimension give \(d_\text{eff} \sim 10^2\)
  • \(\rightarrow\) curse of dimensionality = no efficient sampling is possible (need exponential number of points).

To be learnable, real data must be highly structured

Deep learning 

  • Learning from examples a function in high-dimensions with artificial neural networks:
    • Training:

    • Testing:
\stackrel{!}{=} \text{dog}
\stackrel{!}{=} \text{cat}

deep net

correct prediction!!


parameters optimized with gradient descent to achieve correct output

e.g. > 90% accuracy for many algorithms and tasks

Understanding deep learning success

  • Successful in many high-dimensional tasks
  • Many open questions
    1. What is the structure of real data that makes them learnable?
    2. How is it exploited by deep neural networks?
    3. How much data is needed to learn a given task?

Learning relevant data representations

  • Hypothesis: deep learning success is in its ability to exploit data structure by learning representations relevant for the task:
    • Deeper neural network layers respond to higher-level, more abstract features in a hierarchical manner;

    • Are more abstract representations lower dimensional?
    • Empirically: dimensionality of internal representations is reduced with depth \(\rightarrow\) lower dimensionality of the problem \(\rightarrow\) beat the curse. 

How is dimensionality reduction achieved?

\rightarrow f_\theta(\mathbf x)
\mathbf x \rightarrow










Data invariances allow dimensionality reduction

Focus of Part I

Focus of Part II

Bruna and Mallat '13, Petrini et al. '21, Tomasini et al. '22












  • Dimensionality of the problem can be reduced by becoming invariant to aspects of the data irrelevant for the task:
    • Semantic invariance. Related to hierarchical structure of the task (higher-level features are a composition of lower-level features). Features can have different synonyms.

    • Deformation invariance. The relevant features are sparse in space, their exact position does not matter.

Intro summary

  • Learning high-d tasks \(\rightarrow\) curse of dimensionality;
  • Deep learning is successful on these tasks;
  • \(\implies\) real data is highly structured;
  • Deep networks beat the curse by exploiting this structure?
  • Evidence: DNNs' learn representations that are
    • Hierarchical;
    • Low dimensional;
  • Our approach: characterize dimensionality reduction through data invariances to gain understanding and possibly predict sample complexity (i.e. how many points are needed to learn a task).

Learning Hierarchical Tasks with Deep Neural Networks:
The Random Hierarchy Model

Part I










Previous works:

  • Hierarchical tasks can be represented by neural networks that are deep;
  • Correlations between low-level features and task necessary to performance;
  • ...

Open questions:

  • Can deep neural networks trained by gradient descent learn hierarchical tasks?
  • How many samples do they need?

Physicist approach:
Introduce a simplified model of data

DNNs Learning Hierarchical Tasks: open questions

Q1: Can deep neural networks trained by gradient descent learn hierarchical tasks?

Q2: How many samples do they need?

Previous works:

  • Hierarchical tasks can be represented by neural networks that are deep;
  • Generative models of hierarchical tasks are introduced; 
  • Correlations between low-level features and task necessary to performance;
  • .
\rightarrow f_\theta(\mathbf x)

Physicist approach:
introduce a simplified model of data

The Random Hierarchy Model (RHM)

Propose generative model of hierarhical data and study sample complexity of neural networks.

2) Finite vocabulary:

  • Number of distinct classes is \(n_c\)
  • Number of features is \(v\)
  • Number of sub-features is \(v\)
  • etc...

4) Random (frozen) rules:
- The \(m\) possible strings of a class are chosen uniformly at random among the \(v^s\) possible ones;

- Same for features, etc...

s=2, L=3

1) Hierarchy:

  • any class (say dog) corresponds to a string of \(s\) features;
  • a feature (say head) corresponds to a string of \(s\) sub-features;
  • etc... Repeat \(L\) times.
  • Dimension of data: \(d = s^L\)

3) Degeneracy:

  • A class can be represented by \(m\) different strings of \(s\) features, out of \(v^s\) possible strings;
  • A given feature corresponds to \(m\) strings of sub-features, etc...
  • Two classes (or two features) cannot be represented by same string: \(m ≤ v^{s-1}\)
  • \(P_\text{max} \sim n_c m^{\frac{s^L-1}{s-1}}\),  exponential in \(d\).

Sampling from the RHM

To generate data, for a given choice of \((v, n_c, m, L, s)\):

  1. Sample a set of frozen rules (the task);
  2. Sample a label;
  3. Sample a \(s\)-tuple of features out of the \(m\) possible, corresponding to that label;
  4. For each sampled feature, at each of the \(L\) levels, repeat point (3.)
L=3, s=2

At every level, semantically equivalent representations exist. 

How many points do neural networks need to learn these tasks?

Simple neural networks are cursed

Training a one-hidden-layer fully-connected neural network
with gradient descent to fit a training set of size \(P\). 


  • The sample complexity is proportional to the maximal training set size \(P_\text{max}\);
  • \(P_\text{max}\) grows exponentially with the input dimension \(\rightarrow\) curse of dimensionality.
m=n_c=v, s=2, L=2


rescaled \(x-\)axis

Deep neural networks beat the curse

Training a deep convolutional neural networks with \(L\)
hidden layers on a training set of size \(P\).

  • The sample complexity:

  • Power law in the the input dimension \(d=s^L\),  curse is beaten!! 


rescaled \(x-\)axis

Deep neural networks beat the curse

Training a deep convolutional neural networks with \(L\)
hidden layers on a training set of size \(P\).

  • The sample complexity:

  • Power law in the the input dimension \(d=s^L\),  curse is beaten!! 


rescaled \(x-\)axis

Can we understand DNNs success?

Are they learning the structure of the data?

Semantic Invariance indicates Performance

  • Are the neural networks capturing the semantic invariance of the task to perform well?
  • We can measure sensitivity to exchanges of semantically equivalent features in deep CNNs:
m=n_c=v, L=3, s=2
  • Invariance and performance are strongly correlated!
  • \(P^*\) is also the point at which invariance is learned.

How is semantic invariance learned?

\(P^*\) points are needed to measure correlations

  • We can compute correlations (the signal) and show that 1-step layer-wise gradient descent learns invariant representation when training on all data;
  • For a finite \(P\), correlations are estimated from empirical counterpart (sampling noise);
  • Learning is possible starting from signal = noise, this gives:


P_{\text{signal = noise}} \sim n_cm^L = P^*

=    sample complexity of DNNs!!

Input-label correlations predict sample complexity

From correlations to semantic invariance

  • Input-output correlations are the same for sub-features with the same meaning:
  • \(\rightarrow\) they can be exploited to learn semantic invariance!
  • For layer-wise training, we can show that gradient descent can use correlations to learn invariant representations (skipping details).
  • We can measure sensitivity to exchanges of semantically equivalent features in deep CNNs:
m=n_c=v, L=3, s=2
  • Strong correlation between invariance and performance!
  • \(P^*\) is also the point at which invariance is learned.

To sum up

On hierarchical tasks:

  • Simple (shallow) neural networks are cursed;
  • Deep neural networks beat the curse;
  • They do so by learning the semantic structure of the task;
  • Semantic structure can be learned by exploiting in-out correlations;
  • Correlations predict sample complexity

Relative Stability to Diffeomorphisms indicates Performance in Deep Nets

Part II

Sparsity in space and stability to smooth deformations

Is it true or not?

Can we test it?

  • We discussed DNNs learning features' meaning in a hierarchical manner;
  • In real data, features are sparse in space:
    exact features position is irrelevant;
  • Hypothesis: DNNs become invariant to exact feature position              \(\rightarrow\) effectively reduce dimensionality of the problem
                \(\rightarrow\) beat the curse of dimensionality.

Bruna and Mallat '13, Mallat '16



  1. Introduce a controlled way to deform images;
  2. Introduce a measure of deformation invariance of neural networks predictors;
  3. Test on real neural networks;

Max-entropy model of diffeomorphisms

\tau x
  • Goal: deform images in a controlled way
  • How to sample from a uniform distribution on all diffeomorphisms \( - \,\tau\, -\) that have the same norm \(\|\tau\|\)?
  • Can be solved as a classical problem in physics where the norm takes the place of an energy, which can be controlled by introducing a temperature \(T\)

more deformed

Measuring deformation invariance

\tau x
f(\tau x)
R_f \propto \langle\|f(x) - f(\tau x)\|^2 \rangle_{x, \tau}

Invariance measure: relative stability

(normalized such that is =1 if no diffeo stability)

Good deep nets learn to become invariant


  1. At initialization (shaded bars) \(R_f \approx 1\), SOTA nets don't show stability to diffeo at initialization.
  2. After training (full bars) \(R_f\) is reduced by one/two orders of magnitude consistently across datasets and SOTA architectures.

  3. By contrast, (2.) doesn't hold true for fully connected and simple CNNs for which \(R_f \sim 1\) before and after training.

Deep nets learn to become stable to diffeomosphisms!

Relationship with performance?

R_f \propto \langle\|f(x) - f(\tau x)\|^2 \rangle_{x, \tau}

Relative stability to diffeomorphisms remarkably correlates with performance!


  • Understanding deep learning performance boils down to understanding data invariances and how deep nets learn them;
  • In the Random Hierarchy model, we understand relationship between emergence of invariant representation and performance; 
  • We predict sample complexity in terms of simple parameters: \(P^* \sim n_cm^L\). Gives rule of thumb to get sample complexity of real data;
  • Many future directions: e.g. use the model to study language capabilities of transformer architectures.
  • For images, deformation stability seems necessary for performance;
  • By which mechanisms is this stability achieved in deep nets?
  • Combine the two invariances by adding a notion of space in RHM?

Conclusions and Perspectives

Thank you!

Beating the curse of dimensionality in DNNs by learning invariant representations

By Leonardo Petrini

Beating the curse of dimensionality in DNNs by learning invariant representations

We aim to answer the fundamental question of how deep learning algorithms achieve remarkable success in processing high-dimensional data, such as images and text, while overcoming the curse of dimensionality. This curse makes it challenging to efficiently sample data and can result in a sample complexity, which is the number of data points needed to learn a task, scaling exponentially with the space dimension in generic settings. Our investigation centers on the idea that to be learnable, real-world data must be highly structured. We explore two aspects of this idea: (i) the hierarchical nature of data, where higher-level features are a composition of lower-level features, such as how a face is made up of eyes, nose, mouth, etc., (ii) the irrelevance of the exact spatial location of such features. Following this idea, we investigate the hypothesis that deep learning is successful because it constructs useful hierarchical representations of data that exploit its structure (i) while being insensitive to aspects irrelevant for the task at hand (ii).

