PhD Public Defense
Breaking the Curse of Dimensionality
in Deep Neural Networks by
Learning Invariant Representations

Candidate: Leonardo Petrini

Thesis Advisor: Prof. Matthieu Wyart

Physics of Complex Systems Lab

November 17, 2023

Making Predictions from Labelled Data

(supervised learning)

Example: predict person height based on age.

age

height

Making Predictions from Data

(supervised learning)

Example: predict person height based on age.

Tune prediction to fit the data (training);
Use the prediction on new test data;

age

height

train point

prediction

test point

true value

predicted value

Supervised learning aims for accurate predictions on test data

Making Predictions from Data

(supervised learning)

Example: predict person height based on age.

Tune prediction to fit the data (training);
Use the prediction on new test data;

age

height

train point

prediction

test point

predicted value

true value

predicted value

Image Classification

Modern supervised learning can handle more complex tasks as recognizing images.

model

"cat"

"dog"

What's in the black box?

model

Opening the Black Box

Solving supervised learning tasks with deep neural networks.

depth

neurons

input

output

model

CAT

DOG

Training Neural Networks

CAT

DOG

Training Neural Networks

CAT

DOG

Training Neural Networks

CAT

DOG

Training Neural Networks

CAT

DOG

Training Neural Networks

repeat for thousands of images

Training Neural Networks

Testing on unseen images

CAT

DOG

It learned the task
Performance: fraction of correct predictions on test images, can be close to 100%!!

new image

Surprising as this is a complicated task for computers to solve!

An Image is a Table of Numbers

An Image is a Table of Numbers

pixel

The computer just sees a table of numbers, task looks harder now!
How do we compare these tables?

0 100

One number can be represented on a line

(one dimension)

still a dog?

\bm{\rightarrow}

Images as $d-$dimensional vectors

Represent images as points in space:

The computer just sees a table of numbers, task looks harder now!
How do we compare these tables?

Images as $d-$dimensional vectors

Represent images as points in space:

0 100

Two numbers on a square

(two dimensions)

(81, 81)

100

Images can be made of a million pixels.
How do we represent one million dim space?

The computer just sees a table of numbers, task looks harder now!
How do we compare these tables?

Images as $d-$dimensional vectors

0 100

Three numbers on a cube

(three dimensions)

(81, 81, 74)

100

Represent images as points in space:

We cannot represent high-dimensional spaces;
Still, we can study some of their properties.

The Curse of Dimensionality

as the dimensionality increases, images are further and further apart

When testing generalization, the new image could be very far away from all the ones in the training set!
For each new test point to be at distance $\epsilon$ from a training point we need a number of training points $P$ exponential in the dimension!!

P \propto (1/\epsilon)^d

To be learnable, real data must be highly structured!
How to characterize this structure?

Understanding a Visual Scene

We want to identify properties that give structure to the space of images
How do we humans solve an image classification task?

Step 1: Locate the object

usually pixels at the border do not matter for recognizing the object

Step 2: hierarchically recognize edges $\rightarrow$ parts $\rightarrow$ full object

lines $\rightarrow$ textures $\rightarrow$ paws, eyes etc. $\rightarrow$ head etc. $\rightarrow$ dog

Understanding a Visual Scene

We want to identify properties that give structure to the space of images
How do we humans solve an image classification task?

Understanding a Visual Scene

How do we humans solve an image classification task?
New image, and now?

Still a dog, though many things changed

it moved
legs etc. in different
relative positions
ears realized differently

Bottomline: many irrelevant details for solving the task

Invariances give structure to real data

Irrelevant details are variations in the input that do not change the label: we call them invariances

pixels here can be of different colors without affecting the class

position does not affect the class

Bottomline: many irrelevant details for solving the task

Do neural networks understand these invariances?
How to test it?

"Breaking the Curse of Dimensionality in Deep Neural Networks

by Learning Invariant Representations"

Thesis Title and Outline

Irrelevant pixels $\rightarrow$ linear invariance;
Irrelevance of exact feature positions $\rightarrow$ deformation invariance;
Hierarchical structure $\rightarrow$ synonymic invariance;

Linear Invariance

This first invariance can be modeled as: $$\bm{x} \sim \mathcal{N}(0, 1)$$ $$ f^*(\bm x) = g(x_\parallel) $$
In our work, we show that simple neural networks are able to learn this invariance

label:

input:

Weights align in the relevant direction
We can predict how many datapoints are needed to solve this kind of tasks.

Neural networks perform well as they can learn linear invariance

In real data, features only occupy a small
portion of the image frame.
Consequently, the frame can be deformed, and
relevant features be moved, without altering the content (deformation invariance).
Hypothesis: neural networks performance related to how good they are at understanding this property.

Bruna and Mallat '13, Mallat '16

\approx

Deformation Invariance

Can we test this hypothesis?

Measuring deformation invariance

f(x)

\tau x

\tau

f(\tau x)

R_f \propto \langle\|f(x) - f(\tau x)\|^2 \rangle_{x, \tau}

Invariance measure: relative stability

(normalized such that is =1 if no diffeo stability)

we introduced a model to generate deformations of controlled magnitude

Correlation between Deformation Invariance and Performance?

$R_f \sim 1$ at initialization for all arch.

Modern neural networks learn deformation invariance

more invariant

initialization: $R_f \sim 1$

more performant

Suggest that understanding deformation invariance is crucial for solving image classification

Learning Hierarchical Tasks with Deep Neural Networks

...

dog

face

paws

eyes

nose

mouth

ear

edges

Hard to formally characterize the hierarchical structure in real tasks like images or text:

Physicist approach:
Introduce a simplified model of data

Simple Model of Hierarchical Data

classes:

high-level
features:

etc...

low-level
features:

etc...

\tfrac{1}{2}

synonyms

Invariance of this model is with respect to exchanges of synonyms

Simple Model of Hierarchical Data

\tfrac{1}{2}

Having defined the rules we can sample datapoints

Crucial to make the task non-trivial: low-level features are shared
Crucial to make the task solvable: there exist correlations between low-level features and labels (gain info about class by observing low level features)

start from class:

intermediate
representations

inputs

Results on Hierarchical Data

Deep neural networks are able to solve hierarchical tasks without incurring in the curse of dimensionality (sample complexity is $\text{poly} (d)$)

They do that by learning the hierarchical structure of the task: recognize synonyms
They recognize synonyms by exploiting input-output correlations
This allows us to predict the number of training points
needed to learn depending on the hierarchical structure
(num classes, num levels, num features etc...)