Feature vs Lazy Learning

What regime is more suitable for which situation? Why?

Feature Learning (FL) -> large weight changes
Lazy Learning (LL) -> small weight changes

Data Symmetries

Fully Connected network:

f(x) = \frac{1}{h} \sum_{i = 1\dots h} \beta_i\: \max \left(0, \frac{1}{\sqrt{d}}\omega_i \cdot x + b_i\right)

Stripe Model

Data Symmetries

Fully Connected network:

f(x) = \frac{1}{h} \sum_{i = 1\dots h} \beta_i\: \max \left(0, \frac{1}{\sqrt{d}}\omega_i \cdot x + b_i\right)

Sphere Model

What about Network Symmetries?

Fully Connected network:

Convolutional Network:

Goal: show that FL can benefit from choosing the proper architecture

f(x) = \frac{1}{h} \sum_{i = 1\dots h} \beta_i\: \max \left(0, \frac{1}{\sqrt{d}}\omega_i \cdot x + b_i\right)

f(x) = \frac{1}{h} \sum_{i = 1\dots h} \frac{\beta_i}{d}\: \sum_\delta\: \max\left(0, \frac{1}{\sqrt{d}}\: t_\delta[\omega_i] \cdot x + b_i\right)

Translational Invariant Datasets

x(r) = a\cos r + b\sin r, \quad r \in [0, 2\pi]

a,b \sim \mathcal{N}(0,1)

y = \text{sign}(\sqrt{a^2 + b^2} - C_0)

2D problem for the FC
1D problem for the CNN

Issues:

For the 2D-sphere we know FL > LL
In the 1D-stripe there are no other dimensions to compress so weight orientation does not matter

1 Fourier Component

Learning Curves

Translational Invariant Dataset

x(r) = a\cos r + b\sin r + \mathcal{N}(0,\sigma^2), \qquad\sigma = 0.1

y = \text{sign}(\sqrt{a^2 + b^2} - C_0)

Same as before + Gaussian Noise in Real Space
In Fourier Space, this is equivalent to adding non-informative dimensions
For the FC net this is a short cylinder
For the CNN this is a stripe where non-informative directions have small variance

1 Fourier Component + Noise

Translational Invariant Dataset

x(r) = a\cos r + b\sin r + \mathcal{N}(0,\sigma^2), \qquad\sigma = 0.1

y = \text{sign}(\sqrt{a^2 + b^2} - C_0)

1 Fourier Component + Noise

Learning Curves

Translational Invariant Dataset

x(r_1, r_2) = \sum_{i \in \{1,2\}} a_i\cos r_i + b_i\sin r_i

y = \text{sign}(\sum_i \sqrt{a_i^2 + b_i^2} - C)

2D Image - 1 Fourier Component per dimension

f(x) = \frac{1}{h} \sum_{i = 1\dots h} \frac{\beta_i}{d^2}\: \sum_{(\delta_1, \delta_2)}\: \max\left(0, \frac{1}{d}\: t_{(\delta_1, \delta_2)}[\omega_i] \cdot x + b_i\right)

2D CNN

Translational Invariant Dataset

2D Image

Motivation: with 1D CNN we can reduce the dimensionality of the problem by one (translations in 1D), with 2D CNN by two.

FC -> 4D problem
CNN -> 2D problem

Feature vs Lazy Learning

Data Symmetries

Data Symmetries

What about Network Symmetries?

Translational Invariant Datasets

Translational Invariant Dataset

Translational Invariant Dataset

Translational Invariant Dataset

Translational Invariant Dataset

1 Fourier Component

1 Fourier Component + Noise

FL vs LL - Network Symmetries

FL vs LL - Network Symmetries

Leonardo Petrini

Feature vs Lazy Learning

Data Symmetries

Data Symmetries

What about Network Symmetries?

Translational Invariant Datasets

Translational Invariant Dataset

Translational Invariant Dataset

Translational Invariant Dataset

Translational Invariant Dataset

1 Fourier Component

1 Fourier Component + Noise

FL vs LL - Network Symmetries

More from Leonardo Petrini