## Feature vs. Lazy Training

Stefano Spigler

Matthieu Wyart

Mario Geiger

Arthur Jacot

https://arxiv.org/abs/1906.08034

Two different regimes in the dynamics of neural networks

feature training

the network learns features

lazy training

no features are learned

$$f(w) \approx f(w_0) + \nabla f(w_0) \cdot dw$$

$$f(w,x)$$

characteristic training time separating the two regimes

training procedure to force each regim

How does the network perform in the infinite width limit?

$$n\to\infty$$

There exist two limits in the literature

$$n$$ neurons per layer

Overparametrization

• perform well
• theoretically tractable

The 2 limits can be understood in the context of the central limit thm

$$\displaystyle Y = \frac{1}{\color{red}\sqrt{n}} \sum_{i=1}^n X_i \quad$$

As $${\color{red}n} \to \infty, \quad Y \longrightarrow$$ Gaussian

$$\langle Y \rangle={\color{red}\sqrt{n}}\langle X_i \rangle$$

$$\langle Y^2\rangle - \langle Y\rangle^2 = \langle X_i^2 \rangle - \langle X_i\rangle^2$$

$$\langle Y \rangle=\langle X_i \rangle$$

$$\langle Y^2\rangle - \langle Y\rangle^2 = {\color{red}\frac{1}{n}} (\langle X_i^2 \rangle - \langle X_i\rangle^2)$$

$$\displaystyle Y = \frac{1}{\color{red}n} \sum_{i=1}^n X_i$$

Central Limit Theorem:

$$X_i$$

(i.i.d.)

(Law of large numbers)

regime 1: kernel limit

$$x$$

$$f(w,x)$$

$$W^1_{ij}$$

$$\displaystyle z^{\ell+1}_i = {\frac1{\color{red}\sqrt{n}}} \sum_{j=1}^n W^\ell_{ij} \ \phi(z^\ell_j)$$

At initialization

$$W_{ij} \longleftarrow \mathcal{N}(0, 1)$$

$$w$$ = all the weights

$$z^1_j$$

$$z^2_i$$

With this scaling, small correlations adds up significantly => the weights will change only a little during training

[1995 R. M. Neal]

Independent terms in the sum, CLT

=> when $${\color{red}n} \to\infty$$ output is a Gaussian process

space of functions $$\mathbb{R}^d \to \mathbb{R}$$

(dimension $$\infty$$)

regime 1: kernel limit

$$f(w_0)$$

$$f(w_0) + \sum_\mu c_\mu \Theta(w_0, x_\mu)$$

kernel space

(dimension $$m$$)

tangent space

(dimension $$N$$)

$$f(w_0) + \nabla f(w_0) \cdot dw$$

$$d$$ input dimension

$$N$$ number of parameters

$$m$$ size of the trainset

$$f(w)$$

network manifold

(dimension $$N$$)

[2018 Jacot et al.]

[2018 Du et al.]

[2019 Lee et al.]

The NTK is independent of the initialization
and is constant through learning
=> the network behaves like a kernel method

The weights barely change
$$\|dw\|\sim \mathcal{O}(1)$$ and $$dW_{ij} \sim \mathcal{O}({\color{red}1/n})$$   ($$\mathcal{O}({\color{red}1/\sqrt{n}})$$ at the extremity)

The internal activations barely change

$$dz \sim \mathcal{O}({\color{red}1/\sqrt{n}})$$
=> no feature training

for $${\color{red}n}\to\infty$$

neural tengant kernel

$$\Theta(w,x_1,x_2) = \nabla_w f(w, x_1) \cdot \nabla_w f(w, x_2)$$

regime 1: kernel limit

regime 2: mean field limit

$$\displaystyle f(w,x) = \frac{1}{\color{red}n} \sum_{i=1}^n W_i \; \phi \! \left(\frac{1}{\color{red}\sqrt{n}} \sum_{j=1}^n W_{ij} x_j \right)$$

$$W_{ij}$$

$$x$$

$$f(w,x)$$

$$W_{i}$$

was $$\frac{1}{\color{red}\sqrt{n}}$$ in the kernel limit

studied theoretically for 1 hidden layer

### Another limit !

for $${\color{red}n}\longrightarrow \infty$$

$$\displaystyle f(w,x) = \frac{1}{\color{red}n} \sum_{i=1}^n W_i \; \phi \! \left(\frac{1}{\color{red}\sqrt{n}} \sum_{j=1}^n W_{ij} x_j \right)$$

$$\frac{1}{\color{red}n}$$ instead of $$\frac{1}{\color{red}\sqrt{n}}$$ implies

• no output fluctuations at initialization

• we can replace the sum by an integral

$$\displaystyle f(w,x) \approx f(\rho,x) = \int d\rho(W,\vec W) \; W \; \phi \! \left(\frac{1}{\sqrt{n}} \sum_{j=1}^n W_j x_j \right)$$

where $$\rho$$ is the density of neuron's weights

regime 2: mean field limit

$$\displaystyle f(\rho,x) = \int d\rho(W,\vec W) \; W \; \phi \! \left(\frac{1}{\sqrt{n}} \sum_{j=1}^n W_j x_j \right)$$

$$\rho$$ follows a differential equation

[2018 S. Mei et al], [2018 Rotskoff and Vanden-Eijnden], [2018 Chizat and Bach]

In this limit, the internal activation do change

$$\langle Y \rangle=\langle X_i \rangle$$ => feature training

regime 2: mean field limit

What is the difference between the two limits

• which limit describe better finite $$n$$ networks?
• are there corresponding regimes for finite $$n$$

kernel limit              and              mean field limit

$$\frac{1}{\color{red}\sqrt{n}}$$

$$\frac{1}{\color{red}n}$$

frozen

change

internal activations $$z^\ell$$

$$\displaystyle{\color{red}\alpha} f(w,x) = \frac{{\color{red}\alpha}}{\sqrt{n}} \sum_{i=1}^n W_i \ \phi(z_i)$$

[2019 Chizat and Bach]

• if $$\alpha$$ is fixed constant and $$n\to\infty$$ then => kernel limit
• if $${\color{red}\alpha} \sim \frac{1}{\sqrt{n}}$$ and $$n\to\infty$$ then => mean field limit

$$z_i$$

$$W_i$$

$$f(w,x)$$

we use the scaling factor $$\alpha$$ to investigate

the transition between the two regimes

$$\alpha \cdot ( f(w,x) - f(w_0, x) )$$

linearize the network with $$f - f_0$$

We would like that for any finite $$n$$, in the limit $$\alpha \to \infty$$, the network behaves linearly

This $$\alpha^2$$ is here to converge in a time that does not scale with $$\alpha$$ in the limit $$\alpha \to \infty$$

$$\displaystyle \mathcal{L}(w) = \frac{1}{\alpha^2 |\mathcal{D}|} \sum_{(x,y)\in\ \mathcal{D}} \ell\left( {\color{red}\alpha (f(w,x) - f(w_0,x))}, y \right)$$

loss function

$$\displaystyle f(w,x) =\\ \frac{1}{{\color{red}\sqrt{n}}} \sum_{i=1}^n W^3_i \ \phi(z^3_i)$$

$$W_{ij}(t=0) \longleftarrow \mathcal{N}(0, 1)$$

$$\displaystyle z^{\ell+1}_i = {\frac1{\color{red}\sqrt{n}}} \sum_{j=1}^n W^\ell_{ij} \ \phi(z^\ell_j)$$

$$z^3_i$$

$$W^3_i$$

$$z^2_i$$

$$z^1_i$$

$$W^0_{ij}$$

$$W^1_{ij}$$

$$W^2_{ij}$$

$$x_i$$

$$\dot w = -\nabla_w \mathcal{L}(w)$$

Implemented with a dynamical adaptation of the time step $$dt$$ such that,

$$10^{-4} < \frac{\|\nabla \mathcal{L}(t_{i+1})- \nabla \mathcal{L}(t_i)\|^2}{\|\nabla \mathcal{L}(t_{i+1})\|\cdot\|\nabla \mathcal{L}(t_i)\|} < 10^{-2}$$

(works well only with full batch and smooth loss)

continuous dynamics

continuous dynamics

$$10^{-4} < \frac{\|\nabla \mathcal{L}(t_{i+1})- \nabla \mathcal{L}(t_i)\|^2}{\|\nabla \mathcal{L}(t_{i+1})\|\cdot\|\nabla \mathcal{L}(t_i)\|} < 10^{-2}$$

momentum dynamics

$$\dot v = -\frac{1}{\tau}(v + \nabla \mathcal{L})$$

$$\dot w = v$$

there is a plateau for large values of $$\alpha$$

MNIST 10k parity, FC L=3, softplus, gradient flow with momentum

lazy regime

MNIST 10k parity, FC L=3, softplus, gradient flow with momentum

ensemble average

the ensemble average converge with $$n \to \infty$$

no overlap

$$\alpha$$

the ensemble average

$$\displaystyle \bar f(x) = \int f(w(w_0),x) \;d\mu(w_0)$$

MNIST 10k parity, FC L=3, softplus, gradient flow with momentum

plot in function of $$\sqrt{n} \alpha$$ overlap the lines

overlap !

$$\sqrt{n}\alpha$$

$$\alpha$$

$$\sqrt{n}\alpha$$

$$\alpha$$

$$\frac{\|\Theta - \Theta_0 \|}{\|\Theta_0\|}$$

the kernel evolution displays two regimes

MNIST 10k parity, FC L=3, softplus, gradient flow with momentum

the phase space is split in two by $$\alpha^*$$ who decays with $$\sqrt{n}$$

$$n$$

$$\alpha$$

feature training

kernel limit

mean field limit

$$\alpha^* \sim \frac{1}{\sqrt{n}}$$

lazy training

same for other datasets: the trends depends on the dataset

MNIST 10k

reduced to 2 classes

10PCA MNIST 10k

reduced to 2 classes

FC L=3, softplus, gradient flow with momentum

EMNIST 10k

reduced to 2 classes

Fashion MNIST 10k

reduced to 2 classes

FC L=3, softplus, gradient flow with momentum

same for other datasets: the trends depends on the dataset

CIFAR10 10k

reduced to 2 classes

CNN: the tendency is inverted

$$\sqrt{n}\alpha$$

how does the learning curves depends on $$n$$ and $$\alpha$$

MNIST 10k parity, L=3, softplus, gradient flow with momentum

$$\sqrt{n} \alpha$$

overlap !

same time in lazy

lazy

MNIST 10k parity, L=3, softplus, gradient flow with momentum

there is a characteristic time in the learning curves

$$\sqrt{n} \alpha$$

overlap !

characteristic time $$t_1$$

lazy

$$t_1$$ characterise the curvature of the network manifold

$$t_1$$

$$t$$

$$f(w)$$

network manifold

(dimension $$N$$)

$$d$$ input dimension

$$N$$ number of parameters

$$m$$ size of the trainset

tangent space

(dimension $$N$$)

$$f(w_0) + \nabla f(w_0) \cdot dw$$

$$f(w_0) + \sum_\mu c_\mu \Theta(w_0, x_\mu)$$

kernel space

(dimension $$m$$)

$$f(w_0)$$

space of functions $$\mathbb{R}^d \to \mathbb{R}$$

(dimension $$\infty$$)

$$t_1$$ is the time you need to drive to realize the earth is curved

$$v$$

$$R$$

$$t_1 \sim R/v$$

the rate of change of $$W$$ determines when we leave the tangent space, aka $$t_1 \sim \sqrt{n}\alpha$$

$$x$$

$$f(w,x)$$

$$W_{ij}$$

$$\displaystyle z^{\ell+1}_i = \frac1{\sqrt{n}} \sum_{j=1}^n W^\ell_{ij} \ \phi(z^\ell_j)$$

$$z_j$$

$$z_i$$

$$W_{ij}$$ and $$z_i$$ are initialized $$\sim 1$$

$$\dot W_{ij}$$ and $$\dot z_i$$ at initialization $$\Rightarrow t_1$$

$$\dot W^\ell = \mathcal{O}(\frac{1}{\alpha n})$$

$$\dot W^0 = \mathcal{O}(\frac{1}{\alpha \sqrt{n}})$$

$$\dot z = \mathcal{O}(\frac{1}{\alpha \sqrt{n}})$$

actually the correct scaling

Upper bound: consider weight of last layer

$$\Rightarrow t_1 \sim \sqrt{n} \alpha$$

$$\dot W^L= -\frac{\partial\mathcal{L}}{\partial W^L} = \mathcal{O}\left(\frac{1}{\alpha \sqrt{n}}\right)$$

convergence time is of order 1 in lazy regime

• $$\displaystyle \mathcal{L}(w) = \frac{1}{\alpha^2 |\mathcal{D}|} \sum_{(x,y)\in\ \mathcal{D}} \ell\left( {\color{red}\alpha (f(w,x) - f(w_0,x))}, y \right)$$

• $$\dot f(w, x) = \nabla_w f(w,x) \cdot \dot w$$
• $$= - \nabla_w f(w,x) \cdot \nabla_w \mathcal{L}$$
• $$\sim - \nabla_w f \cdot (\frac{\partial}{\partial f} \mathcal{L} \; \nabla_w f)$$
• $$\sim \frac{1}{\alpha} \Theta$$

• "$$\alpha \dot f t = 1$$" $$\Rightarrow$$ "$$t_{lazy} = \|\Theta\|^{-1}$$"

$$\Longrightarrow$$  $${\color{red}\alpha^* \sim \frac{1}{\sqrt{n}}}$$

when the dynamics stops before $$t_1$$ we are in the lazy regime

$$t_1 \sim \sqrt{n} \alpha$$

$$t_{lazy} \sim 1$$

(time to converge in the lazy regime)

(time to exit the tangent space)

$$\alpha^* \sim 1/\sqrt{n}$$

then for large $$n$$

$$\Rightarrow$$

$$\alpha^* f(w_0, x) \ll 1$$

$$\Rightarrow$$

$$\alpha^* (f(w,x) - f(w_0, x)) \approx \alpha^* f(w,x)$$

for large $$n$$ our conclusions should holds without this trick

linearize the network with $$f - f_0$$ was not necessary

arxiv.org/abs/1906.08034

github.com/mariogeiger/feature_lazy

$$n$$

$$\alpha$$

lazy training

stop before $$t_1$$

feature training

go beyond $$t_1$$

kernel limit

mean field limit

$$\alpha^* \sim \frac{1}{\sqrt{n}}$$

time to leave the tangent space $$t_1 \sim \sqrt{n}\alpha$$
=> time for learning features

By Mario Geiger

# [SPML] feature-lazy

Presentation of the paper https://arxiv.org/abs/1906.08034

• 156