Feature vs. Lazy Training
Stefano Spigler
Matthieu Wyart
Mario Geiger
Arthur Jacot
https://arxiv.org/abs/1906.08034
Two different regimes in the dynamics of neural networks
feature training
the network learns features
lazy training
no features are learned
f(w)≈f(w0)+∇f(w0)⋅dw
f(w,x)
characteristic training time separating the two regimes
training procedure to force each regim
How does the network perform in the infinite width limit?
n→∞
There exist two limits in the literature
n neurons per layer
Overparametrization
- perform well
- theoretically tractable
The 2 limits can be understood in the context of the central limit thm
Y=n1i=1∑nXi
As n→∞,Y⟶ Gaussian
⟨Y⟩=n⟨Xi⟩
⟨Y2⟩−⟨Y⟩2=⟨Xi2⟩−⟨Xi⟩2
⟨Y⟩=⟨Xi⟩
⟨Y2⟩−⟨Y⟩2=n1(⟨Xi2⟩−⟨Xi⟩2)
Y=n1i=1∑nXi
Central Limit Theorem:
Xi
(i.i.d.)
(Law of large numbers)
regime 1: kernel limit
x
f(w,x)
Wij1
ziℓ+1=n1j=1∑nWijℓ ϕ(zjℓ)
At initialization
Wij⟵N(0,1)
w = all the weights
zj1
zi2
With this scaling, small correlations adds up significantly => the weights will change only a little during training
[1995 R. M. Neal]
Independent terms in the sum, CLT
=> when n→∞ output is a Gaussian process
space of functions Rd→R
(dimension ∞)
regime 1: kernel limit
f(w0)

f(w0)+∑μcμΘ(w0,xμ)
kernel space
(dimension m)
tangent space
(dimension N)
f(w0)+∇f(w0)⋅dw
d input dimension
N number of parameters
m size of the trainset
f(w)
network manifold
(dimension N)
[2018 Jacot et al.]
[2018 Du et al.]
[2019 Lee et al.]
The NTK is independent of the initialization
and is constant through learning
=> the network behaves like a kernel method
The weights barely change
∥dw∥∼O(1) and dWij∼O(1/n) (O(1/n) at the extremity)
The internal activations barely change
dz∼O(1/n)
=> no feature training
for n→∞
neural tengant kernel
Θ(w,x1,x2)=∇wf(w,x1)⋅∇wf(w,x2)
regime 1: kernel limit
regime 2: mean field limit
f(w,x)=n1i=1∑nWiϕ(n1j=1∑nWijxj)
Wij
x
f(w,x)
Wi
was n1 in the kernel limit
studied theoretically for 1 hidden layer
Another limit !
for n⟶∞
f(w,x)=n1i=1∑nWiϕ(n1j=1∑nWijxj)
n1 instead of n1 implies
- no output fluctuations at initialization
- we can replace the sum by an integral
f(w,x)≈f(ρ,x)=∫dρ(W,W)Wϕ(n1j=1∑nWjxj)
where ρ is the density of neuron's weights
regime 2: mean field limit
f(ρ,x)=∫dρ(W,W)Wϕ(n1j=1∑nWjxj)
ρ follows a differential equation
[2018 S. Mei et al], [2018 Rotskoff and Vanden-Eijnden], [2018 Chizat and Bach]
In this limit, the internal activation do change
⟨Y⟩=⟨Xi⟩ => feature training
regime 2: mean field limit
What is the difference between the two limits
- which limit describe better finite n networks?
- are there corresponding regimes for finite n?
kernel limit and mean field limit
n1
n1
frozen
change
internal activations zℓ
αf(w,x)=nαi=1∑nWi ϕ(zi)
[2019 Chizat and Bach]
- if α is fixed constant and n→∞ then => kernel limit
- if α∼n1 and n→∞ then => mean field limit
zi
Wi
f(w,x)
we use the scaling factor α to investigate
the transition between the two regimes
α⋅(f(w,x)−f(w0,x))
linearize the network with f−f0
We would like that for any finite n, in the limit α→∞, the network behaves linearly
This α2 is here to converge in a time that does not scale with α in the limit α→∞
L(w)=α2∣D∣1(x,y)∈ D∑ℓ(α(f(w,x)−f(w0,x)),y)
loss function
f(w,x)=n1i=1∑nWi3 ϕ(zi3)
Wij(t=0)⟵N(0,1)
ziℓ+1=n1j=1∑nWijℓ ϕ(zjℓ)
zi3
Wi3
zi2
zi1
Wij0
Wij1
Wij2
xi
w˙=−∇wL(w)
Implemented with a dynamical adaptation of the time step dt such that,
10−4<∥∇L(ti+1)∥⋅∥∇L(ti)∥∥∇L(ti+1)−∇L(ti)∥2<10−2
(works well only with full batch and smooth loss)
continuous dynamics
continuous dynamics
10−4<∥∇L(ti+1)∥⋅∥∇L(ti)∥∥∇L(ti+1)−∇L(ti)∥2<10−2
momentum dynamics
v˙=−τ1(v+∇L)
w˙=v
there is a plateau for large values of α
MNIST 10k parity, FC L=3, softplus, gradient flow with momentum
lazy regime
MNIST 10k parity, FC L=3, softplus, gradient flow with momentum
ensemble average
the ensemble average converge with n→∞
no overlap
α
the ensemble average
fˉ(x)=∫f(w(w0),x)dμ(w0)
MNIST 10k parity, FC L=3, softplus, gradient flow with momentum
plot in function of nα overlap the lines
overlap !
nα
α
nα
α
∥Θ0∥∥Θ−Θ0∥
the kernel evolution displays two regimes
MNIST 10k parity, FC L=3, softplus, gradient flow with momentum
the phase space is split in two by α∗ who decays with n
n
α
feature training
kernel limit
mean field limit
α∗∼n1
lazy training
same for other datasets: the trends depends on the dataset
MNIST 10k
reduced to 2 classes
10PCA MNIST 10k
reduced to 2 classes
FC L=3, softplus, gradient flow with momentum
EMNIST 10k
reduced to 2 classes
Fashion MNIST 10k
reduced to 2 classes
FC L=3, softplus, gradient flow with momentum
same for other datasets: the trends depends on the dataset
CIFAR10 10k
reduced to 2 classes
CNN SGD ADAM
CNN: the tendency is inverted
nα
how does the learning curves depends on n and α
MNIST 10k parity, L=3, softplus, gradient flow with momentum
nα
overlap !
same time in lazy
lazy
MNIST 10k parity, L=3, softplus, gradient flow with momentum
there is a characteristic time in the learning curves
nα
overlap !
characteristic time t1
lazy
t1 characterise the curvature of the network manifold
t1
t
f(w)
network manifold
(dimension N)
d input dimension
N number of parameters
m size of the trainset
tangent space
(dimension N)
f(w0)+∇f(w0)⋅dw
f(w0)+∑μcμΘ(w0,xμ)
kernel space
(dimension m)
f(w0)
space of functions Rd→R
(dimension ∞)
t1 is the time you need to drive to realize the earth is curved

v
R
t1∼R/v
the rate of change of W determines when we leave the tangent space, aka t1∼nα
x
f(w,x)
Wij
ziℓ+1=n1j=1∑nWijℓ ϕ(zjℓ)
zj
zi
Wij and zi are initialized ∼1
W˙ij and z˙i at initialization ⇒t1
W˙ℓ=O(αn1)
W˙0=O(αn1)
z˙=O(αn1)
actually the correct scaling
Upper bound: consider weight of last layer
⇒t1∼nα
W˙L=−∂WL∂L=O(αn1)
convergence time is of order 1 in lazy regime
-
L(w)=α2∣D∣1(x,y)∈ D∑ℓ(α(f(w,x)−f(w0,x)),y)
- f˙(w,x)=∇wf(w,x)⋅w˙
- =−∇wf(w,x)⋅∇wL
- ∼−∇wf⋅(∂f∂L∇wf)
- ∼α1Θ
- "αf˙t=1" ⇒ "tlazy=∥Θ∥−1"
⟹ α∗∼n1
when the dynamics stops before t1 we are in the lazy regime
t1∼nα
tlazy∼1
(time to converge in the lazy regime)
(time to exit the tangent space)
α∗∼1/n
then for large n
⇒
α∗f(w0,x)≪1
⇒
α∗(f(w,x)−f(w0,x))≈α∗f(w,x)
for large n our conclusions should holds without this trick
linearize the network with f−f0 was not necessary
arxiv.org/abs/1906.08034
github.com/mariogeiger/feature_lazy
n
α
lazy training
stop before t1
feature training
go beyond t1
kernel limit
mean field limit
α∗∼n1
time to leave the tangent space t1∼nα
=> time for learning features
[SPML] feature-lazy
By Mario Geiger