Spectrum Dependent Learning Curves in Kernel Regression
Journal club
B. Bordelon, A. Canatar, C. Pehlevan
Overview
- Setting: kernel regression for a generic target function
- Decompose the generalization error over the kernel's spectral components
- Derive an approximate formula for the error
- Validated via numerical experiments (with focus on the NTK)
- Error associated with larger eigenvalues decay faster with the training size: learning through successive steps
Kernel (ridge) regression
- p training points {xi,f⋆(xi)}i=1p generated by target function f⋆:Rd→R, xi∼p(xi)
- Kernel regression: minf∈H(K)∑i=1p[f(xi)−f⋆(xi)]2+λ∣∣f∣∣K
- Estimator: f(x)=yt(K+λI)−1k(x)
- Generalization error:
yi=f⋆(xi)
Kij=K(xi,xj)
ki(x)=K(x,xi)
Mercer decomposition
- {λρ,ϕρ} are the kernel's eigenstates
- {ϕρ} are chosen to form an orthonormal basis:
Kernel regression in feature space
- Expand the target and estimator functions in the kernel's basis:
f⋆(x)=∑ρwˉρψρ(x)
f(x)=∑ρwρψρ(x)
-
Then kernel regression can be written as
minw, ∣∣w∣∣<∞∣∣Ψtw−y∣∣2 +λ∣∣w∣∣2
- And its solution is w=(ΨΨt+λI)−1Ψy
design matrix Ψρ,i=ψρ(xi)
e.g. Teacher = Gaussian:
Generalization error and spectral components
- We can then derive Eg=∑ρEρ, with
where
the target function is only here!
the data points are only here!
Approximation for ⟨G2⟩
- \(\tilde\mathbf G(p,v) \equiv \left( \frac1\lambda \Phi\Phi^t + \Lambda^{-1} + v\mathbb I \right)^{-1}\)
- Derive a recurrence equation for the addition of a p+1-st point, xp+1 corresponding to ϕ=(ϕρ(xp+1))ρ:
- Use Sherman-Morrison formula (Woodbury inversion)
Approximation for ⟨G2⟩
- First approximation: approximate the second term as
- Second approximation: continuous p→ PDE
PDE solution
- This linear PDE can be solved exactly (with the method of characteristics)
- Then the error component Eρ is
Note: the same result is found with replica calculations!
Comments on the result
- The effect of the target function is simply a (mode-dependent) prefactor ⟨wˉρ2⟩
- Ratio between two modes:
- The error is large if the target function puts a lot of weight on small λρ modes
Small p:
Large p:
Dot-product kernels in d→∞
- We consider now K(x,x′)=K(x⋅x′), x∈Sd−1
- Eigenstates are spherical harmonics, eigenvalues are degenerate:
e.g. NTK
(everything I say next could be derived for translation-invariant kernels as well)
for d→∞, N(d,k)∼dk
and λk∼N(d,k)−1∼d−k

NTK
Dot-product kernels in d→∞
- Learning proceeds by stages. Take p=αdℓ:
- Modes with larger λk are learned earlier!
Numerical experiments
Three settings are considered:
- Kernel Teacher-Student with 4-layer NTK kernels (for both)
- Finite-width NNs learning pure modes
- Finite-width Teacher-Student 2-layer NNs
kernel regression with KNTK
→ learn with NNs (4 layers h=500, 2 layers h=10000)
Note: this contains several spherical harmonics
Kernel regression with 4-layer NTK kernel



d=10, λ=5
d=10, λ=0 ridgeless
d=100, λ=0 ridgeless
Ek=∑m=1N(d,k)Ekm=N(d,k)Ek,1
Pure λk modes with NNs


2 layers, width 10000
4 layers, width 500
f⋆ has only the k mode
d=30
Teacher-Student 2-layer NNs

d=25, width 8000
Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks
By Stefano Spigler
Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks
- 916