B. Bordelon, A. Canatar, C. Pehlevan
Overview
Kernel (ridge) regression
\(y_i = f^\star(\mathbf x_i)\)
\(\mathbb K_{ij} = K(\mathbf x_i,\mathbf x_j)\)
\(k_i(\mathbf x) = K(\mathbf x,\mathbf x_i)\)
Mercer decomposition
Kernel regression in feature space
design matrix \(\Psi_{\rho,i}=\psi_\rho(\mathbf x_i)\)
e.g. Teacher = Gaussian:
Generalization error and spectral components
the target function is only here!
the data points are only here!
Approximation for \(\left\langle G^2 \right\rangle\)
Approximation for \(\left\langle G^2 \right\rangle\)
PDE solution
Note: the same result is found with replica calculations!
Comments on the result
Small \(p\):
Large \(p\):
Dot-product kernels in \(d\to\infty\)
e.g. NTK
(everything I say next could be derived for translation-invariant kernels as well)
for \(d\to\infty\), \(N(d,k)\sim d^k\)
and \(\lambda_k \sim N(d,k)^{-1} \sim d^{-k}\)
NTK
Dot-product kernels in \(d\to\infty\)
Numerical experiments
Three settings are considered:
kernel regression with \(K^\mathrm{NTK}\)
\(\to\) learn with NNs (4 layers h=500, 2 layers h=10000)
Note: this contains several spherical harmonics
Kernel regression with 4-layer NTK kernel
\(d=10,\ \lambda=5\)
\(d=10,\ \lambda=0\) ridgeless
\(d=100,\ \lambda=0\) ridgeless
\(E_k = \sum_{m=1}^{N(d,k)} E_{km} = N(d,k) E_{k,1}\)
Pure \(\lambda_k\) modes with NNs
2 layers, width 10000
4 layers, width 500
\(f^\star\) has only the \(\textcolor{red}{k}\) mode
\(d=30\)
Teacher-Student 2-layer NNs
\(d=25\), width 8000