Deep equals Shallow in Kernel Regimes

Study Group on Learning Theory

Daniel Yukimura

and some more...

Previously...

Deep Neural Networks (DNN) are universal approximators
Challenges:
- Non-convex optimization.
- Generalization + Overparameterization.
Neural Tangent Kernels (NTK):
- Under certain conditions overparameterized DNNs are equivalent to kernel methods.
- Successful optimization!

(Hornik et al., 1989)

(Jacot et al., 2018)

So... is deep learning solved?

Perfomance gap between NNs and NTKs:
- 5-7% in favor of DNNs on CIFAR-10
- NTKs outperform DNNs on small datasets
- Larger gap for ResNet
- Depends on the task complexity
DNNs approximate better some classes of functions

(Arora et al., 2019)

(Allen-Zhu et al., 2019)

(Ghorbani et al., 2020),

(Malach et al., 2021).

(Ortiz-Jiménez et al., 2021)

(Ghorbani et al., 2019),

Depth doesn't matter much for NTKs

(Basri et al., 2020),

Deep equals shallow for ReLU Networks in Kernel Regimes

Alberto Biatti, Francis Bach - 2021

Result: Kernels derived from DNNs have the same approximation properties than shallow networks.

Consequence: The Kernel framework doesn't seem to explain the benefits of deep architectures.

Approx. using dot-product kernels:

\bullet \hspace{2mm} \text{Consider the shallow case: } f(x) = \frac{1}{\sqrt{m}} \sum\limits_{j=1}^m v_j \sigma( w_j^\intercal x)

Random feature kernels:

k(x,x') = \mathbb{E}_{w\sim \mathcal{N}(0,I)} [\sigma(w^\intercal x) \sigma(w^\intercal x')]

\bullet \hspace{2mm} \text{When only the } (v_j)_j \text{ are trained (with }\ell_2 \text{ reg.)}

\Rightarrow \text{ random feature approx.}

\bullet \hspace{2mm} \text{If } x, x' \text{ are on the sphere, then}

k(x,x') = \kappa(x^\intercal x')

\bullet \hspace{2mm} \text{If we decompose }\sigma\text{ using hermite poly.}

\Rightarrow \kappa(u) = \sum\limits_{i\geq 0} a_i^2 u^i

\text{for some }\kappa.

\sigma(u) = \displaystyle\sum\limits_{i\geq 0} a_i h_i(u)

orth. for the Gaussian meas.

\bullet \hspace{2mm} \text{Usefull }\kappa\text{s:}

\kappa_0(u) = \dfrac{1}{\pi}(\pi - \arccos(u))

\kappa_1(u) = \dfrac{1}{\pi}\left( u (\pi - \arccos(u)) + \sqrt{1-u^2} \right)

Step func.

ReLU

Neural Tangent Kernels:

f(x,\theta) \approx f(x,\theta_0) + \left<\theta - \theta_0, \nabla_\theta f(x,\theta_0)\right>

f(x,\theta) = \sigma( W_L \sigma(\dots W_2\sigma(W_1 x)))

\bullet \hspace{2mm} \text{Deep networks}

\bullet \hspace{2mm} \text{Overparam. regime and } W^\ell_{i,j} \sim \mathcal{N}\left(0, \frac{1}{n_\ell}\right)

k_{NTK} (x,x') = \lim\limits_{m\rightarrow \infty} \left< \nabla f(x, \theta_0), \nabla f(x', \theta_0) \right>

\Rightarrow

\bullet \hspace{2mm} \text{If our inputs are on the sphere:}

k_{NTK} (x,x') = \kappa^L_{NTK}(x^\intercal x')

\kappa^1_{NTK}(x,x') = \kappa^1(u) = u

\kappa^\ell(u) = \kappa_1( \kappa^{\ell-1}(u) )

\kappa^\ell_{NTK}(u) = \kappa^{\ell-1}_{NTK}(u) \kappa_0\left(\kappa^{\ell-1}(u)\right) + \kappa^\ell(u)

\text{defined recursively as:}

\text{for }\ell = 2,\dots,L

Spherical harmonics and description of the RKHS

\bullet \hspace{2mm} \text{Consider the integral operator } T

T f(x) = \displaystyle\int k(x,y) f(y) d \tau (y)

\bullet \hspace{2mm} \text{ In the sphere we can diag. }\kappa \text{ using spher. harmonics}

T Y_{k,j} = \mu_k Y_{k,j}

\text{j-th spher. harmonic poly. of degree k}

\text{eigenvalues}

\text{uniform on }\mathbb{S}^{d-1}

N(d,k) = \frac{2k+d-2}{k} \binom{k+d-3}{d-2}

\sim \mathcal{O} \left( k^{d-2} \right)

\bullet \hspace{2mm} \# \text{ spheric harmonics of degree }k:

\bullet \hspace{2mm} \text{ RKHS space associated to the kernel:}

\text{surface area of }\mathbb{S}^{d-1}

\bullet \hspace{2mm} \text{Eigenvalues:}

\mu_k = \dfrac{\omega_{d-2}}{\omega_{d-1}} \displaystyle\int\limits_{-1}^1 \kappa(t) P_k(t) (1-t^2)^{\frac{(d-3)}{2}} dt

\mathcal{H} = \left\{ f =\displaystyle\sum\limits_{k\geq 0, \mu_k\neq 0} \sum\limits_{j=1}^{N(d,k)} a_{k,j} Y_{k,j}(\cdot), \text{ s.t. } \|f\|_{\mathcal{H}}^2 = \sum\limits_{k\geq 0, \mu_k\neq 0} \sum\limits_{j=1}^{N(d,k)} \frac{a_{k,j}^2}{\mu_{k}} < \infty \right\}

\text{if }\mu_k\text{ has a fast decay}

\Rightarrow a_{k,j} \text{ must also decay quickly}

\text{Legendre poly. of degree }k

\textbf{Theorem 1:}\text{(Decay from regularity of }\kappa\text{ at endpoints)}

\text{Let } \kappa:[-1,1] \rightarrow \mathbb{R} \text{ be a function } C^\infty\text{ at }(-1,1)

\text{and around } \pm 1 \text{ we have}:

\kappa(1-t) = p_1(t) + c_1 t^\nu + \mathcal{o}(t^\nu)

\kappa(-1+t) = p_{-1}(t) + c_{-1} t^\nu + \mathcal{o}(t^\nu)

\text{for } t\geq 0,\hspace{2mm} p_1, p_{-1} \text{ poly., and } \nu\notin \mathbb{Z}.

\text{Then, there's an absolute const. }C(d,\nu)\text{ s.t.:}

\bullet \hspace{2mm} \text{ For k even, if }c_1\neq c_{-1}: \hspace{3mm} \mu_k \sim (c_1+c_{-1}) C(d,\nu) k^{-d - 2\nu +1}

\bullet \hspace{2mm} \text{ For k odd, if }c_1\neq c_{-1}: \hspace{3mm} \mu_k \sim (c_1 - c_{-1}) C(d,\nu) k^{-d - 2\nu +1}

Consequences for ReLU networks:

\kappa_0 (1-t) = 1 - \frac{\sqrt{2}}{\pi} t^{1/2} + \mathcal{O}(t^{3/2})

\kappa_1 (1-t) = 1 - t + \frac{2\sqrt{2}}{3 \pi} t^{3/2} + \mathcal{O}(t^{5/2})

\bullet \hspace{2mm} \text{similarly around} -1\dots

\bullet \hspace{2mm} \text{Expansions around} +1

\text{From theorem 1} \Rightarrow

\bullet \hspace{2mm} \text{A decay of } k^{-d-2}\text{ for even coeff. of }\kappa_1

\bullet \hspace{2mm} \text{A decay of } k^{-d}\text{ for odd coeff. of }\kappa_0

\textbf{Corollary 3:}\text{(Deep NTK decay)}

\text{For the } \kappa_{NTK}^L \text{of an L-layer ReLU network with } L\geq 3

\text{we have } \mu_k \sim C(d,L) k^{-d}, \text{ where }C(d,L) \text{ differs with the}

\text{parity of }k \text{ and grows quadratically with } L.

\bullet \hspace{2mm} \text{A similar result holds for random feature approx.}

\bullet \hspace{2mm} \text{Shows that NTKs has the same decay as Laplace kernels}

\bullet \hspace{2mm} \text{DNNs with the step activation improve with depth.}

\bullet \hspace{2mm} \text{Kernels assoc. with }C^\infty \text{activations have decay faster}

\text{than any poly.}\Rightarrow \text{RKHS contains only smooth functions}

What else?

Lazy vs Active regimes:

Bach et al, 2020

\Rightarrow

\text{lazy training} \sim \text{linearization}

\Delta(L) >> \Delta(Df)

\text{ loss }

\text{ model grad. }

\Rightarrow

some DNNs might have lazy behavior, but the good ones are not on this regime.

What else?

Taylorized Training:

Bai et al, 2020

Double descent curve:

Wyart et al, 2018

Wyart et al, 2019

Montanari et al, 2019

\text{ initial evidence }

\text{ using NTKs }

\text{ random features }

Hierarchical Learning:

Allen-Zhu et al, 2020

Chen et al, 2020

\text{ Backward Feature Correction }

A learner learns a complicated target by decomposing it into a sequence of simpler functions.

\text{ Neural representations }

Hierarchical learning seems to be an important property of DNNs.
Hierarchical learning is not leveraged by NTKs

Deep equals Shallow for NTKs

By Daniel Yukimura

Deep equals Shallow for NTKs

Deep equals Shallow in Kernel Regimes

Study Group on Learning Theory

and some more...

Previously...

So... is deep learning solved?

Deep equals shallow for ReLU Networks in Kernel Regimes

Approx. using dot-product kernels:

Deep equals Shallow for NTKs

More from Daniel Yukimura