UBC MLRG - March 9 2022
Amin
Arthur Jacot, Franck Gabriel, Clement Hongler
Input dataset is fixed:
We refer to the notion below as all the functions in our function space:
On this space, we use the following seminorm:
The cost function is defined as
We can define the dual space of all functions as . This is the set of all such that .
Thus, each member of this dual class can be written as
for some d in the Function Space!
Now let's derive loss function's derivative with respect to function! Note according to the definition below, as the dataset is fixed, the value of cost function only depends on the values of f!
Cost function derivative is:
Note that this is a member of , as it maps an input f to a real number!
If we write out the expressions, turns out that neural networks trained using gradient descent evolve according to Kernel Gradient Descent, with a specific kernel called "Neural Tangent Kernel":
Now that we have this, are we done with our analysis?
Jacot et al. 2018 showed that for fully connected neural networks at initialization, when the width of the network goes to infinity, the NTK converges in probability to a deterministic limiting kernel.
Moreover, they showed that this kernel stays the same during the training phase!
As a consequence, in this limit, the dynamics of the neural network can be explained through a differential equation:
If we solve this differential equation, we'll see that the network function evolves according to the following dynamics while trained using gradient descent:
Thus, we can analytically characterize the behaviour of infinitely wide (and obviously, overparameterized) neural networks, using a simple kernel ridge regression formula!
Arora et al. 2019 showed that under some regulatory conditions, for any and , for any width, with probability at least 1- , we have:
where L can be determined using the width, and .
This suggests that as the width grows, the NTK of the network gets more similar to the NTK of the corresponding infinite width network.
They also showed that under some regulatory conditions, for any and , for any width, with probability at least 1- , we have:
where is the kernel ridge regression using the "empirical" NTK evaluated using the network.
Lee et al. showed that the training dynamics of a linearized version of a neural network can be explained using kernel ridge regression with the kernel as the empirical Neural Tangent Kernel of the network:
Moreover, they showed corresponding bounds between the predictions of the actual network, and the linear network: